Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
7082
Daniel Cremers Marcus Magnor Martin R. Oswald Lihi ZelnikManor (Eds.)
Video Processing and Computational Video International Seminar Dagstuhl Castle, Germany, October 1015, 2010 Revised Papers
13
Volume Editors Daniel Cremers Technische Universität München, Germany Email:
[email protected] Marcus Magnor Technische Universität Braunschweig, Germany Email:
[email protected] Martin R. Oswald Technische Universität München, Germany Email:
[email protected] Lihi ZelnikManor The Technion, Israel Institute of Technology, Haifa, Israel Email:
[email protected] ISSN 03029743 eISSN 16113349 ISBN 9783642248696 eISBN 9783642248702 DOI 10.1007/9783642248702 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011938798 CR Subject Classiﬁcation (1998): I.4, I.2.10, I.5.45, F.2.2, I.3.5 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics
© SpringerVerlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Cameraready by author, data conversion by Scientiﬁc Publishing Services, Chennai, India Printed on acidfree paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
With the swift development of video imaging technology and the, drastic improvements in CPU speed and memory, both video processing and computational video are becoming more and more popular. Similar to the digital revolution in photography of ﬁfteen years ago, today digital methods are revolutionizing the way television and movies are being made. With the advent of professional digital movie cameras, digital projector technology for movie theaters, and 3D movies, the movie and television production pipeline is turning alldigital, opening up numerous new opportunities for the way dynamic scenes are acquired, video footage can be edited, and visual media may be experienced. This book provides a compilation of selected articles resulting from a worshop on “Video Processing and Computational Video”, held at Dagstuhl Castle, Germany in October 2010. During this workshop, 43 researchers from all over the world discussed the state of the art, contemporary challenges, and future research in imaging, processing, analyzing, modeling, and rendering of realworld, dynamic scenes. The seminar was organized into 11 sessions of presentations, discussions, and specialtopic meetings. The seminar brought together junior and senior researchers from computer vision, computer graphics, and image communication, both from academia and industry to address the challenges in computational video. For ﬁve days, workshop participants discussed the impact of as well as the opportunities arising from digital video acquisition, processing, representation, and display. Over the course of the seminar, the participants addressed contemporary challenges in digital TV and movie production; pointed at new opportunities in an alldigital production pipeline; discussed novel ways to acquire; represent and experience dynamic content; accrued a wishlist for future video equipment; proposed new ways to interact with visual content; and debated possible future massmarket applications for computational video. Viable research areas in computational video identiﬁed during the seminar included motion capture of faces, nonrigid surfaces, and entire performances; reconstruction and modeling of nonrigid objects; acquisition of scene illumination, timeofﬂight cameras; motion ﬁeld and segmentation estimation for video editing; as well as freeviewpoint navigation and videobased rendering. With regard to technological challenges, seminar participants agreed that the “rolling shutter” eﬀect of CMOSbased video imagers currently poses a serious problem for existing computer vision algorithms. It is expected, however, that this problem will be overcome by future video imaging technology. Another item on the seminar participants’ wish list for future camera hardware concerned high framerate acquisition to enable more robust motion ﬁeld estimation or timemultiplexed acquisition. Finally, it was expected that plenoptic cameras will hit
VI
Preface
the commercial market within the next few years, allowing for advanced postprocessing features such as variable depthofﬁeld, stereopsis, or motion parallax. The papers presented in these postworkshop proceedings were carefully selected through a blind peerreview process with three independent reviewers for each paper. We are grateful to the people at Dagstuhl Castle for supporting this seminar. We thank all participants for their talks and contributions to discussions and all authors who contributed to this book. Moreover, we thank all reviewers for their elaborate assessment and constructive criticism, which helped to further improve the quality of the presented articles. August 2011
Daniel Cremers Marcus Magnor Martin R. Oswald Lihi ZelnikManor
Table of Contents
Video Processing and Computational Video Towards Plenoptic Raumzeit Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . Martin Eisemann, Felix Klose, and Marcus Magnor Two Algorithms for Motion Estimation from Alternate Exposure Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anita Sellent, Martin Eisemann, and Marcus Magnor Understanding What We Cannot See: Automatic Analysis of 4D Digital InLine Holographic Microscopy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Laura LealTaix´e, Matthias Heydt, Axel Rosenhahn, and Bodo Rosenhahn 3D Reconstruction and VideoBased Rendering of Casually Captured Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aparna Taneja, Luca Ballan, Jens Puwein, Gabriel J. Brostow, and Marc Pollefeys
1
25
52
77
SilhouetteBased Variational Methods for Single View Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eno T¨ oppe, Martin R. Oswald, Daniel Cremers, and Carsten Rother
104
Single Image Blind Deconvolution with HigherOrder Texture Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manuel Martinello and Paolo Favaro
124
Compressive Rendering of Multidimensional Scenes . . . . . . . . . . . . . . . . . . . Pradeep Sen, Soheil Darabi, and Lei Xiao
152
Eﬃcient Rendering of Light Field Images . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Jung and Reinhard Koch
184
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
213
Towards Plenoptic Raumzeit Reconstruction Martin Eisemann, Felix Klose, and Marcus Magnor Computer Graphics Lab, TU Braunschweig, Germany
Abstract. The goal of imagebased rendering is to evoke a visceral sense of presence in a scene using only photographs or videos. A huge variety of diﬀerent approaches have been developed during the last decade. Examining the underlying models we ﬁnd three diﬀerent main categories: view interpolation based on geometry proxies, pure image interpolation techniques and complete scene ﬂow reconstruction. In this paper we present three approaches for freeviewpoint video, one for each of these categories and discuss their individual beneﬁts and drawbacks. We hope that studying the diﬀerent approaches will help others in making important design decisions when planning a freeviewpoint video system. Keywords: FreeViewpoint Video, ImageBased Rendering, Dynamic SceneReconstruction.
1
Introduction
As humans we perceive most of our surroundings through our eyes, and visual stimuli aﬀect all of our senses, drive emotion, arouse memories and much more. That is one of the reasons why we like to look at pictures. A major revolution occurred with the introduction of moving images, or videos. The dimension of time was suddenly added, which gave incredible freedom to ﬁlm and movie makers to tell their story to the audience. With more powerful hardware, computation power and clever algorithms we are now able to add a new dimension to videos, namely the third spatial dimension. This will give users or producers the possibility to change the camera viewpoint on the ﬂy. But there is a diﬀerence between the spatial dimension and time. While freeviewpoint video allows to change the viewpoint not only to positions captured by the input cameras but also to any other position inbetween, the dimension of time is usually only captured at discrete time steps, caused by the recording framerate of the input cameras. For a complete scene ﬂow representation, not only space but also time needs to be reconstructed faithfully. In this paper we present three diﬀerent approaches for freeviewpoint video and spacetime interpolation. After reviewing related work in the next section we will continue with our Floating Textures [1] in Section 3 as an example for high quality freeviewpoint video with discrete time steps. For each discrete time step a geometry of the scene is reconstructed and textured by multiview projective texture mapping, as this process is errorprone we present a warpingbased reﬁnement to correct for the resulting artifacts. In Section 4 we will describe D. Cremers et al. (Eds.): Video Processing and Computational Video, LNCS 7082, pp. 1–24, 2011. c SpringerVerlag Berlin Heidelberg 2011
2
M. Eisemann, F. Klose, and M. Magnor
the transition from discrete to continuous spacetime interpolation, by discarding the geometry and working on images correspondences alone we can create perceptually plausible image interpolations [2,3,4]. As purely imagebased approaches place restrictions on the viewing position, we introduce an algorithm towards complete scene ﬂow reconstruction in Section 5 [5]. All three approaches have their beneﬁts and drawbacks and the choice should always be based on the requirements one has for his application.
2
Related Work
In a slightly simpliﬁed version, the plenoptic function P (x, y, z, θ, φ, t) describes radiance as a function of 3D position in space (x, y, z), direction (θ, φ) and time t [6]. With suﬃciently many input views a direct reconstruction of this function is possible. Initially developed for static scenes light ﬁeld rendering [7] is possibly the most puristic and closest variant to direct resampling. Light ﬁeld rendering can be directly extended to incorporate discrete [8,9] or even continuous time steps [10]. To cover a larger degree of viewing angles at acceptable image quality, however, a large number of densely packed images are necessary [11,12,13]. By employing a preﬁltering step the number of necessary samples can be reduced, but at the cost of more blurry output images [14,15,16]. Wider camera spacings require more sophisticated interpolation techniques. One possibility is the incorporation of a geometry proxy. Given the input images a 3D representation for each discrete time step is reconstructed and used for depth guided resampling of the plenoptic function [17,18,19,20,21,22]. For restricted scene setups the incorporation of template models proves beneﬁcial [23,24,25]. Only few approaches reconstruct a temporally consistent mesh, which allows for continuous time interpolation [26,27]. To deal with insuﬃcient reconstruction accuracy Aganj et al. [28] and Takai et al. [29] deform the input images or both the mesh and the input images, respectively, to diminish rendering artifacts. Unfortunately none of these approaches allow for realtime rendering without a time intensive preprocessing phase. Instead of reconstructing a geometry proxy, purely imagebased interpolation techniques rely only on image correspondences. If complete correspondences between image pixels can be established, accurate image warping becomes possible [30]. Mark et al. [31] follow the seminal approach of Chen et al. [30] but also handle occlusion and discontinuities during rendering. While useful to speed up rendering performance, their approaches are only applicable to synthetic scenes. Beier et al. [32] propose to use a manually guided linebased warping method to interpolate between two images, known from its use in Michael Jackson’s music video ”Black & White”. A physicallyvalid view synthesis by image interpolation is proposed by Seitz et al. [33,34]. For very similar images, optical ﬂow techniques have proven useful [35,36]. Highly precise approaches exist, which can be computed at realtime or (almost) interactive rates [37,38,39]. Einarsson et al. [40] created a complete acquisition system, the socalled light stage 6, for acquiring and relighting human locomotion. Due to the high amount of images acquired they could directly incorporate
Towards Plenoptic Raumzeit Reconstruction
3
optical ﬂow techniques to create virtual camera views in a light ﬁeld renderer, by direct warping of the input images. Correspondence estimation is only one part of an imagebased renderer. The image interpolation itself is another critical part. Fitzgibbon et al. [41] use imagebased priors, i.e., they enforce similarity to the input images, to remove any ghosting artifacts. Drawbacks are very long computation times and the input images must be relatively similar in order to achieve good results. Mahajan et al. [42] proposed a method for plausible image interpolation that searches for the optimal path for a pixel transitioning from one image to the other in the gradient domain. As each output pixel in the interpolated view is taken from only a single source image, ghosting or blurring artifacts are avoided, but if wrong correspondences are estimated unaesthetic deformations might occur. Linz et al. [43] extend the approach of Mahajan et al. [42] to spacetime interpolation with multiimage interpolation based on Graphcuts and symmetric optical ﬂow. In the unstructured video rendering by Ballan et al. [44] the static background of a scene is directly reconstructed, while the actor in the foreground is projected onto a billboard and the view switches at a speciﬁc point between the cameras where the transition is least visible.
3
Floating Textures
Imagebased rendering (IBR) systems using a geometry proxy have the beneﬁt of free camera movement for each reconstructed time step. The drawback is that any reconstruction method with an insuﬃcient number of input images is imprecise. While this may be no big problem when looking at the mesh alone, it will become rather obvious when the mesh is to be textured again. Therefore the challenge is generating a perceptually plausible rendering with only a sparse setup of cameras and a possibly imperfect geometry proxy. Commonly in IBR the full bidirectional reﬂectance distribution function, i.e. how a point on a surface appears depending on the viewpoint and lighting, is approximated by projective texture mapping [45] and image blending. Typically the blending factors are based on the angular deviation of the view vector to capture viewdependent eﬀects. If too few input images are available or the geometry is too imprecise, ghosting artifacts will appear as the projected textures will not match on the surface. In this section we assume that a set of input images, the corresponding, possibly imprecise, calibration data and a geometry proxy are given. The task is to ﬁnd a way to texture this proxy again without noticeable artifacts and shadowing the imprecision of the underlying geometry. Without occlusion, any novel viewpoint can, in theory, be rendered directly from the input images by warping, i.e. by simply deforming the images, so that the following property holds: Ij = WIi →Ij ◦ Ii
,
(1)
where WIi →Ij ◦ Ii warps an image Ii towards Ij according to the warp ﬁeld WIi →Ij . The problem of determining the warp ﬁeld WIi →Ij between two images
4
M. Eisemann, F. Klose, and M. Magnor
!t Fig. 1. Rendering with Floating Textures [1]. The input photos are projected from camera positions Ci onto the approximate geometry GA and onto the desired image plane of viewpoint V . The resulting intermediate images Ivi exhibit mismatch which is compensated by warping all Ivi based on the optical ﬂow to obtain the ﬁnal image IvFloat .
Ii , Ij is a heavily researched area in computer graphics and vision. If pixel distances between corresponding image features are not too large, algorithms to robustly estimate perpixel optical ﬂow are available [37,46]. The issue here is that in most cases these distances will be too large. In order to relax the correspondence ﬁnding problem, the problem can literally be projected into another space, namely the output image domain. By ﬁrst projecting the photographs from cameras Ci onto the approximate geometry surface GA and rendering the scene from the desired viewpoint Cv , creating the intermediate images Ivi , the corresponding image features are brought much closer together than they have been in the original input images, Figure 1. This opens up the possibility of using optical ﬂow estimation to the intermediate images Ivi to robustly determine the pairwise ﬂow ﬁelds WIvi →Ivj . To compensate for more than two input images, a linear combination of the ﬂow ﬁelds according to (3) can be applied to all intermediate images Ivi , which can then be blended together to obtain the ﬁnal rendering result IvFloat . To reduce computational cost, instead of establishing for n input photos (n − 1)n ﬂow ﬁelds, it often suﬃces to consider only the 3 closest input images to the current viewpoint. If more than 3 input images are needed, the quadratic eﬀort can be reduced to linear complexity by using intermediate results. The processing steps are summarized in the following functions and visualized in Figure 1: IvFloat =
n
(WIvi ◦ Ivi )ωi
i=1 n
WIvi =
j=1
ωj WIvi →Ivj
(2) (3)
Towards Plenoptic Raumzeit Reconstruction
5
v WIvi is the combined ﬂow ﬁeld which is used for warping image n Ii . The weight v map ωi contains the weights of Ii for each output pixel with i=1 ωi = 1 based on the camera position [47].
3.1
Soft Visibility
Up to now only occlusionfree situations can be precisely handled, which is seldom the case in realworld scenarios. Simple projection of imprecisely calibrated photos onto an approximate 3D geometry model typically causes unsatisfactory results in the vicinity of occlusion boundaries, Figure 2(a). Texture information from occluding parts of the mesh projects incorrectly onto other geometry parts. With respect to Floating Textures, this not only aﬀects rendering quality but also the reliability of ﬂow ﬁeld estimation.
(a)
(b)
(c)
(d)
Fig. 2. (a) Projection errors occur if occlusion is ignored. (b) Optical ﬂow estimation goes astray if occluded image regions are not properly ﬁlled. (c) Final result after texture projection using a weight map with binary visibility. (d) Final result after texture projection using a weight map with soft visibility. Note that most visible seams and false projections have been eﬀectively removed.
A common approach to handle the occlusion problem is to establish a binary visibility map for each camera, multiply it with the weight map ωi , and normalize the weights afterwards so they sum up to one. This eﬃciently discards occluded pixels in the input cameras for texture generation, in theory. One drawback of such an approach is that it must be assumed that the underlying geometry is precise, and cameras are precisely calibrated. In the presence of coarse geometry, the usage of such binary visibility maps can create occlusion boundary artifacts at pixels where the value of the visibility map suddenly changes, Figure 2(c). To counter these eﬀects, a “soft” visibility map Ω for the current viewpoint and every input camera can be generated using a distance ﬁlter on the binary map: ⎧ if the scene point is occluded ⎨0 Ω(x, y) = occDist(x,y) (4) if occDist(x, y) ≤ r r ⎩ 1 else Here r is a userdeﬁned radius, and occDist(x, y) is the distance to the next occluded pixel. If Ω is multiplied with the weight map ω, (4) makes sure that occluded regions stay occluded, while hard edges in the ﬁnal weight map are
6
M. Eisemann, F. Klose, and M. Magnor
removed. Using this “soft” visibility map the above mentioned occlusion artifacts eﬀectively disappear, Figure 2(d). To improve optical ﬂow estimation, occluded areas in the projected input images Ivi need to be ﬁlled with the corresponding color values from that camera whose weight ωi for this pixel is highest, as the probability that this camera provides the correct color is the highest. Otherwise, the erroneously projected part could seriously inﬂuence the result of the Floating Texture output as wrong correspondences could be established, Figure 2(b). Applying the described ﬁlling procedure noticeably improves the quality of the ﬂow calculation, Figure 2(d). 3.2
GPU Implementation
The nonlinear optimization, namely the optical ﬂow calculation, before the blending step is computationally very intensive and cannot be suﬃciently calculated in advance. Therefore, for immediate feedback it is important to compute the whole rendering part ontheﬂy. The geometry representation can be of almost arbitrary type, e.g., a triangle mesh, a voxel representation, or a depth map (even though correct occlusion handling with a single depth map is not always possible due to the 2.5D scene representation). First, given a novel viewpoint, the closest camera positions are queried. For sparse camera arrangements, typically the two or three closest input images are chosen. The geometry model is rendered from the cameras’ viewpoints into diﬀerent depth buﬀers. These depth maps are then used to establish for each camera a binary visibility map for the current viewpoint. These visibility maps are used as input to the soft visibility shader which can be eﬃciently implemented in a twopass fragment shader. Next, a weight map is established by calculating the camera weights per output pixel, based on the Unstructured Lumigraph weighting scheme [47]. The ﬁnal camera weights for each pixel in the output image are obtained by multiplying the weight map with the visibility map and normalizing the result. To create the input images for the ﬂow ﬁeld calculation, the geometry proxy is rendered from the desired viewpoint into multiple render targets, projecting each input photo onto the geometry. If the weight for a speciﬁc camera is 0 for a pixel, the color from the input camera with the highest weight at this position is used instead. To compute the optical ﬂow between two images eﬃcient GPU implementations are needed [1,46]. Even though this processing step is computationally expensive and takes approximately 90% of the rendering time, interactive to realtime speedups are possible with modern GPUs. Once all needed computations have been carried out, the results can be combined in a ﬁnal render pass, which warps and blends the projected images according to the weight map and ﬂow ﬁelds. The beneﬁts of the Floating Textures approach are best visible in the images in Figure 3, where a comparison of diﬀerent imagebased rendering approaches is given.
Towards Plenoptic Raumzeit Reconstruction
4
7
View and Time Interpolation in Image Space
Up to now we considered the case where at least an approximate geometry could be reconstructed. In some cases however it is beneﬁcial not to reconstruct any geometry at all, but instead work solely in image space. 3D reconstruction poses several constraints on the acquisition setup. First of all, many methods only reconstruct foreground objects, which can be easily segmented from the rest of the image. Second, the scene to be reconstructed must be either static or the recording cameras must be synchronized, so that frames are captured at exactly the same time instance, otherwise reconstruction will fail for fast moving parts. Even though it is possible to trigger synchronized capturing for modern stateoftheart cameras, it still poses a problem in outdoor environments or for moving cameras, due to the amount of cables and connectors. Third, if automatic reconstruction fails, laborious modelling by hand might be necessary. Additionally sometimes even this approach seems infeasible due to ﬁne, complicated structures in the image like e.g. hair. Working in imagespace directly can solve or at least ease most of the aforementioned problems as the problem is altered from a 3D reconstruction problem to a 2D correspondence problem. If perfect correspondences are found between every pixel of two or more images, morphing techniques can create the impression of a real moving camera to the human observer, plus time and space can be treated equally in a common framework. While this enforces some constraints, as e.g. limiting the possible camera movement to the camera hull, it also opens up new possibilities: easier acquisition and rendering of more complex scenes. Because a perceptually plausible motion is interpreted as a physically correct motion by a human observer, we can rely on the capabilities of the human visual system to understand the visual input correctly. It is thus suﬃcient to focus on the aspects that are important to human motion perception to solve the interpolation problem. 4.1
Image Morphing and Spatial Transformations
Image morphing aims at creating smooth transitions between pairs or arbitrary numbers of images. For simplicity of explanation we will ﬁrst examine the case of two images. The basic procedure is to warp, i.e. to deform, the input images I1 and I2 towards each other depending on some warp functions WI1 →I2 , WI2 →I1 and a time step α, with α ∈ [0, 1] so that αWI1 →I2 ◦ I1 = (1 − α)WI2 →I1 ◦ I2 and vice versa in the best case. This optimal warp function can usually only be approximated, so to achieve more convincing results when warping image I1 towards I2 , one usually also computes the corresponding warp from I2 towards I1 and blends the results together. The mathematical formulation is the same as in Eq. 2, only ω is now a global parameter for the whole image. 4.2
Image Deformation Model for Time and View Interpolation
Analyzing properties of the human visual system shows that it is sensitive to three main aspects [49,50,51,52]. These are:
8
M. Eisemann, F. Klose, and M. Magnor
Fig. 3. Comparison of diﬀerent texturing schemes in conjunction with a number of imagebased modelling and rendering (IBMR) approaches. From left to right: Ground truth image (where available), bandlimited reconstruction [11], Filtered Blending [16], Unstructured Lumigraph Rendering [47], and Floating Textures. The diﬀerent IBMR methods are (from top to bottom): Synthetic data set, Polyhedral Visual Hull Rendering [48], FreeViewpoint Video [23], SurfCap [19], and Light Field Rendering [7].
1. Edges 2. Coherent motion for parts belonging to the same object 3. Motion discontinuities at object borders. It is therefore important to pay special attention to these aspects for highquality interpolation.
Towards Plenoptic Raumzeit Reconstruction
9
Observing our surroundings we might notice that objects in the real world are seldom completely ﬂat, even though many manmade objects are quite ﬂat. However they can be approximated quite well by ﬂat structures, like planes or triangles, as long as these are small enough. Usually this limit is given by the amount of detail the eye can actually perceive. In computer graphics it is usually set by the screen resolution. If it is assumed that the world consists of such planes, then the relation between two projections of such a 3D plane can be directly described via a homography in image space. Such homographies for example describe the relation between a 3D plane seen from two diﬀerent cameras, the 3D rigid motion of a plane between two points in time seen from a single camera or a combination of both. Thus, the interpolation between images depicting a dynamic 3D plane can be achieved by a per pixel deformation according to the homography directly in image space without the need to reconstruct the underlying 3D plane, motion and camera parameters explicitly. Only the assumption that natural images can be decomposed into regions, for which the deformation of each element is sufﬁciently well described by a homography has to be made, which is surprisingly often the case. Stich et al. [2,3] introduced translets which are homographies that are spatially restricted. Therefore a translet is described by a 3 × 3 matrix H and a corresponding image segment. To obtain a dense deformation, they enforce that the set of all translets is a complete partitioning of the image and thus each pixel is part of exactly one translet, an example can be seen in Figure 4.
Fig. 4. An image (left) and its decomposition into its homogeneous regions (middle). Since the transformation estimation is based on the matched edgelet, only superpixels that contain actual edgelet (right) are of interest.
The ﬁrst step in estimating the parameters of the deformation model is to ﬁnd a set of point correspondences between the images from which the translet transformation can be derived. This may sound contradictive as we stated earlier that this is the overall goal. However at this stage we are not yet interested in a complete correspondence ﬁeld for every pixel. Rather we are looking for a subset for which the transformation can be more reliably established and which conveys already most of the important information concerning the apparent motion in the image. As it turns out classic point features such as edges and corners, which have a long history of research in computer vision, are best suited for this task. This is in accordance to the human vision, which measures edge and cornerfeatures early on.
10
M. Eisemann, F. Klose, and M. Magnor
Using the Compass operator [53], a set of edge pixels, called edgelet, is obtained in both images. The task is now to ﬁnd for each edgelet in image I1 a corresponding edgelet in image I2 and this matching should be an as complete as possible 11 matching. This problem can be posed as a maximum weighted bipartite graph matching problem, or in other words, one does not simply assign the best match to each edgelet, but instead tries to minimize an energy function to ﬁnd the best overall solution. Therefore descriptors for each edgelet need to be established. The shape context descriptor [54] has been shown to perform very well at capturing the spatial context Cshape of edgelets and is robust against the expected deformations. To reduce computational eﬀort and increase robustness for the matching process only the k nearest neighbor edgelet are considered as potential matches for each edgelet. Also one can assume that edgelets will not move from one end of the image I1 to the other in image I2 as considerable overlap is always needed to establish a reliable matching. Therefore an additional distance term Cdist can be added. One prerequisite for the reformulation is that for each edgelet in the ﬁrst set a match in the second set exists, otherwise the completeness cannot be achieved. While this is true for most edgelets, some will not have a correspondence in the other set due to occlusion or small instabilities of the edge detector at faint edges. However, this is easily addressed by inserting virtual occluder edgelet for each edgelet in the ﬁrst edgelet set. Each edge pixel of the ﬁrst image is connected by a weighted edge to its possibly corresponding edge pixels in the second image and additionally to its virtual occluder edgelet. The weight or cost function for edgelet ei in I1 and ej in I2 is then deﬁned as C(ei , ej ) = Cdist (ei , ej ) + Cshape (ei , ej )
(5)
where the cost for the shape is the χ2 test between the two shape contexts and the cost for the distance is deﬁned as Cdist (ei , ej ) =
a −b ei −ej 
(1 + e
)
(6)
with a, b > 0 such that the maximal cost for the euclidean distance is limited by a. The cost Coccluded to assign an edgelet to its occluder edgelet is user deﬁned and controls how aggressively the algorithm tries to ﬁnd a match with an edgelet of the second image. The lower Coccluded the more conservative the resulting matching will be, as more edges will be matched to their virtual occluder edgelet. Now that the ﬁrst reliable matches have been found this information can be used to ﬁnd good homographies for the translets of both images. But ﬁrst the spatial support for these translets need to be established, i.e. the image needs to be segmented into coherent, disjoint regions. From Gestalt theory [55] it is known that for natural scenes, these regions share not only a common motion but in general also share other properties such as similar color and texture. Superpixel segmentation [56] can be exploited to ﬁnd an initial partitioning of the image into regions to become translets, based on neighboring pixel similarities. Then from the matching between the edge pixels of the input images, local homographies for each set of edge pixels in the source image that are within one
Towards Plenoptic Raumzeit Reconstruction
11
superpixel are robustly estimated using 4point correspondences and RANSAC [57]. Usually still between 20% to 40% of the computed matches are outliers and thus some translets will have wrongly estimated transformations. Using a greedy iterative approach, the most similar transformed neighboring translets are merged into one, as depicted in Figure 5, until the ratio of outliers to inliers is lower than a user deﬁned threshold. When two translets are merged, the resulting translet then contains both edgelet sets and has the combined spatial support. The homographies are reestimated based on the new edgelet set and the inﬂuence of the outliers is again reduced by the RANSAC ﬁltering.
Fig. 5. During optimization, similar transformed neighboring translets are merged into a single translet. After merging, the resulting translet consists of the combined spatial support of both initial translets (light and dark blue) and their edgelet (light and dark red).
Basically in this last step a transformation for each pixel in the input images towards the other image was established. Assuming linear motion only, the deformation vector d(x) for a pixel x is thus computed as d(x) = Ht x − x.
(7)
Ht is the homography matrix of the translet t with x being part of the spatial support of t. However, when only a part of a translet boundary is at a true motion discontinuity, noticeably incorrect discontinuities still produce artifacts along the rest of the boundary. Imagine for example the motion of an arm in front of the body. It is discontinuous along the silhouette of the arm, while the motion at the shoulder changes continuously. We can then resolve the per pixel smoothing by an anisotropic diﬀusion [58] on this vector ﬁeld using the diﬀusion equation δI/dt = div( g(min(∇d, ∇I) ∇I) (8) which is dependent on the image gradient ∇I and the gradient of the deformation vector ﬁeld ∇d. The function g is a simple mapping function as deﬁned in [58]. Thus, the deformation vector ﬁeld is smoothed in regions that have similar color or similar deformation, while discontinuities that are both present in the color image and the vector ﬁeld are preserved. During the anisotropic diﬀusion, edgelet that have an inlier match, meaning they are only slightly deviating from the planar model, are considered as boundary conditions of the diﬀusion process. This results in exact edge transformations handling also nonlinear deformations for each translet and signiﬁcantly improves the achieved quality.
12
4.3
M. Eisemann, F. Klose, and M. Magnor
Optimizing the Image Deformation Model
There are three ways to further optimize the image deformation model from the previous section: using motion priors, using coarsetoﬁne translet estimation, and using a scalespace hierarchy. Since the matching energy function (Eq. 5) is based on spatial proximity and local geometric similarity, a motion prior can be introduced by prewarping the edgelet with a given deformation ﬁeld. The estimated dense correspondences described above can be used as such a prior. So the algorithm described in Section 4.2 can be iterated using the result from the ith iteration as the input to the (i + 1)th iteration. To overcome local matching minima a coarse to ﬁne iterative approach on the translets can be applied. In the ﬁrst iteration, the number of translets is reduced until the coarsest possible deformation model with only one translet is obtained. Thus the underlying motion is approximated by a single perspective transformation. During consecutive iterations, the threshold is decreased to allow for more accurate deformations as the number of ﬁnal translets increases. Additionally, solving on diﬀerent image resolutions similar to scalespace [59] further improves robustness. Thus a ﬁrst matching solution is found on the coarse resolution images and is then propagated to higher resolutions. Using previous solutions as motion prior signiﬁcantly reduces the risk to getting stuck in local matching minima. In rare cases, some scenes still cannot be matched automatically suﬃciently well. For example, when similar structures appear multiple times in the images the matching can get ambiguous and can only be addressed by high level reasoning. To resolve this, a fallback on manual intervention is necessary. Regions can be selected in both images by the user and the automatic matching is computed again only for the so selected subset of edgelet. Due to this restriction of the matching, the correct match is found and used to correct the solution. 4.4
Rendering
Given the pixelwise displacements from the previous sections the rendering can then be eﬃciently implemented on graphics hardware to allow for realtime image interpolation. Two problems arise with a simple warping at motion discontinuities: Foldovers and missing regions. Foldovers occur when two or more pixels in the image end up in the same position during warping. This is the case when the foreground occludes parts of the background. Consistent with motion parallax it is assumed that the faster moving pixel in xdirection is closer to the viewpoint to resolve this conﬂict. When on the other hand regions get disoccluded during warping the information of these regions is missing in the image and must be ﬁlled in from the other image. During rendering we place a regular triangle mesh over the image plane and use a connectedness criterion to decide whether the triangle should be drawn or not. If the motion of a vertex diﬀers more than a given threshold from the neighboring vertices, the triangles spanned between these vertices are simply discarded. An example is given in Figure 6.
Towards Plenoptic Raumzeit Reconstruction
13
Fig. 6. Left: Pervertex mesh deformation is used to compute the forward warping of the image, where each pixel corresponds to a vertex in the mesh. The depicted mesh is at a coarser resolution for visualization purposes. Right: The connectedness of each pixel that is used during blending to avoid a possibly incorrect inﬂuence of missing regions.
Opposed to recordings with cameras, rendered pixels at motion boundaries are no longer a mixture of background and foreground color but are either foreground or background color. In a second rendering pass, the color mixing of foreground and background at boundaries can be modelled using a small selective lowpass ﬁlter applied only to the detected motion boundary pixels. This eﬀectively removes the artifacts with a minimal impact on rendering speed and without aﬀecting rendering quality in the nondiscontinuous regions. As can be seen in Table 1 the proposed algorithm produces highquality results, e.g. using the Middlebury examples [60].
Table 1. Interpolation, Normalized Interpolation and Angular errors computed on the Middlebury Optical Flow examples by comparison to ground truth with results obtained by our method and by other methods taken from Baker et al. [60] Venus Stich et al. Pyramid LK Bruhn et al. Black et al. Mediaplayer Zitnick et al.
Int.. Norm. Int. 2.88 0.55 3.67 0.64 3.73 0.63 3.93 0.64 4.54 0.74 5.33 0.76
Ang. 16.24 14.61 8.73 7.64 15.48 11.42
Dimetrodon Stich et al. Pyramid LK Bruhn et al. Black et al. Mediaplayer Zitnick et al.
Int Norm. Int 1.78 0.62 2.49 0.62 2.59 0.63 2.56 0.62 2.68 0.63 3.06 0.67
Ang. 26.36 10.27 10.99 9.26 15.82 30.10
The results have been obtained without user interaction. As can be seen the approach is best when looking at the interpolation errors and best or up to par in the sense of the normalized interpolation error. It is important to point out that from a perception point of view the normalized error is less expressive than the unnormalized error since discrepancies at edges in the image (e.g. large gradients) are dampened. Interestingly, relatively large angular errors are observed with the presented method emphasizing that the requirements of optical ﬂow estimation and image interpolation are diﬀerent.
14
5
M. Eisemann, F. Klose, and M. Magnor
Plenoptic Raumzeit Reconstruction
The last two proposed methods are diﬀerent approaches towards a semicomplete reconstruction of the plenoptic function. However none of them is able to allow for the full degrees of freedom in space and time. In this section we present a complete spacetime reconstruction. The idea is to represent the 4D spacetime by a 6D scene representation. Each point in the scene is therefore categorized by not only a 3D position but also a 3D velocity vector, assuming linear motion between two discrete time steps. As we treated space and time similarly in the last section we treat time and scene movement similarly here. 5.1
Overview
We assume that the input video streams show multiple views of the same scene. Since we aim to reconstruct a geometric model, we expect the scene to consist of opaque objects with mostly diﬀuse reﬂective properties and the camera parameters to be given, e.g. by [61]. In a preprocessing step the subframe time oﬀsets between the cameras are determined automatically [62,63]. Our scene model represents the scene geometry as a set of small tangent plane patches. The goal is to reconstruct a tangent patch for each point of the entire visible surface. Each patch is described by its position, normal and velocity vector. The presented algorithm processes an image group at a time, which consists of images chosen by their respective temporal and spatial parameters. The image group timespan is the time interval ranging from the acquisition of the ﬁrst image of the group to the time the last selected image was recorded. The result of our processing pipeline is a patch cloud. While it is unordered in scene space, we save a list of patches intersected by the according viewing ray for each pixel in the input cameras. A visualization of our quasidense scene reconstruction is shown in Fig. 7.
Fig. 7. Visualization of reconstructed scenes. The patches are textured according to their reference image. Motion is visualized by red arrows.
5.2
Image Selection and Processing Order
To reconstruct the scene for a given time t a group of images is selected from the input images. The image group G contains three consecutive images from each camera, where the second image is the image from the camera taken closest to t in time.
Towards Plenoptic Raumzeit Reconstruction
15
The acquisition time time(Ij ) = coﬀset + cm of an image from the camera Cj fps is determined by the camera time oﬀset coﬀset , the camera framerate cfps and the frame number m. During the initialization step of the algorithm, the processing order of the images is important and it is favorable to use the center images ﬁrst. For camera setups where the cameras roughly point at the same scene center the following heuristic is used to sort the image group in ascending order: Cj − Ci  (9) s(Ij ) = Ii ∈G
Where Ci is the position of the camera that acquired the image Ii . When at least one camera is static, s(Ij ) can evaluate to identical values for diﬀerent images. Images with identical values s(Ij ) are ordered by the distance of their acquisition time from t . 5.3
Initialization
To reconstruct an initial set of patches it is necessary to ﬁnd pixel correspondences within the image group. Due to the assumption of asynchronous cameras, no epipolar geometry constraints can be used to reduce the search region for the pixel correspondence search. We compute a list of interest points for each image Ii ∈ G using Harris’s corner detector. The intention is to select points which can be identiﬁed across multiple images. A local maximum suppression is performed, i.e., only the strongest response within a small radius is considered. Every interest point is then described by a SURF[64] descriptor. In the following, an interest point and its descriptor is referred to as a feature. For each image Ii , every feature f extracted from that image is serially processed and potentially transformed into a spacetime patch. A given feature f0 is matched against all features for all Ii ∈ G. The best match for each image is added into a candidate set C. To ﬁnd a robust subset of C without outliers a RANSAC based method is used: First a set S of m features is randomly sampled from C. Then the currently processed feature f0 is added to the set S. The value of S = m + 1 can be varied depending on the input data. For all our experiments we chose m = 6. The sampled features in S are assumed to be evidence of a single surface. Using the constraints from feature positions and camera parameters and assuming a linear motion model, a center position c and a velocity v are calculated. The details of the geometric reconstruction are given below. The vectors c and v represent the ﬁrst two parameters of a new patch P . The next RANSAC step is to determine which features from the original candidate set C consent to the reconstructed patch P . The patch center is reprojected into the images Ii ∈ G and the distance from the projected position to the feature position in Ii is evaluated. After multiple RANSAC iterations the largest set
16
M. Eisemann, F. Klose, and M. Magnor
T ⊂ C of consenting features found is selected. For further robustness additional features f ∈ C \ T are added to T if the reconstructed patch from T = T ∪ {f } is consenting on all features in T . After the enrichment of T the ﬁnal set of consenting features is used to calculate the position and velocity for the patch P . As reference image the image the original feature was taken from is chosen. The surface orientation of P is coarsely approximated by the vector pointing from c to the center of the reference image camera. When the patch has been fully initialized, it is added to the initial patch set. After all image features have been processed the initial patch generation is optimized and ﬁltered once before the expand and ﬁlter iterations start. Geometric Patch Reconstruction. Input for the geometric patch reconstruction is a list of corresponding pixel positions in multiple images combined with the temporal and spatial position of the cameras. The result is a patch center c and velocity v. Assuming a linear movement of the scene point, its position x(t) at the time t is speciﬁed by a line x(t) = c + t · v. (10) To determine c and v, a linear equation system is formulated. The line of movement, Eq. 10, must intersect the viewing rays q i that originate from the camera center Ci and are cast through the image plane at the pixel position where the patch was observed in image Ii at time ti = time(Ii ): ⎛ ⎜ ⎜ ⎝
Id
3×3
Id
3×3
.. .
3×3
Id
.. .
3×3
Id
· t0 −q0 · ti
0
T
⎞ cT T ⎜v ⎟ 0 0 ⎜ ⎟ ⎟ ⎜ a0 ⎟ .. ⎟ ⎟ ·⎜ . ⎠ ⎜ . ⎟ = ⎜ . ⎟ T ⎝ . ⎠ 0 −qi aj ⎞
⎛
⎛ ⎜ ⎜ ⎝
⎞ C0 T ⎟ ... ⎟ ⎠ Ci T
(11)
The variables a0 to aj give the scene depth with respect to the camera center C0 to Cj and are not further needed. The overdetermined linear system is solved with a SVD solver. Patch Visibility Model. There are two sets of visibilities associated with every patch P . The set of images where P might be visible V (P ) and the set of images where P is considered truly visible V t (P ) ⊆ V (P ). The two diﬀerent sets exist to deal with specular highlights or not yet reconstructed occluders. During the initialization process the visibilities are determined by thresholding a normalized cross correlation. Let ν(P, I) be the normalized cross correlation calculated from the reference image of P to the image I within the patch region, then V (P ) = {Iν(P, I) > α} and V t (P ) = {Iν(P, I) > β} are determined. The threshold parameters used in all our experiments are α = 0.45 and β = 0.8. The correlation function ν takes the patch normal into account when determining the correlation windows. In order to have an eﬃcient lookup structure for patches later on, we overlay a grid of cells over every image. In every grid cell all patches are listed, that when
Towards Plenoptic Raumzeit Reconstruction
17
Fig. 8. Computing cross correlation of moving patches. (a) A patch P is described by its position c, orientation, recording time tr and its reference image Ir . (b) Positions of sampling points are obtained by casting rays through the image plane (red) of Ir and intersecting with plane P . (c) According to the diﬀerence in recording times (t − tr ) and the motion v of the patch, the sampling points are translated, before they are projected back to the image plane of I . Cross correlation is computed using the obtained coordinates in image space of I .
projected to the image plane, fall into the given cell and are considered possibly or truly visible in the given image. The size of the grid cells λ determines the ﬁnal resolution of our scene reconstruction as only one truly visible patch in each cell in every image is calculated. After the initialization and during the algorithm iterations, the visibility of P is estimated by a depth comparison. All images for which P is closer to the camera than the currently closest patch are added to V (P ). The images I ∈ V t (P ), where the patch is considered truly visible, are determined using a similar thresholding as before V t (P ) = {II ∈ V (P ) ∧ ν(P, I) > β}. The β in this comparison is lowered with increasing expansion iteration count to cover poorly textured regions. 5.4
Expansion Phase
The initial set of patches is usually very sparse. To incrementally cover the entire visible surface, the existing patches are expanded along the object surfaces. The expansion algorithm processes each patch from the current generation. In order to verify if a given patch P should be expanded, all images I ∈ V t (P ) where P is truly visible are considered. Given the patch P and a single image I, the patch is projected into the image plane and the surrounding grid cells are inspected. If a cell is found where no truly visible patch exists yet, a surface expansion of P to the cell is calculated. A viewing ray is cast through the center of the empty cell and intersected with the plane deﬁned by the patches position at time(I) and its normal. The intersection point is the center position for the newly created patch P . The velocity and normal of the new patch are initialized with the values from
18
M. Eisemann, F. Klose, and M. Magnor
the source patch P . At this stage, P is compared to all other patches listed in its grid cell and is discarded if another similar patch is found. To determine whether two patches are similar in a given image, their position x0 , x1 and normals n0 , n1 are used to evaluate the inequality (x0 x1 ) · n0 + (x1 x0 ) · n1 < κ.
(12)
The comparison value κ is calculated from the pixel displacement of λ pixels in image I and corresponds to the depth displacement which can arise within one grid cell. We usually start with λ ≥ 2 and approach λ = 1 in successive iterations of the algorithm. If the inequality holds, the two patches are similar. Patches that are not discarded are processed further. The reference image of the new patch P is set to be the image I in which the empty grid cell was found. The visibility of P is estimated by a depth comparison as described in 5.3. Because the presence of outliers may result in a too conservative estimation of V (P ), the visibility information from the original patch is added V (P ) = V (P ) ∪ V (P ) before calculating V t (P ). After the new patch is fully initialized, it is handed into the optimization process. Finally, the new patch is accepted into the current patch generation, if V t (P ) ≥ φ. The least number of images to accept a patch is dependent on the camera setup and image type. With increasing φ less surface can be covered with patches on the outer cameras, since each surface has to be observed multiple times. Choosing φ too small may result in unreliable reconstruction results. 5.5
Patch Optimization
The patch parameters calculated from the initial reconstruction or the expansion are the starting point for a conjugate gradient based optimization. The function ρ maximized is a visibility score of the patch. To determine the visibility score a normalized cross correlation ν(P, I) is calculated from the reference image of P to all images I ∈ V (P ) where P is expected to be visible: ⎛ ⎞ ρ(P ) =
1 ⎝ ν(P, I) + V (P ) + a · V t (P ) I∈V (P )
a · ν(P, I)⎠
(13)
I∈V t (P )
The weighting factor a accounts for the fact that images from V t (P ) are considered reliable information, while images from V (P ) \ V t (P ) might not actually show the scene point corresponding to P . The visibility function ρ(P ) is then maximized with a conjugate gradient method. To constrain the optimization, the position of P is not changed in three dimensions, but in a single dimension representing the depth of P in the reference image. The variation of the normal is speciﬁed by two rotation angles and at last the velocity is left as three dimensional vector. 5.6
Filtering
After the expansion step the set of surface patches possibly contains visual inconsistencies. These inconsistencies can be put in three groups. The outliers outside
Towards Plenoptic Raumzeit Reconstruction
19
the surface, outliers that lie inside the actual surface and patches that do not satisfy a regularization criterion. Three distinct ﬁlters are used to eliminate the diﬀerent types of inconsistencies. The ﬁrst ﬁlter deals with outliers outside the surface. To detect an outlier a support value s and a doubt value d is computed for each patch P . The support is the patch score Eq. (13) multiplied by the number of images where P is truly visible s = ρ(P ) · V t (P ). Summing the score of all patches P that are occluded by P gives a measure for visual inconsistency introduced by P and is the doubt d. If the doubt outweighs the support d > s the patch is considered an outlier and removed. Patches lying inside the surface will be occluded by the patch representing the real surface, therefore the visibilities of all patches are recalculated as described in 5.3. Afterwards, all patches that are not visible in at least φ images are discarded as outliers. The regularization is done with the help of the patch similarity deﬁned in Eq. (12). In the images where a patch P is visible all surrounding c patches are evaluated. The quotient of the number c of patches similar to P in relation to the total surrounding patches c is the regularization criterion: cc < z. The quotient of the similarly aligned patches was z = 0.25 in all our experiments. Fig.9 shows some results of the presented approach for both a synthetic and a realworld scenario. The most important characteristics of the scene, like continuous depth changes on the ﬂoor plane or walls as well as depth discontinuities are well preserved as can be seen in the depth maps. The small irregularities
(a)
(b)
(c)
Fig. 9. (a) Input views, (b) quasidense depth reconstruction and (c) optical ﬂow to the next frame. For the synthetic windmill scene, highquality results are obtained. When applied to a more challenging realworld scene, the results are still robust and accurate. The conservative ﬁltering prevents the expansion to ambiguous regions. E.g., most pixels in the asphalt region in the skateboarder scene are not recovered.
20
M. Eisemann, F. Klose, and M. Magnor
(a)
(b)
(c)
Fig. 10. Reconstruction results from the Middlebury MVS evaluation datasets. (a) Input views. (b) Closed meshes from reconstructed patch clouds. (c) Textured patches. While allowing the reconstruction of all six degrees of freedom (including 3D motion), the dynamic patch reconstruction approach still reconstructs the static geometry faithfully.
where no patch was created stem from the conservative ﬁltering step. Plausible motion is created even for smoothly changing areas as the wings of the stonemill where the motion is decreased towards the center. To demonstrate the static reconstruction capabilities results obtained from the Middlebury ”ring” datasets [65] are shown in Fig. 10. Poisson surface reconstruction[66] was used to create the closed meshes. The static object is retrieved, although no prior knowledge about the dynamics of the scene was given, i.e., all six degrees of freedom for reconstruction were used.
6
Discussion and Conclusion
In this paper we presented three diﬀerent approaches towards complete freeviewpoint video. Our ﬁrst presented method, the Floating Textures [1], deal with the weaknesses in the classic technique of reconstructing a proxy geometry for discrete timesteps and retexturing it with the input images. To deal with geometric uncertainties, camera calibration errors and visibility problems, Floating Textures proposes to warp the input images in the output domain and reweight the warping and blending paramters based on their angular and occlusion distance. This approach is especially appealing for realtime scenarios, like sports events where only fast, and therefore imprecise 3D reconstruction methods can be applied. Our second presented approach is based on a perceptually plausible image interpolation method [2,3,4]. Though sacriﬁcing full viewpoint control, as the
Towards Plenoptic Raumzeit Reconstruction
21
camera is restricted to move on the camera manifold, this approach has the beneﬁt that space and time can be treated equally, as there is no diﬀerence between both in image correspondence estimation and highquality results are possible. If the complete degrees of freedom for space and time interpolation are needed, our third approach is the method of choice [5]. The scene ﬂow plus dynamic surface reconstruction, or as we would like to call it plenoptic Raumzeit, is described by small geometric patches representing the scene geometry and according velocity vectors representing the movement over time. The additional degrees of freedom however come at the cost of lower rendering quality. Improving the renderings is a fruitful direction for further research. To conclude, imagebase rendering systems vary largely in the way they approach the problem of image interpolation for freeviewpoint video, and your decision on which approach to base your application is one that needs to be speciﬁed early on in the development process. Do you have synchronized cameras? Are discrete time steps enough or do you need a continuous representation of time? Do you need full view control or is a restricted movement enough? We hope that the knowledge provided in this paper helps to make the important design decisions necessary when building a freeviewpoint video system. Acknowledgements. We would like to thank Jonathan Starck for providing us with the SurfCap test data (www.ee.surrey.ac.uk/CVSSP/VMRG/surfcap.htm) and the Stanford Computer Graphics lab for the buddha light ﬁeld data set. The authors gratefully acknowledge funding by the German Science Foundation from project DFG MA2555/42.
References 1. Eisemann, M., De Decker, B., Magnor, M., Bekaert, P., de Aguiar, E., Ahmed, N., Theobalt, C., Sellent, A.: Floating Textures. Computer Graphics Forum 27, 409–418 (2008) 2. Stich, T., Linz, C., Wallraven, C., Cunningham, D., Magnor, M.: Perceptionmotivated Interpolation of Image Sequences. In: Symposium on Applied Perception in Graphics and Visualization, pp. 97–106 (2008) 3. Stich, T., Linz, C., Albuquerque, G., Magnor, M.: View and Time Interpolation in Image Space. Computer Graphics Forum 27, 1781–1787 (2008) 4. Stich, T.: SpaceTime Interpolation Techniques. PhD thesis, Computer Graphics Lab, TU Braunschweig, Germany (2009) 5. Klose, F., Lipski, C., Magnor, M.: Reconstructing Shape and Motion from Asynchronous Cameras. In: Proceedings of Vision, Modeling, and Visualization (VMV 2010), Siegen, Germany, pp. 171–177 (2010) 6. Adelson, E.H., Bergen, J.R.: The Plenoptic Function and the Elements of Early Vision. In: Landy, M., Movshon, J.A. (eds.) Computational Models of Visual Processing, pp. 3–20 (1991) 7. Levoy, M., Hanrahan, P.: Light Field Rendering. In: SIGGRAPH, pp. 31–42 (1996) 8. Fujii, T., Tanimoto, M.: Free viewpoint TV system based on rayspace representation. In: SPIE, vol. 4864, pp. 175–189 (2002)
22
M. Eisemann, F. Klose, and M. Magnor
9. Matusik, W., Pﬁster, H.: 3D TV: A scalable system for realtime acquisition, transmission, and autostereoscopic display of dynamic scenes. ACM Transactions on Graphics 23, 814–824 (2004) 10. Wilburn, B., Joshi, N., Vaish, V., Talvala, E.V., Antunez, E., Barth, A., Adams, A., Horowitz, M., Levoy, M.: High performance imaging using large camera arrays. ACM Transactions on Graphics 24, 765–776 (2005) 11. Chai, J.X., Chan, S.C., Shum, H.Y., Tong, X.: Plenoptic Sampling. In: SIGGRAPH, pp. 307–318 (2000) 12. Lin, Z., Shum, H.Y.: On the Number of Samples Needed in Light Field Rendering with constantdepth assumption. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 579–588 (2000) 13. Lin, Z., Shum, H.Y.: A Geometric Analysis of Light Field Rendering. International Journal of Computer Vision 58, 121–138 (2004) 14. Stewart, J., Yu, J., Gortler, S.J., McMillan, L.: A New Reconstruction Filter for Undersampled Light Fields. In: Eurographics Workshop on Rendering, pp. 150–156 (2003) 15. Zwicker, M., Matusik, W., Durand, F., Pﬁster, H.: Antialiasing for Automultiscopic 3D Displays. In: Eurographics Symposium on Rendering, pp. 107–114 (2006) 16. Eisemann, M., Sellent, A., Magnor, M.: Filtered Blending: A new, minimal Reconstruction Filter for GhostingFree Projective Texturing with Multiple Images. In: Vision, Modeling, and Visualization, pp. 119–126 (2007) 17. Gortler, S.J., Grzeszczuk, R., Szeliski, R., Cohen, M.F.: The Lumigraph. In: SIGGRAPH, pp. 43–54 (1996) 18. Debevec, P.E., Taylor, C.J., Malik, J.: Modeling and Rendering Architecture from Photographs: A Hybrid Geometry and ImageBased Approach. In: SIGGRAPH, pp. 11–20 (1996) 19. Starck, J., Hilton, A.: Surface capture for performance based animation. IEEE Computer Graphics and Applications 27, 21–31 (2007) 20. Hornung, A., Kobbelt, L.: Interactive pixelaccurate free viewpoint rendering from images with silhouette aware sampling. Computer Graphics Forum 28, 2090–2103 (2009) 21. Goldluecke, B., Cremers, D.: A superresolution framework for highaccuracy multiview reconstruction. In: Proc. DAGM Pattern Recognition, Jena, Germany (2009) 22. Furukawa, Y., Curless, B., Seitz, S.M., Szeliski, R.: Towards Internetscale multiview stereo. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1434–1441 (2010) 23. Carranza, J., Theobalt, C., Magnor, M., Seidel, H.P.: Freeviewpoint video of human actors. ACM Transaction on Graphics 22, 569–577 (2003) 24. de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.P., Thrun, S.: Performance capture from sparse multiview video. ACM Transactions on Graphics 27(3), 1–10 (2008) 25. Hasler, N., Stoll, C., Sunkel, M., Rosenhahn, B., Seidel, H.P.: A statistical model of human pose and body shape. Computer Graphics Forum 28 (2009) 26. Goldluecke, B., Magnor, M.: Weighted Minimal Hypersurfaces and Their Applications in Computer Vision. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3022, pp. 366–378. Springer, Heidelberg (2004) 27. Vedula, S., Baker, S., Kanade, T.: Image based spatiotemporal modeling and view interpolation of dynamic events. ACM Transactions on Graphics 24, 240–261 (2005) 28. Aganj, E., Monasse, P., Keriven, R.: Multiview texturing of imprecise mesh. In: Asian Conference on Computer Vision, pp. 468–476 (2009)
Towards Plenoptic Raumzeit Reconstruction
23
29. Takai, T., Hilton, A., Matsuyama, T.: Harmonized Texture Mapping. In: International Symposium on 3D Data Processing, Visualization and Transmission, pp. 1–8 (2010) 30. Chen, S.E., Williams, L.: View Interpolation for Image Synthesis. In: SIGGRAPH, pp. 279–288 (1993) 31. Mark, W., McMillan, L., Bishop, G.: PostRendering 3D Warping. In: Symposium on Interactive 3D Graphics, pp. 7–16 (1997) 32. Beier, T., Neely, S.: Featurebased Image Metamorphosis. In: SIGGRAPH, pp. 35–42 (1992) 33. Seitz, S., Dyer, C.: Physicallyvalid view synthesis by image interplation. In: IEEE Workshop on Representation of Visual Scenes, pp. 18–26 (1995) 34. Seitz, S., Dyer, C.: View Morphing. In: SIGGRAPH, pp. 21–30 (1996) 35. Horn, B., Schunck, B.: Determining Optical Flow. Artiﬁcial Intelligence 16, 185– 203 (1981) 36. Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: International Joint Conference on Artiﬁcial Intelligence, pp. 674–679 (1981) 37. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical ﬂow estimation based on a theory for warping. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004) 38. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime tvl1 optical ﬂow. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007) 39. Werlberger, M., Trobin, W., Pock, T., Wedel, A., Cremers, D., Bischof, H.: Anisotropic HuberL1 optical ﬂow. In: British Machine Vision Conference (2009) 40. Einarsson, P., Chabert, C.F., Jones, A., Ma, W.C., Lamond, B., Hawkins, T., Bolas, M., Sylwan, S., Debevec, P.: Relighting Human Locomotion with Flowed Reﬂectance Fields. In: Eurographics Symposium on Rendering, pp. 183–194 (2006) 41. Fitzgibbon, A., Wexler, Y., Zisserman, A.: Imagebased rendering using imagebased priors. International Journal of Computer Vision 63, 141–151 (2005) 42. Mahajan, D., Huang, F.C., Matusik, W., Ramamoorthi, R., Belhumeur, P.: Moving Gradients: A PathBased Method for Plausible Image Interpolation. ACM Transactions on Graphics 28, 1–10 (2009) 43. Linz, C., Lipski, C., Magnor, M.: Multiimage Interpolation based on GraphCuts and Symmetric Optic Flow. In: Vision, Modeling and Visualization, pp. 115–122 (2010) 44. Ballan, L., Brostow, G.J., Puwein, J., Pollefeys, M.: Unstructured videobased rendering: Interactive exploration of casually captured videos. ACM Transactions on Graphics, 1–11 (2010) 45. Segal, M., Korobkin, C., van Widenfelt, R., Foran, J., Haeberli, P.: Fast Shadows and Lighting Eﬀects using Texture Mapping. Computer Graphics 26, 249–252 (1992) 46. Pock, T., Urschler, M., Zach, C., Beichel, R., Bischof, H.: A Duality Based Algorithm for TVL1OpticalFlow Image Registration. In: International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 511–518 (2007) 47. Buehler, C., Bosse, M., McMillan, L., Gortler, S., Cohen, M.: Unstructured Lumigraph Rendering. In: Conference on Computer Graphics and Interactive Techniques, SIGGRAPH, pp. 425–432 (2001) 48. Franco, J.S., Boyer, E.: Exact polyhedral visual hulls. In: British Machine Vision Conference, pp. 329–338 (2003)
24
M. Eisemann, F. Klose, and M. Magnor
¨ 49. Wallach, H.: Uber visuell wahrgenommene Bewegungsrichtung. Psychologische Forschung 20, 325–380 (1935) 50. Reichardt, W.: Autocorrelation, A principle for the evaluation of sensory information by the central nervous system. In: Rosenblith, W. (ed.) Sensory Communication, pp. 303–317. MIT PressWilley, New York (1961) 51. Qian, N., Andersen, R.: A physiological model for motionstereo integration and a uniﬁed explanation of Pulfrichlike phenomena. Vision Research 37, 1683–1698 (1997) 52. Heeger, D., Boynton, G., Demb, J., Seidemann, E., Newsome, W.: Motion opponency in visual cortex. Journal of Neuroscience 19, 7162–7174 (1999) 53. Ruzon, M., Tomasi, C.: Color Edge Detection with the Compass Operator. In: Conference on Computer Vision and Pattern Recognition, pp. 160–166 (1999) 54. Belongie, S., Malik, J., Puzicha, J.: Matching Shapes. In: International Conference on Computer Vision, pp. 454–461 (2001) 55. Wertheimer, M.: Laws of organization in perceptual forms. In: Ellis, W. (ed.) A Source Book of Gestalt Psychology, pp. 71–88. Trubner & Co. Ltd., Kegan Paul (1938) 56. Felzenszwalb, P., Huttenlocher, D.: Eﬃcient GraphBased Image Segmentation. International Journal of Computer Vision 59, 167–181 (2004) 57. Fischler, M., Bolles, R.: Random Sample Consensus. A Paradigm for Model Fitting With Applications to Image Analysis and Automated Cartography. Communications of the ACM 24, 381–395 (1981) 58. Perona, P., Malik, J.: ScaleSpace and Edge Detection using Anisotropic Diﬀusion. Transactions on Pattern Analysis and Machine Intelligence 12, 629–639 (1990) 59. Yuille, A.L., Poggio, T.A.: Scaling theorems for zero crossings. Transactions on Pattern Analyis and Machine Intelligence 8, 15–25 (1986) 60. Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M., Szeliski, R.: A Database and Evaluation Methodology for Optical Flow. In: International Conference on Computer Vision, pp. 1–8 (2007) 61. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: Exploring photo collections in 3d. ACM Transactions on Graphics 25, 835–846 (2006) 62. Meyer, B., Stich, T., Magnor, M., Pollefeys, M.: Subframe Temporal Alignment of NonStationary Cameras. In: British Machine Vision Conference (2008) 63. Hasler, N., Rosenhahn, B., Thorm¨ ahlen, T., Wand, M., Gall, J., Seidel, H.P.: Markerless motion capture with unsynchronized moving cameras. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 224–231 (2009) 64. Bay, H., Ess, A., Tuytelaars, T., Gool, L.V.: Surf: Speeded up robust features. Computer Vision and Image Understanding 110, 346–359 (2008) 65. Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multiview stereo reconstruction algorithms. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 519–528 (2006) 66. Kazhdan, M., Bolitho, M., Hoppe, H.: Poisson surface reconstruction. In: Proceedings of the Fourth Eurographics Symposium on Geometry Processing, pp. 61–70 (2006)
Two Algorithms for Motion Estimation from Alternate Exposure Images Anita Sellent, Martin Eisemann, and Marcus Magnor Institut f¨ ur Computergraphik, TU Braunschweig, Germany
Abstract. Most algorithms for dense 2D motion estimation assume pairs of images that are acquired with an idealized, inﬁnitively short exposure time. In this work we compare two approaches that use an additional, motionblurred image of a scene to estimate highly accurate, dense correspondence ﬁelds. We consider video sequences that are acquired with alternating exposure times so that a shortexposure image is followed by a longexposure image that exhibits motionblur. For both motion estimation algorithms we employ an image formation model that relates the motion blurred image to two enframing shortexposure images. With this model we can decipher the motion information encoded in the longexposure image, but also estimate occlusion timings which are a prerequisite for artifactfree frame interpolation. The ﬁrst approach solves for the motion in a pointwise least squares formulation while the second formulates a global, total variation regularized problem. Both approaches are evaluated in detail and compared to each other and stateoftheart motion estimation algorithms. Keywords: motion estimation, motion blur, total variation.
1
Introduction
Estimating the dense motion ﬁeld between two consecutive images has been a heavily investigated ﬁeld of computer vision research for decades [1, 2]. To approximate the actual 2D motion ﬁeld, typically the optical ﬂow between consecutive video frames is estimated. If regarded individually, however, shortexposure images capture no motion information at all. Instead, traditional optical ﬂow methods reconstruct motion indirectly by motionmodeling the image diﬀerence. Sampling theoretic considerations show that this approach is prone to temporal aliasing if the maximum 2D displacement in the image plane exceeds one pixel [3]. To prevent aliasing, multiscale optical ﬂow methods preﬁlter the image globally in the image domain because the motion is a priori unknown [3]. This, however, is not the correct temporal ﬁlter: high spatial frequencies should be suppressed only in those Fourier domain regions where aliasing actually occurs, i.e., only in the direction of local motion. There exists a simple way to achieve correct temporal preﬁltering by exposing the image sensor for an extended period of time [4]. In longexposure images D. Cremers et al. (Eds.): Video Processing and Computational Video, LNCS 7082, pp. 25–51, 2011. c SpringerVerlag Berlin Heidelberg 2011
26
A. Sellent, M. Eisemann, and M. Magnor
I1 IB I2 0
1
t
(a)
(b)
(c)
(d)
Fig. 1. Alternate exposure imaging: (a) exposure timing diagram of (b) a shortexposure image I1 followed by (c) a longexposure image IB and (d) another shortexposure image I2
of a moving scene high image frequencies that can cause aliasing are suppressed only in motion direction. Motion estimation from motion blurred images is often performed as one step of blind deblurring approaches [5]. In poor lighting conditions, longexposure times are necessary to obtain a reasonable signal to noise ratio. Motion in the scene suppresses high image frequencies in motion direction which deblurring approachs then try to reconstruct solving a severly illposed problem. Alternate exposure imaging combines shortexposure images that capture high frequency content with longexposure images that integrate the motion of scene points (Fig. 1). Apart from capturing motion information directly, longexposure images bear the advantage that occlusion enters into the image formation process. A scene point and its motion contribute to a motionblurred image exactly for as long as the point is not occluded. Only recently have optical ﬂow algorithms begun to address occlusion [6, 7, 8, 9], assigning occlusion labels per pixels. The moment of occlusion, however, cannot be easily determined from shortexposure images. Two approaches to motion and occlusion estimation from alternate exposure images have been proposed in literature [10,11]. They are based on the same image formation model that is equally valid for occluded and nonoccluded points and incorporates occlusion time estimation. Each of these approaches adds different additional assumptions to make motion estimation computationally manageable. To compare the two approaches, we give a detailed description of the assumptions and evaluate them on synthetic as well as on real test scenes.
2
Related Work
The number of articles on optical ﬂow computation is tremendous which indicates the signiﬁcance of the problem as well as its severity [12, 1, 2]. Related to our work, scalespace approaches [13] and iterative warping [14, 15] obtain reliable optical ﬂow results in the presence of disparities larger than a few pixels. Alternatively, Lim et al. circumvent the problem by employing highspeed camera recordings [16]. None of these approaches, however, consider occlusion.
Motion Estimation from Alternate Exposure Images
27
In contrast, Alvarez et al. determine occlusion masks by calculating forward and backward optical ﬂow and checking for consistency [8]. Areas with large forward/backward optical ﬂow discrepancies are considered occluded and are excluded from further computations. Xiao et al. propose interpolating motion into occluded areas from nearby regions by bilateral ﬁltering [6]. This approach is reﬁned by Sand and Teller [7] in the context of particle video. Xu et al. consider the inﬂow into a traget pixel as occlusion measure [9] (see also [17]). While explicit occlusion handling is incorporated, the moment of occlusion cannot be determined. The advantages of occlusion handling and occlusion timings for image interpolation are demonstrated by Mahajan et al. [18]. Similar to the alternate exposure approach they use a pathbased image formation model. However, paths are calculated between two shortexposed images based on a discrete optimization framework yielding only full pixel accuracy. Motion estimation is also possible from a single, motionblurred image. Assuming spatially invariant, constant velocity motion, Yitzhaky and Kopeika determine direction and extend of motionblur via autocorrelation [19]. Their approach was extended to rotational motion by Pao and Kuo [20]. Similarly, Rekleitis obtains locally constant motion by considering the Fourier spectrum of a motionblurred image [21]. The recent userassisted approach of Jia [22] and the fully automatic approach of Dai and Wu [23] are both able to estimate constant velocity motion by formulating a constraint on the alpha channel of the blurred image, shifting the problem from motion estimation to the illposed problem of alphamatte extraction [24]. Motion estimation from a single motionblurred image is also part of blind image deconvolution approaches [5]. As blind deconvolution is, in general, illposed, these approaches are restricted to spatially invariant point spread functions (PSF) [5, 25, 26] or a locally invariant PSF [27, 28]. Other deconvolution approaches use additional images to gain information about the underlying motion as well as on the frequencies suppressed by the PSF: Tico and Vehvilainen use pairs of blurred and noisy images to determine a spatially invariant blur kernel after image registration [29]. Yuan et al. [30] and Lim and Silverstein [31] assume small oﬀsets between the blurred and the noisy image and include them into the spatially invariant blur kernel estimation. Additionally, they use the noisy image to reduce ringing artifacts during deconvolution. The hybrid camera of BenEsra and Nayar acquires a longexposed image of the scene, while a detector with lower spatial and higher temporal resolution acquires a sequence of shortexposed images to detect the camera motion [32]. A recent extension of the hybrid camera permits the kernel to be a local mixture of predeﬁned basis kernels, which can be handled by modern deblurring methods [33]. The deconvolution approaches of RavAcha and Peleg use two motionblurred images with spatially invariant linear motionblurs in diﬀerent directions to obtain improved deconvolution results [34, 35]. However, for a dynamic scene and a static camera, diﬀerent motionblur directions are hard to obtain. Therefore,
28
A. Sellent, M. Eisemann, and M. Magnor
Cho et al. use two cameras for motion blur estimation that are accelerated in orthogonal directions [36]. The motionfromsmear approach of Chenet al. [37, 38] as well as the approaches of Favaro and Soatto [39] and Agrawal et al. [40] therefore employ images with diﬀerent degrees of motionblur, i.e., diﬀerent exposure times, making diﬀerent simplifying assumptions about the motion. These assumptions range from constant motion [37] over objectwise constant motion [38, 39] to motion computable from neighboring frames with the same exposure time [40]. Pixelwise varying motion and occlusion are not considered. In our approach, we are interested in recovering highquality, dense motion ﬁelds that may vary from pixel to pixel and that are accurate enough to be used for a broad range of applications. In addition, we are interested in adequate motion estimates also for occluded points and a wellfounded estimate of occlusion timings.
3
Image Formation Model
In order to exploit the information provided by the additional longexposure image, we need an image formation model that relates the acquired images via a dense 2D motion ﬁeld. As input, we assume two shortexposure images I1 , I2 : Ω → R, Ω ⊂ R2 which are taken before and after the exposure time of a third, longexposure input image IB : Ω → R. An image formation model that describes a motionblurred image B : Ω → R in terms of I1 and I2 and the unknown motion was introduced in [10]. It exploits that during the exposure time of the longexposure images, a certain set of scenepoints contributes to the color of the motion blurred image at any point x ∈ Ω in the image plane. Assuming that a scenepoint is visible either in I1 or I2 the model s(x) 1−s(x) B(x, p1 , p2 , s) = I1 (p1 (x, t)) dt + I2 (p2 (x, t)) dt. (1) 0
0
is established. p1 (x, ) : [0, 1] → Ω and p2 (x, ) : [0, 1] → Ω are spatially varying, planar curves on the image plane with p1 (x, 0) = x and p2 (x, 0) = x and s(x) ∈ [0, 1] is the occlusion time that is well deﬁned only at points where occlusion actually takes place. In the case of no occlusion, any choice of s yields the same intensity B(x) and diﬀerentiating with respect to s yields 0 = I1 (p1 (x, s)) − I2 (p2 (x, 1 − s)) ,
(2)
a generalization of the brightness constancy assumption used in optical ﬂow estimation. The image formation also gives rise to the frame interpolation I1 (p1 (x, t)) if t ≤ s(x) (3) It (x) = I2 (p2 (x, 1 − t)) if t > s(x). for intermediate frames It for any t ∈ [0, 1]. This interpolation formulation allows to interpolate occluded and disoccluded points correctly without the need for explicit occlusion detection.
Motion Estimation from Alternate Exposure Images
29
In the image formation model described so far, general motion curves were used. To simplify computations and obtain a parameterization with the minimum number of unknowns, a linear motion model is adopted so that p1 (x, t) = x − t w1 (x) and p2 (x, t) = x + t w2 (x),
(4)
w (x) for j ∈ {1, 2}. This turns out to be where wj : Ω → R2 , wj (x) = wj,1 j,2 (x) a suitable approximation also for more general types of motion, Sect. 6. Since for many applications a forward or backward motion ﬁeld is needed, motion curves are warped according to the estimated motion and occlusion parameters to obtain a displacement ﬁeld for I1 and I2 , respectively.
4
Least Squares Approach
The image formation model for a motionblurred image B considered in the previous section yields a pointwise error measure for estimates of the motion paths [10]. Given two shortexposure images I1 , I2 and a longexposure image IB , i.e., the actual measurement, we can compare the blurred image IB to the result B predicted by the model (1): e(x, w1 , w2 , s) = B(x, w1 , w2 , s) − IB (x).
(5)
In this distance measure there are 5 unknowns for every pixel x in the image domain. The minimization of e with respect to these variables can have several equally valid solutions, e.g., by letting s = 0 for an unoccluded point the backward motion path w2 can be chosen arbitrarily. In the next section we give a ﬁrst approach to make the problem computationally manageable by introducing additional assumptions. Diﬀerent additional assumptions which give rise to a second approach are introduced in Sect. 5. 4.1
Additional Assumptions
In order to reduce the number of unknowns in the energy formulation, we ﬁrst consider a point that is neither occluded nor disoccluded during the exposure interval. It is reasonable to assume that motion within one object changes only slightly, so that we can approximate the forward and backward paths to be equal w1 ≈ w ≈ w2 . For a nonoccluded point all occlusion times s are equally valid so we can additionally evaluate the integral for a ﬁxed sequence 0 ≤ s1 < . . . < sN ≤ 1. Fixing the occlusion times not only renders the estimation of s superﬂuous, but also provides us with N equations, each contributing to ﬁnd the correct motion path, i.e. the two remaining unknowns per pixel. If a point is occluded, forward and backward motion diﬀer. Thus optimization under the assumption w1 ≈ w ≈ w2 is expected to lead to a comparably high residual. Only for points with high residual, we assume diﬀerent forward and backward motion paths. To enable computation of the occlusion time  a crucial
30
A. Sellent, M. Eisemann, and M. Magnor
variable for occluded points  the assumption of locally constant motion paths is made, so that the motion information can be inferred from neighboring nonoccluded pixels. Applying the above assumptions, we now consider the resulting optimization problem and its solution more speciﬁcally. An overview of the algorithm is shown in Fig. 2.
Step 2
Step 1 Select sequence si ∈ [0, 1]
Mark high residual neighborhoods
Initialize w = 0
For every marked pixel
Frame Interpolation
Determine w1 = w2 by superpixel similarity
For each level of the image pyramid Minimize Eq. (7)
Optimize for occlusion time
Reject outliers
Motion Fields
Fig. 2. The workﬂow of the least squares approach assumes forward and backward motion paths to be symmetric in the ﬁrst step. Only in the second step the possibility of occlusion is considered for points with a high residual. With the motion paths and occlusion timings, images can be interpolated directly, or traditional motion vector ﬁelds for each pixel in the shortexposure images can be determined.
4.2
Pointwise Optimization Problem
With the assumption w1 ≈ w ≈ w2 , for a ﬁxed sequence 0 ≤ s1 < . . . < sN ≤ 1 and for each i ∈ {1, . . . , N } we consider the deviation of the measured motionblurred image from the model value for a given motion path w ∈ R2 using the diﬀerentiable squared distance, Fi (x, w) =
IB (x) −
si
1−si
I1 (x − tw) dt −
0
2 I2 (x + tw) dt .
(6)
0
If all the assumptions hold exactly, Fi = 0 for the true motion path and for all i ∈ {1, . . . , N }. Using diﬀerent values for s allows to restrict the solution space for the symmetric motion path w. Increasing the number N of samples for s also increases the amount of computation. We keep N small, e.g. N = 5, and additionally include the diﬀerentiated 2 version, Eq. (2), for s = 12 as FN +1 (x, w) = I1 (x − 12 w) − I2 (x + 12 w) , with FN +1 = 0 for the true motion path. We now try to ﬁnd a w ∈ R2 that minimizes the pointwise energy ELS (x, w) =
N +1 i=1
Fi (x, w) .
(7)
Motion Estimation from Alternate Exposure Images
31
Dennis and Schnabel [41] describe several numerical methods to solve this nonlinear least squares problem. We use a modeltrust region implementation of the wellknown LevenbergMarquardt algorithm because of its robustness and reasonable speed. The path integral over the images is calculated using linear interpolation for the image functions I1 and I2 . The derivatives of the function F = (F1 , . . . , FN +1 )T are determined numerically. We solve this nonlinear optimization problem on a multiscale image pyramid. In order to attenuate the impact of local noise, we iterate the optimization and smooth intermediate results by replacing motion paths diﬀering more than 0.25 pixels from the motion paths of the majority of its 8 neighbors by the average motion path of the majority. 4.3
Occlusion Detection
In occluded regions we expect the pointwise energy ELS in Eq. (7) to remain high as the symmetry of the paths w1 ≈ w ≈ w2 is invalid. We therefore mark a pixel and its immediate eight neighbors as possibly occluded, if ELS exceeds a threshold TE . Instead of setting the threshold TE absolutely and thus also in dependency of N , we choose a percentage of occluded points, e.g. 10%, and set TE to the corresponding quantile TE = Q.90 of all optimization residuals in the image. For an occluded/disoccluded pixel, there are two motion paths and the occlusion time necessary to describe the gray value in the blurred image. We extrapolate forward and backward motion paths in the occluded regions from neighboring nonoccluded regions. Given estimates for the motion paths, we determine the occlusion time on the basis of these estimates. Considering a possibly occluded point we build two clusters Ca and Cb from the motion paths of probably unoccluded points in a neighborhood with a radius of r = 20 pixels. With the center of these clusters, we obtain two motion paths. We use superpixel segmentation [42] to determine which motion path is appropriate for which image. Let Six be the superpixel of Ii (x), Sia and Sib the union of superpixels in Ii containing the pixels that contribute to Ca and Cb respectively and d(·, ·) the superpixel distance also deﬁned in [42]. The superpixel of an occluded point and the superpixels containing the background motion should belong to the same object in the ﬁrst shortexposure image and thus the superpixel distance between them is expected to be small or zero. In the second image, the superpixel of the occluded point belongs to the foreground and is therefore expected to be similar or equivalent to the superpixels of the foreground motion in this image. More generally, if the inequality d(Six , Sia ) + d(Sjx , Sjb ) < d(Six , Sib ) + d(Sjx , Sja )
(8)
holds for i = 1 and j = 2 or for i = 2 and j = 1 we assign the motion of Ca to wi and that of Cb to wj . Else we deduce that the point is not occluded after all and assign the motion path with the smallest residual in Eq. (7).
32
A. Sellent, M. Eisemann, and M. Magnor
Given motion paths w1 and w2 only the occlusion time s remains to be estimated. We minimize 2 s 1−s Es (x, s) = IB (x) − I1 (x − tw1 ) dt − I2 (x + tw2 ) dt (9) 0
0
by a straightforward line search algorithm as described in [43]. 4.4
Parameter Sensitivity
The above algorithm depends on the choice of the intermediate timings si and of the occlusion threshold TE . We test the parameter sensitivity on the basic test scene square (Fig. 6, ﬁrst row) where the foreground translates 10 pixels horizontally and the background translates 15 pixels vertically. We evaluate the average angular error (AAE) and the average endpoint error (AEE) between known groundtruth motion and the displacement ﬁelds obtained from the estimated motion paths [2] to measure the impact of the parameter. In the ﬁrst experiment, we vary the number N of intermediate values for s while keeping all other parameters ﬁxed, i.e., using a 6 level image pyramid, 3 iterations on each scale and an outlier threshold of 0.25 pixels. To obtain optimal cover for any length of the motion paths, we distribute the si equally in the interval [0, 1], i.e., si = Ni−1 −1 for i ∈ {1, . . . , N }. Table 1. Increasing the number N of equidistant intermediate values for the occlusion times s also increases the computation time (3.06 GHz processor, nonoptimized, pointwise MATLAB code). Fixing the threshold for occlusion detection TE = Q.90 , the smallest average angular error (AAE) and the smallest average endpoint error (AEE) are obtained for N = 5. N 2 3 4 5 6 7 8 9 10 time [sec] 7529 7612 7621 7797 7846 7912 8065 8139 8180 AAE [◦ ] 7.81 6.78 6.82 6.24 6.68 6.62 6.50 6.49 6.51 AEE [px] 2.28 1.90 1.85 1.73 1.82 1.81 1.77 1.74 1.76
If the number N of the equidistant intermediate values for s is chosen 3 or larger it has only a small inﬂuence on the resulting error, Tab. 1. Also, as the pointwise optimization implementation works with a minimum number of function evaluations, the impact of N on the total computation time is small compared to the total computation time. Apart from determining the number and the spacing of the si , the number N also inﬂuences the weight of the color constancy assumption in FN+1 . As a tradeoﬀ between the equations Fi , i ∈ {1 . . . N } based on the motionblurred image and the equation FN+1 based on the shortexposure images, N = 5 results in the smallest angular error and the smallest endpoint error. In the next experiment, we ﬁx N = 5 and change the number of points that are considered as occluded by setting TE to the corresponding quantile. Considering
Motion Estimation from Alternate Exposure Images
33
Table 2. Fixing the number N = 5 of intermediate values for s, the smallest average angular error (AAE) and the smallest average endpoint error (AEE) are obtained for TE = Q.90 , i.e., when considering 90% of the pixels as nonoccluded TE Q.95 Q.90 Q.85 Q.80 Q.75 Q.70 Q.65 Q.60 Q.55 Q.50 AAE [◦ ] 6.96 6.24 6.90 6.88 6.84 6.83 7.32 7.40 7.75 7.76 AEE [px] 1.96 1.73 1.82 1.80 1.78 1.78 1.91 1.92 2.02 2.04
up to 30% of the pixels as occluded has only a small impact on the AAE and AEE, Tab. 2. Figs. 3c and 3d show occlusion maps for TE = Q.90 and TE = Q.75 , respectively, using the color code in Fig. 3b. While in the ﬁrst case mainly truly occluded points are assigned an occlusion time, many unoccluded points obtain an occlusion label in the second case. Their motion estimate is disabled in the superpixel comparison. Nevertheless, some occluded points are still not detected in the case TE = Q.75 . Their arbitrary motion estimate is considered in the superpixel comparison. Changing the balance from correct motion estimates to arbitrary motion estimates, an occlusion threshold that is too conservative deteriorates the quality of the motion estimation. Still the interval where the motion estimation is robust is quite large.
5
Total Variation Approach
Considering the diﬀerence between a recorded motionblurred image and the blurred image predicted by the image formation model gives a pointwise error measure for the path vectors and the occlusion time. As the solution to this problem is not unique for all image points, additional assumptions were introduced in the last section. Yet, these assumptions impose new restrictions onto the motion. This section considers diﬀerent, less restrictive assumptions on the motion paths by considering the similarity of path vectors and occlusion times for neighboring pixels [11]. 5.1
Additional Assumptions
In natural images spatially neighboring pixels often belong to the same realworld object and therefore exhibit similar properties such as color, texture or motion. For the underdetermined pointwise error function, Eq. (5), we can therefore look for the solution of the pointwise problem that is most similar to the solution of neighboring pixels. We can achieve this by adding a regularization term to the pointwise error functional. Regularization is a typical way to estimate solutions of underdetermined problems [44] and often applied in optical ﬂow estimation to overcome the aperture problem [1,2]. For image points belonging to the same objects, the spatial gradient of the motion ﬁeld is assumed to be small. Yet, at object boundaries, motion changes abruptly and the spatial gradient of the motion ﬁeld is large. As demonstrated in previous work [45], using the total variation as a regularizer for ﬂow ﬁelds yields promising results. While
34
A. Sellent, M. Eisemann, and M. Magnor
1
0
(a)
(b)
(c)
(d)
Fig. 3. (a) The color map used to display ﬂow ﬁelds in this work. (b) Where deﬁned, occlusion timings are encoded with a continuous scale between green for s = 0 and red for s = 1, else they are set to blue. Evaluating the scene squares (Fig. 6), thresholding the optimization residual for occlusion detection considers mainly truly occluded points (c) for T = Q.90 but does not detect all occluded points. (d) Setting T = Q.75 considers also many nonoccluded points as occluded but still does not detect all occluded points.
the total variation for a steep monotonous function and a smoothly increasing, monotonous function with the same endpoints is the same, the customary squared norm of the gradient punishes large deviations from a constant function much severer than a gradual change (Fig 4). Total variation regularization of the motion ﬁeld allows piecewise constant vector ﬁelds which is in accordance with our understanding of only slightly deforming scene objects moving with individual velocities. 5.2
Global Optimization Problem
The central part of the optimization problem is, as before, the pointwise comparison of the recorded motionblurred image IB and the result B predicted by the image √ formation model. We consider the dataterm with a robust penalizer φ (x) = x2 + where = 10−3 , i.e., we consider G1 (x, s, w1 , w2 ) = φ (B(x) − IB (x)). (10) Introduced to motion estimation by Black and Anandan [46], robust penalizers like φ are a diﬀerentiable version of the absolute value and allow for accurate motion estimation also in the presence of outliers and deviations from the assumptions. As in Sect. 4.2, we also include the diﬀerentiated version and consider it as an additional dataterm 1 1 G2 (x, s, w1 , w2 ) = φ (I1 (x − w1 ) − I2 (x + w2 )) . 2 2
(11)
Integrating the weighted sum of the pointwise errors over the image domain, we obtain the dataterm Edata (s, w1 , w2 ) = G1 (x, s, w1 , w2 ) + γ G2 (x, s, w1 , w2 ) dx (12) Ω
with γ ≥ 0. Regularizing both path vectors as well as the occlusion time with their total variation results in the ﬁnal energy functional
Motion Estimation from Alternate Exposure Images
35
1.5
1
0.5
y
0
f(x) = arctan(x) g(x) = 0.1arctan(10)x
−0.5
−1
−1.5 −10
−8
−6
−4
−2
0 x
2
4
6
8
10
Fig. 4. The total variation of the steep function f and the continuously increasing function g are equal. The squared value of the gradient of g is much smaller than the squared value of the gradient of f . Thus total variation regularization models the assumption of objectwise smooth motion ﬁelds better than regularization with the squared value of the gradient.
ET V (s, w1 , w2 ) =
G1 + γ G2 + α Ω
2
(∇w1,i  + ∇w2,i ) + β∇s dx
(13)
i=1
where α, β > 0 are two free parameters of the approach. This energy functional interconnects the pointwise error measure given by G1 and G2 via the regularization terms so that now a global minimization has to be performed. The absolute value considered in the total variation is not diﬀerentiable and we therefore adopt a minimization scheme that is presented in the next section. 5.3
TVL1 Minimization
Our minimization scheme is based on the primaldual algorithm used for TVL1 optical ﬂow [45]. We brieﬂy review the method here and show how we adopt the framework to minimize our more complex energy functional in the next section. For the very general case of minimizing a total variation energy of the form E(u) =
λ ψ(ρ(u)) + Ω
k
∇ui  dx
(14)
i=1
where for a constant λ > 0, a scalar function ψ : R → R+ , a kdimensional function u = u(x) = (u1 , . . . , uk ) on the domain Ω and ρ(u) = ρ(u, x) a pointwise error term, an auxiliary vector ﬁeld v = v(x) = (v1 , . . . , vk ) on Ω is introduced and the approximation Eθ (u, v) =
λ ψ(ρ(v)) + Ω
k 1 u − v2 + ∇ui  dx 2θ i=1
(15)
36
A. Sellent, M. Eisemann, and M. Magnor
is considered instead. If θ is small, v will be close to u near the minimum, and thus E will be close to Eθ . The key result of [45] is that Eq. (15) can be minimized very eﬃciently using an alternating scheme that iterates between solving a global minimization problem for each ui , keeping v ﬁxed 1 argmin (ui − vi )2 + ∇ui  dx, (16) 2θ ui Ω and a minimization problem for v with ﬁxed u 1 argmin λ ψ(ρ(v)) + u − v2 dx, 2θ v Ω
(17)
which can be solved pointwise. Details and proof of convergence can be found in [45, 47]. Eq. (16) searches for a diﬀerentiable, scalar ﬁeld ui that is on the one hand close to the ﬁxed ﬁeld vi but has on the other hand small total variation. Chambolle has introduced a very elegant, quickly computable and globally convergent solution to this problem, which we will also employ in our minimization framework [47]. In Eq. (17) we use the alternate exposure image formation model and its diﬀerentiated version as dataterm ρ(v). In the next section we show in more detail, how we employ the minimization scheme in our framework. 5.4 Implementation In our case, we employ some small modiﬁcations adapted to our problem of minimizing the energy in terms of w1 , w2 and s. First, we employ the above
Initialize w1 = 0, w2 = 0, s = 0.5 For each level of the image pyramid Frame Interpolation For a number of warps Compute error from current estimates For each unknown w1 , w2 , s Motion Fields
For a number of iterations Solve pointwise problem Eq.
(17)
Solve denoising problem Eq.
(16)
Fig. 5. The workﬂow of the total variation approach determines forward and backward motion paths and occlusion times iteratively
Motion Estimation from Alternate Exposure Images
37
scheme, i.e., iterating between Eq. (16) and Eq. (17), by considering u = w1 , u = w2 or u = s, respectively, to solve for each of the unknowns given a ﬁxed approximation of the others. As the thresholding scheme of [45] is not directly applicable to our nonlinear dataterm we apply a descent scheme for Eq. (17), proﬁting from the use of the diﬀerentiable function φ . In order to speed up convergence, we implemented the algorithm on a scale pyramid of factor 0.5, initializing with s = 0.5 for the occlusion timing, and w1 , w2 = 0 on the coarsest level. On each level of the pyramid we compute the remaining error with the current estimates and use this error to solve for s, w1 and w2 . For each variable an instance of Eq. (16) and Eq. (17) has to be solved (Fig. 5). For Eq. (16), we employ the dual formulation detailed in [45], Proposition 1, using 5 iterations and a time step of τ = 18 . For all experimental results with the total variation algorithm we use a 5level image pyramid, 10 errorupdate iterations and 10 iterations to solve Eq. (17) and Eq. (16). Suitable values for the parameter α, β, γ and θ were found experimentally. For normalized intensity values we found θ ∈ (0, 1], α, β ∈ (0, 0.1] and γ ∈ [0, 0.5] to be suitable value ranges. An evaluation of the sensitivity of the algorithm on the parameter choice was performed in [4] and showed that the algorithm yields high quality results quite independent from the actual value of the parameters. Working on the 320 × 225 pixel test scene square (Fig. 6) the computation time of 191 seconds on a 3.06 GHz processor is independent of the parameters.
6
Comparison of Diﬀerent Motion Estimation Algorithms
To evaluate motion ﬁeld estimation with alternate exposure imaging we consider synthetic test data as well as realworld recordings. For synthetic scenes with known groundtruth motion ﬁelds we estimate motion ﬁelds with our algorithms as well as with related approaches [16,7,45]. We interpolate intermediate frames using estimated motion paths and occlusion timings and compare them to groundtruth images and images interpolated with groundtruth motion. We also show results for realworld recordings. The recordings were made with a PointGrey Flea2 camera that is able to acquire short and longexposure images alternatingly. 6.1
Motion Fields for Synthetic Test Scenes
We consider synthetic test scenes containing diﬀerent kinds of motion. The scene square (Fig. 6, ﬁrst row) combines 10 pixels per time unit horizontal translational motion of the square with 15 pixels per time unit vertical motion of the background on a 225 × 320 pixels image. The 300 × 380 pixels scene Ben (Fig. 6, second row) contains 14 pixels per time unit translational motion in front of a static background. The scene windmill (Fig. 6, third row) contains 7◦ per time unit rotational motion approximately parallel to the image plane in front of a static background on 800 × 600 pixels images. In the 512 × 512 pixels images of the wheel scene,
38
A. Sellent, M. Eisemann, and M. Magnor
(Fig. 6, fourth row) the wheel in the background is rotating 7◦ per time unit while the foreground remains static. The challenge of the 800× 600 pixels images in the scene corner (Fig. 6, ﬁfth row) is outofplane rotation of 10◦ around an axis parallel to the vertical image dimension, while the 320 × 240 pixels images of the scene fence (Fig. 6, sixth row) contain translational motion of the same extent as the moving object’s width. To obtain the motionblurred image IB we render and average 220 − 500 images. The ﬁrst and the last rendered image represent the shortexposure images I1 and I2 . Groundtruth 2D motion is determined from the known 3D scene motion. First of all, we test our pointwise least squares approach from Sect. 4 and the total variation approach from Sect. 5 on the synthetic datasets. We compare the results to stateoftheart optical ﬂow algorithms, [16,7,45]. For fair comparison, we provide the competing optical ﬂow algorithms also with the shortexposure image I1.5 , depicting the scene halfway between I1 and I2 . We estimate the motion ﬁelds between I1 and I1.5 as well as between I1.5 and I2 . The two results are then concatenated before comparing them to the groundtruth displacement ﬁeld. As optical ﬂow works best for small displacements [16], the error of the concatenation is considerably smaller than estimating the motion ﬁeld between I1 and I2 directly. We choose the algorithm of Zach et al. [45], because it relies on the same mathematical framework as our total variation approach. However, our method uses a longexposure image instead of a higher frame rate of shortexposure images. We also compare to the algorithm of Sand and Teller [7] on three images, because both our methods and their approach consider occlusion eﬀects while estimating motion. As our algorithms are based on signaltheoretical ideas to prevent temporal aliasing, we incorporate a comparison to the algorithm of Lim et al. [16] that requires highspeed recordings as input. We simulate the highspeed camera with intermediate images such that motion between two frames is smaller than 1 pixel. Tab. 3 shows that the total variation algorithm has the smallest average angular error (AAE) for all test scenes. Also, in all test scenes, except for the rotational motion parallel to the image plane of the scenes windmill and wheel, Table 3. Comparison of diﬀerent motion estimation methods for six synthetic test scenes: the motion ﬁelds computed using the total variation approach to alternate exposure imaging (AEI) consistently yields a smaller average angular error (AAE) than the least squares approach and competitive optical ﬂow algorithms given three images [45, 7] or sequences of temporally oversampled images [16] AAE [◦ ] Sand, Teller [7] Zach et al. [45] Lim et al. [16] AEI, least squares AEI, total variation
Ben 8.42 5.81 9.01 6.31 4.27
square windmill wheel corner fence 6.48 6.78 13.39 6.40 19.12 2.25 4.87 2.59 5.05 19.44 12.19 49.63 27.29 38.40 34.17 6.24 8.64 4.19 12.87 34.41 1.70 4.56 2.21 4.57 12.97
Motion Estimation from Alternate Exposure Images
(a) I1
(b) IB
(c) Groundtruth
39
(d) Lim et al. [16]
Fig. 6. (a) Shortexposure images I1 and (b) motion blurred images IB were rendered so that (c) the groundtruth motion ﬁeld is known in each of the scenes (from top to bottom) square, Ben, windmill, wheel, corner and fence. For comparison motion ﬁelds are calculated with several diﬀerent algorithms. (d) The algorithm of Limet al. [16] needs a high number of input images and returns noisy motion ﬁelds. (e) While the approach of Sand and Teller [7] is prone to oversmoothing, (f) the approach of Zach et al. [45] assigns unpredictable motion ﬁelds to occluded points. (g) Spurious assignments at occlusion boundaries and insuﬃcient regularization in textureless regions deteriorates the quality of our leastsquares approach. (h) The total variation approach to alternate exposure images shows the most promising motion ﬁelds of all approaches.
40
A. Sellent, M. Eisemann, and M. Magnor
(e) Sand, Teller [7]
(f) Zach et al. [45]
(g) Least squares
Fig. 6. Continued.
(h) Total variation
Motion Estimation from Alternate Exposure Images
41
Table 4. For the six synthetic test scenes, the average endpoint error (AEE) of the total variation approach to alternate exposure imaging is among the smallest in comparison to competitive optical ﬂow estimation algorithms given three images [45,7] or sequences of temporally oversampled images [16] AEE [px] Sand, Teller [7] Zach et al. [45] Lim et al. [16] AEI, least squares AEI, total variation
Ben 0.91 0.59 1.46 0.99 0.57
square windmill wheel corner 5.72 2.95 1.27 2.85 0.62 1.69 0.60 1.27 4.88 7.69 1.82 7.73 1.73 5.47 1.02 6.30 0.52 2.16 0.61 0.92
fence 3.36 14.75 5.23 12.64 2.62
the total variation algorithm has the smallest average endpoint error (AEE), Tab. 4. The rotation within the image plane directly violates the assumption of linear motion paths in the image formation model, so here the alternate exposure algorithm is outperformed by the TVL1 optical ﬂow which does not model the motion paths in the intermediate time between the frames. However, in the corner scene with outofplane rotation and severe selfocclusion, the total variation algorithm is able to produce the most accurate motion ﬁelds in average angular error as well as in average endpoint error. The least squares approach shows a higher numerical error than the total variation approach in all test cases. Though not competitive to the highly accurate approach of Zach et al. [45] the least squares approach outperforms the antialiased approach of Lim et al. [16] in all but the fence scene. In the fence scene the least squares approach fails to assign correct motion to the large occluded areas, as nearly all moving points in the image are occluded or disoccluded between I1 and I2 . For the test scenes with planar motion, the least squares algorithm achieves results competitive to the occlusion aware optical ﬂow algorithm of Sand and Teller [7], while the motion ﬁeld for the outofplane rotation of the corner scene with the changing motion at the occluded points is less accurate. Visual comparison of the motion ﬁelds (Fig. 6) shows, that the small numerical error of the total variation approach is due to several reasons: While the algorithm of Limet al. [16] returns noisy motion ﬁelds (Fig. 6d) the algorithm of Sand and Teller [7] tends to oversmooth motion discontinuities (Fig. 6e). The TVL1 optical ﬂow algorithm [45] assigns outlier motion vectors to occluded points (Fig. 6f). The quality of the least squares alternate exposure algorithm suﬀers considerably from noisy motion path detection and spurious motion assignments at nondetected occluded points (Fig. 6g). In contrast, the total variation approach to alternate exposure imaging stands out due to sharp motion boundaries and appropriate motion assignment at occlusion borders (Fig. 6h). As the explicit occlusion detection of the pointwise approach and the implicit occlusion detection of the global optimization approach are hard to compare visually (Fig. 7) we compare the results via frame interpolation under the consideration of occlusion, Eq. (3).
42
A. Sellent, M. Eisemann, and M. Magnor
(a)
(b)
(c)
(d)
Fig. 7. Shown for the scenes Ben and Ball : Occlusion timings of the least squares approach are determined only where the optimization residual exceeds a threshold (a) and (b). With the total variation approach occlusion timings are determined for every pixel, but are only welldeﬁned at occlusion boundaries (c) and (d). Easier comparison of occlusion timings can be obtained by considering frame interpolation (see Fig. 9).
6.2
Frame Interpolation for Synthetic Test Scenes
We evaluate the estimated motion ﬁelds and occlusion timings of the alternate exposure imaging in frame interpolation. For comparison we also interpolate intermediate frames between I1 and I1.5 using the method introduced by Baker et al. [2] and using blending of forward and backward warped images. None of the two methods considers occlusion. We compare the interpolated frames to the groundtruth intermediate images. Fig. 8 gives an overview of the sum of squared diﬀerences (SSD) for all test scenes. Note that though the least squares algorithm has a higher AAE/AEE than the optical ﬂow algorithm of Zach et al. [45] the interpolation error for some of the images, e.g. in the scene Ben, is considerably smaller than using the optical ﬂow algorithm with any of the two interpolation methods. The interpolation with the motion paths from the total variation approach consistently shows better interpolation results than the optical ﬂow based interpolation. Especially for translational motion, both, the least squares and the total variation algorithm occasionally obtain a smaller SSD than interpolation with groundtruth motion. This is due to the fact that inaccuracies in the motion ﬁelds can be balanced by the successful handling of occlusion boundaries (Fig. 9). 6.3
RealWorld Recordings
We also test our methods on realworld recordings. We use the builtin HDR mode of a PointGrey Flea2 camera to alter exposure time and gain between successive frames. By adjusting the gain, we ensure that corresponding pixels of static regions in the shortexposure and longexposure images are approximately of the same intensity. With the HDR mode we are able to acquire I1 , IB and I2 with a minimal time gap between the images. The remaining gap is due to the ﬁxed 30 fps camera frame rate and the readout time of the sensor. As for the synthetic test scenes, we also record a number of real test scenes with diﬀerent challenges. Thereby, all images are recorded with the same PointGrey Flea2 camera and a resolution of 640 × 480 pixels.
Motion Estimation from Alternate Exposure Images Ben
43
square
12
18 16
10 14 12 SSD
SSD
8 6 4
10 8 6 4
2 2 0 0
0 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 t
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 t
wheel
windmill
9
12
8 10
7 6 SSD
SSD
8 6
5 4 3
4
2 2
1 0 0
0 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 t
fence
corner 14
30
12
25
10
20
8
SSD
SSD
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 t
15
6 10 4 5
2 0 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 t
0 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 t
Fig. 8. The sum of squared diﬀerences (SSD) between interpolated images and groundtruth images. The dashed green (circled) line shows the SSD for forward interpolation with optical ﬂow [45], while the continuous green (circled) line shows the SSD for forwardbackward interpolation using the same optical ﬂow. Red (crossed) dashed and continuous lines indicate the SSD for forward interpolation [2] or forwardbackward interpolation, respectively, using groundtruth motion ﬁelds. The SSD obtained using least squares optimization for motion paths from alternate exposure imaging is indicated by the blue dashed line (diamonds) and the SSD obtained using total variation regularization for the motion paths is indicated by the blue continuous line (squares).
44
A. Sellent, M. Eisemann, and M. Magnor
(a)
(b)
(c)
(d)
Fig. 9. (a) Interpolation at t = 0.25 with the method proposed in [2] and (b) blending of forward and backwardwarped images show artifacts at occlusion boundaries even when groundtruth motion ﬁelds are used, because occlusion information is not available. (c) Thresholded occlusion detection in the least squares approach to alternate exposure imaging fails to detect occlusion at some boundaries and exhibits remaining artifacts. (d) Interpolation with total variation regularized motion paths and occlusion timings reduces artifacts at occlusion boundaries.
Motion Estimation from Alternate Exposure Images
45
The scene juggling (Fig. 10 ﬁrst row) contains large motion of a small ball, that additionally vanishes from the ﬁeld of view of the camera. To ensure that the shortexposure images contain no or only little motionblur, their exposure time is set to 6.02 ms. However, the camera can only process an image every 33.33 ms. Using only shortexposure images, this would lead to 27.31 ms of unrecorded motion between sharp images. For our method, we record a longexposure image with an exposure time of 39.65 ms. With our camera setup we measured a remaining gap between IB and the succeeding shortexposure image of 0.48 ms which is due to readout time of the sensor and other hardware constraints. IB reduces the gap and provides us with temporally antialiased information. The same camera setting was used for the walking scene (Fig.10, second row) where a person walks by on a street and the leg moves in the order of magnitude of its width. The scenes model train 1 and 2 (Fig.10, third and fourth row) are also recorded with the same camera setting. Challenges in these scenes are the moving shadows and the highlight on the wagons that violate the assumption that motion is the only reason for brightness changes in the scene. To test the ﬂexibility of the approach to diﬀerent foreground and background motions, the scene tracking (Fig.10, ﬁfth row) was recorded with a camera following the motion of the person in the foreground, i.e., objects in the background have a relative motion to the camera according to their depth. For the waving scene (Fig.10, sixth row) we use exposure times of 20.71 ms and 124.27 ms, resulting in measured gaps of 12.45 ms and 0.48 ms, respectively. This scene provides diﬀerent motions, i.e., that of the hands moving in opposite direction and the static background and the occluded texture of the eye. The motion ﬁelds estimated with the least squares and the total variation approach are also shown in Fig. 10. While motion ﬁelds estimated by the least squares approach are mainly dominated by noise, closer inspection shows that in places where motion actually occurs it is often detected correctly, for example the ball ﬂying out of the image in the juggling scene. Only the large, sparsely textured regions in the background do not provide enough information for the pointwise approach, so that any noise in the image is able to produce pronounced incorrect motion estimates. The results of the total variation approach look more promising. Although the background often provides only little texture, motion is generally estimated correctly. In the walking scene, the total variation approach is not only able to detect the motion of the leg moving approximately as far as its width, but also the motion of the hand faithfully. In the scenes model train 1 and 2 the total variation approach shows robustness to moving shadows and the highlights on the last wagon. In the tracking scene both algorithms detect the motion of the dark backpack in front of the the dark background correctly, and the total variation algorithm additionally is able to faithfully detect the motion of both hands. In the waving scene, the total variation algorithm is able to cope with the motion and the occluded texture.
46
A. Sellent, M. Eisemann, and M. Magnor
(a)
(b)
(c)
(d)
(e)
Fig. 10. The builtin HDR mode of PointGrey cameras is able to alter exposure time and gain between succeeding frames so that (a) short, (b) long and (c) short exposures can be successively acquired at comparable brightness and with minimal temporal gap between frames. Motion ﬁelds for the realworld scenes (from top to bottom) juggling, walking, model train 1, model train 2, tracking and waving are estimated with (d) the least squares approach, Sect. 4 and with (e) the total variation approach, Sect. 5.
7
Discussion
In this section we ﬁrst discuss the advantages and disadvantages of the two approaches to alternate exposure motion estimation, i.e., of the least squares approach, Sect. 4, and the total variation approach, Sect. 5. Then we compare both to optical ﬂow approaches that consider only shortexposure images. 7.1
Comparison of the Two Alternate Exposure Approaches
The least squares approach to alternate exposure imaging is able to estimate motion paths, forward/backward motion ﬁelds and occlusion timings from a set
Motion Estimation from Alternate Exposure Images
47
of three alternate exposure images. It makes some additional assumptions, e.g. symmetry, on the motion paths but requires no further regularization such as a smoothness constraint. The resulting error functional can be evaluated pointwise. Occlusion is detected by thresholding the optimization residual. For occluded pixels, motion paths are inferred based on superpixel comparison, but neighboring pixels are not considered in motion path assignment. Although the assumption of symmetric forward and backward motion paths is actually only satisﬁed if an object moves parallel to the image plane we also evaluate the algorithm on test scenes with more complex motion. In some of the synthetic scenes it outperforms some modern optical ﬂow algorithms [16, 7], that are designed to handle occlusion or deal with temporal aliasing. As no regularization is necessary and the approach solves ambiguities by additional assumptions, the resulting motion ﬁelds seem visually quite noisy but are of reasonable accuracy. In the realworld testscenes, the least squares algorithm turns out to be very susceptible to noise and inaccuracies in the gain correction of long and shortexposure images. This is partially due to the pointwise estimation that assigns large motion to noisy pixels, especially in regions with little texture. Additionally, using a squared error term weights every outlier between the N + 1 equations very heavily, occasionally pushing it far from the desired solution to satisfy the contribution of one noisy pixel. The total variation approach requires regularization to solve the ambiguities of the image formation model for unoccluded points but makes no further assumptions. Considering spatial gradients in the regularization requires to solve for the motion paths of all pixels simultaneously so that a more sophisticated solutionframework has to be applied. Occlusion time estimation is incorporated into the optimization process, so that a separate occlusion detection step is no longer necessary. Due to the regularization, the estimated motion ﬁelds look visually more pleasing and a desirable ﬁllin eﬀect of motion into textureless regions occurs, while oversmoothing is prevented by the choice of the total variation as regularizer. Numerical evaluation for synthetic scenes shows that the estimated motion ﬁelds are indeed more accurate than comparable stateofthe art optical ﬂow algorithms [45,7,16]. Due to implicit occlusion handling, the total variation approach can also deal with objects where every moving pixel is an occluding pixel  a situation like in the fence scene where the least squares approach fails. The images interpolated using the motion paths and occlusion timings of the total variation approach have also more exact occlusion borders than using the least squares approach, where undetected occlusion borders occasionally corrupt the interpolation. Finally, the total variation approach estimates convincing motion ﬁelds also for realworld recordings. 7.2
Limitations and Advantages from Alternate Exposure Imaging in Image Based Motion Estimation
Motion ﬁeld estimation from alternate exposure imaging shares some of the limitations inherent to all optical ﬂow methods. Like in all purely imagebased
48
A. Sellent, M. Eisemann, and M. Magnor
methods, motion in poorly textured regions cannot be detected robustly. This can be seen in the black background of the waving scene (Fig. 10). Also common to all optical ﬂow methods, we assume that motion is the only source of change in brightness, disregarding highly reﬂecting and transparent surfaces from the calculations. Furthermore, we made the assumption that the shortexposure images are free of motion blur. Practically this is true if motion during the short exposure time is smaller than half a pixel. Image noise is also a common problem in motion estimation. While the least squares approach is indeed susceptible to noise, the use of a suitable penalizer for the dataterm and the total variation regularization deals with noise successfully. Additionally, for nonoccluded points the total variation algorithm can choose the occlusion timing s so that noise with zero mean in the path integral can cancel out much better than in the customary comparison of two single pixels. In contrast to most optical ﬂow methods, we are able to include occlusion explicitly into our image formation model. With the total variation approach arbitrarily large occlusion as well as disocclusions can be handled under the assumption that a scene point changes its state of visibility only once. This assumption on the visibility state infers that, e.g. for a static background point an occluding object can move at most as far as its width before the background point reappears. Our image formation model works with motion paths instead of displacement ﬁelds. While motion paths can theoretically have arbitrary forms, the assumption that they are linear allows for a simple parametrization. Actually, linear motion paths imply that the displacement of all pixels on the path is uniform and of constant speed. But as motion paths are allowed to vary for neighboring pixels, the approach can successfully handle also much more complex motions. Finally, while recording the alternate exposure sequence, we replace one shortexposure image with a longexposure image. To show the sequence to a viewer uninterested in motion detection, the long exposed frame may simply be skipped, or, to ensure a suﬃcient frame rate, intermediate images can be easily and quite faithfully interpolated with the proposed method.
8
Conclusion
Alternate exposure imaging has been introduced to record antialiased motion information as well as high frequency content of a scene. From an image formation model connecting a longexposure image with a preceding and a succeeding shortexposure image via motion paths and occlusion information, two algorithms can be derived that estimate motion ﬁelds as well as occlusion timings. The ﬁrst algorithm is able to perform the estimation without any regularization, that is usually necessary to solve the aperture problem in optical ﬂow estimation. Although competitive on synthetic data, the lack of regularization makes the pointwise least squares approach susceptible to image noise and gain maladjustment in realworld recordings. In contrast, the total variation approach
Motion Estimation from Alternate Exposure Images
49
is not only more accurate than stateoftheart optical ﬂow on synthetic scenes, but it also shows convincing performance on realworld scenes. Notably, it is able to handle occlusion situations where the stateoftheart in optical ﬂow based on two successive images  is destined to fail. In our experiments, we also observed that accuracy of the motion ﬁeld is not the most important issue for frame interpolation. With our estimated motion ﬁelds that contain some residual error, together with occlusion timings, we are able to obtain interpolated frames that have a smaller numerical error than interpolation with groundtruth motion. In addition, the interpolated frames also look perceptionally convincing, as, in contrast to traditional interpolation, our algorithms are able to reproduce occlusion borders correctly by making use of the estimated occlusion timings. Acknowledgements. The authors gratefully acknowledge funding by the German Science Foundation from project DFG MA2555/41.
References 1. Barron, J., Fleet, D., Beauchemin, S.: Performance of optical ﬂow techniques. IJCV 12(1), 43–77 (1994) 2. Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M., Szeliski, R.: A database and evaluation methodology for optical ﬂow. In: Proc. ICCV, pp. 1–8. IEEE, Los Alamitos (2007) 3. Christmas, W.: Filtering requirements for gradientbased optical ﬂow measurement. TIP 9, 1817–1820 (2000) 4. Sellent, A., Eisemann, M., Goldl¨ ucke, B., Cremers, D., Magnor, M.: Motion ﬁeld estimation from alternate exposure images. TPAMI (to appear) 5. Kundur, D., Hatzinakos, D.: Blind image deconvolution. IEEE Signal Process Magazine 13, 43–64 (1996) 6. Xiao, J., Cheng, H., Sawhney, H., Rao, C., Isnardi, M.: Bilateral ﬁlteringbased optical ﬂow estimation with occlusion detection. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 211–224. Springer, Heidelberg (2006) 7. Sand, P., Teller, S.: Particle video: Longrange motion estimation using point trajectories. IJCV 80, 72–91 (2008) 8. Alvarez, L., Deriche, R., Papadopoulo, T., Sanchez, J.: Symmetrical dense optical ﬂow estimation with occlusions detection. IJCV 75, 371–385 (2007) 9. Xu, L., Jia, J., Matsushita, Y.: Motion detail preserving optical ﬂow estimation. In: Proc. CVPR, pp. 1293–1300. IEEE, San Francisco (2010) 10. Sellent, A., Eisemann, M., Magnor, M.: Motion Field and Occlusion Time Estimation via Alternate Exposure Flow. In: Proc. ICCP. IEEE, Los Alamitos (2009) 11. Sellent, A., Eisemann, M., Goldl¨ ucke, B., Pock, T., Cremers, D., Magnor, M.: Variational optical ﬂow from alternate exposure images. In: Proc. VMV, pp. 135– 143 (2009) 12. Aggarwal, J., Nandhakumar, N.: On the computation of motion from sequences of imagesa review. Proc. of the IEEE 76, 917–935 (1988) 13. Anandan, P.: A computational framework and an algorithm for the measurement of visual motion. IJCV 2, 283–310 (1989)
50
A. Sellent, M. Eisemann, and M. Magnor
14. Alvarez, L., Weickert, J., S´ anchez, J.: Reliable estimation of dense optical ﬂow ﬁelds with large displacements. IJCV 39, 41–56 (2000) 15. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical ﬂow estimation based on a theory for warping. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004) 16. Lim, S., Apostolopoulos, J., Gamal, A.: Optical ﬂow estimation using temporally oversampled video. TIP 14, 1074–1087 (2005) 17. Black, M.J., Anandan, P.: Robust dynamic motion estimation over time. In: Proc. CVPR, pp. 296–302 (1991) 18. Mahajan, D., Huang, F., Matusik, W., Ramamoorthi, R., Belhumeur, P.: Moving gradients. In: Proc. SIGGRAPH. ToG, vol. 28, pp. 1–11. ACM, New York (2009) 19. Yitzhaky, Y., Kopeika, N.: Identiﬁcation of blur parameters from motion blurred images. Graphical Models and Image Processing 59, 310–320 (1997) 20. Pao, T., Kuo, M.: Estimation of the point spread function of a motionblurred object from autocorrelation. In: Proc. of SPIE, vol. 2501 (2003) 21. Rekleitis, I.M.: Optical ﬂow recognition from the power spectrum of a single blurred image. In: Proc. ICIP, pp. 791–794. IEEE, Los Alamitos (1996) 22. Jia, J.: Single image motion deblurring using transparency. In: Proc. CVPR, pp. 1–8. IEEE Computer Society, Los Alamitos (2007) 23. Dai, S., Wu, Y.: Motion from blur. In: Proc.CVPR, pp. 1–8. IEEE, Los Alamitos (2008) 24. Wang, J., Cohen, M.F.: Image and video matting: A survey. Foundations and Trends in Computer Graphics and Vision 3, 97–175 (2007) 25. Fergus, R., Singh, B., Hertzmann, A., Roweis, S., Freeman, W.: Removing camera shake from a single photograph. ToG (2006) 26. Xu, L., Jia, J.: Twophase kernel estimation for robust motion deblurring. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 157–170. Springer, Heidelberg (2010) 27. Bardsley, J., Jeﬀeries, S., Nagy, J., Plemmons, R.: Blind iterative restoration of images with spatiallyvarying blur. Optics Express 14, 1767–1782 (2006) 28. Levin, A.: Blind motion deblurring using image statistics. Advances in Neural Information Processing Systems 19, 841–848 (2007) 29. Tico, M., Vehvilainen, M.: Estimation of motion blur point spread function from diﬀerently exposed image frames. In: Proc. of Eusipco, Florence, Italy (2006) 30. Yuan, L., Sun, J., Quan, L., Shum, H.Y.: Image deblurring with blurred/noisy image pairs. In: Proc. SIGGRAPH. ToG, vol. 26, pp. 1–8. ACM, New York (2007) 31. Lim, S., Silverstein, A.: Estimation and removal of motion blur by capturing two images with diﬀerent exposures (2008) 32. BenEzra, M., Nayar, S.: Motionbased motion deblurring. TPAMI 26, 689 (2004) 33. Tai, Y., Du, H., Brown, M., Lin, S.: Image/video deblurring using a hybrid camera. In: Proc. CVPR, pp. 1–8. IEEE Computer Society, Los Alamitos (2008) 34. RavAcha, A., Peleg, S.: Restoration of multiple images with motion blur in diﬀerentdirections. In: Workshop on Appl. of Comp. V., pp. 22–28. IEEE, Los Alamitos (2000) 35. RavAcha, A., Peleg, S.: Two motionblurred images are better than one. Pattern Recognition Letters 26, 311–317 (2005) 36. Cho, T., Levin, A., Durand, F., Freeman, W.: Motion blur removal with orthogonal parabolic exposures. In: Proc. ICCP, pp. 1–8 (2010) 37. Chen, W.G., Nandhakumar, N., Martin, W.N.: Image motion estimation from motion smeara new computational model. TPAMI 18 (1996)
Motion Estimation from Alternate Exposure Images
51
38. Chen, W.G., Nandhakumar, N., Martin, W.N.: Estimating image motion from smear: a sensor system and extensions. In: Proc. ICIP, pp. 199–202. IEEE, Los Alamitos (1995) 39. Favaro, P., Soatto, S.: A variational approach to scene reconstruction and image segmentation from motionblur cues. In: Proc. CVPR. IEEE, Los Alamitos (2004) 40. Agrawal, A., Xu, Y., Raskar, R.: Invertible motion blur in video. In: Proc. SIGGRAPH. ToG, vol. 28, pp. 1–8. ACM, New York (2009) 41. Dennis, J., Schnabel, R.: Numerical methods for unconstrained optimization and nonlinear equations. PrenticeHall, Englewood Cliﬀs (1983) 42. Felzenszwalb, P., Huttenlocher, D.: Eﬃcient graphbased image segmentation. IJCV 59 (2004) 43. Forsythe, G.E., Malcolm, M.A., Moler, C.B.: Computer Methods for Mathematical Computations. PrenticeHall, Englewood Cliﬀs (1976) 44. Tikhonov, A., Arsenin, V.: Solutions of IllPosed Problems. Winston, NY (1977) 45. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TVL1 optical ﬂow. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007) 46. Black, M.J., Anandan, P.: The robust estimation of multiple motions: Parametric and piecewisesmooth ﬂow ﬁelds. Comp. V. and Img. Underst. 63, 75–104 (1996) 47. Chambolle, A.: An algorithm for total variation minimization and applications. Journal of Mathematical Image Visualization 20, 89–97 (2004)
Understanding What We Cannot See: Automatic Analysis of 4D Digital InLine Holographic Microscopy Data Laura LealTaix´e1, , Matthias Heydt2 , Axel Rosenhahn2,3, and Bodo Rosenhahn1 1
Leibniz Universit¨at Hannover, Appelstr. 9A, Hannover, Germany
[email protected] 2 Applied Physical Chemistry, University of Heidelberg, INF 253, Heidelberg, Germany 3 Institute of Functional Interfaces, Karlsruhe Institute of Technology, Campus Nord, 76344 EggensteinLeopoldshafen, Germany
Abstract. Digital inline holography is a microscopy technique which got an increasing attention over the last few years in the fields of microbiology, medicine and physics, as it provides an efficient way of measuring 3D microscopic data over time. In this paper, we present a complete system for the automatic analysis of digital inline holographic data; we detect the 3D positions of the microorganisms, compute their trajectories over time and finally classify these trajectories according to their motion patterns. Tracking is performed using a robust method which evolves from the Hungarian bipartite weighted graph matching algorithm and allows us to deal with newly entering and leaving particles and compensate for missing data and outliers. In order to fully understand the behavior of the microorganisms, we make use of Hidden Markov Models (HMMs) to classify four different motion patterns of a microorganism and to separate multiple patterns occurring within a trajectory. We present a complete set of experiments which show that our tracking method has an accuracy between 76% and 91%, compared to ground truth data. The obtained classification rates on four full sequences (2500 frames) range between 83.5% and 100%. Keywords: digital inline holographic microscopy, particle tracking, graph matching, multilevel Hungarian, Hidden Markov Models, motion pattern classification.
1 Introduction Many fields of interest in biology and other scientific research areas deal with intrinsically threedimensional problems. The motility of swimming microorganisms such as bacteria or algae is of fundamental importance for topics like pathogenhost interactions [1], predatorprey interactions [1], biofilmformation [2], or biofouling by marine microorganisms [3, 4]. Understanding the motility and behavioral patterns of microorganisms allows us to understand their interaction with the environment and thus to control environmental
Corresponding author.
D. Cremers et al. (Eds.): Video Processing and Computational Video, LNCS 7082, pp. 52–76, 2011. c SpringerVerlag Berlin Heidelberg 2011
Understanding What We Cannot See
(a)
53
(b)
Fig. 1. (a) The input data, the projections obtained with digital inline holography (inverted colors for better visualization). Sample trajectory in red. (b) The output data we want to obtain from each volume, the classification into four motion patterns, colored according to speed: orientation (1), wobbling (2), gyration (3) and intensive surface probing (4).
parameters to avoid unwanted consequences such as infections or biofouling. To study these effects in 3D several attempts have been made: tracking light microscopy, capable of tracking one bacterium at a time [5], stereoscopy [6] or confocal microscopy [7]. Berg built a pioneering tracking light microscope, capable of tracking one bacterium at a time in 3D. This has been used to investigate bacteria like Escherichia Coli [5]. Another way of measuring 3D trajectories is stereoscopy, which requires two synchronized cameras [6]. Confocal microscopy has also been used to study the motion of particles in colloidal systems over time, however the nature of this scanning technique limits the obtainable frame rate [7]. For any of these techniques, in order to draw statistically relevant conclusions, thousands of images have to be analyzed. Nowadays, this analysis is still heavily dependent on manual intervention. Recent work [8] presents a complete vision system for 2D cell tracking, which proves the increasing demand for efficient computer vision approaches in the field of microscopy as an emerging discipline. The search for a nearly automatic analysis of biological images has been extensively studied [9] but most of the work focuses on position as well as on the shape of the particle [10]. Several methods exist for multiple object detection based on methods such as Markov Chain Monte Carlo (MCMC) [11], inference in Bayesian networks [12] or the Nash Equilibrium of game theory [13]. These have been proven useful to track a fairly small number of targets, but are less appropriate when the number of targets is very large, as in our case. Statistical methods like Kalman filters [8], particle filters or recursive Bayesian filters [14] are widely used for tracking but they need a dynamical model of the target, a task that can be challenging depending on the microorganism under study and to which we dedicate the second part of this paper. In contrast to [14, 8], we do not use the output predictions of the filters to deal with occlusions, but rather use past and future information to complete broken trajectories and detect false alarms. Therefore, we do not need an extra track linking step as in [8]. Furthermore, we deal with 3D trajectories and random
54
L. LealTaix´e et al.
and fast motions which are unsuited for a prediction type of approach. In this work we propose a global optimal matching solution and not a local one as suggested in [15]. Besides generating motion trajectories from microscopic data, a subsequent classification allows the biologists to get in a compact and compressed fashion the desired information from the large image sets. Indeed, the classification of motion patterns in biology is a wellstudied topic [16] but identifying these patterns manually is a complicated and time consuming task. Recently, machine learning and pattern recognition techniques have been introduced to analyze in detail such complex movements. These techniques include: Principal Component Analysis (PCA) [17], a linear transformation used to analyze high dimensional data; Bayesian models [18] which use a graph model and the rules of probability theory to select among different hypotheses; or Support Vector Machines (SVM) [19], which use training data to find the optimum parameters of the model representing each class. A comparison of machine learning approaches applied to biology can be found in [20]. In order to classify biological patterns, we need to use an approach able to handle timevarying signals. Hidden Markov Models [21] are statistical models especially known for their application in temporal pattern recognition. They were first used in speech recognition and since then, HMMs have been extensively applied to vision. Applications vary from handwritten word recognition [22], face recognition [23] or human action recognition [24, 25]. In this paper, we present a complete system for the automatic analysis of digital inline holographic data. This microscopy technique provides videos of a 3D volume and is used to study complex movements of microorganisms. The huge amount of information that we can extract from holographic images makes it necessary to have an automatic method to analyze this complex 4D data. Our system performs the detection of the 3D positions, tracking of the complete trajectories and classification of motion patterns. For multiple microorganism tracking, we propose a geometrically motivated and globally optimal multilevel Hungarian to compensate for leaving and entering particles, recover from missing data and erase the outliers to reconstruct the whole trajectory of the microorganisms [26] . Afterwards, we focus on the classification of four motion patterns of the green alga Ulva linza with the use of Hidden Markov Models [27]. Furthermore, our system is able to find and separate different patterns within a single sequence. Besides classification of motion patterns, a key issue is the choice of features used to classify and distinguish the involved patterns. For this reason we perfom an extensive analysis of the importance of typical motion parameters, such as velocity, curvature, orientation, etc. Our developed system is highly flexible and can easily be extended. Especially for forthcoming work on cells, microorganisms or human behavior, such automated algorithms are of pivotal importance as they allow high throughput analysis of individual segments in motion data.
2 Detection of 3D Positions In this section we present the details of digital inline holography, how this microscopy technique allows us to obtain 3D positions of the microorganisms as well as the image processing methods used to robustly extract these positions from the images.
Understanding What We Cannot See
55
2.1 Digital InLine Holographic Microscopy (DIHM) Digital inline holographic microscopy provides an alternative, lensless microscopy technique which intrinsically contains three dimensional information about the investigated volume. It does not require a feedback control which responds to the motion and it uses only one CCD chip. This makes the method very straightforward and can be implemented with a very simple setup as shown in Figure 2.
Pinhole CCD detector
Laser beam
Microorganisms Hologram Fig. 2. Schematic setup for a digital inline holographic experiment consisting of the laser, a spatial filter to create the divergent light cone, the objects of interest (e.g. microorganisms) and a detector which records the hologram
The holographic microscope requires only a divergent wavefront which is produced by diffraction of laser light from a pinhole. A CCD chip finally captures the hologram. The holographic microscope setup follows directly Gabors initial idea [28] and has been implemented for laser radiation by Xu et al. [29]. A hologram recorded without the presence of particles, called the source is subtracted from each hologram. This is used to reduce the constant illumination background and other artifacts; there are filtering methods [30, 31] to achieve this in case a source image is not readily available. These resulting holograms can then be reconstructed back into real space by a KirchhoffHelmholtz transformation [29] shown in Equation (1). ikr·ξ K(r) = d2 ξI(ξ)e ξ (1) S
The integration extends over the 2D surface of the screen with coordinates ξ = (X, Y, L), where L is the distance from the source (pinhole) to the center of the detector (CCD chip), I(ξ) is the contrast image (hologram) on the screen obtained by subtracting the images with and without the object present and k the wave number: k = 2π/λ. As we can see in Figure 3, the idea behind the reconstruction is to obtain a series of stacked XY projections from the hologram image. These projections contain the information at different depth values. From these images, we can obtain the 3 final projections XY , XZ and Y Z, as described in [32]. These projections contain the image
56
L. LealTaix´e et al.
Hologram
y
Stacked XY projections YZ projection
XZ projection
x z z1
zi
zN
XY projection
Fig. 3. Illustration of the reconstruction process. From the hologram a stack of XY projections is obtained in several depths and from those, the final 3 projections (XZ, XZ and Y Z) are obtained.
information of the complete observation volume, i.e. from every object located in the light cone between pinhole and detector. The resolution in X and Y is δx,y = NλA , D where N A stands for the numerical aperture given by N A = 2L , where D is the detector’s side length. The resolution in the Z direction, which is the direction of the laser, is worse, δz = 2NλA2 . This is because the third dimension, Z, is obtained with a mathematical reconstruction, unlike confocal microscopy, where the value of every voxel is returned. On the other hand, confocal microscope takes a long time to return the values of all the voxels in a volume, and therefore is unsuited for tracking at a high frame rate. Using video sequences of holograms, it is possible to track multiple objects in 3D over time at a high frame rate, and multiple spores present in a single frame can be tracked simultaneously [3, 15, 33]. Using this advantage of digital inline holographic microscopy a number of 3D phenomena in microbiology have been investigated: Lewis et al. [34] examined the swimming speed of Alexandrium (Dinophyceae), Sheng et al. [35, 36] studied the swimming behavior of predatory dinoflagellates in the presence of prey, and Sun et al. [37] used a submersible device to investigate in situ plankton in the ocean. 2.2 Detection of the Microorganisms In our sequences we are observing the green algae Ulva linza, which has a spherical spore body and four flagella. Since the body scatters most of the light, in the projected images the particles have a circular shape. In order to preserve and enhance the particle shape (see Figure 4(a)) but reduce noise and illumination irregularities of the image (see Figure 4(b)), we apply the Laplacian of Gaussian filter (LoG) which, for its shape, is a blob detector [38]: −1 x2 + y 2 − x2 +y2 2 LoG(x, y) = 1− e 2σ πσ 4 2σ 2
(2)
Due to the divergent nature of the light cone, the particles can appear smaller or larger in the projections depending on the zplane. Therefore, the LoG filter is applied in several scales [38] according to the magnification. Note that the whole algorithm is extremely
Understanding What We Cannot See
(a)
57
(b)
Fig. 4. (a) Enhancement of the shape of the microorganisms. (b) Reduction of the noise.
Y Z projection
XZ projection
XY projection
Fig. 5. From the 3D positions obtained at each time frame, we use the method in Section 3 to obtain the full trajectory of each microorganism
adaptable, since we can detect particles with any shape by just changing the filter. After this, we use thresholding on each projection to have the position of the candidate particles on the image. The final 3D positions (Figure 6, green box labeled ”Candidate particles”) are determined by thresholding each projection XY , XZ and Y Z to find the particles in each image and crossing the information of the three projections. Once we have computed the 3D positions of all the microorganisms in all frames, we are interested in linking these 3D positions in order to find their complete 3D trajectories over time, a problem that is generally called Multiple object tracking (see Figure 5).
3 Automatic Extraction of 3D Trajectories In this section we present the complete method to estimate the 3D trajectories of the microorganisms over time. Our algorithm, the Multilevel Hungarian, is a robust method
58
L. LealTaix´e et al.
Projections filtered with LoG and thresholded t2
t1
t
t+1
t+2 Candidate Particles
Standard Hungarian Final Particles Particles
Find particle position position
Adding it iteration iteration n
changes>0 changes>0 ge NO NO O changes>0 changes>0 ge
Trajectories
Deleting ng iteration iteration ite n
NO NO
YES YES
YES YES
Multilevel Hungarian
Fig. 6. Diagram of the algorithm described in Section 3.2
evolved from the HungarianMunkre’s assignment method, and is capable of dealing with entering and leaving particles, missing data and outliers. The diagram of the method is presented in Figure 6. 3.1 Cost Function and Bipartite Graph Matching Graph Matching is one of the fundamental problems in Graph Theory and it can be defined as: given a graph G = (V, E), where E represents its set of edges and V its set of nodes, a matching M in G is a set of pairwise nonadjacent edges, which means that no edges share a common vertex. For our application, we are specially interested in the Assignment Problem, which consists in finding a maximum weight matching in a weighted bipartite graph. In a general form, the problem can be expressed as: ”There are N jobs and N workers. Any worker can be assigned to any job, incurring some cost that varies depending on the jobworker assignment. All jobs must be performed by assigning exactly one worker to each job in such a way that the total cost is minimized (or maximized)”. For the subsets of vertices X and Y , we build a cost matrix in which the element C(i, j) will represent the weight or cost related to vertex i in X and vertex j in Y . For numerical optimization we use the Hungarian or Munkres’ assignment algorithm, a combinatorial optimization algorithm [39, 40] that solves the bipartite graph matching problem in polynomial time. For implementation details on the Hungarian we recommend [41]. Our initial problem configuration is: there are M particles in frame
Understanding What We Cannot See
59
Table 1. Summary of the advantages and disadvantages of the Hungarian algorithm ADVANTAGES Finds a global solution for all vertices Cost matrix is versatile Easy to solve, bipartite matching is the simplest of all graph problems DISADVANTAGES Cannot handle missing vertices (a) Cannot handle entering or leaving particles (b) No discrimination of matches even if the cost is very high (c)
t1 and N particles in frame t2 . The Hungarian will help us to find which particle in t1 corresponds to which particle in t2 , allowing us to reconstruct their full trajectories in 3D space. Nonetheless, the Hungarian algorithm has some disadvantages which we should be aware of. In the context of our project, we summarize in Table 1 some of the advantages and disadvantages of the Hungarian algorithm. In the following sections, we present how to solve the three disadvantages: (a) is solved with the multilevel Hungarian method explained in Section 3.2, (b) is solved with the IN/OUT states of Section 3.1 and finally a solution for (c) is presented in Section 3.1 as a maximum cost restriction. The cost function C as key input for the Hungarian algorithm is created using the Euclidean distances between particles, that is, element C(i, j) of the matrix represents the distance between particle i of frame t1 and particle j of frame t2 . With this matrix, we need to solve a minimum assignment problem since we are interested in matching those particles which are close to each other. Note that it is also possible to include in the cost function other characteristics of the particle like speed, size or gray level distribution. Such parameters can act as additional regularizers during trajectory estimation. IN and OUT States. In order to include more knowledge about the environment to the Hungarian algorithm and avoid matches with very high costs, we have created a variation for the cost matrix. In our experiments, particles can only enter and leave the scene by crossing the borders of the Field Of View (FOV) of the holographic microscope, therefore, the creation and deletion of particles depends on their distance to the borders of the FOV. Nonetheless, the method can be easily extended to situations where trajectories are created (for example by cell division) or terminated (when the predator eats the prey) away from the FOV borders. As shown in Figure 7, we introduce the IN/OUT states in the cost matrix by adding extra row and columns. If we are matching the particles in frame f to particles in frame f + 1, we will add as many columns as particles in frame f and as many rows as particles in frame f + 1. This way, all the particles have the possibility to enter/leave the scene. Additionally, this allows us to obtain a square matrix, needed for the matching algorithm, even if the number of particles is not the same in consecutive frames.
60
L. LealTaix´e et al.
OUT OUT OUT
IN IN IN IN
Fig. 7. Change in the cost matrix to include the IN/OUT states. Each particle is represented by a different color. The value of each extra element added is the distance between the particle position and the closest volume boundary. ⎧ ⎪ min(Pi − {Mx , mx , My , my , Mz }), ⎪ ⎪ ⎨ CBB (i, k) = min(Pk − {Mx , mx , My , my , Mz }) ⎪ ⎪ ⎪ ⎩min(C (i, 1 : k − 1), C (1 : i − 1, k)) BB
BB
1 ≤ i ≤ M and k > N 1 ≤ k ≤ N and i > M (3) i > N and k > N
The cost of the added elements includes the information of the environment by calculating the distance of each particle to the nearest edge of the FOV as in Equation (3), where M is the number of particles in frame t1 and N is the number of particles in frame t2 , mx , my , mz are the low borders and Mx , My , Mz are the high borders for each of the axis. Note that the low border in the z axis is not included as it represents the surface where the microorganisms might settle and, therefore, no particles can enter or leave from there. If the distance is small enough, the Hungarian algorithm matches the particle with an IN/OUT state. In Figure 8 we consider the simple scenario in which we have 4 particles in one frame and 4 in the next frame. As we can see, there is a particle which leaves the scene from the lower edge and a particle which enters the scene in the next frame from the right upper corner. As shown in Figure 8(a), the Hungarian algorithm finds a wrong matching since the result is completely altered by the entering/leaving particles. With the introduction of the IN/OUT state feature, the particles are now correctly matched (see Figure 8(b)) and the ones which enter/leave the scene are identified as independent particles.
Understanding What We Cannot See
61
IN
OUT
(a)
(b)
Fig. 8. Representation of the particles in frame t1 (left) and t2 (right). The lines represent the matchings. (a) Wrongly matched. (b) Correctly matched as a result of the IN/OUT state feature.
Maximum Cost Restriction. Due to noise and illumination irregularities of the holograms, it is common that a particle is not detected in several frames, which means a particle can virtually disappear in the middle of the scene. If a particle is no longer detected, all the matches can be greatly affected. That is why we introduce a maximum cost restriction for the cost matrix, which does not allow matches which have costs higher than a given threshold V . This threshold is the observed maximum speed of the algae spores under study [32]. The restriction is guaranteed by using the same added elements as the ones used for the IN/OUT states, therefore, the final value of the added elements of the cost matrix will be: C(i, k) = min(CBB , V Δt)
(4)
Overall, if a particle is near a volume border or cannot be matched to another particle which is within a reachable distance, it will be matched to an IN/OUT state. This ensures that the resulting matches are all physically possible. Still, if we have missing data and a certain particle is matched to an IN/OUT state, we will recover two trajectories instead of the complete one. In the next section we present a hierarchical solution to recover missing data by extending the matching to the temporal dimension. 3.2 Multilevel Hungarian for Missing Data If we consider just the particles detected using the thresholding, we see that there are many gaps within a trajectory (see Figure 12(a)). These gaps can be a result of morphing (different object orientations yield to different contrast), changes in the illumination, etc. The standard Hungarian is not capable of filling in the missing data and creating full trajectories, therefore, we now introduce a method based on the standard Hungarian that allows us to treat missing data, outliers and create complete trajectories. The general routine of the algorithm, the multilevel Hungarian, is: – Find the matchings between particles in frames [i − 2 . . . i + 2], so we know the position of each particle in each of these frames (if present). (Section 3.2). – Build a table with all these positions and fill the gaps given some strict conditions. Let the algorithm converge until no particles are added. (Section 3.2). – On the same table and given some conditions, erase the outliers. Let the algorithm converge until no particles are deleted. (Section 3.2).
62
L. LealTaix´e et al.
The Levels of the Multilevel Hungarian. The multilevel Hungarian takes advantage of the temporal information in 5 consecutive frames and is able to recover from occlusions and gaps in up to two consecutive frames. The standard Hungarian gives us the matching between the particles in frame t1 and frame t2 and we use this to find matchings of the same particle in 5 consecutive frames, [i − 2, . . . , i + 2]. In order to find these matchings, the Hungarian is applied on different levels. The first two levels, represented in Figure 9 by red arrows, are created to find the matching of the particles in the frame of study, frame i. But it can also be the case that a particle is not present in frame i but is present in the other frames. To solve all the possible combinations given this fact, we use Levels 3, 4 and 5, represented in Figure 9 by green arrows. H3 H5
?
?
H2
H5
H1
H1
? H2
H4 H4
Fig. 9. Represented frames: [i2,i1,i,i+1,i+2]. Levels of the multilevel Hungarian.
Below we show a detailed description and purpose of each level of the multilevel Hungarian: – Level 1: Matches particles in frame i with frames i ± 1. – Level 2: Matches particles in frame i with frames i ± 2. With the first two levels, we know, for all the particles in frame i, their position in the neighboring frames (if they appear). – Level 3: Matches particles in frame i − 1 with frame i + 1. – Level 4: Matches particles in frame i ± 1 with frame i ∓ 2. Level 3 and 4 solve the detection of matchings when a particle appears in frames i ± 1 and might appear in i ± 2, but is not present in frame i. – Level 5: Matches particles in frame i ± 1 with frame i ± 2. Conditions to Add/delete Particles. Once all the levels are applied hierarchically, a table with the matching information is created. On one axis we have the number of particles and on the other the 5 frames from [i − 2 . . . i + 2], as shown in Figure 10. To change the table information, we use two iterations: the adding iteration and the deleting iteration which appear in Figure 6 as blue boxes. During the adding iteration,
Understanding What We Cannot See
63
we look for empty cells in the table where there is likely to be a particle. A new particle position is added if, and only if, two conditions are met: 1. There are at least 3 particles present in the row. Particles have continuity while noise points do not. 2. It is not the first or last particle of the row. We use this strict condition to avoid the creation of false particle positions or the incorrect elongation of trajectories. If we look at particle 6 of the table in Figure 10. In this case, we do not want to add any particle in frames i − 2 and i − 1, since the trajectory could be starting at frame i. In the case of particle 4, we do not want to add a particle in frame i + 2 because the trajectory could be ending at i + 1. Each iteration repeats this process for all frames, and we iterate until the number of particles added converges. After convergence, the deleting iteration starts and we erase the outliers considered as noise. A new particle position is deleted if, and only if, two conditions are met: 1. The particle is present in the frame of study i. 2. There are less than 3 particles in the same row. We only erase particles from the frame of study i because it can be the case that a particle appears blurry in the first frames but is later correctly detected and has more continuity. Therefore, we only delete particles from which we know the complete neighborhood. Each iteration repeats this process for all frames, and we iterate until the number of particles deleted converges. The resulting particles are shown in Figure 10. i1
i
i+1
i+2
6
5
4
3
2
1
i2
Fig. 10. Table with: the initial particles detected by the multilevel Hungarian (green ellipses), the ones added in the adding iteration (yellow squares) and the ones deleted in the deleting iteration (red crosses). In the blank spaces no position has been added or deleted.
Missing Data Interpolation. During the adding iteration, we use the information of the filtered projection in order to find the correct position of the new particle (Figure 6) . For example, if we want to add a particle in frame i − 1, we go to the filtered projections XY, XZ, YZ in t = i − 1, take the position of the corresponding particle in t = i or t = i − 2 and search for the maximum value within a window w. If the position found is already present in the candidate particles’ list of that frame, we go back to the projection and determine the position of the second maximum value. This allows us to distinguish two particles which are close to each other.
64
L. LealTaix´e et al.
There are many studies on how to improve the particle depthposition resolution (zposition). As in [42] we use the traditional method of considering the maximum value of the particle as its center. Other more complex methods [31] have been developed which also deal with different particle sizes, but the flexibility of using morphological filtering already allows us to easily adapt our algorithm. 3.3 The Final Hungarian Once the final particle positions are obtained (in Figure 6, orange box labeled ”Final particles”), we perform one last step to determine the trajectories. We use the standard Hungarian to match particles from frame i to frame i + 1.
4 Motion Pattern Classification In this section we describe the different types of motion patterns, as well as the design of the complete HMM and the features used for their classification. 4.1 Hidden Markov Models Hidden Markov Models [21] are statistical models of sequencial data widely used in many applications in artificial intelligence, speech and pattern recognition and modeling of biological processes. In an HMM it is assumed that the system being modeled is a Markov process with unobserved states. This hidden stochastic process can only be observed through another set of stochastic processes that produce the sequence of symbols O = o1 , o2 , ..., oM . An HMM consists of a number N of states S1 , S2 , ..., SN . The system is at one of the states at any given time. Every HMM can be defined by the triple λ = (Π, A, B). Each transition from Si to Sj can Π = {πi } is the vector of initial state probabilities. occur with a probability of aij , where j aij = 1. A = {aij } is the state transition matrix. In addition, each state Si generates an output ok with a probability distribution bik = P (ok Si ). B = {bik } is the emission matrix. There are three main problems related to HMMs: 1. The evaluation problem: for a sequence of observations O compute the probability P (Oλ) that an HMM λ generated O. This is solved using the ForwardBackward algorithm. 2. The estimation problem: given O and an HMM λ, recover the most likely state sequence S1 , S2 , ..., SN that generated O. Problem 2 is solved by the Viterbi algorithm, a dynamic programming algorithm that computes the most likely sequence of hidden states in O(N 2 T ) time. 3. The optimization problem: find the parameters of the HMM λ which maximize P (Oλ) for some output sequence O. A local maximum likelihood can be derived efficiently using the BaumWelch algorithm. For a more detailed introduction to HMM theory, we refer to [21].
Understanding What We Cannot See
65
4.2 Types of Patterns In our experimental setup we are interested in four patterns shown by the green alga Ulva linza as depicted in Figure 1(b): Orientation(1), Wobbling(2), Gyration(3) and intensive surface probing or Spinning(4). These characteristic swimming patterns are highly similar to the patterns observed before in [43] for the brown algae Hincksia irregularis. Orientation. Trajectory 1 in Figure 1(b) is an example of the Orientation pattern. This pattern typically occurs in solution and far away from surfaces. The most important characteristics of the pattern are the high swimming speed (a mean of 150μm/s) and a straight swimming motion with moderate turning angles. Wobbling. Pattern 2 is called the Wobbling pattern and its main characteristic is a much slower mean velocity of around 50μm/s. The spores assigned to the pattern often change their direction of movement and only swim in straight lines for very short distances. Compared to the orientation pattern this leads to less smooth trajectories. Gyration. Trajectory 3 is an example of the Gyration pattern. This pattern is extremely important for the exploration of surfaces as occasional surface contacts are observable. The behavior in solution is similar to the Orientation pattern. Since in this pattern spores often switch between swimming towards and away from the surfaces, it can be interpreted as a prestage to surface probing. Intensive Surface Probing and Spinning. Pattern 4 involves swimming in circles close to the surface within a very limited region. After a certain exploration time, the spores can either permanently attach or leave the surface to the next position and start swimming in circular patterns again. This motion is characterized by decreased mean velocities of about 30μm/s in combination with a higher tendency to change direction (see Figure 1(b), case 4). 4.3 Features Used for Classification An analysis of the features used for classification is presented in this section. Most of the features are generally used in motion analysis problems. An intrinsic characteristic of digital inline holographic microscopy is the lower resolution of the Z position compared to the X,Y resolution [31]. Since many of the following features depend on the depth value, we compute the average measurements within 5 frames in order to reduce the noise of such features. The four characteristic features used are: – v, velocity: the speed of the particles is an important descriptive feature as we can see in Figure 1(b). We use only the magnitude of the speed vector, since the direction is described by the next two parameters. Range is [0, maxSpeed]. maxSpeed is the maximum speed of the particles as found experimentally in [32]. – α, angle between velocities: it measures the change in direction, distinguishing stable patterns from random ones. Range is [0, 180].
66
L. LealTaix´e et al.
– β, angle to normal of the surface: it measures how the particles approaches the surface or how it swims above it. Range is [0, 180]. – D, distance to surface: this can be a key feature to differentiate surfaceinduced movements from general movements. Range is (mz , Mz ], where mz and Mz are the z limits of the volume under study. In order to work with Hidden Markov Models, we need to represent the features for each pattern with a fixed set of symbols. The number of total symbols will depend on the number of symbols used to represent each feature Nsymbols = Nv Nα Nβ ND . In order to convert every symbol for each feature into a unique symbol for the HMM, we use Equation (5), where J is the final symbol we are looking for, J1..4 are the symbols for each of the features, ranged [1..NJ1 ..4 ], where NJ1..4 are the number of symbols per feature. J = J1 + (J2 − 1)NJ1 + (J3 − 1)NJ1 NJ2 + (J4 − 1)NJ1 NJ2 NJ3
(5)
In the next sections we present how to use the resulting symbols to train the HMMs. The symbols are the observations of the HMM, therefore, the training process gives us the probability of emitting each symbol for each of the states. 4.4 Building and Training the HMMs In speech recognition, an HMM is trained for each of the phonemes of a language. Later, words are constructed by concatenating several HMMs of the phonemes that form the word. HMMs for sentences can even be created by concatenating HMMs of words, etc. We take a similar hierarchical approach in this paper. We train one HMM for each of the patterns and then we combine them into a unique Markov chain with a simple yet effective design that will be able to describe any pattern or combination of patterns. This approach can be used in any problem where multiple motion patterns are present. Individual HMM Per Pattern. In order to represent each pattern, we build a Markov chain with N states and we only allow the model to stay in the same state or move one state forward. Finally, from state N we can also go back to state 1. The number of states N is found empirically using the training data (we use N = 4 for all the experiments, see Section 5.4). The HMM is trained using the BaumWelch algorithm to obtain the transition and emission matrices. Complete HMM. The idea of having a complete HMM that represents all the patterns is that we can not only classify sequences where there is one pattern present, but sequences where the particle makes transitions between different patterns. In Figure 11(a) we can see a representation of the complete model while the design of the transition matrix is depicted in Figure 11(b). The four individual HMMs for each of the patterns are placed in parallel (blue). In order to deal with the transitions we create two special states: the START and the SWITCH state. The START state is just created to allow the system to begin at any pattern (orange). We define Pstart = PSwitchT oModel = 1−PNswitch where NP is the number of patterns. P As START does not contain any information of the pattern, it does not emit any symbol.
Understanding What We Cannot See
67
Pattern 1
1
…
2
N
Pstart
Pswitch Pstart
. . .
START
Pstart
2
SWITCH
Pstart
…
1
Pstart Pswitch
2 Pmodel (1 − Pswitch )
Pswitch
Pattern NP
1
Pswitch
N
PSwitchT oM odel
3 4
Pswitch
(a)
(b)
Fig. 11. (a) Complete HMM created to include changes between patterns within one trajectory. (b) Transition matrix of the complete HMM.
The purpose of the new state SWITCH is to make transitions easier. Imagine a given trajectory which makes a transition from Pattern 1 to Pattern 2. While transitioning, the features create a symbol that neither belongs to Pattern 1 nor 2. The system can then go to state SWITCH to emit that symbol and continue to Pattern 2. Therefore, 1 all SWITCH emission probabilities are Nsymbols . Since SWITCH is such a convenient state, we need to impose restrictive conditions so that the system does not go or stay in SWITCH too often. This is controlled by the parameter Pswitch , set at the minimum value of all the Pmodel minus a small . This way, we ensure that Pswitch is the lowest transition probability in the system. Finally, the sequence of states given by the Viterbi algorithm determines the motion pattern observed. Our implementation uses the standard MatLab HMM functions.
5 Experimental Results In order to test our algorithm we use 6 sequences (labeled S1 to S6) in which the swimming motion of Ulva linza spores is observed [3]. All the sequences have some particle positions which have been semiautomatically reconstructed and manually labeled and inspected (our ground truth) for later comparison with our fullyautomatic results. 5.1 Performance of the Standard Hungarian First of all, we want to show the performance of the final standard Hungarian described in Section 3.3. For this, we use the ground truth particle positions and apply the Hungarian algorithm to determine the complete trajectories of the microorganisms. Comparing the automatic matches to the ground truth, we can see that in 67% of all the sequences, the total number of particles in the sequence is correctly detected, while in the remaining 33%, there is just a 5% difference in the number of particles. The average accuracy of the matchings reaches 96.61%.
68
L. LealTaix´e et al.
To further test the robustness of the Hungarian algorithm, we add random noise to each position of our particles. The added noise is in the same order as the noise intrinsically present in the reconstructed images, determined experimentally in [32]. N = 100 experiments are performed on each of the sequences and the accuracy is recorded. Results show that the average accuracy of the matching is just reduced from 96.61% to 93.51%, making the Hungarian algorithm very robust to the noise present in the holographic images and therefore well suited to find the trajectories of the particles. 5.2 Performance of the Multilevel Hungarian To test the performance of the multilevel Hungarian we apply the method to three sets of particles: – Set A: particles determined by the threshold (pre multilevel Hungarian) – Set B: particles corrected after multilevel Hungarian – Set C: ground truth particles, containing all the manually labeled particles We then start by comparing the number of particles detected, as shown in Table 2. As shown in Table 2, the number of particles detected in Set A is drastically reduced in Set B, after applying the multilevel Hungarian, demonstrating its abilities to compensate for missing data and merging trajectories. If we compare it to Set C, we see that
(a)
(b)
(c) Fig. 12. (a) 3 separate trajectories are detected with the standard Hungarian (blue dashed line). Merged trajectory detected with our method (with a smoothing term, red line). Missing data spots marked by arrows. (b), (c) Ground truth trajectories (blue dashed line). Trajectories automatically detected with our method (red line).
Understanding What We Cannot See
69
Table 2. Comparison of the number of particles detected by thresholding, by the multilevel Hungarian and the ground truth S1 S2 Set A 1599 1110 Set B 236 163 Set C 40 143
S3 579 130 44
S4 668 142 54
S5 1148 189 49
S6 2336 830 48
Table 3. Comparison of the trajectories’ average length S1 Set A 3 Set B 19 Set C 58
S2 5 31 54
S3 5 27 54
S4 4 23 70
S5 6 38 126
S6 7 23 105
the number is still too high, indicating possible tracks which were not merged and so detected as independent. Nonetheless, as we do not know exactly the amount of particles present in a volume (not all particle positions have been labeled), it is of great value for us to compare the average length of the trajectories, defined as the number of frames in which the same particle is present. The results are shown in Table 3 where we can clearly see that the average length of a trajectory is greatly improved with the multilevel Hungarian, which is crucial since long trajectories give us more information on the behavior of the particles. Now let us consider just useful trajectories for particle analysis, that is, trajectories with a length of more than 25 frames, which are the trajectories that will be useful later for motion pattern classification. Tracking with the standard Hungarian returns 20.7% of useful trajectories from a volume, while the multilevel Hungarian allows us to extract 30.1%. In the end, this means that we can obtain more useful information from each analyzed volume. Ultimately, this means that fewer volumes have to be analyzed in order to have enough information to draw conclusions of the behavior of a microorganism. 5.3 Performance of the Complete Algorithm Finally, we are interested in determining the performance of the complete algorithm, including detection and tracking. For this comparison, we are going to present two values: – Missing: percentage of ground truth particles which are not present in the automatic determination – Extra: percentage of automatic particles that do not appear in the ground truth data In Table 4 we show the detailed results for each surface. Our automatic algorithm detects between 76% and 91% of the particles present in the volume. This gives us a measure of how reliable our method is, since it is able to
70
L. LealTaix´e et al.
Table 4. Missing labeled and extra automatic particles
Missing (%) Extra (%)
S1 S2 S3 S4 S5 S6 8.9 20.7 19.1 23.6 11.5 12.9 54.9 34.1 46.5 13.3 25.8 74.6
detect most of our verified particle positions. Putting this information together with the percentage of particles detected by our algorithm but not labeled, we can see that our method extracts much more information from the volume of study. This is clear in the case of S6, where we have a volume with many crossing particles which are difficult to label manually and where our algorithm gives us almost 75% more information. We now consider the actual trajectories and particle position and measure the position error of our method. The error is measured as the Euclidean distance between each point of the ground truth and the automatic trajectories, both at time t. In Figure 12(a) we can see the 3 independent trajectories found with the standard Hungarian and the final merged trajectory which proves the power of our algorithm to fill in the gaps (pointed by arrows). In Figure 12(b) we can see that the automatic trajectory is much shorter (there is a length difference of 105 frames), although the common part is very similar with an error of just 4,2μm. Figure 12(c) on the other hand, shows a perfectly matched trajectory with a length difference of 8 frames and error of 6,4μm for the whole trajectory, which is around twice the diameter of the spore body. This proves that the determination of the particle position is accurate but the merging of trajectories can be improved. The next sections are dedicated to several experimental results on the automatic classification of biological motion patterns. All the trajectories used from now on are obtained automatically with the method described in Section 3 and are classified manually by experts, which we refer to as our ground truth classification data. 5.4 Evaluation of the Features Used for Classification The experiments in this section have the purpose of determining the impact of each feature for the correct classification of each pattern. We perform leaveoneout tests on our training data which consists of 525 trajectories: 78 for wobbling, 181 for gyration, 202 for orientation and 64 for intensive surface probing. The first experiment that we conduct (see Figure 13) is to determine the effect of each parameter for the classification of all the patterns. The number of symbols and states can only be determined empirically since they depend heavily on the amount of training data. In our experiments, we found the best set of parameters to be N = 4, Nv = 4, Nα = 3, Nβ = 3 and ND = 3, for which we obtain a classification rate of 83.86%. For each test, we set one parameter to 1, which means that the corresponding feature has no effect in the classification process. For example, the first bar in blue labeled ”No Depth” is done with ND = 1. The classification rate for each pattern (labeled from 1 to 4) as well as the mean for all the patterns (labeled Total) is recorded. As we can see, the angle α and the normal β information are the less relevant features, since the classification rate with and without these features is almost the same.
Understanding What We Cannot See
71
"#& "%# "! "!"$
%%&"!$&
$!&&"!
"!
'$&"!
#!!!
"&
&&$!
Fig. 13. Classification rate for parameters N = 4, Nv = 4, Nα = 3, Nβ = 3 and ND = 3. On each experiment, one of the features is not used. In the last experiment all features are used.
The angle information depends on the z component and, as explained in section 4.3, the lower resolution in z can result in noisy measurements. In this case, the tradeoff is between having a noisy angle data which can be unreliable, or an average measure which is less discriminative for classification. The most distinguishing feature according to Figure 13 is the speed. Without it, the total classification rate decreases to 55.51% and down to just 11.05% for the orientation pattern. Based on the previous results, we could think of just using the depth and speed information for classification. But if Nα = Nβ = 1, the rate goes down to 79.69%. That means that we need one of the two measures for correct classification. The parameters used are: N = 4, Nv = 4, Nα = 1, Nβ = 3 and ND = 3, for which we obtain a classification rate of 83.5%. This rate is very close to the result with Nα = 3, with the advantage that we now use less symbols to represent the same information. Several tests lead us to choose N = 4 number of states. The confusion matrix for these parameters is shown in Figure 14. As we can see, patterns 3 and 4 are correctly classified. The common misclassifications occur when Orientation (1) is classified as Gyration (3), or when Wobbling (2) is classified as Spinning (4). In the next section we discuss these misclassifications in detail. 1  Ori 2  Wob 3  Gyr 4  Spin 1  Ori
0.75
0.09
0.16
2  Wob
0.07
0.68
0.01
0.24
3  Gyr
0.01
0.94
0.05
0.02
0.98
4  Spin
Fig. 14. Confusion matrix; parameters N = 4, Nv = 4, Nα = 1, Nβ = 3 and ND = 3
72
L. LealTaix´e et al.
(a)
(b)
Fig. 15. (a) Wobbling (pattern 2) misclassified as Spinning (4). (b) Gyration (3) misclassified as Orientation (1). Color coded according to speed as in Figure 1(b).
(a) Gyration (3) + Spinning (4). Zoom on the spinning part. Color coded according to speed as in Figure 1(b).
1
3
(b) Orientation (1, red) + Gyration (3, yellow). Transition marked in blue and pointed by an arrow. Fig. 16. Sequences containing two patterns within one trajectory
Understanding What We Cannot See
73
Fig. 17. Complete volume with patterns: Orientation (1, red), Wobbling (2, green), Gyration (3, yellow). The Spinning (4) pattern is not present in this sequence. Patterns which are too short to be classified are plotted in black.
5.5 Classification on Other Sequences In this section, we present the performance of the algorithm when several patterns appear within one trajectory and also analyze the typical misclassifications. As test data we use four sequences which contain 27, 40, 49 and 11 trajectories, respectively. We obtain classification rates of 100%, 85%, 89.8% and 100%, respectively. Note that for the third sequence, 60% of the misclassifications are only partial, which means that the model detects that there are several patterns but only one of them is misclassified. One of the misclassifications that can occur is that Wobbling (2) is classified as Spinning (4). Both motion patterns have similar speed values and the only truly differentiating characteristics are the depth and the angle α. Since we use 3 symbols for depth, the fact that the microorganism touches the surface or swims near the surface leads to the same classification. That is the case of Figure 15(a), in which the model chooses pattern Spinning (4) because the speed is very low (dark blue) and sometimes the speed in the Wobbling pattern can be a little higher (light blue). As commented in section 4.2, Gyration (3) and Orientation (1) are two linked patterns. The behavior of gyration in solution is similar to the orientation pattern, that is why the misclassification shown in Figure 15(b) can happen. In this case, since the microorganism does not interact with the surface and the speed of the pattern is high (red color), the model detects it as an orientation pattern. We note that this pattern is difficult to classify, even for a trained expert, since the transition from orientation into gyration usually occurs gradually as spores swim towards the surface and interrupt the swimming pattern (which is very similar to the orientation pattern) by short surface contacts. In general, the model has been proven to handle changes between patterns extremely well. In Figure 16(a), we see the transition between Gyration (3) and Spinning (4).
74
L. LealTaix´e et al.
In Figure 16(b), color coded according to classification, we can see how the model detects the Orientation part (red) and the Gyration part (yellow) perfectly well. The model performs a quick transition (marked in blue) and during this period the model stays in the SWITCH state. We have verified that all the transition periods detected by the model lie within the manually annotated transition boundaries marked by experts, even when there is more than one transition present in a trajectory. The classification results on a full sequence are shown in Figure 17. Finally, we can obtain the probability of each transition (e.g. from Orientation to Spinning) for a given dataset under study. This is extremely useful for experts to understand the behavior of a certain microorganism under varying conditions.
6 Conclusions In this paper we presented a fullyautomatic method to analyze 4D digital inline holographic microscopy videos of moving microorganisms by detecting the microorganisms, tracking their full trajectories and classifying the obtained trajectories into meaningful motion patterns. The detection of the microorganisms is based on a simple blob detector and can be easily adapted for any microorganism shape. To perform multiple object tracking, we modified the standard Hungarian graph matching algorithm, so that it is able to overcome the disadvantages of the classical approach. The new multilevel Hungarian recovers from missing data, discards outliers and is able to incorporate geometrical information in order to account for entering and leaving particles. The automatically determined trajectories are compared with ground truth data, proving the method detects between 75% and 90% of the labeled particles. For motion pattern classification, we presented a simple yet effective hierarchical design which combines multiple trained Hidden Markov Models (one for each of the patterns), and has proved successful to identify different patterns within one single trajectory. The experiments performed on four full sequences result in a total classification rate between 83.5% and 100%. Our system is proved to be a helpful tool for biologists and physicists as it provides a vast amount of analyzed data in an easy and fast way. As future work, we plan on further improving the tracking results by using a network flow approach, which will be specially useful for volumes with high density of microorganisms. Acknowledgements. This work has been funded by the German Research Foundation, DFG projects RO 2497/71 and RO 2524/21 and by the Office of Naval Research, grant N000140811116.
References 1. Ginger, M., Portman, N., McKean, P.: Swimming with protists: perception, motility and flagellum assembly. Nature Reviews Microbiology 6(11), 838–850 (2008) 2. Stoodley, P., Sauer, K., Davies, D., Costerton, J.: Biofilms as complex differentiated communities. Annual Review of Microbiology 56, 187–209 (2002)
Understanding What We Cannot See
75
3. Heydt, M., Rosenhahn, A., Grunze, M., Pettitt, M., Callow, M.E., Callow, J.A.: Digital inline holography as a 3d tool to study motile marine organisms during their exploration of surfaces. The Journal of Adhesion 83(5), 417–430 (2007) 4. Rosenhahn, A., Ederth, T., Pettitt, M.: Advanced nanostructures for the control of biofouling: The fp6 eu integrated project ambio. Biointerphases 3(1), IR1–IR5 (2008) 5. Frymier, P., Ford, R., Berg, H., Cummings, P.: 3d tracking of motile bacteria near a solid planar surface. Proc. Natl. Acad. Sci. U.S.A. 92(13), 6195–6199 (1995) 6. Baba, S., Inomata, S., Ooya, M., Mogami, Y., Izumikurotani, A.: 3dimensional recording and measurement of swimming paths of microorganisms with 2 synchronized monochrome cameras. Review of Scientific Instruments 62(2), 540–541 (1991) 7. Weeks, E., Crocker, J., Levitt, A., Schofield, A., Weitz, D.: 3d direct imaging of structural relaxation near the colloidal glass transition. Science 287(5452), 627–631 (2000) 8. Li, K., Miller, E., Chen, M., Kanade, T., Weiss, L., Campbell, P.: Cell population tracking and lineage construction with spatiotemporal context. Medical Image Analysis 12(5), 546–566 (2008) 9. Miura, K.: Tracking movement in cell biology. Microscopy Techniques, 267–295 (2005) 10. Tsechpenakis, G., Bianchi, L., Metaxas, D., Driscoll, M.: A novel computation approach for simultaneous tracking and feature extraction of c. elegans populations in fluid environments. IEEE Transactions on Biomedical Engineering 55(5), 1539–1549 (2008) 11. Khan, Z., Balch, T., Dellaert, F.: Mcmcbased particle filtering for tracking a variable number of interacting targets. TPAMI (2005) 12. Nillius, P., Sullivan, J., Carlsson, S.: Multitarget tracking  linking identities using bayesian network inference. In: CVPR (2006) 13. Yang, M., Yu, T., Wu, Y.: Gametheoretic multiple target tracking. In: ICCV (2007) 14. Betke, M., Hirsh, D., Bagchi, A., Hristov, N., Makris, N., Kunz, T.: Tracking large variable number of objects in clutter. In: CVPR (2007) 15. Lu, J., Fugal, J., Nordsiek, H., Saw, E., Shaw, R., Yang, W.: Lagrangian particle tracking in three dimensions via singlecamera inline digital holography. New J. Phys. 10 (2008) 16. Berg, H.: Random walks in biology. Princeton University Press, Princeton (1993) 17. Hoyle, D., Rattay, M.: Pca learning for sparse highdimensional data. Europhysics Letters 62(1) (2003) 18. Wang, X., Grimson, E.: Trajectory analysis and semantic region modeling using a nonparametric bayesian model. In: CVPR (2008) 19. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Machine Learning 46(13), 389–442 (2004) 20. Sbalzariniy, I., Theriot, J., Koumoutsakos, P.: Machine learning for biological trajectory classification applications. Center for Turbulence Research, 305–316 (2002) 21. Rabiner, L.: A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77(2) (1989) 22. Chen, M., Kundu, A., Zhou, J.: Offline handwritten word recognition using a hidden markov model type stochastic network. TPAMI 16 (1994) 23. Nefian, A., Hayes, M.H.: Hidden markov models for face recognition. In: ICASSP (1998) 24. Yamato, J., Ohya, J., Ishii, K.: Recognizing human action in timesequential images using hidden markov model. In: CVPR (1992) 25. Brand, M., Kettnaker, V.: Discovery and segmentation of activities in video. TPAMI 22(8), 844–851 (2000) 26. LealTaix´e, L., Heydt, M., Rosenhahn, A., Rosenhahn, B.: Automatic tracking of swimming microorganisms in 4d digital inline holography data. In: IEEE WMVC (2009)
76
L. LealTaix´e et al.
27. LealTaix´e, L., Heydt, M., Weisse, S., Rosenhahn, A., Rosenhahn, B.: Classification of swimming microorganisms motion patterns in 4D digital inline holography data. In: Goesele, M., Roth, S., Kuijper, A., Schiele, B., Schindler, K. (eds.) Pattern Recognition. LNCS, vol. 6376, pp. 283–292. Springer, Heidelberg (2010) 28. Gabor, D.: A new microscopic principle. Nature 161(8), 777 (1948) 29. Xu, W., Jericho, M., Meinertzhagen, I., Kreuzer, H.: Digital inline holography for biological applications. Proc. Natl. Acad. Sci. U.S.A. 98(20), 11301–11305 (2001) 30. Raupach, S., Vossing, H., Curtius, J., Borrman, S.: Digital crossedbeam holography for in situ imaging of athmospheric particles. J. Opt. A: Pure Appl. Opt. 8, 796–806 (2006) 31. Fugal, J., Schulz, T., Shaw, R.: Practical methods for automated reconstruction and characterization of particles in digital inline holograms. Meas. Sci. Technol. 20, 75501 (2009) 32. Heydt, M., Div´os, P., Grunze, M., Rosenhahn, A.: Analysis of holographic microscopy data to quantitatively investigate three dimensional settlement dynamics of algal zoospores in the vicinity of surfaces. Eur. Phys. J. E: Soft Matter and Biological Physics (2009) 33. GarciaSucerquia, J., Xu, W., Jericho, S., Jericho, M.H., Tamblyn, I., Kreuzer, H.: Digital inline holography: 4d imaging and tracking of microstructures and organisms in microfluidics and biology. In: Proc. SPIE, vol. 6026, pp. 267–275 (2006) 34. Lewis, N.I., Xu, W., Jericho, S., Kreuzer, H., Jericho, M., Cembella, A.: Swimming speed of three species of alexandrium (dinophyceae) as determined by digital inline holography. Phycologia 45(1), 61–70 (2006) 35. Sheng, J., Malkiel, E., Katz, J., Adolf, J., Belas, R., Place, A.: Digital holographic microscopy reveals preyinduced changes in swimming behavior of predatory dinoflagellates. Proc. Natl. Acad. Sci. U.S.A. 104(44), 17512–17517 (2007) 36. Sheng, J., Malkiel, E., Katz, J., Adolf, J., Place, A.: A dinoflagellate exploits toxins to immobilize prey prior to ingestion. Proc. Natl. Acad. Sci. U.S.A. 107(5), 2082–2087 (2010) 37. Sun, H., Hendry, D., Player, M., Watson, J.: In situ underwater electronic holographic camera for studies of plankton. IEE Journal of Oceanic Engineering 32(2), 373–382 (2007) 38. Lindeberg, T.: Scalespace theory in computer vision. Springer, Heidelberg (1994) 39. Kuhn, H.: The hungarian method for the assignment problem. Nav. Res. Logist. 2, 83–87 (1955) 40. Munkres, J.: Algorithms for the assignment and transportation problems. Journal of the Society of Industrial and Applied Mathematics 5(1), 32–38 (1957) 41. Pilgrim, R.: Munkres’ assignment algorithm; modified for rectangular matrices. Course Notes, Murray State University, http://csclab.murraystate.edu/bob.pilgrim/445/munkres.html 42. Masuda, N., Ito, T., Kayama, K., Kono, H., Satake, S., Kunugi, T., Sato, K.: Special purpose computer for digital holographic particle tracking velocimetry. Optics Express 14, 587–592 (2006) 43. Iken, K., Amsler, C., Greer, S., McClintock, J.: Qualitative and quantitative studies of the swimming behaviour of hincksia irregularis (phaeophyceae) spores: ecological implications and parameters for quantitative swimming assays. Phycologia 40, 359–366 (2001)
3D Reconstruction and VideoBased Rendering of Casually Captured Videos Aparna Taneja1 , Luca Ballan1 , Jens Puwein1 , Gabriel J. Brostow2, and Marc Pollefeys1 1 2
Computer Vision and Geometry Group, ETH Zurich, Switzerland Department of Computer Science, University College London, UK
Abstract. In this chapter we explore the possibility of interactively navigating a collection of casually captured videos of a performance: realworld footage captured on hand held cameras by a few members of the audience. The aim is to navigate the video collection in 3D by generating video based rendering of the performance using the oﬄine precomputed reconstruction of the event. We propose two diﬀerent techniques to obtain this reconstruction, considering that the video collection may have been recorded in complex, uncontrolled outdoor environments. One approach recovers the event geometry by exploring the temporal domain of each video independently, while the other explores the spatial domain of the video collection at each time instant, independently. The pros and cons of the two methods and their applicability to the addressed navigation problem, are also discussed. In the end, we propose an interactive GPUaccelerated viewing tool to navigate the video collection. Keywords: VideoBased Rendering, Dynamic scene reconstruction, Free Viewpoint Video, 3DVideo, 3D reconstruction.
1
Introduction
Photo and video collections exist online with copious amounts of footage. Community contributed photos of scenery can already be registered together oﬄine, allowing for navigation of speciﬁc landmarks using a fast ImageBased Rendering (IBR) representation [1]. We propose that similar capabilities should exist for videos of performances or events, ﬁlmed by members of the audience, for instance, with hand held cameras or mobile phones. In particular, we want to give the user the ability to replay the event by seamlessly navigating around a performer, using the collection of videos. One can only make weak assumptions about this footage because the audience members doing the ﬁlming may have various videorecording devices, they could sit or move about far apart from each other, they may be indoors or outdoors, and they may have a partially obstructed view of the action. Due to all these possibilities, a full 3D reconstruction of the dynamic scene observed by the videos in the collection, is very challenging. In this chapter we propose two diﬀerent techniques to obtain such a reconstruction. Both these techniques ﬁrst recover the geometry of the static elements D. Cremers et al. (Eds.): Video Processing and Computational Video, LNCS 7082, pp. 77–103, 2011. c SpringerVerlag Berlin Heidelberg 2011
78
A. Taneja et al.
of the scene and then, exploit this information to infer the geometry of the dynamic elements exploring the video collection in either the temporal or the spatial domain. We also propose a full pipeline to generate a hybrid representation of the video collection which can be navigated using an interactive GPUaccelerated viewing tool. The chapter is organized as follows: Section 2 discusses the related work. Section 3 describes the oﬄine procedure that aims to compute the hybrid representation of the video collection. In particular, Section 3.3.1 and Section 3.3.2 present the two methods to recover the geometry of the dynamic elements of the scene, and Section 3.3.3 draws a comparison between the two methods. Section 4 describes how the video collection can be navigated interactively. Section 5 discusses the experimental results and Section 6 draws the conclusions.
2
Related Work
Research in the area of image based rendering has culminated in the Photo Tourism work of [1,2] and the commercially supported online PhotoSynth community. One of their main contributions was the pivotal insight that instead of stitching many people’s disparate photos together into a panorama, it is possible and useful to compute a 3D pointcloud from the 2D features that the photos have in common. The pointcloud in turn serves as a scaﬀold and a nonphotorealistic backdrop that provides a spatial context. While a “visitor” navigates the original photos, they see the pointcloud and hints of other photos in a way that reﬂects the real spatial layout of, for example, the Trevi Fountain. The recent work of [3] extends the view interpolation to scenarios with erroneous and incomplete 3D scene geometry. However all these approaches treat the environment as static and do not deal with moving foreground elements (e.g. people), excluding them from the visualization. On the contrary, dynamic environments with moving foregrounds were widely studied in both the video based rendering community and the surface capture community. Surface capture techniques have exploited the usage of silhouette [4,5,6,7,8], photo consistency [9,10,11], shadows and shading [12,13,14], and motion [15] to recover the geometries of the dynamic elements of a scene. Multimodal techniques have also been explored combining multiple information at the same time to make the reconstruction robust to inaccurate input data. Relevant examples are [16,17,18] which combine silhouette and photo consistency, [14] which combines silhouette, shadow and shading, and [19] which combines narrow baseline stereo with wide baseline stereo. Prior knowledge on the foreground objects of the scene has also been exploited. For instance [20,21,22,23,24] assume a prior on the possible shapes of the foreground objects, speciﬁcally, they are assumed to be human. Most of these works however, focus on indoor controlled environments where the cameras are typically static and both the lighting and the background controlled. Approaches that have dealt with outdoor uncontrolled scenarios have resorted to a large and a dense arrangement of cameras, as in [25,26], to priors on the scene as in [27,28,29], or to priors on the foreground objects [30].
3D Reconstruction and VideoBased Rendering of Casually Captured Videos
79
In situations where the intercamera baseline is small, some methods have been proposed to generate video based rendering content. As an example, [25] used narrow baseline stereo and spanned a total of 30◦ of viewing angle using a chain of eight cameras, and could tolerate 100 pixels disparity by focusing special computations on depth discontinuities. The method proposed in [31] demonstrates that under conditions of even 15◦ angular separation, it can be suﬃcient to model the whole scene with homography transformations of 2D superpixels whose correspondence is computed as an alternative to perpixel optical ﬂow. In a studio setting but starting with crude geometry of a performer, [32] shows how good optical ﬂow can ﬁx textureassignment problems that occur where views of some geometry overlap. Previously, view interpolation based on epipolar constraints was demonstrated in [33], where correspondences were speciﬁed manually. However normally, these view interpolation algorithms rely heavily on correlation based stereo and nearby cameras. To deal with a casually captured video collection, we need to consider the fact that these videos may have been captured in outdoor environments where the background can be complex and its appearance may change over time, the cameras may move and their internal settings may change during the recording. No assumptions can be made on the arrangement or density of the cameras and also, on the dynamic structure of the environment.
3
Oﬄine Processing
The aim of this stage is to synthesize an hybrid representation of the video collection that will be subsequently navigated. More precisely, the aim is to i) recover the geometry and the appearance of all the static elements of the scene, ii) calibrate the video collection spatially, temporally and photometrically, and in the end, iii) recover an approximate representation of the shape of the dynamic elements of the scene. 3.1
Static Elements Reconstruction
A 3D reconstruction of all the static elements of the scene is necessary to: i) provide context while rendering transitions, ii) calibrate the camera poses for each video frame, and iii) reﬁne each camera’s video matte. A variety of methods exist for static scene reconstruction [34,35,36,37,38,39,40]. Aside from photos and videos of a speciﬁc event, one could also use online photo collections of speciﬁc places to build dense 3D models [38]. For the sake of simplicity, we refer to the static elements of the scene as the background and to the dynamic elements as the foreground or the middleground. In particular, we refer to the foreground as the main object of interest in the footage (e.g., the main performer), while we refer to the middleground as the remaining dynamic elements of the scene. We follow the same Structure from Motion (SfM) strategy as in [1], matching SIFT features [41] between photos, estimating initial camera poses and 3D points, and reﬁning the 3D solutions via bundle adjustment. We then proceed
80
A. Taneja et al.
SfM + MVS
(a)
Texture Mapping
(b)
(c)
Fig. 1. (a) Collection of images of the ﬁlming location. (b) Geometry of the static elements of the scene. (c) Textured Geometry.
Residual Input Video
Static Geometry
Synthesize Video
Synthesized Video
Recompute Calibration
Fig. 2. Reﬁning the camera poses by minimizing the error between the actual video and the synthesized video
by computing a depthmap for each photo using standard multiview planesweep stereo (MVS) based on normalized crosscorrelation [42]. The ﬁnal polygonal surface mesh is generated using the robust range image fusion presented in [43]. A static texture for the background geometry is also extracted from the photos and baked on (see Figure 1), using a wavelet based pyramidal fusion technique as presented in [44,45]. Since the background scene is fairly dynamic in places, much of that texture will be replaced during the interactive stage of the system, by sampling the viewdependent colors opportunistically from each camera’s video. 3.2
Spatial, Temporal and Photometric Calibration
The camera poses for all the video frames are computed relative to the reconstructed background geometry. We refer to the real image seen by camera A at time t as ItA . The intrinsics, KtA ∈ R3×3 , and the extrinsics EtA ∈ R3×4 for each image ItA are estimated as follows. First, the SIFT features found in images that had been used to reconstruct the background, are searched for potential matches
3D Reconstruction and VideoBased Rendering of Casually Captured Videos
81
to features found in each video frame. These matches generate correspondences between 2D points in the current frame and 3D points in the background geometry. The pose of that camera at that speciﬁc time is recovered by applying the Direct Linear Transform (DLT) [46] and a reﬁnement step based on the reprojection error. This approach however, does not guarantee that the poses of diﬀerent cameras are recovered with the same 3D accuracy. Similar reprojection errors of sparse features, as measured in pixels, could indicate very diﬀerent qualities of poseestimation, especially when depths and resolutions vary greatly. The key is to achieve a calibration that looks correct when the textured geometry is rendered in conjunction with the performer during the interaction stage, even if it is oﬀ by a few meters. Therefore, we perform a second optimization of the camera poses. We use particle ﬁltering [47] to minimize the sumofsquarediﬀerence between each ItA and the image obtained by rendering the texture of the geometry at the current calibration estimate (see Figure 2). In this case, the texture is obtained as the median reprojected texture from a temporal window of 1000 frames of the same camera A (subsampled for eﬃciency). The video collection is synchronized, as proposed in [30], by performing correlation of the audio signals. We silence the quieter 90% of each video, align on the rest, and still need to manually timeshift about one in four videos. Sound travels slowly, so videoonly synchronization [48,49] may be preferable despite being more costly computationally. To account for diﬀerent settings in the cameras, like diﬀerent exposure time, gain and white balancing, the video streams are also calibrated photometrically with respect to each other and with respect to the textured static geometry. More speciﬁcally, we used the method proposed in [50] to compute a color transfer function mapping the color space of one camera into the color space of another. This mapping is computed for all the pairs of cameras and also between each camera and the textured geometry. 3.3
Dynamic Elements Reconstruction
As discussed before in the related work section, a lot of techniques have been developed to recover the shapes of dynamic elements in a scene, exploiting all the possible kinds of depth cues, like multiview stereo and silhouettes, and their possible combinations. However, in the considered scenario of casually captured videos, only few assumptions can be made about the scene and the way it was recorded. As a consequence, algorithms like multiview stereo are not applicable in general since they rely on a dense arrangement of cameras close to the object of interest, which might not always be the case. On the contrary, the usage of silhouette as a depth cue does not suﬀer from such constraints, but estimating silhouette in an uncontrolled outdoor environment is not trivial. In fact, this procedure relies on an accurate knowledge of the appearance of the background. This appearance is easy to infer when the cameras are static and a few frames representing the empty background are available.
82
A. Taneja et al.
However, in a casually captured outdoor scenario, the cameras may move and the background appearance may change over time due to changes in illumination, exposure or white balance compensations, and moving objects in the background. While variations in illumination and camera settings can be partially dealt with using the photometric calibration data, the changes in appearance due to moving objects remains a big issue. For instance, in the case of the Rothman sequence (Figure 3), the audience was standing throughout the event and was therefore modeled by the system as static geometry. While the motion of some people within the crowd did not inﬂuence the crowd geometry, it does inﬂuence the crowd/background appearance. In the past, some approaches have already explored segmentation in a moving cameras scenario but mainly resorting to priors on the shape of the dynamic elements of the scene, as in [30,51,52], to priors on their appearance, as in [53], or to priors on their motions, as in [54]. We choose to use silhouette information for reconstruction and do not resort to any of these priors, but instead exploit the known information about the geometry of the static background. Unlike Video SnapCut [55], Video Cutout [56] and Background Cut [57], we prefer a more automatic segmentation technique which requires only a little user interaction at the cost of segmentation accuracy. This is motivated by the fact that for a video collection where each video contains thousands of frames the above mentioned technique would demand a lot of time and eﬀort from the user. Our downstream rendering process is designed speciﬁcally to cope with our lowerquality segmentation. In the next sections, we propose two methods to recover the appearance of the background, namely the ones presented in [58] and [59]. The former estimates the background appearance for all the videos individually looking back and forth in their timeline, while the latter recovers the background appearance by transferring information about the same time instant across diﬀerent cameras. While the latter approach provides a full 3D volumetric representation of the dynamic elements of the scene, the former approach approximates this geometry as a set of billboards. Section 3.3.3 compares these two approaches and the corresponding representations for the purpose of video based rendering. 3.3.1 Reconstruction Using Temporal Information. In this approach, each dynamic element is represented as a set of billboards where each billboard corresponds to a speciﬁc camera. All the billboards related to a dynamic object are centered on the object’s actual center and their normal is aligned perpendicular to the ground plane. Each billboard faces the related camera at each time instant and has a texture and transparency map provided by the corresponding camera (see Figure 10). While the texture is obtained by trivially projecting the corresponding video frame onto the billboard, the transparency map, implicitly representing the silhouette of the dynamic element, is inferred by exploring the temporal domain of each video, independently. This procedure is performed as follows.
3D Reconstruction and VideoBased Rendering of Casually Captured Videos
(a)
(b)
(c)
83
(d)
Fig. 3. (a) One of the input frames of the Rothman sequence. (b) The obtained initial segmentation. (c) The mean of the obtained perpixel color distribution of the background for that speciﬁc frame. (d) The ﬁnal segmentation.
A user is ﬁrst required to paint the pixels of two random images from each video with the binary labels Ω ∈ {1, 0}, to indicate foreground pixels that belong to the performer, vs. background pixels that do not. With multiple videos, each lasting potentially thousands of frames, all subsequent segmentation, is computed automatically, despite the obvious complications for our video based rendering approach. Even using a primitive paint program, the user eﬀort does not exceed 10 min. per input video. The userlabeled training pixels deﬁne a foreground and a background color model. We simply use a knearestneighbor classiﬁer (k = 60) in RGB space, so the pixelwise independent posterior probability is kkΩ , amounting to the fraction of a pixel’s color neighbors that had been labeled Ω. To get a conservative foreground mask γtA and to compute it eﬃciently, we store the classconditional likelihood ratio of foreground to background in a discretized 2563 color cube lookup table. The table usually takes 5 min. to compute, and each frame is then segmented in 23 sec., using 0.6 as the necessary distance ratio to label a pixel as foreground (see Figure 3). To get a conservative foreground mask, meanshift tracking ( [60]) was used to predict the area of the foreground pixels. Only pixels labeled as foreground and belonging to that area are considered as foreground objects. This decreases the number of false positive foreground pixels. The quality of this initial segmentation is however insuﬃcient for our rendering purposes, as shown in Figure 3(b). To improve it, we use a new background color model, the same foreground color model as above, and graph cuts [61] to optimize the boundary. Each image It is treated as a moving foreground ft with changing background bt by the compositing equation, It = αt ft + (1 − αt ) bt , where αt is the perpixel alpha matte. With a binary initial segmentation γt in hand, we now seek to estimate f , b and a reﬁned α for each frame. A perpixel color model for the background bt of each video frame is estimated ﬁrst. Dilation of the initial segmentation γt by 10 pixels gives a conservative background mask, removing the need for a manually speciﬁed traveling garbage matte. Knowing both the background geometry and the calibration parameters,
84
A. Taneja et al.
we can render the “empty” scene seen at time t from camera A using the colors from elsewhere in A’s timeline (see Figure 3(c)). In one sense our approach is similar to that of [62], where a model of the background is generated and textured using the input video. Here, much like Chuang et al., we determine the probability distribution of bt by sampling from temporally proximate frames. Our algorithm collects samples of bt (m) for m’s which are not labeled as foreground at time t, i.e. those where γt (m) = 0. Further samples are collected by searching backward and forward in time with increasing Δ, projecting the images It±Δ with their related γt±Δ , onto the scene, according to A’s calibrations. Once 10 samples for the same pixel m have been collected, a Gaussian is ﬁtted to model bt (m), though we save the medians instead of the means. This procedure has been parallelized and runs with GPU acceleration. We ﬁrst solve the compositing equation assuming α’s are binary, leading to a trimap that is ready for further processing. Graph cuts is applied to maximize the conditional probability P (αt It ), which is proportional to P (It αt )P (αt ). Applying the logarithm and under the usual assumptions of conditional independence, log (P (αt )) represents the binary potential, while log (P (It αt )) represents the unary potential. For each pixel m in It , P (It (m)αt ) = P (ft )αt (m)P (bt (m))(1−αt (m))
(1)
where P (ft ) is the foreground color model estimated above and P (bt (m)) is the aforementioned Gaussian distribution. Due to the inevitable presence of small calibration and background geometry errors, the projection of It±Δ can be imperfect by some small local transformations. To account for this, P (bt (m)) is actually considered to be the maximum of all the pixels in a 5 × 5 neighborhood. The binary potential is formulated as the standard smoothness term, but modiﬁed to take into account both spatial and temporal gradients in the video. Once this discrete solution for αt is found, a trimap is automatically generated by erosion and dilation (3 and 1 pixels respectively). For all grey pixels in the trimap, we apply the matting technique proposed in [63]. An example result is shown in Figure 3(d). The presented segmentation procedure extracts both the foreground and the middleground elements of the scene. The meanshift tracker is able to distinguish between the dynamic elements so that during rendering, those elements are modeled as separate sets of billboards. When instead, the 3D position of a middleground element cannot be triangulated, as happens when it appears in only one camera, our system makes it disappear before a transition starts, and reappear as the transition concludes. This situation can be observed in the Magician and the Juggler sequences, when people stand in front of somebody’s cameras. 3.3.2 Reconstruction Using Spatial Information. The method proposed in the previous section explores the temporal domain of the videos to segment the dynamic elements in each video independently. An alternative solution would be to exploit the spatial domain of these videos and retrieve the appearance of
3D Reconstruction and VideoBased Rendering of Casually Captured Videos
It1
It2
It3
Rt21
Rt22
Rt23
Dt21
Dt22
Dt23
85
Fig. 4. (Top row) Source images acquired respectively by camera 1, 2 and 3. (Middle row) Images Rtij computed by projecting the previous images into camera 2 (black pixels indicate missing color information, i.e., β = 0). (Bottom row) Diﬀerence images Dtij .
the background from the images provided by other cameras at the same time instant. In other words, we propose to jointly segment the dynamic elements by using the known 3D geometry of the background to transfer color information across multiple views. This will result in a full volumetric reconstruction of the dynamic elements of the scene. The proposed technique is explained in detail in the following paragraphs. Given some images captured at the same time instant t, we aim to project each image onto the other images and exploiting their diﬀerences. Let i and j be two cameras given at time t and let πti be the projection function mapping 3D points in the world coordinate system to 2D points in the image coordinate system of camera i according to both the intrinsic and the extrinsic parameters. Since both the background geometry and the projection function πti are known, the depth map of the background geometry seen by camera i can be computed. Let’s denote this depth map with Zti . The value stored in each of its pixels represents the depth of the closest 3D point of the background geometry that projects to that pixel using πti . In practice, Zti can be easily computed in GPU by rendering the background geometry from the point of view of camera i and by extracting the resulting Zbuﬀer. Let Rtij denote the image obtained by projecting the image Itj into camera i, i.e., the image obtained by rendering the background geometry from the point of view of camera i using the color information of camera j and taking into account the color transfer function between i and j. More formally, for each pixel p in Rtij , we know that (πti )−1 ([p, Zti (p)]T ) represents the coordinates of the closest 3D point in the background geometry projecting in p. Note that (πti )−1 is the
86
A. Taneja et al.
Background geometry Scene element g Ghost of g
Camera j
Camera i Ideal image of g
Fig. 5. Image formation process for a reprojection image Rtij . Since the scene element γ is not a part of the background geometry, it generates a ghost image on camera i which is far away from the region it should ideally project to if it were a part of the background geometry.
inverse of the projection function πti where the depth is assumed to be known and equal to Zti (p). Therefore, the coordinates of pixel p in the image j are equal to πtj ((πti )−1 ([p, Zti (p)]T )) (2) In the end, the color of the pixel p in Rtij is deﬁned as follows Rtij (p) = Ij (πtj ((πti )−1 ([p, Zti (p)]T )))
(3)
Let us note that no color information can be retrieved for pixels of Rtij that map outside the ﬁeld of view of camera j and also for those which have no depth information in Zti , e.g., for those projecting onto regions not modeled by the background geometry. We keep track of such pixels by deﬁning a binary mask βtij such that, βtij (p) = 0 indicates the absence of color information at pixel p in Rtij . The procedure of computing Rtij is performed on GPU using shaders. Figure 4 shows some example images Rtij obtained by projecting the images captured by three diﬀerent cameras, namely #1, #2 and #3, into the camera #2. The reader can notice that, when the background geometry matches the current scene geometry the captured image Iti and the image Rtij look alike in all the pixels with βtij (p) equal to one. On the contrary, in the presence of a dynamic object which was not present in the background geometry, this gets projected into the background points behind it. This reprojection is referred as the ghost of the foreground object in the image Rtij . Figure 5 explains this concept visually. In Figure 4, the ghost of the juggler can be observed in both images Rt21 and Rt23 while it’s not visible in Rt22 since the image is projected on itself. Let’s call Dtij the image obtained by a perpixel comparison between image Iti with Rtij . In order to make our comparison method robust to errors that may be
3D Reconstruction and VideoBased Rendering of Casually Captured Videos
(a)
(b)
87
(c)
Fig. 6. Results obtained by applying diﬀerent color similarity measures to compare the two images Itj and Rtij in order to build the image Dtij . (a) Result obtained by applying the Equation 4. (b) Result obtained by applying the Equation 4 with Itj and Rtij swapped. (c) Result obtained by applying the Equation 5.
present in either the calibration or in the background geometry, the similarity measure used to compare these two images takes into account for local aﬃne transformations in the image space. We propose to compute Dtij as (4) Dtij (p) = min (Iti (p) − Rtij (q)) q∈Wp
where Wp is a window around p and · is the L1 norm in the RGB color space. This similarity measure proved to be more robust but, unfortunately, some details around the ghost borders are lost. This can be seen in Figure 6(a) where the ghost of the foreground object gets shrunk by half the window size used. In order to avoid these artifacts, the same approach is repeated by comparing, this time, the pixel p in Rtij to a corresponding window Wp in Iti . A result obtained by using this second approach is shown in Figure 6(b) where, this time instead, the silhouette of the foreground object gets shrunk by half the window size. In the end we chose to use the following metric which combines the advantages of the both the previous metrics: Dtij (p) = max( min (Iti (p) − Rtij (q)), min (Rtij (p) − Iti (q))) (5) q∈Wp
q∈Wp
A result obtained by applying this new metric can be seen in Figure 6(c). Given the input images Iti , all the possible images Dtij for each i > j are computed. This leads to a set of (n2 − n)/2 diﬀerence images Dtij that we will refer to as D. The problem of recovering the 3D geometry of the foreground object is formulated in a probabilistic way using as observation the computed set of images D. The scene to be reconstructed is discretized as a voxel grid. Let G be the random vector representing the occupancy state of all the voxels inside this grid where Gkt = 1 indicates the voxel k is full and empty otherwise. The aim is to ﬁnd a labeling L∗ for G which maximizes the posterior probability P (G = LD), i.e., L∗ = arg max P (G = LD) (6) L
88
A. Taneja et al.
By the Bayes’ rule, this is equivalent to L∗ = arg max(log(P (DG = L)) + log(P (G = L)))
(7)
L
We ﬁrst describe how the probability P (DG = L) is computed for a given labeling of the voxel grid, while P (G = L) is described later. Let φit (k) denote the footprint of the voxel k in camera i, i.e., the projection of all the 3D points belonging to k onto the image plane of camera i. Furthermore, ij denote with χij t (k) the set of the ghost pixels of voxel k in the image Rt . Since these pixels are the ones corresponding to the background geometry points occluded by the foreground object in camera j, i.e. (πtj )−1 ([φjt (k), Ztj (φjt (k))]T ), χij t (k) can be computed as follows j −1 i χij ([φjt (k), Ztj (φjt (k))]T )) t (k) = πt ((πt )
(8)
i.e., by projecting those background points into camera i (See Figure 5). We make three conditional independence assumptions for computing the probability P (DG = L): ﬁrst, the state of the voxels are assumed to be conditionally independent; second, the image formation process is assumed to be independent for the all images and third, the color of a pixel in an image is independent from the others. Using these assumptions, the probability P (DG = L) can be expressed as P (DG = L) = P (DGkt = Lkt ) (9) k
where P (DGkt ) =
P (Dtij (p)Gkt )
∀p ∈ φit (k) ∪ χij t (k)
(10)
i,j,p
Let us now introduce another random variable Ctij representing the consensus between the pixels in image Iti and the ones in image Rtij . Ctij (p) = 1 indicates that the color information at pixel p in Iti agrees with the color information at p in Rtij . Clearly, this variable strongly depends on the image Dtij . Speciﬁcally, P (Dtij (p)Gkt ) is modeled using a formulation similar to the one proposed by Franco and Boyer in [6], i.e., P (Dtij (p)Gkt ) = P (Dtij (p)Ctij (p) = 1)P (Ctij (p) = 1Gkt ) + P (Dtij (p)Ctij (p) = 0)P (Ctij (p) = 0Gkt )
(11)
P (Dtij (p)Ctij (p))
While in their work they used background images to determine we assume the following: in case of consensus (Ctij (p) = 1) the probability of Dtij (p) being high is low and vice versa. Therefore P (Dtij (p)Ctij (p) = 1) is chosen to be a Gaussian distribution centered at zero and truncated for values less than zero. Concerning the pixels with no color information, i.e., the ones with βtij (p) = 0, we assume this probability to be uniform. Therefore, ⎧ 2 ⎨ − (Dtij (p)) 2σ2 ij ij βtij (p) = 1 P (Dt (p)Ct (p) = 1) = κ e (12) ⎩U βtij (p) = 0
3D Reconstruction and VideoBased Rendering of Casually Captured Videos
89
where κ is the normalization factor for the Gaussian distribution, and U the uniform distribution. On the contrary, when there is no consensus (Ctij (p) = 0) no information can be stated for Dtij (p) and therefore P (Dtij (p)Ctij (p) = 0) is set to the uniform distribution. P (Ctij (p) = 1Gkt ) and P (Ctij (p) = 0Gkt ) are deﬁned in a similar way as in [6] but while in their formulation, the state of the voxel k is inﬂuenced only by the background state of the pixels in φit (k), in our formulation its state is also inﬂuenced by the pixels in χij t (k). While this property adds additional dependence between the voxels, it provides more information on the state of each voxel. In fact, we not only rely on the consensus observed in the voxel’s footprint φit (k) but also on the consensus observed in χij t (k). This allows us to recover from two kinds of situations, namely: when the colors of the foreground object are similar to the colors of the actual background points behind it, and when the information corresponding to the foreground object in the image Rtij is missing. However, our approach will not help if the colors of the actual background points in χij t (k) are also similar to the colors of the foreground element. Concerning P (G = L) we assume dependency only between neighboring voxels (26neighborhood). In this way, Equation 7 can be entirely solved using graph cuts [64,65,61]. More precisely, the pairwise potential log(P (Gat = Lat , Gbt = Lbt )) between two neighboring voxels a and b is deﬁned considering that if these voxels project to pixels lying on edges of the original images Iti there should be a low cost for cutting across these voxels and viceversa. To account for this, in our implementation, we compute the projection of the centers of each pair of neighboring voxels a and b on each image Iti . Subsequently we check all the pixels on the line connecting these two projections looking for an edge. If an edge is not found then the pairwise potential is increased. To account for temporal continuity in the ﬁnal mesh the voxel state prior takes into account its labeling computed in the previous frame according to P (Gat = 1) = 0.3 + ξ((L∗ )at−1 ) where ξ deﬁnes the temporal smoothness. Once graph cuts provides a grid labeling L∗ as a solution for Equation 7, marching cubes [66] can be applied to obtain a continuous mesh of the dynamic object. 3.3.3 Comparison. In the previous sections we presented two techniques to segment the dynamic elements of a scene. These two methods diﬀer from each other in the way they infer the appearance of the background. The former searches this information over time, looking back and forth inside a time window of a single video. The latter instead, recovers the background appearance by transferring information about the same time instant across diﬀerent cameras. While the former provides a pixellevel segmentation accuracy, the latter is limited by the voxel resolution which cannot be pushed over a certain limit without considering calibration errors. However, since the latter approach models the background appearance independently for each time instant, it is more robust to abrupt changes in the background appearance. On the contrary, the
90
A. Taneja et al.
Fig. 7. (Left and Right) Example of segmentation obtained on the Rothman sequence using the method presented in Section 3.3.1. (Center) Reconstruction obtained using deterministic visual hull on these segmentations. It is evident that segmentation errors in even a single image inﬂuence signiﬁcantly the reconstruction.
former approach would have serious problems in scenarios, such as a concert, where moving spotlights can change the background appearance quickly. Concerning the shape representations proposed by the two methods, what we can observe is that a billboard can only make a simple planar approximation of the shape of a 3D object. Perspective artifacts will appear if the observer’s viewing angle is larger than 10◦ . Merging multiple billboards together tends to approximate the actual geometry of the object, hence reducing the visible artifacts signiﬁcantly. However, using the temporal domain method of Section 3.3.1, this would require a lot of cameras viewing the same object. An accurate volumetric or mesh representation would also provide a nice visualization of an object. However, extracting these representations using shape from silhouette approaches involves the fusion of information coming from multiple cameras. This fusion is sensitive to the inevitable errors inherent in the calibration and the segmentation. Since one cannot assume that the obtained segmentations would have the same accuracy, this becomes a signiﬁcant issue in our scenario. For instance, an error in the segmentation or calibration in even a single camera can corrupt the entire reconstruction (e.g., see Figure 7). Probabilistic approaches, like the one proposed in Section 3.3.2, can deal with such issues since its formulation is more robust to misleading information. However the usage of this technique results in visible quantization artifacts since the scene has to be discretized into voxels whose size is bound by the available information. For the purpose of video based rendering, we chose to use the billboards representation since the quantization was a big issue. The following sections describe how the visual artifacts generated by the planar approximations, introduced by this representation, can be minimized.
4
Online Navigation
Once a hybrid representation has been computed, the video collection can be navigated. In the following sections, we present our online navigation tool which allows a user to interactively explore the event from multiple viewpoints.
3D Reconstruction and VideoBased Rendering of Casually Captured Videos
91
Fig. 8. Interactive Navigation Interface: Regular Mode (left) is a live preview of the content being rendered to the ﬁnal output video. Orbit Mode (right) has the same functionality, but also depicts the scenery, performer, and moving cameras. Users can switch between the two modes, and always have jog/shuttle control over both the timeline of the input footage.
4.1
UserInterface
The largely GPUdriven user interface of the system lets the user smoothly navigate the video collection in both space and time. The GUI can be operated in two diﬀerent modes, Regular Mode and Orbit Mode, where the same jog/shuttle and cameratransition commands are available by keyboard or mouse at all times. Those commands can be recorded and used as an editlist for more elaborate postprocessing of an output video. The Regular Mode is essentially a rendering of the event from either a real cameras’s perspective, or the virtual camera’s transition when the user clicks on the navigation arrows (see Figure 8). The navigator icon on the lower right corner of the interface, indicates the possible directions that the user can go (up, down, left, right, forward and backward depending on the availability of nearby videos). Each camera’s neighbors are determined relative to its image plane. Orbit Mode has a live preview window to the side, and serves primarily as a digital production controlroom, where the scenery, performer, and all the moving cameras are depicted as elements of a dynamic 3D world. Orbit Mode also has a video wall option where inset views of each camera are ﬁxed in place on the screen, but some users preferred when these individual videos played as moving screens inside the scene. 4.2
VideoBased Rendering
The user always watches the scene from the point of view of the Virtual Camera V . The virtual camera intrinsic parameters K V are assumed to be ﬁxed and equal to one of the cameras recording the scene. Its extrinsic parameters EtV are always locked to one of the cameras of the collection and unlocked only during a transition from one camera to another. When a camera change is requested, a virtual camera V performs the view interpolation from a starting camera A to an ending camera B, over a period of time [t0 , t1 ]. EtV0 = EtA0 at the start of the transition and EtV1 = EtB1 by the end.
92
A. Taneja et al.
In order to compute the intermediate rotation and translation matrices for camera V , we performed a camera interpolation technique aimed to maintain a constant and linear translation of the image of the actor. Formally, given the point t in space, representing the barycenter of the actor, we force the image of this point in the virtual camera, at any time t ∈ [t0 , t1 ], to be the exact linear interpolation between what it was at time t0 (in view A) and what it will be at t1 (in view B). For more details, please refer to [58]. As mentioned in Section 3.3, the shape of a dynamic element of the scene can either be represented using a set of billboards or using a volumetric representation. In the next sections, we will explain how the scene can be rendered using both representations. 4.2.1 Volumetric Representation. A volumetric representation of the dynamic elements of the scene can be computed for each time instant t as described in Section 3.3.2. Once a transition is requested the dynamic elements are rendered as part of the background geometry using an approach similar to Unstructured Lumigraph [67]. A detailed description of this procedure is presented in the Section 4.2.3. Figure 9 shows a reconstruction result obtained on the Juggler sequence. The voxel grid used was of resolution 140 × 140 × 140 covering the
(a)
(b)
(c)
(d)
Fig. 9. Results obtained on one frame of the juggler sequence. (a) Volumetric reconstruction. (b) One frame of the videos used for the reconstruction. (c) Reconstructed volume projected back to the previous image. (d) Reconstruction rendered from another viewpoint.
3D Reconstruction and VideoBased Rendering of Casually Captured Videos
93
entire extent of the scene where the action took place. Figures 9(a) and 9(d) show one reconstructed frame where two people, a foreground and a middleground, are present in the scene. While the volumetric reconstruction is robust enough to deal with segmentation inaccuracies and to recover the small balls being juggled by the performer, the quantization introduced by the voxelization makes this representation inappropriate for video based rendering purposes, since the resolution of the voxel grid cannot be increased indeﬁnitely without considering calibration and background geometry errors. 4.2.2 Billboard Representation and Transition Optimization. In this case, when a transition is requested from camera A to B, each dynamic element of the scene is modeled by the proxy shape of the two billboards ζ A and ζ B related to camera A and B, respectively (see Figure 10). Each billboard approximates the actor’s geometry using a planar proxy, and therefore it can introduce signiﬁcant artifacts while one navigates between cameras. However, billboards can actually be quite eﬀective, as long as we use them in tandem with a good measure of the expected visual disturbance. Ideally, while V is traveling along its path between A and B, V would crossfade imperceptibly from rendering mostly the billboard ζ A to showing mostly ζ B . We have observed that a wellplaced billboard is a convincing enough proxy shape for viewingangle changes around 10◦ , but the illusion can quickly be lost when the second billboard comes into view. The enhanced Cross Dissolve presented in [68] could help, but we have found that if timed correctly, a cut from one billboard to the next can be almost unnoticeable. [69] made a similar observation. It is preferable for the user to confuse a sharper appearance changeover with the performer’s natural ongoing motions. The best time for appearance changeover is when the action is at its most frontoparallel to the two cameras. Choosing
B
A
V Fig. 10. As the virtual camera transitions from view A to view B, the foreground object is represented by two video sprites on planar billboards, one for each view. The video footage from each camera is rendered onto the respective billboard with the segmentation mask applied.
94
A. Taneja et al.
a bad time will reveal the actor’s current 3D shape as nonplanar. We will explain later the simple strategy that ﬁnds the best changeover time, but ﬁrst we introduce the error measure to be optimized namely, the InterBillboard Distance. The InterBillboard Distance at time t for camera V is computed using the following procedure. Each billboard, ζ A and ζ B , is ﬁrst rendered separately from B the viewpoint of the virtual camera V at time t using the masks αA t and αt as texture. Those two images are then thresholded, producing two silhouette images, S1 and S2 . Overlaying S2 on S1 , as seen in camera V in Figure 10, allows one to evaluate how much change a user can perceive if the two billboards are suddenly swapped during the transition. The more these two images agree the less perceptible the change is. Mathematically, the distance measure used is 1 1 D(S1 , S2 ) = d(m, S2 ) + d(m, S1 ), (13) #S1 #S2 m∈S1
m∈S2
where m represents a pixel inside the silhouette, and d (m, S) is the l 2 distance between this point and a silhouette S. #S1 and #S2 represent the number of points in S1 and S2 , respectively. This error can be quickly computed in a fragment shader using the distance transform [70]. We also tried a correlationbased distance, but found it less eﬀective at matching the perceptual diﬀerences observed by the user. In fact, changes of appearance within the silhouette that occur during a changeover are often perceptually confused with subject motion. The InterBillboard Distance (Equation 13) largely dictates the right moment to switch billboards. We have found that also including the start time as a
Fig. 11. (Top) Background rendered from left, right, and viewindependent texture, (center) corresponding suitability maps, (bottom) ﬁnal rendered background and generated motion blurred background
3D Reconstruction and VideoBased Rendering of Casually Captured Videos
95
parameter can lead to a much better optimum. Thus we optimize over two variables: ρ which is the fraction of the transition interval at which the billboard transition occurs, and Δ which is the transition delay, the time between the user’s request and its actual start. This search is similar in spirit to the approach [71] proposed for combining motion capture data. The start time is delayed by no more than 3 sec., and the transition time was set to 1.5 sec. Since the search domain is limited and known, a fast grid search optimizes both parameters in a separate thread to preserve realtime playback. Once the user requests a transition, the exact timing is optimized as described, such that the transition is performed to minimize disturbing visual artifacts. 4.2.3 Rendering the Virtual Camera. During normal playback of a video in Regular Mode, the virtual camera position is locked to the real camera’s extrinsics. Depending on camera A’s intrinsic parameters KtA , the original video is played at a diﬀerent size in relation to the virtual camera intrinsics K V . Black borders are added to the video if its size is smaller than the virtual camera. This happens, for instance, when one camera has landscape orientation while the other is in portrait mode, or if the zooms are diﬀerent. While it is possible to adapt the intrinsic camera parameters of the virtual camera to those of the real one, that can create perceptually undesirable eﬀects (i.e. the Vertigo eﬀect). During the ﬁrst 20% of the transition, the virtual camera remains locked to the original viewpoint, but the scene rendering fades from the original video to the synthetically rendered scene (at which point the black borders disappear). Then the virtual camera starts moving along the computed transition path while the video is still playing. Like the start of the transition, the virtual camera is locked to the target camera position for the last 20% of the transition, when video of the target camera fades in. During the entire transition, the video is rendered using the color space of the original camera. This is done by using precomputed 3 × 3 color transformations, approximately mapping the appearance between videos, and also from the viewindependent texture to the videos. Only during the last 20% is the appearance gradually transformed to the target video. Next, the middle of the synthetically rendered video transition is created. Although a very large amount of footage is available for rendering, a realtime rendering application must take bandwidth and other system hardware limitations into account. Using all the available videos, masks, and background videos simultaneously would require far too many resources to render the scene interactively. To render a transition from camera A to camera B, we chose to load and use only data extracted from videos A and B, and the static information of the scene. These two cameras are normally also the closest to the virtual camera path, and the beneﬁts of using more videos are often limited. This tradeoﬀ is similar to the one made for IBR of static scenes by [72], where at most three views were used to texture each scene element. We adapt the Unstructured Lumigraph Rendering framework [67] to cope with the fact that some parts of the background scene are occluded by foreground and that we can only aﬀord to use two videos. At each time t, the images ItA and
96
A. Taneja et al.
ItB are used to color the geometry of the background scene, as in [67]. The B generated αmasks, αA t and αt are used to mask the foreground pixel elements A B of It and It , respectively. Three images of the scene from the point of view V are generated: the ﬁrst one uses only the color information from ItA , the second uses colors from ItB , while the last one uses the viewindependent texture extracted in the preprocessing stage. The viewindependent texture is necessary because on the path between A and B, the virtual camera can see parts of the scene that are hidden in both A and B. For each generated image, a perpixel suitability mask is generated in parallel, taking into account the αmasks (i.e. that a pixel is background or not), occlusions, and viewing angles. Occlusions are handled by rendering the depth maps of both ItA and ItB . We use the angle diﬀerences with respect to the surface normals to weight each pixel from the two sources. This is important in the presence of miscalibrations and geometry errors. The suitability mask of the image generated using the static texture is given a constant low value so that its colors are used only where neither of the other images can provide useful information. After the suitability has been computed, a dilation/erosion and smoothing ﬁlter is applied to ensure a spatially smooth transition between the texture during the blending, and to account for discontinuities and blobs that can appear due to occlusion handling and matting errors. The entire procedure is implemented on the GPU using a 3pass rendering. As a ﬁnal step, a motion blur ﬁlter is applied to all the pixels belonging to the background scene, which makes the foreground object stand out. This is a usercontrollable option in the software, and we found that, when enabled, the user’s attention is focused on the performer, i.e. the center of the action, while the motion blur gives peripheral cues about the direction and the speed of the transition. The whole background rendering approach is illustrated in Figure 11. The foreground elements of the scene are then rendered using a similar techB nique with the images ItA and ItB , and the appropriate alpha masks αA t and αt . In the case of billboards, only the two billboards ζ A and ζ B are rendered for each dynamic element and the transition from ζ A to ζ B is decided as described in Section 3.3.1.
5
Experiments
There are many casually ﬁlmed events, but multiview footage that is publicdomain is so far readily available only when Citizenjournalists are provided with a speciﬁed portal for submissions eg. after a U2 concert. We obtained the Climber and Dancer datasets from [30] and INRIA Grenoble RhoneAlpes respectively, and the Juggler, Magician, and Rothman data by attending real events, handing out cameras to members of the public with instructions to play with the settings, and where needed, obtaining signatures allowing for use and dissemination of the footage. These events were chosen because together, they explore a variety of challenges in terms of intercamera distance, large outofplane motion, fast performances of skill, complicated outdoor and indoor lighting conditions, and intrusive objects in the ﬁeld of view. We performed the manual
3D Reconstruction and VideoBased Rendering of Casually Captured Videos
97
part of the process ourselves, labeling the performer in two frames per video, and locating ∼ 40 12MP photos of each new environment. The Climber videos are 720 × 544, the Dancer videos are 780 × 582, and the new footage measures 960×544 pixels, with people ﬁlming in landscape or portrait mode (or switching), with diﬀerent settings for zoom, automatic gain, and white balance. Some people adjusted these manually at times. Naturally, the results of this interactive system are best evaluated in video, so please see our web site [73], where a fully working prototype of our video based rendering player is also downloadable. Among the videos available on our web site, several videos demonstrate speciﬁc stages of the algorithm such as renderingformatting, and several videos show events produced by volunteer testsubjects. Similar colors on the performer and the background are inevitable, which our initial segmentation conﬁrms repeatedly. Even drastically increasing the amount of training data had no eﬀect. The γ masks are frequently exaggerated in size, but that being only an intermediate stage, simply meant that the adaptive scene renderer had to seek further out in the timeline to obtain enough samples. With our new form of background subtraction, even signiﬁcant imperfections in the reconstructed scene geometry did not hinder us from pulling a useful matte, probably because those imperfections coincided with textureless areas. The bigger segmentation problems occur when the subject exhibits signiﬁcant motion blur, because mixed pixels can match the rendered background quite well. Clutter in the scene is caused by both objects that change and people who move around the performer. For the juggler sequence, there are suﬃcient moving cameras to obtain a reconstruction of all the dynamic elements. However, while scenes like Magician and Rothman have enough cameras in positions to triangulate billboards (of the performer and the clutter), their coverage is sparse and their calibrations and segmentations are oﬀ by too much to yield acceptable 3D shapes. We also experimented with computing heightﬁelds to augment our billboards, but without structured lights like those of [20], the results were disappointing. These ﬁndings seem consistent with [74]. The modicum of clutter in the scenes we tested was handled with relatively few artifacts because elements that were rejected from the background model either ended up as middleground billboards due to their 3D separation from the performer, or when incorrectly merged with the performer in one view, were deemed too costly by the Transition Optimization. The current prototype is realtime, running at 25 fps on an Intel i7 2.93Ghz Quadcore with 8GB of memory, an nVidia GTX280 GPU, and a normal 7200rpm hard drive. Even events ﬁlmed with at least six cameras can be explored without impacting performance, because videos are streamed locally, and can be subsampled if HD footage were available. The information necessary for the next frames is preloaded by a separate thread to allow undisturbed realtime rendering. Each scene’s geometry takes ∼1hr to reconstruct and is automatic except that part way through, the pipeline presented in [43] requires the user to designate a
98
A. Taneja et al.
Fig. 12. Examples of diﬀerent transitions for the Juggler and the Rothman sequence. In the Juggler sequence, the three consecutive frames span the best changeover (i.e. switch between billboards) found within a given timeframe. The optimization is successful if this changeover is hard to perceive. The background can be motion blurred or not.
bounding box for the volume reconstruction, and afterwards ﬁt a plane for the ground. After this user eﬀort, the automatic processing takes multiple hours. In the left column of Figure 12, three of our many example transitions are shown between diﬀerent cameras for the Juggler sequence. The right column shows an example of transition from the Rothman sequence.
6
Conclusions
In this chapter we explored the possibility of interactively navigating a casually captured video collection of a performance. We proposed an oﬄine preprocessing stage to recover the geometry of all the static elements of the scene and to calibrate the video collection spatially, temporally and photometrically relative to this geometry.
3D Reconstruction and VideoBased Rendering of Casually Captured Videos
99
We proposed two diﬀerent approaches to obtain the reconstruction of the dynamic elements of the scene, both exploiting the previously computed information. These methods diﬀer from each other in the way they infer the appearance of the static geometry. The ﬁrst method searches this information over time, looking back and forth inside a time window of each video independently. The second instead, recovers the background appearance by transferring color information about the same time instant across diﬀerent videos. While the ﬁrst method provides a pixellevel segmentation accuracy, the second is limited by the voxel resolution which is bound to be more than one pixel. Since the second approach models the background appearance independently for each time instant, the ﬁnal segmentation is more robust to abrupt changes in the background appearance. For the purpose of interactively navigating a video collection, we chose to use the ﬁrst method to represent the shape of the dynamic elements of the scene since it oﬀers a better quality of visualization during the rendering process. To overcome the limitations introduced by the billboard representation during the rendering process, we presented an optimization technique aimed at reducing the visual artifacts. In the future, the two methods can be combined together to beneﬁt from the strengths of each depending on the scenario. In summary, we presented an approach to convert a collection of handheld videos into a digital performance that can easily be navigated. We suggest the reader to try this visualization tool available on our web site [73]. Acknowledgements and Credits. The research leading to these results has received funding from the European Research Council under the European Community’s Seventh Framework Programme (FP7/20072013) / ERC grant agreement #210806, and from the Packard Foundation. We are grateful for the support of the SNF R’Equip program. Some parts of this chapter contain a revised and readapted version of two of our previously published works, namely [58] and [59]. In particular, Figc 2010 Association for Computing ures 8, 10, 11, 12 were taken from [58] ( Machinery, Inc. Reprinted by permission) and Figures 4, 5, 6, 9 were taken from [59] with kind permission from Springer Science+Business Media.
References 1. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: Exploring photo collections in 3d. In: SIGGRAPH Conference Proceedings, pp. 835–846 (2006) 2. Snavely, N., Garg, R., Seitz, S.M., Szeliski, R.: Finding paths through the world’s photos. ACM Transactions on Graphics (Proceedings of SIGGRAPH 2008) 27, 11–21 (2008) 3. Goesele, M., Ackermann, J., Fuhrmann, S., Haubold, C., Klowsky, R., Steedly, D., Szeliski, R.: Ambient point coulds for view interpolation. In: SIGGRAPH (2010) 4. Kim, H., Sarim, M., Takai, T., Guillemaut, J.Y., Hilton, A.: Dynamic 3d scene reconstruction in outdoor environments. In: 3DPVT (2010) 5. Guan, L., Franco, J.S., Pollefeys, M.: Multiobject shape estimation and tracking from silhouette cues. In: CVPR (2008)
100
A. Taneja et al.
6. Franco, J.S., Boyer, E.: Fusion of multiview silhouette cues using a space occupancy grid. In: ICCV, pp. 1747–1753 (2005) 7. Matusik, W., Buehler, C., Raskar, R., Gortler, S.J., McMillan, L.: Imagebased visual hulls. In: Proceedings of ACM SIGGRAPH, pp. 369–374 (2000) 8. Sarim, M., Hilton, A., Guillemaut, J.Y., Kim, H., Takai, T.: Multiple view widebaseline trimap propagation for natural video matting. In: 2010 Conference on Visual Media Production (CVMP), pp. 82–91 (2010) 9. Seitz, S., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multiview stereo reconstruction algorithms. In: CVPR (2006) 10. Seitz, S.M., Dyer, C.R.: Photorealistic scene reconstruction by voxel coloring. In: CVPR, p. 1067 (1997) 11. Furukawa, Y., Ponce, J.: Dense 3d motion capture for human faces. In: CVPR, pp. 1674–1681 (2009) 12. Vlasic, D., Peers, P., Baran, I., Debevec, P., Popovi´c, J., Rusinkiewicz, S., Matusik, W.: Dynamic shape capture using multiview photometric stereo. In: SIGGRAPH Asia (2009) 13. Ahmed, N., Theobalt, C., Dobrev, P., Seidel, H.P., Thrun, S.: Robust fusion of dynamic shape and normal capture for highquality reconstruction of timevarying geometry. In: CVPR (2008) 14. Hern´ andez, C., Vogiatzis, G., Cipolla, R.: Shadows in threesource photometric stereo. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 290–303. Springer, Heidelberg (2008) 15. Vedula, S., Baker, S., Seitz, S., Kanade, T.: Shape and motion carving in 6d. In: CVPR (2000) 16. Goldlucke, B., Ihrke, I., Linz, C., Magnor, M.: Weighted minimal hypersurface reconstruction. PAMI, 1194–1208 (2007) 17. Hilton, A., Starck, J.: Multiple view reconstruction of people. In: 3DPVT (2004) 18. Sinha, S.N., Pollefeys, M.: Multiview reconstruction using photoconsistency and exact silhouette constraints: A maximumﬂow formulation. In: ICCV, pp. 349–356 (2005) 19. Tung, T., Nobuhara, S., Matsuyama, T.: Complete multiview reconstruction of dynamic scenes from probabilistic fusion of narrow and wide baseline stereo. In: ICCV (2009) 20. Waschb¨ usch, M., W¨ urmlin, S., Gross, M.H.: 3d video billboard clouds. Computer Graphics Forum (Proc. Eurographics EG 2007) 26, 561–569 (2007) 21. Ballan, L., Cortelazzo, G.M.: Markerless motion capture of skinned models in a four camera setup using optical ﬂow and silhouettes. In: 3DPVT (June 2008) 22. Carranza, J., Theobalt, C., Magnor, M.A., Peter Seidel, H.: Freeviewpoint video of human actors. ACM Transactions on Graphics, 569–577 (2003) 23. Vlasic, D., Baran, I., Matusik, W., Popovi´c, J.: Articulated mesh animation from multiview silhouettes. ACM Transactions on Graphics 27, 97:1–97:9 (2008) 24. de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.P., Thrun, S.: Performance capture from sparse multiview video. ACM Trans. Graph. 27, 1–10 (2008) 25. Zitnick, C.L., Kang, S.B., Uyttendaele, M., Winder, S., Szeliski, R.: Highquality video view interpolation using a layered representation. ACM Transactions on Graphics 23, 600–608 (2004) 26. Kanade, T.: Carnegie mellon goes to the superbowl (2001), http://www.ri.cmu.edu/events/sb35/tksuperbowl.html 27. W¨ urmlin, S., Niederberger, C.: Realistic virtual replays for sports broadcasts (2010), http://www.liberovision.com/
3D Reconstruction and VideoBased Rendering of Casually Captured Videos
101
28. Guillemaut, J.Y., Kilner, J., Hilton, A.: Robust graphcut scene segmentation and reconstruction for freeviewpoint video of complex dynamic scenes. In: ICCV (2009) 29. Hayashi, K., Saito, H.: Synthesizing freeviewpoint images from multiple view videos in soccer stadium. In: CGIV, pp. 220–225 (2006) 30. Hasler, N., Rosenhahn, B., Thorm¨ ahlen, T., Wand, M., Gall, J., Seidel, H.P.: Markerless motion capture with unsynchronized moving cameras. In: CVPR, pp. 224– 231 (2009) 31. Lipski, C., Linz, C., Berger, K., Sellent, A., Magnor, M.: Virtual video camera: Imagebased viewpoint navigation through space and time. Computer Graphics Forum 29, 2555–2568 (2010) 32. Eisemann, M., Decker, B.D., Magnor, M., Bekaert, P., de Aguiar, E., Ahmed, N., Theobalt, C., Sellent, A.: Floating Textures. Computer Graphics Forum (Proc. Eurographics EG 2008) 27, 409–418 (2008) 33. Seitz, S.M., Dyer, C.R.: View morphing. In: Proceedings of ACM SIGGRAPH, pp. 21–30 (1996) 34. Pollefeys, M., Van Gool, L., Vergauwen, M., Verbiest, F., Cornelis, K., Tops, J., Koch, R.: Visual modeling with a handheld camera. IJCV 59, 207–232 (2004) 35. Lhuillier, M., Quan, L.: A quasidense approach to surface reconstruction from uncalibrated images. IEEE Trans. Pattern Anal. Mach. Intell. 27, 418–433 (2005) 36. Ballan, L., Cortelazzo, G.M.: Multimodal 3D shape recovery from texture, silhouette and shadow information. In: 3DPVT. Chapel Hill, USA (2006) 37. Campbell, N.D., Vogiatzis, G., Hern´ andez, C., Cipolla, R.: Automatic 3d object segmentation in multiple views using volumetric graphcuts. In: 18th British Machine Vision Conference, vol. 1, pp. 530–539 (2007) 38. Goesele, M., Snavely, N., Curless, B., Hoppe, H., Seitz, S.M.: Multiview stereo for community photo collections. In: ICCV, pp. 1–8 (2007) 39. Ballan, L., Brusco, N., Cortelazzo, G.M.: 3D Content Creation by Passive Optical Methods. In: 3D Online Multimedia and Games: Processing, Visualization and Transmission. World Scientiﬁc Publishing, Singapore (2008) 40. Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multiview stereo reconstruction algorithms. In: CVPR, pp. 519–528 (2006) 41. Lowe, D.G.: Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 42. Gallup, D., Frahm, J.M., Mordohai, P., Yang, Q., Pollefeys, M.: Realtime planesweeping stereo with multiple sweeping directions. In: CVPR (2007) 43. Zach, C., Pock, T., Bischof, H.: A globally optimal algorithm for robust tvl1 range image integration. In: ICCV (2007) 44. Sheﬀer, A., Praun, E., Rose, K.: Mesh parameterization methods and their applications. Foundations and Trends in Computer Graphics and Vision 2, 105–171 (2006) 45. Brusco, N., Ballan, L., Cortelazzo, G.M.: Passive reconstruction of high quality textured 3D models of works of art. In: 6th International Symposium on Virtual Reality, Archeology and Cultural Heritage, VAST (2005) 46. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000); ISBN: 0521623049 47. Arulampalam, M.S., Maskell, S., Gordon, N.: A tutorial on particle ﬁlters for online nonlinear/nongaussian bayesian tracking. IEEE Trans. Signal Processing 50, 174– 188 (2002)
102
A. Taneja et al.
48. Sinha, S.N., Pollefeys, M.: Synchronization and calibration of camera networks from silhouettes. In: ICPR 2004: Proceedings of the Pattern Recognition, 17th International Conference on (ICPR 2004), vol. 1, pp. 116–119 (2004) 49. Tuytelaars, T., Van Gool, L.: Synchronizing video sequences. In: CVPR, vol. 1, pp. 762–768 (2004) 50. Reinhard, E., Adhikhmin, M., Gooch, B., Shirley, P.: Color transfer between images. IEEE Computer Graphics and Applications 21, 34–41 (2001) 51. Baumberg, A., Hogg, D.: An eﬃcient method for contour tracking using active shape models. In: Motion of NonRigid and Articulated Objects, pp. 194–199 (1994) 52. Leibe, B., Cornelis, N., Cornelis, K., Gool, L.V.: Dynamic 3d scene analysis from a moving vehicle. In: CVPR (2007) 53. Elgammal, A., Duraiswami, R., Harwood, D., Davis, L.S.: Background and foreground modeling using nonparametric kernel density estimation for visual surveillance. Proceedings of the IEEE 90, 1151–1163 (2002) 54. Sheikh, Y., Javed, O., Kanade, T.: Background subtraction for freely moving cameras. In: ICCV (2009) 55. Bai, X., Wang, J., Simons, D., Sapiro, G.: Video snapcut: robust video object cutout using localized classiﬁers. ACM Trans. Graph. 28 (2009) 56. Wang, J., Bhat, P., Colburn, R.A., Agrawala, M., Cohen, M.F.: Interactive video cutout. ACM Trans. Graph. 24, 585–594 (2005) 57. Sun, J., Zhang, W., Tang, X., Shum, H.Y.: Background cut. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 628–641. Springer, Heidelberg (2006) 58. Ballan, L., Brostow, G.J., Puwein, J., Pollefeys, M.: Unstructured videobased rendering: Interactive exploration of casually captured videos. ACM Transactions on Graphics (Proceedings of SIGGRAPH), 1–11 (2010), http://doi.acm.org/10.1145/1833349.1778824 59. Taneja, A., Ballan, L., Pollefeys, M.: Modeling dynamic scenes recorded with freely moving cameras. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part III. LNCS, vol. 6494, pp. 613–626. Springer, Heidelberg (2011) 60. Cheng, Y.: Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 17, 790–799 (1995) 61. Boykov, Y., Kolmogorov, V.: An experimental comparison of mincut/maxﬂow algorithms for energy minimization in vision. PAMI 26, 1124–1137 (2004) 62. RavAcha, A., Kohli, P., Rother, C., Fitzgibbon, A.: Unwrap mosaics: A new representation for video editing. ACM Transactions on Graphics (SIGGRAPH 2008) (2008) 63. Chuang, Y.Y., Curless, B., Salesin, D.H., Szeliski, R.: A bayesian approach to digital matting. In: Proceedings of IEEE CVPR 2001, Kauai, Hawaii, vol. 2, pp. 264–271 (2001) 64. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. PAMI 23, 1222–1239 (2001) 65. Kolmogorov, V., Zabin, R.: What energy functions can be minimized via graph cuts? PAMI 26, 147–159 (2004) 66. Lorensen, W.E., Cline, H.E.: Marching cubes: A high resolution 3d surface construction algorithm. SIGGRAPH 21, 163–169 (1987) 67. Buehler, C., Bosse, M., McMillan, L., Gortler, S.J., Cohen, M.F.: Unstructured lumigraph rendering. In: SIGGRAPH, pp. 425–432 (2001)
3D Reconstruction and VideoBased Rendering of Casually Captured Videos
103
68. Grundland, M., Vohra, R., Williams, G.P., Dodgson, N.A.: Cross dissolve without cross fade: Preserving contrast, color and salience in image compositing. In: Proceedings of EUROGRAPHICS, Computer Graphics Forum, pp. 577–586 (2006) 69. Sch¨ odl, A., Szeliski, R., Salesin, D.H., Essa, I.: Video textures. In: SIGGRAPH 2000: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pp. 489–498 (2000) 70. Rong, G., Tan, T.S.: Jump ﬂooding in gpu with applications to voronoi diagram and distance transform. In: ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (I3D), pp. 109–116. ACM, New York (2006) 71. Wang, J., Bodenheimer, B.: Synthesis and evaluation of linear motion transitions. ACM Trans. Graph 27, 1–15 (2008) 72. Debevec, P., Borshukov, G., Yu, Y.: Eﬃcient viewdependent imagebased rendering with projective texturemapping. In: 9th Eurographics Workshop on Rendering (1998) 73. Unstructured VBR, http://www.cvg.ethz.ch/research/unstructuredvbr/ 74. Kilner, J., Starck, J., Hilton, A.: A comparative study of freeviewpoint video techniques for sports events. In: CVMP (2006)
SilhouetteBased Variational Methods for Single View Reconstruction Eno T¨ oppe1,2 , Martin R. Oswald1 , Daniel Cremers1 , and Carsten Rother2 1
Technische Universit¨ at M¨ unchen, Germany 2 Mircosoft Research, Cambridge, UK
Abstract. We explore the 3D reconstruction of objects from a single view within an interactive framework by using silhouette information. In order to deal with the highly illposed nature of the problem we propose two diﬀerent reconstruction priors: a shape and a volume prior and cast them into a variational problem formulation. For both priors we show that the corresponding relaxed optimization problem is convex. This leads to unique solutions which are independent of initialization and which are either globally optimal (shape prior) or can be shown to lie within bounds from the optimal solution (volume prior). We analyze properties of the proposed priors with regard to the reconstruction results as well as their impact on the minimization problem. By employing an implicit volumetric representation our reconstructions enjoy complete topological freedom. Being parameterbased, our interactive reconstruction tool allows for intuitive and easy to use modeling of the reconstruction result. Keywords: Single View Reconstruction, ImageBased Modeling, Convex Optimization.
1 1.1
Introduction Single View Reconstruction
The general problem of 3D reconstruction has been considered in a plethora of works in computer vision  at least in the case of given multiple views and in stereo vision. With the help of multiview concepts like point correspondences and photoconsistency it has been shown that highquality reconstructions can be inferred from a set of photographs of a single object. However, there are relatively few works on single view reconstruction, although its underlying problem may be considered one of the most fundamental in vision. This may to a great extent be due to the high illposedness of the corresponding mathematical problem, but is nevertheless astonishing, as humans excel in solving the task in everyday life. The main diﬃculty in inferring 3D geometry from a single image lies in the fact that it is inherently illposed. The process of image formation is not invertible and it is impossible to retrieve exact depth values from a single image. Thus, we have to make use of a strong prior. Such priors can either be obtained by D. Cremers et al. (Eds.): Video Processing and Computational Video, LNCS 7082, pp. 104–123, 2011. c SpringerVerlag Berlin Heidelberg 2011
SilhouetteBased Variational Methods for Single View Reconstruction
105
statistical learning of shape or by restraining the solution space, e.g. by making the assumption of smoothness and compactness. Additionally to a prior, user input can be incorporated into the reconstruction process which can be realized as a modeling tool. While the growing amount of image data on the Internet increases the availability of multiple views for certain scenes, this does only apply to few places of strong public interest, like e.g. touristic hot spots. Single view reconstruction becomes particularly important in situations where a rough estimate of object geometry is desired rather than an exact reconstruction. This is the case when generating an alternate view of a single photograph or changing the illumination of the depicted scene. In this work we follow the idea of modeling an object from a single view gradually by user input, but with the ultimate goal of keeping the process simple for the user. Instead of an involved modeling stage that amounts to the speciﬁcation of absolute vertex positions and normal directions, we rather rely on user provided global and local constraints that, together with a strong prior, lead to a reconstruction estimate. This work recapitulates two diﬀerent priors for single view reconstruction which were proposed in papers [1] and [2]. One is based on a shape prior formulation, the other one amounts to a global constraint on the volume of the reconstruction. We evaluate both approaches and compare them to each other. 1.2
Issues and Related Work
Existing work on single view reconstruction and on interactive 3D modeling can be roughly classiﬁed into the categories planar versus curved and implicit versus parametric approaches. Many approaches such as that of Horry et al. [3] aim to reconstruct planar surfaces by evaluating user deﬁned vanishing points and lines. This has been extended by Liebowitz [4] and Criminisi [5]. This process has been completely automated by Hoiem et al. [6], yielding appealing results on a limited number of input images. Sturm et al. [7] make use of userspeciﬁed constraints such as coplanarity, parallelism and perpendicularity in order to reconstruct piecewise planar surfaces. An early work for the reconstruction of curved objects is Terzopoulos et al. [8] in which symmetry seeking models are reconstructed from a user deﬁned silhouette and symmetry axis using snakes. However, this approach is restricted to the class of tubelike shapes. Moreover, reconstructions are merely locally optimal. The work of Zhang et al. [9] addresses this problem and proposes a model which globally optimizes a smoothness criterion. However, it concentrates on estimating height ﬁelds rather than reconstructing real 3D representations. Moreover, it requires a huge amount of user interaction in order to obtain appealing reconstructions. Also related to the ﬁeld are easytouse tools like Teddy [10] and FiberMesh [11] that have pioneered sketch based modeling but are not imagebased.
106
E. T¨ oppe et al.
All of the cited works are using explicit surface representation – while surface manipulation is often straightforward and a variety of cues are easily integrated leading to respective forces or constraints on the surface, there are two major limitations: Firstly numerical solutions are generally not independent of the choice of parameterization. And secondly, parametric representations are not easily extended to objects of varying topology. While Prasad et al. [12] were able to extend their approach to surfaces with one or two holes, the generalization to objects of arbitrary topology is by no means straightforward and quite involved for the user. Similarly, topologychanging interaction in the FiberMesh system requires a complex remeshing of the modeled object leading to computationally challenging numerical optimization schemes. For the given reasons in this work we pursue an implicit representation of the reconstructed object. Joshi et al. [13] also suggest a silhouettebased surface inﬂation method and minimize a similar energy as [9] or [12] in order to obtain a smooth surface. However, like Zhang et al. [9], Joshi et al. aim to reconstruct depth maps rather than full 3D objects. Another problem of all existing works is the fact that they revert to inﬂation heuristics in order to avoid surface collapsing. These techniques boil down to ﬁxing absolute depth values which undesirably restrict the solution space. We show that a prior on the volume of the reconstruction solves this problem. A precursor to volume constraints are the volume inﬂation terms pioneered for deformable models by Cohen and Cohen [14]. However, no constant volume constraints were considered and no implicit representations were used. 1.3
Contribution
In this paper, we focus on the reconstruction of curved objects of arbitrary topology with a minimum of user input in an interactive and intuitive framework. We propose a convex variational method which generates a 3D object in a matter of seconds using silhouette information only. To this end, we revert to an implicit representation of the surface given by the indicator function of its interior (sometimes referred to as voxeloccupancy). In this representation, the weighted minimal surface problem is a convex functional and relaxation of the binary function leads to an overall convex problem. Two approaches are presented to overcome the ambiguity in the reconstruction process: In the ﬁrst one we formulate a shape prior which determines the basic shape and at the same time inﬂates the reconstruction geometry. In the second approach we introduce a constraint on the volume of the reconstruction. We discuss advantages and shortcomings of both approaches. In both cases we detail how to solve the resulting optimization problem by means of relaxation. This leads to a solution to the unrelaxed problem that is globally optimal in the case of a shape prior and that we show to be within in a bound of the optimum in the case of a volume prior.
SilhouetteBased Variational Methods for Single View Reconstruction
User Strokes for Segmentation
Silhouette
First Estimate
107
Final Result
Fig. 1. The basic workﬂow of the single view reconstruction process: The user marks the input image with scribbles (left) from which a silhouette is generated by segmentation (second from left). A ﬁrst reconstruction estimate is generated automatically from the silhouette (third from left). The user can then iteratively adapt the model in an interactive manner (right).
2
Reconstruction Workflow
A good silhouette is the main prerequisite for a reasonable reconstruction result with the algorithms proposed in Sections 4 and 5. The number of holes in the segmentation of the target object determines the topology of the reconstructed surface. Notably, the proposed reconstruction methods can also cope with disconnected regions of the object silhouette. The segmentation is obtained by utilizing an interactive graph cuts scheme similar to the ones described by [15] and [16]. The algorithm calculates two distinct regions based on respective color histograms which are deﬁned by representational pen strokes given by the user (see Fig. 1). From the input image and silhouette a ﬁrst reconstruction is generated automatically, which  depending on the complexity and the class of the object  can already be satisfactory. However, for some object classes and due to the general oversmoothing of the resulting mesh (see Section 3), the user can subsequently adapt the reconstruction by specifying intuitive and simple global and local constraints. These editing tools are completely parameterbased. The editing stage can be reiterated by the user until the desired result is obtained.
3
Implicit Variational Surfaces
Assume we are given the silhouette of an object in an image as returned by an interactive segmentation tool. The goal is then to obtain a smooth 3D model of the object which is consistent with the silhouette. How should we select the correct 3D model among the inﬁnitely many that match the silhouette? Clearly, we need to impose additional information, at the same time we want to keep this information at a minimum since user interaction is always tedious and slow. Formally, we are given an image plane Ω which contains the input image and lies in R3 . As part of the image we also have an object silhouette Σ ⊂ Ω. Now, we are seeking to compute reconstructions as minimal weighted surfaces S ⊂ R3 that are compliant with the object silhouette Σ:
108
E. T¨ oppe et al.
min
g(s)ds
(1)
S
subject to
π(S) = Σ
where π : R3 → Ω is the orthographic projection onto the image plane Ω, g : R3 → R+ is a smoothness weighting function and s ∈ S is an element of the surface S. We now introduce an implicit representation by replacing the surface S with its implicit binary indicator function u ∈ BV (R3 ; {0, 1}) representing the voxel occupancy (0 =exterior, 1 =interior), where BV denotes the functions of bounded variation [17]. The desired minimal weighted surface area is then given by minimizing the total variation (TV) over a suitable set UΣ of feasible functions u: (2) min g(x)∇u(x)d3 x u∈UΣ
where ∇u denotes the derivative in the distributional sense. Eq. (2) favors smooth solutions. However, smoothness is locally aﬀected by the function g(x) : R3 → R+ which will be used later for modeling. Without any modeling g is the identity mapping by default, i.e. g(x) ≡ 1. How does the set UΣ of feasible functions look like? For simplicity, we assume the silhouette to be enclosed by the surface. Then all surface functions that are consistent with the silhouette Σ must be in the set ⎫ ⎧ ⎧ ⎪ ⎪ π(x) ∈ /Σ⎪ ⎬ ⎨ ⎨0, UΣ = u ∈ BV (R3 ; {0, 1}) u(x) = 1, (3) x∈Σ ⎪ ⎪ ⎪ ⎭ ⎩ ⎩ arbitrary, otherwise Obviously, solving problem (1) / (2) results in the silhouette itself. Therefore we need further assumptions in order to rule out trivial solutions. In the subsequent sections we propose two diﬀerent approaches to the problem. Using the Weighting Function for Modeling The weight g(x) of the TVnorm in Eq. (2) can be used to locally control the smoothness of the reconstruction: With a low value 0 ≤ g(x) < 1, the smoothness condition on the surface is locally relaxed, allowing for creases and sharp edges to form. Conversely, higher values for g(x) > 1 locally enforces surface smoothness. For controlling the weighting function we employ a user scribble interface. The parameter associated to each scribble marks the local smoothness g(x) within the respective scribble area and is propagated through the volume along projection direction. This approach of parametric local smoothness adaptation can be applied in the case of a data term (Section 4) as well as in case of a constant volume constraint (Section 5).
SilhouetteBased Variational Methods for Single View Reconstruction
4
109
Inflation via Shape Prior
By introducing a data term, we realize two objectives: volume inﬂation and determination of the basic reconstructed shape. Since there is no inherent data term in the single view setting we have to deﬁne one heuristically. We choose a term of the following form: u(x) φ(x)d3 x
(4)
φ : R3 → IR can be adopted to achieve the desired object shape and may also be adopted by userinteraction later on. Adding this term to the energy in Equation (2) amounts to the following problem: min (5) u(x) φ(x)d3 x + λ g(x)∇u(x)d3 x u∈UΣ
where λ is a weighting parameter that determines the relative smoothness of the solution. In order to ﬁx a deﬁnition for φ we make the simple assumption that the thickness of the observed object increases as we move inward from its silhouette. For any point p ∈ V let dist(p, ∂S) = min p − s , s∈∂S
denote its distance to the silhouette contour ∂S ⊂ Ω. Then we set: −1 if dist(x, Ω) ≤ h(π(x)) φ(x) = +1 otherwise ,
(6)
(7)
where the height map h : Ω → IR depends on the distance of the projected 3D point to the silhouette according to the function
h(p) = min μcutoﬀ , μoﬀset + μfactor ∗ dist(p, ∂S)k (8) with four parameters k, μoﬀset , μfactor , μcutoﬀ ∈ R+ aﬀecting the shape of the function φ. How the user can employ these parameters to modify the computed 3D shape will be discussed in the following paragraph. Note that this choice of φ implies symmetry of the resulting model with respect to the image plane. Since the backside of the object is unobservable, it will be reconstructed properly for planesymmetric objects. Data Term Parameters. By altering the parameters μoﬀset , μfactor , μcutoﬀ and the exponent k of the height map function (8), users can intuitively change the data term (4) and thus the overall shape of the reconstruction. Note that the impact of these parameters is attenuated with increasing importance of the smoothness term. The eﬀects of the oﬀset, factor and cutoﬀ parameters on the height map are shown in Fig. 2 and are quite intuitive to grasp. The exponent k of the distance function in (8) mainly inﬂuences the objects curvature in the proximity of the silhouette contour. This can be observed in Fig. 2 showing an evolution from a cone to a cylinder just by decreasing k.
110
E. T¨ oppe et al.
FXWRII
IDFWRU
RIIVHW
k=2
k=1
k = 1/100
Fig. 2. Eﬀect of μoﬀset , μfactor , μcutoﬀ (left) and various values of parameter k and resulting (scaled) height map plots for a circular silhouette
Altering the Data Term Locally. Due to the incorporation of a distance transform in the data term, the reconstruction will always become ﬂat at the silhouette border. However this is not always desired like for instance for the bottom and top of the vase in Fig. 3. A simple remedy to this problem is to ignore parts of the contour during the calculation of the distance function. The user indicates the sections of the silhouette contour he wants to have ignored. To keep user interaction simple, we approximate the object contour by a polygon which is laid over the input image. By clicking on the edge, the user indicates to ignore the corresponding contour pixels during distance map calculation (see Fig. 3 top right). 4.1
Optimization via Convex Relaxation
To minimize energy (2) plus the data term (4) we follow the framework developed in [18]. To this end, we relax the binary problem, looking for functions u : V → [0, 1] instead. We can globally minimize the resulting convex functional by solving the corresponding EulerLagrange equation ∇u 0 = φ − λ div g , (9) ∇u using a ﬁxedpoint iteration in combination with Successive OverRelaxation (SOR). A global optimum of the original binary labeling problem is then obtained by simple thresholding of the solution of the relaxed problem – see [18] for details. In [19] it was shown that such relaxation techniques have several advantages over graph cut methods. In this work, the two main advantages are the lack of metrication errors and the parallelization potential. These two aspects allow to compute smooth single view reconstructions with no grid bias within a few seconds using standard graphics hardware.
SilhouetteBased Variational Methods for Single View Reconstruction
111
Fig. 3. Top row: height maps and corresponding reconstructions with and without marked sharp contour edges. Bottom row: input image with marked contour edges (blue) and line strokes (red) for local discontinuities which are shown right.
Input Image
Reconstructed Geometry with Input Image
Reconstructed Geometry only
Textured Geometry
Fig. 4. An example with intricate topology. Due to the implicit representation of the reconstruction surface, the algorithm of Sections 4 and 5 can handle any genus
4.2
Experiments
In the following we apply our reconstruction method with data term to several input images. We show diﬀerent aspects of the reconstruction process for typical classes of target objects and mention limitations of the approach. The experimental results are shown in Figures 4, 5, 6 and 7. Default values for the data term parameters (8) are k = 1, μoﬀset = 0, μfactor = 1, μcutoﬀ = ∞. Each row depicts several views of a single object reconstruction starting with the input image.
112
E. T¨ oppe et al.
Input Image
Textured Geometry with Input Image
Reconstructed Geometry only
Textured Geometry
Fig. 5. For the cockatoo very little additional user input was necessary. The smoothness was reduced locally by a single user scribble (see Section 3).
Input Image
Textured Geometry with Input Image
Reconstructed Geometry with Input Image
Textured Geometry
Fig. 6. The Cristo statue composes of smooth and nonsmooth parts. The socket part was marked as nonsmooth by a user scribble adapting the weight of the minimal surface locally.
The fence in Fig. 4 is an example of the complex topology the algorithm can handle. The reconstruction was automatically generated by the method right after the segmentation stage, i.e. without changing the surface smoothness. Figures 5, 6 and 7 demonstrate the potential of user editing as described in Sections 3 and 4. The reconstructions were edited by adapting the local smoothness and locally editing the data term. It can be seen, that elaborate modeling eﬀects can be achieved by these simple operations. Especially for the cockatoo a single curve suﬃces in order to add the characteristic indentation to the beak. No expert knowledge is necessary. For the socket of the Cristo statue, creases help to attain sharp edges, while keeping the rest of the statue smooth. It should be stressed, that no other postprocessing operations were used. The experiments in Figure 7 stand for a more complex series of target objects. A closer look reveals that the algorithm clearly attains its limit. The structure of the opera building (third row) as well as the elaborate geometry of the bike and its drivers cannot be correctly reconstructed with the proposed method due to a lack of information and more sophisticated tools. Yet the results are appealing and could be spiced up with the given tools.
SilhouetteBased Variational Methods for Single View Reconstruction
Input Image
Textured Geometry with Input Image
Reconstruction Geometry only
113
Textured Geometry
Fig. 7. Reconstruction examples where the algorithm attains its limit. Nevertheless the results are pleasing and could be used for tasks like new view synthesis.
5
Inflation via Volume Prior
Adding a data term to the variational problem (2) delivers reasonable results for single view reconstruction, as shown in the last section. However, we have also seen that a data term imposes a strong bias on the shape. Ideally we would like to have a nonheuristic inﬂation approach that does not restrict the shape variety while at the same time exhibits the same natural compactness as seen in the experiments above. As an alternative inﬂation strategy we propose to use a constraint on the size of the volume enclosed by the minimal surface. We formulate this as a hard constraint by further constraining the feasible set of problem (2): min E(u) where E(u) = g(x)∇u(x)d3 x (10) u∈UΣ ∩UV (11) and UV = u ∈ BV (R3 ; {0, 1}) u(x)d3 x = Vt where UV denominates all reconstructions with bounded variation that have the speciﬁc target volume Vt . Diﬀerent approaches to ﬁnding Vt can be considered. Since in the implementation the optimization domain is naturally bounded, we choose Vt to be a ﬁxed
114
E. T¨ oppe et al.
fraction of the volume of this domain. In a fast interactive framework the user can then adapt the target volume with the help of instant visual feedback. Most importantly, as opposed to a data term driven model volume constraints do not dictate where inﬂation takes place. 5.1
Optimization via Convex Relaxation
As in Section 4 we choose to relax the binary problem. This amounts to replacing r . The corresponding UV and UΣ with their respective convex hulls UVr and UΣ optimization problem is then convex: r Proposition 1. The relaxed set U r := UΣ ∩ UVr is convex.
Proof. The constraint in the deﬁnition of UV is clearly linear in u and therefore UVr is convex. The same argument holds for UΣ . Being an intersection of two convex sets U r is convex as well. One standard way of ﬁnding the globally optimal solution to this problem is gradient descent, which is known to converge very slowly. Since optimization speed is an integral part of an interactive reconstruction framework, we convert our problem to a form for which a primaldual scheme [20] can be applied. We start by replacing the TVnorm in our minimal surface problem by its weak equivalent: 3 3 minr g(x)∇ud x = minr max (12) −udivξ d x u∈U ξ(x)2 ≤g(x)
u∈U
Cc1 (R3 , R3 ).
where ξ ∈ The main problem is that we are dealing with an optimization problem over a constrained set. u needs to fulﬁll three constraints: Silhouette consistency, constant volume and u ∈ [0, 1]. In order to maintain silhouette consistency (3) of the solution we simply restrict updates to those voxels which project onto the silhouette interior excluding the silhouette itself. Furthermore we reformulate the volume constraint as a Lagrange multiplier λ, which together with Equation (12) leads to the following Lagrangian dual problem [21]: 3 3 max minr (13) −udivξ d x + λ u d x − Vt ξ(x)2 ≤g(x)
λ
u∈UΣ
Equation (13) is a saddle point problem. In [20] it was presented how to solve such problems of this special form with a primaldual scheme. We employ this scheme which is fast and provably convergent. It consists of alternating a gradient descent with respect to the function u and a gradient ascent for the dual variables ξ and λ interlaced with an overrelaxation step on the primal variable: ⎧ k+1 = Πξ(x)2 ≤g(x) (ξ k + τ · ∇¯ uk ) ξ ⎪ ⎪ ⎪ ⎨λk+1 = λk + τ · ( u ¯ dx − Vt ) (14) k+1 k ⎪ u = Πu∈[0,1] (u − σ · (divξ k+1 + λ)) ⎪ ⎪ ⎩ k+1 u ¯ = 2uk+1 − uk
SilhouetteBased Variational Methods for Single View Reconstruction
115
where ΠA denotes the projection onto the set A (see [20] for details). Note that the projection for the primal variable u now reduces to a clipping operation. Projection of ξ is done by simple clipping as well. The scheme (14) is numerically attractive since it avoids division by the potentially zerovalued gradientnorm which appears in the EulerLagrange equation of the TVnorm. Moreover, it is parallelizable and we therefore implemented it on the GPU. 5.2
Optimality Bounds
Having computed a global optimal solution uopt of Equation (12), the question remains how we obtain a binary solution and how the two solutions relate to one another energetically. Unfortunately no thresholding theorem holds, which would imply energetic equivalence of the relaxed optimum and its thresholded version for arbitrary thresholds. Nevertheless we can construct a binary solution ubin as follows: Proposition 2. The relaxed solution can be projected to the set of binary functions in such a way that the resulting binary function preserves the userspeciﬁed volume Vt . Proof. It suﬃces to order the voxels x by decreasing values u(x). Subsequently, one sets the value of the ﬁrst Vt voxels to 1 and the value of the remaining voxels to 0. Concerning an optimality bound the following holds: Proposition 3. Let uropt be the global optimal solution of the relaxed energy and uopt the global optimal solution of the binary problem. Then E(ubin ) − E(uopt ) ≤ E(ubin ) − E(uropt ) . 5.3
(15)
Theoretical Analysis of Material Concentration
As we have seen above, the proposed convex relaxation technique does not guarantee global optimality of the binary solution. The thresholding theorem [22] – applicable in the unconstrained problem – no longer applies to the volumeconstrained problem. While the relaxation naturally gives rise to posterior optimality bounds, one may take a closer look at the given problem and ask why the relaxed volume labeling u should favor the emergence of solid objects rather than distribute the prescribed volume equally over all voxels. In the following, we prove analytically that the proposed functional has an energetic preference for material concentration. For simplicity, we consider the case that the object silhouette in the image is a disk. And we compare the two extreme cases of all volume being concentrated in a ball (a known solution of the Cheeger problem) compared to the case that the same volume is distributed equally over the feasible space (namely a cylinder) – see Figure 8.
116
E. T¨ oppe et al.
Fig. 8. The two cases considered in the analysis of the material concentration for the approach in Section 5. On the left hand side we assume a hemispherical condensation of the material. On the right hand side the material is distributed evenly over the volume.
Proposition 4. Let usphere denote the binary solution which is 1 inside the sphere and 0 outside – Fig. 8, left side – and let ucyl denote the solution which is uniformly distributed (i.e. constant) over the entire cylinder – Fig. 8, right side. Then we have (16) E(usphere ) < E(ucyl ), independent of the height of the cylinder. Proof. Let R denote the radius of the disk. Then the energy of usphere is simply given by the area of the halfsphere: E(usphere ) = ∇usphere d2 x = 2πR2 . (17) If instead of concentrated to the halfsphere, the same volume, i.e. V = is distributed uniformly over the cylinder of height h ∈ (0, ∞), we have ucyl (x) =
V 2πR3 2R = = . πR2 h 3πR2 h 3h
2π 3 3 R ,
(18)
inside the entire cylinder, and ucyl (x) = 0 outside the cylinder. The respective surface energy of ucyl is given by the area of the cylinder weighted by the respective jump size: E(ucyl ) = ∇ucyl d2 x 2R 2R = 1− (πR2 + 2πRh) πR2 + 3h 3h 7 = πR2 > E(usphere ). (19) 3 5.4
Experiments
In this section we study the properties of constant volume weighted minimal surfaces again within an interactive reconstruction environment. We show that appealing and realistic 3D models can be generated with minimal user input.
SilhouetteBased Variational Methods for Single View Reconstruction
Input Image
Reconstruction
+30% volume
117
+40% volume
Fig. 9. By increasing the target volume with the help of a slider, the reconstruction is intuitively inﬂated. In this example the intial rendering of the volume with 175x135x80 voxels took 3.3 seconds. Starting from there each subsequent volume adaptation took only about 1 second.
Input Image
Reconstructed Geometry
Textured Geometry
Fig. 10. The constant volume approach favors minimal surfaces for a userspeciﬁed volume. This amounts to solving a Cheeger set problem.
Cheeger Sets and Single View Reconstruction. Solutions to the problem in Eq. (10) are so called Cheeger sets, i.e. minimal surfaces for a ﬁxed volume. In the simplest case of a circleshaped silhouette the corresponding Cheeger set is a ball. Fig. 10 demonstrates that in fact round silhouette boundaries (in the unweighted case g(x) ≡ 1) result in round shapes. In the example of the balloon it also becomes apparent that thinner structures in the silhouette are inﬂated less than compact parts: Coming from the top of the balloon toward the basket on the bottom the inﬂation gradually degrades along with the silhouette width. Varying the Volume. In the constant volume formalism presented in this section the only parameter we have to determine for our reconstruction is the target volume Vt (apart from the weighting function g(x) of the TVnorm in Eq. (12)). The eﬀect of changing this scalar parameter on the appearance of the reconstruction surface can be witnessed in Fig. 9. One can see that the adaptation of the target volume has an intuitive eﬀect on the resulting shape. This is important for an interactive user driven reconstruction process that is made possible by the small computation times that we gain through a parallel implementation of algorithm in Eq. (14).
118
E. T¨ oppe et al.
Image with User Input
Reconstructions
Pure Geometry
Fig. 11. The proposed approach allows to generate 3D models with sharp edges, marked by the user as locations of low smoothness (see Section 3). Along the red user strokes (second from left) the local smoothness weighting is decreased.
Input
Reconstruction
Diﬀerent View
Geometry
Fig. 12. Volume inﬂation dominates where the silhouette area is large (bird) whereas thin structures (twigs) are inﬂated less
Sharp Edges. Similar to Section 4 we examine the eﬀects of adapting the smoothness locally, but now using the volume prior instead of the shape prior. Fig. 11 shows that by adapting the weighting function g(x) of Eq. (12) not only round, but other very characteristic shapes can be modeled with minimal user interaction. The 2D user input is shown alongside with the reconstruction results. More experiments with smoothness adaptation for the constant volume case are presented in the following Section 6.
6
Comparison
In this section we compare the two proposed priors with respect to their reconstruction results, usability and runtime. Comparison of Experimental Results. We have already indicated that the data term acts as a strong prior on the resulting shape of the reconstruction. This can be veriﬁed in Fig. 14: The left reconstruction was done with the shape prior as described in Section 4. Clearly the silhouette distance transform dominates
SilhouetteBased Variational Methods for Single View Reconstruction
Input Image
Data Term as Shape Prior
119
Reconstruction with Reconstruction with Shape Prior Volume Prior
Fig. 13. Using a silhouette distance transform as shape prior the relation between data term (second from left) and reconstruction (third from left) is not easy to assess for a user. The Cheeger set approach of Section 5 behaves more naturally in this respect (right).
the shape in the resulting reconstruction. This might of course be advantageous for particular shapes like the examples shown in Fig. 2. Still, often a Cheeger set (right of Fig. 14) is a better guess for natural shapes. Increasing the smoothness parameter in the data term approach will mitigate the inﬂuence of the distance transform. However, with higher smoothness the result tends to be less voluminous making it hard to achieve balllike shapes (see Fig. 13). In both approaches thin structures in the silhouette are less inﬂated than more compact parts. This is a basic property of minimal surfaces. Nevertheless in the data term approach thin structures tend to be too ﬂat, especially in the presence of a high smoothness parameter (see Fig. 14). In principle a data term inhibits the ﬂexibility of the reconstructions. The air plane in Fig. 15 represents an example in which a parametric shape prior  just as the proposed data term of Section 4  would fail to oﬀer the necessary ﬂexibility required for modeling protrusions. Since our ﬁxedvolume approach does not impose points of inﬂation user input can inﬂuence the reconstruction more freely: Marking the wings as highly nonsmooth (i.e. low weights 0 < g(x) < 0.3) eﬀectively makes them pop out. From a user perspective the shape prior approach is much more involved in terms of the amount of user input. The shape prior consists of four parameters to oﬀer reasonable but still limited ﬂexibility to adapt its shape. On the other hand, for the volume prior approach only one parameter needs to be speciﬁed by the user. In Fig. 14 and 13 we make use of the same input images that were used in [12]. Comparing both results with the ones in their work reveals that we get comparable results with a signiﬁcantly lower amount of user input. As opposed to their work, our method exhibits complete topological freedom. Figure 16 depicts a direct comparison of the two proposed priors on several reconstruction results. Again one can observe that the volume prior generally yields more roundish shapes, while the distance function dominates the results with the shape prior. In sum, both priors yield comparable reconstruction results. Although for the shape prior approach the user has slightly more possibilities to adapt the ﬁnal shape of the reconstruction, this freedom is paid oﬀ by a signiﬁcant amount of additional user input as it is not always simple to ﬁnd the right combination of
120
E. T¨ oppe et al.
Input Image
Reconstruction with Reconstruction with Data Term as Shape Prior Volume Prior
Fig. 14. In contrast to a solution with shape prior (Eq. 5) (center ), the solution with volume prior (right) does not favor a speciﬁc shape and generates more natural looking results. Although in the center reconstruction the dominating shape prior can be mitigated by a higher smoothness (λ in Eq. (5)), this ultimately leads to the ﬂattening of thin structures like the handle.
Image with User Input
Reconstructed Geometry
Textured Geometry
Fig. 15. An example for a minimal surface with prescribed volume and local smoothness adaptation. Colored lines in the input image mark user input, which locally alters the surface smoothness. Red marks low, yellow marks high smoothness (see Section 3 for details).
parameters. In contrast, the volume prior has only a single parameter making it simpler to adapt the shape of the reconstruction. A limitation of both methods is the implicit assumption that the plane of object symmetry should be approximately parallel to the image plane since the topology of the reconstructed object is directly inferred from the topology of the object’s silhouette. Runtime Comparison. The two priors lead to diﬀerent optimization problems and we solved them with diﬀerent optimization schemes. Both have been implemented in a parallel manner using the NVIDIA CUDA Framework. All computation times refer to a PC with a 2.27GHz Intel Xeon CPU and an NVIDIA GeForce GTX580 graphic device running a recent Linux distribution. The computation times for the shape prior approach are slightly lower than the ones of the volume prior approach. For instance, the teapot example (Fig. 14) with a resolution of 189 × 139 × 83 the method with shape prior needs 2 seconds while the method with volume prior needs 4.7 seconds. However, the computation times depend mainly on the volume resolution and also on the
SilhouetteBased Variational Methods for Single View Reconstruction Input Image
Reconstruction with Shape Prior Volume Prior
121
Geometry with Shape Prior Volume Prior
Fig. 16. Direct comparison of the methods with shape and volume prior for several examples
object to be reconstructed. For a reasonable quality of the reconstruction also lower volume resolutions may be suﬃcient. When using e.g. 63 × 47 × 32 the computation times drop down to 0.03s and 0.13s for the methods with shape and volume prior, respectively.
7
Conclusion
In this work we considered a variational approach to the problem of 3D reconstruction from a single view by searching for a weighted minimal surface that is consistent with the given silhouette. A major part of our contribution is to show
122
E. T¨ oppe et al.
that this can be done in an interactive way by providing a tool that is intuitive and computes solutions within seconds. Two paradigms were proposed in order to deal with the highly illposed task. In the ﬁrst one we introduce a shape prior that is incorporated as a data term in order to avoid ﬂat solutions. This approach is along the lines of other works as it boils down to ﬁxing depth values of the reconstruction in order to inﬂate it. In the other proposed method we search for a weighted minimal surface that complies with a ﬁxed user given volume. The resulting Cheeger set problem goes without specifying expected depth of any sort thus providing more geometric ﬂexibility of the result. In the former case we compute globally optimal solutions to the variational problem. In the latter case we showed that the solution lies within a bound of the optimum and exactly fulﬁlls the prescribed volume. We compared both priors and found that the volume prior is more ﬂexible and thus better suited for the task of single view reconstruction. On a variety of challenging real world images, we showed that the proposed method compares favorably over existing approaches, that volume variations lead to families of realistic reconstructions and that additional user scribbles allow to locally reduce smoothness so as to easily create protrusions.
References 1. Oswald, M.R., T¨ oppe, E., Kolev, K., Cremers, D.: Nonparametric single view reconstruction of curved objects using convex optimization. In: Denzler, J., Notni, G., S¨ uße, H. (eds.) Pattern Recognition. LNCS, vol. 5748, pp. 171–180. Springer, Heidelberg (2009) 2. Toeppe, E., Oswald, M.R., Cremers, D., Rother, C.: Imagebased 3D modeling via cheeger sets. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part I. LNCS, vol. 6492, pp. 53–64. Springer, Heidelberg (2011) 3. Horry, Y., Anjyo, K.I., Arai, K.: Tour into the picture: using a spidery mesh interface to make animation from a single image. In: SIGGRAPH 1997: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, pp. 225–232. ACM Press, AddisonWesley Publishing Co., New York, USA (1997) 4. Liebowitz, D., Criminisi, A., Zisserman, A.: Creating architectural models from images. In: Proc. EuroGraphics, vol. 18, pp. 39–50 (1999) 5. Criminisi, A., Reid, I., Zisserman, A.: Single view metrology. Int. J. Comput. Vision 40, 123–148 (2000) 6. Hoiem, D., Efros, A.A., Hebert, M.: Automatic photo popup. ACM Trans. Graph 24, 577–584 (2005) 7. Sturm, P.F., Maybank, S.J.: A method for interactive 3d reconstruction of piecewise planar objects from single images. In: Proc. BMVC, pp. 265–274 (1999) 8. Terzopoulos, D., Witkin, A., Kass, M.: Symmetryseeking models and 3d object reconstruction. IJCV 1, 211–221 (1987) 9. Zhang, L., DugasPhocion, G., Samson, J.S., Seitz, S.M.: Single view modeling of freeform scenes. In: Proc. of CVPR, pp. 990–997 (2001) 10. Igarashi, T., Matsuoka, S., Tanaka, H.: Teddy: a sketching interface for 3d freeform design. In: SIGGRAPH 1999, pp. 409–416. ACM Press, AddisonWesley Publishing Co., New York, USA (1999)
SilhouetteBased Variational Methods for Single View Reconstruction
123
11. Nealen, A., Igarashi, T., Sorkine, O., Alexa, M.: Fibermesh: designing freeform surfaces with 3d curves. ACM Trans. Graph. 26, 41 (2007) 12. Prasad, M., Zisserman, A., Fitzgibbon, A.W.: Single view reconstruction of curved surfaces. In: CVPR, pp. 1345–1354 (2006) 13. Joshi, P., Carr, N.: Repouss´e: Automatic inﬂation of 2d art. In: Eurographics Workshop on SketchBased Modeling (2008) 14. Cohen, L.D., Cohen, I.: Finiteelement methods for active contour models and balloons for 2d and 3d images. IEEE Trans. on Patt. Anal. and Mach. Intell. 15, 1131–1147 (1993) 15. Boykov, Y.Y., Jolly, M.P.: Interactive graph cuts for optimal boundary region segmentation of objects in nd images. In: ICCV, vol. 1, pp. 105–112 (2001) 16. Rother, C., Kolmogorov, V., Blake, A.: GrabCut: interactive foreground extraction using iterated graph cuts. ACM Trans. Graph 23, 309–314 (2004) 17. Ambrosio, L., Fusco, N., Pallara, D.: Functions of bounded variation and free discontinuity problems. The Clarendon Press Oxford University Press, New York (2000) 18. Kolev, K., Klodt, M., Brox, T., Cremers, D.: Continuous global optimization in multview 3d reconstruction. International Journal of Computer Vision (2009) 19. Klodt, M., Schoenemann, T., Kolev, K., Schikora, M., Cremers, D.: An experimental comparison of discrete and continuous shape optimization methods. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 332–345. Springer, Heidelberg (2008) 20. Pock, T., Cremers, D., Bischof, H., Chambolle, A.: An algorithm for minimizing the piecewise smooth mumfordshah functional. In: IEEE Int. Conf. on Computer Vision, Kyoto, Japan (2009) 21. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004) 22. Chan, T., Esedo¯ glu, S., Nikolova, M.: Algorithms for ﬁnding global minimizers of image segmentation and denoising models. SIAM Journal on Applied Mathematics 66, 1632–1648 (2006)
Single Image Blind Deconvolution with HigherOrder Texture Statistics Manuel Martinello and Paolo Favaro HeriotWatt University School of EPS, Edinburgh EH14 4AS, UK
Abstract. We present a novel method for solving blind deconvolution, i.e., the task of recovering a sharp image given a blurry one. We focus on blurry images obtained from a coded aperture camera, where both the camera and the scene are static, and allow blur to vary across the image domain. As most methods for blind deconvolution, we solve the problem in two steps: First, we estimate the coded blur scale at each pixel; second, we deconvolve the blurry image given the estimated blur. Our approach is to use linear highorder priors for texture and secondorder priors for the blur scale map, i.e., constraints involving two pixels at a time. We show that by incorporating the texture priors in a leastsquares energy minimization we can transform the initial blind deconvolution task in a simpler optimization problem. One of the striking features of the simpliﬁed optimization problem is that the parameters that deﬁne the functional can be learned oﬄine directly from natural images via singular value decomposition. We also show a geometrical interpretation of image blurring and explain our method from this viewpoint. In doing so we devise a novel technique to design optimally coded apertures. Finally, our coded blur identiﬁcation results in computing convolutions, rather than deconvolutions, which are stable operations. We will demonstrate in several experiments that this additional stability allows the method to deal with large blur. We also compare our method to existing algorithms in the literature and show that we achieve stateoftheart performance with both synthetic and real data. Keywords: coded aperture, single image, image deblurring, depth estimation.
1
Introduction
Recently there has been enormous progress in image deblurring from a single image. Perhaps one of the most remarkable results is to have shown that it is possible to extend the depth of ﬁeld of a camera by modifying the camera optical response [1,2,3,4,5,6,7]. Moreover, techniques based on applying a mask at the lens aperture have demonstrated the ability to recover a coarse depth of the
This research was partly supported by SELEX Galileo grant SELEX/HWU/ 2010/SOW3.
D. Cremers et al. (Eds.): Video Processing and Computational Video, LNCS 7082, pp. 124–151, 2011. c SpringerVerlag Berlin Heidelberg 2011
Single Image Blind Deconvolution with HigherOrder Texture Statistics
(a)
125
(b)
Fig. 1. Results on an outdoor scene [exposure time 1/200s]. (a) Blurry coded image captured with mask b (see Fig. 4). (b) Sharp image reconstructed with our method.
scene [4,5,8]. Depth has then been used for digital refocusing [9] and advanced image editing. In this paper we present a novel method for image deblurring and demonstrate it on blurred images obtained from a coded aperture camera. Our algorithm uses as input a single blurred image (see Fig. 1 (a)) and automatically returns the corresponding sharp one (see Fig. 1 (b)). Our main contribution is to provide a computationally eﬃcient method that achieves stateoftheart performance in terms of depth and image reconstruction with coded aperture cameras. We demonstrate experimentally that our algorithm can deal with larger amounts of blur than previous coded aperture methods. One of the leading approaches in the literature [5] recovers a sharp image by sequentially testing a deconvolution method for several given hypotheses for the blur scale. Then, the blur scale that yields a sharp image that is consistent with both the model and the texture priors is chosen. In contrast, in our approach we show that one can identify the blur scale by computing convolutions, rather than deconvolutions, of the blurry image with a ﬁnite set of ﬁlters. As a consequence, our method is numerically stable especially when dealing with large blur scales. In the next sections, we present all the steps needed to deﬁne our algorithm for image deblurring. The task is split in two steps: First the blur scale is identiﬁed and second, the coded image is deblurred with the estimated blur scale. We present an algorithm for blur scale identiﬁcation in section 3.1. Image deblurring is then solved iteratively in section 3.2. A discussion on mask selection is then presented in section 4.1. Comparisons to existing methods are shown in section 5. 1.1
Prior Work
This work relates to several ﬁelds ranging from computer vision to image and signal processing, and from optics to astronomy and computer graphics. For simplicity, we group past work based on the technique being employed.
126
M. Martinello and P. Favaro
Coded Imaging: Early work in coded imaging appears in the ﬁeld of astronomy. One of the most interesting pattern designs is the Modiﬁed Uniformly Redundant Arrays (MURA) [10] for which a simple coding and decoding procedure was devised (see one such pattern in Fig. 4). In our tests the MURA pattern seems very well behaved, but too sensitive to noise (see Fig. 5). Coded patterns have also been used to design lensless systems, but these systems require either long exposures or are sensitive to noise [11]. More recently, coding of the exposure [12] or of the aperture [4] has been used to preserve high spatial frequencies in blurred images so that deblurring is wellposed. We test the mask proposed in [4] and ﬁnd that it works well for image deblurring, but not for blur scale identiﬁcation. A mask that we have tested and has yielded good performance is the fourholes mask of Hiura and Matsuyama [13]. In [13] however, the authors used multiple images. A study on good apertures for deblurring multiple coded images via Wiener ﬁltering has instead led to two novel designs [14,15]. Although the masks were designed to be used together, we have tested each of them independently for comparisons purposes. We found, as predicted by the authors, that the masks are quite robust to noise and quite well designed for image deblurring. Image deblurring and depth estimation with a coded aperture camera has also been demonstrated by Levin et al. [5]. One of their main contributions is the design of an optimal mask. We indeed ﬁnd this mask quite eﬀective both on synthetic data and real data. However, as already noticed in [16], we have found that the coded aperture technique, if approached as in [5], fails when dealing with large blur amounts. The method we propose in this paper, instead, overtakes this limitation, especially when using the fourhole mask. Finally, a design based on annular masks has also been proposed in [17] and has been exploited for depth estimation in [3]. We also tested this mask in our experiments, but, contrary to our expectations, we did not ﬁnd its performance superior to the other masks. 3D Point Spread Functions: While there are several techniques to extract depth from images, we brieﬂy mention some recent work by Greengard et al. [18] because their optical design included and exploited diﬀraction eﬀects. They investigated 3D point spread functions (PSF) whose transverse cross sections rotate as a result of diﬀraction, and showed that such PSFs yield an order of magnitude increase in the sensitivity with respect to depth variations. The main drawback however, is that the depth range and resolution is limited due to the angular resolution of the reconstructed PSF. DepthInvariant Blur: An alternative approach to coded imaging is wavefront coding. The key idea is to use aspheric lenses to render the lens point spread function (PSF) depthinvariant. Then, shiftinvariant deblurring with a ﬁxed known blur can be applied to sharpen the image [19,20]. However, while the results are quite promising, the PSF is not fully depthinvariant and artifacts are still present in the reconstructed image. Other techniques based on depthinvariant PSFs exploit the chromatic aberrations of lenses [7] or use diﬀusion [21]. However, in the ﬁrst case, as the focal sweep is across the spectrum, the method is mostly designed for grayscale imaging. While the results shown in
Single Image Blind Deconvolution with HigherOrder Texture Statistics
127
these recent works are stunning, there are two inherent limitations: 1) Depth is lost in the imaging process; 2) In general, as method based on focal sweep are not exactly depthinvariant, the deblurring performance decays for objects that are too close or too far away from the camera. Multiple Viewpoint: The extension of the depth of ﬁeld can also be achieved by using multiple images and/or multiple viewpoints. One technique is to obtain multiple viewpoints by capturing multiple coded images [8,13,22] or by capturing a single image by using a plenoptic camera [9,6,23,24]. These methods however, exploit multiple images or require a more costly optical design (e.g., a calibrated microlens array). Motion Deblurring and Blind Deconvolution: This work also relates to work in blind deconvolution, and in particular on motion deblurring. There has been a quite steady progress in uniform motion deblurring [25,26,27,28,29] thanks to the modeling and exploitation of texture statistics. Although these methods deal with an unknown and general blur pattern, they assume that blur is not changing across the image domain. More recently, the spacevarying case has been studied [30,31,32] albeit with some restrictions on the type of motion or the scene depth structure. Blurred Face Recognition: Work in the recognition of blurred faces [33] is also related to our method. Their approach extracts features from motionblurred images of faces and then uses the subspace distance to identify the blur. In contrast, our method can be applied to spacevarying blur and our analysis provides a novel method to evaluate (and design) masks.
2
Single Image Blind Deconvolution
Blind deconvolution from a single image is a very challenging problem: We need to recover more unknowns than the available observations. This challenge will be illustrated in the next section, where we present the image formation model of a blurred image obtained from a coded aperture camera. To make the problem feasible and wellbehaved, one can introduce additional constraints on the solution. In particular, we constrain the higherorder statistics of sharp texture (sec. 2.2) and impose that the blur scale be piecewise smooth across the image pixels (sec. 2.3). 2.1
Image Model
In the simplest instance, a blurred image of a plane facing the camera can be described via the convolution of a sharp image with the blur kernel. However, the convolutional model breaks down with more general surfaces and, in particular, at occlusion boundaries. In this case, one can describe a blurred image with a linear model. For the sake of notational simplicity, we write images as column vectors, where all pixels are sorted in lexicographical order. Thus, a blurred
128
M. Martinello and P. Favaro
image with N pixels is a column vector g ∈ RN . Similarly, a sharp image with M pixels is a column vector f ∈ RM . Then, g satisﬁes g = Hd f ,
(1)
where the N × M matrix Hd represents the coded blur. d is a column vector with M pixels and collects the blur scale corresponding to each pixel of f . The ith column of Hd is an image, rearranged as a vector, of the coded blur with scale di generated by the ith pixel of f . Notice that this model is indeed a generalization of the convolutional case. In the convolutional model, Hd reduces to a Toeplitz matrix. Our task is to recover the unknown sharp image f given the blurred image g. To achieve this goal it is necessary to recover the blur scale at each pixel d. The theory of linear algebra tells us that: If N = M and the equations in eq. (1) are not linearly dependent, and we are given both g and Hd , then we can recover the sharp image f . However, in our case we are not given the matrix Hd and the blurred image g is aﬀected by noise. This introduces two challenges: First, to obtain Hd we need to retrieve the blur scale d; second, because of noise in g and of the illconditioning of the linear system in eq. (1), the estimation of f might be unstable. The ﬁrst challenge implies that we do not have a unique solution. The second challenge implies that even if the solution were unique, its estimation would not be reliable. However, not all is lost. It is possible to add more equations to eq. (1) until a unique reliable solution can be obtained. This technique is based on observing that, typically, one expects the unknown sharp image and blur scale map to have some regularity. For instance, both sharp textures and blur scale maps are not likely to look like noise. In the next two sections we will present and illustrate our sharp image and blur scale priors. 2.2
Sharp Image Prior
Images of the real world exhibit statistical regularities that have been studied intensively in the past 20 years and have been linked to the human visual system and its evolution [34]. For the purpose of image deblurring, the most important aspect of this study is that natural images form a much smaller subset of all possible images. In general, the characterization of the statistical properties of natural images is done by applying a given transform, typically related to a component of human vision. Among the most common statistics used in image processing are the second order statistics, i.e., relations between pairs of pixels. For instance, this category includes the distributions of image gradients [35,36]. However, a more accurate account of the image structure can be captured with highorder statistics, i.e., relations between several pixels. In this work, we consider this general case, but restrict the relations to linear ones of the form Σf 0
(2)
where Σ is a rectangular matrix. Eq. (2) implies that all sharp images live approximately on a subspace. Despite their crude simplicity, these linear constraints allow for some ﬂexibility. For example, the case of secondorder statistics
Single Image Blind Deconvolution with HigherOrder Texture Statistics
129
results in rows of Σ with only two nonzero values. Also, by designing Σ one can selectively apply the constraints only on some of the pixels. Another example is to choose each row of Σ as a Haar feature applied to some pixels. Notice that in our approach we do not make any of these choices. Rather, we estimate Σ directly from natural images. Natural image statistics, such as gradients, typically exhibit a peaked distribution. However, performing inference on such distributions results in minimizations of non convex functionals for which we do not have probably optimal algorithms. Furthermore, we are interested in simplifying the optimization task as much as possible to gain in computational eﬃciency. This has led us to enforce the linear relation above by minimizing the convex cost Σf 22 .
(3)
As we do not have an analytical expression for Σ that satisﬁes eq. (2), we need to learn it directly from the data. We will see later that this step is necessary only when performing the deconvolution step given the estimated blur. Instead, when estimating the blur scale our method allows us to use Σ implicitly, i.e., without ever recovering it. 2.3
Blur Scale Prior
The statistics of range images can be characterized with an approach similar to that for optical images [37]. The study in [37] veriﬁed the random collage model, i.e., that a scene is a collection of piecewise constant surfaces. This has been observed in the distributions of Haar ﬁlter responses on the logarithm of the range data, which showed strong cusps in the isoprobability contours. Unfortunately, a prior following these distributions faithfully would result in non convex energy minimization. A practical convex solution to enforce the piecewise constant model, is to use total variation [38]. Common choices are the isotropic and anisotropic total variation. In our algorithm we have implemented the latter. We minimize ∇d1 , i.e., the sum of the absolute value of the components of the gradient of d.
3
Blur Scale Identification and Image Deblurring
We can combine the image model introduced in sec. 2.1 with the priors in sec. 2.2 and 2.3 and formulate the following energy minimization problem: ˆ fˆ = argmin g − Hd f 2 + αΣf 2 + β∇d1 , d, 2 2 d,f
(4)
where the parameters α, β > 0 determine the amount of regularization for texture and blur scale respectively. Notice that the formulation above is common to many approaches including, in particular, [5]. Our approach, however, in addition to using a more accurate blur matrix Hd , considers diﬀerent priors and a diﬀerent depth identiﬁcation procedure.
130
M. Martinello and P. Favaro
Our next step is to notice that, given d, the proposed cost is simply a leastsquares problem in the unknown sharp texture f . Hence, it is possible to compute f in closedform and plug it back in the cost functional. The result is a much simpler problem to solve. We summarize all the steps in the following Theorem: Theorem 1. The set of extrema of the minimization (4) coincides with the set of extrema of the minimization ⎧ ⊥ 2 ⎨ dˆ = argmin Hd g2 + β∇d1 d −1 (5) T ⎩ fˆ = αΣ T Σ + H T H H g ˆ ˆ ˆ d d d −1 T . where Hd⊥ = I − Hd αΣ T Σ + HdT Hd Hd , and I is the identity matrix. Proof. See Appendix. Notice that the new formulation requires the deﬁnition of a square and symmetric matrix Hd⊥ . This matrix depends on the parameter α and the prior matrix Σ, both of which are unknown. However, for the purpose of estimating the unknown blur scale map d, it is possible to bypass the estimation of α and Σ by learning directly the matrix Hd⊥ from data. 3.1
Learning Procedure and Blur Scale Identification
We break down the complexity of solving eq. (5) by using local blur uniformity, i.e., by assuming that blur is constant within a small region of pixels. Then, we further simplify the problem by considering only a ﬁnite set of L blur sizes d1 , . . . , dL . In practice, we ﬁnd that both assumptions work well. The local blur uniformity holds reasonably well except at occluding boundaries, which form a small subset of the image domain. At occluding boundaries the solution tends to favor small blur estimates. We also found experimentally that the discretization is not a limiting factor in our method. The number of blur sizes L can be set to a value that matches the level of accuracy of the method without reaching a prohibitive computational load. Now, by combining the assumptions we ﬁnd that eq. (5) at one pixel x ˆ d(x) = argmin Hd⊥ (x)g22 + β∇d(x)1
(6)
d(x)
can be approximated by ⊥ ˆ d(x) = argmin Hd(x) gx 22
(7)
d(x)
where gx is a column vector of δ 2 pixels extracted from a δ × δ patch centered at the pixel x of g. Experimentally, we ﬁnd that the size δ of the patch should not be smaller than the maximum scale of the coded blur in the captured image
Single Image Blind Deconvolution with HigherOrder Texture Statistics
131
⊥ g. Hd(x) is a δ 2 × δ 2 matrix that depends on the blur size d(x) ∈ {d1 , . . . , dL }. So we assume that Hd⊥ (x, y) 0 for y such that y − x1 > δ/2. Notice that the term β∇d1 drops because of the local blur uniformity assumption. ⊥ The next step is to explicitly compute Hd(x) . Since the blur size d(x) is one of ⊥ ⊥ L values, we only need to compute Hd1 , . . . , HdL matrices. As each Hd⊥i depends on α and the local Σ, we propose to learn each Hd⊥i directly from data. Suppose that we are given a set of T column vectors gx1 , . . . , gxT extracted from blurry images of a plane parallel to the camera image plane. The column vectors will all share the same blur scale di . Hence, we can rewrite the cost functional in eq. (7) for all x as Hd⊥i Gi 22 (8) . ⊥ 2 where Gi = [gx1 · · · gxT ]. By deﬁnition of Gi , Hdi Gi 2 = 0. Hence, we ﬁnd that Hd⊥i can be computed via the singular value decomposition of Gi = Ui Si ViT . If Ui = [Udi Qdi ] where Qdi corresponds to the singular values of Si that are zero (or negligible), then Hd⊥i = Qdi QTdi . The procedure is then repeated for each blur scale di with i = 1, . . . , L. Next, we can use the estimated matrices Hd⊥1 , . . . , Hd⊥L on a new image g and optimize with respect to d: ⊥ dˆ = argmin Hd(x) gx 22 + β∇d(x)1 . (9) d
x
The ﬁrst term represents unitary terms, i.e., terms that are deﬁned on single pixels; the second term represents binary terms, i.e., terms that are deﬁned on pairs of pixels. The minimization problem (9) can then be solved eﬃciently via graph cuts [39]. Notice that the procedure above can be applied to other surfaces as well, so that instead of a collection of parallel planes, one can consider, for example, a collection of quadratic surfaces. Also, notice that there are no restrictions on the size of a patch. In particular, the same procedure can be applied to a patch of the size of the input image. In our experiments for depth estimation, however, we consider only small patches and parallel planes as local surfaces. 3.2
Image Deblurring
In the previous section we have devised a procedure to compute the blur scale at each pixel d. In this section we assume that d is given and devise a procedure to compute the image f . In principle, one could use the closedform solution −1 f = αΣ T Σ + HdTˆ Hdˆ HdTˆ g. (10) However, notice that computing this equation entails solving a large matrix inversion, which is not practical for moderate image dimensions. A simpler approach is to solve the least squares problem (4) in f via an iterative method. Therefore, we consider solving the problem fˆ = argmin g − Hdˆf 22 + αΣf 22 f
(11)
132
M. Martinello and P. Favaro
by using a leastsquares conjugate gradient descent algorithm in f [40]. The main component for the iteration in f is the gradient ∇Ef of the cost (11) with respect to f ∇Ef = αΣ T Σ + HdTˆ Hdˆ f − HdTˆ g. (12) The descent algorithm iterates until ∇Ef 0. Because of the convexity of the cost functional with respect to f , the solution is also a global minimum. To compute Σ we use a database of sharp images F = [f1 · · · fT ] where {fi }i=1,...,T are sharp images rearranged as column vectors, and compute the singular value decomposition F = UF ΣF VFT . Then, we partition UF = [UF,1 UF,2 ] such that UF,2 corresponds to the smallest singular values of ΣF . The highorder . T prior is deﬁned as Σ = UF,2 UF,2 , such that we have Σfi ≈ 0. The regularization parameter α is instead manually tuned. The matrix Hdˆ is computed as described in Section 2.1.
4
A Geometric Viewpoint on Blur Scale Identification
In the previous sections we have seen that the blur scale at each pixel can be obtained by minimizing eq. (9). We search among matrices Hd⊥1 , . . . , Hd⊥L the one that yields the minimum 2 norm when applied to the vector gx . We show that this has a geometrical interpretation: Each matrix Hd⊥i deﬁnes a subspace and Hd⊥i gx 22 is the distance of each vector gx from that subspace. Recall that Hd⊥i = Qdi QTdi and that Ui = [Udi Qdi ] is an orthonormal matrix. Then, we obtain that Hd⊥i gx 22 = Qdi QTdi gx 22 = QTdi gx 22 = gx 22 − UdTi gx 22 . If we now divide by the scalar number gx 22 , we obtain exactly the square of the subspace distance [41]
2 K g M(g, Udi ) = 1 − UdTi ,j (13) g j=1 where K is the rank of the subspace Udi , Udi = [Udi ,1 . . . Udi ,K ], and Udi ,j , j = 1, · · · , K are orthonormal vectors. The geometrical interpretation brings a fresh look to image blurring and deblurring. Consider the image model (1). Let us take the singular value decomposition of the blur matrix Hd Hd = Ud Sd VdT
(14)
where Sd is a diagonal matrix with positive entries, and both Ud and Vd are orthonormal matrices. Formally, the vector f undergoes a rotation (VdT ), then a scaling (Sd ), and then again another rotation (Ud ). This means that if f lives in a subspace, the initial subspace is mapped to another rotated subspace, possibly of smaller dimension (see Fig. 2, middle). Notice that as we change the blur scale, the rotations and scaling are also changing and may result in yet a diﬀerent subspace (see Fig. 2, right).
Single Image Blind Deconvolution with HigherOrder Texture Statistics
133
f3
f2 f1 (a)
H d1 g1
g3
g2
H d2
g3 g2
g1 (b)
(c)
Fig. 2. Coded images subspaces. (a) Image patches on a subspace. (b) Subspace containing images blurred with H d 1 ; blurring has the eﬀect of rotating and possibly reducing the dimensionality of the original subspace. (c) Subspace containing images blurred with Hd 2 .
It is important to understand that rotations of the vector f can result in blurring. To clarify this, consider blurred and sharp images with only 3 pixels (we cannot visualize the case of more than 3 pixels), i.e., g1 = [g1,x g1,y g1,z ]T and f1 = [f1,x f1,y f1,z ]T . Then, we can plot the vectors g1 and f1 as 3D points (see Fig. 2). Let g1 = 1 and f1 = 1. Then, we can rotate f1 about the origin and overlap it exactly on g1 . In this case rotation corresponded to blurring. The opposite is also true. We can rotate the vector g1 onto the vector f1 and thus perform deblurring. Furthermore, notice that in this simple example the most blurred images are vectors with identical entries. Such blurred images lie along the diagonal direction [1 1 1]T . In general, blurry images tend to have entries with similar values and hence tend to cluster around the diagonal direction. Our ability to discriminate between diﬀerent blur scales in a blurry image boils down to being able to determine the subspaces where the patches of such blurry image live. If sharp images do not live on a subspace, but uniformly in the entire space, our only way to distinguish the blur size is that the blurring Hd scales some dimensions of f to zero and that the scaling varies with blur size. This case has links to the zerosheet approach in the Fourier domain [42]. However, if the sharp images live on a subspace, the blurring Hd may preserve all the directions and blur scale identiﬁcation is still possible by determining the rotation of the sharp images subspace. This is the principle that we exploit.
134
M. Martinello and P. Favaro
Input: A single coded image g and a collection of coded images of L planar scenes. Output: The blur scale map d of the scene. Preprocessing (oﬄine) Pick an image patch size larger than twice the maximum blur scale; for i = 1, . . . , L do Compute the singular value decomposition Ui Si ViT of a collection of image patches coded with blur scale di ; Calculate the subspace Ud i as the columns of Ui corresponding to nonzero singular values of Si ; end Blur identiﬁcation (online) Solve dˆ = arg mind∈{d ,··· ,d } M2 (gx , Ud ) + β 2 ∇d(x)1 . 1
L
x
g x 2
Algorithm 1. Blur scale identiﬁcation from a single coded image via the subspace distance method.
Notice that the evaluation of the subspace distance M involves the calculation of the inner product between a patch and a column of Udi . Hence, this calculation can be done exactly as the convolution of a column of Udi , rearranged as an image patch, with the whole image g. We can conclude that the algorithm requires computing a set of L × K convolutions with the coded image, which is a stable operation of polynomial computational complexity. As we have shown that minimizing eq. (13) is equivalent to minimizing Hd⊥i gx 22 up to a scalar value, we summarize the blur scale identiﬁcation procedure in Algorithm 1. 4.1
Coded Aperture Selection
In this section we discuss how to obtain an optimal pattern for the purpose of image deblurring. As pointed out in [19] we identify two main challenges: The ﬁrst one is that accurate deblurring requires accurate identiﬁcation of the blur scale; the second one is that accurate deblurring requires little texture loss due to blurring. A ﬁrst step towards addressing these challenges is to deﬁne a metric for blur scale identiﬁcation and a metric for texture loss. Our metric for blur scale identiﬁcation can be deﬁned directly from section 4. Indeed, the ability to determine which subspace a coded image patch belongs to can be measured via the distance between the subspaces associated to each blur scale 2 ¯ M(Ud1 , Ud2 ) = K − U T Ud2 ,j . (15) d1 ,i
i,j
Clearly, the wider apart all the subspaces are, and the less prone to noise the subspace association is. We ﬁnd that a good visual summary of the “spacing” between all the subspaces is a (symmetric) matrix with distances between any
Single Image Blind Deconvolution with HigherOrder Texture Statistics
d20 d25
d
d20 d25
d10
¯ (Ud10, Ud20) = √K M
d10
d25
¯ (Ud25, Ud25) = 0 M
d25
d
135
d
d (a) Ideal distance matrix
(b) Circular aperture
Fig. 3. Distance matrix computation. The topleft corner of each matrix is the distance between subspaces corresponding to small blur scales, and, vice versa, the bottomright corner is the distance between subspaces corresponding to large blur scales. Notice that large subspace √ distances are bright and small subspace distances are dark. The maximum distance ( K) is achievable when two subspaces are orthogonal to each other.
two subspaces. We compute such matrix for a conventional camera and show the results in Fig. 3, together with the ideal distance matrix. In each distance matrix, subspaces associated to blur scales ranging from the smallest to the largest ones are arranged along the rows from left to right and along the columns from top to bottom. Along the diagonal the distance is necessarily 0 as√we compare identical subspaces. Also, by deﬁnition the metric cannot exceed K, where K is the minimum rank among the subspaces. In Fig. 5 we report the distance matrices computed for each of the apertures we consider in this work (see Fig. 4). Notice that the subspace distance map for a conventional camera (Fig. 3(b)) is overall darker than the matrices for coded aperture cameras (Fig. 5). This shows the poor blur scale identiﬁability of the circular aperture and the improvement that can be achieved when using a more elaborate pattern. The rank K can be used to address the second challenge, i.e., the deﬁnition of a metric for texture loss. So far we have seen that blurring can be interpreted as a combination of rotations and scaling. Deblurring can then be interpreted as a combination of rotations and scaling in the opposite direction. However, when blurring scales some directions to 0, part of the texture content has been lost. This suggests that a simple measure for texture loss is the dimension of the coded subspace: The higher the dimension and the more texture content we can restore. As the (coded images) subspace dimension is K, we can immediately conclude that the subspace distance matrix that most closely resembles the ideal distance matrix (see Fig. 3(a)) is the one that simultaneously achieves the best depth identiﬁcation and the least texture loss. Finally, we propose to use the average L1√ﬁtting of any distance matrix to the ideal distance matrix scaled of √ ¯ The ﬁtting yields the values in Table 1. We can K, i.e.,  K(11T − I) − M. also see visually in Fig. 5 that mask 4(b) and mask 4(d) are the coded apertures that we can expect to achieve the best results in texture deblurring.
136
M. Martinello and P. Favaro
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 4. Coded aperture patterns and PSFs. All the aperture patterns we consider in this work (top row) and their calibrated PSFs for two diﬀerent blur scales (second and bottom row). (a) and (b) aperture masks used in both [13] and [43]; (c) annular mask used in [17]; (d) pattern proposed by [5]; (e) pattern proposed by [4]; (f) and (g) aperture masks used in [15]; (h) MURA pattern used in [10].
(a) Mask 4(a)
(b) Mask 4(b)
(c) Mask 4(c)
(d) Mask 4(d)
(e) Mask 4(e)
(f) Mask 4(f )
(g) Mask 4(g)
(h) Mask 4(h)
Fig. 5. Subspace distances for the eight masks in Fig. 4. Notice that the subspace rank K determines the maximum distance achievable, and therefore, coded apertures with overall darker subspace distance maps have poor blur scale identiﬁability (i.e., sensitive to noise).
The quest for the optimal mask is, however, still an open problem. Even if we look for the optimal mask via bruteforce search, a single aperture pattern requires the evaluation of eq. (15) and the computation of all the subspaces associated to each blur scale. In particular, the latter process requires about 15 minutes on a QuadCore 2.8GHz with Matlab 7, which makes the evaluation of
Single Image Blind Deconvolution with HigherOrder Texture Statistics Table 1. L1 ﬁtting of any distance matrix to the ideal distance matrix scaled of
L1 ﬁtting
Masks 4(a) 4(b) 4(c) 4(d) 4(e) 8.24 6.62 8.21 5.63 8.37
4(f) 16.96
137 √
K
4(g) 4(h) 8.17 16.13
a large number of masks unfeasible. Devising a fast procedure to determine the optimal mask will be subject of future work.
5
Experiments
In this section we demonstrate the eﬀectiveness of our approach on both synthetic and real data. We show that the proposed algorithm performs better than previous methods on diﬀerent coded apertures and diﬀerent datasets. We also show that the masks proposed in the literature do not always yield the best performance. 5.1
Performance Comparison
Before proceeding with tests on real images, we perform extensive simulations to compare accuracy and robustness of our algorithm with 4 competing methods including the current stateoftheart approach. The methods are all based on the hypothesis plane deconvolution used by [5] as explained in the Introduction. The main diﬀerence among the competing methods is that the deconvolution step is performed either using the LucyRichardson method [44], or regularized ﬁltering (i.e., with image gradient smoothness), or Wiener ﬁltering [45], or Levin’s procedure [5]. We use the 8 masks shown in Fig. 4. All the patterns have been proposed and used by other researchers [4,5,10,13,15,17]. For each mask and a given blur scale map d, we simulate a coded image by using eq. (1), where f is an image of 4, 875 × 125 pixels with either random texture or a set of patches from natural images (examples of these patches are shown in Fig. 6). Then, for each algorithm we obtain a blur scale map estimate dˆ and compute its discrepancy with the groundtruth. The groundtruth blur scale map d that we use is shown in pseudocolors at the topleft of both Fig. 7 and Fig. 8 and it represents a stair composed of 39 steps at diﬀerent distances (and thus diﬀerent blur scales) from the camera. We assume that the focal plane is set to be between the camera and the ﬁrst object of interest in the scene. With this setting, the bottom part of the blur scale map (small blur sizes) corresponds to points close to the camera, and the top part (large blur sizes) to points far from the camera. Each step of the stair is a square of 125 × 125 pixels, we have squeezed the actual illustration along the vertical axis to ﬁt in the paper. The size of the blur ranges from 7 to 30 pixels. Notice that in measuring the errors we consider all pixels, including those at the blur scale discontinuities, given by the diﬀerence of blur scale between neighboring steps. In Fig. 7 we show, for each mask in Fig. 4, the results of the proposed method (right) together with the results obtained by
138
M. Martinello and P. Favaro
image noise level σ= 0
image noise level σ= 0.002 Fig. 6. Real texture. Some of the patches extracted from real images that have been used in our tests. The same patches are shown with no noise (top part) and when a Gaussian noise is added to them (bottom part).
the current stateoftheart algorithm (left) on random texture. The same procedure, but with texture from natural images, is reported in Fig. 8. For the three best performing masks (mask 4(a), mask 4(b), and mask 4(d)), we report the results with the same graphical layout in Fig. 9, in order to better appreciate the improvement of our method over previous ones, especially for large blur scales. Every plot shows, for each of the 39 steps we consider, the mean and 3 times the standard deviation of the estimated blur scale values (ordinate axis) against the true blur scale level (abscissa axis). The ideal estimate is the diagonal line where each estimated level corresponds to the correct true blur scale level. If there is no bias in the estimation of the blur scale map, the ideal estimate should lie between 3 times the standard deviation about the mean with probability close to 1. Our method performs consistently well with all the masks and at diﬀerent blur scale levels. In particular, the best performances are observed for mask 4b (Fig. 9(b)) and d (Fig. 9(c)), while the performance of competing methods rapidly degenerates with increasing pattern scales. This demonstrates that our method has potential for restoring objects at a wider range of blur scales and with higher accuracy than in previous algorithms. A quantitative comparison among all the methods and masks is given in Table 2 and Table 4 (for random texture) and in Table 3 and Table 5 (for real texture). In each table, the left half reports the average error of the blur scale ˆ 1 , where d and dˆ are the groundtruth and the estimate (measured as d − d estimated blur scale map respectively); the right half reports the error on the reconstructed sharp image fˆ, measured as f − fˆ2 + ∇f − ∇fˆ2 , where 2
2
f is the groundtruth image. The gradient term is added to improve sensitivity to artifacts in the reconstruction. As one can see from Tables 2  5, several levels of noise have been considered in the performance comparison: σ = 0 (Table 2 and
Single Image Blind Deconvolution with HigherOrder Texture Statistics
139
Far
Close
GT
(a) Mask 4(a)
(b) Mask 4(b)
(c) Mask 4(c)
(d) Mask 4(d)
(e) Mask 4(e)
(f) Mask 4(f )
(g) Mask 4(g)
(h) Mask 4(h)
Fig. 7. Blur scale estimation  random texture. GT: Groundtruth blur scale map. (ah) Estimated blur scale maps for all the eight masks we consider in the paper. For each mask, the ﬁgure reports the blur scale map estimated with both Levin et al.’s method (left) and our method (right).
Table 3), σ = 0.001, σ = 0.002, and σ = 0.005 (Table 4 and Table 5). The noise level is however adjusted to accommodate the diﬀerence in overall incoming light between the masks, i.e., if the mask i has an incoming light of li 1 , the noise level for that mask is given by: 1 σi = ∗ σ. (16) li Thus, masks such as 4(f), 4(g) and 4(h) are subject to lower noise levels than masks such as 4(a) and 4(b). Our method produces more consistent and accurate blur scale maps than previous methods for both random texture and natural images, and across the 8 masks that it has been tested with. 5.2
Results on Real Data
We now apply the proposed blur scale estimation algorithm to coded aperture images captured by inserting the selected mask into a Canon 50mm f /1.4 lens mounted on a Canon EOS5D DSLR as described in [5,15]. Based on the analysis 1
The value of li represents the quantity of lens aperture that is open: when the lens aperture is totally open, li = 1; instead, when the mask completely blocks the light, li = 0.
140
M. Martinello and P. Favaro
Far
Close
GT
(a) Mask 4(a)
(b) Mask 4(b)
(c) Mask 4(c)
(d) Mask 4(d)
(e) Mask 4(e)
(f) Mask 4(f )
(g) Mask 4(g)
(h) Mask 4(h)
Fig. 8. Blur scale estimation  real texture. GT: Groundtruth blur scale map. (ah) Estimated blur scale maps for all the eight masks we consider in the paper. For each mask, the ﬁgure reports the blur scale map estimated with both Levin et al.’s method (left) and our method (right).
in section 4.1 we choose mask 4(b) and mask 4(b). Each of the 4 holes in the ﬁrst mask is 3.5mm large, which corresponds to the same overall section of a conventional (circular) aperture with diameter 7.9mm (f /6.3 in a 50mm lens). All indoor images have been captured by setting the shutter speed to 30ms (ISO 320500) while outdoors the exposure has been set to 2ms or lower (ISO 100). Firstly, we need to collect (or synthesize) a sequence of L coded images, where L is the number of blur scale levels we want to distinguish. There are two techniques to acquire these coded images: (1) If the aim is just to estimate the depth map (or blur scale map), one can capture real coded images of a planar surface with sharp natural texture (e.g., a newspaper) at diﬀerent blur scale levels. (2) If the goal is to reconstruct both depth map and allinfocus image, one has to capture the PSF of the camera at each depth level, by projecting a grid of bright dots on a plane and using a long exposure; then, coded images are simulated by applying the measured PSFs on sharp natural images collected from the web. In the experiments presented in this paper, we use the latter approach since we estimate both the blur scale map and the allinfocus image. The PSFs have been captured on a plane at 40 diﬀerent depths between 60cm and 140cm from the camera. The focal plane of the camera was set at 150cm. In the ﬁrst experiments, we show the advantage of our approach over Levin et al.’s method on a scene with blur sizes similar to the ones used in the performance
Single Image Blind Deconvolution with HigherOrder Texture Statistics 40 35
30 25 20 15 10 5 0 0
40
10
20 True blur scale
30
30 25 20 15 10
0 0
40
40
LucyïRichardson Levin Our method
35
30 25 20 15 10 5 0 0
35
5
Estimated blur scale
Estimated blur scale
35
40
LucyïRichardson Levin Our method Estimated blur scale
LucyïRichardson Levin Our method Estimated blur scale
Estimated blur scale
35
20 True blur scale
30
20 True blur scale
30
(a) Mask 4(a)
40
25 20 15 10
40
LucyïRichardson Levin Our method
35
30 25 20 15 10
0 0
30
0 0
40
5 10
LucyïRichardson Levin Our method
5 10
Estimated blur scale
40
141
10
20 True blur scale
30
40
30
40
LucyïRichardson Levin Our method
30 25 20 15 10 5
10
20 True blur scale
30
(b) Mask 4(b)
40
0 0
10
20 True blur scale
(c) Mask 4(d)
Fig. 9. Comparison of the estimated blur scale levels obtained from the 3 best methods using both random (top) and real (bottom) texture. Each graph reports the performance of the algorithms with (a) masks 4(a), (b) masks 4(b), and (c) mask 4(d). Both mean and standard deviation (in the graphs, we show three times the computed standard deviation) of the estimated blur scale are shown in an errorbar with the algorithms performances (solid lines) over the ideal characteristic curve (diagonal dashed line) for 39 blur sizes. Notice how the performance dramatically changes based on the nature of texture (top row vs bottom row). Moreover, in the case of real images the standard deviation of the estimates obtained with our method are more uniform for mask 4(b) than for mask 4(d). In the case of mask 4(d) the performance is reasonably accurate only with small blur scales.
test. The same dataset has been captured by using mask 4(b) (see Fig. 11) and mask 4(d) (see Fig. 12). The size of the blur, especially at the background, is very large; This can be appreciated in Fig. 10(a), which shows the same scenario captured with the same camera setting, but without mask on the lens. For a fair comparison, we do not use any regularization or user intervention to the estimated blur scale maps. As already seen in the Section 5.1 (especially in Fig. 9), Levin et al.’s method yields an accurate blur scale estimate with mask 4(d) when the size of the blur is small, but it fails with large amounts of blur. The proposed approach overcomes this limitation and yields to a deblurred image that in both cases, Fig. 11(e) and Fig. 12(e), is closer to the groundtruth (Fig. 10(b)). Notice also that our method gives an accurate reconstruction of the blur scale, even without using regularization (β = 0 in eq. (9)). Some artefacts are still present in the reconstructed allinfocus images. These are mainly due to the very large size of the blur and to the raw blurscale map: When adding regularization to the blurscale map (β > 0), the deblurring algorithm yields to better results, as one can see in the next examples.
142
M. Martinello and P. Favaro
Table 2. Random texture. Performance (mean error) of 5 algorithms in blur scale estimation and image deblurring for the apertures in Fig. 4, assuming there is not noise.
Methods LucyRichardson Regularized ﬁltering Wiener ﬁltering Levin et al.[5] Our method
a 16.8 18.4 8.8 16.7 1.2
Blur scale b c d 14.4 17.2 2.9 17.2 18.6 6.8 13.8 14.4 16.6 13.7 16.7 1.4 0.9 3.7 0.9
Masks  (image noise level σ = 0) estimation Image deblurring e f g h a b c d e f 17.0 18.1 17.8 15.4 0.22 0.22 0.21 0.22 0.22 0.22 16.7 12.3 18.8 13.4 0.30 0.32 0.27 0.32 0.25 0.42 16.3 15.3 14.1 15.3 0.23 0.29 0.29 0.33 0.31 0.32 16.6 16.8 17.6 13.3 0.22 0.21 0.22 0.21 0.21 0.22 4.2 10.3 3.8 9.6 0.20 0.20 0.21 0.21 0.21 0.22
g h 0.22 0.21 0.23 0.25 0.27 0.30 0.22 0.21 0.21 0.22
Table 3. Real texture. Performance (mean error) of 5 algorithms in blur scale estimation and image deblurring for the apertures in Fig. 4, assuming there is not noise. Methods LucyRichardson Regularized filtering Wiener filtering Levin et al.[5] Our method
a 17.0 18.5 17.1 16.3 3.3
Masks  (image noise level σ = 0) Blur scale estimation Image deblurring b c d e f g h a b c d e f 16.4 18.4 15.6 17.9 18.5 18.0 18.3 0.22 0.20 0.22 0.18 0.20 0.20 16.8 18.2 8.6 16.8 11.4 17.9 15.4 0.51 0.49 0.52 1.08 0.28 0.67 16.4 18.2 14.4 17.0 18.0 17.5 17.6 0.25 0.22 0.26 0.21 0.21 0.24 14.8 17.9 9.9 17.0 18.2 17.6 17.0 0.25 0.21 0.23 0.19 0.20 0.21 3.3 6.8 3.3 6.1 12.6 5.9 11.7 0.18 0.16 0.21 0.16 0.17 0.21
g 0.20 0.28 0.23 0.21 0.19
h 0.20 0.40 0.21 0.20 0.21
In Fig. 13 we have the same indoor scenario, but now the items are slightly closer to the focal plane of the camera; then the maximum amount of blur is reduced. Although the background is still very blur in the coded image (Fig. 13(a)), our accurate blurscale estimation yields to a deblurred image (Fig. 13(b)), where the text of the magazine becomes readable. Since the reconstructed blurscale map corresponds to the depth map (relative depth) of the scene, we can use it together with the allinfocus image to generate a 3D image 2 . This image, when watched with redcyan glasses, allows one to perceive the depth information extracted with our approach. All the regularized blurscale maps in this work are estimated from eq. (9) by setting β = 0.5; the raw maps, instead, are obtained without regularization term (β = 0). We have tested our approach on diﬀerent outdoor scenes: Fig. 15 and Fig. 14. In these scenarios we apply the subspaces we have learned within 150cm from the camera to a very large range of depths. Several challenges are present in these scenes, such as occlusions, shadows, and lack of texture. Our method demonstrates robustness to all of them. Notice again that the raw blurscale maps shown in Fig. 15(c) and Fig. 14(c) are already very close to the maps that include regularization (Fig. 15(d) and Fig. 14(d) respectively). For each dataset, a 2
In this work, a 3D image corresponds to an image captured with a stereo camera, where one lens has a red ﬁlter and the second lens has a cyan ﬁlter. When one watches this type of images with redcyan glasses, each eye will see only one view: The shift between the two views gives the perception of depth.
Single Image Blind Deconvolution with HigherOrder Texture Statistics
143
Table 4. Random texture. Performance (mean error) of 5 algorithms in blur scale estimation and image deblurring for the apertures in Fig. 4, under diﬀerent levels of noise. Methods a LucyRichardson 18.5 Regularized filtering 19.0 Wiener filtering 15.7 Levin et al. 18.4 Our method 9.6 Methods
b 17.1 17.5 16.7 16.3 8.7
a 18.5 18.9 15.5 18.5 11.3
b 17.1 17.4 16.4 16.9 11.1
a 18.4 18.9 15.4 18.5 12.8
b 17.0 17.4 16.2 16.9 12.6
LucyRichardson Regularized filtering Wiener filtering Levin et al. Our method Methods LucyRichardson Regularized filtering Wiener filtering Levin et al. Our method
Masks  (image noise level σ = 0.001) Blur scale estimation Image deblurring c d e f g h a b c d e f 18.2 11.7 16.6 16.2 18.3 17.3 0.39 0.36 0.27 0.28 0.35 0.29 19.0 14.3 16.8 18.3 18.9 15.6 0.88 0.96 0.61 1.03 0.93 0.61 16.8 17.5 17.2 17.6 16.8 17.0 0.35 0.37 0.36 0.39 0.38 0.38 18.1 11.0 16.7 17.3 18.3 17.5 0.32 0.31 0.26 0.28 0.30 0.28 12.7 10.1 12.5 13.2 12.9 13.9 0.20 0.21 0.22 0.22 0.21 0.23 Masks  (image noise level σ = 0.002) Blur scale estimation Image deblurring c d e f g h a b c d e f 18.2 12.1 16.6 16.3 18.3 17.3 0.49 0.46 0.31 0.34 0.44 0.33 18.8 12.7 16.7 16.9 18.9 16.9 0.76 0.69 0.47 0.50 0.67 0.46 16.7 17.3 17.1 17.5 16.8 17.0 0.35 0.37 0.37 0.39 0.38 0.39 18.0 12.1 16.7 17.6 18.4 17.7 0.39 0.38 0.29 0.34 0.37 0.31 13.2 11.3 12.6 13.5 12.8 14.0 0.22 0.22 0.23 0.23 0.22 0.23 Masks  (image noise level σ = 0.005) Blur scale estimation Image deblurring c d e f g h a b c d e f 18.2 12.6 16.5 16.6 18.4 17.3 0.66 0.62 0.41 0.47 0.61 0.40 18.8 13.1 16.6 17.1 18.8 16.9 1.17 1.04 0.69 0.75 1.03 0.59 16.5 17.3 17.2 17.3 16.7 17.0 0.35 0.37 0.37 0.39 0.38 0.39 18.0 12.5 16.7 17.7 18.4 17.7 0.55 0.54 0.37 0.45 0.51 0.37 13.4 12.0 12.8 13.5 13.5 14.0 0.25 0.25 0.26 0.25 0.25 0.26
g 0.26 0.61 0.35 0.25 0.21
h 0.27 0.91 0.38 0.26 0.23
g 0.30 0.49 0.35 0.28 0.23
h 0.32 0.46 0.38 0.29 0.24
g 0.40 0.73 0.35 0.36 0.26
h 0.43 0.68 0.38 0.39 0.27
Table 5. Real texture. Performance (mean error) of 5 algorithms in blur scale estimation and image deblurring for the apertures in Fig. 4, under diﬀerent levels of noise. Methods LucyRichardson Regularized filtering Wiener filtering Levin et al.[5] Our method Methods LucyRichardson Regularized filtering Wiener filtering Levin et al. Our method Methods LucyRichardson Regularized filtering Wiener filtering Levin et al. Our method
a 18.5 19.0 13.8 18.4 8.7
b 17.2 17.5 14.5 16.8 7.8
a 18.5 19.0 14.7 18.4 10.6
b 17.2 17.5 15.8 16.8 9.5
a 18.3 19.0 15.6 18.5 12.2
b 17.1 17.5 16.5 16.9 11.8
Masks  (image noise level σ = 0.001) estimation Image deblurring e f g h a b c d e f 16.8 17.8 18.4 18.1 0.38 0.35 0.26 0.24 0.32 0.23 16.8 17.6 19.0 15.6 0.96 1.05 0.66 1.39 0.94 0.68 15.2 14.4 14.8 14.5 0.21 0.23 0.22 0.22 0.23 0.21 16.7 17.0 18.2 17.8 0.34 0.33 0.27 0.30 0.30 0.27 11.9 13.5 11.5 13.8 0.21 0.18 0.22 0.17 0.19 0.20 Masks  (image noise level σ = 0.002) Blur scale estimation Image deblurring c d e f g h a b c d e f 18.3 13.2 16.7 17.5 18.4 17.9 0.47 0.44 0.30 0.29 0.40 0.26 19.0 14.1 16.8 18.1 19.0 15.7 1.26 1.38 0.87 1.72 1.30 0.74 15.2 15.8 16.0 15.1 15.0 15.7 0.23 0.25 0.24 0.24 0.25 0.24 18.1 11.1 16.7 17.1 18.3 17.7 0.41 0.40 0.30 0.37 0.37 0.30 12.1 9.0 12.3 13.5 12.1 14.1 0.24 0.19 0.23 0.17 0.19 0.20 Masks  (image noise level σ = 0.005) Blur scale estimation Image deblurring c d e f g h a b c d e f 18.2 12.9 16.6 17.4 18.4 17.9 0.61 0.58 0.39 0.40 0.55 0.34 19.0 14.1 16.8 18.1 18.9 15.7 1.89 2.07 1.31 2.38 2.03 0.88 16.1 16.8 16.7 16.5 16.0 16.7 0.26 0.27 0.26 0.27 0.27 0.26 18.1 11.3 16.7 17.4 18.4 17.7 0.56 0.55 0.38 0.49 0.51 0.37 13.3 10.8 12.7 13.7 13.4 13.7 0.26 0.22 0.24 0.19 0.21 0.22 Blur scale c d 18.3 13.7 19.0 14.0 14.1 14.6 18.1 10.6 11.8 7.7
g 0.24 0.64 0.21 0.24 0.20
h 0.25 1.02 0.23 0.25 0.20
g 0.27 0.87 0.22 0.28 0.21
h 0.30 1.34 0.25 0.29 0.20
g 0.37 1.31 0.24 0.35 0.22
h 0.40 2.02 0.26 0.39 0.25
144
M. Martinello and P. Favaro
(a) Conventional aperture
(b) Groundtruth (pinhole camera)
Fig. 10. (a) Picture taken with the conventional camera without placing the mask on the lens. (b)Image captured by simulating a pinhole camera (f/22.0), which can be used as groundtruth for the image texture.
3D image (Fig. 14(e) and Fig. 15(e)) has been generated by using just the output of our method: the deblurred images (b) and the blurscale maps (d). The groundtruth images have been taken by simulating a pinhole camera (f/22.0). 5.3
Computational Cost
We downsample 4 times the input images from an original resolution of 12,8 megapixel (4, 368 × 2, 912) and use subpixel accuracy, in order to keep the algorithm eﬃcient. We have seen from experiments on real data that the raw blurscale map is already very close to the regularized map. This means that we can obtain a reasonable blur scale map very eﬃciently: When β = 0 the value of the blur scale at one pixel is independent of the other pixels and the calculations can be carried out in parallel. Since the algorithm takes about 5ms for processing 40 blur scale levels at each pixel, it is suitable for realtime applications. We have run the algorithm on a QuadCore 2.8GHz with 16GB memory. The code has been written mainly in Matlab 7. The deblurring procedure, instead, takes about 100s to process the whole image for 40 blur scale levels.
6
Conclusions
We have presented a novel method to recover the allinfocus image from a single blurred image captured with a coded aperture camera. The method is split in two steps: A subspacebased blur scale identiﬁcation approach and an image deblurring algorithm based on conjugate gradient descent. The method is simple, general, and computationally eﬃcient. We have compared our method to existing algorithms in the literature and showed that we achieve state of the art performance in blur scale identiﬁcation and image deblurring with both synthetic and real data while retaining polynomial time complexity.
Single Image Blind Deconvolution with HigherOrder Texture Statistics
(a) Input image
(b) Raw blurscale map
(c) Deblurred image
(d) Raw blurscale map
(e) Deblurred image
145
Fig. 11. Comparison on real data  mask 4(b). (a) Input image captured by using mask 4(b). (bc) Blurscale map and allinfocus image reconstructed with Levins et al.’s method [5]; (de) Results obtained from our method.
(a) Input image
(b) Raw blurscale map
(c) Deblurred image
(d) Raw blurscale map
(e) Deblurred image
Fig. 12. Comparison on real data  mask 4(d). (a) Input image captured by using mask 4(d). (bc) Blurscale map and allinfocus image reconstructed with Levins et al.’s method [5]; (de) Results obtained from our method.
146
M. Martinello and P. Favaro
(a) Input
(b) Allinfocus image
(c) Blurscale map
(d) 3D image
Fig. 13. Closerange indoor scene [exposure time: 1/30s]. (a) coded image captured with mask 4(b); (b) estimated allinfocus image; (c) estimated blurscale map; (d) 3D image (to be watched with redcyan glasses).
Single Image Blind Deconvolution with HigherOrder Texture Statistics
(a) Input image
(b) Deblurred image
(c) Raw blursize map
(d) Estimated blursize map
(e) 3D image
(f) Groundtruth image
147
Fig. 14. Longrange outdoor scene [exposure time: 1/200s]. (a) coded image captured with mask 4(b); (b) estimated allinfocus image; (c) raw blurscale map (without regularization); (d) regularized blurscale map; (e) 3D image (to be watched with redcyan glasses); (f) groundtruth image.
148
M. Martinello and P. Favaro
(a) Input image
(b) Deblurred image
(c) Raw blursize map
(d) Estimated blursize map
(e) 3D image
(f) Groundtruth image
Fig. 15. Midrange outdoor scene [exposure time: 1/200s]. (a) coded image captured with mask 4(b); (b) estimated allinfocus image; (c) raw blurscale map (without regularization); (d) regularized blurscale map; (e) 3D image (to be watched with redcyan glasses); (f) groundtruth image.
Single Image Blind Deconvolution with HigherOrder Texture Statistics
149
Appendix Proof of Theorem 1 To prove the theorem we rewrite the least squares problem in f as 2 Hd g = H ¯ d f − g¯22 √ Hd f − g22 + αΣf 22 = f − αΣ 0 2
(17)
¯ d = H T √αΣ T T and g¯ = g T 0T T . Then, we can where we have deﬁned H d T −1 T ¯ H ¯d ¯ g¯. By substituting the solution deﬁne the solution in f as fˆ = H H d d for f back in the least squares problem, we obtain ¯ ⊥ g¯2 Hd f − g22 + αΣf 22 = H (18) d 2 T ¯⊥ =I −H ¯d H ¯ H ¯ d −1 H ¯ T. where H d d d ¯ ⊥ rather than H ⊥ and g¯ rather than g We have shown that we can use H d d in the minimization problem (5) without aﬀecting the solution. The rest of the ¯ ⊥ g¯2 . The step proof then assumes that the energy in eq. (5) is based on H 2 d ⊥ ⊥ ¯ ¯ above is necessary to fully exploit the properties of Hd . Hd is a symmetric ¯ ⊥ )T = H ¯ ⊥ ) and is also idempotent (i.e, H ¯ ⊥ = (H ¯ ⊥ )2 ). By matrix (i.e, (H d d d d applying the above properties we can write the argument of the ﬁrst term of the cost in eq. (5) as ¯ ⊥ g¯ = g¯T (H ¯ ⊥ )T H ¯ ⊥ g¯ = H ¯ ⊥ g¯2 g¯T H d d d d ¯ ⊥ we know that Moreover, from the deﬁnition of H d . ⊥ ¯ =I −H ¯ d (H ¯ TH ¯ d )−1 H ¯T H d d d † ¯ ¯ H = I − Hd d
Thus, the necessary conditions for an extremum of eq. (5) become ⎧ T ⎨ g¯ − H ¯ dH ¯ † g¯ ¯ dH ¯† +H ¯ d ∇H ¯ † g¯ = ∇ · ∇d ∇H d d d ∇d1 ⎩ ¯ † g¯. f =H d
(19)
(20)
(21)
¯ d is the gradient of H ¯ d with respect to d, and the right hand side where ∇H of the ﬁrst equation is the gradient of ∇d1 with respect to d. Similarly, the necessary conditions for eq. (4) are ⎧ ⎨ g¯ − H ¯ d f T ∇H ¯ d f = ∇ · ∇d (22) ∇d1 ⎩ ¯ T g¯ − H ¯ d f = 0. H d It is now immediate to apply the same derivation as in [46] and demonstrate that the left hand side of the ﬁrst equation in both system (22) and system (21) are identical. Since the right hand sides are also identical, this implies that the ﬁrst equations have the same solutions. The second equations in (22) and (21) are instead identical by construction.
150
M. Martinello and P. Favaro
References 1. Jones, D., Lamb, D.: Analyzing the visual echo: Passive 3d imaging with a multiple aperture camera. Technical report, McGill University (1993) 2. Dowski, E.R., Cathey, T.W.: Extended depth of ﬁeld through wavefront coding. Applied Optics 34, 1859–1866 (1995) 3. Farid, H.: Range Estimation by Optical Diﬀerentiation. PhD thesis, University of Pennsylvania (1997) 4. Veeraraghavan, A., Raskar, R., Agrawal, A., Mohan, A., Tumblin, J.: Dappled photography: mask enhanced cameras for heterodyned light ﬁelds and coded aperture refocusing. ACM Trans. Graph 26, 69 (2007) 5. Levin, A., Fergus, R., Durand, F., Freeman, W.T.: Image and depth from a conventional camera with a coded aperture. ACM Trans. Graph 26, 70 (2007) 6. Bishop, T., Zanetti, S., Favaro, P.: Light ﬁeld superresolution. In: ICCP (2009) 7. Cossairt, O., Nayar, S.: Spectral focal sweep: Extended depth of ﬁeld from chromatic aberrations. In: ICCP (2010) 8. Liang, C.K., Lin, T.H., Wong, B.Y., Liu, C., Chen, H.: Programmable aperture photography: Multiplexed light ﬁeld acquisition. ACM Trans. Graph 27, 55:1–55:10 (2008) 9. Ng, R., Levoy, M., Br´edif, M., Duval, G., Horowitz, M., Hanrahan, P.: Light ﬁeld photography with a handheld plenoptic camera. Technical Report CSTR 200502, Stanford University CS (2005) 10. Gottesman, S.R., Fenimore, E.E.: New family of binary arrays for coded aperture imaging. Applied Optics 28, 4344–4352 (1989) 11. Zomet, A., Nayar, S.K.: Lensless imaging with a controllable aperture. In: CVPR, vol. 1, pp. 339–346 (2006) 12. Raskar, R., Agrawal, A.K., Tumblin, J.: Coded exposure photography: Motion deblurring using ﬂuttered shutter. ACM Trans. Graph 25, 795–804 (2006) 13. Hiura, S., Matsuyama, T.: Depth measurement by the multifocus camera. In: CVPR, vol. 2, pp. 953–961 (1998) 14. Zhou, C., Lin, S., Nayar, S.K.: Coded aperture pairs for depth from defocus. In: ICCV (2009) 15. Zhou, C., Nayar, S.: What are good apertures for defocus deblurring? In: IEEE ICCP (2009) 16. Levin, A., Hasinoﬀ, S., Green, P., Durand, F., Freeman., W.T.: 4d frequency analysis of computational cameras for depth of ﬁeld extension. ACM Trans. Graph 28 (2009) 17. McLean, D.: The improvement of images obtained with annular apertures. Royal Society of London 263, 545–551 (1961) 18. Greengard, A., Schechner, Y.Y., Piestun, R.: Depth from diﬀracted rotation. Optics Letters 31, 181–183 (2006) 19. Dowski, E.R., Cathey, T.W.: Singlelens singleimage incoherent passiveranging systems. Applied Optics 33, 6762–6773 (1994) 20. Johnson, G.E., Dowski, E.R., Cathey, W.T.: Passive ranging through wavefront coding: Information and application. Applied Optics 39, 1700–1710 (2000) 21. Cossairt, O., Zhou, C., Nayar, S.K.: Diﬀusion coding photography for extended depth of ﬁeld. ACM Trans. Graph (2010) 22. Dou, Q., Favaro, P.: Oﬀaxis aperture camera: 3d shape reconstruction and image restoration. In: CVPR (2008)
Single Image Blind Deconvolution with HigherOrder Texture Statistics
151
23. Georgiev, T., Zheng, K., Curless, B., Salesin, D., Nayar, S., Intawala, C.: Spatioangular resolution tradeoﬀs in integral photography. In: Eurographics Workshop on Rendering, pp. 263–272 (2006) 24. Levoy, M., Ng, R., Adams, A., Footer, M., Horowitz, M.: Light ﬁeld microscopy. ACM Trans. Graph 25, 924–934 (2006) 25. Fergus, R., Singh, B., Hertzmann, A., Roweis, S., Freeman, W.: Removing camera shake from a single photograph. ACM Trans. Graph 25, 787–794 (2006) 26. Shan, Q., Jia, J., Agarwala, A.: Highquality motion deblurring from a single image. ACM Trans. Graph (2008) 27. Levin, A., Weiss, Y., Durand, F., Freeman., W.T.: Understanding and evaluating blind deconvolution algorithms. In: CVPR, pp. 1964–1971 (2009) 28. Cho, S., Lee, S.: Fast motion deblurring. Siggraph Asia 28 (2009) 29. Xu, L., Jia, J.: Twophase kernel estimation for robust motion deblurring. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 157–170. Springer, Heidelberg (2010) 30. Shan, Q., Xiong, W., Jia, J.: Rotational motion deblurring of a rigid object from a single image. In: ICCV, pp. 1–8 (2007) 31. Whyte, O., Sivic, J.: Zisserman, A., Ponce, J.: Nonuniform deblurring for shaken images. In: CVPR, pp. 491–498 (2010) 32. Gupta, A., Joshi, N., Zitnick, C., Cohen, M., Curless, B.: Single image deblurring using motion density functions. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 171–184. Springer, Heidelberg (2010) 33. Nishiyama, M., Hadid, A., Takeshima, H., Shotton, J., Kozakaya, T., Yamaguchi, O.: Facial deblur inference using subspace analysis for recognition of blurred faces. IEEE T.PAMI 33, 1–8 (2011) 34. Pouli, T., Cunningham, D.W., Reinhard, E.: Image statistics and their applications in computer graphics. Eurographics, State of the Art Report (2010) 35. Ruderman, D.L.: The statistics of natural images. Network: Computation in Neural Systems 5, 517–548 (1994) 36. Huang, J., Mumford, D.: Statistics of natural images and models. In: CVPR, vol. 1, pp. 1541–1548 (1999) 37. Huang, J., Lee, A., Mumford, D.: Statistics of range images. In: CVPR, pp. 324–331 (2000) 38. Rudin, L., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60, 259–268 (1992) 39. Kolmogorov, V., Zabih, R.: Multicamera scene reconstruction via graph cuts. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2352, pp. 82–96. Springer, Heidelberg (2002) 40. Press, W., Flannery, B., Teukolsky, S., Vetterling, W.: Numerical Recipes in C. Cambridge University Press, Cambridge (1988) 41. Sun, X., Cheng, Q.: On subspace distance. In: Image Analysis and Recognition, pp. 81–89 (2006) 42. Premaratne, P., Ko, C.: Zero sheet separation of blurred images with symmetrical point spread functions. Signals, Systems, and Computers, 1297–1299 (1999) 43. Martinello, M., Bishop, T.E., Favaro, P.: A bayesian approach to shape from coded aperture. In: ICIP (2010) 44. Snyder, D., Schulz, T., O’Sullivan, J.: Deblurring subject to nonnegativity constraints. IEEE Trans. on Signal Processing 40(5), 1143–1150 (1992) 45. Bertero, M., Boccacci, P.: Introduction to inverse problems in imaging. Institute of Physics Publishing, Bristol (1998) 46. Favaro, P., Soatto, S.: A geometric approach to shape from defocus. TPAMI 27, 406–417 (2005)
Compressive Rendering of Multidimensional Scenes Pradeep Sen, Soheil Darabi, and Lei Xiao Advanced Graphics Lab, University of New Mexico, Albuquerque, NM 87113
Abstract. Recently, we proposed the idea of using compressed sensing to reconstruct the 2D images produced by a rendering system, a process we called compressive rendering. In this work, we present the natural extension of this idea to multidimensional scene signals as evaluated by a Monte Carlo rendering system. Basically, we think of a distributed ray tracing system as taking point samples of a multidimensional scene function that is sparse in a transform domain. We measure a relatively small set of point samples and then use compressed sensing algorithms to reconstruct the original multidimensional signal by looking for sparsity in a transform domain. Once we reconstruct an approximation to the original scene signal, we can integrate it down to a ﬁnal 2D image which is output by the rendering system. This general form of compressive rendering allows us to produce eﬀects such as depthofﬁeld, motion blur, and area light sources, and also renders animated sequences eﬃciently.
1
Introduction
The process of rendering an image as computed by Monte Carlo (MC) rendering systems involves the estimation of a set of integrals of a multidimensional function that describes the scene. For example, for a scene with depthofﬁeld and motion blur, we can think of the distributed ray tracing system as taking point samples of a 5D continuous “scene signal” f (x, y, u, v, t), where f () is the scenedependent function, (x, y) represents the position of the sample on the image, (u, v) is its position on the aperture for the depthofﬁeld, and t which describes the time at which the sample is calculated. The ray tracing system can compute point samples of this function by ﬁxing the parameters (x, y, u, v, t) and evaluating the radiance of a ray with those parameters. The basic idea of Monte Carlo rendering is that by taking a large set of random point samples of this function, we can approximate the deﬁnite integral: 1 j+ 2
1 i+ 2
t1 1
1
−1
−1
I(i, j) =
f (x, y, u, v, t) du dv dt dx dy, j− 12
i− 12
t0
(1)
which gives us the value of the ﬁnal image I at pixel (i, j) by integrating over the camera aperture from [−1, 1], over the time that the shutter is open [t0 , t1 ], and over the pixel for antialiasing. In rendering, we use Monte Carlo integration D. Cremers et al. (Eds.): Video Processing and Computational Video, LNCS 7082, pp. 152–183, 2011. c SpringerVerlag Berlin Heidelberg 2011
Compressive Rendering of Multidimensional Scenes
153
to estimate integrals like these because ﬁnding an analytical solution to these integrals is nearly impossible for real scene functions f (). Unfortunately, Monte Carlo rendering systems require a large number of multidimensional samples in order to converge to the actual value of the integral, because the variance of the estimate of the integral decreases as O(1/k) with the number of samples k. If a small number of samples is used, the resulting image is very noisy and cannot be used for highend rendering applications. The noise in the Monte Carlo result is caused by variance in the estimate, and there have been many approaches proposed in the past for reducing the variance in MC rendering. One common method for variance reduction is stratified sampling, wherein the integration domain is broken up into a set of equallysized nonoverlapping regions (or strata) and a single sample is placed randomly in each, which reduces the variance of the overall estimate [1]. Other techniques for variance reduction exist, but they typically require more information about f (). For example, importance sampling positions samples with a distribution p(x) that mimics f () as closely as possible. It can be shown that if p(x) is set to a normalized version of f (), then the variance of our estimator will be exactly zero [2]. However, this normalization involves knowing the integral of f (), which is obviously unknown in our case. Nevertheless, importance sampling can be useful when some information about the shape of f () is known, such as the position of light sources in a scene. In this work, however, we assume that we do not know anything about the shape of f () that we can use to position samples, which makes our approach a kind of technique often known as blind Monte Carlo. The only assumption we will make is that f () is a realworld signal that is sparse or compressible in a transform domain. Other kinds of variance reduction techniques have been proposed that introduce biased estimators, meaning that the expected value of the estimator is not equal to the exact value of the integral. Although methods such as stratiﬁed or importance sampling are both unbiased, biased Monte Carlo algorithms are also common in computer graphics (e.g., photon mapping [3]) because they sometimes converge much faster while yielding plausible results. The proposed approach in this paper also converges much more quickly than the traditional unbiased approaches, but it results in a slightly biased result. As we shall see, this occurs because of the discretization of the function when we pose it within the framework of compressed sensing (CS). However, this bias is small while the improvement in the convergence rate is considerable. This chapter is based on ideas presented by the ﬁrst two authors in work published in the IEEE Transactions of Visualization and Computer Graphics [4] and the Eurographics Rendering Symposium (EGSR) [5]. In the ﬁrst work, we introduced the idea of using compressed sensing as a way of ﬁlling missing pixel information in order to accelerate rendering. In that approach, we ﬁrst render only a fraction of the pixels in the image (which provides the speedup) and then we estimate the values of the missing pixels using compressed sensing by assuming that the ﬁnal image is compressible in a transform domain. In the second work, we began to expand this idea to the concept of estimating an
154
P. Sen, S. Darabi, and L. Xiao
(a) Original image (b) 2D sparsiﬁcation (c) 3D sparsiﬁcation (d) 4D sparsiﬁcation Fig. 1. Showing the eﬀect on dimensionality on the compressibility of the signal in the Fourier domain. As the dimension of our scene function f () increases, the compressibility of the data increases as well. Here we show a 4D scene with pixel antialiasing (2D) and depthofﬁeld (2D), which we have sparsiﬁed to 98% sparsity in the Fourier domain by zeroing out 98% of the Fourier coeﬃcients. (a) Reference image. (b) Image generated by integrating down the function to 2D and then sparsifying it to 98% in the Fourier domain. We can see a signiﬁcant amount of ringing and artifacts, which indicates that the 2D signal is not very compressible in the Fourier domain. This is the reason that we use wavelets for compression when handling 2D signals (see Secs. 5 and 6). (c) Image generated by integrating down the function to 3D (by integrating out the u parameter) and then sparsifying it to 98% in the Fourier domain. There are less artifacts than before, although they are still visible. (d) Image generated by sparsifying the original scene function f () to 98% in the Fourier domain. The artifacts are greatly reduced here, indicating that as the dimensionality of the signal goes up the transformdomain sparsity also increases.
underlying multidimensional signal which we then integrate down to produce our ﬁnal image. At the Dagstuhl workshop on Computational Video [6], we presented initial results on applying these ideas to animated video sequences (see Sec. 8). This work presents a more general framework for compressive rendering that ties all of these ideas together into an algorithm that can handle a general set of Monte Carlo eﬀects by estimating a multidimensional scene function from a small set of samples. By moving to higherdimensional data sets, we improve the quality of the reconstructions because compressed sensing algorithms improve as the signal becomes more sparse, and as the dimension of the problem increases the sparsity (or technically, compressibility) of the signal also increases. The reason for this is that the amount of data in the signal goes up exponentially with the dimension, but the amount of actual information does not increase at this rate. As shown in Fig. 1, a 4D signal sparsiﬁed to 98% produces a much better quality image than a 2D signal sparsiﬁed the same amount. To present this work, we ﬁrst begin by describing previous work in rendering as it relates to Monte Carlo rendering and transformdomain accelerations proposed in the past. Next, we present a brief introduction to the theory of compressed sensing, since it is a ﬁeld still relatively new to computer graphics. In Sec. 4, we present an overview of our general approach as well as a simple 1D example to compare the reconstruction of a signal from CS with those of traditional techniques such as parametric ﬁtting. Secs. 5 – 10 then show applications of this
Compressive Rendering of Multidimensional Scenes
155
framework starting with 2D signals and building up to more complex 4D scenes. Finally, we end the chapter with some discussion and conclusions. We note that since this paper is in fact a generalization of two previous papers [5, 4], we have taken the liberty to heavily draw from our own text from these papers and the associated technical report [7], often verbatim, to maintain consistency across all the publications. We also duplicate results as necessary for completeness of this text.
2
Previous Work
Our framework allows to produce noisefree Monte Carlo rendered images with a small set of samples by ﬁlling in the missing samples of the multidimensional function using compressed sensing. Similar topics have been the subject of research in the graphics community for many years. We break up the previous work into algorithms that exploit transformdomain sparsity for rendering, algorithms that accelerate the rendering process outright, algorithms that are used to ﬁll in missing sample information, and ﬁnally applications of compressed sensing in computer graphics. 2.1
Transform Compression in Rendering
There is a long history of research into transformbased compression to accelerate or improve rendering algorithms. We brieﬂy survey some of the relevant work here and refer readers to more indepth surveys (e.g., [8, 9]) for more detail. For background on wavelets, the texts by Stollnitz et al. [10] and Mallat [11] oﬀer good starting points. In the area of image rendering, transform compression techniques have been used primarily for accelerating the computation of illumination. For example, the seminal work of Hanrahan et al. [12] uses an elegant hierarchical approach to create a multiresolution model of the radiosity in a scene. While it does not explicitly use wavelets, their approach is equivalent to using a Haar basis. This work has been extended to use diﬀerent kinds of wavelets or to subdivide along shadow boundaries to further increase the eﬃciency of radiosity algorithms, e.g., [13, 14, 15]. Recently, interest in transformdomain techniques for illumination has been renewed through research into eﬃcient precomputed radiance transfer methods using bases such as spherical harmonics [16, 17] or Haar wavelets [18, 19]. Again, these approaches focus on using the sparsity of the illumination or the BRDF reﬂectance functions in a transform domain, not on exploiting the sparsity of the ﬁnal image. In terms of using transformdomain approaches to synthesize the ﬁnal image, the most successful work has been in the ﬁeld of volume rendering. In this area, both the Fourier [20, 21] and wavelet domains [22] have been leveraged to reduce rendering times. However, the problem they are solving is signiﬁcantly diﬀerent than that of image rendering, so their approaches do not map well to
156
P. Sen, S. Darabi, and L. Xiao
the problem addressed in this work. Finally, perhaps the most similar rendering approach is the frequencybased ray tracing work of Bolin and Meyer [23]. Like our own approach, they take a set of samples and then try to solve for the transform coeﬃcients (in their case the Discrete Cosine Transform) that would result in those measurements. However, the key diﬀerence is that they solve for these coeﬃcients using leastsquares, which means that they can only reconstruct the frequencies of the signal that have suﬃcient measurements as given by the NyquistShannon sampling theorem. Our approach, on the other hand, is based on the more recent work on compressed sensing, which speciﬁes that the sampling rate is dependent on the sparsity of the signal rather than on its bandlimit. This allows us to reconstruct frequencies higher than that speciﬁed by the Nyquist rate. We show an example of this in Sec. 4.1 that highlights this diﬀerence. By posing the problem of determining the value of missing samples within the framework of compressed sensing, we leverage the diverse set of tools that have been recently developed for these kinds of problems. 2.2 Accelerating Ray Tracing and Rendering Most of the work in accelerating ray tracing has focused on novel data structures for accelerating the scene traversal [24]. These methods are orthogonal to ours since we do not try to accelerate the ray tracing process (which involves pointsampling the multidimensional function) but rather focus on generating a better image with less samples. However, there are algorithms to accelerate rendering that take advantage of the spatial correlation of the ﬁnal image, which in the end is related to the sparsity in the wavelet domain. Most common is the process of adaptive sampling [25, 26], in which a fraction of the samples are computed and new samples are computed only where the diﬀerence between measured samples is large enough, e.g., by a measure of contrast. Unlike our approach, however, adaptive sampling still computes the image in the spatial domain which makes it impossible to apply arbitrary wavelet transforms. For example, in the 2D missing pixel case of Sec. 5 we use the CDF 9/7 wavelet transform because it has been shown to be very good at compressing imagery. In other sections, we use sparsity in the Fourier domain. It is unclear how existing adaptive methods could be modiﬁed to use bases like this. There is also a signiﬁcant body of work which attempts to reconstruct images from sparse samples by using specialized data structures. First, there are systems which try to improve the interactivity of rendering by interpolating a set of sparse, rendered samples, such as the Render Cache [27] and the Shading Cache [28]. There are also approaches that perform interpolation while explicitly observing boundary edges to prevent blurring across them. Examples include Pighin et al.’s imageplane discontinuity mesh [29], the directional coherence map proposed by Guo [30], the edgeandpoint image data structure of Bala et al. [31], and the realtime silhouette map by Sen [32,33]. Our work is fundamentally diﬀerent than these approaches because we never explicitly encode edges or use a data structure to improve the interpolation between samples. Rather, we take advantage of the compressibility of the ﬁnal multidimensional signal in a
Compressive Rendering of Multidimensional Scenes
157
transform domain in order to reconstruct it and produce the ﬁnal image. This allows us to faithfully reconstruct edges in the image, as can be seen by our results. 2.3
Reconstruction of Missing Data
Our approach only computes a fraction of the samples and uses compressed sensing to “guess” the values of the missing samples of the multidimensional scene function. In computer graphics and vision, many techniques have been proposed to ﬁll in missing sample data. In the case of 2D signals such as images, techniques such as inpainting [34] and holeﬁlling [35] have been explored. Typically, these approaches work by taking a band of measured pixels around the unknown region and minimizing an energy functional that propagates this information smoothly into the unknown regions while at the same time preserving important features such as edges. Although we could use these algorithms to ﬁll in the missing pixels in our 2D rendering application, the random nature of the rendered pixels makes our application fundamentally diﬀerent from that of typical holeﬁlling, where the missing pixels have localized structure due to speciﬁc properties of the scene (such as visibility) which in our case are not available until render time. Furthermore, these methods become a lot less eﬀective and more complex when trying to ﬁll the missing data for higher dimensional cases, specially as we get to 4D scene functions or larger. Nevertheless, we compare our algorithm to inpainting in Sec. 5 to help validate our approach. Perhaps the most successful approaches for reconstructing images from nonuniform samples for the 2D case come from the nonuniform sampling community, where this is known as the “missing data” problem since one is trying to reconstruct the missing samples of a discrete signal. Readers are referred to Ch. 6 of the principal text on the subject by Marvasti [36] for a complete explanation. One successful algorithm is known as ACT [37] which tries to ﬁt trigonometric polynomials (related to the Fourier series) to the pointsampled measurements in a leastsquares sense by solving the system using Toeplitz matrix inversion. This is related to the frequencybased ray tracing by Bolin and Meyer [23] described earlier. Another approach, known as the Marvasti method [38], solves the missing data problem by iteratively building up the inverse of the system formed by the nonuniform sampling pattern combined with a low pass ﬁlter. However, both the ACT and Marvasti approaches fundamentally assume that the image is bandlimited in order to do the reconstruction, something that is not true in our rendering application. As we show later in this paper, our algorithm relaxes the bandlimited assumption and is able to recover some of the highfrequency components of the image signal. Nevertheless, since ACT and Marvasti represent stateoftheart approaches in the nonuniform sampling community for the reconstruction of missing pixels in a nonuniformly sampled image, we will compare our approach against these algorithms in Sec. 5. Unfortunately, neither of these algorithms is suitable for higher dimensional signals.
158
2.4
P. Sen, S. Darabi, and L. Xiao
Compressed Sensing and Computer Graphics
In this paper, we use tools developed for compressed sensing to solve the problem of reconstructing rendered images with missing pixel samples. Although compressed sensing has been applied to a wide range of problems in many ﬁelds, in computer graphics there are only a few published works that have used CS. Other than the work on compressive rendering on which this paper is based [5,4], most of the other applications of CS in graphics are in the area of lighttransport acquisition [39,40,41]. The important diﬀerence between this application and our own is that it is not easy to measure arbitrary linear projections of the desired signal in rendering, while it is very simple to do so in light transport acquisition through structured illumination. In other words, since computing the weighted sum of a set samples is linearly harder than calculating the value of a single sample, our approach for a rendering framework based on compressive sensing had to be built around random point sampling. This will become more clear as we give a brief introduction to the theory of compressed sensing in the next section.
3
Compressed Sensing Theory
In this section, we summarize some of the key theoretical results of compressed sensing in order to explain our compressive rendering framework. A summary of the notation we shall use in this paper is shown in Table 1. Readers are referred to the key papers of Cand`es et al. [42] and Donoho [43] as well as the extensive CS literature available through the Rice University online repository [44] for a more comprehensive review of the subject. 3.1
Theoretical Background
The theory of compressed sensing allows us to reconstruct a signal from only a few samples if it is sparse in a transform domain. To see how, suppose that we have an ndimensional signal f ∈ Rn we are trying to estimate with k random point samples, where k n. We can write this sampling process with the linear sampling equation y = Sf , where S is an k × n sampling matrix that contains a single “1” in each row and no more than a single “1” in each column to represent each pointsampled value, and with zeros for all the remaining elements. This maps well to our rendering application, where the npixel image we want to render (f ) is going to be estimated from only k pixel samples (y). Initially, it seems that perfect estimation of f from y is impossible, given that there are (n − k) pixels which we did not observe and could possibly have any value (illposedness). This is where we use the key assumption of compressed sensing: we assume that the image f is sparse in some transform domain ˆ f = Ψ−1 f . Mathematically, the signal ˆ f is msparse if it has at most m nonzero coeﬃcients (where m n), which can be written in terms of the 0 norm (which eﬀectively “counts” the number of nonzero elements): ˆ f 0 ≤ m. This is not an unreasonable assumption for realworld signals such as images, since
Compressive Rendering of Multidimensional Scenes
159
Table 1. Notation used in this paper n k m f ˆ f y S Ψ A
size of ﬁnal signal number of evaluated samples number of nonzero coeﬃcients in transform domain highresolution ﬁnal signal, represented by an n × 1 vector transform of the signal, represented by a n × 1, msparse vector k × 1 vector of samples of f computed by the ray tracer k × n sampling matrix of the ray tracer, s.t. y = Sf n × n “synthesis” matrix, s.t. f = Ψˆ f , and its associated inverse Ψ−1 k × n “measurement” matrix, A = SΨ
this fact is exploited in transformcoding compression systems such as JPEG and MPEG. The basic idea of compressed sensing is that through this assumption, we are able to eliminate many of the images in the (n − k)dimensional subspace which do not have sparse properties. To formulate the problem within the compressed sensing framework, we substitute our transformdomain signal ˆ f into our sampling equation: y = Sf = SΨˆ f = Aˆ f, (2) where A = SΨ is a k × n measurement matrix that includes both the sampling and compression bases. If we could solve this linear system correctly for ˆ f given y, we could then recover the desired f by taking the inverse transform. Unfortunately, solving for ˆ f is diﬃcult to do with traditional techniques such as least squares because the system is severely undetermined because k n. However, one of the key results in compressed sensing demonstrates that if k ≥ 2m and the Restricted Isometry Condition (RIC) condition is met (Sec. 3.4), then we can solve for ˆ f uniquely by searching for the sparsest ˆ f that solves the equation. A proof of this remarkable conclusion can be found in the paper by Cand`es et al. [42]. Therefore, we can pose the problem of computing the transform of the ﬁnal rendered image from a small set of samples as the solution of the 0 optimization problem: min ˆ f 0 s.t. y = Aˆ f.
(3)
Unfortunately, algorithms to solve Eq. 3 are NPhard [45] because they involve a combinatorial search of all msparse vectors ˆ f to ﬁnd the sparsest one that meets the constraint. Fortunately, the CS research community has developed fast algorithms that ﬁnd approximate solutions to this problem. In this paper we use solvers such as ROMP and SpaRSA to compute the coeﬃcients of signal ˆ f within the context of compressed sensing. We give an overview of these algorithms in the following two sections. 3.2
Overview of ROMP Algorithm
Since the solution of the 0 problem in Eq. 3 requires a bruteforce combinatorial search of all the the ˆ f vectors with sparsity less than m, the CS research community has been developing fast, greedy algorithms that ﬁnd approximate solutions
160
P. Sen, S. Darabi, and L. Xiao
Algorithm 1. ROMP algorithm Input: measured vector y, matrices A and A† , target sparsity m Output: the vector ˆ f , which is an msparse solution of y = Aˆ f Initialize: I = ∅ and r = y 1: while r = 0 and sparsity not met do /* multiply residual by A† to approx. larger coeﬀs of ˆ f */ 2: u ⇐ A† r 3: J ⇐ sort coeﬃcients of u in nonincreasing order 4: J0 ⇐ contiguous set of coeﬃcients in J with maximal energy 5: I ⇐ I ∪ J0 /* add new indices to overall set */ 6: /* ﬁnd vector of I coeﬀs that best matches measurement */ ˆ 7: fnew ⇐ argmin y − Az2 z : supp (z)=I
8: r ⇐ y − Aˆ fnew 9: end while 10: return ˆ fnew
/* recompute residual */
to the 0 problem. One example is Orthogonal Matching Pursuit (OMP) [46], which iteratively attempts to ﬁnd the nonzero elements of ˆ f one at a time. To do this, OMP is given the measured vector y and measurement matrix A as input and it ﬁnds the coeﬃcient of ˆ f with the largest magnitude by projecting y onto each column of A through the inner product y, aj  (where aj is the j th column of A) and selecting the largest. After the largest coeﬃcient has been identiﬁed, we assume that this is the only nonzero coeﬃcient in ˆ f and approximate its value by solving the y = Aˆ f using leastsquares. The new estimate for ˆ f with a single nonzero coeﬃcient is then used to compute the estimated signal f , which is subtracted from the original measurements to get a residual. The algorithm then iterates again, using the residual to solve for the next largest coeﬃcient of ˆ f , and so on. It continues to do this until an msparse approximation of the transform domain vector is found. Despite its simplicity, OMP has a weaker guarantee of exact recovery than the 1 methods [47]. For this reason, Needell and Vershynin proposed a modiﬁcation to OMP called Regularized Orthogonal Matching Pursuit (ROMP) which recovers multiple coeﬃcients in each iteration, thereby accelerating the algorithm and making it more robust to meeting the RIC. Essentially, ROMP approximates the largestmagnitude coeﬃcients of ˆ f in a similar way to OMP, by projecting y onto each column of A and sorts them in decreasing order. It then ﬁnds all of the continuous sets of coeﬃcients in this list whose largest coeﬃcient is at most twice as big as the smallest member, and selects the set with the maximal energy. These indices are added to a list that is maintained by the algorithm which keeps track of the nonzero coeﬃcients of ˆ f , and the values of those coeﬃcients are computed by solving y = Aˆ f through leastsquares assuming that these are the only nonzero coeﬃcients. As in OMP, the new estimate for ˆ f is then used to compute the estimated signal f which is subtracted from y to get a residual. The algorithm continues to iterate using the residual as the input and solving for the next largest set of coeﬃcients of ˆ f until an msparse approximation of the transform domain vector is found, an error criteria is met, or the number of
Compressive Rendering of Multidimensional Scenes
161
iterations exceeds a certain limit without convergence. Although the 1 problem requires N = O(m log k) samples to be solved uniquely, in practice we ﬁnd that the ROMP algorithm requires around N = 5m samples to start locking in to the correct solution and N = 10m to work extremely robustly. Since we use the ROMP algorithm in both 2D signal reconstruction applications (see Secs. 5 and 6), we provide a pseudocode description for reference in Alg. 1. 3.3
Overview of SpaRSA Algorithm
Another algorithm we use in this paper is known as SpaRSA. One of the key results of recent compressed sensing theory is that the problem of Eq. 3 can be framed as an 1 problem instead, where the 1 norm is deﬁned as the sum of the k absolute values of the elements of the vector (v1 = i=1 vi ): min ˆ f 1
s.t. y = Aˆ f.
(4)
Cand`es et al. [42] showed that this equation has the same solution as Eq. 3 if A satisﬁes the RIC and the number of samples N = O(m log k), where m is the sparsity of the signal. This fundamental result spurred the ﬂurry of research in compressed sensing, because it demonstrated that these problems could be solved by tractable algorithms such as linear programming. Unfortunately, it is still diﬃcult to solve the 1 problem, so researchers in applied mathematics have been working on novel algorithms to provide a fast solution. One successful avenue of research is to reformulate Eq. 4 into what is known as the 2 – 1 problem: 1 min y − Aˆ f 22 + τ ˆ f 1 . (5) ˆ 2 f In this formulation, the ﬁrst term enforces the ﬁt of the solution to the measured values y while the second term looks for the smallest 1 solution (and hence the sparsest solution). The parameter τ balances the optimization towards one constraint or the other. Recently, Wright et al. proposed a novel solution to Eq. 5 by solving a simple iterative subproblem with an algorithm they call Sparse Reconstruction by Separable Approximation (SpaRSA) [48]. In this work, we use SpaRSA to reconstruct scene signals f that are 3D and larger. Unfortunately, even a simple explanation of the SpaRSA is beyond the scope of this paper. Interested readers are referred to the technical report associated with our EGSR paper [7] for more information. 3.4
Restricted Isometry Condition (RIC)
It is impossible to solve y = Aˆ f for any arbitrary A if k n because the system is severely underdetermined. However, compressed sensing can be used to solve uniquely for ˆ f if matrix A meets the Restricted Isometry Condition (RIC): (1 − )v2 ≤ Av2 ≤ (1 + )v2 ,
(6)
162
P. Sen, S. Darabi, and L. Xiao
with parameters (z, ), where ∈ (0, 1) for all zsparse vectors v [47]. Eﬀectively, the RIC states that in a valid measurement matrix A, every possible set of z columns of A will form an approximate orthogonal set. Another way to say this is that the sampling and compression bases S and Ψ that make up A must be incoherent. Examples of matrices that have been proven to meet RIC include Gaussian matrices (with entries drawn from a normal distribution), Bernoulli matrices (binary matrices drawn from a Bernoulli distribution), and partial Fourier matrices (randomly selected Fourier basis functions) [49]. In this paper, we can use pointsampled Fourier (partial Fourier matrices) for all of the applications with a scene function 3D or higher, but for the 2D cases where we are reconstructing an image, the Fourier basis does not provide enough compression (as shown in Fig. 1) and cannot be used. In this case, we would like to use the wavelet transform which does provide enough compression, but unfortunately a pointsampled wavelet basis does not meet the RIC. We discuss our modiﬁcations to the wavelet basis to improve this and allow our framework to be used for the reconstruction of 2D scene functions in Sec. 5. With this theoretical background in place, we can now give an overview of our algorithm and a simple 1D example.
4
Algorithm Overview
The basic idea of the proposed rendering framework is quite simple. We use a distributed ray tracing system that takes a small set of point samples of the multidimensional scene function f (). In traditional Monte Carlo, these samples would be added together to estimate the integral of f (), but in our case assume that the signal f () is sparse in a transform domain and use the compressed sensing theory described in the previous section to estimate a discrete reconstruction of f (). This reconstructed version is then integrated down to form our ﬁnal image. Since the CS solvers operate on discrete vector and matrices, we ﬁrst approximate the unknown function f () with a discrete vector f of size n by taking uniform samples of f (). This approximation is reasonable as long as n is large enough, since it is equivalent to discretizing f (). For example, if we assume that the signal f () is sparse in the Fourier basis composed of 2πperiodic basis functions, then the signal f () must also be 2π periodic. Therefore, the samples that form f must cover the 2π interval of f () to ensure periodicity and therefore maintain the sparsity in the Discrete Fourier Transform domain. In this case, for example, the ith component of f is given by fi = f ( 2π k (i − 1)). By sampling the function f () in this manner when discretizing it, we guarantee that f will also be sparse in transform domain Ψ , where the columns of Ψ are discrete versions of the basis functions ψi from the equations above. Note that we do not explicitly sample f () to create f (since we do not know f () a priori), but rather we assume a nlength vector f exists which is the discrete version of the unknown f () and which we will solve for through CS. We can now take our k random measurements of f , as given by y = Sf where S is the k × n sampling matrix, by pointsampling the original function f ()
0
−2000 −4000 −6000 −0.25 −0.2
−0.15 −0.1
−0.05
0
x
0.05
0.1
0.15
0.2
0.25
4
400 200 0 −200 −400 −600 −800
50
100
150
Number of samples (k)
(e)
200
250
8
5
1 2
10
0
8
6
4
12 10 8 3
6
10
10
10
−5
−10
−15
4 2 4
2 0 −200
−150
−100
−50
0
50
100
150
0 0
200
10
2
50
150
4
200
10
250
200 0 −200 −400 −600 −800
0
50
100
150
Number of samples (k)
(f )
0
50
200
250
150
200
250
200
250
(d) 120
Compressed sensing
Parametric fit
800 600 400 200 0 −200 −400 −600 −800
−1000
100
Number of samples (k)
1000
400
3
−25
(c)
600
−1000
100
Number of samples (k)
Stratified
800
−20
1
Frequency (Hz)
Estimated value of integral
Estimated value of integral
Estimated value of integral
600
10
14
1000
Random
0
2 3
16
(b)
1000
−1000
10
Random Stratified Parametric fit Compressed sensing
1
18
(a)
800
x 10
20
Variance
f(x)
2000
x 10
10
Estimated value of integral
4000
163
4
4
12
Variance
6000
Magnitude of Fourier Transform of f(x)
Compressive Rendering of Multidimensional Scenes
0
50
100
150
200
250
100
80
60
40
20
0
−20
0
50
100
150
Number of samples (k)
Number of samples (k)
(g)
(h)
Fig. 2. Results of the simple example of Sec. 4.1. (a) Plot of the signal f (x) which we want to integrate over the interval shown. (b) Magnitude of the Fourier Transform of f (x). Since 3 cosines of diﬀerent frequencies are added together to form f (x), its Fourier Transform has six diﬀerent spikes because of symmetry (so sparsity m = 6). (c) Linear and (d) log plots of variance as a function of the number of samples to show the convergence of the four diﬀerent integration algorithms. The faint dashed lines in the “random” and “stratiﬁed” curves (visible in the pdf) show the theoretical variance, which matches the experimental results. The log plot clearly shows the “waterfall” curve characteristic of compressed sensing reconstruction, where once the adequate number of samples are taken the signal is reconstructed perfectly every time. In this case, we need around 70 samples, which is roughly 10× the number of spikes in (b). (e – h) Plots of the estimated value of the integral vs. the number of samples for one run with the diﬀerent integration algorithms. The correct value of the integral (100) is shown as a gray line. We can see that the compressed sensing begins to approximate the correct solution around k = 5m = 30 samples and then “snaps” to the right answer at for k > 10m = 60, which is much more quickly than the other approaches.
at the appropriate discrete locations. Therefore, unlike traditional Monte Carlo approaches, our random samples do not occur arbitrarily along the continuous domain of f (), but rather at a set of discrete locations that represent the samples of f . Once the set of samples that form measurement vector y have been taken, we use compressed sensing to solve for the coeﬃcients that correspond to the nonzero basis functions ˆ f . We can then take the inverse transform f = Ψˆ f and integrate it down to get our ﬁnal image. To help explain how our algorithm works, we now look at a simple 1D example. 4.1
Example of 1D Signal Reconstruction
At ﬁrst glance, it might seem that what we are proposing to do is merely just another kind of parametric ﬁt, somehow ﬁtting a function to our samples to
164
P. Sen, S. Darabi, and L. Xiao
approximate f (). Although we are ﬁtting a function to the measured data, compressed sensing oﬀers us a fundamentally diﬀerent way to do this than the traditional methods used for parametric ﬁtting, such as leastsquares. We can see the diﬀerence with a simple 1D example. Suppose we want to compute the deﬁnite integral from − 41 to 14 of the following function f (x), which is unknown to us a priori but is shown in Fig. 2(a): f (x) = α[aπ cos(2πax)] + β[bπ cos(2πbx)] + γ[cπ cos(2πcx)]. The signal is made up of three pure frequencies, and in this experiment we set a = 1, b = 33, and c = 101 so that there is a reasonable range of frequencies represented in the signal. This particular function is constructed so that the analytic integral is easy to compute; the integral of each of the terms in square brackets over the speciﬁed interval is equal to 1. This means that in this case, the desired deﬁnite integral is 14 I= f (x) dx = α + β + γ. − 14
In this experiment, we set α = 70, β = 20, and γ = 10, so that the desired integral I = 100. Within the context of our rendering problem, we assume that do not know f (x) in analytical form, so our goal is to compute the integral value of 100 simply from a set of random point samples of f (x). The most common way to do this in computer graphics is to use Monte Carlo integration, which takes k uniformlydistributed, random samples over the entire interval [− 14 , 14 ] and use these to estimate the integral. Although the answer ﬂuctuates based on the position of our measurements, as we add more and more samples the estimator slowly converges to the correct answer as can be seen in the variance curves of Fig. 2(c, d) and the results of a single run while varying the number of samples in (e). We can compute the theoretical variance of a random Monte Carlo approach analytically which gives us a theoretical variance shown in Fig. 2(c, d) as a thin, dashed line. We can see that the theoretical variance calculated matches well with the experimental results. The slow k1 decay of variance with random Monte Carlo is less than desirable for rendering applications, so a common variancereduction technique is stratiﬁed sampling. Fig. 2(f ) shows the result of one run with this method, and indeed we notice that the estimate of the integral gets closer to the correct solution (shown by a thin gray line) more quickly than with the random approach. The theoretical variance of the stratiﬁed approach can be computed in software by computing the variance of each of the strata for every size k. The resulting curve also matches the experimental results, even predicting a small dip in variance around k = 101. However, stratiﬁed sampling still takes considerable time to converge, so it is worthwhile to examine other techniques that might be better. Another way we might consider computing the integral of this function from the random samples is to try to ﬁt a parametric model to the measurements and then perform the integral on the model itself. Indeed, both our approach and the parametric ﬁt approach require us to know something about the signal (e.g.,
Compressive Rendering of Multidimensional Scenes
165
that it can be compactly represented with sinusoids). However, closer observation reveals that our framework is based on fundamentally diﬀerent theory than typical methods for ﬁtting parametric models and so it yields a considerably diﬀerent result. To see why, let us work through the process of actually ﬁtting a parametric sinusoidal model to our measured samples. Typically, this involves solving a leastsquares problem which in this case means solving y = Aˆ f for the coefﬁcients of ˆ f , where A = SΨ is the sampling matrix multiplied by the Fourier basis. Because we are solving the problem with least squares, we need to have a “thin” matrix (or at least square) for A, which means that the number of unknown coeﬃcients in ˆ f can be at most k, matching the number of observations at y. Since the Fourier transform of f has two sets of complex conjugate elements in ˆ f , the highest frequency we can solve for uniquely in this manner is at most k/2. This traditional approach is closely related to the NyquistShannon sampling theorem [50], wellknown in computer graphics, which states that to correctly reconstruct a signal we must have sampled at more than twice its highest frequency. Indeed, trying to ﬁt a parametric model in this traditional manner means that we only get correct convergence of the integral we have more samples than twice the highest frequency c = 101, or around k = 202. As can be seen in Fig. 2(g), the estimated value of the integral bounces around with fairly high variance until this point and then locks down to I = 100 when we start having enough samples to ﬁt the sinusoids correctly. However, this process is not scalable, since it is dependent on the highest frequency of the signal. If we set c = 1, 000, we would need ten times more samples to converge correctly. Our compressed sensing approach, on the other hand, has the useful property that its behavior is independent of the highest frequency. In another approach, we might consider solving the parametric ﬁt problem using a “fat” A matrix, so that the number of unknown elements in ˆ f , which is n, can be much bigger than k. Perhaps this will allow us to solve for higher frequency sinusoids even though we do not have enough samples. In this case the problem is underdetermined and there could be many solutions with the same square error. Traditionally, these kinds of problems are solved with a leastnorm algorithm, which ﬁnds the leastsquares solution that also has the least norm in the 2 sense (where 2 is the squareroot of the sum of the squares of the components). In our case the leastnorm solution is given by ˆ fln = AT (AAT )−1 y. Unfortunately, this works even worse than the leastsquares ﬁt for our example. It turns out that there are many “junk” signals that have a lower 2 norm than the true answer (because they contain lots of small values in their frequency coeﬃcients instead of a few large ones), yet they still match the measured values in the leastsquares sense. Although these traditional methods for ﬁtting parametric models are commonly used in computer graphics, they do not work for this simple example because they are dependent on the frequency content of our signal. The theory of compressed sensing, on the other hand, oﬀers us new possibilities since it
166
P. Sen, S. Darabi, and L. Xiao
states that the number of samples is dependent on the sparsity of the signal, not the particular frequency content it may have. In this example, we can apply CS to solve for y = Aˆ f and then integrate the signal as discussed in the previous section. The results shown in Fig. 2(h) show that we “snap” to the answer after about 70 samples, which makes sense since we have observed empirically that ROMP typically requires 10 times more samples than the sparsity m, which is 6 in this case. At this point, the estimated value of the integral is 100.0030, where the 0.003 error is caused by the discretization of f as compared to f (x). This is the bias of our estimator, but it is considerably small especially considering the value of the integral we are computing. Finally, we note that CS is able to reconstruct the discrete signal f perfectly and consistently with k = 75, even though we have less than one sample per period of the highest frequency (c = 101) in the signal, well below the Nyquist limit. Before we ﬁnish this discussion we should mention some of the implementation details used to acquire the results of Fig. 2. While the random Monte Carlo and stratiﬁed experiments sample the analytic versions of f (x) directly to compute the integral, the parametric and CS approaches need to solve linear systems of equations and therefore use the discrete version f . For the CS experiments we set the size of f and ˆ f to be n = 212 + 1, which means that the highest frequency that we can solve for is 2048Hz. This is not a limiting factor, however, since we can increase this value (e.g., our later experiments for antialiasing in Sec. 6 use n = 220 ). When measuring the variance curves for random Monte Carlo, stratiﬁed sampling, and parametric ﬁt, we ran 250 trials for each k to reduce their noisiness. The compressed sensing reconstruction was more consistent and we only used 75 trials for each k to compute its variance. Note that the variance of the CS approach is down to the 10−24 range when it has more than 70 samples, which after 75 independent trials means that the value of the integral is rock solid and stable at this point. This simple example might help motivate our approach in theory, but we need to validate our framework by using it to solve a realworld problem in computer graphics. We spend the rest of the chapter discussing the application of this framework to a set of diﬀerent problems in rendering.
5
Application to 2D Signals – Image Reconstruction
In our ﬁrst example, we begin by looking at the problem of image reconstruction from a subset of pixels. The basic idea is to accelerate the rendering process by simply computing a subset of pixels and then reconstructing the missing pixels using the ones we measured. We plan to use compressed sensing to do this by looking for the pixel values that would create the sparsest signal possible in a transform domain. In this case, since we are working with 2D images, the suitable compression basis would be wavelets. Unfortunately, although wavelets are very good at compressing image data, they are incompatible with the pointsampling basis of our rendering system because they are not incoherent with point samples as required
Compressive Rendering of Multidimensional Scenes
167
by Restricted Isometry Condition (RIC) of compressed sensing. To see why, we note that the coherence between a general sampling basis Ω and compression basis Ψ can be found by taking the maximum inner product between any two basis elements of the two: √ μ(Ω, Ψ) = n · max ωj , ψk . (7) 1≤j,k≤n
Because the matrices are orthonormal, the resulting coherence lies in the range √ μ(Ω, Ψ) ∈ [1, n] [51], with a fully incoherent pair having a coherence of 1. This is the case for the pointsampled Fourier transform, which is ideal for compressed sensing but unfortunately is not suitable for our application because of its lack of compressibility for 2D images as shown in Fig. 1. If we use a wavelet as the compression basis (e.g., a 642 × 642 Daubechies8 wavelet (DB8) matrix for n = 642 ), the coherence with a √ pointsampled basis is 32, which is only half the maximum coherence possible ( n = 64). This large coherence makes wavelets unsuitable to be used asis in the compressed sensing framework. In order to reduce coherency yet still exploit the wavelet transform, we propose a modiﬁcation to Eq. 2. Speciﬁcally, we assume that there exists a blurred image fb which can be sharpened to form the original image: f = Φ−1 fb , where Φ−1 is a sharpening ﬁlter. We can now write the sampling process as y = Sf = SΦ−1 fb . Since the blurred image fb is also sparse in the wavelet domain, we can incorporate the wavelet compression basis in the same way as before and get fb . We can now solve for the sparsest ˆ fb : y = SΦ−1 Ψˆ min ˆ fb 0 s.t. y = Aˆ fb ,
(8)
where A = SΦ−1 Ψ, using the greedy algorithms such as OMP or ROMP. Once ˆ fb has been found, we can compute our ﬁnal image by taking the inverse wavelet transform and sharpening the result: f = Φ−1 Ψˆ fb . In this work, our ﬁlter Φ is a Gaussian ﬁlter, and since we can represent the ﬁltering process as multiplication in the frequency domain, we write Φ = F H GF, where F is the Fourier transform matrix and G is a diagonal matrix with values of a Gaussian function along its diagonal. Substituting this in to Eq. 8, we get: min ˆ fb 0 s.t. y = SF H G−1 F Ψˆ fb .
(9)
We observe that G−1 is also a diagonal matrix and should have the values G−1 i,i along its diagonal. However, we must be careful when inverting the Gaussian function because it is prone to noise ampliﬁcation. To avoid this problem, we use a linear Wiener ﬁlter to invert the Gaussian [52], which means that the diagonal 2 elements of our inverse matrix G−1 have the form G−1 i,i = Gi,i /(Gi,i + λ). Since the greedy algorithms (such as ROMP) we use to solve Eq. 3 require a “backward” matrix A† that “undoes” the eﬀect of A (i.e., A† Av ≈ v), where A = SΦ−1 Ψ, we use a backwards matrix of the form A† = Ψ−1 ΦST = Ψ−1 F H GF ST .
168
P. Sen, S. Darabi, and L. Xiao
35
Coherence
30 25 20 15 0
200 400 600 800 1000 Variance of Gaussian filter (Freq domain)
1200
Fig. 3. Coherence vs. variance of Gaussian matrix G. Since G is in the frequency domain, larger variance means a smaller spatial ﬁlter. As the variance grows, the coherence converges to 32, the coherence of the pointsampled, 642 × 642 DB8 matrix. The coherence should be as small as possible, which suggests a smaller variance for our Gaussian ﬁlter in the frequency domain. However, this results in a blurrier image fb which is harder to reconstruct accurately. The optimal values for the variance were determined empirically and are shown in Table 2 for diﬀerent sampling rates.
Note that for real image sizes, the matrix A will be too large to store in memory. For example, to render a 1024 × 1024 image with a 50% sampling rate, our measurement matrix A will have k × n = 5.5 × 1011 elements. Therefore, our implementation must use a functional representation for A that can compute the required multiplications such as Aˆ fb on the ﬂy as needed. The addition of the sharpening ﬁlter means that our measurement matrix is composed of two parts: the pointsamples S and a “blurred wavelet” matrix Φ−1 Ψ which acts as the compression basis. This new compression basis can be thought of as either blurring the image and then taking the wavelet transform, or applying a “ﬁltered wavelet” transform to the original image. To see how this ﬁlter reduces coherence, we plot the result of Eq. 7 as a function of the variance σ2 of Gaussian function of G in Fig. 3 for our 642 × 642 example. Note that the Gaussian G is in the frequency domain, so as the variance gets larger the ﬁlter turns into a delta function in the spatial domain and the coherence approaches 32, the value of the unﬁltered coherence. As we reduce the variance of G, the ﬁlter gets wider in the spatial domain and coherence is reduced by almost a factor of 2. Although it would seem that the variance of G should be as small as possible (lowering coherence), this increases the amount of blur of fb and hence the noise in our ﬁnal result due to inversion of the ﬁlter. We determined the optimal variances empirically on a single test scene and used the same values for all our experiments (see Table 2). In the end, the reduction of coherence by a factor of 2 through the application of the blur ﬁlter was enough to yield good results with compressed sensing.
Compressive Rendering of Multidimensional Scenes
169
Table 2. Parameters for the Gaussian (1/σ 2 ) and Wiener ﬁlters (λ). We iterated over the parameters of the Gaussian ﬁlter to ﬁnd the ones that yielded the best reconstruction for the ROBOTS scene at the given sampling rates (%) for 1024 × 1024 reconstruction. % 6% 13% 25% 33% 43%
1/σ 2 0.000130 0.000065 0.000043 0.000030 0.0000225
λ 0.089 0.109 0.209 0.209 0.234
% 53% 60% 72% 81% 91%
1/σ 2 0.000015 0.000014 0.000013 0.000011 0.000008
λ 0.259 0.289 0.289 0.299 0.399
To test our algorithm, we integrated our compressive sensing framework into both an academic ray tracing system (PBRT [24]) and a highend, open source ray tracer (LuxRender [53]). The integration of both was straightforward since we only had to control the pixels being rendered (to compute only a fraction of the pixels), and then add on a reconstruction module that performed the ROMP algorithm. In order to select the random pixels to measure, we used the Boundary sampling method of the Poissondisk implementation from Dunbar and Humphreys [54] to space out the samples in imagespace. For LuxRender, for example, these positions were provided to the ray tracing system through the PixelSampler class. The “low discrepancy” sampler was used to ensure that samples were only made in the pixels selected. After the ray tracer evaluated the samples, the measurement was recorded into a data structure that was fed into the ROMP solver. The rest of the ray tracing code was left untouched. The ROMP solver was based on the code by Needell et al. [47] available on their website but rewritten in C++ for higher performance. We leverage the Intel Math Kernel Library 10.1 (MKL) libraries [55] to accelerate linear algebra computation and to perform the FastFourier Transform for our Gaussian ﬁlter. In addition, we use the Stanford LSQR solver [56] to solve the leastsquares step at the end of ROMP. The advantage of LSQR is that it is functionalbased so we do not need to represent the entire A matrix in memory since it can get quite large as mentioned earlier. To describe the implementation of the functional version of the measurement matrix A, we ﬁrst recall that A = SF H G−1 F Ψ from Eq. 9. The inverse wavelet transform Ψ was computed using the lifting algorithm [11], and the MKL library was used to compute the Fourier and inverseFourier transforms of the signal. To apply the ﬁlter, we simply weighted the coeﬃcients by the Gaussian function described in the algorithm. After applying the inverse Fourier transform to the ﬁltered signal, we then simply take the samples from the desired positions. This gives us a way to simulate the eﬀect of matrix A in our ROMP algorithm without explicitly specifying the entire matrix. In addition, we found empirically that ROMP behaved better when the maximum number of coeﬃcients added in each iteration was bounded. For the experiments in this chapter, we used a bound of 2k/i, where k is the number of pixels measured and i is the maximum number of ROMP iterations which we set to 30.
170
P. Sen, S. Darabi, and L. Xiao
Table 3. Timing results in minutes of our algorithm. Preprocess includes loading the models, creating the acceleration data structures. Full render is the time to suﬃciently sample every pixel to generate the groundtruth image. CS Recon is the time it took our reconstruction algorithm to solve for a 1024 × 1024 image with 75% of samples. The last column shows the percentage of rays that could be traced instead of using our approach. Because our CS reconstruction is fast, this number is fairly small. We ignore postprocessing eﬀects because these are in the order of seconds and are negligible. We also do not include the cost of the interpolation algorithm since it took around 10 seconds to triangulate and interpolate the samples. Scene Robots Watch Sponza
Preprocess Full Render CS Recon 0.25 611 11.9 0.28 903 13.5 47 634 12.0
% 1.9% 1.4% 1.9%
After the renderer ﬁnishes computing the samples, ROMP operates on the input vector y. We set the target sparsity to one ﬁfth of the number of samples k, which has been observed to work well in the CS literature [57]. ROMP uses the Gaussian ﬁlter of Eq. 9 as part of the reconstruction process. The parameters of the Gaussian ﬁlter were set through iterative experiments on a single scene, but once set they were used for all the scenes in this chapter. Table 2 shows the actual values used in our experiments. If a sampling rate is used that is not in the table, the nearest entries are interpolated. The compression basis used in this work is the CohenDaubechiesFeauveau (CDF) 97 biorthogonal wavelet, which is particularly wellsuited for image compression and is the wavelet used in JPEG2000 [58]. Since we are dealing with color images, the image signal must be reconstructed in all three channels: R, G, and B. To accelerate reconstruction, we transform the color to YUV space and use the compressive rendering framework for only the Y channel and use the Delaunayinterpolation (described below) for the other two. The error introduced by doing this is not noticeable as we are much more sensitive to the Y channel in an image than the other the two. To compare our results, we test our approach against a variety of other algorithms that might be used to ﬁll in the missing pixel data in the renderer. For example, we compare against the popular inpainting method of Bertalmio et al. [34], using the implementation by Alper and Mavinkurve [59]. To compare against approaches from the nonuniform sampling community, we implemented the Marvasti algorithm [38] and used the MATLAB version for the ACT algorithm provided in the Nonuniform Sampling textbook [36]. Finally, rendering systems in practice typically use interpolation methods to estimate values between computed samples. Unfortunately, many of the convolutionbased methods that work so well for uniform sampling simply do not work when dealing with nonuniform sample reconstruction (see, e.g., the discussion by Mitchell [26]). For this work we implemented the piecewise cubic multistage ﬁlter described by Mitchell [26], with the modiﬁcation that we put back the original samples at every stage to improve performance. Finally, we also implemented the most common interpolation algorithm used in practice, which uses Delaunay
Compressive Rendering of Multidimensional Scenes
171
10000
Run time (secs)
1000
100
10
1 n lo g (n ) R OMP
0 .1 1 E +2
1 E +3
1 E +4
1 E +5
1 E +6
1 E +7
Image size (n)
Fig. 4. Runtime complexity of our ROMP reconstruction algorithm, where n is the total size of our image (width × height). The curve of n log n is shown for comparison. We tested our algorithm on images of size 32×32 all the way through 2048×2048. Even at the larger sizes, the performance remained true to the expected behavior. Note that the complexity of the reconstruction algorithm is independent of scene complexity.
triangulation to mesh the samples and then evaluates the color of the missing pixels in between by interpolating each triangle of the mesh, e.g., as described by Painter and Sloan [60]. This simple algorithm provides a piecewise linear reconstruction of the image, which turned out to be one of the better reconstruction techniques. The diﬀerent algorithms were all tested on a Dell Precision T3400 with a quadcore, 3.0 Ghz Intel Core2 Extreme CPU QX6850 CPU with 4GB RAM capable of running 4 threads. The multithreading is used by LuxRender during pixel sampling and by the Intel MKL library when solving the ROMP algorithm during reconstruction. Since most of the reconstruction algorithms have border artifacts, we render a larger frame and crop out a margin around the edges. For example, the 1000 × 1000 images were rendered at 1024 × 1024 with a 12pixel border. 5.1
Timing Performance
In order for the proposed framework to be useful, the CS reconstruction of step 2 has to be fast and take less time than the alternative of simply bruteforce rendering more pixels. Table 3 shows the timing parameters of various scenes rendered with LuxRender. We see that the CS step takes approximately 10 minutes to run for a 1024 × 1024 image with 75% samples with our unoptimized C implementation. Since the fullframe rendering times are on the order of 6 to 15 hours, the CS reconstruction constitutes less than 2% of the total rendering time, which means that in the time to run our reconstruction algorithm only 2% of extra pixels could be computed. On the other hand, the inpainting
172
P. Sen, S. Darabi, and L. Xiao
Table 4. MSE performance of the various algorithms (×10−4 ) for diﬀerent scenes. All scenes were sampled with 60% of pixel samples. Scene Interp Cubic Robots 2.00 4.94 Watch 6.42 11.00 Sponza 2.38 6.77
ACT Marvasti Inpaint CS 1.98 2.11 4.24 1.72 6.15 6.68 15.00 5.34 2.38 2.65 4.70 2.11
implementation we tested took approximately one hour to compute the missing pixels, making our ROMP reconstruction reasonably eﬃcient by comparison. We also examine the runtime complexity of the ROMP reconstruction as a function of image size to see how the processing times would scale with image size for images from 32×32 to 2048×2048 (see Fig. 4). We can see that it behaves as O(n log n) as predicted by the model. For the image sizes that we are dealing with (≤ 107 pixels) this is certainly acceptable, given the improvement in image quality we get with our technique. Finally, we note that our algorithm runs in imagespace so it is completely independent of scene complexity. On the other hand, rendering algorithms scale as O(n), but the constants involved depend on scene complexity and have a signiﬁcant impact on the rendering time. Over the past few decades, feature ﬁlm rendering times have remained fairly constant as advances in hardware and algorithms are oﬀset by increased scene complexity. Since our algorithm is independent of the scene complexity, it will continue to be useful in the foreseeable future. 5.2
Image Quality
Standard measures for image quality are typically 2 distance measures. In this work, we use the mean squared error (MSE) assuming that the pixels in the image have a range of 0 to 1, and compare the reconstructed images from all the approaches to the groundtruth original. Table 4 shows the MSE for the various algorithms we tested: the ﬁrst two are interpolation algorithms, followed by the two algorithms from the nonuniform sampling community, then the result of inpainting and ﬁnally the result of the CSbased reconstruction proposed in this work. Our algorithm has the lowest MSE, something that we observed in all our experiments. A few additional points are worth mentioning. First of all, we noticed that the inpainting algorithm performed fairly poorly in our experiments. The reason for this is that in our application the holes are randomly positioned, while these techniques require bands of known pixels around the hole (i.e., spatial locality). Unfortunately, this is not easy to do in a rendering system since we cannot cluster the samples a priori without knowledge of the resulting image. Also disappointing was Mitchell’s multistage cubic ﬁlter, which tended to overblur the image when we set the kernel large enough to bridge the larger holes in the image. Although the algorithms from the nonuniform sampling community (ACT and Marvasti) perform better, they are on par with the Delaunayinterpolation used in rendering which works remarkably well.
Compressive Rendering of Multidimensional Scenes Robots
Watch CS
3 .2 E0 4
CS
1 .4 E0 3
D . in t
D . in t
ACT
ACT
7 .2 E0 4
MSE
MSE
1 .6 E0 4
8 .0 E0 5
3 .6 E0 4
1 .8 E0 4
4 .0 E0 5
2 .0 E0 5 0 .3
173
0 .4
0 .5 0 .6 0 .7 Fraction of samples (k/n)
0 .8
0 .9
9 .0 E0 5 0 .3
0 .4
0 .5 0 .6 0 .7 Fraction of samples (k/n)
0 .8
0 .9
Fig. 5. Log error curves as a function of the number of samples for four test scenes using our technique and the best two other competing reconstruction algorithms. Our CS reconstruction beats both Delaunayinterpolation (D. int) as well as ACT, requiring 5% to 10% less samples to achieve a given level of quality.
To see how our algorithm would work at diﬀerent sampling rates, we compare it against the two best competing methods (DelaunayInterpolation and ACT) for two of our scenes in Fig. 5. We observe that to achieve a given image quality, Delaunay interpolation and ACT require about 5% to 10% more samples than our approach. When the rendering time is 10 hours, this adds up to an hour of savings to achieve comparable quality. Furthermore, since our algorithm is completely independent of scene complexity, the beneﬁt of our approach over interpolation becomes more signiﬁcant as the rendering time increases. However, MSE is not the best indicator for visual quality, which is after all the most important criterion in highend rendering. To compare the visual quality of our results, we refer the readers to Fig.12 at the end of the chapter. We observe that compressive rendering performs much better than interpolation in regions with sharp edges or those that are slightly blurred, a good property for a rendering system. To see this, we direct readers to the second inset of the robots scene in Fig. 12. Although the ﬁne grooves in the robot’s arm cannot be reconstructed faithfully by any of the other algorithms, compressed sensing is able to do this by selecting the values for the missing pixel locations that yield a sparse wavelet representation. Another good example can be found on the third row of the watch scene. Here there is a pixel missing between two parts of the letter “E,” which our algorithm is the only one to be able to correctly reconstruct. All other techniques simply interpolate between the samples on either side of the missing pixel and ﬁll in this sample incorrectly. However, since clean, straight lines are more sparse than the jumbled noise estimated by the other approaches, they are selected by our technique. Although the ACT algorithm performs reasonably well overall, it suﬀers from ringing when the number of missing samples is high and there is a sharp edge (see e.g., the last inset of watch) because of the ﬁtting of trigonometric polynomials to the point samples.
174
P. Sen, S. Darabi, and L. Xiao
pixel to integrate
(a)
pixel to integrate
(b)
(c)
(d)
Fig. 6. Illustration of our antialiasing algorithm. (a) Original continuous signal f (x) to be antialiased over the 4 × 4 pixel grid shown. (b) In our approach, we ﬁrst take k samples of the signal aligned on an underlying grid of ﬁxed resolution n. This is equivalent of taking k random samples of discrete signal f . (c) The measured samples form our vector of measurements y, with the unknown parts of f shown in green. Using ROMP, we solve y = Aˆ f for ˆ f , where A = SΨ. S is the sampling matrix corresponding to the samples taken, Ψ is the blurredwavelet basis described in Sec. 5. (d) Our approximation to f , computed by applying the synthesis basis to ˆ f (i.e., f = Ψˆ f ). We integrate this approximation over each pixel to get our antialiased result.
6
Application to 2D Signals – Antialiasing
In this section, we present another example of 2D scene reconstruction by applying our framework to the problem of boxﬁltered antialiasing. The basic idea is simple (see overview in Fig. 6). We ﬁrst take a few random point samples of the scene function f () per pixel. Unlike the previous section, we are no longer dealing with pixels of the image but rather samples on an underlying grid of higher resolution than the image that matches the size of the unknown discrete function f and is aligned with its samples. We then use ROMP to approximate a solution to Eq. 3 which can then be used to calculate f . Once we have f , we can integrate it over the pixel to perform our antialiasing. The observation is that if f is sparse in the transform domain, we will need only a small set of samples to evaluate this integral accurately. Fig. 7 shows a visual comparison of our approach against stratiﬁed and random supersampling which are also used for antialiasing images. This is very similar to our previous approach, except now we have introduced the notion of applying integration to the reconstructed function to produce the ﬁnal image.
7
Application to 3D Signals – Motion Blur
We now describe the application of our framework to the rendering of motion blur, which involves the reconstruction of a 3D scene. Motion blur occurs in dynamic scenes when the projected image changes as it is integrated over the time the camera aperture is open. Traditionally, Monte Carlo rendering systems emulate motion blur by randomly sampling rays over time and accumulating them together to estimate the integral [61]. Conceptually, our approach is very
Compressive Rendering of Multidimensional Scenes Random samples
Stratiﬁed sampling
175
Our approach
Fig. 7. Visual comparison for Garden scene. Each row has a diﬀerent number of samples/pixel (from top to bottom: 1, 4).
similar to that of our antialiasing algorithm. We ﬁrst take a set of samples of the scene y, except that now the measurements are also spaced out in time to sample the discrete spatiotemporal volume f , which represents a set of video frames over the time the aperture was open. We then use compressed sensing to reconstruct ˆ f , the representation of the volume in the 3D Fourier transform domain Ψ. After applying the inverse transform to recover an approximation to the original f , we can then integrate it over time to achieve our desired result. An example of the computed motion blur is shown in Fig. 8.
8
Application to 3D Signals – Video
An obvious extension to the motion blur application of the previous section is to view the individual frames of the reconstructed spatiotemporal volume directly, resulting in an algorithm to render animated sequences. This is a signiﬁcant advantage over the more complex adaptive approaches that have been proposed for rendering (e.g., MDAS [62] or AWR [63]) which are diﬃcult to extend to animated scenes because of their complexity. This is the reason that these previous approaches have dealt exclusively with the rendering of static imagery. To generate an animated sequence, these approaches render a set of static frames by evaluating each frame independently and do not take into account their temporal coherence. Our approach, on the other hand, uses compressed sensing to
176
P. Sen, S. Darabi, and L. Xiao Reference image
Random sampling
Our approach
Fig. 8. Visual comparison of motion blur results for the Train scene. The reference image was rendered with 70 temporal samples/pixel, while the other two where rendered with a single random sample per pixel in time. Our result was reconstructed assuming a spatiotemporal volume of 24 frames. Images were rendered at a resolution of 1000 × 1000.
evaluate a sparse version of the signal in x, y, and t in the 3D Fourier domain, so it fully computes the entire spatiotemporal volume which we can view as frames in the video sequence. To demonstrate this, we show individual frames in Fig. 9 from the dynamic train scene of Fig. 8. For comparison, we implemented an optimized linear interpolation using a 3D Gaussian kernel, which is the best convolution methods can do when the sampling rate is so low. We also compare against our earlier compressive rendering work [5], which reconstructs the set of static images individually using sparsity in the wavelet domain. The second column of Fig. 9 shows the missing samples in green, which results in an image that is almost entirely green when we have a 1% sampling rate (only 1/100 pixels in the spatiotemporal volume are calculated). It is remarkable that our algorithm can reconstruct a reasonable image even at this extremely low sampling density, while the other two approaches fail completely. This suggests that our technique could be useful for previsualization, since we get a reasonable image with 100× less samples. The reconstruction time is less than a minute per frame using the unoptimized C++ SpaRSA implementation, while the rendering time is 8 minutes per frame on an Intel Xeon 2.93 GHz CPUbased computer with 16 GB RAM. This means that the groundtruth reference 128frame video would take over 17 hours to compute. On the other hand, using our reconstruction with only 1% of samples we get a video with a frame shown in Fig. 9 (right image) in less than 2 hours.
9
Application to 4D Signals – Depth of Field
We now show the application of our framework to a 4D scene function to demonstrate the rendering of depthofﬁeld. Monte Carlo rendering systems compute
Compressive Rendering of Multidimensional Scenes Reference Frame
Sample Positions
Measured Samples
Gaussian
2D CS rendering
177
3D CS rendering
Fig. 9. Reconstructing frames in an animated sequence. We can use our approach to render individual frames in an animated sequence and leverage the coherence both spatially and temporally in the Fourier domain. The three rows represent diﬀerent frames in the train sequence with the sampling rate varied for each (1% for the top row, 10% for the middle, and 25% for the bottom). The ﬁrst column shows the reference frame of the fullyrendered sequence, the second column shows the positions of the samples (shown in white), the third column shows the samples available for this particular frame (unknown samples shown in green), the fourth column shows a reconstruction by convolving the samples with an optimized 3D Gaussian kernel with variance adjusted for the sampling rate (best possible linear ﬁlter), the ﬁfth column shows the results of reconstructing each frame separately using the 2D algorithm from Sec. 5, and the last column is reconstructs the entire 3D volume using a 3D Fourier transform. These images are rendered at 512 × 512 with 128 frames in the video sequence.
depthofﬁeld by estimating the integral of the radiance of incoming rays over the aperture of the lens through a set of random points on the virtual lens [61]. This means that we must choose two additional random parameters for each sample which tell us the position the ray passes through the virtual lens. Therefore, in this application we parameterize each sample by its imagespace coordinate (x, y) and these two additional parameters (u, v). As mentioned in the previous sections, since our compressed sensing reconstruction works on discrete positions, we uniformly choose the positions on the virtual lens to lie on a grid. The proposed framework is general and therefore easy to map to this new problem. We take our sample measurements y by sampling this new 4D space, and then reconstruct the whole space f using compressed sensing by assuming the sparsity in the 4D Fourier domain. For this example, we use the SpaRSA solver to compute the sparse transformdomain signal ˆ f . The ﬁnal image is calculated in the end by integrating the reconstructed 4D signal over all u and v for each pixel. Fig. 10 shows an example of the output of algorithm for the depthofﬁeld.
178
10
P. Sen, S. Darabi, and L. Xiao
Application to 4D Signals – Area Light Source
We can also apply the proposed framework to reconstruct a 4D scene with an area light source. This extension is fairly similar to that of the depthofﬁeld eﬀect, with the only diﬀerence in that here the two random variables represent points on the area light source. In this way, we parametrize each sample by its imagespace coordinate (x, y) along with the position on the area light source (p, q). Again, we sample this 4D space and reconstruct it with SpaRSA assuming the sparsity in the 4D Fourier domain. Finally, we integrate over the area light coordinates p and q for each pixel in the reconstructed space to determine the ﬁnal pixel color. Fig. 11 shows an example of the result of the algorithm for an area light source. Reference image
Random sampling
Our approach
Fig. 10. Visual comparison of depthofﬁeld results for the Dragon scene. The reference image was rendered with 256 samples per pixel. Our result was generated by reconstructing a signal of size 340 × 256 × 16 × 16 and integrating the 16 × 16 samples over (u, v) for each of the 340 × 256 pixels. Reference image
Random sampling
Our approach
Fig. 11. Visual comparison of area light source results for the Buddha scene. The reference image was rendered with 256 samples per pixel. Our result was generated by reconstructing a signal of size 300×400×16×16 and integrating the 16×16 samples over (p, q) for each of the 300 × 400 pixels.
Compressive Rendering of Multidimensional Scenes
4
179
GT
samples
D. int
ACT
inpainting
CS
GT
samples
D. int
ACT
inpainting
CS
GT
samples
D. int
ACT
inpainting
CS
1
2 3
1
3
2
4
1 2 3
4
Fig. 12. Results of scenes with the reconstruction algorithms, each with a diﬀerent % of computed samples. From top to bottom: robots (60%), watch (72%), sponza (87%). The large image shows the ground truth (GT) rendering to show context. The smaller columns show the ground truth (GT) of the inset region and the raytraced pixels (unknown pixels shown in green), followed by the results of Delaunay interpolation (D. int), ACT, inpainting, and compressed sensing (CS). It can be seen that our algorithm produces higherquality images than the others.
180
11
P. Sen, S. Darabi, and L. Xiao
Discussion
The framework proposed in this chapter is fairly general and as shown in these last few sections it can reconstruct a wide range of Monte Carlo eﬀects. However, there are some issues that currently aﬀect its practical use for production rendering. One of the current limiting factors for the performance of the system is the speed of the reconstruction by the solver. In this work, we used C++ implementations of ROMP [47] and SpaRSA [48], but these were still relatively slow (requiring tens of minutes for some of the reconstructions) which decreases the performance of the overall system. However, the applied mathematics community is constantly developing new CS solvers, and there are already solvers that appear to be much faster than ROMP or SpaRSA that we are just starting to experiment with. There is also a possibility to implement the CS solver on the GPU, which would give us a further speed up for our algorithm. Another issue is the memory usage of the algorithm. Currently, the solvers need to store the entire signal f (or its transform ˆ f ) in memory while the solver is calculating its components. As the dimension of our scene function grows, the size of f grows exponentially. For example, if we want to do compressive rendering for a 6D scene with depthofﬁeld and an area light source, say with an image resolution of 1024 × 1024 and sample grids of 16 × 16 for a lens and an area light source, we would have to store an f of size n = 236 entries, which would require 256 GB of memory. Therefore, our current approach suﬀers from the “curse of dimensionality” that can plague other approaches for multidimensional signal integration. We are currently working on modifying the solvers to ease the memory requirements of the implementation. Nevertheless, this chapter presents a novel way to look at the Monte Carlo rendering by treating it as a multidimensional function that we can reconstruct fully by assuming that it is sparse in a transform domain using the tools from compressed sensing. This work might encourage other researchers to explore new ways to solve the rendering problem.
12
Conclusion
In this chapter, we have presented a general framework for compressive rendering that shows how we can use a distributed ray tracing system to take a small set of point samples of a multidimensional function f (), which we then can approximately reconstruct using compressed sensing algorithms such as ROMP and SpaRSA by assuming sparsity in a transform domain. After reconstruction, we can then integrate the signal down to produce the ﬁnal rendered image. This algorithm works for a general set of Monte Carlo eﬀects, and we demonstrate results with motionblur, depthofﬁeld, and area light sources. Acknowledgments. The authors would like to thank Dr. Yasamin Mostoﬁ for fruitful discussions regarding this work. Maziar Yaesoubi, Nima Khademi Kalantari, and Vahid Noormoﬁdi also helped to acquire some of the results presented. The Dragon (Figs. 1 and 10), Buddha (Fig. 11), Sponza (Fig. 12)
Compressive Rendering of Multidimensional Scenes
181
and Garden (Fig. 7) scenes are from the distribution of the PBRT raytracer by Pharr and Humphreys [24]. The Robot (Fig. 12) and Train (Fig. 6, 8 and 9) scenes are from J. Peter Lloyd, and the Watch (Fig. 12) scene is from Luca Cugia. This work was funded by the NSF CAREER Award #0845396 “A Framework for Sparse Signal Reconstruction for Computer Graphics.”
References 1. Veach, E.: Robust monte carlo methods for light transport simulation. PhD thesis, Stanford University, Stanford, CA, USA (1998); AdviserLeonidas J. Guibas 2. Dutr´e, P., Bala, K., Bekaert, P., Shirley, P.: Advanced Global Illumination. AK Peters Ltd.,Wellesley (2006) 3. Jensen, H.W.: Realistic image synthesis using photon mapping. A. K. Peters, Ltd., Natick (2001) 4. Sen, P., Darabi, S.: Compressive rendering: A rendering application of compressed sensing. IEEE Transactions on Visualization and Computer Graphics 17, 487–499 (2011) 5. Sen, P., Darabi, S.: Compressive estimation for signal integration in rendering. In: Computer Graphics Forum (Proceedings of Eurographics Symposium on Rendering (EGSR) 2010), vol. 29, pp. 1355–1363 (2010) 6. Sen, P., Darabi, S.: Exploiting the sparsity of video sequences to eﬃciently capture them. In: Magnor, M., Cremers, D., ZelnikManor, L. (eds.) Dagstuhl Seminar on Computational Video (2010) 7. Sen, P., Darabi, S.: Details and implementation for compressive estimation for signal integration in rendering. Technical Report EECETR100003, University of New Mexico (2010) 8. Schr¨ oder, P.: Wavelets in computer graphics. Proceedings of the IEEE 84, 615–625 (1996) 9. Schr¨ oder, P., Sweldens, W.: Wavelets in computer graphics. In: SIGGRAPH 1996 Course Notes (1996) 10. Stollnitz, E.J., Derose, T.D., Salesin, D.H.: Wavelets for computer graphics: theory and applications. Morgan Kaufmann Publishers Inc., San Francisco (1996) 11. Mallat, S.: A Wavelet Tour of Signal Processing, 2nd edn. Academic Press, London (1999) 12. Hanrahan, P., Salzman, D., Aupperle, L.: A rapid hierarchical radiosity algorithm. SIGGRAPH Comput. Graph 25, 197–206 (1991) 13. Gortler, S.J., Schr¨ oder, P., Cohen, M.F., Hanrahan, P.: Wavelet radiosity. In: SIGGRAPH 1993, pp. 221–230 (1993) 14. Lischinski, D., Tampieri, F., Greenberg, D.P.: Combining hierarchical radiosity and discontinuity meshing. In: SIGGRAPH, pp. 199–208 (1993) 15. Schr¨ oder, P., Gortler, S.J., Cohen, M.F., Hanrahan, P.: Wavelet projections for radiosity. Computer Graphics Forum 13 (1994) 16. Ramamoorthi, R., Hanrahan, P.: An eﬃcient representation for irradiance environment maps. In: SIGGRAPH (2001) 17. Sloan, P.P., Kautz, J., Snyder, J.: Precomputed radiance transfer for realtime rendering in dynamic, lowfrequency lighting environments. In: SIGGRAPH, pp. 527–536 (2002) 18. Ng, R., Ramamoorthi, R., Hanrahan, P.: Allfrequency shadows using nonlinear wavelet lighting approximation. ACM Trans. Graph 22, 376–381 (2003)
182
P. Sen, S. Darabi, and L. Xiao
19. Ng, R., Ramamoorthi, R., Hanrahan, P.: Triple product wavelet integrals for allfrequency relighting. In: SIGGRAPH, pp. 477–487 (2004) 20. Malzbender, T.: Fourier volume rendering. ACM Trans. Graph 12, 233–250 (1993) 21. Totsuka, T., Levoy, M.: Frequency domain volume rendering. In: SIGGRAPH, pp. 271–278 (1993) 22. Gross, M.H., Lippert, L., Dittrich, R., H¨ aring, S.: Two methods for waveletbased volume rendering. Computers and Graphics 21, 237–252 (1997) 23. Bolin, M., Meyer, G.: A frequency based ray tracer. In: SIGGRAPH 1995: Proceedings of the 22nd Annual Conference on Computer Graphics and Interactive Techniques, pp. 409–418 (1995) 24. Pharr, M., Humphreys, G.: Physically Based Rendering: From Theory to Implementation. Morgan Kaufmann Publishers Inc., San Francisco (2004) 25. Whitted, T.: An improved illumination model for shaded display. Communications of the ACM 23, 343–349 (1980) 26. Mitchell, D.P.: Generating antialiased images at low sampling densities. In: SIGGRAPH, pp. 65–72 (1987) 27. Walter, B., Drettakis, G., Parker, S.: Interactive rendering using the render cache. In: Lischinski, D., Larson, G. (eds.) Proceedings of the 10th Eurographics Workshop on Rendering, vol. 10, pp. 235–246. SpringerVerlag/Wien, New York, NY (1999) 28. Tole, P., Pellacini, F., Walter, B., Greenberg, D.: Interactive global illumination in dynamic scenes. ACM Trans. Graph 21, 537–546 (2002) 29. Pighin, F., Lischinski, D., Salesin, D.: Progressive previewing of raytraced images using image plane disconinuity meshing. In: Proceedings of the Eurographics Workshop on Rendering 1997, pp. 115–125. Springer, London (1997) 30. Guo, B.: Progressive radiance evaluation using directional coherence maps. In: SIGGRAPH 1998: Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, pp. 255–266. ACM, New York (1998) 31. Bala, K., Walter, B., Greenberg, D.P.: Combining edges and points for interactive highquality rendering. ACM Trans. Graph 22, 631–640 (2003) 32. Sen, P., Cammarano, M., Hanrahan, P.: Shadow silhouette maps. ACM Transactions on Graphics 22, 521–526 (2003) 33. Sen, P.: Silhouette maps for improved texture magniﬁcation. In: HWWS 2004: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pp. 65–73. ACM, New York (2004) 34. Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: SIGGRAPH 2000: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pp. 417–424 (2000) 35. Masnou, S., Morel, J.M.: Level lines based disocclusion. In: Proceedings of ICIP, pp. 259–263 (1998) 36. Marvasti, F.: Nonuniform Sampling: Theory and Practice. Kluwer Academic Publishers, Dordrecht (2001) 37. Feichtinger, H., Gr¨ ochenig, K., Strohmer, T.: Eﬃcient numerical methods in nonuniform sampling theory. Numer. Math. 69, 423–440 (1995) 38. Marvasti, F., Liu, C., Adams, G.: Analysis and recovery of multidimensional signals from irregular samples using nonlinear and iterative techniques. Signal Process 36, 13–30 (1994) 39. Gu, J., Nayar, S., Grinspun, E., Belhumeur, P., Ramamoorthi, R.: Compressive structured light for recovering inhomogeneous participating media. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 845–858. Springer, Heidelberg (2008)
Compressive Rendering of Multidimensional Scenes
183
40. Peers, P., Mahajan, D., Lamond, B., Ghosh, A., Matusik, W., Ramamoorthi, R., Debevec, P.: Compressive light transport sensing. ACM Trans. Graph. 28, 1–18 (2009) 41. Sen, P., Darabi, S.: Compressive Dual Photography. Computer Graphics Forum 28, 609–618 (2009) 42. Cand`es, E.J., Romberg, J., Tao, T.: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. on Information Theory 52, 489–509 (2006) 43. Donoho, D.L.: Compressed sensing. IEEE Trans. on Information Theory 52, 1289– 1306 (2006) 44. Rice University Compressive Sensing Resources website (2009), http://www.dsp.ece.rice.edu/cs/ 45. Cand`es, E.J., Rudelson, M., Tao, T., Vershynin, R.: Error correction via linear programming. In: IEEE Symposium on Foundations of Computer Science, pp. 295– 308 (2005) 46. Tropp, J.A., Gilbert, A.C.: Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. on Information Theory 53, 4655–4666 (2007) 47. Needell, D., Vershynin, R.: Uniform uncertainty principle and signal recovery via regularized orthogonal matching pursuit (2007) (preprint) 48. Wright, S., Nowak, R., Figueiredo, M.: Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing 57, 2479–2493 (2009) 49. Cand`es, E.J., Tao, T.: Near optimal signal recovery from random projections: universal encoding strategies? IEEE Trans. on Information Theory 52, 5406–5425 (2006) 50. Shannon, C.E.: Communication in the presence of noise. Proc. Institute of Radio Engineers 37, 10–21 (1949) 51. Donoho, D.L., Huo, X.: Uncertainty principles and ideal atomic decomposition. IEEE Transactions on Information Theory 47, 2845–2862 (2001) 52. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. AddisonWesley Longman Publishing Co., Inc., Amsterdam (2001) 53. LuxRender (2009), http://www.luxrender.net/ 54. Dunbar, D., Humphreys, G.: A spatial data structure for fast Poissondisk sample generation. ACM Trans. Graph. 25, 503–508 (2006) 55. Intel Math Kernel Library (2009), http://www.intel.com/ 56. Stanford Systems Optimization Laboratory software website (2009), http://www.stanford.edu/group/SOL/software/lsqr.html 57. Tsaig, Y., Donoho, D.L.: Extensions of compressed sensing. Signal Process 86, 549–571 (2006) 58. Taubman, D.S., Marcellin, M.W.: JPEG 2000: Image Compression Fundamentals, Standards and Practice. Springer, Heidelberg (2001) 59. Alper, E., Mavinkurve, S.: Image inpainting implementation (2002), http://www.eecs.harvard.edu/~ sanjay/inpainting/ 60. Painter, J., Sloan, K.: Antialiased ray tracing by adaptive progressive reﬁnement. SIGGRAPH Comput. Graph. 23, 281–288 (1989) 61. Cook, R.L., Porter, T., Carpenter, L.: Distributed ray tracing. SIGGRAPH Comput. Graph. 18, 137–145 (1984) 62. Hachisuka, T., Jarosz, W., Weistroﬀer, R.P., Dale, K., Humphreys, G., Zwicker, M., Jensen, H.W.: Multidimensional adaptive sampling and reconstruction for ray tracing. ACM Trans. Graph. 27, 1–10 (2008) 63. Overbeck, R.S., Donner, C., Ramamoorthi, R.: Adaptive wavelet rendering. ACM Trans. Graph. 28, 1–12 (2009)
Eﬃcient Rendering of Light Field Images Daniel Jung and Reinhard Koch Computer Science Department, ChristianAlbrechtsUniversity Kiel HermannRodewaldStr. 3, 24118 Kiel, Germany
Abstract. Recently a new display type has emerged that is able to display 50,000 views oﬀering a full parallax autostereoscopic view of static scenes. With the advancement in the manufacturing technology, multiview displays come with more and more views of dynamic content, closing the gap to this high quality full parallax display. The established method of content generation for synthetic stereo images is to render both views. To ensure a high quality these images are often ray traced. With the increasing number of views, rendering of all views is not feasible for multiview displays. Therefore methods are required that can render the large amount of diﬀerent views required by those displays eﬃciently. In the following a complete solution is presented that describes how all views for a full parallax display can be rendered from a small set of input images and their associated depth images with an imagebased rendering algorithm. An acceleration of the rendering of two orders of magnitude is achieved by diﬀerent parallelization techniques and the use of eﬃcient data structures. Moreover, the problem of ﬁnding the bestnextview for an imagebased rendering algorithm is addressed and a solution is presented that ranks possible viewpoints based on their suitability for an imagebased rendering algorithm. Keywords: Imagebased rendering, depth compensated interpolation, bestnextviewselection, viewpoint planning, full parallax display, multiview display.
1
Introduction
Threedimensional displays add an important depth clue to realistic perception, and are becoming ever more popular. Hence there is an increasing demand to produce and display 3D content to the general public. So far, the largescale 3D presentation is mostly conﬁned to 3D cinemas using glassesbased stereoscopy. Glassesfree autostereoscopic displays exist that project multiple views (typically below 10 views), but still have a poor depth resolution and horizontal parallax only. For outdoor 3Dadvertisement, one calls for highresolution largescale displays of several square meters size that emit a full light ﬁeld with unrestricted 3D perception to a passing audience without glasses. In this case, one needs to D. Cremers et al. (Eds.): Video Processing and Computational Video, LNCS 7082, pp. 184–211, 2011. c SpringerVerlag Berlin Heidelberg 2011
Eﬃcient Rendering of Light Field Images
185
produce not 10 but several thousands of slightly diﬀerent views that capture all details of the 3D scene. This can be achieved by a light ﬁeld display that emits all possible light rays of the scene in an integral image. Each pixel of such an integral image is indeed a display element in itself, holding all possible light rays for this position in space. Such a display may easily need several Gigapixel per square meter, and each display element consists of a lens system with an associated light ray image. Currently, no digital version of such a display is available, although experiments with multiprojectorbased systems show its feasibility [1], [2]. There is, however, a display capable of displaying 12 Gigapixel/m2 for a static image by coupling a 2D lens array with a photographic ﬁlm and backlighting. The display emits a full 2D parallax light ﬁeld with 50,000 light rays per lens, the 230,000 lenses arranged in a 2Darray of 1 m2 with an interlens distance of 2 mm. Figure 1 depicts a prototype of the display (left) and a schematic of one of the display elements with a lens system that projects its lens image (right). Each lens system projects a circular light ray image with an opening angle of 40 degree. The highdensity photographic ﬁlm contains the colors for the light rays, and an observer in front of the display perceives a light ﬁeld with both eyes, indistinguishable from the real 3D scene.
Fig. 1. The light ﬁeld display with 230,000 display elements (Prototype of corporation: Realeyes GmbH, left) and a sketch of one display element projecting its lens image consisting of 50,000 rays (right)
As this display is targeted towards 3D outdoor advertisement, the light ﬁeld to be coded into the ﬁlm will be produced by the advertisement company as a 3D scene model and rendered by a highend ray tracer for highest quality. However, ray tracing of such an enormous amount of data is a challenge to the renderer. Each lens image itself holds only 50,000 pixel, but the render system needs to produce 230,000 images, one for each lens, which might well take a few months on a current multicore desktop computer. It is possible to speedup
186
D. Jung and R. Koch
the rendering by using large render farms but at the expense of very high costs. In [3] we evaluated the render performance for such a display. We came to the conclusion that full ray tracing is not economically feasible, but point out that the image content is highly redundant as the images are very similar. This leads to the concept to ray trace only a small subset of images and to interpolate the inbetween images using depthcompensated view interpolation. In this case, a speedup of up to two orders of magnitude is possible while preserving most of the image’s quality. In the following a complete solution is presented that describes how all views for a full parallax display can be rendered out of a small set of input images and their associated depth images with an imagebased rendering algorithm. It is described how the viewpoints of the input images are chosen and how the presented algorithm is accelerated by parallelization on the central processing unit and the graphics hardware. Afterwards the results of the viewpoint selection algorithm are presented and the rendering algorithm is evaluated whereby the diﬀerent parallelization techniques are compared. This work extends [3] and [4] by giving a more detailed insight into the imagebased rendering algorithm and the viewpoint selection as well as the acceleration techniques used.
2
Prior Work
The plenoptic function can be used to describe all light that exists within a scene. It has been introduced by Adelson and Bergen [5] and describes the direction (θ, φ) and the wavelength (λ) of all light rays passing through every position in the scene (Vx , Vy , Vz = V) in dependence of the time (t) P = P (θ, φ, λ, t, Vx , Vy , Vz ) = P (θ, φ, λ, t, V) .
(1)
For static scenes and under constant illumination the plenoptic function can be reduced to the ﬁvedimensional function P5D = P5D (θ, φ, Vx , Vy , Vz ) = P5D (θ, φ, V) ,
(2)
which characterizes a full panoramic image at the viewpoint V. McMillan and Bishop [6] linked imagebased rendering to the plenoptic function and they came to the conclusion that a unit sphere centered around the viewpoint would be the most natural representation. However, they used a cylindrical projection for their representation because it simpliﬁed their correspondence search and could easily be unrolled onto a planar map. The light ﬁeld of a scene can be reduced to a 4D function under the limitation that the radiance of light rays do not change while passing through the scene. Ashdown [7] used this representation to simplify the calculation method of illuminance and pointed out its suitability for ray tracing. With this representation it is not possible to describe transparent surfaces or occlusions but the direct illuminance can be predicted without knowledge about the geometry of the scene. Levoy and Hanrahan [8] introduced the two plane parameterization of the 4D function and the idea of rendering new viewpoints out of this representation.
Eﬃcient Rendering of Light Field Images
187
Their parameterization is well suited for the fast rendering of viewpoints and does not depend on the geometry of the scene. A drawback of their algorithm is that very dense sampling is needed to avoid ghosting due to an unmodeled scene geometry. Gortler et al. [9] also used the two plane parameterization for rendering of novel viewpoints but introduced a depth correction by a 3D model to reduce rendering artefacts at object boundaries. For representation of the model they chose an octree, as described by Szeliski [10]. Another approach to incorporate geometric support has been made by Lischinski and Rappoport [11]. They introduced the Layered Light Field (LLF) that incorporates parallel layered depth images (LDI), surface normals, diﬀuse shading, visibility of light sources and the material properties at the surface point. That way they were able to render images of synthetic scenes that correctly handled viewdependent shading and reﬂections. A detailed summary of imagebased rendering techniques and imagebased view synthesis can be found in Koch and EversSenne [12]. One important issue is the rendering speed for eﬃcient view interpolation. Halle and Kropp [13] used the technically mature rendering of OpenGL for image rendering for full parallax displays. This way they took advantage of the parallelization on the graphics hardware with a minimum eﬀort and full compatibility with the graphics toolkits that are based on OpenGL. Dietrich et al. [14], [15] achieved a remarkable acceleration in real time ray tracing by sample caching and reuse of shading computations exploiting frametoframe coherence. Similar to texture MIP map levels they cached computation results on diﬀerent levels of the hierarchical space partition for reuse. An important consideration for imagebased interpolation is the selection of proper reference images as key frames for the interpolation. This leads to the problem of view selection and view planning. Bestnextview problems ﬁnd applications in many real world scanning scenarios ranging from large scale urban model acquisition [16] to automated object reconstruction [17]. One of the ﬁrst applications of viewpoint planning was to calculate the optimal position for feature detectability. Tarabanis et al. [18] give an overview over viewpoint planning algorithms and introduce a classiﬁcation for view planning algorithms dividing them into algorithms that follow the generateandtest paradigm and the synthesis paradigm. The generateandtest approach is characterized by a discretized viewing space that reduces the number of possible viewpoints to be evaluated. In contrast the synthesis approach puts constraints in form of an analytic function on the best next view, which ensure that the derived viewpoints satisfy these constraints. Werner et al. [19] used dynamic programming to select the optimal set of input images from a set of reference views for one degree of freedom. In addition to the visibility criterion Massios and Fisher [20] introduced a bestnextview selection algorithm that utilizes a quality criterion as well. A common quality criterion for bestnextview selection algorithms is the correlation between the captured surface normal and the viewing angle (Wong et al. [21]). Automatic camera placement has many applications in synthetic [22] and real world scenarios [23]. A comprehensive review of view planning algorithms for automated object reconstruction algorithms can be found in [24].
188
3
D. Jung and R. Koch
System Overview
Although most of the eﬀects inherently available in ray tracers can be simulated on the graphics pipeline of GPUs, e.g., reﬂections by environment maps, advertisement companies produce their content almost exclusively with modeling tools using ray tracing. The reason might be that they do not want to limit their design possibilities in advance. Another reason might be that many eﬀects that are readily available on ray tracers are much harder to achieve when rendering on the graphics pipeline of a GPU. Therefore, rendering on the graphics pipeline of a GPU would require more technical training of the artists and would also increase the time to design a 3D model. In this contribution we present an approach for timeeﬃcient rendering of dense highresolution light ﬁeld images from a modeled 3D scene. As discussed before, full raytracing of all lens images for the light ﬁeld display is not feasible as it would take several months to render. Instead, we propose to ray trace only a small set of reference lens images, and exploit depthcompensated view interpolation for all other. Hence, we need to deﬁne which of the display elements carry most information and are to be selected as reference views. These images are ray traced with full resolution. For all other lens images, we merely render a subset of all depth images from the 3D model, which are obtained with marginal costs. The input of our interpolation algorithm is a subset of all depth images and few selected reference lens images, which constitutes a convenient generic interface to all possible 3D modeling tools. Hence the proposed algorithm is independent of the modeling tool used for content creation, and can even be used with real world scenes as long as depth and color can be obtained by existing range and color cameras, although the presented algorithm is not intended for that purpose. In order to minimize the number of input images a very exact geometry is required making the algorithm sensitive to errors in the geometry that are introduced by range sensors and camera calibration. Imagebased rendering algorithms that are robust against errors in the geometry or don’t use geometry at all, e.g., see [8], compensate the missing geometry by taking a high number of light ﬁeld samples, which contradicts our goal to minimize the number of input images. Figure 2 sketches the proposed system. Depth images are obtained from the 3D model directly and a visualgeometric octree representation (the Light Field Nodes LFNs) of the visible scene surfaces is produced. The bestnextview selection is driven by evaluating the visible surfaces of the scene and a ranking of the most important reference lens images is obtained. These images are fully ray traced and viewdependent surface color is added as a quad tree for each LFN. Finally, all remaining lens images are rendered directly from the LFNs by a simple ray cast and interpolation of the view dependent color information. The proposed imagebased rendering algorithm can be divided into four stages. In the ﬁrst stage the geometric model of the scene is built using depth images of the scene. After the depth images are added to the scene the geometric model is reﬁned. In the second stage all viewpoints of a discrete viewpoint area are evaluated for their suitability as reference images for the proposed imagebased
Eﬃcient Rendering of Light Field Images
"
#
189
'&
$ %&
(& )& !
Fig. 2. Overview of the proposed algorithm
rendering algorithm incorporating criteria like the size of the display and the distance to the display’s zero plane. In the third stage the color information of the light ﬁeld is assembled from the previously chosen set of input images. The ﬁnal stage is the rendering of all views for the full parallax display. In the following the diﬀerent stages and the imagebased rendering algorithm are described in detail. 3.1
Building the Geometric Model
The ﬁrst stage of the algorithm is to build a geometric model of the scene. The model is built from a set of depth images of the scene that can easily be rendered by any modeling tool. This avoids the problem of exporting the geometric model from diﬀerent modeling tools and provides a convenient and generic interface. For every depth image and for every pixel of each depth image a ray cast is done in which the depth value deﬁnes the length of the ray. The result is a scene description in form of a 3D point cloud. Figure 3 (left) depicts the 3D point representation where V is the position of the point relative to the model’s coordinate system, according to the ﬁvedimensional plenoptic function P5D (eq. 2). The sphere that is drawn around the 3D point in ﬁgure 3 (left) illustrates the storage capacity that is needed to save the diﬀerent rays that may pass through the 3D point. The 3D points are stored in an octree. An octree is a hierarchical data structure that partitions 3D space and is able to handle sparse data eﬃciently. The root node of the octree deﬁnes a cube that encloses the complete scene. When a 3D point is inserted into the octree it gets a size assigned that will deﬁne at which hierarchical level of the octree the point will be inserted. Figure 3 (right) shows a 3D point that is placed at the center of its bounding volume. Starting at the root node it is checked recursively, in which subvolume of the root node the point lies and the 3D point is inserted in this child of the current node until the volume of the next hierarchical level is smaller than the volume assigned to the 3D point. If the volume where the 3D point is inserted is empty, a new node is added, otherwise, the point is merged with the existing 3D point. The volume assigned to the 3D point depends on the distance of the point to the display’s zero plane. The display’s zero plane is a plane in the virtual scene,
190
D. Jung and R. Koch
Fig. 3. Representation (left) and bounding volume (right) of a 3D point
deﬁned by the location of the multiview display. The camera centers of all views are located on that plane and an observer of the display will focus on that plane of the virtual scene when focusing on the display. The size of the cube limits the geometric detail that can be described by the geometric representation. Near the display’s zero plane the size of the cubes is chosen to be very small because near the camera centers the viewing rays allow a very high spatial resolution. Far away from the display’s zero plane the viewing rays diverge, limiting the spatial resolution and allowing for a much coarser geometric model.
Fig. 4. Overlapping ﬁeld of view (left) and 3D point near the display’s zero plane (right)
The situation that 3D points are merged occurs often because of the overlapping ﬁeld of view of the input images (ﬁg. 4, left). This will also occur near the display’s zero plane as the diﬀerent rays are close together near the camera center (ﬁg. 4, right). After the depth images are added to the model the visible surface of the model is extracted. Due to the limited numerical precision of the depth images a surface of the scene may be thicker than one layer of volumetric 3D points in the octree representation (ﬁg. 5). These additional 3D points lie directly under the surface and are occluded in all views. A 3D scene contains a huge amount
Eﬃcient Rendering of Light Field Images
191
of 3D points, typically between 250,000 and 550,000. Therefore, it is beneﬁcial to remove the occluded points, which will save memory and computation time due to the reduced complexity of the model.
Fig. 5. Quantization leading to occluded subsurface points
In order to remove the occluded points the model is intersected with all possible viewing rays of the display. When a 3D point is intersected with a viewing ray it is marked as visible. After all viewing rays have been intersected with the model all nonmarked 3D points are removed from the geometric model of the scene. This ensures that no visible parts of the model are removed without resorting to a priori knowledge of the model’s complete geometry. 3.2
Determine the Best Viewpoints
After the model of the scene is complete a set of color input images has to be selected in order to obtain the color information of the light ﬁeld. Therefore the next step is the computation of the best possible input viewpoints. In order to determine the viewpointranking the possible viewpoints are restricted to the positions of the display elements, following the generateandtest paradigm as introduced by Tarabanis et al. [18]. For the presented algorithm a bestnextview algorithm is utilized that we introduced in [4]. In this algorithm the importance of a viewpoint is weighted by the number of visible 3D points and its viewing distance to each 3D point. The viewing distance was chosen as a criterion because 3D points near the display’s zero plane contain more view dependent color information that is relevant for the display than points far away from the display’s zero plane and thus need to be sampled more often.
d P
s
Fig. 6. Opening angle in relation to the distance of the display and its diagonal size
Figure 6 shows the relation between the distance d to the display’s zero plane, the diagonal size s of the display and the opening angle α for a point P located
192
D. Jung and R. Koch
on a line that runs orthogonally through the center of the display. The relation between the opening angle and the distance to the display’s zero plane is described by s α = 2 · arctan . (3) 2·d It can be seen that the opening angle becomes smaller as the distance to the display’s zero plane increases, therefore, points far away from the display’s zero plane can be safely regarded as ambient and need to be sampled less often than points near the display’s zero plane, which may contain view dependent color information as they are sampled under a large opening angle from the display. To account for this each visible 3D point is weighted by its distance to the display’s zero plane. The algorithm to calculate the best next view creates a list where all possible viewpoints are ordered by their measured importance. This allows for an interactive selection of the number of input images that are used as input images for the imagebased rendering algorithm. The weighting function 0, if P is not visible from V w(V, P, σ, f, a) := (4) V −P 2 2 else f · e− 2σ2 + a, takes as arguments the viewpoint V , the 3D point P , a display ﬁtting value σ, a scaling factor f and a minimum weight a. The importance of an input image In out of N input images with n∈ {1, ..., N } is then measured by the sum over the weights of all 3D points M In :=
M
w(Vn , Pm , σ, f, a)
(5)
m=1
with the single 3D point Pm and m∈ {1, ..., M }. After all possible viewpoints have been evaluated a list is created, ordered by the importance of the viewpoint with the most important viewpoint on top. The depth images of the selected viewpoints are used to create a geometric model for evaluation before the color images of these viewpoints are rendered. This allows for an evaluation of the model because an insuﬃcient number of input images may lead to visible holes in the model, caused by nonoverlapping viewpoints (ﬁg. 7, left) or occlusions (ﬁg. 7 right). Therefore the interpolated view will expose artifacts that are caused by this incomplete geometry as shown by the dashed cameras in the ﬁgures. The eﬀects of occlusions or self occlusion can be very complex even for a simple scene with one object, e.g., a concave object. Hence an evaluation of the geometry by examination of the geometry and the possibility to add additional viewpoints is necessary to ensure that the number of input images is suﬃcient. 3.3
Assembling of the Light Field
After the viewpoints have been selected the color information is added to the geometric model. Each 3D point of the geometric model has a data structure
Eﬃcient Rendering of Light Field Images
193
Fig. 7. Hole in the geometry caused by subsampling (left) and by an occlusion (right)
attached that holds the view dependent color information. In the following this data structure is called a Light Field Node (LFN). There are two requirements the data structure has to fulﬁll that are linked to the properties of the objects in the scene. For objects that exhibit lambertian reﬂectance or belong to the background of the scene where the viewing angle doesn’t change signiﬁcantly for the diﬀerent views it is suﬃcient to save one color value per LFN, whereas LFNs near the display’s zero plane may need to store more than one color information per LFN, due to specular reﬂections. Therefore the data structure of the LFN has to handle sparse data eﬃciently. Another requirement is to encode the direction of the saved viewing ray and the possibility to eﬃciently ﬁnd the angular closest viewing ray in the data structure for interpolation. The view dependent color information is saved in an image of the same properties and orientation as the camera that renders the light ﬁeld for the display. In this case it is a spherical image and the direction of the viewing rays is encoded in the spherical coordinates of its position in the image. Figure 8 shows a LFN and how the ray direction is encoded in its data structure. The LFN can be imagined as a unit sphere with its center at the position of the 3D point. Therefore a viewing ray is completely described by the position of the LFN and the two angles (Φ and Θ) of the spherical coordinates. All input images for the display are rendered with a camera of a constant orientation and its viewpoint restricted to a plane. Under this condition it is advantageous to orient the LFN the same way as the camera that rendered the input images. This setup is shown in ﬁgure 9. The beneﬁt is that it is not required to project the center of the camera to the spherical image of the LFN to get its image index and correctly encode the viewing ray’s direction, but the image index of the LFN is the same as the image index of the input image. The interpolation of all views will also beneﬁt from this setup as the orientation of the interpolated views is the same as the orientation of the input views and all viewpoints are located on the display’s zero plane.
194
D. Jung and R. Koch
Fig. 8. Light ﬁeld node and encoding of the ray direction in the image’s index
Fig. 9. Relation of the image index of the reference image and the image index of the LFN
Another beneﬁt of this conﬁguration is that the opening angle of the LFN, in which viewing rays can be saved, is the same as the opening angle of the display elements. Therefore the LFN can only save viewing rays that are in the ﬁeld of view of the display elements. The second requirement is to handle the sparse data of the view dependent color information, saved by a LFN. If every LFN would save a full spherical image the memory requirement would limit the presented algorithm to about 5,000 LFNs. The scenes processed typically have 250,000 to 550,000 LFNs, which is about two orders of magnitude more. Therefore a quad tree [25] was chosen, which has several properties that are beneﬁcial for the proposed algorithm. First, a quad tree can save sparse data eﬃciently allowing to save as many viewing rays as necessary per LFN, allocating memory only for saved viewing rays. Another advantage is the eﬃcient lookup of angular closest viewing rays, which is important for both, the decision if a viewing ray has new information and should be saved in the data structure and also for the interpolation of novel views for the display. In the following it is described how the color information is saved in the LFN.
Eﬃcient Rendering of Light Field Images
1 2 3 4 5 6 (a)
1
2
3
4
5
195
6
(b) (c)
LFN
Fig. 10. Similarity based selection of samples, saved by a LFN (one dimensional draft)
The color information is added to the geometric model by intersecting the model with the viewing rays. When a viewing ray intersects a LFN it is decided whether the viewing ray should be saved or discarded. Figure 10.a shows a draft of successive viewing rays that intersect with the LFN, reduced by one dimension for the sake of clarity. The rays are labeled in their initial intersection order with the LFN and their brightness shall represent their color. Figure 10 shows which viewing rays are saved after all input images have been processed. When the ﬁrst viewing ray is processed, the LFN has no color information saved. Therefore the viewing ray is saved, denoted by a white circle in the ﬁgure. When the second ray is intersected with the LFN it has the same color as the angular closest ray and the viewing ray is discarded. Due to this reason all following viewing rays are discarded until the last viewing ray is processed, which has a diﬀerent color than the ﬁrst viewing ray and is therefore saved by the LFN. This approach allows to reduce the number of saved rays, which is important for lambertian surfaces of the scene or when the surface color does not change within the viewing angle. A drawback is that the interpolation between the angular closest viewing rays during view interpolation will introduce errors for the reconstructed viewing rays between the saved samples, as depicted in ﬁgure 10.b. This problem is solved by adding the input images to the geometric model a second time. This time the viewing rays will be intersected with the LFN in reversed order. The viewing ray number six will not be saved because it was saved in the ﬁrst run. When viewing ray number ﬁve is processed the angular closest ray is viewing ray number six. Therefore the color dissimilarity is detected and the viewing ray is saved by the LFN. The following viewing rays are not saved because of their color similarity to the angular closest rays. Figure 10.c shows the saved color values after the second run. Practice has shown that it suﬃces to add the input images three times in alternating reversed order as there are very few viewing rays saved after the third run. 3.4
Rendering
After the color information has been added to the model all views are rendered for the display. This is done by ray intersection of all viewing rays with the geometric model. At the ﬁrst intersection of a viewing ray with a LFN the
196
D. Jung and R. Koch
intersection test is canceled and the color value of the viewing ray is interpolated by that LFN. The color is interpolated from angular closest views, as described earlier in section 3.3. The intersection of viewing rays with the geometric model is a central part of the presented algorithm that is used by all of the diﬀerent stages. Therefore it is worthwhile to pay particular attention to the parallelization and implementation of the ray cast. The data structures have been chosen because of their hierarchical partition of space that allow for eﬃcient ray intersection and lookup of the angular closest viewing rays. Another beneﬁt is the inherent ability to handle sparse data sets. In the following the diﬀerent approaches to accelerate the ray cast are described and compared with regards to the used data structures.
4
Eﬃcient Ray Cast Implementation
The ray cast has been implemented using diﬀerent approaches. The reference implementation is implemented single threaded on the central processing unit utilizing only one of the available kernels. Listing 1.1 shows the pseudo code for rendering of an image via ray cast. Every viewing ray of the display element is cast and its intersection with the model is calculated. The ﬁrst LFN that is intersected by the viewing ray is then used to, e.g., interpolate the color from the angular closest color samples of that LFN. Listing 1.1. Pseudo code for ray casting of an image
f o r ( e v e r y row y ) { f o r ( e v e r y column x ) { r a y = MakeRay ( y , x ) LFN = ocTree−>I n t e r s e c t ( r a y ) o u t P i x e l [ y ] [ x ] = LFN−>I n t e r p o l a t e C o l o r ( r a y ) } } Ray casters are particularly suited for parallelization, as all viewing rays are processed independently although there are implementations, which exploit image and object space coherence [26]. 4.1
Multithreaded Implementation on the Central Processing Unit
The parallelization on the central processing unit (CPU) has been implemented using Open MultiProcessing (OpenMP), which oﬀers an easy way to utilize multicore processors. For rendering of the interpolated images the parallelization is straight forward because access on the data structure is read only. Listing 1.2 shows how OpenMP can be used to parallelize the outer rendering loop to take advantage of the multicore processor.
Eﬃcient Rendering of Light Field Images
197
Listing 1.2. Parallelization of the outer rendering loop with OpenMP
#pragma omp p a r a l l e l f o r s c h e d u l e ( dynamic ) f o r ( e v e r y row y ) { ... } The other parts of the algorithm that utilize the ray cast modify the data structures and therefore need a locking mechanism when executed in parallel. The operations that need to be made thread safe are the insertion of new points and the modiﬁcation of existing nodes. The locking mechanism is implemented by an OpenMP lock and each node of the octree is given a lock to guarantee exclusive access. An OpenMP lock can only be held by one thread at a time and if more than one thread attempts to acquire the lock all but the thread that holds the lock are paused. When the lock is released by the thread that holds the lock the next thread can acquire it.
Fig. 11. Insertion (left) and modiﬁcation (right) of an octree’s node
The procedure of inserting a new node into the octree is depicted in ﬁgure 11 (left). The node that should be inserted is dyed white and connected with its parent node by a dashed line. New nodes are either inserted as leafs to hold a LFN or as inner nodes for hierarchical space partitioning. To guarantee that the thread that inserts a node has exclusive access to the data structure the parent node is locked. Then, the new node is created and locked by the thread, which holds the lock of the parent node. The new node is attached to its parent node and both locks, which are held by the thread, are released. Figure 11 (right) shows a situation where the node or its content should be modiﬁed by a thread. The node that should be modiﬁed is dyed white. Then, the node is locked, its content modiﬁed and afterward the lock is released by the thread. The data structure of the LFN is attached to the octree as content and is thread safe because access to the content of the octree is guaranteed to be exclusive by its locking mechanism. Accelerating the rendering with OpenMP on the central processing unit is straight forward and takes full advantage of the hierarchical data structures but is limited by the number of available CPUs. In order to further accelerate the algorithm an approach has been implemented to accelerate the intersection of the viewing rays of the display with the geometric model on the graphics hardware.
198
4.2
D. Jung and R. Koch
Implementation on the Graphics Hardware
Halle and Kropp [13] introduced an algorithm for eﬃcient rendering of images for full parallax displays that takes advantage of conventional graphics hardware and OpenGL. In order to achieve this goal the rendering process had to be adapted to fulﬁll the special requirements of image rendering for full parallax displays. We extend their work by combining their eﬃcient rendering on conventional graphics hardware for full parallax displays with our imagebased rendering algorithm. In the following the rendering algorithm will be discussed in detail and it is described how it is used to accelerate our algorithm.
Fig. 12. Viewing frustum of a display element of a full parallax display (sketch in 2D)
The viewing frustum for one display element of a full parallax display is depicted in ﬁgure 12. The image for a display element is rendered by placing a camera into its position. Then a ray is cast for every viewing ray of the display element, starting in front of the display at a deﬁned distance to the display’s zero plane. At the position of the display element, the camera center, the foreground geometry is point reﬂected and, therefore, appears upside down in the image of the display element with left and right inverted in the resulting image. The ray ends at the speciﬁed maximum distance behind the display’s zero plane. If we assume that a viewing ray doesn’t change along its path the intersection of the viewing ray with the model is canceled at the ﬁrst intersection point, marked by the white cross in ﬁgure 12. In order to render content for the full parallax display with the OpenGL library the algorithm introduced by Halle and Kropp divides the viewing frustum of the display element into two parts. The ﬁrst part is behind the display plane and can be rendered directly with OpenGL (see ﬁg. 13, right). In the OpenGL camera model the near clipping plane can not be set to the camera’s center. To avoid this limitation the camera center is slightly shifted backwards along the optical axis in order to place the near clipping plane on the display’s zero plane. According to Halle and Kropp this imposes the drawback of small image distortions and a potentially coarse depth resolution due to the small distance of the near clipping plane to the camera center. They concluded that these problems have very little impact on the ﬁnal image and the coarse
Eﬃcient Rendering of Light Field Images
199
Fig. 13. Viewing frustum in front of (left) and behind (right) the display’s zero plane
depth resolution could be avoided by separating the viewing frustum into smaller pieces. The second part is the rendering of those parts of the scene, which are in front of the display’s zero plane. Therefore the camera is rotated to face the opposite direction and shifted backwards along its optical axis such that the near clipping plane lies at the display’s zero plane. Figure 13 (left) shows the frustum of one display element that lies in front of the display. In order to achieve the same rendering behavior as in ﬁgure 12 the intersection test has to choose the intersection that has the greatest distance to the camera center, marked by a white cross in ﬁgure 13 (left), as opposed to the closest intersection, marked by a black cross. In OpenGL this can be achieved by setting the function that compares the depth values to prefer greater values and by clearing the depth buﬀer to zero (see listing 1.3). Listing 1.3. Adjustment of the depth order for rendering of pseudoscopic images with OpenGL
glDepthFunc (GL GREATER) ; glClearDepth ( 0 . 0 ) ; The surfaces that are visible for the camera lie on the back side of the objects. Hence it is important to turn back face culling oﬀ or to set OpenGL to cull front faces of objects. Halle and Kropp did several adjustments to correct the lighting for rendering of the pseudoscopic image, which will not be discussed here as the presented rendering algorithm does not depend on it. For the remainder of this section the part of the scene that lies in front of the display’s zero plane is called front geometry and the part that lies behind the display’s zero plane is called back geometry. The scene is divided by the display’s zero plane into the front and back geometry that are rendered separately. The front geometry will be rendered to produce a pseudoscopic image, whereas the back geometry will be rendered orthoscopic. Every LFN of the back geometry is now assigned a unique identiﬁcation number that is encoded as a 24bit color. This mapping between the LFN and a unique color will be used to render the
200
D. Jung and R. Koch
geometric model and identify the LFN that is hit by the viewing ray. The color black is assigned to the background of the scene and is used to identify viewing rays, which have no intersection with the geometric model. Each LFN belongs either to the front geometry or the back geometry. Therefore, the mapping of the front geometry of the scene to a color is independent of the back geometry’s mapping and color values can be reused. For rendering each LFN is represented by a quad in the associated color. The front and back geometry are rendered separately to diﬀerent textures with the previously described modiﬁcations of the rendering. In a last rendering pass, both images are combined. This is done by texturing the rendered images on a viewport aligned quad. By rendering with an orthographic projection it is ensured that the pixels of both textures map exactly on the viewport of the ﬁnal image. The pseudoscopic front geometry image is sidecorrected via separate texture coordinates that are mapped onto the quad. Both images are combined by a fragment shader. Those parts of the front geometry’s image that have the black background color are ﬁlled with the values of the back geometry’s image. The result is a pseudocolored image of the scene where the intersections of the viewing rays with the geometric model are obtained and encoded in the mapping of the color to the LFNs. The ﬁnal image is generated on the CPU in parallel by identifying the intersected LFNs and by interpolating the color value from the angular closest color information that is saved by the LFN. So far, rendering is limited by the camera model of OpenGL. Some displays may require a special lens model for content generation, which is needed to sample the artiﬁcial scene with the viewing rays of the display. In the following it is described how a vertex shader can be used to simulate a ﬁsheye lens that is used to project the diﬀerent viewing rays of a display element of the full parallax display. The idea is to use a vertex shader to move the vertices to a position where the image rendered with OpenGL equals an image taken with a ﬁsheye camera. This modiﬁcation has to be done in dependence of the viewpoint because this lens eﬀect is achieved by modifying the geometric model of the scene. First, the modelview matrix is set to the display element’s position and the projection matrix is set to identity. Hence the vertex position can be obtained by the optimized GLSL function ftransform().1 Afterwards the vertex position (x, y, z) is transformed into spherical coordinates by θ = arctan
1
x2 + y 2 z
and
φ = arctan
y . x
(6)
The complete source code for the vertex shader can be found in the appendix (listing 1.4).
Eﬃcient Rendering of Light Field Images
201
The ﬁeld of view (fov ) of the ﬁsheye camera is accounted for and the new vertex position (x , y , z , w )T is assigned by ⎛
⎛ θ ⎞ ⎞ x f ov · cos(φ) ⎜ y ⎟ ⎜ θ · sin(φ)⎟ ⎜ ⎟ = ⎜ f ov ⎟ ⎝ z ⎠ ⎝ z+zN ear ⎠ . zF ar−zN ear w 1
5
(7)
Results
In the following the proposed algorithm is evaluated and the results are shown. First, the results obtained by the bestnextview selection algorithm are discussed. This is followed by an evaluation of the interpolated views against ground truth and at last the runtime of the diﬀerent acceleration techniques are compared. 5.1
Best Next View Selection
Figure 14 (left) shows an overview of the synthetic scene that was used for evaluation. For the bestnextview selection algorithm it is suﬃcient to use a reduced data set where every fourth row and column of the full data set is used. Figure 14 (left) was created by a simulation of this reduced display where the viewpoint was centered about 60 centimeters in front of the display. In dependence of this viewpoint the view dependent color information was computed for each display element, which yielded the ﬁnal image. The reduced data set consists of 14,400 lens images with a resolution of 512x512 pixel. Figure 15 (left) shows the contribution of each input image to the geometry of the scene where only newly inserted 3D points are counted. In ﬁgure 15 (right) the overall completeness of the model is shown by integration over the contribution of the images. In ﬁgure 15 (left) can be seen that the ﬁrst viewpoint adds about nine percent of the geometry to the model. Around the
Fig. 14. Simulated display overview with the viewpoint centered about 60 centimeters in front of the display (left) and position of the top ranking 393 viewpoints (right). The darker a viewpoint is colored the more important it is ranked.
202
D. Jung and R. Koch
Number of input images
10000
1000
100
10
100 90 80 70 60 50 40 30 20 10 0 1
10000
1000
100
10
Geometry [%]
9 8 7 6 5 4 3 2 1 0 1
Geometry [%]
100th input image the contribution of new 3D points to the model’s geometry is only marginal. However, in ﬁgure 15 (right) it can be seen that after the ﬁrst 100 images are added, the obtained geometry is incomplete and covers only 70 percent of the scene. This is due to the many marginal contributions of single images that add up to about 30 percent of the complete geometry. For the complete geometry the ﬁrst 4,762 input images are required, which is about 33 percent of the input image set. However, even with only 1,000 reference images, which is about 7 percent of the reduced image set, no visible artifacts can be seen and a high quality reconstruction can be achieved.
Number of input images
Fig. 15. Contribution of the single input images to the geometry of the model (left) and overall completeness of the model in dependence of the number of input images (right). The axes of abscissae are in logarithmic scale.
The ﬁnal result of the bestnextview algorithm is pictured in ﬁgure 14 (right). The perspective is the same as in the left ﬁgure. Therefore, the selected viewpoints can be directly related to the overview. The importance of a viewpoint is represented by its brightness with the most important viewpoint set to black. Viewpoints with a minor contribution to the scene’s geometry have been colored white for the sake of clarity. The two viewpoints that are rated most important are in the upper left corner of the display followed by the lower right corner of the display. Both viewpoints cover the background, which has a larger surface than the foreground. Therefore, the majority of the 3D points that are required to describe the geometry of the scene belong to the background, which makes those viewpoints most important for the geometric completeness. The second rated viewpoint lies in the opposite display corner of the most important viewpoint where most of the background regions could be sampled, which were occluded in the ﬁrst viewpoint. In general it can be seen that the foreground on the mask is sampled more often than the background of the scene. This is because the weighting function is set to prefer foreground objects as they are more relevant for view dependent color information. Another reason is that near the display’s zero plane sampling points must be closer for an overlapping ﬁeld of view or holes in foreground objects may occur.
13
20
12
19
11
18
10
17
9
16
8
15
AD PSNR
PSNR
21
203
10000
14
1000
22
100
15
10
AD
Eﬃcient Rendering of Light Field Images
Number of input images
Fig. 16. Average diﬀerence (AD) and peak signal to noise ratio (PSNR) of the simulated overview (reduced data set) from the reconstructed images compared to the simulated overview from the ground truth images. The axis of abscissae is in logarithmic scale.
The average diﬀerence (AD) and the peak signal to noise ratio (PSNR) have been measured by comparing a simulated overview of the display from the reconstructed images against a simulated overview from the ground truth images on the reduced data set. First, the mean squared error (MSE) is calculated by
⎞2 ⎛
G
B R R G B − i + − i + − i
i
i
i
x,y x,y x,y x,y x,y x,y 1 ⎝ ⎠ M SE = X · Y x=1 y=1 3 X Y
(8)
taking the sum over the squared diﬀerence of all pixel (X ·Y ) between the ground truth image i and the interpolated image i and averaging the diﬀerence of the color channels (RGB). Afterwards the PSNR is calculated by P SN R = 10 · log
2552 . M SE
(9)
The results are depicted in ﬁgure 16. The evaluation has been performed on the top ranking 50, 100, 1,000, 1,500 and 2,000 images from the bestnextview algorithm. As expected, the average diﬀerence is drastically reduced when the number of input images is increased. In the reconstructed images the object boundaries are one to two pixel oﬀ due to aliasing, which results in a relatively low peak signal to noise ratio. Figure 17 and 18 show close up views of the foreground area comparing the simulated display overview from reconstructed images against a simulated overview from the ground truth images using 104 input images (ﬁg. 17) and 144 input images (ﬁg. 18). In each case the number of input images was the same and all images of the reduced data set were interpolated with the same IBR
204
D. Jung and R. Koch
algorithm. The left result in each case was obtained by using regular sampled viewpoints for the interpolation and the right result was obtained by using the same number of top ranking viewpoints from the bestnextview algorithm for the interpolation.
Fig. 17. Close up of the foreground area of a simulated display overview (reduced data set) from reconstructed images using 104 input images. Comparison of regular sampled viewpoints (l) against the same number of top ranking viewpoints from the bestnextview algorithm (r).
In Figure 17 and 18 it can be seen that with regular sampling the reconstruction with 104 and 144 input images lacks parts of the geometry near the display’s zero plane due to nonoverlapping ﬁelds of view and occlusions, leading to severe artifacts in the reconstruction. When the top ranking viewpoints were used for image reconstruction the geometry of the foreground is much more complete. Although there are still holes in the geometry of the foreground object the geometry of the foreground is almost complete using 144 input images resulting in less artifacts compared to the reconstruction from regular sampled viewpoints. Good image interpolation results can be achieved with less than 100 percent of complete geometry and without manual user interaction by the proposed solution. For premium highquality rendering additional reference views can be progressively added from the ranking, which has been obtained by the bestnextview selection algorithm. The result can be interactively evaluated on the geometric model and further reference views can be added until a satisfying result is obtained. In the following interpolation results are shown and the interpolation is evaluated against ground truth images. 5.2
Interpolation Results
The interpolation algorithm was evaluated on foreground and background parts of the scene. This evaluation was performed on the full data set with about 0.5 percent of the viewpoints selected as reference views. The reference images, the interpolated images and the diﬀerence images between the original and the interpolated images are shown in ﬁgures 1921. The numbers in the lower right
Eﬃcient Rendering of Light Field Images
205
Fig. 18. Close up of the foreground area of a simulated display overview (reduced data set) from reconstructed images using 144 input images. Comparison of regular sampled viewpoints (l) against the same number of top ranking viewpoints from the bestnextview algorithm (r).
of the diﬀerent views are used to identify the discrete viewpoints of the display in relation to the nearest reference views, e.g., between the reference viewpoints 00 and 89 the 88 viewpoints that are in between were not used as input images for the interpolation. The nearest neighboring reference images of the interpolation for the background are shown in ﬁgure 19 and the nearest neighboring reference images of the foreground are depicted in ﬁgure 20. For the foreground area the reference viewpoints are close together to avoid subsampling of the geometry. Another reason is that more light ﬁeld samples are needed in the foreground area to capture the specular reﬂections of the golden ornamentation of the mask. Figure 21 (left and right) shows the results of the interpolation. The top row shows the ground truth input images as rendered by the modeling tool. In the center of the ﬁgures are the interpolated views and at the bottom are the diﬀerence images of the ground truth images and the interpolated ones. The diﬀerence images show that most interpolation errors are located at object boundaries. This is because at object boundaries the depth images either belong to the foreground or the background and no anti aliasing can be applied. On the contrary the color images are rendered with anti aliasing, which leads to a color transfer from the foreground objects to the background and vice versa. Another reason lies in the quantization of the octree in 3D space, which can lead to color blending during angular view interpolation on textured surfaces. 5.3
Runtime
The proposed algorithm has been evaluated on an Intel Core i7950 CPU with a Nvidia GeForce GTX 460 graphics card. The diﬀerent implementations and parallelization approaches have been compared for the initialization and the rendering phase. The initialization phase consists of loading the geometric model and assembling of the light ﬁeld. During the rendering phase all images of the display are interpolated. The results are listed in Table 1.
206
D. Jung and R. Koch
Fig. 19. Reference images of the background area with sparse sampling. The depth images are scaled by 255.
Fig. 20. Reference images of the foreground area with dense sampling. The depth images are scaled by 8355.
Fig. 21. Ground truth (upper row), interpolated (central row) and diﬀerence image (lower row) of the background (left) and the foreground (right)
Eﬃcient Rendering of Light Field Images
207
Table 1. Comparison of the runtime for the diﬀerent implementations and the average single image rendering time. The example scene has about 380,000 LFNs and 1,000 out of 14,400 input images were used for the interpolation. Ray tracing
Initialization Acceleration Rendering Acceleration Acc. parallel. [s] [s]
CPU (8 Threads)


84.006
1.00

8,610.00 2,041.00 750.00
1.00 4.22 11.48
1.722 0.403 0.210
48.78 208.45 400.03
1.00 4.27 8.20
Interpolation CPU (1 Thread) CPU (8 Threads) GPU
Ray tracing of an image for the display takes on average 84 seconds when rendered with the modeling tool. When only the image rendering times are compared, single threaded interpolation already achieves an acceleration factor of 48.78 compared to ray tracing of an image. The parallelization on the central processing unit using OpenMP further accelerates the rendering by a factor of 4.27. The implementation on the graphics hardware achieved an acceleration factor of 8.20 compared to rendering with a single thread only. This relatively small acceleration factor is because the graphics hardware takes no advantage of the hierarchical data structure of the octree.
Mean time [ms]
Rendering time on the GPU 320 300 280 260 240 220 200 180 160 140 120 100
150
200
250
300
350
400
450
3
Number of LFNs [10 ] Fig. 22. Rendering time in dependence of the number of LFNs on the graphics hardware
208
D. Jung and R. Koch
Figure 22 shows the mean rendering times for an image in dependence of the number of LFNs. The results have been retrieved on an Intel Core i7920 CPU with a Nvidia GeForce GTX 285 graphics card. The straight line in ﬁgure 22 was ﬁtted to the data points by a leastsquares method. It can be seen that the rendering time has a linear correlation to the number of LFNs that represent the scene. This is because all vertices, which represent the LFNs are processed by the graphics hardware and visibility is determined by comparing the depth values. Hence, the graphics hardware does not take advantage of the hierarchical data structure and an early abort for viewing rays is not possible. The resulting acceleration for the reduced data set, which are given in table 1 do not take into account the rendering time of the input images and the viewpoint selection. Table 2 shows the eﬀective acceleration that can be achieved for a full sized display with 230,400 display elements. The upper part of the table shows the rendering times when all images for the display are ray traced and computed on a single work station. The center of the table shows the diﬀerent stages that are required for the interpolation. First, the depth images are ray traced with the modeling tool. For the bestnextview selection algorithm it is suﬃcient to sample only every fourth row and column of the data set. As long as parts of the scene do not lie inside or extremely close to the display’s zero plane, the reduced data set contains enough geometric detail for proper viewpoint selection. The viewpoints are ordered by the bestnextview selection algorithm followed by ray tracing of the chosen viewpoints with the modeling tool. Finally, the interpolation is initialized and all images of the display are interpolated. For interpolation the algorithm implemented on the graphics hardware was chosen. The lower part of table 2 shows the sum of all steps involved including the rendering time of the input color and the depth images. It can be seen that the acceleration factor is mainly inﬂuenced by the ratio of the interpolated images to the input images that are required for interpolation. For a full sized display an eﬀective acceleration factor of almost 100 was achieved by the proposed algorithm. Table 2. Eﬀective acceleration of the rendering compared to ray tracing (RT) of all color images for the full data set Rendering method Number of images Time [h] Acceleration Ray tracing
RT (8 Threads)
230,400 color
5,376.38
1.00
Initialization
RT (8 Threads) GPU (BNV† ) RT (8 Threads)
14,400 depth 14,400 depth 1,000 color
4.00 5.30 23.34

View interpolation
GPU (rendering)
230,400 color
22.19
242.29
230,400 color
54.83
98.06
Eﬀective acceleration GPU (summation) †
BNV: Best Next View selection
Eﬃcient Rendering of Light Field Images
6
209
Conclusion
Full parallax displays require a large amount of images. Ray tracing of these images on a single workstation would take several months. The rendering process could be accelerated by a render farm but at the cost of high charges. A complete solution for image rendering for full parallax displays has been presented that solely depends on depth images and few ray traced input images making it independent of the modeling tool. The problem of ﬁnding the best viewpoints for the input images has been addressed and a solution has been integrated into the rendering system that allows for a considerable reduction of input images and ensures completeness of the geometric model at the same time. The rendering beneﬁts from the eﬃcient data structures, which are well suited for the processing of very large amounts of sparse data. The presented algorithm has been accelerated by parallelization on the central processing unit and the graphics hardware. This allowed reducing the rendering time for all images from several months to a few days on a single work station, achieving an eﬀective acceleration of two orders of magnitude. The evaluation of the geometric model and the ﬁnal decision how many input images are used for the interpolation is currently aided by user interaction. Future work will approach the integration of a criterion that measures the impact of missing geometry on the view interpolation. The beneﬁt could be a further reduction of input images for the interpolation leading to an even higher eﬀective acceleration of the rendering. Acknowledgments. This project was funded by the Federal Ministry of Education and Research under the support code 16 IN 0655. The sole responsibility for the content of this publication lies with the authors. Founded by the BMWi due to a directive of the German Parliament.
References 1. Yang, R., Huang, X., Li, S., Jaynes, C.: Toward the Light Field Display: Autostereoscopic Rendering via a Cluster of Projectors. IEEE Transactions on Visualization and Computer Graphics 14, 84–96 (2008) 2. Baker, H., Li, Z.: Camera and projector arrays for immersive 3D video. In: Proceedings of the 2nd International Conference on Immersive Telecommunications, IMMERSCOM 2009, pp. 1–23. Institute for Computer Sciences, SocialInformatics and Telecommunications Engineering, Brussels (2009) 3. Jung, D., Koch, R.: Eﬃcient DepthCompensated Interpolation for Full Parallax Displays. In: 5th International Symposium 3D Data Processing, Visualization and Transmission (3DPVT) (2010) 4. Jung, D., Koch, R.: A BestNextViewSelection Algorithm for MultiView Rendering. In: 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT) (accepted 2011) 5. Adelson, E.H., Bergen, J.R.: The Plenoptic Function and the Elements of Early Vision. In: Computational Models of Visual Processing, pp. 3–20. MIT Press, Cambridge (1991)
210
D. Jung and R. Koch
6. McMillan, L.: Plenoptic Modeling: An ImageBased Rendering System. In: Proceedings of SIGGRAPH 1995, pp. 39–46. ACM, New York (1995) 7. Ashdown, I.: NearField Photometry: A New Approach. Journal of the Illuminating Engineering Society 22, 163–180 (1993) 8. Levoy, M., Hanrahan, P.: Light Field Rendering. In: Proceedings of SIGGRAPH 1996, pp. 31–42. ACM, New York (1996) 9. Gortler, S.J., Grzeszczuk, R., Szeliski, R., Cohen, M.F.: The Lumigraph. In: Proceedings of SIGGRAPH 1996, pp. 43–54. ACM, New York (1996) 10. Szeliski, R.: Rapid octree construction from image sequences. In: CVGIP, vol. 58, pp. 23–32 (1993) 11. Lischinski, D., Rappoport, A.: ImageBased Rendering for NonDiﬀuse Synthetic Scenes. In: Rendering Techniques 1998: Proceedings of the Eurographics Workshop in Vienna, pp. 301–314 (1998) 12. Koch, R., EversSenne, J.F.: View Synthesis and Rendering Methods. In: Schreer, O., Kauﬀ, P., Sikora, T.E. (eds.) 3D Video Communication  Algorithms, Concepts and Realtime Systems in HumanCentered Communication, pp. 151–174. Wiley, Chichester (2005) 13. Halle, Kropp: Fast Computer Graphics Rendering for Full Parallax Spatial Displays. In: Proc. SPIE, vol. 3011, pp. 105–112 (1997) 14. Dietrich, A., Schmittler, J., Slusallek, P.: WorldSpace Sample Caching for Eﬃcient Ray Tracing of Highly Complex Scenes. Technical Report TR200601, Computer Graphics Group, Saarland University (2006) 15. Dietrich, A., Slusallek, P.: Adaptive Spatial Sample Caching. In: Proceedings of the IEEE/EG Symposium on Interactive Ray Tracing 2007, pp. 141–147 (2007) 16. Teller, S.: Toward Urban Model Acquisition from GeoLocated Images. In: Proceedings of the 6th Paciﬁc Conference on Computer Graphics and Applications, PG 1998, p. 45. IEEE Computer Society, Washington, DC, USA (1998) 17. Levoy, M., Pulli, K., Curless, B., Rusinkiewicz, S., Koller, D., Pereira, L., Ginzton, M., Anderson, S., Davis, J., Ginsberg, J., Shade, J., Fulk, D.: The digital Michelangelo project: 3D scanning of large statues. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2000, pp. 131–144. ACM Press, AddisonWesley Publishing Co. (2000) 18. Tarabanis, K., Allen, P., Tsai, R.: A survey of sensor planning in computer vision. IEEE Transactions on Robotics and Automation 11, 86–104 (1995) 19. Werner, T., Hlavac, V., Leonardis, A., Pajdla, T.: Selection of Reference Views for ImageBased Representation. In: International Conference on Pattern Recognition, vol. 1, pp. 73–77 (1996) 20. Massios, N., Fisher, R.: A Best Next View Selection Algorithm Incorporating a Quality Criterion. In: BMV 1998 (1998) 21. Wong, L., Dumont, C., Abidi, M.: Next best view system in a 3D object modeling task. In: IEEE International Symposium Computational Intelligence in Robotics and Automation, CIRA 1999, pp. 306–311 (1999) 22. Fleishman, S., CohenOr, D., Lischinski, D.: Automatic Camera Placement for ImageBased Modeling. In: Proceedings of the 7th Paciﬁc Conference on Computer Graphics and Applications, PG 1999, pp. 12–20. IEEE Computer Society, Washington, DC, USA (1999) 23. Byers, Z., Dixon, M., Goodier, K., Grimm, C., Smart, W.: An autonomous robot photographer. In: Proceedings. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003), vol. 3, pp. 2636–2641 (2003)
Eﬃcient Rendering of Light Field Images
211
24. Scott, W.R., Roth, G., Rivest, J.F.: View planning for automated threedimensional object reconstruction and inspection. ACM Comput. Surv. 35, 64–96 (2003) 25. Finkel, R.A., Bentley, J.L.: Quad trees: a data structure for retrieval on composite keys. Acta Informatica 4, 1–9 (1974), doi:10.1007/BF00288933 26. Wald, I., Slusallek, P., Benthin, C., Wagner, M.: Interactive rendering with coherent ray tracing. Computer Graphics Forum 20, 153–165 (2001)
Appendix Listing 1.4. GLSL Vertex shader for simulating a fisheye lens
unifo r m f l o a t zNear ; unifo r m f l o a t zFar ; v o i d main ( ) { vec4 pos ; f l o a t rxy ; float p; float t ; gl FrontColor = gl Color ; pos = f t r a n s f o r m ( ) ; rxy = s q r t ( pos . x∗ pos . x + pos . y∗ pos . y ) ; p = a ta n ( pos . y , pos . x ) ; t = a ta n ( rxy / pos . z ) ; t /= 0 . 3 4 9 0 6 5 8 5 0 4 ; // 2 0 . 0 deg ∗ PI / 180 pos . x = t ∗ c o s ( p ) ; pos . y = t ∗ s i n ( p ) ; pos . z = ( pos . z + zNear ) / ( zFar − zNear ) ; pos .w = 1 . ; g l P o s i t i o n = pos ; }
Author Index
Ballan, Luca 77 Brostow, Gabriel J. Cremers, Daniel Darabi, Soheil
77
104
Favaro, Paolo
Rosenhahn, Axel 52 Rosenhahn, Bodo 52 Rother, Carsten 104
124 52
Sellent, Anita Sen, Pradeep
184
Klose, Felix 1 Koch, Reinhard LealTaix´e, Laura
104
1, 25
Heydt, Matthias Jung, Daniel
Oswald, Martin R. Pollefeys, Marc 77 Puwein, Jens 77
152
Eisemann, Martin
Magnor, Marcus 1, 25 Martinello, Manuel 124
184 52
25 152
Taneja, Aparna 77 T¨ oppe, Eno 104 Xiao, Lei
152