Academic Press is an imprint of Elsevier 30 Corporate Drive, Suite 400 Burlington, MA 01803
⬁ This book is printed on acid-free paper. Copyright © 2009 by Elsevier Inc., Portions of the text have appeared in “Proceedings of the 8th International Conference on Information Processing in Sensor Networks – 2009”. Designations used by companies to distinguish their products are often claimed as trade-marks or registered trademarks. In all instances in which Academic Press is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, scanning, or otherwise, without prior written permission of the publisher. Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, e-mail:
[email protected]. You may also complete your request on-line via the Elsevier homepage (http://elsevier.com), by selecting “Support & Contact” then “Copyright and Permission” and then “Obtaining Permissions.” Library of Congress Cataloging-in-Publication Data Application submitted. ISBN 13: 978-0-12-374633-7
For information on all Academic Press publications visit our Website at www.books.elsevier.com
Typeset by: diacriTech, India Printed in the United States 09 10 11 12 10 9 8 7 6 5 4 3 2 1
Foreword Vision is a wonderful sensing mechanism to use on a very large scale. Most interesting events can be observed with cameras, and cameras are cheap. Current optical motion capture systems are an example of how useful a large collection of well-coordinated cameras can be. In typical versions of this application, performers wear special dark suits with white markers. These markers can be detected by cameras, and their positions can be reconstructed in 3D by matching and triangulation.We need a large number of cameras to get good reconstructions because body parts occlude one another and so each marker might be visible only from a few small sets of views. Furthermore, the performance space might be very big and resolution demands high. Optical motion capture systems now use highly specialized cameras and are an example of a large, coordinated camera network. This book is a complete introduction to the new field of multi-camera networks, with chapters written by domain experts. The key feature of such networks is that they see a great deal of the world and can obtain a density of sampling in space and time that a single camera cannot. This sampling density is essential for some applications. In surveillance, for example, dense sampling of space and time might allow us to track a single individual—or, better, many distinct single individuals—throughout a complex of buildings. Camera networks might mix cameras of distinct types—fixed-lens, steerable pan-tilt-zoom, thermals, and omnidirectional cameras and perhaps even acoustic sensors. Such a network could be very reliable for detecting events. The key is to detect events opportunistically, exploiting omnidirectional cameras when events are distant or to produce a warning of events, and then passing to higher-resolution sensors as appropriate or to use thermal cameras and acoustic sensors when the weather is poor and visible light sensors in good weather. Valuable scientific and engineering data can come from effective use of multi-camera networks. For example, not much is known right now about how people behave in public and how their movements and actions are shaped by the objects around them. This is important in the design of buildings and public places, much of which is based on quite small sets of observations of people’s behavior in public. A very large set of observations of public behavior could allow significant improvement in building design. Such improvement might range from placing nursing supplies at well-chosen locations in hospitals to placing fountains and benches in public squares so that people enjoy them more. Online observations could be used to allow a building to manage lighting or heating or other expensive services supplied to its users. Multiple views of a geometry can be used to obtain 3D reconstructions and, with appropriate assumptions, camera location and calibration information. Currently, they are used to produce reconstructions of quite complex geometries from a moving camera. For example, one might drive a camera through a city and build a geometric model from the resulting video. Camera networks offer the exciting possibility of simply dropping large numbers of cameras on the geometry instead. Once these cameras find one another and establish communications, their observations might be used to recover a geometric model as well as a model of the cameras’ current locations. While
xvii
xviii
Foreword
dropping cameras on a city might annoy residents, the strategy is very attractive for places where one cannot safely drive a camera around—perhaps, inhospitable terrain or a disaster area. The potential for building applications like these is the reason to read this book. A network of multiple cameras will generally need to be calibrated. There are two forms of calibration. The first has to do with network structure. In some cases, multiple cameras are attached to a server and there may be several such servers. What the network can do is constrained by the server’s ability to handle incoming video signals. In other cases, each camera has a local processing capacity but the network topology is fixed and known. Here, early processing can occur at the camera, reducing the information bandwidth to servers. In yet other cases, cameras may need to discover one another to produce an informal network, and the choice of network topology may be significant. The second form of calibration is geometric. Sometimes camera locations are random; for example, cameras might be scattered from the air. Other times it is possible to choose camera locations to obtain good views of the scene. In either case, one needs geometric calibration to determine exactly where in space cameras are with respect to each other. The solution needs to use available network bandwidth efficiently as well. The overlap between fields of view may need to be determined so that one can be sure when the system is seeing one event and when it is seeing two. Solutions to all of these problems draw on the concept of structure from motion. However, cameras may also need to be synchronized, which is a problem with a distributed systems flavor. Calibration issues like these are dealt with in Part I. Another important group of technical issues involve ideas from artificial intelligence. Tracking an object through a system of distributed cameras requires being able to match objects that have disappeared from one view to those that have appeared in another view. Even inaccurate matches will give estimates of flow. Constraints on the world—for example, that people who enter a corridor must eventually leave it—can improve those estimates. One would like to fuse information across multiple sensors to produce accurate filtered inferences about the world while respecting constraints on communication bandwidth and network topology. Filtered inferences can be used to control and steer the camera system so that some events are observed in greater detail. Problems of distributed tracking, inference, and control are dealt with in Part II. Another reason to pass inferences around a network is compression. Video tends to contain redundant information, and when the camera views overlap, which is the usual case for a distributed camera network, this redundancy can be quite pronounced. Part III deals with methods to compress the overall video signal produced by the distributed camera network. An important component of this problem is distributing the compression process itself to conserve network bandwidth. Several sample applications are discussed in Part IV. The technology of building networks and communicating between cameras is complex, but much of this complexity can be hidden from application builders by clever choices of system architecture and middleware. Part V describes available architectures and middleware and their use in applications.
Foreword As the important practical issues of architecture, middleware, calibration, filtered inference, and compression are being worked out, multi-camera networks are becoming widely used in many applications. This book should be the guide for those who will build those applications. I look forward to the important improvements in the design and safety of buildings and workplaces that will result. David A. Forsyth University of Illinois at Urbana–Champaign
xix
Preface Technological advances in sensor design, communication, and computing are stimulating the development of new applications that will transform traditional vision systems into pervasive intelligent camera networks. Applications enabled by multi-camera networks include smart homes, office automation through occupancy sensing, security and surveillance, mobile and robotic networks, human–computer interfaces, interactive kiosks, and virtual reality systems. Image sensors extract valuable event and context information from the environment. By acquiring an information-rich data type, they enable vision-based interpretive applications. Such applications range from real-time event interpretation in smart environments to adaptation to user behavior models based on long-term observations in ambient intelligence. Multi-camera networks represent a multi-disciplinary field that defines rich conceptual and algorithmic opportunities for computer vision, signal processing, and embedded computing, as well as for wired and wireless sensor networks. New algorithm and system design challenges have been identified across the different communities involved in multi-camera networks research. In signal processing, the subjects of much recent work involves effective methods for multi-layered or hybrid data exchange among cameras for collaborative deduction regarding events of interest and exploitation of the spatial and temporal redundancies in the data. The field of sensor networks finds opportunities for novel research when hybrid types and amounts of data are produced by image sensors. Embedded computing methods that allow cameras to work collaboratively over a network to solve a vision problem have been under study. From a computer vision design standpoint, multi-view methods based on partial processing of video locally provide researchers with new opportunities in considering system-level constraints that may influence algorithm design. In this way, multi-camera networks create opportunities for design paradigm shifts in distributed and collaborative fusion of visual information, enabling the creation of novel methods and applications that are interpretive, context aware, and user-centric.
Distributed Processing in Multi-Camera Networks With the cost of image sensors, embedded processing, and supporting network infrastructure decreasing, the potential for a dramatic increase in the scale of camera networks can be realized, enabling large-scale applications that can observe, detect, and track events of interest across large areas. The network scalability factor emphasizes a shift in data fusion and interpretation mechanisms from traditionally centralized (a powerful control unit gathering raw information from the sensors) to distributed (Figure P.1). In fact, the design of scalable, network-based applications based on high-bandwidth data, such as video, demands such a paradigm shift in processing methodologies to mitigate communication bottlenecks, alleviate interfacing challenges, and guard against system vulnerability. In the new trend of distributed sensing and processing, the streaming of raw video to a central processing unit is replaced with a pervasive computing model in which each camera (network node) employs local processing to translate the observed
xxi
xxii
Preface
Centralized processing
Distributed processing
Clustered processing
Camera
FIGURE P.1 Classification of multi-camera networks based on the decision mechanism: centralized processing, distributed vision sensing and processing, and clustered processing.
data into features and attributes. This compact symbolic data is then transmitted for further processing, enabling collaborative deduction about events of interest. Large-scale multi-camera networks demand efficient and scalable data fusion mechanisms that can effectively infer events and activities occurring in the observed scene. Data fusion can occur across multiple views, across time, and across feature levels and decision stages. Joint estimation and decision-making techniques need to be developed that take into account the processing capabilities of the nodes as well as available network bandwidth and application latency requirements. A variety of trade-offs can be considered for balancing in-node processing, communication cost and latency issues, levels and types of shared features, and the frequency and forms of data exchange among the network nodes. Spatio-temporal data fusion algorithms need to be developed to meet application requirements in light of the various trade-offs and system constraints. In addition, the results of collaborative processing in the network and of high-level interpretation can be used as feedback to the cameras to enable active vision techniques by instructing each camera as to which features may be important to extract in its view.
Multi-Camera Calibration and Topology Multi-camera networks are used for interpreting the dynamics of objects moving in wide areas or for observing objects from different viewpoints to achieve 3D interpretation. For wide-area monitoring, cameras with non-overlapping or partially overlapping fields of view are used; cameras with overlapping fields of view are used to disambiguate occlusions. In both cases, determination of the relative location and orientation of the cameras is often essential for effective use of a camera network.
Preface
xxiii
The calibration of a network of cameras with overlapping fields of view requires classical tools and concepts of multi-view geometry, including image formation, epipolar geometry, and projective transformations. In addition, feature detection and matching, as well as estimation algorithms, are important computer vision and signal processing aspects of multi-camera network calibration methods (see Chapter 1, Multi-View Geometry for Camera Networks). Visual content such as observations of a human subject can be used to calibrate large-scale camera networks. For example, automatic calibration algorithms based on human observation can simultaneously compute camera poses and synchronization from observed silhouette information in multiple views. Once the camera network is calibrated, various techniques can be used for 3D dynamic scene reconstruction from similar silhouette cues and other features obtained by each camera (see Chapter 2, Multi-View Calibration Synchronization and Dynamic Scene Reconstruction). Because the multiple cameras’ narrow fields of view may result in a lack of adequate observations in calibrating large-scale, non-overlapping camera networks, controlled camera pan-tilt-zoom movement added to a beaconing signal can be used for calibration (see Chapter 3, Actuation-Assisted Localization of Distributed Camera Sensor Networks). Location information for cameras may be essential in certain applications; however, in many situations exact localization may not be necessary for interpreting observations. Instead, a general description of the surroundings, the network topology, and the target location with respect to some views may be sufficient, allowing alternative methods of learning and describing network geometry (see Chapter 4, Building an Algebraic Topological Model of Wireless Camera Networks). An important aspect of multi-camera networks is defining procedures that will guide sensor deployment. For example, it is important to know the optimal camera placement for complete or maximal coverage of a predefined area at a certain resolution, depending on the network’s objective. Globally optimal solutions or heuristics for faster execution time can be employed to this end (see Chapter 5, Optimal Placement of Multiple Visual Sensors). The optimization problem can also be solved via binary integer programming (see Chapter 6, Optimal Visual Sensor Network Configuration). Such approaches provide a general visibility model for visual sensor networks that supports arbitrarily shaped 3D environments and incorporates realistic camera models, occupant traffic models, and self- and mutual occlusions.
Active and Heterogeneous Camera Networks In hardware settings, multi-camera networks are undergoing a transition from pure static rectilinear cameras to hybrid solutions that include dynamic and adaptive networks, other sensor types, and different camera resolutions. Pan-tilt-zoom (PTZ) camera networks are considered for their ability to cover large areas and capture high-resolution information of regions of interest in a dynamic scene. This is achieved through sensor tasking, where a network monitors a wide area and provides the positional information to a PTZ camera. Real-time systems operating under real-world conditions may comprise multiple PTZ and static cameras, which are centrally controlled for optimal close-up views of objects such as human faces for surveillance and biometrics applications (see Chapter 7, Collaborative Control of Active Cameras in Large-Scale Surveillance).
xxiv
Preface The use of active camera networks leads to several problems, such as online estimation of time-variant geometric transformations to track views and steer PTZ cameras. Cooperative tracking methods have been developed for uncalibrated networks without prior 3D location information and under target uncertainties (see Chapter 8, Pan-Tilt-Zoom Camera Networks). Integration of other sensor modalities can provide complementary and redundant information that, when merged with visual cues, may allow the system to achieve an enriched and more robust interpretation of an event. By the fusion of data from multiple sensors, the distinct characteristics of the individual sensing modalities can be exploited, resulting in enhanced overall output. Different architectures and algorithms can be used for fusion in camera network applications (see Chapter 9, Multi-Modal Data Fusion Techniques and Applications). In addition to traditional cameras, the use of omnidirectional cameras with different types of mirrors or lenses has been studied. Signals from these cameras can be uniquely mapped to spherical images and then used for calibration, scene analysis, or distributed processing. When a multi-view geometry framework is combined with omnivision imaging, reformulation of the epipolar geometry constraint, depth and disparity estimation, and geometric representations are necessary (see Chapter 10, Spherical Imaging in Omnidirectional Camera Networks).
Multi-View Coding Transmission of video data from multiple cameras requires very large bandwidths. If camera nodes could communicate with each other, it would be possible to develop algorithms to compress the data and hence reduce bandwidth requirements. The problem of finding efficient communication techniques to distribute multi-view video content across different devices and users in a network is addressed, for example, by distributed video coding (DVC) (see Chapter 11, Video Compression for Camera Networks: A Distributed Approach). When collaboration between nodes is not possible, compression must be performed at each node independently. Distributed compression of correlated information is known as distributed source coding, and in principle the problem of distributed compression of multi-view video can be addressed in this framework (see Chapter 12, Distributed Compression in Multi-Camera Systems).
Multi-Camera Human Detection, Tracking, Pose, and Behavior Analysis Detection of humans and their actions is a canonical task in computer vision and is an area of study in multi-camera networks. In recent years, machine learning methods have yielded high recognition performance while maintaining low false detection rates. However, these approaches generally require a large amount of labeled training data, which is often obtained by time-consuming hand-labeling. A solution to this problem is online learning for feature selection to train scene-dependent and adaptive object detectors. In order to reduce human effort and to increase robustness, each camera can hold a separate classifier that is specified for its scene or viewpoint. The training and improvement of these classifiers can incorporate knowledge from other cameras through co-training (see Chapter 13, Visual Online Learning in Multi-Camera Networks). A related application for overlapping cameras is estimation of human poses without the use of
Preface
xxv
markers. After the 3D visual hull of the object (human body) is reconstructed using voxel carving by multiple cameras, techniques such as 3D Haar-like wavelet features can be used to classify the pose (see Chapter 14, Real-Time 3D Body Pose Estimation). In multi-person tracking scenarios, robust handling of partial views and occlusions has been investigated. Persons observed by a multi-camera system may appear in different views simultaneously or at different times depending on the overlap between views. The Bayesian framework has been used to formulate decision making in these applications. Tracking can rely on a joint multi-object state space formulation, with individual object states defined in the 3D world, and it can use the multi-camera observation model as well as motion dynamics of the humans observed (see Chapter 15, Multi-Person Bayesian Tracking with Multiple Cameras). Identity management of moving persons across multiple cameras can then enable event analysis and pattern recognition. For example, abnormal events can be detected by analysis of the trajectories followed by subjects over long observation periods (see Chapter 16, Statistical Pattern Recognition for Multi-Camera Detection, Tracking, and Trajectory Analysis). Because appearances change when objects move from one camera’s view to another’s, tracking must rely on a measure of appearance similarity between views. It can be shown that the brightness transfer function between two cameras lies in a low-dimensional subspace. In addition, geometric constraints between the motions observed by different cameras can be exploited to define several hypotheses for testing. Such hypotheses can be generated even in the absence of calibration data (see Chapter 17, Object Association across Multiple Cameras). In large-scale commercial deployments, a combination of overlapping and nonoverlapping views often exists, requiring flexible techniques for feature correspondence, tracking, and data fusion to calibrate the camera network and carry out event detection operations. Once calibration is achieved, a map of the area covered by the camera network can be created. State-of-the-art systems allow the user to define rules for event detection on this map which are translated to the cameras’fields of view (see Chapter 18, Video Surveillance Using a Multi-Camera Tracking and Fusion System). In more general settings, where a multitude of events is to be monitored, an exchange of data between the cameras is necessary, which occurs via primitives that each camera, according to its view, extracts from the event. In some real-life surveillance applications, such as pointof-sale monitoring and tailgating detection, nonvideo signals are employed in addition to cameras. For flexible operation dealing with multiple types of events, the different primitives can be fused in a high-level decision module that considers spatio-temporal constraints and logic to interpret the reported data as a composite event. In one example, binary trees can be defined in the reasoning module, where leaf nodes represent primitive events, middle nodes represent rule definitions, and the root node represents the target composite event (see Chapter 19, Composite Event Detection in Multi-Camera and Multi-Sensor Surveillance Networks).
Smart Camera Networks: Architecture, Middleware, and Applications Years of development in imaging, processing, and networking have provided an opportunity to develop efficient embedded vision-based techniques. This in turn has enabled the creation of scalable solutions based on distributed processing of visual information in
xxvi
Preface
Image sensors Rich information Low power, low cost
Sensor networks Wireless communication Networking Smart Camera Networks
Signal processing Embedded computing Collaborative methods
Computer vision Object detection Scene understanding
FIGURE P.2 Convergence of disciplines for smart camera networks: image sensors, sensor networks, signal processing for embedded computing, and computer vision.
cameras, thus defining the field of smart cameras (Figure P.2). Distributed smart cameras can collaborate in a network setting and transmit partially processed results for data fusion and event interpretation. Such a network operates in real time and employs the confluence of simultaneous advances in four key disciplines: computer vision, image sensors, embedded computing, and sensor networks (see Chapter 20, Toward Pervasive Smart Camera Networks). Smart camera platforms and networks can be classified as single smart cameras, distributed smart camera systems, and wireless smart camera networks. Several smart camera architectures that operate as nodes in a wireless sensor network have been proposed and tested in real scenarios. The design issues to consider are how to balance the processing load of the cameras with that of the central processor, transmission bandwidth management, and latency effects (see Chapter 21, Smart Cameras for Wireless Camera Networks: Architecture Overview). An important aspect of the integration of multiple and potentially heterogeneous smart cameras into a distributed system for computer vision and sensor fusion is the design of the system-level software, also called middleware. A middleware system allows construction of flexible and self-organizing applications and facilitates modular design. In general applications, the major services a middleware system should provide are control and distribution of data. In smart camera networks, the middleware must additionally provide services for distributing signal processing applications (see Chapter 22, Embedded Middleware for Smart Camera Networks and Sensor Fusion). Tracking in smart camera networks can be performed with a cluster-based Kalman filter. Cluster formation is triggered by the detection of specific features denoting an object of interest. It is possible that more than one cluster may track a single target because cameras that see the same object may be outside of each other’s communication range.A clustering protocol is therefore needed to allow the current state and uncertainty of the target position to be handed off between cluster heads as the object is tracked (see Chapter 23, Cluster-Based Object Tracking by Wireless Camera Networks).
Preface
xxvii
Conclusions Building on the premise of distributed vision-based sensing and processing, multi-camera networks enable many novel applications in smart environments, such as behavior analysis, occupancy-based services, patient and elderly care, smart classrooms, ambient intelligence, and gaming. User interactivity and context awareness, with the possibility of fusion with other sensing modalities, offer additional richness in multi-camera networks and create opportunities for developing user-centric applications. Moreover, by introducing new design paradigms, multi-camera networks create multi-disciplinary research opportunities within computer vision, networking, signal processing, and embedded computing. This book aims to treat the fundamental aspects of system, algorithm, and application development entailed in the emerging area of multi-camera networks. Its chapters set forth state-of-the-art geometric characteristics, such as calibration and topology and network-based detection and tracking. Also presented are introductions to a number of interesting applications based on collaborative methods in camera networks. In particular, the book covers architectures for smart cameras, embedded processing, and middleware approaches. A number of conceptual as well as implementation issues involved in the design and operation of multi-camera networks are covered here through examination of various realistic application scenarios, descriptions of the state of the art in available hardware and systems, and discussion of many of the practical components necessary in application development. We hope that the design methodologies and application details discussed throughout this book will provide guidelines for researchers and developers working on vision-based applications to better envision the opportunities and understand the challenges arising from the development of applications based on multi-camera networks.
Acknowledgments We would like to acknowledge the contribution of a number of people who helped make this book a reality. We are grateful to Tim Pitts of Elsevier for his encouragement and support of this project. Marilyn Rash offered us her precious editorial and project management support. Many thanks to Melanie Benson of Academic Press for helping with organizing the book’s content. We are grateful to the leading multi-camera network researchers who agreed to contribute chapters for this book. Their active and timely cooperation is very much appreciated. Finally, a special thank goes to David Forsyth for contributing the Foreword of this book. Andrea Cavallaro Hamid Aghajan
CHAPTER
Multi-View Geometry for Camera Networks Richard J. Radke Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, New York
1
Abstract Designing computer vision algorithms for camera networks requires an understanding of how images captured from different viewpoints of the same scene are related. This chapter introduces the basics of multi-view geometry in computer vision, including image formation and camera matrices, epipolar geometry and the fundamental matrix, projective transformations, and N -camera geometry. We also discuss feature detection and matching, and describe basic estimation algorithms for the most common problems that arise in multi-view geometry. Keywords: image formation, epipolar geometry, projective transformations, structure from motion, feature detection and matching, camera networks
1.1 INTRODUCTION Multi-camera networks are emerging as valuable tools for safety and security applications in environments as diverse as nursing homes, subway stations, highways, natural disaster sites, and battlefields. While early multi-camera networks were confined to lab environments and were fundamentally under the control of a single processor (e.g., [1]), modern multi-camera networks are composed of many spatially distributed cameras that may have their own processors or even power sources. To design computer vision algorithms that make the best use of the cameras’ data, it is critical to thoroughly understand the imaging process of a single camera and the geometric relationships involved among pairs or collections of cameras. Our overall goal in this chapter is to introduce the basic terminology of multi-view geometry, as well as to describe best practices for several of the most common and important estimation problems. We begin in Section 1.2 by discussing the perspective projection model of image formation and the representation of image points, scene points, and camera matrices. In Section 1.3, we introduce the important concept of epipolar geometry, which relates a pair of perspective cameras, and its representation by the fundamental matrix. Section 1.4 describes projective transformations, which typically arise in camera networks that observe a common ground plane. Section 1.5 briefly discusses algorithms for detecting and matching feature points between images, a prerequisite Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00001-X
3
4
CHAPTER 1 Multi-View Geometry for Camera Networks for many of the estimation algorithms we consider. Section 1.6 discusses the general geometry of N cameras, and its estimation using factorization and structure-from-motion techniques. Section 1.7 concludes the chapter with pointers to further print and online resources that go into more detail on the problems introduced here.
1.2 IMAGE FORMATION In this section, we describe the basic perspective image formation model, which for the most part accurately reflects the phenomena observed in images taken by real cameras. Throughout the chapter, we denote scene points by X ⫽ (X, Y , Z), image points by u ⫽ (x, y), and camera matrices by P.
1.2.1 Perspective Projection An idealized “pinhole” camera C is described by ■ ■ ■
A center of projection C ∈ R3 A focal length f ∈ R⫹ An orientation matrix R ∈ SO(3).
The camera’s center and orientation are described with respect to a world coordinate system on R3 . A point X expressed in the world coordinate system as X ⫽ (Xo , Yo , Zo ) can be expressed in the camera coordinate system of C as ⎤ ⎛⎡ ⎤ ⎞ Xo XC X ⫽ ⎣YC ⎦ ⫽ R ⎝⎣Yo ⎦ ⫺ C ⎠ ZC Zo ⎡
(1.1)
The purpose of a camera is to capture a two-dimensional image of a three-dimensional scene S —that is, a collection of points in R3 . This image is produced by perspective projection as follows. Each camera C has an associated image plane P located in the camera coordinate system at ZC ⫽ f . As illustrated in Figure 1.1, the image plane inherits a natural orientation and two-dimensional coordinate system from the camera coordinate system’s XY -plane. It is important to note that the three-dimensional coordinate systems in this derivation are left-handed. This is a notational convenience, implying that the image plane lies between the center of projection and the scene, and that scene points have positive ZC coordinates. A scene point X ⫽ (XC , YC , ZC ) is projected onto the image plane P at the point u ⫽ (x, y) by the perspective projection equations x ⫽f
XC ZC
y⫽f
YC ZC
(1.2)
The image I that is produced is a map from P into some color space. The color of a point is typically a real-valued (gray scale) intensity or a triplet of RGB or YUV values. While the entire ray of scene points {(x, y, f )| ⬎ 0} is projected to the image coordinate (x, y) by (1.2), the point on this ray that gives (x, y) its color in the image I is the one closest to the image plane (i.e., that point with minimal ).This point is said to be visible; any scene point further along on the same ray is said to be occluded.
1.2 Image Formation X 5 (XC, YC, ZC ) Y
Z
y
u 5 (x, y)
Image coordinates
x f
X Camera coordinates
FIGURE 1.1 Pinhole camera, which uses perspective projection to represent a scene point X ∈ R3 as an image point u ∈ R2 .
For real cameras, the relationship between the color of image points and the color of scene points is more complicated. To simplify matters, we often assume that scene points have the same color regardless of the viewing angle (this is called the Lambertian assumption) and that the color of an image point is the same as the color of a single corresponding scene point. In practice, the colors of corresponding image and scene points are different because of a host of factors in a real imaging system. These include the point spread function, color space, and dynamic range of the camera, as well as non-Lambertian or semi-transparent objects in the scene. For more detail on the issues involved in image formation (see [2, 3]).
1.2.2 Camera Matrices Frequently, an image coordinate (x, y) is represented by the homogeneous coordinate (x, y, 1), where ⫽ image coordinate of a homogeneous coordinate (x, y, z)
0. The can be recovered as xz , zy when z ⫽ 0. Similarly, any scene point (X, Y , Z) can be represented in homogeneous coordinates as (X, Y , Z, 1), where ⫽ 0. We use the symbol ∼ to denote the equivalence between a homogeneous and a nonhomogeneous coordinate. A camera C with parameters (C, f , R) can be represented by a 3⫻4 matrix PC that multiplies a scene point expressed as a homogeneous coordinate in R4 to produce an image point expressed as a homogeneous coordinate in R3 . When the scene point is expressed in the world coordinate system, the matrix P is given by ⎡
f PC ⫽ ⎣0 0
0 f 0
⎤ 0
0⎦ R | ⫺RC 1
(1.3)
5
6
CHAPTER 1 Multi-View Geometry for Camera Networks Here, the symbol | denotes the horizontal concatenation of two matrices. Then ⎡ ⎤ ⎡ X f ⎢Y ⎥ ⎥ ⫽ ⎣0 PC ⎢ ⎣Z ⎦ 0 1 ⎡ f ⫽ ⎣0 0
⎡ ⎤ ⎤ X 0
⎢Y ⎥ ⎢ 0⎦ R | ⫺RC ⎣ ⎥ Z⎦ 1 1 ⎤ ⎛⎡ ⎤ ⎞ ⎡ 0 0 X f f 0⎦ R ⎝⎣Y ⎦ ⫺ C ⎠ ⫽ ⎣0 0 1 Z 0 ⎡ ⎤ XC ⎡ ⎤ f XC ⎢f Z ⎥ ⎢ C⎥ ⫽ ⎣f YC ⎦ ∼ ⎢ ⎥ ⎣ YC ⎦ ZC f ZC 0 f 0
0 f 0
⎤⎡ ⎤ 0 XC 0 ⎦ ⎣ YC ⎦ 1 ZC
We often state this relationship succinctly as u ∼ PX
(1.4)
Intrinsic and Extrinsic Parameters We note that the camera matrix can be factored as P ⫽ KR [I | ⫺ C]
(1.5)
The matrix K contains the intrinsic parameters of the camera, while the variables R and C comprise the extrinsic parameters, specifying its position and orientation in the world coordinate system. While the intrinsic parameter matrix K in (1.3) was just a diagonal matrix containing the focal length, a more general camera can be constructed using a K matrix of the form ⎤⎡
⎡
f
mx
K ⫽⎣
⎦⎣
my 1
s/mx f
⎤ px py ⎦ 1
(1.6)
In addition to the focal length of the camera f , this intrinsic parameter matrix includes mx and my , the number of pixels per x and y unit of image coordinates, respectively; (px , py ), the coordinates of the principal point of the image; and s, the skew (deviation from rectangularity) of the pixels. For high-quality cameras, the pixels are usually square, the skew is typically 0, and the principal point can often be approximated by the image origin. A general camera matrix has 11 degrees of freedom (since it is only defined up to a scale factor). The camera center and the rotation matrix each account for 3 degrees of freedom, leaving 5 degrees of freedom for the intrinsic parameters.
Extracting Camera Parameters from P Often, we begin with an estimate of a 3 ⫻ 4 camera matrix P and want to extract the intrinsic and extrinsic parameters from it. The homogeneous coordinates of the camera center C can easily be extracted as the right-hand null vector of P (i.e., a vector satisfying
1.2 Image Formation PC ⫽ 0). If we denote the left 3⫻3 block of P as M, then we can factor M ⫽ KR, where K is upper triangular and R is orthogonal using a modified version of the QR decomposition [4]. Enforcing that K has positive values on its diagonal should remove any ambiguity about the factorization.
More General Cameras While the perspective model of projection generally matches the image formation process of a real camera, an affine model of projection is sometimes more computationally appealing. Although less mathematically accurate, such an approximation may be acceptable in cases where the depth of scene points is fairly uniform or the field of view is fairly narrow. Common choices for linear models include orthographic projection, in which the camera matrix has the form ⎡
1 P ⫽ ⎣0 0
0 1 0
⎤ 0 R 0⎦ 0 1
0 0 0
t 1
(1.7)
or weak perspective projection, in which the camera matrix has the form ⎡
␣x P ⫽⎣ 0 0
0 ␣y 0
0 0 0
⎤ 0 R 0⎦ 0 1
t 1
(1.8)
Finally, we note that real cameras frequently exhibit radial lens distortion, which can be modeled by xˆ x ⫽L x 2 ⫹ y2 yˆ y
(1.9)
where (x, y) is the result of applying the perspective model (1.4), and L(·) is a function of the radial distance from the image center, often modeled as a fourth-degree polynomial. The parameters of the distortion function are typically measured offline using a calibration grid [5].
1.2.3 Estimating the Camera Matrix The P matrix for a given camera is typically estimated based on a set of matched correspondences between image points uj ∈ R2 and scene points Xj with known coordinates in R3 . That is, the goal is to select the matrix P ∈ R3⫻4 that best matches a given set of point mappings: {Xj → uj , j ⫽ 1, . . . , N }
(1.10)
Each correspondence (uj , Xj ) produces two independent linear equations in the elements of P. Thus, the equations for all the data points can be collected into a linear system Ap ⫽ 0, where A is a 2N ⫻12 matrix involving the data, and p ⫽ ( p11 , . . . , p34 )T is the vector of unknowns. Since the camera matrix is unique up to scale, we must fix some scaling (say, || p||2 ⫽ 1) to ensure that Ap cannot become arbitrarily small. The least squares minimization problem is then min Ap2
s.t. pT p ⫽ 1
(1.11)
7
8
CHAPTER 1 Multi-View Geometry for Camera Networks The solution to this problem is well known: The minimizer is the eigenvector p ∈ R12 of ATA corresponding to the minimal eigenvalue, which can be computed via the singular ˆ which value decomposition. This eigenvector is then reassembled into a 3⫻4 matrix P, can be factored into intrinsic and extrinsic parameters as described earlier. To maintain numerical stability, it is critical to normalize the data before solving the estimation problem. That is, each of the sets {uj } and {Xj } should be translated to have √ zero mean and then isotropically scaled so that the average distance to the origin is 2 √ for the image points and 3 for the scene points. These translations and scalings can be represented by similarity transform matrices T and U that act on the homogeneous coordinates of uj and Xj . After the camera matrix for the normalized points Pˆ has been estimated, the estimate of the camera matrix in the original coordinates is given by ˆ P ⫽ T ⫺1 PU
(1.12)
Several more advanced algorithms for camera matrix estimation (also called resectioning) are discussed in Hartley and Zisserman [6], typically requiring iterative, nonlinear optimization. When the data sets contain outliers (i.e., incorrect point correspondences inconsistent with the underlying projection model), it is necessary to detect and reject them during the estimation process. Frequently, the robust estimation framework called RANSAC [7] is employed for this purpose. RANSAC is based on randomly selecting a large number of data subsets, each containing the minimal number of correspondences that make the estimation problem solvable. An estimate of outlier probability is used to select a number of subsets to guarantee that at least one of the minimal subsets has a high probability of containing all inliers. The point correspondences required for camera matrix estimation are typically generated using images of a calibration grid resembling a high-contrast checkerboard. Bouguet authored a widely disseminated camera calibration toolbox in MATLAB that only requires the user to print and acquire several images of such a calibration grid [8].
1.3 TWO-CAMERA GEOMETRY Next, we discuss the image relationships resulting from the same static scene being imaged by two cameras C and C , as illustrated in Figure 1.2. These could be two physically separate cameras or a single moving camera at different points in time. Let the scene coordinates of a point X in the C coordinate system be (X, Y , Z) and, in the C coordinate system, be (X , Y , Z ). We denote the corresponding image coordinates of X in P and P by u ⫽ (x, y) and u ⫽ (x , y ), respectively. The points u and u are said to be corresponding points, and the pair (u, u ) is called a point correspondence. Assuming standard values for the intrinsic parameters, the scene point X is projected onto the image points u and u via the perspective projection equations (1.2): X Z X x ⫽ f Z x ⫽f
Y Z Y y ⫽ f Z y⫽f
(1.13) (1.14)
1.3 Two-Camera Geometry
9
X 5 (X, Y, Z ) 5 (X 9, Y 9, Z 9) Y9
Y
u 5 (x, y)
Z9
Z
u95 (x9, y9)
f9
f
– X9
X
FIGURE 1.2 Rigid camera motion introduces a change of coordinates, resulting in different image coordinates for the same scene point X.
Here f and f are the focal lengths of C and C , respectively. We assume that the two cameras are related by a rigid motion, which means that the C coordinate system can be expressed as a rotation Rˆ of the C coordinate system followed by a translation [tX tY tZ ]T . That is, ⎡
⎤ ⎡ ⎤ ⎡ ⎤ X X tX ⎣Y ⎦ ⫽ Rˆ ⎣Y ⎦ ⫹ ⎣tY ⎦ Z tZ Z
(1.15)
In terms of the parameters of the cameras, Rˆ ⫽ R R⫺1 and t ⫽ R (C ⫺ C ). Alternately, we can write Rˆ as ⎛
cos ␣ cos ␥ ⫹ sin ␣ sin  sin ␥ Rˆ ⫽ ⎝⫺cos ␣ sin ␥ ⫹ sin ␣ sin  cos ␥ sin ␣ cos 
⎞ ⫺sin ␣ cos ␥ ⫹ cos ␣ sin  sin ␥ sin ␣ sin ␥ ⫹ cos ␣ sin  cos ␥ ⎠ (1.16) cos ␣ cos 
cos  sin ␥ cos  cos ␥ ⫺sin 
where ␣, , and ␥ are rotation angles around the X-, Y -, and Z-axes, respectively, of the C coordinate system. By substituting equation 1.15 into the perspective projection equations 1.2, we obtain a relationship between the two sets of image coordinates:
X
Y
x ⫽f
y ⫽f
Z
Z
⫽
(1.17)
r31 r32 tZ f x ⫹ f y ⫹ r33 ⫹ Z
⫽
r11 ff x ⫹ r12 ff y ⫹ r13 f ⫹ tXZf
r21 ff x ⫹ r22 ff y ⫹ r23 f ⫹ tYZf r31 r32 tZ f x ⫹ f y ⫹ r33 ⫹ Z
(1.18)
Here rij are the elements of the rotation matrix given in (1.16). In Section 1.4 we will consider some special cases of (1.17)–(1.18).
10
CHAPTER 1 Multi-View Geometry for Camera Networks
1.3.1 Epipolar Geometry and Its Estimation We now introduce the fundamental matrix, which encapsulates an important constraint on point correspondences between two images of the same scene. For each generic pair of cameras (C , C ), there exists a matrix F of rank 2 such that for all correspondences (u, u ) ⫽ ((x, y), (x , y )) ∈ P ⫻P , ⎡ ⎤T ⎡ ⎤ x x ⎣y ⎦ F ⎣y ⎦ ⫽ 0 1 1
(1.19)
The fundamental matrix is unique up to scale provided that there exists no quadric surface Q containing the line C C and every point in the scene S [9]. Given the fundamental matrix F for a camera pair (C , C ), we obtain a constraint on the possible locations of point correspondences between the associated image pair (I , I ). The epipolar line corresponding to a point u ∈ P is the set of points: T u u u ⫽ u ⫽ (x , y ) ∈ P F ⫽0 1 1
T
(1.20)
If u is the image of scene point X in P , the image u of X in P is constrained to lie on the epipolar line u . Epipolar lines for points in P can be defined accordingly. Hence, epipolar lines exist in conjugate pairs (, ), such that the match to a point u ∈ must lie on and vice versa. Conjugate epipolar lines are generated by intersecting any plane ⌸ containing the baseline C C with the pair of image planes (P , P ) (see Figure 1.3). For a more thorough review of epipolar geometry, the reader is referred to [10]. The epipoles e ∈ P and e ∈ P are the projections of the camera centers C and C onto P and P , respectively. It can be seen from Figure 1.3 that the epipolar lines in each image all intersect at the epipole. In fact, the homogeneous coordinates of the epipoles e and e are the right and left eigenvectors of F , respectively, corresponding to the eigenvalue 0.
P X u9
u C
,9
, e
e9
P
C9
P9
FIGURE 1.3 Relationship between two images of the same scene, encapsulated by the epipolar geometry. For any scene point X, we construct the plane ⌸ containing X and the two camera centers, which intersect the image planes at the two epipolar lines and .
1.3 Two-Camera Geometry
FIGURE 1.4 Image pair with sample epipolar lines. Corresponding points must occur along conjugate epipolar lines; see, for example, the goalie’s head and the front corner of the goal area (top).
Since corresponding points must appear on conjugate epipolar lines, they form an important constraint that can be exploited while searching for feature matches, as illustrated in Figure 1.4.
1.3.2 Relating the Fundamental Matrix to the Camera Matrices The fundamental matrix can be easily constructed from the two camera matrices. If we assume that P ⫽ K [I | 0]
P ⫽ K [R | t]
(1.21)
11
12
CHAPTER 1 Multi-View Geometry for Camera Networks then the fundamental matrix is given by F ⫽ K ⫺T [t]⫻ RK ⫺1 ⫽ K ⫺T R[RT t]⫻ K ⫺1
(1.22)
where [t]⫻ is the skew-symmetric matrix defined by ⎡
0 [t]⫻ ⫽ ⎣ t3 ⫺t2
⫺t3 0 t1
⎤ t2 ⫺t1 ⎦ 0
(1.23)
In the form of (1.22), we can see that F is rank 2 since [t]⫻ is rank 2. Since F is only unique up to scale, the fundamental matrix has seven degrees of freedom. The left and right epipoles can also be expressed in terms of the camera matrices as follows: e ⫽ K t
e ⫽ KRT t
(1.24)
From the preceding, the fundamental matrix for a pair of cameras is clearly unique up to scale. However, there are four degrees of freedom in extracting P and P from a given F . This ambiguity arises because the camera matrix pairs (P, P ) and (PH, P H) have the same fundamental matrix for any 4⫻4 nonsingular matrix H (e.g., the same rigid motion applied to both cameras).The family of (P, P ) corresponding to a given F is described by P ⫽ [[e ]⫻ F ⫹ e vT | e ]
P ⫽ [I | 0]
(1.25)
where v ∈ R3 is an arbitrary vector and is a nonzero scalar. In cases where the intrinsic parameter matrix K of the camera is known (e.g., estimated offline using a calibration grid), the homogeneous image coordinates can be transformed using uˆ ⫽ K ⫺1 u
uˆ ⫽ K ⫺1 u
(1.26)
P ⫽ [I | 0]
P ⫽ [R | t]
(1.27)
The camera matrices become
and the fundamental matrix is now called the essential matrix. The advantage of the essential matrix is that its factorization in terms of the extrinsic camera parameters E ⫽ [t]⫻ R ⫽ R[RT t]⫻
(1.28)
is unique up to four possible solutions, the correct one of which can be determined by requiring that all projected points lie in front of both cameras [6].
1.3.3 Estimating the Fundamental Matrix The fundamental matrix is typically estimated based on a set of matched point correspondences (see Section 1.5). That is, the goal is to select the matrix F ∈ R3⫻3 that best matches a given set of point mappings:
uj → u j ∈ R2 ,
j ⫽ 1, . . . , N
(1.29)
1.3 Two-Camera Geometry
13
It is natural to try to minimize a least squares cost functional such as J (F ) ⫽
N T u j
j⫽1
1
F
uj 1
(1.30)
over the class of admissible fundamental matrices. We recall that F must have rank 2 (see Section 1.3.2). Furthermore, the fundamental matrix is unique up to scale, so we must fix some scaling (say, F ⫽ 1 for some appropriate norm) to ensure that J cannot become arbitrarily small. Hence, the class of admissible estimates has only 7 degrees of freedom. Constrained minimizations of this type are problematic because of the difficulty in parameterizing the class of admissible F . Faugeras [11], Faugeras et al. [12], and Luong et al. [13] proposed some solutions in this regard and analyzed various cost functionals for the estimation problem. A standard approach to estimating the fundamental matrix was proposed by Hartley [14]. Ignoring the rank-2 constraint for the moment, we minimize (1.30) over the class {F ∈ R3⫻3 | F F ⫽ 1}, where · F is the Frobenius norm. Each correspondence (uj , u j ) produces a linear equation in the elements of F : xj xj f11 ⫹ xj yj f21 ⫹ xj f31 ⫹ yj xj f21 ⫹ yj yj f22 ⫹ yj f23 ⫹ xj f31 ⫹ yj f32 ⫹ f33 ⫽ 0
The equations in all of the data points can be collected into a linear system Af ⫽ 0, where A is an N ⫻9 matrix involving the data, and f ⫽ ( f11 , f21 , f31 , f12 , f22 , f32 , f13 , f23 , f33 )T is the vector of unknowns. The least squares minimization problem is then min Af 2
s.t. f T f ⫽ 1
(1.31)
As in the resectioning estimation problem in Section 1.2.3, the minimizer is the eigenvector f ∈ R9 of AT A corresponding to the minimal eigenvalue, which can be computed via the singular value decomposition. This eigenvector is then reassembled into a 3⫻3 matrix Fˆ . To account for the rank-2 constraint, we replace the full-rank estimate Fˆ with Fˆ ∗ , the minimizer of min Fˆ ⫺ Fˆ ∗ F
s.t. rank(Fˆ ∗ ) ⫽ 2
(1.32)
Given the singular value decomposition Fˆ ⫽ UDV T , where D ⫽ diag(r, s, t) with r ⬎ s ⬎ t, the solution to (1.32) is ˆ T Fˆ ∗ ⫽ U DV
(1.33)
ˆ ⫽ diag(r, s, 0). where D As in Section 1.2.3, to maintain numerical stability it is critical to normalize the data before solving the estimation problem. Each of the sets uj and u j should be translated to √ have zero mean and then isotropically scaled so that the average distance to the origin is 2. These translations and scalings can be represented by 3⫻3 matrices T and T that act on the homogeneous coordinates of uj and u j . After the fundamental matrix for the
14
CHAPTER 1 Multi-View Geometry for Camera Networks normalized points has been estimated, the estimate of the fundamental matrix in the original coordinates is given by F ⫽ T T Fˆ ∗ T
(1.34)
The overall estimation process is called the normalized eight-point algorithm. Several more advanced algorithms for fundamental matrix estimation are discussed in Hartley and Zisserman [6] and typically require iterative, nonlinear optimization. As before, outliers (i.e., incorrect point correspondences inconsistent with the underlying epipolar geometry) are often detected and rejected using RANSAC.
1.4 PROJECTIVE TRANSFORMATIONS The fundamental matrix constrains the possible locations of corresponding points: They must occur along conjugate epipolar lines. However, there are two special situations in which the fundamental matrix for an image pair is undefined; instead, point correspondences between the images are related by an explicit one-to-one mapping. Let us reconsider equations (1.17) and (1.18), which relate the image coordinates of a point seen by two cameras C and C . If this relationship is to define a transformation that globally relates the image coordinates, for every scene point (X, Y , Z) we require that the dependence on the world coordinate Z disappears. That is, tX ⫽ a1X x ⫹ a2X y ⫹ bX Z tY ⫽ a1Y x ⫹ a2Y y ⫹ bY Z tZ ⫽ a1Z x ⫹ a2Z y ⫹ bZ Z
(1.35) (1.36) (1.37)
for some set of as and bs. These conditions are satisfied when either tX ⫽ t Y ⫽ t Z ⫽ 0
or k1 X ⫹ k 2 Y ⫹ k 3 Z ⫽ 1
In the first case, corresponding to a camera whose optical center undergoes no translation, we obtain
r11 ff x ⫹ r12 ff y ⫹ r13 f
r21 ff x ⫹ r22 ff y ⫹ r23 f
x ⫽
r31 r32 f x ⫹ f y ⫹ r33
y⫽
r31 r32 f x ⫹ f y ⫹ r33
An example of three such images composed into the same frame of reference with appropriate projective transformations is illustrated in Figure 1.5.
1.4 Projective Transformations
FIGURE 1.5 Images from a nontranslating camera composed into the same frame of reference by appropriate projective transformations.
In the second case, corresponding to a planar scene, (1.17) and (1.18) become
r11 ff ⫹ tX f k1 x ⫹ r12 ff ⫹ tX f k2 y ⫹ r13 f ⫹ tX f k3 x ⫽
r31 r32 ⫹ t x ⫹ y ⫹ r33 ⫹ tZ k3 k ⫹ t k Z 1 Z 2 f f
(1.38)
r21 ff ⫹ tY f k1 x ⫹ r22 ff ⫹ tY f k2 y ⫹ r23 f ⫹ tY f k3 y ⫽
r31 r32 ⫹ t k ⫹ t k x ⫹ y ⫹ r33 ⫹ tZ k3 Z 1 Z 2 f f
An example of a pair of images of a planar surface, registered by an appropriate projective transformation, is illustrated in Figure 1.6. In either case, the transformation is of the form x ⫽
a11 x ⫹ a12 y ⫹ b1 c1 x ⫹ c 2 y ⫹ d
(1.39)
y ⫽
a21 x ⫹ a22 y ⫹ b2 c1 x ⫹ c 2 y ⫹ d
(1.40)
which can be written in homogeneous coordinates as u ⫽ Hu
(1.41)
for a nonsingular 3⫻3 matrix H defined up to scale. This relationship is called a projective transformation (and is sometimes also known as a collineation or homography). When c1 ⫽ c2 ⫽ 0, the transformation is known as an affine transformation.
15
16
CHAPTER 1 Multi-View Geometry for Camera Networks
FIGURE 1.6 Images of a planar scene composed into the same frame of reference by appropriate projective transformations.
Projective transformations between images are induced by common configurations such as surveillance cameras that view a common ground plane or a panning and tilting camera mounted on a tripod. In the next section, we discuss how to estimate projective transformations relating an image pair. We note that, technically, affine transformations are induced by the motion of a perspective camera only under quite restrictive conditions. The image planes must both be parallel to the XY -plane. Furthermore, either the translation vector t must be identically 0 or Z must be constant for all points in the scene. That is, the scene is a planar surface parallel to the image planes P and P . However, the affine assumption is often made when the scene is far from the camera (Z is large) and the rotation angles ␣ and  are very small. This assumption has the advantage that the affine parameters can be easily estimated, usually in closed form.
1.4.1 Estimating Projective Transformations The easiest approach to estimating a projective transformation from point correspondences is called the direct linear transform (DLT). If the point correspondence {uj → u j } corresponds to a projective transformation, then the homogeneous coordinates u j and
1.4 Projective Transformations
17
Huj are vectors in the same direction: u j ⫻ Huj ⫽ 0
(1.42)
Since (1.42) gives two independent linear equations in the nine unknowns of H, N point correspondences give rise to a 2N ⫻9 linear system: Ah ⫽ 0
(1.43)
where A is a 2N ⫻9 matrix involving the data, and h ⫽ (a11 , a12 , a22 , a22 , b1 , b2 , c1 , c2 , d)T is the vector of unknowns. We solve the problem in exactly the same way as in (1.31) using the singular value decomposition (i.e., the solution h is the singular vector of A corresponding to the smallest singular value). When the element d is expected to be far from 0, an alternate approach is to normalize d ⫽ 1, in which case the N point correspondences induce a system of 2N equations in the remaining eight unknowns, which can be solved as a linear least squares problem. As with the fundamental matrix estimation problem, more accurate estimates of the projective transformation parameters can be obtained by the iterative minimization of a nonlinear cost function, such as the symmetric transfer error given by N
||u i ⫺ Hui ||22 ⫹ ||ui ⫺ H ⫺1 u i ||22
(1.44)
i⫽1
The reader is referred to Hartley and Zisserman [6] for more details. We note that an excellent turnkey approach to estimating projective transformations for real images is given by the Generalized Dual-Bootstrap ICP algorithm proposed by Yang et al. [15].
1.4.2 Rectifying Projective Transformations We close this section by mentioning a special class of projective transformations, rectifying projective transformations, that often simplify computer vision problems involving point matching along conjugate epipolar lines. Since epipolar lines are generally not aligned with one of the coordinate axes of an image, or are even parallel, the implementation of algorithms that work with epipolar lines can be complicated. To this end, it is common to apply a technique called rectification to an image pair before processing, so that the epipolar lines are parallel and horizontal. An associated image plane pair (P , P ) is said to be rectified when the fundamental matrix for (P , P ) is the skew-symmetric matrix ⎡
0 F∗ ⫽ ⎣0 0
0 0 ⫺1
⎤ 0 1⎦ 0
(1.45)
In homogeneous coordinates, the epipoles corresponding to F∗ are e0 ⫽ e1 ⫽ [1 0 0]T , which means the epipolar lines are horizontal and parallel. Furthermore, expanding the fundamental matrix equation for a correspondence ((x, y)T , (x , y )T ) ∈ P ⫻P gives ⎡ ⎤T ⎡ ⎤ x x ⎣ y ⎦ F∗ ⎣ y ⎦ ⫽ 0 1 1
(1.46)
18
CHAPTER 1 Multi-View Geometry for Camera Networks
G
H
e e9
F P
P
P9
P9
FIGURE 1.7 Rectifying projective transformations G and H transform corresponding epipolar lines to be horizontal with the same y value.
which is equivalent to y ⫺ y ⫽ 0. This implies that not only are the epipolar lines in a rectified image pair horizontal, they are aligned. Thus, the lines y ⫽ in P and y ⫽ in P are conjugate epipolar lines. A pair of projective transformations (G, H) rectifies an associated image plane pair (P , P ) with fundamental matrix F if H ⫺T FG ⫺1 ⫽ F∗
(1.47)
By the above definition, if the projective transformations G and H are applied to P and P to produce warped image planes Pˆ 0 and Pˆ 1 , respectively, then (Pˆ 0 , Pˆ 1 ) is a rectified pair (Figure 1.7). The rectifying condition (1.47) can be expressed as 9 equations in the 16 unknowns of the 2 projective transformations G and H, leading to 7 degrees of freedom in the choice of a rectifying pair. Seitz [16] and Hartley [17] described methods for deriving rectifying projective transformations from an estimate of the fundamental matrix relating an image pair. Isgrò and Trucco [18] observed that rectifying transformations can be estimated without explicitly estimating the fundamental matrix as an intermediate step.
1.5 FEATURE DETECTION AND MATCHING As indicated in the previous sections, many algorithms for multi-view geometry parameter estimation require a set of image correspondences as input—that is, regions of pixels representing scene points that can be reliably, unambiguously matched in other images of the same scene. Many indicators of pixel regions that constitute “good” features have been proposed, including line contours, corners, and junctions. A classical algorithm for detecting and matching “good” point features, proposed by Shi and Tomasi [19], is described as follows: 1. Compute the gradients gx (x, y) and gy (x, y) for I . That is, gx (x, y) ⫽ I (x, y) ⫺ I (x ⫺ 1, y) gy (x, y) ⫽ I (x, y) ⫺ I (x, y ⫺ 1)
1.5 Feature Detection and Matching 2. For every N ⫻N block of pixels ⌫, a. Compute the covariance matrix
gx2 (x, y) (x,y)∈⌫ gx gy (x, y)
B ⫽ (x,y)∈⌫
g g (x, y) (x,y)∈⌫ x2 y (x,y)∈⌫ gy (x, y)
b. Compute the eigenvalues of B, 1 , and 2 . c. If 1 and 2 are both greater than some threshold , add ⌫ to the list of features. 3. For every block of pixels ⌫ in the list of features, find the N ⫻N block of pixels in I that has the highest normalized cross-correlation, and add the point correspondence to the feature list if the correlation is sufficiently high. A recent focus in the computer vision community has been on different types of “invariant”detectors that select image regions that can be robustly matched even between images where the camera perspectives or zooms are quite different. An early approach was the Harris corner detector [20], which uses the same matrix B as the Shi-Tomasi algorithm; however, instead of computing the eigenvalues of B, the quantity ⫽ det(B) ⫺ · trace2 (Z)
(1.48)
is computed, with in a recommended range of 0.04–0.15. Blocks with local positive maxima of are selected as features. Mikolajczyk and Schmid [21] later extended Harris corners to a multi-scale setting. An alternate approach is to filter the image at multiple scales with a Laplacian-ofGaussian (LOG) [22] or difference-of-Gaussian (DOG) [23] filter; scale-space extrema of the filtered image give the locations of the interest points. The popular Scale-Invariant Feature Transform (SIFT) detector proposed by Lowe [23] is based on multi-scale DOG filters. Feature points of this type typically resemble“blobs”at different scales, as opposed to “corners.” Figure 1.8 illustrates example Harris corners and SIFT features detected for the same image. A broad survey of modern feature detectors was given by Mikolajczyk and Schmid [24]. Block-matching correspondence approaches begin to fail when the motion of the camera or of scene objects induces too much change in the images. In this case, the assumption that a rectangular block of pixels in one image roughly matches a block of the same shape and size in the other image breaks down. Normalized cross-correlation between blocks of pixels is likely to yield poor matches. In such cases, it may be more appropriate to apply the SIFT descriptor [23], a histogram of gradient orientations designed to be invariant to scale and rotation of the feature. Typically, the algorithm uses a 16⫻16 grid of samples from the gradient map at the feature’s scale to form a 4⫻4 aggregate gradient matrix. Each element of the matrix is quantized into eight orientations, producing a descriptor of dimension 128. Mikolajczyk and Schmid [25] showed that the overall SIFT algorithm outperformed most other detector/descriptor combinations in their experiments, accounting for its widespread popularity in the computer vision community.
19
20
CHAPTER 1 Multi-View Geometry for Camera Networks
(a)
(b)
(c)
FIGURE 1.8 (a) Original image. (b) Example Harris corners. (c) Example SIFT features.
1.6 MULTI-CAMERA GEOMETRY We now proceed to the relationships between M ⬎ 2 cameras that observe the same scene. Although an entity called the trifocal tensor [26] that relates three cameras’ views plays an analogous role to the fundamental matrix for two views, here we concentrate on the geometry of M general cameras. We assume a camera network that contains M perspective cameras, each described by a 3⫻4 matrix Pi . Each camera images some subset of a set of N scene points {X1 , X2 , . . . , XN } ∈ R3 . We define an indicator function ij where ij ⫽ 1 if camera i images scene point j. The projection of Xj onto Pi is given by uij ∈ R2 , denoted uij ∼ Pi Xj
(1.49)
The main problem in multi-camera geometry is structure from motion (SFM). That is, given only the observed image projections {uij }, we want to estimate the corresponding camera matrices Pi and scene points Xj . As we discuss in the next sections, this process typically proceeds in stages to find a good initial estimate, which is followed by a nonlinear optimization algorithm called bundle adjustment. We note that SFM is closely related to a problem in the robotics community called simultaneous localization and mapping (SLAM) [27], in which mobile robots must estimate their locations from sensor data as they move through a scene. SFM also forms the fundamental core of commercial software packages such as boujou or SynthEyes for “matchmoving” and camera tracking, which are used to insert digital effects into Hollywood movies.
1.6.1 Affine Reconstruction Many initialization methods for the SFM problem involve a factorization approach. The first such approach was proposed by Tomasi and Kanade [28] and applies only to affine cameras (we generalize it to perspective cameras in the next section).
1.6 Multi-Camera Geometry
21
For an affine camera, the projection equations take the form uij ⫽ Ai Xj ⫹ ti
(1.50)
for Ai ∈ R2⫻3 and ti ∈ R2 . If we assume that each camera images all of the scene points, then a natural formulation of the affine SFM problem is min
N M
{Ai ,ti ,Xj }
uij ⫺ (Ai Xj ⫹ ti )2
(1.51)
i⫽1 j⫽1
If we assume that the Xj are centered at 0, taking the derivative with respect to ti reveals that the minimizer tˆi ⫽
N 1 uij N j⫽1
(1.52)
Hence, we recenter the image measurements by uij ← uij ⫺
N 1 uij N j⫽1
(1.53)
and are faced with solving min
{Ai ,Xj }
N M
uij ⫺ Ai Xj 2
(1.54)
i⫽1 j⫽1
Tomasi and Kanade’s key observation was that if all of the image coordinates are collected into a measurement matrix defined by ⎡
u11 ⎢ u21 ⎢ W ⫽⎢ . ⎣ .. uM1
u12 u22 .. . uM2
··· ··· .. . ···
⎤ u1N u2N ⎥ ⎥ .. ⎥ . ⎦
(1.55)
uMN
then in the ideal (noiseless) case, this matrix factors as ⎡ ⎢ ⎢ W ⫽⎢ ⎣
A1 A2 .. . AM
⎤ ⎥ ⎥ ⎥ [X1 X2 · · · XN ] ⎦
(1.56)
revealing that the 2M⫻N measurement matrix is ideally rank 3. They showed that solving (1.54) is equivalent to finding the best rank-3 approximation to W in the Frobenius norm, which can easily be accomplished with the singular value decomposition. The solution is unique up to an affine transformation of the world coordinate system, and optimal in the sense of maximum likelihood estimation if the noise in the image projections is isotropic, zero-mean i.i.d. Gaussian.
22
CHAPTER 1 Multi-View Geometry for Camera Networks
1.6.2 Projective Reconstruction Unfortunately, for the perspective projection model that more accurately reflects the way real cameras image the world, factorization is not as easy. We can form a similar measurement matrix W and factorize it as follows: ⎡
11 u11 ⎢ 21 u21 ⎢ W ⫽⎢ .. ⎣ . M1 uM1
12 u12 22 u22 .. . m2 um2
⎤ ⎛ ⎞ P1 1N u1N ⎜ P2 ⎟
2N u2N ⎥ ⎥ ⎜ ⎟ ⎥ ⫽ ⎜ . ⎟ X1 .. ⎦ ⎝ .. ⎠ . PM MN uMN
··· ··· .. . ···
X2
···
XN
(1.57)
Here, both uij and Xj are represented in homogeneous coordinates. Therefore, the measurement matrix is of dimension 3M⫻N and is ideally rank 4. However, since the scale factors (also called projective depths) ij multiplying each projection are different, these must also be estimated. Sturm and Triggs [29, 30] suggested a factorization method that recovers the projective depths as well as the structure and motion parameters from the measurement matrix. They used relationships between fundamental matrices and epipolar lines to obtain an initial estimate for the projective depths ij (an alternate approach is to simply initialize all ij ⫽ 1). Once the projective depths are fixed, the rows and columns of the measurement matrix are rescaled, and the camera matrices and scene point positions are recovered from the best rank-4 approximation to W obtained using the singular value decomposition. Given estimates of the camera matrices and scene points, the scene points can be reprojected to obtain new estimates of the projective depths, and the process iterated until the parameter estimates converge.
1.6.3 Metric Reconstruction There is substantial ambiguity in a projective reconstruction as just obtained, because ⎡ ⎢ ⎢ ⎢ ⎣
P1 P2 .. . PM
⎤
⎡
⎢ ⎥ ⎢ ⎥ ⎥ [X1 X2 · · · XN ] ⫽ ⎢ ⎣ ⎦
P1 H P2 H .. . PM H
⎤ ⎥ ⎥ ⫺1 ⎥ [H X1 H ⫺1 X2 · · · H ⫺1 XN ] ⎦
(1.58)
for any 4⫻4 nonsingular matrix H. This means that while some geometric properties of the reconstructed configuration will be correct compared to the truth (e.g., the order of 3D points lying along a straight line), others will not be (e.g., the angles between lines/planes or the relative lengths of line segments). To make the reconstruction useful (i.e., to recover the correct configuration up to an unknown rotation, translation, and scale), we need to estimate the matrix H that turns the projective factorization into a metric factorization, so that
Pˆ i H ⫽ Ki Ri I | ⫺Ci
(1.59)
and Ki is in the correct form (e.g., we may force it to be diagonal). This process is also called auto-calibration.
1.6 Multi-Camera Geometry
23
The auto-calibration process depends fundamentally on several properties of projective geometry that are subtle and complex. We give only a brief overview here; see references [6, 31, 32] for more details. If we let mx ⫽ my ⫽ 1 in (1.6), the form of a camera’s intrinsic parameter matrix is ⎡
␣x K ⫽⎣ 0 0
s ␣y 0
⎤ Px Py ⎦ 1
(1.60)
We define a quantity called the dual image of the absolute conic (DIAC) as ⎡
␣2x ⫹ s2 ⫹ Px2 ⎢ ∗ T ⫽ KK ⫽ ⎣ s␣y ⫹ Px Py Px
s␣y ⫹ Px Py ␣2y ⫹ Py2 Py
⎤ Px ⎥ Py ⎦ 1
(1.61)
If we put constraints on the camera’s internal parameters, these correspond to constraints on the DIAC. For example, if we require that the pixels have zero skew and that the principal point is at the origin, then ⎡
␣2x ⫽⎣ 0 0 ∗
0 ␣2y 0
⎤ 0 0⎦ 1
(1.62)
and we have introduced three constraints on the entries of the DIAC: ∗12 ⫽ ∗13 ⫽ ∗23 ⫽ 0
(1.63)
If we further constrain that the pixels must be square, we have a fourth constraint, that ∗11 ⫽ ∗22 . The DIAC is related to a second important quantity called the absolute dual quadric, denoted Q⬁∗ . This is a special 4⫻4 symmetric rank-3 matrix that represents an imaginary surface in the scene that is fixed under similarity transformations. That is, points on Q⬁∗ map to other points on Q⬁∗ if a rotation, translation, and/or uniform scaling is applied to the scene coordinates. The DIAC (which is different for each camera) and Q⬁∗ (which is independent of the cameras) are connected by the important equation ∗i ⫽ P i Q⬁∗ P iT
(1.64)
That is, the two quantities are related through the camera matrices P i . Therefore, each constraint we impose on each ∗i (e.g., each in equation 1.63) imposes a linear constraint on the 10 homogeneous parameters of Q⬁∗ , in terms of the camera matrices P i . Given enough camera matrices, we can thus solve a linear least squares problem for the entries of Q⬁∗ . Once Q⬁∗ has been estimated from the {P i } resulting from projective factorization, the correct H matrix in (1.59) that brings about a metric reconstruction is extracted using the relationship Q⬁∗ ⫽ Hdiag(1, 1, 1, 0)H T
(1.65)
24
CHAPTER 1 Multi-View Geometry for Camera Networks
The resulting reconstruction is related to the true camera/scene configuration by an unknown similarity transform that cannot be estimated without additional information about the scene. We conclude this section by noting that there exist critical motion sequences for which auto-calibration is not fully possible [33], including pure translation or pure rotation of a camera, orbital motion around a fixed point, and motion confined to a plane.
1.6.4 Bundle Adjustment After a good initial metric factorization is obtained, the final step is to perform bundle adjustment—that is, the iterative optimization of the nonlinear cost function N M
ij (uij ⫺ Pi Xj )T ⌺⫺1 ij (uij ⫺ Pi Xj )
(1.66)
i⫽1 j⫽1
Here, ⌺ij is the 2⫻2 covariance matrix associated with the noise in image point uij . The quantity inside the sum is called the Mahalanobis distance between the observed image point and its projection based on the estimated camera/scene parameters. The optimization is typically accomplished with the Levenberg-Marquardt algorithm [34]—specifically, an implementation that exploits the sparse block structure of the normal equations characteristic of SFM problems (since each scene point is typically observed by only a few cameras) [35]. Parameterization of the minimization problem, especially of the cameras, is a critical issue. If we assume that the skew, aspect ratio, and principal point of each camera are known prior to deployment, then each camera should be represented by seven parameters: one for the focal length, three for the translation vector, and three for the rotation matrix parameters. Rotation matrices are often minimally parameterized by three parameters (v1 , v2 , v3 ) using axis–angle parameterization. If we think of v as a vector in R3 , then the rotation matrix R corresponding to v is a rotation through the angle v about the axis v, computed by R ⫽ cos vI ⫹ sincv[v]⫻ ⫹
1 ⫺ cos v T vv v2
(1.67)
Extracting the v corresponding to a given R is accomplished as follows. The direction of the rotation axis v is the eigenvector of R corresponding to the eigenvalue 1.The angle that defines v is obtained from the two-argument arctangent function using 2 cos() ⫽ trace(R) ⫺ 1
(1.68)
2 sin() ⫽ (R32 ⫺ R23 , R13 ⫺ R31 , R21 ⫺ R12 ) v T
(1.69)
Other parameterizations of rotation matrices and rigid motions were discussed by Chirikjian and Kyatkin [36]. Finally, it is important to remember that the set of cameras and scene points can be estimated only up to an unknown rotation, translation, and scale of the world coordinates, so it is common to fix the extrinsic parameters of the first camera as P1 ⫽ K1 [I | 0]
(1.70)
1.7 Conclusions
25
In general, any over-parameterization of the problem or ambiguity in the reconstructions creates a problem called gauge freedom [37], which can lead to slower convergence and computational problems. Therefore, it is important to strive for minimal parameterizations of SFM problems.
1.7 CONCLUSIONS We conclude by noting that the concepts discussed in this chapter are now viewed as fairly classical in the computer vision community. However, there is still much interesting research to be done in extending and applying computer vision algorithms in distributed camera networks. Several such cutting-edge algorithms are discussed in this book. The key challenges are as follows: ■
■
A very large number of widely distributed cameras may be involved in a realistic camera network. The setup is fundamentally different in terms of scale and spatial extent compared to typical multi-camera research undertaken in a lab setting. The information from all cameras is unlikely to be available at a powerful, central processor—an underlying assumption of the multi-camera calibration algorithms discussed in Section 1.6. Instead, each camera may be attached to a power- and computation-constrained local processor that is unable to execute complex algorithms, and to an antenna that is unable to transmit information across long distances.
These considerations argue for distributed algorithms that operate independently at each camera node, exchanging information between local neighbors to obtain solutions that approximate the best performance of a centralized algorithm. For example, we recently presented a distributed algorithm for the calibration of a multi-camera network that was designed with these considerations in mind [41]. We refer the reader to our recent survey on distributed computer vision algorithms [42] for more information.
1.7.1 Resources Multi-view geometry is a rich and complex area of study, and this chapter only introduces the main geometric relationships and estimation algorithms. The best and most comprehensive reference on epipolar geometry, structure from motion, and camera calibration is Multiple View Geometry in Computer Vision by Richard Hartley and Andrew Zisserman [6], which is essential reading for further study of the material discussed here. The collected volume Vision Algorithms: Theory and Practice (Proceedings of the International Workshop on Vision Algorithms), edited by Bill Triggs, Andrew Zisserman, and Richard Szeliski [38], contains an excellent, detailed article on bundle adjustment in addition to important papers on gauges and parameter uncertainty. Finally, An Invitation to 3D Vision: From Images to Geometric Models by Yi Ma, Stefano Soatto, Jana Kosecka, and Shankar Sastry [39], is an excellent reference that includes a chapter that provides a beginning-to-end recipe for structure from motion. The companion Website includes MATLAB code for most of the algorithms described in this chapter [40].
26
CHAPTER 1 Multi-View Geometry for Camera Networks
REFERENCES [1] T. Kanade, P. Rander, P. Narayanan, Virtualized reality: Constructing virtual worlds from real scenes, IEEE Multimedia, Immersive Telepresence 4 (1) (1997) 34–47. [2] D. Forsyth, J. Ponce, Computer Vision: A Modern Approach, Prentice Hall, 2003. [3] P. Shirley, Fundamentals of Computer Graphics, A.K. Peters, 2002. [4] G. Strang, Linear Algebra and Its Applications, Harcourt Brace Jovanovich, 1988. [5] B. Prescott, G.F. McLean, Line-based correction of radial lens distortion, Graphical Models and Image Processing 59 (1) (1997) 39–47. [6] R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, 2000. [7] M.A. Fischler, R.C. Bolles, Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography, Communications of the ACM 24 (1981) 381–395. [8] J.-Y. Bouguet, Camera calibration toolbox for MATLAB,in: www.vision.caltech.edu/bouguetj/ calib_doc/, 2008; accessed January 2009. [9] S. Maybank, The angular velocity associated with the optical flowfield arising from motion through a rigid environment, in: Proceedings of the Royal Society London A401 (1985) 317–326. [10] Z. Zhang, Determining the epipolar geometry and its uncertainty—a review, The International Journal of Computer Vision 27 (2) (1998) 161–195. [11] O. Faugeras, Three-Dimensional Computer Vision: A Geometric Viewpoint, MIT Press, 1993. [12] O. Faugeras, Q.-T. Luong, T. Papadopoulo, The Geometry of Multiple Images: The Laws That Govern the Formation of Multiple Images of a Scene and Some of Their Applications, MIT Press, 2001. [13] Q.-T. Luong, O. Faugeras, The fundamental matrix: Theory, algorithms, and stability analysis, International Journal of Computer Vision 17 (1) (1996) 43–76. [14] R. Hartley, In defence of the 8-point algorithm, in: Proceedings of the International Conference on Computer Vision, 1995. [15] G. Yang, C.V. Stewart, M. Sofka, C.-L. Tsai, Registration of challenging image pairs: Initialization, estimation, and decision, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (11) (2007) 1973–1989. [16] S. Seitz, Image-based transformation of viewpoint and scene appearance, Ph.D. thesis, University of Wisconsin–Madison, 1997. [17] R. Hartley, Theory and practice of projective rectification, International Journal of Computer Vision 35 (2) (1999) 115–127. [18] F. Isgrò, E. Trucco, Projective rectification without epipolar geometry, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 1999. [19] J. Shi, C. Tomasi, Good features to track, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1994. [20] C. Harris, M. Stephens, A combined corner and edge detector, in: Proceedings of the Fourth Alvey Vision Conference, 1988. [21] K. Mikolajczyk, C. Schmid, Indexing based on scale invariant interest points, in: Proceedings of the IEEE International Conference on Computer Vision, 2001. [22] T. Lindeberg, Detecting salient blob-like image structures and their scales with a scale-space primal sketch: a method for focus-of-attention, International Journal of Computer Vision 11 (3) (1994) 283–318. [23] D.G. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision 60 (2) (2004) 91–110.
1.7 Conclusions
27
[24] K. Mikolajczyk, C. Schmid, Scale and affine invariant interest point detectors, International Journal of Computer Vision 60 (1) (2004) 63–86. [25] K. Mikolajczyk, C. Schmid, A performance evaluation of local descriptors, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (10) (2005) 1615–1630. [26] A. Shashua, M. Werman, On the trilinear tensor of three perspective views and its underlying geometry, in: Proceedings of the International Conference on Computer Vision, 1995. [27] S. Thrun, W. Burgard, D. Fox, Probabilistic Robotics: Intelligent Robotics and Autonomous Agents, MIT Press, 2005. [28] C.Tomasi, T. Kanade, Shape and Motion from Image Streams: A Factorization Method—Part 2: Detection and Tracking of Point Features, Technical Report CMU-CS-91-132, Carnegie Mellon University, April 1991. [29] P. Sturm, B. Triggs, A factorization-based algorithm for multi-image projective structure and motion, in: Proceedings of the fourth European Conference on Computer Vision, 1996. [30] B. Triggs, Factorization methods for projective structure and motion, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1996. [31] M. Pollefeys, R. Koch, L.J. Van Gool, Self-calibration and metric reconstruction in spite of varying and unknown internal camera parameters, in: Proceedings of the IEEE International Conference on Computer Vision, 1998. [32] M. Pollefeys, F. Verbiest, L.J. Van Gool, Surviving dominant planes in uncalibrated structure and motion recovery, in: Proceedings of the Seventh European Conference on Computer Vision-Part II, 2002. [33] P. Sturm, Critical motion sequences for monocular self-calibration and uncalibrated Euclidean reconstruction, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 1997. [34] J. Dennis, Jr., R. Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations, SIAM Press, 1996. [35] B. Triggs, P. McLauchlan, R. Hartley, A. Fitzgibbon, Bundle adjustment—A modern synthesis, in: W. Triggs, A. Zisserman, R. Szeliski (Eds.), Vision Algorithms: Theory and Practice, LNCS, Springer-Verlag, 2000. [36] G. Chirikjian, A. Kyatkin, Engineering Applications of Noncommutative Harmonic Analysis with Emphasis on Rotation and Motion Groups, CRC Press, 2001. [37] K. Kanatani, D.D. Morris, Gauges and gauge transformations for uncertainty description of geometric structure with indeterminacy, IEEE Transactions on Information Theory 47 (5) (2001) 2017–2028. [38] B. Triggs, A. Zisserman, R. Szeliski (Eds.), Vision algorithms: Theory and practice, in: Proceedings of the International Workshop on Vision Algorithms (1999). [39] Y. Ma, S. Soatto, J. Kosecka, S.S. Sastry, An Invitation to 3-D Vision, Springer, 2004. [40] Y. Ma, S. Soatto, J. Kosecka, S.S. Sastry, An invitation to 3-D vision, in: http://vision.ucla. edu/MASKS/; accessed January 2009. [41] R. Radke, D. Devarajan, Z. Cheng, Calibrating distributed camera networks, Proceedings of the IEEE (Special Issue on Distributed Smart Cameras) 96 (10) 1625–1639. [42] R. Radke, A survey of distributed computer vision algorithms, in: H. Nakashima, J. Augusto, H. Aghajan (Eds.), Handbook of Ambient Intelligence and Smart Environments, Springer, 2009.
CHAPTER
Multi-View Calibration, Synchronization, and Dynamic Scene Reconstruction
2
Marc Pollefeys, Sudipta N. Sinha, Li Guan Department of Computer Science, University of North Carolina, Chapel Hill, North Carolina, and Institute for Visual Computing, ETH Zurich, Switzerland Jean-Sébastien Franco LaBRI–INRIA Sud-Ouest, University of Bordeaux, Bordeaux, France
Abstract In this chapter, we first present a method for automatic camera network geometric calibration. The novel RANSAC-based calibration algorithm simultaneously computes camera poses and synchronization only from the objects of interest’s (foreground object’s) silhouette information in the videos from the multiple views. Then using this calibrated system, and again just from the silhouette cues, we introduce a probabilistic sensor fusion framework for 3D dynamic, scene reconstruction with potential static obstacles. Several real-world data sets show that despite lighting variation, photometric inconsistency between views, shadows, reflections, and so forth, not only are densely cluttered dynamic objects reconstructed and tracked, but static obstacles in the scene are also recovered. Keywords: epipolar geometry, frontier points, epipolar tangents, RANSAC, structure from motion, visual hulls, occlusion, dynamic scene reconstruction, shape from silhouette, sensor fusion, probability, Bayes rule
2.1 INTRODUCTION Recovering 3D models of the real world from images and video is an important research area in computer vision, with applications in computer graphics, virtual reality, and robotics. Manually modeling photo-realistic scenes for such applications is tedious and requires much effort. The goal in computer vision is to generate such models automatically by processing visual imagery from the real world and recovering the 3D shape and structure of the scene. In this chapter, we will focus on the problem of reconstructing a 3D time-varying event involving either rigid or nonrigid moving objects such as human Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00002-1
29
30
CHAPTER 2 Multi-View Calibration, Synchronization beings, to, for example, digitize a dance performance that was recorded by video cameras from multiple viewpoints. This enables users viewing the event to be completely immersed in a virtual world, allowing them to observe the event from any arbitrary viewpoint. Called free-viewpoint 3D video, this has promising applications in 3D teleimmersion, in digitizing rare cultural performances and important sports events, and in generating content for 3D video–based realistic training and demonstrations for surgery, medicine, and other technical fields. Since 1997, when Kanade et al. [34] coined the term virtualized reality and reconstructed real-life scenes involving humans using a large cluster of cameras, various systems have been developed that can digitize human subjects [13, 15, 21, 42, 50].These systems typically deploy a network of about 6 to 15 synchronized cameras in a room-sized scene and record video at 15 to 30 frames per second (fps). With the popularity of digital cameras and the growth of video surveillance systems, nowadays both indoor and outdoor camera networks are becoming commonplace. However, modeling dynamic scenes outdoors is much more challenging and many of the assumptions made by current systems need to be relaxed. Currently in all multi-camera systems [13, 15, 21, 42, 50] calibration and synchronization must be done during an offline calibration phase before the actual video is captured. Someone must be physically present in the scene with a specialized calibration object, which makes the process of camera deployment and acquisition fairly tedious. Multiple calibration sessions are often required, as there is no easy way to maintain the calibration over a long duration. We will present a technique that can recover all the necessary information from the recorded video streams. Our method recovers the calibration and synchronization of cameras observing an event from multiple viewpoints, analyzing only the silhouettes of moving objects in the video streams. This eliminates the need for explicit offline calibration and will ease camera deployment for digitizing time-varying events in the future. Compared with the new silhouette calibration technique, shape from silhouette (SFS) reconstruction approaches have a much longer history [6, 36] because of their simplicity, speed, and general robustness in providing global shape and topology information about objects. A very critical advantage of these methods among other multi-view reconstruction techniques, such as multi-view stereo [11, 35, 52], is that they do not require object appearance to be similar across views, thus bypassing tedious photometric calibration among all cameras in the system. Silhouette-based methods have their own challenge: Occlusions with background objects whose appearance is also recorded in background images have a negative impact on silhouette-based modeling because extracted silhouettes can become incomplete. In particular, the inclusive property of visual hulls [36] with respect to the object they model is no longer guaranteed. Binary silhouette reasoning with several objects is prone to large visual ambiguities, leading to misclassifications of significant portions of 3D space. Occlusion may also occur between dynamic objects of interest. When several objects in the scene need to be modeled, naïve binary silhouette reasoning is prone to large visual ambiguities due to similar visual occlusion effects. These limitations come on top of the usual sensitivities to noise, shadows, and lighting conditions. In short, silhouette-based methods usually produce decent results in controlled, man-made environments. Major
2.2 Camera Network Calibration and Synchronization
31
challenges arise in applying silhouette-based approaches in uncontrolled, natural environments when arbitrary numbers of targets are to be modeled, either dynamically or statically. We show that the shape of static occluders in the interaction space of moving objects can be recovered online by accumulating occlusion cues from dynamic object motion. Also, by using distinct appearance models for dynamic objects, silhouette reasoning can be efficiently conducted. To handle both static occluders and dynamic objects, we propose a Bayesian sensor fusion framework that is verified to be effective and robust against general outdoor environments of densely populated scenes.
2.2 CAMERA NETWORK CALIBRATION AND SYNCHRONIZATION In traditional camera calibration, images of a calibration target (an object whose geometry is known) are first acquired. Correspondences between 3D points on the target and their imaged pixels are then recovered (the target is built in a way to make this step easy). After this, the camera-resectioning problem is solved. This involves estimating the intrinsic and extrinsic parameters of the camera by minimizing the reprojection error of the 3D points on the calibration object. The Tsai camera calibration technique was popular in the past, but required a nonplanar calibration object with known 3D coordinates [67]. The resulting calibration object often had two or three orthogonal planes. This was difficult to construct and transport and made the overall process of camera calibration quite cumbersome. Zhang [72] proposed a more flexible planar calibration grid method in which either the planar grid or the camera can be freely moved. The calibration object is easily created by pasting a checkerboard pattern on a planar board that can be waved around in the scene. An implementation of this calibration method is provided by [9] and has become popular among computer vision researchers. While this method produces quite accurate calibration in realistic scenarios, the calibration process for large multi-camera systems can still be quite tedious. Often, with cameras placed all around a scene, the checkerboard can only be seen by a small group of cameras at one time. Hence, only a subset of cameras in the network can be calibrated in one session. By ensuring that these subsets overlap, it is possible to merge the results from multiple sessions and obtain the calibration of the full camera network. However, this requires extra work and makes the overall procedure quite error prone. There is a new method for multi-camera calibration [63]—one that uses a singlepoint calibration object in the form of a bright LED that is waved around the scene. The advantage it provides is that the LED can be simultaneously observed in all images irrespective of their configuration. In a fairly dark room, detecting LED light source and establishing correspondence are easy and reliable. By moving the LED around, one can calibrate a larger volume than would be possible with a checkerboard. Motion capture sensors are often also used for multi-camera synchronization. In controlled scenes, a hardware trigger can be used to synchronize all the video cameras to ensure higher accuracy. A simple alternative is to use a clap sound to manually synchronize the videos, but this can be error prone for videos containing fast-moving subjects or in outdoor environments.
32
CHAPTER 2 Multi-View Calibration, Synchronization Although these traditional calibration methods are quite accurate, they require physical access to the observed space and involve an offline precalibration stage that precludes reconfiguration of cameras during operation (at least, an additional calibration session is needed). This is often impractical and costly for surveillance applications and can be impossible for remote camera networks or sensors deployed in hazardous environments. Meanwhile, significant progress has been made in the last decade in automatic feature detection and feature matching across images. Also, robust structure-from-motion methods that allow the recovery of 3D structures from uncalibrated image sequences [48] have been developed. In theory, these techniques could be used for multi-camera calibration. However, they require much more overlap between cameras than is available in camera networks. Live video from a camera network lacks such point correspondences because of large baselines between cameras and low overlap of background in the different views (an example is shown in Figure 2.1). However, since these cameras are set up to observe dynamic events, the silhouettes of moving foreground objects are often a prominent feature. We will show how the silhouettes of moving objects observed in two video streams can be used to recover the epipolar geometry of the corresponding camera pair. In fact, when the video streams are unsynchronized, our method can simultaneously recover both the epipolar geometry and the synchronization of the two cameras. In our method, we take advantage of the fact that a camera network observing a dynamic object records many different silhouettes, yielding a large number of epipolar constraints that need to be satisfied by every camera pair. At the core of our approach is a robust RANSAC-based algorithm [7] that computes the epipolar geometry by analyzing the silhouettes of a moving object in a video. In every RANSAC iteration, the epipole positions in the two images are randomly guessed and a hypothesis for the epipolar geometry is formed and efficiently verified using all the silhouettes available from the video. Random sampling is used for exploring the 4D space of possible epipole positions as well as for dealing with outliers in the silhouette data. Our algorithm is based on
(a)
(b)
FIGURE 2.1 (a) Moving person observed and recorded from four different viewpoints. (b) Four corresponding frames are shown along with the silhouettes that were extracted. Note that the silhouettes are noisy and the sequences are not synchronized.
2.2 Camera Network Calibration and Synchronization
x1
33
x2 l2
l1 e
e9 (a)
(b)
FIGURE 2.2 (a) Frontier points and epipolar tangents for two views. (b) Several frontier points may exist on a human silhouette, but many will be occluded.
the constraint arising from the correspondence of frontier points and epipolar tangents for silhouettes in two views. This constraint was also used in [24, 46, 49, 69], but for specific camera motion or camera models, or in a situation where a good initialization was available. When a solid object is seen in two views, the only true point correspondences on the apparent contour occur at special locations called frontier points. In Figure 2.2(a), one pair of frontier points is denoted by x1 and x2 , respectively. Note that the viewing rays that correspond to a matching pair of frontier points such as x1 and x2 must intersect at a true surface point in the tangent plane of the surface.The contour generators or rims must also intersect at such a surface point. This point, along with the camera baseline, defines an epipolar plane that must be tangent to the surface, giving rise to corresponding epipolar lines such as l1 and l2 , which are tangent to the silhouettes at the frontier points. Frontier point correspondence does not extend to more than two views in general. A convex shape, fully visible in two views, can have exactly two pairs of frontier points. For a nonconvex shape such as a human figure, there can be several potential frontier points, but many of them will be occluded (see Figure 2.2(b)). If the location of the epipole in the image plane is known, matching frontier points can be detected by drawing tangents to the silhouettes. However, when the epipole location is unknown, it is difficult to reliably detect the frontier points. In [69] Wong
34
CHAPTER 2 Multi-View Calibration, Synchronization and Cipolla searched for outermost epipolar tangents for circular motion. In their case, the existence of fixed entities in the images, such as the horizon and the image of the rotation axis, simplified the search for epipoles. We too use only the extremal frontier points and outermost epipolar tangents because, for fully visible silhouettes, these are never occluded. Also, the extremal frontier points must lie on the convex hull of the silhouette in the two views. Furukawa et al. [24] directly searched for frontier points on a pair of silhouettes to recover the epipolar geometry. Their approach assumes an orthographic camera model and requires accurate silhouettes; it does not work unless there are at least four unoccluded frontier point matches in general position. Hernández et al. [32] generalized the idea of epipolar tangencies to the concept of silhouette coherence, which numerically measures how well a solid 3D shape corresponds to a given set of its silhouettes in multiple views.They performed camera calibration from silhouettes by solving an optimization problem where silhouette coherence is maximized. However they only dealt with circular turntable sequences, which have few unknown parameters, so their optimization technique does not generalize to an arbitrary camera network. Boyer [10] also proposed a criterion that back-projected silhouette cones must satisfy such that the true object is enclosed within all of the cones. They used it for refining the calibration parameters of a camera network, but this requires good initialization. In this chapter, we first study the recovery of epipolar geometry in the case where the two cameras are synchronized. We then show how to extend the algorithm to simultaneously recover the epipolar geometry as well as the temporal offset between a pair of unsynchronized cameras recording at the same frame rate. While our approach requires background segmentation, the RANSAC-based algorithm [7] is robust to errors in silhouette extraction. For stable estimates of the epipolar geometry, it is important to sample the 3D space densely, which requires sufficient motion of the foreground object covering a large part of the 3D volume observed by the cameras. This is not a problem, as our method can efficiently handle long video sequences with thousands of silhouettes.
2.2.1 Epipolar Geometry from Dynamic Silhouettes Given nontrivial silhouettes of a human (see Figure 2.2(b)), if we can detect matching frontier points, we can use the 7-point algorithm to estimate the epipolar geometry by computing the fundamental matrix. However, it is difficult to directly find matching frontier points without knowing the epipoles or without some form of initialization. Therefore, we need to explore the full space of possible solutions. While a fundamental matrix has seven degrees of freedom (DOFs), our method randomly samples only in a 4D space because once the epipole positions are known, potential frontier point matches can be determined, and from them the remaining degrees of freedom of the epipolar geometry can be computed via an epipolar line homography. We propose a RANSACbased approach that directly allows us to efficiently explore this 4D space as well as robustly handle noisy silhouettes in the sequence.
2.2.2 Related Work The recovery of camera pose from silhouettes was studied by various authors [33, 46, 68, 69], and recently there has been some renewed interest in the problem [10, 24, 32].
2.2 Camera Network Calibration and Synchronization
35
However, most of these techniques can be applied only in specific settings and have requirements that render them impractical for general camera networks observing an unknown dynamic scene. These include that the observed object be static [24, 33], the use of a specific camera configuration (at least partially circular) [32, 69], the use of an orthographic projection model [24, 68], and a good initialization [10, 70].
Our Approach For every frame in each sequence, a binary foreground mask of the object is computed using background segmentation techniques. Instead of explicitly storing the complete silhouette S , we compute and store only the convex hull HS and its dual representation, as shown in Figure 2.3 for every video frame. This is a very compact representation as it allows us to efficiently compute outer tangents to silhouettes in long sequences containing potentially thousands of different silhouettes. The convex hull HS is represented by an ordered list of k 2D points in the image (v1 . . . vk in counter-clockwise order (CCW)). The 2D lines tangent to HS are parameterized by the angle ⫽ 0 . . . 2 (in radians) that the line subtends with respect to the horizontal direction in the image. For each vertex vk , an angular interval [1k , 2k ] is computed. This set represents all lines that are tangent to HS at the vertex vk . These tangent lines are directed—that is, they are consistently oriented with respect to the convex hull. Thus for a direction , there is always a unique directed tangent l , which keeps tangency computations in our algorithm quite simple. The basic idea in our approach is the following. To generate a hypothesis for the epipolar geometry, we randomly guess the position of eij and eji in the two views. This fixes four DOFs of the unknown epipolar geometry, and the remaining three DOFs can be determined by estimating the epipolar line homography Hij for the chosen epipole pair. To compute the homography, we need to obtain three pairs of corresponding epipolar lines (epipolar tangents in our case) in the two views. Every homography Hij satisfying the system of equations [ljk ]⫻ Hij lik ⫽ 0, where k ⫽ 1 . . . 3 is a valid solution. Note that these equations are linear in Hij and allow it to be estimated efficiently. In a RANSAC-like fashion [7], we then evaluate this hypothesis using all the silhouettes present in the sequences.
(a)
(b)
(c)
FIGURE 2.3 (a) Convex hull of the silhouette in a video frame. (b) Tangent table representation. (c) Space of all tangents to the convex hull parameterized by .
36
CHAPTER 2 Multi-View Calibration, Synchronization Hypothesis Generation
At every RANSAC iteration, we randomly choose a pair of corresponding frames from the two sequences. In each of the two frames, we randomly sample two directions each and obtain outer tangents to the silhouettes with these directions. The first direction 1 is sampled from the uniform distribution U(0, 2), while the second direction 2 is chosen as 2 ⫽ 1 ⫺ x, where x is drawn from the normal distribution N(, 2 ). For each of these directions, the convex hull of the silhouette contains a unique directed tangent. The two tangent lines in the first view are denoted li1 and li2 , while those in the second view are denoted lj1 and lj2 , respectively (these are shown in light gray in Figure 2.4(a)).1 The intersections of the tangent pairs produce the hypothesized epipoles eij and eji in the two views. We next randomly select another pair of frames and compute outer tangents from the epipoles eij and eji to the silhouettes (actually their convex hulls) in both views. If there are two pairs of outer tangents, we randomly select one. This third pair of lines is denoted li3 and lj3 , respectively (these are shown in dark gray in Figure 2.4(b)). Now Hij ,
(a)
(b)
FIGURE 2.4 (a) Two random directions are sampled in each image in corresponding random frames. The intersection of the corresponding epipolar tangents generates the epipole hypothesis. (b) Outermost epipolar tangents to the new silhouette computed in another pair of randomly selected corresponding frames, shown in dark gray. The three pairs of lines can be used to estimate the epipolar line homography.
1
If silhouettes are clipped, the second pair of tangents is chosen from another frame.
2.2 Camera Network Calibration and Synchronization
37
the epipolar line homography, is computed from the three corresponding lines2 {lik ↔ ljk }, where k ⫽ 1 . . . 3.The quantities (eij , eji , Hij ) form the model hypothesis in each iteration of our algorithm.
Model Verification Each randomly generated model for the epipolar geometry is evaluated using all the data available. This is done by computing outer tangents from the hypothesized epipoles to the whole sequence of silhouettes in each of the two views. For unclipped silhouettes, we obtain two tangents per frame, whereas for clipped silhouettes there may be one or even zero tangents. Every epipolar tangent in the first view is transferred through Hij to the second view (see Figure 2.5), and the reprojected epipolar transfer error e is computed based on the shortest distance from the original point of tangency to the transferred line: e ⫽ d(xi , lit ) ⫹ d(xj , ljt )
(2.1)
where d(x, l) represents the shortest distance from a 2D point x to a 2D line l, and xi and xj represent the point of tangencies in the two images which, when transferred to the other view, gives rise to epipolar lines ljt and lit , respectively. Figure 2.6 shows the typical error distributions. We use an outlier threshold denoted by o to classify a certain hypothesis as good or bad. The value of o is automatically computed (see below) and is typically 0.005 to 0.01 percent of the image width in pixels. The K th quantile of the error distribution denoted by eK is computed (in all our experiments, K ⫽ 0.75, or 75 percent). If eK ⭐ o , then the epipolar geometry model is considered a promising candidate and is recorded. Frontier point pairs often remain stationary in video and give rise to duplicates.These are removed using a binning approach before computing the error distribution. The RANSAC-based algorithm looks for nS promising candidates. These candidates are then ranked based on the inlier count and the best ones are further refined. A stricter threshold in of 1 pixel is used to determine the tangents that are inliers. While evaluating a
FIGURE 2.5 Hypothesized epipolar geometry model, used to compute the epipolar transfer error for all silhouettes in video. Here only a single frame is shown for clarity. The original outer tangents are shown in light gray while the transferred epipolar lines are shown in dark gray.
2 There
are two ways to pair {li1 , li2 } with {lj1 , lj2 }, and we generate both hypotheses.
Frequency
CHAPTER 2 Multi-View Calibration, Synchronization
Frequency
38
eK
o
e
(in pixels)
(a)
o
eK
e
(in pixels)
(b)
FIGURE 2.6 (a) Error distribution for a good hypothesis. Note that the K th quantile eK is much smaller than o . (b) Error distribution for a bad hypothesis. Note that it is much more spread out and the K th quantile eK is greater than o .
hypothesis, we maintain a count of the tangents that exceed the outlier threshold o and reject a hypothesis early, when a partial outlier count indicates that the total expected outlier count is likely to be exceeded (i.e., with high probability). This allows us to abort early whenever the model hypothesis is completely inaccurate, avoiding the redundancy of computing outer tangents from epipoles to all the silhouettes for many completely wrong hypotheses. The best 25 percent of the promising candidates are then refined using iterative nonlinear (Levenberg-Marquardt) minimization followed by guided matching. The total symmetric epipolar distance in both images is minimized. During guided matching, more inliers are included and the final solution is obtained when the inlier count stabilizes. In practice, many of the promising candidates from the RANSAC step, when iteratively refined, produce the same solution for the epipolar geometry. So we stop when three promising candidates converge to the same solution. The refined solution with the most inliers is the final one. Comparing the Frobenius norm of the difference of two normalized fundamental matrices is not a suitable measure to compare two fundamental matrices, so we use the statistical measure proposed by Zhang [71].
Automatic Parameter Tuning Our algorithm has a few critical parameters: the total number of RANSAC iterations N, the number of promising candidates nS , and the outlier threshold o . We automatically determine these parameters from the data, making our approach completely automatic and convenient to use. The number of iterations N depends on the desired promising candidate count denoted by nS . N is chosen as min (n, NI ), where n is the number of iterations required to find nS candidates (NI ⫽ 1 million in all our experiments). nS is determined by the outlier threshold o . A tighter (i.e., lower) outlier threshold can be used to select very promising candidates, but such occurences are rare. If the threshold is set higher, promising candidates are obtained more frequently but at the cost of finding a few ambiguous ones as well. When this happens, a larger set of promising candidates must be analyzed. Thus, nS is set to max (2o , 10). We automatically compute o from the data during the preliminary RANSAC iterations. In the beginning, the hypothesis and verification iterations proceed as described earlier,
2.2 Camera Network Calibration and Synchronization
39
but these are only used to compute o , and the promising candidates found at this stage are not used later.We start with a large value of o (⫽ 50 pixels in our implementation) and iteratively lower it as follows.We compare e K , the K th quantile (K ⫽ 75) with the current value of o . If e K ⬍ o , we simply reset o to the smaller value e K . If o ⭐ e K ⭐ (o ⫹ 1), then we increment a counter Co . If e K ⬎ (o ⫹ 1), then the value of o is not changed. We reset Co to zero whenever the threshold is lowered. If either o falls below 5 pixels or Co becomes equal to 5, we accept the current estimate of o as final.
2.2.3 Camera Network Calibration We next consider the problem of recovering full camera calibration from pairwise epipolar geometries. Given a sufficient number of links in a graph (similar to the one shown in Figure 2.9 except that the links now represent estimates of fundamental matrices), our goal is to recover the Euclidean camera matrices for all cameras in the network. An overview of our approach is described in Figure 2.7. An important step in this approach is to compute an accurate projective reconstruction of the camera network from epipolar geometries and two view matches. We start by first recovering a triplet of projective cameras from the fundamental matrices between the three views. Using an incremental approach, we add a new camera to the calibrated network by resolving a different triplet of cameras each time. Each time a new camera is added, all the parameters corresponding to the cameras and 3D points are refined using projective bundle adjustment. Finally, when a full projective reconstruction is available, standard techniques for self-calibration and Euclidean (metric) bundle adjustment are used to compute the final metric camera calibration. In our silhouette-based calibration work, frontier point correspondences do not generalize to more than two views. In a three-view case, the frontier points in the first two views do not correspond to those in the last two views. Although three-view correspondences, called triple points, do exist on the silhouette as reported by [21, 37], they are hard to extract from uncalibrated images. Thus, we are restricted to only twoview correspondences over different pairs in our camera network and so cannot directly adopt an approach like that of [48]. Instead, we incrementally compute a full projective reconstruction of a camera network from these two-view correspondences and the corresponding fundamental matrices. Levi and Werman [39] studied the following problem. Given only a subset of all possible fundamental matrices in a camera network, when is it possible to recover all the missing fundamental matrices? They were mainly concerned with theoretical analysis, and their algorithm is not suited for the practical implementation of computing projective reconstructions from sets of two-view matches in the presence of noise. Pairwise epipolar geometries (view graph)
Solve triplet
Metric calibration
Projective bundle adjustment
Self-calibration
FIGURE 2.7 Overview of camera network calibration from epipolar geometries.
Euclidean bundle adjustment
40
CHAPTER 2 Multi-View Calibration, Synchronization Resolving Camera Triplets
Given any two fundamental matrices between three views, it is not possible to compute three consistent projective cameras.The two fundamental matrices can be used to generate canonical projective camera pairs {P1 , P2 } and {P1 , P3 }, respectively. However these do not correspond to the same projective frame. P3 must be chosen in the same projective frame as P2 , and the third fundamental matrix is required to enforce this. These independently estimated fundamental matrices are denoted F12 , F13 , and F23 , while the unknown projective cameras are denoted P1 , P2 , and P3 , respectively (see Figure 2.8). The three fundamental matrices are said to be compatible when they satisfy the following constraint: T T T e23 Fe13 ⫽ e31 Fe21 ⫽ e32 Fe12 ⫽ 0
(2.2)
The three fundamental matrices available in our case are not compatible because they were independently estimated from two-view correspondences. A linear approach for computing P1 , P2 , and P3 from three compatible fundamental matrices is described in [31]. However, it is not suitable when the fundamental matrices are not compatible, as in our case. We now describe our linear approach to compute a consistent triplet of projective cameras. As described in [31], given F12 , canonical projective cameras, P1 and P2 as well as P3 can be chosen as follows: P1 ⫽ [I|0]
P2 ⫽ [[e21 ]⫻ F12 |e21 ]
(2.3)
P3 ⫽ [[e31 ]⫻ F13 |0] ⫹ e31 v T
P3 has been defined up to an unknown 4-vector v (equation 2.3). By expressing F23 as a function of P2 and P3 , we obtain the following: F23 ⫽ [[e32 ]⫻ P3 P2⫹
(2.4)
The expression for F23 is linear in v. Hence, all possible solutions for F23 span a 4D subspace of P8 [39]. We solve for v, which produces the solution closest to the measured F23 in the 4D subspace. P3 can now be computed by substituting this value of v into equation 2.3. The resulting P1 , P2 , and P3 are fully consistent with F12 , F13 , and the matrix F23 computed before. 3
3 F13
F13
F23
F23
q
F23 1
F12 (a)
2
1
F12 (b)
k
p
2
Gk⫺1
Gk (c)
FIGURE 2.8 (a) Three nondegenerate views for which the fundamental matrices have been estimated independently. (b) Family of solutions for the third fundamental matrix (F23 ), compatible with the other two (F12 and F13 ). We look for a compatible solution closest to the measured F23 . (c) New camera k incrementally linked to a calibrated network by resolving a suitable triplet involving two cameras within Gk⫺1 .
2.2 Camera Network Calibration and Synchronization
41
In order to choose F12 , F13 , and F23 for this approach, we must rank the three fundamental matrices based on an accuracy measure; the least accurate one is assigned to be F23 while the choice of the other two does not matter. To rank the fundamental matrices based on the accuracy of their estimates, their inlier spread score sij is computed as follows: sij ⫽
(u,v)∈Pi
|u ⫺ v|2 ⫹
|u ⫺ v|2
(u,v)∈Pj
Here Pi and Pj represent the set of 2D point correspondences in views i and j that forms the set of inliers for the corresponding fundamental matrix Fij . A higher inlier spread score indicates that Fij is stable and accurate. The score is proportional to the inlier count, but also captures the spatial distribution of the 2D inliers. Our method works only when the camera centers for the three cameras are not collinear. This degenerate configuration can be detected by analyzing the location of the six epipoles (when all three camera centers are collinear, eij ⫽ eik for various permutations of the three views). In our method, when a degenerate triplet is detected, we reject it and look for the next best possibility. For most camera networks (all the data sets we tried), cameras were deployed around the subject and collinearity of camera centers was never a problem.
Incremental Construction Our incremental approach to projective reconstruction starts by greedily choosing a set of three views for which the fundamental matrices are, relatively, the most accurate. As described in the previous section, this triplet is resolved, resulting in a partial projective reconstruction of three cameras. Next, cameras are added one at a time using the approach described next. The process stops when either all cameras have been added or no more cameras can be added to the network because of insufficient links (fundamental matrices). Given Gk⫺1 , a camera network with (k ⫺ 1) cameras, we first need to choose the camera that will be added to the calibrated network. For this, we inspect the links (epipolar geometries) between cameras that belong to Gk⫺1 and those that have not been reconstructed yet. The camera chosen for reconstruction is denoted by k, and the two cameras in Gk⫺1 corresponding to the two links are denoted by p and q, respectively. Thus for cameras p and q in Gk⫺1 and k, the new view, we now reconstruct a triplet of consistent projective cameras from Fpk , Fqk , and Fpq (here Pk plays the role of P3 ). Since the fundamental matrix corresponding to any pair within Gk⫺1 can be computed, the choice of p and q is irrelevant because all projective cameras are known. Finally, the computed projective camera Pk is transformed into the projective frame of Gk⫺1 and added. This produces a complete projective reconstruction of Gk , the camera network with the added new view. For a network with N cameras in general position, this method will work if a sufficient number of links are present in the camera network graph. The various solvable cases are discussed in Levi and Werman [39]. In our case, resolving the initial triplet requires three links; every subsequent view that is added requires at least two links. Thus, the minimum number of unique links that must be present in the graph is 3 ⫹ 2(N ⫺ 3) ⫽ 2N ⫺ 3.
42
CHAPTER 2 Multi-View Calibration, Synchronization
When more links are available in the graph, our ranking procedure chooses the best ones and the less accurate links may never be used.
2.2.4 Computing the Metric Reconstruction Every time a new camera is added, a projective bundle adjustment is done to refine the calibration of all cameras in the partial network.This prevents error accumulation during the incremental construction. Camera networks are typically small, containing 8 to 12 cameras; therefore, running the projective bundle adjustment multiple times is not an issue. Once a full projective reconstruction of the camera network has been computed, a linear self-calibration algorithm [48] is used to upgrade from a projective to a metric reconstruction. Finally, an Euclidean bundle adjustment is done by parameterizing the cameras in terms of the intrinsic and extrinsic parameters. In all cases, we constrain the camera skew to be zero but impose no other parameter constraints. Depending on the exact scenario, other constraints could be enforced at this step for higher accuracy—for example, enforcing a fixed aspect ratio of pixels and enforcing the principal point to be at the center of the image. For higher accuracy, radial distortion in the images should also be modeled in the Euclidean bundle adjustment, which typically further reduces the final reprojection error. However, estimation of radial distortion was not done in our current work and will be addressed in future work.
2.2.5 Camera Network Synchronization When the video sequences are unsynchronized, the constraints provided by the epipolar tangents still exist, but up to an extra unknown parameter—the temporal offset ⌬t. We assume that the cameras are operating at a constant and known frame rate, which is often the case with popular video cameras. The algorithm presented earlier is extended as follows. A random hypothesis is generated by sampling an extra dimension—a possible range of temporal offsets in addition to the 4D space of epipoles. The algorithm now requires more hypotheses than in the synchronized case before a stable solution can be found, but a multi-resolution approach for computing the temporal offset speeds it up considerably. See Figure 2.9 (b) for a distribution of candidate solutions for the temporal offsets. The uncertainty of the estimate is also computed from such a distribution. For full network synchronization, every camera can be thought to have an independent timer where the time differences can be measured in frame alignment offsets, as all cameras are assumed to be operating at a constant, known frame rate. We represent the sensor network by a directed graph G(V , E) as shown in Figure 2.9. There are N sensors, and each node vi ∈ V has a timer denoted by xi . A directed edge in this network, eij ∈ E, represents an independent measurement of the time difference xj ⫺ xi between the two timers. Each estimate tij has an associated uncertainty represented by the standard deviation ij , which is inversely proportional to the uncertainty. When G represents a tree—that is, it is fully connected and has N ⫺ 1 edges—it is possible to synchronize the whole network. When additional edges are available, each provides a further constraint, which leads to an overdetermined system of linear equations. Each edge contributes a linear constraint of the form xi ⫺ xj ⫽ tij . Stacking these equations produces a |E| ⫻ N system of linear equations. A maximum likelihood estimate
2.2 Camera Network Calibration and Synchronization Good hypotheses 119 103 92 82 73 59 44 37 23 13
2100 275 250 225 0 25 50 75 100 125 Temporal offset (# frames)
1
2
3
4
5 Number of trials (in millions)
(a) G (V, E)
xi C (tij , ij) xj (b)
FIGURE 2.9 (a) Distribution of temporal offset candidates for a pair in the MIT sequence. (b) Edges in the camera network graph represent pairwise offset measurements.
of the N timers can be obtained by solving this system using weighted least squares (each equation is multiplied by the factor 1ij ). The timer estimates (the first camera is fixed at zero) are optimal provided no outliers are present in the edges being considered. It is fairly easy to detect edges in the network. A consistent network should outlier e ⫽ 0 ᭙ cycles C ∈ G. For every edge e ∈ E, we check the satisfy the constraint e∈C sum of edges for cycles of length 3 that also contain the edge e. An outlier edge will have a significantly large number of nonzero sums and could be easily detected and removed. This method will produce very robust estimates for complete graphs because N22⫺N linear constraints are available for N unknowns. A fully connected graph with at least N ⫺ 1 edges is still sufficient to synchronize the whole network, although the estimates in this case will be less reliable. Experiments on network synchronization were performed on the MIT sequence. The subframe synchronization offsets from the first to the other three sequences were found to be 8.50, 8.98, and 7.89 frames, respectively, while the
43
44
CHAPTER 2 Multi-View Calibration, Synchronization Pair
(a)
tij
ij
tij
True (tij)
e01
28.7
0.80 28.50
28.32
e02
28.1
1.96 28.98
28.60
e03
27.7
1.57 27.89
27.85
e12
20.93 1.65 20.48
20.28
e13
0.54 0.72
0.61
0.47
e23
1.20 1.27
1.09
0.75
(b)
FIGURE 2.10 (a) Camera network graph for the MIT sequence. (b) Table of pairwise offsets and uncertainties (tij 1 and ij ) and final estimates (t ij ). These are within 13 of a frame (i.e., 100 of a second within the ˆ ground truth (tij )).
corresponding ground truth offsets were 8.32, 8.60, and 7.85 frames, respectively (see Figure 2.10(b)).
Silhouette Interpolation Visual hull methods typically treat the temporal offset between multiple video streams as an integer and ignore subframe synchronization. Given a specific frame from one video 1 stream, the closest frame in other 30-Hz video streams could be as far off in time as 60 seconds.While this might seem small at first, it can be significant for a fast-moving person. This problem will be illustrated later, where the visual hull was reconstructed from the closest original frames in the sequence.The gray area in Figure 2.10(a) represents what is inside the visual hull reconstruction, and the white area corresponds to the reprojection error (points inside the silhouette in one view carved away from other views). Subframe offsets need to be considered to perfectly synchronize the motion of the arms and the legs. To deal with this problem, we propose temporal silhouette interpolation. Given two adjacent frames i and i ⫹ 1 from a video stream, we compute the signed distance map in each image such that the boundary of the silhouette represents the zero level set in each case. Let us denote these distance maps by di (x) and di⫹1 (x), respectively. Then, for a subframe temporal offset ⌬ ∈ [0, 1], we compute an interpolated distance map denoted S(x) ⫽ (1 ⫺ ⌬)di (x) ⫺ ⌬di⫹1 (x). Computing the zero level set of S(x) produces the interpolated silhouette. This simple scheme, motivated by the work of Curless [17], robustly implements linear interpolation between two silhouettes without explicit pointto-point correspondence. However, it is approximate and does not preserve shape.Thus, it can be applied only when the inter-frame motion in the video streams is small.
2.2.6 Results We now show the results of our camera network calibration and visual hull reconstruction on a number of multi-view video data sets (Figure 2.11 lists the relevant details).
2.2 Camera Network Calibration and Synchronization Data Set
Views
Frames
Pairs
Residual
MIT [50]
4
7000
5 out of 6 pairs
0.26 pixels
Dancer (INRIA)
8
200
20 out of 28 pairs
0.25 pixels
Man (INRIA)
5
1000
10 out of 10 pairs
0.22 pixels
Kung-fu [13]
25
200
268 out of 300 pairs
0.11 pixels
Ballet [13]
8
468
24 out of 28 pairs
0.19 pixels
Breakdancer [60]
6
250
11 out of 15 pairs
0.23 pixels
Boxer [5]
4
1000
6 out of 6 pairs
0.22 pixels
FIGURE 2.11 Details of various data sets captured by various researchers. We remotely calibrated and reconstructed the human actors from the uncalibrated video footage. The number of accurate epipolar geometry estimates in the camera network graph and the final reprojection error after metric reconstruction are also reported.
45
46
CHAPTER 2 Multi-View Calibration, Synchronization
(a)
(b)
(c)
(d)
FIGURE 2.12 (a) Epipolar geometry recovered using our method for a particular camera pair in the Boxer data set. (b) Checkerboard image pair used for evaluation only. Ground truth epipolar lines are shown in black. Epipolar lines for our fundamental matrix estimate are shown in dark gray and light gray. (c,d) Final metric reconstruction of the camera network and 4953 reconstructed frontier points from the Boxer sequence. The final reprojection error was 0.22 pixels and the image resolution was 1032 ⫻ 788 pixels.
Although all the experiments involved video streams observing human subjects, both our calibration and reconstruction approaches are completely general and can be used for reconstructing time-varying events involving any solid nonrigid shape. We tested our method on a synthetic 25-view data set and 8 real data sets, acquired by various computer vision researchers in their own labs using different camera configurations. We were able to recover the full calibration of the camera network in all these cases without prior knowledge or control over the input data. Figure 2.12 is one example. We thus show that it is possible to remotely calibrate a camera network and reconstruct time-varying events from archived video footage with no prior information about the cameras or the scene. We evaluated our method by comparing the calibration with ground truth data for the Kung-fu sequence (see Figure 2.13). Since the metric reconstruction of the camera network obtained by our method is in an arbitrary coordinate system, it first needs to be scaled and robustly aligned to the ground truth coordinate frame.The final average reprojection error in the 25 images was 0.11 pixels, and the reconstructed visual hull of the Kung-fu character is visually as accurate as that computed from ground truth. Figure 2.14 shows the camera network reconstructions from various real data sets. Corresponding input video frames are shown along with the visual hull computed using the recovered calibration. The 3D geometry of the camera network is also shown. By reconstructing
2.2 Camera Network Calibration and Synchronization
(a)
(b)
(c)
(d)
(e)
FIGURE 2.13 (a) Epipolar geometries between the first and remaining 24 views for the Kung-fu data set. The four views shown with gray borders had incorrect estimates initially but were automatically corrected after projective reconstruction. The dotted epipolar lines show the incorrect epipolar geometry; the correct ones are shown with solid lines. 3D models of the Kung-fu character computed using (b) ground truth calibration and (c) our calibration. (d) 3D registration showing the accuracy of our model and calibration. (e) Camera network registered to the coordinate frame of the ground truth data.
the visual hull, we illustrate the accuracy of the recovered camera calibration and demonstrate that such dynamic scenes can now be reconstructed from uncalibrated footage.
47
48
CHAPTER 2 Multi-View Calibration, Synchronization
(a)
(c)
(e)
(b)
(d)
(f )
FIGURE 2.14 Metric 3D reconstructions from six different data sets: (a) Kung-fu, (b) Boxer, (c) Breakdancer, (d) Dancer, (e) Man, (f) Ballet.
Figure 2.15(a) shows the metric 3D reconstruction for the four-view MIT sequence.To test the accuracy of the calibration that was computed, we projected the visual hull back into the images. Inaccurate calibration, poor segmentation, or lack of perfect synchronization could give rise to empty regions (white pixels) in the silhouettes. In our case, the silhouettes were mostly filled, except for fast-moving parts where the reprojected visual hull was sometimes a few pixels smaller (see Figure 2.15(a)). This arises mostly when subframe synchronization offsets are ignored or is due to motion blur or shadows. For higher accuracy, we computed visual hulls from interpolated silhouettes. The silhouette interpolation was performed using the subframe synchronization offsets computed earlier for this sequence (see Figure 2.10(b)). An example is shown in
2.3 Dynamic Scene Reconstruction from Silhouette Cues
2
4
3 1
(a)
(b)
(c)
FIGURE 2.15 Metric 3D reconstructions of the MIT sequences. (a) Visual hull is reprojected into the images to verify the accuracy. (b,c) Silhouette interpolation using subframe synchronization reduces reprojection errors.
Figure 2.15(b). Given three consecutive frames, we generated the middle one by interpolating between the first and the third and compared it to the actual second frame. Our interpolation approach works reasonably for small motion, as would be expected in video captured at 30 frames per second. Figure 2.15(c) shows the visual hull reprojection error with and without subframe silhouette interpolation. In the two cases, the reprojection error decreased from 10.5 to 3.4 percent and from 2.9 to 1.3 percent of the pixels inside the silhouettes in the four views.
2.3 DYNAMIC SCENE RECONSTRUCTION FROM SILHOUETTE CUES With moving objects in the scene, it is almost impossible for a single-view system to obtain enough spatial coverage of the objects at one time instant. Given a geometrically calibrated camera network using the technique just introduced, 3D dynamic scene reconstruction can be achieved up to a higher quality level. Two well-known categories of multi-view reconstruction algorithms are shape-fromsilhouette (SFS) [21, 36, 37] techniques that compute an object’s coarse shape from its silhouettes, and shape-from-photo consistency [35, 52, 57], which consists of usually
49
50
CHAPTER 2 Multi-View Calibration, Synchronization
volumetric methods that recover the geometry of complete scenes using photo consistency constraint [35]. Multi-view stereo [51] also relies on photo consistency to recover dense correspondence across views and to compute scene depth. However, shape-from-photo consistency methods as well as dense stereo methods rely on camera color consistency to some extent. SFS methods have their advantages. They rely on silhouettes generated from background models from each camera view, so they do not require color correspondence between cameras. No tedious photometric calibration step is required. The previously presented geometric calibration thus provides sufficient information for this type of reconstruction. Classic viewing cone intersection algorithms can be used to reconstruct the scene from perfect silhouettes, and are practical for a number of applications because they are very fast to compute. However, in outdoor environments varying lighting conditions, shadowing, and nonsalient background motion are challenges to robust silhouette segmentation that challenge the assumption of perfect silhouettes. Researchers have addressed the robustness and sensitivity problem [61], and a probabilistic visual hull framework has been proposed [22]. In addition to those issues, SFS methods suffer from occlusion problems: Occluded camera views produce incomplete silhouettes of the object being reconstructed. Because of the intersection rule, such silhouette corruption results in the construction of incomplete visual hulls, which is the case, when the reconstructed object is occluded by the static obstacles that happen to be learned as part of the background in the first place. Occlusions in general also occur when parts of a reconstructed object are visually blocking other parts (self-occlusion) and when one object is blocking another (inter-occlusion). With the increase of such occlusions, the discriminatory power of silhouettes for space occupancy decreases, resulting in silhouette-based volumes much larger than the real objects they are meant to represent. All occlusion types thus negatively affect the quality of the final reconstruction result, yet they are very common and almost unavoidable in natural environments.Therefore, if we are going to explore the feasibility of SFS methods in uncontrolled real-world scenes, we need to overcome this critical problem, as well as those aforementioned.
2.3.1 Related Work Silhouette-based modeling in calibrated multi-view sequences has been quite popular, and has yielded a number of approaches to building volume-based [65] or surface-based [6] representations of an object’s visual hull. The difficulty and focus in modeling objects from silhouettes have gradually shifted from pure 3D reconstruction to the sensitivity of visual hull representations to silhouette noise. In fully automatic modeling systems, silhouettes are usually extracted using background subtraction techniques [14, 58], which are difficult to apply outdoors and often locally fail because of changing light, shadows, color space ambiguities, and background object–induced occlusion, among other conditions. Several solutions to these problems have been proposed, including a discrete optimization scheme [59], silhouette priors over multi-view sets [26], and silhouette cue integration using a sensor fusion paradigm [22]. However, most existing reconstruction methods focus on mono-object situations and fail to address the specific multi-object
2.3 Dynamic Scene Reconstruction from Silhouette Cues
(a)
(b)
FIGURE 2.16 Principle of multi-object silhouette reasoning for shape-modeling disambiguation. (Best viewed in color; see companion Website.)
issues of silhouette methods. While inclusive of the object’s shape [36], visual hulls fail to capture object concavities but are usually very good at hinting toward the overall topology of a single observed object. This property was successfully used in a number of photometric methods to carve an initial silhouette-based volume [23, 56]. This ability to capture topologies breaks with the multiplicity of objects in the scene. In such cases two silhouettes are ambiguous in distinguishing between regions actually occupied by objects and unfortunate silhouette-consistent “ghost”regions. Ghosts occur when the configuration of the scene is such that regions of space occupied by objects of interest cannot be disambiguated from free-space regions that also happen to project inside all silhouettes (see the polygonal gray region in Figure 2.16(a)). Ghosts are increasingly likely as the number of observed objects rises, when it becomes more difficult to find views that visually separate objects in the scene and to carve out unoccupied regions of space. Ghost regions were analyzed in the context of tracking applications to avoid committing to a “ghost” track [47]. The method we propose casts the problem of silhouette modeling at the multi-object level, where ghosts can naturally be eliminated based on per-object silhouette consistency. Multi-object silhouette reasoning was applied in the context of multi-object tracking [20, 44]. The reconstruction and occlusion problem was also studied for the specific case of transparent objects [8]. Besides the inter-occlusion between dynamic object entities mentioned above, a fundamental problem with the silhouette-based modeling is the occlusion problem associated with static objects whose appearance is also recorded in background images/models. Such occlusions have a negative impact because extracted silhouettes can become incomplete. In particular, the inclusive property of visual hulls [36] with respect to the object being modeled is no longer guaranteed. Generally detecting and accounting for this type of occlusion has attracted the attention of researchers for problems such as structure from motion [19] and motion and occlusion boundary detection [2]. However, the scope of these works is limited to extraction of sparse 2D features such as T-junctions or edges to improve the robustness of data estimation. Inter-object occlusions were implicitly modeled in the context of voxel coloring approaches, using an iterative scheme with semitransparent voxels and multiple views of a scene from the same time instant [8].
51
52
CHAPTER 2 Multi-View Calibration, Synchronization In the following sections, we first introduce the fundamental probabilistic sensor fusion framework and the detailed formulations (Section 2.3.2). We then describe additional problems when placing all math expressions together as an automatic system, such as appearance automatic initialization, and tracking the dynamic objects’ motions (Section 2.3.3). Specifically, we discuss how to initialize the appearance models and keep track of the motion and status of each dynamic object. Finally, we show the results of the proposed system and algorithm on completely real-world data sets (Section 2.3.4). Despite the challenges in the data sets, such as lighting variation, shadows, background motion, reflection, dense population, drastic color inconsistency between views, and so forth, our system produces high-quality reconstructions.
2.3.2 Probabilistic Framework Using a volume representation of the 3D scene, we process multi-object sequences by examining each voxel in the scene using a Bayesian formulation that encodes the noisy causal relationship between the voxel and the pixels that observe it in a generative sensor model. In particular, given the knowledge that a voxel is occupied by a certain object among m possible in the scene, the sensor model explains what appearance distributions we are supposed to observe that correspond to that object. It also encodes state information about the viewing line and potential obstructions from other objects, as well as a localization prior used to enforce the compactness of objects, which can be used to refine the estimate for a given instant of the sequence. Voxel sensor model semantics and simplifications are borrowed from the occupancy grid framework explored in the robotics community [18, 43]. We also formulate occlusion inference as a separate Bayesian estimate, for each voxel in a 3D grid sampling the acquisition space, of how likely it is to be occupied by a static occluder object. By modeling the likely responses in images to a known state of the scene through a generative sensor model, the strategy is then to use Bayes’ rule to solve the inverse problem and find the likely state of occluder occupancy given noisy image data. Although theoretically possible, a joint estimation of the foreground and occluder object shapes would turn the problem into a global optimization over the conjunction of both shape spaces, because estimation of a voxel’s state reveals dependencies with all other voxels on its viewing lines with respect to all cameras. To benefit from the locality that makes occupancy grid approaches practical and efficient, we break the estimation into two steps: For every time instant, we first estimate the occupancy of the visual hulls of dynamic objects from silhouette information; then we estimate the occluder occupancy in a second inference, using the result of the first estimation as a prior for dynamic object occupancy. However, we describe the static occluder inference first because it deals with only one problem, the visual occlusions, whereas for the dynamic objects inference, there are many more complicated problems, such as appearance learning and object tracking. Also, it is easier this way to introduce the inter-occlusion between the dynamic objects.
Static Occluder To introduce the static occluder formulation, we assume for simplicity just one dynamic object in the scene for now. Later, in Section 2.3.4, we show real-world data sets in which multiple dynamic objects and static occluder(s) are dealt with in the same scenes.
2.3 Dynamic Scene Reconstruction from Silhouette Cues 3D scene lattice X Ln L1
L2 ... View n
View 1
View 2 (a)
Independent of time 1 O2 t On
B1
Time 1 1fl G2
t
O2
1
t O1 1 O1 On 1 1 O O n t O1 O2t On 1 O1t O2
B2
Gn1 1
G1
Time t G2t
1
G1 G1
tγ Gn
...
1
G1
Bn
G1t
1
In I12
I 11
G
1
Gn
I 1t
G1t
t
G2t
Gnt t
In I 2t
(b)
FIGURE 2.17 Problem overview: (a) Geometric context of voxel X. (b) Main statistical variables to infer the occluder occupancy probability of X. Gt , Gˆit , Gˇit : dynamic object occupancies at relevant voxels at, in front of, and behind X, respectively. O , Oˆ it , Oˇ it : static occluder occupancies at, in front of, and behind X. Iit , Bi : colors and background color models observed where X projects in images.
Consider a scene observed by n calibrated cameras. We focus on the case of one scene voxel with 3D position X among the possible coordinates in the lattice chosen for scene discretization. The two possible states of occluder occupancy at this voxel are expressed using a binary variable O . This state is assumed to be fixed over the entire experiment in this setup, under the assumption that the occluder is static. Clearly, the regions of importance to infer O are the n viewing lines Li , i ∈ {1, . . . , n}, as shown in Figure 2.17(a). Scene states are observed for a finite number of time instants t ∈ {1, . . . , T }. In particular, dynamic visual hull occupancies of voxel X at time t are expressed by a binary statistical variable Gt , treated as an unobserved variable to retain the probabilistic information given by Franco and Boyer [22].
Observed Variables The voxel X projects to n image pixels xi , i ∈ 1, . . . , n, whose color observed at time t in view i is expressed by the variable Iit . We assume that static background images were observed free of dynamic objects and that the appearance and variability of background colors for pixels xi were recorded and modeled using a set of parameters Bi . Such observations can be used to infer the probability of dynamic object occupancy in the absence of background occluders. The problem of recovering occluder occupancy is
53
54
CHAPTER 2 Multi-View Calibration, Synchronization more complex because it requires modeling interactions between voxels on the same viewing lines. Relevant statistical variables are shown in Figure 2.17(b).
Viewing Line Modeling Because of potential mutual occlusions, we must account for other occupancies along the viewing lines of X to infer O . These can be either other static occluder states or dynamic object occupancies that vary across time. Several such occluders or objects can be present along a viewing line, leading to a number of possible occupancy states for voxels on the viewing line of X. Accounting for the combinatorial number of possibilities for voxel states along X’s viewing line is neither necessary nor meaningful—first, because occupancies of neighboring voxels are fundamentally correlated to the presence or absence of a single common object, and, second, because the main useful information we need to make occlusion decisions about X is whether something is in front of it or behind it regardless of where along the viewing line. With this in mind, we model each viewing line using three components, that model (1) the state of X, (2) the state of occlusion of X by anything in front, and (3) the state of what is at the back of X. We model the front and back components by extracting the two most influential modes in front of and behind X, which are given by two voxels Xˆ it and Xˇ it . We select Xˆ it as the voxel at time t that most contributes to the belief that X is obstructed by a dynamic object along Li , and Xˇ it as the voxel most likely to be occupied by a dynamic object behind X on Li at time t.
Viewing Line Unobserved Variables With this three-component modeling come a number of related statistical variables, as illustrated in Figure 2.17(b). The occupancy of voxels Xˆ it and Xˇ it by the visual hull of a dynamic object at time t on Li is expressed by two binary state variables, respectively Gˆit and Gˇit . Two binary state variables Oˆ it and Oˇ it express the presence or absence of an occluder at voxels Xˆ it and Xˇ it , respectively. Note the difference in semantics between the two variable groups Gˆit , Gˇit and Oˆ it , Oˇ it . The former designates dynamic visual hull occupancies of different time instants and chosen positions; the latter expresses static occluder occupancies, whose position only was chosen in relation to t. Both need to be considered because they both influence the occupancy inference and are not independent. For legibility, we occasionally refer to the conjunction of a group of variables by dropping indices and exponents—for example, G ⫽ {G1 , . . . , GT }, B ⫽ {B1 , . . . , Bn }.
Joint Distribution As a further step toward a tractable solution to occlusion occupancy inference, we describe the noisy interactions between the variables considered, through the decomposition of their joint probability distribution p(O, G , Oˆ , Gˆ , Oˇ , Gˇ , I , B ). We propose the following: p(O )
T t⫽1
p(Gt |O )
n
p(Oˆ it )p(Gˆit |Oˆ it )p(Oˇ it )p(Gˇit |Oˇ it )
i⫽1
p(Iit |Oˆ it , Gˆit , O, Gt , Oˇ it , Gˇit , Bi ).
(2.5)
2.3 Dynamic Scene Reconstruction from Silhouette Cues p(O ), p(Oˆ it ), and p(Oˇ it ) are priors of occluder occupancy.We set them to a single constant distribution Po , which reflects the expected ratio between occluder and nonoccluder voxels in a scene. No particular region of space is to be favored a priori.
Dynamic Occupancy Priors
p(Gt |O ), p(Gˆit |Oˆ it ), and p(Gˇit |Oˇ it ) are priors of dynamic visual hull occupancy with identical semantics. This choice of terms reflects the following modeling decisions. First, the dynamic visual hull occupancies involved are considered independent of one another, as they synthesize the information of three distinct regions for each viewing line. However, they depend on knowledge of occluder occupancy at the corresponding voxel position because occluder and dynamic object occupancies are mutually exclusive at a given scene location. Importantly, however, we have direct access not to the dynamic object occupancies but to the occupancies of the object’s visual hull. Fortunately, this ambiguity can be adequately modeled in a Bayesian framework by introducing a local hidden variable C expressing the correlation between dynamic and occluder occupancy: p(Gt |O ) ⫽
p(C )p(Gt |C , O )
(2.6)
C
We set p(C ⫽ 1) ⫽ Pc using a constant expressing our prior belief about the correlation between visual hull and occluder occupancy. The prior p(Gt |C , O ) explains what we expect to know about Gt given the state of C and O : p(Gt ⫽ 1|C ⫽ 0, O ⫽ ) ⫽ PGt ᭙
(2.7)
p(Gt ⫽ 1|C ⫽ 1, O ⫽ 0) ⫽ PGt
(2.8)
p(Gt ⫽ 1|C ⫽ 1, O ⫽ 1) ⫽ Pgo
(2.9)
with PGt as the prior dynamic object occupancy probability as computed independently of occlusions [22], and Pgo set close to 0 to express that it is unlikely that the voxel is occupied by dynamic object visual hulls when it is known to be occupied by an occluder and both dynamic and occluder occupancies are known to be strongly correlated (2.9). The probability of visual hull occupancy is given by the previously computed occupancy prior, in case of noncorrelation (2.7) or by the instance when the states are correlated but occluder occupancy is known to be empty (2.8).
Image Sensor Model
The sensor model p(Iit |Oˆ it , Gˆit , O, Gt , Oˇ it , Gˇit , Bi ) is governed by a hidden local per-pixel process S .The binary variable S represents the hidden silhouette detection state (0 or 1) at this pixel. This is unobserved information and can be marginalized, given an adequate split into two subterms: p(Iit |Oˆ it , Gˆit , O, Gt , Oˇ it , Gˇit , Bi ) ⫽ p(Iit |S , Bi )p(S |Oˆ it , Gˆit , O, Gt , Oˇ it , Gˇit ) S
(2.10)
55
56
CHAPTER 2 Multi-View Calibration, Synchronization where p(Iit |S , Bi ) indicates the color distribution we expect to observe given knowledge of silhouette detection and the background color model at this pixel. When S ⫽ 0, the silhouette is undetected and thus the color distribution is dictated by the pre-observed background model Bi (considered Gaussian in our experiments). When S ⫽ 1, a dynamic object’s silhouette is detected, in which case our knowledge of color is limited, thus, we use a uniform distribution in this case, favoring no dynamic object color a priori. The second part of the sensor model, p(S |Oˆ it , Gˆit , O, Gt , Oˇ it , Gˇit ), indicates what silhouette state is expected to be observed given the three dominant occupancy state variables of the corresponding viewing line. Since these are encountered in the order of visibility Xˆ it , X, Xˇ it , the following relations hold: p(S |{Oˆ it , Gˆit , O, Gt , Oˇ it , Gˇit }⫽{o, g, k, l, m, n}, Bi ) ⫽ p(S |{Oˆ it , Gˆit , O, Gt , Oˇ it , Gˇit }⫽{0, 0, o, g, p, q}, Bi ) ⫽ p(S |{Oˆ it , Gˆit , O, Gt , Oˇ it , Gˇit }⫽{0, 0, 0, 0, o, g}, Bi )
(2.11)
⫽ PS (S |o, g) ᭙(o, g) ⫽ (0, 0) ᭙(k, l, m, n, p, q)
These expressions convey two characteristics: first, that the form of this distribution is given by the first non-empty occupancy component in the order of visibility, regardless of what is behind this component on the viewing line; second, that the form of the first non-empty component is given by an identical sensor prior PS (S |o, g). We set the four parametric distributions of PS (S |o, g) as the following: PS (S ⫽ 1|0, 0) ⫽ Pfa PS (S ⫽ 1|1, 0) ⫽ Pfa
(2.12)
PS (S ⫽ 1|0, 1) ⫽ Pd
(2.13)
PS (S ⫽ 1|1, 1) ⫽ 0.5
where Pfa ∈ [0, 1] and Pd ∈ [0, 1] are constants expressing the prior probability of false alarm and the probability of detection, respectively.They can be chosen once for all data sets, as the method is not sensitive to the exact value of these priors. Meaningful values for Pfa are close to 0, while Pd is generally close to 1. Equation 2.12 expresses the case where no silhouette is expected to be detected in images—that is, respectively, when there are no objects on the viewing line or when the first encountered object is a static occluder. Equation 2.13 expresses two distinct cases: first, where a dynamic object’s visual hull is encountered on the viewing line, in which case we expect to detect a silhouette at the matching pixel; second, where both an occluder and dynamic visual hull are present at the first nonfree voxel. This is perfectly possible, because the visual hull is an overestimate of the true dynamic object shape. While the true shapes of objects and occluders are naturally mutually exclusive, the visual hull of a dynamic object can overlap with occluder voxels. In this case we set the distribution to uniform because the silhouette detection state cannot be predicted: It can be caused by shadows cast by dynamic objects on occluders in the scene and by noise.
Inference Estimating the occluder occupancy at a voxel translates to estimating p(O |IB) in Bayesian terms. Applying Bayes’ rule to the modeled joint probability (Equation 2.5) leads to the
2.3 Dynamic Scene Reconstruction from Silhouette Cues following expression, once hidden variable sums are decomposed to factor out terms not required at each level of the sum: ⎛ n
⎞ T 1 ⎝ p(O |IB) ⫽ p(O ) p(Gt |O ) Pit ⎠ z t⫽1 i⫽1
(2.14)
Gt
where Pit ⫽
Oˇ it ,Gˇit
p(Oˇ it )p(Gˇit |Oˇ it )
p(Oˆ it )p(Gˆit |Oˆ it )
Oˆ it ,Gˆit
(2.15)
p(Iit |Oˆ it , Gˆit , O , Gt , Oˇ it , Gˇit , Bi )
where Pit expresses the contribution of view i at time t. The formulation therefore expresses Bayesian fusion over the various observed time instants and available views, with marginalization over unknown viewing line states (2.14). The normalization constant z is easily obtained by ensuring summation to 1 of the distribution.
Multiple Dynamic Objects In this section, we focus on the inference of multiple dynamic objects. Because a dynamic object changes shape and location constantly, our dynamic object reconstruction has to be computed for every frame in time, and there is no way to accumulate the information over time as we did for the static occluder. We will thus focus on a single time instant for this section. Our notations slightly change as follows to best describe the formulations: We consider a scene observed by n calibrated cameras and assume a maximum of m dynamic objects of interest that can be present in it. In this formulation we focus on the state of one voxel at position X chosen among the positions of the 3D lattice used to discretize the scene. We model how knowledge about the occupancy state of voxel X influences image formation, assuming that a static appearance model for the background was previously observed. Because of the occlusion relationships arising between objects, the zones of interest to infer the state of voxel X are its n viewing lines Li and i ∈ {1, . . . , n}, with respect to the different views. In this paragraph we assume that some prior knowledge about scene state is available for each voxel X in the lattice and can be used in the inference. Various uses of this assumption will be demonstrated in Section 2.3.3. A number of statistical variables are used to model the state of the scene and the image generation process and to infer G , as depicted in Figure 2.18.
Statistical Variables Scene voxel state space. The occupancy state of X is represented by a variable G . The particularity of our modeling lies in the multi-labeling characteristic of G ∈ L, where L is a set of labels {∅, 1, . . . , m, U }. A voxel either is empty (∅), is one of m objects the model is keeping track of (numerical labels), or is occupied by an unidentified object (U ). U is intended to act as a default label capturing all objects that are detected as different from background but not explicitly modeled by other labels. This proves useful for automatic detection of new objects. See the subsection Automatic Detection of New Objects in Section 2.3.3.
57
58
CHAPTER 2 Multi-View Calibration, Synchronization
G2vm Gnvm v
Gn 2
G1vm
G2v1
X
Gnv1
v G1 2 v
G1 1 L1
Ln
G2v1
In
L2 I1
View 1
3D scene lattice
I2 View 2
...
View n
FIGURE 2.18 Overview of the main statistical variables and geometry of the problem. G is the occupancy at voxel X and lives in a state space L of object labels. {Ii } are the color states observed at the n pixels where vj X projects. {Gi } are the states in L of the most likely obstructing voxels on the viewing line, for each of m objects, enumerated in their order of visibility {vj }i .
Observed appearance. The voxel X projects to a set of pixels, whose colors Ii , i ∈ 1, . . . , n
we observe in images. We assume these colors are drawn from a set of object- and viewspecific color models whose parameters we denote Cil . More complex appearance models are possible using gradient or texture information, without loss of generality. Latent viewing line variables. To account for inter-object occlusion, we need to model the contents of viewing lines and how they contribute to image formation. We assume some a priori knowledge about where objects lie in the scene. The presence of such objects can have an impact on the inference of G because of the invisibility and how they affect G . Intuitively, conclusive information about G cannot be obtained from a view i if a voxel in front of G with respect to i is occupied by another object. However, G directly influences the color observed if it is unoccluded and occupied by one of the objects. Still, if G is known to be empty, then the color observed at pixel Ii reflects the appearance of objects behind X in image i, if any. These visibility intuitions are modeled in the next subsection. It is not meaningful to account for the combinatorial number of occupancy possibilities along the viewing rays of X. This is because neighboring voxel occupancies on the viewing line usually reflect the presence of the same object and are therefore correlated. In fact, assuming we witness no more than one instance of every one of the m objects along the viewing line, the fundamental information required to reason about X is the presence and ordering of the objects along this line. To represent this knowledge, as depicted in Figure 2.18, assuming prior information about occupancies is already available at each voxel, we extract for each label l ∈ L and each viewing line i ∈ {1, . . . , n} the voxel whose probability of occupancy is dominant for that label on the viewing
2.3 Dynamic Scene Reconstruction from Silhouette Cues
59
line. This corresponds to electing the voxels that best represent the m objects and have the most influence on the inference of G . We then account for this knowledge in the problem of inferring X, by introducing a set of statistical occupancy variables Gil ∈ L that correspond to these extracted voxels.
Dependencies Considered We propose a set of simplifications in the joint probability distribution of the variables set that reflect the prior knowledge we have about the problem. To simplify 1:m ⫽ the writing we often note the conjunction of a set of variables as follows: G1:n l {Gi }i∈{1,...,n}, l∈{1,...,m} . We propose the following decomposition for the joint probability 1:m I 1:m distribution p(G G1:n 1:n C1:n ): p(G )
l∈L
l p(C1:n )
p(Gil |G )
i,l∈L
p(Ii |G Gi1:m Ci1:m )
(2.16)
i
Prior terms. p(G ) carries prior information about the current voxel. This prior can reflect
different types of knowledge and constraints already acquired about G , for example, localization information to guide the inference (Section 2.3.3). l ) is the prior over the view-specific appearance models of a given object l. The p(C1:n prior, as written over the conjunction of these parameters, can express expected relationships between the appearance models of different views, even if not color-calibrated. Since the focus in this chapter is on learning voxel X, we do not use this capability here l ) to be uniform. and assume p(C1:n Viewing line dependency terms. We summarized the prior information along each viewing line using the m voxels most representative of the m objects, so as to model inter-object occlusion phenomena. However, when examining a particular label G ⫽ l, keeping the occupancy information about Gil leads us to account for intra-object occlusion phenomena, which in effect leads the inference to favor mostly voxels from the front visible surface of the object l. Because we wish to model the volume of object l, we discard the influence of Gil when G ⫽ l: p(Gik |{G ⫽ l}) ⫽ P (Gik )
when k ⫽ l
(2.17)
p(Gil |{G ⫽ l}) ⫽ ␦∅ (Gil )
᭙l ∈ L
(2.18)
where P (Gik ) is a distribution reflecting prior knowledge about Gik , and ␦∅ (Gik ) is the distribution giving all the weight to label ∅. In (2.18) p(Gil |{G ⫽ l}) is thus enforced to be empty when G is known to represent label l, which ensures that the same object is represented only once on the viewing line. Image formation terms. The image formation term p(Ii |G Gi1:m Ci1:m ) explains what color
we expect to observe given the knowledge of viewing line states and per-object color models. We decompose each such term into two subterms by introducing a local latent variable S ∈ L representing the hidden silhouette state: p(Ii |G Gi1:m Ci1:m ) ⫽
S
p(Ii |S Ci1:m )p(S |G Gi1:m )
(2.19)
60
CHAPTER 2 Multi-View Calibration, Synchronization The term p(Ii |S Ci1:m ) simply describes what color is likely to be observed in the image given knowledge of the silhouette state and the appearance models corresponding to each object. S acts as a mixture label: If {S ⫽ l}, then Ii is drawn from the color model Cil . For objects (l ∈ {1, . . . , m}) we typically use Gaussian mixture models (GMM) [58] to efficiently summarize the appearance information of dynamic object silhouettes. For background (l ⫽ ∅) we use per-pixel Gaussians as learned from pre-observed sequences, although other models are possible. When l ⫽ U the color is drawn from the uniform distribution, as we make no assumption about the color of previously unobserved objects. Defining the silhouette formation term p(S |G Gi1:m ) requires that the variables be considered in their visibility order to model the occlusion possibilities. Note that this order vj can be different from 1, . . . , m. We note {Gi }j∈{1,...,m} , the variables Gi1:m as enumerated in the permutated order {vj }i reflecting their visibility ordering on Li . If {g}i denotes the particular index after which the voxel X itself appears on Li , then we can rewrite the vg vg⫹1 · · · Givm ). silhouette formation term as p(S |Giv1 · · · Gi G Gi A distribution of the following form can then be assigned to this term: p(S |∅ · · · ∅ l ∗ · · · ∗) ⫽ dl (S )
with l ⫽ ∅
p(S |∅ · · · · · · · · · ∅) ⫽ d∅ (S )
(2.20) (2.21)
where dk (S ), k ∈ L is a family of distributions giving strong weight to label k and lower equal weight to other labels, determined by a constant probability of detection d Pd ∈ [0, 1]: dk (S ⫽ k) ⫽ Pd and dk (S ⫽ k) ⫽ |1⫺P L|⫺1 to ensure summation to 1. Equation 2.20 thus expresses that the silhouette pixel state reflects the state of the first visible non-empty voxel on the viewing line, regardless of the state of voxels behind it (“*”). Equation 2.21 expresses the particular case where no occupied voxel lies on the viewing line, the only case where the state of S should be background: d∅ (S ) ensures that Ii is mostly drawn from the background appearance model.
Inference 1:m ) in Bayesian Estimating the occupancy at voxel X translates into estimating p(G |I1:n C1:n terms. We apply Bayes’ rule using the joint probability distribution, marginalizing out the 1:m : unobserved variables G1:n 1:m p(G |I1:n C1:n )⫽
1 1:m 1:m p(G G1:n I1:n C1:n ) z 1:m
(2.22)
G1:n
n 1 ⫽ p(G ) fi1 z i⫽1
(2.23)
where fik ⫽
v Gi k
p(Givk |G )fik⫹1
for k ⬍ m
(2.24)
2.3 Dynamic Scene Reconstruction from Silhouette Cues and fim ⫽
p(Givm |G )p(Ii |G Gi1:m Ci1:m )
61
(2.25)
v Gi m
The normalization constant z is easily obtained by ensuring summation to 1 of the distri 1:m I 1:m bution: z ⫽ G ,G 1:m p(G G1:n 1:n C1:n ). Equation 2.22 is the direct application of Bayes’ 1:n rule, with the marginalization of latent variables. The sum in this form is intractable; thus we factorize the sum in (2.23). The sequence of m functions fik specifies how to recursively compute the marginalization with the sums of individual Gik variables appropriately subsumed, so as to factor out terms not required at each level of the sum. Because of the particular form of silhouette terms in (2.20), this sum can be efficiently computed by noting that all terms after a first occupied voxel of the same visibility rank k share a term of identical value in p(Ii |∅ . . . ∅ {Givk ⫽ l} ∗ . . . ∗) ⫽ Pl (Ii ). They can be factored out of the remaining sum, which sums to 1 being a sum of terms of a probability distribution, leading to the following simplification of (2.24), ᭙k ∈ {1, . . . , m ⫺ 1}: fik ⫽ p(Givk ⫽ ∅|G )fik⫹1 ⫹
p(Givk ⫽ l|G )Pl (Ii )
(2.26)
l ⫽∅
Dynamic Object and Static Occluder Inference Comparison We showed the mathematical models for static and dynamic objects inference. Although both types of entity are computed only from silhouette cues from camera views and both require consideration of the visual occlusion effect, they actually are fundamentally different. First of all, there is no way to learn an appearance model for a static occluder because its appearance is initially embedded in the background model of a certain view. Only when an occlusion event happens between the dynamic object and the occluder can we detect that a certain appearance should belong to the occluder but not to the background, and occluder probability should increase along that viewing direction. As for dynamic objects, we mentioned (and will show in more detail in the next section) that their appearance models for all camera views could be manually or automatically learned before reconstruction. Second, for occluders, because they are static, places in the 3D scene that have been recovered as highly probable to be occluders will always maintain the high probabilities, not considering noise. This enables the accumulation of the static occluder in our algorithm. But for the inter-occlusion between dynamic objects, it is merely a one-time instant event. This effect is actually reflected in the inference formulae of the static occluder and dynamic objects. However, we show in the next section that these two formulations are still very similar and can be integrated in a single framework, in which the reconstruction of the static occluder and dynamic objects can also benefit from each other.
2.3.3 Automatic Learning and Tracking In Section 2.3.2, we presented a generic framework to infer the occupancy probability of a voxel X and thus to deduce how likely it is for X to belong to one of m objects.
62
CHAPTER 2 Multi-View Calibration, Synchronization
Additional work is required to use this framework to model objects in practice. The formulation explains how to compute the occupancy of X if some occupancy information about viewing lines is already known. Thus the algorithm needs to be initialized with a coarse shape estimate, whose computation is discussed next. Intuitively, object shape estimation and tracking are complementary and mutually helpful. We explain in a following subsection how object localization information is computed and used in the modeling. To be fully automatic, our method uses the inference label U to detect objects not yet assigned to a given label and to learn their appearance models. Finally, static occluder computation can easily be integrated into the system to help the inference be robust to static occluders. The algorithm at every time instance is summarized in Algorithm 2.1. Algorithm 2.1: Dynamic Scene Reconstruction Input: Frames at a new time instant for all views Output: 3D object shapes in the scene Coarse Inference; if a new object enters the scene then add a label for the new object; initialize foreground appearance model; go back to Coarse Inference; end if Refined Inference; static occluder inference; update object location and prior; return
Shape Initialization and Refinement The proposed formulation relies on available prior knowledge about scene occupancies and dynamic object ordering. Thus part of the occupancy problem must be solved to bootstrap the algorithm. Fortunately, multi-label silhouette inference with no prior knowledge about occupancies or consideration for inter-object occlusions provides a decent initial m-occupancy estimate. This inference case can easily be formulated by simplifying occlusion-related variables from n 1 1:m p(G |I1:n C1:n ) ⫽ p(G ) p(Ii |G Ci1:m ) z i⫽1
(2.27)
This initial coarse inference can then be used to infer a second, refined inference, this j time accounting for viewing line obstructions given the voxel priors p(G ) and P (Gi ) of equation 2.17 computed from the coarse inference. The prior over p(G ) is then used to introduce soft constraints to the inference. This is possible by using the coarse inference result as the input of a simple localization scheme and using the localization information in p(G ) to enforce a compactness prior over the m objects, as discussed in the following subsection.
2.3 Dynamic Scene Reconstruction from Silhouette Cues
63
Object Localization We use a localization prior to enforce the compactness of objects in the inference steps. For the particular case where walking persons represent the dynamic objects, we take advantage of the underlying structure of the data set by projecting the maximum probability over a vertical voxel column on the horizontal reference plane. We then localize the most likely position of objects by sliding a fixed-size window over the resulting 2D probability map for each one. The resulting center is subsequently used to initialize p(G ) using a cylindrical spatial prior. This favors objects localized in one and only one portion of the scene and is intended as a soft guide to the inference. Although simple, this tracking scheme is shown to outperform state-of-the-art methods (see Densely Populated Scene subsection in Section 2.3.4), thanks to the rich shape and occlusion information modeled.
Automatic Detection of New Objects The main information about objects used by the proposed method is their set of appearances in the different views. These sets can be learned offline by segmenting each observed object alone in a clear, uncluttered scene before processing multi-object scenes. More generally, we can initialize object color models in the scene automatically. To detect new objects we compute U ’s object location and volume size during the coarse inference, and track the unknown volume just as with other objects as described in the previous section. A new dynamic object inference label is created (and m incremented) if all of the following criteria are satisfied: ■ ■ ■
The entrance is only at the scene boundaries. U ’s volume size is larger than a threshold. Subsequent updates of U ’s track are bounded.
To build the color model of the new object, we project the maximum voxel probability along the viewing ray to the camera view, threshold the image to form a “silhouette mask,” and choose pixels within the mask as training samples for a GMM appearance model. Samples are collected only from unoccluded silhouette portions of the object, which can be verified from the inference. Because the cameras may be badly colorcalibrated, we propose to train an appearance model for each camera view separately. This approach is fully evaluated in the Appearance Modeling Validation subsection later in this chapter.
Occluder Computation The static occluder computation can easily be integrated with the multiple dynamic object reconstruction described in the Static Occluder subsection of Section 2.3.2. At every time instant the dominant occupancy probabilities of m objects are already extracted; the two dominant occupancies in front of and behind the current voxel X can be used in the occupancy inference formulation in the earlier section. It could be thought that the multi-label dynamic object inference discussed in this section is an extension of the single dynamic object cases assumed there. In fact, the occlusion occupancy inference does benefit from the disambiguation inherent to multi-silhouette reasoning, as the real-world experiment in Figure 2.25 will show.
64
CHAPTER 2 Multi-View Calibration, Synchronization
2.3.4 Results and Evaluation In this section, we apply the introduced algorithms on static occluders and multiple dynamic objects to a few data sets. All the data sets are captured using multiple off-theshelf video camcorders in natural environmets.
Occlusion Inference Results To demonstrate the power of static occluder shape recovery, we mainly use a single person as the dynamic object in the scene. In the next section, we also show that the static occluder shape can be recovered in the presence of multiple dynamic objects. We show three multi-view sequences: PILLARS and SCULPTURE,acquired outdoors, and CHAIR, acquired indoors with combined artificial and natural light from large bay windows. In all sequences nine DV cameras surround the scene of interest; background models are learned in the absence of moving objects. A single person as our dynamic object walks around and through the occluder in each scene. His shape is estimated at each considered time step and used as a prior to occlusion inference. The data is used to compute an estimate of the occluder’s shape using equation 2.14. Results are presented in Figure 2.19. Nine geometrically calibrated 720⫻480 resolution cameras all record at 30 Hz. Color calibration is unnecessary because the model uses silhouette information only. The
(a)
(b)
(c)
(1)
(2)
(3)
(4)
FIGURE 2.19 Occluder shape retrieval results. Sequences. (a) PILLARS, (b) SCULPTURE, (c) CHAIR. (1) Scene overview. Note the harsh light and difficult backgrounds for (a) and (b), and the specularity of the sculpture, causing no significant modeling failure. (2)(3) Occluder inference: Dark gray neutral regions (prior Po ); light gray high-probability regions. Bright/clear regions indicate the inferred absence of occluders. Fine levels of detail are modeled, and sometimes lost—mostly from calibration. In (a) the structure’s steps are also detected. (4) Same inference with additional exclusion of zones with reliability under 0.8. Peripheral noise and marginally observed regions are eliminated. The background protruding shape in (c-3) is due to a single occlusion from view (c-1). (The supplemental video on this book’s companion website shows extensive results with these data sets, including one or more persons in the scene.)
2.3 Dynamic Scene Reconstruction from Silhouette Cues
65
background model is learned per-view using a single Gaussian color model per pixel and training images.Although simple, the model proves sufficient, even in outdoor sequences subject to background motion, foreground object shadows, and substantial illumination changes. This illustrates the strong robustness of the method to difficult real conditions. The method can cope well with background misclassifications that do not lead to large coherent false-positive dynamic object estimations: Pedestrians are routinely seen in the background for the SCULPTURE and PILLARS sequences (e.g., Figure 2.19(a-1)) without any significant corruption of the inference. Adjacent frames in the input videos contain largely redundant information for occluder modeling; thus the videos can be safely subsampled. PILLARS was processed using 50 percent of the frames (1053 frames processed); SCULPTURE and CHAIR, with 10 percent (160 and 168 frames processed, respectively). Processing of both dynamic and occluder occupancy was handled on a 2.8-GHz PC at approximately one timestep per minute. The very strong locality inherent to the algorithm and preliminary benchmarks suggest that real-time performance can be achieved using a GPU implementation. Because of adjacent frame redundancy, occluder information does not need to be processed for every frame, opening the possibility for online, asynchronous cooperative computation of occluder and dynamic objects at interactive frame rates.
Multi-Object Shape Inference Results We use four multi-view sequences to validate multi-object shape inference. Eight 30-Hz 720⫻480 DV cameras surrounding the scene in a semicircle are used for the CLUSTER and BENCH sequences. The LAB sequence is provided by [30], and SCULPTURE is used to reconstruct the static sculpture (Figure 2.19(b)) in the previous section. Table 2.1 shows the result of multiple persons walking in the scene together with the reconstructed sculpture. Cameras in each data sequence are geometrically calibrated but not color-calibrated. The background model is learned per-view using a single Gaussian color model at every pixel with training images.Although simple, the model proves sufficient, even in outdoor sequences subject to background motion, foreground object shadows, window reflections, and substantial illumination changes, showing the robustness of the method to difficult real conditions. For dynamic object appearance models of the CLUSTER, LAB, and SCULPTURE data sets, we trained an RGB GMM model for each person in each view with manually segmented foreground images. This was done offline. For the BENCH sequence, however, appearance models were initialized online automatically. The time complexity is O(nmV ), with n the number of cameras, m the number of objects in the scene, and V the scene volume resolution. We processed the data sets on
Table 2.1 Results of Persons Walking in Scenes Sequence CLUSTER BENCH LAB
(outdoor)
(outdoor)
(indoor)
SCULPTURE
(outdoor)
Cameras
Dynamic Objects
Occluder
8
5
No
8
0–3
Yes
15
4
No
9
2
Yes
66
CHAPTER 2 Multi-View Calibration, Synchronization a 2.4-GHz Core Quad PC with computation times of 1 to 4 minutes per time step. Again, the very strong locality inherent to the algorithm and preliminary benchmarks suggest that around 10 times faster performance could be achieved using a GPU implementation.
Appearance Modeling Validation It is extremely hard to color-calibrate a large number of cameras generally, not to mention when operated under varying lighting conditions, as in a natural outdoor environment. To show this, we compare different appearance-modeling schemes in Figure 2.20, for
0.48
0.67
0.51
0.46
0.74
0.69
0.68
(1)
0.8 0.4 0.0
0.64
0.71
0.54
0.34
0.68
0.67
0.68
0.65
0.69
0.71
0.61
0.67
0.71
0.69
0.69
View1
View 2
View 3
0.69
0.70
0.73
(2)
(3)
View 4
View 5
View 6
View 7
View 8
0.74
0.75
0.71
0.70
(4)
0.62
(5)
FIGURE 2.20 Appearance model analysis. A person in eight views is displayed in row 4. A GMM model Ci is trained for view i ∈ [1, 8]. A global GMM model C0 over all views is also trained. Rows 1, 2, 3, and 5 compute P (S |I , B, Ci⫹1 ), P (S |I , B, Ci⫺1 ), P (S |I , B, C0 ); and P (S |I , B, Ci ) for view i, respectively, with S the foreground label, I the pixel color, and B the uniform background model. The probability is displayed according to the gray scale scheme at top right. The average probability over all pixels in the silhouette region and the mean color modes of the applied GMM model are shown for each figure. (See color image of this figure on companion website.)
2.3 Dynamic Scene Reconstruction from Silhouette Cues a frame of the outdoor BENCH data set. Without loss of generality, we use GMMs. The first two rows compare silhouette extraction probabilities using the color models of spatially neighboring views. These indicate that stereo approaches that heavily depend on color correspondence between neighboring views are very likely to fail in natural scenarios, especially when the cameras have dramatic color variations, such as in views 4 and 5. The global appearance model in row 3 performs better than those in rows 1 and 2, but this is mainly due to its compensation between large color variations across camera views, which at the same time decreases the model’s discriminability. The last row, where a color appearance model is independently maintained for every camera view, is the obvious winner. We thus use the last scheme in our system. Once the model is trained, we do not update it as time goes by, which could be an easy extension for robustness.
Densely Populated Scene The CLUSTER sequence is a particularly challenging configuration: Five persons are on a circle of less than 3 m in diameter, yielding an extremely ambiguous and occluded situation at the circle center. Despite the fact that none of them are being observed in all views, we are still able to recover their label and shape. Images and results are shown in Figure 2.21. The naïve 2-label reconstruction (probabilistic visual hull) yields large volumes with little separation between objects, because the entire scene configuration is too ambiguous. The addition of prior information tracking estimates the most probable compact regions and eliminates large errors, at the expense of dilation and lower precision. Accounting for viewing line occlusions enables the model to recover more detailed information, such as limbs. The LAB sequence [30] with poor image contrast is also processed.The reconstruction result from all 15 cameras is shown in Figure 2.22. Moreover, in order to evaluate our
(a)
(b)
(c)
FIGURE 2.21 Result from the 8-view CLUSTER data set. (a) Two views at frame 0. (b) Respective 2-label reconstruction. (c) More accurate shape estimation using our algorithm.
67
CHAPTER 2 Multi-View Calibration, Synchronization
P1
P2
P3 P4
P1 P2
P3
P4
(a) 550 500
Our track COST track M2 tracker
450
Ground plane error (mm)
68
400 350 300 250 200 150 100 50
0
10
20
30 40 50 60 Frame number (b)
70
80
90
FIGURE 2.22 LAB data set result from Gupta et al. [30]. (a) 3D reconstruction with 15 views at frame 199. (b) Eight-view tracking result comparison with methods in Gupta et al. [30] and Mittal and Davis [44] and the ground truth data. Mean error in the ground plane estimate in millimeters is plotted.
2.3 Dynamic Scene Reconstruction from Silhouette Cues localization prior estimation, we compare our tracking method (Section 2.3.3) with the ground truth data, the result of Gupta et al. [30] and Mittal and Davis [44]. We use the exactly same eight cameras as the latter authors [44] for the comparison, as shown in Figure 2.22(b). Although slower in its current implementation (2 minutes per time step) our method is generally more robust in tracking and also builds 3D shape information. Most existing tracking methods only focus on a tracking envelope and do not compute precise 3D shapes. This shape information is what enables our method to achieve comparable or better precision.
Automatic Appearance Model Initialization The automatic dynamic object appearance model initialization has been tested using the BENCH sequence. Three persons are walking into the empty scene one after another. By examining the unidentified label U , object appearance models are initialized and used for shape estimation in subsequent frames. Volume size evolution of all labels is shown in Figure 2.23 and the reconstructions at two time instants are shown in Figure 2.24. During the sequence, U has three major volume peaks because three new persons enter the scene. Some smaller perturbations are due to shadows on the bench or the ground. Besides automatic object appearance model initialization, the system robustly redetects and tracks the person who leaves and reenters the scene.This is because, once the label is initialized, it is evaluated for every time instant even if the person is out of the scene. The algorithm can easily be improved to handle leaving/reentering labels transparently.
Dynamic Object and Occluder Inference The BENCH sequence demonstrates the power of our automatic appearance model initialization as well as the integrated occluder inference of the“bench”as shown in Figure 2.24 between frame 329 and 359. See Figure 2.23 for the scene configuration during that period. (The complete sequence is also given in the supplemental video that is available on this book’s companion website.) 20
115
332
Volume size
U L1 L1 initialized P1 fully entered
L2
L2 initialized P2 fully entered
L3 0
50
100
150
200 Frame number
L3 initialized P3 fully entered 250
300
350
400
FIGURE 2.23 Appearance model automatic initialization with the BENCH sequence. The volume of U increases if a new person enters the scene. When an appearance model is learned, a new label is initialized. During the sequence, L1 and L2 volumes drop to near zero because those persons walk out of the scene on those occasions.
69
70
CHAPTER 2 Multi-View Calibration, Synchronization
P2(L2)
P2 P3
P2(L2) (O )
(O )
P3(U ) P3(L3)
Frame 329, Camera 4 P2
P3 P1
Frame 359, Camera 3
P1(L1) Frame 329 P1 is out of the scene P2 just enters the scene with label U
Frame 359 P1 reenters P3 is assigned L3
FIGURE 2.24 result. Person numbers are assigned according to the order in which their appearance models are initialized. At frame 329, P3 is entering the scene. Since it is P3 ’s first time into the scene, he is captured by label U (gray). P1 is out of the scene at the moment. At frame 359, P1 has reentered the scene. P3 has its GMM model already trained and label L3 is assigned. The bench as a static occluder is being recovered. BENCH
Frame 120
P1
P2
P 1 P2 P2
Frame 900 P2
P1
P1
Single label (a)
Multiple label (b)
FIGURE 2.25 data set comparison. (a) Reconstruction with a single foreground label. (b) Reconstruction with a label for each person. By resolving inter-occlusion ambiguities, both the static occluder and dynamic objects achieve better quality.
SCULPTURE
We compute results for the SCULPTURE sequence with two persons walking in the scene, as shown in Figure 2.25. For the dynamic objects, we manage to get much cleaner shapes when the two persons are close to each other, as well as more detailed shapes such as extended arms. For the occluder, thanks to the multiple foreground modes and
2.4 Conclusions the consideration of inter-occlusion between the dynamic objects in the scene, we are able to recover the fine shapes, too. Otherwise, the occluder inference would have to use ambiguous regions when persons are clustered.
Limitations of Our Shape Estimation Technique Although we have shown the robustness of our shape estimation algorithm with many complicated real-world examples, it is important to summarize the weaknesses or limitations of this technique. First of all, like all silhouette-based methods, our technique requires a background model for every view. This is not good news if the foreground objects of interest have a color appearance similar to the background. By modeling the foreground appearance explicitly, as in the multiple-person case, this problem is relieved but not fundamentally solved. Other solutions include introducing skeleton models to the object of interest to constrain the object shape prior and taking into account temporal cues such as motion in the video instead of taking frames as independent entities. Second, the occlusion inference that we introduce here is effective only in partial occlusion cases, meaning that the object should not be occluded by all of the views. In fact, our occlusion inference is essentially based on a “majority” of observation agreements. That is, if most of our observations do not witness the foreground object, the probability of the object’s existence is still lower than desired, no matter how long the video sequence is.This affects the densely crowded scenes, too.The solution is, again, to introduce more prior information to the problem. More quantitative analysis in defining a “too densely crowded” scene is one future direction. However, since this problem is camera position dependent, it is beyond the topic of this chapter. Finally, our complete inference is based on the assumption that the voxels can be observed independently of their neighbors, which introduces local voxel probability perturbations due to image noise and numerical errors. One future direction is to account for spatial and temporal consistencies and refine the shapes that we have computed.
2.4 CONCLUSIONS We presented a complete approach to reconstructing 3D shapes in a dynamic event from silhouettes extracted from multiple videos recorded by an uncalibrated and unsynchronized camera network. The key elements in our approach are as follows: 1. A robust algorithm that efficiently computes the epipolar geometry and temporal offset between two cameras using silhouettes in the video streams. The proposed method is robust and accurate, and allows calibration of camera networks without the need to acquire specific calibration data.This can be very useful for applications where the use of technical personnel with calibration targets for calibration or recalibration is either infeasible or impractical. 2. A probabilistic volumetric framework for automatic 3D dynamic scene reconstruction. The proposed method is robust to occlusion, lighting variation, shadows, and so forth. It does not require photometric calibration among the cameras in the system.
71
72
CHAPTER 2 Multi-View Calibration, Synchronization It automatically learns the appearance of dynamic objects, tracks motion, and detects survivance events such as entering and leaving the scene. It also automatically discovers the static obstacle and recovers its shape by observing the dynamic objects’ performance in the scene for a certain amount of time. By combining all of the algorithms described in this chapter, it is possible to develop a fully automatic system for dynamic scene analysis.
REFERENCES [1] M. Alexa, D. Cohen-Or, D. Levin, As-rigid-as-possible shape interpolation, in: ACM SIGGRAPH Papers, 2000. [2] N. Apostoloff, A. Fitzgibbon, Learning spatiotemporal T-junctions for occlusion detection, Computer Vision and Pattern Recognition II (2005) 553–559. [3] K. Astrom, R. Cipolla, P. Giblin, Generalised epipolar constraints, International Journal of Computer Vision 33 (1) (1999) 51–72. [4] P.T. Baker, Y. Aloimonos, Calibration of a multicamera network, Computer Vision and Pattern Recognition VII (2003) 7–72. [5] L. Ballan, G.M. Cortelazzo, Multimodal 3D shape recovery from texture, silhouette and shadow information, in: Proceedings of the Third International Symposium on 3D Data Processing, Visualization, and Transmission, 2006. [6] B.G. Baumgart, Geometric modeling for computer vision. PhD thesis, Stanford University, 1974. [7] R.C. Bolles, M.A. Fischler, A RANSAC-based approach to model fitting and its application to finding cylinders in range data, in: Proceedings of the seventh International Joint Conferences on Artificial Intelligence, 1981. [8] J.S. De Bonet, P. Viola, Roxels: Responsibility weighted 3D volume reconstruction, in: Proceedings of the International Conference on Computer Vision I (1999) 418–425. [9] J. Bouguet, MATLAB camera calibration toolbox, 2000. Available at www.vision.caltech.edu/ bouguetj/calib_doc. [10] E. Boyer, On using silhouettes for camera calibration, in: Proceedings of Asian Conference on Computer Vision I (2006) 1–10. [11] A. Broadhurst, T. Drummond, R. Cipolla, A probabilistic framework for the Space Carving algorithm, in: Proceedings of the International Conference on Computer Vision I (2001) 388–393. [12] G.J. Brostow, I. Essa, D. Steedly, V. Kwatra, Novel skeletal representation for articulated creatures, in: Proceedings of European Conference on Computer Vision’04 III (2004) 66–78. [13] J. Carranza, C. Theobalt, M.A. Magnor, H.P. Seidel, Free-viewpoint video of human actors, in: ACM SIGGRAPH Papers, 2003. [14] G. Cheung, T. Kanade, J.-Y. Bouguet, M. Holler, A real time system for robust 3D voxel reconstruction of human motions, in: Proceedings of Computer Vision and Pattern Recognition II (2000) 714–720. [15] G.K.M. Cheung, S. Baker, T. Kanade, Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture, in: Proceedings of Computer Vision and Pattern Recognition I (2003) 77–84. [16] R. Cipolla, K.E.Astrom, P.J. Giblin, Motion from the frontier of curved surfaces, in: Proceedings of the Fifth International Conference on Computer Vision, 1995.
2.4 Conclusions
73
[17] B. Curless, M. Levoy, A volumetric method for building complex models from range images, in: ACM SIGGRAPH Papers, 1996. [18] A. Elfes, Using occupancy grids for mobile robot perception and navigation. IEEE Computer, Special Issue on Autonomous Intelligent Machines 22 (6) (1989) 46–57. [19] P. Favaro, A. Duci, Y. Ma, S. Soatto, On exploiting occlusions in multiple-view geometry, in: Proceedings of the International Conference on Computer Vision I (2003) 479–486. [20] F. Fleuret, J. Berclaz, R. Lengagne, P. Fua, Multi-camera people tracking with a probabilistic occupancy map, Pattern Analysis and Machine Intelligence 30 (2) (2008) 267–282. [21] J.-S. Franco, E. Boyer, Exact polyhedral visual hulls, in: Proceedings of the Fourteenth British Machine Vision Conference, 2003. [22] J.-S. Franco, E. Boyer, Fusion of multi-view silhouette cues using a space occupancy grid, in: Proceedings of the International Conference on Computer Vision II (2005) 1747–1753. [23] Y. Furukawa, J. Ponce, Carved visual hulls for image-based modeling, in: Proceedings European Conference on Computer Vision (2006) 564–577. [24] Y. Furukawa, A. Sethi, J. Ponce, D. Kriegman, Robust structure and motion from outlines of smooth curved surfaces, Pattern Analysis and Machine Intelligence 28 (2) (2006) 302–315. [25] B. Goldlücke, M.A. Magnor, Space-time isosurface evolution for temporally coherent 3D reconstruction, in: Proceedings Computer Vision and Pattern Recognition’04, I (2004) 350–355. [26] K. Grauman, G. Shakhnarovich, T. Darrell, A Bayesian approach to image-based visual hull reconstruction, in: Proceedings of IEEE Computer Vision and Pattern Recognition I (2003) 187–194. [27] L. Guan, S. Sinha, J.-S. Franco, M. Pollefeys, Visual hull construction in the presence of partial occlusion, in: Proceedings of 3D Data Processing Visualization and Transmission, 2006. [28] L. Guan, J.-S. Franco, M. Pollefeys, 3D occlusion inference from silhouette cues, in: Proceedings of Computer Vision and Pattern Recognition (2007) 1–8. [29] L. Guan, J.-S. Franco, M. Pollefeys, Multi-object shape estimation and tracking from silhouette cues, in: Proceedings of Computer Vision and Pattern Recognition (2008) 1–8. [30] A. Gupta, A. Mittal, L.S. Davis, Cost: An approach for camera selection and multi-object inference ordering in dynamic scenes, in: Proceedings of the International Conference on Computer Vision, 2007. [31] R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, volume 23. Cambridge University Press, 2005. [32] C. Hernández, F. Schmitt, R. Cipolla, Silhouette coherence for camera calibration under circular motion, in: Proceedings Pattern Analysis and Machine Intelligence 29 (2) (2007) 343–349. [33] T. Joshi, N. Ahuja, J. Ponce, Structure and motion estimation from dynamic silhouettes under perspective projection, in: Proceedings of the International Conference on Computer Vision (1995) 290–295. [34] T. Kanade, P. Rander, P.J. Narayanan, Virtualized reality: Constructing virtual worlds from real scenes. IEEE MultiMedia 4 (1) (1997) 34–47. [35] K. Kutulakos, S. Seitz, A theory of shape by space carving, in: Proceedings of the International Journal of Computer Vision 38 (2000) 307–314. [36] A. Laurentini, The visual hull concept for silhouette-based image understanding, in: Proceedings Pattern Analysis and Machine Intelligence 16 (2) (1994) 150–162. [37] S. Lazebnik, E. Boyer, J. Ponce, On computing exact visual hulls of solids bounded by smooth surfaces, in: Proceedings of Computer Vision and Pattern Recognition I (2001) 156–161. [38] S. Lazebnik, A. Sethi, C. Schmid, D.J. Kriegman, J. Ponce, M. Hebert, On pencils of tangent planes and the recognition of smooth 3D shapes from silhouettes, in: Proceedings of European Conference on Computer Vision (3) (2002) 651–665.
74
CHAPTER 2 Multi-View Calibration, Synchronization [39] N. Levi, M. Werman, The viewing graph, in: Proceedings of Computer Vision and Pattern Recognition I (2003) 518–522. [40] M.I.A. Lourakis, A.A. Argyros, The design and implementation of a generic sparse bundle adjustment software package based on the Levenberg-Marquardt algorithm. Technical Report 340, Institute of Computer Science–FORTH, Heraklion, Crete, 2004. [41] W. Matusik, C. Buehler, L. McMillan, Polyhedral visual hulls for real-time rendering, in: Proceedings of Eurographics Workshop on Rendering, 2001. [42] W. Matusik, C. Buehler, R. Raskar, S.J. Gortler, L. McMillan, Image-based visual hulls, in: Kurt Akeley, editor, Computer Graphics Proceedings, ACM SIGGRAPH/Addison-Wesley, 2000. [43] D. Margaritis, S. Thrun, Learning to locate an object in 3D space from a sequence of camera images, in: Proceedings of the International Conference on Machine Learning (1998) 332– 340. [44] A. Mittal, L.S. Davis, M2tracker: A multi-view approach to segmenting and tracking people in a cluttered scene, International Journal of Computer Vision 51 (3) (2003) 189–203. [45] A. Melkman, On-line construction of the convex hull of a simple polygon. Information Processing Letters 25 (11) (1987). [46] P.R.S. Medonça, K.-Y.K. Wong, R. Cipolla, Epipolar geometry from profiles under circular motion, IEEE Transactions Pattern Analysis Machine Intelligence 23 (6) (2001) 604–616. [47] K. Otsuka, N. Mukawa, Multiview occlusion analysis for tracking densely populated objects based on 2-D visual angles, in: Proceedings of Computer Vision and Pattern Recognition I (2004) 90–97. [48] M. Pollefeys, L. Van Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, R. Koch, Visual modeling with a hand-held camera, International Journal of Computer Vision 59 (3) (2004) 207–232. [49] J. Porrill, S. Pollard, Curve matching and stereo calibration. Image Vision Computations 9 (1) (1991) 45–50. [50] P. Sand, L. McMillan, J. Popovi´c, Continuous capture of skin deformation, in: ACM SIGGRAPH Papers, 2003. [51] D. Scharstein, R. Szeliski, A taxonomy and evaluation of dense two-frame stereo correspondence algorithms, International Journal of Computer Vision 47 (2002) 7–42. [52] S. Seitz, B. Curless, J. Diebel, D. Scharstein, R. Szeliski, A comparison and evaluation of multi-view stereo reconstruction algorithms, in: Proceedings of Computer Vision and Pattern Recognition (2006) 519–528. [53] S.N. Sinha, M. Pollefeys, L. McMillan, Camera network calibration from dynamic silhouettes, in: Proceedings of Computer Vision and Pattern Recognition I (2004) 195–202. [54] S.N. Sinha, M. Pollefeys, Synchronization and calibration of camera networks from silhouettes, in: Proceedings of the International Conference on Pattern Recognition I (2004) 115–119. [55] S.N. Sinha, M. Pollefeys, Visual-hull reconstruction from uncalibrated and unsynchronized video streams, in: Proceedings of 3D Data Processing Visualization and Transmission (2004) 349–356. [56] S.N. Sinha, M. Pollefeys, Multi-view reconstruction using photo-consistency and exact silhouette constraints: A maximum-flow formulation, in: Proceedings of the International Conference on Computer Vision I (2005) 349–356. [57] G. Slabaugh, B.W. Culbertson, T. Malzbender, M.R. Stevens, R. Schafer, Methods for volumetric reconstruction of visual scenes, International Journal of Computer Vision 57 (2004) 179–199. [58] C. Stauffer, W.E.L. Grimson, Adaptive background mixture models for real-time tracking, in: Proceedings of Computer Vision and Pattern Recognition II (1999) 246–252. [59] D. Snow, P. Viola, R. Zabih, Exact voxel occupancy with graph cuts, in: Proceedings of Computer Vision and Pattern Recognition I (2000) 345–352.
2.4 Conclusions
75
[60] J. Starck, A. Hilton, Surface capture for performance-based animation. IEEE Computer Graphics and Applications 27 (3) (2007) 21–31. [61] C. Stauffer, W.E.L. Grimson, Adaptive background mixture models for real-time tracking, in: Proceedings of Computer Vision and Pattern Recognition II (1999) 246–252. [62] P. Sturm, B. Triggs, A factorization based algorithm for multi-image projective structure and motion, in: Proceedings European Conference on Computer Vision, 1996. [63] T. Svoboda, D. Martinec, T. Pajdla, A convenient multi-camera self-calibration for virtual environments, PRESENCE: Teleoperators and Virtual Environments 14 (4) (2005) 407–422. [64] R. Szeliski, Rapid octree construction from image sequences, CVGIP: Image Understanding, 58 (1) (1993) 23–32. [65] R. Szeliski, D.Tonnesen, D.Terzopoulos, Modeling surfaces of arbitrary topology with dynamic particles, in: Proceedings of Computer Vision and Pattern Recognition (1993) 82–87. [66] B. Triggs, P. McLauchlan, R. Hartley, A. Fitzgibbon, Bundle adjustment—A modern synthesis, In W. Triggs, A. Zisserman, and R. Szeliski, editors, Vision Algorithms: Theory and Practice, Lecture Notes in Computer Science, Springer-Verlag, 2000. [67] R.Y. Tsai, A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses, Radiometry 3 (4) (1987) 383–344. [68] B. Vijayakumar, D.J. Kriegman, J. Ponce, Structure and motion of curved 3D objects from monocular silhouettes, in: Proceedings of the Conference on Computer Vision and Pattern Recognition, 1996. [69] K.Y.K. Wong, R. Cipolla, Structure and motion from silhouettes, in: Proceedings of the International Conference on Computer Vision II (2001) 217–222. [70] A.J. Yezzi, S. Soatto, Structure from motion for scenes without features, in: Proceedings of Computer Vision and Pattern Recognition I (2003) 525–532. [71] Z. Zhang, Determining the epipolar geometry and its uncertainty: A review. International Journal of Computer Vision 27 (2) (1998) 161–195. [72] Z. Zhang, A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis Machines Intelligence 22 (11) (2000) 1330–1334. [73] Y. Zhou, H. Tao, A background layer model for object tracking through occlusion, in: Proceedings of the International Conference on Computer Vision II (2003) 1079–1085. [74] R. Ziegler, W. Matusik, H. Pfister, L. McMillan, 3D reconstruction using labeled image regions, in: EG Symposium on Geometry Processing (2003) 248–259.
CHAPTER
Actuation-Assisted Localization of Distributed Camera Sensor Networks
3
Mohammad H. Rahimi, Jeff Mascia, Juo-Yu Lee, Mani B. Srivastava Center for Embedded Networked Sensing, University of California (UCLA), Los Angeles, California
Abstract External calibration of a camera sensor network to determine the relative location and orientation of the nodes is essential for effective usage. In prior research, cameras, identified by controllable light sources, utilized angular measurements among themselves to determine their relative positions and orientations. However, the typical camera’s narrow field of view makes such systems susceptible to failure in the presence of occlusions or nonideal configurations. Actuation assistance helps to overcome such limitations by broadening each camera’s view. In this chapter we discuss and implement a prototype system that uses actuation to aid in the external calibration of camera networks. We evaluate our system using simulations and a testbed of Sony PTZ cameras and MicaZ nodes, equipped with Cyclops camera modules mounted on custom pan-tilt platforms. Our results show that actuation assistance can dramatically reduce node density requirements for localization convergence. Keywords: camera sensor network, actuation, external calibration
3.1 INTRODUCTION With the continuous fall in the price and complexity of imaging technology, distributed image sensing using networked wireless cameras is becoming increasingly feasible for many applications, such as smart environments, surveillance, and tracking. A key problem when aggregating information across spatially distributed imagers is determination of their poses (i.e., 3D location and 3D orientation), which is essential to relating the physical-world information observed by the different sensors within a common coordinate system. In computer vision the problem of determining the 3D pose of a camera is referred to as the external calibration problem, to contrast it with the internal calibration problem of determining internal parameters of the optical system, which may vary from camera to camera. Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00003-3
77
78
CHAPTER 3 Actuation-Assisted Localization While there is significant prior research on the related problem of node localization in sensor networks, particularly using nonvision-sensing modalities such as RF, acoustic, and ultrasound, and their combinations, those techniques reveal only the location; external calibration requires estimation of the entire six-dimensional pose for all nodes. Recently, a new family of approaches that exploit vision for localizing the nodes in the network has emerged. These techniques are appealing in camera networks for two reasons. First, they use the same modality that is used for sensing, thereby minimizing the cost and complexity of the sensor nodes. Second, they recover not only the location up to a scale but also the orientation of the nodes in the network. At the sensing level, these approaches are either passive, relying on two or more cameras observing common feature points [4–7], or active, using a number of optical beacons that can be identified by the other nodes in the network [8–10]. In the passive approach, establishing correspondence among common feature points between camera pairs is a significant challenge, particularly as the baseline between cameras increases in realistic deployment scenarios. In the active approach, a main challenge is that the cameras’ limited fields of view impose stringent constraints on the sensor node poses to guarantee enough common beacon observations for the external calibration procedure to succeed for the entire network. As we show later in this chapter, successful calibration of the network requires either a very large network density deployment or careful placement of the cameras, which potentially may compete with the sensing requirement of the application. In this chapter we present an actuation-assisted vision-based external calibration approach for a network of cameras. In our approach, the network consists of actuated camera devices that use pan and tilt capabilities to adapt their field of view (FOV) for sensing and for external calibration of the network. To calibrate, each node is equipped with an optical beacon in addition to an image sensor. It continuously reconfigures its pose to intercept other nodes’ optical signals and extract their identities. Furthermore, each node transmits the identity and associated pose of each of its neighbors to a fusion point where the information is processed to recover each node’s reference pose (i.e., zero pan and zero tilt). Recent research (e.g., [1, 2, 12, 13]) has shown that image sensors with actuation capability have advantages during system operation. The sensing quality is significantly enhanced, particularly for environments with poorly modeled occlusions and target objects with frequent motion. We argue that the larger effective coverage provided by actuation offers several advantages over static imagers in deployment and configuration of the network. First, actuation reduces the network density required for the external calibration procedure to converge and establish a single unified coordinate frame for the entire system. This permits network deployment to be driven primarily by sensing considerations and not limited to those static configurations for which external calibration will converge. In any case, finding such static configurations where the external calibration will converge in the presence of environmental occlusions is a nontrivial problem. A second benefit of actuation is that it can potentially provide more accurate external calibration results compared to a static network with an equivalent number of nodes, because of the additional observations it enables. From the application’s perspective,
3.1 Introduction
79
actuation combines improvement in sensing performance and in camera localization, providing an indispensable capability to be embedded in vision networks. Clearly, actuation is not without a price. It is costly in terms of node expense, size, and power consumption. On the other hand, while the cost of individual actuation-capable image sensor nodes is higher than that of static nodes, the significant reduction in node density enabled by actuation translates into a large overall system cost reduction. We also note that since external network calibration and recalibration is infrequent, in powerconstrained applications one can design an actuation platform with very low static power (versus active power) to minimize overall power consumption and thus the impact of actuation over the life of the network. Like our work, the Funiak et al. paper [11] exploits actuation, but it is the beacons that are actuated and not the cameras. From a practical perspective, the two approaches offer different benefits.While [11] does not require actuated cameras, they do require more manual effort, which becomes an issue for recalibration. One might imagine the use of robotic actuated beacons, but the mechanical complexity required for realworld operation will be substantially more than that of simple pan-tilt actuation in the cameras.The computation involved is also significantly more complex and the time to calibrate longer with the actuated object approach. Moreover, as shown by various authors [1, 2, 12, 13], actuation of cameras offers significant sensing performance advantages, and if actuation is present for sensing then it is better to use it for external calibration as well, instead of introducing the additional complexity of actuated beacons. Overall, our approach and that of Funiak et al. [11] reside at different points in the design space with different regimes of applicability. Our experimental evaluation includes two classes of actuation network. First, we investigate the applicability of our approach using commercially available Sony PTZ cameras for applications such as security and wide-area surveillance.The common theme in these applications is the requirement of high-resolution imaging and availability of permanent power resources. In addition, we investigate the applicability of our scheme in constrained low-power camera networks with limited image resolution and power supply. In this case, to realistically study the impact of actuation in constrained settings, we designed and developed a custom low-power pan-tilt platform for commercially available Cyclops imagers [3]. The primary contribution of this chapter is a technique for external calibration (i.e., finding the location and orientation) of spatially distributed networked image sensors that exploits pan and tilt actuation capability. We present a methodology and a system architecture and, through simulation as well as experimentation, we illustrate that by exploiting actuation our technique can enable external calibration of far sparser networks than is possible with techniques that do not exploit actuation. A secondary contribution of our work is the low-power pan-tilt platform itself, which brings actuation capability, previously considered to be an exclusive realm of higher-end power-hungry cameras, to low-end power- and resource-constrained image sensors such as Cyclops. This chapter is organized as follows. Section 3.2 describes the fundamental concepts underlying our methodology of actuation. Section 3.3 describes actuation planning of the nodes for calibration. Section 3.4 describes our platform and system, and Section 3.5 describes the results of our simulation and experimental evaluation. We conclude the chapter in Section 3.6.
80
CHAPTER 3 Actuation-Assisted Localization
3.2 METHODOLOGY In this section, we introduce calibration of a single triangle that consists of actuated cameras, and we describe our approach to merging such triangles to calibrate the entire network. Furthermore, we describe our approach for refinement of the estimated reference poses of the actuated cameras. Throughout our methodology, we assume that the camera’s intrinsic parameters are calibrated [15] and the camera’s projection center is aligned with its actuation center.
3.2.1 Base Triangle To determine the relative location and orientation of two camera nodes, the basic configuration of our scheme consists of two cameras that can mutually view each other and a common beacon node (Figure 3.1(a)). Hence, a closed-loop formulation of such a triangle can be written in camera A’s coordinate system as lab Wab ⫹ lbc Rab Wbc ⫺ lca Wac ⫽ 0
(3.1)
where Wab , Wac , and Wbc are unit vectors, lab , lbc , and lca are the relative length of three segments of the triangle, and Rab is the rotation matrix that transforms an observation in camera B’s coordinate system into camera A’s coordinate system. Using mutual epipolar measurements of cameras A and B and common beacon C, equation 3.1 derives the relative poses of A and B and the relative length (not the absolute length) of the triangle segments [8, 9]. If beacon node C is also a camera, we only extract its relative position and not its pose versus camera A and B; additional observation is needed to determine the relative orientation of camera C. In a practical network deployment, because of the cameras’ limited FOV, solving equation 3.1 imposes a strong constraint on the camera poses, which in turn may interfere with the deployment requirements of the application. On the other hand, by exploiting actuation we loosen the constraint on the poses by reconfiguring the cameras to view each other, the beacon nodes, and the application’s region of interest (Figure 3.1(b)). In this case, we can solve equation 3.1 for an extended triangle that consists of actuated A
lab
Wab
B Wba
Wac
lac
C
Wbc
lbc A
B
C (a)
(b)
FIGURE 3.1 (a) Triangle of two camera nodes A and B and one beacon node C. (b) Exploiting actuation to form a triangle based on actuated observations relaxes the constraint on the nodes’ poses. Dark and light areas represent camera FOVs during pose reconfiguration.
3.2 Methodology
81
observations. If we assume the reference pose of the actuated camera as zero pan and zero tilt, then each observation represented as a unit vector e can be converted to a unit vector e in the camera’s reference pose: e ⫽ Rf e
(3.2)
Rf (, ) ⫽ Ry ()Rx ()
(3.3)
where the rotation matrix Rf is ⎛
cos() ⎜ Ry () ⫽ ⎝ 0 sin()
0 1 0
⎛ ⎞ ⫺sin() 1 ⎜ ⎟ 0 ⎠, Rx () ⫽ ⎝0 cos() 0
0 cos() ⫺sin()
⎞ 1 ⎟ sin() ⎠ cos()
(3.4)
In equations 3.3 and 3.4 the parameters and represent the pan and tilt angles, respectively.
3.2.2 Large-Scale Networks In this section, we describe our approach to merging into a common frame observations among the nodes in the actuated camera network. We build a visibility graph (see Figure 3.3), in which a vertex is either an actuated beaconing camera or at least a beacon →B) indicates that camera A can view node B node. In this graph, a directed edge (A ⫺ in the network. We further introduce a formalism to describe camera network topology and merging rules. Definition 3.1. A base triangle ⌬ is the triangle defined in the previous section and consists of two actuated cameras that in any arbitrary pose can view each other and a common beacon node. It is the smallest triangular subgraph in the network, and we merge larger networks by merging such triangles. Definition 3.2. A base edge in the base triangle is the edge that consists of the two cameras that can view each other (e.g., AB in Figure 3.1(a)). In a larger subgraph, a base edge refers to two nodes that can see each other and a third beacon. Rule 3.1. Two base triangles can merge if and only if they share an edge such that one edge node is on a base edge on both triangles. We furthermore call such a node a base node and use an underscore notation to denote it.
Rule 1 can be realized as follows. Since the relative poses of other nodes in both triangles are known in the base node’s coordinate system as described in Section 3.2.1, they can be merged into it. Furthermore, the shared edge enables adjusting the relative length of the edges in both triangles. Figure 3.2 illustrates the application of the above rule pictorially. The triangles ⌬ABC and ⌬BCD in Figure 3.2(a) and the triangles ⌬ABC and ⌬BCD in Figure 3.2(b) merge through a shared edge and base node C. However, in Figure 3.2(c), the base edge BC is shared between the two triangles, so rule 1 can be applied through both nodes B and C to merge the two triangles ⌬ABC and ⌬BCD . In practice, we can merge the two triangles using node B or C and average the results to minimize the inaccuracy that is due to measurement errors. Finally, Figure 3.2(d) shows the two triangles that share an edge and yet cannot
82
CHAPTER 3 Actuation-Assisted Localization A
B
A
B
A
B
A
B
C
D
C
D
C
D
C
D
(a)
(b)
(c)
(d)
FIGURE 3.2 Merger of the base triangles. (a)(b) Triangles can be merged through the shared edge and base node C. (c) Triangles can be merged through shared edge BC and both base nodes B and C. (d) Triangles cannot be merged since they do not have a shared base node.
A
B
E
C
D
F
FIGURE 3.3 Directed graph to abstract the visual connectivity between cameras.
be merged since neither node B nor node C is a base edge in both triangles. We now expand our formalism to larger subgraphs. Definition 3.3. A cluster ⌫ is a connected subgraph that consists of base triangles connected as shown in Figure 3.2(a), Figure 3.2(b), or Figure 3.2(c). In a cluster the relative location of all nodes and the orientation of all base nodes are known. We use a base node with a minimum node ID as the root node to obtain the relative location of other nodes and the relative orientation of other base nodes in the cluster. Rule 3.2. Two clusters can merge if and only if they share an edge that has at least one end point as a base node in both clusters. In the case where two clusters share two or more edges with this property, we choose the one with a minimum sum of end node IDs to merge them.
Figure 3.3 illustrates how nodes are merged through the above rules. Initially, separate base triangles are formed: ⌬ABC , ⌬BCD , ⌬BDE , ⌬EF D . Furthermore, clusters incrementally merge so that eventually all nodes can be included in a common coordinate system. Therefore, in Figure 3.3, clusters ⌫ABCD , ⌫BCDE , ⌫BDEF are formed at the second stage, followed by clusters ⌫ABCDE , ⌫BCDEF at the third stage, followed by ⌫ABCDEF at the fourth stage. In this case, we extract the relative orientation and location of all nodes in the final cluster, ⌫ABCDEF except node D, for which we can only extract its location because of the requirement for additional observations. When a final cluster is found, there are different ways of extracting the position or rotation of nodes versus the root node. For example, in Figure 3.3, suppose cameraA serves as the root node.To determine the rotation of camera F, we can trace the path A ←→ C ←→ B ←→ E ←→ F. If another bidirectional edge is added to B and F, we can also trace the path A ←→ C ←→ B ←→ F
3.2 Methodology
83
to determine the rotation of F. To minimize accumulation of pose estimation errors over multiple hops, we use the shortest path of each node from the root node to determine its pose.
3.2.3 Bundle Adjustment Refinement After triangles are merged into a cluster, each camera’s reference pose is determined in the coordinate of the root node. However, we expect an accumulation of pose errors that depends on the initial projection errors. To reduce pose estimation errors, we can jointly refine pose configurations with a centralized bundle adjustment algorithm [10]. After triangles are merged and the initial pose estimates are computed, we formulate a nonlinear optimization problem by minimizing a cost function fo that captures reprojection error [10]. min fo ⫽
Ri ,Ti
i
j∈Vi
j
||ei ⫺
Ri (Tj ⫺ Ti ) 2 || ||Ri (Tj ⫺ Ti )||2 2
(3.5)
Here the subscript i represents the camera nodes in the cluster and subscript j represents indices of the set of beacon Vi that can be seen by camera node i. Matrix Ri is the rotation matrix of camera i, and vectors Ti and Tj are the positions of camera i and j beacon j, respectively. The unit vector ei is the projection of beacon j onto the image plane of camera i. Based on the cost function and the associated optimization schemes (e.g., Newton’s method), the pose configuration is refined after a sufficient amount of iterations. In our implementation, to avoid numerical instability in cases of Ti ≈ Tj in equation 3.5, we reformulate the optimization problem as [4, 16]: min fo ⫽
Ri ,Ti
i
j
2
||||Ri (Tj ⫺ Ti )||2 ei ⫺ Ri (Tj ⫺ Ti )||2
(3.6)
j∈Vi
Algorithm 3.1 shows the pseudocode of our methodology. The pseudocode runs on a merging node that receives information from all nodes, merges the nodes into cluster(s), and estimates the pose of the nodes versus the root node(s) of each cluster(s). The merging node continuously receives information from the actuating cameras until any termination rule, which will described in Section 3.3, is met. Note that, depending on the network configuration, we may have multiple clusters in the network that cannot be merged further, and we apply the following algorithm to each cluster independently. Algorithm 3.1: Network Merge Using Pseudocode • Repeat – Each camera actuates continuously, applies equation 3.2 and transmits 2D observations. • Until actuation termination rule is met. • Search among the set of observations to build the base triangles. • Merge the base triangles and resulting clusters based on rule 1 and rule 2. • For each cluster – Choose the node with minimum ID as the root node. – Estimate the reference pose of each node versus the reference pose of the root node. – Execute the bundle adjustment.
84
CHAPTER 3 Actuation-Assisted Localization
3.3 ACTUATION PLANNING In this section we overview our approach to orchestrating actuation among the nodes in the network. We assume that the network will be in calibration mode for a time episode and discuss actuation strategies and rules to indicate when the calibration mode should be aborted. We discuss the impact on latency and power consumption of the localization services. In our discussion, we assume that beaconing nodes offer an omnidirectional view, and we only constrain the cameras’ FOVs. This can be easily achieved by having multiple directional beacons (i.e., LEDs) that are pointing in different directions. We also briefly discuss the implication of directional beacons on actuation planning.
3.3.1 Actuation Strategies The strategy employed by cameras to move through their range of motion affects the quantity of observations and the system’s latency and energy costs. For a network with omnidirectional beacons, a 2D quasi-raster scan pattern (Figure 3.4), which constitutes multiple S-shapes (a) or a spiral path (b), guarantees coverage of all visible nodes in the neighborhood of the camera. Equations 3.7 and 3.8 estimate the time needed for a quasi-raster scan: g() ⫽ T ⫽ Tactuation ⫹ Tobservation ⫽
max ⫹1 ␦
(3.7)
1 max ⫹ g(t )max ( p ) ⫹ g(p )g(t )observation t
(3.8)
where max ⭐ 2 and max ⭐ are the maximum pan and tilt angles, respectively, ␦p and p t ␦t are the pan and tilt step sizes, respectively, and represents the angular velocity of the actuator. To guarantee complete coverage, the maximum angular step size should be less than or equal to the camera’s FOV fov . Since neighboring beacon nodes may be observed near the image’s borders, where pixel quantization error is greatest, applications with
(a)
FIGURE 3.4 (a) S-shape raster scan. (b) Spiral raster scan.
(b)
3.4 System Description
85
stringent accuracy requirements may opt for smaller step sizes but incur higher latency and greater energy consumption.The observation time observation is directly proportional to the bit length of the optical identification packet and inversely proportional to the camera frame rate. Equation 3.9 provides an estimate of the energy required to perform a scan where Pactuation represents the power consumed by the actuator while in motion and Pobservation represents the power consumed by the camera and beacons while capturing images. E⫽
Pactuation max (t ⫹ g(t )max p ) ⫹ g(p )g(t )observation Pobservation
(3.9)
Note that, for a network with directional beacons, the quasi-raster strategy potentially leads to missing neighboring nodes where two actuated cameras located within viewing range of each other may never observe the other regardless of search duration. To avoid this, we can exploit a random scan search in networks with directional beacons, where the camera pans and tilts at randomly selected step sizes after each observation to guarantee that cameras within viewing range eventually observe each other.
3.3.2 Actuation Termination Rules Our system uses two types of rules: local, which are evaluated and followed at each camera; and global, which are evaluated at the microserver and followed throughout the entire network. A network with omnidirectional beacons terminates the localization protocol when any of the following rules are met: ■
■
■
Local exhaustive: Each camera aborts at the end of one quasi-raster scan. This guarantees that a camera will collect all the neighboring node observations in a static environment. Global complete network merge: All cameras abort when the microserver has received enough observations to merge the entire network into a cluster. Global marginal error enhancement: All cameras abort when the improvement in the bundle adjustment cost function is below a threshold after a user-defined number of iterations.
Although the local exhaustive rule is useful when power consumption and latency are not at issue, it is costly in applications that require frequent localization services (e.g., dynamic environments) or when energy consumption is important. Additionally, the exhaustive rule does not guarantee complete coverage in a network with directional beacons, so any of the above global rules may be applicable. Finally, note that application of the global rules requires adjustment to the network merge algorithm such that the entire algorithm iterates at each step.
3.4 SYSTEM DESCRIPTION In this section, we describe the architecture of our system including wireless camera nodes, optical communication beacons, and network architecture. We furthermore describe deployment of both Cyclops and Sony camera nodes to evaluate the performance of our system.
86
CHAPTER 3 Actuation-Assisted Localization
3.4.1 Actuated Camera Platform We used two types of actuated camera for our experimentation: a Sony off-the-shelf, high-resolution pan-tilt-zoom actuated camera, and a Mote/Cyclops camera with our custom-designed actuation platform (Figure 3.5(b)(c)). The former we used for largespace experimentation with applications having no power constraints and large distances between camera modules. Environmental monitoring, security, and various defenserelated scenarios fit this profile. The latter we used for small-space experimentation with applications having power constraints and small distances between camera modules. Intelligent environments, asset-monitoring, and indoor security applications fit this profile. The Sony PTZ camera features a high-resolution VGA (640⫻480) image sensor and a 45◦ FOV when completely zoomed-out. It covers a pan range of 340◦ and a tilt range of 115◦ , and its angular speed is 170◦ /s in pan axis and 76.67◦ /s in tilt axis. The module has both 10/100 Base-T Ethernet and 802.11 interfaces and runs an HTTP server for control and image-capturing purposes. The Cyclops module, which is used as a sensor for the Mote-class radio is a commercially available low-power low-resolution CIF (320⫻280) image sensor and has a 42◦ FOV. To design an actuation platform for the Mote/Cyclops pairs, we took various parameters into account, including actuation step size and actuation delay. We carefully chose the actuation step size to minimize error in overall calibration performance. In addition, to guarantee minimal accumulation of pose errors over time, our actuation platform performs reference pose calibration by periodically revisiting the zero pan and tilt pose at mechanically enforced hard limit points. The actuation platform is capable of 180◦ rotation on both pan and tilt axes and has a 1◦ angular repositioning error. To achieve minimum power consumption with little impact on the life of the camera, we selected an actuation system with a miniature gearbox that holds its pose against gravity (for the current mechanical load of the sensor node) without consuming active power.
Actuated camera Microserver node Wireless link
Serial link Network interface node
Beacon node
(a)
(b)
(c)
FIGURE 3.5 System components. (a) Camera and optical beacon nodes and a computationally capable microserver node for information fusion and network-level external calibration. (b) The Sony PTZ actuation camera for large-space deployment. (c) The Mote/Cyclops camera for small-space deployment with our custom-designed actuation platform.
3.4 System Description
87
The price we pay for this is extra power consumption during active actuation due to additional gearbox mechanical load. This is acceptable, however, as we seek to optimize platform energy consumption for the infrequent pose reconfiguration case. The actuation platform consumes 800 mA and 0 mA during active and inactive mode, respectively, and actuates at an angular speed of 315◦ /s in both axes. In our experience, the image distortion due to nonlinearity in the lenses significantly influenced the 2D projection error in both camera platforms. To minimize the impact of image distortion, we performed an initial intrinsic calibration with a known-size grid pattern to acquire camera-intrinsic parameters and compensate for camera distortion. For every extracted pixel we recovered the pixel position by executing nonlinear optimization, which compensates for the expected distortion [14].
3.4.2 Optical Communication Beaconing As mentioned before, we used an active beaconing approach to identify commonly observed features between pairs of cameras. There are two reasons for using active beacons: to improve robustness in extracting easily identifiable features and to simplify image processing on the sensor nodes. In our system, a blinking LED sends a Manchester-coded optical message that consists of a preamble, a unique ID, and a cyclic redundancy check code. The neighboring camera nodes collect images at a frame rate higher than the transmission rate of the beacons and performs successive frame differencing to extract the 2D projection of the neighboring active beacons as well as their IDs by decoding the associated optical message. To minimize error due to offset of the LED from the center of the image of the beaconing camera node, we attached several LEDs to each beaconing camera, and a neighboring camera used the average of the LEDs projection as the 2D projection of the neighboring beacons.
3.4.3 Network Architecture To minimize localization service resources, our system maps to a typical sensor network architecture and comprises three types of wireless device: actuated camera nodes, optical beacon nodes, and a microserver (Figure 3.5(a)). Additionally, each camera node has an LED(s) for optical beaconing. In our Cyclops network, each time a camera identifies the modulated light pattern of a node in its FOV, it sends an observation message that consists of the ID of the reporting node, the ID of the observed node, the image plane coordinates of the observed node, and the pan and tilt angles of the reporting node.The microserver incorporates each new observation into its database of observations and executes the Network Merge algorithm until the appropriate termination rule is met. It has a wireless radio to receive messages from the camera nodes and send control messages (e.g., localization termination) over the network. It also has computation capability to execute the Network Merge algorithm. Alternatively, the Network Merge algorithm can be performed at a backend server to minimize the computation requirements of the microserver. In our Sony camera network, since each camera was accessible through high-speed channels, the microserver was used to actuate the cameras, capture multiple images, identify the modulated light pattern, and execute the Network Merge algorithm.
88
CHAPTER 3 Actuation-Assisted Localization In our experimentation, we only considered a homogeneous network of actuated beaconing camera nodes. However, in our simulations we also considered a heterogeneous network comprising camera nodes with both image-sensing and beaconing capabilities and beacon nodes with only beaconing capabilities. This minimizes the system cost in a sparse network, where the application does not require a large number of cameras.
3.5 EVALUATION We evaluated the performance of our actuation-assisted system with respect to the following metrics: external calibration accuracy, required node density, and external calibration latency.We used testbed experimentation and simulation environments. Our testbed consisted of both Sony PTZ cameras and actuated Crossbow MicaZ motes with Cyclops image sensors.The MicaZ and the Cyclops run the SOS operating system [17].The camera nodes captured images that were processed at our backend server for verifying the localization algorithm and evaluating its accuracy. In our backend, we used Python with the numeric extensions package (NumPy). In addition, we developed a simulation environment that allowed us to generate camera networks, acquire observations, execute localization algorithms, and visualize results. We used the simulation environment to analyze larger-scale network scenarios, which are hard to build in our small-scale experimentation environment, to evaluate the performance of the actuated camera network against its density and latency.
3.5.1 Localization Accuracy We evaluated the accuracy of our system by performing our localization algorithm on our experimentation testbed. Figure 3.6 shows the deployment of six Sony PTZ cameras that covered an area of 80 m2 . Right spots represent the cameras, and right arrows indicate the visibility among them. Figure 3.6 also shows six actuated Cyclops cameras, represented by left spots, which we deployed in a single large cubicle. The deployment covered an area of 9 m2 . Figures 3.7 and 3.8 show the Euclidean localization error and the orientation estimation error for each node, respectively. In our experience, the dominant sources of localization error were misalignment between the center of projection of the camera and the center of the actuator, mechanical imprecision of the actuation platform, and pixel quantization effects introduced by the camera’s finite resolution. To investigate the impact of resolution on accuracy, we simulated a camera network consisting of 12 nodes that uniformly deployed along the perimeters of a 5-m circle (Figure 3.9(a)). Figure 3.9(b) demonstrates the reduction in rate of improvement in localization error as we increased the image resolution. This was expected, as the limitation of image resolution is the major error component at lower image resolution. However, at some point the other sources of error became dominant and increasing image resolution had less impact on localization accuracy. Figure 3.9(b) also shows the improvement in accuracy after we applied the bundle adjustment algorithm. In general, bundle adjustment reduces localization error, but we observed a reduction in the degree of improvement as we increased the resolution of the cameras.
3.5 Evaluation 10 m 1
1
2 3
6
6
8m 2
5
5 4
4
3
CENS Courtyard
FIGURE 3.6
Distance from ground truth (m)
Testbed position estimation error 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00
0
1
2
4 3 Camera ID (a)
5
6
7
Angle from ground truth (degrees)
Experimental deployment of the actuated Sony and Cyclops cameras in the CENS building at UCLA. Light gray spots and arrows indicate six Sony PTZ modules covering an area of 80 m2 across multiple cubicles; dark gray spots and arrows represent six actuated Cyclops camera modules covering an area of 9 m2 in one large cubicle.
Testbed orientation estimation error 3.0 Pan Tilt
2.5 2.0 1.5 1.0 0.5 0.0
0
1
2
4 3 Camera ID (b)
5
6
7
FIGURE 3.7 (a) Sony PTZ camera localization error. The Euclidean estimation error represents the distance between estimated and ground truth positions. We used camera 1 as the root node and measured its distance versus camera 2 to estimate absolute distances. (b) Sony PTZ camera orientation error, representing the angular difference between estimated and ground truth orientations.
89
90
CHAPTER 3 Actuation-Assisted Localization Angle from ground truth (degrees)
Distance from ground truth (m)
Testbed position estimation error 0.10 0.08 0.06 0.04 0.02 0.00
0
1
2
4 3 Camera ID (a)
5
6
7
Testbed orientation estimation error 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0
Pan Tilt
0
1
2
3 4 Camera ID (b)
5
6
7
FIGURE 3.8 (a) Cyclops camera localization error. (b) Cyclops camera orientation error. The experimentation setup is the same as for the Sony PTZ camera test (see Figure 3.7).
Radius 5 m, 12 cameras
Accuracy (m), AVG
0.25
0.15 0.10 0.05 0.00
(a)
Fused Optimized
0.20
0
100
200
300 400 500 Resolution (b)
600
700
FIGURE 3.9 (a) Simulation of 12 cameras deployed along the boundary of a circle. (b) Distance error accuracy versus resolution in our circularly deployed network (12 cameras, FOV 45◦ ).
3.5.2 Node Density We examined the relationship between the spatial density of cameras and the likelihood of localizing them. First we simulated a network of Sony PTZ cameras in a large area of 10⫻10⫻10 m. Since the sensing range of this camera is beyond 10 m, the entire room was within its viewing range. We first considered a“smart”network topology, where each camera has a random location, but its pose faced the center of the room. In the static case, we used the network merge algorithm to localize the network. In the actuated case,
3.5 Evaluation
91
we used the same network topology but gave the cameras a full pan and tilt range and used the quasi-raster actuation policy. In both cases, we simulated networks from 2 to 50 cameras in size. Figure 3.10(a) shows the results. The horizontal axis represents the total number of cameras in the network.The vertical axis represents the number of cameras in the largest localized cluster (a measure of localization success). Clearly, the actuated camera network outperforms the static camera network—it almost always localizes completely. However, the two data sets converge as the number of cameras increases. This is because each
Cameras in largest calibrated cluster
Large :: Smart :: 0 Beacons – Sony 50
Static Actuated 100%
40 30 20 10 0
0
10
20 30 40 Cameras in network (a)
50
Cameras in largest calibrated cluster
Large :: Smart :: 0 Beacons – Cyclops 50
Static Actuated 100%
40 30 20 10 0
0
10
20 30 40 Cameras in network (b)
50
FIGURE 3.10 (a) Successfully localized Sony PTZ cameras versus network size for simulation of smartly deployed cameras in a large room. (b) Successfully localized Cyclops cameras versus network size for simulation of smartly deployed cameras in a large room.
92
CHAPTER 3 Actuation-Assisted Localization camera is likely to face other cameras that are facing it, an ideal setting for building base triangles. In the next experiment, we simulated a network of Cyclops cameras (Figure 3.10(b)) in the same area of 10⫻10⫻10 m. Since Cyclops can distinguish our LED beacons within three m, no individual camera can observe everything. Figure 3.10 illustrates that, not surprisingly, the actuated camera network significantly outperforms the static camera network because of the limited coverage of the individual cameras versus the size of the environment.
3.5.3 Latency An important cost of actuation assistance is increased latency. In this section, we evaluate the relationship between the duration and effectiveness of localization. This simulation used a network of 30 actuated Cyclops cameras with random locations and poses in a 6⫻6⫻3-m space. Cyclops has a 3-m viewing range, a 45◦ FOV, and a 180◦ pan and tilt range. At each actuation step in the process, we performed the localization algorithms on the observations recorded up to that point. Figure 3.11 shows the number of cameras in the largest localized cluster as a function of time using both the quasi-raster and random actuation strategies. Elapsed time represents the time spent monitoring each pose and actuating from pose to pose based on the following assumptions, which are derived from the Cyclops hardware measurements: An optical identification packet is 2 bytes long; the camera capture rate is 1 frame/s; and the actuator has an angular velocity of 315◦ /s. Quasi-raster search uses an actuation step size equal to the camera’s FOV. Random search selects a random step size each time it actuates. According to Figure 3.11, the local exhaustive stop rule is met after 355 s when using quasi-raster and 480 s when using random search. These delays vary with network topology, but the duration time of less than 4 minutes
Cameras in largest calibrated cluster
Calibration latency 35 Raster Random
30 25 20 15 10 5 0
0
100
200
300 Times (s)
400
500
FIGURE 3.11 Relationship between the duration of the localization protocol and the number of localized cameras for the quasi-raster and random actuation strategies.
3.6 Conclusions for localization of the network, even with the slow frame rate of the Cyclops camera, is indicative of rapid localization.
3.6 CONCLUSIONS We presented a methodology, a system architecture, and a prototype platform for actuation-assisted vision-based external calibration (i.e., determining 3D location and 3D orientation) of a network of spatially distributed cameras. In our system, the network consists of actuated camera devices that are equipped with an optical beacon. During the calibration period, each node continuously sends a unique optical message. It also actuates its camera to intercept the optical messages of its neighbors and determine their poses in its local coordinate system. Furthermore, each node sends its local information to a merging point, where the information is aggregated to build a unified coordinate system. Our evaluation illustrates a significant reduction in network density versus that of a static camera network. In addition, we illustrated that our approach can continuously refine its pose estimate using additional observations that are potentially available in an actuated network. Our future work includes investigation of a distributed implementation of our methodology in larger networks and localization of a mobile network of actuated cameras.
REFERENCES [1] M. Chu, P. Cheung, J. Reich, Distributed attention, in: Proceedings of the 2nd international conference on Embedded networked sensor systems, 2004. [2] A. Kansal, W. Kaiser, G. Pottie, M. Srivastava, G. Sukhatme, Virtual high-resolution for sensor networks, in: Proceedings of the 4th international conference on Embedded networked sensor systems, 2006. [3] M. Rahimi, R. Baer, O.I. Iroezi, J.C. Garcia, J. Warrior, D. Estrin, M. Srivastava, Cyclops: in situ image sensing and interpretation in wireless sensor networks, in: Proceedings of the 3rd International Conference on Embedded Networked Sensor Systems, 2005. [4] W.E. Mantzel, H. Choi, R.G. Baraniuk, Distributed camera network localization. Conference Record of the Thirty-Eighth Asilomar Conference on Signals, Systems and Computers, 2004. [5] D. Devarajan, R.J. Radke, H. Chung, Distributed metric calibration of ad hoc camera networks. ACM Transactions on Sensor Networks, 2 (3) (2006) 380–403. [6] H. Lee, H. Aghajan, Collaborative self-localization techniques for wireless image sensor networks, in: Proceedings of Asilomar Conference on Signals, Systems, and Computers,2005. [7] H. Lee, H. Aghajan, Vision-enabled node localization in wireless sensor networks. Cognitive Systems with Interactive Sensors, 2006. [8] D. Lymberopoulos, A. Barton-Sweeney, A. Savvides, Sensor Localization and Camera Calibration using Low Power Cameras. Yale ENALAB Technical Report 08050, 2005. [9] A. Barton-Sweeney, D. Lymberopoulos, A. Savvides, Sensor localization and camera calibration in distributed camera sensor networks, in: Proceedings of IEEE Basenets, 2006. [10] C.J. Taylor, B. Shirmohammadi, Self localizing smart camera networks and their applications to 3D modeling, in: Proceedings Workshop on Distributed Smart Cameras, 2006.
93
94
CHAPTER 3 Actuation-Assisted Localization [11] S. Funiak, C. Guestrin, M. Paskin, R. Sukthankar, Distributed localization of networked cameras, in: Proceedings of the Fifth International Conference on Information Processing in Sensor Networks, 2006. [12] A. Kansal, E. Yuen, W.J. Kaiser, G.J. Pottie, M.B. Srivastava, Sensing uncertainty reduction using low complexity actuation, in: Proceedings of the third international symposium on Information processing in sensor networks, 2004. [13] A. Kansal, J. Carwana, W.J. Kaiser, M.B. Srivastava, Coordinating camera motion for sensing uncertainty reduction, in: Proceedings of the third international conference on Embedded networked sensor systems, 2005. [14] www.vision.caltech.edu/bouguetj/calib_doc/. [15] R. Hartley, A. Zisserman, Multiple View Geometry. Cambridge University Press, 2000. [16] B. Triggs, P. McLauchlan, R. Hartley, A. Fitzgibbon, Bundle Adjustment: A Modern Synthesis in Vision Algorithms: Theory and Practice. LNCS 1883, Springer-Verlag, 2000. [17] C.-C. Han, R. Kumar, R. Shea, E. Kohler, M. Srivastava, A dynamic operating system for sensor nodes, in: Proceedings of the 3rd International Conference on Mobile Systems, Applications, and Services, 2005.
CHAPTER
Building an Algebraic Topological Model of Wireless Camera Networks
4
E. Lobaton, S.S. Sastry Electrical Engineering and Computer Sciences Department, University of California, Berkeley, California P. Ahammad Janelia Farm Research Campus, Howard Hughes Medical Institute, Ashburn, Virginia
Abstract The problem of wide-area surveillance from a set of cameras without reliance on calibration has been approached by many through the computation of “connectivity/visibility graphs.” However, even after constructing these graphs, it is not possible to recognize features such as holes in the coverage.The difficulty is not due to the techniques used for finding connectivity between camera views, but rather due to the lack of information in visibility graphs. We propose a refined combinatorial representation of a network using simplices instead of edges and provide a mathematical framework (along with simulation and experimental results) to show that this representation contains, at the very least, accurate topological information such as the number of holes in network coverage. We also discuss ways in which this construct can be used for tracking, path identification, and coordinate-free navigation. Keywords: simplicial homology, camera network coverage, layout recovery, nonmetric reconstruction
4.1 INTRODUCTION Identification of the exact location of targets and objects in an environment is essential for many sensor network surveillance applications. However, there are situations in which localization of sensors is not known (because of the unavailability of GPS, ad hoc network setup, etc.). A common approach to overcoming this challenge has been to determine the exact localization of sensors and reconstruction of the surrounding environment. Nevertheless, we will provide evidence supporting the hypothesis that many of the tasks at hand may not require exact localization information. For instance, when tracking individuals in an airport, we may want to know whether they are in the vicinity Copyright © 2009 Elsevier Inc. All rights reserved. Portions of the text have appeared in “Proceedings of the 8th International Conference on Information Processing in Sensor Networks – 2009”. DOI: 10.1016/B978-0-12-374633-7.00004-5
95
96
CHAPTER 4 Building an Algebraic Topological Model of a specific gate. In this scenario, it is not absolutely necessary to know their exact location. Another example is navigation through an urban environment. This task can be accomplished by making use of target localization and a set of directions such as where to turn right and when to go straight. In both situations, a general description of our surroundings and the target location is sufficient. The type of information that we desire is a topological description of the environment that captures its appropriate structure. Figure 4.1 serves as a didactic tool for understanding the information required for our approach to the tracking/navigation problem. Observe that the complete floor plan (a) and corresponding abstract representation (b) serve an equivalent purpose. The abstract representation allows us to track a target and navigate through the environment. Our goal in this context is to use continuous observations from camera nodes to extract the necessary symbols to create this representation. In this chapter, we consider a camera network where each camera node can perform local computations and extract some symbolic/discrete observations to be transmitted for further processing. This conversion to symbolic representation alleviates the communication overhead for a wireless network. We then use discrete observations to build a model of the environment without any prior localization information of objects or the cameras themselves. Once such nonmetric reconstruction of the camera network is accomplished, this representation can be used for tasks such as coordinate-free navigation, target-tracking, and path identification. We start by introducing, in Section 4.2, the algebraic topological tools used for constructing our model. Next, in Section 4.3, we discuss how topological recovery (or nonmetric reconstruction) of the camera network can be done in 2D and 2.5D, providing simulation and experiments.
8
7
4
7
4
8
25
25 13 24 14
16 10
16 6
6 5
14
10
23
3 22
2
15
22
1 15 12 21
2 1 9
23
24
3
5
13
20 9
11 19
20 (a)
18
12
18
17
21
17
11
19 (b)
FIGURE 4.1 (a) Physical layout of an environment where a camera network is set up to cover the rooms and the circular hallway. (b) Abstraction of the layout where exact measurements are not available but the overall topological structure is maintained. In both cases we observe a target has been tracked and the corresponding path for its motion.
4.2 Mathematical Background
4.2 MATHEMATICAL BACKGROUND In this section we cover the concepts from algebraic topology that will be used throughout this chapter. The section contains material adapted from Armstrong and Ghrist and Muhammad [2, 7] and is not intended as a formal introduction to the topic. For a proper introduction, the reader is encouraged to read [8, 9, 13].
4.2.1 Simplicial Homology Let us start by introducing some definitions that will draw connections between geometric sets and algebraic structures. Definition 4.1. Given a collection of vertices V , a k-simplex is an unordered set [v1 v2 v3 . . . vk⫹1 ], where vi ∈ V and vi ⫽ vj for all i ⫽ j. Also, if A and B are simplices and the vertices of B form a subset of the vertices of A, then we say that B is a face of A. Definition 4.2. A finite collection of simplices in Rn is called a simplicial complex if whenever a simplex lies in the collection so does each of its faces. Definition 4.3. The nerve complex of a collection of sets S ⫽ {Si }N i⫽1 , for some N ⬎ 0, is the simplicial complex where vertex vi corresponds to the set Si and its k-simplices correspond to non-empty intersections of k ⫹ 1 distinct elements of S .
The following statements define some algebraic structures using these simplices. Definition 4.4. Let {si }N i⫽1 (for some N ⬎ 0) be the k-simplices of a given complex. Then the group of k-chains Ck is the free abelian group generated by {si }. That is, ∈ Ck
iff
⫽ ␣ 1 s 1 ⫹ ␣ 2 s2 ⫹ · · · ␣ N sN
for some ␣i ∈ Z. If there are no k-simplices, then Ck :⫽ 0. Similarly, C⫺1 :⫽ 0. Definition 4.5. Let the boundary operator ⭸k applied to a k-simplex s, where s ⫽ [v1 v2 . . . vk⫹1 ], be defined by ⭸k s ⫽
k⫹1
(⫺1)i⫹1 [v1 v2 . . . vi⫺1 vi⫹1 . . . vk vk⫹1 ]
i⫽1
and extended to any ∈ Ck by linearity. A k-chain ∈ Ck is called a cycle if ⭸k ⫽ 0. The set of k-cycles, denoted by Zk , is the ker ⭸k and forms a subgroup of Ck . That is, Zk :⫽ ker ⭸k A k-chain ∈ Ck is called a boundary if there exists ∈ Ck⫹1 such that ⭸k⫹1 ⫽ . The set of k-boundaries, denoted by Bk , is the image of ⭸k⫹1 and also a subgroup of Ck . That is, Bk :⫽ im ⭸k⫹1 Further, we can check that ⭸k (⭸k⫹1 ) ⫽ 0 for any ∈ Ck⫹1 , which implies that Bk is a subgroup of Zk .
97
98
CHAPTER 4 Building an Algebraic Topological Model Observe that the boundary operator ⭸k maps a k-simplex to its (k ⫺ 1)-simplicial faces. Further, the set of edges that form a closed loop are exactly what we denote by the group of 1-cycles. We are interested in finding holes in our domains—that is, cycles that cannot be obtained from boundaries of simplices in a given complex.This observation motivates the definition of the homology groups. Definition 4.6. The k-th homology group is the quotient group Hk :⫽ Zk /Bk . The homology of a complex is the collection of all homology groups. The rank of Hk , denoted the kth betti number k , gives us a coarse measure of the number of holes. In particular, 0 is the number of connected components and 1 is the number of loops that enclose different “holes” in the complex.
4.2.2 Example In Figure 4.2 we observe a collection of triangular shaped sets labeled from 1 to 5. The nerve complex is obtained by labeling the 0-simplices (i.e., the vertices) in the same way as the sets. The 1-simplices (i.e., the edges in the pictorial representation) correspond to pairwise intersections between the regions. The 2-simplex corresponds to the intersection between triangles 2, 4, and 5. For the group of 0-chains C0 , we can identify the simplices {[1], [2], [3], [4], [5]} with the column vectors {v1 , v2 , v3 , v4 , v5 }, where v1 ⫽ [1, 0, 0, 0, 0]T, and so on. For C1 , we identify {[1 2], [2 3], [2 4], [2 5], [3 5], [4 5]} with the column vectors {e1 , e2 , e3 , e4 , e5 , e6 }, where e1 ⫽ [1, 0, 0, 0, 0, 0]T , and so on. Similarly for C2 , we identify [2 4 5] with f1 ⫽ 1. As mentioned before, ⭸k is the operator that maps a simplex ∈ Ck to its boundary faces. For example, we have ⭸2 [2 4 5] ⫽ [4 5] ⫺ [2 5] ⫹ [2 4] ⭸1 [2 4] ⫽ [4] ⫺ [2]
iff
3
⭸2 f1 ⫽ e6 ⫺ e4 ⫹ e3 ,
⭸1 e3 ⫽ v4 ⫺ v2 . 1
2
iff
2
3
4
5
5
1
4
(a)
(b)
FIGURE 4.2 (a) Collection of sets. (b) Pictorial representation of the corresponding nerve complex. The complex is formed by the simplices: [1], [2], [3], [4], [5], [1 2], [2 3], [2 4], [2 5], [3 5], [4 5], [2 4 5]. Note that the actual point coordinates are irrelevant for our pictorial representation.
4.2 Mathematical Background That is, ⭸k can be expressed in matrix form as ⎡
⫺1 ⎢ 1 ⎢ ⭸1 ⫽ ⎢ ⎢ 0 ⎣ 0 0
0 ⫺1 1 0 0
0 ⫺1 0 1 0
0 ⫺1 0 0 1
0 0 ⫺1 0 1
⎤ 0 0⎥ ⎥ 0⎥ ⎥ ⫺1⎦ 1
⎡
⎤ 0 ⎢ 0⎥ ⎢ ⎥ ⎢ 1⎥ ⎥ ⭸2 ⫽ ⎢ ⎢⫺1⎥. ⎢ ⎥ ⎣ 0⎦ 1
Since C⫺1 ⫽ 0, H0 ⫽ Z0 /B0 ⫽ ker ⭸0 /im ⭸1 ⫽ C0 /im ⭸1
We can verify that 0 ⫽ dim(H0 ) ⫽ 1
Hence, we recover the fact that we have only one connected component in Figure 4.2. Similarly, we can verify that 1 ⫽ dim(H1 ) ⫽ dim(Z1 /B1 ) ⫽ dim(ker ⭸1 /im ⭸2 ) ⫽ 1
which tells us that the number of holes in our complex is 1. Also, Hk ⫽ 0 for k ⬎1 (since Ck ⫽ 0).
ˇ 4.2.3 Cech Theorem ˇ Now we introduce the Cech theorem which has been used in the context of sensor networks with unit-disk coverage [6, 7] and has been proved in Bott and Tu [3]. Before we proceed, we require the following definitions: Definition 4.7. A homotopy between two continuous functions f0 : X → Y and f1 : X → Y is a continuous 1-parameter family of continuous functions ft : X → Y for t ∈ [0, 1] connecting f0 to f1 . Definition 4.8. Two spaces X and Y are said to be of the same homotopy type if there exist functions f : X → Y and g : Y → X, with g ◦ f homotopic to the identity map on X and f ◦ g homotopic to the identity map on Y . Definition 4.9. A set X is contractible if the identity map on X is homotopic to a constant map.
In other words, two functions are homotopic if we can continuously deform one into the other. Also, a space is contractible if we can continuously deform it to a single point. It is known that homologies are an invariant of homotopy type; that is, two spaces with the same homotopy type will have the same homology groups. ˇ Theorem 4.1. (Cech theorem). If the sets {Si }N (for some N ⬎ 0) and all non-empty finite i⫽1 intersections are contractible, then the union N i⫽1 Si has the homotopy type of the nerve complex.
That is, given that the required conditions are satisfied, the topological structure of the sets’ union is captured by the nerve. We observe that in Figure 4.2 all of the intersections are contractible. Therefore, we can conclude that the extracted nerve complex has the same homology as the space formed by the union of the triangular regions.
99
100
CHAPTER 4 Building an Algebraic Topological Model
4.3 THE CAMERA AND THE ENVIRONMENT MODELS We focus on a planar model and later describe how 3D scenarios can be simplified to planar cases. The following definitions are presented to formalize our discussion. The Environment. The space under consideration is similar to the one depicted in Figure 4.1(a), where cameras are located in the plane and only sets with piecewiselinear boundaries are allowed (including objects and paths). We assume a finite number of objects in our environment. Cameras. A camera object ␣ is specified by its position o␣ in the plane, a local coordinate frame ⌿␣ with origin o␣ , and an open convex domain D␣ , referred to as the camera domain. As will become clear later, the camera domain D␣ can be interpreted as the set of points visible from camera ␣ when no objects occluding the field of view are present.The convexity of this set will be essential for some of the proofs. Some examples of camera domains are shown in Figure 4.3. Definition 4.10. The subset of the plane occupied by the ith object, which is denoted by Oi , is a connected closed subset of the plane with a non-empty interior and a piecewise linear o boundary. The collection {Oi }N i⫽1 , where No ⬍ ⬁ is the number of objects in the environment, will be referred to as the objects in the environment. o 2 Definition 4.11. Given the {Oi }N i⫽1 , a piecewise linear path ⌫ : [0, 1] → R is said to objects O i ⫽ ∅. be feasible if ⌫([0, 1]) ∩
2 Definition 4.12. Given
a camera ␣, a point p ∈ R is said to be visible from camera ␣ if p ∈ D␣ No and o␣ p ∩ i⫽1 Oi ⫽ ∅, where o␣ p is the line between the camera location o␣ and p. The set of visible points is called the coverage C␣ of camera ␣.
1 2
3
(a)
(b)
(c)
FIGURE 4.3 Three examples of camera domains D␣ : (a) interior of a convex cone in the plane; (b) camera location outside the domain; (c) camera location inside the domain. Note that these examples span projection model types from perspective cameras to omnidirectional cameras.
4.4 The CN-Complex
1 2
C3, 4 C2, 3 C2, 2
C1, 2
C2, 1
C3,1
3 C3, 2
C3, 3
C1, 1
FIGURE 4.4 Example decompositions for the sets shown in Figure 4.3.
Definition 4.13. Given camera ␣ with camera domain D␣ and corresponding boundary ⭸D␣ , a line L␣ is a bisecting line for the camera if ■ ■
L␣ goes through the camera location o␣ . There exists a feasible path ⌫ : [0, 1] → R2 such that for any ⑀ ⬎ 0 there exists a ␦ such / C␣ , ⌫(0.5) ∈ L␣ , and ⌫(0.5) ∈ / ⭸D␣ . that 0 ⬍ ␦ ⬍ ⑀, ⌫(0.5 ⫺ ␦) ∈ C␣ , ⌫(0.5 ⫹ ␦) ∈
If we imagine a target traveling through the path ⌫, we note that the last condition in the definition of a bisecting line identifies and recognizes occlusion events (i.e., the target transitions from visible to not visible or vice versa). However, we will ignore occlusion events due to the target leaving through the boundary of camera domain D␣ . L Definition 4.14. Let {L␣,i }N i⫽1 be a finite collection of bisecting lines for camera ␣. Consider NC the set of adjacent cones in the plane {K␣, j }j⫽1 bounded by these lines, where NC ⫽ 2·NL ; then the decomposition of C␣ by lines {L␣,i } is the collection of sets
C␣, j :⫽ K␣, j ∩ C␣
Note that the decomposition of C␣ is not a partition since the sets C␣, j are not necessarily disjoint. Some examples of decompositions are shown in Figure 4.4.
4.4 THE CN -COMPLEX We consider the following problem. Problem 4.1. Given the camera and environment models, our goal is to obtain a simplicial representation that captures the topological structure of the camera network coverage (i.e., the union of the coverage of the cameras). Observation 4.1. Note that the camera network coverage has the same homology as the domain (R2 ⫺ Oi ) if these two sets are homotopic (i.e., we can continuously deform one into the other).
Our goal is the construction of a simplicial complex that captures the homology of the union of camera coverage C␣ . One possible approach is to obtain the nerve complex using the set of camera coverage {C␣ }. However, this will only work for simple
101
102
CHAPTER 4 Building an Algebraic Topological Model 2
2
2
1
3
1 1 3 2
2
2
3
1 (a)
3
1 (b)
1 (c)
FIGURE 4.5 Examples illustrating nerve complexes obtained using the collection of camera coverage {C␣ }. Notice that cases (a) and (b) capture the right topological information, but case (c) (which involves an object in the environment) does not.
configurations without objects in the domain. An example illustrating our claim is shown in Figure 4.5. Figure 4.5(c) does not capture the topological structure of the coverage because the ˇ hypothesis of the Cech theorem is not satisfied (in particular, C1 ∩ C2 is not contractible). From the physical layout of the cameras and the objects in the environment, it is clear how we can divide C1 to obtain contractible intersections. We are after a decomposition of the coverage that can be achieved without knowing the exact location of objects in the environment. The construction of the camera network complex (CN -complex) is based on identification of bisecting lines for the coverage of each individual camera. This construct captures the correct topological structure of the coverage of the network. Figure 4.6 displays examples of CN -complexes obtained after decomposing the coverage of each camera using its corresponding bisecting lines. The CN -complex captures the correct topological information, given that we satisfy the assumptions made for the model described in Section 4.3. The following theorem states this fact. Its proof can be found in Lobaton et al. [10]. Theorem 4.2. (Decomposition theorem). Let {C␣ }N ␣⫽1 be a collection of camera coverage where each C␣ is connected and N is the number of cameras in the domain. Let {C␣,k }(␣,k)∈AD be the collection of sets resulting from decomposing the coverage by all possible bisecting lines, where AD is the set of indices in the decomposition. Then any finite intersection (␣ ,k )∈A C␣ ,k , where A is a finite set of indices, is contractible.
ˇ The hypothesis of the Cech theorem is thus satisfied if we have connected coverages that are decomposed by all of their bisecting lines. This implies that computing the homology of the CN -complex returns the appropriate topological information about the network coverage. Two steps are required to build the CN -complex: 1. Identify all bisecting lines and decompose each camera coverage. 2. Determine which of the resulting sets have intersections.
4.5 Recovering Topology: 2D Case 2 1 1a
1 2 1b
1a
1c
2a 2b 2
1b
2c
1c (a)
(b)
FIGURE 4.6 Two examples displaying how the CN-complex captures the correct topological information. (a) Camera 1 is decomposed into three regions, each of which becomes a different vertex in our complex. (b) Cameras 1 and 2 are both decomposed into three regions each, and the resulting complex captures the correct topological information.
The first step makes sure that any intersection will be contractible. The second allows us to find the simplices for our representation. These two steps can be completed in different ways that depend on the scenario under consideration. In the next section, we illustrate the construction of the CN -complex for a specific scenario.
4.5 RECOVERING TOPOLOGY: 2D CASE First, a more restrictive 2D scenario is considered for which simulations are provided. The Decomposition theorem from the previous section will guarantee that the correct topological information is captured. We consider a scenario similar to the one shown in Figure 4.1 (b) in which a wireless camera network is deployed and no localization information is available. Each camera node will be assumed to have certain computational capabilities and will communicate wirelessly with the others. The following are the assumptions for this particular simulation: The Environment in 2D. The objects in the environment will have piecewise linear boundaries as described earlier. The locations of the objects are unknown, as are the location and orientation of the cameras. Cameras in 2D. A camera ␣ has the following properties: ■ The domain D␣ of a camera in 2D will be the interior of a convex cone with field of view (FOV) ␣ ⬍ 180◦ . We use this model for simplicity in our simulations. 2D ■ Local camera frames ⌿␣ are chosen such that the range of the FOV is [⫺␣ /2, ␣ /2] when measured from the y-axis.
103
104
CHAPTER 4 Building an Algebraic Topological Model ■
The camera projection ⌸2D ␣ : D␣ → R is given by ⌸2D ␣ ( p) ⫽ px /py
where p is given in coordinate frame ⌿␣2D . The image of this mapping—that is, 2D ⌸2D ␣ (D␣ )—will be called the image domain ⍀␣ . The Target in 2D. A single-point target is considered in order to focus on the complex’s construction without worrying about the target’s correspondence/identification. We can certainly extend our work to multiple targets by using statistical methods that exploit feature matching and time correlation between detection.Various approaches include those described by a number of authors [5, 11, 12, 14, 15]. Synchronization. In a real network, network synchronization can be accomplished by message passing between multiple camera nodes. Given these assumptions, the problem can be formulated as Problem 4.2 (2D Case). Given the camera and environment models in 2D, our goal is to obtain a simplicial representation that captures the topological structure of the coverage of the camera network by using detections of a single target moving through the environment.
Throughout our simulations we will have the target moving continuously through the environment. At each time step the cameras compute their detections of the target and use their observations to detect bisecting lines. Observations at the regions obtained after decomposition using the bisecting lines are stores. The observations are then combined to determine intersections between the regions, which become simplices in the CN -complex.
4.5.1 Algorithms The first step in the construction of the CN -complex is the local processing of the image captures at a single camera node. That is, a single camera node needs to find all of its bisecting lines and store a history of the observations made at the sets resulting from coverage decomposition. We note that we can identify the boundary of the camera domain D␣ with the boundary of our image domain ⍀␣2D , and we can identify bisecting lines in the plane with points in the image domain where occlusion of a target is observed. Hence, decomposition of the coverage corresponds to decomposition of the image domain. One possible implementation for local processing in a camera node is by using the MATLAB notation in Algorithm 4.1. We now describe some of the variables found in the algorithm. regions This variable stores the observations made at the different partitions of the image
domain. im This is an image of the current camera view. t This is the time at which the image was captured. curr_det This is a mask indicating the current detection of a target in an image. prev_det This is a mask indicating the previous detection of a target in an image. Previous and current detections are necessary to identify occlusion events.
4.5 Recovering Topology: 2D Case Algorithm 4.1: function [regions, bisect_lines] = getLocal() flag = 0; regions = getInitialDomain(); bisect_lines = []; while continueBuilding() [im, t] = getImage(); curr_det = getDetection(im); if flag == 0 flag = 1; prev_det = curr_det; end line_found = getBisectLine(bisect_lines,curr_det,prev_det,im); if(isempty(line_found)) then bisect_lines{end+1} = line_found; regions = decomposeDomain(regions,line_found); end regions = processDetections(regions,curr_det,t); prev_det = curr_det; end
line_found This is a list containing the parameters of all bisecting lines found. bisect_lines This is a list of parameters found for the bisecting lines.
The description of the functions used is as follows. regions = getInitialDomain() This function returns an object corresponding to the original image
domain (without any partitions). t = getTime() This function returns the current system time t. val = continueBuilding() This function stops the construction of the complex. Operations continue while the function return is 1 and stop when the return is 0. This function can depend on the system time or some other events in the system. [im, t] = getImage() This function returns an image based on the view from the camera. It also returns the time at which that image was taken. mask = getDetection(im) This function returns a mask specifying where a target was detected in an image. In a real implementation where im is an actual image from a physical camera, this can be accomplished through simple background subtraction. line_found = getBisectLine(bisect_lines, curr_det, prev_det, im) This function computes and returns the parameters describing any bisecting line found from the current detection mask curr_Det, the previous detection mask prev_Det, and image im. If an occlusion event is detected (i.e., something that was visible is no longer visible or vice versa), the
105
106
CHAPTER 4 Building an Algebraic Topological Model parameters for a bisecting line are computed. If the line computed does not belong to any of the already existing lines in bisect_lines, we return the parameters. regions = decomposeDomain(regions0,line_found) This function returns a new image domain decomposition, regions, after the decomposition regions0 has been further refined using the new bisecting line found, line_found. regions = processDetections(regions0,curr_det,t) This function uses the current camera coverage decomposition regions0 and stores detections for each of the partitions corresponding to the current detections mask curr_det at time t. Local observations from each camera node need to be combined to generate the CN -complex. When building the complex, each of the partitioned regions becomes a node of the complex, and simplices between nodes need to be found.A simplex between regions is found if the regions detect a target concurrently. An implementation of this process is shown next. Algorithm 4.2: function CNComplex = combineLocal(t0,tf,regArray) CNComplex = []; for t = t0 : tf CNComplex{end+1} = getSimplex(regArray,t); end
Following are some of the variables in the algorithm. t0 Initial time to search for concurrent detections between decomposed camera cover-
ages. tf Final time used to search for concurrent detections between decomposed camera coverages. regArray An array containing observations from the decomposed coverage of each camera. The main function used for creating our complex is the following: s = getSimplex(regArray,t) This function returns a list of the indices corresponding to the
decomposed regions that have a concurrent detection at time t. The end result is a list of simplices that form the CN -complex.The topological structure of this complex can be analyzed using software packages such as PLEX [1].
4.5.2 Simulation in 2D As mentioned before, the topology of the environment can be characterized in terms of its homology. In particular, we will use the betti numbers 0 and 1 . 0 tells us the number of connected components in the coverage; 1 gives the number of holes. The PLEX software package [1] computes the homologies and corresponding betti numbers. Figure 4.7(a) shows a three-camera layout with two objects in the cameras’ fields of view. In this case, we observe three bisecting lines for camera 1, two for camera 2, and four for camera 3. The coverage C3 is decomposed into 5 regions—namely, {C3,a , C3,b , C3,c , C3,d , and C3,e }. Table 4.1 lists the simplices obtained by our algorithm.
4.5 Recovering Topology: 2D Case
4
3
2
1 3
1
(a)
2 (b)
FIGURE 4.7 (a) Example of a three-camera layout with two occluding objects. Only the coverage of camera 3 is shown. (b) Example of a four-camera layout in a circular hallway configuration. Only the coverage of camera 1 is shown in this case. The dashed lines represent the bisecting lines for the displayed coverage. The dotted curves represent the path followed by the target during the simulation.
The homology computations returned betti numbers 0 ⫽ 1 and 1 ⫽ 2. This agrees with having a single connected component for the network coverage and two objects inside the coverage of the cameras. In Figure 4.7 (b) we observe similar results for a configuration that can be interpreted as a hallway in a building. There is a single bisecting line for all cameras. Our algebraic analysis returns 0 ⫽ 1 and 1 ⫽ 1, the latter identifying a single hole corresponding to the loop formed by the hallway structure. Table 4.2 lists simplices recovered by our algorithm.
Table 4.1 Example
Maximal Simplices Generated for Figure 4.7(a)
[1a 1b 1c 1d], [2a 2b 2c], [3a 3b 3c 3d 3e], [1a 1b 2c 3c] [1d 2a 3c], [2a 2b 3a], [1a 2b 2c 3a], [1a 2c 3a 3b], [1a 2c 3b 3c], [1a 2c 3c 3d], [1a 1b 1c 2c 3d 3e], [1c 1d 3e], [1d 2a 3e]
Table 4.2 Maximal Simplices Generated for Figure 4.7(b) Example [3b 4a 4b], [2b 3a 3b], [1b 2a 2b], [1a 1b 4b]
107
108
CHAPTER 4 Building an Algebraic Topological Model
4.6 RECOVERING TOPOLOGY: 2.5D CASE In this section we generalize our discussion to a 3D domain with very specific constraints on the objects in the environment. Our problem is defined as the detection of a target moving through an environment. Let us start by describing our setup. The Environment in 2.5D: We consider a domain in 3D with the following constraints: ■ All objects and cameras in the environment are within the space defined by the planes z ⫽ 0 (the floor) and z ⫽ hmax (the ceiling). ■ Objects in the environment consist of walls erected perpendicular to our plane from z ⫽ 0 to z ⫽ hmax . The perpendicular projection of the objects to the plane z ⫽ 0 must have a piecewise linear boundary. Cameras in 2.5D: A camera ␣ has the following properties: 3D ■ It is located at position o␣ with an arbitrary 3D orientation and a local 3D coordinate frame ⌿␣ . 3D 2 ■ Its camera projection in 3D, ⌸␣ : F␣ → R , is given by ⌸3D ␣ ( p) ⫽ ( px /pz , py /pz ),
where p is given in coordinate frame ⌿␣3D , and F␣ ⊂ ({(x, y, z) | z ⬎ 0}), the camera’s FOV, is an open convex set such that its closure is a convex cone 3D based at o3D ␣ . The image of this mapping—⌸␣ (F␣ )—is called the image 3D domain ⍀␣ . The Target in 2.5D: A target has the following properties: ■ It is a line segment perpendicular to the bounding planes of our domain which connects the points (x, y, 0) to (x, y, htarget ), where x and y are arbitrary and htarget ⭐ hmax is the target height. The target is free to move along the domain as long as it does not intersect any of the objects in the environment. ■ It is detected by camera ␣ if there exists a point p :⫽ (x, y, z) in the target such that p ∈ F␣ and o3D ␣ p does not intersect any of the objects in the environment. Figure 4.8 shows a target and a camera with its corresponding FOV. For the 2D case, we desire recovery of the camera network coverage. This problem is decoupled from the existence of a target. The target is introduced into the setup as a tool for constructing the simplicial representation. We could have set up our problem as recovering topological information on the space in which a target is visible from any of the cameras. This alternative interpretation turns out to be more useful for our 2.5D formulation, particularly if we are interested in detecting a target even if it is a partial detection. The problem can thus be formulated as Problem 4.3 (2.5D case). Given the camera and environment models in 2.5D, our goal is to obtain a simplicial representation that captures the topological structure of the detectable set for a camera network (i.e., the union of sets in which a target is detectable by a camera) by using detection of a single target moving through the environment.
4.6 Recovering Topology: 2.5D Case
(a)
(b)
109
(c)
FIGURE 4.8 Mapping from 2.5D to 2D. (a), (b) A camera and its FOV are shown from multiple perspectives. (c) The corresponding mapping of this configuration to 2D. For the 2.5D configuration, the planes displayed bound the space that can be occupied by the target. The target is a vertical segment in the 2.5D case and a point in the 2D case.
4.6.1 Mapping from 2.5D to 2D In this section the structure of the detectable set for a camera network will become clear through a conversion of our 2.5D problem into a 2D problem. Since the target is constrained to move along the floor plane, it is possible to do this. In particular: ■ ■
■
■
We can map cameras located at locations (x, y, z) to location (x, y) in the plane. We can map objects in our 2.5D domain to objects with piecewise linear boundaries in the plane. We can also map the FOV of a camera and the domain D␣ of a camera in 2D. A point (x, y) in the plane is in D␣ if the target located at that point intersects the FOV F␣ .The set D␣ is the orthogonal projection (onto the xy plane) of the intersection between F␣ and the space between z ⭓ 0 and z ⭐ htarget . Since the latter is an intersection of convex sets, and orthogonal projections preserve convexity, D␣ is convex.We can also check that D␣ is open. Hence, this description of D␣ is consistent with our definition in Section 4.3. A point (x, y) is in the coverage C␣ of camera ␣ if the target located at (x, y) is detectable by the camera. It is also easy to check that this description of C␣ will be consistent with our previous definition.
Once we make these identifications, we can use the same tools as in the 2D case to build a simplicial complex. The Decomposition theorem from Section 4.4 guarantees that we capture the correct topological information.
4.6.2 Building the CN -Complex As in Section 4.5, we can build the CN -complex by decomposing each camera coverage using its bisecting lines and determining which of the resulting sets intersect. However, a physical camera only has access to observations available in its image domain ⍀3D .
110
CHAPTER 4 Building an Algebraic Topological Model
Therefore, it is essential to determine how to find bisecting lines using image domain information. We note that occlusion events occur when the target leaves the coverage C␣ of camera ␣ along the boundary of camera domain D␣ or along a bisecting line. Whenever a target leaves C␣ through the boundary of the domain D␣ , we observe the target disappear through the boundary of the image domain ⍀␣3D . If the target leaves C␣ through one of the bisecting lines, we observe an occlusion event in the interior of ⍀␣3D . Note that bisecting lines in the 2D domain correspond to vertical planes in the 2.5D configuration, whose intersection with the camera’s FOV map to a line in ⍀␣3D . Hence, all that is required is to find the line segment in which an occlusion event takes place in the image domain. From an engineering point of view, this can be done by some simple image processing to find the edge along which the target disappears/appears in an image. The result is a decomposition of the image domain ⍀␣3D that corresponds to a decomposition of camera coverage C␣ . We also emphasize that this computation can be done locally at a camera node without any need to transmit information. The problem of finding intersections of sets corresponds to having concurrent detections at corresponding cameras for a single target in the environment. As mentioned before, overlap between these regions can be found for the multiple-target case by using approaches such as the ones outlined by various authors (e.g., [5, 11, 12, 14, 15]).
4.6.3 Experimentation To demonstrate how the mathematical tools described in the previous sections can be applied to a real wireless sensor network, we devised an experiment that tracks a robot in a simple maze. (Figure 4.9 shows the layout used.)We placed a sensor network consisting of CITRIC camera motes [4] at several locations in our maze and let a robot navigate through the environment. The CN -complex was built for this particular coverage and used for tracking in this representation. The algorithms for building this complex are precisely the ones used for the simulation in the 2D case in Section 4.5.1.The homologies were computed using the PLEX software package [1]. Time synchronization is required to determine overlaps between the different camera regions.This was accomplished by having all of the camera nodes share time information with one another. The model was constructed so that each camera node performed local computations, first looking for bisecting lines (as shown in Figure 4.10) to decompose its coverage. The detections of the target on the corresponding regions were stored over time. Our implementation detected the occlusion lines by looking for occlusion events over time. If this event did not correspond to an occlusion along the boundary of the image domain, we estimated an occlusion line. Note that the information extracted from each camera node was just a decomposition of the image domain with a list of times at which detections were made. The communication requirements were minimal because of this data reduction. The complex was built by combining all local information from the camera nodes. Each camera node transmitted the history of its detections wirelessly to a central computer, which created the CN -complex by following the steps outlined in Section 4.5.1. The
4.6 Recovering Topology: 2.5D Case 6
5
1
4
(b)
3
2 (a)
(c)
FIGURE 4.9 Layout used for our experiment: (a) Diagram showing the location of the different cameras; (b) Photograph of our simple maze; (c) Photograph of the CITRIC camera motes used for our experiment.
(a)
(b)
FIGURE 4.10 (a) View of camera 5 from the layout in Figure 4.9 before any bisecting lines are found. (b) Same view after a bisecting black line has been found.
111
112
CHAPTER 4 Building an Algebraic Topological Model 1a
5a 5b
4b
4a 6
1b 3b
2a
2b
3a
FIGURE 4.11 Recovered CN-complex for the layout in Figure 4.9.
resulting complex contained the maximal simplices: [1a 1b 4b], [1b 2a 2b], [2b 3a 3b], [3b 4a 4b 5b], [3b 5b 6], and [5a 5b 6]. The complex is shown in Figure 4.11. As mentioned earlier, this representation can be used for tracking and navigation without actual metric reconstruction of the environment. Figure 4.12 shows a set of recorded paths for our robot. By determining which simplices were visited by the robot’s path we could extract a path in the complex as shown by the dashed path in the figure. The main advantage of this representation is that the path in the complex gives a global view of the trajectory while local information can be extracted from single camera views. It is possible to identify homotopic paths in the simplicial representation (i.e., those that can be continuously deformed into one another). The tools required for these computations are already available to us from Section 4.2. In particular, by taking two paths that start and end at the same locations to form a loop, we can verify that they are homotopic if they are the boundary of some combination of simplices. Equivalently, since a closed loop is just a collection of edges in C1 , we need to check whether the loop is in B1 (i.e., in the range of ⭸2 ). This is just a simple algebraic computation. By putting the top and middle paths from Figure 4.12 together we see that the resulting loop is not in the range of ⭸2 (i.e., the paths are not homotopic). On the other hand, the top and bottom paths can be easily determined to be homotopic. Similarly, for coordinate-free navigation purposes, this representation can determine the number of distinct paths from one location to another. It is also possible to find paths in the CN -complex and use local information from each camera to generate a physical path in the environment.
4.6 Recovering Topology: 2.5D Case 1a
5a
5b
4b 4a
6
1b 3b
2a 2b 3 1a
5a
5b
4b 4a
6
1b 3b
2a 2b 3 1a
5a
5b
4b 4a
6
1b 3b
2a 2b 3
(a)
(b)
FIGURE 4.12 (a) Several paths for our robot in the maze (dark dashed path); (b) Corresponding mapping to the CN-complex (dark dashed path). These paths can be easily compared using the algebraic topological tools covered in Section 4.2. The problem turns out to be one of simple linear algebraic computation.
113
114
CHAPTER 4 Building an Algebraic Topological Model
4.7 CONCLUSIONS In this chapter a combinatorial representation of a camera network was obtained through the use of simplices.This representation, which we call the CN -complex, is guaranteed to capture the appropriate topological information of a camera network’s coverage under some basic assumptions. Algebraic tools were introduced in order to understand and proof the validity of our representation. The CN -complex requires discovery of some relationships between camera coverages (i.e., whether or not coverages of a finite subset of cameras intersect). There are many ways of discovering these relationships, but we opted for observations of a single target moving through the environment.We showed simulations for 2D scenarios and discussed an experimental setup for 2.5D configuration. Our experiment with distributed wireless camera nodes illustrates how this simplicial representation can be used to track and compare paths in a wireless camera network without metric calibration information. The identification process between paths converted into an algebraic operation. The results achieved can be extended to coordinate-free navigation, where our representation can give an overall view of how to arrive at a specific location, and the transitions between simplicial regions can be accomplished in the physical space by local visual feedback from single-camera views. This representation allows for local processing at each node and minimal wireless communication. A list of times at which occlusion events are observed is all that needs to be transmitted. Also, only integer operations are required to perform the algebraic operations described in this chapter [9], which opens the doors to potential distributed implementation on platforms with low computational power. Acknowledgments. The project for this chapter was funded by the Army Research Office (ARO) Multidisciplinary Research Initiative (MURI) program, under the title “Heterogeneous Sensor Webs for Automated Target Recognition and Tracking in Urban Terrain” (W911NF-06-1-0076), and Air Force Office of Scientific Research (AFOSR), grant FA9550-06-1-0267, under a subaward from Vanderbilt University.
REFERENCES [1] PLEX: A sytem for computational homology; http://comptop.stanford.edu/programs/plex/ index.html, October 2008. [2] M. Armstrong, Basic Topology, Springer, 1997. [3] R. Bott, L. Tu, Differential Forms in Algebraic Topology, Springer, 1995. [4] P.W.-C. Chen, P. Ahammad, C. Boyer, et al., CITRIC: A low-bandwidth wireless camera network platform, in: Proceedings Third ACM/IEEE International Conference on Distributed Smart Cameras, 2008. [5] Z. Cheng, D. Devarajan, R. Radke, Determining vision graphs for distributed camera networks using feature digests. EURASIP Journal on Advances in Signal Processing, 2007. [6] V. de Silva, R. Ghrist, Homological sensor networks. Notices of the AMS, 54 (1) (2007) 10–17. [7] R. Ghrist, A. Muhammad, Coverage and hole-detection in sensor networks via homology, in: Proceedings of the Fourth International Symposium on Information Processing in Sensor Networks, 2005.
4.7 Conclusions
115
[8] A. Hatcher, Algebraic Topology, Cambridge University Press, 2002. [9] T. Kaczynski, K. Mischaikow, M. Mrozek, Computational Homology, Springer, 2003. [10] E.J. Lobaton, P. Ahammad, S.S. Sastry, Algebraic approach for recovering topology in distributed camera networks, in: Proceedings of the eighth International Conference on Information Processing in Sensor Networks, 2009. [11] D. Marinakis, G. Dudek,Topology inference for a vision-based sensor network, in: Proceedings of the Second Canadian Conference on Computer and Robot Vision, 2005. [12] D. Marinakis, P. Giguere, G. Dudek, Learning network topology from simple sensor data, in: Proceedings of the twentieth Canadian Conference on Artificial Intelligence, 2007. [13] J. Munkres, Topology, Prentice Hall, 2nd edition, 2000. [14] C. Yeo, P. Ahammad, K. Ramchandran, A rate-efficient approach for establishing visual correspondences via distributed source coding, in: Proceedings of Visual Communications and Image Processing, 2008. [15] X. Zou, B. Bhanu, B. Song, A. Roy-Chowdhury, Determining topology in a distributed camera network, in: IEEE International Conference on Image Processing, 2007.
CHAPTER
Optimal Placement of Multiple Visual Sensors Eva Hörster, Rainer Lienhart Multimedia Computing Lab, University of Augsburg, Augsburg, Germany
5
Abstract In this chapter we will focus on the optimal placement of visual sensors such that a predefined area is maximally or completely covered at a certain resolution. We consider different tasks in this context and describe two types of approaches to determine appropriate solutions. First we propose algorithms that give a globally optimal solution. Then we present heuristics that solve the problem within a reasonable time and with moderate memory consumption, but at the cost of not necessarily determining the global optimum. Keywords: sensor placement, pose optimization, multi-camera networks
5.1 INTRODUCTION Many novel multimedia applications, such as video surveillance, sensing rooms, assisted living, and immersive conference rooms, use multi-camera networks consisting of multiple visual sensors. Most require video sensor configurations that ensure coverage of a predefined space inside a specific area with a minimum level of imaging quality such as image resolution. This requirement is crucial for enabling many of the just mentioned applications using multi-camera networks. Thus, an important issue in designing visual sensor arrays is appropriate camera placement. Currently most designers of multi-camera systems place cameras by hand. As the number of sensors in such systems grows, the development of automatic camera placement strategies becomes more and more important. In this chapter we will focus on the optimal placement of visual sensors such that a predefined area or space is completely or maximally covered at a certain resolution. Coverage is defined with respect to a predefined “sampling rate” guaranteeing that an object in the space will be imaged at a minimum resolution (see Section 5.2 for a precise definition). A typical scenario is shown in Figure 5.1. The space to be covered consists of all white regions, with black regions marking background and/or obstacles.Automatically computed camera positions are marked with black circles and their respective fields of view (FOVs) with triangles. Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00005-7
117
118
CHAPTER 5 Optimal Placement of Multiple Visual Sensors
FIGURE 5.1 Optimal placement of four cameras to maximize coverage of a given space. Cameras are marked as black points; triangles indicate their FOV.
We will consider various facets of the camera placement problem given a predefined space. One problem we aim to solve is maximizing coverage of the space given a number of cameras. Often several different camera types are available. They differ in FOV size, intrinsic parameters, image sensor resolution, optics, and cost. Therefore, our second and third tasks are to minimize the cost of a visual sensor array while maintaining a minimally required percentage of coverage of the given space, and to maximize coverage given a maximum price for the camera array. In some situations cameras have already been installed (e.g., at an airport or in a casino). The positions of the multiple cameras in some coordinate system can be determined automatically by camera calibration [13]. Given the fixed initial locations and camera types, the fourth problem we address is to determine the respective optimal poses of already installed cameras with respect to coverage while maintaining the required resolution. We will describe various algorithms to solve these four problems. They can be subdivided into (1) algorithms that determine the global optimum solution, and (2) heuristics that solve the problem within reasonable time and memory limits but do not necessarily determine the global optimum. A careful evaluation and a comparative study of the different approaches show their respective advantages. We developed a user interface to comfortably enter and edit the layout of spaces and setup parameters of the respective optimization problems. This user interface is running as a Web service (see reference [1]).
5.1.1 Related Work Several works address the optimal placement of multiple cameras in various contexts. Erdem and Sclaroff [9] proposed an algorithm based on a binary optimization technique to solve four instances of the camera placement problem. They considered polygonal spaces, which they presented as occupancy grids. Furthermore, pan-tilt-zoom (PTZ) and omnidirectional cameras were considered in their work, where PTZ cameras can adjust their pose and/or scan the space during runtime.
5.1 Introduction
119
Mittal and Davis [15] presented a method for sensor planning in dynamic scenes. They analyzed the visibility from static sensors at sampled points statistically and solved the problem by simulated annealing. Chen [5] studied camera placement for robust motion capture. A quality metric was constructed taking into account resolution and occlusion.This metric was combined with an evolutionary algorithm–based optimizer and used to automate the camera placement process. It should be noted that the outcome of the algorithm is not necessarily the global optimum. Chen and Davis [6] presented a quality metric for judging the placement of multiple cameras in the context of tracking which takes dynamic occlusions into account. Nevertheless, the authors did not combine their metric with an optimization technique to perform automatic camera placement. Optimizing the aggregated observability of a distribution of target motions by an appropriate positioning of a predefined number of cameras was considered by Bodor et al. [3]. In the context of multimedia surveillance systems the placement problem was addressed by Ram et al. [17]. They proposed a performance metric for placement of a number of sensors which takes the orientation of the observed object into account.The optimal configuration is found by choosing the placement that gives the best performance. Zhao and Cheung [23] also considered camera placement for a surveillance application, specifically for visual tagging. They proposed a visibility model that captures the orientation of the observed objects. The space and possible camera positions are represented as a grid, and binary linear programming is adopted for configuration optimization, thereby computing the minimum numbers of cameras necessary to cover the area. Instead of using a fixed number of grid points, the density of the grid points and possible camera configurations is gradually increased during the search. Takahashi et al. considered the task of planning a multi-camera arrangement for object recognition [20]. They proposed a quality score for an arrangement of cameras based on distances between manifolds. The optimal multi-camera arrangement is determined by taking the maximum quality score over all possible placements. The placement of multiple cameras for object localization was studied by Ercan et al. [8]. Given a number of camera sensors to be placed around a room’s perimeter, they proposed an approach for computing the optimal camera positions to accurately localize a point object in the room. Olague and Mohr [16] approached the problem of optimal camera placement for accurate reconstruction using a multicellular genetic algorithm. With the aim of accurate 3D object reconstruction Dunn et al. proposed [7] a camera network design methodology based on the Parisian approach to evolutionary computation.This approach is effective in speeding up computation. Note that the authors restricted their methodology to a viewing sphere model for camera placement. Some work was carried out in the area of (grid) coverage and sensor deployment, with sensors detecting events that occur within a distance r (the sensing range of the sensor)—see Chakrabarty et al. [4], Sahni and Xu [19], Wang and Zhong [21], and Zou and Chakrabarty [24]. The visual sensor model used in the approach presented in this chapter was proposed in work on two-dimensional (Hörster and Lienhart [12]) and three-dimensional (Hörster and Lienhart [13]) cases. In those works we used a linear programming (LP) approach to determine the minimum cost of a sensor array that covers a given space completely,
120
CHAPTER 5 Optimal Placement of Multiple Visual Sensors
and an LP approach that determines for fixed sensor arrays their optimal pan and tilt with respect to coverage. In both approaches space was presented as a regular grid. Our linear programming work is partly based on the approach presented in Chakrabarty et al. [4], but differs in the sensor and space model (e.g., cameras do not posses circular sensing ranges) as well as in the cost function and some constraints. The approach in Erdem and Sclaroff [9] is in some parts also similar to the binary programming algorithm considered in this work, however, we allow spaces to have arbitrary shape, and the points representing space are distributed according to a user-defined importance distribution. Another difference from Erdem and Sclaroff [9] is the fact that, except for one problem, different instances of the placement problem are considered. A problem with some of the research just mentioned is that it is not clear how large spaces and camera arrays consisting of many cameras are handled (e.g.,[17, 20]). In Chakrabarty et al. [4] the authors proposed a divide-and-conquer scheme to approximate the optimum LP solution to the placement problem for large spaces. Thus, the space, which, in their case is a regular grid, is divided into a number of small subgrids and the optimal solution to each subgrid is merged into an approximate solution for the original grid. While one can apply this scheme easily to rectangular grid-sensing fields, dividing the space properly to yield a good approximate solution to the original set can be difficult for nonregular control point sets and nonrectangular space boundaries, especially in the presence of static obstacles. Finally, it should be mentioned that the visual sensor placement problem is closely related to the guard placement problem or art gallery problem (AGP)—that is, determining the minimum number of guards required to cover the interior of an art gallery. O’Rourke addressed it with the art gallery theorem [18]. However some of the assumptions made in the AGP do not hold for the camera placement problem considered here. In AGP all guards are assumed to have similar capabilities, whereas in this work we also consider cameras with different FOVs at different cost levels. Additionally, the FOV is restricted in our sensor model by resolution and sensor properties.
5.1.2 Organization The chapter is organized as follows. In Section 5.2, the problem is stated and basic definitions are given. Section 5.3 presents different approaches to solving the problem. In Section 5.4 we evaluate the approaches experimentally, and possible extensions are discussed in Section 5.5. Section 5.6 summarizes and concludes the chapter.
5.2 PROBLEM FORMULATION In this section we will first give some basic definitions needed to formulate our considered problems accurately. Then in Section 5.2.2, we state the considered problems and give a precise definition of a camera’s, FOV and explain how space is modeled.
5.2.1 Definitions In the following, the term space denotes a physical two- or three-dimensional room. A point in that space is covered if it is captured with a required minimal resolution. The minimal resolution is specified by a sampling frequency fs , and it is satisfied if the point in
5.2 Problem Formulation
121
space is imaged by at least one pixel of a camera that does not aggregate more than x cm2 of a surface parallel to the imaging plane through that point. Expressed in terms of fs , x is converted into the FOV of a camera. A camera’s FOV is defined as the area or volume, respectively, in which a pixel aggregates no more than f12 cm2 of a surface parallel to the imaging plane. Thus, an s object that appears in the camera’s FOV is imaged with at least this resolution, assuming the object has a planar surface orthogonal to the optical axis. Clearly the resolution is smaller if the surface is not orthogonal. How to account for this case is discussed in Section 5.5. This definition results in a description of the FOV of a camera by a triangle, where the triangle’s parameters a and d (see Figure 5.2) can be derived by taking the intrinsic camera parameters and the required minimal resolution (given in terms of the sampling frequency fs ) into account. The position (x, y) or (x, y, z) of a camera describes the location in the two- or threedimensional space, respectively, while the pose specifies only the bearing or the three angular orientations pitch, roll, and yaw, respectively. Note that this definition differs from the more common notion that a pose consists of position and orientation. We also consider static occlusions. To simplify derivation and evaluation, only the 2D problem is discussed. However, the presented approaches can be extended easily to the third dimension (see Section 5.5).
5.2.2 Problem Statements There are many problems that one can consider regarding the placement of multiple visual sensors. Here we focus on only four, but problem variations may be solved in a similar fashion. Given a space and a demanded sampling frequency fs , we are interested in the following problems: ■
■
■
■
Problem 5.1. Given a certain number of cameras of some type and their specific parameters, determine their positions and poses in space such that coverage is maximized. Problem 5.2. Given several types of cameras, their parameters, and specific costs, as well as the maximum total price of the visual sensor array, determine the camera types and their respective positions/poses that maximize coverage of the given space. Problem 5.3. Given the fixed positions and respective types of a number of cameras, determine their optimal poses with respect to maximizing coverage. Problem 5.4. Given a minimally required percentage of coverage, determine the camera array with minimum cost that satisfies this coverage constraint. In the case that only one camera type is available, the minimum number of cameras of that type meeting this coverage constraint is determined.
5.2.3 Modeling a Camera’s Field of View We use a simple model for our cameras: The FOV is described by a triangle as shown in Figure 5.2(a). The parameters of this triangle are calculated using well-known geometric relations given the (intrinsic) camera parameters and the sampling frequency fs .
122
CHAPTER 5 Optimal Placement of Multiple Visual Sensors y⬘⬘ y
y⬘ l1
l2
a
Cx
d ␣
Cy
d ␣
a ␣
x
d
x⬘
(a)
(b)
a x⬘⬘ l3 (c)
FIGURE 5.2 Deriving the model of a camera’s FOV: (a) FOV; (b) translated FOV; (c) rotated FOV.
Defining the FOV by a triangle enables us to describe the area covered by a camera at position (cx , cy ) and with pose by three linear constraints. These constraints are derived as follows: A camera’s FOV is first translated to the origin of the coordinate system (Figure 5.2(b)): x ⫽ x ⫺ cx ,
y ⫽ y ⫺ cy
(5.1)
We then rotate the FOV such that its optical axis becomes parallel to the x-axis (Figure 5.2(c)): x ⫽ cos() · x ⫹ sin() · y
(5.2)
y ⫽ ⫺sin() · x ⫹ cos() · y
(5.3)
The resulting area covered by the triangle (Figure 5.2(c)) can now be described by three line constraints l1 , l2 , l3 : l1 :
x ⭐ d
l2 :
y ⭐
l3 :
a · x 2d a y ⭓ ⫺ · x 2d
(5.4) (5.5) (5.6)
By substitution the following three linear constraints define the area covered by the FOV: cos() · (x ⫺ cx ) ⫹ sin() · ( y ⫺ cy ) ⭐ d
(5.7)
⫺sin() · (x ⫺ cx ) ⫹ cos() · ( y ⫺ cy ) ⭐
a · (cos() · (x ⫺ cx ) ⫹ sin() · ( y ⫺ cy )) 2d
⫺sin() · (x ⫺ cx ) ⫹ cos() · ( y ⫺ cy ) a · (cos() · (x ⫺ cx ) ⫹ sin() · ( y ⫺ cy )) ⭓⫺ 2d
(5.8)
(5.9)
5.2 Problem Formulation
5.2.4 Modeling Space In the ideal case camera positions and poses are continuous in space; that is, cx , cy , and are continuous variables. We approximate the ideal continuous case by quantizing the positions and poses, thus deriving a discrete set of positions and poses a camera can adopt. To compute a discrete set of poses s , we uniformly sample the poses between 0 and 2. However, to determine a discrete set of camera positions, we need to take into account that cameras usually cannot be installed everywhere in a given space. Thus, we assume that, besides the layout of the space considered (Figure 5.3(a)), the input to our algorithm also consists of user-defined regions in this space where cameras can be set up (Figure 5.3(d)). Those regions are then sampled randomly to identify a number A of camera locations. Similar to camera positions and poses, we discretize the considered space (Figure 5.3(a)) to define coverage. We sample control points from the entire space with respect to a user-defined weighting. If some parts of the room are known to be more important (e.g., the area around doors), a higher weighting can be given to those parts, which in
(a)
(b)
(c)
(d)
FIGURE 5.3 Modeling space by sampling. (a) White areas, define the open space; black areas, walls. (b) Sampled space: Dark gray dots mark possible camera positions; light gray dots, control points. (c) Importance weighting of space: Darker gray regions are more important than lighter regions. (d) Gray: Positions allowed for camera placement.
123
124
CHAPTER 5 Optimal Placement of Multiple Visual Sensors turn results in a higher density of samples. Parts that are less interesting might be sampled with a lower frequency. This approach is illustrated in Figure 5.3(c). Darker regions are more important than lighter regions except for black pixels, which mark the border of the space or static obstacles constricting the camera’s FOV. Altogether we sample a number P of control points. As the number of control points, camera positions, and camera poses increases, so does the accuracy of our approximation. For P → ⬁, A → ⬁, and |s | → ⬁, our approximated solution converges to the continuous case solution. An example of a possibly resulting start configuration for our example space is shown in Figure 5.3(b). Control points are marked by light gray dots, possible camera positions by dark gray dots. With that, our problems turn into set coverage problems. We assume a control point to be covered by a certain camera if and only if equations 5.7 through 5.9 are satisfied and no obstacles constrict the FOV of the visual sensor in the direction of the control point.
5.3 APPROACHES We propose different approaches to solve our visual sensor placement problems. They can be subdivided into algorithms, which give a global optimal solution but are complex and time/memory consuming, and heuristics, which solve the problem in reasonable time and with reasonable space requirements. Experimental results evaluate the different proposed heuristics by comparing them to the optimal solution as a bottom line. For comparison purposes we also describe random placement.
5.3.1 Exact Algorithms In this section we describe our linear programming approaches to optimally solve the problems listed in Section 5.2.
Linear Programming In the following we first derive a binary integer programming (BIP) model for problem 5.4 from the previous list. Subsequently it is shown how to modify it to derive a model for the other problems. We assume that several types of cameras with different sensor resolutions and optics (i.e., focal lengths) are available. For each type of camera k, the FOV parameters dk and ak (see Figure 5.2) and a cost Kk are given.We further assume that our space consists of P control points.Visual sensor locations are restricted to A positions. Similarly we discretize the angle by defining a camera’s pose to be one out of |s | poses. Problem 5.4 then consists of minimizing the total cost of the sensor array while ensuring a given percentage of space coverage. If only one type of camera is available, our binary programming model remains the same, but as the total price is minimized so is the total number of cameras in the array. Thus, to obtain the minimal number of cameras that satisfies the constraints, the objective value has to be divided by the price of one camera.
5.3 Approaches
125
Our approach was inspired by the algorithms presented in Chakrabarty et al. [4] and Hörster and Lienhart [12]. Let a binary variable ci be defined by
ci ⫽
1
if control point i is covered by a minimum of M cameras
0
otherwise
(5.10)
The total number, nbCovered, of covered sample points is then given by nbCovered ⫽
(5.11)
ci
i
Further we define two binary variables xkj and g(k, j, , i):
xkj ⫽
1
if a camera of type k is placed at position j with orientation
0 otherwise ⎧ ⎪ ⎨1 if a camera of type k at position j with orientation covers control point i g(k, j, , i) ⫽ ⎪ ⎩0 otherwise
(5.12)
(5.13)
g(k, j, , i) can be calculated in advance for each camera type and stored in a table. The minimization of the total cost of the sensor array is then given by ⎛ ⎞ ⎝Kk min xkj ⎠ k
j
(5.14)
Next we need to express the variables that define coverage in terms of the other just defined variables as follows. Since ci ⫽ 1, if and only if at least M cameras cover the control point i, we introduce the following two inequalities for each grid point: ⎛
ci · ⎝
⎞
xkj · g(k, j, , i) ⫺ M ⎠ ⭓ 0
(5.15)
k,j,
(1 ⫺ ci ) · ((M ⫺ 1) ⫺
xkj · g(k, j, , i)) ⭓ 0
(5.16)
k,j,
The first two constraints (5.15) and (5.16) involve products of binary variables; thus they are nonlinear. To linearize the inequalities, we introduce a new binary variable for each nonlinear term as well as two additional constraints [22].Therefore, we replace each ci · xkj term by a binary variable vkji and introduce the following constraints: ci ⫹ xkj ⭓ 2 · vkji
(5.17)
ci ⫹ xkj ⫺ 1 ⭐ vkji
(5.18)
To ensure that at most only one camera is assigned to each possible camera position, we add the constraint k,
xkj ⭐ 1
(5.19)
126
CHAPTER 5 Optimal Placement of Multiple Visual Sensors for each possible camera position j. Further, to ensure that the minimal predefined percentage of points p is covered, the following constraint is needed as well:
ci ⭓ p · P
(5.20)
i
Our sensor deployment problem can now be formulated as a BIP model. The result is shown in Figure 5.4. Here, the variable T denotes the number of available camera types. The proposed model needs only a few modifications to solve problems 5.1 and 5.2 (listed previously). As the objective in those problems is to maximize coverage, we need to replace the objective function of the BIP model by max
ci
(5.21)
i
As the maximization procedure favors the variable ci to be 1, (5.16) can be dropped as it is solely used to force ci to value 1 if the coverage constraints are satisfied. To derive the
FIGURE 5.4 BIP model to solve problem 5.4.
5.3 Approaches BIP model for problem 5.1 we need as well to substitute (5.20) by
xkj ⫽ N
(5.22)
k, j,
where N denotes the number of cameras we are allowed to place. Whereas, if we aim to solve problem 5.2, we need to substitute (5.20) by
Kk · xkj ⭐ F
(5.23)
k, j,
where F denotes the maximally allowed total price of the sensor array. The BIP model that solves problem 5.3 is shown in Figure 5.5. We proposed a similar model for solving this problem in 3D using regular grid points in Hörster and Lienhart [13]. The binary variable xn is defined by xn ⫽
⎧ ⎪ ⎨1 ⎪ ⎩0
if camera n of type k at position (cx , cy ) has the orientation
(5.24)
otherwise
We assume in this problem that the positions and types of all N cameras are given and fixed. The last constraint in the BIP model (see Figure 5.5) ensures that exactly one pose is assigned to each camera.
FIGURE 5.5 BIP model to solve problem 5.3.
127
128
CHAPTER 5 Optimal Placement of Multiple Visual Sensors
Note that for the rest of this chapter we consider only the case of M ⫽ 1. That is, we are interested in maximizing and/or achieving coverage where we define coverage of a control point by satisfying equations 5.7 through 5.9 with at least one camera. The number of variables and constraints depends on the number of control points and possible camera positions and poses. Thus, if we increase those to achieve a better approximation of the continuous case, the number of variables and constraints in our BIP model increases accordingly. As we are not able to solve the BIP problem with an arbitrarily large number of variables and constraints in a reasonable amount of time and with a reasonable amount of memory, it is essential to keep the number of variables and constraints as low as possible. Based on the observation that possible camera positions are often attached to obstacles or borders of the space (as it is, for instance, easy to mount cameras on walls), and those walls/obstacles restrict the cameras’views in some of the sampled orientations, we simply discard combinations of type, pose, and position that do not cover any control point. Thus, the number of variables xkj or xn , respectively, decreases and hence the number of variables vkji or vni as well as the number of constraints. Of course, the factor of reduction depends on the possible camera positions and orientations, the space geometry, such as the number of obstacles, and other parameters such as FOV or P/A ratio. However, the reduction does not change the optimality of the result.
5.3.2 Heuristics In the following subsections we present various proposed heuristics to solve the camera placement problems within reasonable time and memory consumption.
Greedy Search Our greedy search algorithm corresponds to that proposed by Fleishman et al. [10] but introduces different stop criteria and modifies the definition of “rank.”A greedy search algorithm is a constructive heuristic, that determines the best placement and orientation of one sensor at a time—that is, an iterative procedure places one camera during each iteration. Therefore, we first compute for each discrete camera position, orientation, and type a list of control points that are adequately covered by that camera. In problem 5.3, for each fixed position and type only the pose is varied.A control point is adequately covered by a camera if it lies in its FOV and the view of the camera in the direction of that control point is not occluded. By doing a greedy selection (i.e., choosing at each step the camera with the highest rank), we find near optimal solutions to our four problems. For problems 5.1 and 5.3 we define the rank of a camera as the number of adequately covered control points that it adds in addition to those already covered by previously placed cameras. It is denoted NbCovR . In problems 5.2 and 5.4 we define the rank of a camera as the inverse of the ratio of r where r⫽
Kk NbCovR
(5.25)
and Kk denotes the price of the respective camera type considered. The ratio r measures the price for each control point added by this camera type at the current position and with the current orientation. Thus we choose the camera with the cheapest price per
5.3 Approaches covered control point. The control points covered by the last placed camera are deleted from the current pool of control points and the iteration repeats. If there are multiple highest-ranked cameras, we choose the one with the highest rank with respect to the original control point set. This is reasonable because it causes some points to be covered multiple times and hence possibly covered from more than one direction (see Section 5.5 for further explanation). We add cameras to the solution set until a stop criterion is reached. For problem 5.1 the stop criterion is the allowed number of visual sensors in the solution set; for problem 5.3 the stop criterion is reached if an orientation is assigned to all cameras; for problem 5.4 the stop criterion is a sufficient percentage of coverage of all control points, and for problem 5.2 it is the upper limit of the total price of the array—that is, no camera of any type can be added without exceeding the limit. The pseudocode of our greedy search algorithm is shown in Algorithm 5.1.
Dual Sampling The dual sampling algorithm is also an incremental planner and thus a constructive heuristic. It was originally proposed in Gonzalez-Banos and Latombe [11] for the acquisition of range images using a mobile robot.We modify this algorithm for our multiple visual sensor Algorithm 5.1: Greedy Search Algorithm Input: - A set of possible camera positions, poses, and types - A set of control points Algorithm: begin Compute for every combination of camera position, pose, and type a list of control points that are adequately covered; iter=0, stop=0; do Calculate the rank for every combination of camera position, orientation, and type; Search for the position-orientation-type combination with the highest rank; totalPrice(iter)=totalPrice(iter-1)+cost of highest ranking camera; if( iter⬍N or totalPrice(iter)⭐F or nbCoveredPoints(iter-1)⬍p·P) Place sensor of type t at the position and pose that has the highest rank; Update the set of uncovered control points; nbCoveredPoints(iter)=nbCoveredPoints(iter-1)+newly covered points; Update the set of possible camera positions; Update the set of possible camera types (due to cost constraints); iter=iter+1; else stop=1; until (stop=1) end
129
130
CHAPTER 5 Optimal Placement of Multiple Visual Sensors
Space
Selected control point
Possible camera positions
FIGURE 5.6 Position sampling in version 2 of the dual sampling algorithm.
placement tasks and propose two different variants of it. It should be noted that neither version of the algorithm is suited to solve problem 5.3. The input to the algorithms is a set of control points that should be covered. In each iteration we randomly select one point from the set of uncovered points. We then determine the camera’s type, position, and pose that cover the selected control point and have the highest rank. We define “rank”as in the previous section. The first version of our dual sampling algorithm determines the highest ranking combination of camera type, position, and pose from a given set of possible types, positions, and orientations.This set of possible types, positions, and orientations must be computed or selected before calling the placement algorithm. In our second version of the dual sampling algorithm a set of possible camera types and orientations is input to the placement algorithms, but the set of possible positions is computed in every iteration according to the randomly selected control point. We obtain possible positions by sampling only in a local neighborhood around this control point (see Figure 5.6), thus increasing the possibility that the control point can be covered from this location.This procedure enables us to sample possible positions with a locally higher density. The set of uncovered control points is reduced in each iteration. The algorithm stops if the stop criterion associated with the current placement problem is reached. The stop criteria are defined as in the Greedy Search subsection. Algorithm 5.2 summarizes our dual sampling.
5.3.3 Random Selection and Placement We also implemented random selection and placement of the cameras. By comparing the other algorithms with the results from random selection and placement, we can evaluate, in Section 5.4, to what extent our placement algorithms improve over the bottom line. Together with the optimal solutions, this will give us a clear picture of the capabilities of the proposed heuristics. To solve one of the four problems, the random placement algorithm randomly selects cameras from a set of possible camera positions, poses, and types until the appropriate
5.4 Experiments
131
Algorithm 5.2: Versions 1 and 2 of the Dual Sampling Algorithm Input: - A set of possible camera poses and types - A set of control points - If (version=1): a set of possible camera positions Algorithm: begin if (version=1) Compute for every combination of camera position, pose, and type a list of control points that are adequately covered; iter=0, stop=0; do Select randomly one control point p from the set of uncovered control points; if (version=2) Compute set of possible camera positions by sampling the region around control point p; Compute for every combination of camera position, pose, and type a list of control points that are adequately covered; Calculate the rank of each position-orientation-type combination; Search for the position-orientation-type combination with the highest rank; totalPrice(iter)=totalPrice(iter-1)+cost of highest ranking camera; if( iter⬍N or totalPrice(iter)⭐F or nbCoveredPoints(iter-1)⬍p·P) Place sensor of type t at the position and pose that has the highest rank and covers control point p; Update the set of uncovered control points; nbCoveredPoints(iter)=nbCoveredPoints(iter-1)+newly covered points; Update the set of possible camera types (due to cost constraints); if(version=1) Update the set of possible camera positions; iter=iter+1; else stop=1; until (stop=1) end
stop criterion, as defined in the Greedy Search subsection, is reached. Before selection, position-pose-type combinations that do not cover at least one control point are discarded from the set of possible combinations.
5.4 EXPERIMENTS All presented approaches to solve the different placement problems were implemented in C++. The BIP models were solved using the LINGO package (64-bit version) [2]. We
132
CHAPTER 5 Optimal Placement of Multiple Visual Sensors chose a professional optimization software over the freely available lp_solve package [14] since we experienced serious problems with lp_solve when solving BIPs that exceeded a certain size in terms of variables and constraints.
5.4.1 Comparison of Approaches The BIP approach, which determines the global optimum, is not applicable to large spaces, as there exists an upper limit on the number of variables that is due to memory and computational constraints. In our first set of experiments we aimed to determine the quality of the presented heuristics. That is, we addressed the question of how well their solutions approximate the globally optimal BIP solution. We used four small rooms throughout our experiments (see Figure 5.7). Potential camera positions and poses as well as control points in these rooms were sampled to obtain sets of discrete space points and camera positions as well as discrete poses. We sampled 50 possible camera positions and between 100 and 150 control points. Cameras could adopt four to eight discrete orientations, and either one or two types of cameras were available depending on the problem.The cheaper camera type was assumed to have a smaller FOV. The sampled sets were input to our five algorithms: the BIP approach, which determines the optimal solution; the greedy search approach; both versions of the dual sampling algorithm; and the algorithm of randomly choosing camera positions, poses, and types. The performance of the algorithms was tested separately for all four problem instances,
Space 1
Space 2
Space 3
Space 4
FIGURE 5.7 Spaces used for comparing the imposed camera placement approaches and evaluating the BIP algorithm.
100 90 80 70 60 50 40 30 20 10 0
Random Dual sampling 1 Dual sampling 2 Greedy BIP
1
Coverage (%)
100 90 80 70 60 50 40 30 20 10 0
2
3 Space Problem 1
4
Random Greedy BIP
Total price ($)
Coverage (%)
Coverage (%)
5.4 Experiments
1
2
3 Space Problem 3
4
100 90 80 70 60 50 40 30 20 10 0
1800 1600 1400 1200 1000 800 600 400 200 0
Random Dual sampling 1 Dual sampling 2 Greedy ILP
1
2
3 Space Problem 2
4
Random Dual sampling 1 Dual sampling 2 Greedy BIP
1
2
3 Space Problem 4
4
FIGURE 5.8 Results obtained for all four considered problems using the proposed algorithms.
except for the two dual sampling approaches, which could be applied to problem 5.3. The results are illustrated in Figure 5.8. Given the stochastic aspect in the outcome of the dual sampling algorithms and the random selection and placement algorithm, we plotted the average results for these algorithms over ten runs.The results show clearly the improvement for all algorithms over random selection and placement of the sensors in all cases (i.e., for all spaces and problem instances). All proposed heuristics approximate the BIP solution well. On average, the greedy algorithm is the best-performing heuristic. In most cases, the difference between the result of the greedy algorithm and the result of the BIP approach is very small or even zero—the algorithm determines the same or an identically performing camera configuration. Both dual sampling algorithms perform almost equally well but sightly worse than the greedy algorithm in most experiments. It should be noted that the BIP algorithm obtains only the optimum solution based on the discrete input to the problems (i.e., based on the given point sets). The average performance of the second version of the dual sampling algorithm is identical to or worse than the BIP solution in all experiments. However, in some rare cases, the resulting configuration obtained by this version of the dual sampling algorithm had a better performance than the BIP solution. This can be explained as follows: This algorithm takes as input only the control points. In each iteration it samples the camera positions depending
133
134
CHAPTER 5 Optimal Placement of Multiple Visual Sensors
on the position of the currently chosen control point (see the Dual Sampling subsection and Algorithm 5.2). If only a limited number of camera positions can be chosen because of runtime and memory constraints, this approach has the advantage of being able to sample camera positions with a locally higher density. This implies that if a better solution than the one obtained by solving the BIP model is found, then most probably the density of the discrete camera positions in the space is not sufficient to approximate the continuous case. Thus, the second version of the dual sampling approach is especially suited to situations where we have very large spaces and thus where even greedy search heuristics cannot sample with a sufficient density, because of a lack of needed computational and/or memory resources. In summary, the results show that the proposed heuristics approximate the optimum solution well and are thus suited in situations where the BIP model cannot be solved not at all or only for an insufficient number of control points or camera positions and poses, as these algorithms allow for a higher number of samples. In cases of very large floor plans, the second version of the dual sampling algorithm should be used. Our experiments were performed only on small rooms with relatively simple geometries, as we were not able to obtain the BIP results on larger problems because of a lack of computational and memory resources. Hence we are not able to compare the results with the optimum solution on complete floor plans. For these large and/or more complicated floor plans the evaluation might be slightly different.
5.4.2 Complex Space Examples In this section we present further results obtained by our different approaches on large and complex rooms. It must be noted that, although the cameras’FOVs may span obstacles, those cameras can only capture a certain control point if no obstacle constricts the FOV of the visual sensor in the direction of this point. This is considered by all algorithms. Figure 5.9 shows two configurations resulting from the greedy search approach. Figure 5.9(a) illustrates the placement of six identical cameras such that optimal coverage
(a)
(b)
FIGURE 5.9 Configurations resulting from (a) the greedy search approach for maximizing coverage for six identical cameras. (b) maximizing coverage assuming two camera types and a maximum total cost for the sensor array.
5.4 Experiments
(a)
135
(b)
FIGURE 5.10 Configurations resulting from the dual sampling algorithm: (a) version 1—minimizing the total cost of the visual sensor array while covering 70 percent of the space, assuming two types of camera. (b) version 2—covering 50 percent of the space assuming identical cameras.
FIGURE 5.11 Poses resulting from solving the BIP model for maximizing coverage given fixed sensor positions and types.
is achieved; the configuration in Figure 5.9(b) shows the result of maximizing coverage for an array of two camera types with a maximum total cost of $700, where the camera type with the smaller field of view costs $60 and the other $100. Figure 5.10 shows results that were obtained using the two versions of the dual sampling algorithm to solve problem 5.4. That is, given a required minimum percentage of coverage, the array with the minimum total price is determined as satisfying this constraint. Either one (see Figure 5.10(b)) or more camera types (see Figure 5.10(a)) are available. Given fixed positions and camera types for eight cameras in the example space, Figure 5.11 shows the configuration that maximizes the percentage of covered space obtained by the BIP approach. The resulting configuration changes if the underlying importance distribution changes. This is shown in Figure 5.12. The associated importance distribution is depicted in the top image; the resulting placement that maximizes the coverage given the five camera places, in the bottom images. It can be seen that more important regions are covered
136
CHAPTER 5 Optimal Placement of Multiple Visual Sensors
(a)
(b)
FIGURE 5.12 Configurations resulting from the greedy search approach for the same space (bottom) but different underlying importance distributions (top) for maximizing coverage given five identical cameras: (a) some regions are more important; (b) equal weighting of all regions.
first in the left configuration (a), whereas in the right placement (b), because of the equal weighting of all regions, the camera configuration aims to maximize the total area covered.
5.5 POSSIBLE EXTENSIONS The problems and models considered in this paper can be extended easily. A more complex FOV model that, for instance, includes the focus could be used. Another aspect is that now we define coverage of a point without accounting for the different directions that a surface through that point could adopt with respect to the covering camera. We could do this by introducing for each control point a number of directions and checking for each if coverage is achieved. This requires a slightly different definition of the FOV. The total coverage of the space may then be calculated either by summing over all of those directions and points or by summing the average coverage over the directions per control
5.6 Conclusions
137
point. The extension of the BIP as well as the other algorithms to the third dimension is straightforward, but has been excluded in the discussion because of space limitations.We proposed a 3D camera model in Hörster and Lienhart [13]. In our space definition we only distinguish between open space and obstacles/walls. In practice, situations may occur where regions do not have to be covered but they also do not restrict the cameras’FOV (i.e., they are not obstacles).An example of such a region is a table in an office environment in conjunction with a face recognition or personal identification application:Tables do not restrict the FOV of the cameras mounted on walls, but the regions they cover are not of interest for the application.The effect of those regions may be easily modeled by setting their importance weights to zeros; thus no control points will be computed at these locations. For some applications it may also be desirable to cover regions of the space at different resolution levels [9]. Different sampling frequencies fs need to be defined by the user and assigned to the regions. The camera model and the algorithms need to be modified accordingly.
5.6 CONCLUSIONS In this paper we addressed the issue of appropriate visual sensor placement with respect to coverage. We formulated and considered four different problem settings, and we presented different approaches to solve these problems: BIP models that determine a global optimal solution and various heuristics that approximate this optimum. Experimental evaluations showed the suitability of the algorithms and the practicality of the approaches.
REFERENCES [1] http://mmc36.informatik.uni-augsburg.de/mediawiki-1.11.2/index.php/User_Interface_ for_Optimal_Camera_Placement. [2] LINGO modeling language and solver. LINDO Systems Inc, http://lindo.com/products/ lingo/lingom.html. [3] R. Bodor, P. Schrater, N. Papanikolopoulos, Multi-camera positioning to optimize task observability. IEEE Conference on Advanced Video and Signal Based Surveillance, 2005. [4] K. Chakrabarty, S.S. Iyengar, H. Qi, E. Cho, Grid coverage for surveillance and target location in distributed sensor networks, IEEE Transactions on Computers, 51 (12) (2002), 1448–1453. [5] X. Chen, Design of many-camera tracking systems for scalability and efficient resource allocation, Ph.D. dissertation, Stanford University, 2002. [6] X. Chen, J. Davis, An occlusion metric for selecting robust camera configurations. Machine Vision and Applications, 19 (4) (2008) 1432–1469. [7] E. Dunn, G. Olague, E. Lutton, Parisian camera placement for vision metrology, Pattern Recognition Letters, 27 (11) (2006) 1209–1219. [8] A.O. Ercan, D.B. Yang, A. El Gamal, L. Guibas, Optimal placement and selection of camera network nodes for target localization, in: Proceedings of International Conference on Distributed Computing in Sensor Systems, 2006. [9] U. Erdem, S. Sclaroff, Automated camera layout to satisfy task-specific and floor plan-specific coverage requirements, Computer Vision and Image Understanding, 103 (2006) 156–169.
138
CHAPTER 5 Optimal Placement of Multiple Visual Sensors [10] S. Fleishman, D. Cohen-Or, D. Lischinski, Automatic camera placement for image-based modeling, Computer Graphics Forum, 19 (2) (2000) 101–110. [11] H. Gonzalez-Banos, J. Latombe, Planning robot motions for range-image acquisition and automatic 3D model construction, in: Proceedings of Association for the Advancement of Artificial Intelligence Fall Symposium, 1998. [12] E. Hörster, R. Lienhart, Approximating optimal visual sensor placement, in: IEEE International Conference on Multimedia and Expo, 2006. [13] E. Hörster, R. Lienhart, Calibrating and optimizing poses of visual sensors in distributed platforms, ACM Multimedia Systems, 12 (3) (2006) 195–210. [14] P.N.M. Berkelaar, K. Eikland, lp_solve: Open source (mixed-integer) linear programming system. Eindhoven University of Technology, http://groups.yahoo.com/group/lp_solve/ files/Version5.5/ . [15] A. Mittal, L. Davis, Visibility analysis and sensor planning in dynamic environments, in: Proceedings ECCV, I (2004) 175–189. [16] G. Olague, P. Mohr, Optimal camera placement for accurate reconstruction, Pattern Recognition, 35 (4) (2002) 927–944. [17] S. Ram, K.R. Ramakrishnan, P.K. Atrey, V.K. Singh, M.S. Kankanhalli, A design methodology for selection and placement of sensors in multimedia surveillance systems, in: Proceedings of the fourth ACM International Workshop on Video Surveillance and Sensor Networks, 2006. [18] J. O’Rourke, Art Gallery Theorems and Algorithms, Oxford University Press, 1987. [19] S. Sahni, X. Xu, Algorithms for wireless sensor networks. International Journal of Distributed Sensor Networks, 1 (1) (2005) 35–56. [20] T. Takahashi, O. Matsugano, I. Ide, Y. Mekada, H. Murase, Planning of multiple camera arrangement for object recognition in parametric eigenspace, in: Proceedings of the eighteenth International Conference on Pattern Recognition, 2006. [21] J. Wang, N. Zhong, Efficient point coverage in wireless sensor networks, Journal of Combinatorial Optimization, 11 (2006) 291–304. [22] H. Williams, Model building in mathematical programming, Second edition, Wiley, 1985. [23] J. Zhao, S.-C.S. Cheung, Multi-camera surveillance with visual tagging and generic camera placement. First ACM/IEEE International Conference on Distributed Smart Cameras, 2007. [24] Y. Zou, K. Chakrabarty, Sensor deployment and target localization in distributed sensor networks, Transactions on Embedded Computing Systems, 3 (1) (2004) 61–91.
CHAPTER
Optimal Visual Sensor Network Configuration Jian Zhao, Sen-ching S. Cheung Center for Visualization and Virtual Environments, University of Kentucky, Lexington, Kentucky
6
Thinh Nguyen School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon
Abstract Wide-area visual sensor networks are becoming more and more common. They have many commercial and military uses, from video surveillance to smart home systems, from traffic monitoring to anti-terrorism.The design of such networks is a challenging problem because of the complexity of the environment, self-occlusion and mutual occlusion of moving objects, diverse sensor properties, and a myriad of performance metrics for different applications. Thus, there is a need to develop a flexible sensor-planning framework that can incorporate all the aforementioned modeling details and derive the sensor configuration that simultaneously optimizes target performance and minimizes cost. In this chapter, we tackle this optimal sensor problem by developing a general visibility model for visual sensor networks and solving the optimization problem via binary integer programming (BIP). Our proposed visibility model supports arbitrary-shape 3D environments and incorporates realistic camera models, occupant traffic models, self-occlusion, and mutual occlusion. Using this visibility model, we propose two novel BIP algorithms to find the optimal camera placement for tracking visual tags in multiple cameras. Furthermore, we propose a greedy implementation to cope with the complexity of BIP. Extensive performance analysis is performed using Monte Carlo and virtual environment simulations and real-world experiments. Keywords: sensor placement, smart camera network, visual tags, binary integer programming
Note: An early version of this work appeared in IEEE Journal of Selected Topics in Signal Processing, Special Issue on Distributed Processing in Vision Networks, 2(4) (2008) under the title “Optimal Camera Network Configurations for Visual Tagging.” Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00006-9
139
140
CHAPTER 6 Optimal Visual Sensor Network Configuration
6.1 INTRODUCTION In recent years we have seen widespread deployment of smart camera networks for a variety of applications. Proper placement of cameras in such distributed environments is an important design problem. Not only does it determine the coverage of the surveillance, it also has a direct impact on the appearance of objects in the cameras, which dictates the performance of all subsequent computer vision tasks. For instance, one of the most important tasks in a distributed camera network is to identify and track common objects across disparate camera views. This is difficult to achieve because image features like corners, scale-invariant feature transform (SIFT) contours, and color histograms may vary significantly between different camera views because of disparity, occlusions, and variations in illumination. One possible solution is to utilize semantically rich visual features based either on intrinsic characteristics, such as faces or gaits, or on artificial marks, such as jersey numbers or special-colored tags. We call the identification of distinctive visual features on an object “visual tagging.” To properly design a camera network that can accurately identify and understand visual tags, one needs a visual sensor planning tool—one that analyzes the physical environment and determines the optimal configuration of the visual sensors to achieve specific objectives under a given set of resource constraints. Determining the optimal sensor configuration for a large-scale visual sensor network is technically very challenging. First, visual line-of-sight sensors are amenable to occlusion by both static and dynamic objects.This is particularly problematic as these networks are typically deployed in urban or indoor environments characterized by complicated topologies, stringent placement constraint, and a constant flux of occupant or vehicular traffic. Second, from infrared to range sensing, from static to pan-tilt-zoom or even robotic cameras, there are myriad visual sensors, many of which have overlapping capabilities. Given a fixed budget with limited power and network connectivity, the choice and placement of sensors become critical to the continuous operation of the visual sensor network. Third, the performance of the network depends heavily on the nature of specific tasks in the application. For example, biometric and object recognition requires objects to be captured at a specific pose; triangulation requires visibility of the same object from multiple sensors; and object tracking can tolerate a certain degree of occlusion using a probabilistic tracker. Thus, there is a need to develop a flexible sensor-planning framework that can incorporate all of the aforementioned modeling details and derive a sensor configuration that simultaneously optimizes target performance and minimizes cost. Such a tool will allow us to scientifically determine the number of sensors, their positions, their orientations, and the expected outcome before embarking on the actual construction of a costly visual sensor network project. In this chapter, we propose a novel integer-programming framework for determining the optimal visual sensor configuration for 3D environments. Our primary focus is on optimizing the performance of the network in visual tagging. To allow maximum flexibility, we do not impose a particular method for tag detection and simply model it as a generic visual detector. Furthermore, our framework allows users the flexibility to determine the number of views in which the tag must be observed so that a wide variety of applications can be simulated.
6.2 Related Work
141
6.1.1 Organization This chapter is organized as follows. After reviewing state-of-the-art visual sensor placement techniques in Section 6.2, we discuss in Section 6.3 how the performance of a sensor configuration can be measured using a general visibility model. In Section 6.4, we adapt the general model to the“visual tagging”problem using the probability of observing a tag from multiple visual sensors. Using this refined model, in Section 6.5 we formulate our search for optimal sensor placements as two BIP problems. The first formulation, MIN_CAM, minimizes the number of sensors for a target performance level; the second, FIX_CAM, maximizes performance for a fixed number of sensors. Because of the computational complexity of BIP, we also present a greedy approximation algorithm called GREEDY. Experimental results with these algorithms using both simulations and camera network experiments are presented in Section 6.6. We conclude with a discussion of future work in Section 6.7.
6.2 RELATED WORK The problem of finding optimal camera placement has long been studied. The earliest investigation can be traced back to the“art gallery problem”in computational geometry— the theoretical study of how to place cameras in an arbitrary-shape polygon so as to cover the entire area [1–3]. Although Chv´atal showed [4] that the upper bound of the number of cameras is n/3, determining the minimum number of cameras turns out to be an NP-complete problem [5]. The theoretical difficulties of camera placement are well understood, and many approximate solutions have been proposed; however, few of them can be directly applied to realistic computer vision problems. Camera placement has also been studied in the field of photogrammetry for building accurate 3D models. Various metrics such as visual hull [6] and viewpoint entropy [7] have been developed, and optimizations are realized by various types of ad hoc searching and heuristics [8]. These techniques assume very dense placement of cameras and are not applicable to wide-area wide-baseline camera networks. Recently, Ram et al. [9] proposed a framework to study the performance of sensor coverage in wide-area sensor networks, which, unlike previous techniques, takes into account the orientation of the object. They developed a metric to compute the probability of observing an object of random orientation from one sensor and used it to recursively compute the performance of multiple sensors. While their approach can be used to study the performance of a fixed number of cameras, it is not obvious how to extend their scheme to find the optimal number of cameras, or how to incorporate other constraints such as the visibility from more than one camera. More sophisticated modeling schemes pertinent to visual sensor networks were recently proposed in various studies [10–12]. The sophistication of their visibility models comes at a high computational cost for the optimization. For example, the simulated annealing scheme used in Mittal and Davis [11] takes several hours to find the optimal placement of four cameras in a room. Other optimization schemes such as hill climbing [10], semidefinite programming [12], and an evolutionary approach [13] all prove to be computationally intensive and prone to local minima.
142
CHAPTER 6 Optimal Visual Sensor Network Configuration
Alternatively, the optimization can be tackled in the discrete domain. Hörster and Lienhart [14] developed a flexible camera placement model by discretizing the space into grid and denoting the possible placement of the camera as a binary variable over each grid point. The optimal camera configuration is formulated as an integer linear programming problem that can incorporate different constraints and cost functions pertinent to a particular application. Like ideas were also proposed in a number of studies [15–17]. While our approach follows a similar optimization strategy, we develop a more realistic visibility model to capture the uncertainty of object orientation and mutual occlusion in 3D environments. Unlike [14], in which a camera’s field of view (FOV) is modeled as a 2D fixed-size triangle, ours is based on measuring the image size of the object as observed by a pinhole camera with arbitrary 3D location and pose. Our motivation is based on the fact that the object’s image size is the key to success in any appearance-based object identification scheme. Although the optimization scheme described in Höster and Lienhart [14] can theoretically be used for triangulating objects, its results as well as those of others are limited to maximizing sensor coverage. Our approach, on the other hand, directly tackles the problem of visual tagging, in which each object needs to be visible by two or more cameras. Furthermore, whereas the BIP formulation can avoid the local minima problem, its complexity remains NP-complete [18]. As a consequence, these schemes again have difficulties in scaling up to large sensor networks.
6.3 GENERAL VISIBILITY MODEL Consider the 3D environment depicted in Figure 6.1. Our goal in this section is to develop a general model to compute the visibility of a single tag centered at P in such an environment. We assume that the environment has vertical walls with piecewise linear contours. Obstacles are modeled as columns of finite height and polyhedral cross-sections.Whether the actual tag is the face of a subject or an artificial object, it is reasonable to model each tag as a small flat surface perpendicular to the ground plane. We further assume that all tags are of the same square shape with a known edge length 2w. Without specific knowledge of the height of individuals, we assume that the centers of all tags lie on the same plane ⌫ parallel to the ground plane. This assumption does not hold in the real world, as individuals are of different height. Nevertheless, as we will demonstrate in Section 6.6.1, such height variation does not greatly affect overall visibility measurements but reduces the complexity of our model. While our model restricts the tags to the same plane, we place no restriction on the 3D positions on yaw and pitch angles of the cameras in the visual sensor network. Given the number of cameras and their placement in the environment, we define the visibility of a tag V using an aggregate measure of the projected size of a tag on the image planes of different cameras. The projected size of the tag is very important as the tag’s image has to be large enough to be automatically identified at each camera view. Because of the camera projection of the 3D world to the image plane, the image of the square tag can be an arbitrary quadrilateral. It is possible to precisely calculate the area of this image, but an approximation for our visibility calculation is sufficient. Thus, we
6.3 General Visibility Model
2w l P
C K
G P
xP PC projected on G
l
 s
xP
Mutual occlusion
FIGURE 6.1 3D visibility model of a tag with orientation xP . ⌫ is the plane that contains the tag center P. Mutual occlusion is modeled by a maximum block angle  and its location s . K describes the obstacles and wall boundaries. Cameras of arbitrary yaw and pitch angles can be placed anywhere in the 3D environment. (For a better version of this figure, see this book’s companion website.)
measure the projected length of the line segment l at the intersection between the tag and the horizontal plane ⌫. The actual 3D length of l is 2w, and since the center of the tag always lies on l, the projected length of l is representative of the overall projected size of the tag. Next we identify the set of random and fixed parameters that affects V . Our choice to measure the projected length of l instead of the projected area of the tag greatly simplifies the parameterization of V . Given a camera network, the visibility function of a tag can be parameterized as V (P, vP , s |w, K ), where P, vP , s are random parameters about the tag, and K and w are fixed environmental parameters. P defines the 2D coordinates of the center of the tag on the plane ⌫; vP is the pose vector of the tag. Because we assume that the tag is perpendicular to the ground plane, the pose vector vP lies on the plane ⌫ and has a single degree of freedom—the orientation angle with respect to a reference direction. Note that the dependency of V on vP allows us to model self-occlusion— that is, the tag is being occluded by the subject who is wearing it. It is not visible to a camera if the pose vector is pointing away from the camera. While self-occlusion can be succinctly captured by a single pose vector, the precise modeling of mutual occlusion can be very complicated, involving the number of neighboring objects, their distance to the tag, and the positions and orientations of the cameras. For our model, we chose the worst-case approach by considering a fixed-size occlusion angle  at a random position measured from the center of the tag on the ⌫ plane. Mutual occlusion is said to occur if the projection of the line of sight on the ⌫ plane falls within the range of the occlusion angle. In other words, we modeled the occlusion as a cylindrical wall of infinite height around the tag partially blocking a fixed visibility angle of  at random starting position s . w is half the edge length of the tag, which is a known parameter. The shape of the environment is encapsulated in the
143
144
CHAPTER 6 Optimal Visual Sensor Network Configuration
fixed parameter set K , which contains a list of oriented vertical planes that describe the boundary wall and obstacles of finite height. Using K to compute whether there is a direct line of sight between an arbitrary point in the environment and a camera is straightforward. The specific visibility function suitable for visual tagging will be described in Section 6.4. To correctly identify and track any visual tag, a typical classification algorithm requires the tag size on the image to be larger than a certain minimum size, although a larger projected size usually does not make much difference. For example, a color tag detector needs a threshold to differentiate the tag from noises, and a face detector needs a face image large enough to observe the facial features. On the other hand, the information gain does not increase as the projected object size increases beyond a certain value.Therefore, the threshold version represents our problem much better than the absolute image size. Assuming that this minimum threshold on image size is T pixels, this requirement can be modeled by binarizing the visibility function as follows: Vb (P, vP , s |w, K , T ) ⫽
1 0
if V (P, vP , s |w, K ) ⬎ T otherwise
(6.1)
Finally, we define , the mean visibility, to be the metric for measuring the average visibility of P over the entire parameter space:
⫽
Vb (P, vP , s |w, K , T ) · f (P, vP , s ) dP dvP ds
(6.2)
where f (P, vP , s ) is the prior distribution that can incorporate prior knowledge about the environment. For example, if an application is interested in locating faces, the likelihood of particular head positions and poses is affected by furnishings and detractions such as television sets and paintings. Except for the most straightforward environment such as a single camera in a convex environment as discussed in Zhao and Cheung [19], equation 6.2 does not admit a closed-form solution. Nevertheless, it can be estimated using standard Monte Carlo sampling and its many variants.
6.4 VISIBILITY MODEL FOR VISUAL TAGGING In this section, we present a visibility model for visual tagging. This model is a specialization of the general model in Section 6.3. The goal is to design a visibility function V (P, vP , s |w, K ) that can measure the performance of a camera network in capturing a tag in multiple camera views. We first present the geometry for visibility from one camera and then show a simple extension to create V (P, vP , s |w, K ) for an arbitrary number of cameras. Given a single camera with the camera center at C, it is straightforward that a tag at P is visible at C if and only if the following conditions hold: ■ ■ ■ ■
The tag is not occluded by any obstacle or wall (environmental occlusion). The tag is within the camera’s field of view (field of view). The tag is not occluded by the person wearing it (self-occlusion). The tag is not occluded by other moving objects (mutual occlusion).
6.4 Visibility Model for Visual Tagging Pl2
P
s 
Pl1
␣
Occlusion
Tag orientation vP
Environment K
P9 P l2 9
vC P l1 9
O Image plane P C
FIGURE 6.2 Projection of a single tag onto a camera.
Thus, we define the visibility function for one camera as the projected length ||l || on the image plane of line segment l across the tag if the conditions are satisfied, and zero otherwise. Figure 6.2 shows the projection of l, delimited by Pl1 and Pl2 , onto the image plane ⌸. Based on the assumption that all tag centers have the same elevation and all tag planes are vertical, we can analytically derive the formula for Pl1 , Pl2 as Pli ⫽ C ⫺
vC , O ⫺ C (Pli ⫺ C) vC , Pli ⫺ C
(6.3)
⫺ P ||. where ·, · indicates the inner product. The projected length ||l || is simply ||Pl1 l2 After computing the projected length of the tag, we check the four visibility conditions as follows:
Environmental occlusion. We assume that environmental occlusion occurs if the line segment connecting camera center C with tag center P intersects with some obstacle. While such an assumption does not take partial occlusion into account, it is adequate for most visual tagging applications where the tag is much smaller than its distance from the camera. We represent this requirement as the following
145
146
CHAPTER 6 Optimal Visual Sensor Network Configuration binary function: chkObstacle (P, C, K ) ⫽
1 0
No obstacles intersect with line segment PC otherwise
(6.4)
Specifically, the obstacles are recorded in K as a set of oriented vertical planes that describe the boundary wall and obstacles of finite height. Intersection between the line of sight PC and each element in K is computed. If there is no intersection within the confined environment or the points of intersection are higher than the height of the camera, no occlusion occurs because of the environment. Field of view. Similar to determining environmental occlusion, we declare the tag to be in the FOV if the image P of the tag center is within the finite image plane ⌸. Using a similar derivation as in (6.3), the image P is computed as follows: P ⫽ C ⫺
vC , O ⫺ C (P ⫺ C) vC , P ⫺ C
(6.5)
We then convert P to local image coordinates to determine if it is indeed within ⌸. We encapsulate this condition using the binary function chkFOV (P, C, vC , ⌸, O), which takes camera-intrinsic parameters, tag location, and pose vector as input, and returns a binary value indicating whether the center of the tag is within the camera’s FOV. Self-occlusion. As illustrated in Figure 6.2, the tag is self-occluded if the angle ␣ between the line of sight to the camera C ⫺ P and the tag pose vP exceeds 2 . We can represent this condition as a step function U 2 ⫺ |␣| . Mutual occlusion. In Section 6.3, we modeled the worst-case occlusion using an angle . As illustrated in Figure 6.2, mutual occlusion occurs when the tag center or half the line segment l is occluded. The angle  is suspended at P on the ⌫ plane. Thus, occlusion occurs if the projection of the line of sight C ⫺ P on the ⌫ plane at P falls within the range (s , s ⫹ ). We represent this condition using the binary function chkOcclusion (P, C, vP , s ), which returns 1 for no occlusion and 0 otherwise. Combining both ||l || and the four visibility conditions, we define the projected length of an oriented tag with respect to camera ⌼ as I (P, vP , s |K , ⌼) as follows: I (P, vP , s |w, K , ⌼) ⫽ ||l || · chkObstacle (P, C, K )· ⫺ |␣| · chkOcclusion (P, C, vP , s ) chkFOV(P, C, vC , ⌸, O) · U 2
(6.6)
where ⌼ includes all camera parameters including ⌸, O, and C. As stated in Section 6.3, a threshold version is usually more convenient: Ib (P, vP , s |w, K , ⌼, T ) ⫽
1 0
if I (P, vP , s |w, K , ⌼) ⬎ T otherwise
(6.7)
To extend the single-camera case to multiple cameras, we note that the visibility of the tag from one camera does not affect that of the other and thus each camera can be treated independently. Assume that the specific application requires a tag to be visible by H or more cameras. The tag at a particular location and orientation is visible if the sum of the
6.5 Optimal Camera Placement
147
Ib () values from all cameras exceeds H at that location. In other words, given N cameras ⌼1 , ⌼2 , . . . , ⌼N , we define the threshold visibility function Vb (P, vP , s |w, K , T ) as
Vb (P, vP , s |w, K , T ) ⫽
1 0
if N i⫽1 Ib (P, vP , s |w, K , ⌼i ) ⭓ H otherwise
(6.8)
Using this definition, we can then compute the mean visibility as defined in (6.2) as a measure of the average likelihood of a random tag being observed by H or more cameras. While the specific value of H depends on the application, we use H ⫽ 2 without loss of generality in the sequel for concreteness.
6.5 OPTIMAL CAMERA PLACEMENT The goal of an optimal camera placement is to identify, among all possible camera network configurations, the one that maximizes the visibility function given by equation (6.8). As that equation does not possess an analytic form, it is very difficult to apply conventional continuous optimization strategies such as variational techniques or convex programming. Thus, we follow a similar approach as in Hörster and Lienhart [14] by finding an approximate solution over a discretization of two spaces—the space of possible camera configurations and the space of tag location and orientation. Section 6.5.1 describes the discretization of our parameter spaces. Sections 6.5.2 and 6.5.3 introduce two BIP formulations, targeting different cost functions, for computing optimal configurations over the discrete environment. A computationally efficient algorithm for solving BIP based on a greedy approach is presented in Section 6.5.4.
6.5.1 Discretization of Camera and Tag Spaces The design parameters for a camera network include the number of cameras, their 3D locations and their yaw and pitch angles. The number of cameras is either an output discrete variable or a constraint in our formulation. As camera elevation is usually constrained by the environment, our optimization does not search for the optimal elevation; rather, the user inputs it as a fixed value. For simplicity, we assume that all cameras have the same elevation, but it is a simple change in our code to allow different elevation constraints to be used in different parts of the environment. The remaining 4D camera space—the 2D location and the yaw and pitch angles—is discretized into a uniform lattice gridC of Nc camera grid points, denoted {⌼i : i ⫽ 1, 2, . . . , Nc }. The unknown parameters of the tag in computing the visibility function (6.8) include the location of the tag center P, the pose of the tag vP , and the starting position of the worse-case occlusion angle s . Our assumptions stated in Section 6.3 have the tag center lying on a 2D plane and the pose restricted to a 1D angle with respect to a reference direction. As for occlusion, our goal is to perform the worst-case analysis so that as long as the occlusion angle is less than a given , as defined in Section 6.3, our solution is guaranteed to work no matter where the occlusion is. For this reason, a straightforward quantization of the starting position s of the occlusion angle does not work—an occlusion angle of  starting anywhere between grid points occludes additional views.
148
CHAPTER 6 Optimal Visual Sensor Network Configuration /4
(a)
(b)
(c)
FIGURE 6.3 Discretization to guarantee occlusion less than  ⫽ /4 at any position is covered in one of three cases: (a) [0, 2 ), (b) [ 4 , 3 2 ), and (c) [ 2 , ).
To simultaneously discretize the space and maintain the guarantee, we select a larger occlusion angle m ⬎  and quantize its starting position using a step size of ⌬ ⫽ m ⫺ . The occlusion angles considered under this discretization are then {[i⌬ , i⌬ ⫹ m ) : i ⫽ 0, . . . , N ⫺ 1}, where N ⫽ ( ⫺ m )/⌬ . This guarantees that any occlusion angle less than or equal to  is covered by one of the occlusion angles. Figure 6.3 shows an example of  ⫽ ⌬ ⫽ /4 and m ⫽ /2. Combining these three quantities, we discretize the 4D tag space into a uniform lattice gridP with Np tag grid points {⌳i : i ⫽ 1, 2, . . . , Np }. Given a camera grid point ⌼i and a tag grid point ⌳j , we can explicitly evaluate the threshold single-camera visibility function (6.7), which we now rename I (⌳j |w, T , K , ⌼i ), with ⌳j representing the grid point for the space of P, vP , and s ; w, the size of the tag; T , the visibility threshold; K , the environmental parameter; and ⌼i , the camera grid point. The numerical values I (⌳j |w, T , K , ⌼i ) are then used in formulating cost constraints in our optimal camera placement algorithms.
6.5.2 MIN_CAM: Minimizing the Number of Cameras for Target Visibility MIN_CAM estimates the minimum number of cameras that can provide a mean visibility equal to or higher than a given threshold t . There are two main characteristics of MIN_CAM. First, is computed not on the discrete tag space but on the actual continuous space using Monte Carlo simulation. The measurement is thus independent of the discretization. Furthermore, if the discretization of the tag space is done with enough prior knowledge of the environment, MIN_CAM can achieve the target using very few grid points. This is important, as the complexity of BIP depends greatly on the number of constraints, which is proportional to the number of grid points. Second, the visual tagging requirements are formulated as constraints rather than as the cost function in the BIP formulation of MIN_CAM.Thus, the solution will guarantee the chosen tag grid points to be visible at two or more cameras. While this is useful to applications where the visual tagging requirement in the environment needs to be strictly enforced, MIN_CAM may inflate the number of cameras needed to capture some poorly chosen grid points. Before describing how we handle this problem, we describe the BIP formulation in MIN_CAM. We first associate each camera grid point ⌼i in gridC with a binary variable bi such that 1 bi ⫽ 0
if a camera is present at ⌼i otherwise
(6.9)
6.5 Optimal Camera Placement
149
The optimization problem can be described as the minimization of the number of cameras: min bi
Nc
bi
(6.10)
i⫽1
subject to the following two constraints. First, for each tag point ⌳j in gridP, we have Nc
bj · Ib (⌳j |w, T , K , ⌼i ) ⭓ 2
(6.11)
i⫽1
This constraint represents the visual tagging requirement that all tags must be visible at two or more cameras. As defined in Equation 6.7, Ib (⌳j |w, T , K , ⌼i ) measures the visibility of tag ⌳j with respect to camera at ⌼i . Second, for each camera location (x, y), we have
bi ⭐ 1
(6.12)
all ⌼i at (x, y)
These are a set of inequalities guaranteeing that only one camera is placed at any spatial location. The optimization problem in (6.10), with constraints (6.11) and (6.12), forms a standard BIP problem. The solution to the BIP problem obviously depends on the selection of grid points in gridP and gridC. While gridC is usually predefined according to the environment constraint, there is no guarantee that a tag at a random location can be visible by two cameras even if there is a camera at every camera grid point.Thus, tag grid points must be placed intelligently—those away from obstacles and walls are usually easier to observe. On the other hand, focusing only on areas away from obstacles may produce a subpar result when measured over the entire environment.To balance these two considerations, we solve the BIP repeatedly over a progressively refined gridP over the spatial dimensions until the target t , measured over the entire continuous environment, is satisfied. One possible refinement strategy is to have gridP start from a single grid point at the middle of the environment, and grow uniformly in density within the environment interior, but remain at least one interval away from the boundary. If the BIP fails to return a solution, the algorithm will randomly remove half of the newly added tag grid points. The iteration terminates when the target t is achieved or all of the newly added grid points are removed. This process is summarized in Algorithm 6.1.
6.5.3 FIX_CAM: Maximizing Visibility for a Given Number of Cameras A drawback of MIN_CAM is that it may need a large number of cameras to satisfy the visibility of all tag grid points. If the goal is to maximize average visibility, a sensible way to reduce the number of cameras is to allow a small portion of the tag grid points not to be observed by two or more cameras. The selection of these tag grid points should be dictated by the distribution of the occupant traffic f (P, vP , s ) used in computing the average visibility, as described in equation 6.2. FIX_CAM is the algorithm that does precisely that. We first define a set of binary variables on the tag grid {xj : j ⫽ 1, . . . , Np } indicating whether a tag on the jth tag point in gridP is visible at two or more cameras. We also
150
CHAPTER 6 Optimal Visual Sensor Network Configuration Algorithm 6.1: MIN_CAM Input: initial grid points for cameras gridC and tag gridP, t , maximum grid density maxDensity Output: Camera placement camPlace Set ⫽ 0, newP ⫽ ø; while ⭐ t do foreach ⌼i in gridC do foreach ⌳j in gridP ∪ newP do Calculate Ib (⌳j |w, T , K , ⌼i ); end end Solve newCamPlace ⫽ BIP_solver(gridC, gridP, Ib ); if newCamPlace⫽⫽ø then if |newP |⫽⫽1 then break, return failure; Randomly remove half of the elements from newP; else camPlace = newCamPlace; gridP = gridP ∪ newP; newP = new grid points created by halving the spatial separation; newP = newP \ gridP; Calculate for camPlace by Monte Carlo sampling; end end
assume a prior distribution {j : j ⫽ 1, . . . , Np , j j ⫽ 1} that describes the probability of having a person at that tag grid point. The cost function defined to be the average visibility over the discrete space is given as follows: max bi
Np
j xj
(6.13)
j⫽1
The relationship between the camera placement variables bi , as defined in (6.9), and the visibility performance variables xj can be described by the following constraints. For each tag grid point ⌳j , we have Nc
bi Ib (⌳j |w, T , K , ⌼i ) ⫺ (Nc ⫹ 1)xj ⭐ 1
(6.14)
i⫽1 Nc
bi Ib (⌳j |w, T , K , ⌼i ) ⫺ 2xj ⭓ 0
(6.15)
i⫽1
These two constraints effectively define the binary variable xj : If xj ⫽ 1, inequality 6.15 becomes Nc i⫽1
bi Ib (⌳j |w, T , K , ⌼i ) ⭓ 2
6.5 Optimal Camera Placement
151
which means that a feasible solution to bi must have the tag visible at two or more cameras. Inequality (6.14) becomes Nc
bi Ib (⌳j |w, T , K , ⌼i ) ⬍ Nc ⫹ 2
i⫽1
which is always satisfied—the largest possible value from the left-hand side is Nc , corresponding to the case when there is a camera at every grid point and every tag point is observable by two or more cameras. If xj ⫽ 0, inequality 6.14 becomes Nc
bi Ib (⌳j |w, T , K , ⌼i ) ⭐ 1
i⫽1
which implies that the tag is not visible by two or more cameras. Inequality 6.15 is always satisfied, as it becomes Nc
bi Ib (⌳j |w, T , K , ⌼i ) ⭓ 0
i⫽1
Two additional constraints are needed to complete the formulation: As the cost function focuses only on visibility, we need to constrain the number of cameras to be less than a maximum as follows: Nc
bj ⭐ m
(6.16)
j⫽1
We also keep the constraint in (6.12) to ensure that only one camera is used at each spatial location.
6.5.4 GREEDY: An Algorithm to Speed Up BIP BIP is a well-studied NP-hard combinatorial problem with many heuristic schemes, such as branch-and-bound, already implemented in software libraries (e.g., lp_solve [20]). However, even these algorithms can be quite intensive if the search space is large. In this section, we introduce a simple greedy algorithm, GREEDY, that can be used for both MIN_CAM and FIX_CAM. Besides experimentally showing the effectiveness of GREEDY, we believe that the greedy approach is an appropriate approximation strategy because of the similarity of our problem to the set cover problem. In the set cover problem, items can belong to multiple sets; the optimization goal is to minimize the number of sets to cover all items. While finding the optimal solution to set covering is an NP-hard problem [21], it has been shown that the greedy approach is essentially the best we can do to obtain an approximate solution [22]. We can draw the parallel between our problem and the set cover problem by considering each of the tag grid points as an item “belonging” to a camera grid point if the tag is visible at that camera. The set cover problem then minimizes the number of cameras needed, which is almost identical to MIN_CAM except for the fact that visual tagging requires each tag to be visible by two or more cameras.The FIX_CAM algorithm further allows some of the
152
CHAPTER 6 Optimal Visual Sensor Network Configuration Algorithm 6.2: GREEDY Search Camera Placement Algorithm Input: initial grid points for cameras gridC and tags gridP, target mean visibility t , and maximum number of cameras m Output: Camera placement camPlace Set U ⫽ gridC, V ⫽ ø, W ⫽ gridP, camPlace ⫽ ø; while |V | < t · |gridP | or |camPlace | ⭐ m do c ⫽ camera grid point in U that maximizes the number of visible tag grid points in W ; camPlace ⫽ camPlace ∪ {c} ; S ⫽ subset of grid visible to two or more cameras in camPlace ; V ⫽ V ∪ S; W ⫽ W \S ; Remove c and all camera grid points in U that share the same spatial location as c ; if U ⫽⫽ ø then camPlace ⫽ ø; return; end Output camPlace
tag points not to be covered at all. It is still an open question whether these properties can be incorporated into the framework of set covering, but our experimental results demonstrate that the greedy approach is a reasonable solution to our problem. GREEDY is described in Algorithm 6.2. In each round of the GREEDY algorithm, the camera grid point that can see the most number of tag grid points is selected and all the tag grid points visible to two or more cameras are removed. When using GREEDY to approximate MIN_CAM, we no longer need to refine the tag grids to reduce computational efficiency. We can start with a fairly dense tag grid and set the camera bound m to infinity. The algorithm will terminate if the estimated mean visibility reaches the target t . When GREEDY is used to approximate FIX_CAM,t is set to 1 and the algorithm terminates when the number of cameras reaches the upper bound m as required by FIX_CAM.
6.6 EXPERIMENTAL RESULTS In this section, we present both simulated and realistic camera network results to demonstrate the proposed algorithms. In Section 6.6.1, we show various properties of MIN_CAM, FIX_CAM, and GREEDY by varying different model parameters. In Section 6.6.2, we compare with other camera configurations the optimal camera configurations computed by our techniques.
6.6.1 Optimal Camera Placement Simulation Experiments All the simulations in this section assume a room of dimension 10 m by 10 m with a single obstacle and a square tag with edge length w ⫽ 20 cm long. For the camera and lens
6.6 Experimental Results models, we assume a pixel width of 5.6 m, a focal length of 8 cm, and an FOV of 60◦ . These parameters closely resemble the cameras that we use in our real-life experiments. The threshold T for visibility is set to 5 pixels, which we find to be an adequate threshold for our color tag detector.
MIN_CAM Performance We first study how MIN_CAM estimates the minimum number of cameras for a target mean visibility t through tag grid refinement. For simplicity, we keep all cameras at the same elevation as the tags and assume no mutual occlusion. The target mean visibility is set to be t ⫽ 0.90, which the algorithm reaches in four iterations. The outputs at each iteration are shown in Figure 6.4. Figures 6.4(a) and 6.4(e) show the first iteration. In Figure 6.4(a), the environment contains one tag grid point (black dot) in the middle. The camera grid points are restricted at regular intervals along the boundary lines of the environment and remain the same for all iterations. The arrows indicate the output position and pose of the cameras from the BIP solver. Figure 6.4(e) shows the Monte Carlo simulation results, indicating coverage of the environment by calculating the local average visibility at different spatial locations. The gray level of each pixel represents the visibility at each local region: The brighter the pixel, the higher the probability that it is visible at two or more cameras. The overall mean visibility over the environment is estimated to be 0.4743. Since that is below the target t , the tag grid is refined as shown in Figure 6.4(b), with the corresponding Monte Carlo simulation shown
(a) Iteration 1
(b) Iteration 2
(c) Iteration 3
(d) Iteration 4
(e) 50.4743
(f ) 50.7776
(g) 50
(h) 50.9107
FIGURE 6.4 Four iterations of MIN_CAM. (a)–(d) show the camera placement at the first four iterations of the MIN_CAM algorithm. (e)–(h) are the corresponding visibility maps obtained by Monte Carlo simulation—the lighter the region the higher the probability it can be observed by two or more cameras.
153
154
CHAPTER 6 Optimal Visual Sensor Network Configuration in Figure 6.4(f). When the number of cameras increases from four to eight, increases to 0.7776. The next iteration, shown in Figure 6.4(c), grows the tag grid further. With so many constraints, the BIP solver fails to return a feasible solution. MIN_CAM then randomly discards roughly half of the newly added tag grid points. The discarded grid points are shown as lighter dots in Figure 6.4(d).With fewer grid points and hence fewer constraints, a solution is returned with eleven cameras. The corresponding Monte Carlo simulation shown in Figure 6.4(h) gives ⫽ 0.9107, which exceeds the target threshold. MIN_CAM then terminates.
FIX_CAM versus MIN_CAM In the second experiment, we demonstrate the difference between FIX_CAM and MIN_CAM. Using the same environment as in Figure 6.4(c), we run FIX_CAM to maximize the performance of eleven cameras. The traffic model j is set to be uniform. MIN_CAM fails to return a solution under this dense grid and, after randomly discarding some of the tag grid points, outputs ⫽ 0.9107 using eleven cameras. On the other hand, without any random tuning of the tag grid, FIX_CAM returns a solution of ⫽ 0.9205; the results are shown in Figures 6.5(a) and 6.5(b). When we reduce the number of cameras to ten and rerun FIX_CAM, we manage to produce ⫽ 0.9170, which still exceeds the results from MIN_CAM. This demonstrates that we can use FIX_CAM to fine-tune the approximate result obtained by MIN_CAM. Figure 6.5(c) and 6.5(d), respectively, show the camera configuration and the visibility distribution of using ten cameras.
(a) FC: 11 cam
(b) FC: ⫽ 0.9205
(c) FC: 10 cam
(d) FC: ⫽ 0.9170
(e) G: 11 cam
(f ) G: ⫽ 0.9245
(g) G: 10 cam
(h) G: ⫽ 0.9199
FIGURE 6.5 (a)–(d) show results of using FIX_CAM (FC). (e)–(h) show results from the same set of experiments using GREEDY (G) as an approximation of FIX_CAM.
6.6 Experimental Results GREEDY FIX_CAM Implementation Using the same setup, we repeat our FIX_CAM experiments using the GREEDY implementation. Our algorithm is implemented using MATLAB version 7.0 on a Xeon 2.1-Ghz machine with 4 Gigabytes of memory. The BIP solver inside the FIX_CAM algorithm is based on lp_solve [20]. We tested both algorithms using eleven, ten, nine, and eight maximum cameras. While changing the number of cameras does not change the number of constraints, if does make the search space more restrictive as the number decreases. Thus, it is progressively more difficult to prune the search space, making the solver resemble an exhaustive search. The search results are summarized in Table 6.1. For each run, three numerical values are reported: the fraction of tag points visible to two or more cameras (the actual minimized cost function), the running time, and the mean visibility estimated by Monte Carlo simulations. At eight cameras, GREEDY is 30,000 times faster than lp_solve but has only 3 percent fewer visible tag points than the exact answer. It is also worthwhile to point out that lp_solve fails to terminate when we refine the tag grid by halving the step size at each dimension, whereas GREEDY uses essentially the same amount of time. The placement and visibility maps of the GREEDY algorithm that mirror those from FIX_CAM are shown in the second row of Figure 6.5.
Elevation of Tags and Cameras Armed with an efficient greedy algorithm, we can explore various modeling parameters in our framework. An assumption we made with the visibility model is that all tag centers are in the same horizontal plane. However, this does not reflect the real world given the different heights of individuals. In the following experiment, we examine the impact of height variation on a camera placement performance. Using the camera placement in Figure 6.5(g), we simulate five different scenarios: The height of each person as 10 or 20 cm taller/shorter than the assumed height, and heights randomly drawn from a bi-normal distribution based on U.S. census data [23]. The changes in the average visibility, shown in Table 6.2, range from ⫺3.8 percent to ⫺1.3 percent, indicating that our assumption does not have a significant impact on measured visibiliy. Table 6.1
lp_solve–GREEDY Comparison
Number of Cameras
Lp_solve Visible Tags Time(s)
GREEDY Visible Tags Time(s)
11
0.99
1.20
0.9205
0.98
0.01
0.9245
10
0.98
46.36
0.9170
0.98
0.01
0.9199
9
0.97
113.01
0.9029
0.97
0.01
0.8956
8
0.96
382.72
0.8981
0.94
0.01
0.8761
Table 6.2
Effect of Height Variation on
Height model
⫹20
⫺20
⫹10
⫺10
Random
Change in
⫺3.8%
⫺3.3%
⫺1.2%
⫺1.5%
⫺1.3%
155
156
CHAPTER 6 Optimal Visual Sensor Network Configuration
(a) 0.4 m
( b) 0.8 m
(c) 1.2 m
(d) ⫽ 0.9019
(e) ⫽ 0.8714
(f ) ⫽ 0.8427
FIGURE 6.6 (a)–(c) Camera planning results when cameras are elevated 0.4, 0.8, and 1.2 meters above tags. (d)–(f) The corresponding visibility maps obtained from Monte Carlo simulations.
Next, we consider the elevation of the cameras. In typical camera networks, cameras are usually installed at elevated positions to mitigate occlusion.The drawback of elevation is that it provides a smaller FOV when compared with the case in which the camera is at the same elevation as the tags. By adjusting the pitch angle of an elevated camera, we can selectively move the FOV to various parts of the environment. As we now add one more additional dimension of pitch angle, the optimization becomes significantly more difficult and the GREEDY algorithm must be used. Figure 6.6 shows the result for m ⫽ 10 cameras with three different elevations above the ⌫ plane, on which the centers of all the tags are located. As expected, mean visibility decreases as we raise the cameras. The visibility maps in Figures 6.6(d), 6.6(e), and 6.6(f) show that as the cameras are elevated, the coverage near the boundary drops but the center remains well covered as the algorithm adjusts the pitch angles of the cameras.
Mutual Occlusion We present simulation results to show how our framework deals with mutual occlusion. Recall that we model occlusion as an occlusion angle of  at the tag. Similar to the experiments on camera elevation, our occlusion model adds a dimension to the tag grid, and thus we have to resort to the GREEDY algorithm. We want to investigate how occlusion affects the number of cameras and the camera positions of the output configuration, so we use GREEDY to approximate MIN_CAM by identifying the minimum number of cameras to achieve a target level of visibility. We use a denser tag grid than before to minimize the difference between the actual mean visibility and that over the discrete tag grid estimated by GREEDY. The tag grid we use is 16⫻16 spatially with 16 orientations.
6.6 Experimental Results
(a) 0⬚; 6 cameras
(b) 22.5⬚; 8 cameras
(c) 45⬚; 12 cameras
(d) ⫽ 0.8006
(e) ⫽ 0.7877
(f ) ⫽ 0.7526
157
FIGURE 6.7 Occlusion angle increases; 0◦ (a) to 22.5◦ (b) and 45◦ (c). The required number of simultaneous cameras increases from 6 to 8 and 12 when using GREEDY to achieve a target performance of t ⫽ 0.8. (d)–(f) Corresponding visibility maps.
We set the target to be t ⫽ 0.8 and test different occlusion angles  at 0◦ , 22.5◦ , and 45◦ . As explained in Section 6.5.1, our discretization uses a slightly larger occlusion angle to guarantee worst-case analysis: m ⫽ 32.5◦ for  ⫽ 22.5◦ and m ⫽ 65◦ for  ⫽ 45◦ . In the Monte Carlo simulation, we put the occlusion angle at the random position of each sample point. The results are shown in Figure 6.7, which shows that even with increasing the number of cameras from six to eight to twelve, the resulting mean visibility suffers slightly when the occlusion angle increases. Another interesting observation from the visibility maps in the figure (6.7(d), 6.7(e), and 6.7(f)) is that the region with perfect visibility, indicated by the white pixels, dwindles as occlusion increases.This is reasonable because it is difficult for a tag to be visible at all orientations in the presence of occlusion.
Realistic Occupant Traffic Distribution In this last experiment, we show how we can incorporate realistic occupant traffic patterns into the FIX_CAM algorithm. All experiments thus far assumed a uniform traffic distribution over the entire tag space; however, it is equally likely to find a person at each spatial location and orientation. This model does not reflect many real-life scenarios. For example, consider a hallway inside a shopping mall: While there are people browsing at the window displays, most of the traffic flows from one end of the hallway to the other. By incorporating an appropriate traffic model, performance should improve under the same resource constraint. In the FIX_CAM framework, a traffic model can be incorporated into the optimization by using nonuniform weights j in the cost function (6.13).
158
CHAPTER 6 Optimal Visual Sensor Network Configuration
(a)
(b)
(c)
(d)
FIGURE 6.8 (a) Random walk and (b) ⫽ 0.8395. Both use the specific traffic distribution for optimization and obtain a higher as compared to using an uniform distribution: (c) uniform; (d) ⫽ 0.7538.
To achieve a reasonable traffic distribution, we employ a simple random walk model to simulate a hallway environment. We imagine that the hallway has openings on the either sides of its top portion. At each tag grid point, which is characterized by both the orientation and the position of a walker, we impose the following transitional probabilities: A walker has a 50 percent chance of moving to the next spatial grid point following the current orientation unless she is obstructed by an obstacle, and she has a 50 percent chance of changing orientation. In the case of a change in orientation, there is a 99 percent chance of choosing the orientation to face the tag grid point closest to the nearest opening while the rest of the orientations share the remaining 1 percent.At those tag grid points closest to the openings, we create a virtual grid point to represent a walker exiting the environment. The transitional probabilities from the virtual grid point to the real tag points near the openings are all equal. The stationary distribution j is then computed by finding the eigenvector of the transitional probability matrix of the entire environment, with eigenvalue equal to 1 [24]. Figure 6.8(a) shows the hallway environment, with the four hollow ovals indicating the tag grid points closest to the openings. The result of the optimization under the constraint of using four cameras is shown in Figure 6.8(b). Clearly the optimal configuration favors the heavy traffic area. If the uniform distribution is used instead, we obtain the configuration in Figure 6.8(c) and the visual map in Figure 6.8(d). The average visibility drops from 0.8395 to 0.7538, as there is a mismatch of the traffic pattern.
6.6.2 Comparison with Other Camera Placement Strategies In this section, we compare our optimal camera placements with two different placement strategies. The first strategy is uniform placement—assuming that the cameras are restricted along the boundary of the environment, the most intuitive scheme is to place them at regular intervals on the boundary, each pointing toward the center of the room. The second strategy is based on the optimal strategy proposed by Hörster and Lienhant in [14]. It is unfair to use Monte-Carlo simulations, which use the same model as the optimization, to test the differences in visibility models. As a result, we resort to simulating a virtual 3D environment that mimics the actual 10-m⫻10-m room used in Section 6.6.1, into which we insert a random-walking person wearing a red tag. The results are based on the visibility of the tag in two or more cameras.The cameras are set at the same height
6.6 Experimental Results
159
as the tag, and no mutual occlusion modeling is used. The optimization is performed with respect to a fixed number of cameras. To be fair to the scheme [14], we run their optimization formulation to maximize the visibility from two cameras. The measurements of for the three schemes, with the number of cameras varied from five to eight, are shown in Table 6.3. Our proposed FIX_CAM performs the best, followed by the uniform placement. The scheme does not perform well, as it does not take into account the orientation of the tag [14]. Thus, the cameras do not compensate each other when the tag is in different orientations. We are, however, surprised by how close uniform placement is to our optimal scheme, so we further test the difference between the two with a real-life experiment that incorporates mutual occlusion. We conduct our real-life experiments in a room 7.6 meters long, 3.7 meters wide, and 2.5 meters high, with two desks and a shelf along three of the four walls. We use Unibrain Fire-i400 cameras with Tokina Varifocol TVR0614 lenses, at an elevation of 1.5 meters. Since the lenses are of variable focal length, we set them at a focal length of 8 mm with a vertical FOV of 45◦ and a horizontal FOV of 60◦ . As the elevation of the cameras is roughly level with the position of the tags, we choose a fairly large occlusion angle of m ⫽ 65◦ in deriving our optimal placement. Monte Carlo results between the uniform placement and the optimal placement are shown in Figure 6.9. Table 6.3
Measurements among Three Schemes Using Virtual Simulations
Number of Cameras
FIX_CAM
Hörster and Lienhart [14]
Uniform Placement
5
0.614 ⫾ 0.011
0.352 ⫾ 0.010
0.522 ⫾ 0.011
6
0.720 ⫾ 0.009
0.356 ⫾ 0.010
0.612 ⫾ 0.011
7
0.726 ⫾ 0.009
0.500 ⫾ 0.011
0.656 ⫾ 0.010
8
0.766 ⫾ 0.008
0.508 ⫾ 0.011
0.700 ⫾ 0.009
(a)
(b)
(c)
(d)
FIGURE 6.9 Camera placement in a real camera network: (a) uniform placement: ⫽ 0.3801; (b) optimal placement: ⫽ 0.5325; (c) uniform placement: ⫽ 0.3801; (d) optimal placement: ⫽ 0.5325.
160
CHAPTER 6 Optimal Visual Sensor Network Configuration
FIGURE 6.10 Seven camera views from uniform camera placement. (See this book’s companion website for color images of this and next figure.)
FIGURE 6.11 Seven camera views from optimal camera placement.
Table 6.4 Measurements Between Uniform and Optimal Camera Placements Method
MC Simulations
Virtual Simulations
Uniform
0.3801
0.4104 ⫾ 0.0153
0.2335 ⫾ 0.0112
Optimal
0.5325
0.5618 ⫾ 0.0156
0.5617 ⫾ 0.0121
Real-life Experiments
For the virtual environment simulation, we insert three randomly walking persons and capture 250 frames for measurement. For the real-life experiments, we capture about two minutes of video from the seven cameras, again with three persons walking in the environment. Figures 6.10 and 6.11 show the seven real-life and virtual camera views from both the uniform placement and the optimal placement, respectively. As shown in Table 6.4, the optimal camera placement is better than the uniform camera placement in all three evaluation approaches. The three measured for the optimal placement are consistent. The results of the uniform placement have higher variations most likely because excessive occlusion makes detection of color tags less reliable.
6.7 CONCLUSIONS AND FUTURE WORK We proposed a framework for modeling, measuring, and optimizing the placement of multiple cameras. By using a camera placement metric that captures both self-occlusion
6.7 Conclusions and Future Work
161
and mutual occlusion in 3D environments, we developed two optimal camera placement strategies that complement each other using grid-based binary integer programming. To deal with the computational complexity of BIP, we also developed a greedy strategy to approximate both of our optimization algorithms. Experimental results were presented to verify our model and to show the effectiveness of our approaches. There are many interesting issues in our proposed framework, and in visual tagging in general, that deserve further investigation. The incorporation of models for different visual sensors, such as omnidirectional and PTZ cameras, or even nonvisual sensors and other output devices, such as projectors, is certainly an interesting topic. The optimality of our greedy approach can benefit from detailed theoretical studies. Last but not the least, the use of visual tagging in other application domains, such as immersive environments and surveillance visualization, should be further explored.
REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9]
[10] [11] [12]
[13] [14] [15] [16]
J. O’Rourke, Art Gallery Theorems and Algorithms, Oxford University Press, 1987. J. Urrutia, Art Gallery and Illumination Problems, Elsevier Science, 1997. T. Shermer, Recent results in art galleries, in: Proceedings of the IEEE 80 (9) (1992) 1384–1399. V. Chvátal, A combinatorial theorem in plane geometry, Journal of Combinatorial Theory Series B 18 (1975) 39–41. D. Lee, A. Lin, Computational complexity of art gallery problems, in: IEEE Transactions on Information Theory 32 (1986) 276–282. D. Yang, J. Shin, A. Ercan, L. Guibas, Sensor tasking for occupancy reasoning in a camera network, in: IEEE/ICST, 1st Workshop on Broadband Advanced Sensor Networks, 2004. P.-P. Vazquez, M. Feixas, M. Sbert, W. Heidrich, Viewpoint selection using viewpoint entropy, in: Proceedings of the Vision Modeling and Visualization Conference, 2001. J. Williams, W.-S. Lee, Interactive virtual simulation for multiple camera placement, in: IEEE International Workshop on Haptic Audio Visual Environments and Their Applications, 2006. S. Ram, K.R. Ramakrishnan, P.K. Atrey, V.K. Singh, M.S. Kankanhalli, A design methodology for selection and placement of sensors in multimedia surveillance systems, in: Proceedings of the 4th ACM International Workshop on Video Surveillance and Sensor Networks, 2006. T. Bodor, A. Dremer, P. Schrater, N. Papanikolopoulos, Optimal camera placement for automated surveillance tasks, Journal of Intelligent and Robotic Systems 50 (2007) 257–295. A. Mittal, L.S. Davis, A general method for sensor planning in multi-sensor systems: Extension to random occlusion, International Journal of Computer Vision 76 (1) (2008) 31–52. A. Ercan, D. Yang, A.E. Gamal, L. Guibas, Optimal placement and selection of camera network nodes for target localization, in: IEEE International Conference on Distributed Computing in Sensor Systems, 4026, 2006. E. Dunn, G. Olague, Pareto optimal camera placement for automated visual inspection, in: International Conference on Intelligent Robots and Systems, 2005. E. Hörster, R. Lienhart, On the optimal placement of multiple visual sensors, in: Proceedings of the 4th ACM International Workshop on Video Surveillance and Sensor Networks, 2006. M.A. Hasan, K.K. Ramachandran, J.E. Mitchell, Optimal placement of stereo sensors, Optimization Letters, 2 (2008) 99–111. U.M. Erdem, S. Sclaroff, Optimal placement of cameras in floorplans to satisfy task-specific and floor plan-specific coverage requirements, Computer Vision and Image Understanding 103 (3) (2006) 156–169.
162
CHAPTER 6 Optimal Visual Sensor Network Configuration [17] K. Chakrabarty, S. Iyengar, H. Qi, E. Cho, Grid coverage of surveillance and target location in distributed sensor networks, IEEE Transactions on Computers 51 (12) (2002) 1448–1453. [18] G. Sierksma, Linear and Integer Programming:Theory and Practice, Chapter 8, Marcel Dekker, 2002. [19] J. Zhao, S.-C. Cheung, Multi-camera surveillance with visual tagging and generic camera placement, in: ACM/IEEE International Conference on Distributed Smart Cameras, 2007. [20] Introduction to lp_solve 5.5.0.10, http://lpsolve.sourceforge.net/5.5/ . [21] R.M. Karp, Reducibility among combinatorial problems, Complexity of Computer Computations (1972) 85–103. [22] U. Feige, A threshold of ln n for approximating set cover, Journal of the ACM 45 (4) (1998) 634–652. [23] United States Census Bureau, Statistical Abstract of the United States, U.S. Census Bureau, 1999. [24] C.M. Grinstead, L.J. Snell, Introduction to Probability, American Mathematical Society, Chapter 11, 1997.
CHAPTER
Collaborative Control of Active Cameras in Large-Scale Surveillance
7
Nils Krahnstoever, Ting Yu, Ser-Nam Lim, Kedar Patwardhan Visualization and Computer Vision Lab, GE Global Research, Niskayuna, New York
Abstract A system that controls a set of pan-tilt-zoom (PTZ) cameras for acquiring closeup views of subjects in a surveillance site is presented. The PTZ control is based on the output of a multi-camera, multi-target tracking system operating on a set of fixed cameras, and the main goal is to acquire views of subjects for biometrics purposes, such as face recognition and nonfacial identification. For this purpose, this chapter introduces an algorithm to address the generic problem of collaboratively controlling a limited number of PTZ cameras to capture an observed number of subjects in an optimal fashion. Optimality is achieved by maximizing the probability of successfully completing the addressed biometrics task, which is determined by an objective function parameterized on expected capture conditions, including distance at which a subject is imaged, angle of capture, and several others. Such an objective function serves to effectively balance the number of captures per subject and their quality. Qualitative and quantitative experimental results are provided to demonstrate the performance of the system, which operates in real time under real-world conditions on four PTZ and four static CCTV cameras, all of which are processed and controlled via a single workstation. Keywords: tracking, surveillance, active cameras, PTZ, collaborative control, biometrics, camera planning
7.1 INTRODUCTION Most commercially available automatic surveillance systems today operate with fixed CCTV cameras, as this allows the use of efficient detection and tracking algorithms.
Note: This project was supported by grant #2007-RG-CX-K015 awarded by the National Institute of Justice, Office of Justice Programs, U.S. Department of Justice. The opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the Department of Justice. Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00007-0
165
166
CHAPTER 7 Collaborative Control of Active Cameras Unfortunately, even the resolution of high-quality CCTV cameras is limited to 720⫻480, with the resolution of low-end cameras much lower. This makes tasks that are sensitive to resolution and image quality, such as face detection and recognition and forensic examinations, very difficult, especially if subjects are imaged from a distance. Megapixel sensors can potentially overcome the resolution problem, but their real-world deployment is still very limited, mostly because of challenges associated with managing, processing, and recording their nonstandard imagery. This has led to the extensive use of pan-tilt-zoom (PTZ) cameras in operator-based security applications, since they provide an inexpensive way of obtaining close-up imageries of site activities. Given the autonomy of a fixed-camera tracking system and the versatility of PTZ cameras, it is only natural to combine them in a master–slave configuration, where detection and tracking information obtained from the fixed cameras is used to automatically control one or more PTZ cameras. The main challenge faced by such a system arises when the number of subjects and activities exceeds the number of PTZ cameras, in which case camera scheduling and control become nontrivial. Intuitively, a good plan for scheduling and controlling PTZ cameras should meet the following requirements: (1) no subject should be neglected; (2) no subject should receive excessive preference over other subjects; (3) the quality of captures should be optimized; and (4) the assignment of PTZ cameras to each subject should consider their capacity to meet the quality requirement. In this chapter, we present a formalized method that satisfies these criteria, with a particular interest in obtaining high-quality close-up captures of targets for performing biometrics tasks. Toward this goal, we define a novel objective function based on a set of quality measures that characterize the overall probability of successfully completing these tasks. Specifically, the assignments of all PTZ cameras to targets are considered jointly, so that the optimization of the objective function in effect guides a camera set collaboratively to obtain high-quality imageries of all site-wide activities. The chapter is organized as follows. We first discuss related work in Section 7.2, following with a formal definition of the problem in Section 7.3. We then present the objective function in Section 7.4 and discuss how we can optimize it in Section 7.5. In Section 7.6 the set of quality measures that characterize the success probability is given. Experimental results are given in Section 7.8, after which discussion and conclusions are offered in Section 7.9.
7.2 RELATED WORK A substantial number of papers in the literature have focused on low- and middle-level vision problems in the context of multi-camera surveillance systems. The main problems highlighted in these papers are object detection and tracking [1–5] and site-wide, multitarget, multi-camera tracking [6, 7]. The importance of accurate detection and tracking is obvious, since the extracted tracking information can be directly used for site activity/event detection [8–11]. Furthermore, tracking data is needed as a first step toward controlling a set of PTZ cameras to acquire high-quality imageries [12, 13], and toward, for example, building biometric signatures of the tracked targets automatically [14, 15]. In addition to accurate tracking information, the acquisition of high-quality imageries, particularly for biometrics purposes, also requires (1) accurate calibration between the
7.2 Related Work
167
fixed and PTZ cameras for establishing correspondences [16] and computing pan, tilt, and zoom settings; and (2) an effective and efficient camera-scheduling algorithm for assigning PTZ cameras to targets [17]. A camera-scheduling algorithm would typically utilize tracking information, provided by one or more fixed cameras performing detection and tracking, for computing a schedule that controls the assignment of PTZ cameras to targets over time. Each PTZ camera would then servo, based on calibration data, to aim itself at different targets in a timely fashion as specified by the schedule. Such a setup, commonly known as master–slave [18], is adopted by our system. Several flavors of master–slave control algorithms have been proposed, including those based on heuristics. For example, in Zhou et al. [19], a PTZ camera is controlled to track and capture one target at a time, with the next target chosen as the nearest one to the current target. These heuristics-based algorithms provide a simple and tractable way of computing schedules. However, they quickly become nonapplicable as the number of targets increases and exceeds the number of PTZ cameras, in which case the scheduling problem becomes increasingly nontrivial. A lack of PTZ resources must be taken into account when designing a scheduling strategy that aims at maximizing the number of captures per target and capturing as many targets as possible [17]. To tackle the problem, researchers such as Hampapur et al. [20], proposed a number of different camera-scheduling algorithms designed for different application goals. They include, for example, a round-robin method that assigns cameras to targets sequentially and periodically to achieve uniform coverage. Rather than assigning equal importance to each target, Qureshi and Terzopoulos [21] instead proposed ordering the targets in a priority queue based on their arrival time and the frequency with which they are captured. Currently available PTZ cameras are also ranked based on the ease of adjusting their PTZ states for an assignment, with the most suitable camera selected. Bimbo and Pernici [22] ranked the targets according to the estimated deadlines by which they leave the surveillance area. An optimal subset of the targets, which satisfies the deadline constraint, is obtained through an exhaustive search. These researchers also simplified the problem by computing the scheduling task for each PTZ camera independently. Recently, Li and Bhanu [23] proposed an interesting dynamic camera assignment method using game theory. Several criteria characterizing the performance of PTZ camera imaging and target tracking are introduced to define some utility functions, which are then optimized through a bargaining mechanism in a game. This algorithm bears a similarity to our algorithm in the sense that optimal camera assignment is driven by the expected imaging quality for biometrics tasks. The criteria used in the study [23], however, are not well justified and are defined in an uncalibrated camera setting, thus lacking a semantical interpretation in the physical world. The global utility introduced there also makes it impossible for the system to capture a single target using multiple PTZ cameras simultaneously. Lastly, the PTZ camera scheduling problem can be further complicated by crowded scenes in the real world due to occlusions caused by target interaction. Lim et al. [24] proposed estimating such occlusion moments based on camera geometry and predicted target motion. This is followed by construction of a visibility interval for each capture, which is defined as the complement of the occlusion moment. Based on these visibility intervals, the cameras are scheduled using a greedy graph search method [25].
168
CHAPTER 7 Collaborative Control of Active Cameras
7.3 SYSTEM OVERVIEW Our system consists of a collection of Nf fixed cameras, which perform detection and tracking collaboratively [3, 4] and are calibrated with respect to a common site-wide metric coordinate system [16, 26] to obtain a set of projective matrices, Pf ⫽ {Pfi |i ⫽ 1, . . . , Nf }. Tracking data provided by the fixed cameras is used to control a collection of Np PTZ cameras. Here, the pan-tilt-zoom state of each camera—represented as (, , r), where and represent pan and tilt angles and r represents the zoom factor—is suitably controlled in real time to acquire high-quality target imageries. Each PTZ camera is calibrated in its home position corresponding to the state (0, 0, 1) based on the same metric coordinate system. Accurate site calibration is important to ensure precise targeting by the PTZ cameras. Errors in calibration of the cameras or the tracking system are magnified, especially when targeting faraway subjects, which makes it difficult to obtain full-frame close-up views of, say, a subject’s face. During calibration, the projective matrices are first obtained through a direct linear transformation between image and world correspondences.While the resulting calibration is not sufficiently accurate for our purpose, these initial projective matrices provide good starting points for performing Levenberg-Marquardt bundle adjustment [27], which does yield highly accurate calibration data. Based on the calibration, our system has the ability to instruct a PTZ camera to focus on an object of a certain size at a particular world location. The 3D location of a target is usually its center point; for facial biometrics, this would be the center point of the face.
7.3.1 Planning With the capability to control PTZ cameras, it becomes conceivable for the system to command them to move and follow different targets in a timely fashion, so that highquality imageries can be captured. We refer to such a move-and-follow strategy as a plan. A plan for the ith PTZ camera is denoted as a set of tasks Ei ⫽ ei1 , ei2 , . . . , eik , where eik is the index of the target to be visited by PTZ i at the kth time step. Every target in the plan is followed for a fixed amount of time ⌬t fo , and, if two consecutive entries in the plan belong to different targets, the PTZ camera takes some time ⌬t mv to move from one target to another. This time is, in practice, dependent on the PTZ state of the camera when the camera starts moving and its PTZ state when it reaches the next target. A plan for all PTZ cameras is denoted as E ⫽ E1 , . . . , ENp . When a PTZ camera has reached a target, it starts to capture close-up views. That is, every time a PTZ camera is following a target, it obtains one capture (i.e., one or more images) of it.
7.3.2 Tracking The time at which a scheduled capture occurs is governed by the time it takes the PTZ camera to complete all preceding tasks. Therefore, it is important to compute the expected quality of the capture with respect to this future time. For this reason, the capability of the tracker to reliably predict target locations and velocities becomes critical to the success of a PTZ plan. In particular, the tracker employed by
7.3 System Overview
169
our system is capable of tracking multiple targets effectively by utilizing multiple fixed camera views and assuming that the targets are moving on the ground plane defined by Z ⫽ 0. The system follows a detect-and-track paradigm [5, 26, 28], where target detection and tracking are kept separate. Every camera view is running a separate detector, the approach to which follows the work of [3]. At every step, detections from different camera views are projected onto the ground plane via Pf and supplied to a centralized tracker, reducing the tracking to that for these 2D ground plane locations (which thus makes it very efficient). Tracking is then performed by a global nearest neighbor (GNN), which is essentially an improved nearest-neighbor assignment strategy, or joint probabilistic data association filtering (JPDAF) [29, 30].The algorithm is efficient even when tracking a large number of targets in many camera views simultaneously, and it has excellent performance in cluttered environments. Through the use of the centralized tracker, which operates in a site-wide coordinate system, identity maintenance across camera views and data fusion are effectively performed. Figure 7.1 shows the camera views of a site surveilled by four fixed CCTV cameras. We can see that, even though the targets are fairly close to each other, the tracker is able to persistently maintain their identities across different camera views.
FIGURE 7.1 Multi-camera surveillance system. The tracker performs multi-camera multi-target tracking in a calibrated camera network.
170
CHAPTER 7 Collaborative Control of Active Cameras During operation, the tracker observes targets O ⫽ {Oi |i ⫽ 1, . . . }, such that at time t, the state of Oi is given by Xit ⫽ (xit , vit ), with ground plane location x ⫽ (x, y) and ground plane velocity v ⫽ (vx , vy ). The tracker provides the system with the expected state of a target at capture time tij given as tij
Xi ⫽ A(⌬t)Xit
(7.1)
where A(⌬t) models the dynamics of the target, and ⌬t ⫽ (tij ⫺ t) is the time offset relative to t when the capture is expected to occur. For the commonly used constant velocity model, the prediction is given by tij
xi ⫽ xit ⫹ vit ⌬t
(7.2)
tij
vi ⫽ vit tij
For facial biometrics purposes, xi is augmented with an assumed height that is typical tij
of where a person’s face would be. These predicted 3D face locations, Xface , together tij
with their velocities, vface , are then utilized by the system to evaluate the overall quality of a plan using an objective function discussed in the following section.
7.4 OBJECTIVE FUNCTION FOR PTZ SCHEDULING The task of designing a good strategy that is effective in guiding the collaborative control of the PTZ cameras is challenging. A strategy can be aimed at optimizing different objectives, such as maximizing the number of targets captured or minimizing the time a target is not assigned to a PTZ camera. However, such approaches employ objectives that are often only indirectly related to the actual goals of the system. In contrast, we focus our attention on the design of an objective function that has a direct effect on the successful completion of a face recognition task (e.g., finding subjects from a watch list). Here, the goal is to maximize the probability of successfully recognizing whether the subject is in the watch list. Thus, the captured imagery of the individual must be of high quality, suitable for facial biometrics. We quantify such capture quality with a success probability described in the rest of this section. We denote the jth capture of target i by cij and associate with cij a probability, pij ⫽ p(S|cij ), that our task will succeed for target i, denoted as the event S ⫽ {the biometrics task will succeed}. This success probability depends on many factors, such as the resolution of the captured imageries, the angle at which a subject is captured, and target reachability (i.e., whether the target is within PTZ mechanical limits and tracking limits). We will describe these factors in details in Section 7.6. For now, we consider a random variable, qij , that represents a combined quantification of these factors. We can then express the relationship between p(S|cij ) and qij as pij ⫽ p(S|cij ) ⫽ f (qij )
(7.3)
where f denotes the relationship and is dependent on the application. Our choice is designed to be a joint function of these factors.
7.5 Optimization
171
The global objective function given the success probability can now be formulated as follows. For the set of captures for target Oi , namely Ci ⫽ {cij | j ⫽ 1, . . . , Nci }, the probability of failing is 1 ⫺ p(S|cij ) and the probability of failing every single time for a given target i is j (1 ⫺ p(S|cij )). Hence, the probability of succeeding at least once (i.e., not failing every single time) for a given target i is given by p(S|Ci ) ⫽
0 1 ⫺ j (1 ⫺ p(S|cij ))
Nci ⫽ 0 otherwise
(7.4)
Finally, the overall probability of success for capturing all targets is p(S) ⫽
i
⎡ ⎤ 1 ⎣ 1 ⫺ (1 ⫺ p(S|cij ))⎦ p(S|Ci )p(Ci ) ⫽ N i j
(7.5)
This success probability is appealing: Its value increases as the number and quality of captures increase for every target if there is no PTZ resource constraint. However, with a limited number of PTZ cameras and a potentially large number of targets, the number of captures that each individual target receives is often small. The overall quality of the captures measured over all targets then serves to guide the competition for PTZ resources among the targets, where the goal is to find a camera-scheduling plan that optimizes the objective function (7.5) given the current site and PTZ camera activities (the camera may still be executing a previously generated plan) and past captures.
7.5 OPTIMIZATION Based on the success probability given in equation 7.5, we can now proceed to evaluate PTZ plans to determine the best plan at any given time. In this section, we consider two major issues involved in achieving this. We will first describe how we could merge old and new PTZ plans, which is important for achieving a smooth transition between plans. Then, in face of the combinatorial nature of the search problem, we will describe a best-first search strategy for finding the best plan, which is shown to work very well in practice in Section 7.8.
7.5.1 Asynchronous Optimization Given the cost of determining PTZ plans, the search for an optimal plan must be performed asynchronously to the real-time tracking and PTZ control system. Our policy is that when we begin to optimize a new plan given the most recent states of the targets, say at time t0 , we predict the time, t0 ⫹ ⌬tc , when the computation will finish and let each PTZ camera finish the task it is executing at that time (see Figure 7.2). This creates a predictable start state for the optimization and provides a simple approach for merging old and new PTZ plans. During plan computation, the future locations of observed targets are predicted based on the expected elapsed time needed for moving the PTZ cameras and following targets. The duration of a plan for all PTZ cameras is given by the longest one. As this duration
172
CHAPTER 7 Collaborative Control of Active Cameras increases, state predictions become less reliable as targets start to deviate from their constant velocity paths. For this reason, and to reduce the computational complexity of computing an overall plan, only plans with a certain duration are explored. We call this upper bound on the completion time the planning horizon and define it to be the time, t1 , when the first previously scheduled task completes plus an offset, ⌬th , which is chosen such that at least a certain number of tasks can be assigned to every PTZ camera (see Figure 7.2). At time t0 ⫹ ⌬tc , when the computation of a new plan is expected to complete, the new plan is appended to the task the PTZ camera is executing at that time. So, for
Tasks that will complete or have started PTZ 0
1
0
PTZ 1
Tasks that will be discarded
1
2
PTZ 2
1
0
2
1
2
0
0
1 Time
t0 t01Dtc t1
t11Dth
(a)
PTZ 0
PTZ 1
PTZ 2
New plan
Current tasks
Completed tasks
1
0
1
2
1
0 2
1
2 1
2
0
0 1
0 2 Time
t0 t01Dtc t1
t11Dth
(b)
FIGURE 7.2 Replacement of PTZ plans. (a) At time t0 , the system begins to compute new PTZ plans, estimated to complete at time t0 ⫹ ⌬tc . All tasks being executed at that time are allowed to finish and are used by the optimizer as starting states. Tasks that follow are discarded. (b) Discarded tasks are replaced with those in the new plans. The line t1 ⫹ ⌬th denotes the plan horizon. Each gray box denotes an interval during which a camera moves from one target to another. Numbered white boxes denote the indices of targets that are scheduled to be captured.
7.6 Quality Measures
173
the example in Figure 7.2, at time t0 the plans for three PTZ cameras are [[1¯ , 1, 1], [2¯ , 0, 1, 0], [2¯ , 2, 0, 1]], with the bars representing targets currently being followed.When the new plans become available, the original plans have been partially executed, so then ¯ 1, 0], [2¯ , 0, 1]], all of which are discarded except for the tasks curwe have [[1¯ , 1, 1], [0, rently being executed. The new plans thus become [[1¯ , 1, 1, 1, 0], [0¯ , 2, 0, 0], [2¯ , 1, 1, 2]] (see Figure 7.2).
7.5.2 Combinatorial Search Finding a plan that fits into the plan horizon while maximizing the objective function (7.5) is a combinatorial problem. During optimization, we iteratively add target-to-camera assignments to the plan, which implicitly defines a (directed acyclic) weighted graph, where the weights represent changes in quality caused by the additional assignments and nodes describe the feasible plans obtained so far. It is easy to see from equation 7.5 that the plan quality is nondecreasing in the number of target-to-plan assignments. Any plan that cannot be expanded further without violating the plan horizon is a terminal node and a candidate solution. We can utilize standard graph search algorithms, such as best-first search, to find these candidate solutions. Finding the optimal solution requires exhaustively examining this expanding graph, which is possible if the number of observed targets is small. Solving larger problems through approaches such as A* or branch-and-bound searches requires an admissible optimistic estimator that can hypothesize the quality of a nonterminal node during graph expansion. Such a heuristic, however, has so far not been found. We currently resolve to a best-first strategy, which rapidly yields good candidate solutions, followed by coordinate ascend optimization through assignment changes.This rapidly improves candidate solutions by making small changes in the assignments in the currently found plan. The search is continued until a preset computational time is exceeded, and the algorithm returns the best solution found so far. See Algorithm 7.1 for an overview.
7.6 QUALITY MEASURES The efficacy of our objective function, as given in equation (7.5), is dependent on the success probability of completing a biometrics task given a capture (equation (7.3)), which in turn depends on the imaging quality of that capture. Certainly, many factors can be considered for evaluating capture quality. However, since we are interested in acquiring close-up facial imageries so that face detection and recognition can be subsequently applied, the chosen factors should meet the requirements for capturing frontal and highresolution facial imageries, among others. Specifically, we consider four factors, each of which quantifies the quality of a capture from a different perspective and is modeled by a probabilistic distribution.
7.6.1 View Angle As we are interested in obtaining high-quality imageries for facial biometrics purposes, a camera looking at a person from behind or from a top-down viewpoint brings little value.
174
CHAPTER 7 Collaborative Control of Active Cameras Algorithm 7.1: Estimating the Best PTZ Plan A best-first graph search is used to obtain the best solution within computational time window ⌬tc . Data: Current time tk . Predicted algorithm completion time ⌬tc . Current PTZ activities up to time tk ⫹ ⌬tc . Plan horizon th :⫽ tk ⫹ ⌬th . All captures cijo that have been (or will have been made) after all current PTZ activities complete. Result: Optimal plan E ∗ ⫽ E1 , . . . , ENp such that objective (7.5) is maximized. begin Set E ∗ ⫽ [[] , . . . , []]. Compute the partial failure probability with all past captures Fi ⫽ j (1 ⫺ p(S|cijo , th )). Set p(S|E ∗ ) ⫽ N1 i [1 ⫺ Fi ]. Create priority queue Qopen and insert (E ∗ , p(S|E ∗ )). Create empty set Lclosed . while Qopen not empty and runtime is less than ⌬tc do Retrieve best node E ⫽ E1 , . . . , ENp from Qopen and add to Lclosed . if E is a terminal node and p(S|E) ⬎ p(S|E ∗ ) then Locally refine plan E by switching tasks labels until no further increase in quality possible. Set E ∗ :⫽ E Continue. for all PTZ cameras and all targets slk , l ⫽ 1, . . . , Nt that are active at time tk do if adding slk to plan Ei exceeds time limit th then Continue Create new plan E from E by adding target slk to Ei . if not E ∈ Lclosed and not (E , p) ∈ Qopen for any p then Predict parameters for this new capture cl and compute quality ql . Assume that this plan leads overall to captures {cxy } for targets indexed by x and the number of captures (of target x) indexed
by y. 1 Compute p(S|E ) ⫽ N x 1 ⫺ Fx y (1 ⫺ p(S|cxy , th ) . Insert (E , p(S|E )) into Qopen .
end The plan E ∗ is our best solution.
This viewpoint concern is characterized by the angle, ␣, between the normal direction of the face being captured and the line direction defined by the PTZ camera location and the position of the face, as illustrated in Figure 7.3. We clarify two points here. First, to measure ␣, instead of using the normal direction of the face directly, which is difficult to estimate, we use the travel direction of the subject, which in general is a valid assumption for a moving target. Second, the face position and travel direction are computed based on tij tij the predicted motion state vector at the expected capture time, (xface , vface ), as given in Section 7.3.2. Given camera center O, ␣ can then be computed as ␣ ⫽ arccos
tij
(xface ⫺ O)
tij
vface
· tij tij xface ⫺ O vface
(7.6)
7.6 Quality Measures
175
O
dtc
t
␣
ij x face
t
ij v face
dtz
FIGURE 7.3 Quantities used to measure capture quality. The view angle ␣, the target camera distance dtc , and the target zone distance dtz , all influence the imaging quality with which a person’s face can be captured by a PTZ camera. The overall quality of a capture is decided by considering these factors together.
1 0.8
qv w
0.6 0.4 0.2 0
2150
2100
250
0 50 Alpha (degree)
100
150
FIGURE 7.4 Quality function of the view angle. A Gaussian distribution is defined to model capture quality as a function of view angle.
The closer ␣ is to 0 degree, the better the chance that the camera is capturing a frontal face. Hence, we model this quantity with a Gaussian distribution, q vw (␣) in equation 7.7, with 0 mean and the standard deviation set empirically. Figure 7.4 shows the quality function of q vw (␣). q vw (␣) ⫽ exp
⫺
␣2 2 2vw
(7.7)
176
CHAPTER 7 Collaborative Control of Active Cameras
7.6.2 Target–Camera Distance The quality of images captured by a PTZ camera generally degrades as the distance, dtc , between the target and camera increases. As before, this distance is evaluated based on tij xface and the camera center O, as shown in equation 7.8. (see Figure 7.3). tij
dtc ⫽ xface ⫺ O
(7.8)
Since the quality increases as dtc decreases, our second quality measure is based on dtc using a two-component probability model: q dst (dtc ) ⫽ ␥ ⫹ (1 ⫺ ␥) exp⫺dtc
(7.9)
The second term, depicted by an exponential function, models the trend that the quality of a capture decreases as the target moves away from the camera, while the first term represents a slight baseline quality of capture even if the target is far away. ␥ is a weighting coefficient to balance the contributions of these two terms. Again, the parameters (␥, ) are decided empirically. Figure 7.5 illustrates q dst (dtc ).
7.6.3 Target–Zone Boundary Distance The combined fields of view of a fixed camera network define a surveillance zone that can be monitored by the system, as illustrated by the polygon region in Figure 7.3. Theoretically, any target inside this zone can be detected and tracked. It is also conceivable that a target outside the area may still have correct state estimates as a result of motion prediction. In reality, however, a target near the zone boundary has a higher probability of soon leaving the area and therefore a smaller chance to be well captured. By calculating the distance, dtz , between the predicted location of the target, x tij , and the nearest zone boundary, such that dtz is positive if the target is inside the zone and negative
1
q dst
0.8 0.6 0.4 0.2 0
0
10
20
30
40
50
60
70
80
90
100
Target-camera distance (m)
FIGURE 7.5 Quality function of target-to-camera distance. A two-component mixture model, which combines an exponential distribution and a uniform distribution, is defined to model capture quality as a function of target-to-camera distance.
7.6 Quality Measures 1
q trck
0.8 0.6 0.4 0.2 0 240
230
220
210
0
10
20
30
40
Target–zone boundary distance (m)
FIGURE 7.6 Quality function of the target–zone boundary distance. A two-component mixture model, which combines a logistic function and a uniform term, is defined to model capture quality as a function of target distance to the surveillance zone boundary.
if outside, we obtain a model, q trck (dtz ), which consists of a logistic function and a uniform term: q trck (dtz ) ⫽  ⫹ (1 ⫺ )
exptrck dtz 1 ⫹ exptrck dtz
(7.10)
Here, the larger dtz is, implying that the target is within the zone and near the zone center, the higher the quality of the capture is. Again, the small uniform term represents a small baseline capture quality (see Figure 7.6).
7.6.4 PTZ Limits Each PTZ camera has some mechanical limitation on its PTZ parameter range, defined by (min , max , min , max , rmin , rmax ).A capture requiring a PTZ camera to set its parameter state outside this physical range is impractical. Thus, a term, that defines the mechanical limitation of a PTZ camera, q rch (, , r), is introduced as q rch (, , r) ⬀
1 0
(, , r) ∈ ([min , max ], [min , max ], [rmin , rmax ]) otherwise
(7.11)
For our purposes, we simply adopt a uniform model for PTZ states within this range and 0 quality for those without.
7.6.5 Combined Quality Measure Given the application domain, we learn the relationship between the quality measures and the success probability of a capture (equation 7.3). We choose to determine the success probability directly as p(S|cij ) ⫽ q vw q dst q trck q rch . The rationale behind such a direct relationship is twofold: (1) it is computationally efficient so that the system runs in real time, and (2) despite its simplicity, it effectively ensures that the quality requirements are satisfied.
177
178
CHAPTER 7 Collaborative Control of Active Cameras In addition, targets that have a long lifetime in the site are captured multiple times initially, after which the system spends less and less time on these old targets in favor of new arrivals because of their larger contributions to the objective function (equation 7.5). This is a desirable system behavior, but in practice it leads to problems because human operators may feel that the system starts to neglect targets and, if the tracker accidentally switches between subjects because of tracking errors, it might actually neglect new or recent arrivals. To avoid this, we add a temporal decay factor to the success probability (equation 7.3) of a capture performed previously, at time tij , as follows: p(S|cij , t) ⫽ p(S|cij )(t ⫺ tij , , ⌬ts , ⌬tcut )
(7.12)
where t is the time the success probability was evaluated and () is defined as ⎧ 1 ⎪ ⎪ ⎨ t⫺t ⫺⌬t s ij (t ⫺ tij , , ⌬tcut , ⌬ts ) ⫽ e⫺ ⎪ ⎪ ⎩ 0
t ⫺ tij ⬍ ⌬ts ⌬ts ⭐ t ⫺ tij ⬍ ⌬tcut otherwise
(7.13)
Within time ⌬ts , captures are rated fully; after ⌬tcut , they do not contribute to the objective function. In between, the quality factor decays at a rate determined by . Intuitively, the resulting system behavior is desirable whereby the contribution of a past capture decreases as time elapses, prompting the system to recapture the target and thus avoid neglecting any one target in the site.
7.7 IDLE MODE At every optimization step and during PTZ control, the system becomes idle if no target is currently observed. We can, however, choose to be proactive by aiming the PTZ cameras at locations in the site where targets frequently appear. These locations are obtained by continuously clustering target arrival locations during operation [31]. In idle mode, the system is provided with virtual target locations corresponding to the center of these clusters. This essentially leads to the system adaptively assuming a ready position in which the PTZ cameras are aimed at typical target arrival locations. Since virtual targets are treated just like other targets, the system automatically optimizes the virtual-target-to-camera assignments.
7.8 EXPERIMENTS Our system is integrated on a dual-CPU (Intel Quad Core) workstation running in real time at around 15 fps under load and tested in a corporate campus environment. It captures a total of eight views—four from fixed cameras and four from PTZ cameras. The fixed views are processed at a resolution of 320⫻240; the PTZ views are not processed. The layout of the testbed and the cameras can be seen in Figure 7.7.
7.8 Experiments PTZ 3
179
CAM 1 CAM 2 CAM 3 CAM 4
C
CAM 3
C CAM 2 C
PTZ 1 PTZ 2
PTZ 3
PTZ 4
Ball
PTZ 2
C C
Courtyard
CAM 4
C CAM 1
x y
PTZ 1
C
C
PTZ 4
FIGURE 7.7 Experiment site. A corporate courtyard with multiple fixed and PTZ cameras.
We first present the performance of the quality measures described in Section 7.6. Figure 7.8 shows the quality measures (a) and PTZ parameters (b) for a target walking along a straight line in positive X direction, passing underneath PTZ camera 2 (see the site layout in Figure 7.7). One can see that, as the target approaches the camera, quality improves because of the decreasing camera–target distance, while the quality due to the view angle (increasingly down-looking) decreases. Furthermore, the physical angle limitation of the camera and a degradation in target trackability lead to a steep drop in quality as the target passes under the camera. The optimal capture region is about 10 m in front of the camera. Figure 7.9 shows several PTZ snapshots of a 7-minute video sequence. One can see that the system effectively captures all targets in the field of view and prefers to capture targets that are frontal to the views. The system attempts to capture frontal views of targets whenever it can, but will resolve to capturing from the side or behind when no other option exists. As an example of the behavior of the system, in one instance shown in Figure 7.9 (third row), the system is primarily focusing on the target in the white shirt, being the only one facing any of the cameras. As the target turns and walks away from the camera, the system changes its attention to two other targets that are now facing the cameras (fourth row). For another illustration on how the system schedules PTZ cameras intelligently, we examine a situation in Figure 7.10 where two targets are being tracked by the cameras and a third target enters the site from the left (refer to the site layout in Figure 7.7). The system has very limited time to capture a frontal view and quickly assigns PTZ 4 to
180
CHAPTER 7 Collaborative Control of Active Cameras
Angles zoom
250 200
150
zoom (310)
100 50 0 250 2100 2150 250
0 Position (m) (a)
50
1 0.8 Qualities
qvw
0.6
qdst qtrck
0.4
q rch p(S)
0.2 0 250
0 Position (m) (b)
50
FIGURE 7.8 Quality objectives and PTZ states. (a) The graph shows the quality measure based on view angle distance tracking and PTZ limits, as well as the combined quality of a subject approaching and passing underneath a PTZ camera. One can see that the optimal time of capture is when the target is at position 10 m. (b) PTZ parameters of the camera that follows the target.
follow the new arrival. Figure 7.10(b) shows the success probabilities for all three targets. The third target captured from a side angle (id ⫽ 16) has a success probability of only P(S|id ⫽ 16) ⫽ 0.18. We now evaluate the numerical performance of the system. Table 7.1 shows one of the four fixed cameras tracking several targets. The targets have just been acquired and each is captured several times. Upon execution of plan optimization, the exhaustive search examines a total of 42,592 plan candidates using 8.40 seconds. The best plan is determined to be E ∗ ⫽ [[24, 29], [28, 26, 26], [28, 28], [28, 28]], and is chosen because target id ⫽ 28 was just acquired by the tracker. The total plan probability is p(S|E ∗ ) ⫽ 0.247747. The best-first search terminal node is found after visiting 100 nodes, which yields the plan [[29, 29], [26, 26, 26], [28, 28], [28, 28]] with plan probability p ⫽ 0.235926. The optimal plan E ∗ is found after three steps of local plan refinements.
7.8 Experiments PTZ 0
PTZ 1
PTZ 2
PTZ 3
t0
t1
t2
t3
FIGURE 7.9 Control of PTZ cameras. Snapshots of the PTZ views, with four views per snapshot.
Table 7.1 Step
Plan Improvement during Optimization Reference
Plan (PTZ1, PTZ2, PTZ3, PTZ4)
Probability
[[29,29], [26,26,26], [28,28], [28,28]]
0.235926
1
[[24,29], [26,26,26], [28,28], [28,28]]
0.246558
2
[[24,29], [29,26,26], [28,28], [28,28]]
0.247045
3
[[24,29], [28,26,26], [28,28], [28,28]]
0.247747
100
Note: The best-first search followed by local label switching yields the optimal solution after 100 steps; the exhaustive search finds it after 42,592 steps.
Table 7.2 shows, for several plan optimization runs, how the best-first refined estimate compares to the global estimate and when each was reached (measured by the number of nodes explored). It shows that ■
■ ■
The quality of the best-first estimate is very close to the globally optimal solution found through exhaustive search. The best-first node is found early in the optimization process. The global optimum is very often reached after only a fraction of all possible plans.
Moreover, as shown by the extreme case in the first row of Table 7.2, the exhaustive search sometimes has to explore 50 percentage or more of all nodes. Thus, for practical
181
CHAPTER 7 Collaborative Control of Active Cameras
(a) 1 id514 id515 id516
0.8 P (S|id )
182
0.6 0.4 0.2 0 120
130
140
150
160
170
180
Time (sec) (b)
FIGURE 7.10 Limited time to capture target. (a) Three targets moving in the site. (b) Graph showing target success probabilities.
Table 7.2
Best-First versus Best Node
Total Nodes
Best First Found at
Best First Probability
Best Node Found at
Best Node Probability
234256
81
0.148558
127866
0.161289
38016
72
0.191554
6407
0.201699
96668
112
0.267768
112
0.267768
36864
81
0.231526
8159
0.234387
Note: The table shows the probabilities of the best-first search result versus the global optimum. It also shows when the best-first node is reached and when the best node is found. The leftmost column indicates the total number of nodes examined.
purposes, if the plan horizon is not too large, the best-first refined node is indeed a very good solution to the control problem outlined in this chapter. Finally, we look at some results of the system during operation. Figure 7.11 illustrates the usefulness of our system in adopting a ready position during idle mode. After
7.8 Experiments
183
(1)
(2)
(3)
(4)
(5)
(6)
FIGURE 7.11 Clustering of arrival locations. Rows 1 through 4: four fixed cameras detecting several targets where they first appear. The system clusters these arrival locations into regions where future targets are likely to appear first. Row 5: PTZ cameras aiming at one such region during idle mode. The advantage of this is demonstrated in row 6, where a person arriving at the scene is captured with good quality, even before any plan is generated.
detecting several targets where they first appear in the scene, the system clusters the arrival locations into a region in which future targets are likely to first appear. (Note that, even though only one region was generated in this example, multiple such regions can be generated in the general case if targets arrive from different locations in the scene.) The figure shows four PTZ cameras aiming at the region in idle mode, which proved to be very useful when a subject’s face was captured with good quality as he walked into the region before any plan was generated. In general, by capturing targets from a ready
184
CHAPTER 7 Collaborative Control of Active Cameras position, a PTZ camera avoids captures whose quality is adversely affected by motion blur and jitter caused by PTZ motion. In Figure 7.12, we show some live captures during actual system operation. The results are very good, demonstrating (1) the accuracy of our calibration and PTZ control
(1)
(2)
(3)
(4)
(5)
(6)
FIGURE 7.12 PTZ control during operations. Face captures during live operation of the system. Row 1: a person’s face is captured by all the PTZ cameras since there is only one tracked target at that time; Rows 2 and 3: two different targets are captured by different PTZ cameras; Rows 4 through 6: captures of three different targets. Overall, these captures demonstrate the accuracy of our calibration and PTZ control even with such a high zoom factor. Note that some of these images might not have the target “perfectly” centralized, because the PTZ camera is still in transition.
7.8 Experiments
(a)
(b)
FIGURE 7.13 Asynchronous face detection. Capture quality is validated by a face detection engine running asynchronously. Two example sets of images are shown. Notice the overlaid face images detected by the system. Since the system operates asynchronously, the displayed faces are shifted slightly relative to the most recent live video images.
185
186
CHAPTER 7 Collaborative Control of Active Cameras even at the high zoom factor used for acquiring the images, and (2) the effectiveness of our system in capturing high-resolution and frontal facial imageries. For evaluating system performance, captured imageries are input to a face detection engine [32] asynchronously, as shown in Figure 7.13, which demonstrates successful detection based on the image captures, thus validating the utility of the system for facial biometrics purposes.
7.9 CONCLUSIONS We presented a systematic approach to achieving real-time collaborative control of multiple PTZ cameras. The main novelty of our approach lies in optimizing target captures using a probabilistic objective function that encapsulates a set of quality measures, including capture distance, view angle, target reachability, and trackability. This objective function is light-weighted, allowing the system to perform captures in real time while directly satisfying the components of a good capture. This is unlike most other systems, where the real-time requirement is often compromised by complex planning or vice versa. The future status of our research is promising. The rigorous approach we employed at every step has resulted in a robust system that can be utilized in a wide variety of applications. One can imagine, for example, utilizing our system for recognizing persons, monitoring person-to-person interactions, or automatically collecting, for forensic purposes, face images of subjects that appear in a surveillance site.
REFERENCES [1] C. Stauffer, W. Grimson, Learning patterns of activity using real-time tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (2000) 747–757. [2] T. Zhao, R. Nevatia, Tracking multiple humans in complex situations. IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (2004) 1208–1221. [3] N. Krahnstoever, P. Tu, T. Sebastian, A. Perera, R. Collins, Multi-view detection and tracking of travelers and luggage in mass transit environments, in: Proceedings Ninth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, 2006. [4] P. Tu, F. Wheeler, N. Krahnstoever, T. Sebastian, J. Rittscher, X. Liu, A. Perera, G. Doretto, Surveillance video analytics for large camera networks, SPIE Newsletter, 2007. [5] T. Yu, Y. Wu, N.O. Krahnstoever, P.H. Tu, Distributed data association and filtering for multiple target tracking, in: Proceedings IEEE Conference on ComputerVision and Pattern Recognition, 2008. [6] J. Krumm, S. Harris, B. Meyers, B. Brumitt, M. Hale, S. Shafer, Multi-camera multi-person tracking for easy living, in: IEEE Workshop on Visual Surveillance, 2000. [7] R. Collins, A. Lipton, T. Kanade, A system for video surveillance and monitoring, in: Proceedings of the American Nuclear Society Eighth International Topical Meeting on Robotics and Remote Systems, 1999. [8] S. Gong, J. Ng, J. Sherrah, On the semantics of visual behavior, structured events and trajectories of human action. Image and Vision Computing 20 (12) (2002) 873–888.
7.9 Conclusions
187
[9] S. Hongeng, R. Nevatia, F. Bremond, Video-based event recognition: activity representation and probabilistic recognition methods. Computer Vision and Image Understanding 96 (2004) 129–162. [10] G. Medioni, I. Cohen, F. Bremond, S. Hongeng, R. Nevatia, Event detection and analysis from video streams. IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (8) (2001) 873–889. [11] N.M. Oliver, B. Rosario, A.P. Pentland, A Bayesian computer vision system for modeling human interactions. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (2000) 831–843. [12] A. Senior, A. Hampapur, M. Lu, Acquiring multi-scale images by pan-tilt-zoom control and automatic multi-camera calibration, in: IEEE Workshop on Applications on Computer Vision, 2005. [13] B.J. Tordoff, Active control of zoom for computer vision. PhD thesis, University of Oxford, 2002. [14] J. Denzler, C.M. Brown, Information theoretic sensor data selection for active object recognition and state estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2) (2002) 145–157. [15] S.J.D. Prince, J.H. Elder,Y. Hou, M. Sizinstev, Pre-attentive face detection for foveated wide-field surveillance, in: IEEE Workshop on Applications on Computer Vision, 2005. [16] N. Krahnstoever, P. Mendonça, Bayesian autocalibration for surveillance, in: Proceedings of IEEE International Conference on Computer Vision, 2005. [17] K.A. Tarabanis, P.K. Allen, R.Y. Tsai, A survey of sensor planning in computer vision. IEEE Transactions on Robotics and Automation 11 (1995) 86–104. [18] L. Marchesotti, L. Marcenaro, C. Regazzoni, Dual camera system for face detection in unconstrained environments, in: IEEE Conference Image Processing, 2003. [19] X. Zhou, R. Collins, T. Kanade, P. Metes, A master-slave system to acquire biometric imagery of humans at a distance, in: ACM SIGMM Workshop on Video Surveillance, 2003. [20] A. Hampapur, S. Pankanti, A. Senior, Y.L. Tian, L.B. Brown, Face cataloger: Multi-scale imaging for relating identity to location, in: Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance, 2003. [21] F. Qureshi, D. Terzopoulos, Surveillance camera scheduling: A virtual vision approach. ACM Multimedia Systems Journal, Special Issue on Multimedia Surveillance Systems 12 (2006) 269–283. [22] A.D. Bimbo, F. Pernici, Towards on-line saccade planning for high-resolution image sensing. Pattern Recognition Letters 27 (2006) 1826–1834. [23] Y. Li, B. Bhanu, Utility-based dynamic camera assignment and hand-off in a video network, in: Second ACM/IEEE International Conference on Distributed Smart Cameras, September (2008) 1–9. [24] S.N. Lim, L.S. Davis, A. Mittal, Constructing task visibility intervals for a surveillance system. ACM Multimedia Systems Journal, Special Issue on Multimedia Surveillance Systems 12 (2006). [25] S.N. Lim, L. Davis, A. Mittal, Task scheduling in large camera network, in: Proceedings of the Asian Conference on Computer Vision (2007) 141–148. [26] N. Krahnstoever, P. Mendonça, Autocalibration from tracks of walking people, in: Proceedings British Machine Vision Conference, 2006. [27] R.I. Hartley, A. Zisserman, MultipleView Geometry in ComputerVision. Cambridge University Press, 2000. [28] P.H. Tu, G. Doretto, N.O. Krahnstoever, A.A.G. Perera, F.W. Wheeler, X. Liu, J. Rittscher, T.B. Sebastian, T. Yu, K.G. Harding, An intelligent video framework for homeland protection SPIE
188
CHAPTER 7 Collaborative Control of Active Cameras
[29] [30] [31] [32]
Defense and Security Symposium, Unattended Ground, Sea, and Air Sensor Technologies and Applications IX, invited paper 2007. S. Blackman, R. Popoli, Design and Analysis of Modern Tracking Systems. Artech, 1999. C. Rasmussen, G. Hager, Joint probabilistic techniques for tracking multi-part objects, in: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (1998) 16–21. C. Stauffer, Estimating tracking sources and sinks, in: Proceedings of the Second IEEE Workshop on Event Mining, 2003. H. Schneiderman, T. Kanade, A statistical method for 3D object detection applied to faces and cars, in: IEEE Computer Vision and Pattern Recognition Conference, 2000.
CHAPTER
Pan-Tilt-Zoom Camera Networks A. Del Bimbo, F. Dini, F. Pernici MICC University of Florence, Florence, Italy
8
A. Grifoni Thales Italia, Florence, Italy
Abstract Pan-tilt-zoom (PTZ) camera networks play an important role in surveillance systems. They can direct attention to interesting events in the scene. One method to achieve such behavior is a process known as sensor slaving: One master camera (or more) monitors a wide area and tracks moving targets to provide positional information to one slave camera (or more). The slave camera can thus foveate at the targets in high resolution. In this chapter we consider the problem of estimating online the time-variant transformation of a human’s foot position in the image of a fixed camera relative to his head position in the image of a PTZ camera. The transformation achieves high-resolution images by steering the PTZ camera at targets detected in a fixed camera view. Assuming a planar scene and modeling humans as vertical segments, we present the development of an uncalibrated framework that does not require any known 3D location to be specified and that takes into account both zooming camera and target uncertainties. Results show good performances in localizing the target’s head in the slave camera view, degrading when the high zoom factor causes a lack of feature points. A cooperative tracking approach exploiting an instance of the proposed framework is presented. Keywords: PTZ camera, rotating and zooming camera, camera networks, distinctive keypoints, tracking
8.1 INTRODUCTION In realistic surveillance scenarios, it is impossible for a single sensor to monitor all areas at once or to capture interesting events at high resolution. Objects become occluded by trees and buildings or by other moving objects, and sensors themselves have limited fields of view. A promising solution to this problem is to use a network of pan-tilt-zoom (PTZ) cameras. A PTZ camera can be conceived as reconfigurable and fixed. In fact, through the pan, tilt, and zoom parameters, we can actually choose from a family of Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00008-2
189
190
CHAPTER 8 Pan-Tilt-Zoom Camera Networks fixed cameras. The zoom capabilities of those cameras allow placement away from the scene to monitor, while the pan and tilt can be used to cooperatively track all objects within an extended area and seamlessly track, at higher resolution, individual objects that could not be viewed by a single nonzooming sensor. This allows, for example, target identity to be maintained across gaps in observation. Despite this huge potential, the complexity of such systems is considerable. We argue that three main issues are worth exploring since they form the basic building blocks to create high-level functionalities in PTZ camera networks: (1) control laws for the trajectories of the PTZ system with respect to (wrt) a given task; (2) determining the mapping between one camera and the PTZ system (master–slave); and (3) the scaling of a minimal system (i.e., two cameras) to one that includes multiple omnidrectional and PTZ systems. In this chapter the focus is on establishing the mapping between a fixed camera and the PTZ camera. Once known, the mapping between a fixed and an active camera greatly simplifies peripherally guided active vision of events at a site. To this end, cameras are settled in a master–slave configuration [1]:The master camera is set to have a global view of the scene so that it can track objects over extended areas using simple tracking methods with adaptive background subtraction. The slave camera can then follow the trajectory to generate close-up imagery of the object. Despite the possibility of arranging an active and a fixed camera with a short baseline so as to promote feature matching between the two fields of view, as in [1, 2], here the aim is to propose a general framework for arbitrary camera topology. In our framework, any node in the camera network sharing a common FOV can exploit the mapping. The proposed solution is to be able to estimate online the transformation between a fixed master camera viewing a planar scene and tracking the target, and a PTZ slave camera taking close-up imagery of it. The system computes the target foot position in the fixed camera view and transfers it to the correspondent head position in the PTZ camera view. Both camera and target uncertainties are taken into account in estimating the transformation between the two cameras. Since we do not explicitly use direct camera calibration, no large offline learning stage is needed.
8.2 RELATED WORK A substantial number of papers in the literature concerning multi-camera surveillance systems have focused on object detection (Dalal and Triggs [3]), target tracking (Yilmaz et al. [4]), and data association for multiple target tracking (Vermaak et al. [5]). The importance of accurate detection, tracking, and data association is obvious, since tracking information is needed as the initial stage for controlling one or more PTZ cameras to acquire high-resolution imagery. However, in addition to detection and tracking, the acquisition of high-quality imagery, particularly for biometrics purposes, requires accurate calibration between the fixed and PTZ cameras in order to focus attention on interesting events that occur in the scene. For these cameras, however, precalibration is
8.2 Related Work
191
almost impossible. In fact, transportation, installation, and changes in temperature and humidity in outdoor environments typically affect the estimated calibration parameters. Moreover, it is impossible to re-create the full range of zoom and focus settings. A tradeoff has to be made for simplicity against strict geometric accuracy and between online dynamic and offline batch methodologies. A step in this direction has been made by combining fixed and PTZ cameras. This configuration is often termed in the literature as master–slave. Many researchers use a master–slave camera configuration with two [1, 2, 6–9] or more [10–15] cameras. In particular, most of the methods strongly rely on a direct camera calibration step [7, 8, 13, 14]. Basically these approaches are not autonomous since they need a human to deal with calibration marks. The few exceptions are discussed in [1, 2, 9]. Kang et al. [9] track targets across a fixed and a PTZ camera. They use an affine transformation between consecutive pair of frames for stabilizing moving camera sequences, and an homography transformation for registering the moving and stationary cameras with the assumption that the scene is planar. No explicit zoom usage is performed. Zhou et al. [1] and Badri et al. [2] do not require direct calibration; however, viewpoints between the master and slave cameras are assumed to be nearly identical so as to promote feature matching. In particular Kang et al. [1] use a linear mapping computed from a lookup table of manually established pan and tilt correspondences. They actively control a zooming camera using a combination of motion detection algorithms. Zhou et al. [2] use a lookup table that takes the zoom into account but still needs a batch learning phase. More general methods exist to calibrate one or several PTZ cameras. These methods can be classified depending on the particular task performed. Since PTZ cameras, if they are stationary, play the same role as fixed cameras, standard methods for fixed cameras still apply. From the very definition of camera calibration (internal and external), standard camera calibration techniques (sometimes termed hard calibration) can be used. Past work on zooming camera calibration has mostly been done in a laboratory using calibration targets at least in controlled environments. One important work using active zoom lens calibration isWillson et al. [16]. However, these methods are not flexible enough to be used in wide areas (especially in outdoor environments), because image measurements have to be well spaced in the image. A more flexible method (Hartley [17]) can be used to self-calibrate (without calibration targets) a single PTZ camera by computing the homographies induced by rotating and zooming. The same approach was analyzed (de Agapito et al. [18]) considering the effect of imposing different constraints on the intrinsic parameters of the camera. It was reported that best results are obtained when the principal point is assumed to be constant throughout the sequence although it is known to be varying in reality. A very thorough evaluation of the same method was performed with more than one hundred images (Sinha and Pollefeys [19]). Then the internal calibration of the two PTZ cameras was used for 3D reconstruction of the scene through essential matrix and triangulation, using the mosaic images as a stereo pair. A class of methods exploiting moving objects in order to achieve self-calibration also appears in the literature. For example, Davis and Chen [20] and Svoboda et al. [21] use
192
CHAPTER 8 Pan-Tilt-Zoom Camera Networks LEDs. As the LED is moved around and visits several points, these positions make up the projection of a virtual object (3D point cloud) with an unknown 3D position. However, camera synchronization here limits the flexibility of these approaches, especially if they are applied to IP camera networks. A further class of methods performs a weaker (self-) calibration by establishing a common coordinate frame using walking subjects. Sinha et al. [22] uses silhouettes of close-range walkers; Chen et al. [23] and Lee et al. [24] use far-distance moving objects; and Rahimi et al. [25] use planar trajectories of moving targets. The very last category over-simplifies camera self-calibration, maximizing flexibility over accuracy. Krahnstoever et al., Lu et al., and Bose and Grimson [26–28] especially belong to this category. They use basically the single-view-geometry camera calibration (Liebowitz and Zisserman [29]) of pinhole cameras through a vanishing point computed from inaccurate image features such as the imaged axis of walkers. A main problem with these approaches is that parallel lines used to compute vanishing points have to be viewed with a strong perspective effect. Moreover, the measured features computed using target/ motion detection methods are in general very noisy for the estimation of the underlying projective geometric models. However, none of the presented methods provides a way to maintain calibration of a PTZ camera while moving. PTZ camera network research can also benefit from simultaneous localization and mapping (SLAM) using visual landmarks (Chekhov et al. [30] and Barfoot [31]). In Se et al. [32] landmarks are detected in images using scale-invariant feature transform (SIFT), matched using efficient best-bin-first K-D tree search (Beis and Lowe [33]). With a similar method, Barfoot [31] able to process pairs of 1024⫻768 images at 3 Hz with databases of up to 200,000 landmarks. This result is even more encouraging because of recent efforts in local image descriptors (Zhang et al., Sudipta et al., Grabner et al., and Hua et al. [34–37]) Zhang et al. propose a parallel SIFT implementation in an 8-core system showing an average processing speed of 45 fps for images with 640⫻480 pixels, which is much faster than implementation on GPUs [35, 36]. Finally, [37] demonstrate that descriptors with performance equal to or better than state-of-the-art approaches can be obtained with 5 to 10 times fewer dimensions.
8.3 PAN-TILT-ZOOM CAMERA GEOMETRY In this section, we give an overview of the mathematics used by reviewing the basic geometry of PTZ cameras [18]. The projection of scene points onto an image by a perspective camera may be modeled by the central projection equation x ⫽ PX, where x ⫽ [x, y, 1]T are the image points in homogeneous coordinates, X ⫽ [X, Y , Z, 1]T are the world points, and p is the 3⫻4 camera projection matrix. Note that this equation holds only up to scale. The matrix p can be decomposed as P ⫽ K[R|t]
(8.1)
8.4 PTZ Camera Networks with Master–Slave Configuration
193
where the rotation R and the translation t represent the Euclidean transformation between the camera and the world coordinate systems, and K is an upper triangular matrix that encodes the internal parameters of the camera in the form ⎛
␥f K⫽⎝ 0 0
s f 0
⎞ u0 v0 ⎠ 1
(8.2)
Here f is the focal length and ␥ is the pixel aspect ratio. The principal point is (u0 , v0 ), and s is a skew parameter, which is a function of the angle between the horizontal and vertical axes of the sensor array. Without loss of generality, by choosing the origin of the camera reference frame at the optic center, the projection matrix for each possible view i of the PTZ camera may be written as Pi ⫽ Ki [Ri |0]
(8.3)
The projection of a scene point X ⫽ [X, Y , Z, 1]T onto an image point x ⫽ [x, y, 1]T may now be expressed as x ⫽ Ki [Ri |0][X, Y , Z, 1]T ⫽ Ki Ri [X, Y , Z]T ⫽ Ki Ri d, where d ⫽ [X, Y , Z]T is the 3D ray emanating from the optical center to the image point x. Since the entries in the last column of the projection matrix of equation 8.1 are zero, the depth of the world points along the ray is irrelevant and we only consider the projection of 3D rays d. Therefore, in the case of a rotating camera, the mapping of 3D rays to image points is encoded by the 3⫻3 invertible projective transformation: Pi ⫽ K i Ri
(8.4)
Given a 3D ray d, its projections onto two different images will be xi ⫽ Ki Ri d and xj ⫽ Kj Rj d. Eliminating d from the equations, it is easy to see that in the case of a rotating camera there exists a global 2D projective transformation (homography) Hij that relates corresponding points in two views: xi ⫽ Hij xj , whose analytic expression is given by ⫺1 ⫺1 Hij ⫽ Kj Rj R⫺1 i Ki ⫽ Kj Rij Ki
(8.5)
8.4 PTZ CAMERA NETWORKS WITH MASTER–SLAVE CONFIGURATION PTZ cameras are particularly effective when in a master–slave configuration [1]: The master camera is set to have a global view of the scene so that it can track objects over extended areas using simple tracking methods with adaptive background subtraction. The slave camera can then follow a target’s trajectory to generate close-up imagery of the object. Obviously, the slave and master roles can be exchanged, provided that both have PTZ capabilities. The master–slave configuration can then be extended to the case of multiple PTZ cameras. Figure 8.1 shows the pairwise relationship between two cameras in this configuration. H is the homography relating the image plane of the master camera C with the reference image plane ⌸ of the slave camera C, and Hk is the homography
194
CHAPTER 8 Pan-Tilt-Zoom Camera Networks Tk5Hk ? H9
Hk ⌸
⌸9 a9
A
C9
a C
H9
Ik
I
FIGURE 8.1 Pairwise relationship between two cameras in a master–slave configuration. Camera C is master, tracking the target; and camera C is the slave. H is the homography from the stationary camera to a reference position of the PTZ camera in a wide-angle view. ⌸ is the image plane of the master camera; ⌸ is the reference plane of the slave camera. The homography Hk relates the reference image plane ⌸ of the image with the plane of the current image Ik of the slave camera.
relating the current image plane of the slave camera to the reference image plane ⌸. Once Hk and H are known, the imaged location a of a moving target A tracked by the stationary camera C can be transferred to the zoomed view of camera C by Tk ⫽ H k · H
(8.6)
· a . With
as a ⫽ Tk this pairwise relationship between cameras, the number of possible network configurations can be calculated. Given a set of PTZ cameras Ci viewing a planar scene, we define N ⫽ {Cis }M i⫽1 as a PTZ camera network with the relationship where M denotes the number of cameras in the network and s defines the state of each camera. At any given time these cameras can be in one of two states si ⫽ {MASTER, SLAVE}. The network N can be in one of 2M ⫺ 2 possible state configurations. Not all cameras in a master state or all cameras in a slave state can be defined. It is worth noticing that from this definition more than one camera can act as a master and/or slave. In principle, without any loss of generality, if all cameras in a network have an overlapping field of view (i.e., they are in a full connected topology) they can be set in a master–slave relationship with each other (not only in a one-to-one relationship). For example, in order to cover large areas, more master cameras can be placed with adjacent fields of view. In this case, if it acts as a master camera, one slave camera can suffice to observe the whole area. Several master cameras can have overlapping fields of view so as to achieve higher tracking accuracy (multiple observations of the same object from different cameras can be taken into account to obtain a more accurate measurement and to determine a more accurate foveation by the slave camera). Similarly, more than one camera can act as a slave—for example, for capturing high-resolution images of moving objects from several viewpoints.
8.4.1 Minimal PTZ Camera Model Parameterization In the case of real-time tracking applications, a trade-off can be made for simplicity against strict geometric accuracy because of the different nature of the problem. Whereas in 3D
8.5 Cooperative Target Tracking
195
reconstruction small errors in the internal camera parameters can lead to nonmetric reconstruction, in tracking, weaker Euclidean properties of the world are needed. Thus, more simple camera models can be assumed. It is also evident that the low number of features in the use of high parametric models can lead to fit noise [38]. For these reasons we adopt a minimal parameterization where only the focal length is allowed to vary. Because of the mechanical nature of PTZ cameras it is possible to assume that there is no rotation around the optical axis. We also assume that the principal point lies at the image center and that the CCD pixels are squared (i.e., their aspect ratio equals 1). Under these assumptions we can write for Rij
Rij ⫽ Rij · Rij
with
⎡ Rij ⫽ ⎣
and
⎡
0 1 0
⎤ ⫺ sin (ij ) ⎦ 0 cos (ij )
(8.8)
0 cos (ij ) sin (ij )
⎤ 0 ⫺ sin (ij ) ⎦ cos (ij )
(8.9)
cos (ij ) 0 sin (ij )
1
Rij ⫽ ⎣0
0
(8.7)
where ij , ij are the pan and tilt angles from image i to image j, respectively. Under the same assumptions we can write ⎡
fj Kj ⫽ ⎣ 0 0
0 fj 0
⎤ px py ⎦ 1
(8.10)
0 fi 0
⎤ px py ⎦ 1
(8.11)
and a similar expression for Ki as ⎡
fi Ki ⫽ ⎣ 0 0
where (px , py ) are the coordinates of the the image center. The homography Hij relating current view and reference view is thus only dependent on four parameters: ij , ij , fi , and fj . In our framework, they are reduced to three because fi can be computed in closed form by using two images with different focal lengths obtained from the PTZ camera [39]. In the next section we illustrate our method to recursively track these parameters while the active camera is moving.
8.5 COOPERATIVE TARGET TRACKING Real-time tracking applications are required to work properly even when camera motion is discontinuous and erratic because of command operations issued to the PTZ system. If
196
CHAPTER 8 Pan-Tilt-Zoom Camera Networks surveillance applications are to be usable, then the system must recover quickly and in a stable manner following such movements. Unfortunately, methods that use only recursive filtering can fail catastrophically under such conditions, and although the use of more general stochastic filtering methods, such as particle filtering, can provide increased resilience, the approach is difficult to transfer to surveillance PTZ cameras operating with zoom. In this section we show how to compute the time-variant homography Hk in equation 8.6 using a novel combined tracking that addresses the difficulties described previously. We adopt a SIFT-based matching approach to detect the relative location of the current image wrt the reference image. At each time step we extract the SIFT features from the current image and match them with those extracted from the reference frame, obtaining a set of points’ pairs. The SIFT features extracted in the reference image can be considered as visual landmarks. Once these visual landmarks are matched to the current view, the registration errors between these points are used to drive a particle filter with a state vector that includes the parameters defining Hk . This allows stabilizing the recovered motion, characterizing the uncertainties, and reducing the area where matches are searched. Moreover, because the keypoints are detected in scale space, the scene does not necessarily have to be well textured, which is often the case in an urban planar human-made scene.
8.5.1 Tracking Using SIFT Visual Landmarks Let us denote with Hk the homography between the PTZ camera reference view and the frame grabbed at time step k. What we want to do is to track the parameters that define Hk using a Bayesian recursive filter. Under the assumptions we made, the homography of equation 8.6 is completely defined once the parameters k , k , and fk are known. We used this model to estimate Hk , relating the reference image plane ⌸ with the current image at time k (see Figure 8.1). The focal length fi of the reference image can be computed in closed form using a further image I0 matching with the reference image I through the homography ⎡
h11 H ⫽ ⎣ h21 h31
h12 h22 h32
⎤ h13 h23 ⎦ h33
(8.12)
This is achieved by equation 8.5 with the assumptions made in equation 8.10 and equation 8.11: H K0 K0T H ⫽ Ki KiT
so as to obtain three equations to compute the focal length f0 of I0 and the focal length fi of the reference I: f02 (h11 h21 ⫹ h12 h22 ⫽ ⫺h13 h23 )
(8.13)
f02 (h11 h31 ⫹ h12 h32 ⫽ ⫺h13 h33 )
(8.14)
f02 (h21 h31 ⫹ h22 h32 ⫽ ⫺h23 h33 )
(8.15)
8.5 Cooperative Target Tracking
197
and two equations for fi : fi2 ⫽ fi2 ⫽
f02 (h211 ⫹ h212 ) ⫹ h213 f02 (h231 ⫹ h232 ) ⫹ h233
(8.16)
f02 (h221 ⫹ h222 ) ⫹ h223 f02 (h231 ⫹ h232 ) ⫹ h233
(8.17)
In this computation, care must be taken to avoid pure rotation about the pan and tilt axis. Thus we adopt the state vector xk , which defines the camera parameters at time step k:
xk ⫽ k , k , fk
(8.18)
We use a particle filter to compute estimates of the camera parameters in the state vector. Given a certain observation zk of the state vector at time step k, particle filters build an approximated representation of the posterior pdf p(xk |zk ) through a set of weighted Np samples {(xki , wki )}i⫽1 (called particles), where the weights sum to 1. Each particle is thus a hypothesis on the state vector value (i.e., a homography in our framework), with an associated probability.The estimated value of the state vector is usually obtained through the weighted sum of all particles. Like any other Bayesian recursive filter, the particle filter algorithm requires a probabilistic model for the state evolution between time steps, from which a prior pdf p(xk |xk⫺1 ) and an observation model, from which a likelihood of p(zk |xk ), can be derived. Basically there is no prior knowledge of the control actions that drive the camera through the world, so we adopt a simple random walk model as a state evolution model. This is equivalent to assuming the actual value of the state vector to be constant in time and relying on a stochastic noise vk⫺1 to compensate for unmodeled variations: xk ⫽ xk⫺1 ⫹ vk⫺1 · vk⫺1 ∼ N (0, Q) is a zero mean Gaussian process noise, with the covariance matrix Q accounting for camera maneuvers (i.e., discontinuous and erratic motion). The way we achieve observations zk of the actual state vector value xk is a little j more complex and deserves more explanation. Let us denote with S0 ⫽ {s0 }N j⫽0 the set of SIFT points extracted from the reference view of the PTZ camera (let us assume for j the moment a single reference view), and denote with Sk ⫽ {sk }N j⫽0 the set of SIFT points extracted from the frame grabbed at time step k. From S0 and Sk we can extract pairs of SIFT points that match (through their SIFT descriptors) in the two views of the PTZ camera. After removing outliers from this initial set of matches through a RANSAC algorithm, what remains can be
usedas an observation ˜ for the particle filter. In fact, the set of remaining N pairs, Pk ⫽ s01 , sk1 , . . . , s0N˜ , skN˜ implicitly suggests a homography view the reference and the frame at time step between ˜ ˜ N N 1 1 k, one that maps the points s0 , . . . , s0 into sk , . . . , sk . Thus, there exsits a triple
˜k , ˜k , f˜k that, in the above assumptions, uniquely describes this homography and
198
CHAPTER 8 Pan-Tilt-Zoom Camera Networks that can be used as a measure zk of the actual state vector value. To define the likelihood p(zk |xki ) of the observation zk given the hypothesis xki , we take into account the distance between the homography Hik corresponding to xki and the homography associated with the observation zk : p(zk |xki ) ⬀ e
⫺ 1
N˜ j⫽1
j
j 2
Hik · s0 ⫺sk
(8.19)
j j where Hik · s0 is the projection of s0 in the image plane of frame k through the homography Hik , and is a normalization constant.
It is worth noting that the SIFT points on frame k do not need to be computed on the whole frame. In fact, after the particle filter prediction step it is possible to reduce the area of the image plane where the SIFT points are computed down to the area where the particles are propagated. This reduces the computational load of the SIFT points computation and of the subsequent matching with the SIFT points of the reference image.
8.6 EXTENSION TO WIDER AREAS We have seen that in order to maintain a consistent estimation of the transformation Hk , the pan and tilt angles and the focal length must be estimated wrt a reference plane. A problem arises when the slave camera is allowed to move outside the reference view. As shown in Figure 8.2(a), matching with the reference view does not allow exploiting the whole parameter space of the pan and tilt angles or even zoom. The current zoomed view of the slave camera can only be matched when it has an overlap with the reference image. Figure 8.2(b) shows this limitation. The rectangles represent views that can be matched to acquire zoomed close-ups of moving targets. Moreover, in the region where features are scarce, detection is limited and/or it inaccurate. To overcome these difficulties and to increment the applicability of the recursive tracking described in Section 8.5.1 to wider areas, a database of the scene’s feature points is built during a learning stage. SIFT keypoints extracted to compute the planar mosaic are merged into a large KD-Tree together with the estimated mosaic geometry (see Figure 8.3).The mosaic geometry in this case consists of one or more images taken so as to to cover the whole field of regard1 of the PTZ system at several different pan and tilt angle and zoom settings. This can be computed by following the basic building blocks described in equations 8.13 through 8.17. A more general and accurate solution can be obtained using a global refinement step with a bundle adjustment [40, 41]. What is obtained is shown in Figure 8.3. The match for a SIFT feature extracted from the current frame is searched according to the Euclidean distance of the descriptor vectors. The search is performed so that
1 The camera field of regard is defined as the union of all field of views over the entire range of pan and tilt rotation angles and zoom values.
8.6 Extension to Wider Areas ⌸
A C
a
I Ik
(a) ⌸
Ik
Hk
I
(b)
FIGURE 8.2 (a) Limitation of a single reference image I. When the current view Ik does not overlap I it is not possible to estimate Hk . (b) Rectangles show some position of the PTZ where it is possible estimate Hk . ⌸
H1j I1
I2
Ij
H2j
Hjm H(n 2 1)j
???
In 2 1
In
???
KD-tree
FIGURE 8.3 Each landmark in the database has a set of descriptors that corresponds to location features seen from different vantage points. Once the current view of the PTZ camera matches an image Il in the database, the inter-image homography Hlm transfers the current view into the reference plane ⌸.
bins are explored in the order of their distance from the query description vector, and stopped after a given number of data points have been considered [42]. Once the image Il closest to the current view Ik is found, the homography G relating Ik to Il is computed at runtime with RANSAC. The homography Hlm that relates Il with the mosaic plane
199
200
CHAPTER 8 Pan-Tilt-Zoom Camera Networks ⌸ retrieved in the database is used to finally compute the likelihood. Equation. 8.19 becomes
p(zk |xki ) ⬀ e
⫺ 1
N˜ j⫽1
j
j 2
Hik ·s0 ⫺Hlm ·G·sk
(8.20)
As shown in Figure 8.3 the image points of the nearest neighbor image Il wrt to view Il and the current view (i.e., the query to the database) are projected in ⌸ to compute the likelihood of equation 8.20. In particular, Im in the figure is the reference image used to compute the mosaic.
8.7 THE VANISHING LINE FOR ZOOMED HEAD LOCALIZATION The uncertainty characterization of recursive tracking is further used to localize the target head in the active camera by taking into account both the target position sensed by the master camera and the pdf parameters of the slave camera. Assuming subjects closely vertical in the scene plane, the position of feet and head can be related by a planar homology [43, 44]. This transformation can be parameterized as W ⫽ I ⫹ ( ⫺ 1)
v⬁ lT⬁ vT⬁ l⬁
(8.21)
where I is the 3⫻3 identity matrix, v⬁ is the vanishing point of the directions orthogonal to the scene plane where targets are moving, l⬁ is the corresponding vanishing line of the plane, and is the characteristic cross-ratio of the transformation. According to this, at each time step k, the probability density function of the planar homology Wk should be computed once the probability density function of, respectively, the vanishing point v⬁,k and the vanishing line l ⬁,k in the active camera views at time k are known. In what follows we show how sampling from p(xk |zk ) allows estimating p(v⬁,k |zk ) and p(l ⬁,k |zk ) once the vanishing line l⬁ in the master camera is known, as shown in Figure 8.4. Np modeling Hk , we For each particle i in the set of weighted samples {(xki , wki )}i⫽1 calculate T i ⫺T l ⬁,k ⫽ T⫺ · l⬁ k · l⬁ ⫽ [Hk · H ]
i
i v⬁,k ⫽ ik · l ⬁,k i
(8.22) (8.23)
where l⬁ in equation (8.22) is the vanishing line in the master camera view (see Figure 8.4) and ik in equation (8.23) is the dual image of the absolute conic [45]: T
ik ⫽ Kki · Kki
The intrinsic camera parameters matrix ⎡ i fk Kki ⫽ ⎣ 0 0
0 fki 0
⎤ px py ⎦ 1
(8.24)
8.7 The Vanishing Line for Zoomed Head Localization
I`
FIGURE 8.4 Master vanishing lines are computed once in the master camera and transferred to the current view of the slave camera through Tk .
is computed with reference to the ith particle. The estimated focal length fki is extracted from the state component of equation (8.18). From the samples of equations (8.22), (8.23), and (8.24), the respective pdfs are approximated as p(l ⬁,k |zk ) ≈
N 1 i ␦(l ⬁,k ⫺ l ⬁,k ) N i⫽1
p(v⬁,k |zk ) ≈
N 1 i ␦(v⬁,k ⫺ v⬁,k ) N i⫽1
p(k |zk ) ≈
N 1 ␦(k ⫺ ik ) N i⫽1
In the same way, at time k the pdf p(Wk |zk ) ⫽ N1 Wki ⫽ I ⫹ ( ⫺ 1)
N
i i⫽1 ␦(Wk ⫺ Wk )
i · li T v⬁,k ⬁,k . T i i v⬁,k · l ⬁,k
is computed as (8.25)
using equation (8.22) and equation (8.23). The cross-ratio , being a projective invariant, is the same in any image obtained with the active camera, while only the vanishing line l ⬁,k and the vanishing point v⬁,k vary when the camera moves; thus, it must be observed only once. The cross-ratio can be evaluated accurately by selecting the target foot location a and the target head location b in one of the frames of the zooming camera (see Figure 8.5) as ⫽ Cross (v, a, b, vˆ ⬁,k )
(8.26)
, with the line where v is computed as the intersection of the mean vanishing lines lˆ⬁,k passing from the mean vanishing point vˆ ⬁,k to a. Using the homogeneous vector repre ⫻ (v sentation and the cross-product operator ⫻, it can be estimated as v ⫽ lˆ⬁,k ˆ ⬁,k ⫻ a).
201
202
CHAPTER 8 Pan-Tilt-Zoom Camera Networks
v
I⬁,k
b
Ik
a
V⬁,k
FIGURE 8.5 Foot and head positions in a frame taken from one frame of the slave camera to compute the cross-ratio . The relationship between the involved entities used to compute is shown.
FIGURE 8.6 Particles indicate uncertainty in head localization due to both active camera tracking and target tracking uncertainty.
The pdf p(Mk |zk ) of the final transformation Mk , which maps the target feet observed in the image of the master camera to the target head in the current image of the slave camera, is computed as Mik ⫽ Wki · Hik · H
where Mik represents an entire family of transformations. Given the estimated p(xk |zk ) of the slave camera and the imaged position of the target p(ak |zk ) as tracked from the master camera (see Figure 8.6) the distribution of possible head locations p(bk |zk , zk ) as viewed from the slave camera is estimated. We sample L homographies from p(xk |zk )
8.8 Experimental Results
203
and the same number of particles from the set of particles tracking the feet position in the master camera view p(ak |zk ) to obtain bk ⫽ Mk · a k j
j
j
j ⫽1 . . . L
(8.27)
Figure 8.6 shows the particle distribution of head localization uncertainty in the slave camera computed with equation (8.27). It is worth noting that equation (8.27) jointly takes into account both zooming camera sensor error and target-tracking error.
8.8 EXPERIMENTAL RESULTS For initial testing purposes, a simple algorithm has been developed to automatically track a single target. The target is localized with the wide-angle stationary camera, and its motion is detected using a single Gaussian background subtraction algorithm within the image, and tracked using a particle filter.The experiments described here were run using two PTZ cameras in a master–slave configuration. The first acted as a master camera (i.e., it was not steered during the sequence), grabbing frames of the whole area at PAL resolution. The second acted as a slave camera, grabbing frames at a resolution of 320⫻240 pixels. Because of the limited extension of the monitored area, in our experiments SIFT feature points were extracted from only two images, since they sufficed to populate the features database, and to compute the focal length fi of the reference image (see Section 8.5.1). In Figure 8.7 are shown some frames extracted from an execution of the proposed system: in the left column is the position of the target observed with the master camera; in the right column, the frames of the slave camera view. The particles show the uncertainty on the position of the target feet. Since the slave camera does not explicitly detect the target, the similarity between background and foreground appearance does not influence the estimated localization of the target. As shown in the last frame of the sequence, even if the target is outside the current FOV of the slave camera, its imaged position is still available in image coordinates. A quantitative result for the estimated camera parameters is depicted in Figure 8.8, which shows how the state density evolves as tracking progresses. In each box are shown, clockwise from the upper left corner: the slave camera view, the pan angle pdf, the focal length pdf, and the tilt angle pdf. Time increases from left to right and top to bottom. During the sequence the slave camera is panning, tilting, and zooming to follow the target. The focal length advances from 400 to about 1000 pixels, which corresponds approximately to a 3⫻ zoom factor. It can be seen that extending the focal length causes a significant increase in the variance of parameter densities, which means that the estimated homography between the two cameras becomes more and more inaccurate.This is mainly caused by the fact that the number of features at high resolution that match those extracted from the reference image obviously decreases when zooming in, causing the SIFT match to be less effective. Figure 8.8 also shows how the focal length is much more sensitive to measurement when the camera zooms in. This is especially evident in the last rows of the figure, where the densities of the pan and tilt remain more peaked than the focal length density. This effect is absent in the first two rows, where the camera has a wide-angle view.
204
CHAPTER 8 Pan-Tilt-Zoom Camera Networks
(a)
(b)
FIGURE 8.7 Twelve frames (shown in columns (a) and (b)) extracted from an execution of the proposed master–slave camera system (for each of the two columns—(a) and (b)—left is the master view, right is the slave view). The particles show the uncertainty on the foot position of the target as the cameras zooms in. Since the slave camera does not explicitly detect the target, background appearance similar to foreground’s does not influence the estimated localization of the target.
In Figure 8.9 are shown frames extracted from an execution of the proposed system on a different sequence. On the left is the position of the target observed with the master camera. On the right the correspondent mean head position is marked with a cross in the slave camera computed with the time-variant homography estimated by the particle filter.
Current state estimation 200
150
150 p(x)
p(x)
Current state estimation 200
100 50
50
0 30
40
50 60 Pan angle
0 30
70
200
200
150
150
150
100
50
50 0 230
100
220
210 0 Tilt angle
10
p(x)
200
150
p(x)
200 p(x)
p(x)
100
100
0 230
220
210 Tilt angle
0
0
10
400 600 800 1000 1200 1400 Focal length
200
150
150 p(x)
p(x)
Current state estimation
200
100
0 30
40
50 60 Pan angle
0 30
70
200
200
150
150
150
0 230
100 50
220
210 0 Tilt angle
0
10
p(x)
200
150
p(x)
200 p(x)
p(x)
100 50
50
50
100
0 230
400 600 800 1000 1200 1400 Focal length
220
210 0 Tilt angle
150
150 p(x)
p(x)
200
100
400 600 800 1000 1200 1400 Focal length
50 40
50 60 Pan angle
0 30
70
200
150
150
100 50
210 0 Tilt angle
0
10
p(x)
200
150
p(x)
200
150 p(x)
p(x)
100
200
220
100
0 230
400 600 800 1000 1200 1400 Focal length
220
210 Tilt angle
0
150
150 p(x)
p(x)
200
100
400 600 800 1000 1200 1400 Focal length
50 40
50 60 Pan angle
0 30
70
200
150
150
100 50
210 Tilt angle
0
0
10
p(x)
200
150
p(x)
200
150 p(x)
p(x)
100
200
220
100
0 230
400 600 800 1000 1200 1400 Focal length
220
210 Tilt angle
400 600 800 1000 1200 1400 Focal length
10
150
150 p(x)
p(x)
Current state estimation 200
100
50 40
50 60 Pan angle
0 30
70
200
150
150
100 50
210 Tilt angle
0
10
0
p(x)
200
150
p(x)
200
150 p(x)
p(x)
100
200
220
70
100
200
0 30
0 230
50 60 Pan angle
0 0
50
50
40
50
50
Current state estimation
100
70
Current state estimation
200
0 30
0 230
50 60 Pan angle
100
0
10
50
50
40
50
50
Current state estimation
100
70
Current state estimation
200
0 30
0 230
50 60 Pan angle
100
0
10
50
50
40
50
50
Current state estimation
100
70
100
Current state estimation
100
50 60 Pan angle
50
50
0 400 600 800 1000 1200 1400 Focal length
40
100
0 230
50 60 Pan angle
70
100 50
50 400 600 800 1000 1200 1400 Focal length
40
220
210 Tilt angle
0
10
0
400 600 800 1000 1200 1400 Focal length
FIGURE 8.8 Frames showing probability distributions (histograms) of the slave camera parameters. Shown in each frame are (left to right and top to bottom) current view, pan, tilt, and focal length distributions. As expected, uncertainty increases with zoom factor.
206
CHAPTER 8 Pan-Tilt-Zoom Camera Networks
(a)
(b)
FIGURE 8.9 Screenshots from one experiment. (a) Fixed camera view. (b) PTZ camera view. The line is the imaged line orthogonal to the scene plane. The crosses show estimated mean foot and head image locations.
8.8 Experimental Results The figure also shows the imaged line orthogonal to the scene plane as time progresses. As can be seen, increasing the zoom factor may bias the estimation of the head location. We measured this error, calculating the Euclidean distance in pixels between the estimated {xˆ k , yˆ k } and the ground truth head positions {x¯ k , y¯ k }, respectively, with ⑀k ⫽ (xˆ k ⫺ x¯ k )2 ⫹ (ˆyk ⫺ y¯ k )2
(8.28)
Figure 8.10(a) shows the estimated advancement in focal length. Figure 8.10(b) shows the relative mean error (curve) and the standard deviation in head error localization. 1000 900
Focal length
800 700 600 500 400 300
250
300
350 Frame (a)
400
450
500
70 60
Error
50 40 30 20 10 0
250
300
350 Frame (b)
400
450
500
FIGURE 8.10 Error evaluation in head localization. (a) Estimated focal length advancement as the slave camera zooms in. (b) Related head localization mean error (dark gray curve) and its standard deviation (light gray curves). As can be seen, the uncertainty in the head localization increases with the focal length.
207
208
CHAPTER 8 Pan-Tilt-Zoom Camera Networks 30
2000 particles 1500 particles 1000 particles 600 particles 400 particles 200 particles
25
Error
20
15
10
5
0
0
10
20
30
40
50 Frame
60
70
80
90
100
FIGURE 8.11 Effects of the number of particles on head localization error. As expected, the higher the particle number, the better the filter’s performance.
The mean error increases moderately, while a bias is present for high values of the focal length. The standard deviation increases almost linearly, showing graceful degradation performance. We also investigated the effect of the number of particles on the error, calculated in equation (8.28) and depicted in Figure 8.11. As can be seen, the error decreases while the number of particles increases, but once there are about 1000 particles or more, a further increase does not produce relevant improvements in filter performance.After this experiment, we found that 1000 particles represent a good trade-off between accuracy and computational load.
8.9 CONCLUSIONS We proposed an uncalibrated framework for cooperative camera tracking that is able to estimate the time-variant homography between a fixed wide-angle camera view and a PTZ camera view capable of acquiring high-resolution images of the target. In particular, this method is capable of transferring a person’s foot position from the master camera view to the correspondent head position in the slave camera view.
8.9 Conclusions
209
By utilizing the discriminatory power of SIFT-like features in an efficient top-down manner we introduced new standards of robustness for real-time active zooming cameras cooperating in tracking targets at high resolution. Such robustness is critical to real-time applications such as wide-area video surveillance where human target identification at long distances is involved. Since no direct camera calibration is needed, any camera in a reconfigurable sensor network can exploit the proposed method. Indeed, camera roles can be exchanged: if all cameras in a network have an overlapping field of view, they can be set in a master–slave relationship with each other (not only in a one-to-one relationship) without any loss of generality. The proposed solution is capable of managing both transformation and target position uncertainty. Thus, camera management/control approaches using information gain (i.e., based on entropy) could be developed within the framework. Acknowledgment. The authors would like to thank Giuseppe Lisanti for the fruitful dis-
cussions that led to the extension of this work. This research was partially supported by the IST Program of the European Commission as part of the VIDI-Video project (Contract FP6-045547) and by the FREE SURF project funded by the Italian MIUR Ministry.
REFERENCES [1] X. Zhou, R. Collins, T. Kanade, P. Metes, A master-slave system to acquire biometric imagery of humans at a distance, in: Proceedings of the ACM SIGMM Workshop on Video Surveillance, 2003. [2] J. Badri, C. Tilmant, J. Lavest, Q. Pham, P. Sayd, Camera-to-camera mapping for hybrid pan-tiltzoom sensors calibration, in: Proceedings SCIA, 2007. [3] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. [4] A. Yilmaz, O. Javed, M. Shah, Object tracking: A survey, ACM Computer Survey 38 (4) (2006) 13. [5] J. Vermaak, S. Godsill, P. Perez, Monte carlo filtering for multi target tracking and data association, IEEE Transactions on Aerospace and Electronic Systems, 41 (1) (2005) 309–332. [6] L. Marchesotti, L. Marcenaro, C. Regazzoni, Dual camera system for face detection in unconstrained environments, in: Proceedings of the IEEE International Conference on Image Processing, 2003. [7] A. Jain, D. Kopell, K. Kakligian, Y.-F. Wang, Using stationary-dynamic camera assemblies for wide-area video surveillance and selective attention, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2006. [8] R. Horaud, D. Knossow, M. Michaelis, Camera cooperation for achieving visual attention, Machine Visual Application 16 (6) (2006) 1–2. [9] J. Kang, I. Cohen, G. Medioni, Continuous tracking within and across camera streams, IEEE on Computer Vision and Pattern Recognition, 2003. [10] J. Batista, P. Peixoto, H. Araujo., Real-time active visual surveillance by integrating peripheral motion detection with foveated tracking, in: Proceedings of the IEEE Workshop on Visual Surveillance, 1998.
210
CHAPTER 8 Pan-Tilt-Zoom Camera Networks [11] S.-N. Lim, D. Lis, A. Elgammal, Scalable image-based multi-camera visual surveillance system, in: Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance, 2003. [12] A. Senior, A. Hampapur, M. Lu, Acquiring multi-scale images by pan-tilt-zoom control and automatic multi-camera calibration, in: Proceedings of the IEEE Workshop on Applications of Computer Vision, 2005. [13] A. Hampapur, S. Pankanti, A. Senior, Y.-L. Tian, L. Brown, R. Bolle, Face cataloger: Multi-scale imaging for relating identity to location, in: Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance, 2003. [14] S. Stillman, R.Tanawongsuwan, I. Essa, A system for tracking and recognizing multiple people with multiple cameras, Technical report GIT-GVU-98-25 Graphics, Visualization, and Usability Center, Georgia Institute of Technology, 1998. [15] R. Collins, A. Lipton, H. Fujiyoshi, T. Kanade, Algorithms for cooperative multisensor surveillance, in: Proceedings of the IEEE 89 (10) (2001) 1456–1477. [16] R.Willson, S. Shafer, What is the center of the image?, Journal of the Optical Society ofAmerica A 11 (11) (1994) 2946–2955. [17] R. Hartley, Self-calibration from multiple views with a rotating camera, in: Proceedings European Conference Computer Vision, 1994. [18] L. de Agapito, E. Hayman, I. D. Reid., Self-calibration of rotating and zooming cameras, International Journal of Computer Vision 45 (2) (2001). [19] S. Sinha, M. Pollefeys, Towards calibrating a pan-tilt-zoom cameras network, P. Sturm, T. Svoboda, and S. Teller, editors, OMNIVIS, 2004. [20] J. Davis, X. Chen, Calibrating pan-tilt cameras in wide-area surveillance networks, International Conference on Computer Vision (2003) 144–149. [21] T. Svoboda, H. Hug, L.V. Gool, Viroom – low cost synchronized multicamera system and its self-calibration, in: Pattern Recognition, 24th DAGM Symposium, 2449 in LNCS, (2002) 515–522. [22] S.N. Sinha, M. Pollefeys, L. McMillan, Camera network calibration from dynamic silhouettes, IEEE on Computer Vision and Pattern Recognition, 2004. [23] X. Chen, J. Davis, P. Slusallek, Wide area camera calibration using virtual calibration objects, in: Proceedings of the IEEE Computer International Conference on Computer Vision and Pattern Recognition 2 (2000) 520–527. [24] L. Lee, R. Romano, G. Stein, Monitoring activities from multiple video streams: Establishing a common coordinate frame, IEEE Transactions Pattern Analysis and Machine Intelligence 22 (8) (2000) 758–767. [25] A. Rahimi, B. Dunagan, T. Darrell, Simultaneous calibration and tracking with a network of non-overlapping sensors, in: Proceedings of the IEEE Computer International Conference on Computer Vision and Pattern Recognition (2004) 187–194. [26] N. Krahnstoever, P.R.S. Mendonca, Bayesian autocalibration for surveillance, in: Proceedings of the IEEE International Conference on Computer Vision, 2005. [27] F. Lv, T. Zhao, R. Nevatia, Self-calibration of a camera from video of a walking human, in: Proceedings of the IAPR International Conference on Pattern Recognition, 2002. [28] B. Bose, E. Grimson, Ground plane rectification by tracking moving objects, in: Proceedings of the Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2003. [29] D. Liebowitz,A. Zisserman, Combining scene and auto-calibration constraints, in: Proceedings of IEEE International Conference on Computer Vision, 1999. [30] D. Chekhlov, M. Pupilli, W. Mayol-Cuevas, A. Calway, Real-time and robust monocular slam using predictive multi-resolution descriptors, in: International Symposium on Visual Computing, 2006.
8.9 Conclusions
211
[31] T.D. Barfoot, Online visual motion estimation using fastSLAM with SIFT features, in: Proceedings of the IEEE International Conference on Robotics and Intelligent Systems, August (2005) 3076–3082. [32] S. Se, D.G. Lowe, J.J. Little, Vision-based mobile robot localization and mapping using scaleinvariant features, in: IEEE Conference on Robotics and Automation, 2001. [33] J.S. Beis, D.G. Lowe, Indexing without invariants in 3D object recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 21 (10) (1999) 1000–1015. [34] Q. Zhang, Y. Chen, Y. Zhang, Y. Xu, SIFT implementation and optimization for multi-core systems, tenth Workshop on Advances in Parallel and Distributed Computational Models (in conjunction with IPDPS), 2008. [35] M.P. Sudipta, N. Sinha, J-M. Frahm, Y. Genc, GPU-based video feature tracking and matching, Workshop on Edge Computing Using New Commodity Architectures, 2006. [36] M. Grabner, H. Bischof, Fast approximated SIFT, in: Proceedings ACCV (2006) I:918–927. [37] G. Hua, M. Brown, S. Winder, Discriminant embedding for local image descriptors, IEEE International Conference on Computer Vision, 2007. [38] B. Tordoff, Active control of zoom for computer vision, D. Phil. thesis, Oxford University, 2008. [39] L. Wang, S.B. Kang, H.-Y. Shum, G. Xu, Error analysis of pure rotation-based self-calibration, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (2) (2004) 275–280. [40] B. Triggs, P.F. McLauchlan, R.I. Hartley, A.W. Fitzgibbon, Bundle adjustment: a modern synthesis, in: Proceedings of the ICCV International Workshop on Vision Algorithms, 2000. [41] H.-Y. Shum, R. Szeliski, Systems and experiment paper: Construction of panoramic mosaics with global and local alignment, International Journal of Computer Vision 36 (2) (2000) 101–130. [42] D.G. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision 60 (2) (2004) 91–110. [43] L. Van Gool, M. Proesmans, A. Zisserman, Grouping and invariants using planar homologies, in: Workshop on Geometrical Modeling and Invariants for Computer Vision, Xidian University Press, 1995. [44] A. Criminisi, I. Reid, A. Zisserman, Single view metrology, International Journal of Computer Vision 40 (2) (2000) 123–148. [45] R.I. Hartley, A. Zisserman, MultipleView Geometry in ComputerVision, Cambridge University Press, 2000.
CHAPTER
Multi-Modal Data Fusion Techniques and Applications Alessio Dore, Matteo Pinasco, Carlo S. Regazzoni Department of Biophysical and Electronic Engineering, University of Genoa, Genoa, Italy
9
Abstract In recent years, camera networks have been widely employed in several application domains, such as surveillance, ambient intelligence and video conferencing. The integration of heterogeneous sensors can provide complementary and redundant information that when fused to visual cues allows the system to obtain an enriched and more robust scene interpretation. We discuss possible architectures and algorithms, showing, through system examples, the benefits of the combination of other sensor typologies for camera network-based applications. Keywords: Heterogeneous sensor networks, architecture design, fusion algorithms, multimedia applications
9.1 INTRODUCTION The exploitation of camera networks for applications such as surveillance, ambient intelligence, video conferencing, and so forth, is gathering attention from industrial and academic research communities because of its relevant potentiality and the relative low cost of employed devices. In this domain, the integration of other sensor typologies is particularly interesting and promising for coping with issues and limitations emerging in the processing of visual cues (e.g., occlusions, luminosity condition dependency, low resolution for far objects). As a matter of fact, the complementary and redundant information provided by other sensors, if appropriately integrated, can improve the robustness and performance of camera network systems. To this end the design of these systems should take several factors into account. According to the functionalities to be provided by the system, it must be considered how logical tasks should be subdivided in the multisensor architecture. Related to this analysis physical architecture requirements and limitations (sensor deployment, device cost, type of transmission links between sensors, and so forth) must be examined for an appropriate system design. Given these considerations, algorithms are to be designed so that they not only are suitable to accomplish the functions required by the application Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00009-4
213
214
CHAPTER 9 Multi-Modal Data Fusion Techniques and Applications but are also able to exploit the architectural composition potentialities and cope with possible problems arising from the application’s (potential) complexity. In order to design a system that uses a large number of sensors, some elements are to be considered. One of them is to determine the type of sensors to be used considering the specific application and the tasks that the system has to accomplish. This aspect has certain repercussions on the architecture and the techniques that can be used to fuse the data. Therefore, in this work a discussion of the possible architectural configurations of sensor networks that integrate cameras and other sensor typologies is presented in Section 9.2 to introduce possible problems or possibilities for designing systems able to fuse multi-modal cues. In accordance with the logical and physical architecture, some fusion techniques that can be used in this domain are proposed in Section 9.3. Finally, systems exploiting the integration of multi-modal sensors in camera network frameworks are presented in Section 9.4 to give a flavor of the potentialities and domains of applicability of these systems.
9.2 ARCHITECTURE DESIGN IN MULTI-MODAL SYSTEMS In the literature, many different approaches can be found to designing architectures for systems using multiple and heterogeneous sensors collecting multi-modal information from the environment. The implementation aspects that can lead to a specific solution are manifold, and they mainly concern applications and tasks to be accomplished, the characteristics of the environment to be monitored, the cost of hardware devices, the complexity of the algorithms to be implemented, and installation requirements and constraints. The issues that are to be taken into account in the design process can be categorized as follows: Performance requirements: These are defined with respect to the functionalities provided by the system. For each of them, evaluation metrics have to objectively establish the behavior of the system in the presence of a certain system configuration. It would be appropriate to define a model representing the relationships between performance and architectural parameters. Installation constraints: These govern the physical installation of sensors and computation units that constitute the system. They include the possibility of wired and/or wireless connection, available bandwidth, power supply, noise affecting information gathered by sensors (e.g., lighting conditions, vibrations because of wind, background audio noise). Usually these aspects can’t be controlled; thus the system design is deeply conditioned by the attempt to deal with them. Economic aspects: Each architectural solution implies the usage of specific sensors (smart instead of traditional cameras, thermal instead of infrared cameras, etc.), network configurations, and computation units whose costs can be extremely variable. These three issues are interdependent, and usually it is necessary to establish a tradeoff between requirements and constraints to design a system suitable to the application with the desired performance. In this sense the combination of heterogeneous sensors in
9.2 Architecture Design in Multi-Modal Systems
215
a camera network infrastructure can provide a larger degree of flexibility and potentiality in the setup. Therefore, the following sections describe logical and physical architecture design approaches along with their benefits and drawbacks in handling these issues.
9.2.1 Logical Architecture Design The logical decomposition of data fusion tasks is a fundamental process in the design of systems aiming at combining multiple and heterogeneous cues collected by sensors. In recent years a relevant body of research has focused on formalizing logical models for multi-sensor data fusion in order to propose appropriate and general task decomposition. Among these approaches we can cite Dasarathy’s model [12], which categorizes data fusion functions in terms of types of data/information inputs and outputs (e.g., data, features, objects) and the Omnibus model [7], which is based on the Boyd control loop (i.e., observe, orient, decide, and act) describing human reasoning strategy.Though both of these models present an interesting view of the logical model of a data fusion system, in the following the Joint Directors of Laboratories (JDL) model [24] will be described in more detail. This model is suitable for this presentation because it follows a functional view of the data fusion process—which outlines the primary functions, the database where information is stored, and the connections between levels. Under this view, it is easier to identify suitable techniques and physical architectures to accomplish the desired application. Five levels of processing are individuated in the JDL model that represent the progressive operations accomplished to provide output at increasing levels of abstraction. In this structure the association can be outlined between the type of data or processing performed and the type of extracted information. From raw sensor data the first inference to be defined is the existence of an entity whose basic characteristics to be detected are position and kinematic features. Then the data is used to establish the identity of the entity and, through time and spatial analysis, the behavior of the object can be derived. Considering the joint behavior of the entities, the ongoing situation is estimated, and then the context of the evolution of the situation is assessed. In the JDL model these tasks can be distributed into five levels of processing as shown in Figure 9.1.
Level 0 source preprocessing
Level 1 object refinement
Level 2 situation refinement
Level 3 threat refinement Level 5 cognitive refinement
Sources Database management system
Level 4 process refinement
FIGURE 9.1 JDL data fusion process model.
Support database
Fusion database
Human– computer interaction
216
CHAPTER 9 Multi-Modal Data Fusion Techniques and Applications Level 0—source preprocessing: Before the data fusion process can be performed, raw data is preprocessed. Examples of this procedure include signal and image processing, spatial and temporal alignment, and filtering. Then the data is presented in an appropriate form to other modules and synchronized and coherently associated spatially. An example of spatial alignment employed in multi-modal data fusion systems is shown in Kühnapfel et al. [37] where a calibration function is computed to map the 2D video coordinates of a stationary camera to the 1D audio angle of arrival in a linear microphone array. In Marchesotti et al. [39] the temporal and spatial alignment procedures are shown for a tracking system that combines video and radio (802.11 WLAN) power cues. Level 1—object refinement: The preprocessed data is fused to perform a more reliable and accurate estimate of the entity’s position, velocity, attributes, and characteristics. Tracking and pattern recognition algorithms are the techniques applied in this stage. Techniques that combine video information with data collected by other sensors (e.g., microphones, thermal cameras) are widely exploited in tracking applications to cope with the problems of occlusion (i.e., super-imposition of objects of interest in the image plane) and the association of targets between multiple overlapped and nonoverlapped cameras. For example, in Cucchiara et al. [11] a multi-modal sensor network is proposed where PIR (passive infrared) sensor nodes able to detect the presence and direction of movement of people in the scene is exploited to improve the accuracy of a multi-camera tracking system. Level 2—situation refinement: To establish the ongoing situation, relationships between entities are considered using automated reasoning and artificial intelligence tools. Ambient intelligence usually exploits multiple heterogeneous data sources to understand user behavior in order to provide a user’s services. In Marchesotti et al. [38] the authors proposed a bio-inspired method to predict intentions in a university laboratory by analyzing the position detected by a visual tracker, and combining this information with parameters (login status, network bandwidth available, CPU usage, etc.) regarding the work load on PCs. Level 3—impact assessment: In this stage prediction of the evolution of the situation is assessed to determine potential threats and future impact. Typically, automated reasoning, artificial intelligence, predictive modeling, and statistical estimation is used to foresee possible risks, damages, and relevant events. In McCall et al. [41] an automotive application is shown where information internal (head pose) and external (lane position) to the vehicle is fused to vehicle parameters (speed, steering angle, etc.) to infer lane change intent in evaluating the impact of driver actions. Level 4—process refinement: This processing level is necessary to improve the performances of the system by analyzing data fusion and optimizing resource usage. Sensor modeling, computation of performance measures, and optimization of resources by feedback loop adaptation of the other algorithm parameter levels are accomplished. In Hall [23] examples of automatic parameter regulations for multisensor tracking systems is shown. The approach is based on modeling the reference scene used to measure goodness of the system output in order to apply parameter adaptation strategies when performance decay is excessive.
9.2 Architecture Design in Multi-Modal Systems
217
Level 5—cognitive refinement: Since a system usually interacts with humans by means of human–computer interfaces (HCIs), cognitive aids (see [25]) are to be considered to improve the efficiency of providing information or services to a user. For example, in Rousseau et al. [50] a general framework is proposed using multi-modal interfaces to present information to humans. Level 6—database management: This module is necessary to effectively handle the large amount of data used in the data fusion process. It can be a very complex task, especially in distributed sensor networks where data management between nodes is subtle and unpredictable network delays can render challenging information query and transmission. Therefore, many works can be found in the literature dealing with this issue (e.g., [40]) where the database is (virtually) fully replicated to all nodes of a wireless sensor network, rendering database operation independent of network delays and network partitioning.
9.2.2 Physical Architecture Design A multi-modal data fusion system can make use of a large number of sensor typologies. For example, in many works audio and video information is combined for different purposes, such as video conferencing [22, 55] and security applications [60]. Advanced surveillance systems combine heterogeneous sensors—static, pan-tilt-zoom (PTZ),infrared (IR), visible or thermal cameras—and environmental sensors such as passive infrared (PIR), thermal, fire, seismic, chemical, access control, and many others. Environmental sensors are also used for ambient intelligence applications [38]. In this field video radio fusion aiming for a better localization of users in the area of interest [14] can also be performed. Video, radar, and GPS signals can be integrated both to monitor an area [43] or for safety scopes (e.g., driver assistance) [29, 47, 57]. In the following section the physical architectures for combining the cited sensors are discussed.
Distributed versus Centralized Architectures The architecture of a data fusion system depends on where the logical functions are performed. As described in Hall and Llinas [24], three types of centralized data fusion architecture can be categorized, as shown in Figure 9.2. In Figure 9.2(a) raw data, coming from multiple sensors, is directly sent to one center, in which the fusion is performed. This architecture is suitable if the sensors are identical or produce data of the same type. The main advantage of this approach is that the data is not subjected to error introduced by approximations caused by feature extraction, but at the same time it presents some problems. Indeed, besides the fact that the data coming from the sensors must be of the same type, it is difficult to associate data related to the same object but collected by different sensors. There are also some problems concerning the communication structure that are due to the large amount of data to be sent to the central node. An evolution of the centralized architecture is shown in Figure 9.2(b), in which feature extraction is performed before data transmission to the fusion center. In this case loss of information phenomena can affect collected data, but at the same time the data flow toward the center can be considerably reduced. This kind of architecture is generally more common than the one in Figure 9.2(a). For example in Han and Bham [26] data coming from IR and color cameras
218
CHAPTER 9 Multi-Modal Data Fusion Techniques and Applications
A
Raw data
B Fusion center
N (a)
A
Preprocessing
B
Preprocessing
Features
Fusion center
N
Preprocessing (b)
A
Processing
B
Processing
High-level data
Fusion center
N
Processing (c)
FIGURE 9.2 (a) Direct centralized architecture. (b) Feature extraction–based architecture. (c) Autonomous architecture.
9.2 Architecture Design in Multi-Modal Systems is processed for human movement detection. From each sensor features are extracted and then combined; in this case the features are the human silhouettes. Another kind of centralized architecture is the one shown in Figure 9.2(c) known as autonomous architecture, in which each sensor individually processes data that is consequently combined in the central node. In Zajdel et al. [60] a system which performs audio and video data fusion to detect anomalous events is presented. In this systems atomic events are extracted by the audio unit and video unit independently, and then sent to a central node where it is combined by means of a probabilistic method based on a dynamic Bayes network. Conversely to centralized architectures, in the distributed architecture the fusion is not performed in only one node, but is spread throughout the network.A first step toward this kind of architecture is the hierarchical approach (see Figure 9.3) in which data moves only in one direction but is combined at different levels.This kind of architecture provides an improvement in the data transmission because the data is refined step by step from the sensor level up to the central unit level. However, some problems remain: A structure so conformed is still not flexible since a failure in the central unit compromises the functioning of the whole system; also, adding or removing components causes relevant changes as well in the central unit.
A1
Raw data or high-level data
B1 Level 1 fusion center
N1
A2
Level 2 fusion center
Raw data or high-level data
B2 Level 1 fusion center
N2
FIGURE 9.3 Hierarchical architecture.
219
220
CHAPTER 9 Multi-Modal Data Fusion Techniques and Applications Node A Sensor
Fusion center
Node C Sensor
Node B Sensor
Fusion center
Fusion center Node N Sensor
Fusion center
FIGURE 9.4 Generic distributed architecture.
Foresti and Snidaro [18] present a system for outdoor video surveillance in which data is acquired from different typologies of sensors (optical, IR, radar). At a first level the data is fused to perform the tracking of objects in each zone of the monitored environment; then this information is sent to higher levels in order to obtain the objects’ global trajectories. The distributed architecture (Figure 9.4) is based on three main constraints: (1) a single central fusion node is not present; (2) there are no common communication facilities—broadcast communication is not allowed, but only node-to-node communication; (3) the sensors have no global knowledge of the network—one node sees only the neighborhood. These constraints render the system scalable, survivable to loss or addition of nodes, and modular. For these reasons distributed systems are very interesting from the point of view of both research and industry. However, this architecture has some problems. In fact, data fusion in distributed systems is a complex problem for several reasons. For example, association and decision processing are performed at a local level and local optimization often does not correspond to global optimization. The normal approach to solve this problem consists of maintaining both local and global information and periodically synchronizing the two. Another problem in distributed architectures derives from redundant information—that is, data coming from one node that can be replicated during propagation in the network. In this case, errors due to a sensor can also be replicated, affecting the fusion process in a negative way and the state or identity estimation from the object of interest. This problem is strictly connected to the network topology and can be solved by opportune configurations.
Network Topologies in Distributed Architectures Network topology plays an important role in distributed data fusion systems, and it runs into some problems such as redundancy of information and time alignment. The most conceptually simple topology is fully connected (Figure 9.5(a)), in which each node is
9.3 Fusion Techniques for Heterogeneous Sensor Networks A
B
B
C
D
E (a)
B
C
D
D
A
A
C
221
E
E (b)
(c)
FIGURE 9.5 (a) Fully connected network topology. (b) Tree or singly connected network topology. (c) Multiply connected network topology.
connected with every other node or all nodes communicate via a same communication media (e.g., bus or radio). The major problem with this kind of connection is that it is often unreliable if the number of nodes is high. A second type of topology is the one shown in Figure 9.5(b)—namely, the tree or singly connected network, in which only one path exists between two nodes. This topology, in general, is not robust; in fact the improper functioning of a node prejudices the acquisition of all information coming from the nodes connected to it. The topology shown in Figure 9.5(c) is related to the multiply connected network. In this network one node is connected in an arbitrary way to the other nodes, so any topology configuration is allowed. Generally this kind of network allows dynamic changes, ensuring the system characteristics of scalability, modularity, and survival to loss or addition of nodes. In [43] a decentralized system is described where unmanned air vehicles are equipped with GPS, an inertial sensor, a vision system, and a mm-wave radar or laser sensor. These vehicles are connected by a fully decentralized, scalable and modular architecture, and the task of the system is to perform position and velocity estimation of ground targets.
9.3 FUSION TECHNIQUES FOR HETEROGENEOUS SENSOR NETWORKS The development of techniques for fusing multi-modal cues should be tightly related to the architectural design of the system, as described in Section 9.2. Therefore, in the following sections different algorithms are proposed to be used in such a scenario and their usage in different architectural schemes are outlined.
9.3.1 Data Alignment Data alignment is usually a fundamental step to enable further fusion processing, since it aims at mapping heterogeneous collected data to a common reference system. Temporal
222
CHAPTER 9 Multi-Modal Data Fusion Techniques and Applications and spatial alignment are typical examples of the procedures accomplished when information from multiple sensors must be integrated.
Temporal Alignment In multi-modal fusion systems, the problem of maintaining temporal coherence between data coming from sensors has to be carefully handled. In fact, in such potentially complex architectures, sensors can be far from each other though they monitor the same area, the transmission links to the operating units can be heterogeneous (e.g., ADSL, optical fiber, 802.11x), and the quantity of data is variable according to the type of sensor (uncompressed video, MPEG4 video, audio, etc.). These elements imply that the transmission delay is variable and often unpredictable, so synchronization strategies are required. Time synchronization in sensor networks has been studied extensively in the literature and several protocols have been proposed (e.g., reference broadcast synchronization (RBS) [16], time stamp synchronization (TSS) [49] to handle this problem. Two categories of protocol can be identified; sender–receiver, in which one node synchronizes with another, and receiver–receiver, in which multiple nodes synchronize to a common event. Marchesotti et al. [39] presented a general framework for fusing radio (WLAN 802.11x) and video cues, focusing on the problem of spatial and temporal data alignment. To synchronize information video frames and WLAN signal power measurements, the authors considered the time required to perform the spatial alignment preprocessing on the processing units connected to the sensors. A Network Time Protocol (NTP) server installed in one supervising machine periodically synchronizes all processing unit clocks. Booting time for acquisition processes is considered negligible. Amundson et al. [2] pointed out that existing protocols work well on the devices for which they were designed but the synchronization in networks of heterogeneous devices is still an open issue. Therefore, they proposed a new approach to handle time alignment in heterogeneous sensor networks by performing synchronization between sensors (motes) and PCs and, at the same time, compensating the clock skew. The efficiency of this approach was tested on a multi-modal tracking system composed of audio and video sensors.
Spatial Alignment Many works can be found in the literature addressing spatial alignment for multiplecameras systems. In these cases projective geometry (see [27] for a more in-depth discussion) as camera calibration, homography, and epipolar geometry are the basic mathematical tools used.These approaches can also be applied in the multi-modal domain in order to spatially associate data collected by heterogeneous sensors. For example in Khnapfel et al. [33], the focus is on dynamic audio-visual sensor calibration based on a sequence of audio-visual cues. The sensor calibration is performed by evaluating the function that maps the 2D face coordinates of a stationary camera to the 1D voice angle of arrival (AOA) in a linear microphone array.The Levenberg-Marquardt algorithm is used for fitting the calibration function, which interpolates the AOA observations computed using a trained support vector machine, with the image plane face position. In Krotosky and Trivedi [36], a multi-modal image registration approach is proposed that calibrates visual and thermal cameras observing the same scene. Particular care is required in the matching phase, taking into account the large differences in visual
9.3 Fusion Techniques for Heterogeneous Sensor Networks
223
and thermal imagery. To cope with this problem the authors proposed a method where correspondence matching is performed by mutual information maximization and a disparity voting algorithm is used to resolve the registration for occluded or malformed segmentation regions. The spatial alignment of video and radio information proposed in [39] is based on the transformation of image coordinates to map coordinates by Tsai’s camera calibration method. To keep a coherent representation, radio localization is performed with respect to the common map by estimating the distance from the radio device to the base station using the trilateration method. In Funiak et al. [19] alignment (not necessarily spatial) is performed jointly with the state estimation process. In particular, a distributed filtering approach based on a dynamic Bayesian network model jointly estimates positions of moving objects and camera locations in a sensor network architecture. The basic idea is to obtain a consistent posterior distribution from the marginals computed on each node (e.g., camera) that is as close as possible to the real posterior. The distributed approach has shown to converge to the centralized solution in few rounds of communication between network nodes.
9.3.2 Multi-Modal Techniques for State Estimation and Localization The position of objects/humans in a scene is one of the basic pieces of information usually required in many applications, such as as video surveillance, video conferencing, or ambient intelligence. Techniques used for evaluation are associated with level 1 of the JDL data fusion model (see Section 9.2.1) where information on objects is inferred from available data. Because of the importance of this task, and since well-known problems (occlusions, nonstatic background, light changes, etc.) affect the robustness and performances of video-based localization methods, many techniques are proposed to integrate video cues and other information provided by other sensor typologies for object localization and tracking. Track-to-track data association—that is, the comparison of tracks from different processing nodes to combine tracks that are estimating the state of the same real-world object—has been extensively studied. The most popular solutions to this problem are the probabilistic data association filter (PDA) [6] and joint probabilistic data association (JPDA) [4]. Fuzzy logic algorithms [30] and artificial neural networks [59] have also been demonstrated as suitable tools for handling this problem. Multi-sensor state estimation is usually based on Bayesian filtering, and several algorithms based on this mathematical framework have been proposed to perform state estimation in multi-sensor systems. For example, the Interacting Multiple Models algorithm [5] and the Multiple Resolutional Filtering algorthm [31] have been successfully applied in several applications.
Bayesian Filtering Recursive Bayesian state estimation (Bayesian filtering) [15, 48] is one of the mathematical tools most commonly employed in data fusion to perform tracking tasks. In it a general
224
CHAPTER 9 Multi-Modal Data Fusion Techniques and Applications discrete-time system is described by the following equations: x k ⫽ fk (x k⫺1 , k⫺1 )
(9.1)
z k ⫽ gk (x k , k )
∈ Rn
(9.2)
∈ Rm
where x k represents the state vector at step k; z k represents observations (or measurements); k ∈ Rn is an i.i.d. random noise with known probability distribution function (called process noise); and k ∈ Rm is the observation noise. The functions fk and gk are, respectively, the process and observation models; usually they are considered time invariant. Thus, fk ⫽ f and gk ⫽ g. To estimate the posterior probability p(x k , z 1:k ), Bayesian filtering operates basically in two steps: prediction and update.The system transition model p(x k |x k⫺1 ) and the set of available observations z 1:k⫺1 ⫽ {z 1 , . . . , z k⫺1 } provide the posterior prediction as p(x k |z 1:k⫺1 ) ⫽ p(x k |x k⫺1 )p(x k⫺1 |z 1:k⫺1 )dx k⫺1
(9.3)
New observations z k at time k and the observation model supply the likelihood probability p(z k |x k ), which is used to correct the prediction by means of the update process: p(x k , z 1:k ) ⫽
p(z k |x k )p(x k |z 1:k⫺1 ) p(z k |z 1:k⫺1 )
(9.4)
It is well known that a closed-form optimal solution to this problem is achieved with the Kalman filter under the hypothesis of linearity of the state, an observation model, and the Gaussianity of the process and measurement noises. However, though these conditions occur very rarely in real-world scenarios, the Kalman filter is successfully employed in many applications where target movements are or can be reasonably considered smooth. Moreover, to cope with more general situations many algorithms are proposed in literature as giving an approximate solution. For example, the extended Kalman filter or the unscented Kalman filter can be used for nonlinear Gaussian problems. The particle filter provides a suboptimal solution to Bayesian filtering in the case of nonlinear non-Gaussian transition and observation models that make use of Monte Carlo techniques for sampling the posterior probability density function to have more samples drawn where the probability is higher (importance sampling). s The set of Ns candidate samples (i.e., particles), {x ik }N i⫽1 , representing the prediction are drawn from the so-called proposal distribution (or importance distribution) qk ⫽ (x k |x 1:k⫺1 , z 1:k ). In many applications the proposal distribution can be reasonably obtained by the transition model so that particles are drawn from p(x k |x ik⫺1 ). The values of the associated weights are obtained by means of the equation wki ⫽
p(z k |x ik )p(x ik |x ik⫺1 ) i w q(x ik |x i0:k⫺1 , z 0:k ) k⫺1
(9.5)
When the proposal distribution is given by the transition model, the weight computation i . However, this choice, whereas computationis simplified such that wki ⫽ p(z k |x ik )wk⫺1 ally efficient, generates a degeneration of performance, when used in the propagation of several particles with very low weight. To overcome this issue, a resampling procedure is required to eliminate particles with low weight and to replicate probable ones.
9.3 Fusion Techniques for Heterogeneous Sensor Networks
225
This realization of the Particle Filter algorithm is called sequential importance resampling (SIR). Though SIR is one of the most popular implementations for particle filtering, many other particle filter schemes can be found in the literature that approximate Bayesian filtering for nonlinear non-Gaussian cases. For example, the optimal choice of the importance function (i.e., the one that leads to the minimum variance of the weights) has been demonstrated [15] to be (m) (m) (m) x k |x 0:k⫺1 , z 0:k ⫽ p x k |x 0:k⫺1 , z 0:k
(9.6)
so the weight updating is (m) (m) (m) wk ⬀wk⫺1 p z k |x k⫺1
(9.7)
However the computation of p(z k |x k⫺1 ) requires integrations and is usually achievable only in particular cases (e.g., for Gaussian noise and linear observation models [3]). Odobez et al. [44] presented a particle filter tracking approach where measurements are considered first order conditionally dependent. That is, (m) p(z k |z 1:k⫺1 , x 0:k ) ⫽ p z k |z k⫺1 , x k , x k⫺1
(9.8)
in order to improve the observation model, taking into account the similarity of observations for consequent frames. With this model the proposal distribution is also dependent on the current observation and its results are equal to (m) (m) q x k |x k⫺1 , z 1:k ⫽ p x k |x k⫺1 , z k , z k⫺1 (m) (m) ⬀p z k |z k⫺1 , x k , x k⫺1 p x k |x k⫺1
(9.9)
For an overview of other possible particle filtering schemes, the interested reader is referred to Doucet et al. [15].
Multi-Modal Tracking Using Bayesian Filtering Multisensor data fusion for target tracking is a very active and highly investigated domain because of its utility in a wide range of applications (for a survey on this topic see Smith and Singh [51]). In particular, Bayesian filtering through particle filters is popular in multimodal tracking because of its capability to combine complex (nonlinear, non-Gaussian) and heterogeneous observation models. Three examples of the use of particle filtering to combine heterogeneous data (video and radio; video and audio; infrared and ultrasound) for tracking are presented in the following paragraphs. In Dore et al. [14] a video–radio fusion framework based on particle filtering is proposed to track users in an ambient intelligence system who are equipped with mobile devices (e.g., palms). This approach exploits complementary benefits provided by the two types of data. In fact, visual tracking commonly outperforms radio localization in precision, but inefficiency arises in cases of occlusion or when the scene is too vast. Conversely, radio measurements, gathered by a user’s radio device, are unambiguously associated with the respective target through the “virtual” identity (i.e., MAC/IP addresses), and they are available in extended areas. The proposed framework
226
CHAPTER 9 Multi-Modal Data Fusion Techniques and Applications thus allows localization even where/when either video or radio observations miss or are not satisfactory, and it renders the system setup more flexible and robust. The state estimated by the particle filter is given by the kinematic characteristics of the target, x ⫽ [x, y, vx , vy ], where x and y are the coordinates of target location on the map, and vx and vy are the velocity. The dynamic model of the target’s motion is second order autoregressive, as movement of the target is supposed to be fairly regular: ⎡
1 ⎢0 xk ⫽ ⎢ ⎣0 0
0 1 0 0
⎤ ⎡ 2 0 T /2 ⎢ 0 T⎥ ⎥x ⫹⎢ 0 ⎦ k⫺1 ⎣ T 1 0
T 0 1 0
⎤ 0 T 2 /2⎥ ⎥ 0 ⎦ k⫺1 T
(9.10)
where T is the time interval and k⫺1 is a Gaussian zero-mean 2D vector with diagonal covariance matrix ⌺ . The observation model of visual cues is assumed to be linear, affected by Gaussian noise:
1 zk ⫽ 0
0 1
0 0
x 0 x k ⫹ ky 0 k
(9.11)
where the measurement vector is z k ⫽ [x, y], the first matrixis the observation matrix y (.) 2 . H, and the random noise k ⫽ [kx k ] is the Gaussian k ⫽ N 0, (.) The likelihood is then obtained as:
p(z k |x ik ) ⫽ exp
⫺
z k ⫺ Hx ik|k⫺1
x2 · y2
(9.12)
When no video observation is available, the radio observations are used to define an appropriate observation model. The observation model has to associate the received signal strength (RSS) measurements to the user’s position in the environment. To do that a radio map is exploited that, through an offline training phase, models the RSS in each point of the monitored environment. The measurement vector z k is composed of the RSSs from each access point (AP): z k ⫽ RSS AP1 , RSS AP2 , . . . , RSS APN . Worth noticing is that the method is general in terms of the number of APs used to locate a user. The likelihood probability is computed as follows: 1. A predicted target position is given by Hx ik|k⫺1 ⫽ (x, y). 2. From the radio map the pdfs of the RSSs of each hth access point in the predicted position of the user is extracted. 3. The likelihood p(RSS APh |x ik ) of the observed RSS APh concerning the hth access point is given by its probability in the correspondent pdf coming from the radio map. 4. Given that each access point transmits the signal independently with respect to the others, the likelihood for the ith particle is then calculated as p(z k |x ik ) ⫽
p RSS APh |x ik
(9.13)
h
A pictorial representation of the procedure used to compute the likelihood is presented in Figure 9.6. An example of the possibility of tracking by exploiting radio signal strength measurements when no visual cues are available is provided in Figure 9.7.
9.3 Fusion Techniques for Heterogeneous Sensor Networks 0.5 AP 1
0.4
AP1 x ik
AP2
Radio map
0.5 0.4
0.5
0.3
0.4
0.2
0.3
0.1
0.2
0 270
0.3
AP3
i
p (RSS k , x k)
260
250
240 AP 1
0.1
RSS k
0.2 0 263
0.1
253
AP 2
RSSk
0 273
263
253
243
233
AP 3
RSSk
FIGURE 9.6 Computation of likelihood with radio measurements collected from three access points.
47
52
25
57
62
67(m)
30 35 40 45 50 (m) Ground truth Occluded ground truth
Filtered track Filtered occlusion
FIGURE 9.7 Example of video–radio tracking. When video measurements are not available because of an occlusion, received signal strength cues are used.
227
228
CHAPTER 9 Multi-Modal Data Fusion Techniques and Applications A particle filter framework was also used for joint video and radio localization in Miyaki et al. [42]. The radio observation model uses signal strength from WiFi access point, and is integrated in a visual particle filter tracker. In Gatica-Perez et al. [22] particle filtering was used to locate and track the position and speaking activity of multiple meeting participants.The state X t is here representative of a joint multi-object configuration (scale, position, etc.) and the speaking activity. The dynamic model p(X t |X t⫺1 ) considers both the interaction-free single-target dynamics and the interactions that can occur in the scene (e.g., occlusions). The fusion of audio and visual cues takes place in the observation model, where audio cues Z ai,t , shape st information Z sh i,t , and spatial structure Z i,t are considered: p(Z t |X t ) ⫽
st p(Z ai,t |X i,t )p(Z ai,t |X i,t )p(Z sh i,t |X i,t )p(Z i,t |X i,t )
(9.14)
i∈It
where i is the target identifier. Hightower and Borriello [28] presented a particle filter approach to track objects using infrared and ultrasound measurements for ubiquitous computing applications. The employed particle filter scheme is the SIR (see Section 9.3.2), where the likelihood model p(z|x) is given by the product of the likelihood provided by the infrared and ultrasound sensor models. In particular, the infrared sensor is modeled considering its range parameterized by a Gaussian N (, 2 ), where ⫽ 0 and 2 ⫽ 15 ft, where mean and variance are obtained from experiments. Instead, the ultrasound time-of-flight likelihood is modeled by a lookup table built from experiments characterizing the ultrasound system measurement error.
Bayesian Filtering in Distributed Architectures In the literature several works can be found dealing with the problem of Bayesian filtering in multi-sensor architectures in an efficient way and trying to exploit information available to each sensor. In distributed architectures in particular it is important to enable state estimation procedures that consider efficient and consistent communication and fusion of information at a sensor node. In Nettleton et al. [43] a distributed data fusion approach was proposed based on the information form of the Kalman filter. The decentralized node architecture is obtained by computing local estimates using local available observations and local prior information. This information is transmitted to other nodes, where differences with respect to previous information is computed. These differences are fused with the local estimate of the receiver, leading to a local estimate that is exactly the same and can be obtained by a centralized Kalman filter processing all information jointly. In Coates [10], two techniques to perform particle filtering in distributed sensor networks were proposed.The aim is to dynamically fuse information collected by sensors, limiting the exchange data and algorithmic information. The first method relies on the possibility of factorizing the likelihood probability in order to exchange low-dimensional parametric models related to the factors between nodes of the network. It is worth noting that factorization is consistent if noise affecting measurements from different sensors is or can be approximated as uncorrelated. The second algorithm proposed is applicable to more general models and uses the predictive capabilities of the particle filter
9.3 Fusion Techniques for Heterogeneous Sensor Networks
229
in each sensor node to adaptively encode the sensor data.This algorithm was also used in a hierarchical sensor network where two classes of sensors are present; one responsible for collecting information; the other, for computation and communication tasks.
9.3.3 Fusion of Multi-Modal Cues for Event Analysis Methods for integrating multi-modal cues to understand events in a monitored scene have been widely investigated. These methods, unlike the ones described in Section 9.3.2 that operate on simple state descriptions (i.e., position and velocity), infer information at a higher semantic level. As a matter of fact, they are typically carried out at level 2 of the JDL data fusion model (see Section 9.2.1), where situation analysis is performed. Many of these methods rely on techniques such as the Dempster-Shafer rule of combination [32], artificial neural networks [8], and expert systems [34]. In McCall et al. [41] data from a large number of heterogeneous sensors installed on a vehicle is analyzed together to predict driver intentions for preventive safety. In particular, information on the internal and external vehicle situations is collected by a camera network composed of PTZ cameras and omnicameras, a microphone array, radar, and CAN bus data providing the state of the vehicle components (speed, brakes status, steering wheel angle, wheel direction, etc.). In the specific approach a sparse Bayesian learning (SBL) [56] classifier is used to classify lane change intention by jointly considering road lane tracking information, driver gaze, and CAN bus data (see Figure 9.8). Given a parameters vector x(t) containing these cues collected at time t and at previous instants, SBL prunes irrelevant or redundant features to produce highly sparse representations. The class membership probabilities P(C|x(t)) (where C corresponds to lane change or lane keeping) are then modeled and the threshold for deciding driver intent is estimated on a set of real-world experiments by analyzing obtained receiver operating characteristic (ROC) curves. Lane tracking
Head motion
CAN bus data
Lane marking detection
Kalman filter
Sparse Bayesian classifier
FIGURE 9.8 Multi-modal fusion for detection of a driver’s lane change intent (adapted from [41] with permission from author).
230
CHAPTER 9 Multi-Modal Data Fusion Techniques and Applications
Multi-modal data fusion techniques for situation awareness applications are widely used in ambient intelligence. In Marchesotti et al. [38] multi-modal data was collected in a laboratory to understand user actions and thus allocate them efficiently to available workstations. The exerted sensors are part hardware device and part software routine. The external set employs ■ ■
■
A software-simulated badge reader (BR). Two video cameras with partially overlapped fields of view to monitor the room and locate users employing a blob tracker and a calibration tool to obtain the position in the map plane of the lab (TLC). Two mouse activity sensors (MOU ) and two keyboard sensors (KEY ) to allow the system to distinguish whether the resource load on a PC is due to automatic simulation or to user activity.
On the internal variables-sensing side we use ■ ■ ■ ■
Two login controllers (LOG). Two CPU computational load sensors (CPU ). Two network adapter activity monitors (LAN ). Two hard disk usage meters (HD).
These sensors collect data at a rate of 1 Hz and send it to the CPU, where it is filtered and put in the following two vectors (the numerical subscripts indicate the PC): X P (t) ⫽ {CPU1 (t), CPU2 (t), LAN1 (t), LAN2 (t) HD1 (t), HD2 (t), LOG1 (t), LOG2 (t)} X C (t) ⫽ {TLC1 (t), TLC2 (t), MOU1 (t), MOU2 (t), KEY1 (t), KEY2 (t), BR(t)}
(9.15)
(9.16)
The data fusion is performed by a self-organizing map (SOM) [35] classifier that accomplishes a spatial organization of the input feature vector, called feature mapping, with an unsupervised learning technique. The multidimensional vectors X P (t) and X C (t) are mapped to a lower dimensional M-D map (layer) (M is the map dimension), where the input vectors are clustered according to their similarities and each cluster is assigned a label. Supervised by a human operator or according to a priori information, labels are associated to an ongoing situation that belongs to a set of conditions to be identified pertaining to the specific application. After the SOM is formed in the training phase, the classification of unknown sensor data is performed by detecting which cluster the related vectors are mapped to. In this scenario the situations detected are WHF (low human work), WHL (high human work), WAF (low machine work), WAL (high machine work), ARRIVE (laboratory income), and INTRUSION (nonauthorized presence).
9.4 APPLICATIONS In the previous section techniques were presented where the integration of information collected by multiple and heterogeneous sensors is particularly useful for increasing
9.4 Applications
231
system performances and robustness. It is possible to categorize the applications where these techniques are employed in four classes: ■ ■ ■ ■
Surveillance Ambient intelligence Video conferencing Automotive
The following paragraphs offer examples of the use of multi-modal fusion systems for these purposes. Table 9.1 summarizes their characteristics.
9.4.1 Surveillance Applications Typical automatic surveillance systems assist human security personnel by detecting and tracking objects in order to recognize anomalous situations and eventually identify who or what caused them. Fusion of images from heterogeneous cameras (e.g., visual, infrared, thermal, etc.) is a widely investigated topic for video-surveillance applications. These sensors are usually employed to cope with issues typical of 24-hour surveillance in an indoor or outdoor environment, such as low luminosity and wide areas (see, for example, [45]). However other security applications can exploit these sensors. For example in Han and Bhanu [26] the human silhouette is estimated by the combination of color and thermal images. Moreover, as shown in Socolinsky et al. [52] face detection and recognition can take advantage of the fusion of these two modalities. Prati et al. [46] proposed the fusing of visual information from a camera network with information provided by a sensor network mounting PIR (passive infrared) sensors in order to detect and track human targets in an outdoor scene. The integration of PIR sensors (detecting the presence and direction of target movement) enables the reduction of false alarms by disambiguating between opening doors and moving people. PIR sensors are also used to improve tracking robustness in case of occlusions with static scene elements (e.g., columns) and target direction changes during the occlusion. In Tseng et al. [58] a wireless network of environmental sensors able to detect light, sounds, fire, and so forth, was combined with mobile sensors—robots equipped with cameras to take snapshots of detected unusual events and communicate them to a server or to other mobile sensors by a WLAN connection. This system integrates multi-modal sensor information with the robots’ motion capability, which can be guided to the location of the detected anomalous event so as to obtain more details about it. The system then can communicate with the user and receive commands by a mobile device. Aggression detection is performed in Zajdel et al. [60] by exploiting audio and video cues in order to improve the confidence of event analysis. Independent processing of the two data sources provides information on possible occurring situations (e.g., screaming, trains passing).Then a dynamic Bayesian network is used to combine the complementary information to disambiguate between normal aggression and events.
9.4.2 Ambient Intelligence Applications Multisensor data fusion is very diffused in ambient intelligence applications where a large variety of sensors are used with the aim of providing the system with context awareness
232
CHAPTER 9 Multi-Modal Data Fusion Techniques and Applications
Table 9.1
Examples of Multi-Modal Fusion Systems
Citation
Application Field
Architecture Type
Sensor Type
[45]
Surveillance
Hierarchical
Visual camera IR camera Thermal camera Omnicamera
[26]
Surveillance
Centralized
Color camera Thermal camera
[52]
Surveillance
Centralized
Color camera Thermal camera
[46]
Surveillance
Hierarchical
Visual camera Passive infrared (PIR)
[58]
Surveillance
Distributed
Environmental sensors (light, fire, sound, etc.) Mobile sensors Visual cameras
[60]
Surveillance
Centralized
Visual cameras Audio sensors
[43]
Surveillance
Distributed
GPS Inertial sensor Visual camera mm-wave radar Laser sensor
[38]
Ambient intelligence
Distributed
Visual cameras Mouse activity sensors Keyboard sensors PC login controllers CPU computational load sensors Network adapter activity monitors Hard-disk usage meters
[14]
Ambient intelligence
Hierarchical
Visual cameras Radio sensors
[54]
Ambient intelligence
Centralized
Visual cameras IR cameras Accelerometer
[22]
Video conferencing
Centralized
Visual cameras Audio sensors
[55]
Video conferencing
Hierarchical
Visual cameras Audio sensors
[17]
Automotive
Not specified
Visible cameras Far IR cameras
[53]
Automotive
Hierarchical
Visual cameras LASER scanners radar
[9]
Automotive
Hierarchical
GPS localizer CAN bus Gyroscope
9.4 Applications
233
and offering services to the users. Marchesotti et al. [38] presented an ambient intelligence system, modeled on a bio-inspired architecture, with the purpose of providing efficient workstation allocation in a university laboratory. To do that, information from the following sensors are fused by a SOM classifier (see Section 9.3.3): video cameras, mouse activity sensors, keyboard sensors, PC login controllers, CPU computational load sensors, network adapter activity monitors, and hard disk usage meters. Using contextual information regarding the state of the laboratory (i.e., the resource occupancy) and user position, the system indicates to incoming users where to login to use a workstation with lower ongoing CPU, disk, and network load. To localize and identify users in an extended environment, Dore et al. [14] proposed a video–radio localization system (see Section 9.3.2) that can be exploited by a smart space [13] in which users equipped with mobile terminals connected to the system are guided toward a selected target via multi-modal messages. An example of the use of multi-modal message is shown in Figure 9.9. The radio localization isn’t as precise as in the video but it is useful in order to make the localization more robust in case of occlusions, or to associate targets in wide areas not entirely covered by camera fields of view. In fact, radio identification is univocal since it is related to the MAC/IP address of the mobile device. Tabar et al. [54], described a system based on a wireless sensor network that detects events in a smart home application. The system is intended to automatically assist the elderly by detecting anomalous events such as falls and alerting caretakers. A subject’s posture is monitored with the help of different cameras and an accelerometer worn on users.
Go straight
2D guide
Avatar
Augmented reality
3D guide
FIGURE 9.9 Multi-modal guidance message modalities for immersive communication with a user of a smart space.
234
CHAPTER 9 Multi-Modal Data Fusion Techniques and Applications
9.4.3 Video Conferencing In video conferencing the fusion between audio and video sensors is often used to track a speaker. This application meets with some problems related to data acquired from the two types of sensors, occlusions in video analysis, and overlapping speech in audio acquisition. Gatica-Perez et al. [22], described a method for tracking position and speaking activity of multiple subjects in a meeting room. Data, acquired using a little microphone array and multiple uncalibrated cameras, is integrated through a novel observation model. The tracking problem is modeled with a Markov state space in which the hidden states represent the multiobject configuration (e.g., position and scale) and the observations are given by data acquired from the sensors. Another system for active speaker detection and tracking was proposed by Talantzis et al. [55]. In this system audio and video sensors are also used, but the cameras are calibrated. An audio module based on PF is used to estimate the location of the audio sources, while the video subsystem performs 3D tracking using a certain number of 2D head trackers and exploiting 3D to 2D image plane mapping. The association between 2D images is validated using a Kalman filter in 3D space. The fusion between audio and video information is performed by associating the audio state with each possible video state and then choosing the association with the lower Euclidean distance.
9.4.4 Automotive Applications There are two main automotive applications; pedestrian safety and driver assistance. In the former sensors installed in the car and in the infrastructure detect pedestrians and, to avoid collisions, alert the driver or activate the automatic brake. A survey of these applications is provided by Ghandi and Trivedi [21]. Some systems (e.g., [1, 20]) use only visual cameras in monocular or binocular configuration; others integrate this information with data coming from IR cameras [17] or laser scanners and radar [53]. Cheng et al. [9] described a driver assistance system that collects data from internal and external sensors applied to the car. These sensors include a GPS localizer, a CAN bus, a gyroscope, and cameras that take information about the driver and the external environment. The system is able to alert the driver to risk situations such as exceeding the speed limit or loss of attention.
9.5 CONCLUSIONS This chapter provided an overview of multi-modal fusion techniques in camera networks, relating them to the architectural potentialities, limitations, and requirements that can arise in system design. Moreover, the advantages of integrating visual cues with complementary and redundant information collected by heterogeneous sensors were described in detail. Finally, possible applications where these approaches are particularly suitable and effective were presented to outline the promising results that multi-modal fusion systems offer.
9.5 Conclusions
235
REFERENCES [1] Y. Abramson, B. Steux, Hardware-friendly pedestrian detection and impact prediction. Intelligent Vehicles Symposium, 2004. [2] I. Amundson, M. Kushwaha, B. Kusy, P. Volgyesi, G. Simon, X. Koutsoukos, A. Ledeczi, Time synchronization for multi-modal target tracking in heterogeneous sensor networks, in: Workshop on Networked Distributed Systems for Sensing and Control, 2007. [3] E. Arnaud, E. Mémin, Optimal importance sampling for tracking in image sequences: application to point tracking, in: Proceedings of Eighth European Conference on Computer Vision, 2004. [4] Y. Bar-Shalom, Extension of the probabilistic data association filter to multitarget tracking, in: Proceeding of the Fifth Symposium on Nonlinear Estimation, 1974. [5] Y. Bar-Shalom, H. Blom, The interacting multiple model algorithm for systems with markovian switching coefficients. IEEE Transactions on Automatic Control 33 (8) (1988) 780–783. [6] Y. Bar-Shalom, E. Tse, Tracking in a cluttered environment with probabilistic data association. Automatica 11 (1975) 451–460. [7] M. Bedworth, J. Obrien, The omnibus model: A new model of data fusion? IEEE Aerospace and Electronic Systems Magazine 15 (4) (2000) 30–36. [8] S. Chaudhuri, S. Das, Neural networks for data fusion, in: IEEE International Conference on Systems Engineering, 1990. [9] S.Y. Cheng, A. Doshi, M.M. Trivedi, Active heads-up display based speed compliance aid for driver assistance: A novel interface and comparative experimental studies, in: Proceedings of IEEE Intelligent Vehicles Symposium, 2007. [10] M. Coates, Distributed particle filters for sensor networks, in: Proceedings of the Third International Symposium on Information Processing in Sensor Networks, 2004. [11] R. Cucchiara, A. Prati, R. Vezzani, L. Benini, E. Farella, P. Zappi, Using a wireless sensor network to enhance video surveillance. Journal of Ubiquitous Computing and Intelligence 1 (2006) 1–11. [12] B. Dasarathy, Sensor fusion potential exploitation – innovative architectures and illustrative applications, in: Proceeding of the IEEE 85 (1) (1997) 24–38. [13] A. Dore, A. Calbi, L. Marcenaro, C.S. Regazzoni, Multimodal cognitive system for immersive user interaction, in: ACM/ICST First International Conference on Immersive Telecommunications, 2007. [14] A. Dore, A. Cattoni, C.S. Regazzoni, A particle filter-based fusion framework for video-radio tracking in smart-spaces, in: International Conference on Advanced Video and Signal based Surveillance, 2007. [15] A. Doucet, N. de Freitas, N. Gordon (Eds.), Sequential Monte Carlo Methods in Practice. Springer, New York, 2001. [16] J. Elson, L. Girod, D. Estrin, Fine-grained network time synchronization using reference broadcasts, in: Fifth Symposium on Operating Systems Design and Implementation, 2002. [17] Y. Fang, K. Yamada, Y. Ninomiya, B. Horn, I. Masaki, Comparison between infrared-imagebased and visible-image-based approaches for pedestrian detection, in: Proceedings of IEEE Intelligent Vehicles Symposium, 2003. [18] G. Foresti, L. Snidaro, A distributed sensor network for video surveillance of outdoor environments, in: International Conference on Image Processing, 2002. [19] S. Funiak, C.E. Guestrin, M. Paskin, R. Sukthankar, Distributed inference in dynamical systems, in: Advances in Neural Information Processing Systems 19, MIT Press, 2006. [20] T. Gandhi, M. Trivedi, Vehicle mounted wide FOV stereo for traffic and pedestrian detection, in: Proceedings of IEEE International Conference on Image Processing, 2005.
236
CHAPTER 9 Multi-Modal Data Fusion Techniques and Applications [21] T. Gandhi, M. Trivedi, pedestrian protection systems: Issues, survey, and challenges. IEEE Transaction on Intelligence Transportation Systems 8 (3) (2007) 413–430. [22] D. Gatica-Perez, G. Lathoud, J.-M. Odobez, I. McCowan, Audiovisual probabilistic tracking of multiple speakers in meetings. IEEE Transactions on Audio, Speech, and Language Processing 15 (2) (2007) 601–616. [23] D. Hall, Automatic parameter regulation of perceptual systems. Image and Vision Computing 24 (8) (2006) 870–881. [24] D.L. Hall, J.L. Llinas, Handbook of Multisensor Data Fusion. CRC Press, Boca Raton, 2001. [25] M.J. Hall, S.A. Hall, T. Tate, Removing the HCI bottleneck: How the human computer interface (HCI) affects the performance of data fusion systems, in: MSS National Symposium on Sensor and Data Fusion, 2000. [26] J. Han, B. Bhanu, Fusion of color and infrared video for moving human detection. Pattern Recognition 40 (6) (2007) 1771–1784. [27] R.I. Hartley, A. Zisserman, Multiple View Geometry in ComputerVision, Cambridge University Press, Cambridge, UK, 2000. [28] J. Hightower, G. Borriello, Particle filters for location estimation in ubiquitous computing: A case study, in: Proceedings of Sixth International Conference on Ubiquitous Computing, 2004. [29] U. Hofmann, A. Rieder, E. Dickmanns, Radar and vision data fusion for hybrid adaptive cruise control on highways. Machine Vision and Applications 14 (1) (2003) 42–49. [30] H. Hong, H. Chong-zhao, Z. Hong-Yan, W. Rong, Multi-target tracking based on multi-sensor information fusion with fuzzy inference. Control and Decision 19 (3) (2004) 272–276. [31] L. Hong, Multiresolutional multiple-model target tracking. IEEE Transactions on Aerospace and Electronic Systems 30 (2) (1994) 518–524. [32] J. Jiu, Y. Tan, X. Yang, A framework for target identification via multi-sensor data fusion, in: Proceedings of the 2003 IEEE International Symposium Intelligent Control, 2003. [33] T. Khnapfel, T. Tan, S. Venkatesh, E. Lehmann, Calibration of audio-video sensors for multimodal event indexing, in: IEEE International Conference on Acoustics, Speech and Signal Processing, 2007. [34] J. Kittler, Multi-sensor integration and decision level fusion, in: A DERA/IEE Workshop on Intelligent Sensor Processing, 2001. [35] T. Kohonen, The Self-Organizing Map, in: Proceedings of the IEEE 78 (9) (1990) 1464–1480. [36] S.J. Krotosky, M.M.Trivedi, Mutual information based registration of multimodal stereo videos for person tracking. Computer Vision and Image Understanding 106 (2–3) (2007) 270–287. [37] T. Kühnapfel, T. Tan, S. Venkatesh, E. Lehmann, Calibration of audio-video sensors for multimodal event indexing, in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2007. [38] L. Marchesotti, S. Piva, C.S. Regazzoni, Structured context-analysis techniques in biologically inspired ambient-intelligence systems. IEEETransactions on Systems, Man, and Cybernetics— Part A: Systems and Humans 35 (1) (2005). [39] L. Marchesotti, R. Singh, C.S. Regazzoni, Extraction of aligned video and radio information for identity and location estimation in surveillance systems, in: International Conference of Information Fusion, 2004. [40] G. Mathiason, S. Andler, S. Son, L. Selavo, Virtual full replication for wireless sensor networks, in: Proceedings of the 19th Euromicro Conference on Real-Time Systems, 2007. [41] J.C. McCall, M.M. Trivedi, B. Rao, Lane change intent analysis using robust operators and sparse Bayesian learning. IEEE Transactions on Intelligent Transportation Systems 8 (3) (2007) 431–440.
9.5 Conclusions
237
[42] T. Miyaki, T. Yamasaki, K. Aizawa, Tracking persons using particle filter fusing visual and wi-fi localizations for widely distributed cameras, in: IEEE International Conference on Image Processing, 2007. [43] E. Nettleton, M. Ridley, S. Sukkarieh, A. Göktogan, H.F. Durrant-Whyte, Implementation of a decentralised sensing network aboard multiple UAVs. Telecommunication Systems 26 (2–4) (2004) 253–284. [44] J.-M. Odobez, D. Gatica-Perez, S.O. Ba, Embedding motion in model-based stochastic tracking. IEEE Transactions on Image Processing 15 (11) (2006) 3514–3530. [45] G. Pieri, D. Moroni, Active video surveillance based on stereo and infrared imaging. EURASIP Journal on Advances in Signal Processing 2008 (1) (2008) 1–7. [46] A. Prati, R. Vezzani, L. Benini, E. Farella, P. Zappi, An integrated multi-modal sensor network for video surveillance, in: VSSN ’05: Proceedings of the Third ACM International Workshop on Video Surveillance & Sensor Networks, 2005. [47] R. Rasshofer, K. Gresser, Automotive radar and lidar systems for next generation driver assistance functions. Advances in Radio Science 3 (2005) 205–209. [48] B. Ristic, S. Arulapalam, N. Gordon, Beyond the Kalman Filter, Artech, 2004. [49] K. Romer, Time synchronization in ad hoc networks, in: ACM Symposium on Mobile Ad-Hoc Networking and Computing, 2001. [50] C. Rousseau, Y. Bellik, F. Vernier, D. Bazalgette, A framework for the intelligent multimodal presentation of information. Signal Processing 86 (12) (2006) 3696–3713. [51] D. Smith, S. Singh, Approaches to multisensor data fusion in target tracking: A survey. IEEE Transaction on Knowledge and Data Engineering 18 (12) (2006) 1696–1710. [52] D.A. Socolinsky, A. Selinger, J.D. Neuheisel, Face recognition with visible and thermal infrared imagery. Computer Vision and Image Understanding 91 (1–2) (2003) 72–114. [53] A. Steinfeld, D. Duggins, J. Gowdy, J. Kozar, R. MacLachlan, C. Mertz, A. Suppe, C. Thorpe, Development of the side component of the transit integrated collision warning system, in: Seventh International IEEE Conference on Intelligent Transportation Systems, 2004. [54] A. Tabar, A. Keshavarz, H. Aghajan, Smart home care network using sensor fusion and distributed vision-based reasoning, in: Proceedings of the Fourth ACM International Workshop on Video Surveillance and Sensor Networks, 2006. [55] F. Talantzis, A. Pnevmatikakis, A. Constantinides, Audio–visual active speaker tracking in cluttered indoors environments. IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics 38 (3) (2008) 799–807. [56] M.E. Tipping, Sparse Bayesian learning and the relevance vector machine. Journal Machine Learning Research 1 (2001) 211–244. [57] M. Trivedi, T. Gandhi, J. McCall, Looking-in and looking-out of a vehicle: Computer-visionbased enhanced vehicle safety. IEEE Transactions on Intelligence Transportations Systems 8 (1) (2007) 108–120. [58] Y.-C. Tseng, Y.-C. Wang, K.-Y. Cheng, Y.-Y. Hsieh, imouse: An integrated mobile surveillance and wireless sensor system. IEEE Computer 40 (6) (2007) 60–66. [59] M. Winter, G. Favier, A neural network for data association, in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 1999. [60] W. Zajdel, J. Krijnders, T. Andringa, D. Gavrila, Cassandra: audio-video sensor fusion for aggression detection, in: IEEE Conference on Advanced Signal and Video Surveillance, 2007.
CHAPTER
Spherical Imaging in Omnidirectional Camera Networks
10
Ivana Tošic, ´ Pascal Frossard Signal Processing Laboratory (LTS4), Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Abstract We propose in this chapter to consider the emerging framework of omnidirectional camera networks. We first describe how omnidirectional images captured by different types of mirrors or lenses can be uniquely mapped to spherical images. Spherical imaging can then be used for calibration, scene analysis, or distributed processing in omnidirectional camera networks. We then present calibration methods that are specific to omnidirectional cameras. We observe the multi-view geometry framework with an omnivision perspective by reformulating the epipolar geometry constraint for spherical projective imaging. In particular, we describe depth and disparity estimation in omnidirectional camera networks. Finally, we discuss the application of sparse approximation methods to spherical images, and we show how geometric representations can be used for distributed coding or scene understanding applications. Keywords: stereographic projection, omnivision, spherical imaging, epipolar geometry, depth estimation
10.1 INTRODUCTION Representation of three-dimensional visual content is, nowadays, mostly limited to planar projections, which give the observer only a windowed view of the world. Since the creation of the first paintings, our minds have been strongly bound to this idea. However, planar projections have important limitations for building accurate models of 3D environments, since light has a naturally radial form. The fundamental object underlying dynamic vision and providing a firm mathematical foundation is the plenoptic function (PF) [2], which simply measures the light intensity at all positions in a scene and for all directions. In static scenes, the function becomes independent of time, and it is convenient to define it as a function on the product manifold R3⫻S 2 , where S 2 is the 2D sphere (we drop the chromaticity components for the sake of simplicity). We can consider the Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00010-0
239
240
CHAPTER 10 Spherical Imaging in Omnidirectional Camera Networks plenoptic function as the model for a perfect vision sensor, and hence processing of visual information on the sphere becomes very interesting. Most state-of-the-art techniques [2] first map the plenoptic information to a set of Euclidean views, replacing the spherical coordinates with Euclidean coordinates: (u, v) (, ).These images are then typically processed by regular image processing algorithms. Such a way of dealing with the plenoptic function limits the scope and functionality of image processing applications. Indeed, the Euclidean approximation is only valid locally or for very small scales. However, it is one of the most important features of the plenoptic function to encompass information globally, at all scales and in all directions.Thus either one has to deal with a large number of regular images approximating the plenoptic function or one faces the problem that large discrepancies occur between the Euclidean approximation and the effective plenoptic function. Processing the visual information on the sphere is thus attractive in order to avoid the limitations due to planar projections.The radial organization of photo-receptors in the human fovea also suggests that we should reconsider the way we acquire and sample visual information, and depart from classical planar imaging with rectangular sampling. This chapter discusses the potentials and benefits of omnidirectional vision sensors, which require a 360-degree field of view (FOV), for image and 3D scene representation. Numerous efficient tools and methods have been developed in omnidirectional imaging for different purpose, including image compression, reconstruction, surveillance, computer vision, and robotics. Our focus is on omnidirectional cameras with a single point of projection, whose output can be uniquely mapped on a surface of a unit sphere. In this context, we describe image processing methods that can be applied to spherical imaging, and we provide a geometrical framework for networks of omnidirectional cameras. In particular, we discuss the calibration issue and disparity estimation problems. Finally, we show how the scene geometry can be efficiently captured by sparse signal approximations with geometrical dictionaries, and we present an application of distributed coding in camera networks.
10.2 OMNIDIRECTIONAL IMAGING Omnidirectional vision sensors are devices that can capture a 360-degree view of the surrounding scene. Various constructions of such devices exist today, each of which is characterized by different projective geometry and thus requires a specific approach to analyze the captured visual information. This section contains an overview of the main types of omnidirectional vision sensors and describes the projective geometry framework that allows a spherical image representation for a subset of existing omnidirectional vision sensors. The spherical representation of omnidirectional images, which are presented at the end of this section, permits their analysis by using the image processing tools on the sphere.
10.2.1 Cameras According to their construction, omnidirectional vision sensors can be classified into three types [3]: systems that use multiple images (i.e., image mosaics), devices that use
10.2 Omnidirectional Imaging
241
special lenses, and catadioptric devices that employ a combination of convex mirrors and lenses. The images acquired by traditional perspective cameras can be used to construct an omnidirectional image, either by rotating a single camera or by construction of a multi-camera system. Obtained images are then aligned and stitched together to form a 360-degree view. However, a rotating camera system is limited to capturing static scenes because of the long acquisition time for all images. It may also suffer from mechanical problems, which lead to maintenance issues. On the other side, multiple-camera systems can be used for real-time applications, but they suffer from difficulties in alignment and camera calibration. In this context, true omnidirectional sensors are interesting since each image provides a wide FOV of the scene of interest. Special lenses, such as fish-eye lenses, probably represent the most popular classes of systems that can capture omnidirectional images. However, such cameras present the important disadvantage that they do not have a single center of projection, which makes omnidirectional image analysis extremely complicated. Alternatively, omnidirectional images can be generated by catadioptric devices. These systems use a convex mirror placed above a perspective camera, where the optical axis of the lens is aligned with the mirror’s axis. A class of catadioptric cameras with quadric mirrors is of particular interest, since it represents omnidirectional cameras with a single center of projection. Moreover, the images obtained with such cameras can be uniquely mapped on a surface of the sphere [4]. This chapter will focus on multi-view omnidirectional geometry and image processing for catadioptric systems with quadric mirrors.
10.2.2 Projective Geometry for Catadioptric Systems Catadioptric cameras achieve an almost hemispherical field of view with a perspective camera and a catadioptric system [5], which represents a combination of reflective (catoptric) and refractive (dioptric) elements [4, 6]. Such a system is schematically shown in Figure 10.1(a) for a parabolic mirror. Figure 10.1(b) illustrates one omnidirectional image captured by the described parabolic catadioptric camera. Catadioptric image formation was studied in [4, 7, 8]. Central catadioptric systems are of special interest because they have a single effective viewpoint. This property is important not only for easier analysis of the captured scene but also for performing multiview 3D reconstruction. The images captured by the perspective camera from the light reflected by the catadioptric system are not straightforward to analyze because the lines in the 3D space are projected onto conic sections [4, 9, 10]. The unifying model for the catadioptric projective geometry for central catadioptric cameras was established by Geyer and Daniilidis in 2001 [4]. Any central catadioptric projection is equivalent to a composition of two mappings on the sphere. The first mapping represents a central spherical projection, with the center of the sphere incident to the focal point of the mirror and independent of the mirror shape. The second mapping is a projection from a point on the principal axis of the sphere to the plane perpendicular to that axis. However, the position of the projection point on the axis of the sphere depends on the shape of the mirror.The model developed by Geyer and Daniilidis [4] includes perspective, parabolic, hyperbolic, and elliptic projections. The catadioptric projective framework permits us to derive efficient and simple scene analysis from omnidirectional images captured by catadioptric cameras.
242
CHAPTER 10 Spherical Imaging in Omnidirectional Camera Networks
1
Parabolic mirror
Lens
Image plane
2
(a)
(b)
FIGURE 10.1 (a) Omnidirectional system with a parabolic mirror: The parabolic mirror is placed at parabolic focus F1 ; the other focus F2 is at infinity inspired from [4]. (b) Omnidirectional image captured by the parabolic catadioptric camera in (a).
We describe now in more detail the projective model for a parabolic mirror, where the second mapping represents the projection from the north pole to the plane that includes the equator. In particular, the second mapping is known as the stereographic projection, and it is conformal. We consider a cross-section of the paraboloid in a catadioptric system with a parabolic mirror. This is shown on Figure 10.2. All points on the parabola are equidistant to the focus F1 and the directrix d. Let l pass through F1 and be perpendicular to the parabolic axis. If a circle has a center F1 and a radius equal to double the focal length of the paraboloid, then the circle and parabola intersect twice the line l and the directrix is tangent to the circle. The north pole N of the circle is the point diametrically opposite to the intersection of the circle and the directrix. Point P is projected on the circle from its center, which gives ⌸1 . This is equivalent to a projective representation, where the projective space (set of rays) is represented as a circle here. It can be seen that ⌸2 is the stereographic projection of the point ⌸1 to the line l from the north pole N , where ⌸1 is the intersection of the ray F1 P and the circle. We can thus conclude that the parabolic projection of a point P yields point ⌸2 , which is collinear with ⌸1 and N . Extending this reasoning to three dimensions, the projection by a parabolic mirror is equivalent to projection on the sphere (⌸1 ) followed by stereographic projection (⌸2 ). Formal proof of the equivalence between parabolic catadioptric projection and the described composite mapping through the sphere can
10.2 Omnidirectional Imaging
243
z
P
N ⌸1 l
⌸2
1
d
S
FIGURE 10.2 Cross-section of the mapping of the omnidirectional image on the sphere (inspired from [4]).
be found in [4]. A direct corollary of this result is that the parabolic catadioptric projection is conformal since it represents a composition of two conformal mappings. For the other types of mirrors in catadioptric systems, the position of the point of projection in the second mapping is a function of the eccentricity ⑀ of the conic (see theorem 1 established by Geyer and Daniilidis [4]). For hyperbolic mirrors with ⑀ ⬎ 1 and elliptic mirrors with 0 ⬍ ⑀ ⬍ 1, the projection point lies on the principal axis of the sphere, between the center of the sphere and the north pole. A perspective camera can be also considered as a degenerative case of a catadioptric system with a conic of the eccentricity ⑀ ⫽ ⬁. In this case, the point of projection for the second mapping coincides with the center of the sphere.
10.2.3 Spherical Camera Model We exploit the equivalence between the catadioptric projection and the composite mapping through the sphere in order to map the omnidirectional image through the inverse stereographic projection to the surface of a sphere whose center coincides with the focal point of the mirror.This leads to the definition of the spherical camera model [11], which consists of a camera center and a surface of a unit sphere whose center is the camera center. The surface of the unit sphere is usually referred to as the spherical image. The spherical image is formed by a central spherical projection from a point X ∈ R3 to the unit sphere with the center O ∈ R3 , as shown in Figure 10.3(a). Point X is projected into a point x on the unit sphere S 2 , where the projection is given by the following relation: x⫽
1 X |X|
(10.1)
244
CHAPTER 10 Spherical Imaging in Omnidirectional Camera Networks z X
z
l n
x O O
y
y
x
x (a)
( b)
FIGURE 10.3 Spherical projection model: (a) projection of a point X ∈ R3 to a point x on the spherical image; (b) projection of a line l on the plane ∈ R3 to a great circle on the spherical image (inspired by [11]).
The point x on the unit sphere can be expressed in spherical coordinates: ⎛ ⎞ ⎛ ⎞ x sin cos x ⫽ ⎝y ⎠ ⫽ ⎝ sin sin ⎠ z cos
where ∈ [0, ] is the zenith angle, and ∈ [0, 2) is the azimuth angle. The spherical image is then represented by a function defined on S 2 , that is, I (, ) ∈ S 2 . We now briefly discuss the consequences of the mapping of 3D information on the 2D sphere. The spherical projection of a line l in R3 is a great circle on the sphere S 2 , as illustrated in Figure 10.3(b).This great circle, denoted C, is obtained as the intersection of the unit sphere and the plane that passes through the line l and the camera center O. Since C is completely defined by the normal vector n ∈ S 2 of the plane , there is a duality of a great circle and a point on the sphere, as pointed out by Torii et al. [11]. Let 2 the union of the positive unit hemisphere, H ⫽ {(x, y, z)|z ⬎ 0}, and the us denote as S⫹ ⫹ 2 ⫽ X ∪ s. Without loss of semicircle on the sphere, s ⫽ {(x, y, z)|z ⫽ 0, y ⬎ 0}—that is, S⫹ ⫹ 2. generality, we can consider the normal vector n of the plane , which belongs to S⫹ The great circle C is then simply defined as C ⫽ {x|n T x ⫽ 0, |x| ⫽ 1}
The transform from C, as just defined, to the point f (C) ⫽
x⫻y , |x ⫻ y|
2 n ∈ S⫹
x, y ∈ S 2 ,
(10.2)
is then formulated as follows:
∈ {⫺1, 1}
(10.3)
2 . Similarly, we can define the inverse transform, from where is selected such that n ∈ S⫹ 2 to the great circle C, as the point n ∈ S⫹ 2 f ⫺1 (n) ⫽ {x|n T x ⫽ 0, |x| ⫽ 1, n ∈ S⫹ }
(10.4)
10.2 Omnidirectional Imaging
245
The defined transform in equation (10.3) and its inverse transform given by equation (10.4) represent the duality relations between great circles and points on the sphere. Finally, there exists a duality between a line l on a projective plane P 2 ⫽ {(x, y, z)|z ⫽ 2 , as given in [11]. This duality is represented by the transform of a 1} and a point on S⫹ 2: 2 line l ∈ P to a point n ∈ S⫹ g(l) ⫽
⫻ , | ⫻ |
T
T
⫽ (x T , 1) ,
⫽ (y T , 1) ,
x, y ∈ l
(10.5)
2 to a line l: and its inverse transform, from a point n ∈ S⫹ T
2 g ⫺1 (n) ⫽ {x|n T ⫽ 0, ⫽ (x T , 1) , n ∈ S⫹ }
(10.6)
The formulated duality relations play an important role in defining the trifocal tensor for spherical cameras [11], which is usually used to express the three-view geometry constraints.
10.2.4 Image Processing on the Sphere Because images captured by catadioptric systems can be uniquely mapped on the sphere, it becomes interesting to process the visual information directly in the spherical coordinates system [12]. Similar to the Euclidean framework, harmonic analysis and multiresolution decomposition represent efficient tools for processing data on the sphere. When mapped to spherical coordinates, the omnidirectional images are resampled on an equi-angular grid on the sphere:
Gj ⫽ (jp , jq ) ∈ S 2 : jp ⫽
(2p⫹1) q 4Bj , jq ⫽ Bj
(10.7)
where p, q ∈ Nj ≡ {n ∈ N : n ⬍ 2Bj } and for some range of bandwidth B ⫽ {Bj ∈ 2N, j ∈ Z}. These grids allow us to perfectly sample any band-limited function I ∈ L2 (S 2 ) of bandwidth Bj . The omnidirectional images can then be represented as spherical signals, modeled by elements of the Hilbert space of square-integrable functions on the 2D sphere L2 (S 2 , d), where d(, ) ⫽ d cos d is the rotation-invariant Lebesgue measure on the sphere. These functions are characterized by their Fourier coefficients Iˆ (m, n), defined through spherical harmonics expansion: Iˆ (m, n) ⫽
∗ d(, ) Ym,n (, )I (, )
S2
∗ Ym,n
where is the complex conjugate of the spherical harmonic of order (m, n) [13]. It can be noted here that this class of sampling grids is associated with a fast spherical Fourier transform [14]. Multi-resolution representations are particularly interesting in applications such as image analysis and image coding.The two most successful embodiments of this paradigm, the various wavelet decompositions [15] and the Laplacian pyramid (LP) [16], can be extended to spherical manifolds. The spherical continuous wavelet transform (SCWT) was introduced by Antoine and Vandergheynst [17]. It is based on affine transformations on the sphere—namely, rotations
246
CHAPTER 10 Spherical Imaging in Omnidirectional Camera Networks defined by the element of the group SO(3), and dilations Da , parameterized by the scale a ∈ R∗⫹ [18]. Interestingly, it can be proved that any admissible 2D wavelet in R2 yields an admissible spherical wavelet by inverse stereographic projection. The action of rotations and dilations, together with an admissible wavelet ∈ L2 (S 2 ), permits us to write the SCWT of a function I ∈ L2 (S 2 ) as
WI (, a) ⫽ ,a | f ⫽
d(, ) I (, ) [R Da ]∗ (, )
(10.8)
S2
This last expression is nothing but a spherical correlation (i.e., WI (, a) ⫽ (I ∗ a∗ )()). Since the stereographic dilation is radial around the north pole N ∈ S 2 , an axisymmetric wavelet on S 2 (i.e., invariant under rotation around N ) remains axisymmetric through dilation. So, if any rotation ∈ SO(3) is decomposed in its Euler angles , , ␣ ∈ S 1 (i.e., ⫽ (, , ␣)) then R a ⫽ R[] a , where [] ⫽ (, , 0) ∈ SO(3), is the result of two consecutive rotations moving N to ⫽ (, ) ∈ S 2 . Consequently, the SCWT is redefined on S 2 ⫻ R∗⫹ by WI (, a) ≡ (I ∗ a∗ )([]) ≡ (I a∗ )()
(10.9)
a ∈ R∗⫹ .
with To process images given on discrete spherical grids, the SCWT can be replaced by frames of spherical wavelets [19, 20]. One of the most appealing features of frames is their ability to expand any spherical map into a finite multi-resolution hierarchy of wavelet coefficients. The scales are discretized in a monotonic way, namely a ∈ A ⫽ {aj ∈ R∗⫹ : aj ⬎ aj⫹1 , j ∈ Z}
(10.10)
and the positions are taken in an equi-angular grid Gj , as described previously. Another simple way of dealing with discrete spherical data in a multi-resolution fashion is to extend the Laplacian pyramid [16] to spherical coordinates. This can be simply done considering the recent advances in harmonic analysis on S 2 , particularly the work of Driscoll and Healy [13]. Indeed, based on the notations introduced earlier, one can introduce a series of down-sampled grids Gj with Bj ⫽ 2⫺j B0 . A series of multi-resolution spherical images Sj can be generated by recursively applying convolution with a lowpass filter h and downsampling. The filter h could, for example, take the form of an axisymmetric low-pass filter defined by its Fourier coefficients: 2 2 hˆ 0 (m) ⫽ e⫺0 m
(10.11)
Suppose, then, that the original data I0 is bandlimited (i.e, Iˆ0 (m, n) ⫽ 0, ᭙m ⬎ B0 ) and sampled on G0 . The bandwidth parameter 0 is chosen so that the filter is numerically close to a perfect half-band filter Hˆ 0 (m) ⫽ 0, ᭙m ⬎ B0 /2. The low-pass–filtered data is then downsampled on the nested subgrid G1 , which gives the low-pass channel of the pyramid I1 . The high-pass channel of the pyramid is computed as usual—that is, by first upsampling I1 on the finer grid G0 , low-pass–filtering it with H0 , and taking the difference with I0 . Coarser resolutions are computed by iterating this algorithm on the low-pass channel Il and scaling the filter bandwidth accordingly (i.e., l ⫽ 2l 0 ). Because of their multiresolution and local nature, frames of spherical wavelets or the spherical Laplacian pyramid can efficiently supersede current solutions based on spherical harmonics. This is also thanks to their covariance under rigid rotations, as pointed out by Makadia and Daniilidis [21].
10.3 Calibration of Catadioptric Cameras
247
10.3 CALIBRATION OF CATADIOPTRIC CAMERAS We have assumed up to this point that the camera parameters are perfectly known and that the cameras are calibrated. However, camera calibration is generally not given in practical systems. One usually has to estimate the intrinsic and extrinsic parameters in omnidirectional camera networks.
10.3.1 Intrinsic Parameters Intrinsic camera parameters enable mapping the pixels on the image to corresponding light rays in the space. For catadioptric cameras these parameters include the focal length of the catadioptric system (the combined focal lengths of the mirror and the camera); the image center; and the aspect ratio and skew factor of the imaging sensor. For the catadioptric camera with a parabolic mirror, the boundary of the mirror is projected to a circle on the image, which can be exploited for calibration. Namely, one can fit a circle to the image of the mirror’s boundary and calculate the image center as the circle’s center. Then, knowing the FOV of the catadioptric camera with respect to the zenith angle, the focal length can be determined by simple calculations. This calibration strategy is very advantageous as it does not require the capturing and analysis of the calibration pattern. It can be performed on any image where the mirror boundary is sufficiently visible. Another approach for calibration of catadioptric cameras uses the image projections of lines to estimate the intrinsic camera parameters [4, 22]. Whereas in perspective cameras calibration from lines is not possible without metric information, Geyer and Daniilidis [4] showed that it is possible to calibrate central catadioptric cameras only from a set of line images. They first considered the parabolic case and assumed that the aspect ratio is one and the skew is zero, so the total number of unknown intrinsic parameters is three (focal length and two coordinates of the image center). In the parabolic case, a line in the space projects into a circle on the image plane. Since a circle is defined by three points, each line gives a set of three constraints but also introduces two unknowns that specify the orientation of the plane containing the line. Therefore, each line contributes one additional constraint, leading to the conclusion that three lines are sufficient to perform calibration. In the hyperbolic case, there is one additional intrinsic parameter to be estimated, the eccentricity, so in total there are four unknowns. The line projects into a conic, which is defined by five points and thus gives five constraints. Therefore, for the calibration of a hyperbolic catadioptric camera, only two line images suffice. Based on this reasoning, Geyer and Daniilidis proposed an algorithm for calibration of parabolic catadioptric cameras from line images. We will here briefly explain the main steps of their algorithm, but refer the interested reader to [4, 22] for the details. The only assumption of the algorithm is that images of at least three lines are obtained. In the first step, the algorithm obtains points from the line images, where the number of points per line is M 3 if the aspect ratio and skew factor are known, or M 5 if they are not known. An ellipse is fitted to each set of points on the same line image in the second step. This gives a unique affine transformation, which transforms those ellipses whose axes are parallel and aspect ratios are identical into a set of corresponding circles. When the skew factor is equal to zero, the aspect ratio can be derived from the obtained affine transformation in the closed form [22]. However, in the presence of skew, the aspect
248
CHAPTER 10 Spherical Imaging in Omnidirectional Camera Networks ratio and skew factor are solutions to the polynomial equation and have to be evaluated numerically. The evaluated affine transformation is applied on the line image points, which are then fitted to circles in the third step of the algorithm. For each line i, the center ci and the radius ri of the corresponding circle are calculated. In the final step, the algorithm finds the image center ⫽ (x , y ) and the focal length f , using the knowledge that a sphere constructed on each image circle passes through the point (x , y , 2f )T . To understand this fact, we first need to define the fronto-parallel horizon, which represents a projection of a plane parallel to the image plane and is a circle centered at the image center with radius 2f (see Figure 10.4). Geyer and Daniilidis [22] showed that a line image intersects the fronto-parallel horizon antipodally, i.e., at two points both at distance 2f from the image center. Therefore, by symmetry, a sphere constructed on each image circle passes through the point (x , y , 2f )T. All three coordinates of this point are unknown, however, and one needs three defined spheres to find this point in the intersection of the three spheres, as Figure 10.4 illustrates. Because of the presence of noise, this intersection most probably does not exist, so Geyer and Daniilidis [22] proposed instead finding a point in space that minimizes the distance to all of the spheres.This reduces to minimizing the following objective function: (, f ) ⫽
L
( ⫺ ci )T ( ⫺ ci ) ⫹ 4f 2 ⫺ ri2
2
(10.12)
i⫽1
where L is the number of obtained line images. By solving (⭸/⭸f )(, f ) ⫽ 0, we obtain the solution for f : 1 2 ri ⫺ ( ⫺ ci )T ( ⫺ ci ) 4L i⫽1 L
f02 ⫽
(10.13)
2f
Line image
Line image Fronto-parallel horizon
FIGURE 10.4 Intersection of three spheres constructed from line images, yielding a point on the mirror’s axis a distance 2f above the image plane (inspired from [22]).
10.3 Calibration of Catadioptric Cameras With the obtained solution for f , minimizing (, f ) over yields 1 0 ⫽ ⫺ A ⫺1 b 2
(10.14)
where A and b are given by A⫽
L
(ck ⫺ ci )T (cj ⫺ ci )
i, j,k
b⫽
L
(ci T ci ⫺ ri2 ⫺ cj T cj ⫹ rj2 )(ck ⫺ ci )
(10.15)
i, j,k
While we have focused on calibration using the projection of lines, camera calibration can also be performed by the projection of spheres [23], which actually offers improved robustness. Besides calibration of paracatadioptric cameras, researchers have also investigated the calibration of cameras with different types of mirrors, such as hyperbolic and elliptical [24]. For example, Scaramuzza et al. [25] proposed a calibration method that uses a generalized parametric model of the single-viewpoint omnidirectional sensor and can be applied to any type of mirror in the catadioptric system. The calibration requires two or more images of the planar pattern at different orientations. Calibration of noncentral catadioptric cameras was proposed by Miˇcušík and Pajdla [26], based on epipolar correspondence matching from two catadioptric images. Epipolar geometry was also exploited by Kang [27] for calibration of paracatadioptric cameras.
10.3.2 Extrinsic Parameters In multi-camera systems, the calibration process includes the estimation of extrinsic as well as intrinsic parameters. Extrinsic parameters include the relative rotation and translation between cameras, which are necessary in applications such as depth estimation and structure from motion. For noncentral catadioptric cameras, Miˇcušík and Pajdla [26] addressed the problem of intrinsic parameter calibration (as mentioned in Section 10.3.1) and the estimation of extrinsic parameters.They extracted possible point correspondences from two camera views and validated them based on the approximated central camera model. Relative rotation and translation were evaluated by linear estimation from the epipolar geometry constraint, and the estimation robustness against outliers was improved by the RANSAC algorithm. Antone and Teller [28] considered the calibration of extrinsic parameters for omnidirectional camera networks, assuming the intrinsic parameters to be known. Their approach decoupled the rotation and translation estimation in order to obtain a linear time calibration algorithm. The expectation maximization (EM) algorithm recovers the rotation matrix from the vanishing points, following with position recovery using feature correspondence coupling and the Hough transform with Monte Carlo EM refinement. Robustness of extrinsic parameter recovery can be improved by avoiding commitment to point correspondences, as presented by Makadia and Daniilidis [29]. For a general spherical image model, they introduced a correspondenceless method for camera rotation and translation recovery based on the Radon transform. They defined the epipolar
249
250
CHAPTER 10 Spherical Imaging in Omnidirectional Camera Networks delta filter (EDF), which embeds the epipolar geometry constraint for all pairs of features with a series of Diracs on the sphere. Moreover, they defined the similarity function on all feature pairs. The main result of this work was the realization that the Radon transform is actually a correlation on the SO(3) group of rotations of the EDF and a similarity function, and as such it can be efficiently evaluated by the fast Fourier transform on the SO(3). The extrinsic parameters are then found in the maximum of the five-dimensional space given by the Radon transform.
10.4 MULTI-CAMERA SYSTEMS With the development of panoramic cameras, epipolar or multi-view geometry has been formalized for general camera models, including noncentral cameras [30]. However, since single-viewpoint cameras have some advantages, such as simple image rectification in any direction, we describe in this section the epipolar geometry only for central catadioptric cameras. Because of the equivalence between catadioptric projection and composite mapping through the sphere, the epipolar geometry constraint can be formulated through the spherical camera model. We consider in particular the case of calibrated paracatadioptric cameras, where the omnidirectional image can be uniquely mapped through inverse stereographic projection to the surface of the sphere whose center coincides with the focal point of the mirror. Two- and three-view geometry for spherical cameras was introduced by Torii et al. [11], and we overview the two-view case in this section. (We refer the interested reader to [11] for the three-view geometry framework.) Epipolar geometry has very important implications in the representation and the understanding of 3D scenes. As an example, we discuss the use of geometry constraints in the estimation of disparity in networks of omnidirectional cameras.
10.4.1 Epipolar Geometry for Paracatadioptric Cameras Epipolar geometry relates the multiple images of the observed environment with the 3D structure of that environment. It represents a geometric relation between 3D points and their image projections, which enables 3D reconstruction of a scene using multiple images taken from different viewpoints. Epipolar geometry was first formulated for the pinhole camera model, leading to the well-known epipolar constraint. Consider a point p in R3 , given by its spatial coordinates X, and two images of this point, given by their homogeneous coordinates x1 ⫽ [x1 y1 1]T and x2 ⫽ [x2 y2 1]T in two camera frames. The epipolar constraint gives the geometric relationship between x1 and x2 , as described in the following theorem [31]: Theorem 1. Consider two images x1 and x2 of the same point p from two camera positions with relative pose (R, T), where R ∈ SO(3) is the relative orientation and T ∈ R3 is the relative position.Then x1 and x2 satisfy x2 , T ⫻ R x1 ⫽ 0
or
x2 T Tˆ R x1 ⫽ 0
(10.16)
10.4 Multi-Camera Systems p
X2
X1
x
x O1
y
z
z
O2
y (R, T)
FIGURE 10.5 Epipolar geometry for the pinhole camera model.
The matrix Tˆ is obtained by representing the cross-product of T with Rx1 as matrix multiplication (i.e., Tˆ R x1 ⫽ T ⫻ R x1 ). Given T ⫽ [t1 t2 t3 ]T , Tˆ can be expressed as ⎛
0 Tˆ ⫽ ⎝ t3 ⫺t2
⫺t3 0 t1
⎞ t2 ⫺t1 ⎠ 0
The matrix E ⫽ Tˆ R ∈ R3⫻3 is the essential matrix. The epipolar geometry constraint is derived from the coplanarity of the vectors x2 , T, and R x1 , as shown in Figure 10.5. Epipolar geometry can also be used to describe the geometrical constraints in systems with two spherical cameras. Let O1 and O2 be the centers of two spherical cameras, and let the world coordinate frame be placed at the center of the first camera (i.e., O1 ⫽ (0, 0, 0)). A point p ∈ R3 is projected to unit spheres corresponding to the cameras, giving projection points x1 , x2 ∈ S 2 , as illustrated in Figure 10.6. Let X1 be the coordinates of the point p in the coordinate system of the camera centered at O1 . The spherical projection of a point p to the camera centered at O1 is given as 1 x 1 ⫽ X 1 ,
1 ∈ R
(10.17)
If we further denote the transform of the coordinate system between two spherical cameras with R and T, where R denotes the relative rotation and T denotes the relative position, the coordinates of the point p can be expressed in the coordinate system of the camera at O2 as X2 ⫽ R X1 ⫹ T. The projection of point p to the camera centered at O2 is then given by 2 x2 ⫽ X2 ⫽ RX1 ⫹ T,
2 ∈ R
(10.18)
As for the pinhole camera model, vectors x2 , R x1 , and T are coplanar, and the epipolar geometry constraint is formalized as x2 T Tˆ Rx1 ⫽ x2 T Ex1 ⫽ 0
(10.19)
The epipolar constraint is one of the fundamental relations in multi-view geometry because it allows the estimation of the 3D coordinates of point p from its images x1 and
251
252
CHAPTER 10 Spherical Imaging in Omnidirectional Camera Networks p z
z X1
O1
X2 e1
O2
e2
y
y x x (R, T )
FIGURE 10.6 Epipolar geometry for the spherical camera model.
x2 , given R and T.That is, it allows scene geometry reconstruction. However, when point p lies on vector T, which connects camera centers O1 and O2 , it leads to a degenerative case of the epipolar constraint because vectors x2 , Rx1 , and T are collinear and the coordinates of point p cannot be determined. The intersection points of unit spheres of both cameras and vector T are the epipoles denoted e1 and e2 in Figure 10.6. In other words, when point p is projected to the epipoles of two cameras, its reconstruction is not possible from these cameras.
10.4.2 Disparity Estimation Geometry constraints play a key role in scene representation or understanding. In particular, the problem of depth estimation is mainly based on geometrical constraints, as a means of reconstructing depth information with images from multiple cameras. Dense disparity estimation in omnidirectional images has become a part of localization, navigation, and obstacle avoidance research. When several cameras capture the same scene, the geometry of the scene can be estimated by comparing the images from the different sensors. The differences between the respective positions of 3D points on multiple 2D images represent disparities that can be estimated by stereo-matching methods. In general, the strategies that have been developed for dense disparity estimation from standard camera images are also applicable to omnidirectional images. The algorithms are generally based on reprojection of omnidirectional images on simpler manifolds. For example, Takiguchi et al. [32] reprojected omnidirectional images onto cylinders, while Gonzalez-Barbosa et al. [33] and Geyer et al. [34] rectified omnidirectional images on the rectangular grid. Neither cylindrical nor rectangular projections are sufficient, however, to represent neighborhood and correlations among the pixels. The equi-angular grid on the sphere better represents these in an omnidirectional image. They can be mapped onto the 2D sphere by inverse stereographic projection, as discussed earlier. We discuss here the problem of disparity estimation in systems with two omnidirectional cameras, referring the reader to the paper by Arican and Frossard [35] for depth estimation in systems with more cameras. To perform disparity estimation directly in a spherical framework, rectification of the omnidirectional images is performed in the spherical domain.Then the global energy minimization algorithm, based on the graph-cut
10.4 Multi-Camera Systems
253
algorithm, can be implemented to perform dense disparity estimation on the sphere. Interestingly, disparities can be directly processed on the 2D sphere to better exploit the geometry of omnidirectional images and to improve the estimation’s accuracy. Rectification is an important step in stereo vision using standard camera images. It aims at reducing the stereo correspondence estimation to a one-dimensional search problem, and basically consists of image warping, which is computed such that the epipolar lines coincide with the scanlines. It not only eases implementation of the disparity estimation algorithms but also speeds computation. In the spherical framework, rectification can be performed directly on spherical images. The following observations about epipoles on spherical images can be used here: (1) Epipoles resemble the coordinate poles; (2) epipolar great circles intersecting on epipoles are like longitude circles. Spherical image pairs can thus undergo rotation in the spherical domain such that the epipoles coincide with the coordinate poles. In this way, epipolar great circles coincide with the longitudes and disparity estimation becomes a mono-dimensional problem. Figure 10.7 illustrates the rectification strategy, and Figure 10.8 shows original and rectified spherical images represented as rectangular images, with latitude and longitude angles as axes that correspond to an equi-angular grid on the sphere. Rectification permits extension of the disparity estimation algorithms developed for standard images to spherical images with fast computations. In the spherical framework, disparity can be defined as the difference in angle values between the representation of the same 3D point, in two different omnidirectional images. Since pixel coordinates are defined with angles, we define the disparity ␥ as the difference between the angles corresponding to pixel coordinates on the two images (i.e., ␥ ⫽ ␣ ⫺ ), as illustrated in Figure 10.9. Figure 10.9 shows the 2D representation of the geometry between cameras and the 3D point. The depth R1 is defined as the distance between the 3D point and the reference
North pole
Epipole
FIGURE 10.7 Resemblance between longitudes and epipolar great circles and the corresponding rotation.
254
CHAPTER 10 Spherical Imaging in Omnidirectional Camera Networks
(a)
(b)
FIGURE 10.8 (a) Original images; (b) rectified images. Epipolar great circles become straight vertical lines in rectified images. X
R1
e C1

R2 e
d
e
␣
C2
FIGURE 10.9 2D representation of the geometry between cameras and the 3D point.
camera center. The relation between the disparity ␥, the depth R1 , and the baseline distance d is given as ␥ ⫽ arcsin
d sin  R1
(10.20)
10.4 Multi-Camera Systems This relation holds for all epipolar great circles on rectified stereo images. The disparity estimation problem can now be cast as an energy minimization problem. Let P denote the set of all pixels on two rectified spherical images. Further, let L ⫽ {l1 , l2 , . . . , lmax } represent the finite set of discrete labels corresponding to disparity values. Recall that the disparity estimation problem is monodimensional because of the image rectification step. A single value per pixel is therefore sufficient to represent the disparity, and the disparity values for each pixel together form the disparity map. If f : P → L is a mapping, so that each pixel is assigned a disparity label, our aim is to find the optimum mapping f ∗ such that the disparity map is as accurate and smooth as possible. The computation of the optimum mapping f ∗ can be formulated as an energy minimization problem, where the energy function E( f ) is built on two components Ed and E , that respectively represent the data and smoothness functions. E( f ) ⫽ Ed ( f ) ⫹ E ( f )
(10.21)
The data function first reports the photo-consistency between the omnidirectional images. It can be written as
Ed ( f ) ⫽
D( p, q)
(10.22)
( p,q)∈P 2
where p and q are corresponding pixels in two images under a mapping function f . D(., .) is a nonpositive cost function, which can be expressed as D( p, q) ⫽ min{0, (I ( p) ⫺ I (q))2 ⫺ K }
(10.23)
where I (i) represents the intensity or luminance of pixel i and K is a positive constant. The intensity I (i) can be defined as done by Birchfield and Tomasi, which presents the advantage of being insensitive to image sampling [36].This is useful since an equi-angular grid on a sphere causes nonuniform sampling. The smoothness function then captures variations in the disparity between neighboring labels. Its goal is to penalize the estimation of labels that are different from others in their neighborhood, in order to obtain a smooth disparity field. The neighborhood O is generally represented by the four surrounding labels. The smoothness function under a mapping f can, for example, be expressed as E ( f ) ⫽
Vp,q ( f ( p), f (q))
(10.24)
( p,q)∈O
The term Vp,q ⫽ min{|lp , lq |, K } represents a distance metric, proposed in [37], where K is a constant. It reports the difference between labels lp and lq attributed to neighboring pixels p and q in the neighborhood O. The dense disparity estimation problem now consists of minimizing the energy function E( f ) to obtain an accurate and smooth disparity map; such a global minimization can typically be performed efficiently by graph-cut algorithms [35, 37] or belief propagation methods. To illustrate the performance of the disparity estimation algorithm, we show in Figure 10.10 the dense disparity map computed by the graph-cut algorithm. The results correspond to the images illustrated in Figure 10.8, where a room has been captured from two different positions with a catadioptric system and parabolic mirrors.
255
256
CHAPTER 10 Spherical Imaging in Omnidirectional Camera Networks
(a)
(b)
FIGURE 10.10 (a) Reference image. (b) Disparity images computed by the graph-cut method on the sphere [35].
10.5 SPARSE APPROXIMATIONS AND GEOMETRIC ESTIMATION This section presents a geometry-based correlation model between multi-view images that relates image projections of 3D scene features in different views.The model assumes that these features are correlated by local transforms, such as translation, rotation, or scaling [38].They can be represented by a sparse image expansion with geometric atoms taken from a redundant dictionary of functions. This provides a flexible and efficient alternative for autonomous systems, which cannot afford manual feature selection. Moreover, we present a method for distributed compression of multi-view images captured by an omnidirectional camera network, which exhibits the proposed geometry-based correlation. Distributed compression is very important and beneficial in camera networks because it allows reduction of the required bandwidth for image transmission and removes the need for inter-camera communication.
10.5.1 Correlation Estimation with Sparse Approximations The correlation model between multi-view images introduced in our 2008 paper [38] relates image components that approximate the same 3D object in different views by local transforms that include translation, rotation, and anisotropic scaling. Given a redundant dictionary of atoms D ⫽ {k }, k ⫽ 1, . . . , N , in the Hilbert space H, we say that image I has a sparse representation in D if it can be approximated by a linear combination of a small number of vectors from D. Therefore, sparse approximations of two1 multi-view images 1 Two
taken for the sake of clarity, but the correlation model can be generalized to any number of images.
10.5 Sparse Approximations and Geometric Estimation
257
can be expressed as I1 ⫽ ⌽⍀1 c1 ⫹ 1 and I2 ⫽ ⌽⍀2 c2 ⫹ 2 , where ⍀1,2 labels the set of atoms {k }k∈⍀1,2 participating in the sparse representation; ⌽⍀1,2 is a matrix composed of atoms k as columns; and 1,2 represents the approximation error. Since I1 and I2 capture the same 3D scene, their sparse approximations over the sets of atoms ⍀1 and ⍀2 are correlated. The geometric correlation model makes two main assumptions in order to relate the atoms in ⍀1 and ⍀2 : ■
■
The most prominent (energetic) features in a 3D scene are, with high probability, present in sparse approximations of both images. The features’ projections in images I1 and I2 are represented as subsets of atoms indexed by Q1 ∈ ⍀1 and Q2 ∈ ⍀2 , respectively. These atoms are correlated, possibly under some local geometric transforms. We denote by F () the transform of an atom between two image decompositions that results from a viewpoint change.
Under these assumptions, the correlation between the images is modeled as a set of transforms Fi between corresponding atoms in sets indexed by Q1 and Q2 . The approximation of the image I2 can be rewritten as the sum of the contributions of transformed atoms, remaining atoms in ⍀2 , and noise 2 : I2 ⫽
i∈Q1
c2,i Fi (i ) ⫹
c2,k k ⫹ 2
(10.25)
k∈⍀2 \Q2
The model from equation (10.25) is applied to atoms from the sparse decompositions of omnidirectional multi-view images mapped onto the sphere [38]. This approach is based on the use of a structured redundant dictionary of atoms that are derived from a single waveform subjected to rotation, translation, and scaling. More formally, given a generating function g defined in H (in the case of spherical images g is defined on the 2-sphere), the dictionary D ⫽ {k } ⫽ {g␥ }␥∈⌫ is constructed by changing the atom index ␥ ∈ ⌫ that defines the rotation, translation, and scaling parameters applied to the generating function g. This is equivalent to applying a unitary operator U (␥) to the generating function g—that is, g␥ ⫽ U (␥)g. As an example, Gaussian atoms on the sphere are illustrated in Figure 10.11 for different translation (␥, ), rotation (), and anisotropic scaling parameters (␣, ).
(a)
(b)
(c)
FIGURE 10.11 Gaussian atoms: (a) on the north pole ( ⫽ 0, ⫽ 0), ⫽ 0, ␣ ⫽ 2,  ⫽ 4; (b) ⫽ 4 , ⫽ 4 , ⫽ 8 , ␣ ⫽ 2,  ⫽ 4; (c) ⫽ 4 , ⫽ 4 , ⫽ 8 , ␣ ⫽ 1,  ⫽ 8.
258
CHAPTER 10 Spherical Imaging in Omnidirectional Camera Networks
The main property of the structured dictionary is that it is transform invariant. That is, the transformation of an atom by a combination of translation, rotation, and anisotropic scaling transforms another atom in the same dictionary. Let {g␥ }␥∈⌫ and {h␥ }␥∈⌫ respectively denote the set of functions used for the expansions of images I1 and I2 . When the transform-invariant dictionary is used for both images, the transform of atom g␥i in image I1 to atom h␥j in image I2 reduces to a transform of its parameters— that is, h␥j ⫽ F (g␥i ) ⫽ U (␥ )g␥i ⫽ U (␥ ◦ ␥i )g. Because of the geometric constraints that exist in multi-view images, only a subset of all local transforms between {g␥ } and {h␥ } is feasible. This subset can be defined by identifying two constraints—epipolar and shape similarity—between corresponding atoms. Given the atom {g␥ }␥∈⌫ in image I1 these two constraints give the subset of possible parameters ⌫i ⊆ ⌫ of the correlated atom {h␥j }. Pairs of atoms that correspond to the same 3D points have to satisfy the epipolar constraints that represent one of the fundamental relations in multi-view analysis. Two corresponding atoms are said to match when their epipolar atom distance dEA (g␥i , h␥j ) is smaller than a certain threshold k. (For more details on this distance, see [38].) The set of possible candidate atoms in I2 that respect epipolar constraints with atom g␥i in I1 , called the epipolar candidates set, is then defined as the set of indices ⌫Ei ⊂ ⌫, with ⌫Ei ⫽ {␥j |h␥j ⫽ U (␥ )g␥i , dEA (g␥i , h␥j ) ⬍ k}
(10.26)
The shape similarity constraint assumes that the change in viewpoint on a 3D object results in a limited difference between shapes of corresponding atoms since they represent the same object in the scene. From the set of atom parameters ␥, the last three parameters (, ␣, ) describe the atom shape (its rotation and scaling) and are thus taken into account for the shape similarity constraint. We measure the similarity or coherence of atoms by the inner product (i, j) ⫽ |g␥i , h␥j | between centered atoms (at the same position (, )), and we impose a minimal coherence between candidate atoms—that is, (i, j) ⬎ s. This defines a set of possible transforms Vi ⊆ Vi0 with respect to atom shape as
Vi ⫽ {␥ |h␥j ⫽ U (␥ )g␥i , (i, j) ⬎ s}
(10.27)
Equivalently, the set of atoms h␥j in I2 that are possible transformed versions of the atom g␥i is denoted the shape candidates set. It is defined by the set of atom indices ⌫i ⊂ ⌫, with
⌫i ⫽ {␥j |h␥j ⫽ U (␥ )g␥i , ␥ ∈ Vi }
(10.28)
Finally, we combine the epipolar and shape similarity constraints to define the set of possible parameters of the transformed atom in I2 as ⌫i ⫽ ⌫Ei ∩ ⌫i .
10.5.2 Distributed Coding of 3D Scenes Interpreting and compressing the data acquired by a camera network represents a challenging task, as the amount of data is typically huge. Moreover, in many cases the communication among cameras is limited because of bandwidth constraints or the introduced time delay. Development of distributed processing and compression techniques in
10.5 Sparse Approximations and Geometric Estimation
259
the multi-camera setup thus becomes necessary in a variety of applications. Distributed coding of multi-view images captured by camera networks recently attained great interest among researchers. Typically, distributed coding algorithms are based on the disparity estimation between views under epipolar constraints [39, 40], in order to capture the correlation between the different images of a scene.Alternatively, the geometrical correlation model described in the previous section can be used in the design of a distributed-coding method with side information. This model is particularly interesting in scenarios with multi-view omnidirectional images mapped to spherical images, as described in previous sections. Based on the geometric correlation model, we can build a Wyner-Ziv (WZ) coding scheme [41] for multi-view omnidirectional images. The coding scheme illustrated in Figure 10.12 is mostly based on the algorithm we proposed [38], with the shape similarity metric described earlier. The scheme entails coding with side information, where image I1 is independently encoded, and the WZ image I2 is encoded by coset coding of atom indices and quantization of their coefficients. The approach is based on the observation that when atom h␥j in the WZ image I2 has its corresponding atom g␥i in the reference image I1 , then ␥j belongs to the subset ⌫i ⫽ ⌫Ei ∩ ⌫i . Since ⌫i is usually much smaller than ⌫, the WZ encoder does not need to send the whole ␥j but only the information that is necessary to identify the correct atom in the transform candidate set given by ⌫i . This is achieved by coset coding—that is, by partitioning ⌫ into distinct cosets that contain dissimilar atoms with respect to their position (, ) and shape (, ␣, ). Two types of cosets are constructed: position and shape. The encoder eventually sends only the indices of the corresponding cosets for each atom (i.e., kn and ln in Figure 10.12). The position cosets are designed as VQ cosets [38], which are constructed by 2-dimensional interleaved uniform quantization of atom positions (, ) on a rectangular lattice. Shape cosets are designed by distributing all atoms whose parameters belong to ⌫i into different cosets. The decoder matches corresponding atoms in the reference image and atoms within the cosets of the Wyner-Ziv image decomposition using the correlation model described earlier. This atom pairing is facilitated by the use of quantized coefficients of atoms, which are sent directly. Each identified atom pair contains information about the local transform between the reference and WZ image, which is exploited by the decoder to update the transform field between them.The transformation of the reference image with respect to the transform field provides an approximation of the WZ image that is used as side information for decoding the atoms without a correspondence in the reference image. These atoms are decoded based on the minimal mean square error between the currently decoded image and the side information. Finally, the WZ image reconstruction Iˆ2 is obtained as a linear combination of the decoded image Id , reconstructed by decoded atoms from ⌽⍀2 , and the projection of the transformed reference image Itr to the orthogonal complement of ⌽⍀2 [38]. The decoding procedure does not give any guarantee that all atoms are correctly decoded. In a general multi-view system, occlusions are quite probable and clearly impair the atom-pairing process at the decoder. The atoms that approximate these occlusions cannot be decoded based on the proposed correlation model since they do not have a corresponding feature in the reference image. An occlusion-resilient coding block (the shaded area in Figure 10.12) can be built by performing channel coding on all an and bn ,
260
CHAPTER 10 Spherical Imaging in Omnidirectional Camera Networks SPARSE {cn} QUANTIZER DECOMPOSITION
I1
{␥n}
{␥n}
LDPC encoder
Syndrome
an (n, n )
Position cosets
(n, ␣n, n )
DEQUANTIZER
Shape cosets
bn kn ln
SPARSE {cn} QUANTIZER DECOMPOSITION
Encoder
SPARSE RECONSTRUCTION
Decoded image
Occlusion-resilient block
CENTRAL {␥n} DECODER TRANSFORM FIELD UPDATE
{␥n} I2
kn ln
{cn}
DEQUANTIZER
LDPC decoder
{␥n}
SPARSE RECON-
{cn} STRUCTION
Decoded image
Decoder
FIGURE 10.12 Occlusion-resilient Wyner-Ziv coder.
n ⫽ 1, . . . , N , together, where an , bn , kn , and In uniquely determine the index ␥n . The encoder thus sends a unique syndrome for all atoms, which is then used by the decoder to correct the erroneously decoded atom parameters. Finally, we briefly discuss the performance of the distributed coding scheme with the geometrical correlation model. (A deeper analysis is given in [38]). The results are presented for the synthetic room image set that consists of two 128⫻128 spherical images I1 and I2 . The sparse image decomposition is obtained using the matching pursuit (MP) algorithm on the sphere with the dictionary used in [38]. It is based on two generating functions: a 2D Gaussian function and a 2D function built on a Gaussian and the second derivative of a 2D Gaussian in the orthogonal direction (i.e., edge-like atoms). The position parameters and can take 128 different values, while the rotation parameter uses 16 orientations. The scales are distributed logarithmically with 3 scales per octave. This parameterization of the dictionary enables the use of fast computation of correlation on SO(3) for the full atom search within the MP algorithm. In particular, we used the SpharmonicKit library,2 which is part of theYAW toolbox.3 The image I1 is encoded independently at 0.23 bpp with a PSNR of 30.95 dB. The atom parameters for the expansion of image I2 are coded with the proposed scheme. The coefficients are obtained by projecting the image I2 on the atoms selected by MP to improve the atom matching process, and they are quantized uniformly. We see that the performance of the WZ scheme with and without the occlusionresilient block is very competitive with joint-encoding strategies.The rate-distortion (RD)
2 3
www.cs.dartmouth.edu/∼ geelong/sphere/ . http://fyma.fyma.ucl.ac.be/projects/yawtb/ .
10.6 Conclusions
I1 (a)
I2 (b)
27 26
PSNR [dB]
25 24 23 22 21
Wyner–Ziv MP Joint encoding Occlusion-resilient WZ
20 19
0.01
0.02
0.03 0.04 Rate [bpp] (c)
0.05
0.06
FIGURE 10.13 (a),(b) Original room images (128⫻128). (c) Rate distortion performance for image I2 .
curve for the occlusion-resilient WZ scheme using low-density parity check (LDPC) coding is given in Figure 10.13.We can clearly see that the occlusion-resilient coding corrects the saturation behavior of the previous scheme and performs very close to the joint encoding scheme. The geometrical model based on sparse approximations of ominidirectional images is therefore able to capture efficiently the correlation between images. This proves to be very beneficial for the distributed representation of 3D scenes.
10.6 CONCLUSIONS This chapter presented a spherical camera model that can be used for processing visual information directly in its natural radial form. Because images from catadioptric camera systems can be mapped directly on the sphere, we presented image processing methods
261
262
CHAPTER 10 Spherical Imaging in Omnidirectional Camera Networks that are well adapted to the processing of omnidirectional images. We then discussed the framework of multi-camera systems from a spherical imaging perspective. In particular, the epipolar geometry constraints and the disparity estimation problem were discussed for the case of images lying on the 2-sphere. Finally, we showed how sparse approximations of spherical images can be used to build geometrical correlation models, which lead to efficient distributed algorithms for the representation of 3D scenes. Acknowledgments. This work was partly funded by the Swiss National Science Foundation, under grant No. 200020_120063. The authors would like to thank Zafer Arican for providing disparity estimation results.
REFERENCES [1] E.H. Adelson, J.R. Bergen, The plenoptic function and the elements of early vision, in: Computational Models of Visual Processing. MIT Press, 1991. [2] C. Zhang, T. Chen, A survey on image-based rendering—representation, sampling and compression, Signal Processing: Image Communications, 19 (2004) 1–28. [3] Y. Yagi, Omnidirectional sensing and its applications, in: IEICE Transactions on Information and Systems, E82D (3) (1999) 568–579. [4] C. Geyer, K. Daniilidis, Catadioptric projective geometry, International Journal of Computer Vision, 45 (3) (2001) 223–243. [5] S. Nayar, Catadioptric omnidirectional camera, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1997. [6] E. Hecht, A. Zajac, Optics. Addison-Wesley, 1997. [7] S.K. Nayar, Omnidirectional vision, in: Proceedings of the International Symposium of Robotics Research, 1997. [8] S. Baker, S.K. Nayar, A theory of single-viewpoint catadioptric image formation, International Journal of Computer Vision, 35 (2) (1999) 1–22. [9] T. Svoboda, T. Pajdla, V. Hlaváˇc, Epipolar geometry for panoramic cameras, in: Proceedings of the European Conference on Computer Vision, 1998. [10] S. Nene, S.K. Nayar, Stereo with mirrors, in: Proceedings of the International Conference on Computer Vision, 1998. [11] A. Torii, A. Imiya, N. Ohnishi, Two- and three-view geometry for spherical cameras, in: Proceedings of the Sixth Workshop on Omnidirectional Vision, Camera Networks and Non-classical Cameras, 2005. [12] P. Schröder, W. Sweldens, Spherical wavelets: efficiently representing functions on the sphere, in: Proceedings of ACM SIGGRAPH, 1995. [13] J.R. Driscoll, D.M. Healy, Computing Fourier transform and convolutions on the 2-sphere, Advances in Applied Mathematics, 15 (2) (1994) 202–250. [14] D. Healy Jr., D. Rockmore, P. Kostelec, S. Moore, FFts for the 2-sphere—improvements and variations, Journal of Fourier Analysis and Applications, 9 (4) (2003) 341–385. [15] S. Mallat, A Wavelet Tour of Signal Processing, Academic Press, 1998 (rev ed. 2009). [16] P.J. Burt, E.H. Adelson, The Laplacian pyramid as a compact image code, IEEE Transactions on Communications, COM-31 (4) (1983) 532–540. [17] J. Antoine, P. Vandergheynst, Wavelets on the n-sphere and related manifolds, Journal of Mathematical Physics, 39 (8) (1998) 3987–4008.
10.6 Conclusions
263
[18] ———, Wavelets on the 2-sphere: a group theoretical approach, Applied and Computational Harmonic Analysis, 7 (3) (1999) 262–291. [19] I. Bogdanova, P.Vandergheynst, J.Antoine, L. Jacques, M. Morvidone, Discrete wavelet frames on the sphere, in: Proceedings of the European Signal Processing Conference, 2004. [20] ———, Stereographic wavelet frames on the sphere, Applied and Computational Harmonic Analysis, 19 (2) (2005) 223–252. [21] A. Makadia, K. Daniilidis, Direct 3D-rotation estimation from spherical images via a generalized shift theorem, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. [22] C. Geyer, K. Daniilidis, Paracatadioptric camera calibration, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24 (5) (2002) 687–695. [23] X. Ying, Z. Hu, Catadioptric camera calibration using geometric invariants, IEEE Transactions on Pattern Analysis and Machine Intelligence, 26 (10) (2004) 1260–1271. [24] J. Barreto, H. Araujo, Geometric properties of central catadioptric line images and their application in calibration, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27 (8) (2005) 1327–1333. [25] D. Scaramuzza, A. Martinelli, R. Siegwart, A flexible technique for accurate omnidirectional camera calibration and structure from motion, in: Proceedings of the IEEE International Conference on Computer Vision Systems, 2006. [26] B. Miˇcušík, T. Pajdla, Autocalibration & 3D reconstruction with non-central catadioptric cameras, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. [27] S.B. Kang, Catadioptric self-calibration, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2000. [28] M. Antone, S. Teller, Scalable extrinsic calibration of omni-directional image networks, International Journal of Computer Vision, 49 (2–3) (2002) 143–174. [29] A. Makadia, C. Geyer, K. Daniilidis, Correspondenceless structure from motion, International Journal of Computer Vision, 75 (3) (2007) 311–327. [30] P. Sturm, Multi-view geometry for general camera models, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2005). [31] Y. Ma, S. Soatto, J. Košeckà, S.S. Sastry, An Invitation to 3-D Vision: From Images to Geometric Models. Springer, 2004. [32] J.Takiguchi, M.Yoshida, A.Takeya, J. Eino,T. Hashizume, High precision range estimation from an omnidirectional stereo system, in: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, (1) 2002. [33] J. Gonzalez-Barbosa, S. Lacroix, Fast dense panoramic stereovision, in: Proceedings of the IEEE International Conference on Robotics and Automation, 2005. [34] C. Geyer, K. Daniilidis, Conformal rectification of omnidirectional stereo pairs, in:Proceedings of the Workshop on Omnidirectional Vision and Camera Networks, 2003. [35] Z. Arican, P. Frossard, Dense disparity estimation from omnidirectional images, in: Proceedings of the IEEE International Conference on Advanced Video and Signal-Based Surveillance, 2007. [36] S. Birchfield, C. Tomasi, A pixel dissimilarity measure that is insensitive to image sampling, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20 (4) (1998) 401–406. [37] V. Kolmogorov, R. Zabih, Multi-camera scene reconstruction via graph cuts, in: Proceedings of the European Conference on Computer Vision, 2002. [38] I. Toši´c, P. Frossard, Geometry-based distributed scene representation with omnidirectional vision sensors, IEEE Transactions on Image Processing, 17 (7) (2008) 1033–1046.
264
CHAPTER 10 Spherical Imaging in Omnidirectional Camera Networks [39] X. Zhu, A. Aaron, B. Girod, Distributed compression for large camera arrays, in: Proceedings of the IEEE Workshop on Statistical Signal Processing, 2003. [40] N. Gehrig, P.L. Dragotti, Distributed compression of multi-view images using a geometrical coding approach, in: Proceedings of the IEEE International Conference on Image Processing, 6 (2007). [41] A.D. Wyner, J. Ziv, The rate-distortion function for source coding with side-information at the decoder, IEEE Transactions on Information Theory, 22 (1) (1976) 1–10.
CHAPTER
Video Compression for Camera Networks: A Distributed Approach
11
Marco Dalai, Riccardo Leonardi Department of Electronics for Automation, University of Brescia, Brescia, Italy
Abstract The problem of finding efficient communication techniques to distribute multi-view video content across different devices and users in a network has received much attention in the last few years. One such approach is devoted to so-called distributed video coding (DVC). After briefly reporting traditional approaches to multi-view coding, this chapter will introduce DVC.The theoretical background of distributed source coding (DSC) is first presented; the problem of applying DSC to video sources is then analyzed. DVC approaches in both single-view and multi-view applications are discussed. Keywords: MVC, distributed source coding, distributed video coding, multi-camera systems, Wyner-Ziv coding
11.1 INTRODUCTION Invariably involved in any application of networked cameras is the problem of coding and transmission of the video content that must be shared among users and devices in the network. For this task, video compression techniques are employed to reduce the bandwidth required for communication and possibly to efficiently store the video sequences on archival devices. Video coding has been growing in the last decades as a fundamental field of research in multimedia technology, since it enables highly modern devices and applications that would otherwise need to manage huge amounts of uncompressed data. Historically, video coding was concerned with compressing as much as possible single video sequences. Much progress has been made toward this end, from the H.261 video coding standard [39] through the latest developments of the H.264/AdvancedVideo Coding (AVC) standard [27, 33] and its extensions. While the performance of these codecs in terms of compression efficiency has been growing continuously, only in recent years has there been an increasing interest in more general video coding problems, inherently motivated by the explosion of consumer-level technology. For example, more attention is now being paid to error-resilient video coding Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00011-2
267
268
CHAPTER 11 Video Compression for Camera Networks for error-prone channels [13] and to scalable video coding to deal with different display devices in broadcasting scenarios [40]. Similar is the emerging interest, motivated by appealing application, in multi-view video coding. With the advent of camera networks and camera arrays, new applicative perspectives such as 3D television and free-viewpoint television now appear as feasible targets in the forthcoming future. Given the increasing interest in multi-view coding, an extension of H.264/AVC approaches to multiple camera systems has been proposed in the literature [21, 41] and is being considered by standardization bodies [19]. These methods combine different compression tools in specialized architectures, which essentially try to exploit the redundancy in video sequences to compress the data by means of predictive coding. In a nutshell, whereas single-source video coding concentrates on temporal predictions between frames of the same sequence, multi-view video coding extends the idea to existing spatial disparity between frames of different sequences. To apply predictive coding between different views, the encoder obviously must have access to the different video sequences.This implies that communication be enabled between the cameras or that all the cameras be connected to a joint encoder that exploits such a redundancy. In certain situations, such as large low-power camera arrays, the communication of raw data between cameras may result in excessive power consumption or bandwidth requirements. From this perspective, the emerging field of distributed video coding (DVC) has been proposed as an alternative framework for efficient independent compression of video data from multiple cameras—that is, exploiting the redundancy without inter-camera communication. DVC moves from the information theoretic settings of the late 1970s that demonstrated the possibility of separately compressing correlated sources at their joint entropy rate, provided a single joint decoder is in charge of the decoding process. The purpose of this chapter is to provide an introduction to DVC in a multi-camera context. The structure of the chapter is as follows. In Section 11.2, a brief description of classic video-coding techniques is presented in order to better appreciate the approach proposed by DVC. In Section 11.3 an introduction to the information theory underlying DSC is provided in order to clarify DVC’s underlying concepts. In Section 11.4, early approaches to DVC in a mono-view setting are described; these are then discussed in the more general multi-view case in Section 11.5. Section 11.6 concludes the chapter.
11.2 CLASSIC APPROACH TO VIDEO CODING This section presents a very concise description of the techniques used in classic video coding to exploit the redundancy of typical video sequences. It is our intention to provide a high-level description to establish a reference framework with which to compare DVC. The architectural complexity of standard video codecs has evolved from the first H.261 to the more recent H.264/AVC codec. Over the years, many tools were developed to improve performance, but there was no real paradigm shift. The video sequence is usually partitioned into groups of frames (GOP) that are processed with a certain predictive structure in order to exploit the temporal dependencies among frames. In Figure 11.1 an example GOP structure of length 4 is shown.
11.2 Classic Approach to Video Coding
B2 I
B2 B1
I/P
FIGURE 11.1 Predictive coding dependencies of a GOP in a classic video encoder. The frames are encoded in different ways as intra frames (I), predicted frames (P), or bidirectionally predicted frames (B), the last modality possibly being hierarchically implemented on more levels. The dotted arrows represent motion searches used to find predictors in the reference frames. In this example the GOP length is 4 when the last frame is encoded as an intra frame. Video in Partition
Transform and quantization Inverse transform and quantization
Compressed Entropy video code coding
Intra frame prediction Motion compensation
Motion estimation
Motion vectors
FIGURE 11.2 High-level block diagram of the predictive encoding procedure in a classic video codec.
It uses hierarchical B frames, which means frames whose encoding is based on bilateral predictions. The hierarchy is obtained by imposing a specific prediction structure. The encoding procedures for GOP frames is essentially based on motion-compensated prediction and subsequent block-based transform coding of the residual. The block diagram of the encoding procedure is shown in Figure 11.2. Every frame to be encoded is
269
270
CHAPTER 11 Video Compression for Camera Networks
first partitioned in macroblocks; the content of each macroblock can then be searched for in reference frames—that is, frames that have already been encoded—so as to apply predictive encoding to the current macroblock. To avoid drift due to quantization of the residual information between encoder and decoder, it is necessary to implement a closed prediction loop, which means that the encoder replicates the decoder’s behavior. For every macroblock, the best predictor found, or an intra-frame prediction if no such similar blocks exist in the reference frames, is subtracted from the original so that only the residual information is encoded. To remove any possible remaining spatial redundancy, a spatial transform is applied prior to quantization. Both the indication of the used predictor, which includes the motion information, and the transformed and quantized blocks are entropy-coded in order to compress the information as much as possible. This is the coarse description of the structure of a typical video encoder, where additional tools such as variable macroblock size, subpixel motion search, deblocking, and sophisticated intra-frame prediction can be added and optimized jointly using a rate distortion formulation to further improve performance (see [33]). The encoder architecture described for single-view video coding can be easily extended to multi-view coding. The main innovation needed is the predictive structure. An example of this is shown in Figure 11.3, where the same dependencies used in the temporal direction are applied between cameras so as to achieve a disparity compensation. The assumption here, of course, is that the cameras are placed so that contiguous cameras capture similar video sequences. The video-coding community has devoted a great deal of work in recent years to understanding specific multi-view video-coding problems (see for example [21, 41]), and ongoing activities are leading to the definition of a multi-view video-coding standard [19]. Figure 11.4 shows an example, in terms of rate-distortion performance, of the benefits of multi-view coding techniques on a set of video sequences taken from neighboring cameras. The most important point here is that the classic approach to video coding in a multiview setting involves prediction between the frames of different sequences. This implies
Camera
Time
I
B2
B1
B2
I
B1
B3
B2
B3
B1
P
B2
B1
B2
P
FIGURE 11.3 Example of a predictive coding structure for a multi-view system.
11.2 Classic Approach to Video Coding 39 38 37
PSNR (dB)
36 35 34 33 32 31 Camera 0 Camera 1 Camera 2
30 29
0
20
40
60 Rate (kbps)
80
100
120
(a) Breakdancers—2563192, 15 fps 39 38 37
PSNR (dB)
36 35 34 33 32 31 Camera 0 Camera 1 Camera 2
30 29
0
10
20
30
40 Rate (kbps)
50
60
70
80
(b) Exit—1923144, 25 fps
FIGURE 11.4 Rate-distortion operational curves for the multi-view extension of H.264/AVC, JMVC version 1.0 [19]. The plots refer to sequences taken from three cameras. Here, the sequence from camera 0 is encoded in a traditional H.264/AVC single-view manner. The sequence from camera 2 is encoded using camera 0 as a reference; the sequence from camera 1 uses both camera 0 and camera 2 as references. The advantage of inter-camera predictions in terms of bit rate savings for a given target quality is clearly visible.
271
272
CHAPTER 11 Video Compression for Camera Networks that the video content must be analyzed jointly by the encoder and thus that all cameras can either communicate with each other or send the raw video content to a central encoder, which has to jointly process the received data. In the next sections, a completely different approach, DVC, will be introduced. In this context, as mentioned in Section 11.1, predictive encoding is replaced by an independent encoding of the sources where existing correlation between them is only exploited on the decoder side. Before tackling the problem of DVC, however, it is necessary to understand the theoretical setting that is at its base, distributed source coding.
11.3 DISTRIBUTED SOURCE CODING This section provides a brief introduction to the information theoretic setting for distributed source coding that is a necessary prerequisite to understanding distributed video coding techniques. In its first and basic version, DSC is the independent encoding of two correlated sources that are to be transmitted to a common receiver. This problem was first studied by Slepian and Wolf [29] in 1973; their famous results, together with the results obtained in a later paper by Wyner and Ziv [35], yielded the development of DSC as a separate branch of information theory.
11.3.1 Slepian-Wolf Theorem Following Slepian and Wolf, consider a situation where two correlated sources X and Y are to be encoded and transmitted to a single receiver. For the sake of simplicity, we will deal only with the case of discrete memoryless sources with a finite alphabet, and we will specify the necessary hypotheses for ensuring the validity of the demonstrated results. We are interested in studying the two different scenarios shown in Figure 11.5. Let us focus first on the case depicted in Figure 11.5(a). We want to study the rate required to achieve lossless transmission of X and Y from the encoders to the decoder— that is, the rate required for the decoder to recover without distortion the values of X and Y . Note that in this scheme the two encoders are allowed to communicate (assuming there is no limitation on the amount of information they can share). So, in this case, we can consider that encodes 1 and 2 know the values of X and Y. It does not make much sense to consider the rates spent individually by each encoder, as all of the information X
X
Encoder 1
Encoder 1 Decoder
Decoder Y
Y
Encoder 2 (a)
Encoder 2 (b)
FIGURE 11.5 Two scenarios for a two-source problem: (a) joint encoding; (b) distributed encoding.
11.3 Distributed Source Coding
273
may be sent by one of them. We are thus interested in studying the total rate required. It is well known from information theory that the minimum total rate required for lossless encoding of sources X and Y is their joint entropy H(X, Y ) [11]. Consider now the problem of encoding X and Y when the situation is as depicted in Figure 11.5(b). In this case the two encoders cannot communicate with each other, so they separately encode X and Y and send their codes to the common decoder. We ask what the admissible rates are for lossless communication in this case. It is clear that encoders 1 and 2 could send X and Y using, respectively, a rate equal to H(X) and one equal to H(Y ) bits. The total rate in that case would be H(X) ⫹ H(Y ), which is greater than H(X, Y ) under the hypothesis that X and Y are not independent. However, the decoder would receive part of the information in a redundant way. Suppose that the decoder decodes first the value of Y ; then the value of X, being correlated with Y , would be already “partially known” and the complete description received by encoder 1 would somehow be redundant. We can thus guess that some rate could be saved by proper encoding. The surprising result obtained by Slepian and Wolf [29] was not only that the rate for X and Y can actually be smaller than H(X) and H(Y ), but that there is no penalty in this case with respect to the case of Figure 11.5(a) in terms of total required rate. The only additional constraint in this case is that there is a minimum rate equal to the conditional entropy H(X|Y ) to be spent for X and a minimum rate equal to H(Y |X) for Y , which represent the intuitive idea that every encoder must send at least the amount of information from its own source that is not contained in the other source. In particular, Slepian and Wolf formulated the following theorem for the case of memoryless sources. Theorem 11.1 (Slepian-Wolf theorem). Let two sources X and Y be such that (X1 , Y1 ), (X2 , Y2 ), . . . are independent drawings of a pair of correlated random variables (X, Y ). Then it is possible to independently encode the source X and the source Y at rates RX and RY , respectively, so that a common receiver will recover X and Y with an arbitrarily small probability of error if and only if RX ⭓ H(X|Y ), RY ⭓ H(Y |X), and RX ⫹ RY ⭓ H(X, Y ).
This theorem holds for memoryless sources as considered by Slepian and Wolf. A few years later, Cover [10] extended it to the more general case of multiple stationary ergodic sources, giving a simple proof based on the asymptotic equi-partition property— the Shannon-McMillan-Breiman theorem [20]. In this more general case the theorem is obviously reformulated by substituting entropies with entropy rates in the inequalities. The set of all (RX , RY ) rate pairs satisfying the theorem is called the achievable region and is shown in Figure 11.6. The two points labeled A and B in the figure represent an important special case of the theorem. Consider, for example, point A. This point in the region represents a situation where source Y is encoded in a traditional way using a rate RY equal to its own entropy H(Y ), while source X is encoded using the minimal rate RX ⫽ H(X|Y ).This problem is of particular interest, and it is usually referred to as coding X with side information Y at the decoder. It is important to clarify that theorem 11.1 considers rate pairs (RX , RY ) such that the decoder will losslessly recover X and Y with an arbitrarily small probability of error. This means that the encoding is considered to operate on blocks of N symbols and that for a sufficiently large N the probability of error in the decoding phase can be made as small as desired. It is worth noticing that in this sense there is a penalty in distributed
274
CHAPTER 11 Video Compression for Camera Networks RY
A
H(Y )
H(Y|X )
B
H (X|Y )
H(X )
RX
FIGURE 11.6 Slepian-Wolf region.
encoding not in joint encoding. In the latter case, in fact, by using variable-length codes it is possible to encode the two sources X and Y to a total rate as close as desired to the joint entropy H(X, Y ), even with a probability of decoding error exactly zero.
11.3.2 A Simple Example It is useful to clarify the idea behind the Slepian-Wolf theorem by means of a simple example. Suppose that the two sources X and Y are such that Y is an integer uniformly distributed in [0,999] and that X ⫽ Y ⫹ N , with N uniformly distributed on the integers between 0 and 9. Consider the number of decimal digits necessary to describe X and Y in the joint encoding shown in Figure 11.5(a). We easily note that it is possible to encode Y using three decimal digits and then, given the value of Y , encode X with the only digit required to describe the value of N ⫽ X ⫺ Y . For example, if X ⫽ 133 and Y ⫽ 125, then Y is simply encoded with its own representation and X is encoded by specifying the value N ⫽ 8. Thus, a total of four decimal digits allows encoding of both values of X and Y . Suppose now that the two encoders cannot communicate, as depicted in Figure 11.5(b). Then, supposing that Y is encoded with all of its three decimal digits, we are faced with the encoding of X with side information Y at the decoder, as in point A in Figure 11.6. This time, the encoding of X cannot be based on the value of N , since N cannot be computed by encoder 1, which ignores the value of Y . Still, it is possible to encode X using only one decimal digit if the value of Y is known to the decoder. The trick is to encode X simply by specifying the last digit. In our case, for example, where Y ⫽ 125, Y ⭐ X ⭐ Y ⫹ 9, so knowing that the last digit of X is 3 suffices to deduce that X ⫽ 133. So the knowledge of Y at encoder 1 does not impact the rate required for X, and a total rate of four digits allows describing both X and Y .
11.3 Distributed Source Coding
275
Now note that all points on the segment between A and B in Figure 11.6 are achievable. These points can be obtained by properly multiplexing points A and B in time, but it is also possible to actually construct a symmetric encoding of X and Y . In our simple toy example, this can be shown by demonstrating that it is possible to encode X and Y using two decimal digits for each source. Here the trick is to let encoder 2 send the last two digits of Y and encoder 1 send the first and third digits of X. In our example, where X ⫽ 133 and Y ⫽ 125, encoder 2 sends “_25” and encoder 1 sends “1_3.” It is not difficult to see that for the receiver this information, together with the constraint Y ⭐ X ⭐ Y ⫹ 9, is sufficient to determine that X ⫽ 133 and Y ⫽ 125. This simple example reveals an interesting insight into the essence of the SlepianWolf theorem. With real sources, obviously, the encoding techniques usually must be much more complicated. Nevertheless, the main idea is maintained: The principles used both with the encoding with side information and with symmetric rates are “only” a generalization of the described approach to more general and practical situations. In the next section, it is shown with a meaningful example that channel codes can be used for distributed source coding in the case where X and Y are binary sources correlated in terms of Hamming distance.
11.3.3 Channel Codes for Binary Source DSC The close connection between the Slepian-Wolf problem and channel coding was first noticed by Wyner [34], who used an example of binary sources to present an intuitive proof of the Slepian-Wolf theorem. In this section, an example of distributed encoding of binary sequences is provided, with a description of the use of channel codes. The discussion parallels the example given in Section 11.3.2. We assume the reader has familiarity with the basic theory of algebraic channel codes (see [7] for an introduction). A more detailed analysis of channel codes for DSC can be found in Pradhan and Ramchandran [24, 25, 28] and Gehrig and Dragotti [15]. In this example we consider two sources X and Y that are 7-bit words, where the correlation is expressed by the fact that the Hamming distance between X and Y is at most 1—that is, they differ for 1 bit at most. As a reference, note that the joint encoding of X and Y requires 10 bits. One can, for example, raw-encode Y with 7 bits and then encode X by specifying the difference with respect to Y with 3 bits, given that there are eight possible choices. Consider the case of coding X with side information Y at the decoder or, equivalently, when 7 bits are used for the encoding of Y . We show here that by using a proper channel code it is still possible to encode X using only 3 bits. We use the systematic Hamming (7,4) code. The generating matrix G and the parity check matrix H of this code are, respectively, ⎛
1 ⎜ 0 G ⫽⎜ ⎝ 0 0
0 1 0 0
0 0 1 0
0 0 0 1
0 1 1 1
1 0 1 1
⎞ 1 1 ⎟ ⎟ 0 ⎠ 1
⎛
0 H ⫽⎝ 1 1
1 0 1
1 1 0
1 1 1
1 0 1
0 1 0
⎞ 0 0 ⎠ 1
(11.1)
Let us first consider how the code is used in channel coding. In the encoding phase, the Hamming code maps a 4-bit word w into a 7-bit codeword ct ⫽ w · G, which is
276
CHAPTER 11 Video Compression for Camera Networks
transmitted on the channel and received as, say, cr . The decoding phase computes the so-called syndrome s ⫽ cr · H . By construction, the matrices H and G satisfy G · H ⫽ 0. Thus, if the codeword is received without error, one has s ⫽ cr · H ⫽ ct · H ⫽ w · G · H ⫽ 0. If instead an error word e is added to the codeword during transmission, one has s ⫽ cr · H ⫽ (ct ⫹ e) · H ⫽ (w · G ⫹ e) · H ⫽ 0 ⫹ e · H ⫽ e · H . It is easy to note that if e has a Hamming weight equal to 1 (i.e., one bit is corrupted in the transmission) then s equals the column of H indexed by the position of the error. Thus, s allows one to identify the position of the error and so correct the codeword cr to restore ct and recover w. Therein lies the notorious fact that the (7,4) Hamming code can correct one error. Now let us focus on the use of this code for coding X with side information Y at the decoder. The correlation assumption between X and Y can be modeled by saying that X ⫽ Y ⫹ e, the word e having a Hamming weight of at most 1. Suppose now that Y is known at the decoder. We encode X by computing its 3-bit syndrome sX ⫽ X · H and sending it to the decoder. There we can compute sY ⫽ Y · H . Using sX and sY , the decoder can compute s ⫽ sX ⫹ sY ⫽ X · H ⫹ Y · H ⫽ (X ⫹ Y ) · H ⫽ e · H . Again, assuming that e has a Hamming weight of at most 1, the decoder can detect the position of the difference between X and Y and then, since Y is given, deduce X. With a smart trick, it is also possible to use the Hamming code to encode the two sources X and Y in a symmetric way using 5 bits for each. We split the generating matrix G into two submatrices G1 and G2 by taking respectively the first two rows and the last two rows of G. That is, G1 ⫽
1 0
0 1
0 0
0 0
0 1
1 0
1 1
G2 ⫽
0 0
0 0
1 0
0 1
1 1
1 1
0 1
(11.2)
These two submatrices are used as generating matrices for two codes C1 and C2 , which are subcodes of the Hamming code with parity check matrices: ⎛ ⎜ ⎜ H1 ⫽ ⎜ ⎜ ⎝
0 0 0 1 1
0 0 1 0 1
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
⎞
⎛
⎟ ⎟ ⎟ ⎟ ⎠
⎜ ⎜ H2 ⫽ ⎜ ⎜ ⎝
1 0 0 0 0
0 1 0 0 0
0 0 1 1 0
0 0 1 1 1
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
⎞ ⎟ ⎟ ⎟ ⎟ ⎠
(11.3)
The encoding of X and Y is done by computing sX ⫽ X · H1 and sY ⫽ Y · H2 . It is possible to show that the decoder, given the pair of syndromes sX and sY can uniquely determine the words X and Y using the constraint that they differ by at most 1 bit. In fact, ¯ Y¯ ) satisfies the same syndrome suppose on the contrary that a different pair of words (X, and the same distance constraints. Then, as sX ⫽ sX¯ , X ⫹ X¯ has a null syndrome, and it is thus a codeword for C1 ; for similar reasons, Y ⫹ Y¯ is a codeword for C2 . Thus, as C1 ¯ ⫹ (Y ⫹ Y¯ ) is a codeword for it. But and C2 are subcodes of the Hamming code, (X ⫹ X) ¯ ⫹ (Y ⫹ Y¯ ) ⫽ (X ⫹ Y ) ⫹ (X¯ ⫹ Y¯ ) has a weight at most equal to 2 and, since the (X ⫹ X) Hamming code has distance 3, the only word with a smaller weight than 3 is the null ¯ ⫽ (Y ⫹ Y¯ ), but (X ⫹ X) ¯ is in C1 while (Y ⫹ Y¯ ) is in C2 . As the rows of word. So (X ⫹ X) G1 and the rows of G2 are independent (being the G1 and G2 submatrices of G ), the only intersection of C1 and C2 is the null word—that is, X ⫽ X¯ and Y ⫽ Y¯ . So there is a unique solution, which means that X and Y can be recovered at the decoder.
11.3 Distributed Source Coding
277
11.3.4 Wyner-Ziv Theorem A few years after Slepian and Wolf [29], and Wyner and Ziv [35] obtained an important solution to the problem of lossy coding with side information at the decoder—that is, the case when Y is available at the decoder and source X does not have to be recovered perfectly but only within a certain distortion. For lossy source coding, as it is known, the theoretical bounds are described by computing the rate distortion function [6, 11, 14]. We do not want to go into the details of rate distortion theory (the interested reader is referred to Berger [6]). Here we just recall that for the single-source problem, supposing that X is an i.i.d. source with a marginal pdf q(x), and that d(x, x) ˆ is the distortion measure between a reproduction symbol xˆ and the original value x, the rate distortion function is given by ˆ R(D) ⫽ min I (X; X) p ∈P (D)
(11.4)
where I (·; ·) is the mutual information and P (D) is the set of all conditional probability ˆ ⭐ D. That is, the expected value of the distortion functions p(x|x) ˆ such that E[d(X, X)] is at most D. When side information Y is available to both encoder and decoder, the rate distortion function simply changes [6] ˆ ) R(D) ⫽ min I (X; X|Y p∈P (D)
(11.5)
ˆ ⭐ D. Wyner and Ziv obtained where P is now the set of all p(x|x, ˆ y) such that E[d(X, X)] a characterization of the rate distortion function when the side information Y is only available at the decoder [35]. Theorem 11.2 ( Wyner-Ziv theorem). Let two sources X and Y be as in theorem 11.1, and let q(x, y) be their joint distribution. The rate distortion function for the encoding of X with side information Y available to the decoder is RWZ X|Y (D) ⫽ inf [I (X; Z) ⫺ I (Y ; Z)] p∈P (D)
(11.6)
where Z is an auxiliary variable and P (D) is the set of all p(z|x) for which there exists a function f such that E[d(X, f (Y , Z))] ⭐ D.
A detailed analysis of the theorem is beyond the scope of the present work. We only add two comments that may be interesting. In addition to proving their theorem, Wyner and Ziv observe the following facts [35]: ■
■
For positive distortion values D there is a penalty in the rate distortion bound when the side information is not available to the encoder. This means that the result of Slepian and Wolf does not extend to the lossy case. It was shown more recently [37], however, that the rate loss is bounded by a quantity that equals half a bit per sample for quadratic distortion d(x, x) ˆ ⫽ (x ⫺ x) ˆ 2. Theorem 11.2 is valid in a broader setting than the limited case of finite alphabet sources [36]. In particular, it is valid if X is a Gaussian source and X ⫽ Y ⫹ N , where N is a Gaussian noise with variance N2 and independent of Y . In this particular case, under the Euclidean distortion criterion, the rate distortion function can be computed analytically: RWZ X|Y (d) ⫽
⫹ 2 1 log N 2 d
(11.7)
278
CHAPTER 11 Video Compression for Camera Networks where (·)⫹ is the positive part function (i.e., (x)⫹ ⫽ max{0, x}). In this particular case, the rate distortion function is the same obtained when Y is also available to the encoder. Thus, the Slepian-Wolf result does extend to the lossy case.
11.4 FROM DSC TO DVC The application of DSC principles to video coding was independently proposed by two groups, from Stanford University [1] and from the University of California at Berkeley [25]. Starting from these preliminary works, DVC has now become an active field of research. (See for example [17, 22] for an overview.) Singularly enough, while DSC lies in the so-called field of multi-terminal or multi-user information theory, DVC was initially concerned with using DSC in encoding single video sequences. After the first published works on single-source video coding, the field rapidly grew, and DVC is now seen as the application of DSC to the more general problem of multi-source (or multi-view) video coding. This section presents the DVC approach to single-source coding, which is a meaningful introduction to the topic. Most of the ideas can be easily reused in multi-camera contexts later in the chapter, and the most important differences will be discussed in detail in the next sections.
11.4.1 Applying DSC to Video Coding The use of DVC in single-source video coding was proposed as an alternative solution to traditional video-coding techniques, which mainly centered around the use of motion compensation in a prediction loop inside the encoder.There are different motivations for this alternative proposal.The most important are probably the shift of computational complexity from the encoder to the decoder and an expected higher error robustness in the presence of error-prone communications. In short, as already described in Section 11.2, classic video-coding techniques such as H.264/AVC (see [27, 33] and references therein for details) adopt motion estimation at the encoder for motion-compensated prediction encoding of the information contained in sequence frames.This leads to codecs with very good rate distortion performance but at the cost of computationally complex encoders and fragility with respect to transmission errors over the channel. The high computational complexity of the encoder is due to the motion search required to properly perform predictive coding from frame to frame. Fragility, then, is due to the drift caused by error propagation through the prediction loop, which means that the fragile source coding approach must be followed by powerful channel coding for error resilience. In addition further processing must often be designed at the receiver to adopt effective error concealment strategies. DSC techniques are intrinsically based on the idea of exploiting redundancy without performing prediction in the encoding phase, and leaving to the decoder the problem of deciphering the received codes using the correlation or redundancy between sources. For these reasons the use of DSC in single-source video coding appears to be a possible solution for robust encoding with the possibility of flexibly allocating the computational complexity between encoder and decoder.
11.4 From DSC to DVC
279
Consider a video sequence composed of frames X1 , X2 , . . . , Xn ; let R and C be the number of rows and columns in every frame Xi ; and let Xi (r, c) represent the pixel value at frame location (r, c). It is clear that the frames of a video sequence are redundant—that is, a video is a source with strong spatial and temporal memory. Spatial memory means that if we model the frames as stochastic processes, the random variables representing pixel values that are spatially close in the same frame are correlated.Temporal correlation means that consecutive frames are very similar, the only difference being the usually small object movements, unless a scene change, a flash, or some similar “rare” event occurs. We will refer to spatial correlation as intra-frame and to temporal correlation as inter-frame. Later in this chapter, when referring to multi-camera systems, we will call intra-sequence the correlation within a sequence and inter-sequence the correlation between different sequences, for obvious reasons. From H.261 and MPEG1 through the most recent developments such as H.264/ MPEG-4 AVC, the classic techniques for video coding have exploited the correlation of a video sequence by combining the use of transforms for removing the intra-frame correlation and the use of motion-compensated prediction for dealing with the inter-frame correlation. We are mostly interested in this second aspect—motion-compensated prediction between frames. In the basic situation we can consider the problem of encoding a frame Xi when the previous frame Xi⫺1 has already been encoded and is available in an approximated form, say as X˜ i⫺1 , at the decoder. In this case, a classic video coding technique estimates the motion field Mi between the reference frame X˜ i⫺1 and Xi ; then, by “applying”this motion to frame X˜ i⫺1 , it obtains an approximation of Xi , say Xi ⫽ Mi (X˜ i⫺1 ). The encoding of Xi is then performed using the prediction, and, instead of directly encoding Xi , the motion field Mi and the prediction error ei ⫽ Xi ⫺ Mi (X˜ i⫺1 ) are coded. ei is usually transform-encoded so as to remove the remaining intra-frame correlation. This is only a very coarse description of modern video codecs, as an accurate finetuning of tools is necessary to achieve high rate distortion performance as proposed in the standards (MPEG1/2/4). Nevertheless, the main point is sufficiently described in this form: In classic video-coding standards a frame is encoded by applying motioncompensated prediction from previously encoded frames. At the decoder, the motion field is applied to the available reference frame (or frames) and used to generate the prediction, which is then successively updated with the received prediction error. The use of DSC in video coding is based on the idea that we can consider the frames (or portions of frames) of a video sequence as different correlated sources. So, when a frame Xi has to be encoded based on a previously encoded frame Xi⫺1 , by invoking the Slepian-Wolf and Wyner-Ziv results, we can consider X˜ i⫺1 as side information that is known to the decoder and that need not be known to the encoder. In this way the coding technique for Xi exploits the correlation with X˜ i⫺1 in the decoding phase without using prediction in the encoding step. This is the very basic idea behind DVC, which must be further refined in order to produce concrete coding schemes. Note that the DSC scenario considered in this case is one of source coding with side information at the decoder and, for video sequences, we are usually interested in lossy compression. For this reason DVC is often also referred to as Wyner-Ziv (WZ) video coding, and more generally we call WZ coding any encoding technique based on the presence of side information at the decoder. By extension, we often refer to the bits associated with a WZ encoding as WZ bits and call side information
280
CHAPTER 11 Video Compression for Camera Networks (SI) the part of video already available at the decoder, referring in some cases to a whole frame and in others to portions of frames or even groups of frames.
11.4.2 PRISM Codec In this section we will describe the so-called PRISM codec proposed by Puri and Ramchandran [25] in 2002. The encoding approach for the video frames sequence is shown in Figure 11.7 for a single GOP. Again let X1 , X2 , . . . , Xn be the frames. The first frame X1 is encoded in intra-mode using, for example, a block-based approach similar to the ones used in JPEG [32]. For the following frames, a block-based process is considered. The generic frame Xi is divided into 8⫻8 pixel blocks; let Xik be the kth block, and let Xik (r, c) be its pixel values. The following chain of operations is then performed (see Figure 11.8).
Decoder
Encoder Key frame conventionally encoded
Motion compensation using parity bits and CRC
Parity bits of DCT blocks and CRC of Wyner-Ziv frames
FIGURE 11.7 Frame encoding and decoding scheme in PRISM.
Compute CRC
WZ block in DCT
Quantizer
Syndrome encoding
Syndrome decoding
CRC check No
Classifier
Quantizer
FIGURE 11.8 Block diagram of the encoding and decoding processes in PRISM.
Motion search
Yes
Dequantizer IDCT
WZ block out
11.4 From DSC to DVC Encoding 1. Every block is analyzed so as to estimate its correlation with the content of the k k and the sum of absolute differences compared with Xi⫺1 previous frame: Block
Xi is k k k is computed: ⑀i ⫽ r,c |Xi (r, c) ⫺ Xi⫺1 (r, c)|. The value ⑀ik is an estimate of the correlation between the current block and the previous frame with low computational cost. 2. Depending on the value of ⑀ik , every block is classified as follows: k k k ■ If ⑀i is smaller than a given threshold, say ⑀i ⭐ ⑀min , then block Xi is classified as a SKIP block. k k k ■ If ⑀i is larger than a given threshold, say ⑀i ⭓ ⑀max , then block Xi is classified as an INTRA block. k ■ Otherwise, block Xi is classified as a WZ block. WZ blocks are further divided into 16 classes, C1 , C2 , . . . , C16 , depending on their ⑀ik value, so that the encoder can operate differently on blocks exhibiting different levels of correlation. 3. A flag is transmitted indicating the type of block (SKIP/INTRA/WZ) and the code of the block is then emitted: k ■ If Xi is a SKIP block, no further information is encoded. SKIP mode means that the decoder replaces the block with the same-positioned block in the previous frame. k ■ If Xi is an INTRA block, it is encoded in a traditional way through transform coding followed by a run-amplitude (RA) code such as in JPEG. The decoder can thus decode this type of block without any reference to other frames. k ■ If Xi is a WZ block, the index specifying the associated class is added. The block is then encoded, as indicated in Figure 11.9, in the following way. A DCT transform
30 LS bits
Trellis code
Coefficients
CRC
Syndrome 15 bits 16 bits
Classic RA encoding
FIGURE 11.9 Encoding procedure for the WZ blocks in PRISM.
281
282
CHAPTER 11 Video Compression for Camera Networks is applied followed by a quantization tuned depending on the class. The least significant (LS) bits of the quantized low-pass coefficients are encoded in a distributed fashion using a trellis code. Then refinement bits of the low-frequency coefficients are encoded (to reach a given target quality), whereas high-pass coefficients are encoded with a classic RA procedure. Furthermore, a 16-bit CRC is computed on the quantized low-pass coefficients. (The use of CRC will be clarified later.) The procedure just explained for the encoding of the WZ frames is not completely specified since it is not clear how the values of the ⑀min and ⑀max thresholds are established. The same is true for the definition of the Cj , j ⫽ 1, . . . , 16 classes, which are determined based on statistical distribution of the ⑀ik values over a training set. Such training is necessary to adequately estimate the correlation model at the base of WZ coding. These details are not relevant to the purpose of this chapter, and we refer the reader to Puri et al. [26] for more information.
Decoding (WZ blocks) The decoding for a block Xik is performed by combining a sort of motion estimation and a WZ decoding in the following way: 1. For a WZ block Xik in the frame Xi , different blocks in the frame Xi⫺1 around the position of Xik are tested as side information at the decoder. 2. Every candidate block is used as SI; it is transformed and quantized using the specific quantizer for the class containing Xik , and the LS bits of the low-frequency coefficients are extracted and used as side information for a WZ decoding that uses the parity bits of the correct block Xik sent by the encoder. 3. The CRC-16 is computed on the obtained “corrected” side information coefficients. If the CRC matches, the decoding is considered correct and the procedure stops; otherwise, another block is selected from the previous frame and the process is iterated from step 2. If no available SI block matches the CRC, then it is not possible to correctly reconstruct the low-pass coefficients and a concealment strategy must be adopted. 4. When the procedure for low-frequency coefficients is terminated, the high-frequency coefficients are decoded in a traditional mode and inserted to fill the DCT transform of the block. The inverse transform is then applied to obtain the pixel values of the block.
11.4.3 Stanford Approach With respect to the PRISM codec, the Stanford architecture adopts different choices for the application of WZ principles to video sequences [1, 16]. The main difference is that the sequence frames are considered as a whole, and the WZ coding is applied to an entire frame and not to single blocks. Thus, we can actually identify some WZ frames that are completely encoded in a WZ way, without differentiating the processing on a block-by-block basis. The key idea, in this case, is to estimate the motion at the decoder and create a complete SI frame to be corrected as a whole by the WZ decoding.
11.4 From DSC to DVC
283
Decoder
Encoder Key frames conventionally encoded
Motion-compensated interpolation
Interpolated frames
Parity bits of Wyner-Ziv frames
Correction of interpolated frames
FIGURE 11.10 Frame encoding and decoding scheme in the Stanford approach.
The coarse idea is to split the frames of the sequence at the encoder, dividing them into two groups, as shown in Figure 11.10. Again let X1 , X2 , . . . be the frames; in the simpler version of the codec, odd-indexed frames X1 , X3 . . . are encoded in a conventional intramode way—that is, as a sequence of images—while even indexed frames X2 , X4 . . . are encoded in a WZ fashion. At the decoder, the intra-coded frames are used to create an approximation for the WZ frames by motion-compensated interpolation. Then the parity bits are used to “correct” these approximations and recover the frames. This idea is graphically represented in Figure 11.10. This general idea gave rise to many proposed variations, and the description we give in the following subsections is obtained by combining interpretations of details from different research (see for example [22] for a concise overview). It is necessary to clarify in advance one particular characteristic of this architecture, which is the need of a feedback channel from the decoder to the encoder [8]. This feedback channel is used in WZ decoding to request more parity bits from the encoder if those received are not sufficient to properly decode the source. Even if, theoretically, this feedback channel could be removed by introducing higher functionalities at the encoder side, with the drawback of increased complexity [9], it is still not clear what the achievable performance might be in terms of balancing the increase in encoder complexity and the reduction in rate distortion performance.
WZ Frame Encoding The encoding of a WZ frame, say X2n , is performed in the following way (see Figure 11.11): 1. A block-based DCT transform is applied to the frame, and a quantization mask is applied to the transformed coefficients. These coefficients are then reordered in
284
CHAPTER 11 Video Compression for Camera Networks
WZ frames Quantizer
Turbo encoder
Turbo decoder
Buffer
WZ frames out Reconstruction
Feedback channel Key frames
Conventional encoder
Conventional decoder
Motioncompensated interpolation
FIGURE 11.11 Block diagram of the encoding and decoding processes in the Stanford codec.
frequency bands and the bit planes of every band are extracted and prepared for WZ encoding. 2. The extracted bit planes are fed into a turbo encoder1 and the resulting parity bits are stored in a buffer. These parity bits are ready for transmission to the decoder, which requests them iteratively until it has enough to achieve successful WZ decoding. The encoding procedure as just shown is notably simple. We now present the decoding operation for the WZ frames.
WZ Frame Decoding The decoding process for a WZ frame X2n is as follows: and X2n⫹1 be the two reconstructed key frames adjacent to X2n . By applying 1. Let X2n⫺1 and X2n⫹1 are used for the construction a motion-compensated interpolation, X2n⫺1 of an approximation Y2n of X2n , which is the side information for the WZ decoding. 2. The SI is assumed to be a noisy version of the original frame. In particular, it is assumed that every DCT coefficient of Y2n differs from the corresponding coefficient of X2n for an additive noise with a hypothetical distribution (usually a Laplacian distribution [16]). Given the value of the side information coefficient, it is possible to compute the probability of every bit of the original coefficient to be 0 or 1. These probabilities are used for the subsequent WZ decoding. 3. The WZ decoding operates bit plane by bit plane (frequency band by frequency band), starting with the most significant and using, for every bit plane, the previously decoded bit planes to compute the bit probabilities. The probabilities are fed to the turbo decoder as “channel values” of the information bits. The turbo decoder, using a feedback channel, asks for parity bits from the encoder, which sends them by progressively puncturing the parity bits in a buffer. It tries to decode the channel values with these parity bits to recover the original bit plane. If this process fails, more
1 Other implementations use LDPC codes (see for example [3]), but there is no essential difference for the purpose of this chapter.
11.4 From DSC to DVC
285
parity bits are requested and the process is repeated until the turbo decoder is able to correctly recover the bit plane. Note that, though not discussed in the first Stanford publications, it is necessary to adopt ad hoc tools in order to detect the success/failure of the decoding process (see for example [18]). of X 4. After all bit planes have been recovered, the best estimate X2n 2n is constructed by taking for every DCT coefficient the expected value, given its quantized version and the SI,under the assumed Laplacian probabilistic model.The DCT block transform is then inverted and the the sequence of WZ frames is interleaved with the sequence of key frames. It is worth saying that step 1, the motion-compensated interpolation, plays an essential role in this architecture and hides many details that can greatly impact the performance of the system (see for example [5]). In particular, some variations on the scheme deal with the possibility of extrapolating based on past frames rather than interpolating. Furthermore, it has been noted in the literature that if the encoder sends a coarse description of the original frame, it is possible to greatly improve motion estimation and thus the quality of the generated SI. We will return to this aspect later.
11.4.4 Remarks A number of important comments are helpful in understanding the relations and differences between DSC and DVC and can thus serve as guidelines for the design of a concrete DVC system. The first difference between DSC and DVC is found in the a priori assumptions regarding the correlation between information sources. In the theoretical setting for DSC, the Slepian-Wolf and Wyner-Ziv theorems are based on the assumption that the encoders and the decoder are completely aware of the statistical correlation between the sources. This assumption is critical since it underlies the encoding and decoding operations. In particular, as usually happens with information theoretic results, there are assumptions of ergodicity and stationarity of the sources, and it is assumed that the length of the blocks to be encoded can be increased as desired. In DVC, as described previously, the sources are interpreted as frames or portions of frames in a video sequence or, in the case of multi-camera systems, possibly in different video sequences. A first comment is that it is difficult to match the characteristics of video sequences or portions of them with those of a stationary ergodic source. However, the most important point is that in DVC the correlation between the sources is in general not known and must be estimated. The meaning of the term “correlation” itself is not immediately clear in the case of DVC. In DSC it simply refers to the joint probability density functions of the sources. In DVC, we can reasonably think of a dependency between sources that can be separate in two factors. First are the geometrical displacements and deformations, which differ from frame to frame because of motion or the relative position between cameras. Second is what is usually considered the real “innovation,” the uncovered regions and the differences between the chromatic values of the same physical regions in different frames resulting from differences in sampling location, illumination, noise, and so on. That is, there is one correlation in the sense of geometrical deformation that reflects the 3D difference between scenes, and there is another correlation in the sense of differences that cannot be compensated for by means of geometric transformations.
286
CHAPTER 11 Video Compression for Camera Networks
For these reasons we can interpret WZ decoding in DVC as a composition of two basic operations: a compensation to match the WZ data to the SI data, and a correction to recover the original WZ data numerical values from the approximation obtained by compensation. These operations are also performed starting from a classic video codec, but it is important to understand that, although they do not have an essentially different role in a classic approach, they do have one in a distributed setting. The reason is that in a classic codec, where both the original data to be encoded and the reference data are available, it is easy to find the best possible compensation and then to encode the prediction error. In a distributed setting, the encoding is performed using only the WZ data, so compensation must be performed at the decoder. It is somehow easy to encode in a WZ fashion the original data, assuming that the compensation is already done, but it is much more difficult to perform the compensation at the decoder because the original data is not yet available. Compensation is thus the first crucial difficulty of DVC, and it is still not well understood how this problem can be efficiently solved. In the PRISM codec, the compensation process is actually bypassed by means of a looped correction process with a CRC check for detecting successful decoding.2 Thus, PRISM is not interested in estimating correct motion or disparity, but relies only on the hypothesis that a good prediction will be available that will allow for the WZ decoding. When this does not happen—for example, because the parity bits do not suffice—both the parity bits and the CRC are useless. That is, not only can the information not be corrected successfully, but there is not even an estimate of possible compensation to apply to the reference to approximate the WZ data. In the basic implementation of the Stanford codec, compensation is performed using only the key frames; thus it is not based on information on the WZ data. As noted, in some extensions to the original codec it was considered that the encoder could send a hash (i.e., a coarse description of the WZ frame) to help the compensation process [2]. This solution, however, has not been studied in terms of a compressionefficiency trade-off. More precisely, it has never been theoretically studied as a distributed coding strategy, but only as a heuristic to adopt before considering the proper distributed source-coding problem. This can be accepted, but it must be clear that if those coarse descriptions allow the disparity to be found using classic estimation techniques such as block matching, those images are themselves correlated, and encoding them in a classic fashion is surely suboptimal. It is thus reasonable to consider that with this technique a concrete portion of the similarity between the images is not necessarily exploited. The reader is referred to Dalai and Leonardi [12] for a more detailed discussion of this point. Furthermore, every solution proposed for the compensation problem has an indirect impact on possible solutions to the correction problem. Consider the two single-camera architectures described previously. PRISM tries to perform the compensation jointly with the correction, while the Stanford decoder first completely compensates a frame and then corrects it. Both choices have pros and cons. The PRISM codec has the advantage that it allows the use of only one frame as a reference and guessing the motion during WZ
2
From a theoretical point of view, it is interesting to note that the CRC is not so different from a channel code. Thus, one may say that, instead of using a hard decision with a CRC, it would be better to increase the channel code correction capability.
11.4 From DSC to DVC 42 40
PSNR (dB)
38 36 34 32 H.264/AVC (no motion) WZ H.264/AVC (intra) H.263 (intra)
30 28 50
100
150
200
250
300
350
400
450
Rate (kbps) (a) 40 38
PSNR (dB)
36 34 32 30 H.264/AVC (no motion) WZ H.264/AVC (intra) H.263 (intra)
28 26
0
100
200
300 Rate (kbps) (b)
400
500
600
FIGURE 11.12 Rate distortion performance of the WZ codec developed by the European project DISCOVER. Here the results refer to sequences (a) Hallmonitor and (b) Foreman. The QCIF format is 15 frames per second [3].
287
288
CHAPTER 11 Video Compression for Camera Networks
decoding; the Stanford solution requires the use of more than one reference frame and separately estimates the motion.The first choice is DSC oriented, since the motion itself is part of the information ideally encoded in a distributed way. However, the motion search embedded in the WZ correction phase prevents PRISM from using WZ codes on large blocks, since an exhaustive search of the motion within the combination of all possible motion fields is not be feasible. The Stanford scheme, on the contrary, allows more powerful channel codes to be used since the correction is applied to the whole frame, and it takes advantage of the large data size to exploit the“average”correlation. PRISM’s requirement of performing the correction independently on small blocks is a great penalty since it is surely more difficult to efficiently estimate the correlation separately for each block. Because the block size is very small, it is not possible to invoke the large number laws to exploit the average correlation. The second related difficulty in DVC is the problem of rate allocation. As stated, even after the compensation is performed, the correlation between theWZ data and the SI,and thus the required rate for the correction operation, is uncertain. This has an important impact on the allocation of rate, since an underestimate of the required rate leads to failures in the correction process. It is not clear how to achieve a gradual degradation of the quality of the decoded data with the reduction of the rate. There is usually a threshold below which the decoding fails and returns useless information and above which the decoding is successful but the quality does not increase further with the rate. This problem is clearly perceived in the codecs presented.The Stanford solution bypasses the problem by means of a return channel. This is an unfair solution in the context of DSC, strictly speaking. It can of course be a reasonable approach for specific application scenarios, but it changes the theoretical setting of the problem. In the PRISM codec, the problem is that, by simulations, the real performance of the coding of the INTER blocks is low, often lower than that of the INTRA blocks. It should be clear that the two proposed codecs are the first important steps toward the realization of DVC systems and that improvements have been achieved by different research groups [22]. However, many fundamental problems still remain to be solved. In order to provide a comparison between the performance of a distributed codec and that of classic codecs, Figure 11.12 shows the compression performance for sample video sequences obtained with the software developed by DISCOVER, a European project, funded by the European Commission IST FP6 Programme.3
11.5 APPLYING DVC TO MULTI-VIEW SYSTEMS The previous section introduced DVC using two important examples of single-source DVC from the literature. In this section we provide an introduction to DVC techniques
3
The DISCOVER software started, with the so-called IST-WZ software developed by the Image Group at the Instituto Superior Técnico (IST), Lisbon, Portugal (http:amalia.img.lx.it.pt), by Catarina Brites, João Ascenso, and Fernando Pereira.
11.5 Applying DVC to Multi-View Systems
289
in multi-view video coding. The idea of using DVC in multi-camera systems appeared as an appealing option with respect to H-264 MV extensions. Compared to single-camera DVC, multi-camera DVC is clearly more representative of the application of DSC to video applications, where there is indeed a clear advantage to compressing different correlated sources without communication between encoders. The two motivations for single-camera DVC—that is, the possibility of flexibly allocating the complexity between encoder and decoder and the error resilience—are still relevant to multi-camera problems, but the possibility of exploiting inter-camera correlation without communication between cameras is in many cases of broader interest. This can lead to some important differences between the practical implementation of multi-camera DVC systems and that of single-camera DVC systems, as we shall discuss later. What is not different, however, is the general philosophy behind the underlying coding paradigm.
11.5.1 Extending Mono-View Codecs There is of course a great number of possible scenarios in multi-camera systems, so it is not possible to provide a general treatment of multi camera DVC without specifying which configurations are to be considered. For example, one may consider many cameras positioned in regularly spaced points, all with the same importance, that must communicate with a single receiver; or one can select a system based on different types of cameras that operate differently.A configuration often considered is one in which some cameras— usually called intra-cameras—encode their video sources in a classic sense; these sources are used at the decoder as SI for other cameras—called WZ cameras—that operate in a WZ fashion. The two architectures described previously for single-source DVC can also be used for multi-camera DVC.The side information for theWZ (block of) frames can comprise in this case both frames from the intra-cameras and intra-coded key frames from WZ cameras. There is no single possible choice for this, and we only provide here an example based on publications by the same research groups that proposed the original single-view codecs (e.g., [30, 31, 38]). The PRISM codec can be extended, as proposed in Yeo and Ramchandran [30], to deal with multi-camera systems. The extension can be easily defined on a system with an intra-camera and a WZ camera. The intra-camera provides side information for the WZ camera. This means that for the decoding of a WZ frame at the generic time instant t, the side information available to the decoder in this case comprises not only the previously decoded frame of the same camera but also the frame from the intra-camera at the same instant t. This implies a minimal modification in the codec with respect to the single camera, the only difference being that for every WZ block, in addition to the usual PRISM motion search, there is a disparity search to detect estimators from the different views rather than from the previous frame (see Figure 11.13). It is worth noticing that if the relative positions of the cameras are known, it is possible to reduce the region of the disparity search in the intra-view frame, using multi-view geometry, to a segment over the epipolar line associated with the position of the WZ block. The codec proposed at Stanford can be extended to multi-view scenarios. In Zhu et al. [38], the authors propose the use of a generalized version of the original codec in large camera arrays. The idea is that in a large camera array some cameras can be used
290
CHAPTER 11 Video Compression for Camera Networks WZ camera
Intra-camera
Time instant t21
Motion search
Epipolar line
Time instant t
Disparity search
FIGURE 11.13 Correspondence search in the PRISM decoder in a multi-view setup. Intra-camera 1
WZ camera
Intra-camera 2
Time instant t21 Motion search Time instant t
Disparity search
Time instant t11
FIGURE 11.14 Stanford’s codec in the multi-view setting.
as intra-cameras and the remaining ones used as WZ cameras. The encoding of the WZ frames then proceeds as for the single-camera codec, while the decoding is different for the generation of side information. Indeed, instead of only using an interpolation between key frames to construct the approximation, it is possible to use the intra-camera views to
11.5 Applying DVC to Multi-View Systems
291
generate a rendered view of theWZ frame, which is an additional approximation available for use in the WZ decoding (see Figure 11.14). As already mentioned regarding the motion interpolation used in the single-view codec, many technical details need to be worked out with respect to the rendering method. It is useful to say here that in practical contexts there is usually a higher correlation between frames of the same sequence than between frames of different sequences. In any case, the technique used for the generation of the SI has a dramatic effect on the quality of the obtained approximation and thus on the performance of the codec [4]. Note, however, that these details are not usually specified in papers dealing with DVC, which contributes to the difficulty of properly evaluating the performance of different implementations of any proposed architecture.
11.5.2 Remarks on Multi-View Problems The architectures for single- and multiple-camera systems based on the PRISM and Stanford codecs have been intensively studied by many research groups.There are so many details that can be implemented in different ways, and variations that can be easily incorporated in the same schemes, that a complete discussion of all possible combinations is impossible here. We do think it is necessary to make clear that the problem of finding a satisfying approach to DVC is still unsolved. Both PRISM and Stanford codecs suffer problems of practical usability in real contexts. The main problems have already been presented for the single-camera systems, but in the multi-view case they assume a greater importance. Recall that we mentioned basically two problems—the difficulty of performing compensation at the decoder and the difficulty in allocating, at the encoder, the required rate for data correction because of the “unknown” correlation between the WZ data and the side information. In single-camera systems these problems can be mitigated if a trade-off is allowed in the requirements. That is, since there is a unique source to be compressed, the use of DVC is motivated by computational complexity allocation and error resilience. If a certain complexity is allowed at the encoder and more importance is given to error robustness, then it is possible, in the encoding phase, to perform a number of operations that can help the encoder estimate the motion so as to facilitate the decoder task and also appropriately estimate the rate required in the decoding phase. The encoding-decoding technique for the correction phase can thus still be based on WZ principles, but at least the rate allocation and the compensation problems are somewhat mitigated. In a multiview scenario there is no such possibility unless the relative positions of the various cameras and the scene depth field are known. Given that different sources are available at different encoders, it is in no way possible to balance computational complexity with disparity estimation or to improve estimation of the required rate. Consider for example the PRISM codec. Suppose a given block on a WZ frame has no good predictor in the side information frames because of occlusions. In a single-camera system this situation can be detected if certain operations, such as a coarse motion search, can be performed by the encoder. In a multi-camera system, however, it is not possible to distinguish if a certain block is present in the intra-camera side information or not. This of course implies that it is impossible to efficiently apply DSC principles at the block level. This problem may be partially solved by the Stanford approach, since the WZ decoding operates at the frame level and thus exploits “average correlation” with the SI
292
CHAPTER 11 Video Compression for Camera Networks in a frame. The Stanford codec, however, has the problem that the compensation at the decoder is completely performed before the WZ decoding and can thus be based only on a priori information on the geometrical deformations to be applied to intra-camera views to estimate the WZ camera view. This implies that the solution cannot be flexible in realistic cases. As for the mono-view case, one may consider the possible encoding of a coarse lowor high-pass description of theWZ frame to be sent from encoder to decoder and used for compensation. However, this eludes the real challenge of applying DSC to multiple-view video coding, since a great deal of work consists precisely in exploiting the geometrical similarities between different views in a distributed fashion.
11.6 CONCLUSIONS In this chapter we introduced DVC,which has been one the most studied topics in the field of video coding for sometime. The main difference between classic coding techniques and DVC is that the predictive coding used in classic approaches is substituted in DVC by a completely different framework, where it is the decoder’s task to find similarities between already encoded portions of data. We showed that channel codes can be used in the case of binary data, and we showed examples of video codecs that use channel codes as basic tools to apply distributed compression to some portions of the video data, after appropriate transform and quantization. As noted, the examples discussed for single-source coding are also meaningful for multi-view systems, but different strategies can be investigated.
REFERENCES [1] A. Aaron, R. Zhang, B. Girod, Wyner-Ziv coding for motion video. Asilomar Conference on Signals, Systems and Computers, 2002. [2] A. Aaron, S. Rane, B. Girod, Wyner-Ziv video coding with hash-based motion-compensation at the receiver, in: Proceedings of the IEEE International Conference on Image Processing, 2004. [3] X. Artigas, J. Ascenso, M. Dalai, S. Klomp, D. Kubasov, M. Ouaret, The discover codec: architecture, techniques and evaluation. Picture Coding Symposium, 2007. [4] X. Artigas, F. Tarres, L. Torres, Comparison of different side information generation methods for multiview distributed video coding, in: Proceedings of the International Conference on Signal Processing and Multimedia Applications, 2007. [5] J. Ascenso, C. Brites, F. Pereira, Improving frame interpolation with spatial motion smoothing for pixel domain distributed video coding, in: Proceedings of the fifth EURASIP Conference on Speech and Image Processing, Multimedia Communications and Services, 2005. [6] T. Berger, Rate-Distortion Theory: A Mathematical Basis for Data Compression. Prentice-Hall, 1971. [7] R.E. Blahut, Theory and Practice of Error-Control Codes. Addison-Wesley, 1983. [8] C. Brites, J. Ascenso, F. Pereira, Feedback channel in pixel domain Wyner-Ziv video coding: Myths and realities, in: Proceedings of the fourteenth European Signal Processing Conference, 2006.
11.6 Conclusions
293
[9] C. Brites, F. Pereira, Encoder rate control for transform domain Wyner-Ziv video coding, in: Proceedings of the IEEE International Conference on Image Processing, 2007. [10] T.M. Cover, A proof of the data compression theorem of Slepian and Wolf for ergodic sources, IEEE Transactions on Information Theory 22 (1975) 226–228. [11] T.M. Cover, J.A. Thomas, Elements of Information Theory. Wiley, 1990. [12] M. Dalai, R. Leonardi, Minimal information exchange for image registration, in: Proceedings of sixteenth European Signal Processing Conference, 2008. [13] F. Dufaux, T. Ebrahimi, Error-resilient video coding performance analysis of motion JPEG2000 and MPEG-4, in: Proceedings of SPIE Visual Communications and Image Processing, 2004. [14] R.G. Gallager, Information Theory and Reliable Communication. Wiley, 1968. [15] N. Gehrig, P.L. Dragotti, Symmetric and a-symmetric Slepian-Wolf codes with systematic and non-systematic linear codes, IEEE Communications Letters 9 (1) (2005) 61–63. [16] B. Girod, A. Aaron, S. Rane, D. Rebollo-Monedero, Distributed video coding, in: Proceedings of the IEEE 93 (1) (2005) 71–83. [17] C. Guillemot, F. Pereira, L. Torres, T. Ebrahimi, R. Leonardi, J. Ostermann, Distributed monoview and multiview video coding, IEEE Signal Processing Magazine 24 (5) (2007) 67–76. [18] D. Kubasov, K. Lajnef, C. Guillemot, A hybrid encoder/decoder rate control for WynerZiv video coding with a feedback channel. International Workshop on Multimedia Signal Processing, 2007. [19] A. Vetro, P. Pandit, H. Kimata, A. Smolic, Y.-K. Wang, Joint Multiview Video Model JMVM 8.0. ITU-T and ISO/IEC Joint Video Team, Document JVT-AA207, April 2008. [20] B. McMillan, The basic theorems of information theory, Annals of Mathematical Statistics 24 (2) (1953) 196–219. [21] P. Merkle, K. Müller, A. Smolic, T. Wiegand, Efficient compression of multi-view video exploiting inter-view dependencies based on H.264/MPEG4-AVC, in: IEEE International Conference on Multimedia and Exposition, 2006. [22] F. Pereira, C. Brites, J. Ascenso, M. Tagliasacchi, Wyner-Ziv video coding: a review of the early architectures and relevant developments, in: IEEE International Conference on Multimedia and Exposition, 2008. [23] S.S Pradhan, K. Ramchandran, Distributed source coding using syndromes (DISCUS): design and construction, IEEE Transactions on Information Theory 49 (3) (2003) 626–643. [24] S.S Pradhan, K. Ramchandran, Generalized coset codes for distributed binning, IEEE Transactions on Information Theory 51 (10) (2005) 3457–3474. [25] R. Puri, K. Ramchandran, PRISM: A new robust video coding architecture based on distributed compression principles, in: Proceedings of fortieth Allerton Conference on Communication, Control and Computing, 2002. [26] R. Puri, A. Majumdar, K. Ramchandran, PRISM: A video coding paradigm with motion estimation at the decoder, IEEE Transactions on Image Processing 16 (10) (2007) 2436–2448. [27] I.E.G. Richardson, H.264 and MPEG-4 Video Compression. Wiley (UK), 2003. [28] D. Schonberg, S.S. Pradhan, K. Ramchandran, Distributed code constructions for the entire Slepian-Wolf rate region for arbitrarily correlated sources, in: Proceedings of the Data Compression Conference, 2004. [29] D. Slepian, J.K. Wolf, Noiseless coding of correlated information sources, IEEE Transactions on Information Theory 19 (4) (1973) 471–480. [30] C. Yeo, K. Ramchandran, Robust distributed multi-view video compression for wireless camera networks, in: Proceedings of SPIE Visual Communications and Image Processing, 2007. [31] C. Yeo, J. Wang, K. Ramchandran, View synthesis for robust distributed video compression in wireless camera networks, in: Proceedings of the IEEE International Conference on Image Processing, 2007.
294
CHAPTER 11 Video Compression for Camera Networks [32] G.K. Wallace. The JPEG still picture compression standard, Communications of the ACM 14 (4) (1991) 31–44. [33] T. Wiegand, G.J. Sullivan, G. Bjntegaard, A. Luthra, Overview of the H.264/AVC video coding standard, IEEE Transactions on Circuits and Systems for Video Technology 13 (7) (2003) 560–576. [34] A.D. Wyner, Recent results in the Shannon theory, IEEE Transactions on Information Theory 20 (1) (1974) 2–10. [35] A.D. Wyner, J. Ziv, The rate distortion function for source coding with side information at the receiver, IEEE Transactions on Information Theory 22 (1) (1976) 1–11. [36] A.D. Wyner, The rate-distortion function for source coding with side information at the decoder-II: General sources, Information and Control 38 (1) (1978) 60–80. [37] R. Zamir, The rate loss in the Wyner-Ziv problem, IEEE Transactions on Information Theory 42 (6) (1996) 2073–2084. [38] X. Zhu, A. Aaron, B. Girod, Distributed compression for large camera arrays, in: Proceedings of the IEEE Workshop on Statistical Signal Processing, 2003. [39] Video codec for audiovisual services at px64 kbit/s, ITU-T Recomendation H.261, 1990. [40] T. Wiegard, G.J. Sullivan, J.R. Ohm, A.K. Luthra (eds.), Special issue on scalable video codingstandardization and beyond, IEEE Transactions on Circuits and Systems for Video Technology 17 (9) 2007. [41] ITU-T Rec. and ISO/IEC 14496-10 AVC, Advanced video coding for generic audiovisual services, 2005.
CHAPTER
Distributed Compression in Multi-Camera Systems Pier Luigi Dragotti Communications and Signal Processing Group, Electrical and Electronic Engineering Department, Imperial College London
12
Abstract Transmission of video data from multiple cameras requires a huge amount of bandwidth. If camera nodes could communicate with each other, it would be possible to develop algorithms that would compress the data and so reduce bandwidth requirements. However, such a collaboration is usually not feasible, and compression has to be performed independently at each node.The problem of compressing correlated information in a distributed fashion is known as distributed source coding. In this chapter we review the recent development in distributed compression of multi-view images and multi-view video sequences. Keywords: distributed source coding, multi-view imaging, image compression, plenoptic function
12.1 INTRODUCTION Multi-camera systems are becoming more and more popular and enable important applications such as environment browsing and monitoring. One day they will also enable free-viewpoint television. However, transmission of video data from multiple cameras requires a huge amount of bandwidth. The data acquired by the video sensors is clearly highly correlated. Thus, if sensors could communicate among themselves, it would be possible to develop a compression algorithm that fully exploits this correlation and so the bandwidth requirements would be reduced. However, such a collaboration is usually not feasible since it requires an elaborate inter-sensor network. This limitation raises the problem of finding the best way to compress this correlated information without requiring communication between camera nodes. The problem of compressing correlated information in a distributed fashion is known as distributed source coding (DSC) and has its theoretical foundations in two papers that appeared in the 1970s [27, 39]. The topic remained dormant for more then two decades, and it was only toward the end of the nineties that the full potential of this technique was realized. Since then the interest in DSC has increased dramatically and the research community has made striking advances in all aspects, from theory to constructive algorithms Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00012-4
295
296
CHAPTER 12 Distributed Compression in Multi-Camera Systems
to several practical applications.We refer to Dragotti and Gastpar [9] for a recent overview that encompasses all aspects of DSC and summarizes the main contributions of the current decade. In this chapter, we study in detail the distributed compression of visual data acquired by multi-camera systems. Central to any compression algorithm is understanding the correlation in the data to be compressed. The structure of the data acquired by a multicamera system is well modeled by the plenoptic function, which was introduced in Adelson and Bergen [3]. We thus review the main properties of this function and indicate how they can be used for distributed compression. We first focus on the distributed compression of still images taken from multiple viewpoints and then discuss the distributed compression of multi-view video sequences. The underlying message of the chapter is that behind any compression algorithm resides an effort to understand and exploit the plenoptic function. The chapter is organized as follows: In Section 12.2 we provide an overview of DSC theory and highlight the main notions and terminologies used in this research. Then, in Section 12.3, we introduce the plenoptic function and analyze its properties. In Section 12.4 and 12.5, we discuss the compression of multi-view images and multi-view video sequences. We conclude the chapter in Section 12.6.
12.2 FOUNDATIONS OF DISTRIBUTED SOURCE CODING Data compression is an old and well-studied problem whose precise formulation dates back to the seminal work of Shannon [25]. In the classical source-coding problem, a single encoder has access to the entire source and exploits its redundancy in order to perform compression. However, many modern compression problems that arise in sensor networks, ad hoc networks, multi-camera systems, and the like, the source is observed at multiple locations, often separated in space, and each observed signal needs to be compressed independently. Distributed source coding studies the problem of compressing such sources in a distributed fashion, normally allowing joint reconstruction at the decoder. To clarify the distributed compression problem, we begin by considering two discrete memoryless sources X and Y that must be encoded at rates R1 and R2 , respectively. Clearly, this can be achieved with no loss of information using R1 ⭓ H(X) and R2 ⭓ H(Y ) bits where H(·) denotes the source entropy. If X and Y are correlated and a single encoder has access to both, lossless compression is achieved when R1 ⫹ R2 ⭓ H(X, Y ), as illustrated in Figure 12.1(a). In a distributed source coding setup, these two sources are separated and two separate encoders must be used; however, both sources are reconstructed at a single decoder, as illustrated in Figure 12.1(b). In 1973 Slepian and Wolf [27] showed that lossless compression can still be achieved, with R1 and R2 satisfying R1 ⭓ H(X|Y ) R2 ⭓ H(Y |X) R1 ⫹ R2 ⭓ H(X, Y )
12.2 Foundations of Distributed Source Coding X X Encoder
R1 R2 H(X, Y )
Decoder
R1
X, Y
Decoder Y
Y
Encoder 1
(a)
Encoder 2
X, Y
R2 (b)
FIGURE 12.1 Slepian-Wolf problem: (a) classical joint-coding and joint-decoding scenario; (b) setup: independent encoding and joint decoding. R2
H (X, Y )
H(Y )
H(Y |X )
H(X |Y )
H(X ) H(X, Y )
R1
FIGURE 12.2 Slepian-Wolf rate region.
This means, surprisingly, that there is no loss in terms of overall rates even though the encoders are separated. The Slepian-Wolf rate region is sketched in Figure 12.2. To convince ourselves that the boundaries of this region are achievable, we consider the following illustrative example [22]. Assume that X and Y are uniformly distributed memoryless binary sources of length 3, and assume that they are correlated in such a way that their Hamming distance is at most 1 (i.e., for any realization of X, Y is either equal to X or differs by 1 bit only). This means that, given a certain X, there are only four equiprobable different realizations of Y that are compatible with the given X. We thus derive the following entropies: H(x) ⫽ H( y) ⫽ 3 bits, H(x| y) ⫽ H( y|x) ⫽ 2 bits, and H(x, y) ⫽ H(x) ⫹ H( y|x) ⫽ 5 bits, and we conclude that only 5 bits are necessary for joint lossless compression of the two sources. As proved by Slepian and Wolf, the same compression can be achieved by two independent encoders. The solution consists in grouping the different possible codewords of X into sets called bins. Assume that Y is losslessly transmitted to the decoder using 3 bits, and consider the following set of bins containing all possible outcomes of
297
298
CHAPTER 12 Distributed Compression in Multi-Camera Systems X: bin0 ⫽ {000, 111}, bin1 ⫽ {001, 110}, bin2 ⫽ {010, 101}, and bin3 ⫽ {100, 011}. Note that the codewords have been placed in the bins so that the Hamming distance between the members of a given bin is maximal (3 in this case). Now, instead of transmitting X to the decoder, only the index of the bin that X belongs to is transmitted (this requires 2 bits only).The decoder can then retrieve the correct X from the two possible candidates by using the received Y . In fact, given Y , only one of the two candidates satisfies the correct Hamming distance with Y . Perfect reconstruction of X and Y is thus achieved using only 5 bits. There are a few aspects of the example that are worth discussing further. First, one of the two sources is transmitted to the receiver with no error (source Y in the example). Since Y is available at the decoder, this is normally known as the source-coding problem with side information at the decoder, and it corresponds to the corner point in the Slepian-Wolf rate region of Figure 12.2. Notice also that the symmetric case is of great practical importance but is often more difficult to achieve. Second, the methodology presented in the example bears many similarities with channel-coding principles. In channel coding, one tries to retrieve the original transmitted codeword X from the received codeword Y . This is achieved by partitioning all possible input codewords into groups (bins).The bins are designed so that the average distance between the codewords that belong to a certain bin is maximized. In this manner, errors introduced by the noisy channel can be corrected.This connection between source coding with side information and channel coding highlighted in Figure 12.3, was made more evident by Wyner [38] and Berger [4], and has been used recently to design constructive distributed codes. We refer to Guillemot and Roumy [16] for a good recent overview on practical distributed source coders. Join distribution p(x,y) X (n bits)
Y (n bits)
X (n bits) H Noisy channel p ( y/x)
s(nk bits) Noiseless channel
Y (n bits) H s(nk bits)
Decoder Corrector (a)
^ X
^ X
Y (b)
FIGURE 12.3 Channel coding (a) and distributed source coding (b). In channel coding, the syndrome of Y is used to retrieve the original symbol X. In distributed source coding, the syndrome of X is transmitted to the decoder. By observing Y and the syndrome, the decoder can reconstruct X.
12.3 Structure and Properties of the Plenoptic Data
299
Finally, notice that the Slepian-Wolf problem focuses on the two-source case. Extensions to an arbitrary number of correlated sources and ergodic sources were presented by Cover [7, 8]. The case of lossy coding of correlated sources, particularly of continuous-valued sources, is more involved. An important case studied by Wyner and Ziv [39] is one where Y is available at the decoder and X has to be reconstructed within a certain distortion D. In particular, Wyner and Ziv showed that there is no rate loss in the case of mean square error (MSE) distortion and jointly Gaussian sources. That is, in this case it is possible to compress X with an error D at the theoretical conditional rate RX/Y (D), even if Y is only available at the decoder. More recently, Zamir [42] showed that rate loss in the general Wyner-Ziv problem and memoryless sources can be no larger than 0.5 bit/source symbol, compared to the joint encoding and decoding scenario when a squared-error distortion metric is used. A natural way to deal with lossy distributed compression is to perform independent quantization first and then apply a Slepian-Wolf compression strategy to the quantized (discrete) source. This is, in fact, the strategy commonly used in many practical lossy distributed compression algorithms. The achievable rate distortion bounds for the lossy distributed compression problem and their tightness have been studied in several papers [4, 31, 36]. We refer to Eswaran and Gastpar [10] for a clear account of these recent information-theoretic developments. The implicit assumption in most works on distributed compression is that the joint statistics of the source are known at the various encoders and at the decoder. This indicates that it is of central importance to understand and estimate such structure in order to devise successful compression strategies. The structure of the data acquired by a multi-camera system is very peculiar and will be discussed in the next section.
12.3 STRUCTURE AND PROPERTIES OF THE PLENOPTIC DATA At the heart of distributed compression of multi-view data is the characterization of visual information. The data acquired by multiple cameras from multiple viewpoints can be parameterized with a single function, the plenoptic function, introduced by Adelson and Bergen [3] in 1991. The plenoptic function corresponds to the function representing the intensity and chromaticity of light observed from every position and direction in the 3D space. It can therefore be parameterized as a 7D function: IPF ⫽ P(, , , , Vx , Vy , Vz ). As illustrated in Figure 12.4, the three coordinates (Vx , Vy , Vz ) correspond to the position of the camera; and represent the viewing direction; is the time; and corresponds to the frequency considered. The measured parameter IPF is simply the intensity of the light observed under these parameters. The high dimensionality of the plenoptic function makes it difficult to handle. Many assumptions, however, can be made to reduce its dimensionality. For example, McMillan and Bishop introduced plenoptic modeling [21] where the wavelength is omitted (i.e., gray scale images or separate RGB channels are considered) and static scenes are considered. This reduces the parameterization to five dimensions. Moreover, camera position can be constrained to a plane, a line, or a point to further remove one, two, or
300
CHAPTER 12 Distributed Compression in Multi-Camera Systems
Vy
Vz
Vx
FIGURE 12.4 Plenoptic function. Camera plane
Image plane
s
Light ray
v
u Object t
v
z (a)
t ( b)
FIGURE 12.5 (a) 4D lightfield parameterization model where each light ray is characterized by its intersections with the camera and image planes. (b) Epipolar (v-t) plane image of a scene with two objects at different depths (the s and u coordinates are fixed in this case).
three dimensions, respectively. The case when cameras are on a plane leads to the 4D lightfield parameterization introduced by Levoy and Hanrahan [19]. This parameterization is obtained by using two parallel planes: the focal plane (or camera plane) and the retinal plane (or image plane). A ray of light is therefore parameterized by its intersection with these two planes, as shown in Figure 12.5. The coordinates (s, t) in the focal plane correspond to the position of the camera, while the coordinates (u, v) in the retinal plane give the point in the corresponding image. By further restricting camera locations to a line, we obtain the epipolar plane image (EPI) volume [5], which is a 3D plenoptic function. The EPI has a structure that is similar to a video sequence, but the motion of the objects can be fully characterized by their positions in the scene. A 2D slice of EPI is shown in Figure 12.6. Notice that in this setup points in the world scene are converted into lines in the plenoptic domain.
12.4 Distributed Compression of Multi-View Images z
v
v0
v1
v0 v
v1
f t0
t1
t
t1
t0
t
FIGURE 12.6 2D plenoptic function of two points of a scene. The t-axis corresponds to the camera position; the v-axis corresponds to the relative positions on the corresponding image. A point of the scene is represented by a line since the difference between the positions of a given point on two images ) satisfies the relation (v ⫺ v ) ⫽ f (t⫺t , where z is the depth of the point and f is the focal length of z the cameras.
In fact, by using similar triangles and a pinhole camera model, one can easily show that the difference between the positions of a given point on two different images satisfy the following relation: (v ⫺ v ) ⫽
f (t ⫺ t ) z
(12.1)
where z is the depth of the point and f is the cameras’ focal length. Alternative 3D plenoptic functions can be obtained, for example, by placing cameras on a circle, oriented toward the outside of the circle. Finally, if we constrain camera position to a single point, we have a 2D function that is in fact a common still image. For the rest of the chapter we will concentrate on EPI volume parameterization. In the next section we will consider static scenes leading to the classical 3D parameterization (x, y, Vx ). In Section 12.5 we will add the time parameter and focus on the distributed compression of dynamic scenes.
12.4 DISTRIBUTED COMPRESSION OF MULTI-VIEW IMAGES In this section, we consider the problem of compressing still images acquired by a multicamera system. Most of the algorithms proposed recently use sophisticated channel codes to perform distributed compression [1, 18, 20, 32–34, 43]. A different way of modeling the dependency among images, based on redundant dictionaries, was proposed in Tosic and Frossard [35]; however, that work focuses on omnidirectional cameras, which are beyond the scope of this chapter. In Gehrig and Dragotti [12, 13], a novel scheme that exploits the properties of the plenoptic function was proposed. The geometrical setup considered is similar to the one sketched in Figure 12.6: N cameras are placed on a horizontal line, all pointing in the same direction (perpendicular to the line of the cameras). The distance between two consecutive cameras is denoted ␣. Finally, it is assumed that the objects in the scene have a depth bounded between zmin and zmax , as shown in Figure 12.7.
301
302
CHAPTER 12 Distributed Compression in Multi-Camera Systems
zmax
zmin
Camera 1
Camera 2
Camera 3
Camera 4
␣ Image 1
0
0
0 f
Image 2
FIGURE 12.7 Multi-camera system from Gehrig and Dragotti [12, 13].
According to the epipolar geometry constraint of equation 12.1, we know that the difference between the positions of a specific point on the images obtained from two consecutive cameras is equal to ⌬ ⫽ ␣fz , where z is the depth of the object and f is the focal length of the cameras. Given ␣, the disparity ⌬ depends only on the distance z of the point from the focal plane.Therefore, if by hypothesis there is a finite depth of field— that is, z ∈ [zmin , zmax ]—then there is a finite range of possible disparities to be encoded, irrespective of how complicated the scene is. This means that, ideally, one image may be compressed using a centralized compression algorithm, while the only information that must be transmitted by the other cameras is the disparity. The key insight, that the disparity is bounded, is used in Gehrig and Dragotti [12, 13] to develop a distributed image compression algorithm. It is also of interest to point out that the assumption that there is a finite depth of field is not new and was previously used by Chai et al. to develop new schemes for the sampling of the plenoptic function [6]. The fundamental issue with distributed image compression is that traditional compression algorithms do not preserve the structure of the visual information. For example, a standard image compression algorithm essentially applies a transformation to the image (e.g., the wavelet transform) and splits the transformed image into bit planes; each bit plane is then compressed independently. Thus, the algorithm does not use the fact that the image contains objects with smooth boundaries, which makes it difficult to exploit the epipolar constraint. Gehrig and Dragotti [12, 13], proposed the use of a quadtree-based compression algorithm, first presented in Shukla et al. [26], to perform distributed compression. The main reason was that the structure of the quadtree is linked to the visual properties of the image and therefore to the epipolar constraints.This makes the algorithm suitable for distributed compression. To clarify the distributed compression algorithm we first briefly review the quadtreebased algorithm proposed in [26]. The aim of this coding approach is to approximate a given signal using a 2D piecewise polynomial representation. The algorithm starts by splitting the image into four squares of the same size. The decomposition is then iterated Jmax times so that the full quadtree has a depth Jmax and each leaf in the tree is related to a region of the image of size T 2⫺Jmax ⫻ T 2⫺Jmax , where T ⫻ T is the support
12.4 Distributed Compression of Multi-View Images of the original image. Next, each node of the tree is approximated with a geometrical tile made of two 2D polynomial regions separated by a 1D linear boundary. Then R-D curves for each node of the tree are generated by uniformly quantizing the polynomial coefficients and the coefficients that describe the linear boundary. To optimize bit allocation, an operating slope is chosen and the four children of a node are pruned if their Lagrangian cost is higher than that of the parent node. More precisely, the children are pruned when (Dc1 ⫹ Dc2 ⫹ Dc3 ⫹ Dc4 ) ⫹ (Rc1 ⫹ Rc2 ⫹ Rc3 ⫹ Rc4 ) ⭓ (Dp ⫹ Rp )
(12.2)
where the indices c1 to c4 correspond to the four children.The compression performance is further improved by joining neighbor nodes that do not have a common parent node. Joining is again performed using a Lagrangian minimization. An example of the final decomposition is shown in Figure 12.8. The distributed image-coding extension proposed in Gehrig and Dragotti [12, 13] consists of decomposing each view using the quadtree approach and then transmitting only partial information from each view. More precisely, let us assume that images are exactly piecewise polynomials and that objects in the scene have Lambertian surfaces. This means that the appearance of the object does not change with the viewing position. Now, under these simplifying hypotheses the structure of the quadtree changes from one image to the other according to the epipolar constraint only. Therefore, there is no need to transmit the entire structure from all images, but only a part of it. Moreover, because of the assumption of Lambertian surfaces, the leaf information is common to all images and need be transmitted only once. This is illustrated in Figure 12.9 for the case of 1D signals and a binary tree decomposition. The two signals can be seen as the scanlines of two stereo polynomial images. They differ only for the location of the discontinuities. However, these locations must satisfy the epipolar constraints. Moreover, the amplitude of the polynomial pieces is constant because of the Lambertian hypothesis. Due to the epipolar constraint, the first part of the binary decomposition is common to the two cameras and needs to be transmitted only once. Also, since the polynomial pieces are equal, the leaf
f(x, y)
g(x, y) (a)
(b)
(c)
FIGURE 12.8 (a) Geometrical tile model consisting of two 2D polynomial regions separated by a 1D linear boundary. (b) Prune-join quadtree decomposition of cameraman with a target bit rate of 0.2 bpp. (c) Reconstructed image and its PSNR ⫽ 27.3 dB.
303
CHAPTER 12 Distributed Compression in Multi-Camera Systems Prune - Join tree decomposition of signal 1
0 Depth J
2 4
J
6 8 10
0
100 200 300 400 500 600 700 800 900 1000
0
100 200 300 400 500 600 700 800 900 1000 (a)
250 200 150 100 50 0
Prune - Join tree decomposition of signal 2
0 2 Depth J
304
4
J
6 8 10
0
100 200 300 400 500 600 700 800 900 1000
0
100 200 300 400 500 600 700 800 900 1000 (b)
250 200 150 100 50 0
FIGURE 12.9 Binary tree decomposition of two piecewise constant signals satisfying the correlation model of equation 12.1. The upper part of the tree is the same.
information must be transmitted from the first encoder only. In the case of polynomial images, the full description of the first view from the first encoder is transmitted. That is, the quadtree structure is transmitted together with the tile approximation used for each leaf. The other camerassend only the subtrees of the quadtree structure with the T root node at level J⌬ ⫽ log2 ⌬max ⫺⌬min ⫹1 along with the joining information and the coefficients representing the 1D boundaries. Here ⌬min and ⌬max are the minimum and maximum possible disparities for the geometrical setup considered. The decoder is then able to retrieve the complete tree structure and the complete tile description at each leaf by using the first image as side information and the epipolar constraint of equation 12.1. This strategy leads to a substantial saving of bits and can be shown to be optimal when the preceding assumptions are satisfied exactly. In practice, natural images satisfy only
12.4 Distributed Compression of Multi-View Images
305
approximately the piecewise polynomial model. Moreover, real surfaces are not exactly Lambertian. For this reason, some extra information is transmitted by each encoder to increase the robustness of the proposed quadtree algorithm. More precisely, the entire quadtree structure is transmitted by each encoder and partial information from each leaf is transmitted. In this way the matching of each leaf on different images is achieved with no error and good reconstruction quality is still obtained, with some bit saving. Figures 12.10 and 12.11 show results obtained on a sequence of six consecutive views, where the first and sixth views are fully transmitted and only the quadtree structure,
Global distortion in PSNR (dB)
31 Distributed 30
29
Independent
28 JPEG2000 0.04
0.06
0.08 0.1 0.12 Average bit rate (bpp)
0.14
0.16
FIGURE 12.10 Distributed versus independent compression encoding of six views. Results obtained with a JPEG2000 encoder are also shown.
PSNR = 27.18 dB (a)
PSNR = 28.24 dB (b)
PSNR = 30.02 dB (c)
FIGURE 12.11 Reconstruction of the fifth view. Independent encoding at 0.08 bpp with JPEG2000 (a). Independent encoding at 0.08 bpp with the prune-join quadtree coder (b). Distributed encoding Gehrig and Dragotti approach [12, 13] with an average of 0.08 bpp (c).
306
CHAPTER 12 Distributed Compression in Multi-Camera Systems together with partial information on the leaf description, is transmitted from the other four views. Figure 12.10 reveals that the Gehrig and Dragotti approach [12, 13] outperforms independent encoding of the six views for the entire range of considered bit rates. In Figure 12.11 one of the views considered is shown. One can clearly see an improvement in the visual quality of the image compressed with this new scheme.
12.5 MULTI-TERMINAL DISTRIBUTED VIDEO CODING Multi-camera systems normally monitor dynamic scenes; therefore, the time dimension is of fundamental importance and cannot be omitted. This means that one has to add one dimension to the plenoptic function studied so far. In other words, if we keep assuming that cameras are along a line, as in the previous section, then the plenoptic function has four dimensions: (x, y, Vx , ). If we assume that cameras are placed on a plane or can be placed anywhere in free space, then the dimensions increase to five and then six. For the sake of clarity, we stay with the EPI model and add only the time dimension.The data acquired in this context is therefore correlated over time (inter-frame correlation) and space (inter-view correlation or correlation along Vx ). The issue now is to find ways to exploit this dependency, possibly in a distributed fashion. While it is natural to accept that each video should be compressed independently, since cameras are normally far apart and cannot communicate among themselves, it is probably less intuitive to accept the idea that each frame in the same video sequence may be compressed independently as well. In fact the problem of mono-view distributed video compression has always been considered before that of multi-view compression. In distributed video coding (DVC) each frame of a single video stream is encoded separately and the encoder treats past and/or future frames as side information when encoding the current frame. The frames encoded using the knowledge that the past or future frame will be available at the decoder are sometimes called Wyner-Ziv frames, since this problem is clearly linked to the problem of compression with side information at the decoder that Wyner and Ziv first introduced. The advantage of compressing each frame independently is that the video sequence is encoded without motion estimation and is therefore much less complex—all of the computation complexity is moved to the decoder. At the same time, the theory of distributed source coding, discussed in Section 12.2, indicates that this approach theoretically incurs a very small rate penalty; indeed, if the error between frames is Gaussian, an approach where the motion compensation is moved to the decoder potentially incurs no rate penalty at all. In battery-powered wireless camera networks, it is of fundamental importance, as a means of reducing power consumption, to compress each video effectively but in the most power-efficient way. In this respect DVC represents in principle the right technology to meet both constraints. The first distributed video-coding schemes were presented by various authors [2, 14, 23, 24], and many other approaches have since been proposed. We refer to Guillemot et al. [15] for a good recent overview of the topic. The distributed compression of multi-view video was considered in various papers [11, 17, 28, 29, 37, 40, 41], and the strategies proposed are quite different.
12.6 Conclusions
307
For example, in Yeo and Ramchandran [41] it was argued that correlation in time is much higher than that in space, and it was therefore suggested that devising a distributed compression strategy for intra-view compression is unnecessary and that it is useful only to have distributed video encoders. In this work intra-view correlation was used to increase robustness against channel errors only. More precisely, if some information in one view is lost during the transmission, the other received views are used to synthetize the lost view. In this way, increased robustness against transmission errors is achieved. The work in Song et al. [28, 29] used epipolar constraints to develop an efficient spatiotemporal distributed compression algorithm. In this respect it recalled in spirit the work discussed in the previous section. The encoding strategy of the authors allows for an exchange of information between sensors, which helps to simplify the reconstruction at the decoder. Yang et al. [40] proposed a multi-terminal video-coding scheme that outperforms separate H.264 coding of two stereo video sequences [30]. Encoders are not allowed to communicate but the 3D geometric information on camera location is assumed to be known by the decoder. One of the two video sequences is encoded using standard H.264 and plays the role of side information at the decoder. A low-quality version of the I-frame of the second sequence is transmitted, and the decoder estimates the disparity using the first sequence and the low-resolution I-frame received. The other frames of the second sequence are encoded using distributed compression principles. Finally, in the work of Flierl and Vandergheynst [11], motion compensated spatiotemporal wavelet transform was used to explore the dependencies in each video signal. As with the scheme of Yang et al. [40], one video signal was encoded with conventional source coding and was used as side information at the decoder. Channel coding was then applied to encode the transform coefficients of the other video signals. Interestingly, it was shown that, at high rates and under some assumption of the spatio-temporal model, the optimal motion-compensated spatio-temporal transform for such a coding structure is the Haar wavelet.
12.6 CONCLUSIONS The availability of images from multiple viewing positions points to a variety of interesting new applications, such as monitoring, remote education, and free-viewpoint video. However, the amount of data acquired by such systems is huge, so it is therefore of paramount importance to develop efficient compression algorithms. The compression problem is particularly difficult in this context since each camera is normally not allowed to communicate with the other sensors. Compression has to be performed at each node, yet we expect some of the inter-node correlation to be exploited. The only way for a node to perform effective distributed compression is by predicting the structure of the correlation from the configuration of the multi-camera system and available local information. We highlighted here how the plenoptic function plays a central role in this and how its properties can be used for distributed compression. Many new distributed compression algorithms have been proposed recently and show encouraging results. One day, hopefully, they will be used in commercial systems.
308
CHAPTER 12 Distributed Compression in Multi-Camera Systems Acknowledgments. The author would like to thank Dr. J. Berent and Dr. N. Gehrig for providing the figures for this chapter.
REFERENCES [1] A. Aaron, P. Ramanathan, B. Girod, Wyner-Ziv coding of light fields for random access, in: IEEE International Workshop on Multimedia Signal Processing, 2004. [2] R. Aaron, A. Zhang, B. Girod, Wyner-Ziv coding of motion video, in: Proceedings of the IEEE Asilomar Conference on Signals and Systems, 2002. [3] E.H. Adelson, J. Bergen, The plenoptic function and the elements of early vision, in: Computational Models of Visual Processing, MIT Press, 1991. [4] T. Berger, Multiterminal source coding. Lectures presented at CISM summer school on the Information Theory Approach to Communications, July 1977. [5] R.C. Bolles, H.H. Baker, D.H. Marimont, Epipolar-plane image analysis: An approach to determining structure from motion, International Journal of Computer Vision 1 (1987) 7–55. [6] J.-X. Chai, S.-C. Chan, H.-Y. Shum, X. Tong, Plenoptic sampling, in: Proceedings of the ACH twenty seventh Annual Conference on Computer Graphics and InteractiveTechniques, 2000. [7] T. Cover, A proof of the data compression theorem of Slepian and Wolf for ergodic sources, IEEE Transactions Information Theory 21 (1975) 226–228. [8] T. Cover, J.A. Thomas, Elements of Information Theory, Wiley, 1991. [9] P.L. Dragotti, M. Gastpar, Distributed Source Coding: Theory, Algorithms and Applications, Academic Press, Elsevier, 2009. [10] K. Eswaran, M. Gastpar, Foundations of distributed source coding, in: P.L. Dragotti, M. Gastpar (Eds.), Distributed Source Coding: Theory, Algorithms and Applications, Elsevier, 2009. [11] M. Flierl, P. Vandergheynst, Distributed coding of highly correlated image sequences with motion-compensated temporal wavelets, Eurasip Journal Applied Signal Processing (2006) 1–10. [12] N. Gehrig, P.L. Dragotti, Distributed compression of multi-view images using a geometric approach, in: Proceedings of the IEEE International Conference on Image Processing, 2007. [13] N. Gehrig, P.L. Dragotti, Geometry-driven distributed compression of the plenoptic function: Performance bounds and constructive algorithms, IEEE Transactions on Image Processing 18 (3) (2008). [14] B. Girod, A. Aaron, S. Rane, D. Rebollo-Monedero, Distributed video coding, Proceedings of the IEEE 93 (1) (2005) 71–83. [15] C. Guillemot, F. Pereira, L. Torres, T. Ebrahimi, R. Leonardi, J. Ostermann, Distributed monoview and multiview coding, IEEE Signal Processing Magazine 24 (5) (2007) 67–76. [16] C. Guillemot, A. Roumy, Towards constructive Slepian-Wolf coding schemes, in: P.L. Dragotti, M. Gastpar (Eds.), Distributed Source Coding: Theory, Algorithms and Applications, Elsevier, 2009. [17] X. Guo, Y. Lu, F. Wu, W. Gao, S. Li, Distributed multi-view video coding, in: Proceedings of the SPIE Conference on Visual Communications and Image Processing, 2006. [18] A. Jagmohan, A. Sehgal, N.Ahuja, Compression of lightfield rendered images using coset codes, in: Proceedings of the IEEE Asilomar Conference on Signals and Systems, Special Session on Distributed Coding, 2003. [19] M. Levoy, P. Hanrahan, Light field rendering, in: ACM Computer Graphics, SIGGRAPH, 1996. [20] H. Lin, L. Yunhai, Y. Qingdong, A distributed source coding for dense camera array, in: Proceedings of the IEEE International Conference on Signal Processing (ICSP’04), 2004. [21] L. McMillan, G. Bishop, Plenoptic modeling: an image-based rendering system, in: ACM Computer Graphics, SIGGRAPH, 1995.
12.6 Conclusions
309
[22] S.S. Pradhan, K. Ramchandran, Distributed source coding using syndromes (DISCUS): Design and construction, in: Proceedings of the Data Compression Conference, 1999. [23] R. Puri, A. Majumdar, K. Ramchandran, PRISM: A video coding paradigm with motion estimation at the decoder, IEEE Transactions on Image Processing 16 (10) (2007) 2436–2448. [24] R. Puri, K. Ramchandran, PRISM:A video coding architecture based on distributed compression principles, in: IEEE Proceedings of fortieth Allerton Conference on Communication, Control, and Computing, 2002. [25] C.E. Shannon, A mathematical theory of communication, Bell Systems Technology Journal, 27 (1948) 379–423 (continued 27:623–656, 1948). [26] R. Shukla, P.L. Dragotti, M.N. Do, M. Vetterli, Rate-distortion optimized tree structured compression algorithms for piecewise polynomial images, IEEE Transactions on Image Processing 14 (3) (2005) 343–359. [27] D. Slepian, J.K. Wolf, Noiseless coding of correlated information sources, IEEE Transactions Information Theory 19 (1973) 471–480. [28] B. Song, O. Bursalioglu, A.K. Roy-Chowdhury, E. Tuncel, Towards a multi-terminal video compression algorithm by integrating distributed source coding with geometrical constraints, Journal of Multimedia 2 (3) (2007) 9–16. [29] B. Song, O. Bursalioglu, A.K. Roy-Chowdhury, E. Tuncel, Towards a multi-terminal video compression algorithm using epipolar geometry, in: IEEE International Conference on Acoustics, Speech, and Signal Processing, 2006. [30] G.J. Sullivan, T. Wiegand, Video compression—from concepts to the H.264/AVC standard, in: Proceedings of the IEEE 93 (1) (2005) 18–31. [31] T.S. Han, K. Kobayashi, A unified achievable rate region for a general class of multiterminal source coding systems, IEEE Transactions Information Theory 26 (3) (1980) 277–288. [32] M.P. Tehrani, T. Fujii, M. Tanimoto, Distributed source coding of multiview images, in: Visual Communications and Image Processing Conference, 2004. [33] M.P.Tehrani,T. Fujii, M.Tanimoto,The adaptive distributed source coding of multi-view images in camera sensor networks, IEICE Transactions Fundamentals, E88-A (10) (2005) 2835–2843. [34] G. Toffetti, M. Tagliasacchi, M. Marcon, A. Sarti, S. Tubaro, K. Ramchandran, Image compression in a multi-camera system based on a distributed source coding approach, in: European Signal Processing Conference, 2005. [35] I.Tosic, P. Frossard, Geometry-based distributed coding of multi-view omnidirectional images, in: IEEE International Conference on Image Processing, 2008. [36] A.Wagner, S.Tavildar, P.Viswanath, Rate region of the quadratic Gaussian two-encoder sourcecoding problem, IEEE Transactions Information Theory 54 (5) (2008) 1938–1961. [37] M.Wu, A.Vetro, C.W. Chen, Multiple description image coding with distributed source coding and side information, SPIE Multimedia Systems and Applications VII, 5600 (2004) 120–127. [38] A.D. Wyner, Recent results in the Shannon theory, IEEE Transactions Information Theory, 20 (1974) 2–10. [39] A.D. Wyner, J. Ziv, The rate-distortion function for source coding with side information at the decoder, IEEE Transactions Information Theory 22 (1976) 1–10. [40] Y. Yang, V. Stankovic, W. Zhao, Z. Xiong, Multiterminal video coding, in: Proceedings of the IEEE International Conference on Image Processing, 2007. [41] C. Yeo, K. Ramchandran, Robust distributed multi-view video compression for wireless camera networks, in: Visual Communications and Image Processing, 2007. [42] R. Zamir, The rate loss in the Wyner-Ziv problem, IEEE Transactions Information Theory 42 (6) (1996) 2073–2084. [43] X. Zhu, A. Aaron, B. Girod, Distributed compression for large camera arrays, in: Proceedings of the IEEE Workshop on Statistical Signal Processing, 2003.
CHAPTER
Online Learning of Person Detectors by Co-Training from Multiple Cameras
13
P. M. Roth, C. Leistner, H. Grabner, H. Bischof Institute for Computer Graphics and Vision Graz University of Technology, Graz, Austria
Abstract The detection of persons is an important task in automatic visual surveillance systems. In recent years, both, advanced representations and machine learning methods, have proved to yield high recognition performance while keeping false detection rates low. However, building general detectors applicable to a wide range of scenes leads to high model complexity and thus to unnecessarily complex detectors. To alleviate this problem, we propose to train scene-specific object detectors by co-training from multiple cameras. The main idea is that each camera holds a separate classifier, which is trained by online learning incorporating (detection) information from other cameras. Thus, incrementally better scene-specific classifiers can be obtained. To demonstrate the power of our approach, we performed experiments on two data sets for two and three cameras, respectively. We show that, with the proposed approach, state-of-the-art detection results can be obtained without requiring a great number of training samples. Keywords: visual online learning, object detection, multi-camera networks
13.1 INTRODUCTION Because of the increasing number of cameras mounted for security reasons, automatic visual surveillance systems are required to analyze increasing amounts of data. One important task for such automatic systems is the detection of persons. Hence, there has been considerable interest in this topic and several approaches have been proposed to solve this problem. Early attempts to find (moving) persons used change detection (motion detection). For that purpose, a background model was estimated and pixels that could not be described by the background model were reported as part of the foreground. These pixels were grouped into blobs, and the actual detection was performed based on blob analysis (e.g., [1]). However, these approaches had several limitations (e.g., varying backgrounds, Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00013-6
313
314
CHAPTER 13 Online Learning of Person Detectors crowds) and so could not be applied in complex scenarios.Therefore, several approaches to modeling the appearance of the object have been proposed. These can be subdivided into three main groups according to subject representation. ■
■
■
Global image features such as edge templates, shape features, or implicit shape models (e.g., [2]) Local image features such as Haar wavelets [3], histogram of gradients [4], and local covariance matrices [5] Articulated parts (e.g., [6])
Using these representations, a classifier is built using a learning algorithm (e.g.,AdaBoost [7]), which is subsequently applied to the whole image using a sliding window technique. The aim of all of these methods is to train a general person model that should be applicable to different scenarios and tasks. To cope with the variability of subjects and backgrounds, a large training set is required. But even if classifiers are trained on a very large number of samples, they often fail in practice. Thus, the main limitation of such approaches is the lack of a representative training data set. However, not all variability, especially for the negative class (i.e., possible backgrounds), can be captured during training, resulting in low recall and insufficient precision. To overcome these problems, scene-specific classifiers can be applied that solve only a specific task (e.g., object detection for a specific camera setup). Furthermore, these classifiers should be able to continuously adapt to changing environments (e.g., changing illumination conditions), which can be realized by using incremental or online learning methods1 (e.g., [8–10]). Hence, variations need not be handled by the overall model, which reduces the complexity of the problem and allows training of more specific and thus more efficient classifiers. All of these methods are limited to a single camera view. However, recently there has been considerable interest in multiple cameras for person detection and tracking [11–16]. Such approaches mainly address the problem of occlusions, which cannot be handled by single-view approaches. For that purpose, they apply change detection (e.g., [12, 14]) or a fixed pretrained classifier (e.g., [11, 16]) to detect persons in each camera’s view. Then the obtained information is combined by estimating a score map (e.g., [12, 14]) on a common (top) view. The goal in this chapter is to explore the information collected from multiple views to adapt classifiers online. However, online updates suffer from the problem that the detector might start drifting, finally ending up in an unreliable state [17]. To avoid drifting, we propose a co-training strategy [18] for updating the classifiers that correspond to single-camera views. In particular, we propose to train a general seed classifier by offline boosting (to ensure an initial classifier of sufficient accuracy), which is later improved and adapted to a specific viewpoint by online boosting [19]. In fact, in Roth et al. [20] we showed that an offline trained classifier can be improved by online updates if the required statistics are stored during offline training. For co-training a view-specific classifier, we incorporate information (i.e., detections) from other views such that one camera teaches the other.The overall principle of the proposed approach is illustrated in Figure 13.1. 1
In contrast to incremental learning methods, during online learning no data is stored.
13.1 Introduction
View 1
View n
Classifier 1
Classifier n
Continue
Detect
Combine Co-Train
Update
FIGURE 13.1 A scene is observed by multiple cameras with partly overlapping fields of view (first row). At each camera, a classifier is applied to detect persons (second row). To improve the detection results (lower false positive rate as well as increasing detection rate), the classifier is updated (last row) in a co-training manner by combining the decisions of all n camera views (third row).
To establish geometric correspondences for co-training, we use the homography between two cameras [21], which allows us to transfer coordinates from one local camera coordinate system to another. In addition, to make these projections more robust, we include motion information in this step. Thus, the cameras teach each other by exchanging detection results (coordinates), which makes our approach also suitable for large camera networks. In this way, only the location and not the whole image has to be transferred between cameras, which drastically reduces the amount of required data exchange. Moreover, since we apply an online learning method, a sample can be discarded directly after the update, which reduces the memory requirements and computational costs of updating. The remainder of this chapter is organized as follows. In Section 13.2, we review co-training and online boosting for feature selection. Based on that, in Section 13.3 we introduce our co-training system using camera networks with partly overlapping views. Experiments on two challenging data sets for person detection in Section 13.4 illustrate the advantages of our approach. Finally, we conclude the chapter with a summary and an outlook in Section 13.5.
315
316
CHAPTER 13 Online Learning of Person Detectors
13.2 CO-TRAINING AND ONLINE LEARNING In this section we review the main learning concepts used in the proposed approach. First, we discuss the idea of co-traning in general and discuss online boosting for feature selection in detail (including an overview of the offline variant). Then, we give a short overview of knowledge transfer between an offline and an online classifier, followed by a short discussion about the applied image representation.
13.2.1 Co-Training It is well known (e.g., Nigam et al. [22]) that unlabeled samples contain information about the joint distribution of data. Since they can be obtained more easily than labeled samples, the main goal is to take advantage of the statistical properties of the labeled and unlabeled data by using a semi-supervised approach. In our approach, a seed classifier trained from a smaller number of labeled samples can be improved by taking into account a large number of available unlabeled samples. This is realized by co-training, which was proposed by Blum and Mitchell [18]. The main idea of co-training is to split the instance space into two independent observations2 (e.g., color and shape) and to train two separate classifiers. These classifiers are then applied in parallel, where one classifier teaches the other (i.e., unlabeled samples that are confidently labeled by one classifier are added to the training set of another). It was proven in [18] that co-training converges if two strong conditions are fulfilled. First, the two observations must be conditionally independent; second, each of them should be able to solve the task. Thus, to satisfy these conditions, training samples are required for which one of the classifiers is confident whereas the other one is not. Since it is hard to ensure these conditions in practice—the first one in particular—the requirements were later relaxed [23]. Nevertheless, a fairly strong assumption on training algorithms remains: they should never provide a hypothesis that is “confident but wrong.” For visual learning, co-training was first applied by Levin et al. [24] to train a car detector. Their approach starts with a small number of hand-labeled samples and generates additional labeled examples by applying co-training to two boosted offline classifiers. One is trained directly from gray-value images whereas the other is trained from background subtracted images. The additional labels are generated based on confidence-rated predictions. Using the additionally labeled samples, the training process is restarted from scratch. In general, the approach is not limited to two observations but can be extended to multiple observations (e.g., [25, 26]). Zhou and Li [26] extended the original co-training approach for three classifiers. Moreover, Javed et al. [25] applied an arbitrary number of classifiers and extended the method to online learning. In particular, they first generated a seed model by offline boosting, which was later improved by online boosting. The co-training was then performed on the feature level, where each feature (i.e., global PCA feature) corresponded to a base classifier. If an unlabeled sample was labeled very confidently by a subset of such base classifiers it was used for both, updating the base classifiers and the boosting parameters.
2 We
use the term “observations” to avoid confusion with “camera views,” even though in machine learning the term “view” is more common.
13.2 Co-Training and Online Learning
13.2.2 Boosting for Feature Selection Offline Boosting for Feature Selection Boosting, in general, is a widely used technique in machine learning for improving the accuracy of any given learning algorithm (see [27, 28] for a good introduction and overview). In fact, boosting converts a set of weak classifiers into a strong one. In this work, we focus on the (discrete) AdaBoost algorithm, which was introduced by Freund and Schapire [7]. The algorithm can be summarized as follows: Given a training set X ⫽ {x1 , y1 , . . . , xL , yL | xi ∈ Rm , yi ∈ {⫺1, ⫹1}} of L samples, where xi is a sample and yi is its corresponding positive or negative label, and a weight distribution p(x) that is initialized uniformly distributed: p(xi ) ⫽ 1L . Then a weak classifier h is trained using X and p(x), which has to perform only slightly better than random guessing (i.e., the error rate of a classifier for a binary decision task must be less than 50 percent). Depending on the error e of the weak classifier, a weight ␣ is calculated and the samples’ probability p(x) is updated. For misclassified samples the corresponding weight is increased while for correctly classified samples the weight is decreased. Thus, the algorithm focuses on the hard samples. The entire process is iteratively repeated and a new weak classifier is added at each boosting iteration until a certain stopping criterion is met. Finally, a strong classifier Hoff (x) is estimated by a linear combination of all N trained weak classifiers: Hoff (x) ⫽
N
␣n hn (x)
(13.1)
n⫽1
The predicted label y is finally estimated by evaluating the sign of Hoff (x):
y⫽
1 ⫺1
Hoff (x) ⬎ 0 Hoff (x) ⬍ 0
(13.2)
Moreover, as was shown by Friedman et al. [29], boosting provides a confidence measure P(y ⫽ 1|x) ⫽
eHoff (x) eHoff (x) ⫹ e⫺Hoff (x)
(13.3)
where the decision P(y ⫽ 1|x) ⬎ 0.5 is equivalent to Hoff (x) ⬎ 0. Thus, the classifier response Hoff (x) can also be considered a confidence measure. Boosting can also be applied to feature selection [30]. The basic idea is that each feature corresponds to a weak classifier and boosting selects an informative subset from it. Thus, given a set of k possible features F ⫽ {f1 , . . . , fk }, in each iteration n, a weak hypothesis is built from the weighted training samples. The best one forms the weak hypothesis hn that corresponds to the selected feature fn . The weights of the training samples are updated with respect to the error of the chosen hypothesis.
Online Boosting for Feature Selection Contrary to offline methods, during online learning each training sample is provided to the learner only once. Thus, all steps must be online and the weak classifiers must be updated whenever a new training sample is available. Online updating of weak classifiers is not a problem since various online learning methods exist that may be used for generating hypotheses. The same applies to the voting weights ␣n , which can easily be
317
318
CHAPTER 13 Online Learning of Person Detectors computed if the errors of the weak classifiers are known. The crucial step is computation of the weight distribution since the difficulty of a sample is not known a priori. To overcome this problem, Oza and Russell [31, 32] proposed computing the importance of a sample by propagating it through the set of weak classifiers. In fact, is increased proportionally to the error e of the weak classifier if the sample is misclassified; decreased otherwise. Since the approach of Oza and Russel cannot be directly applied to feature selection in Grabner and Bischof [19], we introduced selectors and performed online boosting on them, not directly on the weak classifiers. A selector hsel n (x) holds a set of M weak classifiers {h1 (x), . . . , hM (x)} that are related to a subset of features Fn ⫽ f1 , . . . , fM ∈ F , where F is the full feature pool. At each time the selector hsel n (x) selects the best weak hypothesis hsel (x) ⫽ arg min e (hm (x)) m
(13.4)
according to the estimated training error eˆ ⫽
wrong wrong ⫹ corr
(13.5)
where corr and wrong are the importance weights of the samples seen so far that were classified correctly and incorrectly, respectively. The work flow of online boosting for feature selection can be described as follows: sel A fixed number of N selectors hsel 1 , . . . , hN is initialized with random features. The selectors are updated whenever a new training sample x, y is available and the weak classifier with the smallest estimated error is selected. Finally, the weight ␣n of the nth selector hsel n is updated, the importance n is passed to the next selector hsel n⫹1 , and a strong classifier is computed by a linear combination of N selectors: Hon (x) ⫽
N
␣n hsel n (x)
(13.6)
n⫽1
Contrary to the offline version, an online classifier is available at any time during the training process.
Including Prior Knowledge To ensure an initial classifier of sufficient accuracy and to efficiently retrain an existing classifier, it is desirable to incorporate prior knowledge. Let the prior knowledge be given as an offline classifier Hoff , which was built on a distribution P .The goal is to transfer this knowledge to an online classifier Hon , which should be trained from a slightly different distribution P. We showed that a classifier trained by offline boosting can be retrained online [20]. In fact, this is possible if all statistics required for online updating are stored during offline training. This can be straightforward for all components: ■
To build a weak hypothesis hn corresponding to an image feature fn , a learning algorithm is applied. In particular, the distributions P(y ⫽ 1|fn (x)) and P(y ⫽ ⫺1|fn (x)) for the positive and negative samples are computed and the estimate hn is obtained by applying a Bayesian decision rule. Assuming that positive and negative feature responses follow Gaussian distributions during offline training,
13.3 Co-Training System
FIGURE 13.2 Haar-like features: To estimate the response of a Haar-like feature, the difference in pixel values between white and black regions is computed.
■
the mean and the variance can be computed (and stored) from all training samples. These parameters can then be easily adjusted when the training is continued online [33]. To select the best weak classifier within a selector, the error of the weak classifier is used to calculate the voting weight ␣ and update the importance . In the offline case, this error depends on the weights pi of the training samples that were classified correctly and incorrectly, respectively. If these values are saved as corr and wrong in the offline stage, they can be updated using the importance during online learning. Thus, by equation 13.5 the estimated error can be recalculated.
Since all information required for online learning is captured during offline training, these modifications allow direct online retraining of an offline-trained classifier.
Image Representation and Features Since the seminal work of Viola and Jones [3, 34], who used Haar-like features to compute face and person detectors, there has been considerable interest in boosting for feature selection in the field of pattern recognition. The main purpose of using features instead of raw pixel values as input to a learning algorithm is to reduce inter-class variability and incorporate a priori knowledge. Various approaches have been proposed, mainly differing in the feature type used for training an object representation (e.g., Gabor filters [35]). However, in this work we use classical Haar-like features (see Figure 13.2), which can be computed very efficiently using integral images [34].
13.3 CO-TRAINING SYSTEM The overall camera network is depicted in Figure 13.3. We have a setup with n partly overlapping cameras, each of them observing the same 3D scene. In general, the objects of interest can move in the world coordinate system {xw , yw , zw }, but since the main goal in this chapter is to train a person detector, we can assume that the objects of interest are moving in a common ground plane. With overlapping camera views, the local image coordinate systems {xi , yi } can be mapped onto each other using a homography based on an identified point in the ground plane. In addition, for each camera an estimate of the ground plane is required. Both, the calibration of the ground plane and the estimation of the homography, are discussed in more detail in Section 13.3.1. Once we have calibrated the scene we can start co-training. In fact, in our approach the different observations on the data are realized by different camera views. The thus defined co-training procedure is discussed in Section 13.3.2.
319
320
CHAPTER 13 Online Learning of Person Detectors Camera 2 y2
View 1
Camera n
View n
View 2
yn
x2
xn
Camera 1 Object
y1
zw xw
x1 Ground plane
yw
FIGURE 13.3 Co-training system: Multiple cameras observe a partly overlapping scene and collaborate during the update phase.
Object point (xw , yw , zw)
Camera i
(xj, yj )
Ground plane (xi, yi )
Camera j View j
View i
Homography H
FIGURE 13.4 Homography induced by a plane.
13.3.1 Scene Calibration As illustrated in Figure 13.3, the size of an object (person) in the camera coordinate system depends on its absolute position within the world coordinate system. Since objects are constrained to move on the ground plane, we can estimate the ground plane for all cameras and thus obtain the approximate expected size of the object in a specific ground plane position. In fact, for the current setups we estimate the ground plane manually, but for an autonomous system an unsupervised autonomous approach (e.g., [36]) might be applied. Similar to Khan and Shah [14], we use homography information to map one view onto another. It is well known (see, e.g., [21]) that points on a plane from two different views are related by a planar homography, which is depicted in Figure 13.4. Hence, the plane induces a homography H between the two views, where the homography H maps points xi from the first view to points xj in the second view: xj ⫽ Hx i
(13.7)
13.3 Co-Training System
321
To estimate the homography, we manually select the required points in the ground plane. Once we have estimated the homographies between all views, we map one view onto another. In particular, given a detection in one view, we first estimate the base point (i.e., the lower center point of the bounding box), which is assumed to be on the ground plane.Then, this base point is mapped onto a different camera coordinate system by the estimated homography. Finally, with the new base point we can superimpose the detections (i.e., the bounding boxes) from the original onto the projected view. In this way, we can verify if a detection in one view was also reported in a different view.
13.3.2 Online Co-Training Similar to Levin et al. [24], in our co-training framework we apply boosted classifiers. Since the response of a strong classifier given in equation 13.1 can be interpreted as a confidence measure, we can apply confidence-based learning. In contrast to existing approaches, however, ours differs in two main points. First, we consider the different camera views as independent observations for co-training; second, we apply an online method, which enables more efficient learning (i.e., we do not have to retrain the classifiers from scratch). To start the training process, we first train a general classifier H 0 ⫽ Hoff from a fixed set of positive and negative samples by offline boosting [34]. This classifier is trained such that, rather than high precision, it has a high recall rate. This initial classifier is cloned and used for all n camera views: H10 , . . . , Hn0 . Because of the different camera positions, we get the independent observations required for co-training—even if exactly the same classifier is applied! As was shown in Section 13.2.2, an offline classifier can easily be retrained using online methods. Hence, these cloned classifiers H10 , . . . , Hn0 can be retrained online and later adapted to a specific camera view by co-training. We apply a decentralized approach, where we select one view Vi and use all other views Vj , j ⫽ i, for verification (in a round-robin–like procedure each view is verified by all others). By using the homography information, a specific sample xi ∈ Vi is projected onto all other views Vj : xj ⫽ Hij xi , where j ⫽ i and Hij represent the homography between the views Vi and Vj . Based on the response of Hj (xj ), we decide whether the sample xi should be used for updating the classifier Hi . We refer to these update strategies as verification and falsification. In addition, to increase the stability of the classifiers we keep a small pool of “correct” patches X ⫹ to perform an additional conservative verification and falsification step. In all other cases we do not perform an update. Verification. If the responses of all other classifiers are also positive—that is, Hj (xj ) ⭓ , j ⫽ i and is the the minimum required confidence—the sample xi is verified and used as a positive update for Hi . It is also added to the pool of positive examples: X ⫹ ⫽ X ⫹ ∪ xi . Falsification. If the responses of all other classifiers are negative—that is, Hj (xj ) ⬍ 0, j ⫽ i—the sample xi is classified as a false positive and the classifier Hi is updated using xi as a negative example. After each negative update the pool of positive samples X ⫹ is checked to determine if it is still consistent; otherwise, a positive update with the corresponding sample is performed. Because of geometric inaccuracies and classification errors, detections and their corresponding projections might be misaligned (label jitter). Especially for positive
322
CHAPTER 13 Online Learning of Person Detectors updates, misaligned samples might lead to noisy updates, which result in drifting and thus corrupt classifiers. To overcome this severe shortcoming, we additionally include motion information—that is, foreground-background segmentation, which is illustrated in Figure 13.5. In particular, we apply a background subtraction, where the background is modeled by a temporal approximated median filter [37]. The foreground regions are labeled by pixel-wise thresholding of the difference between the currently processed image and the estimated background image. The main idea is to use the thus obtained binary image of the selected patch to verify if a patch is aligned properly. In fact, as illustrated in Figure 13.5(c) a positive update is performed only if the confidence of the current classifier is high and if the patch is aligned correctly. In contrast, a misaligned patch such as illustrated in Figure 13.5(d) is not used for updating, even if the classifier’s confidence is high. In this way we can avoid suboptimal positive updates and obtain a robust classifier. With this very conservative update strategy label noise can be minimized and thus the detections kept stable over time (i.e., drifting can be limited). Since we are continuously learning over time, these few updates are sufficient to adapt to the specific scene.
(a)
(b)
(c)
(d)
FIGURE 13.5 Robust alignment by background subtraction: (a) input image with detections; (b) binary thresholded difference image; (c) example of a correctly aligned patch; (d) example of an incorrectly aligned patch.
13.3 Co-Training System
Camera 2
t = 903
t = 698
t = 498
t =1
Camera 1
323
FIGURE 13.6 Detections (white), negative updates (light gray ), and positive updates (black) performed during co-training for different time steps.
The incremental update rules are illustrated for two cameras in Figure 13.6. The 1st, 498th, 698th, and 903th frames of a sequence totaling 2500 frames are shown.A light gray bounding box indicates a negative update; a black bounding box, a positive update; and
324
CHAPTER 13 Online Learning of Person Detectors a white bounding box, a detection not used for updating. It can be seen that there are a great number of negative updates in the beginning. As the classifiers become more stable over time, precision increases and the number of required negative updates decreases. In addition, new positive updates are selected, in fact increasing the recall. In this way, finally, an effective classifier is obtained.This illustrates that the learning approach focuses on the most valuable samples and that a representative classifier can be obtained from only a few updates. The entire multi-camera co-training scheme is summarized more formally in Algorithm 13.1 and 13.2. Algorithm 13.1: Co-Training from Multiple Views 1: 2: 3: 4: 5: 6:
Manually collect pos./neg. training samples Train initial classifier: H 0 Clone initial classifier: H10 , …, Hn0 while t do update i classifiers Hit⫺1 : Hit (Algorithm 13.2) end while
Algorithm 13.2: Co-Training Update Strategy Input: Hit⫺1 , K detections xk Output: Hit 1: for k ⫽ 1, . . . , K do 2: if Hit⫺1 (xk ) ⬎ then 3: for j ⫽ 1, . . . , J do 4: project xk onto other views: xj ⫽ Hij xk 5: Evaluate Hjt⫺1 (xj ) 6: Check alignment: align(xj ) 7: if Hjt⫺1 (xj ) ⬎ & align(xj ) ⫽ true then 8: update Hit⫺1 , xk , ⫹ 9: else if Hjt⫺1 (xj ) ⬍ 0 then 10: update Hit⫺1 , xk , ⫺ 11: end if 12: end for 13: end if 14: end for
13.4 EXPERIMENTAL RESULTS To determine the benefits of our proposed approach, we performed several experiments on two different data sets, described in more detail in Section 13.4.1. These experiments
13.4 Experimental Results
325
show that our approach offers increasingly better classifiers that finally yield state-of-theart detection results. Even though the approach is quite general and can be applied for an arbitrary number of cameras, limited test data confined the experiments to two and three cameras only. In addition, we evaluated the necessary system resources: memory requirements, computational costs, and necessary data transfer between cameras.
13.4.1 Test Data Description To show the generality of the approach, we performed experiments on two data sets differing in complexity, viewpoint/angle, and subject geometry. The first one (indoor scenario) we generated in our laboratory; the second (outdoor scenario) was generously provided by Berclaz et al. [11]. For both scenarios a scene was observed by three static cameras with partly overlapping views. Indoor scenario. The first data set was generated in our laboratory using a setup of three synchronized Axis 210 Ethernet cameras.3 The images were taken at 30 fps at a resolution of 384⫻288 and were directly stored in JPEG format. We took several sequences showing subjects walking and standing. For the experiments reported here, we used one sequence of 2500 frames for training and a shorter sequence of 250 frames for testing. Outdoor scenario. The second data set, showing the forecourt of a public building, was originally used by Berclaz et al. [11]. If contains several video sequences at a resolution of 360⫻288, taken from three outdoor cameras with overlapping views. During the sequence several subjects walk and stand in the field of view, sometimes highly occluding each other. We extracted two sequences, one for training containing 2000 frames and a shorter one for testing containing 200 frames. For both data sets we estimated the ground plane and computed the homographies for all camera views sharing a viewpoint area as described in Section 13.3. In addition, the test sequences were annotated to enable an automatic evaluation.
13.4.2 Indoor Scenario We carried out the first experiments on the indoor scenario. For that purpose, we trained an initial classifier by offline boosting using a fixed training set of 100 positive and 1000 negative samples, which were randomly chosen from a larger data set. In particular, the classifier consisted of 50 selectors and 250 weak classifiers and was trained for a sample size of 64⫻32, which is a typical size for pedestrian detection. It was cloned and used to initialize the co-training process for each camera view. Later these initial classifiers were updated by co-training. In fact, we ran two experiments in parallel, one using two camera views (co-training) and one using three camera views (tri-training). To demonstrate learning progress, after a predefined number of processed training frames (i.e., 50 frames) we saved a classifier, which we then evaluated on the independent
3 Since the proposed approach can easily be extended to include an arbitrary number of views, the setup will be extended in future; that is, additional cameras will be added.
326
CHAPTER 13 Online Learning of Person Detectors
test sequence. (The current classifier was evaluated but no updates were performed.) To automatically analyze the detection results, we applied the overlapping criterion. A detection is counted as a true positive if there is at least a 60 percent overlap between the bounding box of the detection and the corresponding bounding box in the ground truth. From these results for each classifier, we computed the precision, the recall, and the F-measure. Precision describes the accuracy of the detections, whereas recall determines the number of positive samples classified correctly. F-measure [38] can be considered a trade-off between these characteristics. The thus obtained performance curves over time for a specific camera view are shown in Figure 13.7 (co-training) and in Figure 13.8 (tri-training), respectively. Figure 13.7 shows the typical behavior of the learning process, which can be subdivided into three phases. In the first phase, precision is dramatically improved by eliminating the false positives, which are identified and used for negative updating of the classifier. This effect can be recognized mainly within the first 100 frames processed. In the second phase, from frame 250 to frame 750, recall is increased. As the number of false positives goes down, the reliability of the classifier goes up. Thus, co-trained scene-specific positive samples are collected and used as positive updates. In the final phase, after both recall and precision were improved these parameters stay at the same level. Note that, even though the classifier is updated continuously, it does not drift. In fact, the proposed co-training update strategy minimizes label noise and thus allows training of a stable classifier. We performed the same experiment using three camera views (tritraining). The corresponding results are shown in Figure 13.8, which like Figure 13.7,
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Recall Precision F-measure
0.1 0
0
500
1000 1500 Processed frames
FIGURE 13.7 Indoor scenario: performance curves over time for co-training.
2000
2500
13.4 Experimental Results
327
1 0.9 0.8 0.7 0.6 0.5 0.4 Recall Precision F-measure
0.3 0.2
0
500
1000 1500 Processed frames
2000
2500
FIGURE 13.8 Indoor scenario: performance curves over time for tri-training.
Table 13.1 Indoor Scenario: Performance Characteristics for Co-Training—Initial and Final Classifiers Recall
Precision
F-Measure
View 1
0.57
0.17
0.27
View 2
0.71
0.23
0.35
View 1
0.75
0.90
0.82
View 2
0.78
0.95
0.86
Initial Classifiers
Final Classifiers
shows the three phases. Because of additional information, the improvement in recall and precision can be obtained faster and progress is much smoother. Since up to now only one specific classifier was evaluated, in the following we show the improvement for all classifiers, summarizing the initial and final detection characteristics for co-training and tri-training inTable 13.1 and Table 13.2, respectively. In Figure 13.9 these results are illustrated for tri-training. From Tables 13.1 and 13.2 and especially from Figure 13.9 it can be clearly seen that the overall performance of even a bad classifier can be drastically improved. In addition, a comparison of Tables 13.1 and 13.2 shows that using one additional camera can further improve the performance of the detectors.
328
CHAPTER 13 Online Learning of Person Detectors Table 13.2 Indoor Scenario: Performance Characteristics for Tri-Training—Initial and Final Classifiers Recall
Precision
F-Measure
View 1
0.57
0.17
0.27
View 2
0.71
0.23
0.35
View 3
0.77
0.30
0.43
View 1
0.78
0.92
0.84
View 2
0.75
0.92
0.82
View 3
0.79
0.92
0.85
Initial Classifiers
Final Classifiers
(a)
(b)
FIGURE 13.9 Indoor scenario: examples of improvement obtained by the tri-training detector: (a) initial classifier; (b) final classifier.
Finally, we show some illustrative detection results of the final classifiers obtained by tri-training in Figure 13.10.
13.4.3 Outdoor Scenario Our experiments were performed in the same way as for the indoor scenario described in Section 13.4.2. Even the same data was used to train the initial classifier. Since the
13.4 Experimental Results
(a)
(b)
(c)
FIGURE 13.10 Indoor scenario: examples of detection results obtained by the final tri-training detector: (a) view 1, (b) view 2, and (c) view 3.
viewing angles were very different, however, a different geometry was required and the original samples were resized to 90⫻30. Since the online improvement was demonstrated for the indoor scenario, in Section 13.4.2, here we summarize only the obtained results. We provide the initial and the final detection characteristics for co-training and tri-training in Tables 13.3 and Table 13.4, respectively. Again it can be seen that by using the proposed method a significant improvement in the classifiers can be obtained, and that training using three cameras further improves overall performance. In contrast to the indoor scenario, the initial classifiers are highly overfitting and thus the recall rates are quite high in the beginning. The significantly worse classification results for view 1 can be explained by the smaller subjects in front of the glass doors (low contrast) that are included in the ground truth but not detected by the detector. We show some illustrative results of the finally obtained detectors in Figure 13.11.
13.4.4 Resources The proposed approach does not need a central node, so the required bandwidth for data transfer is quite small. Thus, the approach is also applicable in resource-constrained
329
330
CHAPTER 13 Online Learning of Person Detectors Table 13.3 Outdoor Scenario: Performance Characteristics for Co-Training—Initial and Final Classifiers Recall
Precision
F-Measure
View 1
0.90
0.21
0.34
View 2
0.86
0.27
0.41
View 1
0.77
0.89
0.83
View 2
0.91
0.90
0.91
Initial Classifiers
Final Classifiers
Table 13.4 Outdoor Scenario: Performance Characteristics for Tri-Training—Initial and Final Classifiers Recall
Precision
F-Measure
View 1
0.90
0.21
0.34
View 2
0.86
0.27
0.41
View 3
0.94
0.26
0.41
View 1
0.80
0.92
0.85
View 2
0.93
0.92
0.92
View 3
0.92
0.90
0.91
Initial Classifiers
Final Classifiers
systems such as distributed smart camera networks. In the following, we analyze these requirements and show that our approach is even applicable in narrow-bandwidth scenarios. It has been shown that offline-trained object detectors (those obtained by boosting for feature selection) are highly suitable for embedded systems (e.g., [39]). Their main advantage is that the applied classifier can be trained using a powerful computer and only the compact representation consisting of a small number of features must be stored on the embedded device. In contrast, when applying an online learning method (online boosting for feature selection) a huge number of features (O(NM), where M and N are the numbers of selectors and the features for each selector, respectively) must be stored on the embedded system. Thus, in the following paragraphs we discuss the advantages of the proposed approach considering the required resources (especially memory) and show that the method can be applied even if system resources are limited. As discussed earlier in the Section 13.2 subsection, as image representation we use Haar-like features, which correspond to weak classifiers.These features can have different sizes, aspect ratios, and locations within a given subwindow, all of which have to be stored. For instance, even considering a small window size of 64⫻32 pixels, we must
13.4 Experimental Results
(a)
(b)
331
(c)
FIGURE 13.11 Outdoor scenario: examples of detection results obtained by the final tri-training detector: (a) view 1, (b) view 2, and (c) view 3.
select from several hundred thousand features to add only a few to our final ensemble classifier. For just the simplest feature type, we have to store at least two rectangles (each consisting of one x and one y coordinate, respectively, as well width and height). Additionally, for each feature, its statistics (mean and variance for both positive and negative distributions) and its final decision threshold have to be stored. Again considering a 64⫻32 patch, such a system generates a maximum of 2,655,680 features, resulting in at least 240 MB of required memory. Note that this number grows dramatically [40] with training patch size. Fortunately, this set is highly over-complete, so usually only a small subset (i.e., 10 percent) is required to obtain proper results; the required memory can thus be reduced to 24 MB. In our approach online learning allows for training highly scene-specific classifiers, which can be very compact, further reducing the memory requirement. In a typical scenario, we require only 100 selectors, each typically holding 150 different features, resulting in a total of 15,000 features to be stored. Hence, we only need 500 KB of memory, which is a reasonable amount for most embedded platforms.
332
CHAPTER 13 Online Learning of Person Detectors Considering the computational complexity, target objects can be detected in O(N ). Scanning the window is fast because integral image structures allow evaluating each rectangle sum in constant time. An integral image requires constant O(N ) memory: width ⫻ height ⫻ size of (unsigned integer). Moreover, updates can be performed very efficiently in O(NMS), where S is the number of new samples. In our approach, we use only S ⫽ 2 per frame. In our approach each camera acts as autonomously as possible, so no visual information (i.e., images) has to be exchanged among cameras. Thus, from each view only a certain number of confidence responses O(|H(x)| ⬍ ) are required for co-training and must be transferred between the cameras. For that purpose, only the corresponding coordinates (at overlapping areas) and their confidences have to be transmitted. Depending on the choice of , then, typically only a few hundred bytes per frame must be exchanged.
13.5 CONCLUSIONS AND FUTURE WORK In this chapter, we presented a novel approach for online training of a person detector from multiple cameras. The main idea is to train a general classifier in the laboratory and to adapt this classifier, unsupervised, for a specific camera view by co-training. Thus, the detection results can be iteratively improved without any user interaction. The detection results of one camera are verified by the detections of the others. For that purpose, homography information between two cameras is applied. In particular, we apply online boosting for feature selection learning, which has proven to allow efficient learning as well as efficient evaluation. Since, in contrast to existing approaches, we do not have to estimate a global score map, detection can be performed efficiently. Moreover, the use of a noncentralized architecture makes our approach applicable for distributed smart camera networks. We evaluated our proposed approach on two multi-camera data sets for pedestrian detection. The results show that existing classifiers can be improved via co-training and that state-of-the art detection results can be obtained. Future work will include automatic estimation of the ground plane and of the required homographies. In addition, we will increase the number of camera views and add additional cues such as shape to further increase the stability of the approach. Acknowledgments. The work for this chapter was supported by the FFG project AUTOVISTA (813395) under the FIT-IT programme, the FFG project EVis (813399) under the FIT-IT programme, and the Austrian Joint Research Project Cognitive Vision under projects S9103-N04 and S9104-N04.
REFERENCES [1] C. Stauffer, W.E.L. Grimson, Adaptive background mixture models for real-time tracking, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1999. [2] B. Leibe, A. Leonardis, B. Schiele, Robust object detection with interleaved categorization and segmentation, International Journal of Computer Vision 77 (1–3) (2008) 259–289.
13.5 Conclusions and Future Work
333
[3] P. Viola, M.J. Jones, D. Snow, Detecting pedestrians using patterns of motion and appearance, in: Proceedings of the IEEE International Conference on Computer Vision, 2003. [4] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2005. [5] O. Tuzel, F. Porikli, P. Meer, Human detection via classification on Riemannian manifolds, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007. [6] B. Wu, R. Nevatia, Detection of multiple, partially occluded humans in a single image by Bayesian combination of edgelet part detectors, in: Proceedings of the IEEE International Conference on Computer Vision, 2005. [7] Y. Freund, R. Schapire, A decision-theoretic generalization of online learning and an application to boosting, Journal of Computer and System Sciences 55 (1) (1997) 119–139. [8] P.M. Roth, H. Grabner, D. Skoˇcaj, H. Bischof, A. Leonardis, Online conservative learning for person detection, in: Proceedings of the IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. [9] V. Nair, J.J. Clark, An unsupervised, online learning framework for moving object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2004. [10] B. Wu, R. Nevatia, Improving part based object detection by unsupervised, online boosting, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007. [11] J. Berclaz, F. Fleuret, P. Fua, Principled detection-by-classification from multiple views, in: Proceedings of the International Conference on Computer Vision Theory and Applications, 2008. [12] F. Fleuret, J. Berclaz, R. Lengagne, P. Fua, Multi-camera people tracking with a probabilistic occupancy map, IEEE Transition on Pattern Analysis and Machine Intelligence 30 (2) (2008) 267–282. [13] R. Eshel, Y. Moses, Homography based multiple camera detection and tracking of people in a dense crowd, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2008. [14] S.M. Khan, M. Shah, A multiview approach to tracking people in crowded scenes using a planar homography constraint, in: Proceedings of the European Conference on Computer Vision, 2006. [15] M. Hu, J. Lou, W. Hu, T. Tan, Multicamera correspondence based on principal axis of human body, in: Proceedings of the IEEE International Conference on Image Processing, 2004. [16] H. Kim, E. Murphy-Chutorian, J. Triesch, Semi-autonomous learning of objects, in: Proceedings of the IEEE Workshop on Vision for Human-Computer Interaction, 2006. [17] H. Grabner, P.M. Roth, H. Bischof, Is pedestrian detection really a hard task?, in: Proceedings of the IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, 2007. [18] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, in: COLT: Proceedings of the Workshop on Computational Learning Theory, 1998. [19] H. Grabner, H. Bischof, On-line boosting and vision, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2006. [20] P.M. Roth, H. Grabner, C. Leistner, M. Winter, H. Bischof, Interactive learning a person detector: Fewer clicks—less frustration, in: Proceedings of the Workshop of the Austrian Association for Pattern Recognition, 2008. [21] R. Hartley, A. Zisserman, Multiple View Geometry, Cambridge University Press, 2003. [22] K. Nigam, A. McCallum, S. Thrun, T. Mitchell, Text classification from labeled and unlabeled documents using EM, Machine Learning 39 (2/3) (2000) 103–134. [23] M.-F. Balcan, A. Blum, K. Yang, Co-training and expansion: Towards bridging theory and practice, in: Advances in Neural Information Processing Systems, 2004.
334
CHAPTER 13 Online Learning of Person Detectors [24] A. Levin, P. Viola, Y. Freund, Unsupervised improvement of visual detectors using co-training, in: Proceedings of the International Conference on Computer Vision, 2003. [25] O. Javed, S. Ali, M. Shah, Online detection and classification of moving objects using progressively improving detectors, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2005. [26] Z.-H. Zhou, M. Li, Tri-training: Expoiting unlabeled data using three classifiers, in: IEEE Transition on Knowledge and Data Engineering 17 (11) (2005) 1529–1541. [27] Y. Freund, R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence 14 (5) (1999) 771–780. [28] R. Schapire, The boosting approach to machine learning: An overview, in: Proceedings of the MSRI Workshop on Nonlinear Estimation and Classification, 2001. [29] J. Friedman, T. Hastie, R. Tibshirani, Additive logistic regression: a statistical view of boosting, Annals of Statistics 28 (2) (2000) 337–407. [30] K. Tieu, P. Viola, Boosting image retrieval, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2000. [31] N.C. Oza, S. Russell, Experimental comparisons of online and batch versions of bagging and boosting, in: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001. [32] N.C. Oza, S. Russell, Online bagging and boosting, in: Proceedings of Artificial Intelligence and Statistics, 2001. [33] H. Grabner, C. Leistner, H. Bischof, Time dependent online boosting for robust background modeling, in: Proceedings of the International Conference on Computer Vision Theory and Applications, 2008. [34] P. Viola, M.J. Jones, Rapid object detection using a boosted cascade of simple features, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2001. [35] L. Shen, L. Bai, Mutual boost learning for selecting Gabor features for face recognition, Pattern Recognition Letters 27 (15) (2006) 1758–1767. [36] R. Pflugfelder, H. Bischof, Online auto-calibration in man-made worlds, in: Proceedings of the Digital Image Computing: Technqiues and Applications, 2005. [37] N.J.B. McFarlane, C.P. Schofield, Segmentation and tracking of piglets, Machine Vision and Applications 8 (3) (1995) 187–193. [38] S. Agarwal, A. Awan, D. Roth, Learning to detect objects in images via a sparse, part-based representation, IEEE Transition on Pattern Analysis and Machine Intelligence 26 (11) (2004) 1475–1490. [39] C. Arth, H. Bischof, C. Leistner, Tricam—an embedded platform for remote traffic surveillance, in: Proceedings of Computer Vision and Pattern Recognition Workshop on Embedded Computer Vision, 2006. [40] R. Lienhart, J. Maydt, An extended set of Haar-like features for object detection, in: Proceedings of the International Conference on Image Processing, 2002.
CHAPTER
Real-Time 3D Body Pose Estimation Michael Van den Bergh, Esther Koller-Meier, Roland Kehl ETH Zurich, Computer Vision Laboratory, Zurich, Switzerland
14
Luc Van Gool ESAT-PSI/VISICS, Katholieke Universiteit Leuven, Leuven, Belgium
Abstract This chapter presents a novel approach to markerless real-time 3D pose estimation in a multi-camera setup. We explain how foreground-background segmentation and 3D reconstruction are used to extract a 3D hull of the user. This is done in real time using voxel carving and a fixed lookup table. The body pose is then retrieved using an example-based classifier that uses 3D Haar-like wavelet features to allow for real-time classification. Average neighborhood margin maximization (ANMM) is introduced as a powerful approach to train these Haar-like features. Keywords: pose estimation, markerless, real time, visual hull, 3D Haar-like features, example-based classification, linear discriminant analysis, average neighborhood margin maximization
14.1 INTRODUCTION Posture recognition has received a significant amount of attention given its importance for human–computer interfaces, teleconferencing, surveillance, safety control, animation, and several other applications. The context of this work is the CyberWalk Project [1], a virtual reality system where the user walks on an omnidirectional treadmill, as shown in Figure 14.1, interacting with the virtual world using body pose commands, and the system detects certain events. For this application a markerless pose detection subsystem has to be fast and robust for detecting a predefined selection of poses. We present an example-based technique for real-time markerless rotation-invariant pose recognition using average neighborhood margin maximization (ANMM) [2] and 3D Haar wavelet-like features [3]. (The latter will be called Haarlets for brevity.) In examplebased approaches, observations are compared and matched against stored examples of human body poses. In our approach, these observations consist of 3D hulls of the user. The system makes use of a multi-camera setup, in which the cameras are placed around Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00014-8
335
336
CHAPTER 14 Real-Time 3D Body Pose Estimation
FIGURE 14.1 User walking on the CyberWalk omnidirectional platform.
the user. First, foreground-background segmentation is used to extract the user from the background. Then the segmentations from the different cameras are combined to make a 3D hull reconstruction. This is done in real time using voxel carving and a fixed lookup table [4]. The camera network is distributed, as each camera is connected to a separate PC that runs the foreground–background segmentations.The segmentations are sent to a central PC that runs the hull reconstruction. The body pose is then determined from this 3D hull using an example-based classifier that employs 3D Haarlets to allow for real-time classification. ANMM, which is based on linear discriminant analysis (LDA), is introduced as a powerful approach to train these Haarlets. Where the classic AdaBoost [5] runs into memory issues when training 3D rather than 2D Haarlets [6], the weakened memory requirements of ANMM allow for a straightforward implementation of a 3D pose detector based on 3D Haarlets. The benefit of classifying 3D hulls rather than 2D silhouettes is that the orientation of the hulls can be normalized. Finally, we explain how an overhead tracker is used to estimate the orientation of the user, in order to normalize the orientation of the extracted hull, and thus making the pose estimation system rotation invariant. In this chapter we first give an overview of the different real-time pose estimation approaches. We also provide an overview of the different 3D hull reconstruction techniques, and explain the one that we have chosen, considering the real-time nature of the system. For pose classification based on these 3D hulls, we present ANMM as a powerful new method and evaluate it against LDA. We extend ANMM to 3D and show how it can be used to train 3D Haarlets for real-time classification. The 3D approach benefits from increased robustness and the possibility of making the system rotation invariant. We show these benefits by comparing the system to the 2D case. The result is a pose estimation system with the same or better performance than the state of the art but at faster, interactive speeds.
14.2 BACKGROUND This section provides an overview of methods to estimate body pose; they are divided in two categories: model-based and example-based. The model-based methods can also
14.2 Background
337
be called tracking methods, as they track individual body parts in an articulated body model. Example-based methods do not rely on body models but match the input to a set of predefined poses.
14.2.1 Tracking Our first choice was in favor of example-based rather than model-based (tracking) techniques. Model-based approaches typically rely on articulated 3D body models [7–11]. In order to be effective they need a high number of degrees of freedom in combination with nonlinear anatomical constraints. Consequently, they require time-consuming perframe optimization, and the resulting trackers are too slow for real time (⬎25 Hz). They are also very sensitive to fast motions and segmentation errors. Most methods exploit 2D image information for tracking. However, these cues only offer weak support to the tracker, which quickly leads to sophisticated, and therefore often rather slow, optimization schemes. Multiple calibrated cameras allow for the computation of the subject’s 3D shape, which provides a strong cue for tracking because the 3D shape only contains information consistent over all individual views with respect to some hypothesis and thus discards, for example, clutter edges or spikes in the silhouettes.The increased computational power offered by cheap consumer PCs made real-time computation of the 3D shape or hull possible and created several interesting approaches to full-body tracking. Cheung et al. [12] introduced the SPOT algorithm, a rapid voxel-based method for volumetric human reconstruction. Real-time tracking is achieved by assigning the voxels in the new frame to the closest body part of the previous one. Based on this registration, the positions of the body parts are updated over consecutive frames. However, this simple approach does not guarantee that two adjacent body parts will not drift apart, and it can easily lose track of moderately fast motions. Furthermore, to obtain good segmentation, the subject has to wear a dark suit. Cheung et al. [13] used both color information and a shape-from-silhouette method for full-body tracking, although not in real time. Colored surface points (CSPs) segment the hull into rigidly moving body parts based on the results of the previous frames, and take advantage of the constraint of equal motion of parts at their coupling joints to estimate joint positions.A complex initialization sequence recovers an actor’s joint positions, which are used to track the same actor in new video sequences. Mikic et al. [14] proposed a similar voxel-based method for full-body tracking. After volumetric reconstruction, the different body parts are located using sequential template growing and fitting.The fitting step uses the placement of the torso computed by template growing to obtain a better starting point for the voxel labeling. Furthermore, an extended Kalman filter estimates the parameters of the model given the measurements. To achieve robust tracking, the method uses prior knowledge of average body part shapes and dimensions. Kehl et al. [4] also proposed a markerless solution for full-body pose tracking. A model built from super-ellipsoids is fitted to a colored volumetric reconstruction using stochastic meta descent (SMD), taking advantage of the color information to overcome ambiguities caused by limbs touching each other. To increase robustness and accuracy, the tracking is refined by matching model contours against image edges. The results of this tracker
338
CHAPTER 14 Real-Time 3D Body Pose Estimation
FIGURE 14.2 Full-body pose tracker (Kehl et al. [4]) in action.
are shown in Figure 14.2. Similar to the previously mentioned tracking approaches, this system is capable of tracking one frame in approximately 1.3 seconds. As the input data for a real-time system is generally recorded at 15 to 30 Hz, the tracking is too slow and as a result too sensitive to fast motions. Tracking-based approaches suffer from a trade-off between complex, accurate tracking at ⫾1 Hz and faster but more inaccurate tracking. In both cases it is difficult not to lose track of the subject in an interactive system where the user walks or moves a lot. Therefore, we made the choice to look at example-based methods. In example-based approaches, instead of tracking articulated body models, observations are compared and matched against stored examples of human body poses. These stored examples can be 2D silhouettes or reconstructed 3D hulls.
14.2.2 Example-Based Methods Example-based methods benefit from the fact that the set of typically interesting poses is far smaller than the set of anatomically possible ones, which is good for robustness. Because the pose is estimated on a frame-by-frame basis, it is not possible to lose track of an object. Also, not needing an explicit parametric body model makes these methods more amenable to real-time implementation and to pose analysis of structures other than human bodies, such as animals. Silhouettes (and their derived visual hulls) seem to capture the essence of human body poses well, as illustrated in Figure 14.3. Compared to tracking, not many example-based pose estimation methods exist in the literature. Rosales and Sclaroff [15] trained a neural network to map example 2D silhouettes to 2D positions of body joints. Shakhnarovich et al. [16] outlined a framework for fast pose recognition using parameter-sensitive hashing. In their framework, image features such as edge maps, vector responses of filters, and edge direction histograms can be used to match silhouettes against examples in a database. Ren et al. [17] applied this parameter-sensitive hashing framework to the use of 2D Haarlets for pose recognition. The Haarlets are trained using AdaBoost.
14.3 Segmentation
(a)
(b)
339
(c)
FIGURE 14.3 (a) Input image. (b) Silhouettes extracted from input images using foreground–background segmentation. (c) Silhouettes combined to reconstruct a 3D hull.
The primary limitation of silhouette-based approaches is that the stored silhouettes are not invariant to changes in subject orientation. A visual hull can be reconstructed using the silhouettes taken from several camera views and can then be rotated to a standard orientation before being used for training or classification.The result is a rotationinvariant system. The example-based approach proposed by Cohen and Li [18], matches 3D hulls with an appearance-based 3D shape descriptor and a support vector machine (SVM). This method is rotation invariant, but, running at 1 Hz, it is not real time. Weinland et al. [19] and Gond et al. [20] proposed similar hull-based approaches but provide no statistics concerning classification speeds.To build a 3D-hull system capable of real-time performance, we aim to combine the speed of Haarlets with the strength of ANMM.
14.3 SEGMENTATION The first step in our system is foreground-background segmentation, which considers the difference between the observed image and a model of the background. Regions where the observed image and the background model differ significantly are defined as foreground, as illustrated in Figure 14.4. The background model is typically calculated from a set of images of the empty working volume. Background subtraction works only for static backgrounds since the same model is used for subsequent images. For our setup, static backgrounds can be assumed except for audience or slightly flickering light tubes. It is essential to have a good similarity measurement for two colors. Considering either the difference or the angle between two observed color vectors is not advisable because both vectors require normalization. It makes a significant difference whether an angular difference is found for long- or short-signal vectors. Therefore, we use the illumination-invariant collinearity criterion proposed by Mester et al. [21]. Let xf be the vector of all RGB values within a 3⫻3 neighborhood in the input image, and let xb be the corresponding vector in the background image. Collinearity can be tested by considering the difference vectors df and db between xf and xb and estimating
340
CHAPTER 14 Real-Time 3D Body Pose Estimation
(a)
(b)
(c)
(d)
FIGURE 14.4 Result of our illumination-invariant background subtraction method. (a) Input image. (b) Background. (c) Resulting segmentation. Note that the black loudspeaker in the background batters a hole in the foreground region. (d) Segmentation result with darkness compensation; the region in front of the black loudspeaker is now segmented correctly.
the true signal direction u (i.e., df ⫽ xf ⫺ xf · u) and then calculating the sum of the differences: D 2 ⫽ |df |2 ⫹ |db |2
(14.1)
Minimizing D 2 estimates u and yields zero if the two vectors xf and xb are collinear—that is, the difference vectors and hence the sum of their norms are zero. If the two vectors are collinear, no change is judged to be present and the background is still visible. If not collinear, the pixels are considered to have different colors and a foreground pixel is found. However, as our observed color vectors are noisy, perfect collinearity is unlikely. Griesser et al. [22] showed that applying a static threshold Ts and an adaptive threshold Tadapt on D 2 makes the segmentation robust against noise: foreground
D
2
⬎ ⬍
Ts ⫹ Tadapt
(14.2)
background
The adaptive threshold is used to incorporate spatio-temporal considerations. Spatial compactness is induced by giving a pixel a higher chance to be foreground if several of its neighbors have this status. A sampled Markov random field (MRF) is used to enforce this spatial compactness in an iterative manner. Temporal smoothness can be achieved by using the results of the previous frame for initialization of the MRF. This collinearity test (which is also called darkness compensation) makes the method intensity invariant and thus provides robustness against lighting changes and shadows. However, dark colors are problematic for this method as they can be seen as a lowintensity version of any color and consequently as a match with any color. To remedy this, an additional component with a constant value Odc is added to both vectors. This additional vector renders the color similarity measure more sensitive to differences, especially when dark pixels are involved. Objects or backgrounds with dark colors can thus be
14.4 Reconstruction
341
segmented as illustrated in Figure 14.4, where the region in front of the black loudspeaker in the top left of the image is now correctly segmented. Segmentation is controlled by three user-defined parameters: the static threshold Ts , the darkness offset Odc , and the importance factor B of the spatio-temporal compactness. First, the static threshold Ts is determined with Odc and B set to zero.The darkness offset Odc is then increased until a balance between appearing shadows and vanishing holes is reached. Finally, the compactness value B is increased until the foreground regions are smooth and compact.
14.4 RECONSTRUCTION Computing the visual hull of an object requires its silhouettes in a number of available images together with the centers of projection of the corresponding cameras. If we want to reconstruct an object, we know that it is included in the generalized cone extruded from the silhouette with its origin at the camera center. The intersection of these cones from multiple calibrated camera views yields a volume that contains the object. This principle is called shape from silhouette and produces a volume that approximates the object reasonably well if a sufficient number of cameras with different lines of sight are used. This approximated volume is called the visual hull of the object and is commonly defined as the largest possible volume that exactly explains a set of consistent silhouette images [23]. Figure 14.5 illustrates the principle for three camera views. Note that the visual hull is never an exact representation of the object, because concave regions cannot be reconstructed from silhouettes and an infinite number of camera views are needed to compute the exact visual hull [24]. However, our results show that even a coarse approximation of the subject’s visual hull from four to five View 1 View 2
Object
View 3
FIGURE 14.5 Visual hull of the object, created by the intersection of the generalized cones extruded from its silhouettes.
342
CHAPTER 14 Real-Time 3D Body Pose Estimation views is sufficient for body pose estimation. In Shanmukh and Pujari [25] guidelines can be found for choosing an optimal camera setup for object reconstruction. Our definition of the visual hull in this chapter is limited to using a finite number of camera views. Algorithms for shape from silhouette can be roughly divided into three groups:
Volumetric reconstruction using voxels. This technique divides the working volume into a discrete grid of smaller volumes, so-called voxels, and projects them successively onto the image planes of the available camera views. Voxels lying outside of the silhouette in at least one view do not belong to the intersection of the cones and can be discarded. Because of their simplicity, voxel-based procedures have been used for body tracking [26–30]. Their drawback is their tendency to be expensive as a high number must be projected into the image planes. Polyhedral visual hull. This is a surface-based approach to computing the visual hull from a polygonal representation of the silhouettes, applying constructive solid geometry (CSG) to compute the intersection of the corresponding polyhedra. Real-time algorithms were proposed by Matusik et al. [31] and Franco and Boyer [32]. The polyhedral visual hull method offers better accuracy than voxel-based procedures as it does not work on a discretized volume. Moreover, the resulting triangle mesh is perfectly suited for rendering on graphics hardware. Still, because of the complexity of the geometric calculations these algorithms are limited by their overall fragility, which relies on perfect silhouettes. Corrupted silhouettes often result in incomplete or corrupted surface models. In the application described in this chapter, silhouettes are often corrupted by reflections and noisy segmentation. Space carving and photo consistency. Space carving is a volumetric reconstruction technique that uses both color consistency and silhouettes, as proposed by Kutulakos and Seitz [33] and Seitz and Dyer [34]. Voxels that are not photo-consistent across all camera views in which they are visible are carved away. Photo consistency methods often assume constant illumination and Lambertian reflectance. The reconstructed volume contains only the surface voxels and is often referenced as the photo hull. Visibility of the voxels is critical for this method and is usually solved by making multiple plane-sweep passes, each time using only the cameras in front of the plane and iterating until convergence. Unfortunately, the complexity of this method makes it diffcult to achieve real-time computation. Cheung et al. [12, 13] thus proposed a mixed approach between visual hull and photo consistency that uses the property that the bounding edge of the visual hull touches the real object at at least one point. Therefore, photo consistency has to be tested only for bounding edges of the visual hull, which can be done at moderate cost. Also unfortunately, the reconstruction is then very sparse and needs much input data to be practical. Voxel-based shape-from-silhouette methods are popular but tend to be computationally expensive, as a high number of voxels have to be projected into the camera images. Most implementations speed up this process by using an octree representation to compute the result from coarser to finer resolutions (Szeliski [35]); others exploit hardware acceleration (Hasenfratz et al. [29]). Our method addresses the problem the other way
14.4 Reconstruction
Reversed projection using lookup table Viewing ray 7 12
7
LUT
15
12
15
Direct projection
FIGURE 14.6 Lookup table stored at each pixel in the image with pointers to all voxels that project onto that pixel. Expensive projections of voxels can be avoided and the algorithm can take advantage of small changes in the images by addressing only voxels whose pixel has changed.
around, as proposed by Kehl et al. [4]. Instead of projecting the voxels into the camera views at each frame, we keep a fixed lookup table (LUT) for each one and store a list at each pixel with pointers to all voxels that project onto that particular pixel (see Figure 14.6). This way, the image coordinates of the voxels have to be neither computed during runtime nor stored in memory. Instead, the LUTs are computed once at startup. The proposed reversal of the projection allows for a compact representation of the voxels: Each is represented by a bit mask where each bit bi is 1 if its projection lies in the foreground of camera i; 0 otherwise. Thus, a voxel belongs to the object (i.e., is labeled as active) if its bit mask contains only 1s. This can be evaluated rapidly by byte comparisons. Another advantage of our method is that the voxel space can be updated instead of computed from scratch for each frame. A voxel only changes its label if one of the pixels it is projected to changes from foreground to background or vice versa. Therefore, as we can directly map from image pixels to voxels, we only have to look up the voxels linked to those pixels, which have changed their foreground-background status. This leads to far fewer voxel lookups compared to standard methods, where for each frame all voxels have to be visited in order to determine their labels. The reconstruction itself is done pixel by pixel through all segmented (binary) images. If a pixel of the current view i has changed its value compared to the previous frame, the corresponding bit bi for all voxels contained in the reference list of this pixel is set to the new value and these voxels’ labels are determined again. Results of our reconstruction algorithm can be seen in Figure 14.7. With this approach, the reconstruction of a hull from six cameras takes about 15 ms.
343
344
CHAPTER 14 Real-Time 3D Body Pose Estimation
FIGURE 14.7 Examples of 3D hull reconstruction.
14.5 CLASSIFIER Our approach aims to classify poses based on 3D hulls of the subject. In this section we propose an example-based classifier, in which the input samples (hulls) are compared to poses stored in a database. Each frame is classified independently from the others.
14.5.1 Classifier Overview In Figure 14.8 the basic classifier structure is shown, where T denotes a transformation found using average neighborhood margin maximization (ANMM).This transformation is based on linear discriminant analysis (LDA) and projects the input samples onto a lower dimensional space where the different pose classes are maximally separated and easier to classify. Using a nearest neighbors (NN) approach, these projected samples are matched to stored poses in a database, and the closest match is the output of the system. Later, to improve the speed of the system, the transformation T can be approximated using Haarlets, which will be discussed in Section 14.6.
14.5.2 Linear Discriminant Analysis The goal of the LDA step is to find a transformation that helps to discriminate between the different pose classes. It provides a linear transformation that projects the input hulls onto a lower dimensional space where they are maximally separated before they are classified. The training examples (hulls) are divided into different pose classes. The voxel values of these hulls are stored in an n-dimensional vector, where n is the total number of voxels in the input hulls. The idea is to find a linear transformation such that the classes are maximally separable after the transformation [36]. Class separability can be measured
14.5 Classifier
Coefficients
Input
T
Pose
NN
Database
FIGURE 14.8 Basic classifier structure. The input samples (hulls) are projected with transformation T onto a lower-dimensional space, and the resulting coefficients are matched to poses in the database using nearest neighbors (NN).
by the ratio of the determinant of the between-class scatter matrix SB and the withinclass scatter matrix SW . The optimal projection Wopt is chosen as the transformation that maximizes the ratio: Wopt ⫽ arg max W
|WSB W T | |WSW W T |
(14.3)
and is determined by calculating the generalized eigenvectors of SB and SW . Therefore, T Wopt ⫽ w1 w2 . . . w3
(14.4)
where w i are the generalized eigenvectors of SB and SW corresponding to the m largest generalized eigenvalues i .The eigenvalues represent the weight of each eigenvector and are stored in a diagonal matrix D; the eigenvectors w i represent characteristic features of the different pose classes. A solution to the optimization problem in equation 14.3 is to compute the inverse of ⫺1 S [36]. Unfortunately, S is singular SW and solve an eigenproblem for the matrix SW B W in most cases because the number of training examples is smaller than the number of dimensions in the sample vector. Thus, inverting SW is impossible. For this reason, it is better to look for an alternative where a different matrix, which does not suffer from this dimensionality problem, is used.
14.5.3 Average Neighborhood Margin Maximization LDA aims to pull apart the class means while compacting the classes themselves. This introduces the small sample size problem, which renders the within-class scatter matrix singular. Furthermore LDA can only extract c ⫺ 1 features (where c is the number of classes), which is suboptimal for many applications. ANMM as proposed by Wang and Zhang [2], is a similar approach but one that avoids these limitations. For each data
345
346
CHAPTER 14 Real-Time 3D Body Pose Estimation
(a)
(b)
FIGURE 14.9 How ANMM works. (a) For each sample, within a neighborhood (gray ), samples of the same class are pulled toward the class center, while samples of a different class are pushed away. (b) The data distribution in the projected space.
point, ANMM pulls the neighboring points with the same class label toward it, as near as possible, simultaneously pushing the neighboring points with different labels as far away as possible. This principle is illustrated in Figure 14.9. Instead of using the between-class scatter matrix SB and the within-class scatter matrix SW , ANMM defines a scatterness matrix as
S⫽
i,k:xk ∈Nie
T (xi ⫺ xk ) xi ⫺ xj |Nie |
(14.5)
and a compactness matrix as C⫽
j:xj ∈Nio
T (xi ⫺ xk ) xi ⫺ xj |Nio |
(14.6)
where Nio is the set of n most similar data in the same class as xi (n nearest homogeneous neighborhoods) and where Nie is the set of n most similar data that is in a different class from xi (n nearest heterogenous neighborhoods).TheANMM eigenvectors Wopt can then be found by the eigenvalue decomposition of S ⫺ C. ANMM introduces three main benefits compared to traditional LDA: (1) it avoids the small sample size problem since it does not need to compute any matrix inverse; (2) it can find the discriminant directions without assuming a particular form of class densities (LDA assumes a Gaussian form); and (3) many more than c ⫺ 1 feature dimensions are available. Some examples of resulting ANMM eigenvectors are shown in Figure 14.10. Using ANMM rather than LDA, the classifier is able to achieve roughly 10 percent better performance.
14.6 Haarlets
347
FIGURE 14.10 First 4 eigenvectors for the frontal view only, after training for a 12-pose set using the ANMM algorithm.
14.6 HAARLETS Computing the transformation T as shown in Figure 14.8 can be computationally demanding, especially if there are many ANMM eigenvectors.To improve the speed of the system, the transformation T in the classifier can be approximated using Haarlets, as shown in Figure 14.11. In this case the transformation T is approximated by a linear combination of Haarlets C. An optimal Haarlet set is selected during the training stage. Computing this set on the input image results in a number of coefficients, which when transformed with C result in an approximation of the coefficients that would result from the transformation T on the same input data. They can be used for subsequent classification in the same manner as in the pure ANMM case. Because of their speed of computation, Haarlets are very popular for real-time object detection and real-time classification. The ANMM approximation approach provides a new and powerful method for selecting or training them, especially in the 3D case, where existing methods fail because of the large number of candidate Haarlets, as noted by Ke et al. [6]. Our approach makes it possible to train 3D Haarlets by selecting from the full set of candidates. Papageorgiou et al. [37] proposed a framework for object detection based on 2D Haarlets, which can be computed with a minimum of memory accesses and CPU operations using the integral image. Viola and Jones [5] used AdaBoost to select suitable 2D Haarlets for object detection. The same approach was used for pose recognition by Ren et al. [17]. Our approach uses similar Haarlets although they are three-dimensional, and it introduces a new selection process based on ANMM.
14.6.1 3D Haarlets The concepts of an integral image and Haarlets can be extended to three dimensions.The 3D integral image, or integral volume, is defined as ii(x, y, z) ⫽
i(x , y , z )
(14.7)
x ⭐x,y ⭐y,z ⭐z
Using the integral volume, any rectangular box sum can be computed in eight array references, as shown in Figure 14.12. Accordingly, the integral volume makes it possible to construct volumetric box features similar to the 2D Haarlets. We introduce the 3D Haarlet set as illustrated in Figure 14.13.
348
CHAPTER 14 Real-Time 3D Body Pose Estimation
Compute Haarlet coefficients
Input
Coefficients
Coefficients C
Pose NN
Database Set
FIGURE 14.11 Classifier structure illustrating the Haarlet approximation. The pretrained set of Haarlets is computed on the input sample (silhouette or hull). The approximated coefficients are computed as a linear combination C of the Haarlet coefficients. The contents of the dotted-line box constitute an approximation of T in Figure 14.8.
x O
B y
z
D A C H E F
G
FIGURE 14.12 Sum of voxels within the gray cuboid computed with eight array references. If A, B, C, D, E, F , G, and H are the integral volume values at shown locations, the sum can be computed as (B ⫹ C ⫹ E ⫹ H) ⫺ (A ⫹ D ⫹ F ⫹ G).
14.6.2 Training Viola and Jones [5] used AdaBoost to select suitable 2D Haarlets for object detection. The same approach was used for pose recognition by Ren et al. [17]. Considering memory
14.6 Haarlets
349
FIGURE 14.13 Proposed 3D Haarlets. The first 15 features are extruded versions of the original 2D Haarlets in all 3 directions; the final 2 are true 3D center-surround features.
and processing time constraints, Ke et al. [6] noted that it is not possible to evaluate the full set of candidate 3D Haarlets using AdaBoost, and therefore only a fraction of the full dictionary can be used at a very limited resolution. This makes it virtually impossible to train a useful 3D Haarlet set using AdaBoost. In our approach we introduce a new selection process based on ANMM. The Haarlets are selected to approximate Wopt (Section 14.5.3) as a linear combination thereof. The particular Haarlet set used here is shown in Figure 14.13.Along with feature type, Haarlets can vary in width, height, depth, and position inside the voxel space. At a 24⫻24⫻24 resolution, this results in hundreds of millions of candidate features. The best Haarlets are obtained from this set by convolving all candidates with the vectors in Wopt and selecting those with the highest coefficients (i.e., the highest response magnitudes). This score is found for each candidate Haarlet by calculating its dot product with each ANMM vector (each row in Wopt ) and calculating the weighted sum using the weights of those ANMM vectors, as stored in the diagonal matrix D (i.e., the eigenvalues serve as weights). Thus, the entire ANMM eigenspace is approximated as a whole, giving higher priority to dimensions with a higher weight when selecting Haarlets. This dot product can be computed very efficiently using the integral volume. Most selected Haarlets are redundant unless Wopt is adapted after each new Haarlet is selected, before choosing the next one. Let F be a matrix containing the already selected Haarlets in vector form, where each row of F is a Haarlet. F can be regarded as a basis that spans the feature space which can be represented by the Haarlet vectors selected so far. Basically we do not want the next selected Haarlet to be in the space already
350
CHAPTER 14 Real-Time 3D Body Pose Estimation represented by F . Let N be a basis of the null space of F , N ⫽ null(F )
(14.8)
N forms a basis that spans everything not yet described by F . To obtain the new optimal transformation we project D · Wopt onto N , where D is the diagonal matrix containing the weights of the eigenvectors w i in Wopt . D · Wopt ⫽ D · Wopt · N · N T
(14.9)
or Wopt ⫽ D ⫺1 · D · Wopt · N · N T
(14.10)
where D is a diagonal matrix containing the new weights i of the new eigenvectors , w i in Wopt i ⫽ ||i · w i · N · N T ||
(14.11)
Every time a new Haarlet is selected based on Wopt , F is updated accordingly and the whole process is iterated until the desired number of Haarlets is obtained. Examples of selected Haarlets are shown in Figure 14.14. 30
30
30
25
25
25
20
20
20
15
15
15
10
10
10
5
5
5
0 30 25
20 15
10
5
0 0 5
10
15
20
25 30
0 30 25
20 15
10
5
0 0 5
10
15
20
25
30
0 30 25
20 15
10
5
0 0 5
10
15
20
25
30
(a) 30
30
30
25
25
25
20
20
20
15
15
15
10
10
10
5
5
5
0 30 25
0 30 25
20 15
10
5
0 0 5
10
15
20
25
30
20 15
10
5
0 0 5
10
15
20
25
30
0 30 25
20 15
10
5
5 0 0
10
15
20
25
(b)
FIGURE 14.14 (a) Three example ANMM eigenvectors. (b) Approximation using 10 Haarlets. The first example shows how a feature is selected to inspect the legs; the last example shows a feature that distinguishes between the left and right arm stretched forward.
30
14.6 Haarlets
351
14.6.3 Classification After the ANMM vectors have been computed and the Haarlets have been selected to approximate them, the next step is to actually classify new silhouettes. This process uses the Haarlets to extract coefficients from the normalized silhouette image; it then computes a linear combination of these coefficients to approximate the coefficients that would result from the ANMM transformation. An example of such an approximated ANMM feature vector is shown in Figure 14.14. The resulting coefficients can be used to classify the pose of the silhouette. Given the coefficients h extracted with the Haarlets, the approximated ANMM coefficients l can be computed as l⫽L·h
(14.12)
where L is an m⫻n matrix in which m is the number of ANMM eigenvectors and n is the number of Haarlets used for the approximation. L can be obtained as the least squares solution to the system Wopt ⫽ L · F T
(14.13)
The least squares solution to this problem yields L ⫽ Wopt ·
FT F
⫺1
T FT
(14.14)
L provides a linear transformation of the feature coefficients h to a typically smaller number of ANMM coefficients l. This allows the samples to be classified directly based on these ANMM coefficients, whereas an AdaBoost method needs to be complemented with a detector cascade [5] or a hashing function [16, 17]. Finally, using NN search, the new silhouettes can be matched to the stored examples (i.e., the mean coefficients of each class).
14.6.4 Experiments In this section we evaluate how many Haarlets are needed for a good ANMM approximation, and we measure the speed improvement over using a pure ANMM approach. For this experiment a 50-pose classifier was trained using 2000 training samples of a subject in different positions and orientations. The experiment was set in an office scenario with a cluttered background and thus sometimes noisy segmentations.The samples were recorded from six cameras connected to six computers that ran foreground-background segmentation on the recorded images. From these segmented silhouettes, 3D hulls were reconstructed and normalized for size and orientation to 24⫻24⫻24 voxels. Validation was done using 4000 test samples. The resulting classifier uses 44 ANMM eigenvectors, which can be approximated almost perfectly with 100 Haarlets. The number of Haarlets used determines how well the original ANMM transformation is approximated, as shown in Figure 14.15. So there is no overfitting, but after a certain number of Haarlets the approximation delivers the same classification performance as the pure ANMM classification. With 3D ANMM, the classifier achieves 97.52 percent correct classification on 50 pose classes. In Figure 14.15 we also show the performance of a 2D silhouette-based classifier, which will be explained in more detail in Section 14.7. In this 2D case we show the classification performance of a classifier where the Haarlets are trained with ANMM and one
CHAPTER 14 Real-Time 3D Body Pose Estimation Correct classification rate for 50 pose classes
352
100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 0 2D Haarlets
10
20
3D Haarlets
30
40 50 60 70 Number of Haarlets
80
90
100
2D AdaBoost
FIGURE 14.15 Correct classification rates using up to 100 Haarlets for classification.
where they are trained with AdaBoost [5]. The ANMM approach has better performance, while the AdaBoost approach suffers from overfitting. Due to its memory constraints, it is not possible to apply AdaBoost to 3D Haarlets [6]. As shown in Figure 14.16 the Haarlet-approximated approach is many times faster than pure ANMM.The computation time increases almost linearly for the ANMM transformation as the number of pose classes increases, because increasing the number of pose classes increases the number of ANMM feature vectors. Using the ANMM approximation, the integral volume of the hull has to be computed once, after which computing additional Haarlet coefficients requires virtually no computation time relative to the time of computing the integral volume. Considering the processing time required for segmentation (5 ms, in parallel) and reconstruction (15 ms), the total processing time is less than 25 ms per frame. (The classification was performed on a standard 3-GHz computer.) Note that if we decrease the number of cameras used in the system, the correct classification rate decreases linearly down to three cameras, where the correct classification rate is 91.93 percent (it was 97.52 percent using six cameras). With fewer than three cameras it is impossible to reconstruct a reasonable 3D hull, and therefore classification is also impossible.The computation time for the reconstruction also decreases linearly to about 8 ms for reconstructing a hull from three cameras (it was 15 ms using six cameras).
14.7 ROTATION INVARIANCE The pose classification problem becomes much more difficult when the subject can freely change not only position but also orientation. A change of position can easily be normalized, but when classifying 2D silhouettes it is impossible to normalize for the
14.7 Rotation Invariance
Computation time for classification (ms)
60.00
50.00
40.00
30.00
20.00
10.00
0.00 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 Number of pose classes ANMM pure
ANMM approximation
FIGURE 14.16 Classification times in milliseconds for the pure ANMM classifier and the classifier using 100 3D Haarlets to approximate the ANMM transformation. The ANMM approximated version only has to compute the integral volume once (3.5 ms) and the computation time for the 100 Haarlets is negligible. In the pure ANMM case, however, the number of feature vectors increases with the number of pose classes and requires 1.115 ms of computation time per vector.
rotation of the subject. In a 3D hull approach, however, it is possible to normalize the rotation of the 3D hulls before classifying them. Normalizing hull rotation consists of measuring the angle of the hull’s orientation and then rotating it to a standard orientation. The goal is that, regardless of the orientation of the subject, the resulting normalized hull looks the same, as shown in Figure 14.17.
14.7.1 Overhead Tracker An overhead tracker is used to determine the subject’s angle of orientation. Our visual tracker, based on a color-based particle filter [38], uses a set of particles to model the posterior distribution of the likely state of the subject. During each iteration, the tracker generates a set of new hypotheses for the state by propagating the particles using a dynamic model.This generates a prior distribution of the state, which is then tested using the observation of the image. A human is modeled by a circle and an ellipse representing the head and shoulders. The color distributions of these two regions are compared to a stored model histogram to yield the likelihood for the state of each particle. Particle filtering is a multiple-hypothesis approach. Several hypotheses exist at the same time and are kept during tracking. Each hypothesis or sample s represents one hypothetical state of the object, with a corresponding discrete sampling probability .
353
354
CHAPTER 14 Real-Time 3D Body Pose Estimation
80
80
80
80
60
60
60
60
40
40
40
40
20
20
20
20
0 80 60 40
20 00
20
40
60
80
0 80 60 40
20 0 0
20
40
60
80
0 80 60 40
20 0 0
20
40
60
80
0 80 60 40
20 0 0
20
40
60
80
FIGURE 14.17 Examples of different user orientations resulting in similar hulls.
FIGURE 14.18 Each sample is modeled by an ellipse and a circle.
Each sample (particle) consists of an ellipse with position, orientation, and scale, and a circle with a position relative to the center of the ellipse, as shown in Figure 14.18. The ellipse describes the boundary of the object being tracked—in this case the shoulder region—while the circle represents the head. Each sample is given as s ⫽ {x, y, Hx , Hy , ␣, cx , cy }
(14.15)
where x and y represent the position of the ellipse; Hx and Hy , the size of the ellipse in the x and y axis; ␣, the orientation of the ellipse; and cx and cy , the position of the head circle relative to the ellipse center. In this tracker the ratio between Hx and Hy is constant. To test the probability of a sample being a good hypothesis, a color histogram p is computed over the pixels inside the ellipse, and another histogram p is computed for the pixels inside the head circle. Each pixel has three color channels (red, green and blue) and each channel is divided into eight bins, giving a total of 512 bins.
14.7 Rotation Invariance
355
A pixel is assigned to a bin as follows: u ⫽ n3 · r ⫹ n 2 · g ⫹ n · b
(14.16)
where n is the number of bins for each channel, and r, g, b are color values between 0 and 255. With this formula, each pixel is assigned to a bin u, incremented p(u) ⫽ p(u) ⫹ w
(14.17)
where w is the weight of the pixel. To increase the reliability of the color distribution when boundary pixels belong to the background or are occluded, smaller weights are assigned to pixels that are further away from the region center: w ⫽ 1 ⫺ r2
(14.18)
where r is the distance between the pixel and the center of the ellipse. The resulting histogram is compared to a stored histogram or target model q using the Bhattacharyya coefficient, [ p, q] ⫽
m
p(u) q (u)
(14.19)
u⫽1
The larger is, the more similar the histograms are. We define the distance between two histograms as d ⫽ 1 ⫺ [ p, q]
(14.20)
which is called the Bhattacharyya distance [39]. This similarity measure provides the likelihood of each sample and is used to update the sample set. To speed up the tracker, the number of pixels that must be evaluated to build the histogram is reduced. First, a random sampling is made of the pixels that lie inside the shoulder and head regions. This random sampling is fixed for the tracker’s entire run. When calculating the color histogram, only the sampled pixels are evaluated. This not only benefits the speed but also makes the number of evaluated pixels independent from the size of the ellipse; thus, computation time is constant. The evolution of the sample set is described by propagating each sample according to a dynamic model: st ⫽ Ast⫺1 ⫹ w t⫺1
(14.21)
where A defines the deterministic component of the model, and w t⫺1 is a multivariate Gaussian random variable. Each element of the set is then weighted in terms of the observations (color histogram), and N samples are drawn with replacement by choosing a particular sample with probability (n) .The tracker state at any given time is computed as a weighted mean state over all current samples at that given time, weighted by their Bhattacharyya distance to the target model. This combination of multiple hypotheses, particle filtering, and random sampling results in a fast, robust overhead tracker, as shown in Figure 14.19. Note that a tracker requires initialization; if initialization is not desirable for the application, it is possible to use an alternative such as an orientation sensor or to determine
356
CHAPTER 14 Real-Time 3D Body Pose Estimation
FIGURE 14.19 Example of the overhead tracker.
the greatest horizontal direction of variation in the hull. The latter works well, but it limits the number of usable poses as it cannot distinguish front from back in a subject. Therefore, all poses need to be symmetrical, which is not ideal. This can be avoided by using other cues to determine which side of the subject is in front, such as face detection from the cameras placed sideways. Another option is a direct approach, such as proposed by Cohen and Li [18], Weinland et al. [19], and Gond et al. [20], where a rotation-invariant 3D shape descriptor is used rather than hull normalization. For two reasons, we chose to first normalize the hulls and then classify them: 1. We believe higher performance and lower computation times are possible this way, as both the method of Ren et al. [17] and our method achieve very high classification rates in real time by first determining orientation and then classifying. 2. Disconnecting normalization from classification allows the classification algorithm to be used for different classification problems as well as, for example, hand gestures, or for classification in a space where the third dimension is time (similar to [6]). In this case a different normalization step is required, but the classification algorithm remains mostly the same.
14.7.2 Experiments To test our 3D hull-based rotation-invariant classifier we compared it to a 2D silhouettebased approach. For this experiment we used the same setup, training, and test data as described in Section 14.6.4. A 2D silhouette-based classifier cannot classify the pose of a person with changing orientation, so it is impossible to compare the two approaches directly. It is, however, possible to train a 2D silhouette-based classifier for several possible user orientations. The training samples are divided into 36 individual bins depending
14.8 Results and Conclusions
Correct classification rate
100.00%
90.00%
80.00%
70.00%
60.00%
50.00% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 Number of pose classes 3D
2D
FIGURE 14.20 Correct classification rates comparing classification based on 2D silhouettes and 3D hulls using ANMM approximation and Haarlets.
on the angle of orientation. For each bin a separate 2D classifier is trained. In the classification stage, depending on the measured angle of rotation, the appropriate 2D classifier is used. For this experiment we trained a 2D classifier based on 2D Haarlets using ANMM.This allowed us to quantify how much a 3D hull approach improves performance. Figure 14.20 shows the performance for classification with different numbers of pose classes up to 50. The pose classes were randomly selected and averaged over five random samplings. With all 50 pose classes, the 3D system is 97.52 percent correct; the 2D system is 91.34 percent correct.
14.8 RESULTS AND CONCLUSIONS Using the algorithms described in this chapter, we built a real-time pose detection system using six cameras, hardware triggered to ensure that the recorded images were synchronized. The six cameras were connected to six computers, each running a foreground-background segmentation. As segmentation is one of the computationally more expensive steps in the system, distributing the load benefits system speed significantly. Additionally, the smaller binary silhouettes are easier to send over the network than full-color images. The silhouettes are sent to a host computer, which runs the 3D hull reconstruction and the pose classification. The speed of the reconstruction step is significantly improved by voxel carving and a fixed lookup table. 3D Haarlets help to make the pose classification step about four times faster. The overhead tracker is run on a separate computer in parallel, and sends the orientation estimation over the network.
357
358
CHAPTER 14 Real-Time 3D Body Pose Estimation
This system is capable of detecting 50 poses with 97.52 percent accuracy in real time. Example reconstruction and classification results are given in Figure 14.21, which shows the input images for one of the six cameras, as well as the 3D hull reconstruction from a top and a side view and the detected pose. Input
Reconstruction
FIGURE 14.21 Example reconstruction and classification results.
Detected pose
14.8 Results and Conclusions
359
The system described in this chapter introduces a number of technical contributions. We introduced a new and powerful approach to training Haarlets based on ANMM, and extended it to 3D,which makes it possible to train 3D Haarlets.The 3D approach has new, interesting properties such as increased robustness and rotation invariance. Furthermore, in the 3D approach the trained classifier becomes independent from the camera setup. The result is a pose classification system with the same or better performance when compared to the state of the art but at much faster, interactive speeds. The methods described in this chapter can be ported to other classification problems, such as hand gesture recognition, object detection and recognition, face detection and recognition, and even event detection where the third dimension of the 3D Haarlets is time. The algorithms described to train 3D Haarlets in this chapter can be exported to any system where 2D or 3D Haarlets require training. There are some limitations to our system. For example, as the system relies on foreground-background segmentation, the background must be static. A busy but static background is not a problem for the system, which can deal with noisy segmentations. However, the foreground-background segmentation fails on a moving background. No experiments have been done with multiple subjects on the scene. This should not be a problem as long as the hulls are not touching, in which case it becomes difficult to determine which voxels belong to which subject. Another limitation is the orientation tracker, which requires initialization. Although fast and accurate, in future work it will be important to look for an alternative orientation estimation that does not require initialization and is independent of previous frames. Furthermore, the sparse camera placement limits 3D hull reconstruction quality, and therefore some poses are impossible to detect. At this time the pose classes are limited to visible arm directions. In the future it will be interesting to look at a sequence of poses and have the algorithm detect moving gestures based on a sequence of ANMM coefficients. In such a system the impact of a missed subtle pose will be less apparent, as the sequence as a whole is classified. Acknowledgments. The work for this chapter was carried out in the context of the Sixth Framework Programme of the European Commission: EU Project FP6–511092 CyberWalk, and Swiss NCCR project IM2.
REFERENCES [1] CyberWalk project, http://www.cyberwalk-project.org. [2] F. Wang, C. Zhang, Feature extraction by maximizing the average neighborhood margin, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2007. [3] M.V. den Bergh, E. Koller-Meier, L.V. Gool, Fast body posture estimation using volumetric features, in: IEEE Visual Motion Computing, 2008. [4] R. Kehl, M. Bray, L.V. Gool, Full body tracking from multiple views using stochastic sampling, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. [5] P. Viola, M.J. Jones, Robust real-time object detection, in: IEEE Conference on Computer Vision and Pattern Recognition, 2001. [6] Y. Ke, R. Sukthankar, M. Hebert, Efficient visual event detection using volumetric features, in: IEEE International Conference on Computer Vision, 2005.
360
CHAPTER 14 Real-Time 3D Body Pose Estimation [7] C. Bregler, J. Malik, Tracking people with twists and exponential maps, in: IEEE Conference on Computer Vision and Patter Recognition, 1998. [8] Q. Delamarrre, O. Faugeras, 3D articulated models and multi-view tracking with silhouettes, in: IEEE International Conference on Computer Vision, 1999. [9] D.M. Gavrila, L. Davis, 3D model-based tracking of humans in action: a multi-view approach, in: IEEE Conference on Computer Vision and Pattern Recognition, 1996. [10] I. Kakadiaris, D. Metaxas, Model-based estimation of 3D human motion, in: IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (12) (2000) 1453–1459. [11] R. Plänkers, P. Fua, Articulated soft objects for video-based body modeling, in: IEEE International Conference on Computer Vision, 2001. [12] K. Cheung, T. Kanade, J. Bouguet, M. Holler, A real time system for robust 3D voxel reconstruction of human motions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2000. [13] K. Cheung, S. Baker, T. Kanade, Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. [14] I. Mikic, M. Trivedi, E. Hunter, P. Cosman, Articulated body posture estimation from multicamera voxel data, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2001. [15] R. Rosales, S. Sclaroff, Specialized mappings and the estimation of body pose from a single image, in: IEEE Human Motion Workshop, 2000. [16] G. Shakhnarovich, P. Viola, T. Darell, Estimating articulated human motion with parametersensitive hashing, in: IEEE International Conference on Computer Vision, 2003. [17] L. Ren, G. Shakhnarovich, J.K. Hodgins, H. Pfister, P. Viola, Learning silhouette features for control of human motion, ACM Transactions on Graphics 24 (4) (2005) 1303–1331. [18] I. Cohen, H. Li, Inference of human postures by classification of 3D human body shape, in: IEEE Workshop on Analysis and Modeling of Faces and Gestures, 2003. [19] D. Weinland, R. Ronfard, E. Boyer, Free viewpoint action recognition using motion history volumes, Computer Vision and Image Understanding 104 (2006) 249–257. [20] L. Gond, P. Sayd, T. Chateau, M. Dhome, A 3D shape descriptor for human pose recovery, in: V Conference on Articulated Motion and Deformable Objects, 2008. [21] R. Mester, T. Aach, L. Dümbgen, Illumination-invariant change detection using a statistical colinearity criterion, in: Pattern Recognition: Proceedings of the twenty third DAGM Symposium, 2001. [22] A. Griesser, S.D. Roeck, A. Neubeck, L.V. Gool, Gpu-based foreground-background segmentation using an extended colinearity criterion, in: Proceedings of the Vision, Modeling, and Visualization Conference, 2005. [23] A. Laurentini, The visual hull concept for silhouette-based image understanding, IEEE Transactions on Pattern Analysis and Machine Intelligence archive 16 (2) (1994) 150–162. [24] A. Laurentini, How many 2D silhouettes does it take to reconstruct a 3D object?, Computer Vision and Image Understanding 67 (1) (1997) 81–87. [25] K. Shanmukh, A. Pujari, Volume intersection with optimal set of directions, Pattern Recognition Letters 12 (3) (1991) 165–170. [26] J. Luck, D. Small, C. Little, Real-time tracking of articulated human models using a 3D shapefrom-silhouette method, in: Proceedings of the InternationalWorkshop on RobotVision, 2001. [27] C. Theobalt, M. Magnor, P. Schüler, H. Seidel, Combining 2D feature tracking and volume reconstruction for online video-based human motion capture, in: Proceedings of the tenth Pacific Conference on Computer Graphics and Applications, 2002. [28] I. Mikic, M. Trivedi, E. Hunter, P. Cosman, Human body model acquisition and tracking using voxel data, International Journal on Computer Vision 53 (3) (2003) 199–223.
14.8 Results and Conclusions
361
[29] J.-M. Hasenfratz, M. Lapierre, J.-D. Gascuel, E. Boyer, Real-time capture, reconstruction and insertion into virtual world of human actors, in: Proceedings of Vision, Video and Graphics, 2003. [30] F. Caillette, T. Howard, Real-time markerless human body tracking using colored voxels and 3-D blobs, in: Proceedings of the International Symposium on Mixed and Augmented Reality, 2004. [31] W. Matusik, C. Buehler, L. McMillan, Polyhedral visual hulls for real-time rendering, in: Proceedings of the twelveth Eurographics Workshop on Rendering Techniques, 2001. [32] J. Franco, E. Boyer, Exact polyhedral visual hulls, in: Proceedings of the British MachineVision Conference, 2003. [33] K. Kutulakos, S. Seitz, A theory of shape by space carving,Technical Report TR692, Computer Science Department, University of Rochester, Rochester, New York, 1998. [34] S. Seitz, C. Dyer, Photorealistic scene reconstruction by voxel coloring, International Journal of Computer Vision 25 (3) (1999) 1067–1073. [35] R. Szeliski, Rapid octree construction from image sequences, Computer, Vision, Graphics and Image Processing 58 (1) (1993) 23–32. [36] K. Fukunaga, Introduction to Statistical Pattern Recognition, Second Edition. Academic Press, 1990. [37] C. Papageorgiou, M. Oren, T. Poggio, A general framework for object detection, in: International Conference on Computer Vision, 1998. [38] K. Nummiaro, E. Koller-Meier, L.V. Gool, An adaptive color-based particle filter, Image Vision Computing 21 (1) (2003) 99–110. [39] F. Aherne, N. Thacker, P. Rockett, The Bhattacharyya metric as an absolute similarity measure for frequency coded data, Kybernetika (1997) 1–7.
CHAPTER
Multi-Person Bayesian Tracking with Multiple Cameras
15
Jian Yao, Jean-Marc Odobez Idiap Research Institute, Martigny, Switzerland
Abstract Object tracking is an important task in the field of computer vision which is driven by the need to detect interesting moving objects in order to analyze and recognize their behaviors and activities. However, tracking multiple objects is a complex task because of a large number of issues ranging from different sensing setups to the complexity of object appearance and behaviors. In this chapter, we analyze some of the important issues in solving multiple-object tracking, reviewing briefly how they are addressed in the literature. We then present a state-of-the-art algorithm for the tracking of a variable number of 3D persons in a multi-camera setting with partial field-of-view overlap. The algorithm illustrates how, in a Bayesian framework, these issues can be formulated and handled. More specifically, the tracking problem relies on a joint multi-object state space formulation, with individual object states defined in the 3D world. It involves several key features for efficient and reliable tracking, such as the definition of appropriate multi-object dynamics and a global multi-camera observation model based on color and foreground measurements; the use of the reversible-jump Markov chain Monte Carlo (RJ-MCMC) framework for efficient optimization; and the exploitation of powerful human detector outputs in the MCMC proposal distributions to automatically initialize/update object tracks. Experimental results on challenging real-world tracking sequences and situations demonstrate the efficiency of such an approach. Keywords: tracking, multi-camera, 3D model, multiple-objects, surveillance, color histograms, Bayesian, MCMC, reversible-jump MCMC, uncertainty, human detector
15.1 INTRODUCTION Multiple-object tracking (MOT) in video is one of the fundamental research topics in dynamic scene analysis, as tracking is usually the first step before applying higher-level scene analysis algorithms, such as automated surveillance, video indexing, human– computer interaction, traffic monitoring, and vehicle navigation. While fairly good Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00015-X
363
364
CHAPTER 15 Multi-Person Bayesian Tracking with Multiple Cameras solutions to the tracking of isolated objects or small numbers of objects having transient occlusion have been proposed, MOT remains challenging, with higher densities of subjects, mainly due to inter-person occlusion, bad observation viewpoints, small resolution images, persons entering/leaving, and so forth. These situations are often encountered in the visual surveillance domain. In the following sections, we discuss some of the key factors that affect a tracking algorithm and then introduce our algorithm.
15.1.1 Key Factors and Related Work As just stated, several issues make tracking difficult: background clutter; small object size; complex object shape, appearance, and motion and their changes over time or across camera views; inaccurate/rough scene calibration or inconsistent camera calibration between views for 3D tracking; large field-of-view (FOV) cameras with small or no overlap; and real-time processing requirements. In what follows, we discuss some of the key components in the design of a tracking algorithm and relate them to the raised issues.
Setups and Scenarios In the past decade, an abundance of approaches and techniques for multiple-object tracking have been developed. They can be distinguished according to the physical environment considered, the setup (where and how many sensors are used), and the scenario (under which conditions—for instance, which crowdness level—the tracking is expected to perform). A first set of environments for tracking includes the so-called smart spaces [1]. These are indoor environments—homes, offices, classrooms—equipped with multiple cameras, audio sensing systems, and networked pervasive devices that can perceive ongoing human activities and respond to them. Such settings usually involve the tracking of only a few people. They are usually equipped with multiple cameras, providing good image quality, and subject sizes in the images are relatively high and of the same value across camera views. In this context, robust and accurate tracking results have been demonstrated [2, 3], and current goals are to recover the pose of objects in addition to their localization, exploit other modalities such as audio [4], and characterize human activities. Another set of environments includes open spaces, as encountered in surveillance— for instance, airport or metro indoor spaces or outdoor areas such as parking lots or school campuses [5–7]. In contrast to the previous case, the monitored space is much larger and usually covered with only a few cameras. In general, robust and accurate tracking across large-FOV cameras is difficult, as objects can be very small or have large image projection size variations within and across views due to depth effects. Object appearance can be unclear and similar from one object to another because of the small scale. However, when the crowd level is not too high (e.g., when monitoring outdoor corporate parking lots), good tracking can still be achieved. In both smart and open spaces, viewpoint is an important variable affecting tracking difficulty. When seen from above, individuals in a group can still be distinguished. When seen from floor level or from a low viewpoint, they occlude each other. A tracking algorithm will have to explicitly account for this situation in order to take measurements only on unoccluded parts of the persons and predict the motion of occluded persons.
15.1 Introduction
365
Object State Representation The tracking problem depends on what kind of object state representation one wishes to recover. In its simplest form, object tracking can be defined as the problem of estimating the location of an object in the 2D image plane or in the 3D space as it moves around a scene. That is, a tracker should assign consistent labels to the tracked objects in one or multiple video streams. Additionally, depending on the scenario and setup, a tracker can also provide object-centric information, such as size, orientation, or pose. The selection of an adequate state space is a compromise between two goals: On one hand, the state space should be precise enough so as to model as well as possible the information in the image and provide the richest information to further higher-level analysis. On the other hand, it has to remain simple enough and adequate to the quality of the data in order to obtain reliable estimates and keep computation time low. One approach is to define objects in the 2D image plane—for example, using their position, speed, scale [8]—and possibly representing them with different object parts, such as head-shoulder, torso, or legs [9]. Whenever possible, defining the object in the 3D space using a model-based approach is more appropriate and presents several advantages over a 2D approach. First, parameter settings in most cases will have a physical meaning (e.g., standard height or average walking speed [7]). Similarly, prior information about state values will be more easy to specify, as they are somewhat “built-in.” For instance, according to the 3D position, we automatically know what should be the size of a person in the image plane. Finally, occlusion reasoning when tracking multiple persons is simplified when using the 3D position. To represent humans, generalized 3D cylinders or ellipsoids are often used when enough resolution is available [7, 10]. Alternatively, a simplified 2D version not corresponding to an explicit 3D model can be used. For instance, Zhao et al. [5, 6] parameterize a 3D human through ellipses characterized by head position, height, and 2D inclination. Note that the 2D inclination on the image plane is important because the subjects standing on the floor may not appear vertical on the image because of camera distortions.
Object-Tracking Problem Formulation Several approaches can be used to formulate the tracking problem. In a simple approach, tracking can be done by detecting the position of objects at each frame and then matching these detections across time using motion heuristics to constrain the correspondence [11]. For instance, when subjects are seen from a far distance with a static camera, background subtraction is usually applied. Blobs or connected components are extracted, possibly classified in different categories (person versus vehicles, person versus group), and matched in time. However, blobs do not always correspond to single objects, so blobs can split in several tracks and vice versa, several object tracks can merge into a single blob. To handle this, reasoning about object counts and appearance can be used to identify single tracks through Bayesian networks or graph analysis [12, 13]. For instance, Bose et al. [13] proposed a fragmentation and grouping scheme to deal with these situations. However, such approaches cannot be applied when objects are closer to the camera, as the occlusions become too complex to handle. In past years, Bayesian state space formulation was shown to be very successful in addressing the multi-person tracking problem [5, 7, 8, 10, 14, 15]. Some authors used a
366
CHAPTER 15 Multi-Person Bayesian Tracking with Multiple Cameras single-object state–space model [16, 17], where the modes of the state are identified as individual objects. Berclaz et al. [18] used a greedy approach, extracting the different tracks one by one from instantaneous object detection features using a hidden Markov model (HMM) and removing the detection features associated with the already extracted tracks. However, only a rigorous formulation of the MOT problem using a multi-object state space allows formalizing in a principled way the different components that one may wish for a tracker: uniquely identifying targets, modeling their interactions, and handling variability in number of objects using track birth and death mechanisms. As a pioneer, the BraMBLe system [10] was able to track up to three persons from a single camera and blob likelihood based on a known background model and appearance models of the persons tracked. While the probabilistic tracking framework is appealing, it does not solve all problems by itself. First, as highlighted in Isard and MacCormick [10], one needs to have a global observation model with the same number of observations to deal with multiobject configurations varying in number, in order to obtain likelihoods of the same order of magnitude for configuration with different numbers of objects. This renders the use of object-oriented individual likelihood terms somewhat problematic, say if one defines likelihood as the product of individual object likelihoods. Also, because of the curse of dimension in the multi-object state space, solving the inference problem is not straightforward. The use of a plain particle filter [10] quickly fails when more than three or four people need to be tracked. However, more recently, MCMC stochastic optimization with reversible jump [19] has been shown to be more effective at handling large-dimension state. The algorithm presented in this paper belongs to this category of approach.
Dynamics and Interaction Models In tracking, it is often important to specify some prior knowledge about the temporal evolution of the object representation to discard wrong matches or, in occlusion cases, to predict object location until the occluded object reappears. Single-object dynamics usually assumes some continuity model (in position, speed, or acceleration) over the state variables, whose parameters can be learned from training data [20]. Linear models were commonly employed to comply with the Kalman filtering framework [21], but this happens not to be a restriction when using particle filters [10]. To obtain more precise and meaningful dynamics, auxiliary variable characteristics of the dynamics can be used in switching models. For instance, in human tracking, a discrete variable indicating whether a person is walking or static can be added to the state, allowing consideration of different dynamics according to the subject’s activity. In the multi-object case, the state dynamics must be defined for a varying number of objects. Indeed, specifying this allows proper handling of the birth and death of objects, as described later in the chapter. In addition, the modeling allows us to introduce object interaction models [8, 14, 22], by defining priors over the joint state space. Such priors are usually based on object proximity, which prevents two objects occupying the same state space region or explaining twice the same piece of data. Technically, this can be achieved by defining a pairwise Markov random field (MRF) whose graph nodes are defined at each time step by proximity objects. Qualitatively, such models are useful in crowded situations and for handling occlusion cases. More complex group dynamics can be defined. In Antonini et al. [23], a model relying on the discrete choice theory was used to handle object interactions by modeling and
15.1 Introduction
367
learning the behavior of a pedestrian given his assumed destination and the presence of other pedestrians in a nested grid in front of him. Because of the complexity of the model, however, the tracking task was solved independently for each pedestrian at each time step.
Detection and Tracking Any tracking approach requires an object detection mechanism either in every frame or when the object first appears in the scene to create a track. A common approach is to use temporal information such as foreground detection, frame differencing, or optical flow to highlight changing regions in consecutive frames, and to start tracks where such information is not yet accounted for by already existing objects [5, 8, 10]. Indeed, when object detection can be done reliably at each frame, it is possible to perform tracking by detection—that is, to only rely on the localization output to link detection over time. This is the case in the blob-based approaches cited earlier [12, 13] and when multiple cameras are used [3]. When possible, this is a powerful approach that allows integration of long-term trajectory information in a lightweight manner since only state features are involved and not images. For instance, Wu and Nevatia [9] used a learned detector to find human body parts, combine them, and then initialize the trajectory tracking from the detections. However, in many cases obtaining detections at a majority of time steps for each object is difficult. Still, the use of powerful detectors, trained using training data with boosting or support vector machines, can be efficiently exploited for track initialization and better localization of objects during inference [5], as shown later in this chapter. However, it brings with it some difficulties to real-time tracking applications, as detectors can be time consuming.
Observations and Multi-Camera Tracking How we measure the evidence of the observed data with respect to a given multi-object state is one of the most important points for a tracker. In multi-object tracking, color information is probably the most commonly used cue [5, 7, 8, 24, 25]. It is often represented using probability distribution functions represented by parametric models or histograms.They present the advantage of being relatively invariant under pose and view changes. In addition, to introduce some geometrical information and to be more robust to visual clutter, color information is usually computed for different body parts. While color is helpful to maintain the identity of tracked persons, a key issue is to create and adapt over time the color model of the object to be tracked, as the use of a predefined color model is usually infeasible (except in some specific cases such as sport games).This requires identifying which pixels in the image belong to a person to initialize a dedicated color histogram [26], which often relies on foreground detection [5, 7, 8, 24, 25, 27]. Another commonly used observation for localization is foreground detection (probabilities or binary masks), which is useful in assessing the presence of objects in the image [3, 5, 7, 8]. For tracking objects with complex shapes, or to measure pose information, color is usually not sufficient and contour cues often need to be extracted. For instance, Haritaoglu et al. [28] used silhouettes for object tracking in surveillance applications. Alternatively, one can represent people using sets of local templates or patches and geometric information, as proposed by Leibe et al. [29]. Other modalities, such as audio microphone arrays, can be used for localization, especially in smart rooms [4].
368
CHAPTER 15 Multi-Person Bayesian Tracking with Multiple Cameras Recent interest has concentrated on tracking objects using multiple cameras to extend the limited viewing angle of a single fixed camera. There are two main reasons to use multiple cameras [3, 24–26, 30]. The first is the use of depth information for tracking and occlusion resolution. The second is to increase the space under view for global tracking since it is not possible for a single camera to observe large spaces. Kim and Davis [24] proposed a multi-view multi-hypothesis approach, defined in a particle filtering framework, to segmenting and tracking multiple persons on a ground plane. Berclaz et al. [3] proposed an algorithm that can reliably track multiple persons in a complex environment and provide metrically accurate position estimates by combining a probabilistic occupancy map. Du and Piater [25] presented a novel approach for tracking a single target on a ground plane that fuses multiple-camera information using sequential belief propagation. This method performs very well and can handle imprecise foot positions and calibration uncertainties—a key issue in multi-camera systems where it is not always possible to perform a precise, Euclidean calibration of all cameras. Most approaches use centralized systems in which information from multiple cameras is jointly fused for tracking. These tracking systems need an efficient method applied on a powerful computer for real-time applications. Other approaches use distributed systems in which tracking is conducted independently at each camera, and then results from the different cameras are fused and combined at a higher level. For instance, Qu et al. [15] presents a probabilistic decentralized approach that more efficiently achieves use of a group of computers.
15.1.2 Approach and Chapter Organization In this chapter, we present our approach to automatic detection and tracking of a variable number of subjects in a multi-camera environment with partial FOV overlap.We believe it includes most of the state-of-the-art components that a MOT tracker can have in Bayesian tracking. More precisely, we adopt a multi-object state space Bayesian formulation, solved through RJ-MCMC sampling for efficiency [8, 14]. The proposal (i.e., the function sampling new state configurations to be tested) takes advantage of a powerful machine learning human detector allowing efficient update of tracks or initialization of new tracks. We adopt a 3D approach where object states are defined in a common 3D space that allows representing subjects with a body model and facilitates occlusion reasoning. Multi-camera fusion is solved by using global likelihood models over foreground and color observations. Our algorithm combines and integrates efficient algorithmic components that have been shown (often separately) to be essential for accurate and efficient tracking, and presents additional techniques to solve the following specific issues. To efficiently handle the interaction between multiple objects to avoid multiple objects occupying the same state space region, we propose to refine priors over the joint state space by exploiting both the body orientation in the definition of proximity and using the prediction of future object state to model the fact that moving people tend to avoid colliding with each other. Multi-camera tracking in surveillance scenarios is usually quite different from tracking in indoor rooms. Larger FOV cameras are used to cover more physical space, the overlaps between the FOVs are smaller, and people appear with dramatically different image
15.2 Bayesian Tracking Problem Formulation
369
resolutions because of their placements and points of view. As a consequence, a small and seemingly insignificant 2D position change (e.g., one pixel) in one view can correspond to a large position change in the other view, as illustrated in Figure 15.2. This is particularly problematic at transitions between FOV cameras, when a person enters a new view with a much higher resolution than the current one. Because of this uncertainty, the projection of the current estimate does not match the person in the new view. As a result, the tracker will assume that the person remains only in the first view and will initialize a new track in the new view. To solve this issue, the proposed algorithm integrates in the 3D object state prior a component that models the effects of the image estimation uncertainties according to the views in which the object is visible, and uses a proposal function taking into account the output of a human detector to draw samples at well-localized places in the new view. One final contribution of this chapter is an image rectification step allowing the reduction of subject geometric appearance variability (especially slant) in image, caused by the use of large-FOV cameras. This section introduced the object tracking problem, and described the key issues and applications of object tracking. It then presented a brief overview of related works and summarized our approach. The rest of this chapter is organized as follows. Section 15.2 describes the multi-camera multi-person Bayesian tracking framework with the state space and model representation. The main features of our proposed tracking framework are then described in Sections 15.3 through 15.5. Sections 15.3 and 15.4 introduce the dynamic and observation models for multi-object tracking, respectively. Section 15.5 describes the reversible-jump Markov chain Monte Carlo sampling approach for optimization. Section 15.6 details experiments on rectifying images to remove body slant and shows tracking results on real data. Section 15.7 concludes the chapter.
15.2 BAYESIAN TRACKING PROBLEM FORMULATION The goal is to track a variable number of persons from multiple overlapped camera views.To successfully achieve this objective, we use a Bayesian approach. In the Bayesian tracking framework, the goal is to estimate the conditional probability p(X˜ t |Z1:t ) of the joint multi-person configuration X˜ t at time t given the sequence of observations Z1:t ⫽ (Z1 , . . . , Zt ). This posterior probability p(X˜ t |Z1:t ), known as the filtering distribution, can be expressed recursively using the Bayes filter equation: p(X˜ t |Z1:t ) ⫽
1 p(Zt |X˜ t ) ⫻ C
p(X˜ t |X˜ t⫺1 )p(X˜ t⫺1 |Z1:t⫺1 )d X˜ t⫺1
(15.1)
X˜ t⫺1
where the dynamic model p(X˜ t |X˜ t⫺1 ) governs the temporal evolution of the joint state X˜ t given the previous state X˜ t⫺1 , and the observation likelihood model p(Zt |X˜ t ) measures the fitting accuracy of the observation data Zt given the joint state X˜ t . C is a normalization constant. In non-Gaussian and nonlinear cases, the filter equation can be approximated using Monte Carlo methods, in which the posterior p(X˜ t |Z1:t ) is represented by a set of (r) N samples {X˜ t }N r⫽1 . For efficiency, in this work we use the MCMC method, where the samples have equal weights and form a so-called Markov chain. Using the samples from
370
CHAPTER 15 Multi-Person Bayesian Tracking with Multiple Cameras time (⫺1), we obtain the following approximation of the filtering distribution: p(X˜ t |Z1:t ) ≈
N 1 (r) p(X˜ t |X˜ t⫺1 ) p(Zt |X˜ t ) ⫻ C r⫽1
(15.2)
Thus to define our filter, the main elements to be specified are our multi-object state space, the dynamics, the likelihood model, and an efficient sampling scheme that effectively places the particles at good locations during optimization. We first introduce our state model in the next subsections.
15.2.1 Single-Object 3D State and Model Representation As stated in the introduction, the selection of the state space is a compromise between what can be reliably estimated from the observations and the richness level of the information we want to extract. In the current situation, we use a state space defined in the 3D space, comprising the subject’s location and speed on the ground plane, as well as his height and orientation. Given these parameters, we model subjects using general cylinders, as illustrated in Figure 15.1. Given the resolution of the images, we decided to use one cylinder for the head, one for the torso, and one for the legs. To account for “flatness” (the width of a person is usually larger than her thickness), we decided to use elliptic cylinders (i.e., the section of the cylinder is an ellipse). Utilizing this 3D human body model, one person standing on the ground plane with different orientation should produce different
Xt r1
h1
Xm1 rb2 ra2
Orientation: 180 Height: 180.0 cm
h2
Orientation: 30 Height: 175.0 cm
Xm2
Orientation: 90 Height: 180.0 cm
Orientation: 0 Height: 180.0 cm
h3 ra3 Xb rb3
(a)
(b)
FIGURE 15.1 (a) 3D human body model, consisting of three elliptic cylinders representing head, torso, and legs, respectively. (b) Projections of the body model in the rectified image for different state values. Notice the change of width due to variation in body orientation.
15.3 Dynamic Model
371
projected models in which the main difference is the width of the projected body. There are fixed physical aspect ratios of these three body parts: height ratios 2:7:6 for head, torso, and legs. Thus, in summary, the state space is represented by a 6-dimensional column vector: Xi,t ⫽ xi,t , yi,t , x˙ i,t , y˙ i,t , hi,t , ␣i,t
(15.3)
where ui,t ⫽ (xi,t , yi,t ) denotes the person ground plane 2D position in the 3D physical space. Variables u˙ i,t ⫽ (x˙ i,t , y˙ i,t ) , h, and ␣i,t denote the velocity, the height of the object (in cm), and the orientation with regard to the X-direction on the ground plane, respectively. Figure 15.1 shows the body model along with its projection for different state values on one image. However, to reduce computation, we projected each of the body parts into one 2D bounding box, which is used to compute observation likelihood, as described in Section 15.4.To get such a bounding box for a given part, we first find the 3D coordinates of the four tangent points to the top and bottom elliptical sections of the body part.These points are then projected onto the 2D image plane, and the minimum bounding box containing these four points is used to represent the projection of each cylinder. Examples can be seen in Section 15.6.2. Most of the time, such a projection is a good approximation of the full projection. However, when bodies are close to the camera, intersections among three projected bounding boxes can occur (see for instance Figure 15.9).
15.2.2 The Multi-Object State Space To track a variable number of people, we defined the joint state space of multiple objects as X˜ t ⫽ (Xt , kt )
(15.4)
where Xt ⫽ {Xi,t }i⫽1...M ; M is the maximum number of objects appearing in the scene at any given time instant; and kt ⫽ {ki,t }i⫽1...M is an M-dimensional binary vector. The Boolean value ki,t signals whether the object i is valid/exists in the scene at time t (ki,t ⫽ 1) or not (ki,t ⫽ 0). The identifier set of existing objects is thus represented as Kt ⫽ {i ∈ [1, M]|ki,t ⫽ 1}, and K¯ t ⫽ {1, 2, 3, . . . , M} \ Kt , where the symbol \ denotes the set subtraction. In this way, the“full" state vector has the same dimension whether objects are present.
15.3 DYNAMIC MODEL The dynamical model governs the evolution of the state between time steps. It is responsible for predicting the motion of humans as well as modeling inter-personal interactions between them.
15.3.1 Joint Dynamic Model The joint dynamical model for a variable number of people is defined as follows: p(X˜ t |X˜ t⫺1 ) ⫽ p(Xt , kt |Xt⫺1 , kt⫺1 ) ⫽ p(Xt |Xt⫺1 , kt , kt⫺1 )p(kt |kt⫺1 , Xt⫺1 ) M p(Xi,t |Xt⫺1 , kt , kt⫺1 ) p(kt |kt⫺1 , Xt⫺1 ) ⬀ p0 (Xt |kt ) i⫽1
(15.5) (15.6)
372
CHAPTER 15 Multi-Person Bayesian Tracking with Multiple Cameras with ⎧ ⎪ ⎨p(Xi,t |Xi,t⫺1 ) p(Xi,t |Xt⫺1 , kt , kt⫺1 ) ⫽ pbirth (Xi,t ) ⎪ ⎩ pdeath (Xi,t )
if i ∈ Kt and i ∈ Kt⫺1 if i ∈ Kt and i ∈ / Kt⫺1 (birth) if i ∈ / Kt and i ∈ Kt⫺1 (death)
where we have assumed that targets that were born, died, or stayed behave independently of each other. All the components are described next. In the previous equations the term p0 (Xt |kt ) models the interaction prior of multiple objects given current joint states, and the term p(Xi,t |Xi,t⫺1 ) denotes a single-person dynamics, as discussed below. The term pbirth (Xi,t ) denotes a prior distribution over the state space of object values for a newborn object i at time t, while pdeath (Xi,t ) denotes a probability over the state space of objects for a dead object i at time t. Interestingly, these distributions are state dependent, which allows us to specify regions where the probability of creating or deleting objects is higher, typically near entrance and exit points [8]. In our current implementation, we use uniform probabilities for both terms. The last term p(kt |kt⫺1 , Xt⫺1 ) in equation 15.6 allows us to define a prior over the number of objects that die and are born at a given time step, thus disfavoring for instance the deletion of an object and its replacement by a newly created object. It is defined as p(kt |kt⫺1 , Xt⫺1 ) ⫽ p(kt |kt⫺1 ) ⫽ p(Kt |Kt⫺1 )⬀( pa )|Kt \Kt⫺1 | ( pd )|Kt⫺1 \Kt |
(15.7)
where |A| denotes the size of the set A. Here we assume that the probabilities of new indices are independent of past state values and pa and pd are priors that prevent changing the set of valid indices. It acts as a prior not only on the change in the number of objects but also on the index values: A low value will avoid the deletion of an object and its replacement by a newly created object.
Shape-Oriented and Person Avoidance Interactions Prior Person interactions are modeled by the the term p0 in equation 15.6 and defined by a pairwise prior over the joint state space: p0 (Xt |kt ) ⫽
i,j∈Kt ,i ⫽j
⎧ ⎨
(Xi,t , Xj,t )⬀ exp ⫺g ⎩
g(Xi,t , Xj,t )
⎫ ⎬
i,j∈Kt ,i ⫽j
⎭
(15.8)
where g(Xi,t , Xj,t ) is a penalty function. In Smith et al. [8] and Zhan et al. [14], which used such a prior, the authors defined this penalty function based on the current 2D overlap between the object projections or on the Euclidean distance between the two object centers—for instance, g(Xi,t , Xj,t ) ⫽ (ui,t ⫺ uj,t ), where (x) denotes some function (e.g., (x) ⫽ |x| or (x) ⫽ |x|2 ). In our case, we propose two improvements. First, as humans are not “circular,” we replaced the above Euclidean distance with a Mahalanobis distance: gp (Xi,t , Xj,t ) ⫽ dm,i (ui,t ⫺ uj,t ) ⫹ dm,j (ui,t ⫺ uj,t )
(15.9)
15.3 Dynamic Model
373
where dm,i (resp. dm,j ) is the Mahalanobis distance defined by the ellipsoid shape of the person i (resp. j). Qualitatively, this term favors the alignment of the body orientation of two close-by persons. Conversely, it does not favor having two close-by persons with perpendicular orientations. People following each other is a typical situation where this term could be useful. Second, when people move, they usually anticipate the motion of others to avoid pr collision. We thus introduced a prior as well on the state Xi,t⫹1 predicted from the current state value Xi,t , by defining the penalty function as g(Xi,t , Xj,t ) ⫽ gp (Xi,t , Xj,t ) ⫹ pr pr gp (Xi,t⫹1 , Xj,t⫹1 ). This term thus prevents collision, not only when people are coming close but also when they are moving together in the same direction.
15.3.2 Single-Object Dynamic Model The dynamic model of a single person is defined as p(Xi,t |Xi,t⫺1 ) ⫽ p(ui,t , u˙ i,t |ui,t⫺1 , u˙ i,t⫺1 )p(hi,t |hi,t⫺1 )p(␣i,t |␣i,t⫺1 , u˙ i,t )
(15.10)
where we have assumed that the evolution of state parameters is independent given the previous state values. In this equation, the height dynamics p(hi,t |hi,t⫺1 ) assumes a constant height model with a steady-state value, to avoid large deviations toward too high or too small values. The body orientation dynamics p(␣i,t |␣i,t⫺1 , u˙ i,t ) is composed of two terms that favor temporal smoothness and orientation alignment with the walking direction, as we described in the single-person tracking algorithm [7]. In addition to prior terms not described here that prevent invalid floor positions and reduce the likelihood of the state when the walking speed exceeds some predefined limit, the position/speed dynamics is defined by u˙ t ⫽ A u˙ t⫺1 ⫹ Bw1,t (x)
and ut ⫽ ut⫺1 ⫹ u˙ t ⫹ C(ut⫺1 )w2,t
(15.11)
(y)
where wq,t ⫽ (wq,t , wq,t ) is a Gaussian white noise random variable (q ⫽ 1, 2), and is the time step between two frames. First assume that C(ut⫺1 ) ⫽ 0. In this case, equation 15.11 represents a typical auto-regressive model—a Langevin motion—with √A ⫽ aI and B ⫽ bI (I denotes the 2⫻2 identity matrix), and a ⫽ exp(⫺Bt ) and b ⫽ ¯ A ⫺ a2 where  accounts for speed damping and v¯ is the steady-state root mean square speed.
2D-to-3D Localization Uncertainties In multi-view environments with small overlapping regions between views and important depth scene effects with large image projection size variations in persons within and across views (see Figure 15.2), the Langevin motion is not enough to represent the state dynamics uncertainty. Figure 15.2 illustrates a typical problem at view transitions: A person appearing at a small scale in a given view enters a second view. Observations from the first view are insufficient to accurately localize the person on the 3D ground plane. Thus, when the person enters the second view, the image projections obtained from the state prediction of the MCMC samples often result in a mismatch with the person’s actual localization in the second view. This mismatch might be too high to be
CHAPTER 15 Multi-Person Bayesian Tracking with Multiple Cameras
(a)
600
Ground plane covariances
400 200 Y
374
0 2200 2400 2600 260024002200
0 X
200 400 600
( b)
FIGURE 15.2 (a) Depth effects cause very similar positions in the first camera view to correspond to dramatically different image locations in the second view. (b) For the same scene, a floor map shows localization uncertainties propagated from image localization uncertainties. In solid light gray lines floor locations visible in both cameras. In dotted black locations visible in only one camera.
covered (in one time step) by the regular noise of the dynamic model. As a result, the algorithm may keep (for some time) the person track so that it is only visible in the first view, and create a second track to account for the person’s presence in the second view. To solve this issue, we added the noise term C(ut⫺1 )w2,t on the location dynamics (see equation 15.11), whose covariance magnitude and shape depend on the person location.The covariance of this noise, which models 2D-to-3D–localization uncertainties, is obtained as follows. The assumed 2D Gaussian noises on the image localization of a person’s feet from the different views are propagated to the 3D floor position using an unscented transform [31], and potentially merged for person positions visible from several cameras, leading to the precomputed noise model illustrated in Figure 15.2b. Qualitatively, this term guarantees that in the MCMC process, state samples drawn from the dynamics actually spread the known uncertainty 3D regions, and those samples drawn by exploiting the human detectors will not be disregarded as being too unlikely according to the dynamics.
15.4 Observation Model
15.4 OBSERVATION MODEL When modeling p(Zt |X˜ t ), which measures the likelihood of the observation Zt for a given multi-object state configuration X˜ t , it is crucial to be able to compare likelihoods when the number of objects changes. Thus, we took great care to propose a formulation that provides likelihoods of similar orders of magnitudes for different numbers of objects. For simplicity, we dropped the subscript t in this section. Our observations are defined as Z ⫽ (Iv , Dv )v⫽1... Nv , where Iv and Dv denote the color and the background subtraction observations for each of the Nv camera view. More precisely, Dv is a background distance map with values between 0 and 1 where 0 means a perfect match with the background. Assuming the conditional independence of the camera views, we have ˜ ⫽ p(Z|X)
Nv v⫽1
˜ ˜ p(Iv |Dv , X)p(D v |X)
(15.12)
These two terms are described below (where we dropped the subscript v for simplicity).
15.4.1 Foreground Likelihood The robust background subtraction technique described in Yao and Odobez [32] is used in this chapter. In short, its main characteristics are the use of an approach similar to the mixture of Gaussian (MoG) [33], the use of local binary pattern features as well as a perceptual distance in the color space to avoid detection of shadows, and the use of hysteresis values to model the temporal dynamics of the mixture weights. An example is shown in Figure 15.3. The foreground likelihood of one camera is modeled as ˜ ⫽ p(D|X) ⬀
x∈S
x∈S
exp ⫺fg (1 ⫺ D(x))
x∈S¯
exp ⫺¯ fg D(x)
exp (c1 (D(x) ⫺ c2 ))
(15.13) (15.14)
where x denotes an image pixel, S denotes the object regions of the image, S¯ denotes its complement, as illustrated in Figure 15.4, and c1 ⫽ (fg ⫹ ¯ fg ) and c2 ⫽ fg /c1 . In equation 15.13, we clearly notice that the number of terms is independent of the number Equation 15.14, which was obtained by factoring out the constant term of objects. exp ⫺¯ fg D(x) , indicates that the placement (for track or birth) of objects will x∈S¯ ∪S be encouraged in regions where D(x) ⬎ c2 .
15.4.2 Color Likelihood The color likelihood in one camera is modeled as ˜ ⫽ p(I|D, X) ⬀
3
i∈K
b⫽1 3
i∈K
b⫽1
exp ⫺im |Ri,b |Dc (I, D, Ri,b )
x∈S¯
exp (⫺im Dmin )
exp ⫺im |Ri,b |(Dc (I, D, Ri,b ) ⫺ Dmin )
(15.15)
where Ri,b denotes, for an existing object i visible in the camera view, the image parts of its body region b not covered by other objects (see Figure 15.4), and |Ri,b | denotes the area of Ri,b . The preceding expression provides a comparable likelihood
375
376
CHAPTER 15 Multi-Person Bayesian Tracking with Multiple Cameras
(a)
(b)
(c)
(d)
FIGURE 15.3 Example foreground detection on a real metro image with strong reflection on the ground floor. (a) Shows the original image. (b) Shows the first background layer learned by the background subtraction algorithm. (c) Displays the distance between the (a) image and the background model. (d) Shows the identified foreground regions.
R3 R3
R2
R2 R1 R1
Nonobject region S
Object region S
FIGURE 15.4 Object and nonobject regions used to compute observation likelihoods.
15.4 Observation Model
377
for different numbers of objects, and favors the placement of tracked objects at positions for which the body region color distance Dc (I, D, Ri,b ) is high; it also favors the object existence if this distance is (on average) higher than the expected minimum distance Dmin .
Object Color Representation and Distance From the visible part of the body region Rb of an object, we extract two color histograms: hb , which uses only foreground pixels (i.e., those for which D(x) ⬎ c2 ), and Hb , which uses all pixels in Rb . While the former should be more accurate by avoiding pooling pixels from the background, the latter guarantees that we will have enough observations. To efficiently account for appearance variability due to pose, lighting, resolution, and camera view changes, we propose to represent each object body region using a set of B ¯ k }Bk⫽1 , learned as detailed in the next automatically learned reference histograms, H ⫽ {H subsection.
Color Distance The color distance is then defined as Dc (I, D, Rb ) ⫽ (1 ⫺ f )Dh2 (Hb , H) ⫹ f Dh2 (hb , H)
(15.16)
¯k Dh (H, H) ⫽ mink Dbh H, H
(15.17)
with
where Dbh denotes the standard Bhattacharyya distance between two histograms, and f weights the contribution of each extracted histogram to the overall distance. For a newborn object, we do not have reference histograms to compute the color distance defined in equation 15.16. Still, when creating an object we need to be able to evaluate the color likelihood for it. Thus, at creation time the initial reference histogram for each body part of the newborn object is computed as the average of two currently extracted histograms Hb and hb , as described in Algorithm 15.1.
Multi-Modal Reference Histogram Modeling Due to pose changes, lighting changes, nonrigid motions, or multiple resolutions in multiple views, the histogram for each human body part may vary over time.To deal with this variance, we propose a multi-modal learning method to learn statistical information about the reference histogram for each human body part, which is similar to the background ¯ k,t , wk,t Kt , Bt modeling method used for foreground detection [32]. Let Ht ⫽ Kt , H k⫽1 represent the learned statistical reference histogram model at time t for some human body t ¯ k,t }K part, which consists of a list of Kt reference histogram modes {H k⫽1 with weights Kt {wk,t }k⫽1 , of which the first Bt (⭐Kt ) modes have been identified as representing the reference histogram observations used in equation 15.17.To keep the complexity bounded, we set a maximal mode list size Kmax .The observed histograms (extracted from the object mean state at the end of each time step) are matched against the reference histograms and used to update the best-matched histogram, or to create a new reference histogram if the best match is not close enough.The flow chart for multi-modal reference histogram modeling is described in Algorithm 15.1.
378
CHAPTER 15 Multi-Person Bayesian Tracking with Multiple Cameras Algorithm 15.1: Multi-Modal Reference Histogram Modeling of a Human Body Part Initialization: For a newborn object selected from the camera view v at time t0 , we initialize as Kt0 ⫽ Bt0 ⫽ 1, ¯ 1,t0 ⫽ (Ht0 ⫹ ht0 )/2, and w1,t0 ⫽ w0 , where w0 denotes a low initial weight. Ht0 and ht0 are the normalized H histograms computed from all pixels and all foreground pixels in the unoccluded region of the human body part, respectively. Update: If the person object exists in the scene, we repeat: For the camera view v ⫽ 1 to Nv : If the person object is fully visible in the current view (i.e., without occlusion) based on the mean state, we update the model (otherwise, we do not update it): 1. Compute the normalized histograms Ht (for all pixels) and ht (only for foreground pixels) of the human body part observed in the current view. 2. For H ⫽ Ht and H ⫽ ht , we repeat the following steps: t ˜ ¯ k,t , H)}K a. Compute the Bhattacharyya distances {Dbh (H k⫽1 and find the best-matched mode k with the smallest distance. ¯ k,t b. If the best-matched mode is close enough to the data (i.e., Dbh (H ˜ , H) ⬍ b ), we update with Kt ⫽ Kt⫺1 : ¯ k,t⫺1 ¯ k,t ⫹ ␣h Ht and wk,t ⫹ ␣w , • The best-matched mode: H ˜ ⫽ (1 ⫺ ␣h )H ˜ ˜ ⫽ (1 ⫺ ␣w )wk,t⫺1 ˜ where ␣h and ␣w are the learning rates. • Other modes: wk,t ⫽ (1 ⫺ ␣w )wk,t⫺1 . Otherwise, we create a new mode {H, w0 }, and add it (if Kt⫺1 ⬍ Kmax ), resulting in Kt ⫽ Kt⫺1 ⫹ 1, or replace the existing mode with the smallest weight, if Kt⫺1 ⫽ Kmax . c. Sort all the modes in decreasing order according to their corresponding weights. Bt d. Select the first Bt modes as the potential reference histogram modes, satisfying k⫽1 wk,t Kt k⫽1 wk,t ⭓ Th , Th ∈ [0, 1], typically Th ⫽ 0.6.
15.5 REVERSIBLE-JUMP MCMC Given the high and variable dimensionality of our state space, the inference of the filtering distribution p(X˜ t |Z1:t ) is conducted using a reversible-jump MCMC (RJ-MCMC) sampling scheme that has been shown to be very efficient in such cases [5, 8, 14]. In RJ-MCMC, a Markov chain is defined such that its stationary distribution is equal to the target distribution—equation 15.2 in our case. The Markov chain is sampled using the Metropolis-Hastings (MH) algorithm. Starting from an arbitrary configuration, the algorithm proceeds by repetitively selecting a move type m from a set of moves ⌼ with prior probability pm , and sampling a new configuration X˜ t from a proposal distribution qm (X˜ t |X˜ t ). The move can either change the dimensionality of the state (as in birth or death) or keep it fixed. Then either the proposed configuration is added with probability (known as acceptance ratio)
p(X˜ t |Z1:t ) pm qm (X˜ t ; X˜ t ) ⫻ a ⫽ min 1, ⫻ p(X˜ t |Z1:t ) pm qm (X˜ t ; X˜ t )
(15.18)
to the Markov chain, where m is the reverse move of m, or the current configuration is added. In the following, we describe the moves and proposals we used and highlight the key points.
15.5 Reversible-Jump MCMC
379
FIGURE 15.5 Detection results at the same time instant in three views (see color images on this book’s companion website).
15.5.1 Human Detection Good and accurate automatic track initialization is crucial for multi-object tracking, particularly since it is the phase where the initial object model (color histograms) is extracted. In addition, being able to propose accurate positions to update current tracks is important.To this end, we developed a person detector [34] that builds on the approach of Tuzel et al. [35] and takes full advantage of the correlation existing between the shapes of humans in foreground detection maps and their appearance in an RGB image. In multi-view calibrated environment, the detector was applied on each view separately, on windows (1) that correspond to plausible people sizes; (2) for which the corresponding windows in the other camera views (obtained by calibration) all contain enough (20 percent) foreground pixels. Note that apart from this latter constraint, we did not try to merge the detection output in the different views. The main reason is that such fusion could reduce the number of detections (e.g., the object might be too small, occluded, or noisy in a given image). Also it appeared to be better to keep the best localizations in each of the camera views when initializing or updating track states in the MCMC tracking framework. Figure 15.5 provides an example of obtained detections.
15.5.2 Move Proposals In total, we defined six move types: add, delete, stay, leave, switch, and update, defined as follows. The first four correspond to the typical moves and corresponding acceptance ratios that can be found in Khan et al. [14].
Add det }Nv The human detector described previously is used to find a set of persons Ktdet ⫽ {Kv,t v⫽1 det in the scene at time t, where the set Kv,t consists of all detected persons from the camera view v at time t. If a detected person i ∗ from one camera view is not yet in the current existing object set Kt , we propose adding it. The add move type is defined as
qadd (X˜ t ; X˜ t ) ⫽ 1 |Ktdet \ Kt |
(15.19)
X˜ t
if contains the same objects as in X˜ t , plus a detected object; 0 otherwise.1 The symbol \ denotes the set subtraction and | · | denotes the set size. Here we define 1
Note that cases where the probability is 0 are implicit in the way the move is defined. For other moves, we do not mention it.
380
CHAPTER 15 Multi-Person Bayesian Tracking with Multiple Cameras det \ K }Nv . One of the following two conditions must be satisfied to deterKtdet \ Kt ⫽ {Kv,t t v⫽1 mine that a detected object i from camera view v is not yet in the current existing object Kt . The first is that the minimal distance between object i and all objects in Kt on the ground plane is larger than some threshold value (i.e., the detected object cannot be associated with any existing object). The second is that the percentage of occlusion in camera view v by the existing objects is lower than some threshold value. If all detected objects have already been added, we set the probability padd of the add move to zero.
Delete As required by the reversible-jump MCMC algorithm, add needs to have a corresponding reverse jump defined, in order to potentially move the chain back to a previous hypothesis. We define the delete move type by removing a randomly selected person i ∗ from the identifier set Ktdet ∩ Kt , which is the set of detected objects that have already been added to Kt . Thus, qdelete (X˜ t ; X˜ t ) ⫽ 1 |Ktdet ∩ Kt |
(15.20)
where the operation ∩ denotes the intersection of two sets. If no detected objects were added yet (i.e., Kt ⫽ ) or the intersection is empty, we set the probability pdelete of the delete to zero.
Stay The add/delete move types enable new objects to enter or be removed from the field of view at each time step, driven by a human detector. Additionally, we need a mechanism (r) for deciding the fate of objects already represented in the previous sample set {X˜ t⫺1 }N r⫽1 ∗ at time t ⫺ 1. We define a valid identifier set existing in the previous sample set as Kt⫺1 (r) {i ∈ [1, M]| N r⫽1 ki,t⫺1 ⬎ 0}. That is, an object identifier is valid if it appears in enough samples. If a given object i ∗ is no longer valid in the current sample stateX˜ t (i.e., ki∗,t ⫽ 0 ∗ ), we propose to re-add it with uniform probability 1 |K∗ \ K | and but exists in Kt⫺1 t t⫺1 sample a new state from q(Xi∗; i ∗) ⫽
N
(r)
(r)
p(Xi∗,t |Xi∗,t⫺1 )
r⫽1,s∈Kt⫺1
In this way, the stay proposal can be defined as
∗ qstay (X˜ t ; X˜ t ) ⫽ 1 |Kt⫺1 \ Kt | q(Xi∗; i ∗)
If the set
∗ \K Kt⫺1 t
(15.21)
is empty, we set the probability pstay of stay to zero.
Leave This is the corresponding reverse jump of the stay move. It randomly selects an identifier i ∗ from the set Kt \ Ktdet and removes it from the current sample state. Thus, the leave proposal can be defined as qleave (X˜ t ; X˜ t ) ⫽ 1 |Kt \ Ktdet |
If the set Kt \ Ktdet is empty, we set the probability pleave of leave to zero.
(15.22)
15.5 Reversible-Jump MCMC
381
Switch
This move allows us to randomly select a pair of closeby objects (i ∗, j ∗) ∈ Kt (i.e., ki∗,t ⫽ kj ∗,t ⫽ 1) and exchange their states. In practice it allows us to check whether the exchange of color models better fits the data. According to the Mahalanobis distance between two objects used in our interactions prior (see Section 15.3.1), we define the switch proposal as qswitch (X˜ t ; X˜ t ) ⫽ 1/gp (Xi∗,t , Xj ∗,t )
(15.23)
Thus, there is a greater probability of selecting a pair of objects with a smaller distance. If the Mahalanobis distance all pairs of objects (i ∗, j ∗) is large enough (i.e., gp (Xi∗,t , Xj ∗,t ) ⬎ md ), we set the probability pswitch of the switch proposal to zero.
Update This is an important move that allows us to find good estimates for the object states without changing dimension. It works by first randomly selecting a valid object i ∗ from the current joint configuration (i.e., for which ki∗,t ⫽ 1), and then proposing a new state for update. This new state is drawn in two ways (i.e., the proposal is a mixture). In the first case, the object position, height, and orientation are locally perturbed according to a Gaussian kernel [14]. Importantly, in order to propose interesting state values that may have a visual impact, the noise covariance in position is defined as ⌺(ui∗) ⫽ C(ui∗)C (ui∗), where C(ui∗) is the noise matrix in equation 15.11 used to define the noise covariance in Figure 15.2. The proposal in this case is mathematically defined (for the selected object; the other objects remain unchanged) as qu1 (Xi ∗,t ; Xi∗,t ) ⫽ N (Xi ∗,t ; Xi ∗,t , ⌺u )
(15.24)
where ⌺u ⫽ diag(⌺(ui∗), h2 , ␣2 ), where h2 and ␣2 are noise variances in height and orientation, respectively.The second way to update the object location consists of sampling the new location around one of the positions provided by the human detector that are close enough to the selected object i ∗. Here again, closeness is defined by exploiting ⌺(ui∗), and the perturbation covariance around the selected detection is given by ⌺(ui∗). Thus, the proposal in this case can be defined as qu2 (Xi ∗,t ; Xi∗,t ) ⫽
1 N (ui∗,t ; uk,t , ⌺(ui∗)) Cd (dm,i (ui∗,t , uk,t ))
(15.25)
k∈Ki ∗,t
where Cd is a normalization factor: Cd ⫽ k∈Ki∗,t 1 (dm,i (ui∗,t , uk,t )), and dm,i (ui∗,t , uk,t ) denotes the Mahalanobis distance used in equation 15.9 for the interactions prior. The set Ki∗,t consists of the detected human objects close enough to the object i ∗—that det ) }—where is, Ki∗,t ⫽ {k ∈ Ktdet |dm,i (ui∗ , t, uk,t md md is a predefined distance threshold. Thus, the final update proposal is defined as qupdate (X˜ t ; X˜ t ) ⫽
1 u qu1 (Xi ∗,t ; Xi∗,t ) ⫹ (1 ⫺ u )qu2 (Xi ∗,t ; Xi∗,t ) |K t |
where u is a real value, u ∈ [0, 1].
(15.26)
382
CHAPTER 15 Multi-Person Bayesian Tracking with Multiple Cameras
15.5.3 Summary The steps in the proposed MCMC-based tracking algorithm for a variable number of objects in the scene are summarized in Algorithm 15.2. Note that, while the overall expression of the filtering distribution is quite complex, the expression of the acceptance ratio is usually simple. Indeed, many of the terms at the numerator and the denominator cancel each other, since the likelihood terms as well as the interaction terms only involve local computation, and most of the objects do not change at each move. Algorithm 15.2: Multi-Person Tracking with Reversible-Jump MCMC At each time step t, the posterior over joint object states X˜ t⫺1 at time t ⫺ 1 is represented by a set of unweighted (r) ˜ samples {X˜ t⫺1 }N r⫽1 . The approximation of the current distribution p(Xt |Zt ) is constructed by RJ-MCMC sampling as follows: (r)
1. Initialization: Initialize the Markov chain by randomly selecting a sample X˜ t⫺1 and apply the motion model to each object; accept it as the first sample. 2. RJ-MCMC sampling: Draw (B ⫹ N ) samples according to the following schedule, where B is the length of the burn-in period: a. Randomly select a move type from the set of moves ⌼ ⫽ {add, delete, stay, leave, switch, update} (see Section 15.5.2). b. Select an object i ∗ (or two objects i ∗ and j ∗ for switch). c. Propose a new state X˜ t depending on the randomly selected proposal type: add: add a new object i ∗ to the current state. delete: delete an existing object i ∗ from the current state. stay: re-add an existing object i ∗ to the current state. leave: remove an existing object i ∗ from the current state. switch: exchange the states of two close-by objects (i ∗, j ∗). update: update the parameters of object i ∗. d. Compute the acceptance ratio a defined in equation 15.18 for the chosen move type. e. If a ⭓ 1, then accept the proposed state X˜ t ← X˜ t as a new sample. Otherwise, accept it with probability a and reject it otherwise. (r)
3. Approximation: As an approximation to the current posterior p(X˜ t |Zt ), return the new sample set {X˜ t }N r⫽1 obtained after discarding the initial B burn-in samples. 4. Update reference histograms: Compute the mean states of all existing objects from the new sample set and update their corresponding reference histograms (see Section 15.4.2).
15.6 EXPERIMENTS We first illustrate the effects on images of our slant removal algorithm, and then we present our tracking results.
15.6.1 Calibration and Slant Removal Before video processing, cameras were first calibrated, and a rectification homography was precomputed in order to remove person slant in images at runtime.
15.6 Experiments
383
Camera Calibration Cameras were calibrated using available information and exploiting geometrical constraints [36], such as that 3D lines should appear as undistorted or the vertical direction Z at any point should be obtained from the image coordinates of the vertical vanishing point v⊥ , computed as the intersection of the image projections of a set of 3D world parallel vertical lines. The image-to-ground homography H was estimated using a set of manually marked points in the image plane and their 3D correspondences in the 3D ground plane.
Removing Slant by Mapping the Vertical Vanishing Point to Infinity In Figure 15.6, we observe that standing humans appear with different slants in the image.This introduces variability in the feature extraction process when using rectangular regions. To handle this issue, we compute an appropriate projective transformation H ⊥ of the image plane in order to map its vertical finite vanishing point to a point at infinity, as described in Yao and Odobez [37]. As a result, the 3D vertical direction of persons standing on the ground plane will always map to 2D vertical lines in the new image, as illustrated in Figure 15.6. This transformation should help in obtaining better detection results or extracting more accurate features while maintaining computation efficiency— for example, by using integral images. At runtime, this does not generate an extra cost since this mapping can be directly integrated with the distortion removal step.
15.6.2 Results Two data sets captured from two different scenes were used to evaluate our proposed multi-person tracking system. The first consists of three videos, each 2.5 hours long, captured by three wide-baseline cameras in the Torino metro station scene as shown in Figure 15.5. These sequences are very challenging, given the camera viewpoints (small average and large human size variations in a given view, occlusion, partial FOV overlap), crowded scenes in front of gates, and the presence of many specular reflections on the ground, which in combination with cast shadows generate many background subtraction false alarms. In addition, most persons were dressed in similar colors. The second data set comprises 10 minutes of video footage also captured by three wide-baseline cameras in an outdoor scene. In this scene, humans often appear slanted in the left and/or right
(a)
( b)
(c)
FIGURE 15.6 Vertical vanishing point mapping. (a) After distortion removal and before the mapping. We can observe person slant according to their positions. (b) After the mapping to infinity. Bounding boxes fit more closely the silhouette of persons. (c) Another example.
384
CHAPTER 15 Multi-Person Bayesian Tracking with Multiple Cameras
FIGURE 15.7 Tracking results for the metro scene (see color images of this figure and next two on this book’s companion website).
borders of an image (see Figure 15.6). The camera viewpoint issues mentioned for the metro scene also exist in this outdoor scene. The following experiments were carried out using a total of 1500 moves in the RJ-MCMC sampling with 500 in the burn-in phase. (Videos corresponding to these sequences are available on the companion website.) Figure 15.7 shows some tracking results for the first data set. In this example, our tracking system performed very well, successfully adding persons with the human detector–mediated birth move and efficiently handling inter-person occlusion and partial visibility between camera views. The benefit of using 2D-to-3D ground plane noise in our algorithm, and especially in the dynamics, is illustrated in Figure 15.8. In the first two rows, this component was not used (i.e., C(u) ⫽ 0 in equation 15.11. As can be seen, the estimated state from the first view lags behind, resulting in a mismatch when the tracked person enters the second
15.7 Conclusions
(a)
385
(b)
FIGURE 15.8 Tracking results for the metro scene: (a) Without integrating the ground plane noise model in the dynamic model (results from two views and at four different instants are shown). (b) With integration (results in one view and at the same instants as in (a)).
view. As a consequence, a new object is created. The first track stays for some time and is then removed, resulting in a track break. On the other hand, when using the proposed term the transition between cameras is successfully handled by the algorithm, as shown in Figure 15.8(b). On the second data set, our approach performed very well, with almost no tracking errors in the 10-minute sequences. Results for four frames are shown in Figure 15.9. Anecdotally, our human detector was able to successfully detect a person on a bicycle, and our tracking system was able to track him robustly.
15.7 CONCLUSIONS In this chapter, we discussed general multi-person tracking issues and presented a state-ofthe-art multi-camera 3D tracking algorithm.The strength of the algorithm relies on several key factors: joint multi-state Bayesian formulation; appropriate interaction models using
386
CHAPTER 15 Multi-Person Bayesian Tracking with Multiple Cameras
FIGURE 15.9 Example of multi-person tracking in the outdoor sequence: (a) Frame 197, (b) Frame 973, (c) Frame 1267, and (d) Frame 1288.
state prediction to model collision avoidance; the RJ-MCMC inference-sampling scheme; and well-balanced observation models. The use of a fast and powerful human detector proved to be essential for good track initialization and state update. In the same way, the use of predefined 2D-to-3D geometric uncertainty measures on the state dynamics improve the results, and removing person slant in the image through a simple rectification scheme allowed the use of efficient human detector and feature extraction based on integral images. There are several ways to improve the current algorithm. The first is to use longerterm constraints on the dynamics. One standard approach is to post-process the extracted trajectory to remove transient tracks or resolve identity switches, and to merge trajectory fragments by observing the data in a longer time window.A second avenue of research for all tracking algorithms is to define more accurate likelihoods, especially in the presence of occlusions.There are two aspects to this issue. One is to use more sophisticated object descriptions to more fully explain image content to obtain better localization information. This can be done by including shape cues in the model or by representing objects through their parts. The second related improvement is to find appropriate measurements and likelihood models to infer the presence of an object given its model.This should be robust
15.7 Conclusions
387
enough to noise inherent in the data. Given the large variety of scenes, appearances, or poses, illumination conditions, camera setups, and image resolutions, there cannot be a single solution to this problem. Acknowledgments. This work was supported by the European Union Sixth FP Information Society Technologies CARETAKER project (Content Analysis and Retrieval Technologies Applied to Knowledge Extraction of Massive Recordings), FP6-027231.
REFERENCES [1] R. Singh, P. Bhargava, S. Kain, State of the art smart spaces: application models and software infrastructure, Ubiquity 37 (7) (2006) 2–9. [2] K. Bernardin, T. Gehrig, R. Stiefelhagen, Multi- and single view multiperson tracking for smart room environments, in: Proceedings of the Workshop on Classification of Events, Actions and Relations, 2006. [3] F. Fleuret, J. Berclaz, R. Lengagne, P. Fua, Multicamera people tracking with a probabilistic occupancy map, IEEE Transactions Pattern Analysis Machine Intelligence 30 (2) (2008) 267–282. [4] K. Bernardin, R. Stiefelhagen, Audio-visual multi-person tracking and identification for smart environments, in: Proceedings of the fifteenth International Conference on Multimedia, 2007. [5] T. Zhao, R. Nevatia, Tracking multiple humans in crowded environment, in: Proceedings of the IEEE Computer Vision and Pattern Recognition, 2004. [6] T. Zhao, R. Nevatia, B. Wu, Segmentation and tracking of multiple humans in crowded environments, IEEE Transactions Pattern Analysis Machine Intelligence 30 (7) (2008) 1198–1211. [7] J. Yao, J.-M. Odobez, Multi-camera 3d person tracking with particle filter in a surveillance environment, in: The sixteenth European Signal Processing Conference, 2008. [8] K. Smith, D. Gatica-Perez, J.-M. Odobez, Using particles to track varying numbers of interacting people, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2005. [9] B. Wu, R. Nevatia, Detection and tracking of multiple, partially occluded humans by Bayesian combination of edgelet based part detectors, International Journal of Computer Vision 72 (2) (2007) 247–266. [10] M. Isard, J. MacCormick, BRAMBLE: A Bayesian multi-blob tracker, in: Proceedings of the IEEE International Conference on Computer Vision, 2001. [11] C.J. Veenman, M.J.T. Reinders, E. Backer, Resolving motion correspondence for densely moving points, IEEE Transactions Pattern Analysis Machine Intelligence 23 (1) (2001) 54–72. [12] J. Sullivan, S. Carlsson, Tracking and labelling of interacting multiple targets, in: Europe Conference Computer Vision, 2006. [13] B. Bose, X.Wang, E. Grimson, Multi-class object tracking algorithm that handles fragmentation and grouping, in: Proceedings of the IEEE Conference ComputerVision & Pattern Recognition, 2007. [14] Z. Khan, T. Balch, F. Dellaert, MCMC-based particle filtering for tracking a variable number of interacting targets, IEEE Transactions Pattern Analysis Machine Intelligence 27 (2005) 1805–1819. [15] W. Qu, D. Schonfeld, M. Mohamed, Distributed Bayesian multiple target tracking in crowded environments using multiple collaborative cameras, EURASIP Journal on Applied Signal Processing, Special Issue on Tracking in Video Sequences of Crowded Scenes, 2007 (1) 2007. [16] D. Tweed, A. Calway, Tracking many objects using subordinate condensation, in: Proceedings of the British Machine Vision Conference, 2002.
388
CHAPTER 15 Multi-Person Bayesian Tracking with Multiple Cameras [17] K. Okuma, A. Taleghani, N. Freitas, J. Little, D. Lowe, A boosted particle filter: multi-target detection and tracking, in: Proceedings of the European Conference on Computer Vision, 2004. [18] J. Berclaz, F. Fleuret, P. Fua, Robust people tracking with global trajectory optimization, in: IEEE Conference Computer Vision and Pattern Recognition, 2006. [19] P.J. Green, Trans-dimensional Markov chain Monte Carlo, in: P.J. Green, N.L. Hjort, S. Richardson (Eds.), Highly Structured Stochastic Systems, Oxford University Press, 2003. [20] M. Isard, A. Blake, Condensation conditional density propagation for visual tracking, International Journal of Computer Vision 29 (1) (1998) 5–28. [21] D. Beymer, K. Konolige, Real-time tracking of multiple people using continuous detection, in: IEEE International Conference on Computer Vision Frame-Rate Workshop, 1999. [22] K. Smith, S.O. Ba, J.-M. Odobez, D. Gatica-Perez, Tracking the visual focus of attention for a varying number of wandering people, IEEE Transactions PatternAnalysis Machine Intelligence 30 (7) (2008) 1212–1229. [23] G.Antonini, S.V. Martinez, M. Bierlaire, J.P.Thiran, Behavioral priors for detection and tracking of pedestrians in video sequences, International Journal of Computer Vision 69 (2) (2006) 159–180. [24] K. Kim, L.S. Davis, Multi-camera tracking and segmentation of occluded people on ground plane using search-guided particle filtering, in: Proceedings of the European Conference on Computer Vision, 2006. [25] W. Du, J. Piater, Multi-camera people tracking by collaborative particle filters and principal axis-based integration, in: Asian Conference on Computer Vision, 2007. [26] A. Mittal, L.S. Davis, M2tracker: A multi-view approach to segmenting and tracking people in a cluttered scene using region-based stereo, in: Proceedings of the European Conference on Computer Vision, 2002. [27] P. Perez, C. Hue, J. Vermaak, M. Gangnet, Color-based probabilistic tracking, in: Proceedings of the European Conference on Computer Vision, 2002. [28] I. Haritaoglu, D. Harwood, L. Davis, W4: real-time surveillance of people and their activities, IEEE Transactions Pattern Analysis Machine Intelligence 22 (8) (2000) 809–830. [29] B. Leibe, K. Schindler, L.V. Gool, Coupled detection and trajectory estimation for multi-object tracking, in: International Conference on Computer Vision, 2007. [30] N.T. Pham, W. Huang, S.H. Ong, Probability hypothesis density approach for multi-camera multi-object tracking, in: Asian Conference on Computer Vision, 2007. [31] S. Julier, J. Uhlmann, Reduced sigma point filters for the propagation of means and covariances through nonlinear transformations, in: Proceedings of the 2002 American Control Conference, 2002. [32] J. Yao, J.-M. Odobez, Multi-layer background subtraction based on color and texture, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Workshop on Visual Surveillance, 2007. [33] C. Stauffer, W. Grimson, Adaptive background mixture models for real-time tracking, in: IEEE Conference Computer Vision and Pattern Recognition, 1999. [34] J. Yao, J.-M. Odobez, Fast human detection from videos using covariance features, in: ECCV 2008 Workshop on Visual Surveillance, 2008. [35] O. Tuzel, F. Porikli, P. Meer, Human detection via classification on riemannian manifolds, in: IEEE Conference Computer Vision and Pattern Recognition, 2007. [36] G. Wang, Z. Hu, F. Wu, H.-T. Tsui, Single-view metrology from scene constraints, Image & Vision Computing Journal 23 (2005) 831–840. [37] J. Yao, J.-M. Odobez, Multi-camera multi-person 3D space tracking with MCMC in surveillance scenarios, in: ECCV 2008 Workshop on Multi Camera and Multi-modal Sensor Fusion Algorithms and Applications, 2008.
CHAPTER
Statistical Pattern Recognition for Multi-Camera Detection, Tracking, and Trajectory Analysis
16
Simone Calderara, Rita Cucchiara, Roberto Vezzani Dipartimento di Ingegneria dell’Informazione, University of Modena and Reggio Emilia, Modena, Italy Andrea Prati Dipartimento di Scienze e Metodi dell’Ingegneria, University of Modena and Reggio Emilia, Emilia, Italy
Abstract This chapter will address most aspects of modern video surveillance with reference to research conducted at University of Modena and Reggio Emilia. In particular, four blocks of an almost standard surveillance framework will be analyzed: lowlevel foreground segmentation, single-camera person tracking, consistent labeling, and high-level behavior analysis. The foreground segmentation is performed by a background subtraction algorithm enhanced with pixel-based shadow detection; appearance-based tracking with specific occlusion detection is employed to follow moving objects in a single camera view. Thus, multi-camera consistent labeling detects correspondences among different views of the same object. Finally, a trajectory shape analysis for path classification is proposed. Keywords: probabilistic tracking, consistent labeling, shape trajectory analysis, distributed video surveillance
16.1 INTRODUCTION Current video surveillance systems are moving toward new functionalities to become smarter and more accurate. Specifically, path analysis and action recognition in human surveillance are two very active areas of research in the scientific community. Moreover, with the high proliferation of cameras installed in public places has come a surge of algorithms for handling distributed multi-camera (possibly multi-sensor) systems.
389 Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00016-1
390
CHAPTER 16 Statistical Pattern Recognition for Multi-Camera Detection F
F
Fixed camera
SAKBOT
Segmentation
F
Fixed camera
Segmentation
Fixed camera
Tracking
Geometry recovery
S
Moving cameras
Sensors
S
PTZ cameras
Segmentation ROI and model-based tracking
Ad Hoc Tracking
M
Tracking
Mosaiking, segmentation, and tracking
Sensor data acquisition
Consistent labeling and multi-camera tracking
PTZ control
HECOL Homography and epipoles
Posture analysis
Action analysis
Trajectory analysis
Behavior recognition
ViSOR
MOSES
Video surveillance ontology
Face recognition and person identification
Person annotation
Annotated video storage
MPEG streaming
Head tracking face selection
High-resolution detection
Face obscuration
Security control center
Web
Mobile surveillance platforms
FIGURE 16.1 MPEG Imagelab video surveillance solution.
This chapter will summarize the work done at Imagelab,1 University of Modena and Reggio Emilia, in the last ten years of research in video surveillance. Figure 16.1 shows the modular architecture we developed.The architecture exploits several libraries (collected in the Imagelab library) for video surveillance, written in C++, that provide direct interoperability with OpenCV (state-of-the-art computer vision open source tools). From the bottom up, the lowest level is the interface with sensors (the top of Figure 16.1).Traditionally video surveillance employs fixed cameras and provides motion detection by means of background suppression. Accordingly, we defined the Statistical and Knowledge-Based Object detector algorithm (SAKBOT) [1], which is very robust in many different conditions (see Section 16.2 for details). Tracking with a sophisticated Appearance-Driven Tracking Module with Occlusion Classification (Ad Hoc) [2] has been adopted in many applications of vehicle and person tracking by single cameras (see Section 16.3). Recent advances in security call for coverage of large monitored areas, requiring multiple cameras. In cases of cameras with partially overlapped fields of view (FOVs) we propose a new statistical and geometrical approach to solve the consistent labeling problem.That is, humans (or objects) detected by a camera module should maintain their identities if acquired by other camera modules, even in cases of overlaps, partial views, multiple subjects passing the edges of the FOVs, and uncertainty. An automatic learning phase to reconstruct the homography of the plane and the epipolar lines is required to perform this task. The approach, called HECOL [3] (Homography and Epipolar-Based
1
For further information: http://imagelab.ing.unimore.it.
16.2 Background Modeling
391
Consistent Labeling), has been employed for real-time monitoring of public parks in Reggio Emilia. Behavior analysis is carried out starting with the person trajectory. Learning the normal path of subjects, we can infer abnormal behaviors by means of detecting unusual trajectories that do not fit clusters of preanalyzed data.The challenge is a reliable measure of shape trajectory similarity in a large space covered by multiple cameras. In this field we recently proposed a new effective descriptor of trajectory based on circular statistics using a mixture of von Mises distributions [4]. The Imagelab architecture is enriched by a set of user-level modules. First, it includes a Web-based video and annotation repository, ViSOR [5],2 which also provides a video surveillance ontology. Thus, MOSES (Mobile Streaming for Surveillance) is a video streaming server devoted to surveillance systems.
16.2 BACKGROUND MODELING The first important processing step of an automatic surveillance system is the extraction of objects of interest. In particular, when cameras are installed in fixed positions this can be achieved by calculating the difference between the input frame and a model of the static content of the monitored scene—that is, the background model. Background modeling is a complex task in real-world applications; many difficulties arise from environmental and lighting conditions, micro-movement (e.g., waving trees), or illumination changes. The background model must also be constantly updated during the day because of natural intrinsic changes in the scene itself, such as clouds covering the sun, rain, and other natural artifacts. The adopted motion detection algorithm is specifically designed to ensure robust and reliable background estimation even in complex outdoor scenarios. It is a modification of the SAKBOT system [1] that increases robustness in outdoor uncontrolled environments. The SAKBOT background model is a temporal median model with a selective knowledgebased update stage. Suitable modifications to background initialization, motion detection, and object validation have been developed. The initial background model at time t, BGt , is initialized by subdividing the input image I into 16⫻16 pixel-sized blocks. For each block, a single difference over time, with input frame It , is performed and the number of still pixels is counted as the block’s weight.The background is then selectively updated by including all blocks composed of more than 95 percent still pixels, and the initialization process halts when the whole background image BGt is filled with “stable” blocks. After the bootstrapping stage, the background model is updated using a selective temporal median filter. A fixed k-sized circular buffer is used to collect values of each pixel over time. In addition to the k values, the current background model BGt (i, j) is sampled and added to the buffer to account for the last reliable background information available. These n ⫽ k ⫹ 1 values are then ordered according to their gray-level intensity, and the median value is used as an estimate for the current background model.
2
For further information: http://www.openvisor.org.
392
CHAPTER 16 Statistical Pattern Recognition for Multi-Camera Detection The difference between the current image It and the background model BGt is computed and then binarized using two different local and pixel-varying thresholds: a low-threshold Tlow to filter out the noisy pixels extracted because of small intensity variations; and a high-threshold Thigh to identify the pixels where large intensity variations occur. These two thresholds are adapted to the current values in the buffer: Tlow (i, j) ⫽ b k⫹1 ⫹l ⫺ b k⫹1 ⫺l 2 2 Thigh (i, j) ⫽ b k⫹1 ⫹h ⫺ b k⫹1 ⫺h 2
(16.1)
2
where bp is the value at position p inside the ordered circular buffer b of pixel (i, j), and , l, and h are fixed scalar values.We experimentally set ⫽ 7, l ⫽ 2, and h ⫽ 4 for a buffer of n ⫽ 9 values. The final binarized motion mask Mt is obtained as a composition of the two binarized motion masks computed respectively using the low and high thresholds: A pixel is marked as foreground in Mt if it is present in the low-threshold binarized mask and it is spatially connected to at least one pixel present in the high-threshold binarized mask. Finally, the list MVOt of moving objects at time t is extracted from Mt by grouping connected pixels. Objects are then validated, jointly using color, shape, and gradient information to remove artifacts and objects caused by small background variations, and invalid objects are directly injected into the background model (see Calderara et al. [6] for further details). An object-level validation step is performed to remove all moving objects generated by small motion in the background (e.g., waving trees). This validation accounts for joint contributions coming from the objects’ color and gradient information. The gradient is computed with respect to both spatial and temporal coordinates of the image It : ⭸It (i, j) ⫽ It⫺⌬t (i ⫺ 1, j) ⫺ It (i ⫹ 1, j) ⭸(x, t) ⭸It (i, j) ⫽ It⫺⌬t (i, j ⫺ 1) ⫺ It (i, j ⫹ 1) ⭸( y, t)
(16.2)
In the case of stationary points, the past image samples It⫺⌬t can be approximated with the background model BGt ; then the gradient module Gt is computed as the square sum of all components. ⭸It (i, j) ⫽ BGt (i ⫺ 1, j) ⫺ It (i ⫹ 1, j) ⭸(x, t) ⭸It (i, j) ⫽ BGt (i, j ⫺ 1) ⫺ It (i, j ⫹ 1) ⭸( y, t) ⎧ ⎫ ⎨ ⭸It (i, j) 2 ⭸It (i, j) 2 ⎬ Gt ⫽ g(i, j) | g(i, j) ⫽ ⭸(x, t) ⫹ ⭸( y, t) ⎭ ⎩
(16.3)
This joint spatio-temporal gradient module is quite robust against small motions in the background, mainly thanks to the use of temporal partial derivatives. Moreover, the joint spatio-temporal derivative makes the gradient computation more informative, since
16.3 Single-Camera Person Tracking
393
it also detects nonzero gradient modules even in the inner parts of the object as well as on the boundaries as performed by common techniques. Given the list of moving objects MVOt , the gradient Gt , for each pixel (i, j) of a moving object MVOt , and the gradient (in the spatial domain) of the background GBGt are compared in order to evaluate their mutual coherence. This gradient coherence GCt is evaluated over a k ⫻ k neighborhood as the blockwise minimum of absolute differences between the current gradient values Gt and the background gradient values GBGt in the considered block. To ensure a more reliable coherence value even when the gradient module is close to zero, we combine the gradient coherence with a color coherence contribution CCt , computed blockwise as the minimum of the Euclidean norm in the RGB space between the current image pixel color It (i, j) and the background model values in the considered block centered at (i, j). The overall validation score is the normalized sum of the perpixel validation score, obtained by multiplying the two coherence measures. Objects are validated by thresholding the overall coherence, and pixels belonging to discarded objects are labeled as part of the background. Since shadows can negatively affect both background model accuracy and object detection, they are removed based on chromatic properties in the HSV color space [7]. The blobs classified as shadow are not tracked as the validated objects. They are not considered as background either and also they are not used for the background update. One of the problems in selective background updating is the possible creation of ghosts. The approach used to detect and remove ghosts is similar to that used for background initialization, but at a regional rather than a pixel level. All the validated objects are used to build an image called At (i, j) that accounts for the number of times a pixel is detected as unchanged by single difference. A valid object MVOth is classified as a ghost if
(i,j)∈ MVOth
Nth
At (i, j) ⬎ Tghost
(16.4)
where Tghost is the threshold of the percentage of points of the MVOth unchanged for a sufficient time, and Nth is the area in pixels of MVOth .
16.3 SINGLE-CAMERA PERSON TRACKING After being identified, moving objects should be tracked over time. To this end an appearance-based tracking algorithm is used since it is particularly suitable to video surveillance applications. Appearance-based tracking is a well-established paradigm with which to predict, match, and ensure temporal coherence of detected deformable objects in video streams. These techniques are very often adopted as a valid alternative to approaches based on 3D reconstruction and model matching because they compute the visual appearance of the objects in the image plane only, without the need of defining camera, world, and object models. Especially in human motion analysis, the exploitation of appearance models or
394
CHAPTER 16 Statistical Pattern Recognition for Multi-Camera Detection
templates is straightforward. Templates enable the knowledge of not only the location and speed of visible subjects but also their visual aspect, their silhouette, or their body shape at each frame. Appearance-driven tracking is often employed in video surveillance, particularly of humans, and in action analysis and behavior monitoring to obtain precise information about visible and nonvisible body aspects at each instant [8–10]. This section provides a formal definition of our approach, called Appearance-Driven Tracking with Occlusion Classification (Ad Hoc). The tracking problem is formulated in a probabilistic Bayesian model, taking into account both motion and appearance status. The probabilistic estimation is redefined at each frame and optimized as a MAP (maximum a posteriori) problem so that a single solution for each frame is provided in a deterministic way. We do not track each object separately; rather, the whole object set is considered in a two-step process. A first top-down step provides an estimate of the best positions of all objects, predicting their positions and optimizing them in a MAP algorithm according to pixel appearance and a specifically defined probability of nonocclusion. A second bottom-up step is discriminative, since each observation point is associated with the most probable object. Thus, the appearance model of each object point is selectively updated at the pixel level in the visible part, ensuring high reactivity in shape changes. This section also describes a formal model of nonvisible regions that are nonnegligible parts of the appearance model unobservable in the current frame, where the pixel-toobject association is not feasible. Nonvisible regions are classified depending on the possible cause: dynamic occlusions, scene occlusions, and apparent occlusions (i.e., only shape variations).
16.3.1 The Tracking Algorithm Even if Ad Hoc tracking works at a pixel level, the central element in the system is the object O, which is described by its state vector O ⫽ {{o1 , . . . , oN }, c, e, ⌸}, where {oi } is the set of N points that constitute the object O; c and e are respectively the object’s position with respect to the image coordinate system and the velocity of the centroid; and ⌸ is the probability of O being the foremost object—that is, the probability of nonocclusion. Each point oi of the object is characterized by its position (x, y) with respect to the object centroid, by its color (R, G, B), and by its likelihood ␣ of belonging to the object. The scene at each frame t is described by a set of objects Ot ⫽ {O1 , . . . , OM }, which we assume are generating the foreground image F t ⫽ { f1 , . . . , fL }—that is, the points of MOVt extracted by any segmentation technique. Each point fi of the foreground is characterized by its position (x, y) with respect to the image coordinate system and by its color (R, G, B). The tracking aim is to estimate the set of objects Ot⫹1 observed in the scene at frame t ⫹ 1, based on the foregrounds so far extracted. In a probabilistic framework, this is obtained by maximizing the probability P(Ot⫹1 |F 0:t⫹1 ), where the . notation F 0:t⫹1 ⫽ F 0 , . . . , F t⫹1 . To perform this MAP estimation, we assume a first-order Markovian model, meaning that P(Ot⫹1 |F 0:t⫹1 ) ⫽ P(Ot⫹1 |F t⫹1 , Ot ). Moreover, by using Bayes’ theorem, it is possible to write P(Ot⫹1 |F t⫹1 , Ot ) ⬀ P(F t⫹1 |Ot⫹1 )P(Ot⫹1 |Ot )P(Ot )
(16.5)
16.3 Single-Camera Person Tracking
395
Optimizing equation 16.5 in an analytic way is not possible, as this requires testing all possible object sets by changing their positions, appearances, and probabilities of nonocclusion.As this is definitely not feasible, we break the optimization process into two steps: locally optimizing the position and then updating appearance and the probability of nonocclusion.
Position Optimization The first task of the algorithm is the optimization of the centroid position for all objects. In equation 16.5 the term P(Ot ) may be set to 1, since we keep only the best solution from the previous frame. The term P(Ot⫹1 |Ot ), the motion model, is provided by a circular search area of radius r around the estimated position cˆ of every object. P(Ot⫹1 |Ot ) ⫽ r1 2 inside the search area and equals O outside. To measure the likelihood of a foreground being generated by an object, we define a relation among the corresponding points of F and O with a function gO : F → O, and its domain F˜ O , which
is the set of foreground points matching the object’s points. We may then define F˜ ⫽ O∈O F˜ O —that is, the set of foreground points that match at least one ˜ the co-domain of the function gO , which includes the object. In the same way we call O points of O that have a correspondence in F˜ . (See Figure 16.2(a).) Since the objects can be overlapped, a foreground point f can be in correspondence with more than one object O, and thus we can define the set O( f ) as O( f ) ⫽ O ∈ O : f ∈ F˜ O .The term P(F t⫹1 |Ot⫹1 ) is given by the likelihood of observing the foreground image given the objects positioning, which can be written as P(F
t⫹1
|O
t⫹1
)⫽
f ∈F˜
⎡ ⎣
⎤
P f |gO ( f ) · ⌸O ⎦
(16.6)
˜ f) O∈O(
obtained by adding, for each foreground pixel f , the probability of being generated by the corresponding point o ⫽ gO ( f ) of every matching object O ∈ O( f ), multiplied by its nonocclusion probability, ⌸O .
F
goi
~ F 2F
ONVj
Oi
go
Oj (a)
OVj
OVi
j
~ F 5 OVi U OVj U ONVj ( b)
FIGURE 16.2 (a) Domain and co-domain of the function gO , which transforms the coordinates of a foreground pixel x ∈ F into the corresponding object coordinates. (b) Visible and nonvisible parts of an object. F˜ is the foreground part not covered by an object.
396
CHAPTER 16 Statistical Pattern Recognition for Multi-Camera Detection The conditional probability of a foreground pixel f , given an object point o, is modeled by a Gaussian distribution, centered on the RGB value of the object point: P( f |o) ⫽
1 ¯ ¯ T ⌺⫺1 ( f¯ ⫺o) ¯ e⫺1/2( f ⫺o) · ␣(o) (2)3/2 |⌺|1/2
(16.7)
¯ and ␣(·) give the RGB color vector and the ␣ component of the point, respecwhere (·) tively, and ⌺ ⫽ 2 I3 is the covariance matrix in which the three color channels are assumed to be uncorrelated and with fixed variance 2 . The choice of sigma is related to the amount of noise in the camera. For our experiments we chose ⫽ 20. From a computational point of view, the estimation of the best objects’ alignment requires at most (r 2 )M evaluations. It is reasonable to assume that the contribution of the foremost objects in equation 16.6 is predominant, so we locally optimize the function by considering only the foremost object for every point.The algorithm proceeds as follows: 1. A list of objects sorted by probability of nonocclusion (assuming that this is inversely proportional to the depth ordering) is created. 2. The first object O is extracted from the list and its position c is estimated by maximizing the probability: P(F˜ |O)⬀
P( f |gO ( f ))
(16.8)
f ∈F˜ O
3. After finding the best c, the matched foreground points are removed and the foreground set F considered in the next step is updated as F ⫽ F \F˜ O . 4. The object O is removed from the list as well, and the process continues from step 2 until the object list is empty. The algorithm may fail for objects that are nearly totally occluded, since a few pixels can force a strong change in the object center positioning. For this reason we introduce a confidence measure for the center estimation, to account for such a situation:
˜ o∈O
Conf (O) ⫽
␣(o) ␣(o)
(16.9)
o∈O
If during the tracking the confidence drops under a threshold (set to 0.5 in our experiments), the optimized position is not considered reliable and thus only the prediction is used.
Pixel-to-Track Assignment This is the second phase of the optimization of equation 18.5. In this top-down approach, once all tracks have been aligned, we adapt the remaining parts of each object state. Even in this case we adopt a suboptimal optimization. The first assumption is that each foreground pixel belongs to only one object. Thus, we perform a bottom-up discriminative pixel-to-object assignment, finding the maximum of the following probability for each point f ∈ F˜ : P(O → f ) ⬀ P( f |gO ( f )) · P(gO ( f )) ⫽ P( f |gO ( f )) · ␣(gO ( f ))
(16.10)
16.3 Single-Camera Person Tracking
397
where P( f |gO ( f )) is the same as in equation 16.7, and we use the symbol → to indicate that the foreground pixel f is generated by the object O. Directly from the above assignment rule, we can divide the set of object points into visible OV and nonvisible ONV ⫽ O ⫺ OV points:
OV ⫽ o ∈ O | ∃f
⫽ gO⫺1 (o) ∧ arg max Oi ∈O
P(Oi → f ) ⫽ O
(16.11)
In other words, the subset OV is composed of all points of O that correspond to a foreground pixel and that have won the pixel assignment. (See Figure 16.2(b).) The alpha value of each object point is then updated using an exponential formulation: ␣(ot⫹1 ) ⫽ · ␣(ot ) ⫹ (1 ⫺ ) · ␦(o, OV )
(16.12)
where ␦(·, ·) is the membership function. Equation 16.12 includes two terms: one proportional to a parameter ∈ [0, 1] that corresponds to P(Ot⫹1 |Ot ) and reduces the alpha value at each time step; and one proportional to 1 ⫺ that increases the ␣ value for the matching visible points P(F |O). Similarly, we update the RGB color of each object point: o¯ t⫹1 ⫽ · o¯ t ⫹ (1 ⫺ ) · f · ␦(o, OV )
(16.13)
The last step in updating the object state concerns the nonocclusion probability ⌸. We first define the probability Pot⫹1 that on object Oi occludes another object Oj : ⎧ 0 ⎪ ⎨ t Po(Oi , Oj )t⫹1 ⫽ (1 ⫺ ij )Poij aji ⎪ ⎩ (1 ⫺ ij )Potij ⫹ ij e aij
where
ij ⬍ occl aij ⫽ 0
(16.14)
aij ⫽ 0
⫺1 ⫽ gOi (OV ,i ) ∩ gO⫺1 aij ⫽ OV ,i ∩ ONV ,j (O ) NV ,j j g ij ⫽
aij ⫹aji
Oi ∩ Oj g
(16.15)
aij is the number of points shared between Oi and Oj and assigned to Oi ; ik is the percentage of the area shared between Oi and Oj assigned to Oi or Oj , which is less than or equal to 1 since some points can be shared among more than two objects. The value  is used as an update coefficient, allowing a faster update when the number of overlapping pixels is high. Conversely, when the number of those pixels is too low (under a threshold occl ), we reset the probability value to zero. The probability of nonocclusion for each object can be computed as ⌸(Oi )t⫹1 ⫽ 1 ⫺ max Po(Oi , Oj )t⫹1 Oj ∈O
(16.16)
With the probabilistic framework previously described, we can “assign and track” all foreground pixels belonging to at least one object. However, the foreground image contains points f (∈ F ⫺ F˜ ), with no corresponding object because of shape changes or the entrance into the scene of new objects. We assume that a blob of unmatched
398
CHAPTER 16 Statistical Pattern Recognition for Multi-Camera Detection foreground points is due to a shape change if it is connected (or close to) an object, and in such a situation the considered points are added to the nearest object; otherwise, a new object is created. In both cases the ␣ value of each new point is initialized to a predefined constant value (e.g., 0.4). Obviously, we cannot distinguish in this manner a new object entering the scene occluded by or connected to a visible object. In such a situation the entire group of connected objects will be tracked as a single entity.
16.3.2 Occlusion Detection and Classification As a result of occlusions or shape changes, some points of an object may have no correspondence with the foreground F . Unfortunately, shape changes and occlusions require two different and conflicting solutions. To keep a memory of an object’s shape even during an occlusion, the object model must be slowly updated; at the same time fast updating can better handle shape changes. To this end, the adaptive update function is enriched with knowledge of occlusion regions. In particular, if a point is detected as occluded, we freeze its color and ␣ value instead of using equations 16.12 and 16.13. The introduction of higher-level reasoning is necessary to discriminate between occlusions and shape changes.The set of nonvisible points ONV is the candidate set for occluded regions. After a labeling step over ONV , a set of nonvisible regions (of connected points) is created; points or too small regions are pruned, and a final set of nonvisible sparse regions NVRj is created. Occlusions can be classified as follows: ■
■
■
Dynamic occlusions (RDO ): due to overlap by another object closer to the camera; the pixels in this region were assigned to the other object. Scene occlusions (RSO ): due to (still) objects included in the scene and therefore in the background model, and thus not extracted by the foreground segmentation algorithm but actually positioned closer to the camera. Apparent occlusions (RAO ): not visible because of shape changes, silhouette motion, shadows, or self-occlusions.
The presence of an occlusion can be inferred by exploiting the confidence value of equation 16.9, decreasing it below an alerting value since in case of occlusion the object’s shape changes considerably. The occluded points x of the object model (x ∈ RDO orx ∈ RSO ) should not be updated because we do not want to lose its memory. Instead, if the confidence value decreases because of a sudden shape change (apparent occlusion), not updating the object state creates an error. The solution is a selective update according to the region classification. The detection of the first type of occlusion is straightforward, because we always know the position of the objects and can easily detect when two or more of them overlap. RDO regions comprise the points shared between object Ok and object Oi but not assigned to Ok . To distinguish between RSO and RAO , the position and the shape of the objects in the background can be helpful, but are not provided with our segmentation algorithm. To discriminate between RSO and RAO , we exploit the set of background edges. This subset of points in the background model contains all points of high color variation, among which the edges of the objects are usually detected. In the case of RSO we expect to find edge points in correspondence with the boundary between this RSO and the visible part
16.3 Single-Camera Person Tracking
399
t of the track. From the whole set of nonvisible points ONV defined in equation 16.11, we only keep those with a nonnegligible value of the probability mask in order to eliminate noise due to motion. The remaining set of points is segmented into connected regions. Then, for each region, the area weighted with the probability values is calculated and too small regions are pruned.The remaining nonvisible regions (NVRj ) belonging to an object O must be discriminated as background object occlusions and apparent occlusions. We call B(·) the set of border points of a region. At the same time, the edges of the background model are computed by a simple edge detector, reinforced by probability density estimation. We could exploit a more robust segmentation technique (e.g., mean shift [11]), to extract the border of objects in the background image; in our experiments, however, edge detection has given good results and requires much less computation. Given the set of edge points E ⫽ {ei }i⫽1...n in the background image, a probability density estimate for the background edges can be computed using a kernel (x) and a window h:
pn ( x| E) ⫽
n x ⫺ ei 1 1 n i⫽1 h2 h
(16.17)
¯ can be assumed to be uniform over the same The probability density for nonedges p(x|E) region. We can then naively compute the a posteriori probability of a pixel x being an edge point: P ( E| x) ⫽
P ( x| E) P ( x| E) ⫹ P x| E¯
(16.18)
where we assume equal a priori probability. We can now compute the average a posteriori probability of the set of points o ∈ BONV to be generated by the background edges. In particular, we are interested in the subset ˜ NV ) ⫽ B(ONV ) ∩ B(OV ), which is the part of the border of ONV connected to the B(O visible part OV . The probability estimate allows a noisy match between BONV and the edge points. If this average probability is high enough, meaning that the contour of the occluded region has a good match with the edges, we can infer that another object is hiding a part of the current object, and thus label the region RSO ; otherwise, RAO . In other words, if the visible and the nonvisible parts of an object are separated by an edge, then plausibly we are facing an occlusion between a still object in the scene and an observed moving object. Otherwise, the shape change is more reasonable because there are no more visible points. Figure 16.3 shows a person occluded in large part by a stack of boxes that are within the background image. Two parts of his body are not segmented and two candidate occlusion regions are generated (Figure 16.3(e)). One of them is a shadow included in the object model but now gone. In Figure 16.3(g) the borders of the NVRs are shown, with pixels that match well with the edges highlighted. In an actual occlusion due to a background object, the percentage of points that match the set of bounding points is high; thus the region is classified as RSO . Conversely, for the apparent occlusion (the shadow), we have no matching pixels and consequently this region is classified as RAO .
400
CHAPTER 16 Statistical Pattern Recognition for Multi-Camera Detection
(a) Current frame
(b) Background
(c) Foreground
(d) Tracked object state
R1 R1
Overlap with background edges
R2 R2 (e) Non visible regions R1 and R2
(f ) Background edges
(g) R1 and R2 contours
FIGURE 16.3 Example of regions classification; R1 is classified as an RSO region since there are points that match well with the edge pixels of the background. Instead, R2 is classified as an RAO region.
16.4 BAYESIAN-COMPETITIVE CONSISTENT LABELING In large outdoor environments, multi-camera systems are required. Distributed video surveillance systems exploit multiple video streams to enhance observation. Hence, the problem of tracking is extended from a single camera to multiple cameras; thus a subject’s shape and status must be consistent not only in a single view but also in space (i.e., observed by multiple views). This problem is known as consistent labeling, since identification labels must be consistent in time and space. If the cameras’ FOVs overlap, consistent labeling can exploit geometry-based computer vision. This can be done with precise system calibration, using 3D reconstruction to solve any ambiguity. However, this is not often feasible, particularly if the cameras are preinstalled and intrinsic and extrinsic parameters are unavailable. Thus, partial calibration or self-calibration can be adopted to extract only some of the geometrical constraints (e.g., the ground plane homography).
16.4 Bayesian-Competitive Consistent Labeling The consistent labeling problem, on camera networks with partially overlapping FOVs, is solved with a geometric approach that exploits the FOV relations and constraints to impose identity consistency. We call our approach HECOL (homography and epipolar-based consistent labeling). Specifically, when cameras partially overlap, the shared portion of the scene is analyzed and identities are matched geometrically. After an initial unsupervised and automatic training phase the overlapping regions among FOVs, ground plane homographies, and the epipole location for pairwise overlapping cameras are computed. The consistent labeling problem is then solved online whenever a new object appears in the FOV of a given camera C 1 (the superscript indicates the camera ID). The multi-camera system must check whether corresponds to a completely new object or to one already present in the FOV of other cameras. Moreover, it should deal with groups and identify the objects composing them. The approach described here has the advantage of coping with labeling errors and partial occlusions whenever the involved objects are present in at least one overlapped view. Using the vertical objects’ inertial axis as a discriminant feature can also help to disambiguate the group detected as a single blob, exploiting the information in overlapped views. When many objects are present in the scene and many cameras are involved, an exhaustive search may be computationally expensive. Thus, the subset of K potential matching objects satisfying the camera topology constraints is efficiently extracted by means of a graph model (called the camera transition graph). These K objects are combined to form the hypothesis space ⌫ that contains all 2K ⫺ 1 possible matching hypotheses, including both single objects and groups. A MAP estimator is used to find the most probable hypothesis ␥i ∈ ⌫: i ⫽ arg max p(␥k | ) ⫽ arg max p( | ␥k ) p(␥k ) k
k
(16.19)
To evaluate the maximum posterior, the prior of each hypothesis ␥k and the likelihood of the new object given the hypothesis must be computed.The prior of a given hypothesis ␥k is not computed by means of a specific pdf, but is heuristically evaluated by assigning a value proportional to a score k . The score k accounts for the distance between objects calculated after homographic warping. A hypothesis consisting of a single object then gains higher prior if the warped lower support point (i.e., the point of the object that contacts the ground plane) lp is far enough from the other objects’ support points. On the other hand, a hypothesis consisting of two or more objects (i.e., a possible group) gains higher prior if the objects that compose it are close to each other after the warping and, at the same time, the whole group is far from other objects. Let us suppose that a new object appears on camera C 1 .The lp of each of K objects in 2 C is warped to the image plane of C 1 . Likelihood is then computed by testing the fitness of each hypotheses against current evidence. The main goal is to distinguish between single hypotheses, group hypotheses, and possible segmentation errors exploiting only geometrical properties in order to avoid uncertainties due to color variation, and adopting the vertical axis of the object as an invariant feature. The axis of the object can be warped correctly only with the homography matrix and knowledge of the epipolar constraints among cameras. To obtain the correct axis inclination, the vertical vanishing point (computed by a robust technique as described
401
402
CHAPTER 16 Statistical Pattern Recognition for Multi-Camera Detection H C1
e2
a1
up Source object lp H (a)
(b)
(c)
2 Epipolar line C a2
vp2 (d)
FIGURE 16.4 Example of exploiting the vanishing point and epipolar geometry to warp the axis of the object to the image plane of camera C 1 .
in Brauer-Burchardt and Voss [12]) is then used as shown in Figure 16.4. The lower support point lp of is projected on camera C 2 using the homography matrix. The corresponding point on the image plane of camera C 2 is denoted as a1 ⫽ Hlp, where H is the homography matrix from C 1 to C 2 . The warped axis lies on a straight line passing through vp2 and a1 (Figure 16.4(d)). The ending point of the warped axis is computed using the upper support point up—that is, the middle point of the upper side of the object’s bounding box. Since this point does not lie on the ground plane, its projection onto the image of camera C 2 does not correspond to the actual upper support point; however, the projected point lies on the epipolar line. Consequently, the axis’s ending point a2 is obtained as the intersection between the epipolar line e2 , Hup and line
vp2 , Hlp passing through the axis. Based on geometrical constraints, the warped axis a1 , a2 of in the image plane of C 2 is unequivocally identified but its computation is not error free. To improve its robustness to computation errors, we also account for the dual process that can be performed for each of the K potential matching objects: The axis of the object in C 2 is warped on the segment a1 , a2 on camera C 1. The measure of axis correspondence is not merely the distance between axes a1 , a2 and lp, up; rather, it is defined as the number of matching pixels between the warped axis and the foreground blob of the target object—which makes it easier to define a normalized value for quantifying the matching. Accordingly, the fitness measure a →b from object a in generic camera C i to object b in generic camera C j is defined as the number of pixels resulting from the intersection between the warped axis and the foreground blob of b , normalized by the length (in pixels) of the warped axis itself. The reversed fitness measure b →a is computed similarly by reversing the warping order. In the ideal case of correspondence between a and b , a →b ⫽ b →a ⫽ 1. However, in the case of errors in the lp and up computations, the warped axis can fall partially outside the foreground blob, lowering the fitness measure. In the likelihood definition, we refer to forward contribution when fitness is calculated from the image plane in which the new object appears (camera C 1 ) to the image plane of the considered hypothesis (camera C 2 ). Thus, generalizing for hypotheses containing more than one object (group hypotheses), forward axis correspondence can be
16.4 Bayesian-Competitive Consistent Labeling
403
evaluated by computing the fitness of the new object with all objects composing the given hypothesis ␥k for camera C 2 :
fpforward (|␥k ) ⫽
m ∈␥k
→m
K · Sf
(16.20)
Sf measures the maximum range of variability of the forward fitness measure of the objects inside the given hypothesis: Sf ⫽ max (→m ) ⫺ min (→n ) m ∈␥k
n ∈␥k
(16.21)
The use of the normalizing factor K (i.e., the number of potential matching objects on C 2 ) weighs each hypothesis according to the presence or absence of objects in the whole scene. Backward contribution is computed similarly from the hypotheses space to the observed object:
fpbackward (|␥k ) ⫽
m ∈␥k
m →
K · Sb
(16.22)
where Sb is defined as Sb ⫽ max (m → ) ⫺ min (n → ) m ∈␥k
n ∈␥k
(16.23)
Finally, likelihood is defined as the maximum value between forward and backward contribution. The use of the maximum value ensures use of the contribution where the extraction of support points is generally more accurate and suitable for the matching. The effectiveness of the double backward/forward contribution is evident in the full characterization of groups.The forward contribution helps solve situations when a group of objects is already inside the scene while its components appear one at a time in another camera.The backward component is useful when two people appearing in a new camera are detected as a single blob. The group disambiguation can be solved by exploiting the fact that in the other camera the two objects are detected as separate. Backward contribution is also useful for correcting segmentation errors, in which a person has been erroneously extracted by the object detection system as two separate objects, but a full view of the person exists from the past in an overlapped camera. When more than two cameras overlap simultaneously it is possible to take into account more information than in the pairwise case. To account for this situation our approach is suitably modified by an additional step that selects the best assignment from all possible hypotheses coming from each camera. In detail, when a detection event occurs on C 1 , for each camera C j overlapped with C 1 the best local assignment hypothesis is chosen using the MAP framework. A second MAP estimator detects the most probable among these hypotheses. In complex scenes more hypotheses can have similar a posteriori probability but a particular view may exist where the hypothesis assignment is easier. The purpose of the second MAP stage is to choose this view, which can be easily done using the previously computed posteriors and Bayes’ rule: p C j | ⬀ p | C j ⫽ max p(␥k | ) ␥k ∈⌫
(16.24)
404
CHAPTER 16 Statistical Pattern Recognition for Multi-Camera Detection
The camera posterior is evaluated for each camera C j that overlaps with camera C 1 , assuming that all overlapped camera views are equally probable. Eventually, the label is assigned to the new object according to the winning hypothesis on the winning camera. If the chosen hypothesis identifies a group, the labels of all objects composing the group are assigned as identifiers.
16.5 TRAJECTORY SHAPE ANALYSIS FOR ABNORMAL PATH DETECTION The previous steps of our system had the main objective of keeping a person tracked in a wide area, by segmenting and tracking him in each single static camera and then exploiting consistent labeling to keep him tracked among different camera views. This global label assignment is the fundamental step for subsequent, higher-level tasks. For instance, the information provided by the multi-camera tracking system can be further analyzed to detect anomalous person paths in the scene. This task is accomplished by learning “normality” modeling path recurrence statistically as a sequence of angles modeled with a mixture of von Mises probability density functions. In fact, after the label disambiguation process, tracking output can be exploited for further high-level reasoning on behavior. In particular, paths can be analyzed to detect anomalous events in the system.They are extracted directly from the multi-camera tracking system and homographically projected onto the ground plane. Each path is modeled as a sequence of directions computed as the angle between two consecutive points. From this approximation of the direction, a running average filter of fixed size is applied to smooth the segmentation errors and discretization effects on the direction computation. Using a constant frame rate, we model the single trajectory Tj as a sequence of nj directions , defined in [0, 2): Tj ⫽ 1,j , 2,j , . . . , nj ,j
(16.25)
Circular or directional statistics [13] is a useful framework for analysis. We adopt the von Mises distribution, a special case of the von Mises-Fisher distribution [14, 15], also known as the circular normal or the circular Gaussian. It is particularly useful for statistical inference of angular data. When the variable is univariate, the probability density function (pdf) results are V (|0 , m) ⫽
1 em cos(⫺0 ) 2I0 (m)
(16.26)
where I0 is the modified zero-order Bessel function of the first kind, defined as I0 (m) ⫽
1 2
2 em cos d
(16.27)
0
representing the normalization factor. The distribution is periodic so that p( ⫹ M2) ⫽ p() for all and any integer M.
16.5 Trajectory Shape Analysis for Abnormal Path Detection
405
Von Mises distribution is thus an ideal pdf to describe a trajectory Tj . However, in the general case a trajectory is composed of more than a single main direction; having several main directions, it should be represented by a multi-modal pdf. Thus we propose the use of a mixture of von Mises (MovM) distributions: p() ⫽
K
k V |0,k , mk
(16.28)
k⫽1
As is well known, the EM algorithm is a powerful tool for finding maximum likelihood estimates of the mixture parameters, given that the mixture model depends on unobserved latent variables (defining the “responsibilities”of a given sample with respect to a given component of the mixture). The EM algorithm allows the computation of parameters for the K MovM components. A full derivation of this process can be found in Prati et al. [4]. If the trajectory Tj contains less than K main directions, some components have similar parameters. Each direction i,j is encoded with a symbol Si,j using a MAP approach, that, assuming uniform priors, can be written as Si,j ⫽ arg max p 0,r , mr |i,j ⫽ arg max p i,j |0,r , mr r⫽1,...,K
(16.29)
r⫽1,...,K
where 0,r and mr are the parameters of the rth components of the MovM. Each trajectory Tj in the training set is encoded with a sequence of symbols T j ⫽ {S1,j , S2,j , . . . , Snj ,j }. To cluster or classify similar trajectories, a similarity measure ⍀(T i , T j ) is needed. Acquisition noise, uncertainty, and spatial/temporal shifts make exact matching between trajectories unsuitable for computing similarity. From bioinformatics we can borrow a method for comparing sequences in order to find the best inexact matching between them, accounting for gaps. Among the many techniques, we used global alignment [16], which is preferable to local alignment because it preserves both global and local shape characteristics. Global alignment of two sequences S and T is obtained by first inserting spaces either into or at the ends of S and T so that the length of the sequences is the same, and then placing the two resulting sequences one above the other so that every symbol or space in one of the sequences is matched to a unique symbol in the other. Unfortunately, this algorithm is onerous in terms of computational complexity if the sequences are long. For this reason, dynamic programming is used to reduce computational time to O(ni · nj ), where ni and nj are the lengths of the two sequences. This is achieved using a tabular representation of nj rows and ni columns. Each element (a, b) of the table contains the alignment score of the symbol Sa,i of sequence T i with the symbol Sb,j of sequence T j . This inexact matching is very useful for symbolic string recognition but it has not been used for trajectory data since it can be affected by measurement noise. Our proposal overcomes this problem because each symbol corresponds to a von Mises distribution. Thus, the score between symbols can be measured statistically as a function of the distance between the corresponding distributions. If the two distributions are sufficiently similar, the score should be high and positive; if they differ significantly, the score should be negative (a penalty). Assigning zero to the gap penalty, the best alignment can be found by searching for the alignment that maximizes the global score.
406
CHAPTER 16 Statistical Pattern Recognition for Multi-Camera Detection Specifically, we measured the distance between distributions p and q using the Bhattacharyya coefficient:
cB p, q ⫽
⫹⬁
p () q ()d
(16.30)
⫺⬁
It has been demonstrated [17] that if p and q are two von Mises distributions, cB ( p, q) can be computed in closed form as follows: cB Sa,i , Sb,j ⫽ cB V |0,a , ma , V |0,b , mb ⎛" ⎛ ⎞⎞ 2 ⫹ m2 ⫹ 2m m cos m ⫺ a b 0,a 0,b a b 1 ⎟⎟ ⎜ ⎜ I0 ⎝ ⫽⎝ ⎠⎠ (16.31) I0 (ma ) I0 (mb ) 2
where it holds that 0 ⭐ cB (Sa,i , Sb,j ) ⭐ 1. If we assume that two distributions are sufficiently similar if the coefficient is above 0.5, and that the score for a perfect match is +2, whereas the score (penalty) for the perfect mismatch is ⫺1 (these are the typical values used in DNA sequence alignments), then we can write the general score as follows:
Sa,i , Sb,j
⎧ if cB ⭓ 0.5 ⎨2 · (cB ) ⫽ 2 · (cB ⫺ 0.5) if cB ⬍ 0.5 ⎩ 0 if Sa,i or Sb,j are gaps
(16.32)
Once the score of the best global alignment is computed (as the sum of the scores in the best alignment path), it can be converted to a proper similarity measure ⍀(T i , T j ). This measure is used to cluster the trajectories in the training set by using the k-medoids algorithm [18], a suitable modification of the well-known k-means algorithm which has the benefit of computing, as a prototype of the cluster, the element that minimizes the sum of intra-class distances. In other words, at each iteration the prototype of each cluster is the member at the minimum average distance from all other members. However, one of the limitations of k-medoids (as well as k-means) clustering is the choice of k. For this reason, we propose an iterative k-medoids algorithm. Let us set i ⫽ 0 and k(0) ⫽ Nt , where Nt is the cardinality of the training set. At initialization, each trajectory is chosen as the prototype (medoid) of the corresponding cluster. Then the following steps are performed: 1. Run the k-medoids algorithm with k(i) clusters. 2. If there are two medoids with a similarity greater than a threshold T h, merge them and set k(i ⫹ 1) ⫽ k(i) ⫺ 1. Increment i and go back to step 1. If all medoids have a two-by-two similarity lower than T h, stop the algorithm. In other words, the algorithm iteratively merges similar clusters until convergence. In this way, the “optimal” number of medoids & k is obtained.
16.5 Trajectory Shape Analysis for Abnormal Path Detection
407
16.5.1 Trajectory Shape Classification The described approach obtains a robust unsupervised classification of trajectories, grouped in a variable number of similarity clusters. Clusters with fewer trajectories represent the case of abnormal or (better) “infrequent” trajectory shapes. New trajectories can be classified as normal or abnormal depending on the cardinality of the most similar cluster. In this case, we cannot employ a classical learn-then-predict paradigm, in which the “knowledge” learned in the training phase is never updated. However, at the beginning an infrequent class of trajectories can be considered abnormal; if that class is detected often, it should be considered normal, since in our scenario the model of normality is neither a priori known nor fixed. For this reason, we employ a learnand-predict paradigm in which knowledge (i.e., the trajectory clusters) is continuously updated. Therefore, whenever a new trajectory Tnew is collected, its statistical model is computed and compared to the cluster medoids. Based on this comparison, it can be classified as either belonging to an existing cluster or representing a new one (a class of trajectories never seen before). To learn the trajectory model we can use the same EM algorithm described in the previous section. However, this is a very time-consuming task unsuitable for real-time trajectory classification, even though it is acceptable for offline learning. For this reason, we have derived an online EM algorithm for MovMs similar to what was proposed for a mixture of Gaussians [19]. Online EM updating is based on the concept of sufficient statistics. A statistic T () is sufficient for the underlying parameter if the conditional probability distribution of the data , given the statistic T (), is independent of the parameter . Thanks to the FisherNeyman factorization theorem [20], the likelihood function L () of can be factorized in two components, one independent by the parameters and the other dependent by them only through the sufficient statistics T (): L () ⫽ h()g (T ()). It was shown by Bishop [15] that in the case of distributions of the exponential family (such as Gaussian and von Mises) the factorization theorem can be written as ' ( p (|) ⫽ h () g () exp T T ()
(16.33)
Considering a von Mises distribution and a set of i.i.d. angles (composing the trajectory Tj ), we can decompose the expression of the distribution p(|0 , m) as follows: nj 1 1 exp m cos (i ⫺ 0 ) exp {m cos (i ⫺ 0 )} ⫽ (2I0 (m))nj 2I0 (m) i⫽1 i⫽1 nj
nj nj 1 ⫽ exp m cos 0 cos i ⫹ m sin 0 sin i (2I0 (m))nj i⫽1 i⫽1
⫽
⎧ ⎪ ) * ⎪ ⎨ m cos T
1 0 exp ⎪ (2I0 (m))nj m sin 0 ⎪ ⎩
⎤⎫ ⎡ nj ⎪ ⎪ ⎢i⫽1 cos i ⎥⎬ ⎥ ⎢ · ⎣ nj ⎦⎪ ⎭ sin i ⎪ i⫽1
(16.34)
408
CHAPTER 16 Statistical Pattern Recognition for Multi-Camera Detection
Thus, the sufficient statistics for a single von Mises distribution are ⎡ nj ⎤ ⎢ cos i ⎥ ⎢i⫽1 ⎥ T () ⫽ ⎢ n ⎥ ⎣ j ⎦ sin i i⫽1
In the case of a mixture of distributions belonging to the exponential family, the online updating of the mixture parameters can be obtained simply by updating the sufficient statistics (s.s.) of the mixture, computed as TM () ⫽ K k⫽1 ␥k Tk (), where Tk () are the s.s. for the kth single distribution. The updating process (having observed up to the sample (i ⫺ 1)), can be obtained as Tki () ⫽ ␣(i)␥k Tk (i ) ⫹ (1 ⫺ ␣(i)) Tki⫺1 ()
where Tk (i ) ⫽
(16.35)
. cos i sin i
In Sato [21] there is a comprehensive discussion of the value of the updating parameter ␣(i). Once the mixture parameters have been computed, the same MAP approach & described previously gives the symbol sequence T new . Given the set M ⫽ {M 1 , . . . , M k } of current medoids, T new is compared with each medoid, using the similarity measure ⍀, to find the most similar. & j ⫽ arg max ⍀ T M j , T new
(16.36)
j⫽1,...,& k
Defining the maximum similarity as ⍀max ⫽ ⍀(T M&j , T new ), if this value is below a given threshold T hsim a new cluster should be created with Tnew and the priors (proportional to the number of trajectories assigned to the cluster) updated: & & M k⫹1 ⫽ Tnew ; p M k⫹1 ⫽
1 N ⫹1 ᭙i ⫽ 1, . . . , & k ⇒ pnew M i ⫽ pold M i
N N ⫹1
& k ⫽& k ⫹ 1; N ⫽ N ⫹ 1
where N is the current number of observed trajectories. Conversely, if the new trajectory is similar enough to one of the current medoids, it is assigned to the corresponding cluster & j:
Tnew ∈ cluster & j; pnew M ᭙i ⫽ 1, . . . , & k, i ⫽& j ⇒p N ⫽N ⫹1
& k
⫽
& pold M k · N ⫹ 1
N ⫹1 M ⫽ pold M i
new
i
N N ⫹1
16.6 Experimental Results Moreover, if the average similarity of the new trajectory with regard to other medoids & is smaller than the average similarity of the current medoid M j , Tnew is a better medoid & j than M since it increases the separability with other clusters. Consequently, Tnew becomes the new medoid of the cluster. Finally, to avoid the possibility of old and rare trajectories affecting our model, we drop clusters with small priors and with no new trajectories assigned for a fixed-length time window.
16.6 EXPERIMENTAL RESULTS Figure 16.5 shows two examples of our system in a public park (a) and on our campus (b). Detailed results and comparisons with state-of-the-art techniques can be found in the corresponding papers for background suppression [1, 6], single-camera tracking [2], and consistent labeling [3, 22]. In this chapter we focus mainly on the experimental results of trajectory shape analysis for classification and abnormality detection. The performance evaluation was conducted on both synthetic and real data. Synthetic data is particularly useful because we can have any amount and, the ground truth is directly available; it doesn’t require manual annotation. Our synthetic testing data was produced using a generator in MATLAB, which allowed us to graphically create a high number of ground truth trajectories with noise added to both single angles and their occurrences. Figure 16.6 shows examples of the trajectory classes used in our tests. In the case of real data (see classes R1 . . . R5) the trajectories’points are extracted from the scene using the HECOL system described in Section 16.4. Table 16.1 summarizes the performed tests. For each, the classes of trajectories used are depicted (with reference to Figure 16.6), with the asterisk representing an abnormal (infrequent) class. For testing purposes we evaluated both overall classification accuracy (ability to assign the new trajectory to the correct cluster) and normal/abnormal
C
C
1
C
1
C
2
2
FIGURE 16.5 Example multi-camera tracking results.
409
410
CHAPTER 16 Statistical Pattern Recognition for Multi-Camera Detection
Class 1
Class 2
Class 3
Class 4
Class 5
Class 6
Class 7
Class 8
Class 9
Class 10
Class 11
Class 12
Class 13
Class 14
Class 15
Class R1
Class R2
Class R3
Class R4
Class R5
FIGURE 16.6 Example trajectory classes.
accuracy (ability to correctly classify the trajectory as normal or abnormal depending on the cardinality of the specific cluster). Our system demonstrates optimal results both in classification and normal-abnormal detection; the results are also optimal for real data. We compared our approach with an HMM-based classification system using the similarity measure proposed in Porikli and Haga [23] (see the last two rows of Table 16.1). HMM’s lower classification performance is mainly due to the overfitting problem. When little data is available, the HMM training stage fails to correctly estimate all parameters. Emission distribution and optimal number of hidden states are crucial elements that must be chosen accurately when using an HMM. This procedure can be unsupervised but it needs a large amount of data that is not always available in real scenarios. Conversely, our approach is not greatly affected by the data available because the number of parameters to estimate is significantly lower; thus it can be profitably applied in many different situations. Acknowledgments. This chapter addressed aspects of modern video surveillance that relate to our research at the University of Modena and Reggio Emilia.This research is sponsored by the FREE SURF (Free Surveillance in a Privacy-RespectfulWay) and NATO-funded BE SAFE (Behavioral Learning in Surveilled Areas with Feature Extraction) projects.
Table 16.1
Summary of the Experimental Results
Type of Data Type of Test Total no. Trajectory
Tr Te
Test 1
Test 2
Synthetic
Synthetic
Test 3
Test 4
Test 5
Synthetic
Synthetic
Synthetic
Test 6 Synthetic
Test 7 Synthetic
Test 8 Real
Periodicity
Noise
Mono-modal
Multi-modal
Sequence
Learning normality
Mixed
Mixed
250 150
400 250
450 400
680 430
650 400
150 200
2530 1700
520 430
Average Points/Trajectory
65
95
66
74
85
95
105
66
Training Set
C3 , C4∗
C1 , C15
C1 , C2 , C3
C1 , C2 , C5 , C6∗
C1 , C8 , ∗ C9 , C10
C1 , C7∗
All training
CR1 , CR2 , ∗ CR3 , CR4
Testing Set
C3 , C4∗
∗ ,C C14 15
C1 , C2 , C3∗ , C4∗
C1 , C5 , C6∗ , C7∗
C8 , C2∗ , C9∗ , ∗ , C∗ , C∗ C11 12 13
∗ ,C C13 7
All testing
CR1 , CR2 , ∗ , C∗ CR4 R5
Classification Accuracy
100%
100%
100%
97.67%
94.59%
95%
99.60%
96.77%
Normal/Abnormal Accuracy
100%
100%
100%
100%
97.30%
100%
100%
96.77%
Classification Accuracy HMM
75%
83%
94.1%
92.67%
89.1%
94.01%
86.40%
75.19%
Normal/Abnormal Accuracy HMM
84%
94.1%
93.02%
100%
85.1%
100%
82.4%
66%
411
412
CHAPTER 16 Statistical Pattern Recognition for Multi-Camera Detection FREE SURF, funded by the Italian Ministry for University Research (MIUR), focuses on the study of innovative video surveillance solutions, bringing together systems without physical constraints (i.e., those using PTZ, freely moving cameras, and sensors) and completely respectful of privacy issues (i.e., free from legal constraints).The FREE SURF3 research activities are performed in collaboration with the University of Firenze and the University of Palermo. BE SAFE4 is a NATO Science for Peace project that focuses on extracting visual features that can be used for understanding behaviors, such as potential terrorist attacks. It is carried out in collaboration with the Hebrew University of Jerusalem.
REFERENCES [1] R. Cucchiara, C. Grana, M. Piccardi, A. Prati, Detecting moving objects, ghosts and shadows in video streams, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (10) (2003) 1337–1342. [2] R. Cucchiara, C. Grana, G. Tardini, R. Vezzani, Probabilistic people tracking for occlusion handling, in: Proceedings of the International Conference on Pattern Recognition, 2004. [3] S. Calderara, A. Prati, R. Cucchiara, HECOL: Homography and epipolar-based consistent labeling for outdoor park surveillance, Computer Vision and Image Understanding 111 (1) (2008) 21–42. [4] A. Prati, S. Calderara, R. Cucchiara, Using circular statistics for trajectory shape analysis, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2008. [5] R. Vezzani, R. Cucchiara, ViSOR:Video surveillance on-line repository for annotation retrieval, in: Proceedings of the IEEE International Conference on Multimedia and Expo, 2008. [6] S. Calderara, R. Melli, A. Prati, R. Cucchiara, Reliable background suppression for complex scenes, in: Proceedings of the ACM Workshop on Video Surveillance and Sensor Networks, Algorithm Competition, 2006. [7] A. Prati, I. Mikic, M. Trivedi, R. Cucchiara, Detecting moving shadows: Algorithms and evaluation, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (7) (2003) 918–923. [8] I. Haritaoglu, D. Harwood, L. Davis, W4: real-time surveillance of people and their activities, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (8) (2000) 809–830. [9] A. Senior, A. Hampapur, Y.-L. Tian, L. Brown, S. Pankanti, R. Bolle, Appearance models for occlusion handling, Image and Vision Computing 24 (11) (2006) 1233–1243. [10] R. Cucchiara, C. Grana, A. Prati, R. Vezzani, Probabilistic posture classification for human behaviour analysis, IEEE Transactions on Systems, Man, and Cybernetics 35 (1) (2005) 42–54. [11] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (5) (2003) 564–575. [12] C. Brauer-Burchardt, K. Voss, Robust vanishing point determination in noisy images, in: Proceedings of the International Conference on Pattern Recognition, 2000. [13] K. Mardia, P. Jupp, Directional Statistics, Wiley, 2000. [14] R. Fisher, Dispersion on a sphere, Proceedings of the Royal Society of London, Series A 217 (1953) 295–305.
3 4
http://imagelab.ing.unimore.it/freesurf . http://imagelab.ing.unimore.it/besafe.
16.6 Experimental Results [15] C. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag, 2006. [16] D. Gusfield, Algorithms on Strings, Trees, and Sequences, Cambridge University Press, 1997. [17] S. Calderara, R. Cucchiara, A. Prati, Detection of abnormal behaviors using a mixture of von Mises distributions, in: Proceedings of the IEEE International Conference on Advanced Video and Signal-Based Surveillance, 2007. [18] A. Reynolds, G. Richards, V. Rayward-Smith, The Application of K-Medoids and PAM to the Clustering of Rules, Springer-Verlag, 2004. [19] C. Stauffer, W. Grimson, Learning patterns of activity using real-time tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (8) (2000) 747–757. [20] G. Casella, R. Berger, Statistical Inference, 2nd edition, Duxbury Press, 2002. [21] M. Sato, Fast learning of on-line EM algorithm. Technical Report TR-H-281, ATR Human Information Processing Research Laboratories, 1999. [22] S. Calderara, R. Cucchiara, A. Prati, Bayesian-competitive consistent labeling for people surveillance, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (2) (2008) 354–360. [23] F. Porikli, T. Haga, Event detection by eigenvector decomposition using object and frame features, in: Proceedings of the Computer Vision and Pattern Recognition Workshop, 2004.
413
CHAPTER
Object Association Across Multiple Cameras Yaser Sheikh Robotics Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania
17
Omar Javed Object Video, Reston, Virginia Mubarak Shah School of Electrical Engineering and Computer Science University of Central Florida, Orlando, Florida
Abstract Scene monitoring using a cooperative camera network is fast becoming the dominant paradigm in automated surveillance. In this chapter, we describe a framework to cooperatively generate coherent understanding of a scene from such a network. We consider cases where the camera fields of view are overlapping or nonoverlapping, and we discuss how the appearance and motion of objects can be estimated jointly across cameras.To handle the change in observed appearance of an object as it moves from one camera to another, we show that all brightness transfer functions from a given camera to another lie in a low-dimensional subspace, which we demonstrate can be used to compute appearance similarity. To associate objects based on their motion, we exploit geometric constraints on the relationship between the motions of objects across cameras, to test multiple association hypotheses, without assuming any prior calibration information. Finally given a scene model, we propose a likelihood function for evaluating a hypothesized association between observations in multiple cameras, using both appearance and motion. Keywords: cooperative sensing, appearance matching, trajectory matching, surveillance
17.1 INTRODUCTION Because of the limited field of view (FOV) of cameras and the abundance of object occlusions in real scenes, it is often not possible for a single camera to completely observe an area of interest. Instead, multiple cameras can be effectively used to widen the active area under surveillance. With the decreasing cost and increasing quality of commercial Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00017-3
415
416
CHAPTER 17 Object Association Across Multiple Cameras
cameras, visual monitoring from a network of cameras is fast becoming the dominant paradigm for automated surveillance. By using different cameras synergistically to observe different parts of an area of interest, a more coherent understanding of what is occurring in the observed scene can be constructed. Several approaches with varying constraints have been proposed, highlighting the wide applicability of cooperative sensing in practice. For instance, the problem of associating objects across multiple stationary cameras with overlapping FOVs has been addressed in a number of papers (e.g., [3, 4, 10, 11, 25, 26, 28, 31, 32]). Extending the problem to association across cameras with nonoverlapping FOVs, geometric and appearance-based approaches have also been proposed (e.g., [6, 16, 18, 19, 24, 39, 40, 44]). Camera motion has been studied where correspondence is estimated across pan-tilt-zoom cameras, [7, 22, 30]. In general, when using sensors in such a decentralized but cooperative fashion, knowledge of inter-camera relationships becomes of paramount importance in understanding what happens in the environment. Without such information, it is difficult to tell, for instance, whether an object viewed in each of two cameras is the same or a new one. Two cues available to infer this are the object’s appearance and motion. For the interested reader, some notable papers describing the use of appearance in association include [20, 39, 40, 51, 52]. The principal challenge in using camera networks cooperatively is “registering” or associating data observed across multiple views. Since each camera typically has a different center of projection, perspective distortions and reflectance variations complicate the association process. Consider the different views in Figure 17.1(a). Although the subject observed is the same, the appearance in each view is clearly distinct. Measures of similarity, such as intensity difference or correlation, are unlikely to work in the general case. Similarly although the path in each view in Figure 17.1(b) originates from the same world path, it charts out a different trajectory in each video. In this chapter, we discuss these two challenges in associating objects across views. We advocate a method that explicitly learns the inter-camera transformations with respect to a reference camera and compensates for the distortions caused by different viewpoints. Perspective distortion is modeled by making a simplifying assumption of a ground plane. We define the cost of a hypothesized association of an object trajectory set based on a geometric error function. To handle the change in the observed colors of an object as it moves from one camera to another, we show that all brightness transfer functions from one camera to another lie in a low-dimensional subspace, and we demonstrate that this subspace can be used to compute appearance similarity. In our proposed approach, the system learns the subspace of inter-camera brightness transfer functions in a training phase, during which object correspondences are assumed to be known. Once the training is complete, correspondences are assigned with the maximum a posteriori (MAP) estimation framework using both location and appearance cues. In Section 17.2, we review the development of ideas in related research. Then, in Section 17.3, we describe the framework used to infer the associations between different objects seen across different cameras. In Section 17.4, we describe how to evaluate an association between two observations in two cameras based on appearance. We show that all brightness transfer functions (BTFs) from a given camera to another camera lie in a low-dimensional subspace and we present a method to learn this subspace from the
17.2 Related Work
417
(a) 200
200
300
300
400
400
500
500
600
600
700
700
800
200 300 400 500 600 700
800
200 300 400 500 600 700
( b)
FIGURE 17.1 Transformation of appearance and motion between different views. (a) Appearance of a person seen from two cameras is distinctly different, even though his clothing has not changed between frame captures. (b) Transformation in trajectories.
training data. In Section 17.5, we describe how to evaluate an association between two observations based on their trajectories.
17.2 RELATED WORK Since the seminal work of Sittler [43] on data association in 1964, multi-target/multisensor tracking has been extensively studied. In the data association community it is typically assumed that the sensors are calibrated and data is available in a common coordinate system (a good summary is available in Bar-Shalom [2]). Optimal multitarget/multi-sensor association is known to be NP-hard [13], and with n sensors and k objects there are (k!)n possible configurations. This makes exhaustive evaluation computationally prohibitive. Sequential logic techniques include nearest-neighbor filters, strongest-neighbor filters, one-to-few assignments, and one-to-many assignments. These methodologies are computationally efficient, but since decisions are irreversible at each time step they are prone to error, particularly when the number of objects is large. Deferred logic techniques typically use some form of multiple-hypothesis testing, and many variations have been proposed. It was shown in Poore [35] and
418
CHAPTER 17 Object Association Across Multiple Cameras Pattipati et al. [33] that the data association problem can be formulated as a multidimensional assignment problem. The analysis in this area is important, and many ideas are relevant. However, these approaches assume a registered setting with overlapping FOVs that cannot be used directly in the context of this work, where the coordinate systems differ up to a homography. Prior work can be broadly classified based on assumptions made about camera setup: (1) multiple stationary cameras with overlapping FOVs; (2) multiple stationary cameras with nonoverlapping FOVs; and (3) multiple pan-tilt-zoom cameras. We review related work on object detection in single cameras and discuss the limitations that our approach addresses.
17.2.1 Multiple Stationary Cameras with Overlapping Fields of View By far, the largest body of work in associating objects across multiple cameras makes the assumption that the cameras are stationary and have overlapping FOVs. The earliest work involving associating objects across such cameras stemmed from an interest in multiple perspective interactive video in the early 1990s, in which users observing a scene selected particular views from multiple perspectives. In Sato et al. [37], CAD-based environment models were used to extract 3D locations of unknown moving objects; once objects entered the overlapping views of two agents, stereopsis was used to recover their exact 3D positions. Jain and Wakimoto [17] also assumed calibrated cameras to obtain 3D locations of each object in an environment model for multiple perspective interactive video. Although the problem of associating objects across cameras was not explicitly addressed, several innovative ideas were proposed, such as choosing the best view given a number of cameras and the concept of interactive television. Kelly et al. [23] constructed a 3D environment model using the voxel feature. Humans were modeled as a collection of these voxels and the model was used to resolve the camera hand-off problem. These works were characterized by the use of environment models and calibrated cameras. Tracking across multiple views was addressed in its own right in a series of papers from the latter half of the 1990s. Nakazawa et al. [32] constructed a state transition map that linked regions observed by one or more cameras, along with a number of action rules to consolidate information between cameras. Cai and Aggarwal [3] proposed a method to track humans across a distributed system of cameras, employing geometric constraints for tracking between neighboring cameras. Spatial matching was based on the Euclidean distance of a point with its corresponding epipolar line. Bayesian networks were used in several papers. For example, Chang and Gong [4] used Bayesian networks to combine geometry (epipolar geometry, homographies, and landmarks) and recognition (height and appearance) modalities to match objects across multiple sequences. Bayesian networks were also used by Dockstader and Tekalp [11] to track objects and resolve occlusions across multiple calibrated cameras. Integration of stereo pairs was another popular approach, adopted by Mittal and Davis [31], Krumm et al. [26], and Darrell et al. [9]. Several approaches were proposed that did not require prior calibration of cameras but instead learned minimal relative camera information. Azarbayejani and Pentland [1], developed an estimation technique for recovering 3D object tracks and multi-view
17.2 Related Work
419
geometry from 2D blob features. Lee et al. [28] made an assumption of scene planarity and learned the homography-related views by robust sampling methods.They then recovered 3D camera and plane configurations to construct a common coordinate system and used it to analyze object motion across cameras. Khan et al. [25] proposed an approach that avoided explicit calibration of cameras and instead used constraints on the FOV lines between cameras, learned during a training phase, to track objects across cameras.
17.2.2 Multiple Stationary Cameras with Nonoverlapping Fields of View The assumption of overlapping FOVs restricts the area cameras can cover. It was realized that meaningful constraints could be applied to object tracking across cameras with nonoverlapping FOVs as well.This allowed the collective FOV of the camera system to be dispersed over a far wider area. In the research community, this subfield seems initially to have been an offshoot of object recognition, where it was viewed as a problem of recognizing objects previously viewed by other cameras. Representative of this work were Huang and Russell [16], who proposed a probabilistic appearance-based approach for tracking vehicles across consecutive cameras on a highway. Constraints on the motion of the objects across cameras were first proposed by Kettnaker and Zabih [24], where positions, object velocities, and transition times across cameras were used in a setup of known path topology and transition probabilities. Collins et al. [6] used a system of calibrated cameras with an environment model to track objects across multiple views. Javed et al. [18, 20] did not assume a site model or explicit calibration of cameras; instead, they learned inter-camera illumination and transition properties during a training phase, and then used them to track objects across cameras. Recently Javed et al. [19] and Stauffer andTieu [44] proposed methods for tracking across multiple cameras with both overlapping and nonoverlapping FOVs. Both methods assume scene planarity and build a correspondence model for the entire set of cameras. Some work has been published on recovering the pose and/or tracks between cameras with nonoverlapping fields of view. Fisher [12] showed that, given a set of randomly placed cameras, recovering a pose was possible using distant moving features and nearby linearly moving features. Makris et al. [29] also extracted the topology of a number of cameras based on the co-occurrence of entries and exits. Rahimi et al. [36] presented an approach that reconstructed the trajectory of a target and the external calibration parameters of the cameras, given the location and velocity of each object.
17.2.3 Multiple Pan-Tilt-Zoom Cameras So far, this discussion has addressed approaches that assume the camera remains stationary, with overlapping and nonoverlapping FOVs. Clearly, the collective FOV of the sensors can be further increased if motion is allowed in them. With the introduction of motion, the camera FOVs can be overlapping or nonoverlapping at different times, and one of the challenges of tracking across moving cameras is that both situations need to be addressed. A limited type of camera motion was examined in previous work: motion of the camera about the camera center—that is, pan-tilt-zoom (PTZ). One such work is Matsuyama and Ukita [30], where the authors presented a system-based approach using active cameras, developing a fixed-point PTZ camera for wide-area imaging. Kang et al. [21] proposed
420
CHAPTER 17 Object Association Across Multiple Cameras a method that involved multiple stationary and PTZ cameras. It was assumed that the scene was planar and that the homographies between cameras were known. Using these transformations, a common coordinate frame was established and objects were tracked across the cameras using color and motion characteristics. A related approach was also proposed in Collins et al. [7], where they presented an active multiple-camera system that maintained a single moving object centered in each view, using PTZ cameras.
17.3 INFERENCE FRAMEWORK Suppose that we have a system of r cameras C1 , C2 , . . . , Cr with nonoverlapping views. Further assume that there are n objects in the environment (the number of objects is not assumed to be known). Each of these objects is viewed from different cameras at different time instants. Also assume that the task of single-camera tracking is already solved, and let Oj ⫽ Oj,1 , Oj,2 , . . . , Oj,mj be the set of mj observations observed by camera Cj . Each of these observations Oj,a is a track of some object from its entry to its exit in the FOV of camera Cj and is based on two features, appearance of the object Oj,a (app) and its trajectory Oj,a (traj) (location, velocity, time, etc.). The problem of multi-camera association is to find which of the observations in the system of cameras belong to the same object. c,d define the For a formal definition of this problem, we let a correspondence ka,b hypothesis that Oa,b and Oc,d are observations of the same object in the environment, Oc,d .The problem of multi-camera tracking with observation Oa,b preceding observation c,d c,d is to find a set of correspondences K ⫽ ka,b such that ka,b ∈ K if and only if Oa,b and Oc,d correspond to successive observations of the same object in the environment. Let ⌺ be the solution space of the multi-camera tracking problem. We assume that each observation of an object is preceded or succeeded by a maximum of one observation (of the same object). We define the solution of the multi-camera tracking problem to be a hypothesis K in the solution space ⌺ that maximizes the a posteriori probability. It is given by
j,b j,b K ⫽ arg max P Oi,a (app), Oj,b (app)|ki,a P Oi,a (st), Oj,b (st)|ki,a P Ci , Cj
K ∈⌺
j,b
ki,a ∈K
If the trajectory and appearance probability density functions are known, the posterior can be maximized using a graph theoretic approach.
17.4 EVALUATING AN ASSOCIATION USING APPEARANCE INFORMATION A commonly used cue for tracking in a single camera is the appearance of objects. Appearance is a function of scene illumination, object geometry, object surface material properties (e.g., surface albedo), and camera parameters. Among all these, only the object surface material properties remain constant as an object moves across cameras.
17.4 Evaluating an Association Using Appearance Information
421
Thus, the color distribution of an object can be different when viewed from two different cameras. One way to match appearances in different cameras is to use color measures that are invariant to variations in illumination and geometry [46, 47]. However, these measures are not discriminative and usually cannot distinguish between different shades of the same color. Another approach for matching object appearance in two camera views is to find a transformation that maps the object’s appearance in one camera image to its appearance in the other camera image. Note that, for a given pair of cameras, this transformation is not unique and depends on scene illumination and camera parameters. In this chapter, we make use of the result by Javed et al. [20] that, despite dependence on a large number of parameters, for a given pair of cameras all such appearance transformations lie in a low-dimensional subspace. Our proposed method learns this subspace of mappings (brightness transfer functions) for each pair of cameras from the training data through probabilistic principal component analysis.Thus, given appearances in two different cameras and given the subspace of brightness transfer functions learned during the training phase, we can estimate the probability that the transformation between appearances lies in the learned subspace. In the next section, we present a method for estimating the BTFs and their subspace from training data in a multi-camera tracking scenario.
17.4.1 Estimating the Subspace of BTFs Between Cameras Consider a pair of cameras Ci and Cj . Corresponding observations of an object across this camera pair can be used to compute an inter-camera BTF. One way to determine this BTF is to estimate the pixel-to-pixel correspondence between the object views in the two cameras. However, finding such correspondences from views of the same object in two different cameras is not possible because of self-occlusion and difference in pose. Thus, we employ normalized histograms of object brightness values for the BTF computation. Such histograms are relatively robust to changes in object pose [48]. To compute the BTF, we assume that the percentage of image points in the observed object Oi with brightness less than or equal to Bi is equal to the percentage of image points in observation Oj with brightness less than or equal to Bj . Note that a similar strategy was adopted by Grossberg and Nayar [49] to obtain a BTF between images taken from the same camera of the same view but under different illumination conditions. Now, if Hi and Hj are the normalized cumulative histograms of object observations Oi and Oj , respectively, then Hi (Bi ) ⫽ Hj (Bj ) ⫽ Hj ( fij (Bi )). Therefore, we have fij (Bi ) ⫽ Hj⫺1 (Hi (Bi ))
(17.1)
where H ⫺1 is the inverted cumulative histogram. We use equation 17.1 to estimate the brightness transfer function fij for every pair of observations in the training set. Let Fij be the collectionof all brightness transfer functions obtained in this manner—i.e., f(ij)1 , f(ij)2 , . . . , f(ij)N . To learn the subspace of this collection we use probabilistic principal component analysis (PPCA). According to this model, a d-dimensional BTF, fij , can be written as fij ⫽ Wy⫹ fij ⫹ ⑀
(17.2)
Here y is a normally distributed q-dimensional latent (subspace) variable, q ⬍ d; W is a d ⫻ q-dimensional projection matrix that relates the subspace variables to the observed
422
CHAPTER 17 Object Association Across Multiple Cameras BTF; fij is the mean of the collection of BTFs; and ⑀ is isotropic Gaussian noise—i.e., ⑀ ∼ N (0, 2 I). Given that y and ⑀ are normally distributed, the distribution of fij is given as fij ∼ N (fij , Z)
(17.3)
where Z ⫽ WWT ⫹ 2 I. Now, as suggested in [50], the projection matrix W is estimated as W ⫽ Uq (Eq ⫺ 2 I)1/2 R
(17.4)
where the q column vectors in the d⫻q-dimensional Uq are the eigenvectors of the sample covariance matrix of Fij ; Eq is the q⫻q diagonal matrix of corresponding eigenvalues 1 , . . . , q ; and R is an arbitrary orthogonal rotation matrix and can be set to an identity matrix. The value of 2 , which is the variance of the information “lost” in the projection, is calculated as 2 ⫽
d
1 v d ⫺ q v⫽q⫹1
(17.5)
Once the values of 2 and W are known, we can compute the probability of a particular BTF belonging to the learned subspace of BTFs by using the distribution in equation 17.3. Note that until now we have been dealing with only the brightness values of images and with computing the brightness transfer functions.To deal with color images, we treat each channel (i.e., R, G, and B) separately. The transfer function for each color channel (color transfer function) is computed exactly as discussed previously. The subspace parameters W and 2 are also computed separately for each color channel. Note as well that we do not assume knowledge of any camera parameters or response functions for the computation of these transfer functions and their subspace.
17.5 EVALUATING AN ASSOCIATION USING MOTION INFORMATION Appearance information contains strong cues for associating objects across frames. The motion of objects, seen across cameras, provides another strong cue for association that is largely independent of object appearance. As a result, applying it in conjunction with appearance information provides a significant boost to association accuracy. In this section, we detail how objects can be associated across different cameras based on their motion.
17.5.1 Data Model The scene is modeled as a plane in 3-space, ⌸, with K moving objects. The kth object,1 O a trajectory on ⌸, represented by a time-ordered set of points, xk (t) ⫽
k , moves along xk (t), yk (t) ∈ R2 , where xk (t) and yk (t) evolve according to some spatial algebraic curve—a line, a quadratic, or a cubic. The finite temporal support is denoted by ⌬t. The 1 The abstraction of each object is as a point, such as the centroid. It should be noted, however, that since the centroid is not preserved under general perspective, transformations using the centroid will introduce bias.
17.5 Evaluating an Association Using Motion Information scene is observed by N perspective cameras, each observing some subset of all motion in the scene, due to a spatially limited field of view and a temporally limited window of observation (due to camera motion). The imaged trajectory observed by the nth camera for Ok is xkn (t). We assume that in each sequence frame-to-frame motion the camera has been compensated, so xkn (t) is in a single reference coordinate. The measured image positions of objects, xnk , are described in terms of the canonical image positions, xkn , with independent normally distributed measurement noise, ⫽ 0, and covariance matrix Rkn . That is, x¯ kn (t) ⫽ xkn (t) ⫹ ⑀, ⑀ ∼ N (0, Rkn )
(17.6)
The imaged trajectory is related to xk (t) by a projective transformation denoted by an invertible 3⫻3 matrix, Hn . The homogeneous representation of a point xkn (t) is Xkn (t) ⫽ (xkn (t), ykn (t), ) ∈ P 2 . Thus, we have Xkn (t) ⫽ Hn Xk (t)
Finally, we introduce the association or correspondence variables C ⫽ {ckn }N K , where cji ⫽ m represents the hypothesis that Oji is the image of Om , where p(c) is the probability of association c. Since associations of an imaged with different scene trajectory n n trajectories are mutually exclusive and exhaustive, K l⫽1 p(ck ⫽ l) ⫽ 1. A term p(ck ⫽ 0) may be included to model the probability of spurious trajectories, but we do not consider this in the remainder of this work (i.e., we assume p(ckn ⫽ 0) ⫽ 0).
Kinematic Polynomial Models The position xj (t) of an object Oj is modeled as a dth-order polynomial in time: xj (t) ⫽
d
pi t i
(17.7)
i⫽0
where pi are the coefficients of the polynomial. In matrix form: xj (t) ⫽ Pj t (d) ⫽
px,0 px,1 · · · px,d py,0 py,1 · · · py,d
⎡ ⎤ 1 ⎢t ⎥ ⎢.⎥ ⎢.⎥ ⎣.⎦
td
We omit the dependence of p on j for notational simplicity. Selecting the appropriate order of polynomials is an important consideration. If the order is too low, the polynomial may not correctly reflect the kinematics of the object. On the other hand, if the order is too high, some of the estimated coefficients may not be statistically significant [2].This problem is even more important in the situation under study, since often only a segment of the polynomial is observed and over- or underfitting is likely. Thus, numerical considerations while estimating the coefficients of the curve are of paramount importance, especially during the optimization routine. Readers are advised to refer to Hartley and Zisserman [14] for information on numerical conditioning during estimation. For instance, the number of parameters that need to be estimated when a parametric cubic curve is to be fit to the trajectories is at most 8K ⫹ 9N , since there are K curves
423
424
CHAPTER 17 Object Association Across Multiple Cameras described by 8 parameters each, with N homographies, each with 9 unknowns. At least four points per object must be observed and just one curve must be observed between a pair of views. The parameterization for a cubic curve is x(t) ⫽ p3 t 3 ⫹ p2 t 2 ⫹ p1 t ⫹ p0
In this case
⎡
px,0 P ⫽⎣py,0 1
px,1 py,1 1
(17.8)
⎤ px,3 py,3 ⎦ 1
px,2 py,2 1
Since the scene is modeled as a plane, a point on ⌸ is related to its image in the nth camera by Hn . Thus, a measured point Xji at time t associated with Om (i.e., cji ⫽ m) is Xji ⫽ Hi Pm t (d) ⫹ ⑀˜
(17.9)
17.5.2 Maximum Likelihood Estimation The problem statement is as follows: Given the trajectory measurements for each camera {x¯ kn }N K , find associations C of each object across cameras and the maximum likelihood estimate of ⌰ ⫽ ({Pk }K , {Hn }N ), where {Pk }K are the motion parameters of the K objects, and {Hn }N are the set of homographies to ⌸. For the remainder of this paper, nk represents (Pk , Hn ). For each individual observed trajectory x¯ ji we have ␦e (i,j)
p(x¯ ji |cji , ⌰) ⫽ p(x¯ ji |ic i ) ⫽ j
p(x¯ ji (t)|xji (t))
(17.10)
t⫽␦s (i,j)
where ␦s (i, j) and ␦e (i, j) are the start time and end time of Oji , respectively.2 Computn (t) requires description of the object kinematic model, which we described in ing xm Section 17.5.1. Using equation 17.10 and assuming conditional independence between trajectories we then have ¯ C|⌰) ⫽ p(X,
z(i) N
p(x¯ ji |cji , ⌰)p(cji ) ⫽
i⫽1 j⫽1
z(i) N 1 p(x¯ ji |ic i ) j K i⫽1 j⫽1
(17.11)
where z(i) denotes the total number of trajectories observed in camera i. Thus, the ¯ C|⌰) is complete data log likelihood, p(X, ¯ C|⌰) ⫽ log p(X,
z(i) N
i⫽1 j⫽1
log
1 p(x¯ ji |ic i ) j K
(17.12)
The problem, of course, is that we do not have measurements of C so we cannot use equation 17.12 directly. Therefore, we need to find the maximum likelihood estimate Evaluating p(x¯ ji (t)|xji (t)) requires a measurement error model to be defined, for example, normally distributed, in which case p(x¯ ji (t)|xji (t)) ⫽ N (x¯ ji (t)|xji (t), Rji ). 2
17.5 Evaluating an Association Using Motion Information
425
¯ (MLE) of ⌰ given X—that is, ¯ ⌰∗ ⫽ arg max p(X|⌰)
(17.13)
⌰
¯ ⫽ p(X|⌰) ¯ To evaluate the MLE we need to (1) describe how to evaluate L(⌰|X) and (2) describe a maximization routine. By marginalizing out the association in equation 17.10,
p x¯ kn |⌰ can be expressed as a mixture model: K
1 p x¯ ji |⌰ ⫽ p x¯ ji |im K m⫽1
(17.14)
Then the incomplete log likelihood from the data is given by ¯ ⫽ log log L(⌰|X)
z(i) N p x¯ ji |⌰ i⫽1 j⫽1
⫽
z(i) N
log
i⫽1 j⫽1
K 1 i i p x¯ j |m K m⫽1
This function is difficult to maximize since it involves the logarithm of a large summation. ¯ The expectation-maximization algorithm provides a means of maximizing p(X|⌰) by iteratively maximizing a lower bound: ⌰⫹ ⫽ arg max Q(⌰, ⌰⫺ ) ⌰
⫽ arg max ⌰
⫺
¯ ⌰⫺ ) log p(X, ¯ C|⌰) p(C|X,
C∈C
⫹
where ⌰ and ⌰ are the current and the new estimates of ⌰, respectively, and C is the space of configurations that C can assume. To evaluate this expression, we have ¯ ⌰⫺ ) ⫽ p(C|X,
z(i) N p cji |x¯ ji , ⌰⫺
(17.15)
i⫽1 j⫽1
where by Bayes’ theorem and equation 17.14,
1 i |i⫺ ⫺ i i i p x ¯ p x¯ j |cj , ⌰ p cj j ci K j ⫽ p cji |x¯ ji , ⌰⫺ ⫽ ⫺ i K 1 i i⫺ p x¯ j |⌰ p x¯ | i j⫽1 K
j
(17.16)
cj
After manipulation, we obtain an expression for ⌰: Q(⌰, ⌰⫺ ) ⫽
¯ ⌰⫺ ) log p(X, ¯ C|⌰) p(C|X,
C∈C
⫽
z(i) N K
1 i i p cji ⫽ m|x¯ ji , i⫺ p x¯ j |m m log K m⫽1 i⫽1 j⫽1
(17.17)
To derive the update terms for H and P, we need to make explicit the algebraic curve we are using to model the object trajectory and the measurement noise model.
426
CHAPTER 17 Object Association Across Multiple Cameras If noise is normally distributed,
p x¯ kn |nm ⫽
␦e (n,k) t⫽␦s (n,k)
1
e⫺ 2 d (x¯ k (t),xm (t)) 1
n ) (2Rm
1 2
n
n
(17.18)
¯ where d(·) is the Mahalanobis distance. The probability p(X|C, ⌰) can be evaluated as follows: N z(n)
¯ p X|C, ⌰ ⫽
␦e (n,k)
n⫽1 k⫽1 t⫽␦s (n,k)
1
1 e
2Rcnn
⫺ 12 d x¯ kn (t),xcnn (t)
k
(17.19)
2
k
where
T ⫺1 d x¯ kn (t), xcnn (t) ⫽ x¯ kn (t) ⫺ xcnn (t) Rcnn x¯ kn (t) ⫺ xcnn (t) k
k
k
k
and x (t) is the corresponding point that lies exactly on the curve described by Pckn and is transformed to the coordinate system of camera n using Hn . Explicitly, ckn
T T xcnn (t) ycnn (t) ⫽ Hn xckn (t) yckn (t) 1 k
k
(17.20)
It is instructive to note that unlike the maximum likelihood term for independent point detections defined in terms of the reprojection error in [45], where the parameters of reprojection error function include “error-free” data points, the curve model fit on the points allows the error function to be written compactly in terms of the parameters of the curve and a scalar value denoting the position along the curve (taken here to be the time index t). This drastically reduces the number of parameters that need to be estimated. We need an analytical expression for log K1 p x¯ ji |m , which will then be maximized in the “M-step.” Taking the partial derivatives, with respect to the homography and curve parameters,
df df df df ,..., i , i ,..., i dhi1 dh9 dp1 dp4
for each of the cameras (except the reference camera) and all world objects, we arrive at the updating formulae.The Jacobian can then be created to guide minimization algorithms (such as Levenberg-Marquardt). We performed quantitative analysis through simulations to test the behavior of the proposed approach to noise. We also obtained qualitative results on a number of real sequences, recovering the true underlying scene geometry and object kinematics. For the real sequences, the video was collected by cameras mounted on aerial vehicles. Frame-to-frame registration was performed using robust direct registration methods, and object detection and tracking were performed partly using an automated tracking system and partly through manual tracking.
17.5.3 Simulations In this set of experiments we generated random trajectories fitting a prescribed model. The variable scene descriptors included number of objects, number of cameras, and number of frames (observations). For each camera there was a separate probability of observation of an object, and for each object a duration of observation was randomly
17.5 Evaluating an Association Using Motion Information
427
selected. In this way, spatio-temporal overlap was not guaranteed during data generation. A noise parameter was set for introducing errors into the true parameter values (camera parameters and curve coefficients), which were then treated as initial estimates. The homographies subtended by the camera parameters were calculated and used to project each curve onto the image, depending on its probability and its duration of observation. Zero-mean noise was then added to the projected points. We tested the sensitivity of the algorithm with respect to corruption of the curve coefficients by white noise and with respect to measurement error. For these experiments five object trajectories were randomly generated according to linear, quadratic, and cubic models, and two homographies (two cameras). The probability of observation was set to 1 so that both cameras were guaranteed to see both objects (but not necessarily at the same time). Only 10 frames were observed, and 10 iterations of the EM algorithm were run. Four measurement noise levels were tested: 1, 6, 11, and 21, against five coefficient noise levels: 1 ⫻ 10⫺10 , 1 ⫻ 10⫺8 , 1 ⫻ 10⫺6 , 1 ⫻ 10⫺4 , and 1 ⫻ 10⫺2 , and each configuration was repeated 25 times (to generate statistics). This experiment demonstrates that although higher-order models have a larger number of parameters to estimate, they are less susceptible to noise; the result is illustrated in Figure 17.2 for the linear, quadratic, Parameter noise—quadratic 1
0.9
0.9
0.8
0.8
True match probability
True match probability
Parameter noise—linear 1
0.7 0.6 0.5 0.4 0.3 0.2 0.1
Noise ⫽ 1 Noise ⫽ 6 Noise ⫽ 11 Noise ⫽ 21
0.7 0.6 0.5 0.4 0.3 Noise ⫽ 1 Noise ⫽ 6 Noise ⫽ 11 Noise ⫽ 21
0.2 0.1
0 1e–010 1e–009 1e–008 1e–007 1e–006 1e–005 0.0001
0.001
0 1e–010 1e–009 1e–008 1e–007 1e–006 1e–005 0.0001
0.01
Parameter noise
0.001
Parameter noise
(a)
(b) Parameter noise—cubic 1
True matching probability
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
1
Noise ⫽ 1 Noise ⫽ 6 Noise ⫽ 11 Noise ⫽ 21 1.5
2
2.5
3
3.5
4
4.5
5
Parameter noise
(c)
FIGURE 17.2 Performance with respect to noise: (a) linear model; (b) quadratic model; (c) cubic model.
0.01
428
CHAPTER 17 Object Association Across Multiple Cameras and cubic models. This follows intuition, as more information on the underlying homography is placed by each object.
17.5.4 Real Sequences In this set of experiments, we studied the association of objects across multiple sequences in real videos, testing the proposed approach on three sequences. In the first, several cars were moving in succession along a road, as shown in Figures 17.3(a) and (b). From the space–time plot it is clear that one object was moving faster than the rest (indicated by the angle with the horizontal plane). The linear (constant velocity) model was used for this experiment. Within six iterations, the correct associations were discerned and, as shown in Figures 17.3(c) and (d), the trajectories were correctly aligned. It should be noted that in this case the lines were almost parallel; they constituted the degenerate case. However, the correct association was still found, and the alignment was reasonable. In the second experiment a quadratic kinematic model was used in two sequences. Figure 17.4 shows the relative positions of the first set of sequences before (a) and after (b) running the proposed approach. It can be observed that the initial misalignment was almost 400–500 pixels. It took 27 iterations of the algorithm to converge. For the second set of videos, Figure 17.5 shows the objects (b) before and (c) after
Object 1 Object 2 Object 3 Object 4 Object 5
70 60 50 40 30 20 10 0
500
Object 1 Object 2 Object 3 Object 4 Object 5
500 450 400 350 300 250 200 150 100 50
400 300
200 100 50
100
50 150
200
250
300
100
150
200
250
350
300
350
(b)
(a) 150
150
100
100
50
50
0
600 500 400 300 200 100
0 2100 2200
300
250
200
(c)
80 70 60 50 40 30 20 10 0
150
100
50
0
0 600 500 400 300 200 100 0 2100 2200 250
300
250
200
150
100
50
0
250
(d)
FIGURE 17.3 Experiment 1—Reacquisition of objects: (a) trajectories overlayed on the first segment mosaic; (b) trajectories overlayed on the second segment mosaic; (c) space–time plot of trajectories showing that object 2 is moving faster than the other objects; (d) space–time plot of trajectories in segment 2.
17.5 Evaluating an Association Using Motion Information Before 0 100 200 300 400 500
Trajectory 1 Trajectory 2
600 100 200 300 400 500 600 700 800 (a) After 0 100 200 300 400 500
Trajectory 1 Trajectory 2
600 100 200 300 400 500 600 700 800 (b)
FIGURE 17.4 Object association across multiple nonoverlapping cameras—quadratic curve: (a) initialization; (b) converged solution. ⫺50 0 50 100 150 200 250 300 350 400
300
250
250
200
200 150 150 100 100 200 300 400 500 600
(a)
100 0
50 100 150 200 250 300
50 100 150 200 250 300 350 400
x - coordinate
x - coordinate
(b)
(c)
FIGURE 17.5 Object association across two nonoverlapping cameras for a quadratic model of motion: (a) coregistered mosaics from the two cameras; (b) initialization of trajectories; (c) converged solution.
429
430
CHAPTER 17 Object Association Across Multiple Cameras
50
50
100
100
150 150
200 250
200
300
250
350
300
400 50 100 150 200 250 300 350 400 450 500 (a)
50 100 150 200 250 300 350 400 450 (b)
FIGURE 17.6 Overhead view of persons walking: (a) the trajectories viewed from the first camera are color coded; (b) the same trajectories from the second camera.
running the proposed algorithm. In this case the initial estimate of the homography was good (within 50 pixels), but the initial estimate of the curve parameters was poor. The final alignment of the sequences is shown in Figure 17.5(a). The algorithm took only six iterations to converge. Finally, Figure 17.6 illustrates performance on video taken from two overhead cameras looking at people walking. The color code of each trajectory shows the association across views recovered by the algorithm. The large rotation present between the views the algorithm took caused a large number of iterations to be executed (39).
17.6 CONCLUSIONS In this chapter, we discussed methods for associating objects across multiple cameras. We assumed that a planar assumption is viable about the ground plane and that, for a given pair of cameras, all appearance transformations lie in a low-dimensional subspace. Given these assumptions, and taking as input the time-stamped trajectories of objects observed in each camera, we estimated the inter-camera transformations, the association of each object across the views, and canonical trajectories, which are the best estimate (in a maximum likelihood sense) of the original object trajectories up to a 2D projective transformation. Thus, we described an extension to the reprojection error for multiple views, providing a geometrically and statistically sound means of evaluating the likelihood of a candidate correspondence set. For evaluating a hypothesis based on appearance, we showed that, despite depending on a large number of parameters, for a given pair of cameras all such transformations lie in a low-dimensional subspace. The method learns this subspace of mappings (brightness transfer functions) for each pair of cameras from the training data by using probabilistic principal component analysis. Thus, given appearances in two different cameras and the subspace of brightness transfer functions learned during the training phase, we can estimate the probability that the transformation between the appearances lies in the learned subspace.
17.6 Conclusions
431
REFERENCES [1] A. Azarbayejani, A. Pentland, Real-time self-calibrating stereo person tracking using 3D shape estimation from blob features, in: IAPR Proceedings on International Conference on Pattern Recognition, 1996. [2] Y. Bar-Shalom (eds.), Multitarget-Multisensor Tracking: Advanced Applications, Artech House, 1990. [3] Q. Cai, J.K. Aggarwal, Tracking human motion in structured environments using a distributed camera system, IEEETransactions on PatternAnalysis and Machine Intelligence 21 (11) (1999) 1241–1247. [4] T.-H. Chang, S. Gong, Tracking multiple people with a multi-camera system, in: Proceedings of the IEEE Workshop on Multi Object Tracking, 2001. [5] O. Chum, T. Pajdla, P. Sturm, The geometric error for homographies, in: Computer Vision and Image Understanding, 2005. [6] R. Collins, A. Lipton, H. Fujiyoshi, T. Kanade, Algorithms for cooperative multisensor surveillance, in: Proceedings of the IEEE, 2001. [7] R. Collins, O. Amidi, T. Kanade, An active camera system for acquiring multi-view video, in: Proceedings of the IEEE ICIP, 2002. [8] A. Criminisi, A. Zisserman, A Plane Measuring Device, BMVC, 1997. [9] T.J. Darrell, I.A. Essa, A.P. Pentland,Task-specific gesture analysis in Real-time using interpolated views, in: Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence, 1995. [10] T. Darrell, D. Demirdjian, N. Checka, P. Felzenszwalb, Plan-view trajectory estimation with dense stereo background models, in: Proceedings of the IEEE International Conference on Computer Vision, 2001. [11] S. Dockstader, A. Tekalp, Multiple camera fusion for multi-object tracking, in: Proceedings of the IEEE International Workshop on Multi-Object Tracking, 2001. [12] R. Fisher, Self-organization of randomly placed sensors, in: Proceedings of European Conference on Computer Vision, 2002. [13] M. Garey, D. Johnson, Computers and Intractibility: A Guide to Theory of NP-Hardness, Freeman, 1979. [14] R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, 2000. [15] J. Hopcroft, R. Karp, A n2.5 Algorithm for maximum matching in bi-partite graph, SIAM Journal of Computing, 1973. [16] T. Huang, S. Russell, Object identification in a Bayesian context, in: Proceedings of the International Joint Conferences on Artificial Intelligence, 1997. [17] R. Jain, K. Wakimoto, Multiple perspective interactive video, in: Proceedings of the IEEE International Conference on Multimedia Computing and Systems, 1995. [18] O. Javed, Z. Rasheed, K. Shafique, M. Shah, Tracking in multiple cameras with disjoint views, in: Proceedings of the IEEE Ninth International Conference on Computer Vision, 2003. [19] O. Javed, Z. Rasheed, O. Alatas, M. Shah, M-Knight: A real time surveillance system for multiple overlapping and non-overlapping cameras, in: Proceedings of the IEEE International Conference on Multi Media and Expo, 2003. [20] O. Javed, K. Shafique, M. Shah, Appearance modeling for tracking in multiple nonoverlapping cameras, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2005. [21] S. Kang, K. Ikeuchi, Toward automatic robot instruction from perception Mapping human grasps to manipulator grasps, in: IEEE Transactions on Robotics and Automation 12, 1996.
432
CHAPTER 17 Object Association Across Multiple Cameras [22] J. Kang, I. Cohen, G. Medioni, Continuous tracking within and across camera streams, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2003. [23] P. Kelly, A. Katkere, D. Kuramura, S. Moezzi, S. Chatterjee, R. Jain, An architecture for multiple perspective interactive video, in: ACM Proceedings of the Conference on Multimedia, 1995. [24] V. Kettnaker, R. Zabih, Bayesian multi-camera surveillance, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1999. [25] S. Khan, M. Shah, Consistent labeling of tracked objects in multiple cameras with overlapping fields of view, in: Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003. [26] J. Krumm, S. Harris, B. Meyers, B. Brumitt, M. Hale, S. Shafer, Multi-camera multi-person tracking for easy living, in: Proceedings of the IEEE Workshop on Visual Surveillance, 2000. [27] H. Kuhn,The Hungarian method for solving the assignment problem, Naval Research Logistics Quarterly, 1955. [28] L. Lee, R. Romano, G. Stein, Learning patterns of activity using real-time tracking, in: Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000. [29] D. Makris, T. Ellis, J. Black, Bridging the gaps between cameras, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2004. [30] T. Matsuyama, N. Ukita, Real-Time Multitarget tracking by a cooperative distributed vision system, in: Proceedings of the IEEE, 2002. [31] A. Mittal, L. Davis, M2 Tracker: A multi-view approach to segmenting and tracking people in a cluttered scene, International Journal of Computer Vision, 2003. [32] A. Nakazawa, H. Kato, S. Inokuchi, Human tracking using distributed vision systems, in: Proceedings of the International Conference on Pattern Recognition, 1998. [33] K. Pattipati, S. Deb, Y. Bar-Shalom, Passive multisensor data association using a new relaxation algorithm, in: Y. Bar-Shalom (ed.), Multisensor-Multitarget Tracking: Advanced Applications. Artech House, 1990. [34] C. Papadimitriou, Computational Complexity, Addison-Wesley, 1994. [35] A. Poore, Multidimensional assignment fomulation of data association problems arising from multitarget and multisensor tracking, in: Computational Optimization and Applications, Springer, 1994. [36] A. Rahimi, B. Dunagan, T. Darrell, Simultaneous calibration and tracking with a network of non-overlapping sensors, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2004. [37] K. Sato,T. Maeda, H. Kato, S. Inokuchi, CAD-based object tracking with distributed monocular camera for security monitoring, in: Proceedings of the IEEE Workshop on CAD-Based Vision, 1994. [38] K. Shafique, M. Shah, A noniterative greedy algorithm for multiframe point correspondence, in: Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005. [39] Y. Shan, H. Sawhney, R. Kumar, Unsupervised learning of discriminative edge measures for vehicle matching between non-overlapping cameras, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2005. [40] Y. Shan, H. Sawhney, R. Kumar, Vehicle identification between non-overlapping cameras without direct feature matching, in: Proceedings of the IEEE International Conference on Computer Vision, 2005. [41] Y. Sheikh, X. Li, M. Shah, Trajectory association across non-overlapping moving cameras in planar scenes, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2007. [42] Y. Sheikh, M. Shah,Trajectory association across multiple airborne cameras, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
17.6 Conclusions
433
[43] R. Sittler, An optimal data association problem in surveillance theory, IEEE Transactions on Military Electronics, 1964. [44] C. Stauffer, K. Tieu, Automated multi-camera planar tracking correspondence modelling, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2003. [45] P. Sturm, Vision 3D non calibrée – contributions à la reconstruction projective et étude des mouvements critiques pour l’auto-calibrage, PhD Dissertation, INP6, Grenoble France, 1997. [46] J. Geusebroek, R. Boomgaard, A. Smeulders, H. Geerts, Color invariance, in: Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (12) (2001). [47] B. Funt, G.D. Finlayson. Color constant color indexing. IEEE Transactions on Pattern Recognition and Machine Intelligence 17 (5) (1995). [48] M.J. Swain, D.H. Ballard. Indexing via color histograms, in: Proceedings of the IEEE International Conference on Computer Vision, 1990. [49] M.D. Grossberg, S.K. Nayar, Determining the camera response from images: What is knowable?, in: Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (11) (2003). [50] M.E. Tipping, C.M. Bishop, Probabilistic principal component analysis, Journal of the Royal Statistical Society, Series B, 61 (3) (1999). [51] F. Porikli, Inter-camera color calibration by correlation model function, in: Proceedings of the IEEE International Conference on Image Processing, 2003. [52] O. Javed, K. Shafique, Z. Rasheed, M. Shah, Modeling inter-camera space-time and appearance relationships for tracking across non-overlapping views, Computer Vision and Image Understanding Journal 109 (2) (2008).
CHAPTER
Video Surveillance Using a Multi-Camera Tracking and Fusion System
18
Zhong Zhang, Andrew Scanlon, Weihong Yin, Li Yu, Péter L. Venetianer ObjectVideo Inc., Reston, Virginia
Abstract Use of intelligent video surveillance (IVS) systems is spreading rapidly in a wide range of applications. In most cases, even in multi-camera installations, the video is processed independently in each feed. This chapter describes a real-time system that fuses tracking information from multiple cameras, thus vastly expanding the capabilities of IVS by allowing the user to define rules on the map of the whole area, independent of individual cameras. The fusion relies on all cameras being calibrated to a site map while the individual sensors remain largely unchanged. We present a new method to quickly and efficiently calibrate all cameras to the site map, making the system viable for large-scale commercial deployments. The method uses line feature correspondences, which enable easy feature selection and provide a built-in precision metric to improve calibration accuracy. Keywords: intelligent video surveillance, multi-camera target tracking, sensor data fusion, camera calibration
18.1 INTRODUCTION Based on user-defined rules or policies, IVS systems can automatically detect potential threats or collect business intelligence information by detecting, tracking, and analyzing targets in a scene [1–3]. These systems are being utilized in a wide range of applications, primarily involving security and business intelligence. Security applications fall into two major categories. In one, the IVS system operates and alerts in real time, providing immediate detection and prevention. In forensic mode, the system is used offline, after the fact, to find out what happened. Security applications, particularly real-time applications, typically require very high detection and very low false alarm rates to be useful and commercially viable.They include the detection of perimeter intrusion, suspicious object insertion (e.g., a left bag), suspicious human behavior (e.g., loitering), and illegal or suspicious vehicle parking. Figure 18.1 illustrates some real-world security application scenarios. Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00018-5
435
436
CHAPTER 18 Video Surveillance Using a Multi-Camera Tracking
(a)
(b)
(c)
(d)
FIGURE 18.1 Security applications of a video surveillance system: (a) person entering a metro tunnel; (b) person approaching a fence line; (c) protecting an oil pipeline; (d) illegal street parking.
In business intelligence applications the main purpose is to collect statistical information over time. The output of such systems is typically some form of summary report, with little emphasis on individual events. While the underlying technology and the core events of interest are often similar between business intelligence and security applications, business intelligence requirements are somewhat different. Typically no real-time reporting is required, and the application is less sensitive to the accuracy of detecting individual events as long as the overall statistics are reliable. Typical applications include detection of customer behavior in retail environments (such as counting people in a store or determining shopping habits within a store) and measurement of operational efficiencies (such as using queue length to determine staffing levels). In large, multi-camera installations, a central management console provides unified access to all systems, allowing centralized configuration and quick access to all rules, alerts, and results. The user interface may display all results together on a map, or a counting application may aggregate the counts from different feeds. Processing of the camera feeds, the rules, and the alerts is still independent one from the other. While this setup is sufficient for some scenarios, its effectiveness is limited by detecting only
18.1 Introduction
437
local, single-camera events. More complex events spanning multiple cameras cannot be detected and are thus potentially overlooked. For example, a vehicle circling the perimeter of an important location cannot be detected if the cameras covering the perimeter process their view without any data fusion.To enable the system to provide global awareness of what is going on in the whole area, the data from multiple cameras must be fused. Such a multi-camera wide-area IVS system relies on a wide range of research and development issues, such as camera placement, camera alignment, data fusion, and global event detection. In a multi-camera surveillance system, the number and placement of individual cameras have a great impact on system cost, capability, and performance. If the primary purpose of the system is to accurately track the target of interest throughout the scene without losing it because of target-to-target or background-to-target occlusions, then multiple cameras will typically monitor the same area from different directions, as described by several studies [4–6]. In commercial applications, installation cost is often one of the main driving factors. An efficient camera placement algorithm, where the FOV of every camera is ensured to have a 25 to 50 percent overlap with the FOVs of the rest of the cameras in the site, is described in Pavlidis et al. [7]. Camera calibration/alignment is used to establish correspondences between the targets in different video streams taken by different cameras in the system. One general assumption is that each video stream has a common ground plane on which the targets of interest are usually detected. When the cameras have significant portions of overlapping FOVs, the homograph between the two corresponding image ground planes from two cameras can be computed using the target footprint trajectories and a least median of squares search [8–11]. For multi-camera surveillance applications with little or no overlap between cameras, a significant amount of current research has focused on automatically learning the camera topology. Javed et al. [12–16] exploited the redundancy in paths that humans and cars tend to follow (e.g., roads, walkways, and corridors) by using motion trends and appearance of objects to establish correspondence. The system does not require any inter-camera calibration; instead, it learns the camera topology and path probabilities of objects during a training phase using Parzen windows. Once the training is complete, correspondences are assigned using the maximum a posteriori (MAP) estimation framework. Similarly, Gilbert and Bowden [17, 18] used an incremental learning method to create the spatio-temporal links between cameras and thus to model their posterior probability distribution.This can then be used with an appearance model of the object to track across cameras. The approach requires no precalibration or batch preprocessing, is completely unsupervised, and becomes more accurate over time as evidence is accumulated. Stauffer [19, 20] and Tieu et al. [21] exploited the unique characteristics of tracking data to establish a planar tracking correspondence model for a large camera network, using it to reliably track objects through multiple cameras. Entry and exit zones in each camera view are learned and used to assist the tracking. Ellis et al. [22], Makris et al. [23], and Black et al. [24] used these learned entry and exit zones to build the camera topology by temporal correlation of objects transiting between adjacent camera view fields. The camera topology information is extracted in two steps: (1) the principal entry and exit zones associated with each camera view are identified; (2) the correspondences or links between the exit zones of one camera to the entry zones
438
CHAPTER 18 Video Surveillance Using a Multi-Camera Tracking
of an adjacent camera are established by accumulating evidence from many trajectory observations. A significant benefit of the method is that it does not rely on establishing correspondence between trajectories. In addition to generating the topological structure, it results in a measure of inter-camera transition times, which can be used to support predictive tracking across the camera network. Hengel et al. [25] further proposed an approach that begins by assuming all camera views are potentially linked and successively eliminates camera topologies that are contradicted by observed motion. Over time, the true patterns of motion emerge as those not contradicted by the evidence. These patterns may then be used to initialize a finer-level search using other approaches if required. The learning-based approaches just described all require reliable target correspondences among cameras, which can be helped by the large body of research on target appearance matching [26, 27] and statistical data association [28, 29]. Although these multi-camera surveillance approaches using learned camera topology have received wide attention in the research community, they have several drawbacks for commercial applications. First, the training/learning procedure is difficult to manage. With unsupervised training/learning, the accuracy of the camera topology and the performance of the whole system is difficult to guarantee. The installation cost increases dramatically if the training/learning has to be supervised by a computer vision expert. Second, any changes in camera settings, such as moving the camera and adding or removing cameras requires the whole system to be recalibrated. Third, adequate and reliable target correspondences among cameras may not always be available. For example, the system may contain different types of cameras with completely different imaging properties; different cameras may have different view angles on the targets; or the illumination may be quite different from camera to camera. All of these variations may make the same physical target appear different from different camera views, which makes appearancebased target matching or correspondence searching much less reliable. Another major issue is scalability: Previous research described systems with only a few cameras, but a commercial installation may require dozens or hundreds, which can make training very difficult. In this paper, we describe a commercially viable site-based wide-area intelligent video surveillance system. The major objectives are to provide a global view of the site in real time and to support global event definition and detection. The basic assumptions on the site under monitoring are as follows: ■ ■ ■ ■
The cameras may have overlapping or nonoverlapping FOVs. The major entry or exit regions are covered by at least one camera. The scenes are mostly noncrowded. The targets of interest are on a single planar surface.
In this approach, the cameras are aligned deterministically through an easy-to-use map-based calibration method; the data fusion module has an interface similar to an individual camera sensor so that the system is more scalable.The system works in real time, has low bandwidth requirements, and can be easily configured and operated by security personnel. The chapter is organized as follows: Section 18.2 describes the architecture of a typical single-camera surveillance system. Section 18.3 explains how this architecture is
18.2 Single-Camera Surveillance System Architecture
439
expanded into a multi-camera system. Section 18.4 describes some real-life applications of the system. Section 18.5 describes some testing schemes. Section 18.6 lists potential areas for future work, and finally Section 18.7 offers our conclusions.
18.2 SINGLE-CAMERA SURVEILLANCE SYSTEM ARCHITECTURE A typical IVS system [30, 31] is illustrated in Figure 18.2. A dynamic background model [32, 33] is continuously built and updated from the incoming video frames. In each video frame, pixels that are statistically different from the background are marked as foreground and are spatially grouped into blobs (e.g., using the efficient quasi-connected component detector [34]). These blobs are tracked over time to form spatio-temporal targets (e.g., using a Kalman filter). Next, these targets are classified based on various features. Finally, the events of interest (rules) specified by the user are detected on the targets. For example, the user may want to detect when people enter an area by defining a virtual tripwire. The first part of the processing pipeline, up to and including classification, is generic, largely independent of the details of the application and the user-defined events of interest. These steps, marked as content analysis in Figure 18.2, all deal with the actual video frames and generate a high-level metadata description of what is happening.This metadata contains all target information (location, velocity, classification, color, shape, etc.) and potentially the description of the scene, including static (water, sky, etc.) and dynamic (lighting changes, weather) descriptors.The end of the processing pipeline, event detection, uses this metadata description as its input instead of the video, and compares it with the user-defined rules. The design shown enables the system to efficiently operate offline for forensic applications. The content analysis module extracts all video metadata in real time. Only the metadata has to be stored, instead of high-quality video suitable for automated processing. This significantly reduces storage requirements, and events can be detected very quickly simply by analyzing the metadata instead of the much slower video. And by relying on video metadata this design enables the multi-camera surveillance system described in more detail in the next section.
FIGURE 18.2 Block diagram of a typical IVS system.
440
CHAPTER 18 Video Surveillance Using a Multi-Camera Tracking
18.3 MULTI-CAMERA SURVEILLANCE SYSTEM ARCHITECTURE The IVS system, as just described, provides an adequate solution for many applications. But by analyzing only a single video feed at a time, it offers a somewhat myopic view into the world, with all its associated limitations. For example, the goal of the IVS system may be to detect suspicious activities around a facility, with several cameras daisy-chained on its fence line. A vehicle parking near that fence line can easily be detected by the single camera covering the area where the vehicle parks. A vehicle circling around the facility multiple times cannot be detected by the single-camera system, however. A multi-camera surveillance system tracking targets from one camera to the next can overcome such limitations. This section describes the key challenges of such a system and a solution that has been demonstrated to work well in several real-life deployments.
18.3.1 Data Sharing One of the key questions when designing the cross-camera surveillance system is at which stage in the pipeline of Figure 18.2 the single camera units should share their information. In principal, fusion can be performed at all blocks of the pipeline, but the decision on which block has a major impact on speed, performance, communication bandwidth, and flexibility. Performing fusion before foreground detection or blob generation requires building a mosaic, which is very expensive in processor, memory, and bandwidth use. In addition, it usually requires the cameras having overlapped field of views and similar illumination and image resolution, which may not always be satisfied in real applications. Fusing at the video metadata level requires merging all of the metadata from the cameras into a full representation of the environment.This approach distributes the most time-consuming processing (content analysis) between the different sensors, eliminates the need for a mosaic, and minimizes communication, since only the metadata, no video or imagery, needs to be transmitted. Given these advantages, our system communicates only the video metadata for fusion. For example, the video metadata from a single camera unit for each video frame can include the following information: the camera time stamp, a list of targets visible in that frame, and the targets’ properties, such as bounding box, centroid, footprint, shape, velocity, and classification. The metadata can also include information on the scene, such as camera movement or sudden light change. The exact contents of the metadata also depend on the application. Some of the fields just listed are mandatory to perform fusion; others, to enable the detection of certain event types; and some only to provide additional information with alerts.
18.3.2 System Design The cross-camera fusion system is illustrated in Figure 18.3. As shown, the video from each camera is initially processed the same as in a single-camera system: The content analysis module translates it into metadata, which is then sent from all sensors to the centralized data fusion module. Fusion combines the metadata from all sensors into a common coordinate system, but still maintains the metadata format so that the fused metadata can be fed to the event detection module. This module is identical to the one
18.3 Multi-Camera Surveillance System Architecture
FIGURE 18.3 Block diagram of a cross-camera fusion system.
used in the single sensor system of Figure 18.2. The rules and metadata are all represented as relative coordinates. For a single camera sensor the coordinates are relative to a single frame, while for the fusion sensor they are relative to the global map used. This means that the metadata is the same whether it is generated by a view or a map sensor. This design has many benefits. The time-consuming video processing is distributed among the single camera sensors. Communication bandwidth requirements are low because only the video metadata is transmitted. The sensor architecture is simple: The system running on the individual sensors is almost identical to the single-camera system. Content analysis turns the video into metadata. It is still possible to have single-camera event detection running on the individual sensors, if required. The only difference is that the metadata is streamed out to the fusion sensor to enable multi-camera event detection. The fusion sensor is different, but still has much in common with single-camera sensors. The main difference is the frontend, which ingests multiple video metadata streams instead of video and uses the data fusion module to convert those streams into fused metadata. Once the various metadata streams are fused into one, however, the event detection is the same as in the individual sensors. This similarity between the different modes means that our system has only a single main executable, which can be configured at installation to act as a standalone single camera sensor, as a single camera sensor used for fusion, or as a fusion sensor.This unification is extremely important from a practical perspective, greatly simplifying deployment and maintenance. More and more IVS systems are moving toward performing computations on the edge—embedded in a camera. This architecture works well for that approach as well. The embedded system processes the video and generates the metadata, which is then sent to a centralized fusion sensor. Our approach also seamlessly supports the forensic applications described earlier. The video metadata can be stored in the individual sensors, performing fusion and event detection at the time of forensic processing, or the fused metadata can be stored, in
441
442
CHAPTER 18 Video Surveillance Using a Multi-Camera Tracking which case the forensics is the same as single-camera forensics. Moreover, it is possible to later convert a standard installation into a cross-camera system for forensic analysis. If the single-camera video metadata has been stored, even calibration can be performed later and forensics executed on the previously stored data.
18.3.3 Cross-Camera Calibration The major challenge of a cross-camera tracking system is how to associate the targets detected and tracked in different individual cameras. The data fusion process illustrated in Figure 18.3 requires the single-camera sensors to be calibrated in such manner that the targets in different cameras have a common coordinate system. Traditional camera calibration approaches [35] use a 3D reference object with a known Euclidean structure. However, setting up the object with great accuracy is difficult, requires special equipment, and does not scale well. To overcome some of these problems, a simple and practical camera calibration technique using a model plane with a known 2D reference pattern was introduced [36, 37]. In this technique, the user places the model plane or the camera at two or more locations and captures images of the reference points. Camera parameters are derived from the model plane to image plane homographies computed from correspondences between the reference points and their projections. Although this algorithm is simple, it yields good results mainly for indoor and/or close-range applications, where the object is big enough that its features can be easily and accurately detected and measured. To make this approach viable for large-area outdoor applications, the reference object must be very large to provide the necessary accuracy. The proposed system calibrates all camera views to a global site map based on feature correspondences between them. In the current implementation the user manually selects these corresponding features for each view using a simple user interface. The system also supports entering latitude/longitude positions for the map points and selecting the corresponding points on the view. The global site map here may be a fine-resolution satellite image for an outdoor application or a blueprint drawing for an indoor application. In this approach, we assume that in each view the targets are on the same plane, called the image ground plane; the global map also represents a single plane in the world, called the map ground plane. Thus for each camera, the mapping between point x in the view and the corresponding point X in the map is fully defined by a homography H [38, 39]: X ⫽ Hx
(18.1)
where H is a 3⫻3 homogeneous matrix. The map and viewpoints are represented by homogeneous 3-vectors as X ⫽ (X, Y , 1) and x ⫽ (x, y, 1) , respectively. The scale of the matrix does not affect the equation, so only the eight degrees of freedom corresponding to the ratio of the matrix elements are significant. The camera model is completely specified once the matrix is determined. H can be computed from a set of map–view correspondence points. From equation 18.1, each pair of correspondence points provides two equations. Given the 8 unknowns in H, n ⭓ 4 point pairs are needed. With H in vector form as h ⫽ (h11 , h12 , h13 , h21 , h22 , h23 , h31 , h32 , h33 ) , equation 18.1 for n points becomes Ah ⫽ 0,
18.3 Multi-Camera Surveillance System Architecture
443
where A is a 2n⫻9 matrix: ⎡
x1 ⎢0 ⎢ ⎢x ⎢ 2 ⎢ 0 A⫽⎢ ⎢. ⎢. ⎢. ⎢ ⎣xn 0
y1 0 y2 0 .. . yn 0
1 0 1 0 .. . 1 0
0 x1 0 x2 .. . 0 xn
0 y1 0 y2 .. . 0 yn
0 1 0 1 .. . 0 1
⫺x1 X1 ⫺x1 Y1 ⫺x2 X2 ⫺x2 Y2 .. . ⫺xn Xn ⫺xn Yn
⫺y1 X1 ⫺y1 Y1 ⫺y2 X2 ⫺y2 Y2 .. . ⫺yn Xn ⫺yn Yn
⎤ ⫺X1 ⫺Y1 ⎥ ⎥ ⫺X2 ⎥ ⎥ ⎥ ⫺Y2 ⎥ .. ⎥ ⎥ . ⎥ ⎥ ⫺Xn ⎦ ⫺Yn
(18.2)
The goal is to find the unit vector h that minimizes |Ah|, which is given by the eigenvector of the smallest eigenvalue of A A. This eigenvector can be obtained directly from the singular value decomposition (SVD) of A. The key element in the process becomes finding correspondence points between the map and each camera view. These correspondence points, also called control points in image registration, provide unambiguous matching pairs between the map and the camera view. However, there are two potential problems with using only matching points for calibration. The first is that it may be difficult to find the precise corresponding point locations in some environments because of limited resolution, visibility, or the angle of view. An example is an overhead view of a road, in which the corner points of the broken lane-dividing lines theoretically provide good calibration targets. However, it may be difficult to reliably determine which lane-dividing line segment of the map view corresponds to which line segment in the camera view. The second problem is that the precision of the matching point pairs is usually unknown. The precision of a point measures the accuracy of the map-matching location with respect to the accuracy of the view location. For example, 1 pixel of movement away from a location in the camera image plane may cause 100 pixels of movement away from its original corresponding location on the map. This means the precision of this pair of matching points is low. When we calibrate the camera view onto the map, we minimize the distance between these pairs of matching points. Assigning higher weight to points with higher precision improves calibration performance. To overcome these two problems, we introduce matching line features in conjunction with the matching points. A line feature is typically specified by two points, as a line segment, but for a matching line feature only the corresponding lines have to match, not the segments. For example, when viewing a road, it might be difficult to find point correspondences, but the dividing line and the edges of the road define good line features. Hence the line features help to overcome the first limitation previously stated. Figure 18.4 illustrates matching line segments. Matching line features also help overcome the second problem by providing a good precision metric, which helps to improve calibration accuracy by allowing the system to put increased weight on more precise control points. Line features can be directly specified by the user or computed from pairs of user-defined calibration control points. Additional control points are then computed from the intersection of line features. The precision of such a computed control point is determined as follows. First, use the point of intersection on the map as the reference point. Next, add small random Gaussian noise with zero mean and small standard deviation (e.g., 0.5) to the end points of all
444
CHAPTER 18 Video Surveillance Using a Multi-Camera Tracking
L1 L2
P1 L1 L3
P1
L2
L3
L4 L5
L4 L5
(a)
(b)
FIGURE 18.4 Selecting matching features in a corresponding map (a) view (b) pair. It is much easier to find matching lines than points. Matching features are represented by the same index.
the related line segments on the map and recompute the point of intersection. Finally, calculate the distance between the new and the reference point of intersection. Repeat this process many times and compute the mean distance, which reflects the precision of the corresponding point of intersection. The point of intersection is used as a control point only if its corresponding mean distance is less than a threshold determined by the desired accuracy of the calibration. Figure 18.5 illustrates the view-to-map camera calibration process. First, control points are computed as just described using the matching features selected manually by the operator on a GUI provided by the system. Next, the image plane–to–map plane homography is computed using the Direct Linear Transformation algorithm [39].This least squares method is very sensitive to the location error of the control points, especially if the number of the control points is small or the points are clustered in a small portion of the image. In the proposed approach, matching line features are used to iteratively improve the calibration. In each iteration, control points are added or adjusted, until the change in feature-matching error falls below a threshold. Since a line segment is a more representative feature than a single point and its location is more reliable than that of a single point, this iterative refinement process very effectively reduces calibration errors; in our system it always rapidly converges to the optimal result. In each iteration, the corresponding view and map features are used to estimate the homography, which is used to transform the view features onto the map. Then the calibration error is computed as the average distance on the map between the transformed view features and the corresponding map features. For a point feature, the distance is simply point to point. For a line feature, the distance is the enclosed area between the two line segments, as illustrated by the shaded area in Figure 18.6. To reduce this calibration error, we add new control points based on line features. In Figure 18.6, L1 and l1 represent a pair of matching line segments on the map and view, respectively. Note that their ending points are not required to be matching points; thus they may not be in the control point list initially. Once H is estimated, the view line segment l1 is transformed into the map line segment L1 with points P1 and P2 . The goal is to find an estimate of H that minimizes the distance between L1 and the original
18.3 Multi-Camera Surveillance System Architecture Calibration features Compute control points
Compute view to map homograph
Adjust control points
No Compute calibration errors
Match error threshold?
Map–view mapping
Yes
FIGURE 18.5 Camera calibration block diagram.
P2⬘
L1
p2
P1 l1
P2 P1⬘
L1⬘
H p1
Map
View
FIGURE 18.6 Adjusting control points.
matching line segment L1 on the map. This is obtained by minimizing the shaded area between L1 and L1 . To achieve this, P1 and P2 are projected onto line L1 , yielding P1 and P2 . Thus, point pairs ( p1 , P1 ) and ( p2 , P2 ) become matching pairs of points and are added to the list of control points for the next iteration. In subsequent iterations, these additional control points are further adjusted by projecting them onto line L1 based on the newly estimated homography H, as long as these adjustments further improve calibration accuracy. These calibration processes, except the manual matching feature selection operation using a GUI, are all performed automatically by the system. Although the map–view mapping obtained is not a full camera calibration, it provides the most valuable information for cross-camera target tracking. First, since all of the camera views are calibrated to the same map, the corresponding targets from multiple cameras can be naturally associated based on their map locations. Second, by using actual map scales, the physical size and velocity of the target can be estimated, providing
445
446
CHAPTER 18 Video Surveillance Using a Multi-Camera Tracking
FIGURE 18.7 Examples of camera EFOVs on the map.
useful new target information. Third, the map–view mapping can be used to estimate the effective field of view (EFOV) of each camera on the map. The EFOV defines the effective monitoring area of each camera, helping with planning camera placement and performing cross-camera target hand-off. It includes the points where the view size of a human is above a threshold and the mapping error are low. Figure 18.7 shows an example of some EFOVs on a site map.
18.3.4 Data Fusion After processing a video frame, each individual view sensor sends to the fusion sensor a data packet containing the view sensor ID, time stamp, and all available target video metadata for that frame. The time stamp associated with each data packet is called the view time stamp. The incoming data packet rate is the video sensor processing frame rate. In our system, we cap the video process frame rate at 10 fps, but it might be slighter lower when the video scene is too busy or the CPU load is too high. A data packet will be sent to the fusion sensor at each view time stamp regardless of whether targets are detected in the scene. The high-level goal of the data fusion process is to combine the metadata from the individual view sensors into a fused metadata stream on the global map. For this to work properly, the metadata from the individual sensors has to be synchronized so that the fusion sensor knows which metadata entries correspond to the same time. The various sensors are synchronized using third-party time synchronization software. This means that the metadata has synchronized time stamps, but data can still arrive out of synch because of network delays. To compensate for any delay, the fusion sensor introduces some latency and maintains a data buffer for the input metadata. The data buffer is batchprocessed every T seconds, where the configurable T latency depends on the typical network delay and is usually between one half and one second. Thus the fusion sensor is processing at a lower frame rate than the input view sensors—for example, if T is a half second, the fusion process rate is 2 fps. Each output metadata packet from the fusion
18.3 Multi-Camera Surveillance System Architecture Synchronized video metadata
447
Map–view mappings
Data fusion Update view target
View target fusion
Update map target
Map target fusion
Map metadata
FIGURE 18.8 Block diagram of the data fusion process.
sensor also has an associated time stamp, a map time stamp, which is the time of each fusion process. Figure 18.8 illustrates one iteration of the target data fusion process. First, we introduce two types of target representation: view target and map target. A view target corresponds to a single target detected by an input sensor. In addition to the input target features, which are in the view coordinate, it has map location and size data, computed using the map–view mapping. The map target corresponds to a physical target in the site. One map target may contain multiple view targets from different views or from the same view but in different temporal segments. At each map time stamp, the map target has a primary view target that provides the most reliable representation of the physical object at that time. The target data fusion is performed on two levels: view target and map target, and is described in detail in the following paragraphs. On each set of synchronized input metadata, the fusion sensor performs four pipeline processes. First, the update view target module builds its own representation of each view target by converting all location and size information into map coordinates and storing it along with the view-based data. This step basically converts all input metadata into view targets. The system has two types of view targets: fused, which have already been associated with a map target, and new, which are incoming view targets that have not been associated with or converted into a map target yet. Next, the view target fusion module detects if any new view target has become stable, meaning that it has a consistent appearance and is tracked with high confidence. This requirement is used to temporarily ignore nonsalient and partially occluded targets, and targets on the image boundaries, where both location and appearance may be inaccurate. Once a new view target becomes stable, it is compared with all existing map targets to see if there is a good match. If the new view target matches an existing map target, it will be merged into it and will trigger the map target update module to update the corresponding map target. Otherwise, the fusion sensor will produce a new map target based on this stable new view target. Either way, after the view target fusion process, a stable new view target becomes a fused view target. The matching measure between two targets is the combination of three probabilities: location matching, size matching, and appearance matching. The location-matching probability is estimated using the target map location from the map–view mapping. The estimated target map location is modeled by a 2D Gaussian probability density function.
448
CHAPTER 18 Video Surveillance Using a Multi-Camera Tracking
The uncertainty range varies in different directions, determined by factors such as target velocity, target-tracking confidence, map–view mapping precision, and so forth.The sizematching probability is computed from the map size of each target—one way to define the matching measurement is by the size ratio of the smaller target size over the larger target size. The appearance-matching probability is obtained by comparing the appearance models of the targets under investigation. The appearance model used in our system is a distributed intensity histogram, which includes multiple histograms for different spatial partitions of the target. The idea here is to focus more on the relative intensity distribution of a target, not its intensity or color directly. We use the observation that the intensity and color properties of a target are not stable from camera to camera; the relative intensity distribution is more consistent. For example, a person with a light-color shirt and dark-color pants may have quite different average intensity and color in different cameras, but the property that his upper part appears brighter than his lower part may always hold. The appearancematching score is the average correlation between the corresponding spatially partitioned and normalized histograms. As we discussed earlier, appearance matching is not very reliable because of large variations in camera type, view angle, illumination, and so forth. In our system, the strategy is to use location matching to find potential correspondence candidates and size and appearance matching to filter out disqualified ones. More target matching approaches can be found in [15, 18, 19, 26, 27]. One map target can represent multiple view targets from different views—for example, during the time a target is in the overlapping area of two cameras. For each time stamp, the map target has a primary view target that provides the most reliable representation of the physical object at that time. The map target update module determines this primary view target and updates the general target properties, such as map location, velocity, and classification type. It may also reclassify targets based on additional map target properties, such as physical size and speed, or if classification from the individual views contradict each other. In addition, it determines when to terminate tracking a map target. When a map target cannot find any corresponding view targets, it is either occluded by other targets or static background, or it has moved out of the field of view of all cameras. The tracking status of such a map target is marked as “Disappeared.” The map target update module keeps predicting the map location of such targets. The prediction cannot be too long, because it becomes inaccurate and too many predicted targets in the site reduce the reliability of the target fusion process, while new view targets emerge. On the other hand, a too short prediction reduces the system’s ability to handle target occlusion and camera-to-camera target hand-off, particularly for nonoverlapping cameras. To determine the optimal prediction duration for a “Disappeared” map target, the EFOVs of the cameras on the site map are used, as illustrated in Figure 18.7. The prediction duration is determined by how long the target stays in the EFOVs based on the predicted location and velocity. Once a target moves out of the EFOVs on the map and does not come back according to its moving direction, or it disappears for too long a time, the system stops tracking it and removes it from the list of targets on the fusion sensor. The map target update step uses only stable view targets, but even these can have errors, for example, due to occlusions. If, say, a human target enters the camera view
18.4 Examples
449
with his legs occluded by a car, the corresponding target may appear to be stable even though its map location might have significant error depending on the camera viewing angle; the target size and appearance will also be inaccurate. These inaccuracies in target appearance and map location estimation may cause a false view target fusion decision.The consequence is either merging multiple physical targets into one map target or breaking one physical target into multiple map targets. The task of the map target fusion module is to correct such mistakes. Similar to the view target, each map target has an associated stability status based on how consistent its shape and size are in a temporal window. The map target fusion module tests whether a map target needs to be split or merged when the stability status of the map target changes from one stable state to another but with different size and/or appearance. For example, when the human target in the earlier example moves out of the occlusion, its size, appearance, and map location suddenly change. If this target was merged into an existing map target at the view target fusion, the map target may need to be split. Or if this target was considered as a new map target at the time of view target fusion, it may now have to be merged into an existing map target. When a split happens, one of the view targets in the old map target is taken out and a new map target is created using this split-off view target. When a merge happens, the map target with the shorter history will be merged into the one with the longer history.
18.4 EXAMPLES The system described in this chapter was successfully installed in a number of applications. This section describes two very different installations, both using Windows XP on Intel processors. A 2.8 GHz dual-CPU machine can comfortably run up to four sensors, each at around 10 fps.
18.4.1 Critical Infrastructure Protection The most typical application of the camera fusion system is protecting a large site with several cameras daisy-chained around the perimeter, as illustrated in Figure 18.9. This system was installed around a military airfield, with 48 fixed and 16 PTZ cameras covering a 5-mile perimeter. The 48 fixed cameras were daisy-chained, with overlapping FOVs, providing full fence coverage. The PTZ cameras were installed to provide better resolution imagery in case of intrusion. They were typically colocated with one of the fixed cameras and provided coverage for 2 to 4 leader cameras, depending on field topology and visibility. The first step of the installation is calibration, consisting of two major steps: calibrating the fixed cameras to the map and calibrating the PTZ cameras to the fixed cameras. Calibration of the fixed cameras is performed as described in Section 18.3.3: Each camera is calibrated to a global site map by manually selecting correspondences between the camera views and the map. The system provides visual feedback of the calibration by superimposing the camera views on the map, thus showing whether they line up correctly.
450
CHAPTER 18 Video Surveillance Using a Multi-Camera Tracking
FIGURE 18.9 Wide-area surveillance by daisy-chaining cameras (cones) around the perimeter of the protected facility.
The PTZ cameras have to be calibrated to all fixed cameras they interact with. The process is very similar to calibrating the fixed cameras, requiring the user to manually select correspondences between them and the PTZ cameras. As future work, we are planning to simplify this process by calibrating the PTZ unit to the map as well, instead of calibrating it to multiple fixed cameras, as currently done. The next step of the configuration is defining the rules. The system allows the user to set up rules on individual camera views, on the map, or on both. The most important rule for the user was a multi-segment tripwire following the fence line around the entire perimeter, configured to detect when targets cross it from the outside in. A challenge when defining rules on a map is to accurately localize them.The user interface for defining rules has several features to help precise rule definition. The rule can be drawn initially at a more zoomed-out setting, showing the full area. Then portions can be viewed and fine-tuned at higher zoom levels. Besides seeing the rule on the map, the interface also allows projecting, visualizing, and editing in a camera view. This method typically provides the highest resolution, and it allows fixing some inaccuracies resulting from potential calibration errors. In the far view of a camera, being just a pixel off may translate into meters on the map, and that discrepancy in turn can mean the difference between a rule aligned with the fence and being on the public road around the fence, generating several false alarms. Such inaccuracies can easily be visualized and corrected by viewing the rules on the map. In addition to the multi-segment tripwire rule, some areas are protected with rules detecting enters and loiters in an area of interest. Once the rules are defined, the system will start to check for violations and alert if one is detected. All such alerts contain the location of the detected event, which by default includes location in the camera with the best view of the target and location on the map. If the map is just an image, location is represented as an image coordinate. If the map
18.4 Examples
451
has associated geographical information, such as a world file, location is expressed as latitude/longitude coordinates as well. The system is very flexible and can easily accommodate configuration changes. Additional cameras can be added quickly by calibrating them to the map and the PTZ cameras. The rest of the sensors and rules are completely unaffected by this addition.
18.4.2 Hazardous Lab Safety Verification The same system was used in a very different application scenario by a major research university. The goal was to detect violations of the so-called two-person rule in some infectious disease laboratories. The rule means that one person should never be alone in the lab—there should always be at least two. The only exception is the few seconds when people enter and exit, so if one person enters the empty lab, there should be no alert as long as another person enters within a few seconds. The most straightforward solution would be to count the number of people entering and exiting the lab and use the difference of the two as the person count. The drawback of this approach is that it has no mechanism for correcting errors. For example, if two people enter together and are counted correctly, but leave so close to each other that they are miscounted as a single person, the system will false-alert even though there is nobody inside. Similarly, the system could miss an event if three people entering are mistakenly counted as two and later two of them leave. In this case the system thinks the area is empty, although there is still a person inside. For these reasons a robust system needs to monitor the whole lab so that it can continuously track the people inside, maintaining an up-to-date count. In such a setup, even if for a short while the system miscounts the targets, it can recover and correct the count, potentially before sounding a false alert. To obtain good coverage, minimizing occlusions which are the main reason for counting errors, the cameras were mounted on the ceiling. Some labs were small enough to be covered by a single wide-angle camera, but others required more than that. For these larger labs fusion was very important; otherwise, people in the area where two cameras overlap would have been counted by both cameras. System configuration was conceptually similar to the critical infrastructure example, but there were some differences.The map was replaced with a blueprint or sketch of the lab. The requirement was that the blueprint had to be of the correct scale and contain sufficient identifiable features on the floor (ground plane) for the calibration. All cameras were calibrated to the blueprint using manually selected feature correspondences. The definition of the one-person rule included specifying an area of interest (AOI) on the blueprint, covering the whole lab, and specifying a minimum time a person has to be present to trigger an alert.The goal of this minimum time is to help eliminate false alarms in two cases: (1) Typically people enter and exit one at a time, so for a while a person is alone but no alarm is needed if the second person enters/exits shortly; (2) sometimes people in the lab are occluded temporarily, and the minimum time prevents false alerts for shortlived occlusions. The cameras were running content analysis, reporting all human targets to the fusion sensor. The fusion sensor projected these targets onto the blueprint, fusing targets in
452
CHAPTER 18 Video Surveillance Using a Multi-Camera Tracking the overlap area into a single human. It counted the number of people and alerted if it saw only a single person for longer than a user-defined minimum time, typically around 30 seconds.
18.5 TESTING AND RESULTS Testing an entire cross-camera tracking and fusion system is a very challenging undertaking. In general, testing any video analytics solution is complex because of the practically infinite variety of potential video input. Properly testing such a system must include live video cameras as well as prerecorded video. Long-term testing on live outdoor cameras subjects the system to unexpected conditions. Prerecorded videos allow testing a wide range of scenarios reproducibly, making debugging problems easier. The proposed system has to be tested on different levels: ■ ■ ■ ■ ■
Video analytics and event detection on individual sensors Data fusion and event detection on the fusion sensor Follower camera control System User interface
The video analytics algorithms are the same as the COTS single-camera ObjectVideo system. They have been tested extensively using a wide range of video clips with corresponding event and target-level ground truth. Target-level ground truth includes all moving targets in the videos, and the tests measure how reliably the system detects, tracks, and classifies them. Event-level ground truth includes the definition of a set of rules for various clips and testing whether the system correctly detects violations, without false alarms. All these tests are executed and evaluated automatically. In addition, the algorithms are constantly tested in a wide range of installations. Testing the data fusion component poses a greater challenge, requiring synchronized video data with relevant events, several targets from several video cameras. Collecting such data is difficult, requiring a complex setup, much hardware, and large test area. It is easy to create such a setup for a 2-camera lab environment like the one described in Section 18.4.2, but re-creating the setup of Section 18.4.1 is extremely challenging. To enable testing of such complex configurations, ObjectVideo developed the ObjectVideo Virtual Video Tool (OVVV) [40, 41]. OVVV is a publicly available visual surveillance simulation testbed based on the commercial game engine Half-Life 2 from Valve Software. It simulates multiple synchronized video streams from a variety of camera configurations, including static, PTZ, and omnidirectional cameras, in a virtual environment populated with computer- or player-controlled humans and vehicles. Figure 18.10 shows snapshots of some virtual video created using the OVVV tool. To support performance evaluation, OVVV generates detailed automatic ground truth for each frame. We used OVVV to define a large area, populate it with various moving targets, place static and PTZ cameras in different configurations, and generate synchronized videos from all cameras. This setup allowed us to test the system with complex scenarios
18.6 Future Work
453
FIGURE 18.10 Test scenes created with the OVVV tool.
involving a large number of cameras in a repeatable way, and to evaluate the performance against both target- and event-level ground truth. This capability is crucial for system testing as well, since it allows long-term large-scale testing of the whole system, testing stability and robustness. The OVVV supports PTZ cameras as well, so it allows testing the follower cameras.
18.6 FUTURE WORK We are looking at improving the overall system in several ways. The current fusion, as described in Section 18.3.4, uses location, in conjunction with other features, as the strongest cue for fusing targets. In crowded environments, however, a more sophisticated method is required. We are investigating additional features, such as color and shape, to properly resolve this challenge. We are also looking into making the system more scalable.The current implementation was tested with over 50 individual sensors communicating with the fusion sensor, but an installation covering very large areas with hundreds of sensors would be problematic, with too much data flooding the fusion sensor and overwhelming its processing. A more scalable solution can use multiple layers of fusion: a fusion sensor handling a smaller cluster of sensors and multiple such fusion sensors communicating to higher-level fusion sensors. Since the metadata from a fusion sensor and from a view sensor has the same format, we can also treat a fusion sensor as an input sensor for a higher-level fusion sensor. Figure 18.11 illustrates such approach. If the same site map is used, the mapping
454
CHAPTER 18 Video Surveillance Using a Multi-Camera Tracking
Fusion sensor 3
Fusion sensor 1
View sensor 1
View sensor 6
View sensor 2
Fusion sensor 2
View sensor 7
View sensor 3
View sensor 4
View sensor 5
FIGURE 18.11 Multi-level camera fusion approach using a cascade of fusion sensors.
between two fusion sensors is simply an identical matrix. This is a natural extension to the system we described earlier. The major disadvantage is that extra system latency will be introduced at each layer of the data fusion process.
18.7 CONCLUSIONS We presented a real-time cross-camera fusion system that provides powerful new capabilities over single-camera systems, but with very little extra complexity in terms of user interface and implementation. The system was successfully deployed in a variety of commercial applications.
REFERENCES [1] R.T. Collins, A.J. Lipton, H. Fujiyoshi, T. Kanade, Algorithms for cooperative multi-sensor surveillance, in: Proceedings of the IEEE, 89 (2001) 1456–1477. [2] P. Kumar, A. Mittal, P. Kumar, Study of robust and intelligent surveillance in visible and multi-modal framework, in: Informatica 32 (2008) 63–77. [3] W. Hu, T. Tan, L. Wang, S. Maybank, A Survey on Visual Surveillance of Object Motion and Behaviors, in: IEEE Transactions Systems, Man and Cybernetics 34 (2004) 334–352. [4] A. Mittal, L.S. Davis, M2 tracker: a multi-view approach to segmenting and tracking people in a cluttered scene, in: Proceedings of the European Conference on Computer Vision 1 (2002) 1836. [5] S.L. Dockstader, A.M. Tekalp, Multiple camera tracking of interacting and occluded human motion, in: Proceedings of the IEEE, 89 (2001) 1441–1455.
18.7 Conclusions
455
[6] H. Tsutsui, J. Miura, Y. Shirai, Optical flow-based person tracking by multiple cameras, in: Proceedings of the IEEE Conference Multisensor Fusion and Integration in Intelligent Systems, 2001. [7] I. Pavlidis, V. Morellas, P. Tsiamyrtzis, S. Harp, Urban surveillance system: From the laboratory to the commercial world, in: Proceedings of the IEEE, 89 (2001) 1478–1497. [8] I. Mikic, S. Santini, R. Jain, Video processing and integration from multiple cameras, in: Proceedings of the DARPA Image Understanding Workshop, 1998. [9] G.P. Stein, Tracking from multiple view points: Self-calibration of space and time, in: Proceedings of the IEEE Conference CVPR, 1999. [10] L. Lee, R. Romano, G. Stein, Monitoring activities from multiple video streams: Establishing a common coordinate frame, IEEE Transactions on Pattern Analysis Machine Intelligence 22 (2000) 758–767. [11] J. Black, T. Ellis, Multi camera image tracking, in: Proceedings of the second IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, 2001. [12] O. Javed, S. Khan, Z. Rasheed, M. Shah, Camera handoff: Tracking in multiple uncalibrated stationary cameras, in: Proceedings of the IEEE Workshop Human Motion, 2000. [13] O. Javed, M. Shah, Tracking and object classification for automated surveillance, in: Proceedings of the European Conference on Computer Vision, 2002. [14] O. Javed, Z. Rasheed, K. Shafique, M. Shah, Tracking across multiple cameras with disjoint views, in: Proceedings of the International Conference on Computer Vision, 2003. [15] O. Javed, K. Shafique, M. Shah, Appearance modeling for tracking in multiple non-overlapping cameras, in: Proceedings of the Computer Vision and Pattern Recognition, 2005. [16] O. Javed, K. Shafique, Z. Rasheed, Modeling inter-camera space-time and appearance relationships for tracking across non-overlapping views, in: Proceedings of the Computer Vision and Image Understanding, 2008. [17] A. Gilbert, R. Bowden, Incremental modelling of the posterior distribution of objects for inter and intra camera tracking, in: Proceedings of the British Machine Vision Conference, 2005. [18] A. Gilbert, R. Bowden,Tracking objects across cameras by incrementally learning inter-camera colour calibration and patterns of activity, in: Proceedings of the European Conference on Computer Vision, 2006. [19] C. Stauffer, K. Tieu, Automated multi-camera planar tracking correspondence modelling, in: Proceedings of the IEEE Computer Vision and Pattern Recognition, 2003. [20] C. Stauffer, Estimating tracking source and sinks, in: Proceedings of the second IEEEWorkshop on Event Mining (in conjunction with CVPR’ 03), 2003. [21] K. Tieu, G. Dalley, W.E.L. Grimson, Inference of non-overlapping camera network topology by measuring statistical dependence, in: Proceedings of the International Conference on Computer Vision, 2005. [22] T.J. Ellis, D. Makris, J. Black, Learning a multi-camera topology, in: Proceedings of the Joint IEEE International Workshop VS-PETS, 2003. [23] D. Makris, T.J. Ellis, J. Black, Bridging the gaps between cameras, in: Proceedings of the Computer Vision and Pattern Recognition, 2004. [24] J. Black, D. Makris, T. Ellis, Validation of blind region learning and tracking, in: Proceedings of the Joint IEEE International Workshop VS-PETS, 2005. [25] A.V.D. Hengel, A. Dick, R. Hill, Activity topology estimation for large networks of cameras, in: Proceedings of the IEEE International Conference Advanced Video and Signal based Survelliance, 2006. [26] Y. Guo, H.S. Sawhney, R. Kumar et al., Robust object matching for persistent tracking with heterogeneous features, in: Proceedings of the Joint IEEE International Workshop VS-PETS, 2005.
456
CHAPTER 18 Video Surveillance Using a Multi-Camera Tracking [27] Y. Shan, H.S. Sawhney, R. Kumar, Unsupervised learning of discriminative edge measures for vehicle matching between non-overlapping cameras, in: Proceedings of the Computer Vision and Pattern Recognition, 2005. [28] T. Huang, S. Russell, Object identification: A Bayesian analysis with application to traffic surveillance, Artificical Intelligence (103) (1998) 1–17. [29] V. Kettnaker, R. Zabih, Bayesian multi-camera surveillance, in: Proceedings of the Computer Vision and Pattern Recognition, 1999. [30] R. Collins, A. Lipton, T. Kanade, A system for video surveillance and monitoring, in: Proceedings of the American Nuclear Society Eighth International Topical Meeting on Robotics and Remote Systems, 1999. [31] V. Pavlidis, P. Morellas, Two examples of indoor and outdoor surveillance systems: motivation, design, and testing, in: Proceedings of the second European Workshop on Advanced VideoBased Surveillance, 2001. [32] C. Stauffer, W.E.L. Grimson, Adaptive background mixture models for real-time tracking, in: Proceedings of the Computer Vision and Pattern Recognition, 1999. [33] K. Toyama, J. Krumm, B. Brummit, B. Meyers, Wallflower: Principles and practice of background maintenance, International Conference on Computer Vision, 1999. [34] T. Boult, Frame-rate multi-body tracking for surveillance, in: Proceedings of the DARPA Image Understanding Workshop, 1998. [35] R.Y.Tsai, A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses, IEEE Journal of Robotics and Automation 3 (4) (1987) 323–344. [36] P.F. Sturm, S.J. Maybank, On plane-based camera calibration: A general algorithm, Singularities, Applications, in: Proceedings of the Computer Vision and Pattern Recognition, 1999. [37] Z. Zhang, Flexible camera calibration by viewing a plane from unknown orientations, in: Proceedings of the Seventh International Conference on Computer Vision, 1999. [38] J. Semple, G. Kneebone, Algebraic Projective Geometry, Oxford University Press, 1979. [39] R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, 2003. [40] G.R.Taylor, A.J. Chosak, P.C. Brewer, OVVV:Using virtual worlds to design and evaluate surveillance systems, in: Proceedings of the Seventh International Workshop on Visual Surveillance, 2007. [41] http://development.objectvideo.com.
CHAPTER
Composite Event Detection in Multi-Camera and Multi-Sensor Surveillance Networks
19
Yun Zhai, Rogerio Feris, Lisa Brown, Russell Bobbitt, Arun Hampapur, Sharath Pankanti, Quanfu Fan, Akira Yanagawa IBM Thomas J. Watson Research Center, Hawthorne, New York Ying-Li Tian City College of City University of New York, New York Senem Velipasalar University of Nebraska–Lincoln, Lincoln, Nebraska
Abstract Given the rapid development of digital video technologies, large-scale multicamera networks are now more prevalent than ever.There is an increasing demand for automated multi-camera/sensor event-modeling technologies that can efficiently and effectively extract events and activities occurring in the surveillance network. In this chapter, we present a composite event detection system for multicamera networks. The proposed framework is capable of handling relationships between primitive events generated from (1) a single camera view, (2) multiple camera views, and (3) nonvideo sensors with spatial and temporal variations. Composite events are represented in the form of full binary trees, where the leaf nodes represent primitive events, the root node represents the target composite event, and the middle nodes represent rule definitions. The multi-layer design of composite events provides great extensibility and flexibility in different applications. A standardized XML-style event language is designed to describe the composite events such that inter-agent communication and event detection module construction can be conveniently achieved. In our system, a set of graphical interfaces is also developed for users to easily define both primitive and high-level composite events.The proposed system is designed in distributed form, where the system components can be deployed on separate processors, communicating with each other over a network. The capabilities and effectiveness of our system have been demonstrated in several real-life applications, including retail loss prevention, indoor tailgating detection, and false positive reduction. Keywords: composite event detection, multi-camera networks Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00019-7
457
458
CHAPTER 19 Composite Event Detection in Multi-Camera
19.1 INTRODUCTION Nowadays surveillance cameras are widely deployed in many aspects of our daily lives. They are found in public parks, government buildings, and commercial districts among other venues. Security providers have started to integrate video into their solutions for private home protection. Video streams generated from surveillance cameras enable a new technology for automatically monitoring neighborhoods, tracking traffic flows, locating potential threats, and performing post-event forensic analysis. As the result of decades of scientific advances in the field of computer vision, many automatic video analytic features, such as object detection, tracking, and primitive event analysis, have become more feasible in real-life scenarios. While relatively mature technologies are now available in the commercial market for analyzing and managing video data of single cameras, systems that handle data from multiple cameras (and possibly other sensors) are only now emerging. Given the rapidly improved performance and decreasing costs of network cameras and digital video management systems (DVM), large-scale surveillance networks, along with primitive video analytic features, have become more practical and affordable. Such networks pose great challenges in intelligent information retrieval, content understanding, data management, task scheduling, information visualization, and sharing on demand. Because of current development in multi-camera analytic technologies and the complex physical settings of the monitored scenes, operational multi-camera event modeling and detection still remain a problem in many real-life applications for large-scale video surveillance networks. Furthermore, a new trend in the field of video analytics is the integration with nonvideo signals—for instance, retailers analyzing the correlated patterns between video streams and point-of-sale (POS) streams to find abnormal activities and so discover potential fraud. Thus, an open system infrastructure for convenient integration of different types of signals in real time is one of the required functionalities in digital video surveillance. In this chapter, we introduce a composite event detection framework for modeling and detecting multi-camera, multi-modality events in surveillance networks. The proposed framework combines primitive events to formulate high-level composite events using logical operators with applicable spatio-temporal constraints. The primitive events are the building blocks of the composite events; they are the visual events that could be detected and generated by mature vision technologies. In addition, they could also be nonvisual signals with standard input/output, such as radio frequency identification (RFID), POS streams, and badge readers. The proposed composite event detection framework is designed in a such way that primitive events can be physically distributed on separate processing workstations. Event messages are transmitted over the network through proxy channels. Logically, composite events are formulated as multi-level fullbinary trees, where the root node represents the target composite event, the leaf nodes represent the primitive events, and the middle nodes represent the rule operators that combine primitive events. Intuitively, rules are composed of binary operators with operands that could be either primitive events and/or lower-level rules. The multilevel, distributive design of composite events enables the extensibility and scalability of the proposed event detection framework.
19.2 Related Work
459
A standardized XML-style event description language is designed to provide a unified definition schema for (1) storage of event definitions (both primitive and composite), (2) simple event parsing for constructing and initiating the event detection module, and (3) consistent communication between processing workstations across different platforms. As an essential component in realistic solutions, a set of graphical user interfaces (GUI) is developed for convenient event definitions.The user interfaces are implemented using Windows ActiveX, enabling the user to define events from any terminal on the network that includes aWeb browser.To facilitate forensic investigation and other post-event operations, we have designed an interactive event search and browsing schema where users can perform a query-based search on composite events and visualize them through textual descriptions, keyframes, streaming-on-demand, and the like. The effectiveness of our proposed multi-camera/sensor event detection framework is demonstrated in various real-life applications. One application is retail loss prevention, where the proposed system is deployed in retail stores to detect fraudulent “fake scans”; a second case is indoor tailgating detection, where the proposed system detects inconsistencies between badge reads and visually captured persons; the last scenario is reducing false positives caused by primitive event detectors. The remainder of this chapter is organized as follows: Section 19.2 reviews existing technologies and related work on event detection in multi-camera and multi-sensor network video streams. Section 19.3 presents the full description of the proposed high-level composite event detection framework. In Section 19.4, the IBM MILS system is introduced for composite event search and browsing. Three real-life applications are described in Section 19.5. Finally, Section 19.6 concludes the chapter.
19.2 RELATED WORK Events are understood as current states of ongoing actions/activities or the outcomes of completed actions. Intuitively, events involve self-actions of and/or interactions between objects.Thus, in visual analytics, event detection tasks are built on fundamental technologies such as object detection, tracking, and classification. Object detection and tracking are long-studied problems in the computer vision community. Decades worth of research in this area have created a vast body of knowledge that would be impossible to exhaustively explore here. Background modeling is a commonly used method for detecting active objects present in the camera field of view. In this method, image pixels are classified into background and foreground areas based on their statistical properties, such as color and gradient distributions over time. Representative work in this area includes Stauffer and Grimson [36], Tian et al. [39], and Cucchiara et al. [9]. Another approach in object detection and localization is template matching, where prelearned object models are compared with image patches with some scale and orientation variations. One of the most cited works in this category is Viola and Jones [44], who applied AdaBoost classifiers to achieve real-time face detection. Object tracking is the task of establishing correspondences of detected objects across video frames. Here, some of the most cited work includes the KLT tracker for point object tracking (Tomasi and Kanade [40]), scale-invariant feature
460
CHAPTER 19 Composite Event Detection in Multi-Camera transform (SIFT) descriptors by Lowe [24], mean-shift tracking by Comaniciu et al. [10], and geometric shape–based matching by Fieguth and Terzopoulos [12]. Event Detection in Single-Camera Views Object detection and tracking are the cornerstones of high-level event modeling because they provide spatio-temporal representations of target objects in the scene. Many event detection methods have been developed for single-camera analytics. A broad category of approaches is based on graphical models. Since events are strongly bounded by temporal patterns, states and transitions between states can be learned to build models—for example, finite state machines (Zhai et al. [51]), hidden Markov models (Phung et al. [28]), context-free grammar (Ryoo and Aggarwal [33]), and dynamic Bayesian networks (Laxton et al. [21]). Other approaches have also been pursued. Rao et al. [31] proposed a rank theory approach for matching object trajectories in different views. Boiman and Irani [5] applied a spatio-temporal interest point descriptor to detect irregular activities. Porikli and Haga [30] detected unusual events using eigenvector analysis and object properties. In the surveillance domain, many event detection methods are specifically designed to handle exactly defined events. For instance, abandoned-object detection was addressed by Stringa and Regazzoni [37]. Haritaoglu et al. [13] proposed a method to recognize events such as depositing/removing objects and exchanging bags. Watanabe et al. [46] introduced a system to detect persons entering/exiting a room. Event Detection Across Multiple Cameras In recent years, much research has been targeted at detecting events or activities across multiple cameras [2, 17, 19, 23, 32, 48, 53]. Ahmedali and Clark [2] laid the groundwork for a distributed network of collaborating and intelligent surveillance cameras. They proposed an unsupervised calibration technique that allows each camera module to indicate its spatial relationship to other cameras in the network. For each camera, a person detector was trained using the Winnow algorithm with automatically extracted training samples. To improve detection performance, multiple cameras with overlapping fields of view (FOVs) collaborated to confirm results. Lim et al. [23] designed a scalable, wide-coverage visual surveillance system by utilizing image-based information for camera control across multiple cameras. They created a scene model using a plane view. With this scene model, they can handle occlusion prediction and schedule video acquisition tasks subject to visibility constraints. Kim and Davis [19] proposed a multi-view multi-hypothesis approach for segmenting and tracking multiple persons on a ground plane. To precisely locate the ground location, the subjects’ vertical axes across different views are mapped to the top view plane and their intersection point on the ground is estimated. Iterative segmentation searching was incorporated into the tracking framework to deal with rapidly expanding state space due to multiple targets and views. Remagnino et al. [32] proposed a multi-agent architecture for understanding scene dynamics, merging the information streamed by multiple cameras. Wu et al. [48] described a framework consisting of detection, representation, and recognition for motion events based on trajectories from multiple cameras. Composite Event Detection Recently, detection of composite events (events that are composed of multiple primitive events using spatio-temporal logic) from single
19.3 Spatio-Temporal Composite Event Detection
461
or multiple cameras has attracted much attention from the research community [6, 20, 29, 41, 43]. Pietzuch et al. [29] introduced a generic composite event detection framework that can be added on top of existing middleware architecture, and they provided a decomposable core language for specifying composite events. Urban et al. [41] introduced distributed event processing agents, which provide a conceptual language for the expression of composite events, an underlying XML framework for filtering and processing composite events, and a distributed execution environment for processing events from multiple streams. Bry and Eckert [6] considered a data-log–like rule language for expressing composite event queries. They found that the temporal relationships between events can be utilized to make the evaluation of joins more efficient. Kreibich and Sommer [20] extended the existing event model to fulfill the requirements of scalable policy–controlled distributed event management, including mechanisms for event publication, subscription, processing, propagation, and correlation. The system developed by Velipasalar et al. [43] allows users to specify multiple composite events of high complexity, and then automatically detect their occurrence based on the primary events from a single-camera view or across multiple camera views. The event definitions were written to an XML file, which was then parsed and communicated to the tracking engines operating on the videos of the corresponding cameras.
19.3 SPATIO-TEMPORAL COMPOSITE EVENT DETECTION In this section, we present a high-level spatio-temporal event detection system for multicamera networks, in which the composite events are formulated by combining a group of primitive events using both logical and spatio-temporal conditions. Before going into a detailed description of our system, some terms need to be defined. In our system, primitive events refer to the events that can be directly derived by analyzing video features, such as motion, color, and tracking output. Some example primitive events are tripwire-crossing detection, motion detection, and speed detection. Primitive events can also be nonvisual: the output of a product scanner or badge reader, RFID signals, and so forth. Composite events, on the other hand, are those composed of multiple primitive events using spatio-temporal logic. For instance, the composite event “a person entering the building” is defined as primitive event “a tripwire is crossed in camera A monitoring in front of the building” FOLLOWED BY the primitive event “a face is detected in camera B monitoring the entrance of the building lobby” within 20 seconds. Composite events are often difficult to derive from video features directly because of limited and costly multi-camera vision technologies. In the following subsections, we present the system infrastructure of our proposed event detection framework with event representation and detection.Then an XML-based event description language and a set of convenient user interfaces are introduced.
19.3.1 System Infrastructure Figure 19.1 shows the system infrastructure of our proposed composite event detection framework. This system is designed in such a way that it is able to formulate and detect
462
CHAPTER 19 Composite Event Detection in Multi-Camera Data server
Retrieve detected events
End user/UI
Composite event detector
Define primitive and composite events
Ingest detected events Send primitive events
Primitive event detector
Primitive event detector
Primitive event detector
FIGURE 19.1 Proposed high-level spatio-temporal event detection system. Primitive events are generated by primitive event detectors that can be in distributive form. They are then transmitted to the central composite event detection unit. In this unit, primitive events are combined into composite events using logical operators. The primitive and composite events are defined by the end user, and the results of the event detection modules are stored in the database, which can be retrieved by the user at a later time.
composite events targeted by either a single camera view or a combination of views from a multi-camera network. There are four major components in the system. Each is independent of the others and replaceable. ■
■
■
Agent/engine: This is the standalone processor for detecting primitive events in a camera view or from a nonvideo device, such as tripwire, motion detection, prominent spatio-temporal curvature, or RFID signals. A single agent/engine can handle multiple primitive event detections, and multiple agents can reside on the same physical workstation. Thus, agents are able to process information acquired either from a single-camera view (or device) or from multiple-camera views independently. Central unit: This is the processor for high-level composite event detections. The central unit receives primitive events from agents/engines and determines if they satisfy the conditions of target composite events. It also ingests generated metadata for the data server, including composite event definitions, triggered composite events, and triggering primitive events. End-user terminal: This is where the user defines both primitive and composite events. The user can also perform query-based event search and browsing via Web applications. The end-user terminal can be any machine in the network that has an Internet browser. The definition process is carried out through a set of conveniently designed user interfaces (Section 19.3.4).
19.3 Spatio-Temporal Composite Event Detection ■
463
Data server: Once primitive and composite events are generated, they are transmitted to the centralized data server for future analysis, such as event query search and browsing on the end-user terminal. Event definitions are also stored on the data server.
A single agent can generate primitive events of multiple types, and multiple agents can reside on the same processing workstation. In addition, the central unit can be located on the same workstation as the agents. The end user defines primitive events (e.g., tripwire and motion detection) and formulates target composite events using primitive events through the user interface (UI). The primitive and composite event definitions are then transferred to and updated on the corresponding remote agents and the central unit. Once the primitive events are detected by the agents, they are sent over the network to the central unit for the composite event detection process. If a composite event is detected, summary results (e.g., alerts) are generated and stored in the data server, which is available for future search and retrieval.
19.3.2 Event Representation and Detection High-level composite events are represented in a full-binary tree format with one extra root node (Figure 19.2), where the root node represents the target composite event, the leaf nodes represent the primitive events, and the intermediate nodes represent the rules. Each rule node has a binary operator with two operands and corresponding spatio-temporal conditions. The purpose of a tree structure is to avoid cycles and thus reduce implementation and computational ambiguity. If a primitive event is the operand of multiple rules, duplicate nodes for this event are created such that each rule has an independent copy. In our formulation, each rule contains (but is not limited to) one of the four binary operators in the following list. Each operator is coupled with a temporal condition and a spatial condition.The temporal condition is the temporal window in which the operands
Composite event
Rule 3 WITHOUT
Primitive event 1
Rule 1
Rule 2
AND
FOLLOWED BY
Primitive event 2
Primitive event 3
Primitive event 4
FIGURE 19.2 Tree structure of an example composite event with four primitive events and three rules. The operands of operators can be primitive events, lower-level rules, or both.
464
CHAPTER 19 Composite Event Detection in Multi-Camera must be satisfied according to the operator. The spatial condition enforces the spatial distance between two detected operands; it can be either in the image space or in the global world space if cameras are calibrated. In the current system, we focus only on the spatial condition imposed in the image space.Therefore, the spatial condition is enforced only if two operands are primitive events generated from the same camera view. Each rule could be either a “triggering” or a “nontriggering” rule. The target composite event is detected if any of its “triggering” rules is satisfied. On the other hand, satisfaction of a “nontriggering” rule does not result in composite event detection, which is mainly designed for building high-level detection rules. Intuitively, every composite event should contain at least one “triggering” rule. The four proposed binary operators are ■
■
■
■
AND This operator triggers the rule if both its operands occur within the defined temporal and spatial conditions. OR This operator is satisfied when either one of its operands occurs. Because of its innate property, temporal and spatial conditions are not applied for this operator. FOLLOWED BY This operator enforces the temporal order between two operands.The first operand must occur before the second to trigger the rule. WITHOUT This operator involves the negation of the second operator. That is, the first operand should occur when the second operand DOES NOT occur within the defined temporal and spatial conditions (if applicable).
This is not an exhaustive list. Useful extensions include EXCLUSIVE OR and support for negation (OR NOT, NOT FOLLOWED BY, etc.). Furthermore, since only instantaneous events are considered in this chapter, operators that support events spanning a time interval are also omitted. One notable feature of the proposed system is that composite events can be formulated in a cascade, where previously defined composite events are considered the operand(s) of rules of other composite events. These composite events, referred to as “operand events,” receive primitive events as the input of regular composite events. The detected operand events (themselves being composite events) are then reformulated as the input of another composite event. In this case, there will be multiple “composite event detectors” in the system infrastructure, where some are responsible for handling operand events and act as both primitive event detector and composite event detector. By introducing the “operand events,” events with higher complexity can be formulated by cascading other less complicated composite events. According to the above definition, a composite event can be formulated in the following form: E ⫽ ( P, R )
(19.1)
where P ⫽ {P1 , . . . , Pn } is a set of primitive events, and R ⫽ {R1 , . . . , Rm } is a set of rules. Each rule Ri is defined as Ri ⫽ (⌽1i , ⌽2i , ⊕i , Ti , Si )
⌽1i
⌽2i
(19.2)
where and are the two operands of the rule; ⊕ is the binary operator (⊕ ∈ {AND, OR, FOLLOWED BY, WITHOUT}), and T and S are the temporal and spatial constraints, respectively. To enable the extensibility of the layered rule composition, the
19.3 Spatio-Temporal Composite Event Detection
465
operands (⌽1i , ⌽2i ) can be either primitive events or previously defined rules—that is, ⌽∗i ∈ {P1 , . . . , Pn , R1 , . . . , Ri⫺1 }. This enables multi-layered rule composition and at the same time avoids cycles in the event representation tree. Since the composite event is formulated in full binary tree form, it can be restructured by a breadth-first traversal of the tree. Equation 19.1 is then reformulated as E ⫽ S ⫽ {S1 , S2 , . . . , Sm⫹n }
(19.3)
where S is a vector containing the event tree nodes in the order they are visited during breadth-first traversal. Two conditions initiate the composite event detection process. The first is that the central unit receives a new generated primitive event. The second is that idle waiting time exceed the predefined threshold. The reason for the second condition is that in the operator WITHOUT, negation of the second operand is used. Thus, the nonexistence of the second operand must be periodically examined. Once the composite event detection process is initiated, a conventional breadth-first traversal algorithm is applied to determine if any of the triggering rules is satisfied with the presence of the newly received primitive event, or with the nonexistence of the second operand if the current rule operator is WITHOUT.
19.3.3 Event Description Language Composite events are stored in a standardized and extensible description language based on XML. A composite event is represented using a two-layered description. The top layer defines the event, including information such as agent ID, workstation address, event name, primitive events, and rule definitions. The lower layer is composed of two parts: primitive events and rule definitions. An example of the composite event description language is shown in Figure 19.3, where Figure 19.3(a) shows the top layer composite event definition, Figure 19.3(b) shows the definition of a primitive event, and Figure 19.3(c) shows the definition of a rule. In each primitive event node, the tag <Param> defines the actual parameters of the event (e.g., crossing line coordinates and direction of a tripwire). In each rule definition, the tag stores the binary operator type (e.g., AND or WITHOUT). The tag <Enable> indicates if the current rule is a triggering rule. The tags and indicate if the operands are primitive events or rules. The tags and indicate which primitive events and/or previous rules should be the operands of the current rule. In each primitive event node and rule node, the tag relates to the index tags of the rule operands, and thus must be a number unique to each of the primitive event and rule entities. In addition, if one of the current rule’s operands is another rule, the index of the current rule must be greater than its operand’s index to avoid potential cycles. The proposed event description language is standardized, compact, and nonambiguous. This is important in large-scale camera networks, where low-traffic loads, effective inter-agent communications, and efficient event constructions are required.
19.3.4 Primitive Events and User Interfaces Primitive events are the building blocks of composite events. We have so far developed 10 primitive events, which are described below. Since our system is designed to be
466
CHAPTER 19 Composite Event Detection in Multi-Camera
19999 <EventName type=“text”>Comp Event 1 127.0.0.1 3 3 … … … … … …
(a)
1 14001 10.0.0.1 <EventType type=“text”>Trip wire <EventName type=“text”>Main Entrance <Param type=“text”> …
(b) 3 <Enable type=“bool”>true AND PrimEvent 1 Rule 2 <MinTime type=“int” unit=“sec”>0 <MaxTime type=“int” unit=“sec”>120 <MinDist type=“int” unit=“pixel”>0 <MaxDist type=“int” unit=“pixel”>20
(c)
FIGURE 19.3 Event description language for event storage, agent construction, and inter-processor communications: (a) high-level composite event structure; (b) primitive event definition, where detailed event parameters are defined; (c) rule definition structure.
scalable, its capability is not limited to these types of events. Additional primitive events and nonvision events can be easily integrated into the system (one example is shown in Section 19.5.1). ■
■
■
■ ■
■
■ ■
■ ■
Motion detection: detection of motion within a defined region of interest (ROI) satisfying auxiliary conditions such as minimum motion area, minimum motion duration, and minimum number of objects in the ROI. Directional motion: detection of motion within a defined ROI satisfying directional filtering conditions. Abandoned object: detection of an event where an object is abandoned in the defined ROI, which satisfies size and duration constraints. Object removal: detection of an event where an object is removed in an ROI. Tripwire: detection of an object crossing a predefined tripwire (composed of a series of connected line segments) satisfying both directional and size constraints. Camera moved: detection of an event where the camera is moved to another FOV. This event is widely deployed in pan-tilt-zoom (PTZ) cameras. Camera motion stopped: detection of an event where a moving camera stops. Camera blind: detection of an event where the camera FOV is lost, which can be caused by camera tampering or malfunction and/or network failure. Face capture: detection of an event where faces are present in predefined ROIs. Region event: detection of directional traffic in/out of a defined ROI.
19.3 Spatio-Temporal Composite Event Detection
(a)
( b)
FIGURE 19.4 Screen shots of example primitive event definition editor interfaces: (a) tripwire event; (b) motion detection event.
Primitive events are defined through a set of graphical user interfaces, where users can conveniently specify their corresponding parameters. Screen shots of two example primitive event definition editors are shown in Figure 19.4. Figure 19.4(a) shows the dialog for specifying a tripwire event, where auxiliary conditions (object size, crossing direction, moving speed, etc.) are defined. In Figure 19.4(b), a motion detection event is specified using information such as ROIs and observed motion/time conditions. Defined primitive event parameters are saved in the event description language under node . Once primitive events are defined and established, users can formulate composite events using them through a set of composite event definition interfaces (Figure 19.5). In the first step, the system retrieves all available primitive events for the user to select (Figure 19.5(a)). Next the composite event editor dialog appears (Figure 19.5(b)). The selected primitive events are the initial set of candidates for the rule operands. To define a rule, the user selects an operator and two operands and specifies the spatio-temporal
467
468
CHAPTER 19 Composite Event Detection in Multi-Camera
(a)
( b)
FIGURE 19.5 Composite event definition editor. The user selects primitive events from dialog (a) for composite event formulation; event rules are defined in dialog. (b) Operands and operators of a new rule can be selected from the corresponding drop-down lists; the spatio-temporal constraints are set by edit boxes. Created rules are automatically added to the operand lists.
constraints between the operands. Lastly, the user decides whether or not to enable the triggering flag for this rule. Once the definition is completed, the rule is added to the rule list by clicking Add Rule.This rule is also added to the operand candidate set immediately after it is created. Intuitive descriptions of the created rules are displayed in the rule list window in a natural language. The dynamic nature of the operand candidate set enables the building of multi-level event compositions. Both the primitive and composite event definition editors are implemented in the form of ActiveX components, so they can be deployed on any workstation in the network.
19.4 COMPOSITE EVENT SEARCH One of the essential components of advanced digital video management (DVM) systems is interactive search of archived events (primitive, composite, nonvideo, etc.). Effective event search enables significant forensic features in DVMs, such as event visualizations,
19.4 Composite Event Search
469
video playback, event statistical summaries, and spatial and temporal activity comparison. In this section, we introduce the composite event search and browsing mechanisms in the embodiment of the IBM Smart Surveillance Solution system.
19.4.1 IBM Smart Surveillance Solution The IBM Smart Surveillance Solution (SSS) [35] is a service offered for use in surveillance systems providing video-based behavioral analysis. It offers the capability not only to automatically monitor a scene but also to manage the surveillance data, perform eventbased retrieval, receive real-time event alerts through a standard Web infrastructure, and extract long-term statistical patterns of activity. SSS is an open and extensible framework designed so that it can easily integrate multiple independently developed event analyses. It is composed of two major components: the Smart Surveillance Engine (SSE) and Middleware for Large Scale Surveillance (MILS). SSE processes video steams in real time, extracts object metadata, and evaluates user-defined events. It provides a software framework for hosting a wide range of video analytics such as behavior analysis, face recognition, and license plate recognition. MILS takes event metadata and maps it into a relational database. It also provides event search services, metadata management, system management, user management, and application development services. Figure 19.6 demonstrates the software structure of the SSS system. In this structure, SSEs parse the video streams acquired from cameras and generate generic event metadata in real time.The metadata is then transferred to the backend MILS interfaces and ingested into the database (DB2). MILS provides services for search-by-query, visualization, and statistical retrieval. In this section, we focus on the MILS interfaces for composite event searches.
19.4.2 Query-Based Search and Browsing To retrieve relevant data from the database and further formulate it into a user-friendly representation, convenient composite event search is essential. In this section, we introduce the search-related functionalities of the proposed system. The main composite event search interface is presented in Figure 19.7. In MILS, users search archived composite events using one or more of the following query
S3-MILS (Middleware for largescale surveillance)
S3-SSE (Smart surveillance engine) 1 SSE API/Plug-in (e.g., license plate recognition)
2 IBM middleware stack 1. WRS(WAS, DB2, MO) 2. WSII
FIGURE 19.6 System architecture of IBM, Smart Surveillance Solution.
Application and solutions that are specific to customers/industries on top of the MILS web service and IBM middleware stack Optional: Integrations such as legacy and sensors.
3
4
470
CHAPTER 19 Composite Event Detection in Multi-Camera
FIGURE 19.7 Composite event query page, including as search criteria composite event description, temporal duration of occurrences, and search domain.
modalities: (1) description, (2) temporal interval, and/or (3) physical domain. The description can be as exact as the target composite event subject, or it can match only partially with the event subjects. The advantages of enabling multiple physical search domains are twofold. Users retrieve similar composite events generated by different sets of primitive events and rules. For instance, users may be interested in searching “tailgating”events that occur in two separate buildings. Even though these events are individually formulated with scene-specific primitive events, they will be retrieved simultaneously by searching both buildings. The search results are presented in an interactive browsing fashion. The two modes for the retrieved events are stat view and list view. List view shows the retrieved events in chronological order with event description, time of occurrence, and physical domain (engine) (Figure 19.8(a)). Stat view (Figure 19.8(b)) summarizes the retrieved events in bin-style frequency graphs with respect to the unit-temporal interval (e.g., every hour). If multiple physical domains are provided in the query, event statistics of each are shown in separate graphs. This is particular useful for cross-domain event comparisons. Each bin in the event statistical graph is a hyperlink to a list view page, which contains the retrieved composite events during the time interval represented by the current bin. To further investigate queried composite events, each entry in the event list is enabled as a hyperlink to a event summary page. An example event summary page is shown in Figure 19.9. On this page, detailed contents of the target composite event are retrieved from the database and presented. In particular, individual primitive events that triggered the composite event are shown along with their occurrence information, including event names, event types, occurring time, processing agent ID, and visualizations. If the primitive events are video based (i.e., generated from video streams), the keyframes are the images captured when the primitive events occur. If the primitive events are generated from nonvideo signals (e.g., a barcode reader), the keyframes are textboxes with corresponding event details. In addition to the primitive events, the conditions that triggered the composite event are displayed to provide users with a clear picture of exactly what happened in the scene.
19.4 Composite Event Search
(a)
( b)
FIGURE 19.8 Composite event search results in two presentation modes: (a) list view, where retrieved composite events are shown with their descriptions in chronological order; (b) stat view, a statistical summary of retrieved composite events.
FIGURE 19.9 Detailed browsing page of a target composite event, showing primitive events, triggering condition, and event playback and map.
471
472
CHAPTER 19 Composite Event Detection in Multi-Camera
The example in Figure 19.9 shows a naïve composite event summary page. The target composite event is the capture of persons entering the building. It is the composition of three primitive events, where event P1 is the tripwire crossing in front of the building, event P2 is the badge read at the entrance door, and event P3 is the face capture after the person enters the interior door. Two rules are defined: (1) R1 :⫽ P1 [FOLLOWED BY] P2 with temporal constraint T1 ⫽ [5, 30]; and (2) R2 :⫽ R1 [FOLLOWED BY] P3 with temporal constraint T2 ⫽ [3, 10]. Note that there are more sophisticated algorithms for solving this particular problem. The presented formulation is just to provide an easy-to-follow embodiment for illustration purposes. Keyframes are the compact form of primitive event representations. They depict the scenario at the time the event occurs. In many situations, the cause of the event and its consequence are also interesting to investigators. Thus, it is important to have a playback facility that exhibits a duration of video showing the entire target event as well as its temporal neighborhoods. Utilizing streaming-on-demand technologies, IBM MILS provides two types of event playback. The first is playback of individual primitive events, achieved by clicking on the target primitive event keyframes. The video segment that corresponds to the current primitive event will be played in the playback window on the interface. The second playback method is streaming the videos of all primitive events automatically in the order of their occurrence. This provides an overall view of the entire composite event. If primitive events are nonvideo based, they will be skipped in the overall composite event playback and only textual descriptions will be shown. Another interesting and useful event visualization feature is the map utility. MILS Map is a graphical representation of the floor plan of a scene providing correspondences between primitive event sources and their physical locations on the floor plan (upperleft window in Figure 19.9). When the composite event summary page is retrieved, the primitive event camera FOVs as well as the nonvideo devices are labeled in the map. If a particular primitive event is selected for video playback, its corresponding FOV (or device) is highlighted in the map. During playback of the entire composite event, as the primitive events are played one after another, their corresponding FOVs/devices are highlighted consecutively in the same order to give a temporal depiction of the physical locations of the primitive events.
19.5 CASE STUDIES In this section, we present the proposed composite event detection framework in three real-life applications. First, we propose a solution to a common problem in the retail sector and then an indoor tailgating detection. Lastly, an example of false positive reduction is demonstrated.
19.5.1 Application: Retail Loss Prevention Statistics show that in 2007, retailers in North America lost over $5 billion resulting from various types of fraud. A major portion of this loss was caused by fake scans (sometimes known as nonscans), where the cashier (or the customer if in a self-checkout)
19.5 Case Studies
473
intentionally passes the product in front of the barcode reader without actually scanning it. Since this type of fraud often occurs between friends, it is also referred to as sweethearting in retail loss prevention literature. Here we present a way to detect fake scanning by applying our composite event detection system in retail checkout lanes. This application demonstrates that our system is not only designed for handling visual events; it is also able to integrate other types of sensor signals, such as barcode input. The composite event is formulated as follows. Two motion detection events are defined in the barcode scanner and bagging (belt) regions. The two visual primitive events are detected in two camera views to demonstrate the multi-camera capability of our system. The motion detection in the barcode scanner region is acquired by a camera with a close-up view of the scanner, and the belt region motion detection is acquired by an overhead camera positioned above the checkout lane. These two primitive events represent object presence in the designated regions. To capture a scan action, a condition is defined that comprises the motion event in the barcode scanner region (denoted PS ) followed by a motion event in the belt region (denoted PB ). Temporal constraint is also imposed such that the sequence should occur within a time interval of [1, 2] seconds. Once a scan action is captured by the visual sensor, the system matches it with the input barcode signal (denoted PU ). If the corresponding barcode signal is found within the temporal neighborhood, a fake scan is detected. Thus, the overall composite event for detecting fake scans is defined as (1) nontriggering rule R1 :⫽ PS [FOLLOWED BY] PB with temporal constraint T1 ⫽ [1, 2]; and (2) triggering rule R2 :⫽ R1 [WITHOUT] PU with temporal constraint T2 ⫽ [0, 1]. A graphical illustration is shown in Figure 19.10. Our testing data set consists of 9 transactions performed by different customers. The corresponding barcode signals are also available and are synchronized with the video input. There are a total of 179 scan actions in the data set, of which 111 are legitimate scans and 68 are fake scans.The system detected 96 fake scans with 5 misdetections (false negatives) and 28 overdetections (false positives). (A large portion of the false positives (60%) are the result of double-triggers.)
FIGURE 19.10 Composite event formulation for detecting fake scans in retail checkout scenarios: (a) motion event in barcode scanner region (cam1); (b) motion event in bagging/belt region (cam2); (c) barcode input event (point-of-sale machine). A fake scan alert is triggered if, (1) a motion detection alert in barcode scanner region followed by a motion detection alert in bagging region, and (2) condition (1) happened without any barcode input. Please note that primitive events are acquired from multiple camera views, and the barcode input events are generated by the point-of-sale server which is a nonvision device.
474
CHAPTER 19 Composite Event Detection in Multi-Camera
(a)
( b)
(c)
FIGURE 19.11 Fake-scan action detection results with different checkout lanes and different camera settings. (a) The keyframe captures the number of fake scans. (b) The keyframe captures a particular fake-scan action. (c) The keyframe shows the total number of scan actions captured. (For convenience of visualization, only keyframes from the overhead cameras are shown.)
The accuracy of this system has greatly exceeded industry expectations (“Dropping shrinkage from 2% to 1% of sales has the same effect on profit as 40% increase in sales,” according to a 2004Advanced Manufacturing Research report). Figure 19.11 shows some keyframes of captured fake scans.
19.5.2 Application: Tailgating Detection Tailgating poses high security threats to institutions such as banks, private companies, and high-rise residential buildings. A common solution for the prevention of tailgating is human monitoring; however, studies have shown that the effectiveness of human monitoring drops dramatically after 20 minutes of continuously watching the video stream from surveillance cameras. Given the apparent and inevitable limitations of costly human monitoring, automatic tailgating detection is critical for asset protection in these scenarios. In this section, we demonstrate how to apply the proposed composite event detection framework to detect a common form of tailgating event, where multiple persons pass into a secure area with only one proper badge read. In this application, primitive events generated from three sensors (two video cameras and one magnetic badge reader) are analyzed and composed to formulate the high-level tailgating events. The first primitive event, PB , is defined as the employee badge scan input, containing information such as the employee’s name, number, and time stamp. The detection of multiple persons is handled by two primitive events, PF and PS . Event PF is the capture of multiple faces in a camera looking at the door from the inside; event PS is the capture of multiple persons in a side-view camera (camera positions and orientations are illustrated in Figure 19.12(a)). Figures 19.12(b) and 19.12(c) illustrate the rule definitions of the tailgating event detection. Two rules are defined. The first is a nontriggering rule to capture the presence of multiple persons in cameras—R1 :⫽ PF [OR] PS with a short temporal constraint T1 ⫽ [0, 2]. The second is the triggering rule that combines the visual cue with employee badge reads.Thus, rule R2 :⫽ PB [FOLLOWED BY] R1 , with temporal distance T2 ⫽ [1, 5]. The purpose of applying the [OR] operator on PF and PS is to avoid possible occlusion in single-camera views.
19.5 Case Studies
475
Multiple faces detected
B
F S
Badge John Smith Followed by 11:30:26 09/10/2008
OR
Multiple people detected (a)
(b)
(c)
FIGURE 19.12 (a) Camera and sensor layout for tailgating detection. (b) Badge information. (c) Visual primitive events and rules for detecting multiple persons. Parts (b) and (c) are high-level tailgating event formulation. The detection rules in this case are (1) multiple faces detected in the frontal camera [OR] multiple people detected in the side-view camera, and (2) a badge read signal [Followed by] rule (1).
One face
Multiple persons (a)
Multiple faces
One person (b)
FIGURE 19.13 Tailgating results: (a) one face detected but two persons captured by the side-view camera; (b) one person captured in side view but two faces detected.
Two examples of tailgating detection are shown in Figure 19.13. In these examples, detection of multiple persons entering is demonstrated in the keyframes of two primitive events. In the first example (Figure 19.13(a)), the tailgating person follows another person who enters first. His face is not captured by the frontal camera as a result of occlusion. However, his presence is detected by the side-view camera. In the second
476
CHAPTER 19 Composite Event Detection in Multi-Camera example (Figure 19.13(b)), the situation is reversed. Both tailgating events are successfully detected by applying the [OR] operator on the primitive events.
19.5.3 Application: False Positive Reduction As mentioned in previous sections, primitive event detectors sometimes generate false positives resulting from limitations in technology and/or complex physical settings in the scene. In this section, we discuss the use of composite events to reduce false positives generated by primitive event detectors. Consider the monitoring of traffic on a body of water (Figure 19.14(a)), where statistics on boats traveling in a certain direction must be recorded daily. In this case, tripwires can be defined on the water surface to capture boats moving in the target direction (Figure 19.14(b)). As is commonly known, reliable object detection and tracking on water is difficult resulting from the uncertain pattern of waves and/or boat wakes. In this case, false positives are likely to occur for single tripwires (Figure 19.14(c)). To resolve this problem, and thus reduce false positives, we utilize the fact that the boat wake often appears only for a very short period. Unlike boats, wakes do not travel consistently on the water surface for a long distance. Therefore, if two tripwires are defined in consecutive order, they can capture the boats, since only boats can cross both tripwires. A composite event is thus formulated as a tripwire PX followed by another tripwire PY within a certain time period. Given the nature of this problem, both tripwires are defined within the same camera view. They have the same crossing direction defined to ensure that they are triggered by the same boat. The composite event definition for capturing boats on the
(a)
(b)
(c)
(e)
(f )
Followed by (d)
FIGURE 19.14 (a) Keyframe of an input river sequence. (b) Boat captured by the primitive tripwire detector. (c) False positive caused by wakes. (d),(e) Primitive tripwire detectors for the composite event modeling. (f) Keyframe of the captured boat using composite events.
19.6 Conclusions and Future Work
477
river is illustrated in Figure 19.14. One result keyframe of the captured boat is shown in Figure 19.14(f) without the false positive shown in Figure 19.14(c).
19.6 CONCLUSIONS AND FUTURE WORK Effective and efficient event-modeling technologies in multi-camera and multi-sensor surveillance networks are the core of security monitoring, border control, and asset protection. In this chapter, we introduced and explained in detail a high-level spatio-temporal event detection framework across different sensors, including video cameras and other types of scanner/readers.The proposed event detection system integrates primitive event detectors in composite events using logical and spatio-temporal operators. The composite events are formulated as full-binary trees where the root node is the target composite event, the leaf nodes are the primitive events, and the middle nodes represent the rules. To enable the extensibility and portability of the composite event detection framework, a standardized event description language was designed for transmitting and storing event definitions. Moreover, a set of convenient graphical user interfaces was developed to simplify both the primitive and composite event definition processes. The applicability and effectiveness of the proposed composite event detection have been demonstrated in real-life applications: retail loss prevention, tailgating detection, and false positive reduction. The current framework models the rules in a composite event as binary operators (e.g., AND, OR). Although complex events can be represented as combinations of binary operators, we plan to extend our formulation in future work such that unary- and multi-operand operators can also be integrated. Another interesting improvement will be incorporating high-level auxiliary constraints. One example is to have primitive events correlate with each other based on consistent object associations. This is particularly important in high-level activity-learning tasks in multi-camera networks.
REFERENCES [1] A. Adam, E. Rivlin, I. Shimshoni, D. Reinitz, Robust real-time unusual event detection using multiple fixed-location monitors, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (3) (2008) 555–560. [2] T. Ahmedali, J.J. Clark, Collaborative multi-camera surveillance with automated person detection, in: Proceedings of the Canadian Conference on Computer and Robot Vision, 2006. [3] D. Arsic, M. Hofmann, B. Schuller, G. Rigoll, Multi-camera person tracking and left luggage detection applying homographic transformation, in: Proceedings of the International Workshop on Performance Evaluation of Tracking and Surveillance, 2007. [4] J. Black, T. Ellis, D. Makris, A hierarchical database for visual surveillance applications, in: Proceedings of the IEEE International Conference on Multimedia and Expo, 2004. [5] O. Boiman, M. Irani, Detecting irregularities in images and in video, in: Proceedings of the IEEE International Conference on Computer Vision, 2005. [6] F. Bry, M. Eckert, Temporal order optimizations of incremental joins for composite event detection, in: Proceedings of the Conference on Distributed Event-Based Systems, 2007.
478
CHAPTER 19 Composite Event Detection in Multi-Camera [7] Q. Cai, J. Aggarwal, Tracking human motion in structured environments using a distributed camera system, in: Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence 21 (11) (1999) 1241–1247. [8] T-H. Chang, S. Gong, Tracking multiple people with a multi-camera system, in: Proceedings of the IEEE Workshop on Multi-Object Tracking, 2001. [9] R. Cucchiara, C. Grana, M. Piccardi, A. Prati, Detecting moving objects, ghosts, and shadows in video streams, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (10) (2003) 1337–1342. [10] D. Comaniciu, V. Ramesh, P. Meer, Real-time tracking of non-rigid objects using mean-shift, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2000. [11] H. Dee, D. Hogg, Detecting inexplicable behaviour, in: Proceedings of the British Machine Vision Conference, 2004. [12] P. Fieguth, D. Terzopoulos, Color-based tracking of heads and other mobile objects at video frame rates, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1997. [13] I. Haritaoglu, D. Harwood, L. Davis, W4 : Real-time surveillance of people and their activities, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (8) (2000) 809–830. [14] Y. Ivanov, A. Bobick, Recognition of multi-agent interaction in video surveillance, in: Proceedings of the IEEE International Conference on Computer Vision, 1999. [15] O. Javed, K. Shafique, M. Shah, Appearance modelling for tracking in multiple non-overlapping cameras, in: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2005. [16] P. Kelly, A. Katkere, D. Kuramura, S. Moezzi, S. Chatterjee, R. Jain, An architecture for multiple perspective interactive video, in: Proceedings of the ACM Conference on Multimedia, 1995. [17] V. Kettnaker, R. Zabih, Bayesian multi-camera surveillance, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1999. [18] S. Khan, M. Shah, Consistent labeling of tracked objects in multiple cameras with overlapping fields of view, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (10) (2003) 1355–1360. [19] K. Kim, L.S. Davis, Multi-camera tracking and segmentation of occluded people on ground plane using search-guided particle filtering, Proceedings of European Conference on Computer Vision, 2006. [20] C. Kreibich, R. Sommer, Policy-controlled event management for distributed intrusion detection, Proceedings of the Workshop of Conference on Distributed Computing Systems, 2005. [21] B. Laxton, J. Lim, D. Kriegman, Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007. [22] L. Lee, R. Romano, G. Stein, Monitoring activities from multiple video streams: establishing a common coordinate frame, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (8) (2000) 758–767. [23] S. Lim, L.S. Davis, A. Elgammal, A scalable image-based multi-camera visual surveillance system, Proceedings of the Conference on Advanced Video and Signal Based Surveillance, 2003. [24] D. Lowe, Distinctive image features from scale-invariant keypoints, International Journal on Computer Vision 60 (2) (2004) 91–110. [25] G. Medioni, I. Cohen, F. Bremond, S. Hongeng, R. Nevatia, Event detection and analysis from video streams, IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (8) (2001) 873–889.
19.6 Conclusions and Future Work
479
[26] R. Nevatia, J. Hobbs, B. Bolles, An ontology for video event representation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2004. [27] N. Oliver, A. Garg, E. Horvitz, Layered representations for learning and inferring office activity from multiple sensory channels, Journal on Computer Vision and Image Understanding 96 (2) (2004) 163–180. [28] D. Phung, T. Duong, S. Venkatesh, H.H. Bui, Topic transition detection using hierarchical hidden markov and semi-markov models, Proceedings of the ACM International Conference on Multimedia, 2005. [29] P.R. Pietzuch, B. Shand, J. Bacon, Composite event detection as a generic middleware extension, IEEE Network 18 (1) (2004) 44–55. [30] F. Porikli, T. Haga, Event detection by eigenvector decomposition using object and frame features, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2004. [31] C. Rao, A. Yilmaz, M. Shah, View-invariant representation and recognition of actions, International Journal on Computer Vision 50 (2) (2002) 203–225. [32] P. Remagnino, A.I. Shihab, G.A. Jones, Distributed intelligence for multi-camera visual surveillance, Pattern Recognition 37 (4) (2004) 675–689. [33] M.S. Ryoo, J.K. Aggarwal, Recognition of composite human activities through context-free grammar based representation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2006. [34] C. Sacchi, C. Regazzoni, A distributed surveillance system for detection of abandoned objects in unmanned railway environments, IEEE Transactions on Vehicular Technology 49 (2000) 2013–2026. [35] C-F. Shu, A. Hampapur, M. Lu, L. Brown, J. Connell, A. Senior, Y-L. Tian, IBM smart surveillance system (S3): An open and extensible framework for event based surveillance, in: Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance, 2005. [36] C. Stauffer, W. Grimson, Learning patterns of activity using real-time tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (8) (1999) 809–830. [37] E. Stringa, C. Regazzoni, Content-based retrieval and real time detection from video sequences acquired by surveillance systems, in: Proceedings of the IEEE International Conference on Image Processing, 1998. [38] M. Taj, A. Cavallaro, Multi-camera scene analysis using an object-centric continuous distribution hidden markov model, in: Proceedings of the International Conference on Image Processing, 2007. [39] Y-L. Tian, M. Lu, A. Hampapur, Robust and efficient foreground analysis for real-time video surveillance, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2005. [40] C. Tomasi, T. Kanade, Detection and tracking of point features. Carnegie Mellon University Technical Report CMU-CS-91-132, 1991. [41] S.D. Urban, S.W. Dietrich, Y. Chen, An XML framework for integrating continuous queries, composite event detection, and database condition monitoring for multiple data streams, in: Proceedings of the Event Processing Conference, 2007. [42] N. Vaswani, A. Chowdhury, R. Chellappa, Activity recognition using the dynamics of the configuration of interacting objects, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2003. [43] S. Velipasalar, L. Brown, A. Hampapur, Specifying, interpreting and detecting high-level, spatio-temporal composite events in single and multi-camera systems, in: Proceedings of the International Workshop on Semantic Learning Applications in Multimedia, IEEE CVPR Workshop, 2006.
480
CHAPTER 19 Composite Event Detection in Multi-Camera [44] P. Viola, M. Jones, Robust real-time face detection, International Journal on Computer Vision 57 (2) (2004) 137–154. [45] V-T. Vu, F. Bremond, M.Thonnat, Automatic video interpretation: a novel algorithm for temporal scenario recognition, in: Proceedings of the International Joint Conference on Artificial Intelligence, 2003. [46] H. Watanabe, H. Tanahashi, Y. Satoh, Y. Niwa, K. Yamamoto, Event detection for a visual surveillance system using stereo omni-directional system, in: Proceedings of the International Conference on Knowledge-Based Intelligent Information and Engineering Systems, 2003. [47] W. Wolf, Hidden markov model parsing of video programs, in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 1997. [48] G. Wu, Y. Wu, L. Jiao, Y. Wang, E.Y. Chang, Multi-camera spatio-temporal fusion and biased sequence-data learning for security surveillance, in: Proceedings of the ACM International Conference on Multimedia, 2003. [49] T. Xiang, S. Gong, Beyond tracking: Modelling activity and understanding behaviour, International Journal on Computer Vision 67 (1) (2006) 21–51. [50] L. Zelnik-Manor, M. Irani, Event-based analysis of video, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2001. [51] Y. Zhai, Z. Rasheed, M. Shah, Semantic classification of movie scenes using finite state machines, IEEE Proceedings on Video, Image and Signal Processing 152 (6) (2005) 896–901. [52] H. Zhong, J. Shi, M. Visontai, Detecting unusual activity in video, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2004. [53] H. Zhou, D. Kimber, Unusual event detection via multi-camera video mining, in: Proceedings of the International Workshop on International Conference on Pattern Recognition, 2006.
CHAPTER
Toward Pervasive Smart Camera Networks Bernhard Rinner Institute of Networked and Embedded Computing, Klagenfurt University, Klagenfurt, Austria
20
Wayne Wolf School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, Georgia
Abstract Smart camera networks are real-time distributed embedded systems that perform computer vision using multiple cameras. This new approach has emerged thanks to a confluence of simultaneous advances in four key disciplines: computer vision, image sensors, embedded computing, and sensor networks. In this chapter, we briefly review and classify smart camera platforms and networks into single smart cameras, distributed smart camera systems, and wireless smart camera networks. We elaborate the vision of pervasive smart camera networks and identify major research challenges. As the technology for these networks advances, we expect to see many new applications open up—transforming traditional multi-camera systems into pervasive smart camera networks. Keywords: distributed smart cameras, multi-camera networks, sensor networks, pervasive computing
20.1 INTRODUCTION Smart cameras have been the subject of study in research and industry for quite some time. While some camera prototypes that integrated sensing with low-level processing were developed in the 1980s, the first commercial “intelligent” cameras appeared in the 1990s. However, the sensing and processing capabilities were very limited on these cameras. In the meantime we have seen dramatic progress in smart camera research and development (e.g., [1–3]). A number of technical factors are converging that cause us to totally rethink the nature of the camera. Distributed smart cameras embody some (but not all) of these trends— specifically, cameras are no longer boxes and no longer take pictures. A smart camera’s fundamental purpose is to analyze a scene and report items and activities of interest to the user. Although the camera may also capture an image to help the user interpret Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00020-3
483
484
CHAPTER 20 Toward Pervasive Smart Camera Networks the data, the fundamental output of a smart camera is not an image. When we combine several smart cameras to cover larger spaces and solve occlusion problems, we create a distributed camera. Furthermore, when we use distributed algorithms to perform smart camera operations, we create a distributed smart camera. Law enforcement and security are the most obvious applications of distributed smart cameras. Large areas can be covered only by large numbers of cameras; analysis generally requires fusing information from several. However, distributed smart cameras have many other uses as well, including machine vision, medicine, and entertainment. All of these applications require imagery from multiple cameras to be fused in order to interpret the scene. Because of the complex geometric relationships between subjects of interest, different sets of cameras may need to cooperate to analyze different subjects. Also, because of subject motion, the sets of cameras that must cooperate may change rapidly. Pulling all of the video from a large number of cameras to a central server is expensive and inherently unscalable. The combination of large numbers of nodes, fast response times, and constantly changing relationships between cameras pushes us away from serverbased architectures. Distributed computing algorithms provide a realistic approach to the creation of large distributed camera systems. Distributed computing introduces several complications. However, we believe that the problems they solve are much more important than the challenges of designing and building a distributed video system. As in many other applications, distributed systems scale much more effectively than do centralized architectures. ■
■
■
■
Processing all data centrally poses several problems. Video cameras generate large quantities of data requiring high-performance networks in order to be transmitted in steady state. Moving video over the network consumes large amounts of energy. In many systems, communication is 100 to 1000 times more expensive in energy than computation. We do not expect camera systems to be run from batteries for long intervals, but power consumption is a prime determinant of heat dissipation. Distributing larger amounts of power also requires more substantial power distribution networks, which increase the installation cost of the system. Although data must be compared across several cameras to analyze video, not all pairs of cameras must communicate with each other. If we can manage the data transfer between processing nodes, we can make sure that data goes only to the necessary nodes. A partitioned network can protect physically distributed cameras so that the available bandwidth is used efficiently. Real-time and availability considerations argue in favor of distributed computing. The round-trip delay to a server and back adds to the latency of making a decision, such as whether a given activity is of interest. Having available multiple points of computation enables reconfigurations in case of failure, which increases the availability of the multi-camera system.
In the evolution of smart cameras we can identify three major paths. First, single smart cameras focus on the integration of sensing with embedded on-camera processing. The main goal here is to be able to perform various vision tasks onboard and deliver abstracted data from the observed scene. Second, distributed smart cameras (DSCs) introduce
20.2 The Evolution of Smart Camera Systems
485
distribution and collaboration to smart cameras, resulting in a network of cameras with distributed sensing and processing. Distributed smart cameras thus collaboratively solve tasks such as multi-camera surveillance and tracking by exchanging abstracted features. Finally, pervasive smart cameras (PSCs) integrate adaptivity and autonomy in DSCs. The ultimate vision of PSCs is to provide a service-oriented network that is easy to deploy and operate, adapts to changes in the environment, and provides various customized services to users. The goal of this chapter is twofold. First, we briefly review and classify smart camera platforms and networks. Second, we elaborate the vision of pervasive smart camera networks and identify major research challenges toward this vision. The discussion of research challenges is based on an exploration of trends in current smart camera systems. The remainder of this chapter is organized as follows: Section 20.2 starts with a brief overview of the architecture of smart cameras and then focuses on the evolution of smart camera systems. In Section 20.3 we identify current trends and speculate about future developments and applications. Section 20.4 concludes the chapter with a brief discussion.
20.2 THE EVOLUTION OF SMART CAMERA SYSTEMS Smart cameras are enabled by advances in VLSI technology and embedded system architecture. Modern embedded processors provide huge amounts of performance. However, smart cameras are not simply cost-reduced versions of arbitrarily selected computer vision systems. Embedded computer vision requires techniques distinct from non-real-time computer vision because of the particular stresses on computer systems that vision algorithms cause. Memory is a principal bottleneck in system performance because memory speed does not increase with Moore’s law [4]. However, computer vision algorithms, much like video compression algorithms, use huge amounts of data, often with less frequent reuse. As a result, caches typically found in general-purpose computing systems may be less effective for vision applications. At a minimum, software must be carefully optimized to make the best use of the cache; at worst, the memory system must be completely redesigned to provide adequate memory bandwidth [5]. Besides memory capacity and memory bandwidth, computing power is a crucial resource for embedded computer vision. The individual stages of the typical imageprocessing pipeline raise different requirements on the processing elements. Low-level image processing, such as color transformation and filtering, operates on individual pixels in regular patterns. These low-level operations process the complete image data at the sensor’s frame rate, but typically offer a high data parallelism. Thus, low-level image processing is often realized on dedicated hardware such asASICs or FPGAs, or specialized processors [6]. High-level image processing, on the other hand, operates on (few) features or objects, which reduces the required data bandwidth but significantly increases the complexity of the operations. These complex processing tasks typically exhibit a data-dependent and irregular control flow.Thus, programmable processors are the prime choice for them. Depending on the complexity of the image-processing algorithms, even multi-core or multi-processor platforms may be deployed [7, 8].
486
CHAPTER 20 Toward Pervasive Smart Camera Networks
20.2.1 Single Smart Cameras Integration of image sensing and processing on single embedded platforms has been conducted for quite some time. However, research on single smart cameras has intensified over the last decade. Table 20.1 presents an overview of selected single smart camera platforms. Moorhead and Binnie [9] presented one of the first fabricated CMOS implementations. Their SoC smart camera integrated edge detection into the image sensor. VISoc [10] represents another smart camera-on-a-chip implementation featuring a 320⫻256-pixel CMOS sensor, a 32-bit RISC processor, and a vision/neural coprocessor. Kleihorst et al. [16] developed a specialized processor for image processing with high performance and low power consumption. This processor features 320 processing elements allowing it to process a single line of an image in CIF resolution in one cycle or an image in VGA resolution in two cycles, respectively. Wolf et al. [11] developed a first-generation smart camera prototype for real-time gesture recognition. For the implementation they equipped a standard PC with additional PCI boards featuring a TriMedia TM-1300 VLIW processor. An Hi8 video camera is connected to each PCI board for image acquisition.
Table 20.1
Examples of Single Smart Camera Systems Platform Capabilities CPU Communication
System
Sensor
Power
Application
SOC Smart Camera (Moorhead & Binnie [9])
CMOS
Custom logic for on-chip edge detection
n/a
Mains
Low-level edge detection
VISoc (Albani [10])
CMOS, 320⫻256
32-bit RISC and vision/neural processor
n/a
Battery
Low-level edge detection
First-Generation Prototype (Wolf [11])
Hi8 Camcorder, NTSC
PC with TriMedia TM-1300 board
n/a
Mains
Gesture recognition
Embedded Single Smart Camera (Bramberger, Rinner [12])
color, VGA
DSP
n/a
Mains
Adaptive background subtraction
TRICam [13]
Video in (no sensor)
DSP and FPGA, 128-MB RAM
Ethernet
Mains
Viola-Jones object detection
DSP-Based Smart Camera (Bauer [14])
Neuromorphic sensor, 64⫻64
BlackFin DSP
n/a
Mains
Vehicle detection and speed estimation
Altera Stratix FPGA
FireWire (1394)
Mains
Template-based object tracking
Gyroscope and FPGA-Based accelerometer, Smart Camera (Dias & Berry [15]) 2048⫻2048
20.2 The Evolution of Smart Camera Systems
487
A completely embedded version of a smart camera was introduced by Bramberger et al. [12]. Their first prototype was based on a single DSP COTS system (TMS320C64xx processor from Texas Instruments) equipped with 1 MB of on-chip memory and 256 MB of external memory. A CMOS image sensor is directly connected to the DSP via the memory interface. Communication and configuration are realized over a wired Ethernet connection. Arth et al. [13] presented the TRICam—a smart camera prototype based on a single DSP fromTexas Instruments.Analog video input (either PAL or NTSC) is captured by dedicated hardware, and an FPGA is used for buffering the scanlines between video input and DSP. The TRICam is equipped with 1 MB of on-chip and 16 MB of external memory. Bauer et al. [14] presented a DSP-based smart camera realizing a neuromorphic vision sensor. This smart sensor delivers only information about intensity changes with precise timing information, which is then processed to identify moving objects and estimate their speed. Dias et al. [15] described a generic FPGA-based smart camera. The FPGA is used to implement several standard modules (e.g., image sensor interface, memory interface, FireWire interface) along with a programmable control module and a flexible number of processing elements. The processing elements can be interconnected arbitrarily according to the algorithm’s data flow.
20.2.2 Distributed Smart Cameras Distributed smart cameras distribute not only sensing but also processing. However, the degree of distribution may vary substantially. On the one hand, smart cameras can serve as processing nodes that perform some fixed preprocessing but still deliver data to a central server. On the other hand, processing may be organized in a completely decentralized fashion where the smart cameras organize themselves and collaborate dynamically. Table 20.2 presents an overview of selected distributed smart camera systems. Implementing and deploying distributed smart cameras with decentralized coordination pose several new research challenges. Multiple threads of processing may take place on different processing nodes in parallel. This requires a distribution of data and control in the smart camera network.The required control mechanisms are implemented by means of dedicated protocols. Elsewhere [17, 18] we discussed how substantial system-level software or middleware would greatly enhance application development. Such middleware has to integrate the camera’s image-processing capabilities and provide a transparent inter-camera networking mechanism. We proposed [19] the use of agents as a top-level abstraction for the distribution of control and data. A distributed application comprises several mobile agents, which represent image-processing tasks within the system. Combining agents with a mobility property allows the image-processing tasks to move between cameras as needed. To demonstrate the feasibility of this agent-oriented approach, we implemented an autonomous and fully decentralized multi-camera tracking method [20]. Patricio et al. [21] also used the agent-oriented paradigm, but in their approach the agent manages a single camera and an internal state representing beliefs, desires, and intentions. Collaboration of cameras thus corresponds to collaboration of agents; that is, an agent can inform its neighbor about an object expected to appear or ask other agents whether they are currently tracking the same object.
488
CHAPTER 20 Toward Pervasive Smart Camera Networks
Table 20.2
Examples of Distributed Smart Camera Systems Platform Capabilities CPU Communication
Power
VGA
ARM and multiple DSPs
100-Mbps Ethernet, GPRS
Mains
BlueLYNX (Fleck et al. [22])
VGA
PowerPC, 64-MB RAM
Fast Ethernet
Mains
Local image preprocessing Central reasoning
GestureCam (Shi & Tsui [24])
CMOS, 320⫻240 (max. 1280⫻1024)
Xilinx Virtex-II FPGA, custom logic plus PowerPC core
Fast Ethernet
Mains
Local image analysis, No collaboration
NICTA Smart Camera (Norouznezhad [23])
CMOS 2592⫻1944
Xilinx XC3S5000 FPGA; microBlaze core
GigE vision interface
Mains
Local image analysis, No collaboration
System
Sensor
Distributed SmartCam (Bramberger et al. [8])
Application Local image analysis, Cooperative tracking
Fleck et al. [22] demonstrated a multi-camera tracking implementation where camera coordination and object hand-off between cameras are organized centrally. Each camera uses a particle filter–based tracking algorithm to track the individual objects within a single camera’s field of view. The camera nodes report the tracking results along with the object description to the central server node. Norouznezhad et al. [23] presented an FPGA-based smart camera platform developed for large multi-camera surveillance applications. Its most distinctive features compared to other platforms is the large CMOS image sensor (2592⫻1944) and the GigE vision interface. The processing unit is partitioned into pixel-based processing and ROI processing, both of which can be executed in parallel.
20.2.3 Smart Cameras in Sensor Networks Wireless sensor networks are receiving much attention in the scientific community [25]. While many networks focus on processing scalar sensor values such as temperature or light measurements, some focus on visual sensors. Because a core feature of sensor networks is that they are designed to run on battery power, one of the main challenges is to find a reasonable trade-off between computing power, memory resources, communication capabilities, system size, and power consumption. Table 20.3 presents an overview of selected wireless smart camera systems. The Meerkats sensor nodes [31] use an Intel Stargate mote equipped with a 400MHz StrongARM processor, 64-MB SDRAM, and 32-MB Flash. Wireless communication is realized with an 802.11b standard PCMCIA card. A consumer USB webcam serves as an imager (640⫻480 pixels), and the sensor nodes are operated by an embedded Linux system. A focus of this work is to evaluate the power consumption of different
20.2 The Evolution of Smart Camera Systems
Table 20.3
Examples of Wireless Smart Camera Systems
System
Sensor
Platform Capabilities CPU Communication
Meerkats (Margi et al. [30])
Webcam, 640⫻480
StrongARM at 400 MHz
Cyclops (Rahimi et al. [28])
Color CMOS, 352⫻288
MeshEye (Hengstler et al. [32])
Power
Application
802.11b
Battery
Local image analysis Collaborative object tracking Image transmission to central sink
ATmega128 at 7.3 MHz
None onboard (802.15.4 via MicaZ mote)
Battery
Collaborative object tracking
2⫻ low-resolution sensor, 1⫻ VGA color CMOS sensor
ARM7 at 55 MHz
802.15.4
Battery
Unknown
WiCa (Kleihorst et al. [7])
2⫻ color CMOS sensor, 640⫻480
Xetal 3D (SIMD)
802.15.4
Battery
Local processing Collaborative reasoning
CMUcam3 (Rowe et al. [26])
Color CMOS, 352⫻288
ARM7 at 60 MHz
None onboard (802.15.4 via FireFly mote)
Battery
Local image analysis Inter-node collaboration
CITRIC (Chen et al. [34])
OV9655 color CMOS sensor, 1280⫻1024
XScale PXA270
802.15.4
Compression Battery (428–970 Tracking Localization mW)
tasks such as Flash memory access, image acquisition, wireless communication, and data processing. In Margi et al. [30], further details on deploying Meerkats in a multi-node setup are given. For detection of moving objects, image data is analyzed locally on the cameras. Nodes collaborate for hand-over using a master–slave mechanism. Compressed image data is transmitted to a central sink. Feng et al. [35] presented the Panoptes—a very similar system that is also based on Stargate motes and USB webcams. Another representative smart camera for sensor networks is the Cyclops by Rahimi et al. [28]. This node is equipped with a low-performance ATmega128 8-bit RISC microcontroller operating at 7.3 MHz with 4 kB of on-chip SRAM, and 60 kB of external RAM. The CMOS sensor can deliver 24-bit RGB images at CIF resolution (352⫻288).The Cyclops platform does not provide onboard networking facilities, but it can be attached to a MicaZ mote. Medeiros et al. [29] used a network of Cyclops cameras to implement a protocol supporting dynamic clustering and cluster head election.They demonstrated their system in an object-tracking application. The MeshEye sensor node by Hengstler et al. [32] combines multiple vision sensors on a single node. The platform is equipped with two low-resolution image sensors and
489
490
CHAPTER 20 Toward Pervasive Smart Camera Networks
one VGA color image sensor. One of the low-resolution sensors is used to constantly monitor the camera’s field of view. Once an object has been detected, the second lowresolution sensor is activated and the location of the detected object is estimated using simple stereo vision. The estimated object’s region of interest is then captured by the high-resolution sensor. The main advantage of this approach is that power consumption can be kept at a minimum as long as there are no objects in the system’s field of view. The processing is done on an ARM7 microcontroller running at 55 MHz. The MeshEye is equipped with 64 kB of RAM and 256 kB of Flash memory. An 802.15.4 chip provides wireless networking capabilities. The WiCa wireless camera by Kleihorst et al. [7] is equipped with the SIMD processor IC3D operating at 80 MHz. This processor features 320 RISC processing units operating concurrently on the image data stored in line memory. In addition to the line memory, the platform also provides access to external DPRAM. For general-purpose computations and communication tasks, the WiCa is equipped with an 8051 microcontroller. It can be extended with an 802.15.4-based networking interface used for inter-node communication. The WiCa platform was designed for low-power applications and hence could be operated on batteries. Distributed processing between fourWiCas has been demonstrated in a gesture recognition system [33]. The CMUcam3, developed by Rowe et al. [26], is the latest version of an embedded computer vision platform. It consists of a color CMOS sensor capable of delivering 50 frames per second at a resolution of 352⫻288 pixels. Image data is stored in a FIFO and processed by an ARM7 microcontroller operating at 60 MHz. The CMUcam3 is equipped with 64 kB of RAM and 128 kB of Flash memory. It comes with a software layer implementing various vision algorithms such as color tracking, frame differencing, convolution, and image compression. Networking capabilities can be achieved by attaching an external mote via a serial communication channel, for example, by combining it with FireFly motes [27]. This FireFly Mosaic relies on tight time synchronization for multi-camera cooperation. The nodes are statically deployed in home activity monitoring. The CITRIC mote [34] is a wireless camera hardware platform with an SXGA OmniVision CMOS sensor, an XScale processor, 64-MB of RAM and 16-MB of Flash memory. Wireless communication using IEEE 802.15.4 is achieved by connecting the CITRIC board to a Tmote Sky board. The CITRIC platform is similar to the prototype platform used by Texeira et al. [36], which consists of an iMote2 connected to a custom-built camera sensor board. It has been demonstrated in image compression, single-target tracking via background subtraction, and camera localization using multi-target tracking [34].
20.3 FUTURE AND CHALLENGES Distributed smart camera networks can be used in a variety of applications. The particular constraints imposed by these applications and their associated platforms require somewhat different algorithms: ■
Best results are obtained for tracking when several cameras share overlapping fields of view. In this case, the cameras can compare the tracks they generate to improve the accuracy of the overall track generated by the network.
20.3 Future and Challenges ■
■
■
491
When covering large areas, we may not be able to afford enough cameras to provide overlapping fields of view. Tracking must then estimate the likelihood that a person seen in one view is the same person seen by a different camera at a later time. Not all cameras may be alike.We may, for example, use low-power, low-resolution cameras to monitor a scene and wake up more capable cameras when activity warrants. We may also use cameras in different spectral bands, such as infrared. Some cameras may move, which causes challenges for both calibration and background elimination. A moving camera may be part of a cell phone that captures opportunistic images, or it may be mounted on a vehicle.
20.3.1 Distributed Algorithms Distributed algorithms have a number of advantages and are a practical necessity in many applications. Centralized algorithms create bottlenecks that limit system scalability. Distributed algorithms can also, when properly designed, provide some degree of fault tolerance. Two styles of distributed algorithms have been used in distributed smart cameras: consensus algorithms compare information between nodes to improve estimates; coordination algorithms hand off control between nodes. Consensus and coordination algorithms use different styles of programming and have distinct advantages. Consensus algorithms are typically thought of as message-passing systems.An example consensus for distributed smart cameras is the calibration algorithm of Radke et al. [37]. This algorithm determines the external calibration parameters (camera position) of a set of cameras with overlapping fields of view by finding correspondences between features extracted from the scenes viewed by each camera. It is formulated as a message-passing system in which each message includes a node’s estimate of the position. Consensus algorithms are well suited to estimation problems, such as the determination of position. Many algorithms can be formulated as message-passing systems. An important characteristic of consensus algorithms is loose termination criteria. Distributed systems may not provide reliable transmission of messages.As a result, termination should not rely on strict coordination of messages into iterations. Coordination algorithms can be viewed as token-passing systems. A token represents the locus of control for processing. The coordination algorithms are well-suited to problems like tracking, in which the identity of a subject must be maintained over an extended period. These algorithms can be thought of as protocols—each node maintains its own internal state and exchanges signals with other nodes to affect both its own state and that of other nodes. Coordination algorithms date back to the early days of distributed smart cameras. The VSAM system [38] handed off tracking from camera to camera. More recently, the gesture recognition system of Lin et al. [39] uses a token to represent the identity of the subject whose gestures are being recognized. Some low-level feature extraction is always performed locally, but the final phases of gesture recognition may move from node to node as the subject moves and features from several cameras need to be fused. A protocol manages the transfer of the token between nodes; it must ensure that they are neither duplicated nor lost. The tracking system of Velipasalar et al. [40] also uses a protocol
492
CHAPTER 20 Toward Pervasive Smart Camera Networks to trade information about targets. Each node runs its own tracker for each target in its field of view. A protocol periodically exchanges information between nodes about the position of each target. This system is considered a coordination algorithm rather than a consensus algorithm because only one round of information exchange is performed at each period. Of course, tracking includes both maintenance of identity (coordination) and position of estimation (consensus). More work needs to be done to combine these two approaches into a unified algorithmic framework.
20.3.2 Dynamic and Heterogeneous Network Architectures Dynamic and heterogeneous camera networks provide some advantages over static architectures. They can be better adapted to application requirements and—more important—are able to react to changes in the environment during operation. Such a heterogeneous architecture can comprise cameras with different capabilities concerning sensing, processing, and communication. We can choose the mode in which the camera operates and hence determine the configuration of the overall camera network. By that we can set the network into a configuration that best fits current requirements.There are many possible optimization criteria—energy, response time, and communication bandwidth are just a few examples. The optimization goal we want to achieve clearly depends on the application. Dynamic and heterogeneous architectures are not special to camera networks. These principles are well known, for example, in sensor and communication networks. One such example is multi-radio networks that combine low- and high-performance radios to adapt bandwidth, energy consumption, and connectivity over time. The system of Stathopoulos et al. [41] uses dual radio platforms to implement a protocol that selectively enables high-bandwidth nodes to form end-to-end communication paths. For their work they use a low-bandwidth network that is always on for control and management as well as for transmission of low-bandwidth data. High-bandwidth radios are only enabled when fast response times are needed or large data volumes have to be transferred. Lymberopoulos et al. [42] evaluated the energy efficiency of multi-radio platforms. They compared an 802.15.4 radio providing a data rate of 250 kbps with an 802.11b radio providing a data rate of up to 11 Mbps. They also considered the different startup times of the two radios in their energy evaluation. A freely moving camera not only poses challenges for calibration and background elimination. Connected by some wireless links, it can also change the topology of the overall network. Communication links to some nodes may drop; new links to other nodes may need to be established. This is closely related to mobile ad hoc networks (MANETs) [43], which mainly deal with self-configuration of the dynamic network.
20.3.3 Privacy and Security In the deployment of camera networks in end-user environments such as private homes or public places, awareness of privacy, confidentiality, and general security issues is rising [44]. By being able to perform onboard image analysis and hence to avoid transferring raw data, smart cameras have great potential for increasing privacy and security. Chattopadhyaya et al. [45] and Fleck et al. [46], among others, explored smart cameras in
20.4 Conclusions
493
privacy-sensitive applications by omitting the transfer of images of some parts of the observed scene. Serpanos et al. [47] identified the most important security issues of smart camera networks and classified the major security requirements as at the node and network levels. Although security issues of distributed smart cameras are analogous to those of networked embedded systems and sensor networks, emphasis should be given to special requirements of smart camera networks, including privacy and continuous real-time operation.To guarantee data authenticity and protect sensitive and private information, a wide range of mechanisms and protocols should be included in the design of smart camera networks.
20.3.4 Service Orientation and User Interaction With all of the technological considerations and challenges of smart camera systems, one major aspect is easily forgotten: They should be designed for users. This is even more important for future camera networks—some of which are targeted at consumer applications. For these applications, service orientation, robustness, and ease of use are important factors in user acceptance. A main challenge is identifying and demonstrating multi-camera applications that are useful and desirable. Aside from obvious surveillance scenarios, applications frequently mentioned are personal health and elderly care, where the environment is monitored for unusual events such as a falling person [48]. Smart homes are another related scenario where pervasive smart camera networks could be employed to simplify the life of residents. Services for smart homes include adaptation of the environment (e.g., lighting or air conditioning) based on the detection of the presence of persons. The automatic detection of gestures and activities by the smart camera network supports a more active user interaction. Much progress has been achieved in this field in recent years (e.g., [49]); however, further research is required for human gesture and activity recognition to be applied in real-world settings. Regardless of the actual application scenario, a major challenge is to develop multicamera systems that can be deployed, set up, and operated by customers with little or no technical knowledge.
20.4 CONCLUSIONS Smart camera networks have emerged thanks to simultaneous advances in four key disciplines: computer vision, image sensors, embedded computing, and sensor networks. The convergence of these technical factors has stimulated a revolution in the way we use cameras. Image sensors will become ubiquitous and blend into the everyday environment. Their onboard processing and communication facilities foster collaboration among cameras and distributed data analysis. Considering the recent advances of smart cameras in research and industrial practice, we can identify several trends. First, camera networks are currently undergoing a transition from static to dynamic and adaptive networks. Second, as the costs of single cameras and required network infrastructure drop, we will see an increase in the size of the camera networks. Finally, we expect to see researchers integrating different sensors into
494
CHAPTER 20 Toward Pervasive Smart Camera Networks distributed smart sensor networks—audio, seismic, thermal, and so forth. By fusing data from multiple sensors, the smart camera exploits the distinct characteristics of the individual sensors, resulting in enhanced overall output. All of these advances will stimulate the development of many new applications—transforming traditional multi-camera systems into pervasive smart camera networks.
REFERENCES [1] B. Rinner, W. Wolf, Introduction to distributed smart cameras, in: Proceedings of the IEEE 96 (10). [2] H. Aghajan, R. Kleihorst (Eds.), in: Proceedings of the ACM/IEEE International Conference on Distributed Smart Cameras, 2007. [3] R. Kleihorst, R. Radke (Eds.), in: Proceedings of the ACM/IEEE International Conference on Distributed Smart Cameras, 2008. [4] D.A. Patterson, J.L. Hennessy, Computer architecture: A quantitative approach, 4th Edition, Morgan Kaufmann, 2006. [5] W. Wolf, High Performance Embedded Computing, Morgan Kaufmann, 2006. [6] R. Kleihorst, B. Schueler, A. Danilin, Architecture and applications of wireless smart cameras (Networks), in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2007. [7] R. Kleihorst, A. Abbo, B. Schueler, A. Danilin, Camera mote with a high-performance parallel processor for real-time frame-based video processing, in: Proceedings of the First ACM/IEEE International Conference on Distributed Smart Cameras, 2007. [8] M. Bramberger, A. Doblander, A. Maier, B. Rinner, H. Schwabach, Distributed embedded smart cameras for surveillance applications, Computer 39 (2) (2006) 68–75. [9] T.W.J. Moorhead, T.D. Binnie, Smart CMOS camera for machine vision applications, in: Proceedings of the IEEE Conference on Image Processing and Its Applications, 1999. [10] L. Albani, P. Chiesa, D. Covi, G. Pedegani, A. Sartori, M. Vatteroni, VISoc: A Smart Camera SoC, in: Proceedings of the Twenty-eighth European Solid-State Circuits Conference, 2002. [11] W. Wolf, B. Ozer, T. Lv, Smart cameras as embedded systems, Computer 35 (9) (2002) 48–53. [12] M. Bramberger, J. Brunner, B. Rinner, H. Schwabach, Real-time video analysis on an embedded smart camera for traffic surveillance, in: Proceedings of the Tenth IEEE Real-Time and Embedded Technology and Applications Symposium, 2004. [13] C.Arth, H. Bischof, C. Leistner,TRICam: An embedded platform for remote traffic surveillance, in: Proceedings of the Conference on Computer Vision and Pattern Recognition Workshop, 2006. [14] D. Bauer, A.N. Belbachir, N. Donath, G. Gritsch, B. Kohn, M. Litzenberger, C. Posch, P. Schön, S. Schraml, Embedded vehicle speed estimation system using an asynchronous temporal contrast vision sensor, EURASIP Journal on Embedded Systems (2007) 12 pages. [15] F. Dias, F. Berry, J. Serot, F. Marmoiton, Hardware, design and implementation issues on a FPGA-based smart camera, in: Proceedings of the ACM/IEEE International Conference on Distributed Smart Cameras, 2007. [16] R.P. Kleihorst, A.A. Abbo, A. van der Avoird, M.J.R. OP de Beeck, L. Sevat, P. Wielage, R. van Veen, H. van Herten, Xetal: a low-power high-performance smart camera processor, in: Proceedings of the IEEE International Symposium on Circuits and Systems, 2001. [17] B. Rinner, M. Jovanovic, M. Quaritsch, Embedded middleware on distributed smart cameras, in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (invited paper), 2007.
20.4 Conclusions
495
[18] C.H. Lin, W. Wolf, A. Dixon, X. Koutsoukos, J. Sztipanovits, Design and implementation of ubiquitous smart cameras, in: Proceedings of the IEEE International Conference on Sensor Networks, Ubiquitous, and Trustworthy Computing, 2006. [19] M. Quaritsch, B. Rinner, B. Strobl, Improved agent-oriented middleware for distributed smart cameras, in: Proceedings of the ACM/IEEE International Conference on Distributed Smart Cameras, 2007. [20] M. Quaritsch, M. Kreuzthaler, B. Rinner, H. Bischof, B. Strobl, Autonomous multicamera tracking on embedded smart cameras, EURASIP Journal on Embedded Systems (2007) 10 pages. [21] M.A. Patricio, J. Carbó, O. Pérez, J. García, J.M. Molina, Multi-agent framework in visual sensor networks, EURASIP Journal on Applied Signal Processing 2007 (1) (2007) 226–226. [22] S. Fleck, F. Busch, W. Strasser, Adaptive probabilistic tracking embedded in smart cameras for distributed surveillance in a 3D model, EURASIP Journal on Embedded Systems 2 (2007) 17. [23] E. Norouznezhad, A. Bigdeli, A. Postula, B.C. Lovell, A high resolution smart camera withGigE-vision extension for surveillance applications, in: Proceedings of the Second ACM/IEEE International Conference on Distributed Smart Cameras, 2008. [24] Y. Shi, T. Tsui, An FPGA-based smart camera for gesture recognition in HCI applications, Computer Vision: ACCV (2007) 718–727. [25] I.F. Akyildiz, T. Melodia, K.R. Chowdhury, Wireless multimedia sensor networks: applications and testbeds, in: Proceedings of the IEEE 96 (10) (1996). [26] A. Rowe, A.G. Goode, D. Goel, I. Nourbakhsh, CMUcam3: An Open Programmable Embedded Vision Sensor, Technical report, CMU-RI-TR-07-13, Robotics Institute, Carnegie Mellon University, May 2007. [27] A. Rowe, D. Goel, R. Rajkumar, FireFly Mosaic: A vision-enabled wireless sensor networking system, in: Proceedings of the Twenty-eighth IEEE International Real-Time Systems Symposium, 2007. [28] M. Rahimi, R. Baer, O.I. Iroezi, J.C. Garcia, J. Warrior, D. Estrin, M. Srivastava, Cyclops: In situ image sensing and interpretation in wireless sensor networks, in: Proceedings of the Third ACM International Conference on Embedded Networked Sensor Systems, 2005. [29] H. Medeiros, J. Park, A. Kak, A light-weight event-driven protocol for sensor clustering in wireless camera networks, in: Proceedings of the ACM/IEEE International Conference on Distributed Smart Cameras, 2007. [30] C.B. Margi, X. Lu, G. Zhang, R. Manduchi, K. Obraczka, Meerkats: A power-aware, self-managing wireless camera network for wide area monitoring, in: Proceedings of the International Workshop on Distributed Smart Cameras, 2006. [31] C. Margi, V. Petkov, K. Obraczka, R. Manduchi, Characterizing energy consumption in a visual sensor network testbed, in: Proceedings of the Second International Conference on Testbeds and Research Infrastructures for the Development of Networks and Communities, 2006. [32] S. Hengstler, D. Prashanth, S. Fong, H. Aghajan, MeshEye: A hybrid-resolution smart camera mote for applications in distributed intelligent surveillance, in: Proceedings of the Sixth ACM/IEEE International Symposium on Information Processing in Sensor Networks, 2007. [33] C. Wu, H. Aghajan, R. Kleihorst, Mapping vision algorithms on SIMD architecture smart cameras, in: Proceedings of the First ACM/IEEE International Conference on Distributed Smart Cameras, 2007. [34] P. Chen, P. Ahammad, C. Boyer, S.-I. Huang, L. Lin, E. Lobaton, M. Meingast, S. Oh, S. Wang, P. Yan, A.Y. Yang, C. Yeo, L.-C. Chang, J. Tygar, S.S. Sastry, Citric: A low-bandwidth wireless camera network platform, in: Proceedings of the Second ACM/IEEE International Conference on Distributed Smart Cameras, 2008. [35] W.-C. Feng, E. Kaiser, Wu.C. Feng, M. Le. Baillif, Panoptes: scalable low-power video sensor networking technologies, ACM Transactions on Multimedia Computing, Communications, and Applications 1 (2) (2005) 151–167.
496
CHAPTER 20 Toward Pervasive Smart Camera Networks [36] T. Teixeira, D. Lymberopoulos, E. Culurciello, Y. Aloimonos, A. Savvides, A lightweight camera sensor network operating on symbolic information, in: Proceedings of the International Workshop on Distributed Smart Cameras, 2006. [37] R.J. Radke, D. Devarajan, Z. Cheng, Calibrating distributed camera networks, in: Proceedings of the IEEE 96 (10) (2008). [38] R.T. Collins, A.J. Lipton, H. Fujiyoshi, T. Kanade, Algorithms for cooperative multisensor surveillance, in: Proceedings of the IEEE 89 (10) (2001) 1456–1477. [39] C.H. Lin, T. Lv, W. Wolf, I.B. Ozer, A peer-to-peer architecture for distributed real-time gesture recognition, in: Proceedings of the IEEE International Conference on Multimedia and Expo, 2004. [40] S. Velipasalar, J. Schlessman, C.-Y. Chen, W. Wolf, J.P. Singh, SCCS: A scalable clustered camera system for multiple object tracking communicating via message passing interface, in: Proceedings of the IEEE International Conference on Multimedia and Expo, 2006. [41] T. Stathopoulos, M. Lukac, D. McIntire, J. Heidemann, D. Estrin, W.J. Kaiser, End-to-end routing for dual-radio sensor networks, in: Proceedings of theTwenty-sixth IEEE International Conference on Computer Communications, 2007. [42] D. Lymberopoulos, N.B. Priyantha, M. Goraczko, F. Zhao, Towards energy efficient design of multi-radio platforms for wireless sensor networks, in: Proceedings of the Seventh ACM/IEEE International Conference on Information Processing in Sensor Networks, 2008. [43] N.P. Mahalik (Ed.), Sensor Networks and Configuration: Fundamentals, Standards, Platforms, and Applications, Springer, 2007. [44] A. Senior, S. Pankanti, A. Hampapur, L. Brown, Y.-L. Tian, A. Ekin, J. Connell, C. F. Shu, M. Lu, Enabling video privacy through computer vision, IEEE Security & Privacy Magazine 3 (3) (2005) 50–57. [45] A. Chattopadhyaya, T. Boult, PrivacyCam: A privacy preserving camera using uCLinux on the Blackfin DSP, in: Proceedings of the Workshop on Embedded Computer Vision, 2007. [46] S. Fleck, W. Strasser, Smart camera based monitoring system and its application to assisted living, in: Proceedings of the IEEE 96 (10) (2008) 1698–1714. [47] D.N. Serpanos, A. Papalambrou, Security and privacy in distributed smart cameras, in: Proceedings of the IEEE 96 (10) (2008) 1678–1687. [48] S. Fleck, R. Loy, C. Vollrath, F. Walter, W. Strasser, SmartClassySurv: A smart camera network for distributed tracking and activity recognition and its application to assisted living, in: Proceedings of the ACM/IEEE International Conference on Distributed Smart Cameras, 2007. [49] D.A. Forsyth, O. Arikan, L. Ikemoto, J. O’Brien, D. Ramanan, Computational studies of human motion—Part 1: Tracking and motion synthesis, Foundations and Trends in Computer Graphics and Vision 1 (2–3) (2006) 192 pages.
CHAPTER
Smart Cameras for Wireless Camera Networks: Architecture Overview
21
Zoran Zivkovic, Richard Kleihorst NXP Semiconductors Research, Eindhoven, The Netherlands
Abstract Processing visual information from a large camera network is important for many applications, such as smart environments, human–computer interfaces, and surveillance. Smart camera motes using onboard processing offer an opportunity to create scalable solutions based on distributed processing of visual information. Wireless communication in sensor networks is preferred for many practical reasons. Wireless smart camera node architecture is analyzed in this chapter. Keywords: smart camera architecture, embedded processing, multi-camera systems, wireless sensor networks, distributed processing
21.1 INTRODUCTION Smart environments that are aware of the current situation and can respond in an appropriate way are expected to become a part of our everyday life. Furthermore, it is widely believed that computing in such smart environments will move from desktop computers to a multiplicity of embedded computers present in the smart devices around us [26, 32]. Cameras observing the environment are an important information source, and smart cameras that extract relevant information from images using onboard processing might become an essential building block for future intelligent environments. Large camera networks are needed to cover large areas in surveillance and monitoring applications. Furthermore, multiple cameras are often used to obtain a different view of the same scene. Multiple viewpoints help in dealing with ambiguities and occlusions and can lead to more reliable analysis of the scene. An example is human body analysis, which is an important problem for most human-centric applications. Using images to analyze human behavior is a widely studied aspect of computer vision (e.g., [8, 33]), and multiple cameras can improve results (e.g., [13, 14, 31]). Construction of a camera network with many cameras is still hampered by a number of practical constraints. Typically, the cameras are connected to a central processing unit that performs the processing. Many image-processing operations require a great deal of processing power, and current high-end vision algorithms often use the full processing Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00021-5
497
498
CHAPTER 21 Smart Cameras for Wireless Camera Networks power of a high-end PC. For multiple cameras the needed amount of processing increases more and more. Furthermore, sending images from many cameras to a single host might be impossible because of the huge amount of data that must be transmitted. Finally, providing real-time and low-latency response becomes also increasingly difficult. Recently, smart camera motes using onboard processing and with the ability to communicate with each other have received much attention. Smart cameras are seen as an opportunity to create scalable solutions based on distributed processing of visual information. A sensor network is a network of spatially distributed autonomous devices using sensors to cooperatively monitor physical or environmental conditions. Smart cameras have been used for many years in industrial inspection applications to provide robust and low-cost solutions for specific localized problems. Using smart cameras to realize a visual sensor network is an important research topic and presents a new challenge for smart camera architecture. The networked smart camera should have powerful onboard processing capabilities. There are two other major requirements specific to networked smart cameras:
Low power consumption. Autonomous operation is a highly desirable feature. Ideally the cameras should work for long periods on batteries so that a large camera network can be easily deployed. Power consumption is thus a very important design parameter. Advanced (wireless) communication capabilities. Networked smart cameras should cooperatively monitor the environment; therefore communication capabilities are essential.The communication should facilitate implementation of various distributed processing algorithms that can provide scalable solutions for the processing of visual information. The cameras should be able to send data to each other and/or to a central processing module. Wireless communication technology is in a mature phase, and wireless communication in sensor networks is preferred for many practical reasons. In this chapter we analyze the architecture challenges of developing networked wireless smart cameras. We will first analyze common processing of data in a smart camera network. Basic smart camera architecture requirements are derived and analyzed. Furthermore, we will describe in detail some smart camera platforms specifically designed for building wireless smart camera networks.
21.2 PROCESSING IN A SMART CAMERA NETWORK Sensor network data can be processed by sending all of the data to one central processing node. However, this might lead to many practical problems, and distributed processing is often required to achieve a scalable solution. In this section we will analyze distributed and centralized processing in smart camera networks.
21.2.1 Centralized Processing If images from all cameras are gathered at a central processor, it might be possible to make optimal processing algorithms by combining all visual information available. However,
21.2 Processing in a Smart Camera Network sending real-time video streams from many cameras to a single host may be very difficult in practice. Having smart cameras with onboard processors instead of regular cameras might be beneficial in such cases. Bottom-up vision-processing algorithms [14, 28], are particularly interesting for such settings. In principle, there are two general approaches in vision processing: top-down and bottom-up [28].Typical for the top-down approach is to build a model that is fitted to the images, for example fitting 3D human models to images [13]. The bottom-up approach has become more popular with the idea of extracting various features from images, which are combined within an image and then across images from different camera views. One example is partly based on human posture reconstruction, where candidate human body parts are first detected in each image and then grouped together (e.g., [14, 28]). Many bottom-up algorithms can be naturally mapped to a smart camera network. Each camera can perform some local image processing to reduce raw images/videos to simple descriptions so that they can be efficiently transmitted between cameras. The amount of processing at the central node is thus greatly reduced (see Figure 21.1). This can be essential if real-time and low-delay (latency) performance is required as, for example, in gesture control applications [34, 35].The type of processing in such smart camera systems is still centralized, but the computation-intensive feature extraction is distributed across cameras.
Processing using standard cameras connected to a single processor computer
Central computer
Camera network Camera 1 Reading sensor Sending data
Sending last line
Feature extraction camera 1 image
Data fusion
Images (broad bandwidth) Camera N Reading sensor Sending data
Sending last line Processing time (latency)
Event Processing using smart cameras connected to a single processor computer
Feature extraction camera N image
Smart camera network Camera 1 Reading sensor Additional Sending Line processing processing data
Central computer Data fusion
Features (low bandwidth) Camera N Reading sensor Line processing
Additional Sending processing data
FIGURE 21.1 Typical bottom-up computer vision algorithm using a standard camera network and a smart camera network. In the standard camera system processing usually starts when the whole frame is transferred to the main computer. On a smart camera most of the processing can be performed while reading the camera sensor, which reduces latency. Although smart cameras send much less data, the transmission is typically via a low-bandwidth data path that can extend the data transfer.
499
500
CHAPTER 21 Smart Cameras for Wireless Camera Networks
21.2.2 Distributed Processing In surveillance and monitoring application, the number of cameras needed increases with the increase in area that needs to be covered. Ideally, we would have a fully decentralized vision algorithm that computes and disseminates aggregates of the data with minimal processing and communication requirements and good fault tolerance. Processing and communication requirements rise as the number of cameras increases if the processing is centralized.The requirements could be constant in a distributed algorithm version (see Figure 21.2). Decentralized vision systems are receiving more and more attention (e.g., [11, 29]). A new trend in distributed sensor systems is the use of gossip-based models of computation [17, 18]. Roughly, each node in a gossip-based protocol repeatedly contacts some other node at random and the two nodes exchange information. Gossip-based protocols are very simple to implement and enjoy strong performance guarantees as a result of randomization. Their use in smart camera network processing is still to be investigated. It is highly likely that hybrid processing will lead to the best practical solution, where camera groups combine information in a centralized way and further data processing is distributed across groups. Centralized processing
versus
Distributed processing
Centralized processing Distributed processing Standard cameras
Smart cameras
Smart camera computation power
Small
Medium
Communication bandwidth needed
Large and rising with Medium and rising with Small and possibly constant number of sensors number of sensors
Central processor computation power
Large and rising with Medium and rising with None number of sensors number of sensors
Large and possibly constant
FIGURE 21.2 Processing and communication requirements for distributed and centralized camera network processing.
21.3 Smart Camera Architecture
501
21.3 SMART CAMERA ARCHITECTURE The architecture of a smart camera is similar to the general architecture of a sensor node in a sensor network. The smart camera should be able to process the visual information and communicate with a central node and/or with other cameras. Therefore, the main hardware modules of a smart camera are the sensor, the processor, and the communication module (see Figure 21.3). Low-energy consumption is an important requirement, especially in large camera networks. Wireless communication has many practical advantages, and low-power performance is essential for wireless smart cameras. A wireless smart camera mote should be able to work on batteries for a long period of time such that no wires are needed to realize the whole network. Efficient power management is needed for such cameras. Note that many cameras are often equipped with controllable devices. For example, many industrial cameras contain a controllable light source. Furthermore, pan-tilt-zoom (PTZ) cameras are common in many surveillance applications. Because they are highly application driven, such devices will not be considered here as an integral part of the camera architecture.
21.3.1 Sensor Modules Any image sensor can be used depending on the application. Most cameras use either a CCD or a CMOS image sensor. Both accomplish the task of converting light into electrical signals. CMOS can potentially be smaller, use less power, and provide data faster. The light is first modified by a lens system and some light filters. Camera lenses providing a wide field of view can be interesting for surveillance applications. The sensor module often performs conversion to digital signals and basic image enhancement tasks such as white balance, contrast, and gamma correction.The image enhancement tasks and other properties of the image sensor module can usually be controlled. High-end image sensor modules can provide high-quality images with low noise. However, they are expensive and often large. Low-end image sensors used, for example, in mobile phones are cheap and can be tiny; they are expected to be more appropriate for smart cameras—especially in large networks. There are also many special and usually more expensive sensors, such as thermal-imaging and multi-spectral sensors. Those providing depth map of the scene (e.g., [23]) may be very interesting for specific applications. However, they are not inherently low power because they are actively sensing by constantly transmitting signals into the environment. Current depth-imaging devices are also still much more expensive than standard camera modules, but cheaper versions
Sensor
Processor
Transceiver
FIGURE 21.3 General architecture of a wireless smart camera. Image data from the sensor is processed by the processing module. Transceiver module is needed for communicating with other cameras and/or a central processing unit.
502
CHAPTER 21 Smart Cameras for Wireless Camera Networks are being developed.1 Another interesting special sensor still in development is the neuromorphic, or “silicon-retina,” sensor [24], which is claimed to be leading to a highly power-efficient solution since it usually sends data only when there are changes in the environment. Finally, we should mention that a smart camera might be equipped with a number of camera sensor modules. For example, two camera modules next to each other might be used for stereo depth estimation [12]. The MeshEye smart camera mote [15] uses camera modules of different resolutions for different tasks. Combining vision with other sensors is also an option. For example, inertial sensors can be added to detect movement [6].
21.3.2 Processing Module The image data from the sensor is further processed by the onboard processor, which also controls sensor parameters and communication. Some external devices (e.g., light, zoom lens, pan-tilt unit) might also be controlled. A range of processor types can be used. General-purpose processor (GPP). A general-purpose processor, such as the Pentium, offers much flexibility. However, its cost and power consumption are too high for most applications. Digital signal processor (DSP). Many processors intended for embedded applications can be cost-effective solutions. A digital signal processor can provide high-speed data processing. Because image processing can be regarded as an extreme case of signal processing, most general-purpose DSPs can be used for many image-processing operations. However, they are usually not fine-tuned for processing of this specific type. Media processor. “Media processors” are used for many multimedia applications and might offer a reasonable mix of cost effectiveness and flexibility. Typically such processors contain a high-end DSP together with some typical multimedia peripherals such as video and other fast data ports. The most popular media processors are TriMedia ( NXP ), DM64x ( TI ), Blackfin (ADI ), and BSP ( Equator). Image/vision processor. The low-level image processing is associated with typical kernel operations, like convolutions, and data-dependent operations using a limited neighborhood of the current pixels. Highly parallel processor architectures can provide very efficient solutions for such operations [16]. A set of processors has been specifically designed for them that use Single-instruction multiple data (SIMD) architecture where a number of processing elements ( PEs) work in parallel. The rationale behind this is that the millions of pixels per second that arrive at the vision processor need identical treatment. Besides extreme pixel-processing performance, a benefit of SIMD is that it is very power efficient. For example, the Xetal-II processor from NXP contains 320 processing elements and can achieve a performance of over 100 GOPs with less than 600 mW of power consumption [2]. This chip has been used in the WiCa wireless smart camera [21]. A chip with comparable architecture and performance, the IMAPCAR from NEC, which is based
1
3DV systems: www.3dvsystems.com.
21.3 Smart Camera Architecture
Video Data
Low level Feature extraction Filtering
Intermediate level Object detection Object analysis
503
High level Decision making Networking
FIGURE 21.4 Common operations in visual information processing. Besides vision processing, the smart camera processor must communicate with the camera network. Low-level and some intermediate-level operations can be efficiently implemented using parallel architecture processors (i.e., SIMD). A general-purpose DSP might be better suited for high-level and some intermediate-level operations.
on the IMAP-CE [22], has not been seen yet in smart cameras; as the name implies, it is used for automotive safety. Sony [4] recently introduced the Fiesta chip, which is aimed at the HDTV camcorder market. Because of its high performance, low-power consumption, and programmability, it could play a role in smart cameras. Hybrid processor/system-on-a-chip. Visual-information processing algorithms often contain various types of processing. The low- or early-image processing operations can be efficiently implemented on a SIMD processor. The intermediate- and higher-level processing, such as analysis of the detected objects and decision making, might be better suited for a general-purpose DSP. Smart camera processors must deal with data communication and possibly with controlling some external devices. Thus, a combination of processing cores for different tasks seems highly suitable for smart cameras (see Figure 21.4). Combination of a vision processor for low-image processing operations and a DSP for higher-image processing seems a reasonable choice. For example, a general-purpose microcontroller and a SIMD Xetal (NXP) processor were combined on a printed circuit board (PCB) to realize the WiCa smart camera platform [1]. The complete solution can be also integrated into an application-specific integrated circuit (ASIC). Such an ASIC will likely be fairly complex and can be called a system-on-a-chip (SoC). For example, the EyeQ SoC2 contains a number of DSPs and vision-specific processing engines. Further integration could also include the sensor and communication module on the same chip. The costs of developing an ASIC are very high and pay off only at high volumes. A solution using a field-programmable gate array (FPGA) can be an alternative; FPGA is often advisable during the development phase. An ASIC is likely to be faster and requires smaller chip, but it should be noted that FPGA prices are getting lower and performance is getting better every year. An FPGA vision-processing system involving various processing levels similar to Figure 21.4 was described by Chalimbaud and Berry [6]. Choosing an appropriate processor is difficult since the needed processing power depends on the application [19]. Image-processing tasks are usually computationally
2
Mobileye: www.mobileye.com.
504
CHAPTER 21 Smart Cameras for Wireless Camera Networks intense, and more computation power is always welcome [15]. The image-processing– specific SIMD processors could become an important part of future wireless smart cameras. Advanced power management systems should also be considered. Power consumption reduction is expected to further continue because of techniques such as voltage scaling, lazy computation, and use of low-energy architectures. Reductions in energy consumption can go magnitudes further before they reach the intrinsic minimum of the silicon chips [20]. Additional constraints common to embedded systems are the amount of available memory and the choice between fixed-point and floating-point processors. Floatingpoint processors are around 20 percent more expensive and take up more space on the chip. Images contain mostly 8-bit data, so for many low-level operations fixed-point processors might be sufficient. For higher-level decision making, floating-point might be more appropriate. Important additional issues that need to be considered are how mature the development tools for the chip are and whether the chip is going to be supported and further developed in the foreseeable future.
21.3.3 Communication Modules Standard cameras have some sort of video output. Industrial smart cameras also have some control and other data communication possibilities. For smart cameras intended for smart camera networks, the communication module is much more important. Wireless communication is preferred over wired communication for many practical reasons. Wireless smart cameras need to communicate with a central node and/or with other cameras. There are a great many wireless communication protocols and standards. Because low-power communication is important in a wireless smart camera, low-power communication standards, such as ZigBee and Bluetooth, which are common in wireless sensor networks, are also highly relevant for smart camera networks. Low power means low bandwidth and short-range communication, so streaming real-time video is not possible over low-power networks (see Figure 21.5). Besides basic communication, the smart camera communication module should also support easy embedding in a large camera/sensor network. Many network architectures and protocols are available, and smart camera networks often have specific requirements. An overview of networking issues in wireless multimedia networks is given in Akyildiz et al. [3]. One of the main issues in realizing algorithms in a smart camera network is which part of the processing to perform on the camera and what information to transmit to the central node and/or other cameras. This can range from sending real-time video to sending just events detected by the camera. It is believed that wireless transmission is close to its energy efficiency limit [20].This becomes clear from the distribution of several modern low-range transmission systems such as ZigBee, PicoRadio, and Bluetooth: All are scattered slightly above the straight energy-per-bit line in Figure 21.5. It is expected that processing will continue to become more and more power efficient. This draws us to the conclusion that it is better to invest more in computing at the camera node itself, sending only event detections to the central host and/or to the other cameras in the connected environment.
21.4 Example Wireless Smart Cameras
505
Power Mains
DOCSIS
1W
UTP
ADSL 802.11a
100 m GSM Battery
10 m Low data Low power
1m Autonomous
UMTS
100
ZigBee
PicoRadio
100
1k
Sensor
Bluetooth
Power/bit does not scale with Moore's law 10 k
100 k
1M
Speech, audio, hifi
10 M
Data rate bit/s 100M
Moving pictures
FIGURE 21.5 Fixed energy-per-bit line of short-range transmission systems. Note that all modern short-range communication standards are scattered just above a line.
21.4 EXAMPLE WIRELESS SMART CAMERAS While smart cameras for industrial inspection have been used for many years, smart camera platforms aimed at wireless smart camera networks are still in the research phase. A number of camera motes were developed by various institutions. Cyclops [27], one of the early platforms developed, uses a 7.3 MHz, 8-bit microcontroller with very limited processing power and the MICA2 wireless mote for communication. A number of other system prototypes were developed using standard off-the-shelf components. For example, Panoptes [10] and Meerkats [25] consist of a webcam and an 802.11b PCMCIA wireless card connected to a Stargate board3 that contains a StrongARM processor. SmartCAM [5] is a prototype system for traffic surveillance that uses a development board with several DM64x (TI) DSPs and a PCI wireless network card. We will analyze here four recent hardware platforms designed specifically as lowpower wireless smart cameras: MeshEye [15], CMUCam3 [30], WiCa [1], and CITRIC [7] (see Figure 21.6).
21.4.1 MeshEye The MeshEye camera mote was developed at Stanford University [15]. This mote uses a CMOS VGA (640⫻480 pixel, gray scale, or 24-bit color) sensor module and two kilopixel optical mouse sensors (30⫻30 pixel, 6-bit gray scale). Up to eight kilopixel sensors can be used. Communication is via a ZigBee transceiver module. The processor is the Atmel AT91SAM7S microcontroller, which incorporates an ARM7 processor. The processor
3
Crossbow technology: www.xbow.com.
506
CHAPTER 21 Smart Cameras for Wireless Camera Networks
MeshEye
CMUcam3
WiCa
CITRIC
FIGURE 21.6 Example wireless smart camera motes.
offers a low-power solution based on 32-bit RISC architecture and works at 55 MHz. The main design goal was low-power consumption.The vision system is specifically designed for low-power performance. In the MeshEye all three vision sensors are focused to infinity and have approximately the same field of view. They are used in the following manner. One kilopixel sensor is continuously detecting moving objects. Once an object is detected, the other kilopixel sensor roughly determines the position and size of the object using a basic stereo vision algorithm. This allows calculation of the region of interest (ROI) that the object should occupy in the higher-resolution VGA sensor. The ROI can then be used to get a more detailed view of the object from the VGA sensor. This vision system is inspired by the human vision system: The kilopixel sensors resemble the retina’s rod cells and the highresolution color sensor resembles the cone cells. MeshEye’s architecture was optimized for high-power efficiency. Its developers analyzed the power consumption of the system in a simple application where each object was detected and a high-resolution ROI from the VGA camera was saved. They reported an operating time of five days on two AA batteries (capacity 2850 mAh) if the objects appeared every 50 seconds and the kilopixel sensor worked at two frames per second. It was reported that the very limited processing power of the simple ARM7 core is the main bottleneck in the system.
21.4.2 CMUcam3 The CMU camera was developed at Carnegie Mellon University [30].Three versions were developed, and the latest version is shown in Figure 21.6. The CMUcam3 is ARM7TDMI based, and the main processor is the NXP-LPC2106 connected to a 351⫻288 RGB color CMOS camera sensor module. The camera communicates using a ZigBee module, which also runs a real-time resource-centric operating system, and Nano-RK [9] provides hooks for globally synchronized task processing. The processor used is slightly faster than the MeshEye’s processor, and a new camera is being developed using a Blackfin (ADI) media processor. Power consumption is low since the ARM7 processor uses just 30 mA at 1.8V. The total system in active mode uses 130 mA at 5V. Again, the very limited processing power of the system is the main bottleneck.
21.5 Conclusions
507
21.4.3 WiCa The WiCa camera was developed by NXP Semiconductors Research [1]. The platform is based on the NXP Xetal IC3D processor, which is a massively parallel SIMD architecture with 320 parallel processing elements. The peak pixel performance of the Xetal IC3D is around 50 GOPS with 400 mW power consumption. Up to two CMOS VGA (640⫻480 pixel, 24-bit color) sensor modules can be attached to the camera. Besides the SIMD processor, an 8051 general-purpose processor is used for intermediate- and high-level processing and control. Both processors are coupled using a dual-port RAM that enables them to occupy a shared workspace at their own processing pace. The camera communicates using a ZigBee module. Peak power consumption is around 750 mA on 5V, and operation of up to 4 hours on four AA batteries was reported [35]. Further development plans include using the new Xetal II processor with 107 GOPS peak performance [2].
21.4.4 CITRIC The CITRIC smart camera mote [7] is the result of joint work by the University of California at Berkeley and Merced and the Industrial Technology Research Institute of Taiwan. The platform contains a 1.3-megapixel camera and a ZigBee module for communication. The main processor is an ARM9-based Intel PXA270. The developers report power consumption of around 1 W while running the task of background subtraction on the camera. The ARM9-based processor on the CITRIC platform has considerably more processing power than the ARM7 used in the CMUcam and MeshEye. Comparing it to the NXP Xetal processor on the WiCa platform is more difficult. The Xetal is a SIMD processor with 320 processing elements and can be very efficient for parallel operations—for example, most pixel-based operations such as filtering, edge detection, and background subtraction. Canny edge detection can be performed for a VGA image in a few milliseconds on the Xetal processor running at only 80 MHz, while the same operation takes around 350 ms on the fastest 520 MHz Intel PXA270, as reported in Chen et al. [7]. An overview of the characteristics of the selected cameras is given in Table 21.1.
21.5 CONCLUSIONS Visual information processing in smart camera networks is a developing field with many potential applications. Many wireless smart camera platforms have been proposed, and new ones are being designed.The key modules are the sensor, processor, and transceiver, and the key requirements are low-power consumption, high processing power, and wireless communication capability. An ideal network smart camera should work on batteries for very long period of time; have enough processing power to run state-of-the-art vision algorithms; and be able to easily connect and communicate with other cameras and other devices. Most of the time standard sensor modules are used in smart camera networks. Specialized sensors might be important for certain applications, especially if their price goes down as new developments occur. Wireless communication technology is in a mature phase, so usually standard low-power, low-bandwidth solutions, such as ZigBee, are used
508
CHAPTER 21 Smart Cameras for Wireless Camera Networks
Table 21.1
Comparison of Selected Wireless Smart Camera Motes
Platform
Sensor
Processor
Transceiver
Power consumption
MeshEye [15]
CMOS (640⫻480, color), two low resolution (30⫻30, 6-bit gray scale)
Atmel AT91SAM7S (ARM7 based), 55 MHz
ZigBee (802.15.4, 250 Kbps)
≈ 1W (power-efficient use of sensors tested to reduce this)
CMUcam3 [30]
CMOS (351⫻288, color)
NXP-LPC2106 (ARM7 based), 60 MHz
ZigBee (802.15.4, 250 Kbps)
≈ 650 mW
WiCa [1]
Two CMOS (640⫻480, color)
NXP Xetal (SIMD 320 processors), 80 MHz 8051 Microcontroller, 24 MHz
ZigBee (802.15.4, 250 Kbps)
≈ 1 W (3.75 W peak)
CITRIC [7]
CMOS (1280⫻1024, color)
Intel PXA270 (ARM9 based), up to 520 MHz
ZigBee (802.15.4, 250 Kbps)
≈ 1W
(see Figure 21.5). The processor module is the central and most critical part of smart camera architecture. It must perform various tasks. One possible way to proceed is the use of multiple processing cores where each core is specifically designed for a certain type of processing. Further developments in processor technology and architecture are expected to increase processing power and reduce power consumption. Advanced power management of the whole smart camera will likely be needed. Fully integrated solutions which could lead to tiny devices that might further reduce power consumption, are expected. Modular solutions are an alternative that can provide choice in sensor, processing, and communication modules. Most distributed processing algorithms developed for sensor networks are based on simple scalar sensors such as temperature, pressure, and acceleration. In comparison to scalar sensors, vision sensors generate much more data because of the two-dimensional nature of their pixel array. The sheer amount of raw data makes analysis particularly difficult in many applications, so new paradigms are needed for distributed processing algorithms in smart camera networks. Continuing development of distributed vision processing algorithms and new applications will continue to shape the architecture of networked smart cameras.
REFERENCES [1] A.A. Abbo, R.P. Kleihorst, A programmable smart-camera architecture, in: Proceedings of the Advanced Concepts for Intelligent Vision Systems, 2002. [2] A.A. Abbo, R.P. Kleihorst, V. Choudhary, et al., Xetal-II: A 107 GOPS,600 mW massively parallel processor for video scene analysis. IEEE Journal of Solid State Circuits 43 (1) (2008) 192–201.
21.5 Conclusions
509
[3] I.F. Akyildiz, T. Melodia, K.R. Chowdhury, A survey on wireless multimedia sensor networks. Computer Networks 51 (2007) 921–960. [4] S. Arakawa, Y. Yamaguchi, S. Akui, et al. A 512 GOPS fully-programmable digital image processor with full HD 1080p processing capabilities. ISSCC Digital Technical Papers (2008) 312–315. [5] M. Bramberger, A. Doblander, A. Maier, B. Rinner, H. Schwabach, Distributed embedded smart cameras for surveillance applications. IEEE Computer 39 (2) (2006) 68–75. [6] P. Chalimbaud, F. Berry, Embedded active vision system based on an FPGA architecture. EURASIP Journal on Embedded Systems, 2007. [7] P. Chen, P. Ahammad, C. Boyer, et al. CITRIC: A low-bandwidth wireless camera network platform, in: Proceedings of the ACM/IEEE International Conference on Distributed Smart Cameras, 2008. [8] D. Hogg, Model-based vision: A program to see a walking person. Image andVision Computing 1 (1) (1983) 5–20. [9] A. Eswaran, A. Rowe, R. Rajkumar, Nano-RK: an energy-aware resource-centric RTOS for sensor networks, in: Proceedings of the IEEE Real-Time Systems Symposium, 2005. [10] W. Feng, E. Kaiser, W.C. Feng, and M.L. Baillif, Panoptes: Scalable low-power video sensor networking technologies. ACM Transactions Multimedia Computing Communication Applications 1 (2) (2005) 151–167. [11] S. Funiak, C.E. Guestrin, M.A. Paskin, Rahul Sukthankar, Distributed localization of networked cameras, in: Proceedings of the ACM/IEEE International Symposium on Information Processing in Sensor Networks, 2006. [12] X. Gao, R. Kleihorst, B. Schueler, Stereo vision in a smart camera system, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Workshop on Embedded Computer Vision, 2008. [13] D. Gavrila, L. Davis, Tracking of humans in action: A 3D model-based approach, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1996. [14] A. Gupta, A. Mittal, L.S. Davis, Constraint integration for efficient multiview pose estimation with self-occlusions. IEEE Transactions Pattern Analysis Machine Intelligence 30 (3) (2008) 493–506. [15] S. Hengstler, D. Prashanth, S. Fong, H. Aghajan, MeshEye: A hybrid-resolution smart camera mote for applications in distributed intelligent surveillance, in: Proceedings of the ACM/IEEE Conference on Information Processing in Sensor Networks, 2007. [16] P. Jonker, Why linear arrays are better image processors, in: Proceedings of the IAPR Conference on Pattern Recognition, 1994. [17] R. Karp, C. Schindelhauer, S. Shenker, B. Vocking, Randomized rumour spreading, in: Proceedings of the IEEE Symposium on Foundations of Computer Science, 2000. [18] D. Kempe, A. Dobra, J. Gehrke, Gossip-based computation of aggregate information, in: Proceedings of the IEEE Symposium on Foundations of Computer Science, 2003. [19] B. Kisacanin, Examples of low-level computer vision on media processors, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Workshop on Embedded Computer Vision, 2005. [20] R. Kleihorst, B. Schueler, A. Abbo, V. Choudhary, Design challenges for power consumption in mobile smart cameras, in: Proceedings of the Conference on Cognitive Systems with Interactive Sensors, 2006. [21] R. Kleihorst, B. Schueler, A. Danilin, Architecture and applications of wireless smart cameras (networks), in: Proceedings of the IEEE Conference on Acustics Speech and Signal Processing, 2007. [22] S. Kyo, T. Koga, S. Okazaki, I. Kuroda, A 51.2-GOPS scalable video recognition processor for intelligent cruise control based on a linear array of 128 four-way VLIW processing elements. IEEE Communications Magazine 38 (11) (2003).
510
CHAPTER 21 Smart Cameras for Wireless Camera Networks [23] R. Lange, P. Seitz, Solid-state time-of-flight range camera. IEEE J. Quantum Electronics 37 (3) (2001) 390–397. [24] A. Mahowald, C. A. Mead, A silicon model of early visual processing. Neural Networks 1 (1) (1988) 91–97. [25] C.B. Margi, R. Manduchi, K. Obraczka, Energy consumption tradeoffs in visual sensor networks, in: Proceedings of the Brazilian Symposium on Computer Networks, 2006. [26] M. Pantic, A. Pentland, A. Nijholt, T.S. Huang, Human computing and machine understanding of human behavior: A survey. Artificial Intelligence for Human Computing, Lecture Notes in A.I., 4451 (2007) 47–71. [27] M.H. Rahimi, D. Estrin, R. Baer, H. Uyeno, J.Warrior, Cyclops: Image sensing and interpretation in wireless networks, in: Proceedings of the ACM International Conference on Embedded Networked Sensor Systems, 2004. [28] D. Ramanan, D.A. Forsyth, A. Zisserman, Tracking people by learning their appearance. IEEE Transactions Pattern Analysis and Machine Intelligence 29 (1) (2007) 65–81. [29] A. Rowe, D. Goel, B. Rajkumar, FireFly Mosaic: A vision-enabled wireless sensor networking system, in: Proceedings of the Real-Time Systems Symposium, 2007. [30] A. Rowe, A. Goode, D. Goel, I. Nourbakhsh, CMUcam3: An open programmable embedded vision sensor. Technical Report RI-TR-07-13, Carnegie Mellon Robotics Institute, 2007. [31] S. Velipasalar, W. Wolf, Multiple object tracking and occlusion handling by information exchange between uncalibrated cameras, in: Proceedings of the IEEE International Conference on Image Processing, 2005. [32] M. Weiser, The computer for the twenty-first century. Scientific American 265 (3) (1991) 94–104. [33] C. Wren, A. Azarbayejani, T. Darrell, A. Pentland, Pfinder: Real-time tracking of the human body. IEEE Transactions Pattern Analysis and Machine Intelligence 19 (7) (1997) 780–785. [34] C. Wu, H. Aghajan, Collaborative gesture analysis in multi-camera networks, in: Proceedings of the ACM SenSys Workshop on Distributed Smart Cameras, 2006. [35] Z. Zivkovic, V. Kliger, A. Danilin, B. Schueler, C. Chang, R. Kleihorst, H. Aghajan, Toward low-latency gesture control using smart camera networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Workshop on Embedded Computer Vision, 2008.
CHAPTER
Embedded Middleware for Smart Camera Networks and Sensor Fusion
22
Bernhard Rinner, Markus Quaritsch Institute of Networked and Embedded Computing, Klagenfurt University, Klagenfurt, Austria
Abstract Smart cameras represent an interesting research field that has evolved over the last decade. In this chapter we focus on the integration of multiple, potentially heterogeneous smart cameras into a distributed system for computer vision and sensor fusion. Because an important aspect of distributed systems is the systemlevel software, also called middleware, we discuss the middleware requirements of distributed smart cameras and the services the middleware must provide. In our opinion, a middleware following the agent-oriented paradigm allows us to build flexible and self-organizing applications that encourage a modular design. Keywords: distributed smart cameras, smart camera middleware, agent-oriented middleware, sensor fusion
22.1 INTRODUCTION Smart cameras have been the subject of study in research laboratories and industry for quite some time. While in the “early days” sensing and processing capabilities were very limited, we have seen dramatic progress in smart camera research and development in the last few years [1–3]. Recently, much effort has been put into the development of smart camera networks. These distributed smart cameras (DSCs) [4–6] are real-time distributed embedded systems that achieve computer vision using multiple cameras.This new approach is emerging thanks to a confluence of demanding applications and the huge computational and communications abilities predicted by Moore’s law. While sensing, processing, and communication technology is progressing at a quick pace, we unfortunately do not experience such rapid development in system-level software. Designing, implementing, and deploying applications for distributed smart cameras is typically complex and challenging. Thus, we would like to obtain as much support as possible from system-level software on the DSC network. Such system-level software, or middleware system, abstracts the network and provides services to the application. As we will see later in this chapter, a DSC network is significantly different from other Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00022-7
511
512
CHAPTER 22 Embedded Middleware for Smart Camera Networks well-known network types such as computer [7] or sensor networks [8].Thus, we cannot directly adopt middleware systems available for these networks. From the application point of view, the major services a middleware system should provide are the distribution of data and control. However, DSC networks are mostly deployed to perform distributed signal processing. Thus, middleware systems for DSCs should also provide dedicated services for these applications. Our focus is on support for distributed image processing and sensor fusion. In this chapter we introduce our approach to a middleware system for distributed smart cameras. We first present a brief overview of smart camera architectures and distributed smart camera networks. In Section 22.4 we focus on embedded middleware systems, starting with the introduction of a generic middleware architecture and middleware systems for general-purpose networks. Middleware systems for embedded platforms as well as the differences for DSC networks are also covered in this section. In Section 22.5 we present our agent-based middleware approach for distributed smart cameras. In Section 22.6 we describe the implementation of our middleware system and present two case studies: a decentralized multi-camera tracking application and a sensor fusion case study. Finally, in Section 22.7, we conclude the chapter with a brief discussion.
22.2 SMART CAMERAS
Sensor
CMOS, CCD
A generic architecture of a smart camera comprising sensing, processing, and communication units is depicted in Figure 22.1. The image sensor, which is implemented in either CMOS or CCD technology, represents the data source of the processing pipeline. The sensing unit reads the raw data from the image sensor and often performs some preprocessing such as white balance and color transformations. It also controls important parameters of the sensor, such as capture rate, gain, or exposure, via a dedicated interface. The main image-processing tasks take place at the processing unit, which receives the captured images from the sensing unit, performs real-time image analysis, and transfers the abstracted data to the communication unit.The communication unit controls the entire processing pipeline and provides various external interfaces such as USB,Ethernet, or FireWire. These generic units are implemented on various architectures ranging from system-ona-chip (SoC) platforms over single processor platforms to heterogeneous multi-processor systems. Field-programmable gate arrays (FPGAs), digital signal processors, and/or microprocessors are popular computing platforms for smart camera implementations. The
Sensing unit
Processing unit
Communication unit
Sensor control, preprocessing
Image analysis, video compression
External interfaces: USB Ethernet, WLAN, FireWire
FIGURE 22.1 Generic architecture of a smart camera.
Abstracted data
22.3 Distributed Smart Cameras
513
main design goals for smart cameras are to provide sufficient processing power and fast memory for processing the images in real time while keeping power consumption low. Smart cameras deliver some abstracted data of the observed scene, and it is natural that the delivered abstraction depends on the camera’s architecture and application. Almost all smart cameras currently deliver a different output. They perform a variety of imageprocessing algorithms such as motion detection, segmentation, tracking, and object recognition, and typically deliver color and geometric features, segmented objects, or high-level decisions such as wrong-way drivers or suspect objects. The abstracted results may be transferred either within the video stream (e.g., by color coding) or as a separate data stream. Note that the onboard computing infrastructure of smart cameras is often exploited to perform high-level video compression and only transfer the compressed video stream.
22.3 DISTRIBUTED SMART CAMERAS Visual sensors provide a huge amount of data on a single scene. However, in many cases a single view is not sufficient to cover a region of interest. Parts of a scene may be occluded because of spatial constraints, and the area covered by a single camera may also be very limited. Therefore, multiple camera installations are required to observe a certain region of interest. By having different views, distributed vision has the potential to realize many more complex and challenging applications than single-camera systems. Most installations today follow a centralized architecture where huge amounts of processing power are provided in the back office for image processing and scene analysis (e.g., [9, 10]). However, processing the images from multiple sensors on a central host has several drawbacks. First, the communication costs are very high. Using analog CCTV cameras requires dedicated high-bandwidth wiring from each camera to the back office and digitalization of the analog images. But digital cameras connected via standard Ethernet and communicating via the Internet Protocol (IP) also require plenty of bandwidth to transfer the raw images. Encoding the video data is often not an option because of the loss in quality and added artifacts that render the decoded images useless for further analysis. Another issue of centralized systems is scalability. The main limiting factors are (1) the communication bandwidth that can be handled in the back office, and (2) the processing power required for analyzing the images of dozens of cameras. Smart cameras will be key components in future distributed vision systems, and they promise to overcome the limitations of centralized systems. Distributed computing offers greater flexibility and scalability than do centralized systems. Instead of processing the accumulated data on a dedicated host, scene analysis is distributed within the smart camera network. Individual cameras must therefore collaborate on certain high-level tasks (e.g., scene understanding, behavior analysis). Low-level image processing is done on each camera. Collaboration among cameras is founded on abstract descriptions of the scenes provided by individual cameras. Fault tolerance is another aspect in favor of a distributed architecture. Because the reliability of a centralized system depends on one or a few components in the back office, the whole system may breakdown because of failure in a single component. Distributed smart cameras, in contrast, may degrade gracefully. If a single camera fails, a certain view of the scene will not be available, but the other cameras may compensate.
514
CHAPTER 22 Embedded Middleware for Smart Camera Networks The distributed architecture also influences the communication infrastructure. Centralized systems demand high-bandwidth links from the cameras to the central processing host. Smart camera networks, in contrast, communicate in a peer-to-peer manner and the bandwidth requirements are significantly lower because, instead of raw image data, only abstract information is exchanged. This further allows incorporation of cameras that are connected wirelessly. However, in some application domains (e.g., video surveillance) it is still necessary to archive the acquired video footage, which is typically done on a central storage server. Still, this demands significantly less bandwidth compared to systems based on centralized processing because the archived video is usually of a lower resolution and a lower frame rate. Moreover, archiving is done only in case of certain events for a short period of time.
22.3.1 Challenges of Distributed Smart Cameras The development of distributed smart cameras poses several challenges. Although each camera observes and processes the images of a very limited area, high-level computer vision algorithms require information from a larger context. Thus, it is important to partition the algorithms so that intermediate results can be exchanged with other cameras to allow collaboration on certain tasks. Lin et al. [11] describe the partitioning of their algorithm for gesture recognition to be used in a peer-to-peer camera network. The segmented image, contour points, or ellipse parameters after an ellipse-fitting step can be used to recognize the gesture when a person is observed by two cameras. Collaboration can also take place at lower levels. Dantu and Joglekar [12] investigated collaborative lowlevel image processing, like smoothing, edge detection, and histogram generation. Each sensor first processes its local image region and the results are then merged hierarchically. Most computer vision algorithms need to know the parameters of the camera. In a smart camera network each camera also needs to know the position and orientation of the other cameras—at least of those in the immediate vicinity or those observing the same scene—in order to collaborate on image analysis. Calibrating each camera manually is possible (see [13, 14]), but this requires much time and effort. Adding a new camera or changing the position or orientation of a camera necessitates updating the calibration. In an autonomous distributed system it is much more appropriate for smart cameras to obtain their position—at least relative to neighboring cameras—and orientation by observing their environment and the objects moving within the scene (e.g., [15–18]). In a smart camera network consisting of dozens to hundreds of cameras, it is tedious, if not impossible, to manually assign each camera certain tasks. Often it is necessary to adapt a given allocation of tasks if changes occur in the network (e.g., adding or removing cameras) or in the environment. It would be more appropriate to assign tasks to the smart camera network as a whole, possibly with some constraints, and the cameras then organize themselves—that is, form groups for collaboration and assign tasks to certain cameras or camera groups. Reconfiguration due to changes in the environment can also be self-organizing. Smart camera networks are basically heterogeneous distributed systems. Each camera comprises different types of processors and, within the network as a whole, various smart cameras may be deployed. This makes the development of applications for distributed smart cameras very challenging. A substantial system-level software, therefore, would
22.4 Embedded Middleware for Smart Camera Networks
515
strongly support this implementation [19]. On the one hand, the system-level software has to provide a high-level programming interface for applications executed by the smart cameras. The most important part of the application programming interface is a suitable abstraction of the image-processing unit. An application has to be able to interact uniformly with the image-processing algorithm regardless of the underlying smart camera platform. On the other hand, the system-level software should simplify the development of distributed applications for smart camera networks. Implementing the networking functionality as part of the system-level software is the foundation for the collaboration of various applications on different smart cameras.
22.3.2 Application Development for Distributed Smart Cameras Developing applications for a network of distributed smart cameras requires profound knowledge in several disciplines. First, algorithms for analyzing image data have to be developed or adapted to specific needs. This requires knowledge of computer vision as well as algorithmic understanding. The next step is to bring these algorithms to the embedded smart camera platform, often with real-time constraints in mind (e.g., operating at 25 fps), which demands deep knowledge of the underlying hardware and its capabilities as well as of available resources in order to optimize the implementation. A set of algorithms is then selected and encapsulated within the application logic to put together a specific application. From this description it is obvious that at least three different roles are involved in application development. They are (1) algorithm developer, (2) framework developer (platform expert), and (3) application developer (system integrator). For this reason, a solid middleware with well-defined interfaces between the application developer and the algorithm developer greatly enhances application development for smart camera networks. Reduced time to market and improved software quality are also beneficial consequences.
22.4 EMBEDDED MIDDLEWARE FOR SMART CAMERA NETWORKS Middleware is system-level software that resides between the applications and the underlying operating systems, network protocol stacks, and hardware. Its primary functional role is to bridge the gap between application programs and the lower-level hardware and software infrastructure in order to make it easier and more cost effective to develop distributed systems [20]. Originally, middleware implementations were targeted for general-purpose applications with the primary goal of simplifying distributed application development; real-time considerations or resource limitations were not an issue. Nowadays, however, embedded systems also make use of middleware implementations that impose additional constraints and requirements on middleware design and implementation.
22.4.1 Middleware Architecture Middleware implementations are usually very comprehensive. They not only have to run on different hardware platforms and support various communication channels and
516
CHAPTER 22 Embedded Middleware for Smart Camera Networks
Applications
Domain-specific middleware services
Common middleware services
Distribution layer
Host infrastructure layer
Operating system and protocols Hardware device
FIGURE 22.2 General-purpose middleware layers (Adopted from [21].)
protocols, but they also must bridge applications running on different platforms, possibly in different programming languages, into a common distributed system. In order to support software flexibility on different levels, a layered architecture is often used. This is the case for middleware implementations. A very general partitioning into different layers of abstraction is given by Schmidt [21] (see Figure 22.2). The operating system, along with its hardware drivers, concurrency mechanisms, and communication channels, is the basis of each middleware. It contains drivers for the underlying hardware platform and provides basic mechanisms for accessing the devices as well as concurrency, process and thread management, and interprocess communication. The host infrastructure layer encapsulates the low-level system calls in reusable modules and enhances communication and concurrency mechanisms. It also hides nonportable aspects of the operating system and is the first step toward a portable and platform-independent middleware. The interface provided to higher layers is usually object oriented. Examples of this layer are the Java Virtual Machine (JVM) and .NET’s Common Language Runtime. The distribution layer integrates multiple hosts in a network in a distributed system and defines higher-level models for distributed programming. It enables developers to program distributed applications much like standalone applications. Examples include Sun’s Remote Method Invocation (RMI) in Java and CORBA. The common middleware services layer augments the subjacent distribution layer by defining domain-independent components and services that can be reused in applications and thus simplify development. Such components provide, for example, database connection pooling, threading, and fault tolerance, as well as services common to distributed applications such as logging and global resource management.
22.4 Embedded Middleware for Smart Camera Networks
517
The domain-specific layer provides services to applications of a particular domain (e.g., e-commerce or automation).These services are also intended to simplify application development. The highest level of this architecture is the application layer. Individual applications for a distributed system are implemented using services provided by the lower layers, especially the domain-specific layer and the common middleware services layer.
22.4.2 General-Purpose Middleware In general-purpose computing, different middleware implementations have evolved during the last decades. Probably the most prominent middleware standard is OMG’s Common Object Request Broker Architecture (CORBA) [22], a distributed object system that allows objects on different hosts to interoperate across the network. CORBA is designed to be platform independent and not constrained to a certain programming language. An object’s interface is described in a more general interface description language (IDL), which is then mapped to a programming language’s native data types. While the CORBA specification is very comprehensive and heavy-weighted, Real-Time CORBA (RT-CORBA) and Minimum CORBA have been specified [23, 24] for resourceconstrained real-time systems. Schmidt et al. implemented the RT-CORBA specification with “TAO” [25]. Another middleware for networked systems is Microsoft’s Distributed Component Object Model (DCOM) [26, 27], which allows software components to communicate over a network via remote instantiation and method invocation. Unlike CORBA, which is designed for platform and operating system independence, DCOM is implemented primarily on the Windows platform. Java Remote Method Invocation (RMI) [28], promoted by Sun, follows a similar approach. RMI allows invocation of an object method in a different JVM, possibly on a different host, thus simplifying the development of distributed Java applications. Java’s integrated object serialization and marshaling mechanism allows even complex objects to be used for remote method invocation. Although RMI is limited to the Java programming language, it is more flexible than DCOM because Java is available for several platforms.
22.4.3 Middleware for Embedded Systems Because embedded systems are becoming more and more distributed, some form of middleware would greatly support application development for networked embedded devices. Wireless sensor networks are an inherently distributed system where individual sensors have to collaborate; however, the resources and capabilities of the individual sensors are very limited; typically only a couple of scalar values are sensed. The requirements for middleware on wireless sensor networks are also significantly different compared to those in general-purpose computing. These middleware systems focus on reliable services for ad hoc networks and energy awareness [29]. Molla and Ahmed [30] surveyed recent research on middleware for wireless sensor networks. They found that most implementations are based on TinyOS [31], a component-oriented, event-driven operating system for sensor nodes (motes). Several interesting approaches have been implemented and evaluated.The spectrum ranges from
518
CHAPTER 22 Embedded Middleware for Smart Camera Networks
a virtual machine on top of TinyOS, hiding platform and operating system details, to more data-centric approaches for data aggregation (shared tuple space) and data query. Agilla [32] and In-Motes [33], for example, use an agent-oriented approach, in which agents implement the application logic in a modular and extensible way and can migrate from one mote to another. Cougar [34] and TinyDB [35] follow the data-centric approach, integrating all sensor network nodes into a virtual database system where the data is stored distributively among several nodes.
22.4.4 Specific Requirements of Distributed Smart Cameras Compared to the middleware systems described up to now, a middleware for distributed smart cameras has to fulfill significantly different requirements. This is not merely due to different resource constraints but is also a consequence of the application domain of smart camera networks. In general-purpose computing, platform independence is a major issue. For this reason, several layers of indirection encapsulate the platform dependencies and provide high-level interfaces. General-purpose middleware implementations are, therefore, resource consuming and introduce a noticeable overhead. Wireless sensor networks, on the other hand, have very tight resource limitations in terms of processing power and available memory, and middleware implementations have to cope with these circumstances. Typical embedded smart camera platforms, as presented in Section 22.2, lie in between general-purpose computers and wireless sensor nodes regarding available resources, so middleware for smart camera networks must find a trade-off between platform independence, programming language independence, and the overhead it introduces. Distributed smart cameras are intended for processing captured images close to the sensor, which requires sophisticated image-processing algorithms. Support from the middleware is necessary in order to simplify application development and integration into the application of image-processing tasks. Moreover, monitoring of the resources used by the individual image-processing algorithms is required. Communication in wireless sensor networks is relatively expensive compared to processing and is thus used sparingly (e.g., in case of certain events or to send aggregated sensor data to a base station). Collaboration of individual nodes is typically inherent to the application. Smart camera networks demand higher communication bandwidth for sending regions of interest, exchanging abstract features extracted from the images, or even streaming the video data. Typical camera system surveillance applications comprise several hundreds to thousands of cameras (e.g., in airports or train stations), and many different tasks have to be executed. Because assigning those tasks manually to the cameras is almost impossible, the idea is that a user simply defines a set of tasks that has to be carried out (e.g., motion detection, tracking of certain persons) together with some restrictions and rules for what to do in case of an event. The camera system itself then allocates the tasks to cameras, taking into account the restrictions and rules imposed by the user. In some cases a task cannot be performed by a single camera, so individual cameras have to organize themselves and collaborate on it. Having a single point of control in such a self-organizing system is discouraged in favor of distributed and decentralized control.
22.5 The Agent-Oriented Approach
519
Given the requirements for camera systems, it is obvious that a different kind of middleware is necessary. Middleware for wireless sensor networks is not intended to cope with advanced image-processing tasks and sending large amounts of data. Adapting general-purpose middleware, on the other hand, is feasible and is able to fulfill the given requirements, although the introduced overhead does not yield efficient resource utilization.
22.5 THE AGENT-ORIENTED APPROACH Agent-oriented programming (AOP) has become more and more prominent in software development in the last few years. The AOP paradigm extends the well-known objectoriented programming (OOP) paradigm and introduces active entities called agents. This section provides a short introduction to mobile agent systems, along with their use in embedded devices and especially in distributed smart cameras.
22.5.1 From Objects to Agents Agent systems are a common technology for developing general-purpose distributed applications. AOP is used in many application domains such as electronic commerce [36, 37], information management [38], and network management [39, 40], to name just a few. Although agents are used in various domains, there exists no common definition for them. Possibly, AOP’s widespread use makes a definition difficult. Agents are ascribed different properties depending on their use. However, for our discussion a rather general definition, adopted from [41], is used: An agent is a software entity that is situated in some environment and that is capable of autonomous actions in this environment in order to meet its design objectives.
The most important property of an agent—common to all—is autonomy. This is also its fundamental distinction from object-oriented programming. AOP can be seen as an extension of OOP, in which the main entity is an object. Objects are used to represent logical or real-world entities. An object consists of an internal state, stored in member variables, and corresponding methods to manipulate the internal state. Hence, objects are passive entities because their actions have to be triggered from the outside. Agents extend objects, as they are capable of performing autonomous actions. In other words, they can be described as proactive objects. As stated in the definition given previously, agents are situated in an environment. This environment, usually called an agency, provides the required infrastructure for the agents, including, for example, communication and naming services. Some requirements for an agency are discussed in [42]. Using a well-defined agency guarantees that agents are able to interact with the outside world as well as other agents in a uniform manner. A minimal set of services is therefore defined in the MASIF standard [43, 44] and the FIPAACL [45] standard. When an agency conforms to one of these standards, interoperability with other agencies complying with the same standard is ensured.
520
CHAPTER 22 Embedded Middleware for Smart Camera Networks Practical applications of agent systems consist of two, and possibly up to several dozen agents, distributed among several environments in a network. The agents can communicate with each other regardless of the environment they inhabit and may collaborate on a certain task. Agent systems are thus perfectly qualified for distributed computing, as the data as well as the computation is distributed within the network by them. Moreover, if it is not desired or feasible to have a central point of control in a system, agents can be used to realize decentralized systems where control is spread among several hosts in a network.
22.5.2 Mobile Agents Agents, as described up to now, reside in the same agency during their lifetime, but enhancing the agent-oriented approach with mobility makes this paradigm much more powerful and more flexible because the agents are able to move from one agency to another. Making agents mobile allows in certain situations a significant reduction in network communication. Furthermore, the execution time of certain tasks can be reduced by exploiting this mobility. If, for example, an agent requires a specific resource that is not available on its current host, it has two options: to use remote communication to access the resource or to migrate to the host on which the resource is available. Which option to choose depends on whether remotely accessing a resource is possible and on the costs of remote versus local interaction. Prominent representatives of mobile agent systems are D’Agents [46], Grasshopper [47], Voyager [48], and dietAgents [49, 50], among others.
22.5.3 Code Mobility and Programming Languages Concerning the implementation of mobile agent systems, agent mobility places some requirements on the programming language.Agent systems are typically used on a variety of hardware platforms and operating systems; even within a single network different hosts may be present. Mobile agents, however, must be able to migrate to any host within the system, which means that their code has to be executable on all hosts. Therefore, although early mobile agent systems were implemented using scripting languages, in the last couple of years languages based on intermediate code (i.e., Java and .NET) are preferred. Programming languages that are compiled to native code such as C or C++ are hardly ever used for implementing mobile agent systems. Scripting languages allow execution of the same code on different platforms without the need to recompile. Platform independence is realized by providing an interpreter or an execution environment that executes the code.As a consequence, the performance of interpreted code is not convincing. Java and .NET overcome these limitations by exploiting just-in-time compilers that generate native code before execution, which brings a significant performance shift. Another aspect of mobile agents is their migration from one agency to another. This basically requires the following steps: 1. Suspend the agent and save its current internal state and data. 2. Serialize the agent (data, internal state, and possibly code). 3. Transfer the serialized agent and its data to the new host.
22.5 The Agent-Oriented Approach
521
4. Create the agent from its serialized form. 5. Resume agent execution on the new host. Suspending an agent and resuming it on another host is not trivial without its cooperation. Hence, two types of migration are distinguished: weak and strong [51]. Strong migration, also called transparent migration, denotes the ability to migrate both the code and the current execution state to a different host. It is supported by just a few programming languages, and thus agent platforms (e.g., Telescript [52]), Java, and .NET are not among these. Weak, or nontransparent, migration, in contrast, requires the cooperation of the agent. The agent has to make sure to save its execution state before migrating to another host and to resume execution from its previously saved state after the migration. Weak migration is supported by all mobile agent systems.
22.5.4 Mobile Agents for Embedded Smart Cameras Although AOP is widely used in general-purpose computing for modeling autonomy and goal-oriented behavior, it is rather uncommon in embedded systems. The reasons are the significantly different requirements of software for embedded devices. Resources such as memory and computing power are very limited, and embedded systems often have to fulfill real-time requirements. The basically nondeterministic behavior of autonomous agents further hinders their use in embedded systems. Nevertheless, research has shown that theAOP paradigm can enhance software development for embedded systems. Applications include process control, real-time control, and robotics (see [53–55]). Stationary agents are used to represent individual tasks within the system. Their communication structure (i.e., which agents communicate with one another) is typically fixed according to the physical circumstances they are used to model and does not change over time. Smart cameras are embedded systems, and mobile agents are perfectly suited to manage entire smart camera networks. The ultimate goal is that these networks operate completely autonomously with no or only minimal human interaction. For example, a smart camera network controls access to a building and identifies all persons entering it. When an unknown person enters, his position is tracked and security staff is informed. The agent-oriented paradigm can be used to model individual tasks within the system, such as face recognition or tracking.Via agent communication, tasks on different cameras can collaborate on a certain mission. One aspect against using mobile agent systems on embedded smart cameras is the overhead introduced by commonly used programming languages such as Java and .NET (see Section 22.5.3), but this does not necessarily make it impossible to implement mobile agent systems more effectively and more resource-efficiently and thus also applicable on embedded systems. Mobile-C [56], for example, is a mobile agent system implemented in C/C++. Its development was motivated by applications that require direct low-level hardware access. Moreover, this agent system conforms to the FIPA standard and extends it to support agent mobility. Although the chosen programming language is C/C++, the agent code is interpreted using the Ch C/C++ interpreter [57]. An agent’s task is divided into subtasks, which are organized in a task list. Upon migration, the next task in the list is executed.
522
CHAPTER 22 Embedded Middleware for Smart Camera Networks
22.6 AN AGENT SYSTEM FOR DISTRIBUTED SMART CAMERAS The feasibility and applicability of the agent-oriented approach in embedded smart cameras is demonstrated in two case studies—autonomous and decentralized multi-camera tracking and sensor fusion. In both we focus on the middleware services that simplify application development. First, a description of our agent system, DSCAgents, is given.
22.6.1 DSCAgents DSCAgents is designed for smart camera networks and basically suited to various hardware architectures. Its design is founded on the general architecture described in Section 22.2, which consists of a communication unit and a processing unit as well as one or more image sensors. The operating system on the smart cameras is assumed to be embedded Linux, but other POSIX-compliant systems are also feasible. For efficiency, the chosen programming language is C++, which also influences the design to some degree. The concrete implementation targets the SmartCAM platform developed by Bramberger et al. [3]. The overall architecture of a smart camera network and its mobile agent system is depicted in Figure 22.3. The smart cameras are connected via wired (or possibly wireless) Ethernet, whereas each camera hosts an agency—the actual runtime environment for the mobile agents. The agency is situated on the general-purpose processor of the communication unit. Mobile agents represent the image-processing tasks to be executed on the smart camera network.
Software Architecture Because the underlying hardware architecture is in general a multi-processor platform, the software architecture has to reflect this. The host processor handles the communication tasks and thus executes the agent system; the processing unit is devoted to image processing. Figure 22.4 depicts the software architecture of our middleware.
IP network
Mobile agent
FIGURE 22.3 Architecture of a smart camera network with a mobile agent system.
22.6 An Agent System for Distributed Smart Cameras
Application
523
Algorithms MPEG encoding
System agents
Video analysis
DSP framework
DSC agents
Resource manager
Network layer
ACE
Optional drivers
DSPLib
Linux DSP driver
Dynamic loading PCI messaging DSPBios
Host processor
Processing unit
PCI
PCI
(a)
(b)
FIGURE 22.4 Software architecture: (a) Host processor, (b) Processing unit.
The software architecture on the communication unit can be partitioned into layers, as described in Section 22.4.1. On the processing unit the Linux kernel that is used as operating system already comprises device drivers for a great number of hardware components, most important networking, and busses connecting additional components. A custom device driver is used to manage the digital signal processors (DSPs) on the processing unit and to exchange messages between processors. The host abstraction layer basically consists of the Adaptive Component Environment (ACE) and the SmartCAM framework.ACE [58] is a reusable C++ framework that provides a portable and lightweight encapsulation of several communication mechanisms, network communication, and inter- and intra-process communication. The DSPLib provides an object-oriented interface to the processing unit, which allows loading and unloading of executables on the processing unit as well as the exchange of messages between algorithms on the processing unit and applications on the host processor. The network layer is based on the ACE framework and basically provides mechanisms to establish network connections between hosts and asynchronous message-oriented communication. Different low-level protocols are supported. DSCAgents finally matches the common middleware layer. It provides a run-time environment for the agents, manages the agent’s life cycle, and allows agents in different agencies to communicate with each other. A number of stationary system agents provide additional services required for building an application. The NodeManagementAgent handles all management tasks on a smart
524
CHAPTER 22 Embedded Middleware for Smart Camera Networks
camera. It is available under a well-known name and can also be accessed remotely. The services it provides include, among others, creating agents (local as well as remote), monitoring available resources, and obtaining information about agents in an agency.The SceneInformationAgent manages all information regarding the vision system. Depending on actual deployment, this may include camera calibration properties, position and orientation of the camera, list of neighboring cameras, and visual properties (e.g., image resolution and color depth). The ImageProcessingAgent is the central instance for interacting with the camera’s processing unit. It is only locally accessible, preventing remote execution of image-processing tasks.The main functionality includes loading and unloading image-processing algorithms, messaging between agents and algorithms, and providing information on available resources. On the processing unit, the DSPFramework [3, 59] is the foundation for the algorithms running on the DSPs. The operating system used is DSPBIOS, provided by Texas Instruments, which is enhanced with dynamic loading capabilities and inter-processor messaging. Additional modules, such as optional drivers and the resource manager, can be loaded dynamically during runtime. Also, the algorithms executed on the DSPs can be loaded and unloaded dynamically during operation without interrupting other algorithms.
Agent Mobility DSCAgents supports agent mobility.That is, agents can move from one camera to another. Unfortunately, transparent migration is hard to implement in C++, so we chose weak migration based on remote cloning. Agent migration thus involves the following steps: 1. 2. 3. 4.
The originating agent saves its internal state in a serializable form. A new agent is created on the destination agency. The new agent initializes itself with the initial state from the originating agent. The originating agent is destroyed.
Steps 1 and 3 require the agents’ cooperation. In other words, they have to be implemented by the application developer. Steps 2 and 4 are handled by the agent system. It is not mandatory that the agent’s code be available in the destination agency. If it is not available, the agent code can be loaded as a dynamic library, which is provided in an agent repository on the network. This allows new agent types to be deployed during operation of the camera network. An agent comprises the image-processing algorithms in the form of a dynamic executable that is loaded and unloaded onto the image-processing unit as needed. Hence, the image-processing algorithms are also flexible and can be modified during runtime; even new image-processing algorithms can be added to the system after deployment.
Evaluation DSCAgents’ implementation targets embedded systems and thus has to use scarce resources such as memory and processing power sparingly. Table 22.1 lists the code size of DSCAgents and its modules (stripped, cross-compiled binaries). Its total consumption of permanent memory, including all libraries, is less than 3.5 MB. Compared to other agent systems, such as DietAgents, Grasshopper, or Voyager, which require 20 MB
22.6 An Agent System for Distributed Smart Cameras
525
Table 22.1 Memory Consumption Libraries
Consumption
ACE
2060 kB
Boost
442 kB
SmartCam Framework
178 kB
Library total
2680 MB
DSCAgents
885 kB
Total
3565 kB
Table 22.2 Execution Times for Frequent Operations on the Embedded Smart Camera Operation Create agent (local)
Time 4.49 ms
Create agent (remote)
10.11 ms
Load image-processing algorithm
17.86 ms
and more (this includes the Java Virtual Machine and the class path), DSCAgents is fairly lightweight. A large portion of the code size is contributed by the ACE library, which not only abstracts networking functionality but contains several other components. When only the networking capabilities are required, an optimized version of this library may be compiled without unnecessary components, which further decrease the code size. DSCAgents requires approximately 2.5 MB of RAM after startup. When additional agents are created, memory consumption increases: An agency comprises 50 agents—a fairly large number when thinking of smart camera networks—so memory consumption can reach up to 3.3 MB. Of course, this heavily depends on how much memory the agent itself allocates, but for the purpose of this evaluation an agent was used that allocates no additional memory. Note that in this case each agent has its own thread of execution, which accounts for the greater part of the allocated memory. Table 22.2 summarizes the execution times for agent creation, agent migration, and the loading of an image-processing algorithm. Creating a new agent on our camera takes about 4.5 ms locally and 10.1 ms on a remote agency. Regarding mobility of agents, it is interesting to note how long it takes to move an agent from one camera to another. Since DSCAgents uses remote cloning, the time for agent migration is the same as for remote agent creation. Loading an image-processing algorithm from an agent takes approximately 18 ms. This includes the time required to send the executable through the DSPAgent to one of the DSPs. Figure 22.5 illustrates an agent communication and shows the average transmission time against the message size. Note the logarithmic scale of the x- and y-axes. The presented values are the average of 20 runs. Messaging between agents on the same camera
526
CHAPTER 22 Embedded Middleware for Smart Camera Networks 10000
Time (ms)
1000
100
10
1 100 Local
1000
10000 100000 Message size (bytes)
1e 1 06
1e 1 07
Remote
FIGURE 22.5 Average messaging time.
can be very fast and is independent of the message size because this does not require sending the message over the network. Messaging between agents on different hosts is somewhat as suggested next, and it also depends on message size. For small messages (about 1000 bytes, the interface’s MTU) messaging times are nearly constant. As message size increases, it takes longer for the sending agent to receive an acknowledgment. The time required for larger messages goes linear with message size.
22.6.2 Decentralized Multi-Camera Tracking We have implemented an autonomous decentralized tracking method that follows the so-called tag-and-track approach. This means that not all moving objects within the monitored area are tracked but only a certain object of interest. Furthermore, the tracking task is executed only on the camera that currently sees the target; all other cameras are unaffected.The basic idea is to virtually attach a tracking instance to the object of interest. The tracking instance (i.e., the agent) then follows its target in the camera network from one camera to another. Agent mobility inherently supports this highly dynamic tracking approach and furthermore allows the use of different tracking and hand-off strategies as well as different tracking algorithms, even after deployment of the whole system. Figure 22.6 illustrates the position of the target together with its tracking instance over time. A more elaborate discussion is given in Quaritsch et al. [60].
Tracking Application The target is followed from one camera to the next without a central control instance. Moreover, the camera topology is stored in a distributed manner on the camera’s SceneInformationAgent. Neighborhood relations are represented by so-called migration regions. A migration region is basically a polygon in 2D image coordinates that has assigned a
22.6 An Agent System for Distributed Smart Cameras
t0
t1⬎ t0
t2⬎ t1
FIGURE 22.6 Basic decentralized object tracking.
list of neighboring cameras and a motion vector for distinguishing different directions. Hence, the migration regions define a directed graph representing the neighborhood relations among cameras. The tracking instance is made up of a tracking algorithm for following the position of a moving object in a single view and a mobile agent containing the application logic. A strict separation between tracking algorithm and application logic allows, on the one hand, an agent to be equipped with different algorithms for object tracking depending on the application or on environmental conditions. On the other hand, different strategies for following an object within the smart camera network can be implemented using the same tracking algorithm.
Target Hand-off The most crucial part of our multi-camera tracking approach is the target hand-off from one camera to the next. The hand-off requires the following basic steps: 1. 2. 3. 4. 5.
Select the cameras where the target may appear next. Migrate the tracking instance to the those cameras. Reinitialize the tracking task. Redetect the object of interest. Continue tracking.
Identification of potential cameras for the hand-off uses the neighborhood relations as discussed previously. The next two steps of the hand-off are managed by the mobile agent system. Object redetection and tracking are then continued on the new camera. The tracking agent may use different strategies for the target hand-off [61]. We use the master–slave approach, which has the major benefit that the target is observed for as long as possible. During hand-off, there exist two or more tracking instances dedicated to one object of interest. The tracking instance that currently has the target in its field of view is called the master. When the target enters a migration region, the master initiates creation of the slaves on all neighboring cameras. The slaves in turn reinitialize the tracking algorithm with the information passed from the master and wait for the target to appear. When the target enters the slave’s field of view, the slave becomes the
527
528
CHAPTER 22 Embedded Middleware for Smart Camera Networks new master while the old master and all other slaves terminate. A sophisticated hand-off protocol is used to create the slaves on the neighboring cameras, elect the new master, and terminate all other tracking instances after successful target hand-off. The hand-off protocol minimizes the time needed to create the slaves and also handles undesired situations (e.g., the target does not appear on a neighboring camera, or more than one slave claim to have detected the target). Our tracking approach has been implemented and tested on tracking a human in our laboratory. Figure 22.7 illustrates target hand-off from one camera to another, showing the view of both cameras in screenshots (a) and (b) and the agents residing on the camera in screenshot (c). Note that the views of both cameras overlap, which is not mandatory. The highlighted square in the camera’s view denotes the position of the tracked person, and the rectangles on the left and right illustrate the migration regions of cameras A and B, respectively. Tracking is started on camera A. During handoff, two tracking instances are present, one on each camera (the highlighted rectangle in screenshot (c) represents the tracking agent). After the hand-off, camera B continues the tracking.
Evaluation When an object is followed in a smart camera network, the migration times of the tracking agent are critical. If the hand-off is too slow, the tracker may not able to find the target or the target may have already left the destination camera’s field of view. Hence, the duration for target hand-off is a subject of evaluation. During hand-off, three time intervals can be identified, which are quantified and summarized inTable 22.3.When the tracked object enters the migration region, it takes about 18 ms to create the slave agent on the next camera. Starting the tracking algorithm on the DSP requires 24 ms, which includes loading the dynamic executable to the DSP, starting the tracking algorithm, and reporting to the agent that the tracking algorithm is ready to run. Initialization of the tracking algorithm by the slave agent using the information obtained from the master agent takes 40 ms.The total migration time is thus about 80 ms. In case a migration region points to more than one neighboring camera, a slave has to be created on each. The master creates all of its slaves almost in parallel, exploiting asynchronous communication. However, the number of slaves slightly increases hand-off times (Table 22.4).
Middleware Support For the multi-camera tracking application the following middleware services are helpful and necessary: Mobility. Because of their mission, tracking agents are inherently highly mobile. They follow their target from one camera to the next, so the middleware has to support this mobility, providing agents a fast and reliable migration mechanism. DSCAgents supports agent mobility as a general service by means of remote cloning (see Agent Mobility section). Dynamic task loading. A consequence of the chosen tag-and-track approach is that only the camera currently observing the object has to execute the tracking algorithm and the agent decides which algorithm to use. The tracking algorithm is thus loaded by
22.6 An Agent System for Distributed Smart Cameras
(a)
(b)
(c)
FIGURE 22.7 Visualization of tracking between two cameras: (a) Tracker on camera A; (b) Hand-over to camera B: Subject is in migration region (highlighted square); (c) Tracker on camera B. Because the acquired image of the camera and the current position of the subject are updated at different rates, the highlighted position in image (b) is correct whereas the background image is inaccurate.
529
530
CHAPTER 22 Embedded Middleware for Smart Camera Networks
Table 22.3 Hand-off Time to a Single Camera Create slave on neighboring camera
18 ms
Loading dynamic executable
24 ms
Reinitializing tracking algorithm on slave camera
40 ms
Total
82 ms
Table 22.4 Hand-off Time to Multiple Neighbor Cameras Number of Neighbors
Time
1
82 ms
2
98 ms
3
105 ms
4
126 ms
the agent on the camera on agent arrival and unloaded on agent departure. Object tracking has tight timing constraints, even during target hand-off, so loading imageprocessing tasks has to be fast. Dynamic task loading is considered a domain-specific service and is provided by the ImageProcessingAgent. This agent handles the entire communication with the image-processing part and also supports loading dynamic executables and the framework on the image-processing unit (see Figure 22.4). Neighborhood relations. To follow an object over the camera network, it is crucial to know the position and orientation of the networked cameras. The middleware therefore has to manage and update information on the camera topology, preferably autonomously and in a distributed manner to keep the whole system fault tolerant. Neighborhood relations are also a domain-specific service and are provided by the SceneInformationAgent. For our evaluation, the agent reads the configuration from a file, but in a real-world deployment it should observe the activities in the scene and learn the camera parameters automatically.
22.6.3 Sensor Fusion The next step for distributed smart cameras is not only to use visual sensors but also to integrate other kinds of sensors such as infrared cameras, audio sensors, or induction loops. The intention is to obtain more reliable and more robust data while reducing ambiguity and uncertainty. In order to take advantage of these different sensors, however, it is necessary to correlate the data somehow—that is, to fuse the information from individual sensors. Much research on sensor fusion has been conducted over the last decades. Several data fusion algorithms have been developed and applied, individually and in combination, providing users with various levels of informational details. The key scientific problems,
22.6 An Agent System for Distributed Smart Cameras
531
which are discussed in the literature [62, 63], can also be correlated to the three fusion levels: Raw-data fusion. The key problems to be solved at this level of data abstraction can be referred to as data association and positional estimation [64]. Data association is a general method of combining multi-sensor data by correlating one sensor observation set with another. Common techniques for positional estimation are focused on Kalman filtering and Bayesian methods. Feature fusion. These approaches are typically addressed by Bayesian theory and Dempster-Shafer theory. Bayesian theory is used to generate a probabilistic model of uncertain system states by consolidating and interpreting overlapping data provided by several sensors [65]. It is limited in its ability to handle uncertainty in sensor data. Dempster-Shafer theory is a generalization of Bayesian reasoning that offers a way to combine uncertain information from disparate sensor sources. Recent methods for statistical learning theory (e.g., support vector machines [66]) have been successfully applied to feature fusion. Decision fusion. Fusion at the decision level combines the decisions of independent sensor detection/classification paths by Boolean operators or by a heuristic score (e.g., M-of-N, maximum vote, or weighted sum). The two basic categories of classification decision are hard (single, optimum choice) and soft (in which decision uncertainty in each sensor chain is maintained and combined with a composite measure of uncertainty).A few investigations undertaken on level-three data fusion appear in the literature.
The Fusion Model In the I-SENSE project [67] we investigated distributed sensor fusion performed in a network of embedded sensor nodes. Our fusion model supports fusion at multiple levels (raw data, feature, and decision). It also considers the data flow in the sensor network as well as resource restrictions on the embedded sensor nodes. The fusion model describes the functionality of distributed fusion and consists basically of a set of communicating tasks located on different nodes within the network. A directed acyclic graph G is used to represent the fusion tasks (nodes) as well as the data flow between them (edges). Each node has some properties describing the resource requirements of the task, and the edges indicate the required communication bandwidth between two tasks. Figure 22.8 is a simple example of a fusion model. The sensor tasks (e.g., V1, . . . ,V4, A1, A2) make up the bottom layer of the fusion tree. These tasks acquire data from the environment independently from other tasks. Fusion tasks (F1 and F2) and filter tasks (e.g., F3, F4, F5. . .) form the higher layers. Fusion tasks fully depend on data from other fusion tasks or from sensor tasks to produce an output. On receiving new input data, they must ensure temporal alignment and calculate the output vector. Filter tasks are similar to fusion tasks except that they have only one input channel. Raw-data fusion takes place in the lower layers of the fusion graph (F1 or F2) because their input is the raw data of the sensor tasks. Feature fusion takes place in the middle layers: Features from the fusion tasks are fused to an overall feature vector. Decision fusion
532
CHAPTER 22 Embedded Middleware for Smart Camera Networks F12
F9
F10
F4
F3
F11
F5
F6
F1
V1
F7
F2
V2
A1
A2
V3
V4
R1
FIGURE 22.8 A simple fusion model.
is done in the upper layers of the fusion graph. Either features from a previous fusion stage are used to generate a decision (e.g., classification) or multiple weak decisions are used to calculate more reliable decisions.The border between feature fusion and decision fusion in the fusion graph is not strict but depends on the current application. We evaluated the presented fusion model on vehicle classification, exploiting visual and acoustic data. We collected a database consisting of about 4100 vehicles, which were either cars or large or small trucks. The vision-only classifier predicts vehicle types very well (95.19 percent), but it has problems distinguishing between small and large trucks. Quite similar results are obtained when using acoustic features only. Fusing data from both sensors, however, increases the classification performance for all three vehicle classes [67].
Middleware Support In our fusing approach, the individual fusion tasks are distributed among the nodes and can be reallocated from one node to another. Also, the communication structure between individual fusion tasks may change over time. Implementing such a general sensor fusion approach therefore benefits from having a substantial middleware to provide fundamental services. Sensor interfaces. Each platform is equipped with a different set of sensors. Although the sensing tasks have to know how to interpret the data from the corresponding sensors, the middleware has to provide a general interface for each sensor class (visual and audio). Thus, there is one interface for image (still) data, regardless of spectral range (i.e., gray scale, color, infrared), one interface for acoustic sensors, and so on. An image processing task simply uses the interface for the type of sensor data it processes. This also makes the integration of new sensors easier because it is not necessary to modify each sensing task but only to include the device driver in the middleware. The sensor interfaces are part of the framework on the processing unit
22.7 Conclusions and are provided by optional drivers.These drivers are loaded by the middleware from the host processor according to the platform’s capabilities. Connecting the sensors and the algorithms follows the publish–subscribe design pattern [68]. Dynamic task loading. A flexible framework for sensor fusion must be able to reallocate a task executed on one node to another. Hence, it is necessary not only to load and start tasks dynamically during runtime on the embedded platform without interrupting other tasks but also to stop certain tasks. This domain-specific service is provided by the ImageProcessingAgent together with the framework on the image-processing unit. Resource monitoring. Resource monitoring has to keep track of all resources consumed by the image-processing tasks, particularly memory, processing power, DMA channels, and communication bandwidth. Fusion tasks can be allocated to a certain node only if there are sufficient resources available. Time synchronization. Distributed sensor data fusion implies a uniform time base for all nodes. Without a system-wide synchronized clock it would be impossible to combine results from different sensors. Therefore, the middleware has to provide a service that keeps all the nodes as well as the individual processors on a single platform synchronized.A dedicated agent provides this service. For our evaluation the network time protocol (NTP) was sufficient, but other synchronization mechanisms may also be implemented (http://www.ntp.org/ ).
22.7 CONCLUSIONS In this chapter we investigated middleware for distributed smart cameras. Smart cameras combine image sensing, considerable processing power, and high-performance communication in a single embedded device. Recent research has focused on the integration of several smart cameras into a network to create a distributed system devoted to image processing. Middleware services for distributed systems are used in many application domains, ranging from general-purpose computing to tiny embedded systems such as wireless sensor networks. As discussed, however, the requirements for middleware in distributed smart cameras are considerably different, so a more specialized middleware that takes into account the special needs of distributed smart camera is needed. We proposed the agent-oriented approach for managing distributed smart cameras and building applications. Mobile agents are autonomous entities that “live” within the network.Their autonomous and goal-oriented behavior allows the creation of self-organizing distributed smart cameras. In two case studies we respectively demonstrated the agentoriented approach on decentralized multi-camera tracking and sensor fusion. We also discussed the important services the middleware has to provide in distributed computer vision and sensor fusion domains. Our experience shows that this middleware eases the development of DSC applications in several ways: First, it provides a clear separation between algorithm implementation and algorithm coordination among different cameras. Second, it strongly supports scalability—that is, the development of applications for a variable number of cameras.
533
534
CHAPTER 22 Embedded Middleware for Smart Camera Networks Finally, the available resources can be better utilized by dynamic loading and dynamic reconfiguration services, which are usually not available on embedded platforms. There are still many open questions for middleware systems for distributed embedded platforms in general. Topics for further research include support for application development (development, operation, maintenance, etc.), resource awareness, and scalability and interoperability.
REFERENCES [1] R. Kleihorst, B. Schueler, A. Danilin, Architecture and applications of wireless smart cameras (networks), in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2007. [2] W. Wolf, B. Ozer, T. Lv, Smart cameras as embedded systems, Computer 35 (9) (2002) 48–53. [3] M. Bramberger, A. Doblander, A. Maier, B. Rinner, H. Schwabach, Distributed embedded smart cameras for surveillance applications, IEEE Computer 39 (2) (2006) 68–75. [4] H. Aghajan, R. Kleihorst (Eds.), in: Proceedings of the ACM/IEEE International Conference on Distributed Smart Cameras, 2007. [5] R. Kleihorst, R. Radke (Eds.), in: Proceedings of the ACM/IEEE International Conference on Distributed Smart Cameras, 2007. [6] B. Rinner, W. Wolf, Introduction to distributed smart cameras, Proceedings of the IEEE 96 (10) (2008) 1565–1575. [7] A.S. Tanenbaum, M. van Steen, Distributed Systems: Principles and Paradigms, Prentice Hall, 2006. [8] I.F. Akyildiz, T. Melodia, K.R. Chowdhury, A survey on wireless multimedia sensor networks, Computer Networks 51 (2007) 921–960. [9] R. Collins, A. Lipton, T. Kanade, A system for video surveillance and monitoring, in: Proceedings of the American Nuclear Society Eighth International Topical Meeting on Robotics and Remote Systems, 1999. [10] C.-F. Shu, A. Hampapur, M. Lu, L. Brown, J. Connell, A. Senior, Y. Tian, IBM smart surveillance system (s3): An open and extensible framework for event based surveillance, in: Proceedings of the IEEE Conference on Advanced Video and Signal-Based Surveillance, 2005. [11] C.H. Lin, T. Lv, W. Wolf, I.B. Ozer, A peer-to-peer architecture for distributed real-time gesture recognition, in: Proceedings of the IEEE International Conference on Multimedia and Expo, 2004. [12] R. Dantu, S.P. Joglekar, Collaborative vision using networked sensors, in: Proceedings of the International Conference on Information Technology: Coding and Computing, 2004. [13] Q. Cai, J.K. Aggarwal, Tracking human motion in structured environments using a distributedcamera system, IEEE Transactions on Pattern Analysis and Machine Intelligence 21 (11) (1999) 1241–1247. [14] R. Collins, A. Lipton, H. Fujiyoshi, T. Kanade, Algorithms for cooperative multisensor surveillance, in: Proceedings of the IEEE 89 (2001) 1456–1477. [15] J. Košecká, W. Zhang, Video compass, in: Proceedings of the Seventh European conference on Computer Vision, 2002. [16] B. Bose, E. Grimson, Ground plane rectification by tracking moving objects, in: Proceedings of the Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2003. [17] S. Khan, M. Shah, Consistent labeling of tracked objects in multiple cameras with overlapping fields of view, Transactions on Pattern Analysis and Machine Intelligence 25 (10) (2003) 1355–1360.
22.7 Conclusions
535
[18] R. Pflugfelder, H. Bischof, Online auto-calibration in man-made worlds, in: Proceedings of the International Conference on Digital Image Computing: Techniques and Applications, 2005. [19] B. Rinner, M. Jovanovic, M. Quaritsch, Embedded middleware on distributed smart cameras, in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2007. [20] R.E. Schantz, D.C. Schmidt, Research advances in middleware for distributed systems, in: Proceedings of the IFIPWorld Computer Congress:TC6 Stream on Communication Systems— The State of the Art, 2002. [21] D.C. Schmidt, Middleware for real-time and embedded systems, Communications of the ACM 45 (6) (2002) 43–48. [22] A. Pope, The CORBA Reference Guide: Understanding the Common Object Request Broker Architecture, Addison-Wesley, 1998. [23] D.G. Schmidt, F. Kuhns, An overview of the real-time CORBA specification, Computer 33 (6) (2000) 56–63. [24] Minimum CORBA Specification, http://www.omg.org/technology/documents/formal/ minimum_CORBA.htm, accessed June 2008. [25] D.C. Schmidt, D.L. Levine, S. Mungee, The design of the TAO real-time object request broker, Computer Communications 21 (4) (1998). [26] D. Box, Essential COM,Addison-Wesley, 2007. [27] R. Sessions, COM and DCOM: Microsoft’s Vision for Distributed Objects, John Wiley & Sons, 1997. [28] E. Pitt, K. McNiff, Java.RMI: The Remote Method Invocation Guide, Addison-Wesley, 2001. [29] Y. Yu, B. Krishnamachari, V.K. Prasanna, Issues in designing middleware for wireless sensor networks, IEEE Network 18 (1) (2004) 15–21. [30] M.M. Molla, S.I. Ahamed, A survey of middleware for sensor network and challenges, in: Proceedings of the IEEE International Conference on Parallel Processing, Workshops, 2006. [31] P. Levis, S. Madden, J. Polastre, R. Szewczyk, K. Whitehouse, A. Woo, D. Gay, J. Hill, M. Welsh, E. Brewer, D. Culler, Tinyos: An operating system for sensor networks, in: Ambient Intelligence, pp. 115–148, Springer, 2005. [32] C.-L. Fok, G.C. Roman, C. Lu, Rapid development and flexible deployment of adaptive wireless sensor network applications, in: Proceedings of the IEEE International Conference on Distributed Computing Systems, 2005. [33] D. Georgoulas, K. Blow, Making motes intelligent: An agent-based approach to wireless sensor networks, WSEAS on Communications Journal 5 (3) (2006) 525–522. [34] P. Bonnet, J. Gehrke, P. Seshadri, Towards sensor database systems, in: Mobile Data Management, Lecture Notes in Computer Science, 87 (2001) 3–14. [35] S.R. Madden, M.J. Franklin, J.M. Hellerstein, W. Hong, TinyDB: an acquisitional query processing system for sensor networks, ACM Transactions on Database Systems 30 (1) (2005) 122–173. [36] M. Yokoo, S. Fujita, Trends of Internet auctions and agent-mediated Web commerce, New Generation Computing 19 (4) (2001) 369–388. [37] T. Sandholm, eMediator: A next-generation electronic commerce server, Computational Intelligence 18 (4) (2002) 656–676. [38] K. Stathis, O. de Bruijn, S. Macedo, Living memory: Agent-based information management for connected local communities, Interacting with Computers 14 (6) (2002) 663–688. [39] W.-S.E. Chen, C.-L. Hu, A mobile agent-based active network architecture for intelligent network control, Information Sciences 141 (1–2) (2002) 3–35. [40] L.-D. Chou, K.-C. Shen, K.-C. Tang, C.-C. Kao, Implementation of mobile-agent-based network management systems for national broadband experimental networks in Taiwan, in: Holonic and Multi-Agent Systems for Manufacturing, pp. 280–289, Springer, 2003.
536
CHAPTER 22 Embedded Middleware for Smart Camera Networks [41] M. Wooldridge, Intelligent Agents: The Key Concepts, Lecture Notes in Computer Science, 2322, Springer-Verlag, 2002. [42] Y. Aridor, M. Oshima, Infrastructure for mobile agents: Requirements and design, in: Lecture Notes in Computer Science, 1477, pp. 38–49, Springer, 1998. [43] MASIF Standard, ftp://ftp.omg.org/pub/docs/orbos/98-03-09.pdf , 1998 (accessed June 2008). [44] D. Milojicic, M. Breugst, I. Busse, J. Campbell, S. Covaci, B. Friedman, K. Kosaka, D. Lange, K. Ono, M. Oshima, C. Tham, S. Virdhagriswaran, J. White, MASIF: The OMG mobile agent system interoperability facility, in: Lecture Notes in Computer Science, 1477, pp. 50–67, Springer-Verlag, 1998. [45] Foundation for Intelligent Physical Agents, Agent Communication Language, http://www.fipa .org/repository/aclspecs.html, 2007 (accessed June 2008). [46] R.S. Gray, G. Cybenko, D. Kotz, R.A. Peterson, D. Rus, D’Agents: applications and performance of a mobile-agent system, Software Practioner Experiment 32 (6) (2002) 543–573. [47] C. Bäumer,T. Magedanz, Grasshopper—A mobile agent platform for active telecommunication networks, in: Intelligent Agents for Telecommunication Applications, pp. 690–690, Springer, 1999. [48] T. Wheeler, Voyager Architecture Best Practices, Technical report, Recursion Software, Inc, March 2007. [49] P. Marrow, M. Koubarakis, R.-H. van Lengen, F.J. Valverde-Albacete, E. Bonsma, J. Cid-Suerio, A.R. Figueiras-Vidal, A. Gallardo-Antolín, C. Hoile, T. Koutris, H.Y. Molina-Bulla, A. NaviaVázquez, P. Raftopoulou, N. Skarmeas, C. Tryfonopoulos, F. Wang, C. Xiruhaki, Agents in decentralised information ecosystems: The DIET approach, in: Proceedings of the Artificial Intelligence and Simulation Behaviour Convention, Symposium on Information Agents for Electronic Commerce, 2001. [50] C. Hoile, F. Wang, E. Bonsma, P. Marrow, Core specification and experiments in DIET: A decentralised ecosystem-inspired mobile agent system, in: Proceedings of the International Conference on Autonomous Agents and Multi-Agent Systems, 2002. [51] A. Fugetta, G.P. Picco, G. Vigna, Understanding code mobility, IEEE Transactions on Software Engineering, 24 (1998) 324–362. [52] J.E. White, Telescript technology: Mobile agents, Mobility: Processes, Computers, and Agents (1999) 460–493. [53] A.J.N. Van Breemen, T. De Vries, An agent based framework for designing multi-controller systems, in: Proceedings of the Fifth International Conference on the Practical Application of Intelligent Agents and Multi-Agent Technology, The Practical Application Company Ltd., Manchester, UK, 2000. [54] N.R. Jennings, S. Bussmann, Agent-based control systems: Why are they suited to engineering complex systems? IEEE Control Systems Magazine 23 (3) (2003) 61–73. [55] C.E. Pereira, L. Carro, Distributed real-time embedded systems: Recent advances, future trends and their impact on manufacturing plant control, Annual Reviews in Control 31 (1) (2007) 81–92. [56] B. Chen, H.H. Cheng, J. Palen, Mobile-C: a mobile agent platform for mobile C-C++ agents, Software Practioner Experiments 36 (15) (2006) 1711–1733. [57] H.H. Cheng, Ch: A C/C++ interpreter for script computing, C/C++ User’s Journal 24 (1) (2006) 6–12. [58] D.C. Schmidt, An Architectural Overview of the ACE Framework, in: USENIX login, 1998; http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.4830. [59] A. Doblander, A. Maier, B. Rinner, H. Schwabach, A novel software framework for power-aware service reconfiguration in multi-reconfiguration in distributed embedded smart cameras, in: Proceedings of the IEEE International Conference on Parallel and Distributed Systems, 2006.
22.7 Conclusions [60] M. Quaritsch, M. Kreuzthaler, B. Rinner, H. Bischof, B. Strobl, Autonomous Multi-Camera Tracking on Embedded Smart Cameras, EURASIP Journal on Embedded Systems (2007) 10 pages. [61] M. Bramberger, M. Quaritsch, T. Winkler, B. Rinner, H. Schwabach, Integrating multi-camera tracking into a dynamic task allocation system for smart cameras, in: Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance, 2005. [62] J. Llinas, D.L. Hall, An introduction to multi-sensor data fusion, in: Proceedings of the IEEE International Symposium on Circuits and Systems, 1998. [63] D.L. Hall, J. Llinas, Handbook of Multisensor Data Fusion, CRC Press, 2001. [64] B.V. Dasarathy, Information fusion—What, where, why, when, and how? in: Information Fusion (editorial) 2 (2) (2001) 75–76. [65] B. Moshiri, M.R. Asharif, R.H. Nezhad, Pseudo information measure: A new concept for extension of Bayesian fusion in robotic map building, Journal of Information Fusion 3 (2002) 51–68. [66] V. Vapnik, Statistical Learning Theory, John Wiley & Sons, 1998. [67] A. Klausner, A. Tengg, B. Rinner, Distributed multi-level data fusion for networked embedded systems, IEEE Journal on Selected Topics in Signal Processing 2 (4) (2008) 536–555. [68] A. Doblander, A. Zoufal, B. Rinner, A novel software framework for embedded multiprocessor smart cameras, in: ACM Transactions on Embedded Computing Systems, 2008.
537
CHAPTER
Cluster-Based Object Tracking by Wireless Camera Networks
23
Henry Medeiros, Johnny Park School of Electrical and Computer Engineering, Purdue University, West Lafayette, Indiana
Abstract In this chapter, we present a distributed object tracking system that employs a cluster-based Kalman filter in a network of wireless cameras. Cluster formation is triggered by the detection of objects with specific features. More than one cluster may track a single target, since cameras that see the same object may be outside of each other’s communication range. The clustering protocol allows the current state and uncertainty of the target position to be easily handed off among cluster heads as the object is being tracked. Keywords: wireless camera networks, wireless sensor networks, sensor clustering, distributed tracking, Kalman filtering
23.1 INTRODUCTION Recent advances in the design of miniaturized low-power electronic devices have enabled the development of wireless sensor networks (WSN) [1]. These advances have also had a major impact on the construction of image sensors. As a consequence, a new technology now emerging is that of wireless camera networks (WCN). In addition to enhancing the current applications of WSNs, wireless camera networks will enable many new applications in the areas of surveillance, traffic control, health care, home assistance, environmental monitoring, and industrial process control [2]. However, whereas the design challenges of wireless camera networks are, in many aspects, similar to those of wireless sensor networks, most of the systems devised to tackle the challenges of WSNs cannot be directly applied to WCNs. Therefore, new protocols and algorithms specific to WCNs must be designed. Since sensor nodes in a WSN are expected to operate untethered for long periods of time, energy conservation is a major concern. It is well known among researchers that local data aggregation is an effective way to save sensor node energy and prolong theWSN lifespan [3–5]. The basic concept of local data aggregation is illustrated in Figure 23.1. As the sensor nodes acquire information about an event of interest, rather than transmit each Copyright © 2009 Elsevier Inc. All rights reserved. DOI: 10.1016/B978-0-12-374633-7.00023-9
539
540
CHAPTER 23 Cluster-Based Object Tracking by Wireless Camera Networks Sensing node Routing node Event of interest
Base station
(a) Cluster member Routing node Event of interest Base station
Cluster head
Cluster (b)
FIGURE 23.1 Local data aggregation in wireless sensor networks. Rather than (a) transmit all the measurements to the base station individually, (b) nodes send their measurements to a nearby node that aggregates them into a more compact form.
measurement individually to a base station (Figure 23.1(a)), which may be multiple hops away from the sensors, each node first transmits its own measurement to a nearby node that aggregates all the measurements into a more compact form (Figure 23.1(b)). This has motivated many researchers to employ sensor-clustering techniques to enable local data aggregation for environment-monitoring applications [6–10], wherein the events of interest are dispersed throughout the network. In such applications, all network sensors can constantly acquire useful information; thus, clustering consists of partitioning the entire network into groups of nearby nodes so that one node, deemed the cluster head, is able to carry out local data aggregation. However, when sensor networks are used for event-driven applications such as object detection and tracking (as opposed to environment monitoring), not all sensors provide useful information at the same time. Therefore, the goal in event-driven clustering is to dynamically select a subset of sensors that can provide information about the target based on the physical position of the event source and on the characteristics of the sensors. Most of the existing event-driven clustering algorithms assume that the distances between the sensors and the event-generating targets are somehow related to the ability of the sensors to detect the target [11–18]. In WCNs, however, the distance-based criteria for
23.2 Related Work
541
sensor node clustering are not sufficient because, depending on their viewing directions, physically proximal cameras may view segments of space that are disjointed and even far from each other.That means that, even when a single object is being tracked, a clustering algorithm must allow for the formation of multiple disjointed clusters of cameras for tracking the same object. In this chapter, we present a lightweight event-driven protocol for wireless camera networks that allows formation and propagation of clusters of cameras for the purpose of distributed and collaborative information processing [19]. Within this clustering framework, we also describe how a decentralized Kalman filter can be employed for real-time object tracking in a resource-constrained wireless camera network [20]. The chapter is organized as follows. In Section 23.2, we present an overview of work related to both clustering in wireless camera networks and distributed Kalman filtering. In Section 23.3, we discuss in more detail the challenges of clustering in wireless camera networks and then present a clustering protocol for them. Section 23.4 presents a method for object tracking employing a cluster-based Kalman filter. Section 23.5 presents details of the implementation of our simulation system and of our experimental testbed, as well as experimental results obtained. Finally, in Section 23.6, we conclude the chapter and present future research opportunities.
23.2 RELATED WORK This section presents some of the recent contributions in the areas of event-driven clustering protocols and distributed Kalman filtering.
23.2.1 Event-Driven Clustering Protocols In environment-monitoring applications, the nodes of a sensor network are usually clustered via one of the following strategies: ■ ■ ■
Only once at system initialization [6] Periodically based on some predefined network-wide time interval [7] Aperiodically on the basis of some internal node parameter such as the remaining energy reserve at the nodes [8]
However, in object-tracking applications, clustering must be triggered by the detection of an event of interest external to the network. This section presents some works that take external events into consideration in the cluster formation process. Chen et al. [11] proposed an algorithm for distributed target tracking using acoustic information. Their system is composed of sparsely placed high-capability nodes and densely spaced low-end sensors. The high-capability nodes act as cluster heads, the lowend sensors, as cluster members. Cluster heads close to the detected event become active with higher probability than those farther from the event. Similarly, the probability that a cluster member sends data to the cluster head is proportional to its distance to the event. The probability that a given node is closest to the target is computed with the help of Voronoi diagrams.The previously mentioned probabilities are then used to set up back-off timers used by the cluster heads to decide how long they should wait before volunteering
542
CHAPTER 23 Cluster-Based Object Tracking by Wireless Camera Networks
to become the active cluster head and by the sensors to decide how long they should wait before sending data to the cluster head. After one cluster head volunteers to become active, the remaining candidates leave the election process. Similarly, after a sufficient number of sensors send their data to the active cluster head, the remaining sensors cease to transmit their measurements. Kuang et al. [12] also designed a cluster-based tracking system for acoustic sensors in which the cluster head election is performed based on a random back-off timer proportional to the acoustic energy detected by each sensor. After a cluster head is elected, the number of sensors that should participate in estimating the target position is computed based on the density of sensors that can detect the target. Again, using a random back-off timer proportional to the detected energy, the sensors transmit their measurements to the cluster head in order, from the highest detected energy to the lowest detected energy. The cluster head then uses the measurements to estimate the position of the target using a particle filter. The system in Kuang et al. [12] does not consider information hand-off among cluster heads; instead, whenever the target is lost, the ALL_NBR [21] algorithm is used for recovery. Yang et al. [22] recently proposed the adaptive dynamic cluster-based tracking (ADCT) method, in which, when a target is initially detected, a simple cluster head election protocol takes place, and the node closest to the target becomes the cluster head. After a cluster is formed, each cluster member sends a bid to the cluster head. The value of the bid is based on the sensor’s distance to the target and its communication cost with respect to the cluster head. The cluster head then ranks the cluster members based on their bids and chooses the best sensors to collaborate with. The number of sensors that collaborate with the cluster head is a predefined system parameter. As the object moves, the predicted object position is used to assign the role of cluster head to a new node. In the distributed group management protocol proposed by Liu et al. [13], sensors in a region near a detected target create a group to keep track of this target. When a target is detected, an alarm region that contains the nodes with a high probability of detection it is defined. Nodes within that region interact so that the first node to detect the target (which is also assumed to be the node nearest to the target) becomes the cluster head. After the cluster head is elected, it requests nodes within the alarm region to join its cluster. As the object moves, the cluster head adds new nodes and removes old ones. If multiple objects are being tracked, or if there are communication failures during cluster head election, multiple clusters may coexist in the network. To allow for disambiguation among multiple tracks, each cluster identifies its track by a time stamp. If different clusters overlap, they merge into a single cluster by dropping the newest track. If these tracks resplit later, a new cluster head election takes place and multiple clusters are again formed. Chong et al. [23] proposed to extend this work by employing multiple hypothesis testing (MHT) [24] in the target disambiguation problem. Fang et al. [14] proposed a distributed aggregate management (DAM) protocol in which nodes that detect energy peaks become cluster heads and a tree of cluster members is formed by their neighbors that detect lower energy levels. When many targets lie within the same cluster, they use their energy-based activity monitoring (EBAM) algorithm to count the number of targets. EBAM assumes that all targets are equally strong emitters of energy, and it counts the number of targets within a cluster based on the total energy detected by it. To drop this assumption, the researchers proposed the
23.2 Related Work expectation-maximization–like activity monitoring (EMLAM) algorithm, which assumes that the targets are initially well separated and uses a motion prediction model along with message exchanges among cluster leaders to keep track of the total number of objects. Zhang and Cao [15] proposed the dynamic convoy tree-based collaboration (DCTC) algorithm, in which the nodes that can detect an object create a tree rooted at a node near it. As the object moves, nodes are added to and pruned from the tree, and the root moves to nodes closer to the object. They proposed two different schemes for adding nodes to and pruning nodes from the tree. In the conservative scheme, all nodes within a circular region that should enclose the next position of the target based on its current velocity are added. In the prediction-based scheme, some prediction method is used to estimate the future position of the target and only nodes within this region are added. They also suggested two schemes for tree reconfiguration—cluster head reassignment and reconfiguration of the paths from the cluster members to the new cluster head. The sequential reconfiguration scheme is based on broadcasting information about the communication costs between the cluster head and the cluster members. In the localized reconfiguration method, some of this information is suppressed because it can be approximately computed by each node locally. Tseng et al. [18] proposed an agent-based architecture for object tracking in which a master agent migrates through the network to keep track of the target. When the object is initially detected, an election takes place and the sensor closest to the object becomes the master agent. The master agent invites nearby sensors to act as slave agents, which send their measurements of the target to the master agent. The master agent also inhibits the action of farther sensors that can sense the object in order to save network energy. As the object moves, the master agent invites new sensors to become slave agents and inhibits old sensors whose measurements are no longer desired because they are too far from the object. The master migrates to a new sensor when it detects that the object is leaving its field of view. As the master migrates, the state of the target is carried to the new master. Information about the target is periodically delivered to a base station. This is essentially a cluster-based architecture wherein the master agent corresponds to the cluster head and the slave agents are the cluster members. Blum et al. [25] proposed a middleware architecture to allow for distributed applications to communicate with groups of sensors assigned to track multiple events in the environment. Their architecture is divided into the entity management module (EMM) and the entity connection module (ECM). The EMM is responsible for creating unique groups of sensors to track each event, keeping persistent identities to these groups, and storing information about the state of the event. The ECM provides end-to-end communication among different groups of sensors. He et al. [26] presented an implementation and experimental evaluation of this system on a real network of wireless sensors. In the recent contribution by Bouhafs et al. [16], cluster formation is triggered by a user query. When the query finds a node that can detect the target, this node starts the formation of a cluster that consists of a tree rooted at the cluster head containing all the nodes that can detect the target. The cluster members then send their measurements to the cluster head, which decides which sensor is closest to the target and defines it as the new cluster head. This process entails reconfiguring the entire tree so that each node can find a communication path to the new cluster head. Cluster head
543
544
CHAPTER 23 Cluster-Based Object Tracking by Wireless Camera Networks reassignment is repeated periodically to keep track of the target. Sensors in the neighborhood of the cluster are allowed to hear the cluster head reassignment messages so that, when they detect the target energy above a threshold, they are allowed to join the cluster. Demirbas et al. [17] proposed an architecture to solve the pursuer-evader problem in a sensor network. In their system, a tree rooted at the current position of the evader is constructed and updated at the evader’s every movement. This tree is constructed so that each node is always closer to the evader than its children, and to save energy the height of the tree is restricted. The pursuer then queries its closest node, which directs it to its parent in the tree. Since the pursuer is assumed to move faster than the evader, it eventually reaches the evader. The pursuer uses a (depth-first or breadth-first) tree construction algorithm to initially find the tree rooted at the evader. Ji et al. [27] proposed a cluster-based system for tracking the boundaries of large objects spread throughout the network.This work contrasts with those on tracking point targets in that each sensor is able to sense only a portion of the object. The system is composed of dynamic clusters responsible for aggregating data from boundary sensors (i.e., sensors that can sense the object and have at least one neighbor that cannot). As the boundary of the object moves, the cluster head is reselected by the previous cluster head and the cluster members are updated. Several other works employed clustering for object detection, localization, and tracking in sensor networks [28–32]. However, they assume that the cluster infrastructure is already present and, therefore, do not explicitly address the process of cluster formation or propagation. Table 23.1 summarizes the main characteristics of the existing event-driven clustering protocols. All of the aforementioned works have in common the fact that they are designed for omnidirectional sensors. Therefore, they do not account for challenges specific to directional sensors such as cameras. One such challenge is the fact that physical proximity between a sensor and the target does not imply that the sensor is able to acquire information about the target. Hence, distance-based cluster formation protocols are not directly applicable to camera networks. Sensor clustering in wireless camera networks will be addressed in detail in Section 23.3.1.
23.2.2 Distributed Kalman Filtering The distribution of computations involved in estimation problems using Kalman filters in sensor networks has been a subject of research since the late 1970s [33]. This section presents some of the recent contributions in this area. Olfati-Saber [34] presented a distributed Kalman filter approach wherein a system with an np-dimensional measurement vector is first split into n subsystems of p-dimensional measurement vectors.These subsystems are then individually processed by micro Kalman filters in the network nodes. In this system, the sensors compute an average inverse covariance and average measurements using consensus filters. The averaged values are then used by each node to individually compute the estimated state of the system using the information form of the Kalman filter. Even though this approach is effective in an environment-monitoring application in which the state vector is partially known by each network node, it is not valid for an
23.2 Related Work
Propagation
Yang et al. [22]
Liu et al. [13]
Fang et al. [14]
Zhang and Cao [15]
Tseng et al. [18]
Blum et al. [25]
Bouhafs et al. [16]
Demirbas et al. [17]
Ji et al. [27]
Medeiros et al. [19]a
Dynamic Head Election
Kuang et al. [12]
Summary of Existing Event-Driven Clustering Protocols
Chen et al. [11]
Table 23.1
b
Coalescence
c
Fragmentation
d
Inter-Cluster Interactions Directional Sensors
a
This work is explained in detail in Section 23.3.2. Among the predefined cluster heads. c In EMLAM. d When multiple targets are temporarily seen as a single “virtual” target, but there is no explicit mechanism for cluster fragmentation. b
object-tracking application in which, at a given time, each node in a small number of nodes knows the entire state vector (although possibly not accurately). Nettleton et al. [35] proposed a tree-based architecture in which each node computes the update equations of the Kalman filter in its information form and sends the results to its immediate predecessor in the tree. The predecessor then aggregates the received data and computes a new update. Node asynchrony is handled by predicting asynchronously received information to the current time in the receiving node. This approach is scalable since the information transmitted between any pair of nodes is fixed. However, the size of the information matrix is proportional to m2 , where m is the dimension of the state vector. In a sensor network setting, this information may be too large to be transmitted between nodes, therefore, methods to effectively quantize this information may need to be devised. Regarding quantization, the work by Ribeiro et al. [36] studied a network environment wherein each node transmits a single bit per observation, the sign of innovation (SOI), at every iteration of the filter.The system assumes an underlying sensor-scheduling mechanism so that only one node at a time transmits the information. It also assumes the update information (the signs of innovations) to be available to each node of the network. The researchers showed that the mean squared error of their SOI Kalman filter is closely related to the error of a clairvoyant Kalman filter, which has access to all the data in analog form. There is an interesting trade-off between the work of Nettleton et al. and that of Ribeiro et al. The former presents a high level of locality; that is, each node needs
545
546
CHAPTER 23 Cluster-Based Object Tracking by Wireless Camera Networks
only information about its immediate neighbors. On the other hand, a reasonably large amount of information must be transmitted by each node. The latter, by its turn, requires the transmission of a very small amount of information by each node; however, the algorithm does not present locality since the information must be propagated throughout the network. This kind of trade-off must be carefully considered when designing an algorithm for real wireless sensor network applications. Balasubramanian et al. [37] proposed a cluster-based distributed object tracking system employing a heterogeneous network composed of high-capability nodes and low-end active sensors such as Doppler radars.They assumed the network is partitioned into clusters with each cluster composed of one high-capability node and many low-end sensors. The high-capability nodes act as cluster heads, and the low-end sensors collect data. The cluster heads are responsible for selecting the sensors that expend the least amount of energy to detect the target. The energy cost is computed based on a cost function that considers the energy required to sense the target and the energy required to transmit the data to the cluster head. The data collected by the sensors is then aggregated by the cluster head using a Kalman filter, after linearizing the measurement function. When the target enters the area covered by another cluster, the Kalman filter is initialized by the predicted state obtained from the previous cluster head. To the best of our knowledge, the only work that applies Kalman filtering to a clusterbased architecture for object tracking using camera networks is Goshorn et al. [38].Their system assumes that the network was previously partitioned into clusters of cameras with similar fields of view. As the target moves, information within a cluster is handed off to a neighboring cluster.
23.3 CAMERA CLUSTERING PROTOCOL We introduce this section with an overview of the challenges of sensor clustering in wireless camera networks, because a clear understanding of these challenges is necessary to appreciate the characteristics of the clustering protocol that we have devised. In the subsequent sections, we present the details of our clustering protocol.
23.3.1 Object Tracking with Wireless Camera Networks Most of the current event-driven clustering protocols assume that sensors closest to an event-generating target can best acquire information about it. In wireless camera networks, however, the distance-based criteria for sensor node clustering are not sufficient since, depending on their pointing directions, physically proximal cameras may view segments of space that are disjointed and even far from one another.Thus, even when only a single object is being tracked, a clustering protocol must allow for the formation of multiple disjointed clusters of cameras to track it. An example is illustrated in Figure 23.2(a), where, in spite of the fact that the cameras in cluster A cannot communicate with the cameras in cluster B, both clusters can track the object. Therefore, multiple clusters must be allowed to track the same target. Even if all cameras that can detect a common object are able to communicate with one another in multiple hops, the communication overhead involved in tracking the object
23.3 Camera Clustering Protocol
547
Cluster A Target
Cluster B (a)
(b)
(c)
(d)
FIGURE 23.2 (a) Multiple clusters tracking the same object in a wireless camera network; dotted circles represent their communication ranges. (b) Two single-hop clusters in a network of cameras that can communicate in multiple hops. Dark circles represent cluster heads; light circles, cluster members. The lines connecting the nodes correspond to communication links among them. (c) and (d) Fragmentation of a single cluster. As the cluster head in (c) leaves the cluster, it is fragmented into two clusters (d).
using a large cluster may be unacceptable, given that collaborative processing generally requires intensive message exchange among cluster members [11]. Therefore, rather than create a single large multi-hop cluster to track an object, it is often desirable that multiple single-hop clusters interact as needed. An example is illustrated in Figure 23.2(b): Whereas all cameras that can see the same object may constitute a connected graph if we allow multi-hop communications, requiring two single-hop clusters may be more efficient in this case. Dynamic cluster formation requires all cluster members to interact to select a cluster head. Many algorithms could be used for electing a leader from among all cameras that can see the same object. However, they would not work for us since we must allow for multiple single-hop clusters (for reasons previously explained) and for the election of a separate leader for each. Therefore, it is necessary to devise a new leader election protocol suitable for single-hop clusters in a wireless camera network setting. After clusters are created to track specific targets, they must be allowed to propagate through the network as the targets move. Cluster propagation refers to the process of accepting new members into the cluster when they identify the same object, removing members that can no longer see the object, and assigning new cluster heads when the current cluster head leaves the cluster. Since cluster propagation in wireless camera networks can be based on distinctive visual features of the target, it is possible for clusters tracking different objects to propagate independently or even overlap if necessary. In other words, cameras that can detect multiple targets may belong simultaneously to multiple clusters. Including a new member in a cluster and removing an existing member are simple operations. However, when a cluster head leaves the cluster, mechanisms
548
CHAPTER 23 Cluster-Based Object Tracking by Wireless Camera Networks
Monitoring
1 Object detected
Object lost
Formation of clusters Approaches other clusters
Tracking Object moves
Inter–cluster communication coalescence
Propagation fragmentation
2 Interacting
4
3
5
(a)
(b)
FIGURE 23.3 (a) State transition diagram of a cluster-based object-tracking system using a wireless camera network. (b) Orphan cameras after the first stage of the leader election algorithm.
must be provided to account for the possibility that the cluster will be fragmented into two or more clusters, as illustrated in Figures 23.2(c) and (d). Since multiple clusters are allowed to track the same target, if they overlap they must be able to coalesce into a single cluster. Coalescence of clusters is made possible by permitting intra-cluster communications to be overheard as different clusters come into each other’s communication range. Obviously, overhearing implies inter-cluster communication. It is important to note that inter-cluster communication can play a role in intra-cluster computation of a parameter of the environment, even when cluster merging is not an issue. For example, a cluster composed of overhead cameras may request information about the z coordinate of the target from a neighboring cluster composed of wall-mounted cameras, so mechanisms to allow inter-cluster interactions in wireless camera networks are necessary. To summarize these points, Figure 23.3(a) illustrates the state transition diagram of a cluster-based object-tracking system using a wireless camera network. The network initially monitors the environment. As an object is detected, one or more clusters are formed to track it. To keep track of the object, the clusters must propagate through the network as it moves and, if necessary, fragment themselves into smaller clusters. Finally, if two or more clusters tracking the same object meet each other, they may interact to share information or coalesce into larger clusters.
23.3.2 Clustering Protocol In this section we present a clustering protocol for wireless camera networks. We believe that the best way to present the protocol would be to show the state transition diagram at each node. Such a diagram would define all of the states of a node as it transitions from initial object detection to participation in a cluster to, possibly, its role as a leader and, finally, to relinquishing its cluster membership. Unfortunately, such a diagram would be much too complex for a clear presentation. Instead, we present the concept in three pieces, which correspond to cluster formation and head election, cluster propagation, and inter-cluster communication. The state transition diagram for cluster propagation includes the transitions needed for cluster coalescence and fragmentation.
23.3 Camera Clustering Protocol As the reader will note, our state transitions allow for wireless camera networks to dynamically create one or more clusters to track objects based on visual features. Note, too, that the protocol is lightweight in the sense that it creates single-level clusters—composed only of cameras that can communicate in a single hop—rather than multiple-level clusters—which incur large communication overhead and latency during collaborative processing and require complex cluster management strategies. Cameras that can communicate in multiple hops may share information as needed through inter-cluster interactions.
Cluster Head Election To select cluster heads for single-hop clusters, we employ a two-phase cluster head election algorithm. In the first phase, nodes compete to be the one that minimizes (or maximizes) some criterion, such as the distance from the camera center to the object center in the image plane. By the end of this phase, at most one camera in a single-hop neighborhood elects itself leader and its neighbors join its cluster. During the second phase, cameras that were left without a leader (because their leader candidate joined another cluster) identify the next best leader candidate. As illustrated by the state transition diagram on the left in Figure 23.4, in the first phase of the cluster head election algorithm, each camera that detects an object sends
First phase RCV: criteria values from neighbor Update candidate list Idle
Second phase RCV: cluster ready from first candidate
Object detected SEND: create cluster Update candidate list
REPLY: join cluster
RCV: criteria values from neighbor
Definitive member
RCV: cluster ready from CH REPLY: join cluster
Update candidate list Timeout && first candidate ! 5 local ID Waiting Timeout && first candidate 55 local ID
Timeout && CH ! 5 local ID
SEND: cluster ready Cluster head
Provisional member
Timeout && CH 5 5 local ID SEND: cluster ready
FIGURE 23.4 Cluster head election state transition diagram.
Remove first candidate CH 5 next candidate
549
550
CHAPTER 23 Cluster-Based Object Tracking by Wireless Camera Networks a message requesting the creation of a cluster and includes itself in a list of cluster head candidates sorted by the cluster head selection criteria. The cluster creation message includes the values of the cluster head selection criteria from the sender. After a camera sends a cluster creation message, it waits for a predefined timeout period for cluster creation messages from other cameras. Whenever a camera receives a cluster creation message from another camera, it updates the list of cluster head candidates. To make sure that cameras that detect the object at later moments do not lose information about the available cluster head candidates, all the cameras that can hear the cluster creation messages update their candidate lists. After the timeout period, if the camera finds itself in the first position on the candidate list, it sends a message informing its neighbors that it is ready to become the cluster head. If the camera does not decide to become a cluster head, it proceeds to the second phase of the algorithm. The first phase of the algorithm guarantees that a single camera becomes a cluster head within its communication range. However, cameras that can communicate with the cluster head in multiple hops may be left without a leader. Figure 23.3(b) shows an example of this situation. Cameras 1 and 2 decide that camera 3 is the best cluster head candidate. However, camera 3 chooses to become a member of the cluster headed by camera 4. Hence, cameras 1 and 2 are left orphans after the first stage of leader election and must proceed to the second phase of the algorithm to choose their cluster heads. During the second phase of cluster head election, cameras that did not receive a cluster ready message after a time interval remove the first element of the cluster head candidate list. If the camera then finds itself in the first position, it sends a cluster ready message and becomes a cluster head. Otherwise, it waits for a timeout period for a cluster ready message from the next candidate in the list. This process is illustrated on the right of the state transition diagram in Figure 23.4. Eventually, the camera either becomes a cluster head or joins a cluster from a neighboring camera. To avoid the possibility of multiple cameras deciding to become cluster heads simultaneously, it is important that the cluster head election criteria impose a strict order to the candidates (if it does not, ties must be broken during the first phase). In the final step of the algorithm, to establish a bidirectional connection between the cluster head and the cluster members, each member sends a message to report to the cluster head that it has joined the cluster. This step is not strictly necessary if the cluster head does not need to know about the cluster members. However, in general, for collaborative processing, the cluster head needs to know its cluster members so that it can assign them tasks and coordinate the distributed processing.
Cluster Propagation Inclusion of new members in active clusters takes place as follows. When a camera detects a new target, it proceeds normally, as in the cluster formation step, by sending to its neighbors a create cluster message and waiting for the election process to take place. However, if there is an active cluster tracking the same object in the camera’s neighborhood, the cluster head replies with a message requesting that the camera join its cluster.The camera that initiated the formation of a new cluster then halts the election process and replies with a join cluster message.
23.3 Camera Clustering Protocol RCV: leave cluster or join cluster
RCV: create cluster or cluster ready REPLY: force join
Update cluster member list Lose object SEND: candidate list
551
Cluster head RCV: force join REPLY: join cluster CH 5 sender
Idle
Lose object OVERHEAR: force join to cluster head
SEND: leave cluster
Object detected
REPLY: join cluster CH 5 sender
SEND: create cluster Waiting
RCV: force join REPLY: join cluster CH 5 sender
Definitive member
RCV: candidate list Remove nonneighbors Provisional member
OVERHEAR: force join to head candidate REPLY: join cluster CH 5 sender
FIGURE 23.5 Cluster propagation state transition diagram.
Removal of cluster members is trivial; when the target leaves a cluster member’s FOV, all the member has to do is send a message informing the cluster head that it is leaving the cluster. The cluster head then updates its members list. If the cluster member can track multiple targets, it terminates only the connection related to the lost target. Figure 23.5 shows the state transition diagram for cluster propagation, including the transitions for inclusion and removal of cluster members as well as cluster fragmentation and coalescence, which we explain next.
Cluster Fragmentation When the cluster head leaves the cluster, we must make sure that, if the cluster is fragmented, a new cluster head is assigned to each fragment. Cluster head reassignment works as follows. Assuming that the cluster head has access to the latest information about the position of the target with respect to each cluster member, it is able to keep an updated list of the best cluster head candidates. When the cluster head decides to leave the cluster, it sends a message to its neighbors containing a sorted list of the best cluster head candidates. Each cluster member removes from that list all nodes that are not within its neighborhood, which is known to each node. Leader election then takes place as in the second phase of the regular cluster leader election mechanism.
Cluster Coalescence When two single-hop clusters come within each other’s communication range, there can be two possible scenarios: (1) we have a noncoalescing inter-cluster interaction, or
552
CHAPTER 23 Cluster-Based Object Tracking by Wireless Camera Networks
(2) the clusters may coalesce to form a larger cluster. We will address the noncoalescing inter-cluster interactions in the next section. As far as two clusters coalescing into one is concerned, our cluster head reassignment procedure allows for seamless cluster coalescence. Consider two clusters, A and B,that are propagating toward each another. As the reader will recall, cluster propagation entails establishing a new cluster head as the previous head loses sight of the object. Now consider the situation when a camera is designated to become the new head of cluster A and is in the communication range of the head of cluster B. Under this circumstance, the camera that was meant to be A’s new leader is forced to join cluster B. The members of cluster A that overhear their prospective cluster head joining cluster B also join B. If there are members of cluster A not within the communication range of the head of cluster B, they do not join cluster B. Instead, they proceed to select another cluster head from what remains of cluster A following the second phase of the regular cluster leader election mechanism.
Noncoalescing Inter-Cluster Interaction It may be the case that members of multiple clusters come into one another’s communication range but that their respective cluster heads cannot communicate in a single hop. In that case, the clusters are not able to coalesce but may need to interact to share information. The same situation prevails when a new cluster comes into existence in the vicinity of an existing cluster but without the head of the former being able to communicate with the head of the latter. In both cases, information can be shared among clusters through border nodes. These are the nodes that can communicate with other nodes in neighboring clusters, as illustrated in Figure 23.6(a). Members of neighboring clusters become border nodes through the border node discovery procedure, which takes place as follows. Consider two clusters, A and B, and suppose a new node that joins cluster B is in the communication range of a member of cluster A. As previously explained, when a node joins a cluster, either at cluster creation time or during cluster propagation, it sends a join cluster message to its corresponding cluster head. This is illustrated by the first arrow on the right side of the space–time diagram in Figure 23.6(b). Since the shown member of cluster A is in communication range of the Cluster B
Cluster A Join cluster B
Border nodes
Border node
Border node Border node
Cluster head
(a)
Join cluster B
Border node
Border node
(b)
FIGURE 23.6 (a) Border nodes. (b) Messages transmitted to establish inter-cluster connections.
Cluster head
23.3 Camera Clustering Protocol
553
Definitive member OVERHEAR: leave cluster && border node counter 55 1
OVERHEAR: join cluster
Decrease border node counter SEND: not border node
Increase border node counter SEND: border node
Border node
OVERHEAR: leave cluster && border node counter . 1 Decrease border node counter
FIGURE 23.7 Inter-cluster communication state transition diagram.
new member of cluster B, it can overhear that message (first dashed line) and be aware that it has become a border node. This new border node in cluster A then sends a border node message to its own cluster head to inform it that it has become a border node, as is illustrated in the cluster A half of Figure 23.6(b). When the cluster B member overhears that message (second dashed line), it also becomes a border node and informs its own cluster head of that fact by sending it a border node message. It is not sufficient for a border node to know that it is in communication range of some member of another cluster. As we illustrate in Figure 23.6(a), border nodes may communicate with multiple other border nodes.Therefore, it is necessary for each border node to keep track of how many connections it has to other clusters.This can be achieved simply by incrementing a counter each time a new connection among border nodes is established and decrementing it when a connection is terminated. Figure 23.7 shows the state transition diagram for inter-cluster communication. When a cluster head is informed that one of its members has become a border node, it can, in effect, request information from the neighboring clusters as needed.
Cluster Maintenance Additional robustness to communication failures is achieved by a periodic refresh of the cluster status. Since the protocol is designed to enable clusters to carry out collaborative processing, it is reasonable to assume that cluster members and cluster heads exchange messages periodically. Therefore, we can use a soft-state approach [39] to keep track of cluster membership. That is, if the cluster head does not hear from a member within a certain designated time interval, that membership is considered terminated (by the same token, if a cluster member stops receiving messages from its cluster head, it assumes the cluster no longer exists and starts the creation of its own). If a specific application requires unidirectional communication (i.e., communication only from head to members or only from members to head), refresh messages can be sent by the receiver side periodically to achieve the same soft-state updating of cluster membership. Inter-cluster communication can be maintained in a similar manner. If a border node does not hear from nodes outside its own cluster for a predefined timeout period, it
554
CHAPTER 23 Cluster-Based Object Tracking by Wireless Camera Networks assumes it is no longer a border node. If communication is unidirectional, border nodes can overhear the explicit refresh messages sent by the neighboring clusters’ border nodes to their respective cluster heads.
23.4 CLUSTER-BASED KALMAN FILTER ALGORITHM As we showed in the previous section, our clustering protocol provides a simple and effective means of dynamically creating clusters of cameras while tracking objects with specific visual features. After a cluster is created, cameras within it are able to share information about the target with very small overhead. Moreover, the information shared by the cameras is automatically carried by the cluster as it propagates. This protocol provides a natural framework for the design of a decentralized Kalman filter wherein data acquired by the cameras is aggregated by the cluster head and the estimated target position is carried along with the cluster as it propagates. In this section we present one approach to such a decentralized Kalman filter [20]. We assume that the reader is familiar with basic Kalman filter concepts. A good introduction to the subject can be found in Welch and Bishop [40] and Bar-Shalom and Li [41]. When designing any algorithm for wireless sensor networks, two major constraints must be taken into consideration: ■
■
Because the processing power of each node is very limited, sophisticated algorithms cannot be implemented in real time. Even relatively simple ones must be carefully implemented to use the available hardware resources wisely. Communication is very expensive in terms of its energy overhead and the increased probability of packet collisions should there be too many messages. This implies that the data must be transmitted sparingly.
In this specific case, to deal with the first constraint, we took advantage of the sparsity of the matrices involved in the estimation problem at hand, as explained in detail in Section 23.4.1. To deal with the second constraint, we kept to a minimum the number of messages that need to be exchanged between the nodes for the Kalman filter to do its work. Besides, since most of the traffic in an object-tracking application in a wide-area network consists of object information transmitted to the base station via multi-hop messages, we limit that traffic by transmitting such information at a predefined rate. The result is an algorithm that can estimate the target position in a distributed manner, accurately, and in real time while reducing overall energy consumption in the network compared to a centralized approach.
23.4.1 Kalman Filter Equations We model the target state as a 5D vector that includes the target position (xk , yk ) at discrete time instant k, its velocity (x˙ k , y˙ k ), and the time interval ␦k between the two latest measurements. That is, the state vector is given by xk ⫽ [ xk
yk
␦k
x˙ k
y˙ k ]T
23.4 Cluster-Based Kalman Filter Algorithm The dynamic equations of the system are described by the nonlinear equations ⎡
xk ⫹ ␦k x˙ k ⫹ a2x ␦2k
⎢ ⎢ yk ⫹ ␦k y˙ k ⫹ ay ␦2k 2 ⎢ ⎢ xk⫹1 ⫽ ⎢ ␦k ⫹ ⎢ ⎢ ⎣ x˙ k ⫹ ax ␦k
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦
y˙ k ⫹ ay ␦k
In our system, following Bar-Shalom and Li [41], the target acceleration (ax , ay ) is modeled by white Gaussian noise. We also model the time uncertainty between the latest measurements as white Gaussian noise. It is necessary to consider the time uncertainty since we only want to loosely synchronize the cameras to allow for consistency among their measurements. That is, we do not want to use complex time synchronization algorithms, so there may be a small time offset in the measurements from different cameras.The dynamic equations can be represented more compactly as xk⫹1 ⫽ f (xk , wk )
where wk ⫽ [ax ay ]T is the process noise vector, assumed to be white Gaussian with covariance matrix Q. The measurements are given by the pixel coordinates of the target and the time elapsed between the two most recent measurements. We assume the target moves on the xy plane of the reference frame and that the cameras were previously calibrated— that is, the homographies between the xy plane of the reference frame and the image plane of each camera are known. Hence, the homographies relate the pixel coordinates to the elements of the state vector corresponding to the target coordinates of the object. The measurement model can now be defined as zk ⫽ hi (xk ) ⫹ vk
where hi (xk ) ⫽ [hi1 (xk ) hi2 (xk ) ␦k ]T . Here (hi1 (xk ), hi2 (xk )) are the pixel coordinates based on the homography H i corresponding to camera i; ␦k is the time elapsed between the two most recent measurements; and vk is the measurement noise, assumed white Gaussian with covariance matrix R. Since both the dynamic equations of the system as well as the measurement equations are given by nonlinear functions, it is necessary to employ an extended Kalman filter to estimate the state of the target. The time update equations of the extended Kalman filter are given by xˆ k|k⫺1 ⫽ f (xˆ k⫺1|k⫺1 , 0)
(23.1)
Pk|k⫺1 ⫽ Fk Pk⫺1|k⫺1 FkT ⫹ Wk QWkT
(23.2)
where xˆ k|k⫺1 and xˆ k⫺1|k⫺1 are the predicted and previously estimated state vectors, and Pk|k⫺1 and Pk⫺1|k⫺1 are the predicted and previously estimated covariance matrices for
555
556
CHAPTER 23 Cluster-Based Object Tracking by Wireless Camera Networks the state vector. Fk , the Jacobian matrix of the state transition function f (·) with respect to xk , is given by ⎡
⎢ ⎢ ⭸f ⎢ Fk ⫽ ⫽⎢ ⭸xk xˆ k⫺1|k⫺1 ,0 ⎢ ⎣
1 0 0 0 0
␦k 0 0 1 0
x˙ k y˙ k 1 0 0
0 1 0 0 0
⎤
0 ␦k 0 0 1
⎥ ⎥ ⎥ ⎥ ⎥ ⎦
(23.3)
Similarly, Wk is the Jacobian matrix of f (·) with respect to wk , given by ⎡
⎤
␦2k 2
0
⎢ ⎢ ⎢ 0 ⭸f Wk ⫽ ⫽⎢ ⭸wk xˆ k⫺1|k⫺1 ,0 ⎢ ⎢ 0 ⎣ ␦k 0
0
␦2k 2
0 1 0 0
0 0 ␦k
⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦
(23.4)
The measurement update equations for the filter are given by Kk ⫽ Pk|k⫺1 HkT (Hk Pk|k⫺1 HkT ⫹ R)⫺1
(23.5)
xˆ k|k ⫽ xˆ k|k⫺1 ⫹ Kk (zk ⫺ hi (xˆ k|k⫺1 , 0))
(23.6)
Pk|k ⫽ (I ⫺ Kk Hk )Pk|k⫺1
(23.7)
where Hk is the Jacobian matrix of the function hi (·) with respect to xk evaluated at xˆ k⫺1|k⫺1 , given by ⎡ ⎢ ⭸hi Hk ⫽ ⫽⎢ ⭸xk xˆ k⫺1|k⫺1 ,0 ⎣
⎤
⭸hi1 ⭸xk
⭸hi1 ⭸yk
0
0
⭸hi2 ⭸xk
⭸hi2 ⭸yk
0
0
⎥ 0 ⎥ ⎦
0
0
1
0
0
0
(23.8)
As we can see, the Jacobian matrices in equations (23.3), (23.4), and (23.8) are relatively sparse. This allows us to rewrite the filter equations in order to achieve an efficient implementation more suitable for wireless sensor nodes with limited processing power. To do so, we first divide the Jacobian matrix Fk of equation (23.3) into four submatrices as follows. Let FA ⫽ I2 , FB ⫽ [v |␦I2 ], FC ⫽ 03⫻2 , and FD ⫽ I3 , where In is an n-dimensional identity matrix, 0n⫻m is an n⫻m zero matrix, and v ⫽ (x˙ k , y˙ k )T. Then equation (23.3) becomes1
F F ⫽ A(2⫻2) FC(3⫻2)
I2 FB(2⫻3) ⫽ FD(3⫻3) 03⫻2
v
| I3
␦I2
Now let the state covariance matrix Pk|k⫺1 be represented by
PA(2⫻2) P⫽ T PB(3⫻2) 1
PB(2⫻3) PD(3⫻3)
Note that we have dropped the discrete-time subscript to simplify the notation.
23.4 Cluster-Based Kalman Filter Algorithm Then the first term on the right side of equation (23.2) can be rewritten as FPF ⫽ U ⫽ T
UA
UB
UBT
UD
T PA ⫹ FB PBT ⫹ FB PBT ⫹ FB PD FBT PB ⫹ FB PD
⫽
(PB ⫹ FB PD )T
PD
Now, if we further subdivide the matrices PB and PD into
PB ⫽ PBA(2⫻1)
PBB(2⫻2) and PD ⫽
pDA(1⫻1)
PDB(1⫻2)
T PDB (2⫻1)
PDD(2⫻2)
we have T T UA ⫽ PA ⫹ UBA v T ⫹ ␦UBB ⫹ vPBA ⫹ ␦PBB T P UB ⫽ UBA UBB ⫽ PBA ⫹ vpDA ⫹ ␦PDB BB ⫹ vPDB ⫹ ␦PDD
U D ⫽ PD
Similarly, let the second term on the right-hand side of equation (23.2) be represented by
WQW ⫽ V ⫽ T
and let
Q⫽
VA
VB
VBT
VD
Qxy(2⫻2)
0
0
qt
Then we have ␦4 Qxy 4 VB ⫽ VBA VBB ⫽ 0(2⫻1) VA ⫽
␦3 2 Qxy
and VD ⫽
vDA
VDB
T VDB
VDD
⫽
qt
0(1⫻2)
0(2⫻1)
␦2 Qxy
Finally, the covariance update equation (23.2) becomes ⎡
UA ⫹ VA
UBA
UBB ⫹ VBB
T UBA
pDA ⫹ vDA
PDB
T ⫹V T UBB BB
T PDB
PDD ⫹ VDD
⎢ Pk|k⫺1 ⫽ Uk⫺1|k⫺1 ⫹ Vk⫺1|k⫺1 ⫽ ⎣
⎤ ⎥ ⎦ k⫺1|k⫺1
Let us now turn our attention to the measurement update equations. Following a similar derivation, the Jacobian matrix Hk of equation (23.8) can be divided into H⫽
HA(2⫻2)
HB(2⫻3)
HC(1⫻2)
HD(1⫻3)
⫽
0(1⫻2)
0(2⫻3)
HA(2⫻2) 1
0
0
557
558
CHAPTER 23 Cluster-Based Object Tracking by Wireless Camera Networks Then, since HB ⫽ 0(2⫻3) and HC ⫽ 0(1⫻2) , the first element inside the parenthesis on the right-hand side of equation (23.5) becomes HPH ⫽ T
HA PA HAT
HA PBA
(HA PBA )T
pDA
Let the covariance matrix of the measurement noise be represented by R⫽
RA(2⫻2)
0
0
rB
Then it follows from the Schur complement that ⎡ (HPH ⫹ R) T
⫺1
⎢ ⎢ ⫽M ⫽⎢ ⎣
⫺
S ⫺
SHA PBA d
T
⎤
SHA PBA d
⎥ MA ⎥ ⎥⫽ T T ⎦ PBA HA SHA PBA ⫹ d MBT 2 d
MB
MD
where d ⫽ rB ⫹ pDD , and S(2⫻2) ⫽
HA PA HAT
T HT HA PBA PBA A ⫹ RA ⫺ d
⫺1
Finally, the equation for the Kalman gain, equation (23.5), then becomes ⎡
PA HAT
⎢ T T K ⫽ PH M ⫽ ⎢ ⎣PBA HA T
T HT PBB A
⎤ PBA ⎥ MA pDA ⎥ ⎦ MBT T PDB
MB
MD
It is important to note that in the equations just given all operations are carried out on small matrices. Also, many elements used in each step of the computations are reused in later steps, so it is possible to temporarily store them and reuse them later, saving computation time.
23.4.2 State Estimation Algorithm 23.1 summarizes the state estimation algorithm that runs at each node of the wireless camera network. As already mentioned, we use our clustering protocol as the underlying framework for the Kalman filter implementation. Therefore, the algorithm is initialized when a camera joins a cluster (either as a cluster member or as a cluster head) and is terminated when the camera leaves the cluster. The algorithm thus runs only while a camera is actively tracking a target. After the target leaves the camera’s FOV, it may switch to an energy-saving mode that periodically observes the environment to detect the presence of new targets. In that sense, the initialization step presented in Algorithm 23.1 is a local intra-cluster process that prepares the cameras to track a specific target, as opposed to the network-wide system initialization procedure described in Section 23.4.3.
23.4 Cluster-Based Kalman Filter Algorithm Algorithm 23.1: Cluster-Based Kalman Filter Initialization if cluster_head = local_ID initialize target state broadcast time stamp Maintenance new local measurement available: if cluster_head = local_ID estimate target state else send measurement to cluster head measurement received: estimate target position sample period elapsed: if cluster_head = local_ID send current estimated state to the user Termination if cluster_head = local_ID send current estimated state to new cluster head
The initialization in the state estimation algorithm takes place after cluster formation is concluded, and its main goals are to initialize the Kalman filter and synchronize the cluster members so that they can estimate the target state consistently. To synchronize the cluster members, the newly elected cluster head broadcasts a message to its cluster members informing them of its current time. The cluster members then synchronize their internal clocks to the time received from the cluster head and time-stamp any future measurements based on that time. It is important to note that neither clock drifts between the cluster head and its cluster members, nor potential message delays (caused by the underlying algorithm for medium access control or by message queuing), are taken into consideration in this approach. Therefore, this time synchronization is inherently not very accurate. This is why we modeled the uncertainty in the measurement times in the state estimation equations, as explained in Section 23.4.1. Our synchronization approach is similar to the post-facto synchronization using undisciplined clocks [42], except that the source of the synchronization message is the cluster head rather than a third-party node. Thus, this method provides only a locally consistent time suitable for the Kalman filter to operate. To achieve global time synchronization for the nodes, the time stamp can be propagated along with the clusters using a synchronization protocol such as TPSN [43]. Evidently, this method does not provide a time stamp consistent with any external reference (such as the base station time), so it is only possible to know the position of the target at a given time with respect to the instant it is initially detected. To obtain a global reference consistent with an external clock, a more complex time synchronization algorithm is required. Some of the existing options are surveyed in Sivrikaya and Yener [44].
559
560
CHAPTER 23 Cluster-Based Object Tracking by Wireless Camera Networks In our approach, the cluster head is responsible for estimating the state of the target and reporting it to the base station. Therefore, whenever a cluster member acquires a new measurement, it sends it to the cluster head, which updates the estimated target state based on that measurement. Similarly, when a cluster head acquires a new local measurement, it updates the state of the target based on it. However, since the base station may be multiple hops away from the cluster, transmitting every new estimate to the user can result in increased energy consumption. Therefore, the cluster head sends information to the base station at a predefined rate. The cluster head is responsible for keeping track of the state of the target, so, as the cluster propagates and a new cluster head is chosen, it is necessary to hand off the state information to the new cluster head. As explained in the Cluster Propagation subsection of Section 23.3.2, the clustering protocol allows information about the state of the target to be carried by the cluster as new cluster heads are assigned.Therefore, the Kalman filter algorithm only has to piggy-back a message containing the target state to the cluster head reassignment message. After the new cluster head is assigned, it continues state estimation. The simplicity of Algorithm 23.1 is only possible because of the underlying clustering protocol that handles all aspects of distributed data collection and information hand-off among different cameras. That is, after a cluster is created, while all the nodes within the cluster are responsible for collecting measurements, only the cluster head is responsible for estimating the target state via Kalman filtering. As the cluster head leaves the cluster, any of the previous cluster members may become the new cluster head and continue the state estimation.
23.4.3 System Initialization As we observe in the Kalman filter equations presented in Section 23.4.1, the cluster head needs to know the homographies between the cluster members and the xy plane in order to estimate the position of the target based on the measurements acquired by the cluster members. Therefore, to avoid transmitting large amounts of data while tracking the object, when the system is initialized each node stores the homographies of its one-hop neighbors (i.e., its potential cluster members). The system initialization works as follows. When a new camera joins the network, it broadcasts its own homography to its one-hop neighbors. When a camera receives a homography from a new neighboring camera, it stores it and replies by sending its own homography. Even though this procedure can take O(n2 ) steps, where n is the number of nodes in the network, it is a local algorithm, meaning that no information need be broadcast beyond a single-hop neighborhood. If we assume, then, that m is the average number of nodes in a single-hop region, the algorithm terminates in expected O(m2 ) iterations. Since we can assume that m n for a wide-area camera network, the algorithm is feasible in practice.
23.5 EXPERIMENTAL RESULTS To illustrate some of the trade-offs involved in employing a cluster-based Kalman filter for object tracking, we carried out a number of experiments in a simulated multi-hop
23.5 Experimental Results network. However, although simulations allow us to analyze the global behavior of the network, in wireless camera networks, it is also necessary to verify that the algorithms are indeed effective under the severe resource constraints imposed by the actual cameras. To illustrate the performance of the system in practice, we therefore implemented the object-tracking system in a real network of wireless cameras.
23.5.1 Simulator Environment To carry out our simulated experiments, we implemented a camera network simulator that provides the pixel coordinates of a moving target based on each camera’s calibration matrix. Figures 23.8(a) and (b) show two views of the simulator’s graphical user interface. The simulator creates a target that moves randomly or follows a predefined trajectory in the xy plane in the world frame. The simulated cameras operate independently and the data generated by each camera is output via a transfer control protocol connection to an individual node in a sensor network simulation using the Avrora [45] simulator for the AVR family of microcontrollers used in the Mica [46] motes. Avrora is capable of performing simulations with high accuracy on code natively generated for AVR microcontrollers. It also provides a free-space radio model that allows for simulation of wireless communications in sensor networks. Our Kalman filter code and the clustering protocol, both implemented in nesC [47] and running under the TinyOS operating system [48], were executed directly in Avrora.
Simulation Results We used our simulation environment to evaluate the performance of the proposed clusterbased Kalman filter for object tracking. Figure 23.8 shows the configuration of 15 wireless
(a)
(b)
FIGURE 23.8 GUI showing a configuration of 15 wireless cameras. (a) The tetrahedral volumes represent the cameras’ FOVs. (b) The lines connecting pairs of cameras indicate communication links between camera nodes.
561
562
CHAPTER 23 Cluster-Based Object Tracking by Wireless Camera Networks cameras used in the experiments, all of which were randomly placed on the top plane of a cuboid volume with the dimensions of 50⫻50⫻5 m. Each camera node was assumed to have a communication range of 18 m in all directions and a view angle of 120◦ . Each line between a pair of cameras shown in Figure 23.8 indicates that the two cameras can communicate directly (i.e., they are one-hop neighbors), and each pyramid represents the viewing volume of the camera. It was assumed that all cameras had been fully calibrated (both intrinsic and extrinsic parameters of the cameras were known). Additionally, we assumed that all objects move on the floor (the bottom plane of the working environment), which allows each camera to compute the 2D world coordinates of the object given its image coordinates using the homography relating the camera plane and the floor. Based on the results of the experiments in Medeiros et al. [19], we set the clustering timeout to 300 ms. This value provided an adequate balance between the effectiveness of cluster formation and the speed with which a cluster is able to follow a target.
Estimation Accuracy To evaluate the accuracy of the cluster-based Kalman filter, we introduced Gaussian random noise into the camera measurements and computed the root mean squared error of the system’s estimates.We then compared the performance of our cluster-based Kalman filter with the performance of a centralized tracking method. In the centralized approach, we transmitted all of the data collected by the network to the base station, which applied a centralized Kalman filter to the data. Figure 23.9(a) shows plots of the root mean squared error of the unfiltered data, of our distributed Kalman filter, and that of the centralized Kalman filter as a function of the standard deviation of the measurement noise. As can be seen, our algorithm substantially reduces the error in the unfiltered data. Although the centralized Kalman filter provides more accurate results, it requires the transmission of all data to the base station. Note that this centralized tracker is an idealized concept since every message generated by the motes is processed. It does not consider the message drops that would occur in a real centralized tracker as the messages are routed to the base station.These message drops are difficult to quantify, however, since they depend on the network topology and the distance between the base station and the nodes detecting the event. We also compared the accuracy of our cluster-based Kalman filter to that of an alternative decentralized tracking method. In the alternative method, we used linear interpolation for local data aggregation. That is, the target position was periodically estimated by linearly interpolating the two most recent measurements available to the cluster head. The results of the experiment are shown in Figure 23.9(b). The noisy nature of the data causes the performance of linear interpolation to degrade significantly as the standard deviation of the pixel error increases.As the experiment shows, even for small pixel error our algorithm significantly reduces the total error when compared to local data aggregation using linear interpolation. This performance gain becomes larger as the pixel error increases.
Average Number of Messages Transmitted To evaluate the potential energy savings obtained by restricting the number of multi-hop messages transmitted, we measured the average number of messages transmitted per
23.5 Experimental Results 18
Unfiltered data Decentralized KF Centralized KF
7 6 5 4 3 2 1
Decentralized linear interpolation Decentralized KF
16 Average tracking error
Average tracking error
8
563
14 12 10 8 6 4 2
4
6 8 10 12 14 16 18 Standard deviation of pixel error (a)
20
0
4
6 8 10 12 14 16 18 Standard deviation of pixel error (b)
20
FIGURE 23.9 (a) Root mean squared error of the target position as a function of the standard deviation of the pixel noise. (b) Performance of the cluster-based algorithm compared to a decentralized tracker using linear interpolation.
node per minute in our simulator using our Kalman filter algorithm. We also measured the average number of messages needed to transmit all of the information to the base station, which is required by the centralized Kalman filter approach. Figure 23.10(a) shows the number of messages transmitted as a function of the average distance to the base station. To estimate the number of messages required to reach the base station, we multiplied the number of messages by the average distance to the base station. The results of the experiment show that, for networks of small diameter where the average distance to the base station is small, because of the overhead introduced by clustering, the number of messages transmitted by our algorithm is higher than if all the data were transmitted to the base station. However, as the average distance to the base station grows, the number of messages transmitted in the centralized system increases, whereas the clustering overhead remains constant. Eventually, a threshold is reached where sending all messages becomes more expensive than creating the clusters. This threshold depends on the sampling period of the cluster-based Kalman filter. For example, for a sampling period of 750 ms, the cluster-based approach performs better than the centralized approach when the average distance to the base station is larger than two hops.
Model Error As explained in Section 23.4.1, we modeled the movement of our target as a constant velocity movement with random acceleration. The more this movement resembles our model, the better are the estimates that were obtained. This is illustrated in Figure 23.10(b). The two top plots show the x and y positions of the target as a function of time. The bottom plot shows the corresponding error in target position after applying the Kalman filter. As the figure shows, at the time instants when the target undergoes abrupt changes in direction, the error is larger because the constant velocity model is not valid.
Average number of messages transmitted per node per minute
CHAPTER 23 Cluster-Based Object Tracking by Wireless Camera Networks 50 Centralized 750-ms sample period 1500-ms sample period 3000-ms sample period
45 40 35 30 25 20 15 10 1
1.5 2 2.5 3 3.5 Average number of hops from each node to base station
4
(a) 600 X
400 200 0
0
50
100
150
200
250 300 Time (sec)
350
400
450
500
0
50
100
150
200
250 300 Time (sec)
350
400
450
500
0
50
100
150
200
250 300 Time (sec)
350
400
450
500
600 Y
400 200 0 15 Error
564
10 5 0
(b)
FIGURE 23.10 (a) Number of messages transmitted by the system as a function of the average distance to the base station. (b) Estimation error in the target position for varying movement. Solid curves represent the ground truth; superimposed markers, the filtered data.
On the other hand, the algorithm accurately tracks the target as long as its movement is smooth. This is illustrated in Figure 23.11, where we plot the x and y positions of a target moving with constant velocity in the y direction and with varying acceleration in the x direction. The bottom plots show the corresponding root mean squared error with respect to the ground truth. As we can see, although the acceleration in each plot is different, the average error remains approximately constant. Overall, the total error obtained with the decentralized Kalman filter is comparable to the error obtained using the centralized Kalman filter, as qualitatively illustrated in Figure 23.12. Figure 23.12(a) shows the x and y coordinates of the target when tracked
40 60 80 100 Time (sec)
0
20
40 60 80 100 Time (sec)
600 400 200 0
10 0
X
20
40 60 80 100 Time (sec)
0
20
40 60 80 100 Time (sec)
600 400 200 0
600 400 200 0
20 Error
Error
20
0
Y
20
600 400 200 0
0
20
40 60 80 100 Time (sec)
10 0
0
20
40 60 80 100 Time (sec)
0
20
40 60 80 100 Time (sec)
0
20
40 60 80 100 Time (sec)
20 Error
600 400 200 0
0
Y
Y
X
600 400 200 0
X
23.5 Experimental Results
0
20
40 60 80 100 Time (sec)
10 0
FIGURE 23.11 Position of the target as a function of time for different values of acceleration. Solid curves represent the ground truth; superimposed markers, the filtered data.
550
550
500
500
450
450
400
400
350
350
300
300
250
250
200
200
150
150
100
100
50 50 100 150 200 250 300 350 400 450 500 550
(a)
50 50 100 150 200 250 300 350 400 450 500 550
(b)
FIGURE 23.12 Tracking results for (a) centralized Kalman filter and (b) decentralized Kalman filter. The solid black curves represent the ground truth of the target trajectory. The superimposed curves with markers represent the target’s estimated positions.
by the centralized Kalman filter. Figure 23.12(b) shows the x and y coordinates of the target when tracked by the decentralized Kalman filter. However, as the figure shows, delays introduced by the clustering protocol cause our decentralized algorithm to occasionally lose track of the target when the sensor network is engaged in cluster formation. In the example presented in Figure 23.12, if we consider any two measurements more than five seconds apart as defining a region not covered by the tracker, the centralized tracker follows the target approximately 95 percent of the total distance traveled whereas the cluster-based version keeps track of the target approximately 77 percent of the total distance.
565
566
CHAPTER 23 Cluster-Based Object Tracking by Wireless Camera Networks Computation Time Because the Avrora simulator provides instruction-level accuracy at the actual clock frequency of the microcontroller, it was possible to precisely compute the time it takes to perform each step of our algorithm. In our current implementation, the time required for the Kalman filtering update is 15 ms, whereas the prediction takes 4.5 ms. The very small variance in these measurements is only due to the time required to attend to the interrupt service routines, which is on the order of microseconds.
23.5.2 Testbed Implementation We deployed a network consisting of 12 Cyclops cameras [49] attached to MicaZ motes mounted on the ceiling of our laboratory. The cameras were spaced roughly 1 m (39 in) apart to provide partial overlap between the neighboring cameras’ field of view. The FOV of all cameras covered a region of about 5⫻3.5 m (16.4⫻11.5 ft). Figure 23.13 is a picture our wireless camera network. The cameras were calibrated by the calculation of planar homographies between the floor of the laboratory and the camera planes. Thus, as the object to be tracked moves on the floor, each camera that sees it can compute the coordinates of the centroid of its image with respect to the world coordinate frame. Since the focus of our work is object tracking rather than object detection and identification, we used only simple objects in our tracking experiments. For such objects, because detection is carried out by thresholding the color histogram, our list of object features simply consists of flags to indicate whether an object matches a given histogram. (More robust algorithms such as in Lau et al. [50] could be used to achieve similar tracking performance while allowing cameras to dynamically assign identifiers to the objects being tracked.) During collaborative processing, cluster members share information about the target. This information is carried by the clusters as they propagate. To implement this behavior, the cameras within a cluster share an object identifier defined simply by the numerical ID of the first camera that detects the target. This information, too, is carried by the clusters Cyclops camera
FIGURE 23.13 Ceiling-mounted wireless cameras for the testbed.
23.6 Experimental Results
567
as they propagate during object tracking.Whenever this information is lost—for instance, if cluster propagation fails and a new cluster is created to track the object—the network loses previous information about the target and a new object identifier is created by the next camera that detects it. Note that the numerical ID is only a means for us to visualize the propagation of the actual object parameters estimated by the Kalman filter.
Experiments on Real Wireless Cameras To evaluate the performance of the system while tracking an object, we moved the object randomly and, at the same time, computed the target coordinates using the wireless camera network and a single wired camera at 30 frames per second. The data gathered by the wired camera was used as the ground truth. Figure 23.14 shows the trajectory of the object for three different runs of the experiment. The ground truth is represented by the solid lines; the dashed lines show the target trajectory as computed by the wireless cameras. The markers placed on the dashed tracks correspond to the actual target positions computed.The reason for showing both the trajectory with the dashed line and the markers is to give the reader a sense of when the system loses track of the target. Each time that happens, the system creates a new object identifier for the moving target and the target state is reinitialized based on the most recent measurement. This is illustrated by the different markers in Figure 23.14. 500
600 550
450
500
400
450 400
350 350 300
300 250
250 200 150 100
150
200
250
300
350
200 50
100
150
200
500 450 400 350 300 250 100 120 140 160 180 200 220 240 260 280 300
FIGURE 23.14 Tracking performance for three runs of the tracking experiment.
250
300
350
400
450
568
CHAPTER 23 Cluster-Based Object Tracking by Wireless Camera Networks
23.6 CONCLUSIONS AND FUTURE WORK We presented a lightweight event-driven clustering protocol for wireless cameras. As is well recognized, clustering is critical to energy-efficient collaborative processing in sensor networks. Any clustering protocol must address the issues of cluster formation, propagation, coalescence, fragmentation, extinction, and interaction among multiple clusters.We believe that, because cameras are directional devices, multiple cluster formation and coalescence are important for wireless camera networks. Our protocol addresses all the clustering phases in a single coherent framework. We also presented an object-tracking algorithm suitable for a network of wireless cameras. Our algorithm has a low message overhead and can be executed in real time even in resource-constrained wireless sensors such as MicaZ motes. It represents an effective approach to local data aggregation in the context of WCN object tracking. The algorithm uses the clustering protocol to establish connections among cameras that detect the target and to enable the propagation of the target state as it moves. Data aggregation is carried out using the decentralized Kalman filter. Regarding the accuracy of our filter, our experiments show that the algorithm is indeed capable of tracking a target with accuracy comparable to that achieved by a centralized approach wherein every measurement is transmitted to the base station.Wan and van der Merwe [51, 52] reported that the extended Kalman filter can introduce large errors in the estimated parameters and that the unscented Kalman filter (UKF) [53] usually provides better results. However, we found that our approach is able to substantially decrease the noise in the target position and that filter instability is not an issue in our application. In addition, a straightforward implementation of the UKF is computationally expensive, although it may be possible to obtain a more effective UKF implementation if we take into account, as we did for the EKF, the sparsity of the matrices involved in the problem. Devising an efficient implementation of a cluster-based UKF for object tracking using wireless camera networks and comparing it with the performance of our cluster-based EKF is a subject of further investigation. Irrespective of the kind of Kalman filter employed—either the EKF or the UKF—it should be possible to further reduce the overall energy consumption of the proposed system by quantizing the uncertainty matrix of the target state before transmitting it to the next cluster head during cluster head reassignment. To investigate this issue, we can quantize each element of the covariance matrix or hand off only a subset of its elements, such as its trace. It is important to evaluate how the accuracy of the filter degrades with different quantization approaches and as the amount of quantization increases. So far we have focused on the problem of a single cluster tracking a single object. The issues of multiple clusters tracking the same object and the inter-cluster interactions involved in that process, as well as estimating the state of multiple objects simultaneously, are subjects of future studies. Specifically, it is necessary to find effective methods to aggregate the information produced by different clusters tracking the same target. Then it is necessary to evaluate how multiple clusters interact with one another and how these interactions affect the overall number of messages transmitted in the network, as well as the estimation accuracy of the target position. These issues are of great importance in practical applications when camera arrangements other than the simple ceiling-mounted setup are used.
23.6 Conclusions and Future Work
569
Our protocol assumes that all cameras that can see the target join a cluster. Nonetheless, it is possible to extend it so that, after a cluster is formed, the cluster head may choose which cameras it will collaborate with using certain camera selection criteria based on how well a camera sees a target [54, 55]. Finally, it is necessary to devise effective camera calibration systems for wireless camera networks. It may be possible to employ the clustering protocol in this process so that the cluster head, based on its own measurements and the measurements received from the cluster members, can simultaneously track an object and estimate its own camera parameters as well as those of each of its cluster members. When a node leaves the cluster, it receives the latest estimated camera parameters from the cluster head [56]. It is also important to keep in mind that these systems must be lightweight so that the computations involved can be carried out by resource-constrained wireless cameras.
REFERENCES [1] I.F. Akyildiz, W. Su, Y. Sankarasubramaniam, E. Cayirci, A survey on sensor networks, IEEE Communications Magazine 40 (8) (2002) 102–114. [2] I.F. Akyildiz, T. Melodia, K.R. Chowdhury, A survey on wireless multimedia sensor networks, Computer Networks 51 (4) (2007) 921–960. [3] C. Intanagonwiwat, D. Estrin, R. Govindan, J. Heidemann, Impact of network density on data aggregation in wireless sensor networks, in: Proceedings of the International Conference Distributed Computing Systems, 2002. [4] B. Krishanamachari, D. Estrin, S. Wicker, The impact of data aggregation in wireless sensor networks, in: Proceedings of the International Workshop Distributed Event-Based Systems, 2002. [5] K.-W. Fan, S. Liu, P. Sinha, Scalable data aggregation for dynamic events in sensor networks, in: Proceedings of the Fourth International Conference on Embedded Networked Sensor Systems, 2006. [6] S. Bandyopadhyay, E. Coyle, An energy efficient hierarchical clustering algorithm for wireless sensor networks, in: Proceedings IEEE INFOCOM, 2003. [7] W.B. Heinzelman, A.P. Chandrakasan, H. Balakrishnan, An application-specific protocol architecture for wireless microsensor networks, IEEE Transactions on Wireless Communications 1 (4) (2002) 660–670. [8] O. Younis, S. Fahmy, HEED:A hybrid, energy-efficient, distributed clustering approach for ad hoc sensor networks, IEEE Transactions on Mobile Computing 3 (4) (2004) 366–379. [9] I. Gupta, D. Riordan, S. Sampalli, Cluster-head election using fuzzy logic for wireless sensor networks, in: Proceedings of the Third Annual Communication Networks and Services Research Conference, 2005. [10] V. Mhatre, C. Rosenberg, D. Kofman, R. Mazumdar, N. Shroff, A minimum cost heterogeneous sensor network with a lifetime constraint, IEEE Transactions on Mobile Computing 4 (1) (2005) 4–15. [11] W.-P. Chen, J. Hou, L. Sha, Dynamic Clustering for acoustic target tracking in wireless sensor networks, IEEE Transactions on Mobile Computing 3 (3), (2004). [12] X.-H. Kuang, R. Feng, H.-H. Shao, A lightweight target-tracking scheme using wireless sensor network, Measurement Science and Technology 19 (2) (2008), Article 025104. [13] C. Liu, J. Liu, J. Reich, P. Cheung, F. Zhao, Distributed group management for track initiation and maintenance in target localization applications, in: Proceedings of the IEEE International Workshop on Information Processing in Sensor Networks, 2003.
570
CHAPTER 23 Cluster-Based Object Tracking by Wireless Camera Networks [14] Q. Fang, F. Zhao, L. Guibas, Lightweight sensing and communication protocols for target enumeration and aggregation, in: Proceedings of the ACM Symposium on Mobile Ad Hoc Networking and Computing, 2003. [15] W. Zhang, G. Cao, DCTC: Dynamic convoy tree-based collaboration for target tracking in sensor networks, IEEE Transactions on Wireless Communications 3 (5), (2004) 1689–1701. [16] F. Bouhafs, M. Merabti, H. Mokhtar, Mobile event monitoring protocol for wireless sensor networks, in: Proceedings of the Twenty-First International Conference on Advanced Information Networking and Applications Workshops, 2007. [17] M. Demirbas, A. Arora, M. G. Gouda, A Pursuer-Evader game for sensor networks, in: Proceedings of the Sixth International Symposium on Self-Stabilizing Systems, 2003. [18] Y.-C. Tseng, S.-P. Kuo, H.-W. Lee, C.-F. Huang, Location tracking in a wireless sensor network by mobile agents and its data fusion strategies, The Computer Journal 47 (4) (2004) 448–460. [19] H. Medeiros, J. Park, A. Kak, A light-weight event-driven protocol for sensor clustering in wireless camera networks, in: Proceedings of the First IEEE/ACM International Conference on Distributed Smart Cameras, 2007. [20] H. Medeiros, J. Park, A. Kak, Distributed object tracking using a cluster-based kalman filter in wireless camera networks, IEEE Journal of Selected Topics in Signal Processing 2 (4) (2008) 448–463. [21] Y. Xu, J. Winter, W.-C. Lee, Prediction-based strategies for energy saving in object tracking sensor networks, in: Proceedings of the IEEE International Conference on Mobile Data Management, 2004. [22] W. Yang, Z. Fu, J.-H. Kim, M.-S. Park, An adaptive dynamic cluster-based protocol for target tracking in wireless sensor networks, in: Proceedings of the Joint Ninth Asia-Pacific Web Conference on Advances in Data and Web Management, and Eighth International Conference on Web-Age Information Management, 2007. [23] C.-Y. Chong, F. Zhao, S. Mori, S. Kumar, Distributed tracking in wireless ad hoc sensor networks, in: Proceedings of the Sixth International Conference of Information Fusion, 2003. [24] D. Reid, An algorithm for tracking multiple targets, IEEE Transactions on Automatic Control 24 (6) (1979) 843–854. [25] B. Blum, P. Nagaraddi, A. Wood, T. Abdelzaher, S. Son, J. Stankovic, An entity maintenance and connection service for sensor networks, Proceedings of the First International Conference on Mobile Systems, Applications, and Services, 2003. [26] T. He, S. Krishnamurthy, J.A. Stankovic, T. Abdelzaher, L. Luo, R. Stoleru, T. Yan, L. Gu, J. Hui, B. Krogh, Energy-efficient surveillance system using wireless sensor networks, in: Proceedings of the Second International Conference on Mobile Systems, Applications, and Services, 2004. [27] X. Ji, H. Zha, J. Metzner, G. Kesidis, Dynamic cluster structure for object detection and tracking in wireless ad-hoc sensor networks, in: Proceedings of the IEEE International Conference on Communications, 2004. [28] R.R. Brooks, C. Griffin, D. Friedlander, Self-organized distributed sensor network entity tracking, International Journal of High Performance Computing Applications 16 (3) (2002) 207–219. [29] H. Yang, B. Sikdar, A protocol for tracking mobile targets using sensor networks, in: Proceedings of the First IEEE International Workshop on Sensor Network Protocols and Applications, 2003. [30] Q. Wang, W.-P. Chen, R. Zheng, K. Lee, L. Sha, Acoustic target tracking using tiny wireless sensor devices, in: Proceedings of the Information Processing in Sensor Networks, 2003. [31] D. Estrin, R. Govindan, J. Heidemann, S. Kumar, Next century challenges: scalable coordination in sensor networks, in: Proceedings of the Fifth Annual ACM/IEEE International Conference on Mobile Computing and Networking, 1999.
23.6 Conclusions and Future Work
571
[32] Y. Zou, K. Chakrabarty, Energy-aware target localization in wireless sensor networks, in: Proceedings of the First IEEE International Conference on Pervasive Computing and Communications, 2003. [33] J.L. Speyer, Computation and transmission requirements for a decentralized linear-quadraticGaussian control problem, IEEE Transactions on Automatic Control 24 (1979) 266–269. [34] R. Olfati-Saber, Distributed Kalman filter with embedded consensus filter, in: Proceedings of the Forty-Fourth IEEE Conference on Decision and Control, and the European Control Conference, 2005. [35] E. Nettleton, H. Durrant-Whyte, S. Sukkarieh, A Robust architecture for decentralised data fusion, in: Proceedings of the International Conference on Advanced Robotics, 2003. [36] A. Ribeiro, G. Giannakis, S. Roumeliotis, SOI-KF: distributed Kalman filtering with low-cost communications using the sign of innovations, IEEE Transactions on Signal Processing 54 (12) (2006) 4782–4795. [37] S. Balasubramanian, S. Jayaweera, K. Namuduri, Energy-aware, collaborative tracking with ad-hoc wireless sensor networks, in: Proceedings of the IEEE Wireless Communications and Networking Conference, 2005. [38] R. Goshorn, J. Goshorn, D. Goshorn, H. Aghajan, Architecture for cluster-based automated surveillance network for detecting and tracking multiple persons, in: Proceedings of the First IEEE/ACM International Conference on Distributed Smart Cameras, 2007. [39] J. Kurose, K. Ross, Computer Networking: A Top-Down Approach Featuring the Internet, 3rd edition, Addison-Wesley, 2005. [40] G. Welch, G. Bishop, An Introduction to the Kalman Filter, Technical report, University of North Carolina at Chapel Hill, 1995. [41] Y. Bar-Shalom, Xiao-Rong Li, Estimation and Tracking: Principles, Techniques, and Software, Artech House, 1993. [42] J. Elson, D. Estrin, Time synchronization for wireless sensor networks, in: Proceedings of the Fifteenth International Parallel and Distributed Processing Symposium, 2001. [43] S. Ganeriwal, R. Kumar, M.B. Srivastava, Timing-sync protocol for sensor networks, in: Proceedings of the First International Conference on Embedded Networked Sensor Systems, 2003. [44] F. Sivrikaya, B.Yener, Time synchronization in sensor networks: A survey, IEEE Network 18 (4) (2004) 45–50. [45] B.L. Titzer, D.K. Lee, J. Palsberg, Avrora: Scalable sensor network simulation with precise timing, in: Proceedings of the Fourth International Symposium on Information Processing in Sensor Networks, 2005. [46] J.L. Hill, D.E. Culler, Mica: A wireless platform for deeply embedded networks, IEEE Micro 22 (6) (2002) 12–24. [47] D. Gay, P. Levis, R. von Behren, M. Welsh, E. Brewer, D. Culler, The nesC language: A holistic approach to networked embedded systems, in: Proceedings of theACM SIGPLAN Conference on Programming Language Design and Implementation, 2003. [48] P. Levis, S. Madden, J. Polastre, R. Szewczyk, K. Whitehouse, A. Woo, D. Gay, J. Hill, M. Welsh, E. Brewer, D. Culler, TinyOS:An Operating System for Sensor Networks, Springer, 2005. [49] M. Rahimi, R. Baer, O.I. Iroezi, J.C. Garcia, J. Warrior, D. Estrin, M. Srivastava, Cyclops: In situ image sensing and interpretation in wireless sensor networks, in: Proceedings of the Third International Conference on Embedded Networked Sensor Systems, 2005. [50] F. Lau, E. Oto, H. Aghajan, Color-based multiple agent tracking for wireless image sensor networks, in: Advanced Concepts for Intelligent Vision Systems, 2006. [51] E.A. Wan, R. van der Merwe, The unscented Kalman filter for nonlinear estimation, in: Proceedings of the IEEE Adaptive Systems for Signal Processing, Communications, and Control Symposium, 2000.
572
CHAPTER 23 Cluster-Based Object Tracking by Wireless Camera Networks [52] E.A. Wan, R. van der Merwe, The unscented Kalman filter, in: S. Haykin (Ed.), Kalman Filtering and Neural Networks, Chapter 7, Wiley, 2001. [53] S.J. Julier, J.K. Uhlmann, A new extension of the Kalman filter to nonlinear systems, in: Proceedings of the Eleventh International Symposium onAerospace/Defence Sensing, Simulation and Controls, 1997. [54] J. Park, P.C. Bhat, A.C. Kak, A look-up table based approach for solving the camera selection problem in large camera networks, in: Proceedings of the Workshop on Distributed Smart Cameras, in conjunction with ACM SenSys, 2006. [55] A.O. Ercan, D.B. Yang, A. El Gamal, Optimal placement and selection of camera network nodes for target localization, in: Proceedings of the International Conference on Distributed Computing in Sensor Systems, 2006. [56] H. Medeiros, H. Iwaki, J. Park, Online distributed calibration of a large network of wireless cameras using dynamic clustering, in: Proceedings of the Second ACM/IEEE International Conference on Distributed Smart Cameras, 2008.
EPILOGUE
Outlook Hamid Aghajan Stanford University, United States of America Andrea Cavallaro Queen Mary University of London, United Kingdom The chapters in this book offered a survey of the design principles, algorithms, and operation of multi-camera networks. The scope of presentation reflects a wide span of techniques in a field of rapid progress toward computer vision applications. However, we believe that much remains to be explored in this rich area, both in conceptual aspects of data fusion and interpretation and in novel applications that can bring the technology into widespread use. Four areas in particular call for further development: ■ ■ ■ ■
System design and algorithm development under realistic constraints Interface and middleware between multi-camera vision and higher-level modules Datasets and performance evaluation metrics Application development based on user and social acceptance criteria
The next sections discuss these areas.
SYSTEMS AND ALGORITHMS IN REALISTIC SCENARIOS Multi-camera networks offer a number of distinct advantages over single-view computer vision: added coverage area, 3D reconstruction of scenes and events, improved confidence in event detection and description, and heterogeneous task assignment and role selection based on camera views. Although much existing work employs multicamera networks to achieve any one of these gains, in combination, these gains can be achieved through a flexible and efficient data exchange and decision fusion mechanism. This requires new approaches to system and algorithm design, with flexibility in data exchange and camera role selection based on observed content. Such a design methodology also enables the use of a multi-camera network in multiple applications. One example is home applications, in which the camera network may serve as a security system when the occupants are out and switch to a communication or HCI system that activates desired services, such as lighting, assisted living aids, or accident monitoring, when the occupants are in residence. Another area of development is protocols and standards that allow a camera network to cooperate with other sensors or user input devices to switch the application target based on context. Data exchange mechanisms for multimodal and heterogeneous data sources and incorporation of confidence levels into the metadata for improved fusion and decision making deserve further attention.
573
574
Outlook
Multi-camera hardware and Network
Local processing and centralized processing Communication bandwidth Latency of real-time results Resolution in image view and time Temporal alignment (synchronization) Camera view overlaps, data redundancies Data exchange methods
Vision algorithms
FIGURE 1 System constraints that influence the design of multi-camera system and vision algorithms.
Despite the various techniques developed for network calibration and localization, much work remains on automated discovery of the deployment environment, definition of sensitive areas in monitoring applications, and adaptation to camera movements and other environmental changes. Today’s calibration algorithms are often based on a predefined number of observations and their results tend to be based on an stand-alone short observation cycle. Dynamic learning of the viewed environment based on accumulated observations in time, and progressive confidence-building mechanisms need further development. Also, the role of user intervention to close the feedback loop in result validation has been little explored. A related, also little explored area is engaging the user in interaction with the system and the use of observed explicit or implicit user input to guide vision processing. Such interaction can, for example, add certainty to an uncertain inference or lead to the discovery of an important element in the observation field (a map of the scene, say), or it may uncover user preferences in a system that provides multimedia and environment control (e.g., lighting). Much of the operation of a real-time multi-camera network occurs under system constraints such as processing power and data communication levels. Data acquisition and communication hardware is selected based on observation load (frame rate, resolution, number of views) and target performance levels (accuracy, latency, access to source data or metadata). In smart camera networks, for example, a trade-off exists between available local processing power at the cameras and total available bandwidth for data exchange to a central processing and data fusion unit. The handling of system and operation constraints often profoundly influences the algorithm design cycle, calling for algorithms different from traditional vision methods. These algorithms may depend on the exchange of low-level features between cameras before a high-level deduction is made from any of the observed video sequences. The delay introduced in the data exchange operation may call for a modification in the data acquisition load, as may the local processing power available for low-level single-view feature extraction. Figure 1 illustrates the interactions between algorithm design in the vision module and system-level parameters.
INTERFACING VISION PROCESSING AND REASONING The interdisciplinary gap between the design and management of vision systems, on the one hand, and application-level operations such as high-level reasoning, long-term
Interfacing Vision Processing and Reasoning
575
behavior modeling, and visualization, on the other hand, has limited the extent to which vision-based applications address novel application domains. Interfacing mechanisms that bridge multi-camera networks and application-level modules are needed to facilitate information utilization at the application layer and supervision of vision processing for effective data extraction from the views. In most computer vision applications, the results of processing are either presented to a viewer or sent to a higher-level reasoning module for further interpretation. These transfers are generally designed as feed-forward data flow paths, with no channeling of information from the other units back to the vision-processing module. However, such feedback paths can often provide the vision module with valuable information that promotes more effective processing. Figure 2 illustrates the traditional data path between the vision processing and high-level reasoning modules adopted in most applications; Figure 3 demonstrates feedback with simple examples of the information sent back to the vision module. In more general terms, examples of information that vision can gain through a feedback path from high-level reasoning include: ■ ■ ■ ■ ■
A context based on other modalities and historic data User behavior models and user preferences A knowledge base accumulated from historic data The value of current estimates or decisions based on interpreted output and logic The relative value of low-level features for addressing a high-level task Multi-camera network/ Vision processing Camera Camera
Camera
Camera
High-level reasoning
Application
Camera
FIGURE 2 Traditional feed-forward data path from vision to high-level reasoning.
Multi-camera network/ Vision processing Camera
Camera
Camera
Camera
Camera
High-level reasoning
Feedback for individual cameras (features, parameters, tasks, etc.) Feedback for the network (task dispatch, resource allocation, scheduling, etc.)
FIGURE 3 Role of feedback from high-level reasoning to the vision module.
Context Event interpretation Behavior models
576
Outlook
High-level reasoning
Visualization
Behavior
Context
Events
Adaptation
Video/ Select view Avatar mapping
Voxel reconstruction Metadata
Interface Observation validation
Knowledge accumulation
Detection
Pose/Activity
User interface
User preferences and availability
3D model
Vision Multi-camera hardware and Network
FIGURE 4 Interfacing vision with high-level reasoning and with visualization. ■ ■
The relative value of observations made by different cameras Assignment of tasks to different cameras based on recent results, context, and current observations
Access to such data often permits active vision processing methods that enable the vision module to search for and extract the most useful information from the views as the processing operation progresses. Figure 4 illustrates the role of a middleware interface between vision and high-level reasoning for a human activity analysis application. In applications where the objective is some form of end-user visualization based on multi-camera observations, the format and update rate of the graphical output can influence the processing of the input video streams. For example, ambient communication between users may offer a selection between compressed video and avatar-based visualization, where in each case a different vision-processing task may be needed. In video presentation, a best-view selection algorithm may run on the video streams; in avatarbased communication, the processing task may classify user activities into a set of avatar movements. Figure 4 shows the middleware interface validating the vision processing output for visualization, employing the user’s stated preferences and availability mode, in a communication application. Based on these states and the selected presentation method, different processing units may be activated in the vision module at different times.
PERFORMANCE EVALUATION Many video sequence data sets are available to the computer vision community for algorithm and system evaluation. Most reported studies have focused on comparing the accuracy of the final results of different algorithms, assuming synchronized access to all
User and Social Acceptance
577
data at a single processing unit. The existing data sets have thus been produced with accompanying ground truth data or evaluation metrics that mainly compare accuracy, complexity, and confidence in the final results. In many applications of multi-camera networks, especially those based on large-scale deployments in realistic, uncontrolled environments, some assumptions may not be valid. Rather, processing may be distributed among several processors, each accessing a limited number of views, with the possibility of delays in accessing the input streams or the extracted metadata. Moreover, in many real-time applications a trade-off is often made between accuracy at the final stage and other factors such as delay and confidence and reliability of results. Additionally, because many vision applications aim to produce interpretation results such as detection of an event or recognition of an activity, expression or intention, frame-by-frame or feature-to-ground truth accuracy may not be relevant in the success of the algorithm. For that reason, different metrics need to be employed that depend on the priorities demanded by the application and by the system-level constraints. Applications with multiple observed objects offer a different challenge for data set and testbed creation, as possible situations can easily cover a very large span.Yet to be resolved are the issues of defining ground truth data that can serve as the basis for performance evaluation under different operating points, as well as appropriate evaluation protocols that consider not only the accuracy of the results but also delay and communication requirements.
USER AND SOCIAL ACCEPTANCE The techniques developed for multi-camera networks have for the most part focused on the accuracy of extracted visual features in order to replicate the appearance or activity of the observed objects, or on creating reports describing these extractions for an external user or process. An area less fully explored is user and social acceptance criteria for vision-based applications that provide comfort or safety to personal, work, shopping, and social interaction environments. Although many applications have been discussed, the lack of adequate study of user acceptance has in many cases blocked user-centric methods to properly direct design efforts. Even application areas such as security, urban monitoring, and automotive, which are intrinsically intrusive, can benefit from a social acceptance perspective, offering privacy-managed discernment and treatment of normal and abnormal events. Researchers have been presented with an extraordinary opportunity to develop context-aware and flexible ways to introduce multi-camera networks as socially acceptable and valuable. User acceptance issues may profoundly affect the adoption of traditional and novel multi-camera applications. Failure to include privacy management in the design of the technology will cause adoption hurdles for applications such as security, retail and shopping services, smart homes, occupancy-based services, social networking, communication, gaming, and education. In user-centered applications, given the nature of the visual sensing modality, proper handling of privacy and user control over representation modes is critical. Especially if the technology offers communication options between the observed user and others, such as in a social setting, user preferences and sensitivities must be part of the design effort.
578
Outlook Vision technology User acceptance
User preferences
Social aspects
Behavior models Context
Modes of representation Privacy issues
Application FIGURE 5 Interaction between vision technology design and user acceptance criteria.
Often, systems that are aware and responsive to user preferences and context may need to employ different kinds of processing algorithms when privacy settings are changed, either explicitly by the user or via detected changes in context. It is thus of paramount importance for system and algorithm designers to consider user acceptance. Also important are user privacy preferences if the system offers different forms of visual communication. Figure 5 illustrates the relationship between technology and user acceptance in developing vision-based applications.
CONCLUSIONS This book introduced the state-of-the-art research on, and application development in, multi-camera networks. Many challenging questions in the design and operation of these systems were addressed in the pioneering work reported, but many more challenges remain. Furthermore, although canonical applications in security and surveillance have seen considerable progress in intelligent detection and reporting (a trend that will continue as a growth area), their effective use in realistic large-scale deployments that fuse data from heterogeneous and distributed sensors is still in its infancy. Moreover, many new pervasive vision-based applications in smart environments and ambient intelligence remain at the early exploratory stage. Vision is an enabling technology for many future applications. Multi-camera networks offer promises and challenges that call for new and inspiring directions in research. The flexibility offered in multi-camera networks is not limited to handling views and fusing the acquired data. Also offered are research opportunities in distributed and pervasive processing, networking, multi-modal data collection, ambient communication, and interactive environments. The incorporation of mechanisms in the design of multicamera networks to address these domains presents exciting research opportunities. It is through such an approach that many novel applications of multi-camera networks will be introduced in the coming years.
Index A Abandoned object primitive event, 466 Absolute dual quadric, 23 Acceptance ratio, 378 Achievable region, 273 Actuated camera platform, 86–87 Actuation benefits, 78–79 classes, 79 effective coverage, 78 exploitation, 79 planning, 84–85 price of, 79 reference poses, 80, 81 strategies, 84–85 termination rules, 85 Actuation assistance, 77–93 actuated camera platform, 86–87 base triangle, 80–81 bundle adjustment refinement, 83 conclusions, 93 evaluation, 88–93 experimental deployment, 89 introduction, 77–79 large-scale networks, 81–83 latency, 92–93 localization accuracy, 88–90 methodology, 80–83 network architecture, 87–88 network merge, 83 node density, 90–92 optical communication beaconing, 87 resolution impact, 88 system description, 85–88 AdaBoost algorithm, 317, 336 Adaptive Component Environment (ACE), 523 Adaptive dynamic cluster-based tracking (ADCT), 542
Affine cameras, 21 Affine reconstruction, 20–21 Affine transformations, 15 inducing, 16 Agent-based tracking architecture, 543 Agent-oriented programming (AOP), 519–21 application domains, 519 Agents, 519 mobile, 520, 521 from objects to, 519–20 resuming, 521 stationary system, 523–24 suspending, 521 tracking, 527 Agent system, 522–33 decentralized multi-camera tracking, 526–30 DSCAgents, 522–26 sensor fusion, 530–33 Algebraic topographical model, 95–96 building, 95–96 cameras, 100–101 ˇ ech theorem, 99 C CN -complex, 101–3 conclusions, 114 environment, 100 homology, 97–98 mathematical background, 97–99 physical layout of environment, 96 recovering topology (2D case), 103–7 recovering topology (2.5D case), 108–13 of wireless camera networks, 95–114 Ambient intelligence applications, 231–33 Angle of arrival (AOA), 222 Apparent occlusions, 398 Appearance-based tracking, 393–400 Appearance-Driven Tracking Module with Occlusion
Classification (Ad Hoc), 390, 394 Appearance-matching score, 448 Appearance models, 393–94 automatic initialization, 69 distributed density histogram, 448 validation, 66–67 Application layer (middleware), 517 Art gallery problem (AGP), 120, 141 Artificial neural networks, 223 Aspect ratio, 247–48 Association evaluation, 420–30 with appearance information, 420–22 data model, 422–24 maximum likelihood estimation, 424–26 with motion information, 422–30 real sequences, 428–30 simulations, 426–28 subspace of BTFs, estimating, 421–22 See also Object association Asynchronous face detection, 185 Asynchronous optimization, 171–73 Auto-calibration, 22 critical motion sequences, 24 process, 23 Automatic detection, 63 Automatic learning/tracking, 61–63 detection, 63 object localization, 63 occluder computation, 63 shape initialization and refinement, 62 Automatic parameter tuning, 38–39 Automotive applications, 234
579
580
Index
Autonomous architecture, 218, 219 Average neighborhood margin maximization (ANMM), 335, 344, 345–47 approximation, 351, 352 benefits, 346 coefficients, 351 eigenspace, 349 eigenvectors, 346, 347, 350 functioning of, 346 LDA basis, 336 scatterness matrix, 346 transformation, 353 vectors, 349, 351, 352 Average visibility, 150 Axis-angle parameterization, 24
B Background modeling, 391–93 bootstrapping stage, 391 complexity, 391 composite event detection and, 459 object-level validation step, 392 Background subtraction algorithm, 389 Badge readers, 458 Base triangles, 80–81 cluster, 82 merger, 82 Bayesian-competitive consistent labeling, 400–404 Bayesian filtering, 223–25 in distributed architectures, 228–29 multi-modal tracking with, 225–28 multi-object state space, 371 single-object 3D state/model representation, 370–71 Bayesian state space formulation, 365 Bayesian tracking, 363–87 approach, 368–69 calibration, 382–83 conclusions, 385–87 detection and, 367 dynamic model, 371–74
dynamics and interaction models, 366–67 experiments, 382–85 framework, 369 introduction, 363–69 key factors/related work, 364–68 for metro scene, 384, 385 multi-camera, 367–68 object state representation, 365 observation model, 375–78 problem formulation, 369–71 results, 383–85 RJ-MCMC, 378–82 setups and scenarios, 364 slant removal, 383 Bayes’ rule, 56 Bayes theorem, 425 Behavior analysis, 391 Best-first refined node, 182 Betti numbers, 98 computing, 106 homology computations returning, 107 Bhattacharya coefficient, 406 Bhattacharya distance, 355 Binary integer programming (BIP) algorithm, 124 complexity, 139 extension, 136–37 global optimum, 132 models, 126, 128 objective function, 127 in optimization problem, 139 optimum solution, 133 poses resulting from, 135 problem, 127 results, 133 speeding up, 151–52 Binary operators, 464 Binary silhouette reasoning, 30 Binary tree decomposition, 304 Bins, 297–98 Bisecting lines, 101 Block-matching correspondence approaches, 19 BlueLYNX, 488 Body pose estimation, 335–59 ANMM, 335, 336
background, 336–39 classifier, 344–47 example-based methods, 338–39 haarlets, 347–52 reconstruction, 341–44 results and conclusions, 357–59 rotation invariance, 352–57 segmentation, 339–41 tracking, 337–38 Boosting, 317–18 offline, 317 online, 317–18 Bottom-up computer vision algorithm, 499 Boundaries, 97 piecewise-linear, 100 Boundary operators, 97, 98 Brightness transfer functions (BTFs), 416 computing, 421 subspace, subspace, 421–22 Bundle adjustment, 24–25, 83
C Calibration accuracy, 31 accuracy, testing, 48 camera network, 39–42 catadioptric cameras, 247–50 cross-camera, 442–46 external problem, 77 between fixed and PTZ cameras, 190 hard, 191 internal, 191 maintaining while moving, 192 marks, 191 multi-camera, 31, 32 precalibration and, 190–91 PTZ camera, 168, 191–92 recovery, 46 scene, 320–21 self, 191–92 Camera blind primitive event, 466 Camera clustering protocol, 548–54 cluster coalescence, 551–52 cluster fragmentation, 551
Index cluster head election, 549–50 cluster maintenance, 553–54 cluster propagation, 550–52 noncoalescing inter-cluster interaction, 552–53 See also Cluster-based tracking Camera control, 165–86 asynchronous optimization, 171–73 camera calibration, 168 combinatorial search, 173 combined quality measure, 177–78 conclusions, 186 experiments, 178–86 idle mode, 178 introduction, 165–66 objective function for scheduling, 170–71 optimization, 171–73 planning, 168 PTZ limits, 177 quality measures, 173–78 related work, 166–67 system overview, 168–70 target-camera distance, 176 target-zone boundary distance, 176–77 tracking, 168–70 view angle, 173–75 Camera domains, 100 Camera matrices, 5–8 estimating, 7–8 extrinsic parameters, 6 fundamental matrix relationship, 11–12 homogeneous coordinates, 5 intrinsic parameters, 6 orthographic projection, 7 parameter extraction, 6–7 weak perspective projection, 7 Camera motion stopped primitive event, 466 Camera moved primitive event, 466 Camera network calibration, 39–42 external, 77 incremental construction, 41–42
triple points resolution, 40–41 Camera networks, 30 complex, 101–3 cooperative, 416 coverage, 101 detectable set for, 108 dynamic and heterogeneous, 492 optimal configuration, 139–61 performance optimization, 140 smart, 481–570 video compression, 267–92 Camera network synchronization, 42–44 results, 45 silhouette interpolation, 44 subframe, 43–44, 48 Camera placement optimal, 117–37, 147–52 theoretical difficulties, 141 uniform, 160 Camera planning horizon, 172 Monte Carlo simulation and, 156 Camera poses, 124 recovery, 34–39 from solving BIP model, 135 Camera positions, 124 Camera projection, 104 Camera-resectioning problem, 31 Cameras algebraic topological model, 100–101 bisecting line, 101 catadioptric, 241, 247–50 CCTV, 165, 169 coverage, 100 daisy-chaining cameras, 450 elevation of, 155–56 FOV, 369 maximizing visibility for, 149–51 minimizing number of, 148–49 omnidirectional, 118, 240–41 pinhole, 4, 5 position, 100
PTZ, 118, 166 purpose, 4 radial lens distortion, 7 skew, 42 in 2D, 103–4 in 2.5D, 108 Camera-scheduling algorithm, 167 Camera transition graphs, 401 Catadioptric cameras calibration, 247–50 extrinsic parameters, 249–50 FOV of, 247 hemispherical field of view, 241 image formation, 241 intrinsic parameters, 247–49 parabolic, 247 projective geometry, 241–43 with quadric mirrors, 241 CCTV cameras, 165, 169 ˇ ech theorem, 99, 102 C Centralized algorithms, 25 Centralized architectures, 217–20 autonomous, 218, 219 direct, 218 distributed architectures versus, 217–20 feature extraction-based, 218 smart cameras, 513 types of, 217 See also Distributed architectures Channel codes for binary source DSC, 275–76 Stanford approach, 288 syndrome, 298 Circular statistics, 404 CITRIC mote, 110, 489, 490, 507 ARM9-based processor, 507 Classification, 351 comparing, 357 example results, 358 problem, 352 rates, 352 regions, 400 times, 353 trajectory shape, 407–9
581
582
Index Classifiers, 344 3D hull-based rotation-invariant, 356–57 base, 316 body pose estimation, 344–47 building with learning algorithm, 314 fixed pretrained, 314 structure, 345, 348 vision-only, 532 weak, 219, 318 Cluster-based Kalman filter, 554–60 cluster head, 560 constraints, 554 covariance update equation, 557 equations, 554–58 Jacobian matrix, 556 measurements, 555 measurement update equations, 556 nonlinear equations, 555 state covariance matrix, 556–57 state estimation, 558–60 system initialization, 560 time update equations, 555 trade-offs, 560 See also Kalman filters Cluster-based tracking, 539–70 for acoustic sensors, 542 adaptive dynamic (ADCT), 542 boundaries of large objects, 544 camera cluster protocol, 546–54 conclusions, 569 distributed Kalman filtering, 544–46 event-driven clustering protocols, 541–44 future work, 570 introduction, 539–41 Kalman filter algorithm, 554–60 related work, 541–46 state transition diagram, 548 Cluster-based tracking (experimental), 560–69
average number of messages transmitted, 563–65 ceiling-mounted wireless cameras, 567–68 computation time, 565 estimation accuracy, 562–63 model error, 565 real wireless cameras, 568–69 simulation results, 562 simulator environment, 561–62 testbed implementation, 565–69 tracking results, 567 Clustering arrival locations, 183 for object detection and localization, 544 Clusters, 82 border nodes, 552–53 coalescence, 551–52 dynamic formation, 547, 548 fragmentation, 551 head, 560 head election, 549–50 inter-cluster communication, 548, 554 maintenance, 553–54 noncoalescing interaction, 552–53 propagation, 547, 548, 550–52 single-hop, 549 CMUcam3 embedded computer vision platform, 489, 490, 508 development, 506 versions, 506 See also Wireless smart camera networks CN -complex, 101–3 building, 109–10 construction steps, 102–3 examples, 103 relationship discovery, 114 Coarse inference, 62 Collinearity test, 340 Collineation. See Projective transformations Color distance, 377 Colored surface points (CSPs), 337
Color likelihood, 375–77 color distance, 377 multi-modal reference histogram modeling, 377–78 object color representation, 377 Combinatorial search, 173 Common middleware services layer, 516 Common Object Request Broker Architecture (CORBA), 517 Communication modules (smart camera), 504–5 Complete network merge, 85 Composite event detection, 457–77 background modeling and, 459 case studies, 472–76 conclusions, 477 effectiveness, 459 event description language, 465 event representation and, 463–65 false positive reduction application, 476 framework, 458 future work, 477 introduction, 458–59 related work, 459–61 relationships between primitive events, 457 retail loss prevention application, 472–74 spatio-temporal, 461–68 system infrastructure, 461–63 tailgating detection application, 474–76 Composite events, 461 conditions triggering, 470 definition editor, 468 formulation, 477 high-level, 463 input, 464 queried, 470 target, 464 tree-structure, 463 See also Events Composite event search, 468–72 function, 468–69 IBM Smart Surveillance Solution (SSS), 469
Index query-based, 469–72 results, 471 Compression centralized, 302 distributed, 295–308 quadtree-based, 302 video, 267–92 Computer vision algorithm, 499 Consensus algorithms, 491 Consistent labeling, 400–404 hypotheses, 401 problem solution, 401 Constructive solid geometry (CSG), 342 Contractible sets, 99 Control points adjusting, 445 number of, 124 uncovered, 130 Cooperative camera networks, 416 Coordinate frames, 100 Coordination algorithms, 491–92 Correlation estimation, 256–58 Co-training, 314, 316 concept, 316 conclusions, 332 detections, 323 experimental results, 324–32 falsification, 321 future work, 332 geometric correspondences, 315 indoor scenario, 325–28 with multiple views, 324 negative updates, 323 outdoor scenario, 328–29 performance characteristics (indoor scenario), 327 performance characteristics (outdoor scenario), 330 positive updates, 323 process, 321 resources, 329–32 scene calibration, 320–21 system, 319–24 system illustration, 320 test data description, 325 update strategy, 324 verification, 321 for visual learning, 316
Covariance matrix, 19 Coverage, 100 camera network, 101 sampling rate and, 117 Critical infrastructure protection, 449–51 Critical motion sequences, 24 Cross-camera fusion system architecture, 440–49 benefits, 441 block diagram, 441 camera calibration block diagram, 445 challenge, 442 control point adjustment, 445 cross-camera calibration, 442–46 data fusion, 446–49 data sharing, 440 design, 440–42 matching line features, 443 testing, 452 Cross-product operator, 201 Cross-ratio, 201 Cyclops cameras experimental deployment, 89 localization error, 90 network simulation, 92 node density, 92 orientation error, 90 smart camera system, 489 testbed, 88
D Daisy-chaining cameras, 450 Darkness compensation, 340 Darkness offset, 341 Data alignment, 221–23 spatial, 222–23 temporal, 222 Data fusion, 446–49 block diagram, 447 high-level goal, 446 multiple layers of, 453 target process, 447 testing, 452 Data models, 422–24 imaged trajectory, 423 kinematic polynomial, 423–24 Data sharing, 440 Decentralized Kalman filter, 554
583
Decentralized tracking, 526–30 application, 526–27 dynamic task loading, 528–30 evaluation, 528 middleware support, 528–30 mobility, 528 neighborhood relations, 530 target hand-off, 527–28 Decision fusion, 531–32 Decoding procedure, 259 Decomposition binary tree, 304 joint probability distribution, 59 sets, 101 theorem, 102 Degrees of freedom (DOFs), 34, 35 Depth effects, 374 Detect-and-track paradigm, 169 Detection of persons importance, 313 motion, 313–14 Difference-of-Gaussian (DOG) filter, 19 Digital signal processors (DSPs), 502 Digital video management (DVM) systems, 458 Directional motion primitive event, 466 Directional statistics, 404 Direct linear transform (DLT), 16 Discretization, 147–48 Disparity estimation, 252–56 dense, 255 distributed coding algorithm based on, 258 problem, 252, 255 Distributed aggregate management (DAM) protocol, 542 Distributed algorithms, 25, 490–92 consensus, 491 coordination, 491–92 Distributed architectures, 217–20 Bayesian filtering in, 228–29 constraints, 220
584
Index Distributed architectures (continued) fusion, 219 generic, 220 network topologies in, 220–21 See also Centralized architectures Distributed coding, of 3D scenes, 258–61 Distributed Component Object Model (DCOM), 517 Distributed compression, 295–308 DSC theory and, 296–99 effective, 307 epipolar constraints for, 307 fundamental issue, 302 introduction, 295–96 lossy, 299 in multi-camera systems, 295–308 of multi-view images, 301–6 phenoptic data and, 299–301 Slepian-Wolf, 299 Distributed encoding, 305 Distributed Kalman filtering, 544–46 heterogeneous network, 546 np-dimensional measurement vector, 544 sign of innovation (SOI), 545 tree-based architecture, 545 See also Kalman filters Distributed smart camera systems, 483, 485, 511–34 agent-oriented paradigm, 487 agent system for, 522–33 algorithms, 490–92 application development, 515 applications, 484 benefits, 513–14 BlueLYNX, 488 challenges, 514–15 complications, 484 fault tolerance, 513
GestureCam, 488 implementing, 487 middleware, 487 middleware requirements, 518–19 NICTA Smart Camera, 488 processing, 500 scalability, 484 SmartCam, 488 trends, 483 See also Smart camera networks Distributed source coding (DSC), 267, 272–78, 295 applying to video coding, 278–80 channel codes, 275–76 correlation, 285 DVC versus, 285–88 example, 274–75 foundations of, 296–99 setup, 296 Slepian-Wolf theorem, 272–74 syndrome, 298 Wyner-Ziv theorem, 277–78 Distributed video coding (DVC), 267, 268, 272 applying to multi-view systems, 288–92 compensation, 286 DSC versus, 285–88 first schemes, 306 frame encoding, 306 idea behind, 279 multi-terminal, 306–7 PRISM codec, 280–82 rate allocation, 288 Stanford approach, 282–85 use in single-source video coding, 278 use motivation, 291 WZ decoding in, 286 Distribution layer (middleware), 516 Domain-specific layer (middleware), 517 DSCAgents, 522–26 agent mobility, 524 architecture, 522 average messaging time, 525–26 evaluation, 524–26 execution times, 525 host abstraction layer, 523 implementation, 524
operating system, 522 RAM requirement, 525 software architecture, 522–24 stationary system agents, 523–24 See also Agent system DSP-based smart camera, 486, 487 Dual image of the absolute conic (DIAC), 23 Dual sampling algorithm, 129–30, 131 configurations resulting from, 135 input, 130 results, 133 uncovered control points, 130 versions, 130, 131 Dynamic and heterogeneous architectures, 492 Dynamic convoy tree-based collaboration (DCTC), 543 Dynamic model, 371–74 joint, 371–73 single-object, 373–74 See also Bayesian tracking Dynamic objects, 57–61 dependencies, 59–60 inference, 60–61, 69–71 occupancies, 55 static occluder inference comparison, 61 statistical variables, 57–59 Dynamic occlusions, 398 Dynamic occupancy priors, 55 Dynamic scene reconstruction, 49–71 algorithm, 62 automatic learning and tracking, 61–63 probabilistic framework, 52–61 related work, 50–52 results and evaluation, 64–71 Dynamic silhouettes, 34 Dynamic task loading, 528–30, 533
E Effective field of view (EFOV ), 446 on site map, 448
Index target movement out of, 448 Eigenvalues, 345 Eigenvectors, 345–47, 350 Embedded middleware. See Middleware Embedded single smart camera, 486, 487 Embedded smart cameras, 521 Energy-based activity monitoring (EBAM), 542 Energy minimization algorithm, 252 Environment algebraic topographical model, 96, 100 smart, 497 in 2D, 103 in 2.5D, 108 Environmental occlusion, 145–46 Epipolar candidate set, 258 Epipolar constraints, 304 identifying, 258 for spatio-temporal distributed compression algorithm, 307 Epipolar delta filter (EDF), 249–50 Epipolar geometry, 10–11 from dynamic silhouettes, 34 epipolar lines, 10, 11 estimating, 34 hypothesized model, 37 illustrated, 47 for paracatadioptric cameras, 250–52 for pinhole camera model, 251 recovery, 34 for spherical camera model, 252 two-image relationship, 10 uses, 251 vanishing point, 402 Epipolar great circles, 253, 254 Epipolar line, 10s, 11 conjugate, 14 Epipolar plane image (EPI) structure, 300 volume, 300 volume parameterization, 301
Epipolar tangents, 34 Epipoles, 10, 252 on spherical images, 253 Error distribution, hypothesis, 38 Essential matrix, 12, 251 Estimating camera matrices, 7–8 epipolar geometry, 34 fundamental matrix, 12–14 projective transformations, 16–17 Euclidean distance, 372 Event detection across multiple cameras, 460 composite, 457–77 in single-camera views, 460 Event-driven clustering protocols, 541–44 Events analysis, 229–30 composite, 461 description language, 465 primitive, 461, 465–68 representation, 463–65 storage, 466 Example-based methods, 338–39 parameter-sensitive hashing, 338 silhouette, 338, 339 See also Body pose estimation Expectation maximization (EM), 249, 425 Expectation-maximizationlike activity monitoring (EMLAM), 543 Extended Kalman filter, 224 External calibration problem, 77 technique, 79 Extrinsic parameters, 6, 249
F Face capture primitive event, 466 Face detection asynchronous, 185 during live operation, 184 engine, 185 False positive reduction application, 476 Feature detection/matching, 18–20
Feature extraction-based architecture, 218 Feature fusion, 531 Feature selection image representation and, 319 offline boosting, 317 online boosting, 317–18 Field of view (FOV), 78, 117, 121 cameras, transitions between, 369 catadioptric camera, 247 effective (EFOV), 446 limited, 415 modeling, 121–22, 142 overlapped, 390 rotating, 122 triangle description, 121 visual tags, 146 FIX_CAM, 149–51 approximating with GREEDY algorithm, 152 camera placement variables, 150 constraints, 151 cost function, 150 GREEDY implementation, 155 MIN_CAM versus, 154 performance, 159 traffic model, 157 use results, 154 Fixed cameras. See Stationary cameras Fixed lookup tables (LUTs), 343 Focal length advancement, 203, 207 computing, 196–97 extending, 203 PTZ cameras, 195 reference image, 196 Foreground likelihood, 375 Foreground segmentation, 389 Forward contribution, 402 FPGA-based smart camera, 486, 487 Frames groups of, 268, 269 independent compression, 306 master-slave system execution, 204
585
586
Index Frames (continued) probability distribution, 205 of spherical wavelets, 246 Free-viewpoint 3D video, 30 Frontier points, 33 illustrated, 33 matching, 34 pairs, 37 Full-body pose tracker, 337–38 Fundamental matrix, 10 as essential matrix, 12 estimating, 12–14 nondegenerate views, 40 numerical stability, 13 relating to camera matrices, 11–12 undefined, 14 Fusion model, 531–32 fusion tasks, 531 illustrated, 532 sensor tasks, 531 Fuzzy logic algorithms, 223
G Gauge of freedom, 25 Gaussian atoms, 257 Generalized Dual-Bootstrap ICP algorithm, 17 General-purpose processors (GPPs), 502 Geometric correlation model, 257 Geometry constraints, 252 GestureCam, 488 Global nearest neighbor (GNN), 169 GMM models, 65, 66 Gradient coherence, 393 Graph search algorithms, 173 GREEDY algorithm, 151–52 FIX_CAM approximation, 152 FIX_CAM implementation, 155 illustrated, 152 for MIN_CAM approximation, 152 use results, 154 Greedy search algorithm, 128–29 configurations from, 134 illustrated, 129 results, 133 Group of k-chains, 97 Groups of frames (GOP), 268
encoding procedures, 269 predictive coding dependencies (GOP), 269
H H.261, 267, 268 H.264/Advanced Video Coding (AVC) standard, 267, 271 Haarlets, 335, 347–52 best, 349 candidate, 347 classification, 351, 352 examples, 350 experiments, 351–52 optimal set, 347 redundant, 349 training, 348–50 3D, 347, 349, 359 Haar-like features, 319, 335 Hamming distance, 275 Hamming weight, 276 Hard calibration, 191 Harris corners, 19, 20 Head localization error, 207, 208 HECOL, 390 Heuristics, 128–30 dual sampling, 129–30, 131 greedy search, 128–29 quality of, 132 Hidden Markov model (HMM), 366, 410 Homogeneous coordinates, 5 Homography. See Projective transformations Homologies, 97–98 computing, 106 group, 98 invariant, 99 planar, 200 between two continuous factors, 99 type, 99 Host infrastructure layer (middleware), 516 Hough transform, 249 Hybrid processor/system-ona-chip, 503 Hyperbolic mirrors, 243
I IBM Smart Surveillance Solution (SSS), 469
Idle mode, 178 Illumination-invariant background subtraction method, 340 Image coordinates relationship, 9 two-camera geometry, 8 Image domain, 104 Image formation camera matrices, 5–7 camera matrices estimation, 7–8 perspective projection, 4–5 Imagelab architecture, 391 MPEG video surveillance solution, 390 Image points, 5 Image sensor model, 55–56 Image/vision processors, 502–3 Importance factor, 341 Independent encoding, 305 Indoor scenario (co-training), 325–28 classifier training, 325 detection results, 329 performance curves, 326 performance curves for tri-training, 327 typical learning behavior, 326 Inference coarse, 62 dynamic objects, 60–61, 69–71 framework, 420 multi-object shape, results, 65–71 occlusion, results, 64–65 refined, 62 static occluder, 56–57, 69–71 Intelligent video surveillance (IVS) systems business applications, 436 detection, 435 flow chart, 439 security applications, 435–36 use of, 435 Inter-cluster communication, 548, 554 Interface description language (IDL), 517 Inter-frame, 279
Index Inter-object occlusions, 51 Interpolated silhouettes, 44, 48 Inter-sequence, 279 Intra-cluster computation, 548 Intra-frame, 279 Intra-mode, 280 Intra-sequence, 279 Intrinsic parameters, 6, 12, 23 I-SENSE project, 531 Iterative k-medoids algorithm, 406
J Java Virtual Machine ( JVM), 516 Joint distribution, 54–55 Joint dynamic model, 371–73 shape-oriented interactions, 372–73 See also Dynamic model Joint probabilistic data association filtering ( JPDAF), 169 Joint probabilistic data association ( JPDA), 223 Joint probability distribution Bayes’ rule application with, 60 decomposition, 59
K Kalman filters, 224, 228 cluster-based, 554–60 decentralized, 554 distributed, 544–46 unscented (UKF), 224, 569 Keyframes, 472, 476 Kinematic polynomial models, 423–24 KLT tracker, 459 K-medoids algorithm, 406 K -simplex, 97
L Lagrangian minimization, 303 Laplacian-of-Gaussian (LOG) filter, 19 Latency, actuation assistance, 92–93 Learn-then-predict paradigm, 407 Least significant (LS) bits, 282
Least squares cost functional, 13 Least squares minimization problem, 7–8, 13 Levenberg-Marquardt algorithm, 24, 426 Levenberg-Marquardt bundle adjustment, 168 Linear discriminant analysis (LDA), 336, 344–45 feature extraction, 345 goal, 344 Linear programming, 124–28 LINGO package, 131 Localization accuracy, 88–90 clustering for, 544 head error, 207 information requirements, 95 multi-modal techniques for, 223–29 optimal camera placement for, 119 unknown, 95 zoomed head, 200–203 Location-matching probability, 447 Logical architecture design, 215–17 cognitive refinement, 217 database management, 217 impact assessment, 216 object refinement, 216 process refinement, 216 situation refinement, 216 source preprocessing, 216 See also Multi-modal data fusion Lossy coding, 299 Low-density parity check (LDPC) coding, 260
M Machine learning algorithms, 314 Mahalanobis distance, 24, 372–73, 426 Map targets, 448, 449 Marginal error enhancement, 85 Markov random field (MRF), 340 Master, 527 Master-slave configuration, 193 execution, frames, 204
minimal PTZ camera model parameterization, 194–95 pairwise relationship, 194 PTZ camera networks with, 193–95 Master vanishing lines, 201 Maximum a posteriori (MAP) estimation framework, 416 Maximum likelihood estimation, 424–26 Mean-shift tracking, 460 Mean visibility, 144 computing, 147 Measurement matrix, 21 Media processors, 502 Meerkats sensor nodes, 488–89 MeshEye, 489–90, 505–6, 508 architecture, 506 development, 505 sensor, 489, 505–6 vision sensors, 505 See also Wireless smart camera networks Metric factorization, 22 Metric reconstruction, 22–24 computing, 42 illustrated, 48 of MIT sequences, 49 See also Reconstruction Metropolis-Hastings (MH) algorithm, 378 Middleware, 511–34 application layer, 517 approach, 512 architecture, 515–17 common services layer, 516 conclusions, 533–34 defined, 515 for distributed smart cameras, 518–19 distribution layer, 516 domain-specific layer, 517 for embedded systems, 517–18 general-purpose, 517 host infrastructure layer, 516 implementations, 515 layers, 516–17 operating system, 516
587
588
Index Middleware (continued) requirements for distributed smart cameras, 518–19 for smart camera networks, 515–19 tracking architecture, 543 for wireless sensor networks, 519 Migration mobile agents, 520–21 nontransparent, 521 transparent, 521 MILS Map, 472 MIN_CAM, 148–49 approximating with GREEDY algorithm, 152 BIP formulation, 148 BIP problem solution, 149 characteristics, 148 FIX_CAM versus, 154 illustrated, 150 iterations, 153–54 performance, 153–54 tag grid point discard, 154 Minimum CORBA, 517 Mixture of von Mises (MovM) distributions, 405 Mobile agents, 520 DSCAgents, 524 for embedded smart cameras, 521 migration, 520–21 Mobile-C, 521 Modeling space, 123–24 Monte Carlo approximation, 370 Monte Carlo sampling, 144 Monte Carlo simulation, 154, 156 MOSES (Mobile Streaming for Surveillance), 391 Motion compensated spatio-temporal wavelet transform, 307 Motion detection primitive event, 466 Move proposals, 379–81 add, 379–80 delete, 380 leave, 380 stay, 380 switch, 381 update, 381 See also Reversible-jump Markov chain Monte Carlo
Multi-camera calibration, 31, 32 Multi-camera geometry, 20–25 affine reconstruction, 20–21 bundle adjustment, 24–25 metric reconstruction, 22–24 projective reconstruction, 22 structure from motion, 20 See also Multi-view geometry Multi-camera networks composite event detection, 457–61 distributed compression, 295–308 dynamic scene reconstruction, 49–71 early, 3 illustrated, 302 modern, 3 multi-view calibration synchronization, 31–49 multi-view geometry, 3–25 Multi-camera tracking, 166 decentralized, 526–30 drawbacks, 438 number and placement of cameras, 437 observations and, 367–68 Multi-camera tracking/fusion system, 435–54 benefits, 441 block diagram, 441 camera calibration block diagram, 445 with cascade of fusion sensors, 454 challenge, 442 control point adjustment, 445 critical infrastructure protection, 449–51 cross-camera calibration, 442–46 data fusion, 446–49 data sharing, 440 design, 440–42 examples, 449–52 future work, 453–54 hazardous lab safety verification, 451–52 introduction, 435–39 matching line features, 443
objectives, 438 single-camera architecture and, 439 testing and results, 452–53 Multicellular genetic algorithm, 119 Multi-modal data fusion, 213–34 ambient intelligence applications, 231–33 applications, 230–34 automotive applications, 234 Bayesian filtering, 223–29 conclusions, 234 data alignment, 221–23 economic aspects, 214 feature extraction-based architecture, 218 installation constraints, 214 introduction, 213–14 JDL process model, 215 logical architecture design, 215–17 performance requirements, 214 physical architecture design, 217–21 for situation awareness applications, 230 for state estimation and localization, 223–29 surveillance applications, 231 system design, 214–21 system examples, 232 techniques, 221–30 video conferencing, 234 Multi-modal guidance message modalities, 233 Multi-modal reference histogram modeling, 377–78 Multi-modal tracking, 225–28 Multi-object shape inference results, 65–71 appearance modeling validation, 66–67 automatic appearance model initialization, 69 densely populated scene, 67–69 dynamic object and occluder inference, 69–71
Index Multi-object silhouette reasoning, 51 Multi-object state space, 371 Multi-person Bayesian tracking, 363–87 Multiple cameras PTZ, 419–20 stationary (nonoverlapping FOV ), 419 stationary (overlapping FOV ), 418–19 Multiple-hypothesis testing, 417 Multiple-object tracking (MOT ), 363, 364 Multi-terminal DVC, 306–7 Multi-view calibration synchronization, 31–49 camera network calibration, 39–42 camera network synchronization, 42–44 epipolar geometry from silhouettes, 34 metric reconstruction computation, 42 recovery from silhouettes, 34–39 results, 44–48 Multi-view coding, 268 applying DVC to, 288–92 extending mono-view codecs to, 289–91 PRISM decoder, 290 problems, 291–92 Stanford codec, 290 Multi-view extension (H.264/AVC), 271 Multi-view geometry, 3–4, 25 feature detection and matching, 18–20 image formation and, 4–8 multi-camera, 20–25 projective transformations, 14–18 two-camera, 8–14 Multi-view images, 301–6 Multi-view stereo, 50 Mutual occlusion modeling of, 143 simulation results, 156–57 visibility condition, 146
N Nearest neighbors (NN) approach, 344 Neighborhood relations, 530
Nerve complex, 97 .NET Common Language Runtime, 516 Network Merge algorithm, 87 Network Time Protocol (NTP), 222, 533 Network topologies, 220–21 fully connected, 221 multiply connected, 221 tree connected, 221 Neuromorphic sensors, 502 NICTA Smart Camera, 488 Node localization, 78 Nontransparent migration, 521 Normalization constant, 61 Normalized eight-point algorithm, 14 Numerical stability, 8, 13
O Object association, 415–30 across multiple nonoverlapping cameras, 429 evaluating (appearance information), 420–22 evaluating (motion information), 422–30 inference framework, 420 introduction, 415–17 multiple PTZ cameras, 419–20 multiple stationary cameras (nonoverlapping FOV), 419 multiple stationary cameras (overlapping FOV), 418–19 reacquisition, 428 related work, 417–20 Objective function efficacy, 173 maximizing, 173 for PTZ scheduling, 170–71 Object localization, 63 Object-oriented programming (OOP), 519 Object state representation, 365 multiple, 371 single, 370–71 Object tracking importance, 363
problem formulation, 365–66 ObjectVideo Virtual Video Tool (OVVV), 452–53 Observation model, 375–78 color likelihood, 375–77 foreground likelihood, 375 Observed variables, 53–54 Occluder inference, 56–57 ambiguous regions, 71 integrated, 69 multi-object shape, 69–71 in partial occlusion cases, 71 See also Static occluders Occlusion angle, 147, 157 Occlusion inference results, 64–65 Occlusion-resilient Wyner-Ziv scheme, 259, 260 Occlusions apparent, 398 detection, 398–99 dynamic, 398 inter-object, 51 scene, 398 silhouette-based, 30 Occupant traffic distribution, 157–58 Offline boosting, 317 Offline-trained object detectors, 330 Omnidirectional camera networks catadioptric cameras, 247–50 conclusions, 261 distributed coding, 258–61 imaging, 240–46 introduction, 239–40 multi-camera systems, 250–56 sparse approximations, 256–58 Omnidirectional cameras, 118, 252 Omnidirectional imaging, 240–46 cameras, 240–41 image processing, 245–46 mapping cross-section, 243 projective geometry, 241–43 spherical camera model, 243–45 as spherical signals, 245
589
590
Index Omnidirectional vision sensors, 240 Onboard image analysis, 492 Online boosting, 317–18 Online co-training, 321–24 Online learning, 313–32, 314 boosting for feature selection (offline), 317 boosting for feature selection (online), 317–18 conclusions and future work, 332 co-training system, 319–24 experimental results, 324–32 image representation and features, 319 including prior knowledge, 318–19 Open spaces, 364 Operating systems DSCAgents, 522 as middleware basis, 516 Optical communication beaconing, 87 Optimal camera placement, 147–52 camera placement, 147–52 camera views from, 160 discretization of camera/tag spaces, 147–48 FIX_CAM, 149–51 GREEDY, 151–52 MIN_CAM, 148–49 simulation experiments, 152–58 strategy comparison, 158–60 Optimal configuration (visual sensor network), 139–61 camera placement, 147–52 experimental results, 152–60 introduction, 140–41 related work, 141–42 visibility model, 142–44 visibility model for visual tagging, 144–47 Optimal placement (visual sensors), 117–37 for accurate reconstruction, 119 algorithms, 118 approaches, 124–31
definitions, 120–21 dual sampling algorithm, 129–30, 131 exact algorithms, 124–28 experiments, 131–36 extensions, 136–37 FOV modeling, 121–22 greedy search algorithm, 128–29 heuristics, 128–30 illustrated, 118 introduction, 117–20 linear programming, 124–28 for localization, 119 modeling space, 123–24 problem formulation, 120–24 problem statements, 121 random selection and placement, 130–31 related work, 118–20 Optimistic estimator, 173 Optimization asynchronous, 171–73 computational cost, 141 in discrete domain, 142 performance, 140 suboptimal, 396 for triangulating objects, 142 Orientation trackers, 359 Orthographic projection, 7 Outdoor scenario (co-training), 325, 328–29 classifiers, 329 co-training performance characteristics, 330 detection results, 331 tri-training performance characteristics, 330 Overhead trackers, 353–56 example illustration, 356 hypotheses, 353, 354 initialization, 355–56 run on separate computer, 357
P Pan-tilt-zoom (PTZ) cameras, 118 assignment, 166 calibration, 168, 191–92 conception, 189 control during operations, 184
control illustration, 181 controlling, 165–86 in critical infrastructure protection, 449–51 experiments, 178–86 fixed/active, mapping between, 190 fixed point, 419 focal lengths, 195 geometry, 192–93 head position, 190 home position, 168 idle mode, 178 limits, 177 in master-slave configuration, 193 mechanical nature, 195 multiple, 419–20 parameters, 179 performance, 167 plans, 168 real-time collaborative control, 186 reference view, 196 scheduling, 179 scheduling problem, 167 settings computation, 167 slave, 190 states, quality objectives, 180 use of, 166 views, 181 zoom capabilities, 190 See also PTZ camera networks Parabolic mirror, 242 Parameterizations axis-angle, 24 EPI volume, 301 minimal model, 194–95 Parameters automatic tuning, 38–39 design, 147 extracting, 6–7 extrinsic, 6, 249–50 fixed, 143 intrinsic, 6, 247–49 PTZ, 179 quantitative result for, 203 random, 143 segmentation, 341 skew, 193 slave camera, 205 unknown, 147 Particle filtering, 228 as multiple-hypothesis approach, 353
Index use examples, 225 Particle filters algorithm, 197 Bayesian filtering through, 225 framework, 228 predictive capabilities, 228–29 Particles distribution, 203 effect on head localization error, 208 in head localization, 202 tracking, 203 uncertainty, 203, 204 Performance FIX_CAM, 159 MIN_CAM, 153–54 multi-modal systems, 214 optimization, 140 PTZ cameras, 167 Performance curves co-training (indoor scenario), 326 tri-training (indoor scenario), 327 Person detectors, 313–14 Perspective projection, 4 equations, 4 pinhole camera, 4, 5 weak, 7 Pervasive smart cameras (PSCs), 485 Phenoptic data, structure and properties, 299–301 Photo consistency, 342 Photo hull, 342 Physical architecture design centralized, 217–20 distributed, 217–20 network topologies, 220–21 See also Multi-modal data fusion Piecewise-linear boundaries, 100 Piecewise polynomial model, 305 Pinhole camera epipolar geometry, 251 perspective projection, 4, 5 vectors, 251 Pixel-to-track assignment, 396–98 Plane-to-map plane homography, 444
Planning horizon, 172 Plans, 168 computation, 171, 172 improvement during optimization, 181 PTZ, estimating best, 174 PTZ, replacement of, 172 Plenoptic function, 240, 299 high dimensionality, 299 illustrated, 300 2D, 301 3D, 301 PLEX software package, 106, 110 Point correspondence, 8 Polyhedral visual hull, 342 Pose estimation errors, 83 Position optimization, 395–96 POS streams, 458 Precalibration, 190–91 Predictive encoding procedure, 269 Primitive events, 461 definition editor interfaces, 467 keyframes, 472 list of, 466 user interfaces and, 465–68 See also Events PRISM codec, 280–82 compensation, 286 decoding, 282 encoding, 281–82 encoding/decoding processes, 280 INTER blocks, 288 INTRA blocks, 288 in multi-view setup, 290 problems, 291–92 Probabilistic data association (PDA), 223 Probabilistic framework, 52–61 dynamic object and static occluder comparison, 61 multiple dynamic objects, 57–61 static occluder, 52–57 See also Dynamic scene reconstruction Probability density function 2D Gaussian, 447 of planar homology, 200
591
Probability of nonocclusion, 394 Processing module (smart camera), 502–4 Projective factorization, 22, 23 Projective geometry, 241–43 Projective reconstruction, 22 Projective transformations, 14–18 estimating, 16–17 inducing, 16 rectifying, 17–18 Pseudocode, network merge, 83 PTZ camera networks, 189–90 camera geometry, 192–93 conclusions, 208–9 control laws for trajectories, 190 cooperative target tracking, 195–98 experimental results, 203–8 extension to wider areas, 198–200 high-level functionalities, 190 illustrated, 194 with master-slave configuration, 193–95 minimal model parameterization, 194–95 related work, 190–92 SLAM and, 192 vanishing line, 200–203 See also Pan-tilt-zoom (PTZ) cameras Pursuer-evader problem, 544
Q Quadtree-based compression, 302 Query-based search, 469–72 page illustration, 470 results, 470 See also Composite event search
R Radial lens distortion, 7 Radio frequency identification (RFID), 458 Radio observations, 226
592
Index Random hypothesis, 42 Random placement algorithm, 130–31 problem solution, 130–31 results, 133 Random sampling, 32 Random walk, 158 RANSAC, 8, 29 algorithm, 32, 34 homography computation, 199 Rate distortion function, 277 Raw-data fusion, 531 Real-time 3D body pose estimation. See Body pose estimation Real-Time CORBA (RT-CORBA), 517 Real-time tracking, 337 Reconstruction affine, 20–21 body pose estimation, 341–44 dynamic scene, 49–71 example results, 358 incremental approach, 41–42 metric, 22–24, 42, 48 projective, 22 shape from silhouette (SFS), 30 Recovering topology (2D case), 103–7 algorithms, 104–6 cameras, 103–4 environment, 103 problem, 104 simulation, 106–7 synchronization, 104 target, 104 Recovering topology (2.5D case), 108–13 cameras, 108 CN -complex, 109–10 environment, 108 experimentation, 110–13 homotopic paths, 112, 113 layout for experiment, 111 mapping to 2D, 109 problem, 108 recovered CN -complex for layout, 112 target, 108 Recovery from silhouettes, 34–39 approach, 35–37
automatic parameter tuning, 38–39 hypothesis generation, 36–37 model verification, 37–39 Rectification homography, 382 projective transformations, 17–18 in stereo vision, 253 Recursive tracking, 198, 200 Reference broadcast synchronization (RBS), 222 Reference poses, 81 determination, 83 refinement, 80 Refined inference, 62 Region event, 466 Regions classification, 400 Remote Method Invocation (RMI), 516, 517 Resampling procedure, 224 sequential importance (SIR), 225 Resectioning, 7–8 Resource monitoring, 533 Retail loss prevention application, 472–74 composite event formulation, 473 fake-scan action detection, 474 testing data, 474 Reversible-jump Markov chain Monte Carlo (RJ-MCMC), 363, 378–82 efficiency, 378 human detection, 379 inference-sampling scheme, 385–86 move proposals, 379–81 multi-person tracking with, 382 sampling, 368 summary, 382 Rotation invariance, 352–57 experiments, 356–57 overhead tracker, 353–56 See also Body pose estimation Rotation matrix, 9, 24
S Sampling frequencies, 137 Sampling rate, 117
Scale-Invariant Feature Transform (SIFT) contours, 140 descriptors, 19, 197, 459–60 detector, 19 feature example, 20 feature extraction, 196, 198 keypoints, 198 matching approach, 196 point extraction, 197 visual landmarks, 192, 196–98 Scatterness matrix, 346 Scene calibration, 320–21 Scene monitoring, 415 Scene occlusions, 398 Scene points color relationship, 5 projection, 4, 193 in world coordinate system, 5 Scene voxel state space, 57 Scheduling algorithm, 167 problem, 167 PTZ, intelligent, 179 PTZ, objective function, 170–71 Schur complement, 558 Scripting languages, 520 Security, smart camera, 492–93 Segmentation, 339–41 adaptive threshold, 340 collinearity test, 340 darkness offset, 341 foreground, 339, 389 importance factor, 341 static threshold, 341 user-defined parameters, 341 See also Body pose estimation Selective update, 398 Self-calibration, 191–92 flexibility over accuracy, 192 LED method, 191–92 weaker, 192 Self-occlusion, 143, 146 Self-organizing map (SOM), 230, 233 Sensor fusion, 530–33 decision fusion, 531 dynamic task loading, 533
Index feature fusion, 531 fusion model, 531–32 middleware support, 532–33 raw-data fusion, 531 research, 530–31 resource monitoring, 533 sensor interfaces, 532–33 time synchronization, 533 Sensor model, 55–56 generative, 52 per-pixel process, 55 silhouette state, 56 Sensor modules (smart camera), 501–2 Sequential importance resampling (SIR), 225 Shape candidate set, 258 initialization and refinement, 62–63 similarity constraints, 258 Shape-from-photo consistency, 49–50 Shape-from-silhouette (SFS), 30, 341 advantages, 50 algorithms, 342 occlusion problems, 50 voxel-based, 342 Silhouette-based methods occlusions, 30 results, 30–31 Silhouette-based modeling, 50–52 difficulty and focus in, 50 solutions, 50–51 Silhouette coherence, 34 Silhouette cues, 49–71 Silhouettes convex hull from, 35 epipolar geometry from, 34 formation term, 60 interpolation, 44, 48 modeling from, 50–52 multi-object reasoning, 51 outer tangents to, 35, 36 recovery from, 34–39 Silicon-retina sensors, 502 Simplicial complex, 97 Simulation association evaluation, 426–28 Cyclops cameras, 92 Monte Carlo, 154
optimal camera placement, 152–58 Sony PTZ cameras, 90 in 2D, 106–7 Simultaneous localization and mapping (SLAM), 20, 192 Single-camera person tracking, 393–400 occlusion detection and classification, 398–400 pixel-to-track assignment, 396–98 position optimization, 395–96 tracking algorithm, 394–98 Single-camera surveillance system architecture, 439 Single-object dynamic model, 373–74 2D-to-3D localization uncertainties, 373–74 See also Dynamic model Single-object state/model representation, 370–71 Single smart cameras, 483 classification, 486 SoC, 486 VISoc, 486 See also Smart cameras Skew camera, 42 factor, 248 symmetric matrix, 12 Slat removal, 383 Slave cameras parameters, 205 See also Master-slave configuration Slepian-Wolf theorem, 272–74 compression strategy, 299 problem, 297, 299 rate region, 297 SmartCam, 488 Smart camera networks, 481–570 advanced communication capabilities, 498 architecture, 501–5 bottom-up algorithms, 499 centralized processing, 498–99 classifications, 483 communication modules, 504–5 conclusions, 493–94
distributed, 483, 484, 500, 511–34 distributed algorithms, 491–92 dynamic and heterogeneous architectures, 492 embedded middleware, 515–19 evolution of, 485–90 future and challenges, 490–93 introduction, 483–85 low power consumption, 498, 501 privacy and security, 492–93 processing module, 502–4 sensor modules, 501–2 service orientation, 493 single camera, 483, 486–87 toward, 483–94 user interaction, 493 wireless, 483, 488–90, 497–508 Smart cameras abstracted data, 513 architecture, 512 benefits, 485 distributed (DSCs), 485, 487–88, 511 DSP-based, 486, 487 embedded, 486, 487 embedded, mobile agents for, 521 evolution, 484 first generation prototype, 486 FPGA-based, 486, 487 onboard computing infrastructure, 513 with onboard processors, 499 output, 484 pervasive (PSCs), 485 single, 483, 484, 486–87 TRICam, 486, 487 Smart environments, 497 Smart spaces, 364 Smoothness function, 255 SoC smart camera, 486 Sony PTZ cameras experimental deployment, 88, 89 localization error, 89 node density, 90–92
593
594
Index Sony PTZ cameras (continued) orientation error, 89 simulation, 90 successfully localized, 91 testbed, 88 Space(s), 120 carving, 342 for comparing imposed camera placement approaches, 132 complex examples, 134–36 definition, 137 modeling, 123–24 open, 137, 364 smart, 364 Sparse approximations, 256–58 Sparse Bayesian learning (SBL), 229 Spatial alignment, 222–23 Spatio-temporal composite event detection, 461–68 agent/engine, 462 central unit, 462 data server, 463 end-user terminal, 462 event description language, 465 event representation and detection, 463–65 illustrated, 462 primitive events, 461, 465–68 system infrastructure, 461–63 See also Event detection Spherical camera model, 243–45 epipolar geometry for, 252 illustrated, 244 Spherical continuous wavelet transform (SCWT), 245, 246 Spherical images, 243 epipoles on, 253 Spherical imaging, 239–62 Spherical projection model, 244 Spherical wavelets, 246 SPOT algorithm, 337 Stanford approach, 282–85 channel codes, 288 decoding, 284–85 encoding, 283–84
encoding and decoding processes, 284 frame encoding and decoding scheme, 283 in multi-view setting, 290 problems, 292 State estimation, 558–60 initialization, 559 multi-modal techniques for, 223–29 recursive Bayesian, 223 synchronization, 559 Static occluders, 31, 52–57 computation, 63 dynamic object comparison, 61 dynamic occupancy priors, 55 image sensor model, 55–56 inference, 56–57, 69–71 joint distribution, 54–55 observed variables, 53–54 shape retrieval results, 64 viewing line modeling, 54 Static threshold, 341 Stationary cameras with nonoverlapping FOV, 419 with overlapping FOV, 418–19 Statistical and Knowledge-Based Object Detector Algorithm (SAKBOT), 390 Statistical pattern recognition, 389–412 background modeling, 391–93 Bayesian-competitive consistent labeling, 400–404 experimental results, 409–12 introduction, 389–91 performance evaluation, 409 single-camera person tracking, 393–400 summary of experimental results, 411 trajectory shape analysis, 404–9 Statistical variables, 57–59 latent viewing line, 58–59 observed appearance, 58
scene voxel state space, 57 Stochastic meta descent (SMD), 337 Structure from motion (SFM), 20 Subframe synchronization, 43–44, 48 Sufficient statistics, 407, 408 Support vector machine (SVM), 339 Surveillance applications, 231 Symmetric encoding, 275 Synchronization camera network, 42–44 recovering topology (2D case), 104 reference broadcast (RBS), 222 subframe, 43–44, 48 time, 222
T Tailgating detection application, 474–76 camera and sensor layout, 475 primitive events, 474 results, 475–76 Target-camera distance, 176 Targets capture time, 182 capturing from ready position, 184 cooperative, tracking, 195–98 dynamics, modeling, 170 hand-off, 527–28 localized, 203 map, 448, 449 relative intensity distribution, 448 size and appearance, 449 in 2D, 204 in 2.5D, 108 tracker observation, 170 Target-zone boundary distance, 176–77 Temporal alignment, 222 3D Haarlets, 347, 349, 359 3D scenes, distributed coding, 258–61 Time complexity, 65 Time synchronization, 222, 533 TinyDB, 518 TinyOS, 517–18
Index Trackers, 168–69, 170 Tracking Ad Hoc, 394 agent-based, 543 agents, 527 appearance-based, 393 body pose estimation, 337–38 in camera control, 168–70 cluster-based, 539–70 cooperative target, 195–98 decentralized, 526–30 instance, 527 mean-shift, 460 methods, 337 multi-camera, 166, 367–68 multi-modal, 225–28 multi-person Bayesian, 363–87 object, 363 particles, 203 real-time, 195, 337 recursive, 198, 200 with SIFT visual landmarks, 196–98 single-camera person, 393–400 video-radio, 225–27 with wireless camera networks (WCNs), 546–48 Trajectory shape analysis, 404–9 circular statistics, 404 classification, 407–9 See also Statistical pattern recognition Transparent migration, 521 TRICam smart camera, 487 Triple points, 39 resolving, 40–41 Tripwire primitive event, 466 Tri-training performance characteristics (indoor scenario), 328 performance characteristics (outdoor scenario), 330 performance curves (indoor training), 327 See also Co-training Two-camera geometry, 8–14 epipolar geometry, 10–11 fundamental matrix, 11–14 image coordinates, 8
U Uniform camera placement, 160 Unscented Kalman filter (UKF), 224, 569 User interfaces (UI), 463, 465–68
V Vanishing lines master, 201 in master camera view, 200 for zoomed head localization, 200–203 Vanishing point exploiting, 402 vertical mapping, 383 Video analytics algorithms, 452 Video coding applying DSC to, 278–80 classic approach, 268–72 research field, 267 Wyner-Ziv (WZ), 279 Video compression, 267–92 DSC, 272–78 DVC, 278–92 introduction, 267–68 Video conferencing, 234 Video-radio tracking, 225, 227 View angle, 173–75 frontal face and, 175 measuring, 174 quality function, 175 Viewing line dependency terms, 59 latent, 58–59 modeling, 54 unobserved variables, 54 Virtual grid, transitional probabilities, 158 Virtualized reality, 30 Visibility average, 150 function, 144, 145 maximizing, 149–51 minimizing number of cameras for, 148–49 tagging, 144–47 Visibility model, 142–47 differences in, 158 general, 142–44 self-occlusion, 143 of tag with orientation, 143 for visual tagging, 144–47
Vision-only classifier, 532 VISoc smart camera, 486 ViSOR, 391 Visual hull, 341 illustrated, 341 polyhedral, 342 reconstruction examples, 344 user orientations, 354 Visual tags, 142 conditions, 144 elevation of, 155–56 environmental occlusion, 145–46 field of view, 146 identifying, 144 mutual occlusion, 146 projected length, 145 projection, 145 requirement, 151 self-occlusion, 146 tracking, 144 visibility conditions, 145–46 visibility model, 144–47 Volumetric reconstruction using voxels, 342 Von Mises distributions, 405, 406
W Weak classifiers, 318, 319 WiCa wireless camera, 90, 489, 507, 508 See also Wireless smart camera networks Wireless camera networks (WCNs), 539 cluster-based object tracking, 539–70 distance-based criteria, 540–41 object tracking with, 546–48 Wireless sensor networks (WSNs), 539 lifespan, prolonging, 539 local data aggregation, 540 sensor nodes in, 539 Wireless smart camera networks, 483, 488–90, 497–98, 507–8 architecture illustration, 501 architecture overview, 497–508 CITRIC, 489, 490, 507
595
596
Index
Wireless smart camera networks (continued) classification of, 489 CMUcam3, 489, 490, 506 communication modules, 504–5 Cyclops, 489 example, 505–7 low energy consumption, 498, 501 Meerkats, 488–89 MeshEye, 489–90, 505–6 middleware for, 519
motes, 505–7 processing in, 498–500 processing module, 502–4 sensor modules, 501–2 WiCa, 489, 490, 507 See also Smart camera networks Wyner-Ziv coding, 259, 260, 279 decoding (PRISM), 282 decoding (Stanford), 284–85 encoding (PRISM), 281–82
encoding (Stanford), 283–84 rate distortion performance, 287 Wyner-Ziv theorem, 277–78
Z Zero-order Bessel function, 404 Zoomed head localization, 200–203 Zoom factor, 207