Paolo Remagnino, Dorothy N. Monekosso, and Lakhmi C. Jain (Eds.) Innovations in Defence Support Systems – 3
Studies in Computational Intelligence, Volume 336 Editor-in-Chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail:
[email protected] Further volumes of this series can be found on our homepage: springer.com Vol. 314. Lorenzo Magnani, Walter Carnielli, and Claudio Pizzi (Eds.) Model-Based Reasoning in Science and Technology, 2010 ISBN 978-3-642-15222-1 Vol. 315. Mohammad Essaaidi, Michele Malgeri, and Costin Badica (Eds.) Intelligent Distributed Computing IV, 2010 ISBN 978-3-642-15210-8 Vol. 316. Philipp Wolfrum Information Routing, Correspondence Finding, and Object Recognition in the Brain, 2010 ISBN 978-3-642-15253-5 Vol. 317. Roger Lee (Ed.) Computer and Information Science 2010 ISBN 978-3-642-15404-1 Vol. 318. Oscar Castillo, Janusz Kacprzyk, and Witold Pedrycz (Eds.) Soft Computing for Intelligent Control and Mobile Robotics, 2010 ISBN 978-3-642-15533-8 Vol. 319. Takayuki Ito, Minjie Zhang, Valentin Robu, Shaheen Fatima, Tokuro Matsuo, and Hirofumi Yamaki (Eds.) Innovations in Agent-Based Complex Automated Negotiations, 2010 ISBN 978-3-642-15611-3 Vol. 320. xxx Vol. 321. Dimitri Plemenos and Georgios Miaoulis (Eds.) Intelligent Computer Graphics 2010 ISBN 978-3-642-15689-2 Vol. 322. Bruno Baruque and Emilio Corchado (Eds.) Fusion Methods for Unsupervised Learning Ensembles, 2010 ISBN 978-3-642-16204-6 Vol. 323. Yingxu Wang, Du Zhang, and Witold Kinsner (Eds.) Advances in Cognitive Informatics, 2010 ISBN 978-3-642-16082-0 Vol. 324. Alessandro Soro, Vargiu Eloisa, Giuliano Armano, and Gavino Paddeu (Eds.) Information Retrieval and Mining in Distributed Environments, 2010 ISBN 978-3-642-16088-2
Vol. 325. Quan Bai and Naoki Fukuta (Eds.) Advances in Practical Multi-Agent Systems, 2010 ISBN 978-3-642-16097-4 Vol. 326. Sheryl Brahnam and Lakhmi C. Jain (Eds.) Advanced Computational Intelligence Paradigms in Healthcare 5, 2010 ISBN 978-3-642-16094-3 Vol. 327. Slawomir Wiak and Ewa Napieralska-Juszczak (Eds.) Computational Methods for the Innovative Design of Electrical Devices, 2010 ISBN 978-3-642-16224-4 Vol. 328. Raoul Huys and Viktor K. Jirsa (Eds.) Nonlinear Dynamics in Human Behavior, 2010 ISBN 978-3-642-16261-9 Vol. 329. Santi Caball´e, Fatos Xhafa, and Ajith Abraham (Eds.) Intelligent Networking, Collaborative Systems and Applications, 2010 ISBN 978-3-642-16792-8 Vol. 330. Steffen Rendle Context-Aware Ranking with Factorization Models, 2010 ISBN 978-3-642-16897-0 Vol. 331. Athena Vakali and Lakhmi C. Jain (Eds.) New Directions in Web Data Management 1, 2011 ISBN 978-3-642-17550-3 Vol. 332. Jianguo Zhang, Ling Shao, Lei Zhang, and Graeme A. Jones (Eds.) Intelligent Video Event Analysis and Understanding, 2011 ISBN 978-3-642-17553-4 Vol. 333. Fedja Hadzic, Henry Tan, and Tharam S. Dillon Mining of Data with Complex Structures, 2011 ISBN 978-3-642-17556-5 Vol. 334. Álvaro Herrero and Emilio Corchado (Eds.) Mobile Hybrid Intrusion Detection, 2011 ISBN 978-3-642-18298-3 Vol. 335. Radomir S. Stankovic and Jaakko Astola From Boolean Logic to Switching Circuits and Automata, 2011 ISBN 978-3-642-11681-0 Vol. 336. Paolo Remagnino, Dorothy N. Monekosso, and Lakhmi C. Jain (Eds.) Innovations in Defence Support Systems – 3, 2011 ISBN 978-3-642-18277-8
Paolo Remagnino, Dorothy N. Monekosso, and Lakhmi C. Jain (Eds.)
Innovations in Defence Support Systems – 3 Intelligent Paradigms in Security
123
Dr. Paolo Remagnino
Prof. Lakhmi C. Jain
Kingston University Faculty of Computing, Information Systems and Mathematics Penrhyn Road Campus Kingston upon Thames Surrey KT1 2EE United Kingdom E-mail:
[email protected] SCT-Building University of South Australia Adelaide Mawson Lakes Campus South Australia Australia E-mail:
[email protected] Dr. Dorothy N. Monekosso University of Ulster at Jordanstown Faculty of Computing and Engineering School of Computing and Mathematics Shore Road BT37 0QB Newtownabbey United Kingdom E-mail:
[email protected] ISBN 978-3-642-18277-8
e-ISBN 978-3-642-18278-5
DOI 10.1007/978-3-642-18278-5 Studies in Computational Intelligence
ISSN 1860-949X
c 2011 Springer-Verlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India. Printed on acid-free paper 987654321 springer.com
This book collection is dedicated to all researchers in the field of intelligent environments.
Preface
Intelligent Paradigms in Security is a collection of articles introducing the latest advances in the field of intelligent monitoring. The book is intended for readers from a computer science or an engineering background. It describes techniques for the interpretation of sensor data in the context of automatic understanding of complex scenes captured in large public areas. The aim is to guide the reader through a number of research topics for which the existing video surveillance literature describes partial and incomplete solutions, and introduces the next challenges in intelligent video surveillance. Each chapter in the book presents a solution to one aspect of the problem of monitoring a public space. The types of environment of interest are often characterized by clutter and complex interactions between people and between people and objects. Each chapter proposes a sophisticated solution to a specific problem. Public environments, such as an airport concourse, a shopping mall, a train station and similar public spaces, are large and require numerous sensors to monitor the environment. The deployment of a large number of sensors produces a large quantity of video data (petabytes or larger) that must be processed; in addition different scenes may require processing at a different level of granularity. Service robots might soon inhabit public areas, collecting global information but also approaching regions / areas of interest in scene to collect more detailed information, such as an abandoned luggage. The processing of visual data or data from any sensor modality must occur at a speed consistent with the objective / goal. This might be real-time, although some data and information can be processed off-line, for instance to generate new knowledge about the scene for the purpose of enhancing the system. The performance of systems must be evaluated according to criteria that include time latency, accuracy in the detection of an event or the global dynamics, response to queries about the event monitored or the expected dynamics in the environment. To achieve this, the concept of normality is used. A model of what constitutes normality permits deviations and/or large anomalies in the behavior of people or position of objects in the monitored environment to be detected. We are confident that our collection will be of great use to practitioners of the covered fields of research, but could also be formative for doctoral students and researchers in
VIII
Preface
intelligent environments. We wish to express our gratitude to the authors and reviewers for their time and vision as well as to the Springer for the assistance during the publication phase of the book. London, Belfast, Adelaide September 2010
Paolo Remagnino Dorothy N. Monekosso Lakhmi Jain
Acknowledgements
The editors wish to thank all the contributors of the collection for their endeavors and patience.
Contents
1
2
Data Fusion in Modern Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lauro Snidaro, Ingrid Visentini, Gian Luca Foresti 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Terminology in Data Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Motivation to Sensor Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 The JDL Fusion Process Model . . . . . . . . . . . . . . . . . . . . . . . 1.2 Data Fusion and Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 JDL Model Contextualized to Surveillance: An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 A Closer Look to Fusion in Level 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Data Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Feature Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Classifier Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.4 Combiner Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.5 An Example: Target Tracking via Classification . . . . . . . . . . 1.4 Sensor Management: A New Paradigm for Automatic Video Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distributed Intelligent Surveillance Systems Modeling for Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefano Maludrottu, Alessio Dore, Carlo S. Regazzoni 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Surveillance Systems as Data Fusion Architectures . . . . . . . 2.1.2 Surveillance Tasks Decomposition . . . . . . . . . . . . . . . . . . . . . 2.2 Intelligent Surveillance System Modeling . . . . . . . . . . . . . . . . . . . . . 2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Smart Sensor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 3 4 5 6 7 8 9 10 10 11 12 13 16 17 23 24 25 27 28 28 29
XII
3
4
5
Contents
2.2.3 Fusion Node Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Control Center Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.5 Performance Evaluation of Multi-sensor Architectures . . . . 2.3 A Case Study: Multi-sensor Architectures for Tracking . . . . . . . . . . 2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Multisensor Tracking System Simulator . . . . . . . . . . . . . . . . 2.3.3 Performance Evaluation of Tracking Architectures . . . . . . . 2.3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30 31 32 33 33 34 38 39 43 43
Incremental Learning on Trajectory Clustering . . . . . . . . . . . . . . . . . . Luis Patino, Franc¸ois Bremond, Monique Thonnat 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 General Structure of the Proposed Approach . . . . . . . . . . . . . . . . . . . 3.3 On-Line Processing: Real-Time Object Detection . . . . . . . . . . . . . . . 3.4 Trajectory Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Object Representation: Feature Analysis . . . . . . . . . . . . . . . . 3.4.2 Incremental Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Trajectory Analysis Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
Highly Accurate Estimation of Pedestrian Speed Profiles from Video Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Panagiotis Sourtzinos, Dimitrios Makris, Paolo Remagnino 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Motion Detection and Tracking . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Static Foot Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Speed Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System-Wide Tracking of Individuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher Madden, Massimo Piccardi 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Shape Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47 50 51 52 53 53 54 59 64 65 66 68 69 71 71 72 73 73 73 77 78 80 80 83 83 87 88
Contents
XIII
5.2.2 Appearance Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.2.3 Mitigating Illumination Effects . . . . . . . . . . . . . . . . . . . . . . . . 94 5.3 Feature Fusion Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6
7
A Scalable Approach Based on Normality Components for Intelligent Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Javier Albusac, Jos´e J. Castro-Schez, David Vallejo, Luis Jim´enez-Linares, Carlos Glez-Morcillo 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Formal Model to Build Scalable and Flexible Surveillance Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 General Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Normality in Monitored Environments . . . . . . . . . . . . . . . . . 6.3.3 Global Normality Analysis by Aggregating Independent Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Model Application: Trajectory Analysis . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Normal Trajectory Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Set of Variables for the Trajectory Definition . . . . . . . . . . . . 6.4.3 Preprocessing Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.4 Constraint Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distributed Camera Overlap Estimation – Enabling Large Scale Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anton van den Hengel, Anthony Dick, Henry Detmold, Alex Cichowski, Christopher Madden, Rhys Hill 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Activity Topology and Camera Overlap . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Estimating Camera Overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Joint Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Measurement and Discrimination . . . . . . . . . . . . . . . . . . . . . . 7.4.3 The Original Exclusion Estimator . . . . . . . . . . . . . . . . . . . . . 7.4.4 Accuracy of Pairwise Occupancy Overlap Estimators . . . . . 7.5 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 The Exclusion Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Partitioning of the Joint Sampling Matrices . . . . . . . . . . . . . 7.5.3 Analysis of Distributed Exclusion . . . . . . . . . . . . . . . . . . . . .
105
106 107 109 109 112 119 122 123 124 126 128 131 141 143 147
147 149 150 151 152 154 156 159 159 161 163 164 170
XIV
8
Contents
7.5.4 Evaluation of Distributed Exclusion . . . . . . . . . . . . . . . . . . . . 7.6 Enabling Network Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
172 176 180 181
Multi-robot Teams for Environmental Monitoring . . . . . . . . . . . . . . . . Maria Valera Espina, Raphael Grech, Deon De Jager, Paolo Remagnino, Luca Iocchi, Luca Marchetti, Daniele Nardi, Dorothy Monekosso, Mircea Nicolescu, Christopher King 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Overview of the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Video-Surveillance with Static Cameras . . . . . . . . . . . . . . . . 8.2.2 Multi-robot Monitoring of the Environment . . . . . . . . . . . . . 8.2.3 Experimental Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Multi-robot Patrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Multi-robot Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 Task Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.4 Automatic Video Surveillance . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Representation Formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Event-Driven Distributed Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Layered Map for Event Detection . . . . . . . . . . . . . . . . . . . . . 8.5.2 From Events to Tasks for Threat Response . . . . . . . . . . . . . . 8.5.3 Strategy for Event-Driven Distributed Monitoring . . . . . . . . 8.6 Implementation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.1 A Real-Time Multi-tracking Object System for a Stereo Camera – Scenario 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.2 Maximally Stable Segmentation and Tracking for Real-Time Automated Surveillance – Scenario 2 . . . . . . . . . 8.6.3 Multi-robot Environmental Monitoring . . . . . . . . . . . . . . . . . 8.6.4 System Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
183
183 185 186 186 186 187 187 188 189 189 191 191 193 193 195 196 197 197 198 202 204 206 207
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
The Editors
Dr Paolo Remagnino (PhD 1993, University of Surrey) is a Reader in the Faculty of Computing, Information Systems and Mathematics at Kingston University. His research interests include image and video understanding, pattern recognition, machine learning, robotics and artificial intelligence.
Dr Dorothy N. Monekosso (MSc. 1992 and PhD 1999, from University of Surrey) is a Senior Lecturer in the School of Computing and Mathematics at the University of Ulster at Jordanstown, Northern Ireland. Her research interests include machine learning, intelligent systems and intelligent environments applied to assisted living and robotics.
XVI
The Editors
Professor Lakhmi C. Jain is a Director/Founder of the Knowledge-Based Intelligent Engineering Systems (KES) Centre, located in the University of South Australia. He is a fellow of the Institution of Engineers Australia. His interests focus on the artificial intelligence paradigms and their applications in complex systems, art-science fusion, virtual systems, e-education, e-healthcare, unmanned air vehicles and intelligent agents.
List of Contributors
Chapter 1 L.Snidaro · I.Visentini · G.L.Foresti Department of Mathematics and Computer Science, University of Udine, 33100 Udine, Italy,
[email protected] Chapter 2 S. Maludrottu · C.S. Regazzoni Department of Biophysical and Electronic Engineering (DIBE) - University of Genoa, Genoa, ITALY A.Dore Institute of Biomedical Engineering, Imperial College London, London, SW7 2AZ, UK,
[email protected] Chapter 3 Luis Patino · Franc¸ois Bremond · Monique Thonnat INRIA Sophia Antipolis - M´editerran´ee, 2004 route des Lucioles - BP 93 - 06902 Sophia Antipolis, Jose-Luis.Patino
[email protected] Chapter 4 Panagiotis Sourtzinos Dimitrios Makris · Paolo Remagnino Digital Imaging Research Centre, Kingston University, UK,
[email protected] Chapter 5 Christopher Madden University of Adelaide, Australian Centre for Visual Technologies, Adelaide, SA 5007, Australia
XVIII
List of Contributors
Massimo Piccardi University of Technology, Sydney, Department of Computer Science, Ultimo, NSW 2007, Australia,
[email protected] Chapter 6 J.Albusac · J.J.Castro-Schez · D.Vallejo · L.Jim´enez-Linares · C.Glez-Morcillo Escuela Superior de Inform´atica, Universidad de Castilla-La Mancha, Paseo de la Universidad, 4, 13071 Ciudad Real, Spain,
[email protected] Chapter 7 Anton van den Hengel · Anthony Dick · Henry Detmold · Alex Cichowski · Christopher Madden · Rhys Hill University of Adelaide, Australian Centre for Visual Technologies, Adelaide, SA 5007, Australia,
[email protected] Chapter 8 Maria Valera Espina · Raphael Grech · Deon De Jager · Paolo Remagnino Digital Imaging Research Centre, Kingston University, London, UK Luca Iocchi · Luca Marchetti · Daniele Nardi Department of Computer and System Sciences University of Rome “La Sapienza”, Italy Dorothy Monekosso Computer Science Research Institute, University of Ulster, UK Mircea Nicolescu · Christopher King Department of Computer Science and Engineering,University of Nevada, Reno
[email protected] Chapter 1
Data Fusion in Modern Surveillance Lauro Snidaro, Ingrid Visentini, and Gian Luca Foresti
Abstract. The performances of the systems that fuse multiple data coming from different sources are deemed to benefit from the heterogeneity and the diversity of the information involved. The rationale behind this theory is the capability of one source to compensate the error of another, offering advantages such as increased accuracy and failure resilience. While in the past ambient security systems were focused on the extensive usage of arrays of single-type sensors, modern scalable automatic systems can be extended to combine multiple information coming from mixed-type sources. All this data and information can be exploited and fused to enhance situational awareness in modern surveillance systems. From biometrics to ambient security, from robotics to military applications, the blooming of multi-sensor and heterogeneous-based approaches confirms the increasing interest in the data fusion field. In this chapter we want to highlight the advantages of the fusion of information coming from multiple sources for video surveillance purposes. We are thus presenting a survey of existing methods to outline how the combination of heterogeneous data can lead to better situation awareness in a surveillance scenario. We also discuss a new paradigm that could be taken into consideration for the design of next generation surveillance systems.
1.1 Introduction Interest in automatic surveillance systems has gained significant attention in the past few years. This is due to an increasing need for assisting and extending the capabilities of human operators in remotely monitoring large and complex spaces such as public areas, airports, railway stations, parking lots, bridges, tunnels, etc. The last generation of surveillance systems was designed to cover larger and larger areas dealing with multiple streams from multiple sensors [61, 12, 20]. Their Lauro Snidaro · Ingrid Visentini · Gian Luca Foresti Department of Mathematics and Computer Science, University of Udine 33100 Udine, Italy P. Remagnino et al. (Eds.): Innovations in Defence Support Systems – 3, SCI 336, pp. 1–21. c Springer-Verlag Berlin Heidelberg 2011 springerlink.com
2
L. Snidaro, I. Visentini, and G.L. Foresti
are meant to automatically assess the ongoing activities in the monitored environment flagging and presenting to the operator suspicious events as they happen in order to prevent dangerous situations. A key step that can help in carrying out this task is analysing the trajectories of the objects in the scene and comparing them against known patterns. The system can in fact be trained by the operator with models of normal and suspicious trajectories in the domain at hand. As recent research shows, this process can even be carried out semi-automatically [66]. Real-time detection, tracking, recognition and activity understanding of moving objects from multiple sensors represent fundamental issues to be solved in order to develop surveillance systems that are able to autonomously monitor wide and complex environments. Research can be conducted at different levels, from the preprocessing steps, involving mainly image processing algorithms, to the analysis and extraction of salient features to represent the objects moving in the scene, to their classification, to the tracking of their movements, and to the analysis of their behaviour. The algorithms that are needed span therefore from image processing to event detection and behaviour understanding, and each of them requires dedicated study and research. In this context, data fusion plays a pivotal role in managing the information and improving system performance. Data fusion is a relatively old concept: from the 70s, when it bloomed in the United States, through the 90s until nowadays this research field has been quite popular due to its polymorphism, and to the effective benefits it provides. The manifest example of human and animal senses has been a critical motivation for integrating information from multiple sources for a reliable and feature-rich perception. In fact, even in case of sensor deprivation, biological systems are able to compensate for lacking information by reusing data obtained from sensors with an overlapping scope. Based on this domain model, the individual interacts with the environment and makes decisions about present and future actions [26]. We can say that two main motivations brought data fusion to its actual point of evolution. First of all, the amount of collected information was growing very fast coherently with the expansion of commodity hardware. Storing the consequent increasing volume of available data was considered often too expensive in terms of resources like space and time. Thus, the idea to “condensate” all this data into a single decisions set, or into a subset of information with reduced dimensionality (but rich in semantic), has rapidly taken place [28]. As a natural evolution, new forms of sensors have evolved in the last two decades opening the door to a wide range of possibilities and potential. On the other hand, the concepts of redundancy, uncertainty and sensor feasibility promoted the good marriage between sensors and fusion algorithms. The resemblance with human and animal senses encouraged to integrate information from heterogeneous sources to improve the knowledge of the observed environment. The natural ability to fuse multi-sensory signals has evolved to a high degree in many animal species and it is in use for millions of years. Today the application of fusion concepts in technical areas has constituted a new discipline, that spans over many
1
Data Fusion in Modern Surveillance
3
fields of science, from economics [4, 36] to military applications [28], from commercial software [27] to videosurveillance [64] and multimodal applications [62, 33] to cite some.
1.1.1 Terminology in Data Fusion At this point, it is necessary to formalize the concepts regarding data fusion. Unfortunately, there is not much agreement on the terminology for fusion systems. The terms sensor fusion, data fusion, information fusion, multi-sensor data fusion, and multi-sensor integration have been widely used in the technical literature to refer to a variety of techniques, technologies, systems, and applications that use data derived from multiple information sources. Fusion applications range from real-time sensor fusion for the navigation of mobile robots to the on-line fusion of human or technical strategic intelligence data [63]. Several attempts have been made to define and categorize fusion terms and techniques. In [73], Wald proposes the term “data fusion” to be used as the overall term for fusion. However, while the concept of data fusion is easy to understand, its exact meaning varies from one scientist to another. Wald uses “data fusion” for “a formal framework in which are expressed the means and tools for the alliance of data originating from different sources. It aims at obtaining information of greater quality; the exact definition of greater quality will depend upon the application”. This is also the meaning given by the Geoscience and Remote Sensing Society, by the U. S. Department of Defense, and in many papers regarding motion tracking, remote sensing, and mobile robots. Unfortunately, the term has not always been used in the same meaning during the last years. In some models, data fusion is used to denote fusion of raw data [15]. There are classic books on fusion like “Multisensor Data Fusion” [74] by Waltz and Llinas and Hall’s “Mathematical Techniques in Multisensor Data Fusion” [29] that propose an extended term, multisensor data fusion. According to Hall, it is defined as the “technology concerned with the combination of data from multiple (and possible diverse) sensors in order to make inferences about a physical event, activity, or situation”. To avoid confusion on the meaning, Dasarathy decided to use the term information fusion as the overall term for fusion of any kind of data [16]. As a matter of fact, information fusion is an all-encompassing term covering all aspects of the fusion field. A literal definition of information fusion can be found in [16]: “Information Fusion. encompasses theory, techniques and tools conceived and employed for exploiting the synergy in the information acquired from multiple sources (sensor, databases, information gathered by human, etc.) such that the resulting decision or action is in some sense better (qualitatively or quantitatively, in terms of accuracy, robustness, etc.) than would be possible if any of these sources were used individually without such synergy exploitation”. By defining a subset of information fusion, the term sensor fusion is introduced as:
4
L. Snidaro, I. Visentini, and G.L. Foresti
“Sensor Fusion. is the combining of sensory data or data derived from sensory data such that the resulting information is in some sense better than would be possible when these sources were used individually”.
1.1.2 Motivation to Sensor Arrays Systems that employ sensor fusion methods expect a number of benefits over single sensor systems. A physical sensor measurement generally suffers from the following problems: • Sensor failure: The deprivation of the sensor means the total loss of the input. • Limited spatial coverage: Depending on the environment, a single sensor could simply be not enough. An obvious example, a fixed camera cannot monitor areas wider than its field of view. • Limited temporal coverage: Some sensors could not be able to provide data at all times. Usually, color cameras are useless at night without proper illumination, being substituted by infrared ones. • Imprecision: The accuracy of the data is the accuracy of the sensor. • Uncertainty: Uncertainty, in contrast to imprecision, depends on the object being observed rather than the observing device. Uncertainty arises when features are missing (e. g., occlusions), when the sensor cannot measure all relevant attributes of the percept, or when the observation is ambiguous [51]. The video surveillance application example is self-explanatory. A single camera cannot provide alone the needed spatial and temporal coverage. Data could be extremely inaccurate since precision is seriously affected by the camera-target distance: the closer the target is to the limit of the field of view, the more imprecise is the information on the target’s location the camera can provide. Noise and occlusions are the main sources for detection uncertainty in this case. Not to mention the total blindness given by sensor failures. On the contrary, broad areas can be monitored 24/7 by a network of heterogeneous sensors. Robust behaviour against sensor deprivation can be achieved by using cameras with overlapping fields of view. This also greatly reduces uncertainty since multiple detections of the same target from different views can be available. This can help discriminating true positives (target’s presence) from false positives (noise), and false negatives (occlusions) from true negatives (target’s absence). The following advantages can be expected from the fusion of sensor data from a set of heterogeneous or homogeneous sensors [8]: • Robustness and reliability: Multiple sensor suites have an inherent redundancy which enables the system to provide information even in case of partial failure. • Extended spatial and temporal coverage: One sensor can look where others cannot respectively can perform a measurement while others cannot. • Increased confidence: A measurement of one sensor is confirmed by measurements of other sensors covering the same domain.
1
Data Fusion in Modern Surveillance
5
• Reduced ambiguity and uncertainty: Joint information reduces the set of ambiguous interpretations of the measured value. • Robustness against interference: By increasing the dimensionality of the measurement space (e. g., measuring the desired quantity with optical sensors and ultrasonic sensors) the system becomes less vulnerable against interference. • Improved resolution: When multiple independent measurements of the same property are fused, the resolution of the resulting value is better than a single sensor’s measurement. In [59], the performance of sensor measurements obtained from an appropriate fusing process is compared to the measurements of the single sensor. According to this work, an optimal fusing process can be designed, if the distribution function describing measurement errors of one particular sensor is precisely known. Intuitively, if all this information is available, this optimal fusing process performs at least as well as the best single sensor. A further advantage of sensor fusion is the possibility of reducing system complexity. In a traditionally designed system the sensor measurements are fed into the application, which has to cope with a big number of imprecise, ambiguous and incomplete data streams. This is also the case for multisensor integration as will be described later. In a system where sensor data is preprocessed by fusion methods, the input to the controlling application can be standardized independently of the employed sensor types, thus facilitating application implementation and providing the capability of adapting the number and type of employed sensors without changing the software [19].
1.1.3 The JDL Fusion Process Model Several fusion process models have been developed over the years. The first and most known originates from the US Joint Directors of Laboratories (JDL) in 1985 under the guidance of the Department of Defense (DoD). The JDL model [28] comprises five levels of data processing and a database, which are all interconnected by a bus. The five levels are not meant to be processed in a strict order and can also be executed concurrently. Steinberg and Bowman proposed revisions and expansions of the JDL model involving broadening the functional model, relating the taxonomy to fields beyond the original military focus, and integrating a data fusion tree architecture model for system description, design, and development [67]. This updated model, sketched in Figure 1.1, is composed by the following levels: • Level 0 - Sub-Object Data Assessment: estimation and prediction of signal/object observable states on the basis of pixel/signal level data association and characterization; • Level 1 - Object Assessment: estimation and prediction of entity states on the basis of observation-to-track association, continuous state estimation (e.g. kinematics) and discrete state estimation (e.g. target type and ID);
6
L. Snidaro, I. Visentini, and G.L. Foresti
Fig. 1.1 The JDL data fusion process model.
• Level 2 - Situation Assessment: estimation and prediction of relations among entities, to include force structure and cross force relations, communications and perceptual influences, physical context, etc.; • Level 3 - Impact Assessment: estimation and prediction of effects on situations of planned or estimated/predicted actions by the participants; to include interactions between action plans of multiple players (e.g. assessing susceptibilities and vulnerabilities to estimated/predicted threat actions given ones own planned actions); • Level 4 - Process Refinement: adaptive data acquisition and processing to support mission objectives. The model is deliberately very abstract which sometimes makes it difficult to properly interpret its parts and to appropriately apply it to specific problems. However, as already mentioned, it was originally conceived more as a basis for common understanding and discussion between scientists rather than a real guide for developers in identifying the methods that should be used [28]. A recent paper by Llinas et al. [37] suggests revisions and extensions of the model in order to cope with issues and functions of nowadays applications. In partcular, further extensions of the JDL Model-version are proposed with an emphasis in four areas: (1) remarks on issues related to quality control, reliability, and consistency in data fusion (DF) processing, (2) assertions about the need for co-processing of abductive/inductive and deductive inferencing processes, (3) remarks about the need for and exploitation of an ontologically-based approach to DF process design, and (4) discussion on the role for Distributed Data Fusion (DDF).
1.2 Data Fusion and Surveillance While in the past ambient security systems were focused on the extensive usage of arrays of single-type sensors [34, 71, 38, 50], modern surveillance systems aim to combine information coming from different types of sources. Multi-modal systems [62, 33], even more often used in biometrics, or multi-sensor multi-cue approaches
1
Data Fusion in Modern Surveillance
7
[45, 21] fuse heterogeneous data in order to provide a more robust response and enhance situational awareness. The JDL model presented in section 1.1 can be contextualized and fitted to a surveillance context. In particular, we can imagine a typical surveillance scenario where multiple cameras monitor a wide area. A concrete example on how the levels of the JDL scheme can be reinterpreted can be found in Figure 1.2. In the proposed example, the levels correspond to specific video-surveillance tasks or patterns as follows: • Level 0 - Sub-Object Data Assessment: the raw data streams coming from the cameras can be individually pre-processed. For example, they can be filtered to reduced noise, processed to increase contrast, scaled down to reduce the processing time of subsequent elaborations. • Level 1 - Object Assessment: multiple objects in the scene (e.g. typically pedestrians, vehicles, etc.) can be detected, tracked, classified and recognized. The objects are the entities of the process, but no relationships are involved yet at this point. Additional data as, for instance, the map or sensible areas are a priori contextual information. • Level 2 - Situation Assessment: spatial or temporal relationships between entities are here drawn: a target moving, for instance, from a sensible Zone1 to
Fig. 1.2 Example of contextualization of the the JDL scheme of Figure 1.1 to a surveillance scenario.
8
L. Snidaro, I. Visentini, and G.L. Foresti
Fig. 1.3 Several fusion levels [41] in JDL Level 1.
Zone2 can constitute an event. Simple atomic events are built considering brief stand-alone actions, while more complex events are obtained joining several simple events. Possible alarms to the operator are given at this point. • Level 3 - Impact Assessment: a prediction of an event can be an example of what, in practice, may happen at this step. An estimation of a trajectory of a potential target, or a prediction of the behaviour of an entity can be a focus of this level. For instance, knowing that an object crossed Zone1 heading to Zone2 , we can presume it will cross even Zone3 according to the current trajectory. • Level 4 - Process Refinement: after the prediction given by Level 3, several optimization can be taken in this phase regarding all the previous levels. For instance, the sensors can be relocated to better monitor Zone3 , new thresholds can be imposed in Level 0 procedures, or different algorithms can be employed in Level 1.
1.3 A Closer Look to Fusion in Level 1 Within Level 1 of the JDL model, several fusion approaches can be considered. For instance, we can outline a hierarchy of different fusion steps, as shown in Figure 1.3, that combine data, features, and classifiers with various criteria [41]. The data level aims to fuse different data sets, to provide, for instance, different views of the same object. The feature-based level merges the output of different features, like colour histograms rather than shape features or the response of a heterogeneous feature set. The classifier level focuses on the use of several heterogeneous base classifiers, while the combination level is dedicated to the study of different combiners that follow various ensemble fusion paradigms. Under these assumptions, each level can be considered as a black box, in which different functions and methods can be alternated transparently.
1
Data Fusion in Modern Surveillance
9
More in general, this taxonomy reflects a bottom-up processing, from a low-level fusion that involves raw data coming straight from sensors to high-level combination of abstract and refined information. The data flow through the levels can be considered as a fan-in-tree, as each step provides a response to higher levels that receive input from the previous ones. A critical point in designing a data fusion application is deciding where the fusion should actually take place.
1.3.1 Data Level The data sources for a fusion process are not specified to originate from identical sensors. McKee distinguishes direct fusion, indirect fusion and fusion of the outputs of the former two [48]. Direct fusion means the fusion of sensor data from a set of heterogeneous or homogeneous sensors, and history values of sensor data, while indirect fusion uses information sources like a priori knowledge about the environment and human input. Therefore, sensor fusion describes direct fusion systems, while information fusion also includes indirect fusion processes. The sensor fusion definition of Section 1.1.1 does not require that inputs are produced by multiple sensors, it only says that sensor data or data derived from sensor data have to be combined. For example, the definition also comprises sensor fusion systems with a single sensor that take multiple measurements subsequently at different instants which are then combined. Another frequently used term is multisensor integration. Multisensor integration means the synergistic use of sensor data for the accomplishment of a task by a system. Sensor fusion is different to multisensor integration in the sense that it includes the actual combination of sensory information into one representational format [46]. The difference is outlined in Figure 1.4: while sensor fusion combines data (possibly applying a conversion) before handing it to the application, an application based on multisensor integration directly process the data from the sensors. A classical videosurveillance sensor fusion application is remote sensing where imagery data acquired with heterogeneous sensors (e.g. colour and infrared cameras) is fused in order to enhance sensing capabilities before further processing. Sensors don’t have to produce commensurate data either. As a matter of fact, sensors could be very diverse and producing different quantities. Sensor fusion reduces the incoming data into a format the application can further process.
Fig. 1.4 Conceptual comparison between (left) Sensor Fusion and (right) Multisensor Integration
10
L. Snidaro, I. Visentini, and G.L. Foresti
1.3.2 Feature Level Feature fusion (or selection) is a process used in machine learning and pattern recognition to obtain a subset of features capable to discriminate different input classes. Object detection and tracking in video sequences are known to benefit from the employment of multiple (heterogeneous) features (e.g. color, orientation histograms, etc.) as shown in [13, 23, 75]. The combination of different features within a higher level framework, as for instance the boosting meta-algorithm, can be a winning strategy in terms of robustness and accuracy [81]. In a surveillance context, a feature selection mechanism such as the one presented in [13] can be used to estimate the performance of the sensor in detecting a given target. The approach selects the most discriminative colour features to separate the target from the background by applying a two-class variance ratio to log likelihood distributions computed from samples of object and background pixels. The algorithm generates a likelihood map for each feature, ranks the map according to its variance ratio value, and then proceeds with a mean-shift tracking system that adaptively selects the top-ranked discriminative features for tracking. The idea was further developed in [78] where additional heterogeneous features were considered and fused via likelihood maps. Another example of successful feature selection for target tracking can be found in [54], where the fusion focus is on a high-level integration of responses: the confidence maps coming from a feature extraction step are merged in order to find the target position. Heterogeneous weighted data is combined without involving classifiers and without maintaining a model of the object, but performing a frame-to-frame tracking with salient feature extracted at each epoch. In [23] the heterogeneity of the features improves recognition performance, even when the target is occluded.
1.3.3 Classifier Level Classification is a crucial step in surveillance systems [79, 35]. Usually, a classifier is considered as a stand-alone entity, that formulates a decision over a problem providing a real-valued or a binary output, or simply a label. A classifier aims to translate the feature outputs into meaningful information, producing higher-level decisions. One classifier can work on raw data or be bound to one or more features; these can be independent or conditionally independent because referring to the same object. Some synonyms of classifier are hypothesis, learner, expert [41]. A classifier can be trained, that means that exploits past knowledge to formulate a decision (i.e., Neural Networks, Bayesian classifiers), or not (i.e., Nearest Neighbours, clustering algorithms); in the latter case, usually a classifier considers a neighbourhood of samples to formulate its response. A wide survey on base classifiers is presented in [7, 41]. In the scheme presented in Figure 1.3, the classifier level refers to the usage of different heterogeneous classifiers to create a fusion scheme that aims to improve the performances with respect to the single-type classifier algorithm. Heterogeneity
1
Data Fusion in Modern Surveillance
11
has, in fact, been linked to diversity [30] and its contribution to performance improvement has been empirically demonstrated in several cases [70, 30, 5].
1.3.4 Combiner Level The problem of fusing different classifiers to provide a more robust and accurate detection has been studied since the early 90s, when the first approaches to the problem were detailed [31, 60, 55, 32]. Classifier fusion is proved to benefit from the diverse decision capabilities of the combined experts, thus improving classification performance with particular respect to accuracy and efficiency [39]. Two advisable conditions are the accuracy (low error rates) and the diversity (committing different errors) of the base learners. A strong motivation to classifier fusion is the idea that the information coming from several classifiers is combined to obtain the final classification response, considering each one of their individual opinion, and mutually compensating their errors [42]. The literature on classifier fusion techniques, that is the design of different combiners, is really vast: the MCS workshops series has a noticeable importance in this sense, but classifier ensembles are often used in a broad number of application fields, from medical imaging [57] to network security [22], from surveillance [72] to handwritten digit recognition [43, 80] including a large range of real-world domains [53]. Trying to formalize the benefits of the aggregation of multiple classifiers, Dietterich [18] and Kuncheva [41] gave a few motivations why multiple classifiers systems may be better than a single classifier. The first one is statistical: the ensemble may be not better1 than the best individual classifier, but the risk of picking an “inadequate single classifier” is reduced. The second reason why we should prefer an ensemble rather than a single classifier is computational: a set of decision makers can provide a solution working each one on a subsection of the problem in less time than a single base learner. The last motivation refers to the possibility of the classifiers’ space not containing the optimum. However, the ability of a mosaic of classifiers to approximate a decision boundary has to be considered; in this respect, it may be difficult or even impossible to adapt the parameters of an individual classifier to fit the problem, but a set of tuned classifier can approach the solution with a good approximation. Under these considerations, the focus is on the fusion criterion that, even suboptimal, achieves better performances than a single trained classifier. Typically, classifiers can be combined by several fusion rules, from simple fixed ones (i.e., sum, mean, vote by majority, etc.) to more complex fusion schemes (i.e., belief functions, Dempster-Shafer evidence theories, etc.). Some meta-learners, as the Boosting technique, can be employed as well, preceded by an off-line training phase. An exhaustive survey of combination methods can be found in [41]. 1
The group’s average performance is not guaranteed to improve on the single best classifier [41].
12
L. Snidaro, I. Visentini, and G.L. Foresti
1.3.5 An Example: Target Tracking via Classification Modern surveillance systems operating in complex environments, like an urban setting, have to cope with many objects moving in the scene at the same time and with events of always increasing difficulty. In order to single out and understand the behaviour of every actor in the scene, the system should be able to track each target as it moves and performs its activities. Position, velocity, and trajectory followed constitute basic pieces of information from which simple events can be promptly inferred. For this reason, a robust and accurate tracking process is of paramount importance in surveillance systems. Tracking is not an easy task since sensor noise and occlusions are typical issues that have to be dealt with to keep each target associated with its ID. In a system with multiple cameras this task (multi-sensor multi-target tracking) is even more daunting, since measurements from different sensors but generated because of the observation of the same target have to be correctly associated. Significant progress in the object tracking problem has been made during the last few years (see [77] for a recent survey). However, no definitive solution has been proposed, considering challenging situations as illumination variation conditions, appearance changes events or unconstrained video sources. Tracking of an object can be performed at different levels. At sensor level, on the image plane of each sensor, the system executes an association algorithm to match the current detections with those extracted in the previous frame. In this case, available techniques range from template matching, to feature matching [11], to more recent and robust algorithms [14]. Another step forward considers to mix different features to enhance the robustness of the tracker; as seen in Section 1.3.2, heterogeneous features brought a big improvement in target localization and tracking. Recently the tracking via classification concept has received a new boost [3, 9, 25, 56, 69, 52]; a single classifier or a classifier ensemble track an object separating the target from the background. In general, using each one of these approaches to tackle the tracking problem we can potentially consider several fusion levels, as shown in Figure 1.5. We can combine more than one source of information, as, for instance, audio and video, as
Fig. 1.5 Example of application
1
Data Fusion in Modern Surveillance
13
they has been repeatedly considered as a potential improvement to the limitation imposed from single-type sensors [10, 6, 62]. Different techniques can be applied to the data to extract heterogeneous relevant features; the joint usage of different information can improve significantly the tracking result quality [77]. Eventually, tracking via classification is considered more robust to occlusions and illumination changes [24] and greater system robustness and performance is achievable with an ensemble of classifiers through information fusion techniques [9, 56].
1.4 Sensor Management: A New Paradigm for Automatic Video Surveillance Video surveillance systems have always been based on multiple sensors since their first generation (CCTV systems) [61]. Video streams from analog cameras were multiplexed on video terminals in control rooms to help human operators monitor entire buildings or wide open areas. The last generation makes use of digital equipment to capture and transmit images that can be viewed virtually everywhere by using Internet. Initially, multi-sensor systems were employed to extend surveillance coverage over wide areas. The recent advances in sensor and communication technology, in addition to lower costs, allow to use multiple sensors for the monitoring of the same area [64, 2]. This has opened new possibilities in the field of surveillance as multiple and possibly heterogeneous sensors observing the same scene provide redundant data that can be exploited to improve detections accuracy and robustness, enlarging monitoring coverage and reducing uncertainty [64]. While the advantages of using multiple sources of information are well know to the data fusion community [44], the full potential of multi-sensor surveillance is yet to be discovered. In particular, the enrichment of available sensor assets has allowed to take advantage of data fusion techniques for solving specific tasks like for example target localization and tracking [65], or person identification [62]. This can be formalized as the application of JDL Level 1 and 2 fusion techniques [44] to surveillance strictly following a processing stream that exploits multi-sensor data to achieve better system perception performance and in the end improved situational awareness . A brief exemplification of the techniques that can be employed in Level 1 and 2 have been presented in Section 1.2. While many technical problems remain to be solved for integrating heterogeneous suites of sensors for wide area surveillance, a principled top-down approach is probably still left unexplored. Given the acknowledged increased complexity of architectures that can be developed nowadays, a full exploitation of this potential is probably beyond the possibilities of a human operator. Think for example to all the possible combinations of configurations that are made available by modern sensors: Pan-Tilt-Zoom (PTZ) cameras can be controlled to cover different areas, day/night sensors offer different sensing modalities, radars can operate at different frequencies, etc. The larger the system the more likely it will be called to address many different surveillance needs. A topdown approach would be needed in order to develop surveillance systems that are
14
L. Snidaro, I. Visentini, and G.L. Foresti
able to automatically manage large arrays of sensors in order to enforce surveillance directives provided by the operator that in turn translate the security policies of the owning organization. Therefore, a new paradigm is needed to guide the design of architectures and algorithms in order to build the next generation of surveillance systems that are able to organize themselves to collect data relevant to the objectives specified by the operator. This new paradigm would probably need to take inspiration by the principles behind the Sensor Management policies foreseen by JDL Level 4 [37]. The JDL model Level 4 is also called Process Refinement step as it implies adaptive data acquisition and processing to support mission objectives. Conceptually, this refinement step should be able to manage the system in its entirety: from controlling hardware resources (e.g. sensors, processors, storage, etc.) to adjusting the processing flow in order to optimize the behaviour of the system to best achieve mission goals. It is therefore apparent that the Process Refinement step encompasses a broad spectrum of techniques and algorithms that operate at very different logical levels. In this regard, an implemented full-fledged Process Refinement would provide the system a form of awareness of its own capabilities and how they relate and interact with the observed environment. The Process Refinement part dedicated to sensors and data sources is often called Sensor Management and it can be defined as “a process that seeks to manage, or coordinate, the use of a set of sensors in a dynamic, uncertain environment, to improve the performance of the system” [76]. In other words, a Sensor Management process should be able to, given the current state of affairs of the observed environment, translate mission plans or human directives into sensing actions directed to acquire needed additional or missing information in order to improve situational awareness and fulfil the objectives. A fivelayered procedure has been proposed in [76] and is reproduced in Figure 1.6. The chart schematizes a general sensor management process that can be used to guide the design of a real sensor management module. In the following, the different levels will be contextualized in the case of a surveillance system.
Mission Planning This level takes as input the current situation and the requests from the human operator and performs a first breakdown of the objectives by trying to match them with the available services and functionalities of the system. In a surveillance system the requests from the operator can be events of interest to be detected (e.g. a vehicle being stationary outside a parking slot) and alarm conditions (e.g. a person trespassing a forbidden area). Each of the events should be given a priority by the operator. The Mission Planning module is in charge of selecting the functions to be used in order to detect the required events (e.g. target tracking, classification, plate reading, face recognition, trajectory analysis, etc.). Actually this module should work in a way similar to a compiler, starting from the description of the events of interest expressed in a high level language, parsing the description and determining
1
Data Fusion in Modern Surveillance
15
Fig. 1.6 Five-layered sensor managing process [76].
the relevant services to be employed. The module will also identify the areas to be monitored, the targets to look for, the frequency of measurements and the accuracy level.
Resource Deployment This level identifies the sensors to be used among the available ones. If mobile and/or active sensors are available their repositioning may be needed [49]. In particular, this level would take into consideration aspects such as coverage and sensing modality. For example, depending on the time of the day a certain event is to be detected a sensor may be preferred to another.
Resource Planning This level is in charge of tasking the individual sensors (e.g. movement planning for active sensors [17, 49]) and coordinating them (e.g. sensor hand-overs) in order to carry out a certain task (e.g. tracking). The level also deals with sensor selection
16
L. Snidaro, I. Visentini, and G.L. Foresti
techniques that can choose for every instant and every target the optimal subset of sensors for tracking or classifying it. Several approaches to sensor selection have been proposed in the literature such as, for example, information gain based [40] and detection quality based [65].
Sensor Scheduling Depending on the planning and requests coming from Resource Planning, this level is in charge of determining a detailed schedule of commands for each sensor. This is particularly appropriate for active (i.e. PTZ cameras), mobile (e.g. robots) and multimode (day/night cameras, multi-frequency radar) sensors. The problem of sensor scheduling has been addressed in [47], and a recent contribution on the scheduling of visual sensors can be found in [58].
Sensor Control This is the lowest level and possibly also the simplest. The purpose of this level is to optimize sensor parameters given the current commands imposed by Level 1 and 2. For video sensors this may involve regulating iris and focus to optimize image quality. Although this is performed automatically by sensor hardware in most of the cases, it could be beneficial to manage sensor parameters directly according to some figure of merit which is dependent on the content of the image. For example, contrast and focus may be adjusted specifically for a given target. An early treatment of the subject may be found in [68], while a recent survey may be found in [1].
1.5 Conclusions Borrowed from biological systems where multiple sensors with an overlapping scope are able to compensate the lack of information from other sources, the data fusion field has gained momentum since its first appearance in the early 70s. Its rapid spread from biometrics to ambient security field, from robotics to everyday applications matches the rise of multi-sensor and heterogeneous-based approaches. In this regard, modern surveillance systems are moving toward techniques that, from the past single-type sensors, rely on mixed-type sensors and exploit multiple cues to solve real-time tasks. Data fusion is a necessary tool to combine heterogeneous information, to provide flexibility to manage unpredictable events and to enhance situational awareness in modern surveillance systems. In this chapter, we discussed the impact of the fusion of information coming from multiple sources, presenting a survey of existing methods and proposing some examples to outline how the combination of heterogeneous data can lead to better situation awareness in a surveillance scenario. We have also presented a possible new paradigm regarding sensor management that could be taken into account in the design of next generation surveillance systems.
1
Data Fusion in Modern Surveillance
17
References [1] Abidi, B.R., Aragam, N.R., Yao, Y., Abidi, M.A.: Survey and analysis of multimodal sensor planning and integration for wide area surveillance. ACM Computing Surveys 41(1), 1–36 (2008), DOI http://doi.acm.org/10.1145/1456650.1456657 [2] Aghajan, H., Cavallaro, A. (eds.): Multi-Camera Networks. Elsevier, Amsterdam (2009) [3] Avidan, S.: Ensemble tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(2), 261–271 (2007), http://doi.ieeecomputersociety.org/10.1109/TPAMI.2007.35 [4] Baker, K., Harris, P., O’Brien, J.: Data fusion: An appraisal and experimental evaluation. Journal of the Market Research Society 31(2), 152–212 (1989) [5] Bian, S., Wang, W.: On diversity and accuracy of homogeneous and heterogeneous ensembles. International Journal of Hybrid Intelligent Systems 4(2), 103–128 (2007) [6] Bigun, J., Chollet, G.: Special issue on audio-based and video-based person authentication. Pattern Recognition Letters 18(9), 823–825 (1997) [7] Bishop, C.M.: Pattern Recognition and Machine Learning (Information Science and Statistics), 1st edn. Springer, Heidelberg (2007) [8] Boss, E., Roy, J., Grenier, D.: Data fusion concepts applied to a suite of dissimilar sensors. In: Proceedings of the Canadian Conference on Electrical and Computer Engineering, vol. 2, pp. 692–695 (1996) [9] Chateau, T., Gay-Belille, V., Chausse, F., Laprest, J.T.: Real-time tracking with classifiers. In: European Conference on Computer Vision (2006) [10] Choudhury, T., Clarkson, B., Jebara, T., Pentl, A.: Multimodal person recognition using unconstrained audio and video. In: International Conference on Audio- and VideoBased Person Authentication, pp. 176–181 (1998) [11] Collins, R.T., Lipton, A.J., Fujiyoshi, H., Kanade, T.: Algorithms for cooperative multisensor surveillance. Proceedings of the IEEE 89(10), 1456–1477 (2001) [12] Collins, R.T., Lipton, A.J., Kanade, T.: Introduction to the special section on video surveillance. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 745–746 (2000) [13] Collins, R.T., Liu, Y., Leordeanu, M.: Online selection of discriminative tracking features. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(10), 1631– 1643 (2005) [14] Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Transactions on Pattern Analysis Machine Intelligence 25(5), 564–575 (2003) [15] Dasarathy, B.V.: Sensor fusion potential exploitation-innovative architectures and illustrative applications. Proceedings of the IEEE 85, 24–38 (1997) [16] Dasarathy, B.V.: Information fusion - what, where, why, when, and how? Information Fusion 2(2), 75–76 (2001) [17] Denzler, J., Brown, C.: Information theoretic sensor data selection for active object recognition and state estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(2), 145–157 (2002) [18] Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000) [19] Elmenreich, W., Pitzek, S.: Using sensor fusion in a time-triggered network. In: Proceedings of the 27th Annual Conference of the IEEE Industrial Electronics Society, Denver, CO, USA, vol. 1, pp. 369–374 (2001)
18
L. Snidaro, I. Visentini, and G.L. Foresti
[20] Foresti, G.L., Regazzoni, C.S., Varshney, P.K.: Multisensor Surveillance Systems: The Fusion Perspective. Kluwer Academic Publisher, Dordrecht (2003) [21] Gavrila, D.M., Munder, S.: Multi-cue pedestrian detection and tracking from a moving vehicle. Int. J. Comput. Vision 73(1), 41–59 (2007), DOI http://dx.doi.org/10.1007/s11263-006-9038-7 [22] Giacinto, G., Perdisci, R., Rio, M.D., Roli, F.: Intrusion detection in computer networks by a modular ensemble of one-class classifiers. Information Fusion 9(1), 69–82 (2008); Special Issue on Applications of Ensemble Methods, doi:10.1016/j.inffus.2006.10.002 [23] Gouet-Brunet, V., Lameyre, B.: Object recognition and segmentation in videos by connecting heterogeneous visual features. Computer Vision and Image Understanding 111(1), 86–109 (2008) [24] Grabner, H., Grabner, M., Bischof, H.: Real-time tracking via on-line boosting. In: Proceedings of the British Machine Vision Conference (BMVC), vol. 1, p. 4756 (2006) [25] Grabner, H., Sochman, J., Bischof, H., Matas, J.: Training sequential on-line boosting classifier for visual tracking. In: International Conference on Pattern Recognition (2008) [26] Grossmann, P.: Multisensor data fusion. The GEC Journal of Technology 15, 27–37 (1998) [27] Hall, D.L., Linn, R.J.: Survey of commercial software for multisensor data fusion. In: Aggarwal, J.K., Nandhakumar, N. (eds.) Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, vol. 1956, pp. 98–109 (1993) [28] Hall, D.L., Llinas, J.: An introduction to multisensor data fusion. Proceedings of the IEEE 85(1), 6–23 (1997) [29] Hall, D.L., McMullen, S.A.: Mathematical Techniques in Multisensor Data Fusion. Artech House, Boston (2004) [30] Hsu, K.W., Srivastava, J.: Diversity in combinations of heterogeneous classifiers. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 923–932. Springer, Heidelberg (2009) [31] Hu, W., Hu, W., Maybank, S.: Adaboost-based algorithm for network intrusion detection. IEEE Transactions on Systems, Man, and Cybernetics, Part B 38(2), 577–583 (2008) [32] Islam, M., Yao, X., Nirjon, S., Islam, M., Murase, K.: Bagging and boosting negatively correlated neural networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B 38(3), 771–784 (2008), doi:10.1109/TSMCB.2008.922055 [33] Jain, A., Hong, L., Kulkarni, Y.: A multimodal biometric system using fingerprints, face and speech. In: 2nd International Conference on Audio- and Video- based Biometric Person Authentication, pp. 182–187 (1999) [34] Javed, O., Rasheed, Z., Shafique, K., Shah, M.: Tracking across multiple cameras with disjoint views. In: Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV), vol. 2, pp. 952–957 (2003), doi:10.1109/ICCV.2003.1238451 [35] Javed, O., Shah, M.: Tracking and object classification for automated surveillance. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 343–357. Springer, Heidelberg (2002) [36] Jephcott, J., Bock, T.: The application and validation of data fusion. Journal of the Market Research Society 40(3), 185–205 (1998) [37] Llinas, J., Bowman, C.L., Rogova, G.L., Steinberg, A.N., Waltz, E.L., White, F.E.: Revisiting the JDL data fusion model II. In: Svensson, P., Schubert, J. (eds.) Proceedings of the Seventh International Conference on Information Fusion, vol. II, pp. 1218–1230. International Society of Information Fusion, Stockholm (2004), http://www.fusion2004.foi.se/papers/IF04-1218.pdf
1
Data Fusion in Modern Surveillance
19
[38] Kang, J., Cohen, I., Medioni, G.: Multi-views tracking within and across uncalibrated camera streams. In: IWVS 2003: First ACM SIGMM International Workshop on Video Surveillance, pp. 21–33. ACM, New York (2003), DOI http://doi.acm.org/10.1145/982452.982456 [39] Kittler, J., Hatef, M., Duin, R.P., Matas, J.: On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998) [40] Kreucher, C., Kastella, K., Hero III, A.O.: Sensor management using an active sensing approach. Signal Processing 85(3), 607–624 (2005), http://www.sciencedirect.com/science/article/ B6V18-4F017HY-1/2/a019fa31bf4135dfdcc38dd5dc6fc6c8, doi:10.1016/j.sigpro.2004.11.004 [41] Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley Interscience, Hoboken (2004) [42] Kuncheva, L.I., Whitaker, C.J.: Measures of diversity in classifier ensembles. Machine Learning 51, 181–207 (2003) [43] Lee, S.W., Kim, S.Y.: Integrated segmentation and recognition of handwritten numerals with cascade neural network. IEEE Transactions on Systems, Man, and Cybernetics, Part C 29(2), 285–290 (1999) [44] Liggins, M.E., Hall, D.L., Llinas, J.: Multisensor data fusion: theory and practice, 2nd edn. The Electrical Engineering & Applied Signal Processing Series. CRC Press, Boca Raton (2008) [45] Liu, H., Yu, Z., Zha, H., Zou, Y., Zhang, L.: Robust human tracking based on multi-cue integration and mean-shift. Pattern Recognition Letters 30(9), 827–837 (2009) [46] Luo, R.C., Kay, M.: Multisensor integration and fusion in intelligent systems. IEEE Transactions on Systems, Man, and Cybernetics 19(5), 901–930 (1989) [47] McIntyre, G., Hintz, K.: Sensor measurement scheduling: an enhanced dynamic, preemptive algorithm. Optical Engineering 37, 517 (1998) [48] McKee, G.T.: What can be fused? In: Multisensor Fusion for Computer Vision. Nato Advanced Studies Institute Series F, vol. 99, pp. 71–84 (1993) [49] Mittal, A., Davis, L.: A general method for sensor planning in multi-sensor systems: Extension to random occlusion. International Journal of Computer Vision 76(1), 31–52 (2008) [50] Monekosso, D., Remagnino, P.: Monitoring behavior with an array of sensors. Computational Intelligence 23(4), 420–438 (2007) [51] Murphy, R.R.: Biological and cognitive foundations of intelligent sensor fusion. IEEE Transactions on Systems, Man and Cybernetics 26(1), 42–51 (1996) [52] Nguyen, H.T., Smeulders, A.W.: Robust tracking using foreground-background texture discrimination. International Journal of Computer Vision 69(3), 277–293 (2006) [53] Oza, N.C., Tumer, K.: Classifier ensembles: Select real-world applications. Information Fusion 9(1), 4–20 (2008); Special Issue on Applications of Ensemble Methods, doi:10.1016/j.inffus.2007.07.002 [54] Parag, T., Porikli, F., Elgammal, A.: Boosting adaptive linear weak classifiers for online learning and tracking. In: International Conference on Computer Vision and Pattern Recognition (2008) [55] Parikh, D., Polikar, R.: An ensemble-based incremental learning approach to data fusion. IEEE Transactions on Systems, Man, and Cybernetics, Part B 37(2), 437–450 (2007), doi:10.1109/TSMCB.2006.883873
20
L. Snidaro, I. Visentini, and G.L. Foresti
[56] Petrovi´c, N., Jovanov, L., Piˇzurica, A., Philips, W.: Object tracking using naive bayesian classifiers. In: Blanc-Talon, J., Bourennane, S., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2008. LNCS, vol. 5259, pp. 775–784. Springer, Heidelberg (2008) [57] Polikar, R., Topalis, A., Parikh, D., Green, D., Frymiare, J., Kounios, J., Clark, C.M.: An ensemble based data fusion approach for early diagnosis of alzheimer’s disease. Information Fusion 9(1), 83–95 (2008); Special Issue on Applications of Ensemble Methods, doi:10.1016/j.inffus.2006.09.003 [58] Qureshi, F., Terzopoulos, D.: Surveillance camera scheduling: A virtual vision approach. Multimedia Systems 12(3), 269–283 (2006) [59] Rao, N.S.V.: A fusion method that performs better than best sensor. In: Proceedings of the First International Conference on Multisource-Multisensor Information Fusion, pp. 19–26 (1998) [60] Ratsch, G., Mika, S., Scholkopf, B., Muller, K.: Constructing boosting algorithms from svms: An application to one-class classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(9), 1184–1199 (2002) [61] Regazzoni, C.S., Visvanathan, R., Foresti, G.L.: Scanning the issue / technology - Special Issue on Video Communications, processing and understanding for third generation surveillance systems. Proceedings of the IEEE 89(10), 1355–1367 (2001) [62] Ross, A., Jain, A.: Multimodal biometrics: An overview. In: Proc. XII European Signal Processing Conf., pp. 1221–1224 (2004) [63] Rothman, P.L., Denton, R.V.: Fusion or confusion: Knowledge or nonsense? In: SPIE Data Structures and Target Classification, vol. 1470, pp. 2–12 (1991) [64] Snidaro, L., Niu, R., Foresti, G., Varshney, P.: Quality-Based Fusion of Multiple Video Sensors for Video Surveillance. IEEE Transactions on Systems, Man, and Cybernetics 37(4), 1044–1051 (2007) [65] Snidaro, L., Visentini, I., Foresti, G.: Quality Based Multi-Sensor Fusion for Object Detection in Video-Surveillance. In: Intelligent Video Surveillance: Systems and Technology, pp. 363–388. CRC Press, Boca Raton (2009) [66] Stauffer, C., Grimson, W.E.L.: Learning patterns of activity using real-time tracking. IEEE Pattern Analysis and Machine Intelligence 22(8), 747–757 (2000) [67] Steinberg, A.N., Bowman, C.: Revisions to the JDL data fusion process model. In: Proceedings of the 1999 National Symposium on Sensor Data Fusion (1999) [68] Tarabanis, K., Allen, P., Tsai, R.: A survey of sensor planning in computer vision. IEEE Transactions on Robotics and Automation 11(1), 86–104 (1995) [69] Tomasi, C., Petrov, S., Sastry, A.: 3d tracking = classification + interpolation. In: ICCV 2003: Proceedings of the Ninth IEEE International Conference on Computer Vision, p. 1441 (2003) [70] Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective voting of heterogeneous classifiers. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 465–476. Springer, Heidelberg (2004) [71] Valin, J.M., Michaud, F., Rouat, J.: Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering. Robot. Auton. Syst. 55(3), 216–228 (2007), http://dx.doi.org/10.1016/j.robot.2006.08.004 [72] Visentini, I., Snidaro, L., Foresti, G.: On-line boosted cascade for object detection. In: Proceedings of the 19th International Conference on Pattern Recognition (ICPR), Tampa, Florida, USA (2008) [73] Wald, L.: A european proposal for terms of reference in data fusion. International Archives of Photogrammetry and Remote Sensing 7, 651–654 (1998)
1
Data Fusion in Modern Surveillance
21
[74] Waltz, E., Llinas, J.: Multisensor Data Fusion. Artech House, Norwood (1990) [75] Wu, B., Nevatia, R.: Optimizing discrimination-efficiency tradeoff in integrating heterogeneous local features for object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008) [76] Xiong, N., Svensson, P.: Multi-sensor management for information fusion: issues and approaches. Information fusion 3(2), 163–186 (2002) [77] Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Comput. Surv. 38(4), 13 (2006), DOI http://doi.acm.org/10.1145/1177352.1177355 [78] Yin, Z., Porikli, F., Collins, R.: Likelihood map fusion for visual object tracking. In: IEEE Workshop on Applications of Computer Vision, pp. 1–7 (2008) [79] Zhang, L., Li, S.Z., Yuan, X., Xiang, S.: Real-time object classification in video surveillance based on appearance learning. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) [80] Zhang, P., Bui, T.D., Suen, C.Y.: A novel cascade ensemble classifier system with a high recognition performance on handwritten digits. Pattern Recognition 40(12), 3415–3429 (2007) [81] Zhang, W., Yu, B., Zelinsky, G.J., Samaras, D.: Object class recognition using multiple layer boosting with heterogeneous features. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 323–330. IEEE Computer Society, Washington (2005)
Chapter 2
Distributed Intelligent Surveillance Systems Modeling for Performance Evaluation Stefano Maludrottu, Alessio Dore, and Carlo S. Regazzoni
Abstract. In the last few years, the decrease of hardware costs and the contemporary increase of processing capabilities make possible the development of more and more complex surveillance systems with hundreds of sensors deployed in large areas. New functionalities are now available such as scene understanding, contextdependent processing and real-time user-driven functionality selection. The efficient use of these tools in architecturally complex systems for advanced scene analysis needs the development of specific data fusion algorithms able to merge multi-source information provided by a large number of homogeneous or heterogeneous sensors. In this context, the possibility of distributing the intelligence is one of the more innovative and interesting research fields for such systems. Therefore, several studies are focused on how the logical tasks can be partitioned between smart sensors, intermediate processing nodes and control centers. Typical tasks of surveillance systems (context analysis, object detection, tracking...) are organized into hierarchical chains, where lower levels in the architectures provide input data for higher level tasks. Each element of the architecture is capable of autonomous data processing. The complexity of such systems remarks the importance of good design choices for both logical and physical architecture. The main objective of this book chapter is to present possible solutions to evaluate the overall performance and technical feasibility as well as the interactions of the subparts of distributed multi-sensor surveillance systems. Performance evaluation of a multi-level hierarchical architecture does not pertain only to the accuracy of involved algorithms but several other aspects must be considered as the data communication between smart sensors and higher level nodes, the computational complexity and the memory used. Then, in order to define Stefano Maludrottu · Carlo S. Regazzoni Department of Biophysical and Electronic Engineering (DIBE) - University of Genoa, Genoa, Italy Alessio Dore Institute of Biomedical Engineering, Imperial College London, London, SW7 2AZ, UK
P. Remagnino et al. (Eds.): Innovations in Defence Support Systems – 3, SCI 336, pp. 23–45. c Springer-Verlag Berlin Heidelberg 2011 springerlink.com
24
S. Maludrottu, A. Dore, and C.S. Regazzoni
a procedure to assess the performances of this kind of systems a general model of smart sensors, intermediate processing nodes and control centers has been studied taking into account the elements involved in a multi-sensor distributed surveillance tasks. The advantage of the proposed architecture analysis method are: 1) to allow quantitative comparison of different surveillance structures through the evaluation of performance metrics and 2) to validate the algorithm choice with respect to the physical structure (communication links, computational load...) available. Finally example structures are compared in order to find the best solution to some benchmark problems.
2.1 Introduction In recent years, due to the increasing robustness and accuracy of computer vision algorithms, to the decrease of sensors and processing hardware prices, and to an increasing demand for reliable security systems in different areas (such as transports, industrial plants, public places or crowded events), surveillance systems have become more and more sophisticated. Hundreds, even thousands of sensors can be deployed in very large areas to perform different security tasks typical of those systems: people counting, tracking, abandoned object detection, movement detection and so on. Accuracy can be achieved through sensor redundancy or by combining different sensor typologies in order to operate under a number of difficult situations such as crowded scenes or poorly illuminated areas. The large amount of multisensorial data produced by heterogeneous and spatially distributed sensors is fused to automatically understand the events in the monitored environment. This data fusion process is instantiated at different levels of the architecture in a distributed and sequential way in order to provide to remote control centers (and to the attention of operators) aggregated data in the form of alarms or collections of salient video sequences about events of interest. A typical description of surveillance systems is a tree-like hierarchical structure composed of smart sensors, intermediate processing nodes and control centers connected by heterogeneous communication links. In this structure, each sensor contributes to the global surveillance task monitoring a small physical area and gathering local data about the environment. Since in advanced surveillance systems data can be locally analyzed using sensors with processing capabilities (and therefore defined as smart sensors) environment. Only salient information can be transmitted to fusion nodes devoted to specific data association and data fusion tasks. Since modern surveillance systems can monitor very large areas, the data fusion process itself can be subdivided into a hierarchical task performed by multiple processing units. More than one layer of intermediate fusion nodes can be dedicated to data fusion. Control centers are devoted to gather the data pertaining the whole monitored area and to present them with appropriate interfaces to the human operators.
2
Performance Evaluation for Intelligent Surveillance
25
This decomposition does not imply a centralized monitoring/control approach, but it reflects a wide range of current systems and it is intended as a descriptive model of the general structure of surveillance systems [10] [5]. Moreover, modern surveillance systems can be defined as “intelligent”, following a definition in [25], if they are capable of integrating the ambient intelligence paradigm [21] into safety or security applications. Many works have been devoted in the last decade to link traditional computer vision tasks to high-level context aware functionalities such as scene understanding, behavior analysis, interaction classification or recognition of possible threats or dangerous situations ([26] [16] [3] [27]). Another notable improvement of modern state-of-the-art surveillance systems (third generation surveillance systems) is the capability of internal logical task decomposition and distributed data processing. The so-called distribution of intelligence in such systems is made possible by the autonomous processing capabilities of the elements of the architecture. It can be described as a dynamic process of mapping the logical architecture (decomposed into a set of sequential sub-tasks) into the physical structure of the surveillance system itself. In such a way processing and communication resources can be optimized within the system. Modern surveillance systems can be used to provide different services to a number of different end-users; all the communication and processing resources of the system can be shared between different users according to specific strategies. Therefore, a modern third generation intelligent surveillance system whose characteristics are compliant to the above definitions can be a large, expensive and structurally complex system. The design of such a system is an articulated task: hardware selection, communication links design, sensors placement are some of the possible constraints that have to be considered. Moreover, existing architectures can be reused for new security applications, therefore is important to assess in advance how new functionalities can perform on existing structures. Different aspects of surveillance architectures should be taken into account to assess their performances for what concerns the algorithms accuracy. Also communication or computational complexity issues that can arise in complex multi-sensor systems should be considered. An efficient performance evaluation should not be an a posteriori analysis of such systems: modifications or refinements in such a large scale can be difficult, time consuming and economically disadvantageous tasks. A better solution is to link performance evaluation to the design phase, in order to produce useful feedbacks regarding specific design choices. Therefore, a correct modeling of intelligent surveillance systems can be defined in order to take into account all the designcritical aspects of surveillance systems.
2.1.1 Surveillance Systems as Data Fusion Architectures Since the main goal of an intelligent surveillance system is to provide to human operators a complete, reliable and context-aware description of the environment based
26
S. Maludrottu, A. Dore, and C.S. Regazzoni
Fig. 2.1 Scheme of the data fusion process of a surveillance system derived from the JDL model.
upon raw data obtained from the deployed sensors. A data fusion process is distributed between different elements of the architecture during which information extraction and refinement is performed at different abstraction levels. The data fusion process can be modeled, according to the JDL model [11], as a hierarchical process composed by four steps (See Fig. 2.1): 0-preprocessing: Raw sensor data are filtered, conditioned and processed before any fusion with data from other sensors; significant information is obtained from raw data using feature extraction algorithms. The preprocessing techniques adopted depend on the specific nature of the sensors. 1-processing: Spatial and temporal alignment of data from different sensors is performed. A common description of objects present in the environment is provided for instance, in terms of position, of velocity, of attributes and of characteristics. Specific recognition or classification algorithms are instantiated in order to label objects perceived in the scene. 2-situation analysis: Automated reasoning or ambient intelligence methods are adopted to provide a complete description of events in the environment. Interactions between objects are compared to predefined models and classified. The meaning of events within a scene is determined. 3-threat analysis: The evolution of the current situation in the environment is evaluated to determine potential threats using predictive models or probabilistic frameworks. Alarms are raised if potentially dangerous situations are recognized. Aggregated data and alarms resulting from the data fusion process are provided to human operators through an appropriate human-computer interface (HCI). HCIs may contain maps of the environment, icons or message boards in order to facilitate an immediate understanding of potential threats or suspicious situations. Aside from the data fusion process, database management can be considered as a key task to effectively handle the large amount of data involved in the data fusion process. Since the database (DB) is a critical component in a surveillance systems, many works have been devoted to DB modeling in complex systems. To avoid risks
2
Performance Evaluation for Intelligent Surveillance
27
of data loss or DB failure, a common approach is a DB (virtual) replication in every node of the architecture [14]. Therefore modern third-generation surveillance systems can be defined and identified by the mapping of specific data analysis and data fusion tasks into an architecture of intelligent hardware devices. This mapping can be an a-priori design decision or can be dynamically modified according to resources available in real-time and specific events in the environment.
2.1.2 Surveillance Tasks Decomposition Distribution of intelligence is a critical feature of surveillance systems in order to concurrently optimize bandwidth usage and processing speed. For instance, data analysis performed at lower levels of the architecture can lead to a more efficient bandwidth and central processing usage, offering in the same time robustness through redundancy, since a distributed processing can better handle components failures. Typical surveillance tasks (e.g. tracking) can be decomposed into a chain of logical modules organized in a hierarchical structure: lower modules provide as output the data needed as input by higher level modules. In this way, starting from raw sensor data, at each level an higher inference can be reached. This decomposition, originally proposed in [13] defines the paradigm of intelligence distribution in terms of module allocation to different physical devices with autonomous processing capabilities. Three kinds of modules have been defined at different abstraction levels: representation modules (as information-processing tasks whose output is higher-symbolic representation of the input data), recognition modules (as algorithms that compare input data with a predefined set of event descriptors) and communication modules (that produce a coded representation of input data suitable for transmission). In [2] a mobile agent system (MAS) is defined as a dynamic task assignment framework to smart cameras in surveillance applications: high-level tasks are allocated to sensor clusters. The system itself decomposes them into subtasks and maps this set of subtasks to single sensors according to the available processing resources of cameras and the current state of the system. Moreover, software agents can dynamically migrate between different sensors according to changes in the observed environment or in the available resources. In [19] is defined an optimal task partitioning framework for distributed systems that is capable of a dynamic adaptation to system or environment changes by taking into consideration Quality of Service (QoS) constraints. In [22] an agents-based architecture is presented, where agents, defined as software modules, are capable of autonomous migration between nodes of the architecture. Within this framework, camera agents are responsible for detection and tracking tasks while object agents provide updated descriptions of temporal events.
28
S. Maludrottu, A. Dore, and C.S. Regazzoni
More in general, the mapping of tasks on the architecture of an intelligent surveillance system can be defined as an association function A: A:T →N
(2.1)
where T is the complete set of processing tasks ti that the system is required to perform and N is the set of intelligent devices (or nodes) ni in the system itself. For every node ni it must hold that ∑ j pi j ≤ p j where pi j is the cost value in terms of processing resources (memory or CPU) of task ti allocated to node n j and p j is the total processing capability of node n j .
2.2 Intelligent Surveillance System Modeling 2.2.1 Introduction In the following of this Section, a model is proposed to provide a general framework for the performance evaluation of complex multi-sensor intelligent surveillance systems. To this aim three main components of the system are identified, i.e., smart sensor, fusion node and control center. In this work we will consider a structure where sensors and control centers can exchange data only with the fusion nodes, albeit it would be possible to add other communication links (e.g. direct communication between sensors) to model specific architectures. This structure has been chosen since this work focuses on the hierarchical architecture where the fusion node overlooks the sensors and, eventually, it modifies their processing tasks or their internal parameters or it passes information gathered from other sensors. This model can be constituted by one or more fusion nodes according to the extension of the monitored area and the number of sensors. The fusion nodes send the processed data to a common remote control center. If the considered environment is very large or structurally complex, or if a huge quantity of sensors have to be considered, more than one level of intermediate nodes can be used to perform the data fusion process. A hierarchical multi-level partitioning of the environment can be adopted in order to manage the data flow between multiple layers of intermediate fusion nodes. Each zone at level i-th (“macro-zone”) is divided into a set of smaller zones (“micro-zones”) at (i + 1)-th level according to a specific partitioning strategy such as the Adaptive Recursive Tessellation [24]. Higher-level fusion nodes take as input metadata produced by lower-level fusion nodes. These data (related to contiguous micro-zones) are fused in order to get a global representation of a macro-zone of the environment. Control centers are directly connected to all the fusion nodes (or to highest-level fusion nodes in a multi-level fusion structure as defined before) of the architecture. In this way, the metadata descriptive of the entire environment are collected in order to provide a complete representation of the environment to the final users. Usually a single control center is present, although in multi-user surveillance systems, more than one control center can be realized.
2
Performance Evaluation for Intelligent Surveillance
29
Without loss of generality, we will consider the overall fields of view of the fusion nodes contiguous non-overlapped. A global metric used to evaluate the global performances of a multi-sensor system will be described in Sect. 2.2.5. The global metric will take into account not only specific data processing errors but also problems or failures of structural elements of the architecture. In Section 2.4 it will be shown how, keeping unaltered the processing algorithms involved, different design choices (in terms of sensor selection and positioning) can significantly affect the overall system performances.
2.2.2 Smart Sensor Model Each smart sensor Si generates a local representation of specific physical quantities (such as light, motion, sound...). Input data are locally processed by the sensor and appropriate metadata are sent to higher levels of the architecture. The processing function Ti (t) represents a data analysis task that produces a certain output data D f i (t + Δ ti ) given an input data D i (t) referring to the scene observed in its field of view VSi (t). The quantity Δ ti (defined as time response of sensor Si ) accounts for the computational time needed by the smart sensor to process input data. The value Δ ti depends on several factors, such as: the specific processing task considered, the computational complexity and the available processing resources. The input noise is modeled with E i (t) and the noise on the processed data sent to the fusion node is E f i (t + Δ ti ). The variables D i (t), D f i (t + Δ ti ), E i (t) and E f i (t + Δ ti ) are vectors of dimension proportional to the complexity of the scene observed by Si at time t (e.g. in a tracking application those vectors have dimension equal to the number of targets Ni (t)). The computational complexity C function of D i (t) is associated to the processing function. The error vector in this application coincides with the accuracy error that has to be evaluated through a suitable performance metric. To further improve modularity, different performance metrics can be adopted (including the ones mentioned in section 2.3.3). Since a third generation intelligent surveillance system is a context-aware and (possibly) multi-user multi-functional system, the specific processing task Ti (t) associated to each sensor Si can be modified according to specific user requests and salient changes in the environment. The database Di represents the memory M of the smart sensor. It has been decided to address it as a database since many surveillance algorithms require to compare actual data with known models or previous data. The algorithm T exchanges information with the database that can also send additional information to the fusion node. The communication unit Ci is responsible for transmitting data to the fusion node. The communication unit has to satisfy the requirements of error free data transmission considering possible bandwidth B limitations of the link toward the fusion node. In order to comply with real-time constraints, a maximum-delay value Δ tmax,s can
30
S. Maludrottu, A. Dore, and C.S. Regazzoni
be defined. Thus, if Δ ti > Δ tmax,s , the output of the processing task can be considered to contain obsolete data and it is discarded. Each smart sensor is considered to be placed in a certain known position of the environment; its field of view (FOV) depends on its angle of view and its orientation. If a smart sensor has a fixed orientation its FOV VSi (t) becomes a constant VSi . Therefore a smart sensor Si can be modeled as composed by four fundamental elements: • • • •
a processing function Ti (t), a database Di , a communication unit Ci , a field of view VSi (t).
The schematic diagram of a smart sensor model can be seen in Fig.2.2.
Fig. 2.2 Representation of the smart sensor model
2.2.3 Fusion Node Model In this stage the data coming from each sensor are integrated to realize an accurate single model of each object perceived in the environment. The structure of the fusion node N is similar to the smart sensor model but some differences are present in the processing function and communication unit. The fusion node processing function TF (t) takes M data D f i (t) (together with the respective error measure E f i (t), i = 1, ..., M) as input, where M is the number of smart sensors connected to the fusion node Fk . Due to failures of specific
2
Performance Evaluation for Intelligent Surveillance
31
sensors or communication links, or due to data loss for real-time constraints or bandwidth limitations, a subset of input data can be received by the fusion node. An “information loss” parameter IF (t) is defined to account for the percentage of the input data not correctly received by the fusion node. Thus, the transfer function D f 1 (t), ..., D f M (t), E f 1 (t), . . . , E f M (t), IF (t)). The cardinality of DF (t) and is: TF (D D f 1 (t) ∪ D f 2 (t) ∪ ...D D f M (t))). EF (t) vectors is NF (t) = TF (card(D The output data D ci (t + Δ tF ) of the processing function TF (t) is sent to the control centers. The time response of fusion node N (Δ tF ) accounts for the computational time needed by the fusion node to process input data. An efficient data fusion algoE f 1 (t), . . . , E f M (t)), an error vector E F (t + Δ tF ) in rithm produces, as output of T (E which each value is lower than the respective error value in E i (t) detected from the i-th sensor. The robustness of data fusion algorithms can be defined as a function E f i , IF ) → E F that records how the error of the output data is affected by the RF : (E errors of the input data and by information loss. Worth of note is that the computational complexity C of the fusion node processing is related to the dimension of the overlapped fields of view and the intricacy of the events analyzed in this region. Similarly to the smart sensor model, the processing function TF (t) depends on the specific task decomposition at time t. The database DF performs the same tasks described for the smart sensor model. The communication unit CF analyzes the transmission load for each communication link connecting the fusion node N to the smart sensors Si . Moreover, it manages the data transmission to control centers through a link of bandwidth BF . It can also send messages to the smart sensor communication unit in order to perform an adaptive optimization procedure. For example, if one link is overused the fusion node can ask to all the smart sensor connected to that link to reduce their transmission rate by a higher compression or a dynamic reallocation of processing tasks can be performed. Taking into account real-time constraints, a maximum delay value for the fusion node Δ tmax,F can be defined. Output data are sent to control centers only if the time response value Δ tF ≤ Δ tmax,F . A fusion node N is modeled by three fundamental elements: • a processing function TF (t), • a database DF , • a communication unit CF .
2.2.4 Control Center Model The control center C receives data produced by fusion nodes. Collected data can be further processed in order to produce a global representation of the environment. Aggregated data are shown to operators by means of an appropriate HCI. At this stage changes of allocation of data processing tasks within the system are issued if the user requests about specific functionalities are received or as a result of a context analysis (e.g. if threats or suspicious situation are detected by the system). The control center processing function TC (t) can be decomposed as TP (t) + THCI where TP (t) is the input data processing task and THCI is the processing related to
32
S. Maludrottu, A. Dore, and C.S. Regazzoni
the user interface (supposed constant over time). If more than one fusion node Nk is connected to the control center C, TP (t) is an high-level data fusion task: TP (t) takes M data D ci (t) (together with the respective error measure E Fi (t), i = 1, ..., M ) as input, where M is the number of intermediate fusion nodes connected to the control center Ck . An ’information loss’ parameter IC (t) can be defined for the control center as the percentage of input data not correctly received at time t. Thus, the transfer D c1 (t), ..., D cM (t), E F1 (t), . . . , E FM (t), IC (t)). function is: TF (D The Output data DC (t + Δ tC ) affected by an error E C (t + Δ tC ) is presented to operators through an appropriate interface. The value Δ tC is the time response value of the control center. The output error is affected by the information loss parameter similarly to the fusion node model. Similarly to the fusion node model, a robustness E Fi , IC ) → E F . function can be defined as RC : (E The database DD performs the same tasks described for the smart sensor model. The communication unit CC analyzes the transmission load for each communication link that connects the control center C to the intermediate fusion nodes Ni . It sends messages to fusion nodes communication units to optimize processing and communication resources or to manage/start/terminate specific surveillance tasks. The control center C is modeled considering: • a processing function TC (t), • a database DC , • a communication unit CC .
2.2.5 Performance Evaluation of Multi-sensor Architectures The Performance evaluation of data fusion systems, such as multi-sensors surveillance systems is usually done considering only the information processing and refinement algorithms. Much work has been dedicated in the recent years to evaluate the performances of surveillance algorithms usually by a comparison of the output of those algorithms with a manually obtained ground truth. In this way a quantitative evaluation and an effective comparison of different approaches can be usually assessed. In visual surveillance, for instance, since typical tasks are tracking or detection of moving objects, the corresponding ground truth can be obtained by manually labeling or drawing bounding boxes around objects. However, the evaluation of those systems does not pertain only on algorithms accuracy but several other aspects must be considered as well: the data communication between sensors, fusion nodes and control centers, the computational complexity and the memory used. A complete model must take into account all the significant elements in a multi-sensor system. Therefore, in order to define a more global procedure to assess the performances of intelligent third-generation surveillance systems a general model of smart sensors, fusion nodes and control centers has been studied. A global evaluation function P(F(T, R), DF , I) can be defined in order to assess the overall performances of surveillance architectures. The global result of this function will implicitly take into account both the robustness and the reliability of processing algorithms and the specific structure of the surveillance system itself.
2
Performance Evaluation for Intelligent Surveillance
33
P depends on several parameters. The performance metric F(T, R) is related to the specific surveillance task (tracking, detection, classification,...) that takes into account the processing functions T = {Ti , TF , TC } of robustness R = {Ri , RF , RC } assigned to the nodes of the architecture. I and DF are, respectively, the information loss (accounts for bandwidth/CPU limitation problems, real-time constraints...) and the database failure rate (accounts for storage/retrieval errors) of every element of the structure. In this way, different intelligent surveillance systems can be evaluated and compared on the basis of different design choices. The impact of specific data processing functions assigned to smart sensors Ti , intermediate fusion nodes TF and control centers TC on the overall performances can be assessed. Sensors selection and positioning can be validated for specific surveillance applications. Finally, the cost of data loss due to different causes (bandwidth limitations, real-time requirements,...) can be successfully evaluated.
2.3 A Case Study: Multi-sensor Architectures for Tracking 2.3.1 Introduction In order to test the proposed model for a real-world surveillance system, a distributed tracking architecture has been chosen as a benchmark problem as in [1]. The structure consists of a set of sensors Si that transmit over a common transmission channel C, the local results of tracking/localization tasks to a single fusion node N. In the following case study, video sensors have been considered since in real-world surveillance systems, video cameras are by far the most common sensors. However, this does not have to be considered as a limitation of the proposed model, since heterogeneous sensors (audio, radio, GPS,...) can be used as well in tracking applications to provide reliable information on moving objects in the environment. In the fusion node data association as well as data fusion are performed. The local object identifiers (IDs) are replaced by common identifiers and point-to-track association is done. A correct modeling solution of this problem can be obtained through a hierarchical decomposition of the architecture of the tracking system into its physical and logical subparts according to the model defined in section 2.2.1. In the literature many similar approaches have been proposed for distributed fusion architecture modeling: in [15], a hierarchical architecture for video surveillance and tracking is proposed; the camera network sends the data gathered to the upper level nodes in which tracking data are fused in a centralized manner. In some related works specific metrics are introduced in order to measure the reliability of each subpart involved in the data fusion. Typical problems arising in video data flow are due to the superimposition of targets and elements of the scene occupying different 3D in the image plane. This issue, called occlusion, causes a momentary lack of observations with consequent difficulties in the position estimation procedure. Moreover, background elements
34
S. Maludrottu, A. Dore, and C.S. Regazzoni
can produce misleading observations (or clutter) that are to be excluded from the measurements-target association procedures. A possible and widely exploited approach to overcome these issues and to enhance the tracking robustness consists in using more sensors monitoring with complete or partial fields of view overlapping the same area. In this way it is possible to obtain multiple observations of the same entity whose joint processing likely improves tracking performance. The fusion module can be considered as a function that associates in a proper way the data produced by the sensors with overlapped fields of views and estimates an unique scene representation. An example of the above mentioned architecture is shown in Figure 2.3. Sensors are considered as smart sensors , i.e. with an embedded processing unit and with the capability of performing autonomous localization and tracking tasks. Fusion nodes gather data from two or more sensors and process them jointly to generate an unique scene representation.
Fig. 2.3 Example of a data fusion system for target tracking
2.3.2 Multisensor Tracking System Simulator On the basis of the architectural model described in sect. 2.2.1 a tracking simulator has been realized to demonstrate the possibility to evaluate the system performances. A context generator has been designed in order to produce a set of realistic target trajectories simulating moving objects with stochastic dynamics. The simulated smart sensor Si acquires data from the context generator only for what concerns his field of view VSi (t) and processes them using a Kalman Filter. Multiple sensors
2
Performance Evaluation for Intelligent Surveillance
35
with overlapped fields of view send their processed data to one fusion node Fk that processes the data using a modified Kalman Filter in addition to a Nearest Neighbor data association technique [28]. 2.3.2.1
Context Generator
The context generator produces trajectories with linear autoregressive first order model with Gaussian noise. Trajectories lay in a two-dimensional space representing the map of a monitored environment. A Poissonian distribution is used to describe the probability of new target born and death. Occlusions are also simulated when two targets are aligned with respect to the camera line of sight and they are sufficiently close each other. Clutter noise is also introduced at each scan as a Poisson process. The context generator (see Fig. 2.4) allows us to have a realistic but controllable environment by which simulating different situations automatically generating the correspondent ground truth. Yet, the proposed context generator has to be considered as a simplified model of the environment, since some of the typical issues of visual-based surveillance applications (such as sudden or continuous lighting variations, adverse weather conditions, camera vibrations...) are not taken into consideration. In order to assess how these problems affect the overall performances of a given multi-sensor architecture, either a more detailed context generator can be implemented (as in [23]) or the simulator can be modified to use real data as input (as in [4]). However, in this work a simple context generator module has been chosen since this case study intends to show the possibility of using the proposed model for performance evaluation. A more realistic simulator can be implemented according to the specific application of the proposed general model. 2.3.2.2
Smart Sensor
When a target trajectory enters in the field of view VSi (t) of the i-th sensor a singleview tracking task is initialized to detect the position and to follow the movements of targets. For each target, in order to estimate the position of tracked objects a Kalman filter is used, whose equations are: x(k + 1) = Aw x(k) + ww (k) z (k) = H w x (k) + v w (k)
(2.2) (2.3)
w l and v l are, respectively, the transition model and the measurement model noise and they are independent, white Gaussian and zero mean. A l and H l respectively are the state transition matrix and the measurement matrix. In this work the computational complexity is a linear function of the number of targets present in the FOV. When this function exceeds a threshold that is defined to represent the unit processing capability, a random lack of data to be sent to the fusion node is simulated.
36
S. Maludrottu, A. Dore, and C.S. Regazzoni
Fig. 2.4 Context generator for multi-camera tracking: white dots are the moving targets and green dots are the sensors placed in the environment
In this example the communication unit of the sensor is responsible of sending data regarding the occupation of the link to the fusion node. A function has been implemented to simulate the optimization of data transmission. In this simulator the database is not modeled since the performed tasks do not need a data storage. 2.3.2.3
Fusion Node
The communication unit handles the data stream coming from the smart sensors. A maximum bandwidth value Bmax has been defined according to the physical constraints of the architecture. If ∑i Bi (k) > Bmax (where Bi is the bandwidth occupation of the sensor i at the time k ) a sensor subset S∗ (k) of the set S of all the sensors connected to the fusion node is chosen in order to optimize the bandwidth usage and such that ∑i B∗i (k) = Cmax we define a priority-based selection of the data streams in order to obtain a feasible subset S (k) such that Ci (Si (k)) (6.2) where • V is the set of input variables used to perform the required surveillance tasks and to inform about the objects features and the current state of the environment. On the other hand, the values of these variables can be directly gathered by the sensors or generated by preprocessing modules. Furthermore, such values may be precise or, on the contrary, provided with imprecision and vagueness. • O is the set of classes of monitored objects in the sub-environment, whose behaviour must be analysed (e.g. people, group of people, car, truck, bicycle, etc). • C refers to the set of monitored aspects in the sub-environment, denoted as concepts from now on. • O ×C determines the concepts that must be used to analyse the normality of each class of object. Depending on the class of each object, different concepts will be used to determine if their behaviour is normal or not. Example 6.2. Let Ei be a part of the environment monitored by a security camera, where there exist traditional urban elements such as gardens, pavements, pedestrian crossings, traffic lights, roads, etc. The classes of objects that may appear are O = {pedestrian, vehicle}. The concepts to monitor in Ei are C = {tra jectories, speed, pedestrian crossings}. That is, the trajectory, speed and behaviour regarding pedestrian normality crossings normality will be analysed. Some examples of the variables V needed to perform the surveillance may be as follows: object location, object size, time, list of key regions of the environment, position of relevant static elements, etc. The concepts proposed to analyse the behaviour of each class of object is given by O × C = {(pedestrian, trajectories), (pedestrian, pedestrian crossing), (vehicle, trajectories), (vehicle, speed), (vehicle, pedestrian crossing)}. According to this example, a pedestrian behaves in a normal way if he/she follows normal trajectories and crosses the road through the pedestrian crossing. In the case of vehicles, their speed in each time is also monitored. 6.3.2.3
Surveillance Based on the Analysis of Concepts
As previously discussed, the surveillance of an environment is carried out by means of concepts so that there exists one concept per each one of the events of interest to monitor.
114
J. Albusac et al.
Definition 6.3. A concept ci (ci ∈ C) is defined as a 3-tuple composed of the following elements: ci =< Vi ; DDVi ; Φi > (6.3) where Vi is the set of input variables used to define the concept ci so that Vi ⊆ V . On the other hand, DDVi is the set of definition domains of the variables that belong to Vi . Therefore, if Vi = {v1i , v2i , ..., vni }, then DDVi is defined as DDVi = {DDV1i, DDV2i , ..., DDVni }, where DDV ji is the definition domain of the variable v ji . The definition domain of a variable specifies the possible values that can take. Finally, Φi is the set of constraints used to complete the definition of the concept ci , according to the elements of Vi (Φi = { μ1i , μ2i , ..., μki }). The normality analysis of ci depends on how the constraints associated to ci are met. 6.3.2.4
Normality Constraint
Definition 6.4. A normality constraint, associated to a concept ci is defined as a fuzzy set Xi over the domain P(Vi ), with an associated membership function μXi :
μXi : P(Vi ) −→ [0, 1]
(6.4)
where 1 represents the maximum degree of satisfaction of the constraint and 0 the minimum. The rest of values represent intermediate degrees of normality. Sometimes, defining several constraints by means of crisp sets (the object meets the constraint or not) is more suitable and practical. In such a case, the membership function is as follows: 1 if x ∈ Xi ; μXi (x) = 0 if x ∈ / Xi ; The constraints definition can be done by using simple constraints through a set of operations generating a new constraint. Operations between constraints Let A and B be two normality constraints defined over the domain P(Vi ) so that: 1. The union (A ∪ B) of constraints is a new constraint that is met if and only if A or B are satisfied. The membership function of A ∪ B is defined as μA∪B (x) = max{μA (x), μB (x)}. 2. The intersection (A ∩ B) of constraints is a new constraint that is met if and only if A and B are simultaneously satisfied. The membership function of A ∩ B is defined as μA∩B (x) = min{μA (x), μB (x)}. ¯ is a constraint that is met if and only if A is not satisfied. 3. The complement (A) The membership function of A¯ is defined as μA¯ (x) = 1 − μA(x). Properties of normality constraints The properties associated to normality constraints are exactly the same that the properties of fuzzy sets. Let A, B and C be normality constraints defined over the domain P(Vi ):
6
1. 2. 3. 4. 5. 6. 7.
Normality Components for Intelligent Surveillance
115
Idempotent: A ∪ A = A ; A ∩ A = A Conmutative: A ∪ B = B ∪ A ; A ∩ B = B ∩ A Associative: A ∪ (B ∪C) = (A ∪ B) ∪C ; A ∩ (B ∩C) = (A ∩ B) ∩C Distributive: A ∪ (B ∩C) = (A ∪ B) ∩ (A ∪C) ; A ∩ (B ∪C) = (A ∩ B) ∪ (A ∩C) Doble negation: ¬(¬A) = A Transitive: Si A ⊂ B ∧ B ⊂ C → A ⊂ C Limit condition: A ∪ 0/ = A; A ∩ 0/ = 0/ ; A ∪ P(Vi ) = P(Vi ) ; A ∩ P(Vi ) = A
The concept definition determines the framework that establishes the general rules about how to carry out the associated surveillance task. However, the process of making instances from such a concept needs to be applied for particular environments. For example, in the case of the trajectory concept, such a definition establishes the mechanisms required to analyse normal trajectories, but without instantiating them because they depend on the particular environment to monitor. 6.3.2.5
Concept Instance
The next step after defining a concept and its constraint in a general way is to make instances of such a concept for particular environments. Definition 6.5. An instance y of a concept ci in an environment E j (E j ∈ P), denoted j as ciy , is defined as follows: ciyj =< Vi ; DDVi ; Φ˜ i = { μ˜ 1i , μ˜ 2i , ..., μ˜ zi } >
(6.5)
where Φ˜ i is the set of particularised constraints of the set Φi , that is, each μ˜ xi ∈ Φ˜ i represents the particularisation of μxi ∈ Φi . It is verified that |Φi | ≥ |Φ˜ i |. Example 6.3. If ci is the trajectory concept, an instance of ci represents a normal trajectory within the environment E j . 6.3.2.6
Normality Constraint Instance
A normality constraint instance is used to adapt the general definition of a kind of analysis based on a concept to a specific environment. Definition 6.6. A normality constraint instance is a fuzzy set defined over P(DDVi ) with an associated membership function ( μ˜ Xi ):
μ˜ Xi : P(DDVi ) −→ [0, 1]
(6.6)
so that if vki ∈ P(Vi ) is employed to define μXi , then the values of vki defined over DDVki ∈ P(DDVi ) are used to make the instance μ˜ Xi . Example 6.4. If μXi represents a constraint that checks if a moving object follows the sequence of regions of a trajectory, μ˜ xi is the constraint employed to check if
116
J. Albusac et al.
the moving object follows a particular sequence of regions within the monitored environment. 6.3.2.7
Degree of Normality Associated to an Instance
Each monitored object has associated a degree of normality that establishes how normal its behaviour is according to each attached concept to it, represented by the deployed instances of such concepts within the environment. Definition 6.7. The degree of normality of an object ob j within an environment E j (E j ∈ P) regarding an instance y of the concept ci (ciyj ), denoted as Nc j (ob j), is iy
calculated from the values obtained for each μ˜ xi : |Φ˜ i |
Nc j (ob j) = iy
being
y μ˜ xi
(6.7)
x=1
a t-norm, such as the t-norm that calculates the minimum value.
This particular t-norm, that is, the minimum, is suitable for surveillance systems and, particularly, for the model proposed in this work since in this kind of systems the violations of some of the constraints that define the normality in the environment are needed to be detected. In other words, if some of the constraints are not satisfied, the normality degree Nc j (ob j) will be low. The fact of using other t-norms implies iy
obtaining lower values than this minimum, since a relevant property of the t-norm operators is that any of them is always lower or equal to the value obtained by the operator that calculates the minimum. Therefore, this would imply a stricter surveillance and, as a result of that, a high number of alarm activations. For instance, if j the product t-norm (T (a, b) = a · b) is used and ciy is defined through two constraints that are satisfied with a value of 0.6, the degree of normality Nc j (ob j) will be iy
0.6 · 0.6 = 0.36, fact that does not make sense since most of the constraints were satisfied with a higher degree (0.6). On the other hand, with the minimum t-norm this degree would be min(0.6, 0.6) = 0.6, which is a value that represents in a better way the degree of satisfaction of the constraints that define the instance. Example 6.5. If ci is the normal trajectory concept and ci1j is a particular normal trajectory instantiated within the environment E j from a set of constraints μXi and their particularisations (μ˜ X1i ), then Nc j determines the degree of normality of the j
i1
object ob j regarding the trajectory ci1 . A high value of Nc j means that ob j follows i1
the trajectory ci1j in a suitable way. On the other hand, a low value represents the contrary case or that ob j does not follow such a trajectory at all. In other words, a low value of Nc j means that one or more constraints of the instance ciyj are not met. i1
6
Normality Components for Intelligent Surveillance
6.3.2.8
117
Degree of Normality Associated to a Concept
The next step After calculating the degree of normality of an object for each instance of a particular concept is to calculate the normality of such an object according to all the defined instances of such a concept. In this way, it is possible to study the general behaviour of an object according to a concept. Definition 6.8. The degree of normality of an object ob j within an environment E j (E j ∈ P) according to a concept ci , denoted as Nc j (ob j), is calculated as follows: i
Nc j (ob j) = i
w y=1
Nc j (ob j)
(6.8)
iy
where w is the number of instances of the concept ci and for instance the maximum t-conorm.
is a t-conorm operator,
In the same way that min(a, b), which represents the maximum value that can be obtained by applying a t-norm, max(a, b) represents the minimum value obtained by applying a t-conorm operator, that is, the value calculated with any other t-conorm operator will be higher than the calculated by max. This fact also justifies its use on the proposed model so that an object ob j behaves in a normal way according to a concept ci within an environment E j (Nc j (ob j)) if it satisfies the constraints of some i of the deployed instances for ci . On the other hand, considering the application of the t-norm and t-conorm operators, the normality analysis of an object ob j, according to ci in E j , is the result of applying an AND-OR fuzzy network over a set of constraints: j j ci1 = μ˜ 1i1 ∧ μ˜ 2i1 ∧ ..... μ˜ ni1 = Nci1 (ob j) ∨ ci2j = μ˜ 1i2 ∧ μ˜ 2i2 ∧ ..... μ˜ ni2 = Ncji2 (ob j) . . . . . . . . . ∨ . . . . . j j w w w ˜ ˜ ˜ ciw = μ1i ∧ μ2i ∧ ..... μni = Nciw (ob j)
Ncji (ob j) Example 6.6. If ci is the trajectory concept and cikj represents the normal trajectories defined through the instances of the variables and constraints in E j , then the degree of normality Nc j (ob j) of each object according to the concept is calculated i
by applying a t-conorm operator over the values Nc j (ob j) obtained for each one of iy
the instances. Let the following values be:
118
J. Albusac et al.
ci1j = μ˜ 1i1 ∧ μ˜ 2i1 ∧ ..... μ˜ ni1 = Ncji1 (ob j) = 0.2 ∨ j j ci2 = μ˜ 1i2 ∧ μ˜ 2i2 ∧ ..... μ˜ ni2 = Nci2 (ob j) = 0.8 . . . . . . . . . ∨ . . . . . j j w w w ˜ ˜ ˜ ciw = μ1i ∧ μ2i ∧ ..... μni = Nciw (ob j) = 0.0 Ncji (ob j) = 0.8 In this case, the object satisfies remarkably the constraints defined for the instance j j ci2 of the concept ci , since the value calculated for Nci2 (ob j) is 0.8. In other words, the object follows a normal trajectory, which is the trajectory instantiated in ci2 . In short, the degree of normality of an object associated to a concept ci , Nc j (ob j), is i
a numerical value that belongs to the interval [0, 1], which is a representative sign of the object behaviour regarding a concept or surveillance aspect. High values of this parameter represent normal situations within the monitored environment while low ones represent suspicious or abnormal situations. 6.3.2.9
Normal Behaviour within an Environment According to a Concept
The final goal to reach after the analysis of a situation is to activate a set of alarms or to draw the attention of the security personnel when such a situation does not meet the limits of normality. That is, after having calculated the degree of normality Ncji (ob j), the model needs a mechanism to decide whether the object behaves normally or not, depending on the calculated degree of normality. This can be addressed by using an alpha threshold defined within the range [0, 1] over the degree of normality, fact that implies that a situation changes from normal to abnormal abruptly. For this reason, the normality is considered as a linguistic variable VN that takes a set of values over the domain definition DDVVN = {AA, PA, SB, PN, AN} (see Figure 6.2). In this way, the object behaviour can be absolutely abnormal (AA), possibly abnormal (PA), suspicious (SB), possibly normal (PN), and absolutely normal (AN); so that a behaviour can belong to more than one set at the same time. The definition of each value of the domain DDVVN depends on the features of the environment to monitor, the desired security level and the criterion of the expert in charge of setting up the configuration parameters, which must be adapted to the system behaviour. Figure 6.2 graphically shows how the values assigned to the sets that represent anomalous situations are not high, which implies that the system is not very strict, avoiding in this way a frequent (and possible unnecessary) alarm activation. In this way, every time that the degree of normality of an object behaviour is calculated according to a concept, Ncji (ob j), the membership of this value to the fuzzy sets that establish the definition domain DDVVN of the normality variable (VN )
6
Normality Components for Intelligent Surveillance
119
is studied, determining the normality of the analysed situation. The alarm activation will rely on upper layers, which will perform the required actions depending on the normality values received.
Fig. 6.2 Definition domain DDVVN of the normality variable VN .
The next section discusses how to aggregate the output of different normality components to get a global analysis and determine if the behaviour of the objects is normal within a particular environment, taking into account multiple surveillance concepts.
6.3.3 Global Normality Analysis by Aggregating Independent Analysis The normality of an object does not exclusively depend on an concept. In fact, a global evaluation of the normality of all the monitored concepts within the environment must be carried out. Therefore, a mechanism for combining multiple analysis is needed to get a global value that represents the normality of the object behaviour in a general way. A possible approach consists in combining the output values of the normality components by using t-norms or t-conorms. For instance, the minimum t-norm establishes that the global normality value corresponds to the value of the analysis of the lowest degree of normality. On the other hand, if the maximum t-conorm is used, then the result will come determined by the component that provides the degree of normality with the highest value. Both techniques do not reflect the real process when combining multiple degrees of normality. For this reason, the OWA (Ordered Weighted Averaging) operators [34] are proposed to address this problem due to their flexibility.
120
J. Albusac et al.
Formally, an OWA operator is represented as a function F : Rn → R associated to a vector of weights W of length n; W = [w1 , w2 , . . . , wn ], where each wi ∈ [0, 1] and ∑ni=1 wi = 1. OWA(a1 , a2 , . . . , an ) =
n
∑ wj · bj
j=1
(6.9)
⎡
⎤ b1 ⎢ . ⎥ ⎥ OWA(a1 , a2 , . . . , an ) = [w1 , w2 , . . . , wn ] ⎢ ⎣ . ⎦ bn where (a1 , a2 , . . . , an ) represent the set of initial values or criteria used by the operator to make a decision, and (b1 , b2 , . . . , bn ) represent the ordered set associated to (a1 , a2 , . . . , an ), being b j the j-th highest value of such a set. Furthermore, the values of the weights wi belonging to the vector W are linked to positions and not to particular values of the original set. In the model devised in this work, the OWA operator is used to aggregate the j j j normality values calculated by each component (Nc1 , Nc2 , . . . , Ncn ). One of the key characteristics of this set of aggregation operators is the flexibility to vary their behaviour depending on the values assigned to the vector of weighs W . Such a vector determines the behaviour of the operator, which may tend to behave as union or intersection operators. In fact, this behaviour can be customised to reflect the minimum t-norm or the maximum t-conorm. Figure 6.3 graphically shows the relationships between the values obtained by the operators t-norm, OWA and t-conorm, respectively.
Fig. 6.3 Visual comparison between the range of values obtained by t-norm, t-conorm and OWA operators.
In order to analyse the behaviour of the OWA operator that deals with the minimum t-norm and the maximum t-conorm, R. Yager proposed the following formula [34, 35]: n
orness(W ) = (1/n − 1) ∑ ((n − 1) · wi)
(6.10)
i=1
After having initialised the vector W , closer values of the orness measure to 1 reflect that the OWA operator is close to the maximum; while closer values to 0 reflect the proximity to the minimum. The particular values of the vector of weights W depend on the desired surveillance level. In this way, an orness value close to 1 represents a soft surveillance so that a situation will be considered normal while there is a normal behaviour. On
6
Normality Components for Intelligent Surveillance
121
the other hand, values close to 0 reflect a hard surveillance so that if there is a single anomalous behaviour, then the situation will be considered as abnormal. Values close to 0.5 reflect intermediate cases. Table 6.1 shows three configurations of W and how they affect to a common monitored situation given by the output of three normality components Ncj1 (ob j), Ncj2 (ob j) and Ncj3 (ob j), reflecting the minimum, the maximum and a weighted measure. Table 6.1 Multiple configurations of the OWA operator.
Criteria Ncj1 (ob j) j Nc2 (ob j) j Nc3 (ob j)
Minimum Maximum Average Value Weight Evaluation Weight Evaluation Weight Evaluation 0.95 0 0 1 0.95 0.33 0.31 0.8 0 0 0 0 0.33 0.264 0.2 1 0.2 0 0 0.33 0.066
TOTAL
0.2
0.95
0.64
After having obtained the global normality value by applying the OWA operator j j j (OWA(Nc1 , Nc2 , . . . , Ncn )), the final normality value N j (ob jk ) associated to an object ob jk in a particular environment E is given by the degree of membership of this value to a fuzzy sets. These sets define the values possibly normal behaviour (PN) and absolutely normal behaviour (AN) of the definition domain DDVVN of the normality variable VN (see Figure 6.2), determining in this way the normality of the analysed situation: N j (ob jk ) = μPN (OWA(Ncj1 , Ncj2 , . . . , Ncjn )) + μAN (OWA(Ncj1 , Ncj2 , . . . , Ncjn )) (6.11) j
j
j
j
j
j
where μPN (OWA(Nc1 , Nc2 , . . . , Ncn )) and μAN (OWA(Nc1 , Nc2 , . . . , Ncn )) establish the membership of the output value of the OWA operator to the sets PN and AN, respectively. The global normality value within an environment E j considering the activity of all the moving objects, denoted as GN j , is calculated as the minimum of the normality value N j calculated for each object ob jk : |O|
GN = j
N j (ob jk )
(6.12)
k=1
where O is the set of monitored objects at a particular time, |O| is the number of monitored objects and N j (ob jk ) is the normality value calculated for each object ob jk ∈ O. The use of the minimum is justified because when an object does not behave normally, the global normality degree GN j in the environment E j must be a low value.
122
J. Albusac et al.
6.4 Model Application: Trajectory Analysis This section discusses the model application to monitor the trajectories and speed of moving objects in a particular scenario. These analyses are performed by two independent normality components, but the global normality is evaluated by taking into account both of them. This aggregation is carried out by using the approach described in Section 6.3.3. The analysis of trajectories and speed, using the visual information gathered by security cameras, represent two problems that have been widely studied by other researchers. The use of a formal model to deal with them provides several advantages over other approaches. In the case of trajectory analysis, most authors focus their work on analysing the spatial information to recognise the paths of moving objects [17, 22, 23, 25]. However, these definitions might not be enough in the surveillance field, since additional constraints need to be checked very often. Most of existing proposals make use of algorithms for learning normal trajectories, which usually match with those frequently followed by most of moving objects. Within this context, it is important to take into account that a trajectory often followed by a vehicle may not be normal for a pedestrian or, on the other hand, normal trajectories at a particular time interval may be abnormal at a different interval. Therefore, the trajectory analysis based on the proposed model provides richer and more scalable definitions because new kinds of constraints are possible to be included when they are needed (or even removed unnecessary constraints previously added). On the other hand, the second component analyses the speed of moving objects by using the 2D images provided by a security camera without knowing the camera configuration parameters. The main goal of this component is to classify the real speed of objects as normal or anomalous. Most of the proposed methods estimate the speed and require a previous calibration process that needs relevant information of the camera, such as the height or the zoom level. Afterwards, they use transformation and projection operations to get a real 3D representation of the position and the object movement [9, 11, 20, 21, 24]. The definition of the speed concept in the proposed model is made by taking into account how people work when they analyse a video, that is, they do not need to know the exact values of position and speed of moving objects to infer that they move at a fast speed. On the contrary, the security staff only analyse the object displacement using the closest static objects as reference. The output of this component is therefore obtained depending on the analysis that determines whether an object moves at a normal speed regarding such a displacement. To do that, horizontal and vertical fuzzy partitions of 2D images are done. This information is used by an inductive learning algorithm to generate a set of constraints that establish how the horizontal, vertical and global displacements of each kind of object in each region that composes the fuzzy partition are performed. An example of a generated constraint is as follows: The speed of vehicles in far regions from the camera is normal if the horizontal displacement is small.
6
Normality Components for Intelligent Surveillance
123
A more detailed discussion on the definition of this concept is given in [3]. Next, the application of the proposed formal model for the definition and analysis of trajectories is described in depth. Nevertheless, Section 6.5 will discuss the results obtained by the two normality components previously introduced.
6.4.1 Normal Trajectory Concept A trajectory is defined by means of three different kinds of constraints: i) role constraints, that specify what kind of objects are allowed to follow a trajectory; ii) spatial constraints, that determine the sequence of regions where moving objects must move (possibly with an associated order) and, finally; iii) temporal constraints, that refer to the maximum period of time or interval allowed to follow a trajectory. In this work, the regions are represented by means of polygons graphically defined by means of a knowledge acquisition tool over a 2D image in order to establish trajectories. In this way, a single sequence of regions may represent a trajectory that comprises multiple paths whose points are close each other (see Figure 6.4). On the other hand, two similar paths are uncommon to correspond to a single trajectory.
Fig. 6.4 (a) Group of trajectories followed by vehicles and marked by means of continuous lines over the scene. (b) Group of trajectories without the background image that represents the scene. (c) Example of sequence of zones or regions used to define the previous group of trajectories.
In order to define trajectories, the following information is required: • Set of relevant regions of the monitored environment. • Regions where every kind of object can be located at a particular time. • Sequence of covered regions by an object, temporally ordered, since such an object is first identified. • Kinds of monitored objects, determined by means of physical features such as height, width, horizontal and vertical displacements, etc. • Identifiers associated to each moving object by the tracking process. • Temporal reference to the current moment. • Time spent to carry out each one of the defined trajectories. This information justifies the choice of variables for the definition of the trajectory concept discussed in the next section.
124
J. Albusac et al.
6.4.2 Set of Variables for the Trajectory Definition As previously specified in the model formalisation, a concept is defined by means of a 3-tuple ci =< Vi ; DDVi ; Φi > composed of three elements, where Vi is the set of variables needed to represent the concept ci , DDVi is the definition domain of each variable belonging to Vi and, finally, the concept definition is completed through the association of a set of constraints Φi . Next, the variables used in the trajectory component are described by taking into account this information. (a) The system must be aware of the division of the environment into regions, and the regions where a moving object may be located. Within this context, the system needs a internal representation of the more relevant regions in order to establish a relation between the objects position and the previously defined regions. Table 6.2 Set of regions A that represent the regions of activity in a monitored environment. Variable DDVA Description A {a1 , a2 , . . . , an ,} Set of areas belonging to a monitored environment. Each ai is represented through a polygon, defined by means of a set of ordered points.
(b) Sequence of temporally-ordered covered areas by an object. Table 6.3 completes the set of variables of the environment regions and the position of each object. Each μa (ob j) represents the intersection of an object ob j with a particular area a. This information is not generated by the component but obtained from the low-level modules, which take the data provided by the segmentation and tracking processes to perform this task. The main reason to calculate the regions where an object is located in this low-level modules is because multiple modules or normality components may need it. On the other hand, the systems maintains a register (ob j) for each object ob j, where the set of covered regions by every object is stored to establish similarities with normal trajectories. Table 6.3 Variable used, together with the set of areas A, to represent the objects position on the scene. Variable (Vi ) DDVi μa (ob j) [0, 1]
(ob j)
Description Imprecise information received from low-level layers about the possibility for an object ob j to be located over an area a, where 1 represents that the object is totally located over it and 0 the contrary case. {μa (ob j)}+ List of μa (ob j)|μa (ob j) > 0, updated until the current time.
(c) Object membership to one or multiple classes (see Table 6.4). Each μc (ob j) represents the degree of membership of an object ob j to the class c within the interval [0, 1], where 1 is the maximum membership value and 0 the minimum. The degree of membership of an object to a class is also calculated by low-level layers that use the spatial information of the segmentation and tracking processes. Therefore,
6
Normality Components for Intelligent Surveillance
125
both the regions where an object is located and its classes are the input of the component that analyses trajectories. Table 6.4 Membership value of an object ob j to a class c. Variable (Vi ) DDVi μc (ob j) [0,1]
Description Imprecise information of the object membership to a class c, ∀c ∈ O, ∃ μc (ob j)).
(d) Temporal references. The system mainteins a temporal reference to the current moment in which the analysis is being performed and, at the same time, it allows the maximum duration and temporal intervals of the trajectories (see Table 6.5). Table 6.5 Set of temporal constraints associated to trajectories. Variable tc
tmax
DDV [1, 31] ∪ {∗} × [1, 12] ∪ {∗}× [1900 − 9999] ∪ {∗} × [0, 24] ∪ {∗} × [0, 59] ∪ {∗} × [0, 59] ∪ {∗} tmax ∈ [0, 9999]
Int j < ts ,te > DDVtc × DDVtc
Description Absolute reference to the current moment. The format is DD/MM/YY - hh : mm : ss. The symbol * is used as a wild card.
tmax is the maximum allowed duration for a trajectory, measured in seconds. This parameter is often used to define temporal constraints. Time interval definition through initial (ts ) and final (te ) moments. This variable is also used to define temporal constraints.
Finally, Table 6.6 shows the rest of variables, which are mainly used to define constraints. Table 6.6 Variables ϒ , Ψ and Ψ (ob j) to define role and spatial constraints, respectively. Variable DDV ϒ O
Description ϒ represents the set of classes/roles that are allowed to follow a certain trajectory. Each trajectory has associated ϒ , used to define role constraints. Ψ DDVΨ ⊆ DDVA Ψ represents the sequence of areas that must be covered to perform a certain trajectory. Each trajectory has associated a sequence Ψ , used to define spatial constraints. Ψ (ob j) DDVΨ Maximum membership value of object ob j to each of the regions covered of Ψ until the current time. If an object does not satisfy the order imposed by Ψ (the object is not located on a region of the sequence Ψ or the order has been broken), the list of covered regions Ψ (ob j) becomes empty and is initialised to a void value: Ψ (ob j) = 0/
126
J. Albusac et al.
6.4.3 Preprocessing Modules The preprocessing modules are responsible for providing the values of the variables used by the normality components in order to perform the analysis. Concretely, the component of normal trajectories needs three preprocessors whose main functions are as follows: i) tracking of objects, ii) estimation of object position (regions where it is located) and iii) object classification. The first one is especially important for the system to know the objects that appear on the scene. In fact, the intermediate layers will not be able to analyse the behaviours of those objects that were not previously recognised. The design and development of tracking algorithms is not one of the main goals of this work but needed to carry out the proposed analysis. For this reason, OpenCV and the library Blob Tracker Facility [1] haven been used to perform this task. The application blobtrack.cpp, implemented in C++, makes use of these libraries to segment and track objects from the video stream gathered by the security cameras. A detailed description of the whole process is discussed in [10]. The original application has been customised to extend its functionality and get more spatial data, providing information about the horizontal and vertical displacement of each object ob j between consecutive frames, calculated as the difference between the central point of the ellipse that wraps the monitored object. On the other hand, the source code has been modified to generate a persistent log used by the preprocessing modules and the normality components. The second preprocessor makes use of the scene knowledge and the information generated by the first preprocessor to estimate the regions where an object is located. That is, the preprocessor estimates the object location from the following information: • Set of areas (A), defined over the scenario to monitor. Figure 6.5.b shows an example of scene division into multiple regions. • Spatial information of each object, provided by the segmentation algorithm. Particularly, the coordinate of the ellipse central point that frames the object, the object height and the object width. The membership value of an object to a particular area (μa (ob j)) is the result of dividing the number of points that represent the object base over the area a by the total number of points of the object base. The object base over an area matches up with the lower half of the ellipse that wraps the object. The points that form this base are those that belong to the rectangle which upper left vertex is calculated as (x − width/2, y) and which inferior right vertex is (x + width/2, y + height/2), being x and y the coordinates of the ellipse central point. To calculate if the point (x, y) belongs to a polygon defined from a set of points, W. Randolph Franklin’s algorithm [15] is used, which is based on a previous algorithm [27]. Finally, the third preprocessor allows to establish the class or classes of a detected object. To do that, the output generated by the first preprocessor and a set of IF-THEN fuzzy rules automatically acquired by the inductive learning algorithm [3] are used. These rules, such as IF v0 es ZD0 ∧ . . . ∧ vn es ZDn THEN y j , are
6
Normality Components for Intelligent Surveillance
127
Fig. 6.5 (a) Scene monitored from a security camera. (b) Scene division into areas or regions.
characterised by the antecedents composed of a set of variables Vi that take a subset of values ZDi ⊂ DDVvi , and a consequent y j that represents the object class if the rule is fired. Currently, the set of variables V employed to classify objects differs from [3] and is composed of the following variables: • Horizontal v1 and vertical v2 location. The position and distance regarding the camera affect to the object dimensions. • Width-height (W/H) aspect, v3 . This value is obtained by dividing the ellipse width by the ellipse height. To study the objects dimensions instead of their size improves the classification results. This is due to the fact that the segmentation algorithms are not perfect and, therefore, the estimated ellipse does not completely wrap the identified objects most of the times. However, it is possible to maintain the suitable dimensions of the class they belong. • Horizontal and vertical displacements aspect, v4 . This aspect is obtained by dividing by the variables that contain the horizontal and vertical displacement values of the object. In some scenarios, these variables may be a relevant discriminator between object classes. In the scene of Figure 6.5, the camera location and the routes where the vehicles drive imply that their horizontal displacements are normally larger than the vertical ones. • Global displacement (MOV), v5 . This value is calculated as the euclidean distance between the ellipse central point that wraps the object between consecutive frames. This variable is also interesting to distinguish object classes since there are displacements that are not real for certain kinds of objects. The length of the displacements varies depending on the region of the 2D image where they take place, that is, depending on the distance between the object and the camera. Obviously, the length of the displacements in regions that are far from the camera are much smaller than those produced in close regions, fact that does not imply a shorter distance on the real scene. The preprocessor also maintains a persistent register that stores how each object was classified at every moment. Depending on this information and the fired rules,
128
J. Albusac et al.
the membership values to each one of the classes are calculated. Algorithm 6.1 specifies the steps to be followed in order to determine the membership to the set of classes O. Figure 6.6 shows an example that describes this classification process. Algorithm 6.1. Object classification algorithm Require: Detected object in the current frame. The values of the variables (v1 , v2 , v3 , v4 , v5 ) are (x1 , x2 , x3 , x4 , x5 ). Ensure: Degrees of membership to the classes pedestrian y1 and vehicle y2 . 1.- Calculate the satisfaction degree of each rule. 2.- FOR EACH class oi ∈ O : A) The value of the current membership is calculated without taking into account previous classifications. If multiple rules of the same class are fired, the value used will be the one with the higher degree of satisfaction of such rules. If no rule is fired, the membership value is 0. B) The final membership value to each one of the classes is calculated as the arithmetic media of the previous membership values and the current one. END FOR
Fig. 6.6 Object classification example. The first row represents the time, where ti < ti+1 . The second row shows the form and size of the ellipse that wraps the object in each ti . Finally, the last row shows the object classification. If the image that represents the class has a numeric value associated to the rigth, it means that the object was classified in ti with such a degree of membership. However, the final membership value is calculated as the arithmetic media of the previous membership values and the current one.
6.4.4 Constraint Definition The last step to complete the normality definition of the trajectory concept consists in the specification of a set of general constraints Φi , included in the tuple < Vi , DDVi , Φi >. According to the formal model, the satisfaction of these constraints allows the surveillance system to infer if the moving objects follow trajectories in a correct way, together with the degree of normality associated to the concept. A high value of this parameter implies that a monitored object follows in a suitable
6
Normality Components for Intelligent Surveillance
129
way one of the normal trajectories defined for the monitored environment. The defined constraints are as follows: Role constraint μ11 . Each trajectory is associated with a list ϒ that specifies the kinds of objects that are allowed to follow it. Only those objects that belong to a class of ϒ satisfy this constraint. For instance, those objects classified as pedestrians will not meet the role constraint of a trajectory defined for vehicles. The membership function μ11 of this constraint is defined in the following way:
μ11 (ob j) = μck (ob j) ⇐⇒ ∀c = ck ; c, ck ∈ ϒ ; μck (ob j) ≥ μc (ob j) So that μc (ob j) represents the membership value of the object ob j to the class c, and μck (ob j) the maximum membership value to one class in ϒ . For example, if ϒ ={Person, Group of People}, and the object ob j has been classified as Person = 0.3 and Group of People = 0.7, then the degree of satisfaction of the constraint μ11 (ob j) is 0.7, that is, max(μPerson (ob j), μGroupO f People (ob j)). Spatial constraint μ21 . The spatial constraint included in the trajectory component checks if an object follows the order of the sequence of regions defined for the trajectory (Ψ ). Every time that the component receives spation information from low-level layers, it calculates the membership value to each one of such regions depending on the object position. Secondly, each (ob j) (see Table 6.3) and Ψ (ob j) are updated only if some of the maximum values stored in such a sequence is exceeded (see Table 6.6). For each one of the trajectories associated to an object, a list Ψ (ob j) is managed in order to store the maximum membership values to regions of Ψ covered by the object. In other words, if the maximum membership value of an object to an area ak is μak (ob j) =0.8, being ak an area belonging to Ψ and the current membership is μak (ob j) =0.3, then the value of μak (ob j) in Ψ (ob j) is not updated. In this way, extremely low values in transitions between regions are avoided. When an object moves from one region to another, the membership value to the previous region may be really low, fact that may generate the violation of the spatial constraint. The degree of satisfaction of the spatial constraint is calculated as the arithmetic average of all μak (ob j) ∈ Ψ (ob j). As shown in Table 6.6, if the object violates the order relation defined in Ψ , the value of Ψ (ob j) will be 0/ and, therefore, the degree of satisfaction of such a constraint will be 0. The membership function μ21 of this constraint is defined as follows:
|(ob j)| μak (ob j) ∑k if |Ψ (ob j)| > 0; |(ob j)| μ21 (ob j) = 0 in other case; Temporal constraints μ31 and μ41 . The temporal constraints allow to specify when a particular trajectory is allowed to be followed. The first kind of temporal constraint (μ31 ) is used to check if a certain time limit tmax is exceeded, that is, an object follows a certain trajectory in a correct way if the trajectory is completed in a time shorter than tmax . The membership function μ31 of the constraint is defined in the following way:
130
J. Albusac et al.
μ31 (tmax ,tc ,tb ) =
1 if (tmax = ∅) ∨ (tmax ≤ (tc − tb )) 0 in other case;
where tc represents the current time and tb the time when the object began the trajectory. The second kind of temporal constraint (μ41 ) is used to check if an object follows a certain trajectory within the suitable time interval. To do that, five temporal relationships that relate moments and intervals, based on the Allen’s Interval Algebra [4] and the work developed by Lin [19], have been defined. Table 6.7 shows these relationships. Table 6.7 Temporal relationships between moments and time intervals. Relationship (tc , Int j ) Before After During Starts Finish
Logic definition tc < start(Int j ) end(Int j ) < tc start(Int j ) < tc < end(Int j ) start(Int j ) = tc end(Int j ) = tc
Formally, the constraint μ41 is defined as follows: ⎧ ⎨ 1 if (Int j = ∅) ∨ Starts(tc , Int j ) ∨ During(tc , Int j ) ∨Finish(tc , Int j )) μ41 (Int j ,tc ) = ⎩ 0 in other case; An object satisfies a temporal constraint with an associated interval Int j if the current moment tc belongs to such an interval. This fact takes place when some of the following conditions is met: Starts(tc , Int j ), During(tc , Int j ) or Finish(tc , Int j ). The next step after defining the concept in a general way and developing the normality component is to make instances for particular environments. Tables 6.8 and 6.9 show the set of particular instances for the scene of Figure 6.5. Each one of Table 6.8 Set of instances of the normal trajectory concept for vehicles. c11,y c11,1 c11,2 c11,3 c11,4 c11,5 c11,6 c11,7 c11,8 c11,9
μ˜ 11 μ˜ 21 ϒ = {vehicle} Ψ = {a1 , a2 , a3 , a4 , a5 , a6 } ϒ = {vehicle} Ψ = {a1 , a2 , a3 , a4 , a5 , a6 , a8 , a9 , a10 , a22 } ϒ = {vehicle} Ψ = {a1 , a2 , a3 , a4 , a5 , a6 , a8 , a9 , a11 , a13 , a14 , a15 } ϒ = {vehicle} Ψ = {a7 , a8 , a9 , a10 , a22 ϒ = {vehicle} Ψ = {a7 , a8 , a9 , a11 , a13 , a14 , a15 } ϒ = {vehicle} Ψ = {a7 , a8 , a9 , a11 , a5 , a6 } ϒ = {vehicle} Ψ = {a23 , a12 , a11 , a5 , a6 } ϒ = {vehicle} Ψ = {a23 , a12 , a11 , a13 , a14 , a15 } ϒ = {vehicle} Ψ = {a23 , a12 , a9 , a11 , a5 , a6 , a8 , a9 , a10 , a22 }
μ˜ 31 150sg 150sg 150sg
μ˜ 41 ∅ ∅ ∅
150sg 150sg 150sg 150sg 150sg 150sg
∅ ∅ ∅ ∅ ∅ ∅
6
Normality Components for Intelligent Surveillance
131
Table 6.9 Set of instances of the normal trajectory concept for pedestrians. HU is the identifier of the time interval defined for the open hours at University. c11,y c11,10 c11,11 c11,12 c11,13 c11,14 c11,15 c11,16 c11,17 c11,18 c11,19 c11,20 c11,21
μ˜ 11 μ˜ 21 ϒ = {pedestrian} Ψ = {a29 , a30 , a31 } ϒ = {pedestrian} Ψ = {a31 , a30 , a29 } ϒ = {pedestrian} Ψ = {a18 , a17 , a16 } ϒ = {pedestrian} Ψ = {a16 , a17 , a18 } ϒ = {pedestrian} Ψ = {a16 , a4 , a28 , a13 , a19 } ϒ = {pedestrian} Ψ = {a19 , a113 , a28 , a4 , a16 } ϒ = {pedestrian} Ψ = {a19 , a20 , a21 } ϒ = {pedestrian} Ψ = {a21 , a20 , a19 } ϒ = {pedestrian} Ψ = {a24 , a25 } ϒ = {pedestrian} Ψ = {a25 , a24 } ϒ = {pedestrian} Ψ = {a26 , a27 } ϒ = {pedestrian} Ψ = {a27 , a26 }
μ˜ 31 μ˜ 41 ∅ [HU ] ∅ [HU ] ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅
the rows represents an instance of the normal trajectory concept, while each one of the columns reflects a particularisation of the defined constraints.
6.5 Experimental Results The scenario chosen for validating and evaluating the model and the designed normality components is graphically shown in Figure 6.5. It is an urban traffic environment where both vehicles and pedestrians are submitted to traffic laws, such as correct trajectories or suitable speed for each stretch. This environment is monitored from a security camera located in the ORETO research group at the University of Castilla-La Mancha. The process of experimental validation consists in the monitoring of ten video scenes with a duration between 2’30” and 3’. One of the premises was to select visual material that covered multiple situations, with different illumination conditions and image resolutions. Figures 6.7 and 6.8 show the degree of illumination of each one of the scenes employed to test the system. Scenes 7 to 10 have a lower resolution. The main goal of the conducted experiments is to evaluate the proposed methods to classify objects and analyse the normality of trajectories and speed. For this reason, those moving objects that are not detected by the tracking algorithm, being global errors of the surveillance systems, were not reflected on the results discussed in this work. In other words, the normality components cannot analyse the behaviour of undetected objects and, therefore, these errors are due to the low-level processing of spatio-temporal information. Each one of the detected objects in a single frame is considered as a situation to be analysed. For each one of these situations, the system checks if the classification, the regions calculation, trajectory analysis, speed analysis, and global analysis are
132
J. Albusac et al.
Fig. 6.7 Screenshots of the scenes employed in the tests.
Fig. 6.8 Set of histograms that represent the brightness and contrast values of each one of the test scenes. The higher the values in the Y axis, the higher the illumination of the scene. The first six scenes correspond to a cloudy day where the objects lack of marked shadows while the remaining four have better illumination conditions.
normal. The matches and errors of the normality components and the global analysis are classified into true positives (TP), true negatives (VN), false positives (FP), and false negatives (FN); each one of them representing the following situations: • • • •
True positive (TP): normal situation correctly classified as normal. True negative (TN): anomalous situation correctly classified as anomalous. False positive (FP): anomalous situation incorrectly classified as normal. False negative (FN): normal situation incorrectly classified as anomalous.
Furthermore, for each one of the evaluated scenes, the relationships between these parameters have been established by means of the following coefficients [14]: • Sensitivity or true positive rate (TPR): T PR = T P/(T P + FN), that is, the hit rate. • False negative rate (FPR): FPR = FP/FP + T N.
6
Normality Components for Intelligent Surveillance
133
• Precision or accuracy (ACC): ACC = (T P+ T N)/(P + N), where P is the number of positive cases and N the number of negative cases. • Specificity (SPC) or true negative rate: SPC = T N/N = T N/(FP + T N) = 1 − FPR. • Positive predictive value (PPV): PPV = T P/(T P + FP). • Negative predictive value (NPV): NPV = T N/(T N + FN). • False discovery rate (FDR): FDR = FP/(FP + T P). Next, the results obtained taking into account the previous criteria are discussed. Particularly, Table 6.10 shows the success rate of the object classification process for each one of the ten test scenes. The three first causes of errors are due to the classification algorithm developed in this work, while the fourth one is due to the computer vision algorithms used in low-level layers, hardly avoided by the classification process. As can be appreciated, the classification method gets good results; concretely, 95% success rate as the average value, and 91% in the worst case.
Table 6.10 Results obtanined in the object classification process of pedestrians and vehicles. More common error causes are a) wrong fired rule; b) no rule fired, being the system unable to classify the object; c) historic, the system takes into account previous classifications of the object, making errors in current classifitions if the previous ones were incorrect; d) ellipse change between two objects that cross at a point (this fact is critical when the objects belong to different classes). Object classification Scene
Success
1 2 3 4 5 6 7 8 9 10 AVERAGE
7176 (95%) 6760 (96%) 6657 (95%) 11065 (99%) 1397 (99%) 2242 (93%) 7498 (95%) 2866 (97%) 6115 (91%) 1887 (98%) 95.8 %
Wrong rule 287 188 227 84 0 153 345 19 359 3
Errors Rules not Historic fired 1 35 1 56 28 49 4 0 1 0 1 0 0 15 4 0 0 123 1 0
Ellipse change 0 0 30 19 0 0 8 64 92 27
Table 6.11 shows the results obtained in the estimated process of regions where detected objects are located, taking into account the intersection between their support or base points and the polygons that represent the regions. This process mainly depends on the tracking algorithm robustness and the association of the ellipses that wrap the objects. If the ellipses do not wrap most of the support points of objects, the error rate in the classification process will be high. This class of errors imply the violation of spatial constraints, generating false positives and false negatives.
134
J. Albusac et al.
Table 6.11 Results obtained in the estimation of regions where the detected objects are located.
Scene 1 2 3 4 5 6 7 8 9 10 AVERAGE
Estimation of objects location (region calculation) Errors Success Wrong ellipse Object with multiple ellipses 7069 (94%) 238 192 7001 (99%) 4 0 6920 (98%) 71 0 11130 (99%) 23 16 1398 (100%) 0 0 2396 (100%) 0 0 7732 (98%) 137 0 2910 (98%) 26 17 6654 (99%) 35 0 1906 (99%) 12 0 98,4 %
Table 6.12 shows the results obtained in the normality analysis of the trajectory concept. The average success rate in this case is 96,4%. Some of the errors made are due to errors in previous processes, such as object misclassification or wrong region estimation. These errors occasionally cause the violation of the constraints of the trajectory concept so that the monitored object does not have associated normal trajectories, generating false positives or false negatives. The errors made by the preprocessors that provide the information to the normality components do not always imply errors in such components. For instance, a vehicle classified as a pedestrian located over a pedestrian crossing; or a pedestrian located over an allowed area for pedestrians, if the system considers that the pedestrian is located over another allowed area for pedestrians. In this last situation and although the region estimation by the system was not correct, it will infer a correct result. For this reason, the number of errors when classifying objects or estimating regions may be higher than in the trajectory analysis. The output of the normality components for each analysed frame is a numeric value used to study the membership to different normality intervals: absolutely normal, possibly normal, suspicious, possibly abnormal, and absolutely abnormal. Any normal situation with an exclusive membership to the absolutely normal or possibly normal intervals is considered as a success of the trajectory component and, in the same way, any anomalous situation with an exclusive membership to the absolutely abnormal or possibly abnormal is also considered as a success. The contrary cases imply false positives and false negatives, which are considered as errors. Regarding the suspicious interval, an error is considered in the following cases: i) a normal situation where the membership to such interval is higher than to the possibly normal interval, and ii) an anomalous situation where the membership to the suspicious interval is higher than to the possibly abnormal interval. To consider suspicious situations as critical or not should rely on the layers responsible for the decision
6
Normality Components for Intelligent Surveillance
135
making process, depending on the desired security level. The more strict the system, the higher the probability of alarm activation and the lower the probability of undetected anomalous situations. Table 6.12 Results obtained by the trajectory analysis component. Labels C1 (object misclassification), C2 (wrong position of the ellipse) and C3 (ellipse swap between objects) refer to error causes. Number Scene of frames
Trajectory analysis (Component 1) Number of Success situations Normal
1 2 3 4 5 6 7 8 9 10
3943 3000 3000 3250 995 1471 5233 1906 2165 772
7224 7142 6751 11051 1398 2343 7831 2738 6390 1843
Abnormal
275 137 240 121 0 53 35 215 299 75 TOTAL
Errors C1
7093 (94%) 6847 (97%) 6731 (96%) 11026 (99%) 1397 (99%) 2242 (93%) 7590 (96%) 2920 (98%) 6420 (95%) 1875 (97%) 96,4 %
184 154 209 88 1 154 134 7 234 4
C2
C3
222 0 4 0 21 30 39 19 0 0 0 0 134 8 26 0 35 0 12 27
Table 6.13 shows the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) for the trajectory component and the relationships between them. These parameters are used to measure the quality of the proposed methods in this work in order to appreciate the number of normal situations classified as normal, the number of detected anomalous situations or the number of wrong alarm activations. Particularly, the results shown in this table are satisfactory since the number of false positives and false negatives is low in relation to the total number of analysed situations. More critical situations are those anomalous situations classified as normal or, in other words, the number of false positives (FP). This table also shows that the value of this parameter is really low in the ten test scenes. Table 6.14 shows the results obtained in the speed classification process from the normality point of view, calculated by the second normality component developed in this work. In this case, the average success rate is 98,7%. In the conducted experiments, both vehicles and pedestrians use to drive along and walk around, respectively, at a normal speed. If this is not the case, most of anomalous behaviours were detected. Since this normality component is based on the displacement of the ellipse that wraps the monitored object, the partial occlusions are important to be taken into account, which might cause abrupt changes in the ellipse position and, therefore, in the speed estimation. In other words, while a partial occlusion is taking place,
136
J. Albusac et al.
Table 6.13 Existing relationships between true/false positives and true/false negatives in the trajectory component.
Coefficient calculation for the trajectory component Scene
TP
TN
FP
FN
TPR
FPR ACC SPC PPV NPV
FDR
1
6928
165
71
335
0.95
0.30
0.01
2
6847
0
0
158
0.98
-
3
6684
47
5
255
0.963
4
11026
0
4
142
0.987
5
1397
0
0
1
0.99
-
6
2228
14
0
154
0.935
0
7
7590
0
0
276
0.965
-
0.965
8
2766
154
26
7
0.997
9
6181
239
0
269
0.958
0
0.96
1
1
0.470
0
10
1875
0
0
43
0.978
-
0.978
-
1
0
0
0.94 0.699 0.98 0.330 0.977
-
1
0
0
0.096 0.963 0.904 0.999 0.156 1
0.987
0
1
0.99
-
0.936
1 -
0
0
0
1
0
0
1
0.083
0
1
0
0.144 0.989 0.856 0.991 0.957
0 0.009
Table 6.14 Results obtained by the normality component of speed analysis. Labels C1 (object misclassification), C2 (abrupt position change or wrong size ellipse), C3 (wrong fired rules), and C4 (ellipse swap between objects of different class) refer to error causes.
Scene
Number of frames
1 2 3 4 5 6 7 8 9 10
3943 3000 3000 3250 995 1471 5233 1906 2165 772
Speed analysis (Component 2) Number of Success situations Normal
7224 7142 6751 11051 1398 2343 7831 2738 6390 1843
Abnormal
275 137 240 121 0 53 35 215 299 75 TOTAL
7478 (99%) 6923 (98%) 6842 (97%) 11095 (99%) 1398 (100%) 2344 (97%) 7813 (99%) 2943 (99%) 6639 (99%) 1918 (100%) 98,7%
Errors C1
C2
C3
C4
0 40 0 0 0 13 0 0 0 0
0 0 25 0 0 39 18 0 15 0
21 25 124 77 0 0 35 10 10 0
0 17 0 0 0 0 0 0 25 0
the ellipse maintains the visible part of the object and might change the position abruptly when the previous invisible part becomes visible. In the test environment, this phenomenon happens three times (see Figure 6.5): i) lower left part by the tree branches, ii) upper central part by the bushes of the roundabout and iii) right central part by the fence of the building. The used learning algorithm deals with this problem since it learns that the displacements in this kind of regions may be higher, considering the object speed as normal.
6
Normality Components for Intelligent Surveillance
137
Table 6.15 shows the existing relationship between TP, TN, FP, and FN. The good results are a consequence of the reduced number of false positives and false negatives and the high number of true positives and true negatives. Table 6.15 Existing relationships between true/false positives and true/false negatives in the normality component of speed analysis.
Coefficient calculation for the component 2 Scene
TP
TN
FP
FN
1
7460
18
21
0
TPR FPR ACC SPC PPV NPV 1
2
6786
137
0
82
0.99
3
6641
201
39
110 0.984 0.163 0.979 0.838 0.994 0.646
0.005
4
10993
102
15
62
0.001
5
1398
0
0
0
1
-
1
-
1
-
0
6
2344
0
39
13
0.994
1
0.978
0
0.984
0
0.02
7
7788
25
10
43
0.995 0.286 0.995 0.714 0.999 0.368
8
2918
25
10
0
9
6589
50
10
40
10
1843
75
0
0
0.53 0.99 0.462 0.99 0
0.988
1
1
1 0.626
0.994 0.128 0.993 0.872 0.999 0.622
1
0.286 0.997 0.714 0.997
0.994 0.167 0.99 1
0
1
1
0.83 0.998 0.556 1
1
1
FDR 0.002 0
0.001 0 0.001 0
Tables 6.16 and 6.17 show the results of the global analysis process, making use of the OWA aggregation operators to combine the output given by the two normality components and get a global degree of normality. The errors in this process can be due to the first component, the second or either both at the same time (this last situation is counted as a single error in the global analysis). Occasionally, the errors caused by one of the components do not produce an error in the global analysis due to the use of a vector of weights by the aggregation operator. For instance, a situation considered as suspicious by the first component and absolutely normal by the second one might be considered as possibly normal by the global analysis module. On the other hand, Figure 6.9 graphically shows the relationship between two of the parameters related to anomalous situations, TN and FP. The number of true negatives (TN) represents the number of anomalous situations detected and correctly understood by the system, while the number of false positives (FP) refers to the anomalous situations unnoticed by the system. The higher the number of false positives and the lower the number of true negatives, then the lower the quality and the efficiency of the methods employed to understand events and situations. The diagrams of Figure 6.9 reflect that the proposed methods in this work do not only detect the normal situations correctly but also the anomalous situations that rarely take place. Tables 6.18 and 6.19 show the total and average times spent by each one of the processes that compose the surveillance system. The region estimation where an object is located is the most time-consuming task. This is due to the fact that the algorithm used checks if everyone of the support points of the object belongs to one
138
J. Albusac et al.
Table 6.16 Results obtained by the global module combining the analysis of the two normality components. Labels C1 and C2 refer to errors caused by the first and second components, respectively.
Scene
Number of frames
1 2 3 4 5 6 7 8 9 10
3943 3000 3000 3250 995 1471 5233 1906 2165 772
Global analysis: trajectories and speed Number of Success situations Normal
Errors
Abnormal
7224 7142 6751 11051 1398 2343 7831 2738 6390 1843
275 137 240 121 0 53 35 215 299 75 TOTAL
7072 (94%) 6805 (97%) 6504 (93%) 10949 (98%) 1397 (99%) 2190 (91%) 7537 (95%) 2910 (98%) 6370 (95%) 1875 (97%) 95,7%
C1
C2
Both
406 118 260 146 1 154 276 33 269 43
21 42 149 77 0 39 53 10 50 0
0 40 0 0 0 13 0 0 0 0
Table 6.17 Existing relationships between true/false positives and true/false negatives in the global normality analysis process.
Coefficient calculation for the global analysis Sce.
TP
TN
FP
FN
TPR FPR ACC SPC PPV NPV
FDR
1
6889
183
92
335
0.95
0.33
0.01
2
6668
137
0
200
0.97
0
3
6334
248
44
365
0.946 0.151 0.941 0.849 0.993 0.405
0.01
4
10847
102
19
204
0.982 0.157 0.98 0.843 0.998 0.333
0.001
5
1397
0
0
1
6
2189
14
39
154
0.934 0.736 0.919 0.264 0.982 0.083
0.02
7
7512
25
10
319
0.959 0.286 0.958 0.714 0.999 0.073
0.001
8
2731
179
36
7
0.997 0.167 0.985 0.833 0.987 0.962
0.01
9
6081
289
10
309
0.952 0.033 0.95 0.967 0.998 0.483 0.001
10
1830
75
0
43
0.977
0.99
-
0
0.94 0.67 0.971
0.99
0.978
1
-
1
0.98 0.353 1
1
1
0.407
0
0.636
0
0
0
of the predefined regions. Even so, the time spent in the worst case is 63 milliseconds, approximately, which allows the system to quickly respond to any event. New versions of the algorithms will be proposed in future works to reduce this time. The results obtained prove that the design of components by means of the formal model discussed in this work is feasible, gives a high performance, offers response times really short and, finally, allows to represent knowledge with high interpretability. Furthermore, the two components developed have been combined with success, thanks to the use of OWA operators, to get a global evaluation of the objects behaviour, which is normal if they follow one normal trajectory at a suitable speed.
6
Normality Components for Intelligent Surveillance
139
Fig. 6.9 Relationships between the anomalous situations detected by the normality components and the global analysis. True negatives (TN) represent detected anomalous situations, while false positives (FP) represent anomalous situations understood as normal. Table 6.18 Time spent by the system processes, measured in seconds. Sce. is scene, elem. are elements to analyse, T is the total time spent, and M is the average time.
Time (seconds) Scene Elem.
Object clas-
Object
Trajectory
Speed
Global
sification
location
analysis
analysis
analysis
T
M
T
1
7499 0.256 3x10−5
2
2x10−5
3 4 5 6 7 8 9 10
7005 0.199
6991 0.215 3x10
−5
11172 0.376 3x10
−5
1398 0.055 3x10
−5
2396 0.069 2x10
−5
7866 0.288 3x10
−5
2953 0.080 2x10
−5
6689 0.221 3x10
−5
1918 0.070
3x10−5
M
290.980 154.996 441.485 371.468 67.501 88.928 263.230 178.272 283.696 58.321
T
M
T
0.039 0.113 2x10−5 0.088 0.022 0.142
2x10−5
0.063 0.235 3x10
−5
0.033 0.272 2x10
−5
0.048 0.074 5x10
−5
0.037 0.125 5x10
−5
0.033 0.183 2x10
−5
0.060 0.134 4x10
−5
0.038 0.176 2x10
−5
0.028 0.064
3x10−5
M
T
M
10−5
0.079
0.078
10−5
0.061 8x10−6
0.097
10
−5
0.079
10
−5
0.080 7x10−6
10
−5
0.051 3x10−5
10
−5
0.040
10−5
10
−5
0.099 0.015 0.025
10−5 10−5
0.086
10−5
0.026 8x10
−6
0.038
10−5
0.048 7x10
−6
0.073
10−5
0.038
10−5
0.083
0.030
10−5
Table 6.19 Time spent by the learning process of fuzzy rules for object and speed classification. Time spent by the learning process Process Learning of rules for object classification Learning of rules for speed classification
Examples of the
Time spent in learning Number of fired rules
training set
(seconds)
4880 2680
1.921 1.610
56 15
Although the success rates are often high (between 91% and 99%), it is important to take into account that the duration of the test videos ranges from 2’30” to 3’. This fact implies to include future modifications is still needed to reduce the number of alarm activations, above all in longer analysis times. Another issue to be addressed is the system scalability when the complexity of the monitored environment get increased. The typical case study is analyzing a
140
J. Albusac et al.
high number of moving objects simultaneously. To deal with this kind of situations, the discussed components can be managed by software agents, which are deployed through the agent platform discussed in [32]. This alternative involves two main advantages regarding scalability issues: i) the knowledge of surveillance components can be managed in an autonomous way by the agents and ii) an agent responsible for a certain particular surveillance component can be replicated on multiple computers to distribute the workload when monitoring becomes complex. For instance, an agent in charge of analyzing trajectories can be replicated n times in m different computers in a transparent way for the rest of agents. To quantitatively prove the scalability of the proposed system regarding this issue, Table 6.20 shows the time (measured in seconds) spent when monitoring the scenario 1 of Figure 6.1 by employing different replication schemes and varying the analyzed situations per second (Ev/s). Column PAT P involves the preprocessing agent (PA) and represents the processing time related to the object classification and location tasks; TATC and SATC represent the communication times between PA and the agents that monitor trajectories (TA) and speed (SA), respectively; TAT P and SAT P represent the processing time spent by the agents that analyze such surveillance components (trajectory analysis and speed analysis); ENATC and ENAT P represent the communication and processing times, respectively, spent by the agent responsible for the global analysis. The set of conducted experiments of Table 6.20 were executed over a network composed of three different computers so that the workload of analyzing the two surveillance concepts can be distributed. The main conclusion obtained when studying the results shown in such a table is that the system scales well when the rate of analyzed video events per second increases, that is, when the number of monitored moving objects per second is higher. These results can be extrapolated to using an Table 6.20 Set of conducted experiments varying the rate of fps (video events analyzed per second) and the computing nodes location (L=localhost; LR=localhost with replication; D=distributed). Test 1 2 3 4
T Ev/s PATP TATC L 10 242.12 4.09 L 15 234.62 4.84 L 20 234.04 5.72 L 25 253.56 7.29
TAT P 1.53 1.47 1.34 1.22
SATC 4.07 4.93 5.70 7.11
SAT P ENATC ENAT P 0.92 6.89 3.41 0.95 6.90 3.16 0.88 6.82 3.20 0.73 6.77 3.22
5 6 7 8
LR LR LR LR
10 15 20 25
340.90 5.80 1.34 5.83 1.50 371.67 8.05 1.88 6.71 1.69 394.70 13.10 1.79 12.35 14.81 427.88 15.86 2.65 16.81 19.01
9 10 11 12
D D D D
10 15 20 25
209.19 192.994 185.819 189.011
16.751 16.429 16.053 20.903
5.939 6.521 6.62 6.731
14.449 13.632 14.154 18.284
5.817 6.177 6.786 6.293
7.38 8.13 8.37 8.33
3.89 4.05 4.21 3.92
23.623 24.975 17.775 22.730
13.557 14.199 13.657 13.484
6
Normality Components for Intelligent Surveillance
141
increasing number of normality components since the agents themselves are replicated although they monitor a common event of interest (e.g. trajectories). On the other hand, the goodness of the obtained results is also due to the system robustness against errors made by low-level layers. To conclude this section, some of these errors and the solutions adopted to reduce the number of false positives and false negatives in the event and behaviour understanding are discussed: • The object classifier maintains a persistent register about the different classifications of objects since they were first detected until they disappear, that is, the classification of objects does not only depend on the last analysed situation. In this way, if an object was correctly classified during a long period of time, a misclassification at particular moments do not cause classification errors. • The object classification is based on the object dimensions, movements and regions where it is located without taking into account their size. Therefore, although the ellipse that wraps an object in the tracking process is not perfect, the classification algorithm will work correctly. • The developed surveillance system also manages a persistent register of the regions covered by each object together with the maximum degree of membership to each one of them. In this way, occasional segmentation errors and wrong ellipses do not affect to the degree of satisfaction of spatial constraints. In other words, the system believes that the object follows the sequence of regions that form the normal trajectory in a correct way although there are certain deviations at particular moments. • Abrupt errors in the ellipse positioning in case of partial occlusions produced by static environmental elements do not cause error in the speed estimation. This is due to the proposed learning algorithm is able to learn the regions where this phenomenon takes place. • The regions that compose the environment are classified as entry/exit areas where objects appear and disappear and intermediate areas. The occlusions might cause that the system loses the reference of an object and assigns a new identifier to a previously detected object. The trajectory definition through the proposed formal model allows to define trajectories whose sequence of regions is only composed of intermediate areas. In this way, when an object is detected again on an intermediate area due to an occlusion, the system will be able to keep on assigning normal trajectories to such an object. • The global normality value for a particular object depends on the combination of the analysis carried out by each one of the normality components. In this way, although the evaluation of one component is not right, the final normality degree might be correct.
6.6 Conclusions Since their first developments, traditional video surveillance systems have been designed to monitor environments. However, these systems have several limitations to
142
J. Albusac et al.
satisfy the security demands posed by the society. This need together with the arrival of new technologies and the price reduction of dedicated hardware for security represent some of the relevant reasons why intelligent surveillance is currently one of the hot research topics. One of the main challenges in this field is to provide security expert systems with the autonomy and ability required to automatically understand events and analyse behaviours in order to improve the productivity and effectiveness in surveillance tasks. In complex environments where multiple situations take place at the same time, human agents are restricted to deal with all them. On the other hand, artificial expert systems do not have these limitations due to their processing capabilities. Furthermore, artificial systems are not affected by factors such as fatigue or tiredness and they can be more effective than people when recognising certain kinds of events, such as the detection of suspicious or unattended objects. In the last few years, multiple second and third generation surveillance systems have been developed both in the commercial and academic fields. Most of them have been designed to solve particular problems in specific scenarios. These systems provide several advantages over the traditional ones, but two goals need to be reached in order to maintain this progress. i) First, the design of scalable surveillance systems that allow to include new modules to increase the analysis capacity and to aggregate their output to get a global view of the environment state. ii) Secondly, there exists a need for a higher flexibility to vary the artificial system behaviour depending on the requirements of the analysed environment and the kinds of monitored objects. The way of monitoring the behaviour of a particular object may change in relation to the rest of objects. Depending on the kind of objects that appear in the scenario and the security level established, the system must provide the flexibility to configure the existing analysis modules so that the system designed in a global way can be adapted to particular environments. This work proposes a formal model for the design of scalable and flexible surveillance systems. This model is based on the normality definition of events and behaviours according to a set of independent components, which can be instantiated in particular monitored environments. These modules, denoted as normality components, specify how each kind of object should ideally behave depending on the aspect to watch, such as the trajectory that follows or the monitored speed. This model also allows to specify the components that are employed for each kind of object in each environment, increasing the flexibility. In addition, the output of these components are combined by means of aggregation operators to get a global view of the behaviour of each single object and the whole environment. The integration of new normality components does not imply modifications on the rest, increasing the analysis capacity of the artificial system. Two normality components, trajectories and speed analysis, have been defined by means of this model. Both make use of the spatial information gathered by the security cameras, previously processed by modules of low-level layers. Although the analysis of trajectories and speed have been widely studied by other researchers, most of the existing approaches are based on the analysis of spatial and temporal information. This information may be enough to recognise the route followed by
6
Normality Components for Intelligent Surveillance
143
objects or to estimate their speed, but it is not enough for surveillance systems that need to deal with a higher number of factors. The proposed model defines the normality of a concept using a set of fuzzy constraints. This fact allows to develop more complete and suitable definitions for surveillance systems. In this way, the system also provides flexibility in the internal analysis performed by the normality components, since multiple constraints can be easily added or removed. The use of fuzzy logic not only deals with the uncertainty and vagueness of low-level information, but also allows us to represent the expert knowledge used by the surveillance system with a high interpretability by means of linguistic labels, providing it with the expressiveness required to justify the decision making. Precisely, in order to facilitate the knowledge acquisition and the deployment of component instances in particular environments, different knowledge acquisition tools and machine learning algorithms have been associated to the normality components. Finally, the experimental results prove the feasibility of the proposed model and the designed normality components, both in relation to the events and behaviours understanding and the time response to provide such results, two key aspects in surveillance systems. However, to improve the system by reducing the number of false alarms is needed in order to minimise the human dependence. One of these improvements may involve the use of fusion techniques in low-level layers, in addition to the output of the normality components. In this way, a component may deal with the information gathered by multiple sensors about a particular object. The management of redundant information may improve the classification and understanding results when the data provided by a single source is not reliable. Another future research line is to develop new normality components that extend the analysis capabilities of the system. Finally, the development of components that analyse and detect specific anomalous situations will be also addressed so that their output can be combined with the normality modules to improve and complete the analysis of the environment state. Acknowledgements. This work has been founded by the Regional Government of CastillaLa Mancha under the Research Project PII1C09-0137-6488 and by the Ministry of Science and Innovation under the Research Project TIN2009-14538-C02-02 (FEDER).
References [1] Opencv videosurveillance, http://opencv.willowgarage.com/wiki/VideoSurveillance [2] Aguilera, J., Wildernauer, H., Kampel, M., Borg, M., Thirde, D., Ferryman, J.: Evaluation of motion segmentation quality for aircraft activity surveillance. In: IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 293–300 (2005) [3] Albusac, J., Castro-Schez, J.J., L´opez-L´opez, L.M., Vallejo, D., Jimenez, L.: A Supervised Learning Approach to Automate the Acquisition of Knowledge in Surveillance Systems. Signal Processing, Special issue on Visual Information Analysis for Security 89(12), 2400–2414 (2009)
144
J. Albusac et al.
[4] Allen, J., Ferguson, G.: Actions and Events in Interval Temporal Logic. Journal of Logic and Computation 4(5), 531–579 (1994) [5] Blauensteiner, P., Kampel, M.: Visual Surveillance of an Airports Apron-An Overview of the AVITRACK Project. In: Annual Workshop of AAPR, Digital Imaging in Media and Education, pp. 1–8 (2004) [6] Bloisi, D., Iocchi, L., Bloisi, D., Iocchi, L., Remagnino, P., Monekosso, D.N.: ARGOS– A Video Surveillance System for Boat Traffic Monitoring in Venice. International Journal of Pattern Recognition and Artificial Intelligence 23(7), 1477–1502 (2009) [7] Buxton, H.: Learning and understanding dynamic scene activity: a review. Image and Vision Computing 21(1), 125–136 (2003) [8] Carter, N., Young, D., Ferryman, J.: A combined Bayesian Markovian approach for behaviour recognition. In: Proceedings of the 18th International Conference on Pattern Recognition, pp. 761 –764 (2006) [9] Cathey, F., Dailey, D.: A novel technique to dynamically measure vehicle speed using uncalibrated roadway cameras. In: IEEE Intelligent Vehicles Symposium, pp. 777–782 (2005) [10] Chen, T., Haussecker, H., Bovyrin, A., Belenov, R., Rodyushkin, K., Kuranov, A., Eruhimov, V.: Computer vision workload analysis: Case study of video surveillance systems. Intel Technology Journal 9(2), 109–118 (2005) [11] Cho, Y., Rice, J.: Estimating velocity fields on a freeway from low-resolution videos. IEEE Transactions on Intelligent Transportation Systems 7(4), 463–469 (2006) [12] Collins, R., Lipton, A., Kanade, T., Fujiyoshi, H., Duggins, D., Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O., Burt, P., et al.: A System for video surveillance and monitoring (Technical report CMU-RI-TR-00-12). Tech. rep., Robotics Institute, Carnegie Mellon University (2000) [13] Dee, H., Velastin, S.: How close are we to solving the problem of automated visual surveillance? Machine Vision and Applications 19(5), 329–343 (2008) [14] Fawcett, T.: Roc graphs: Notes and practical considerations for data mining researchers (Technical report hpl-2003-4). Tech. rep., HP Laboratories, Palo Alto, CA, USA (2003) [15] Franklin, W.: Pnpoly - point inclusion in polygon test (2006), http://www.ecse.rpi.edu/Homepages/wrf/Research/ ShortNotes/pnpoly.html [16] Haritaoglu, I., Harwood, D., Davis, L.: W 4 : Real-Time Surveillance of People and Their Activities. IEEE Transactions on Patter Analysis and Machine Intelligence 22(8), 809– 830 (2000) [17] Johnson, N., Hogg, D.: Learning the distribution of object trajectories for event recognition. Image and Vision Computing 14(8), 609–615 (1996) [18] Khoudour, L., Deparis, J., Bruyelle, J., Cabestaing, F., Aubert, D., Bouchafa, S., Velastin, S., Vincencio-Silva, M., Wherett, M.: Project CROMATICA. In: Del Bimbo, A. (ed.) ICIAP 1997. LNCS, vol. 1311, pp. 757–764. Springer, Heidelberg (1997) [19] Lin, L., Gong, H., Li, L., Wang, L.: Semantic event representation and recognition using syntactic attribute graph grammar. Pattern Recognition Letters 30(2), 180–186 (2009) [20] Maduro, C., Batista, K., Peixoto, P., Batista, J.: Estimation of vehicle velocity and traffic intensity using rectified images. In: Proceedings of the 15th IEEE International Conference on Image Processing (ICIP 2008), pp. 777–780 (2008) [21] Magee, D.: Tracking multiple vehicles using foreground, background and motion models. Image and Vision Computing 22(2), 143–155 (2004)
6
Normality Components for Intelligent Surveillance
145
[22] Makris, D., Ellis, T.: Learning semantic scene models from observing activity in visual surveillance. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 35(3), 397–408 (2005) [23] Morris, B., Trivedi, M.: A survey of vision-based trajectory learning and analysis for surveillance. IEEE Transactions on Circuits and Systems for Video Technology 18(8), 1114–1127 (2008) [24] Palaio, H., Maduro, C., Batista, K., Batista, J.: Ground plane velocity estimation embedding rectification on a particle filter multi-target tracking. In: Proceedings of the 2009 IEEE International Conference on Robotics and Automation, pp. 2717–27722 (2009) [25] Piciarelli, C., Foresti, G.: On-line trajectory clustering for anomalous events detection. Pattern Recognition Letters 27(15), 1835–1842 (2006) [26] Remagnino, P., Velastin, S., Foresti, G., Trivedi, M.: Novel concepts and challenges for the next generation of video surveillance systems. Machine Vision and Applications 18(3), 135–137 (2007) [27] Shimrat, M.: Algorithm 112: Position of point relative to polygon. Communications of the ACM 5(8), 434 (1962) [28] Siebel, N., Maybank, S.: Ground plane velocity estimation embedding rectification on a particle filter multi-target tracking. In: Proceedings of the ECCV Workshop Applications of Computer Vision, pp. 103–111 (2004) [29] Smith, G.: Behind the screens: Examining constructions of deviance and informal practices among cctv control room operators in the UK. Communications of the ACM 2(2), 376–395 (2004) [30] Sridhar, M., Cohn, A., Hogg, D.: Unsupervised Learning of Event Classes from Video. In: Proc. AAAI, AAAI Press, Menlo Park (to appear, 2010) [31] Valera, M., Velastin, S.: Intelligent distributed surveillance systems: a review. IEE Proceedings-Vision, Image and Signal Processing 152(2), 192–204 (2005) [32] Vallejo, D., Albusac, J., Mateos, J., Glez-Morcillo, C., Jimenez, L.: A modern approach to multiagent development. Journal of Systems and Software 83(3), 467–484 (2009) [33] Velastin, S., Khoudour, L., Lo, B., Sun, J., Vicencio-Silva, M.: PRISMATICA: a multisensor surveillance system for public transport networks. In: 12th IEE International Conference on Road Transport Information and Control, pp. 19–25 (2004) [34] Yager, R.: On ordered weighted averaging aggregation operators in multicriteria decisionmaking. IEEE Transactions on Systems, Man and Cybernetics 18(1), 183–190 (1988) [35] Yager, R.: Families of OWA operators. Fuzzy sets and systems 59(2), 125–148 (1993) [36] Zadeh, L.: Fuzzy sets. Information and Control 8(3), 338–353 (1965) [37] Zadeh, L.: From computing with numbers to computing with words - from manipulation of measurements to manipulation of perceptions. Circuits and Systems I: Fundamental Theory and Applications 46(1), 105–119 (1999)
Chapter 7
Distributed Camera Overlap Estimation – Enabling Large Scale Surveillance Anton van den Hengel, Anthony Dick, Henry Detmold, Alex Cichowski, Christopher Madden, and Rhys Hill
Abstract. A key enabler for the construction of large-scale intelligent surveillance systems is the accurate estimation of activity topology graphs. An activity topology graph describes the relationships between the fields of view of the cameras in a surveillance network. An accurate activity topology estimate allows higher-level processing such as network-wide tracking to be localised within neighbourhoods defined by the topology, and thus to scale. The camera overlap graph is an important special case of the general activity topology, in which edges represent overlap between cameras’ fields of view. We describe a family of pairwise occupancy overlap estimators, which are the only approaches proven to scale to networks with thousands of cameras. A distributed implementation is described, which enables the estimator to scale beyond the limits achievable by centralised implementations, and supports growth of the network whilst it remains online. Formulae are derived to describe the memory and network bandwidth requirements of the distributed implementation, which are verified by empirical results. Finally, the efficacy of the overlap estimators is demonstrated using results from their application in higher-level processing, specifically to network-wide tracking, which becomes feasible within the topology oriented architecture.
7.1 Introduction Video surveillance networks are increasing in scale: installations of 50, 000 camera surveillance networks are now being reported [6], the Singapore public transport authority operates a network of over 6, 000 cameras [13], and networks of more than 100 cameras are common place. Even for networks of less than one hundred Anton van den Hengel · Anthony Dick · Henry Detmold · Alex Cichowski · Christopher Madden · Rhys Hill University of Adelaide, Australian Centre for Visual Technologies, Adelaide SA 5007, Australia E-mail: {anton,ard,henry,alexc,cmadden,rhys}@cs.adelaide.edu.au http://www.acvt.com.au/research/surveillance/ P. Remagnino et al. (Eds.): Innovations in Defence Support Systems – 3, SCI 336, pp. 147–182. c Springer-Verlag Berlin Heidelberg 2011 springerlink.com
148
A.v.d. Hengel et al.
cameras, human operators require assistance from software to make sense of the vast amounts of data in video streams, whether to monitor ongoing activity, or to search through archives to analyse a specific event. Computer vision research has made significant progress in automating processing on the very small scale (see [11] for a survey), but there has been less progress in scaling these techniques to the much larger networks now being deployed. In particular, there has been little progress in transforming single system (centralised) approaches into scalable approaches based on distributed processing. One promising approach to tackling large surveillance networks is to identify a core set of network wide common services, which are required by many visual processing approaches, and focus research effort on providing these services on large networks. Prominent among such services is estimation of activity topology. The activity topology of a surveillance network is a graph describing the spatial and temporal relationships between the fields of view of the network’s cameras. An important special case of activity topology is the camera overlap graph, where edges link cameras having commonality in their fields of view. An accurate and up-todate activity topology estimate supports reasoning about events that span multiple cameras. In particular, the camera overlap special case of activity topology supports efficient solution of the camera handover problem, which is concerned with the continuation of visual processing (e.g. tracking) when a target leaves one camera’s field of view and needs to be resumed using data from other cameras (i.e. those adjacent in the topology). Furthermore, the identification of connected subgraphs within the overall topology provides a means of partitioning high-level visual processing, such that processing relating to a given set of inter-related camera takes place within a partition dedicated to that set, without dependence on the rest of the network. Several estimators of camera overlap have been developed based on pairwise occupancy. In fact there is a family of approaches [14], of which the exclusion approach [15] is the first example. These approaches have two desirable properties as implementations of camera overlap estimation. First of all, they produce overlap estimates of sufficient accuracy (precision and recall) to be useful in tracking [5]. Secondly, they have a proven ability to provide on-line estimation for networks of up to 1,000 cameras [7], whereas no other approach has demonstrated the ability to scale beyond 100 cameras. Initial implementations of these approaches, including exclusion [7], require a central server component. The scale of the surveillance systems these implementation can support is limited by the physical memory on this central server. The practical consequence is that systems with more than a few thousand cameras require specialised equipment for the central server, making implementation prohibitively expensive. This chapter describes a distributed implementation of activity topology estimation by exclusion [8], and shows how it generalises to the entire family of pairwise occupancy estimators of camera overlap. The distributed implementation overcomes previous implementations’ dependence on a single central server and thus enables the use of a cluster of commodity servers as a much cost effective approach to estimation of camera overlap on large surveillance networks. Results comparing
7
Camera Topology
149
partitioned and non-partitioned exclusion demonstrate that the advantages of partitioning outweigh the costs. The partitioning scheme used in the distributed implementation enables partitions to execute independently. This both enhances performance (through increased parallelism) and, more importantly, permits new partitions to be added without affecting existing partitions. As a result, the camera overlap estimation sub-system that can grow in capacity as the number of cameras increases, whilst remaining on-line 24 × 7. Formulae are derived to model quantitative aspects of the distributed implementation, namely its memory and network bandwidth requirements, and these are verified by empirical results. Finally, to demonstrate the utility of the approach in real applications, this chapter reports precision-recall results for a tracking application built on top of a pairwise occupancy estimator of camera overlap. These results demonstrate that pairwise occupancy estimators produce camera overlap estimates of sufficient accuracy to support efficient camera handover, which is critical to network-wide tracking of targets in general. Furthermore, handover supports localisation of tracking processes around targets’ current loci of activity, enabling tracking computations to be partitioned, and thus to scale with the system size. Such tracking of targets across cameras forms the basis of many surveillance tasks, as it allows the system to build up broader analysis of target behaviour, such that aberrant behaviour or target motions can be identified, or following a target through to their current location within the area under surveillance.
7.2 Previous Work Activity topology has typically been learned by tracking people as they appear and disappear from camera fields of view (FOVs) over a long period of time. For example, in [18] the delay between the disappearance of each person from one camera and their appearance in another is stored to form a set of histograms describing the transit time between each camera pair. The system is demonstrated on a network of 3 cameras, but does not scale easily as it requires that correspondences between tracks are given during the training phase when topology is learned. Previous work by one of the authors [9] suggests an alternative approach whereby activity topology is represented by a Markov model . This does not require correspondences, but does need to learn O(n2 ) transition matrix elements during a training phase and so does not scale well with the number of cameras n, due to the number of observations required for the Markov model. The training phase required in this and similar work is problematic in large networks, chiefly because the camera configuration, and thus activity topology, changes with surprising frequency; as cameras are added, removed, moved and fail. Approaches requiring a training phase to complete before operation would have to cease operation each time there is a change, and only resume once re-training has completed. This is an intolerable restriction on the availability of a surveillance network. Instead, on-line automatic approaches, where topology is estimated concurrently with the operation of surveillance, are desirable.
150
A.v.d. Hengel et al.
Ellis et al. [10] do not require correspondences or a training phase, instead observing motion over a long period of time and accumulating appearance and disappearance information in a histogram. Instead of recording known correspondences, it records every possible disappearance that could relate to an appearance. Over time, actual transitions are reinforced and extracted from the histogram with a threshold. A variation on this approach is presented in [19], and has been extended by Stauffer [24] and Tieu et al. [26] to include a more rigorous definition of a transition based on statistical significance, and by Gilbert et al. [12] to incorporate a coarse to fine topology estimation. These methods rely on correctly analysing enough data to distinguish true correspondences, and have only been demonstrated on networks of less than 10 cameras. Rahimi et al. [22, 21] perform experiments on configurations of several cameras involving non-overlapping FOVs. One experiment [21] involves calculating the 2D position and 1D orientation of a set of overhead cameras viewing a common ground plane by manually recording the paths followed by people in each camera’s FOV. It is shown that a simple smoothness prior is enough to locate the cameras and reconstruct a path where it was not visible to any camera. In another experiment [22], pre-calibrated cameras are mounted on the walls of a room so that they face horizontally. In this case, the 2D trajectory of a person is recovered as they move around in the room, even when the person is not visible to any camera. On a larger scale, Brand et al. [3] consider the problem of locating hundreds of cameras distributed about an urban landscape. Their method relies on having internal camera calibration and accurate orientation information for each camera, and enough cameras viewing common features to constrain the solution. Given this information the method can localise the cameras accurately due to the constraints imposed by viewing common scene points from different viewpoints. The patented approach of Buehler [4] appears to scale to networks of about 100 cameras. It is based on statistical correlation operators (lift and correlation coefficient) in pairwise occupancy data, and as such is quite close to instantiations of our approach based on an inclusion loss function (i.e. a subset of the possible instantiations). The use of the lift operator within our framework is discussed subsequently. Finally, it should be noted that instantiations of our approach based on an exclusion loss function are complementary to most other approaches. Other approaches (except [19]), accumulate positive evidence of overlap, whereas exclusion (and [19]) accumulates negative evidence ruling out overlap. The two can be composed, for example, exclusion can be used to prune the search space, enabling one or more positive approaches to operate efficiently on that part of the space that is not ruled out.
7.3 Activity Topology and Camera Overlap An estimate of the activity topology of a surveillance network makes feasible a number of processes critical within on-line video surveillance. The nodes of the activity topology graph are the fields of view of individual cameras, or alternatively regions within those fields of view. Each such region is labelled a cell and denoted ci , with an
7
Camera Topology
151
example of a 12x9 cell division of a camera view given in Figure 7.1. The edges of the graph represent the connections between cells, hence dividing the cameras views into smaller cells can allow for a finer spatial resolution of connections within the activity graph. These connections may be used to represent the overlap of the cells between cameras or, by including timing information, to describe the movement of targets through the graph. Overlap is an important special case of the more general notion of topology as it provide a method to efficiently support camera handover, and thus subsequent tasks such as tracking targets across multiple cameras, hence this work focuses upon this special case in this work.
Fig. 7.1 A sample camera view split into 12x9 rectangular cell regions.
7.3.1 Formulation The activity topology graph is defined as follows: 1. Edges are directed, such that (ci , cj ) represents the flow from ci to cj whereas (cj , ci ) represents the (distinct) flow from cj to ci . Directed edges can be converted to undirected edges if required, but the exclusion algorithm estimates each direction independently and thus we retain this information. [a,b] 2. Each edge has a set of labels, pi,j for various time delay intervals [a, b], each giving the probability that activity leaving ci arrives at cj after a delay between a and b. In this work, each edge has exactly one such label, that for [−, ] where is some small value large enough to account for clock skew between cameras. [−,] Thus pi,j describes overlap between cameras. Actual activity topologies are constrained by building layout, camera placement and other factors. Typical topologies contain sub-graphs with many edges between the
152
A.v.d. Hengel et al.
nodes within the same sub-graph and few edges between nodes within different subgraphs. For some non-overlapping cameras links in the activity topology may only occur with significant time delay, and are not considered here. Instead this work focuses upon the subset of overlap topologies, where time padding allows for clock skew within the system. These nearly isolated sub-graphs are termed zones within the activity topology. Figure 7.2 shows a recovered activity topology for a network of over a hundred cameras, with zones represented by circles.
Fig. 7.2 Estimated activity topology for a real camera network. Edges linking cameras are shown as coloured lines, while zones of highly connected cameras are pictured as circles. Singletons are omitted.
7.4 Estimating Camera Overlap Consider a set of n cameras that generates n images at time t. By applying foreground detection [25] to all images we obtain a set of foreground blobs, each of which can be summarised by an image position and camera index. Each image is partitioned into cells, and each cell can be labelled “occupied” or “unoccupied” depending on whether it contains a foreground object. Our framework for pairwise occupancy based estimation of camera overlap processes data in a pipeline with three stages, as follows: 1. Joint sampling 2. Measurement 3. Discrimination
7
Camera Topology
153
Since the processing takes the form of a pipeline, it continues indefinitely as long as new video inputs are available, leading to the revision of results produced in the discrimination stage. In the joint sampling stage, occupancy states are sampled over time, jointly for possible cell pairs, enabling the estimation of a number of joint probabilities of interest, such as Pr(cell i occupied, cell j occupied). The configuration of the joint sampling stage for a given overlap estimator is fully defined by the sample streams used for the left hand side (X) and right hand side (Y ) of each joint sample. These sample streams are termed cell systems in our approach, and since there must be two (perhaps the same), we define the configuration of the joint sampling stage as a cell system pair. In the second stage, measurement, various measures are applied based on the sampled information to determine which cell pairs are likely to be overlapping. The measures presented here include: • • • •
I(X; Y ) – mutual information [23]. H(X|Y ) – conditional entropy [23]. The lift operator used in the patented approach of Intellivid [4]. P r(X = 1|Y = 1) – conditional probability; used within the exclusion approach [15].
There are, of course, various other possibilities. The configuration of the measurement stage is fully defined by the measure chosen. In the final stage, discrimination, we define a decision threshold, which in combination with the measure in the previous stage constitutes a discriminant function [2]. That is, each cell pair can be classified as overlapping or not by comparing the value of the measure to the threshold. The value of the threshold is critical. For a given loss function, it is possible to derive an appropriate threshold based on assumptions of likely occupancy rate of each cell, and the error rate of the occupancy detector. However, in practice we have found these values to be so variable that it is futile to attempt to estimate them a priori. Instead we consider a range of thresholds when evaluating each measure in different scenarios. Specifically, we consider 3 types of loss function and the corresponding range of thresholds that apply: • minimisation – a loss function that minimises the probability of misclassification. • inclusion – in which the penalty for concluding that two cells overlap when they do not is higher than the reverse, and we effectively ignore all overlaps until we have sufficient evidence for inclusion in the overlap graph. • exclusion – where the penalty for deciding on non-overlap in the case where there is overlap is higher, and we effectively assume all overlaps until we have sufficient evidence for exclusion from the overlap graph. The configuration of the discrimination stage is fully defined by the loss function.
154
A.v.d. Hengel et al.
To summarise, any pairwise occupancy overlap estimator can be defined within our framework by a three-tuple: CellSystemP air M easure LossF unction In this work, the focus is on the performance of techniques to detect camera overlap in large surveillance networks. Note, however, that the technique is not limited to this special case of activity-based camera overlap , but can also be applied to the general case (connections between non-overlapping cameras) through the use of varying time offsets in the operands to the estimation techniques. Future work will evaluate this scenario.
7.4.1 Joint Sampling The joint sampling stage includes sampling individual occupancies, storage of sufficient data to calculate probabilities needed for measurement, and the calculation of those probabilities. 7.4.1.1
Sampling Cell Occupancy
A cell is defined as an arbitrary region of image-space within a camera’s field of view. A cell can be in one of three states at any given time: occupied (O), unoccupied (U), or unknown (X). A fourth pseudo-state of being present (P) is defined as the cell being in either of the occupied or unoccupied states, but not the unknown state. A cell is considered occupied when a foreground target is currently within it, and unoccupied if no target is within it. A cell enters the unknown state when no data from the relevant camera is available. Thus, for a given camera all cells are always either unknown or present simultaneously. This limited number of states does not include other object or background feature information to limit the processing required for comparisons. By keeping track of unknown cell states, overlap estimation can be made robust to camera outages, with periods of camera inactivity not adversely affecting the information that can be derived from the active cells. Next we define a cell system to be a set of cells in a surveillance network. A simple cell system could be formed, for example, by dividing the image space of each camera into a regular grid, and taking the cell system to be the collection of all cells of every such grid. Cell systems are an important abstraction within our implementation, as they enable flexibility in specifying what image-space regions are of interest, and the detail to which overlap is to be discerned. Choice of cell systems involves trade-offs between accuracy, performance, and memory requirements of the resulting system. The number of cells affects the memory requirements of the system, and the pattern and density of cells can affect the accuracy of information in the derived overlap topology. Now, joint sampling involves the coordinated consideration of two occupancy signals, the left hand signal and the right hand signal, at the same time point. These
7
Camera Topology
155
two signals are sampled from separate and possibly different cell systems, thus the joint sample element of the three-tuple describing the overall is itself a pair: LHSCellSystem, RHSCellSystem Further, the class of grid-based cell systems can be specified as five-tuples: rx , ry , px , py , pt with: • rx , ry – the number of cells per camera in the x and y dimensions. • px , py – the spatial padding in cells: a cell is considered occupied if there is an occupied cell within the same camera that lies within this allowed tolerance of cells. • pt – the temporal padding in seconds: a cell is considered occupied at a given frame if it is occupied within this time allowance of the specified frame. In most of our work (i.e. the classic exclusion approach), we have used the following cell system pair: 12, 9, 0, 0, 0 , 12, 9, 1, 1, 1 There is a myriad of interesting possibilities. For example, setting the x and y resolutions both to 1 in one of the two cell systems can dramatically reduce the memory required for implementation whilst only slightly reducing the resolution of the overlap estimate. At the other extreme, setting both resolutions to the pixel resolutions of the camera increases overlap resolution, but at a cost of greater memory and CPU requirements. 7.4.1.2
Storage of Sampled Data
As we are developing a system to operate over large, continuously operating camera networks, it is important to be able to calculate cell occupancy probabilities efficiently and in a scalable fashion. We define c, the total number of cells in the system: (7.1) c = n × (rx + ry ) where n is the number of cameras, rx is the number of cells per camera in the left hand side cell system, and ry is the number of cells per camera in the right hand side cell system. Also we define cx , the number of cells in the left hand side cell system: cx = n × rx
(7.2)
and cy , the number of cells in the right hand side cell system: cy = n × r y
(7.3)
c = cx + cy
(7.4)
and of course:
156
A.v.d. Hengel et al.
Then, all the required quantities can be calculated by maintaining the following counters: • T – the number of frames for which the network has been operating • Xi – the number of frames at which the camera of ci is missing: a n element vector; • Oi – the number of frames at which cell ci is occupied: a c element vector; • Ui – the number of frames at which cell ci is unoccupied: a c element vector; • XXij – the number of frames at which the camera of ci is unavailable and the camera of cj is unavailable: a n × n matrix; • XOij – the number of frames at which the camera of ci is unavailable and cell cj is occupied: a n × c matrix; • OXij – the number of frames at which cell ci is occupied and the camera of cj is unavailable: a c × n matrix; • OOij – the number of frames at which cell ci is occupied and cell cj is occupied: a cx × cy matrix. Of these, OOij requires by far the largest amount of storage, though XOij and XXij are also O(n2 ) in space. 7.4.1.3
Calculation of Probabilities
Using this sampled data, probabilities can be estimated as in the following example: OOij PPij OOij Pr(Oj = 1|Oi = 1) ≈ OPij
Pr(Oi = 1, Oj = 1) ≈
(7.5) (7.6)
Where, PPij denotes the number of times ci and cj have been simultaneously present, and OPij denotes the number of times ci has been occupied and cj present. Note that although these counters are not explicitly stored, they can be reconstructed as in the following examples: U Pij = Ui − U Xij OPij = Oi − OXij
(7.7) (7.8)
P Pij = U Pij + OPij
(7.9)
7.4.2 Measurement and Discrimination A wide variety of functions of two binary random variables can be used for the measurement stage. This subsection describe several of the more useful such functions. The discrimination stage is dependent on the measurement stage (and in particular on the choice of measure), so we describe appropriate discriminant functions with each measure.
7
Camera Topology
7.4.2.1
157
Mutual Information Measure
If two cells ci and cj are at least partially overlapping, then Oi and Oj are not independent. Conversely, if cells ci and cj are not overlapping or close to overlapping, we assume that Oi and Oj at each timestep are independent. We can test for independence by calculating the mutual information between these variables: I(Oi ; Oj ) =
Pr(oi , oj ) log
oi ∈Oi ,oj ∈Oj
Pr(oi , oj ) Pr(oi ) Pr(oj )
(7.10)
I(Oi ; Oj ) ranges between 0, indicating independence, and H(Oi ), indicating that Oj is completely determined by Oi , where H(Oi ) is the entropy of Oi : H(Oi ) = − Pr(oi ) log Pr(oi ) (7.11) oi ∈Oi
The hypothesis corresponding to no overlap is I(Oi ; Oj ) = 0, while the alternative hypothesis, indicating some degree of overlap, is I(Oi ; Oj ) > 0. Using an inclusion discriminant function, the penalty for false positives (labelling a nonoverlapping cell pair as overlapping) is high and therefore an appropriate threshold on I(Oi ; Oj ) is one that is closer to H(Oi ) than to 0. Conversely, for exclusion, the cost is greater for a false negative, and thus the appropriate threshold is closer to 0. 7.4.2.2
Conditional Entropy Measure
Dependency between cell pairs is not necessarily symmetric. For example, if a wide angle camera and a zoomed camera are viewing the same area, then a cell ci in the zoomed camera may occupy a fraction of the area covered by a cell cj in the wide angle camera. In this case, Oi = 1 implies Oj = 1, but the converse is not true. Similarly, Oj = 0 implies Oi = 0, but again the converse is not true. To this end we can measure the conditional entropy of each variable: H(Oi | Oj ) = − Pr(oi , oj ) log Pr(oi |oj ) (7.12) oi ∈Oi ,oj ∈Oj
H(Oj | Oi ) = −
Pr(oi , oj ) log Pr(oj |oi )
(7.13)
oi ∈Oi ,oj ∈Oj
H(Oi | Oj ) ranges between H(Oi ), indicating independence, and 0, indicating that Oj is completely dependent on Oi . For inclusion, the decision threshold in a discriminant function based on H(Oi | Oj ) is set close to H(Oi ), requiring strong evidence for overlap to decide in its favour. For exclusion, the decision threshold is set closer to 0. 7.4.2.3
Lift Operator Measure
In some scenarios, occupancy data can be very unbalanced; for example in low traffic areas, O = 1 is far less probable than O = 0. This means that the joint
158
A.v.d. Hengel et al.
observation (0, 0) is not in fact strong evidence that this cell pair is related. This can make decisions based on calculation of the mutual information difficult, as the entropy of each variable is already low, and I(Oi ; Oj ) ranges between 0 (complete dependence) and max(H(Oi ), H(Oj )) (independence). One solution is to measure the independence of the events Oi = 1, Oj = 1 rather than the independence of the variables Oi , Oj . This leads to a measure known as lift [4]: Pr(Oi = 1, Oj = 1) lift(Oi , Oj ) = (7.14) Pr(Oi = 1) Pr(Oj = 1) Lift ranges between 1, indicating independence (non-overlap) and 1/ Pr(Oj = 1), indicating that cells i and j are identical. For inclusion, the decision threshold in a discriminant function based on lift is set closer to 1/OR than to 1, to reduce the risk of false overlap detection. For exclusion, the decision threshold is set near to 1, indicating that a cell pair must be nearly independent to be considered non-overlapping. 7.4.2.4
Conditional Probability Measure
Combining the ideas of non-symmetry between cell pairs (from the H(Oi | Oj ) measure) with that of measuring dependence only of events rather than variables (from the lift measure), we arrive at a conditional probability measure for cell overlap: Pr(Oi = 1, Oj = 1) Pr(Oj = 1|Oi = 1) = (7.15) Pr(Oi = 1) which can be seen to be a non-symmetric version of lift, and also analogous to conditional entropy (based on the same quantities). Pr(Oj = 1|Oi = 1) ranges between Pr(Oj = 1), indicating independence, and 1, indicating complete dependence. The conditional probability measure is equivalent to the overlap measure used in the exclusion approach presented in [15]. For inclusion, the decision threshold in the discriminant function should be set close to 1, so that strong evidence is required to label a cell pair as overlapping. For exclusion, the decision threshold is set close to Pr(Oj = 1). 7.4.2.5
Definition of Measurement and Discrimination Stage Configurations
Recall the overall configuration of an overlap estimator in the framework is defined by a three-tuple: CellSystemP air M easure LossF unction The measurement and discrimination stages are defined by the the second and third elements respectively. For the M easure element, we specify a function of two abstracted binary random variables X and Y , so a configuration with mutual information measure would take the form: CellSystemP air I(Y ; Y ) LossF unction
7
Camera Topology
159
For the LossF unction element, we simply choose between the three symbols Inclusion, Exclusion and M inimisation, thus an exclusion estimator using a mutual information measure is: CellSystemP air I(Y ; Y ) Exclusion
7.4.3 The Original Exclusion Estimator The original exclusion approach [15] was implemented [16, 7] prior to the formulation of this framework, but can easily be expressed within it. Specifically, this estimator uses: • An LHS cell system with 12 × 9 cells per camera, no spatial padding, and one frame of temporal padding. • An RHS cell system with 12 × 9 cells per camera, one cell of spatial padding in each direction, and no temporal padding. • The P r(X = 1, Y = 1) measure. • The Exclusion loss function. This is expressed in the framework notation thus: 12, 9, 0, 0, 0 , 12, 9, 1, 1, 1 P r(X = 1, Y = 1) Exclusion
(7.16)
7.4.4 Accuracy of Pairwise Occupancy Overlap Estimators The accuracy of estimators created using our framework is evaluated in terms of precision-recall of the identified overlap edges, when compared to ground truth. The dataset consists of 26 surveillance cameras placed around an office and laboratory environment as shown in Figure 7.3. Each camera has some degree of overlap with at least one other camera’s field of view; ground truth has been obtained by manual inspection. The data were obtained at ten frames per second over a period of approximately four hours. Results are reported for estimators generated from four different configurations of the framework, as follows: 1. An estimator based on the lift measure. This is expressed in the framework as: 12, 9, 0, 0, 1 , 12, 9, 0, 0, 1 lif t(X, Y ) ∗ 2. An estimator similar to the original exclusion approach, based on the conditional probability measure and using asymmetric time padding. This is expressed in the framework as: 12, 9, 0, 0, 0 , 12, 9, 0, 0, 1 P r(X = 1, Y = 1) ∗
160
A.v.d. Hengel et al.
19
8
24
9
6
20
5
10
7
21 17 11 18
12 22
1
25
4 2
23
16
15
13
14
26
3
Fig. 7.3 A floor plan showing the 26 camera dataset. The camera positions are shown by circles, with coloured triangles designating each camera’s field of view. White areas are open spaces, whilst light blue regions showing areas that are opaque to cameras.
3. An estimator based on the mutual information measure. This is expressed in the framework as: 12, 9, 0, 0, 1 , 12, 9, 0, 0, 1 I(X; Y ) ∗ 4. An estimator based on the conditional entropy measure. This is expressed in the framework as: 12, 9, 0, 0, 1 , 12, 9, 0, 0, 1 H(X|Y ) ∗ For the LossF unction component of these configurations (denoted *), we vary the thresholds between their bounds, and hence vary between extreme inclusion (emphasising precision) and extreme exclusion (emphasising recall). Figure 7.4 shows precision-recall results for the four different estimators applied to the 26 camera dataset as P-R curves. These results demonstrate that using some estimators, a reasonable level of precision can be obtained, even for a relatively high level of recall. The lift and conditional entropy estimators provide very poor precision across the range of recall values. The mutual information-based estimator provides a considerable improvement: it provides overlap information of sufficient accuracy
7
Camera Topology
161
Fig. 7.4 The accuracy of the overlap topology results
to enable subsequent processes. The conditional probability estimator provides even higher precision up to the point of 80% recall of the ground truth information, though it is outperformed by the mutual information estimator for very high levels of recall. Section 7.6 evaluates the utility of the overlap estimates as input to a tracking system.
7.5 Distribution The key to distribution of the framework is in the partitioning of the data. The distributed implementation is illustrated and evaluated in terms of the original exclusion approach, for which extensive CPU, memory and network bandwidth measurements are available, and which is expressed in the framework as follows: 12, 9, 0, 0, 0, 12, 9, 1, 1, 1 P r(X = 1, Y = 1) Exclusion
162
A.v.d. Hengel et al.
Certain optimisations are possible in cases that have similar cell structure in both the LHS and RHS cell systems. Specifically, we can define r, the number of cells per camera, and then redefine, c, the total number of cells to be: c=n×r
(7.17)
which is smaller than the general value given in Equation 7.1. The role of the overlap estimation component within an overall surveillance system is to derive activity topology from occupancy. We adopt a layered approach, with an overlap estimation layer that consumes occupancy information from a lower layer (detection pipelines), and produces activity topology information to be consumed by higher layers. This system model is shown in Figure 7.5. The operation of the detection pipeline layer is straightforward. Video data is captured from cameras or from archival storage. This data is processed by background subtraction to obtain foreground masks. The foreground masks are converted into blobs by a connected components algorithm. Finally occupancy is detected at the midpoint of the bottom edge of each blob, which is taken to be the lowest visible extent of the foreground object.
Overlap Estimation Servers
…
Each server estimates overlap for a region of an adjacency matrix representing the camera overlap graph
… …
…
…
…
… Each detection server sends occupancy data to a subset of the overlap estimation servers
Detection Servers
…
…
Each server runs detection pipelines for several cameras
Cameras
…
… …
…
…
…
…
…
Fig. 7.5 Architecture of partitioned system model
7
Camera Topology
163
Our aim is to partition the overlap estimation layer across multiple computers. Each such computer is termed an estimation partition. A detection pipeline for a given camera forwards occupancy data to any estimation partitions requiring data for that camera. For a given cell pair, joint sampling, measurement and discrimination are then all performed on a single estimation partition. In designing the estimation layer, we wish to: 1. Distribute the memory required to store data for cell pairs across multiple (affordable) computers. 2. Avoid communication (and in particular, synchronisation) between estimation partitions, in order to permit processing within each partition to proceed in parallel. 3. Permit the system to grow, through addition of estimation partitions, whilst the existing partitions continue processing and hence the extant system remains online. 4. Quantify the volume of communication required between the estimation partitions and the rest of the surveillance system, and ensure that this requirement remains within acceptable bounds. 5. Keep the size of partitions uniform. Below the overlap estimation layer we have detection pipelines, each performing foreground detection and cell occupancy determination for a single camera. The detection pipelines are independent and can be trivially distributed in any desired fashion. Above the estimation layer, the resulting topology estimates must be made available to higher levels of the surveillance system. The possibilities include: 1. As estimation partitions derive topology estimates they forward significant changes in those estimates to a central database. These changes include both increases in likelihood of an edge in the topology and decreases in likelihood. In the extreme, this includes edges disappearing completely, reflecting changes in activity topology over time, and hence those edges being removed from the central topology database. 2. Option 1, but with the central database replaced with a distributed database. 3. Higher layers obtain topology information by querying the estimation layer partition(s) holding it. In effect, the estimation partitions act as a distributed database. However, the experiments reported here concern only the estimation layer, and not the detection pipelines or any further use of the estimated topology within a surveillance system.
7.5.1 The Exclusion Approach The original exclusion approach requires only the following joint sampling data to operate: • OUij – the number of frames at which cell ci is occupied and cj is unoccupied, i.e. the exclusion count: a c × c matrix.
164
A.v.d. Hengel et al.
• OPij – the number of frames at which cell cj is occupied and cj is present (i.e. not unknown), i.e. the exclusion opportunity count: a c × c matrix, compressible to a c × n matrix since all cells within a camera share the same present/unknown state at any given time. Based on this data, exclusion estimates camera overlap through the evaluation of the Pr(Oj = 1|Oi = 1) measure, which can be approximated as follows: Pr(Oj = 1|Oi = 1) ≈
OOij OPij − OUij = OPij OPij
(7.18)
The overlap estimate is further strengthened by exploitation of the bi-directional nature of overlap; we consider cells ci and cj to overlap only when the following Boolean function is true: Xij = (Pr(Oi = 1|Oj = 1) > P ∗ ) ∧ (Pr(Oj = 1|Oi = 1) > P ∗ )
(7.19)
with P ∗ a threshold value. The effect of varying this threshold, in terms of the precision and recall achieved by the estimator, is extensively evaluated in [17].
7.5.2 Partitioning of the Joint Sampling Matrices In centralised overlap estimation implementations, all joint sampling data and matrices are maintained in the memory of a centralised processing node. Because these matrices are large and dense (at least in the case of OUij ), the memory available on this central node places an overall limit on the size of network that can be supported. For example, an instantiation of exclusion with 108 (12 × 9) cells per camera, 1000 cameras, and 16-bit (2 byte) OUij counts will require: (108 × 1000)2 × 2 = 23, 328, 000, 000 bytes (or approximately 24GB) to represent OUij . Some optimisation is possible; for example, a previously reported implementation [7] used byte-sized counts and a selective reset procedure (division by two of sections of by OUij and OPij , so as to maintain approximately correct Pr(Oj = 1|Oi = 1) ratios) to support 1000 cameras within 12GB. Nevertheless, there are two obstacles to further increases in the scale of systems that can be built: 1. The requirement that all joint sampling matrices be stored in a single server means that the memory (and processing) capacity of that server limits the maximum size of the networks that is feasible. For example, the current limit for easily affordable server hardware is less than 100GB, and only incremental improvements can be expected, so centralised implementations are limited to supporting networks of a few thousand cameras.
7
Camera Topology
165
2. The requirement for O(n2 ) memory (however distributed) means that even if it is possible to increase system scale by the addition of more hardware, it becomes increasingly expensive to do so, and at some point it ceases to be feasible. Both of these challenges need to be overcome; in this work we focus on the first. 7.5.2.1
A Partitioning Scheme
Observe from Equation 7.19 that calculation of overlap (Xij ) for given i and j requires both Pr(Oj = 1|Oi = 1) and Pr(Oi = 1|Oj = 1), and hence (from Equation 7.18) each of OUij , OUji , OPij and OPji . Whilst it would be possible to perform the final overlap calculation separately from the calculation (and storage) of OUij and OPij , we assume that it is not practically useful to do so, which implies that for given i and j, each of OUij , OUji , OPij and OPji must reside in the same partition (so as to avoid inter-partition communication). This constraint drives our partitioning scheme, along with the aims identified previously.
Fig. 7.6 The partitioning scheme for 200 partitions
166
A.v.d. Hengel et al.
Figure 7.6 shows partitioning across 200 estimation partitions; each partition contains two distinct square regions of the OUij matrix, such that the required symmetry is obtained. These square regions are termed half partitions. Note that with some measures, such as Mutual Information, the values in OUij and OUji are always the same. Hence in these cases only one half partition need be stored (in the case of half partitions on the diagonal of the matrix, only one half of each half partition need be stored). Thus the overall storage requirement in these cases is halved. Note however that this optimisation is not possible with asymmetric measures such as Conditional Probability and so we always store the whole matrix to maintain generality. The OPij matrix can be partitioned in the same way. However, given that the OPij matrix contains significant redundancy (the OPij values for all j in a given camera are the same), some optimisation is possible. Each of the square regions within Figure 7.6 contains sufficient rows and columns for several whole cameras worth of data. Table 7.1 defines the system parameters for partitioned estimation, with Figure 7.7 illustrating the r and R parameters. Note also: n R= √ 2N
(7.20)
relates n, N and R where N ∈ 2N2 and N ≥ 2. Now, for a given (cell) co-ordinate pair (i, j) within OUij we can determine the partition co-ordinates, (I, J) of the half partition containing the data for (i, j), as follows: i j (I, J) = ( , ) (7.21) rR rR From the partition co-ordinates of a given half partition, (I, J), we determine the partition number of the (whole) partition to which that half-partition belongs, using the following recursively defined partition numbering function: ⎧ P N (J, I) if J > I ⎪ ⎪ ⎪ ⎨ P N (I − 1, J − 1) if I = J∧ P N (I, J) = (7.22) I mod 2 = 1 ⎪ 2 ⎪ ⎪ ⎩ I +J otherwise 2
This recursive function gives the partition numbers shown in Figure 7.6. More importantly, it is used within the distributed estimation implementation to locate the partition responsible for a given region of OUij . Detection pipelines producing occupancy data use Equation 7.22 to determine the partitions to which they should send that occupancy data, and clients querying the activity topology may use it to locate the partition that holds the information they seek. The inverse relation maps partition numbers to a set of two half partition coordinate pairs. This set, P for a given partition is:
P = (I, J), (I, J) : J < J (7.23)
7
Camera Topology
167
Table 7.1 Partitioned Estimator System Parameters Parameter Definition n the number of cameras. N the number of partitions. r the number of cells into which each camera’s field of view is divided. R the length of a half partition, in terms of the number of distinct whole cameras for which that half partition contains data.
r: number of cells per camera
…
…
…
…
R: number of cameras per half partition
…
…
Fig. 7.7 Partitioned system parameter detail
The co-ordinate pair (I, J) is termed the upper half-partition, and the pair (I, J) is termed the lower half-partition; they are distinguished based on the y axis coordinate, as shown. Now, the x co-ordinate of the upper half-partition, I, is a function of the partition number, P : √ I= 2P (7.24) and the y co-ordinate of the upper half-partition, J, is a function of the partition number and the x co-ordinate:
168
A.v.d. Hengel et al.
2
I J =P − 2
(7.25)
Combining equations 7.24 and 7.25 yields the following, upper half-partition address function: ⎛ ⎡ √ 2 ⎤⎞ √ ⎜ ⎢ 2P ⎥⎟ U HP A(P ) = ⎝ 2P , P − ⎢ (7.26) ⎥⎠ ⎢ ⎥ 2 ⎢ ⎥ Next, the lower half-partition co-ordinate pair, (I, J), is a function of the upper half-partition co-ordinate pair, (I, J), as follows: ⎧ ⎨ (I + 1, J + 1) if I = J (I, J) = (7.27) ⎩ (J, I) otherwise Combining equations 7.24, 7.25 and 7.27 yields the following, lower half-partition address function: ⎧ √ 2 √ 2P ⎪ ⎪ 2P + 1, P − +1 ⎪ 2 ⎪ ⎪ ⎪ √ 2 ⎪ ⎪ √ 2P ⎪ ⎪ 2P = P − ⎨ if 2 LHP A(P ) = (7.28) ⎪ ⎪ √ 2 ⎪ ⎪ √ 2P ⎪ ⎪ , 2P ⎪ 2 ⎪ P− ⎪ ⎪ ⎩ otherwise Equations 7.26 and 7.28 define the partition co-ordinates of the two half-partitions corresponding to a given partition number. This is exploited in an implementation strategy whereby partition creation is parameterised by partition number, and this mapping is used to determine the two rectangular regions of OUij to be stored in the partition. 7.5.2.2
Incremental Expansion
A key property of the partitioning scheme is support for incremental expansion of OUij and hence of the system. As shown in Figure 7.8, new partitions, with higher partition numbers, can be added on the right and bottom borders of the matrix, leaving the existing partitions unchanged in both partition number and content. Since the addition of new partitions is entirely independent of the existing partitions, expansion can occur whilst the system (i.e. the existing partitions) remains on-line. Figure 7.8 shows expansion by two (partition) rows and (partition) columns each time. The matrix must remain square, and hence must grow by the same amount
7
Camera Topology
169
Fig. 7.8 Expansion from 2 to 8 and then to 18 partiions
in each direction. The implication is that when growth is necessary, a large number of new partitions must be added (not just one at a time). Growth by two rows and columns at a time is the smallest increment that ensures all partitions are exactly the same size. Growth by one row and column can lead to the latest partition on the diagonal being half the size of all the rest (it has only one half partition instead of two). At worst, this leads to under-utilisation of one computing node, and full utilisation will be restored at the next growth increment. It is also worth noting that whilst partitions have to be of fixed size, the mapping between partitions and computing nodes can be virtualised, allowing, for example, more recently added nodes, which are likely to have greater capacity, to be assigned more than a single partition. Now, the number of partitions, N is expressed in terms of the length, L of the (square) partition grid: L2 (7.29) 2 Now suppose that at some point in its life time, a surveillance system has N partitions. Growth by two partitions in each direction results in partition grid length, L + 2, and hence the number of partitions in the system after growth, N , is: N=
N =
2
(L + 2) L2 + 4L + 4 = = N + 2L + 2 2 2
(7.30)
The growth in the number of partitions is then: N − N = 2L + 2 =
2n +2 R
i.e. it is linear in the number of cameras in the system prior to growth.
(7.31)
170
A.v.d. Hengel et al.
7.5.3 Analysis of Distributed Exclusion Here we describe the expected properties for an implementation of distributed exclusion based on our partitioning approach. Section 7.5.4 evaluates the properties measured for a real implementation against the predictions made here. The properties of interest are: • Network bandwidth – the input bandwidth for each estimation partition and the aggregate bandwidth between the detection and estimation layers. • Memory – memory required within each partition. Specifically, our aim is to relate these properties to the system parameters defined in Tables 7.1 and 7.2. Such relationships enable those engineering a surveillance system to provision enough memory and network hardware to prevent degradation of system performance, due to paging (or worse, memory exhaustion) and contention respectively, thus increasing the probability of the system maintaining continuous availability. Table 7.2 System Implementation Parameters Parameter Definition f the number of frames per units time processed by the estimation partitions. d the maximum time which occupancy data may be buffered prior to processing by the estimation partitions. b the size of each OUij count, in bytes.
7.5.3.1
Network Bandwidth Requirements
The input bandwidth required by a partition is determined by the occupancy and padded occupancy data needed in the two half partitions constituting the partition. The occupancy data required in a half partition is that in the x co-ordinate range of OUij which the half-partition represents. Similarly, the padded occupancy data required corresponds to the y-axis range. The size of the range in each case is the length (in cells) of a half-partition, that is: nr l = Rr = √ 2N
(7.32)
As with R in Equation 7.20, this is defined only where N ∈ 2N2 and N ≥ 2. Now, observe from Figure 7.6 that there are only two configurations of partitions: 1. For partitions on the diagonal of the partition grid, each of the two halfpartitions occupies the same co-ordinate range in both x and y axes. 2. For all other partitions, the x co-ordinate range of one half partition is the y co-ordinate range of the other half-partition, and vice versa.
7
Camera Topology
171
In both cases, a (whole) partition occupies a given (possibly non-contiguous) range in the x dimension and the same range in the y dimension, the total size of these ranges is 2l. Given that padded occupancy, pi , is computed from occupancy, ok , for k in the set of cells including i and its immediate neighbours (within the same camera), all the padded occupancy data needed in a partition can be computed from the occupancy data that is also needed. Therefore the amount of data per frame needed as input into a partition is simply the amount of occupancy data, that is 2l bits. The unpartitioned case (N = 1) is handled separately: the number of inputs per frame is simply nr. With f frames per second this gives the partition’s input bandwidth per second, βP , in bits per second: nrf if N = 1 βP = (7.33) √ 2lf = 2nrf if N ∈ 2N2 ∧ N ≥ 2 2N Now, in practice, the information sent over the network will need to be encoded in some structured form, so the actual bandwidth requirement will be some constant multiple of βP . For N partitions, each on a separate host, the aggregate bandwidth per unit time, βT , in bits per second, is: nrf if N = 1 √ βT = (7.34) n 2Nrf if N ∈ 2N2 ∧ N ≥ 2 7.5.3.2
Memory Requirements
Each estimation partition requires memory for: • • • •
Representation of two half partitions of the OUij matrix. Representation of two half partitions of the OPij matrix. Buffering occupancy data received from detection servers. Other miscellaneous purposes, such as parsing the XML data received from the detection servers.
Globally, the OUij matrix requires one integer count for each pair of camera cells, (i, j). The number of cell pairs is r2 n2 , so the storage required for OUij in each partition is: r 2 n2 b μO U = (7.35) N The OPij matrix is the same size as OUij . However, for given i, OPij for all j identifying cells within a given camera has the same value (as the cells identified by j are either all available or all unavailable at a given point in time), so the storage required for OPij in each partition is: μO P =
rn2 b μO U = N r
(7.36)
172
A.v.d. Hengel et al.
The occupancy data processed within a given partition is produced by several detection servers. Hence it may be the case that different occupancy data pertaining to a given time point arrives at an estimation partition at different times, and in fact some fraction of data (typically very small) may not arrive at all. To cope with this, occupancy data is buffered in estimation partitions prior to processing. Double buffering is required to permit data to continue to arrive in parallel within processing of buffered data. Each data item requires at least two bits (to represent the absence of data as well as the occupied and unoccupied states). Using one byte per item, the storage required for buffering within a partition is: 2dnrf if N = 1 μB = 2dβP = 4dnrf (7.37) √ if N ∈ 2N2 ∧ N ≥ 2 2N Finally, estimation partitions require memory for parsing and connection management and for code and other fixed requirements, the storage required for parsing and connection management is proportional to the number of cameras processed by the partition, whereas the remaining memory is constant, so: nμP + μC if N = 1 μM = √2n (7.38) 2 μ + μ C if N ∈ 2N ∧ N ≥ 2 2N P where μP and μC are constants determined empirically from a given implementation.
7.5.4 Evaluation of Distributed Exclusion Results are reported for running distributed exclusion for surveillance networks of between 100 and 1,400 cameras and between 1 and 32 partitions. The occupancy data is derived by running detection pipelines on 2 hours footage from a real network of 132 cameras then duplicating occupancy data as necessary to synthesise 1,400 input files, each with 2 hours of occupancy data. These files are then used as input for the estimation partitions, enabling us to repeat experiments. The use of synthesis to generate a sufficiently large number of inputs for the larger tests results in input that contains an artificially high incidence of overlap, since there is complete overlap within each set of input replicas. The likely consequence of this is that the time performance of estimation computations is slightly worse than it would be in a real network. Our experimental platform is a cluster of 16 servers, each with two 2.0Ghz dualcore Opteron CPUs and each server having 4 gigabytes of memory. We instantiate up to 32 estimation partitions (of size up to 2GB) on this platform. 7.5.4.1
Verification
Results are verified against the previous, non-partitioned implementation of exclusion. It is shown in [17] that this previous implementation exhibits sufficient
7
Camera Topology
173
precision and recall of the ground truth overlap to support tracking. The partitioned implementation achieves very similar results for the same data. The differences arise because the distributed implementation choose a simpler approach to dealing with clock skew. Adopting the more sophisticated previous approach would be feasible in the partitioned implementation, and would not affect memory or network requirements, but would increase CPU time requirements. 7.5.4.2
Performance Results
The parameters for our experiments are shown in Table 7.3. Parameters n, N and (by implication) R are variables, whereas the remaining parameters have the constant values shown. The μP and μC constants have been determined empirically. Figure 7.9 shows measurements of the arithmetic mean memory usage within each partition for the various configurations tested. Also shown are curves computed from the memory requirement formulae derived in Section 7.5.3.2. As can be seen, these closely match the measured results. The standard deviation in these results is at most 1.0 × 10−3 of the mean, for the 8 partition/200 camera case. Figure 7.10 shows measurements of the arithmetic mean input bandwidth into each partition for the various configurations tested. Also shown are curves computed from the network bandwidth requirement formulae derived in Section 7.5.3.1, scaled by a constant multiple (as discussed earlier), which turns out to be 8. As one would expect, these closely match the measured results. Notice that the bandwidth requirements of the two partition case are the same as for the unpartitioned case: both partitions require input from all cameras. The standard deviation in these results is at most 3.7 × 10−2 of the mean, for the 32 partition/1200 camera case. The explanation for this variance (in fact any variance at all) is that we use a compressed format (sending only occupied cells) in the occupancy data. Table 7.3 Experimental Parameters Parameter Value N n(for N = 1) n(for N = 2) n(for N = 8) n(for N = 32) r R f d b µP µC
either 1, 2, 8 or 32 partitions. 100, 200 or 300 cameras 100, 200 or 400 cameras 200 to 800 cameras in steps of 200 200 to 1400 cameras in steps of 200 each camera’s field of view is divided into 12 × 9 = 108 cells. determined from N and n. 10 frames per second. at most 2 seconds buffering delay. 2 bytes per OUij count. 250 KB storage overhead per-camera. 120 MB storage overhead per-partition.
174
A.v.d. Hengel et al.
Fig. 7.9 Memory used within each partition
Fig. 7.10 Bandwidth into each partition
7
Camera Topology
175
Fig. 7.11 CPU time used by each partition
Figure 7.11 shows measurements of the arithmetic mean CPU time within each partition for the various configurations tested. Recall that the footage used for experimentation is two hours (7,200 seconds) so all configurations shown are significantly faster than real time. The standard deviation in these results is at most 7.4 × 10−2 of the mean, for the 32 partition/200 camera case. Partitions in this case require a mean of 77 seconds CPU time for 7,200 seconds real-time, with the consequence that CPU time sampling effects contribute much of the variance. In contrast, the 32 partition/1400 camera case requires a mean of 1,435 seconds CPU time for 7,200 second real time, and has standard deviation of 1.6 × 10−2 of the mean. At each time step, the estimation executes O(n2 ) joint sampling operations (one for each pair of cells). Thus, we fit quadratic curves (least squares) to the measured data to obtain the co-efficients of quadratic formulae predicting the time performance for each distinct number of partitions. These formulae are shown as the predicted curves in Figure 7.11. 7.5.4.3
Discussion
Observe from Figures 7.9 and 7.11 that partitioning over just two nodes delivers significant increases over the unpartitioned implementation, at the cost of a second, commodity level, server. There is some additional network cost arising from partitioning, for example the total bandwidth required in the two partition case is twice that for the unpartitioned case with same number of cameras, however the total bandwidth required is relatively small in any case and thus is not expected to be
176
A.v.d. Hengel et al.
%
%
$
$
#$
#$
#
#
""
""
"
"
!
!
$
$
"
"
$
$
problematic in practice. The total memory required for a given number of cameras is almost independent of the number of partitions, and the cost of this additional memory is more than outweighed by the ability for the memory to be distributed across multiple machines, thus avoiding any requirement for expensive machines capable of supporting unusually large amounts of memory. The experiments validate the memory requirement formulae from Section 7.5.3.2 and (trivially) the network bandwidth formulae from Section 7.5.3.1. Using the memory formulae together with the empirically derived quadratic formulae fitted to the curves in Figure 7.11, it is possible to determine the current scale limit for the estimation to operate in real-time on typical commodity server hardware. We take as typical a server with 16 GB memory and 2 CPUs: the current cost of such a server is less than the cost of ten cameras (including camera installation). We instantiate two 8 GB partitions onto each such server. Figure 7.12 shows the predicted memory and CPU time curves for the 32 partition case extended up to 3,500 cameras. As can be seen, the memory curve crosses the 8 GB requirement at about 3,200 cameras, with the CPU time curve crossing 7,200 seconds at about 3,400 cameras, leading the conclusion that a 16 server system can support over 3,000 cameras; significantly larger scale than any previously reported results.
"
"
!
!
!
!
#
$
Fig. 7.12 Scale limit for 32 partition system
7.6 Enabling Network Tracking We are now interested in investigating the use of pairwise camera overlap estimates for supporting target tracking across large networks of surveillance cameras. This is
7
Camera Topology
177
achieved through the comparison of the use of camera overlap topology information to a method based on matching target appearance histograms, and evaluating the effect of combining both methods. We use standard methods for target segmentation and single camera tracking, and instead focus on the task of implementing hand-off by joining together the tracks that have been extracted from individual views, as illustrated in Figure 7.13.
Fig. 7.13 Example showing a hand-off link (dashed purple line) between two single camera tracks (solid blue lines), as well as the summarisation of individuals to 12x9 grid cells (green shading) as employed in the topology estimation method used
The two main approaches to implementing tracking hand-off considered here are: target appearance, and camera overlap topology. Target appearance is represented by an RGB colour histogram, which has the advantage that it is already used for single camera tracking, and is straightforward to extend to multiple camera tracking. However it has the disadvantage that target appearance can change significantly between viewpoints and due to errors in segmenting the target from the background. The camera overlap topology describes the relationships between the cells that for the camera views, and can be obtained automatically using the process defined previously. The appearance of a person is defined by the pixel colours representing that person in an image, and is widely used in video surveillance. Measures based on appearance include correlation of the patch itself, correlating image gradients, and matching Gaussian distributions in colour space. We use an RGB histogram to represent the appearance of each target. Histograms are chosen because they allow for some distortion of appearance: they count only the frequency of each colour rather than its location, and they quantise colours to allow for small variations. They are also compact and easy to store, update and match. Here we use an 8x8x8 RGB histogram that is equally spaced in all three dimensions, totalling 512 bins. However it has the disadvantage that target appearance can change significantly between viewpoints and due to errors in segmenting the target from the background. Histogram matching is based upon the Bhattacharyya coefficient [1] to determine the similarity of object appearances. If we let i sequentially reference histogram bins, then the similarity of normalised histograms A and B is given by:
178
A.v.d. Hengel et al.
Similarity =
512
Ai · Bi
(7.39)
i=1
A decision on whether two targets match can then be reached by thresholding this similarity measure. Other similarity measures, such as Kullback-Leibler divergence [20], produced similar results. For large camera networks, the number of false matches resulting from the use of appearance matching alone will generally increase with the number of cameras in the system. In practice, for large networks this will need to be mitigated by applying at least a limited form of camera topology information, and only searching for appearance matches in the same cluster of cameras, such as within the same building. Hand-off based on appearance matching across the entire network or within clusters may then be further refined by combining it with the use of overlap topology and searching only overlapping regions for targets of matching appearance. To perform the evaluation, we draw a random sample of detections of people at particular times. For each sample, we manually identify all other simultaneous detections of the same person, in order to obtain a set of ground truth links between detections in different cameras. No restriction is placed on the detections that are sampled in this way; they may occur anywhere within tracks that have been found within a single camera. This reflects the fact that hand-off does not just link the end of one track with the start of another, but rather needs to link arbitrary points from track pairs depending on when a person becomes visible or disappears from another camera. This is quite a stringent test because it is based solely on instantaneous “snapshots” of the network: no temporal information is used. In reality, information from temporally adjacent frames may be available to assist with deciding on camera handoff. A set of 500 object detections were randomly chosen from the many hours of footage across a set of 24 overlapping cameras. These provided 160 usable test cases where segmentation errors were not extreme, and the individual was not significantly occluded. It was found that one camera provided unreliable time stamping, providing an opportunity to compare results to the more reliable 23 camera set. Figure 7.14 presents the tracking hand-off results in terms of precision and recall, for appearance matching and using each of three overlap topology estimators: exclusion, mutual information, and lift. The results show that using the appearance model alone has much lower precision when detecting tracking hand-offs than using any of the topology estimates alone, but it performs better than randomly “guessing”. This low precision could be due to a number of factors that influence appearance measures, such as segmentation errors and illumination effects. Additionally, some cameras in the dataset were behind glass windows, which can introduce reflections. This had a minor effect on segmentation, but influenced object appearance. Regardless of these effects, distinguishing between individuals wearing similar clothing is difficult using appearance features alone. Estimates of tracking hand-off links based on the automatically generated camerato-camera topology are similar to those based on the camera-to-camera ground truth
7
Camera Topology
179
Fig. 7.14 P-R curves demonstrating tracking hand-off results
topology, when the unreliable camera was removed. The precision was considerably worse with this camera included, as it introduced a number of false links in the topology. Using overlap at a camera level alone is problematic, because objects may be observed in non-overlapping regions of overlapping views and erroneously be considered to be the same object. The increased spatial resolution of overlap using 12x9 cells per camera outperformed the tracking hand-off achieved using even the ground truth camera-to-camera topology. Thus, more cases where different objects are seen in non-overlapping portions of the camera views are correctly excluded in the hand-off search process. Combining appearance and topology did not significantly increase the precision for a given level of recall, suggesting that incorrect links which fit the topology model are not correctly excluded by using appearance. A more complex appearance model or improvements in the accuracy of object segmentations may improve the accuracy of appearance; however unless the appearance model or extraction technique can compensate for illumination changing the perceived object appearance, these are still likely to be very difficult cases in real surveillance environments. The accuracy of determining appearance similarity also depends significantly upon the individual clothing that is worn. In practice, many people wear similar clothing, often with a significant amount of black or dark colours. If the appearance model can only capture large differences in appearance and clothing, then it may be difficult to
180
A.v.d. Hengel et al.
accurately discriminate between individuals; however capturing nuances in appearance can lead to large data structures that can be even more sensitive to illumination and segmentation errors. By contrast, topology information is derived solely from object detection in each cell, and thus does not depend on the appearance of people in the video. It is less sensitive to these issues, and able to obtain a topology that is accurate enough to be useful for tracking even in environments where the camera struggles to accurately capture the appearance of each person. The effect of removing the poor quality wireless camera is also indicated in the graph. It is clear that the precision of the topology is reduced by including such cameras, as the time delay allows for evidence to arise supporting overlaps that do not occur. The precision difference for appearance-only tracking was much less and is not shown in the graph. This is because the appearance is not as significantly affected by time delays, so removing the less reliable camera does not have much of an affect.
7.7 Conclusion This chapter reports recent research exploring the automatic determination of activity topologies in large scale surveillance systems. The focus is specifically on the efficient and scalable determination of overlap between the fields of view, so as to facilitate higher level tasks such as the tracking of people through the system. A framework is described that utilises joint sampling of cell occupancy information for pairs of camera views to estimate the camera overlap topology, a subset of the full activity topology. This framework is has been developed to implement a range of camera overlap estimators: results are reported for estimators based on mutual information, conditional entropy, lift and conditional probability measures. Partitioning and distributed processing using the framework provides a more cost effective approach to activity topology estimation for large surveillance networks, as this permits the aggregation of the resources of a large number of commodity servers into a much larger system than is possible with a single high-end server/supercomputer. In particular, the distributed implementation can obtain the memory it requires at acceptable cost. Results comparing partitioned and nonpartitioned implementations demonstrate that the advantages of partitioning outweigh the costs. The partitioning scheme enables partitions to execute independently; enhancing both performance (through increased parallelism) and, just as importantly, permitting partitions to be added without affecting existing partitions. Formulae are derived for the network and memory requirements of the partitioned implementation. These formulae, verified by experimental results, enable engineers seeking to use the distributed topology estimation framework to determine the resources required from the implementation platform. A further detailed investigation is conducted into the accuracy of the topologies estimated by the occupancy-based framework. This assesses support for tracking of people across multiple cameras. These results demonstrate the utility of the topologies produced by the estimation framework.
7
Camera Topology
181
The camera overlap topology estimation framework, the distributed implementation, and the use of the estimated topology for higher level functions (tracking), together demonstrate the system’s ability to support intelligent video surveillance on large scale systems.
References [1] Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distributions. Bulletin of the Calcutta Mathematical Society 35, 99–109 (1943) [2] Bishop, C.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006) [3] Brand, M., Antone, M., Teller, S.: Spectral solution of large-scale extrinsic camera calibration as a graph embedding problem. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3022, pp. 262–273. Springer, Heidelberg (2004) [4] Buehler, C.: Computerized method and apparatus for determining field-of-view relationships among multiple image sensors. United States Patent 7286157 (2007) [5] Cichowski, A., Madden, C., Detmold, H., Dick, A.R., van den Hengel, A., Hill, R.: Tracking hand-off in large surveillance networks. In: Proceedings of Image and Visual Computing, New Zealand (2009) [6] Floreani, D., manufacturer of large surveillance systems: Personal Communication to Anton van den Hengel (November 2007) [7] Detmold, H., van den Hengel, A., Dick, A.R., Cichowski, A., Hill, R., Kocadag, E., Falkner, K., Munro, D.S.: Topology estimation for thousand-camera surveillance networks. In: Proceedings of International Conference on Distributed Smart Cameras, pp. 195–202 (2007) [8] Detmold, H., van den Hengel, A., Dick, A.R., Cichowski, A., Hill, R., Kocadag, E., Yarom, Y., Falkner, K., Munro, D.: Estimating camera overlap in large and growing networks. In: 2nd IEEE/ACM International Conference on Distributed Smart Cameras (2008) [9] Dick, A., Brooks, M.J.: A stochastic approach to tracking objects across multiple cameras. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 160–170. Springer, Heidelberg (2004) [10] Ellis, T.J., Makris, D., Black, J.: Learning a multi-camera topology. In: Proceedings of Joint IEEE Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 165–171 (2003) [11] Espina, M.V., Velastin, S.A.: Intelligent distributed surveillance systems: A review. IEEE Proceedings - Vision, Image and Signal Processing 152, 192–204 (2005) [12] Gilbert, A., Bowden, R.: Tracking objects across cameras by incrementally learning inter-camera colour calibration and patterns of activity. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 125–136. Springer, Heidelberg (2006) [13] Griffin, J.: Singapore deploys march networks vms solution (2009), http://www.ipsecuritywatch.com/web/online/IPSW-News/ Singapore-deploys-March-Networks-VMS-solution/512$4948, IP Security Watch [14] van den Hengel, A., Detmold, H., Madden, C., Dick, A.R., Cichowski, A., Hill, R.: A framework for determining overlap in large scale networks. In: Proceedings of International Conference on Distributed Smart Cameras (2009)
182
A.v.d. Hengel et al.
[15] van den Hengel, A., Dick, A., Hill, R.: Activity topology estimation for large networks of cameras. In: AVSS 2006: Proc. IEEE International Conference on Video and Signal Based Surveillance, pp. 44–49 (2006) [16] van den Hengel, A., Dick, A.R., Detmold, H., Cichowski, A., Hill, R.: Finding camera overlap in large surveillance networks. In: Yagi, Y., Kang, S.B., Kweon, I.S., Zha, H. (eds.) ACCV 2007, Part I. LNCS, vol. 4843, pp. 375–384. Springer, Heidelberg (2007) [17] Hill, R., van den Hengel, A., Dick, A.R., Cichowski, A., Detmold, H.: Empirical evaluation of the exclusion approach to estimating camera overlap. In: Proceedings of the International Conference on Distributed Smart Cameras (2008) [18] Javed, O., Rasheed, Z., Shafique, K., Shah, M.: Tracking across multiple cameras with disjoint views. In: Proceedings of International Conference on Computer Vision, pp. 952–957 (2003) [19] Ko, T.H., Berry, N.M.: On scaling distributed low-power wireless image sensors. In: Proceedings 39th Annual Hawaii International Conference on System Sciences (2006) [20] Kullback, S.: The Kullback-Leibler distance, vol. 41. The American Statistical Association (1987) [21] Rahimi, A., Dunagan, B., Darrell, T.: Simultaneous calibration and tracking with a network of non-overlapping sensors. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 187–194 (2004) [22] Rahimi, A., Dunagan, B., Darrell, T.: Tracking people with a sparse network of bearing sensors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 507–518. Springer, Heidelberg (2004) [23] Shannon, C.E., Weaver, W.: The mathematical theory of communication. University of Illinois Press, Urbana (1949) [24] Stauffer, C.: Learning to track objects through unobserved regions. In: IEEE Computer Society Workshop on Motion and Video Computing, vol. II, pp. 96–102 (2005) [25] Stauffer, C., Grimson, W.E.L.: Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Analysis and Machine Intelligence 22(8), 747–757 (2000) [26] Tieu, K., Dalley, G., Grimson, W.: Inference of non-overlapping camera network topology by measuring statistical dependence. In: Tenth IEEE International Conference on Computer Vision, vol. II, pp. 1842–1849 (2005)
Chapter 8
Multi-robot Teams for Environmental Monitoring Maria Valera Espina, Raphael Grech, Deon De Jager, Paolo Remagnino, Luca Iocchi, Luca Marchetti, Daniele Nardi, Dorothy Monekosso, Mircea Nicolescu, and Christopher King
Abstract. In this chapter we target the problem of monitoring an environment with a team of mobile robots having on board video-cameras and fixed stereo cameras available within the environment. Current research regards homogeneous robots, whereas in this chapter we study highly heterogeneous systems and consider the problem of patrolling an area with a dynamic set of agents. The system presented in the chapter provides enhanced multi-robot coordination and vision-based activity monitoring techniques. The main objective is the integration and development of coordination techniques for multi-robot environment coverage, with the goal of maximizing the quality of information gathered from a given area thus, implementing a Heterogeneous mobile and reconfigurable multi-camera video-surveillance system.
8.1 Introduction Monitoring a large area is a challenging task for an autonomous system. During recent years, there has been increasing attention in using robotic technologies for security and defense applications, in order to enhance their performance and reduce the danger for the people involved. Moreover, the use of multi-robot systems allows for a better deployment, increased flexibility and reduced costs of the system. A significant amount of research in multi-agent systems have been dedicated to the development of and experimentation on methods, algorithms and evaluation methodologies for multi-robot patrolling in different scenarios. This chapter shows Maria Valera Espina · Raphael Grech · Deon De Jager · Paolo Remagnino Digital Imaging Research Centre, Kingston University, London, UK Luca Iocchi · Luca Marchetti · Daniele Nardi Department of Computer and System Sciences University of Rome “La Sapienza”, Italy Dorothy Monekosso Computer Science Research Institute, University of Ulster, UK Mircea Nicolescu · Christopher King Department of Computer Science and Engineering, University of Nevada, Reno P. Remagnino et al. (Eds.): Innovations in Defence Support Systems – 3, SCI 336, pp. 183–209. c Springer-Verlag Berlin Heidelberg 2011 springerlink.com
184
M.V. Espina et al.
that using multi-robots in environmental monitoring is both effective and efficient. We provide a distributed, multi-robot solution to environment monitoring, in order to detect or prevent defined, undesired events, such as intrusions, leaving unattended luggage and high temperatures (such as a fire). The problem of detecting and responding to threats through surveillance techniques is particularly well suited to a robotic solution comprising of a team of multiple robots. For large environments, the distributed nature of the multi-robot team provides robustness and increased performance of the surveillance system. Here we develop and test an integrated multirobot system as a mobile, reconfigurable, multi-camera video-surveillance system. The system goal is to monitor an environment by collectively executing the most effective strategies for gathering the best quality information from it. Using a group of mobile robots equipped with cameras has several significant advantages over a fixed surveillance camera system. Firstly, our solution can be used in environments that have previously not been equipped with a camera-based monitoring system: the robot team can be deployed quickly to obtain information about an unknown environment. Secondly, the cameras are attached to the robots, which will be positioning themselves within the environment, in order to best acquire the necessary information. This is in contrast with a static camera, which can only perform observations from a fixed view point. Thirdly, the robots in the team have the power to collaborate on the monitoring task and are able to pre-empt a potential threat. Fourthly, the robots could be equipped with additional, specialized sensors, which could be delivered at the appropriate place in the environment to detect, for example, the presence of high temperatures, such as in the case of a fire. Lastly, the robot team can communicate with a human operator and receive commands about the goals and potential changes in the mission, allowing for a dynamic, adaptive solution. Therefore, these enhanced multi-robot coordination and vision-based activity monitoring techniques, advance the state-of-the-art in surveillance applications. In this chapter, we focus on monitoring a large area by using a system with the following characteristics. 1. The system is composed of a number of agents, some of them having mobile capabilities (mobile robots) whilst others are fixed (video cameras). 2. The system is required to monitor and detect different kinds of predefined events at the same time. 3. Each agent has a set of sensors that are useful to detect some events. Sensors are of a different type within the entire system. 4. The system is required to operate in two modes: a) patrolling mode b) response mode These requirements make the problem significantly different from previous work. First of all, we consider a highly heterogeneous system, where robots and cameras inter-operate. Second, we consider different events and different sensors and we will therefore consider different sensor models for each kind of event. Third, we will study the dynamic evolution of the monitoring problem, where at each time a subset of the agents will be in response mode, while the rest of them will be in patrolling mode.
8
Multi-robot Teams for Environmental Monitoring
185
The main objectives of the developed system are: 1. develop environment monitoring techniques through behavior analysis based on stereo cameras, 2. develop distributed multi-robot coverage techniques for security and surveillance, 3. validate our solution by constructing a technological demonstrator showing the capabilities of a multi-robot system to effectively deploy itself in the environment and monitor it. In our previous work, we already developed and successfully implemented new dynamic distributed task assignment algorithms for teams of mobile robots: applied to robotic soccer [27] and for foraging-like tasks [20]. More specifically, in [27] we proposed a greedy algorithm to effectively solve the multi-agent dynamic and distributed task assignment problem, that is very effective in situations where the different tasks to be achieved have different priorities. In [20] we also proposed a distributed algorithm for dynamic task assignment based on token passing that is applicable when tasks are not known a priori, but are discovered during the mission. The problem considered here requires both finding an optimal allocation of tasks among the robots and taking into account tasks that are discovered at runtime. Therefore it is necessary to integrate the two approaches. As a result, we do not only specialize these solutions to the multi-robot surveillance and monitoring task, but also study and develop extensions to these techniques in order to improve the optimality of the solutions and the adaptivity to an open team of agents, taking into account the physical constraints of the environment and of the task. The use of stereo cameras in video-surveillance opens several research issues, such as the study of segmentation algorithms based on depth information provided by the stereo-vision, tracking algorithms that take into account 3-D information about the moving objects and techniques of behavior analysis that integrate and fuse 3-D information gathered from several stereo sensors. Furthermore, the application of multi-robot coverage to security and surveillance tasks provide new opportunities of studying multi-robot distributed coordination techniques with dynamic perception of the tasks and methods for optimal coverage of the environment in order to maximize the quality of the information gathered from it. These aspects will be considered in more detail in the coming sections. The rest of the chapter is organized as follows: Section 8.2 provides an overview of our proposed system. In Section 8.3 previous related work is presented. The Representation formalism is explained in Section 8.4 and the event-driven multi-sensor monitoring algorithm presented in Section 8.5. In Section 8.6 the system implementation and experimental results are illustrated. Finally Conclusions are drawn in Section 8.7.
8.2 Overview of the System The overall surveillance system developed is presented hereunder. The system is mainly composed of two subsystems: a video-surveillance sub-system operating with static cameras and a multi-robot system for environment monitoring and threat response.
186
M.V. Espina et al.
8.2.1 Video-Surveillance with Static Cameras One objective of the visual surveillance system was to identify when people leave, pick or exchange objects. Two scenarios were used to test the capabilities of the video-surveillance system. In the first scenario (i.e. unattended baggage event), the system was designed to send a report if a person was observed leaving a bag In the second scenario (i.e. object manipulation), the system should send a report if a person manipulated an unauthorized object from the environment. Once a report is sent, a patrol robot would be commissioned to go and take a high resolution picture of the scenario. Recognizing these types of actions may be done without sophisticated algorithms, so for this demonstration, we use a simple rule-sets based only on proximity and trajectories: For the first scenario: • If a bag appears in the environment, models will be generated for that bag and for the nearest person. If the associated person is observed moving away from the bag, it will be considered “left bag”, and a report of the incident will be generated. • If a bag is associated with one person, and a second person is observed moving away with the bag, it will be considered an “bag taken”and a report will be generated and sent to the multi-robot system. For the second scenario: • If a person is observed manipulating an object that was either present at the beginning of the sequence, or left by another person (i.e. unauthorized object), the incident will be considered an “allert”and a report will be generated and sent to the multi-robot system.
8.2.2 Multi-robot Monitoring of the Environment Our approach to multi-robot monitoring is to extend the work done in multi-robot patrolling, adding the capability for the robots to respond to events detected by visual and other sensors in a coordinated way. Therefore two problems are considered and solved. 1. Identify global tasks associated to events detected by local sensors on-board the robots or the vision components of the system. 2. Coordinate the response to these events among the multiple robots. These problems have been solved by developing a general algorithm for eventdriven distributed monitoring (see Section 8.5).
8.2.3 Experimental Scenario The scenario used for the experimental validation was tested in the campus of the Department of Computer and System Science (DIS) of Sapienza University in Rome, Italy1 . The selected scenario, shown in Figure 8.1, was an indoor corridor to 1
www.dis.uniroma1.it
8
Multi-robot Teams for Environmental Monitoring
187
simulate the unattended baggage event and a lab room to simulate the object manipulation. A team of robots carrying video cameras is deployed in the environment as they cooperate to optimize the surveillance task, by maximizing the amount and quality of information gathered from the environment using the on-board cameras. When the robots reach the desired target poses, the cameras mounted on them could act as a network of surveillance cameras and video-surveillance algorithms may run on their inputs. Moreover, another system based on fixed stereo cameras, capable of providing depth information, is available within the environment. This can eventually be also integrated on the robot platforms.
Fig. 8.1 Experimental scenario at DIS
8.3 Related Work In this chapter, we define the problem of Environmental Monitoring by extending the classical problem of Multi-Robot Patrolling to include also the Threat Response, i.e. the response to a threat event detected within the environment by an agent (either a robot or a computer vision sub-system) monitoring that environment. Examples of threat responses can vary from intercepting an intruder, examining a specific area or a specific object left by somebody. The main components to be integrated for the effective monitoring of an environment are: Multi-Robot Patrolling, MultiRobot Coverage, Dynamic Task Assignment and Automatic Video Surveillance. The current state-of-the-art about these topics is presented in the following sections.
8.3.1 Multi-robot Patrolling The patrolling task refers to the act of walking around an area, with some regularity, in order to protect or supervise it [39]. Usually, it is carried on by a team of multiple robots to increase the efficiency of the system. Given a map representation of the environment, the first step is to define whether the map should be partitioned, or not, in smaller sections. In order to maximize the efficiency, a Multi-Robot team should assign to each robot different areas to be patrolled. That means that a division of the global map has to be done, and the submaps assigned to the robots. As analyzed in [39], in most cases it is sufficient to
188
M.V. Espina et al.
adopt a static strategy, wherein the whole environment is already given as a collection of smaller areas. This means that a partitioning is not really necessary. However, more interesting approaches deal with dynamic and non-deterministic environment, resulting in a more challenging domain, that requires to be partitioned dynamically. This fact involves that the robots should coordinate themselves, to decide who has to patrol which area. The subsequent step involves how to sweep the assigned area. Basically, this is performed by transforming the environment map into a graph to be traversed. This aspect of patrolling is the most coped by the current state-of-the-art. In fact, given a topological graph the map, most of the algorithms and techniques used for dealing with graph can be adopted. Major approaches use the Traveling Salesman Problem and its variant to define an optimal (or sub-optimal) path for a given graph. In [38] is defined a steering control based mechanism that takes into account the constraint given by a real platform to define the path. First, a rectangle partitioning schema is applied to the map. Then, each rectangle is covered by the circle defining the sweep area (it depends on platform and sensors used). Finally, the path is the result of connecting the covering circles. Another possibility is given by using Hamiltonian cycle to define the path. The work in [37] defines an algorithm to transform an occupancy grid based map into a topological graph, where this kind of partitioning strategies are applied to perform the patrol task. More advanced techniques apply Game Theory [6] and Reinforcement Learning [4] methods, to include the behavior of intruder in the sequencing strategies. A comparison of this techniques and preliminary results are presented in [4]. The last aspect involved in Multi-Robot Patrolling is the task reallocation. When dynamic domain are utilized as test bed, the assigned area can change over time. This fact implies that the patrolling team needs to reshape the strategy to take into account the modification. Usually, it involves rebuilding the topological graph, and resetting the current configuration. A more efficient approach, however, requires a coordination among the robots, to minimize the task hopping. A basic approach, involving reallocation over team formation, is presented in [1].
8.3.2 Multi-robot Coverage The goal of the coverage task is to build an efficient path to ensure the whole area is crossed by the robot. Using a team of robots, the goal requires to build efficient paths to jointly ensure the coverage of the area. Therefore, an important issue in mobile Multi-Robot teams is the application of coordinating tasks to the area coverage problem. Multi-Robot environment coverage has been recently studied by solving the problem of generating patrol paths. Elmaliach et al. [18] introduced an algorithm that guarantees the maximal uniform frequency for visiting places by the robots. Their algorithm detects circular paths that visit all points in the area, while taking into account terrain directionality and velocity constraints. The approach in [2] considers also the case in which the adversary knows the patrol scheme of the robots. Correll and Martinoli [17] consider the combination of probabilistic and
8
Multi-robot Teams for Environmental Monitoring
189
deterministic algorithms for the multi-robot coverage problem. They apply their method to a swarm-robotic inspection system at different levels of wheel-slip. They conclude that the combination of both probabilistic and deterministic methods lead to more accuracy, particularly if real world factors are becoming significant. Ziparo et al. [48] considered the problem of deploying large robot teams within Urban Search And Rescue (USAR) like environments for victim search. They used RFIDs for coordinating the robots by local search, and extended the approach by a global planner for synchronizing routes in configuration time-space. These approaches are mainly focused on the problem of computing optimal team trajectories in terms of path length and terrain coverage frequency, while there has been only little attention on team robustness. Within real-world scenarios, dynamic changes and system failures are instead crucial factors for any performance metric.
8.3.3 Task Assignment Cooperation based on Task Assignment has been intensively studied and can be typically considered as a strongly coordinated/distributed approach [19]. In Reactive Task Assignment (e.g., [36]), each member of the team decides whether to engage itself in a task, without re-organizing the other member activities, drastically reducing the requirements on communication but limiting the level of cooperation that they can support. Iterative Task Assignment, such as [27, 46], allocates all tasks present in the system at each time step. In this way, the system can adapt to environmental conditions ensuring a robust allocation, but generally require knowing in advance the tasks that have to be allocated. Sequential Task Assignment methods [23, 49] allocate tasks to robots sequentially as they enter the system, therefore tasks, to be allocated, do not need to be known before the allocation process begins. Such techniques suffer, in general, from a large requirement in terms of bandwidth, due to the large amount of messages exchanged in order to assign tasks. Hybrid solutions which merge characteristics of different types of task allocation have been investigated. For example, in [22] the authors provide an emotion-based solution for multi robot recruitment. Such an approach can be considered intermediate between sequential and reactive task assignment. As previous approaches, this work does not explicitly take into account conflicts due to dynamic task perception. Conflicts arising in Task Assignment are specifically addressed for example by [3, 28]. However, conflicts described in those approaches are only related to the use of shared resources (i.e. space), while other approaches can address a more general class of conflicts, such as the ones that arise when task properties change over time due to dynamic on-line perception [20].
8.3.4 Automatic Video Surveillance Semi-automated visual surveillance systems deal with the real-time monitoring of persistent and transient objects within a specific environment. The primary aims of these systems are to provide an automatic interpretation of scenes to understand and
190
M.V. Espina et al.
predict the actions and the interactions of the observed objects based on the information acquired by sensors. As mentioned in [45], the main stages of the pipeline in an automatic visual surveillance system are moving object detection and recognition, tracking and behavioral analysis (or activity recognition). One of the most critical and challenging components of a semi-automated video surveillance system is the low-level detection and tracking phase. Even small detection errors can significantly alter the performance of routines further down the pipeline, and subsequent routines are usually unable to correct errors without using cumbersome, ad-hoc techniques. Adding to this challenge, low-level functions must process huge amounts of data, in real time, over extended periods. This data is frequently corrupted by the camera’s sensor (e.g. CCD noise, poor resolution and motion blur), the environment (e.g. illumination irregularities, camera movement, shadows and reflections), and the objects of interest (e.g. transformation, deformation or occlusion). Therefore, to adapt to the challenges of building accurate detection and tracking systems, researchers are usually forced to simplify the problem. It is common to introduce certain assumptions or constraints that may include: fixing the camera [44], constraining the background [43], constraining object movement or applying prior knowledge regarding object-appearance or location [41]. Relaxing any of these constraints often requires the system to be highly application domain oriented. There are two main approaches to object detection: “temporal difference” and “background subtraction”. The first approach consists in the subtraction of two consecutive frames followed by thresholding. The second approach is based on the subtraction of a background or reference model and the current image followed by a labeling process. The “temporal difference” has good throughput in dynamic environments as it is very adaptive. However, its performance in extracting all the relevant object pixels is poor. On the other hand, background subtraction approach has a good performance in object extraction. Although, it is sensitive to dynamic changes in the environment; to overcome this issue, adaptive background techniques are applied, which involves creating a background model and continuously upgrading to avoid poor detection in dynamic environments. There are different techniques background modeling, commonly related to the application such as active contours techniques used to track nonrigid objects against homogeneous backgrounds [7], primitive geometric shapes for certain simple rigid objects [16] or articulated shapes for humans in high-resolution images [35]. Background modeling and updating background techniques are based on pixel-based or region-based approaches. In this chapter, an updating technique based on pixel-based background model, Gaussian Mixture Model, GMM [40, 5], for foreground detection in scenario1 is presented. In scenario2, an updating technique based on region-based background model Maximally Stable Extremal Region (MSER) [31] is applied. Moving down in the pipeline of the system after the foreground extraction comes the tracking. Yilmaz et al. [47] reviewed several algorithms, listing the strengths and weaknesses of each of them; emphasizing that each tracking algorithm inevitably fails under certain set of conditions. Therefore, different tracking techniques are used in each scenario as the environment conditions are different in each of them. Therefore, in scenario1 Kalman Filters [26] are implemented and in Scenario 2 a optimized, multi-phased, kd-tree-based [24] tracking algorithm is
8
Multi-robot Teams for Environmental Monitoring
191
used. At last, in [10], a survey of activity recognition algorithms is presented where well-known probabilistic approaches, such as Bayesian Networks [11] or Hidden Markov Model [8] are used. In this video surveillance system HMM are used to generalize the object interactions and therefore recognize a predefined activity.
8.4 Representation Formalism One of our main contribution is the study of a multi-robot patrolling and threat response, with a heterogenous team of agents including both mobile robots and static cameras. The heterogeneity is given not only by the different mobility capabilities of the agents, but also by different sensor abilities. This study is motivated by the fact that integration of many technologies, such as mobile robotics, artificial vision and sensor networks can significantly increase effectiveness of surveillance applications. In such a heterogeneous team, one important issue is to devise a common formalism for representing the knowledge about the environment of the entire system. Our approach to solve the problem of multi-robot monitoring is composed by three components: 1. a map representation of the events occurring in the environment; 2. a generated list of tasks, to handle the events; 3. a coordination protocol, to distribute the tasks among the agents. The most interesting component is the map representation. Inspired by [29], a Gaussian process models the map to be covered in terms of wide-areas and hot spots. In fact, the map is partitioned in two categories. The objective of the single agent is, then, to cover the assigned areas, prioritizing the hot-spot areas, while keeping the wide-area coverage. In this approach we introduce two novel concepts: 1. we consider different types of events that can occur in the domain at the same time, each one represented with a probabilistic function, 2. we consider decaying of information certainty over time. Moreover, our system is highly heterogeneous, since it is constituted by both mobile robots carrying different sensors and static cameras. The proposed formalism, thus, allows for a unified representation of heterogeneous multi-source exploration of different types of events or threats.
8.4.1 Problem Formulation Let X denote a finite set of locations (or cells) in which the environment is divided. This decomposition depends on the actual environment, robot size, sensor capabilities and event to be detected. For example, in our experimental setup, we monitor an indoor environment looking for events related to people moving around and unattended luggage, and we use a discretization of the ground plane of 20 × 20cm. Let E denote a finite set of events that the system is required to detect and monitor. Let Z denote a finite set of sensors included in the systems: they can be either fixed or
192
M.V. Espina et al.
mounted on mobile platforms. For each event e ∈ E there is a probability (or belief) that the event is currently occurring at a location x ∈ X ; this probability is denoted by Pe (x). This probability distribution obviously sums to 1 (i.e., Σx Pe (x) = 1). This means that we assume that an event e is occurring (even if this is not the case), and that the team of agents performs a continuous monitoring of the environment. In other words, when a portion of the environment is examined and considered to be clear (i.e., low values of Pe (x)), then in another part of the environment that is not examined this probability increases and thus it will become the objective of a next search. It is also to be noted that this representation is adequate when the sensors cover only a part of the environment at any time, as in our setting, while it is not feasible in cases where sensors cover the entire environment. The computation of this probability distribution is performed by using sensors in Z . Given a sensor z ∈ Z and a set of sensor readings z0:t from time 0 to current time t, the probability that event e occurs in location x at time t can be expressed as pe (xt |z0:t ) = η pe (zt |xt )
xt−1
pe (xt |xt−1 )pe (xt−1 |z0:t−1 )dxt−1
(8.1)
Equation 8.1 is derived from Bayes Theorem (see for example [42]). The set of probability distributions pe (xt |z0:t ) for each event e represents a common formalism for the heterogeneous team considered in this work and allows for both driving the patrolling task and evaluating different strategies. This representation has an important feature: it allows for explicitly defining a sensor model for each pair sensor,event. In fact, pe (zt |xt ) represents the sensor model of sensor z in detecting event e. In this way, it is possible to accurately model heterogeneous sensors within a coherent formalism. Also the motion model p(xt |xt−1 ) can be effectively used to model both static objects (e.g., bags) and dynamic objects (e.g., persons). It is important to notice also that the sensor model pe (zt |xt ) contributes all the cells xt that are actually observed by the sensor and to cells that are not within its field of view. In this latter case, the probability of presence of the event is set to the nominal value λ and thus Pe (xt ) tends to λ (i.e., no knowledge) if no sensor is examining cell xt for some time. This mechanism implements a form of decay of information over time and requires the agents to continuously monitor the environment in order to assess that no threats are present. Usually, the idleness [14] is normally used in evaluating multi-agent patrolling approaches. This concept can be extended to our formalization as follows. Given a minimum value of probability γ , that can de defined according to the sensor models for all the events, the idleness Ie (x,t) for an event e of a location x at time t is defined as the time elapsed since the location had a low (i.e. < γ ) probability of hosting the event. More formally Ie (x,t) = t − tˆ such that pe (xtˆ) < γ ∧ ∀τ > tˆ, pe (xτ ) ≥ γ Then the worst idleness W Ie (t) for an event e at time t is defined as the biggest value of the idleness for all the locations. Formally W Ie (t) = max Ie (x,t) x
8
Multi-robot Teams for Environmental Monitoring
193
8.5 Event-Driven Distributed Monitoring As stated before, we consider two different classes of tasks. Patrolling tasks define areas of the environment that should be traveled regularly by the agents. The shape of these areas is not constrained to be any specific one. We assume, however, that a decomposition can be performed, to apply standard approaches to coverage problem [15]. Threat response tasks specify restricted portions of the map, where potentially dangerous events are currently occurring. The kind of threat is left unspecified, since it is dependent on the application domain. Examples of considered events are: an intruder detected in a restricted area, a bag or an unknown object left in a clear zone, a non-authorized access to a controlled room. The appropriate response, then, should be specified per application. We assume that the basic response for all these events requires for the agent to reach the location on the map. In this sense, they are the hot-spots specified in 8.4.1.
8.5.1 Layered Map for Event Detection Figure 8.2 shows a diagram of the proposed solution. Data acquired by the sensors of the system are first locally filtered by each agent and then globally fused in order to determine a global response to perceived events. A finite set of event maps models the event space E . For each sensor in the set S , it is possible to define a sensor model. Each sensor model defines a probability distribution function (pdf) that describes how the sensor perceives an event, its uncertainty and how to update the event map. A sensor can update different event maps, and, hence, it becomes important to define how heterogeneous sensors update the map. A Layered Map for Event Detection defines a multi-level Bayesian filter. Each level describes a probability distribution related to a specific sensor: the combination of several levels results in
Fig. 8.2 The data flow in the Layered Map for Event Detection approach.
194
M.V. Espina et al.
a probability distribution for the event of interest. However, the importance of an event decays when the time goes by: to reflect the temporal constraints in the event handling, we introduced an aging update step. This step acts before updating the filter, given the observation from sensors (like in the Predict step of recursive Bayesian filters). The pdf associated to each sensor level has this meaning: 1 if the sensor filter has converged in a hot-spot p(x) = 0 if there is no relevant information given by the sensor in x This p(x) = 0.5 means that, in that point, the sensor has complete ignorance about the environment it can perceive. Given this assumptions, in every time frame, the pdf of sensor level smooths towards complete ignorance: p(x) + δincrease if p(x) < 0.5 p(x) = p(x) − δdecrease if p(x) > 0.5 The combination of the sensor level is demanded to the Event Detection layer. This layer has a bank of filters, each one delegated to detect a specific event. Each filter uses the belief from a subset of the sensor levels to build a joint belief, representing the pdf of associated event. The characterization of the event depends on the behavior an agent can perform to response of it. In this sense, the event is a threat and the response to it depends on the coordination step presented in Section 8.5.2. This process is formalized in Algorithm 8.1. Algorithm 8.1. MSEventDetect input : u = action performed by the agent zs = set of sensor reading from a specific sensor BF = set of Bayesian sensor filters output : E = set of pdf associated to events of interest // initialize the event belief Bel Bel ← 0 foreach b f in BF do // apply aging p(b f ) (x) + δincrease if p(b f ) (x) < 0.5 p(b f ) (x|zs ) = p(b f ) (x) − δdecrease if p(b f ) (x) > 0.5 // perform Bayesian filtering Predictbfi (u) Updatebfi (zs ) Bel ← Bel ∪ p(b f ) (x) end // build joint belief in the event detection layer D E ←0 foreach d in D do p(d) (x|Bel) = ∏i γi beli E ← E ∪ p(d) (x) end
8
Multi-robot Teams for Environmental Monitoring
195
8.5.2 From Events to Tasks for Threat Response The output of the Event Detection is a distribution over the space of the environment, describing the probability of occurring event in specific areas. The team of agents need to translate these information into tasks to perform. First of all, a clustering is done to extract the high probability peaks of the distribution. The clustering uses a grid-based decomposition of the map, to give a coarse approximation of the distribution itself. If the distribution is multi-modal, then, a cluster will be associated to each peak. Each cluster ce is then defined in terms of center position and occupancy area: these information will be addressed by the task association. After the list of clusters is generated, a corresponding task list is built. In principle, one task is associated to each cluster. Two categories of tasks are, then, considered: patrolling and threat response. The super class of Threat-Response could comprehend different behaviors: explore the given area with a camera, verify the presence of an intruder or an unexpected object and so on. However, the basic behavior associated to this tasks requires for the agent, to reach the location, or its nearby, and take some kind of action. This means that the Threat-Response category defines a whole class of behaviors, distinguished by the last step. Therefore, in our experimental setup, we consider them as simple behaviors to reach the location, avoiding the specification of other specific actions.
Algorithm 8.2. Event2Task input : E = set of pdfs associated to events of interest output : T = set of tasks to perform // clustering of event set C ←0 foreach e in E do ce = Clusterize(e) C ← C ∪ ce end // associate event to task T ← 0 foreach c in C do patrolling if c ∈ E p T ←T ∪ threat if c ∈ Et end
Algorithm 8.2 illustrates the steps performed to transform the pdf of an event to a task list. Here, E p is the class of events that requires a patrolling task, while Et is the class of events requiring a threat-response task. Figure 8.3 shows an example where two pdfs (represented as sets of samples) are processed to obtain two tasks.
196
M.V. Espina et al.
Fig. 8.3 Event to task transformation.
8.5.3 Strategy for Event-Driven Distributed Monitoring We can now describe the strategy developed for the Event-driven Distributed Monitoring. Algorithm 8.3 incorporates the previously illustrated algorithm for the Event Detection and the event-to-task transformation. The agent starts with a uniform knowledge of the map: no events are detected yet. In normal conditions, the default behavior is to patrol the whole area of the map. At time t, the agent a receives information from the sensors. A sensor can model a real device, as well as a virtual one to describe other types of information (a priori, constraints on the environment and so on). Algorithm 8.1 is then used to detect a cluster of events on the map. These clusters are then passed to the Algorithm 8.2: a list of tasks is generated. The tasks are spread over the network, to wake up the coordination protocol and the Task Assignment step is performed. Each agent selects the task that is more appropriated to its skills and it signals to other agents its selection. The remaining tasks are relayed to the other agents, that, in the meantime, select the most appropriate task. If the number of tasks is larger than the number of available agents, the non-assigned tasks are put in a queue. When an agent completes its task, this is removed from the pool of tasks for each agent.
8
Multi-robot Teams for Environmental Monitoring
197
Algorithm 8.3. Event Driven Monitoring input : BF = set of Bayesian sensor filters Z = set of sensor readings from S S = set of sensors M = map of the environment // initialize the sensor filters foreach b f in BF do p(x)(b f ) = U (M) end // retrieve agent’s actions u = actions // retrieve sensor readings Z←0 foreach s in S do Z ← Z ∪ zs end // detect events E = MSEventDetect(u, Z, BF) // generate the task set T = Event2Task(E ) // assign a task to the agent a task = TaskAssignment(a, T )
8.6 Implementation and Results As mentioned in Section 8.2, two scenarios are considered for our system and for each of them different vision algorithms were implemented. In the first scenario, a bag is left unattended and a robot will go and check the suspected area. In the second scenario, the video surveillance system deals with the manipulation of unauthorized objects in a specific positions ( laptop in top-left corner in Figure 8.1). The implementation of the computer vision and robotic components to deal with these scenarios and the realization of a full demonstrator to validate the approaches are described in the following.
8.6.1 A Real-Time Multi-tracking Object System for a Stereo Camera - Scenario 1 In this scenario a multi-object tracking algorithm based on a ground plane projection of real-time 3D data coming from a stereo imagery is implemented, giving distinct separation of occluded and closely-interacting objects. Our approach, based on the research activity completed in [25, 26, 33, 34], consists of tracking, using Kalman Filters, fixed templates that are created combining the height and the statistical pixel occupancy of the objects in the scene. These objects are extracted from the background using a Gaussian Mixture Model [40, 5] using four channels: three
198
M.V. Espina et al.
channels colours (YUV colour space) and a depth channel obtained from the stereo devices [25]. The mixture model is adapted over time and it is used to create a background model that is also upgraded using an adaptive learning rate parameter according to the scene activity level on a per-pixel basis (the value is experimentally obtained). The use of depth information (3D data) can contribute to solve difficult challenges normally faced when detecting and tracking objects such us: improve the foreground segmentation due to its relatively robustness against lighting effects as shadows and also giving the shape feature provides information to discern between people and other foreground objects such as bags. Moreover, the use of a third dimension feature can help to clarify the uncertainty of predictions on the tracking process when an occlusion is produced between a foreground and background object. The 3D foreground cloud data is then rendered as if it was viewed from an overhead, orthographic camera view (see Figure 8.4); reducing the amount of information and therefore, the computational performance is increased when the tracking is done onto plan-view projection data rather than onto 3D data directly. The projection of the 3D data to a ground plane is chosen due to the assumption that people usually do not overlap in the normal direction to the ground plane. Therefore, this 3D projection allows to separate and to solve some occlusions that more difficult to solve using the original camera-view. The data association implemented in the tracking process is also based on the work presented in [26, 34]. The Gaussian and linear dynamic prediction filters used to track the occupancy and height statistics plan-view maps are the well-known Kalman Filters [9]. Figure 8.5 shows tracking of different types of objects (robot, person and bag), including occlusion. Each planview map has been synchronized with its raw frame pair and back projected to the real map of the scene.
Fig. 8.4 Process for creation of a plan-view
8.6.2 Maximally Stable Segmentation and Tracking for Real-Time Automated Surveillance - Scenario 2 In this section we present a novel real-time, color-based, (MSER) detection and tracking algorithm for detecting object manipulation events, based on the work carried out in [21]. Our algorithm synergistically combines MSER-evolution with image-segmentation to produce maximally-stable segmentation. Our MSER algorithm clusters pixels into a hierarchy of detected regions using an efficient lineconstrained evolution process. Resulting regions are used to seed a second clustering
8
Multi-robot Teams for Environmental Monitoring
199
Fig. 8.5 Seven frames of a sequence that shows tracking different types of objects (robot, person and bag), including an occlusion. Each plan-view map has been synchronized with its raw frame pair and back projected to the real plan of the the scene(right side of each image).
process to achieve image-segmentation. The resulting region-set maintains desirable properties from each process and offers several unique advantages including fast operation, dense coverage, descriptive features, temporal stability, and low-level tracking. Regions that are not automatically tracked during segmentation, can be tracked at a higher-level using MSER and line-features. We supplement low-level tracking with an algorithm that matches features using a multi-phased, kd-search algorithm. Regions are modeled and identify using transformation-invariant features that allow identification to be achieved using a constant-time hash-table. To demonstrate the capabilities of our algorithm, we apply it to a variety of real-world activity-recognition scenarios. MSER algorithm is used to reduce unimportant data, following Mikolajczyk [32] final conclusions on comparison of the most promising feature-detection techniques. The MSER algorithm was originally developed by Matas et al. [31] to identify stable areas of light-on-dark, or dark-on-light, in greyscale images. The algorithm is implemented by applying a series of binary thresholds to an image. As the threshold value iterates, areas of connected pixels grow and merge, until every pixel in the image has become a single region. During this process, the regions are monitored, and those that display a relatively stable size through a wide range of thresholds are recorded. This process produces a hierarchical tree of nested MSERs. The tree-root contains the MSER node that comprises
200
M.V. Espina et al.
every pixel in the image, with incrementally smaller nested sub-regions occurring at every tree-branch. The leaves of the tree contain the first-formed and smallest groups of pixels. Unlike other detection algorithms, the MSER identifies comparatively few regions of interest. However, our algorithm returns either a nested set of regions (traditional MSER-hierarchy formation), or a non-nested, non-overlapping set of regions (typical to image segmentation). Using non-nested regions significantly improves tracking speed and accuracy. To increase the number of detections and improve coverage, Forssen [21] redesigned the algorithm to incorporate color information. Instead of grouping pixels based on a global threshold, Forssen incrementally clustered pixels using the local color gradient (i.e. for every pixel p in the image, the color gradient is measured against adjacent pixels p[+] and p[-]). This process identifies regions of similar-colored pixels that are surrounded by dissimilar pixels. In our approach we take advantage of the increased detection offered by Forssen’s color-based approach, although in our approach the region growth is constrained using detected lines; improving segmentation results on objects with highcurvatures gradients. To detect lines, the Canny filter is used rather than MSER as it is more effective at identifying a continuous border between objects since it considers a larger section of the gradient. Therefore, our system processes each frame with the Canny algorithm. Canny edges are converted to line-segments and the pixels corresponding to each line-segment is used to constrain MSER growth. Simply speaking, MSER evolution operates as usual, but is not permitted to cross any Canny lines. An example of detected lines is shown in Figure 8.6(Right). Detected lines are displayed in green.
Fig. 8.6 Left: An example of the feed-forward process. Dark-gray pixels are preserved, Light-gray pixels are re-clustered. Center: MSERs are modeled and displayed using ellipses and average color-values. Right: An example of MSER image segmentation. Regions are filled with their average color, detected lines are shown in green, the path of the tracked hand is represented as a red line.
To improve performance on tracking large, textureless objects that are slowmoving or stationary, we apply a feed-forward algorithm, which is a relative simple addition to our MSER algorithm. After every iteration of MSER generation, we identify pixels in the current frame that are nearly identical (RGB values within 1) to the pixel in the same location of the following frame. If the majority of pixels in any given MSER remain unchanged for the following video image, the matching pixels are pre-grouped into a region for the next iteration. This pixel-cluster is then used to seed growth for the next iteration of MSER evolution. Using our feed-forward
8
Multi-robot Teams for Environmental Monitoring
201
approach, any region that cannot continually maintain its boundaries, will be assimilated into similarly-colored adjacent regions. After several iterations of region competition, many unstable regions are eliminated automatically without any additional processing (see also Figure 8.6). Once the regions using MSER features and line-corner features are obtained, the tracking algorithm is implemented to operate upon them. Our tracking algorithm applies four different phases to handle a specific type of tracking problem: “Feed-Forward Tracking ”, “MSER-Tracking”, “LineTracking”and “Secondary MSER-Tracking”. However, if an object can be tracked in an early phase, later tracking-phases are not applied to the object. By executing the fastest trackers first, we can further reduce resource requirements. In the “Feedforward Tracking”phase, using our pixel feed-forward algorithm; tracking becomes a trivial matter of matching the pixel’s donor region with the recipient region. In “MSER-Tracking” phase by as mentioned before, eliminating the problem of nesting by reducing the hierarchy of MSERs to non-hierarchical image segmentation, the representation becomes a one-to-one correspondence and matches are identified using a greedy approach. The purpose of this phase of tracking is to match only those regions that have maintained consistent size and color between successive frames. Each image region is represented by the following features: Centroid (x,y) image coordinates, Height and Width (second-order moment of pixel-positions) and finally color values. Matching is only attempted on regions that remained un-matched after the “Feed-Forward Tracking”phase. Matches are only assigned when regions have similarity measures beyond a predefined similarity threshold. In this tracking phase, line-corners are matched based on their positions, the angles of the associated lines, and the colors of the associated regions. It should be mentioned that, even if a line separates (and is therefore associated with) two regions, that line will have different properties for each region. Specifically, the line angle will be 180 degrees rotated from one region to the other, and the left and right endpoints will be reversed. Each line-end is represented by the following features: Position (x,y) image coordinates, Angle of the corresponding line, RGB color values of the corresponding region and Left / Right handedness of endpoint (Perspective of looking out from the center of the region). Line-corner matching is only attempted on regions that remained un-matched after the “MSER-Tracking”phase. At last, on the “Secondary MSERTracking”phase a greedy approach is used to match established regions (regions that were being tracked but were lost) to unassigned regions in more recent frames. Unlike the first three phases, which only consider matches between successive frames, the fourth phase matches regions within an n-frame window. Although there may be several ways to achieve foreground detection using our tracking algorithms, we feel it would be appropriate to simply follow the traditional pipeline. To this affect, the first several frames in a video sequence are committed to building a region-based model of the background. Here, MSERs are identified and tracked until a reasonable estimation of robustness and motion can be obtained. Stable regions are stored to the background model using the same set of features listed in the tracking section. Bear in mind that since background features are continually tracked, the system is equipped to identify unexpected changes to the background. The remainder of the video is considered the operation phase. Here, similarity measurements are made
202
M.V. Espina et al.
between regions in the background model, and regions found in the current video frame. Regions considered sufficient dissimilar to the background are tracked as foreground regions. Matching regions are tracked as background regions. Once the foreground is segmented from the background, a color and shape-based model is generated from the set of foreground MSER features. Our technique uses many of the principals presented by Chum and Matas[13], but our feature vectors were selected to provide improved robustness in scenes where deformable or unreliable contours are an issue. Therefore, we propose an algorithm that represents objects using an array of features that can be classified into three-types: MSER-pairs (a 4-dimensional feature-vector), MSER-individuals (a 3-dimensional feature-vector) and finally, Size-position measure (a 2-dimensional feature-vector), feature-set only used for computing vote-tally. The recognition of activities is done without sophisticated algorithms (a Hidden Markov Model was used to generalize object interactions), so for this surveillance system, we use a simple rule-sets based only on proximity and trajectories. Figure 8.7 shows an example of activity recognition using MSER. Each object is associated by a color bar at the right of the image. The apparent height of the bar corresponds to the computed probability that the person’s hand is interacting with that object. In the scenario shown on the left, a person engaged in typical homework-type behaviors including: typing on a laptop; turning pages in a book; moving a mouse; and drinking from a bottle. In the scenario on the right, a person reached into a bag of chips multiple times, and extinguished a trash-fire with a fire extinguisher.
Fig. 8.7 An example of activity recognition using MSER
8.6.3 Multi-robot Environmental Monitoring The implementation of the Multi-Robot Environmental Monitoring described in Section 8.5 has been implemented on a robotic framework and tested both on 2 Erratic robots2 and on many simulated robots in the Player/stage environment3. 2 3
www.videre.com playerstage.sourceforge.net
8
Multi-robot Teams for Environmental Monitoring
203
Fig. 8.8 Block diagram of proposed architecture.
Figure 8.8 shows the block diagram of the overall system and the interactions among the developed modules. In particular, the team of robots monitors the environment while waiting for receiving event messages from the vision sub-system. As previously described, we use a Bayesian Filtering method to achieve the Sensor Data Fusion. In particular, we use a Particle Filter for the sensor filters and event detection layer. In this way, the pdf describing the belief of the system about the events to be detected are described as sets of samples, providing a good compromise between flexibility in the representation and computational effort. The implementation of the basic robotic functionalities and of the services needed for multi-robot coordination is realized using the OpenRDK toolkit4 [12]. The mobile robots used in the demonstrator have the following features: • Navigation and Motion Control based on a two-level approach: a motion using a fine representation of the environment and a topological path-planner that operates on a less detailed map and reduces the search space; probabilistic roadmaps and rapid-exploring random trees are used to implement these two levels [13]. • Localization and Mapping based on a standard particle filter localization method and a well-known implementation GMapping5 that has been successfully experimented on our robots also in other applications [30]. • Task Assignment based on a distributed coordination paradigm using utility functions [27] already developed and successfully used in other projects. Moreover, to test the validity of the approach, we replicate the scenarios in the Player/Stage simulator. defining a map of the real environment used for the experiments, and several software agents, with the same characteristics of the real robots. The combination of OpenRDK and Player/Stage is very suitable to develop and 4 5
openrdk.sf.net openslam.org/gmapping.html
204
M.V. Espina et al.
experiment multi-robot applications, since they provide a powerful but yet flexible and easy-to-use robot programming environment.
8.6.4 System Execution In this section we show the behaviors of the overall surveillance system developed in this project. As stated, the recognition of the predefined activities for the scenarios illustrated in this chapter is done without sophisticated algorithms and simple rulesets based only on proximity and trajectories are applied. The communication between the video surveillance system and the multi-robot system is done using a TCP client-server communication interface. Each static stereo camera is attached to a PC computer and they communicate between them and the robots via private wireless network. Each static camera and its PC act as a client and one of this PCs also acts as a server. The PC video-server is the only PC that communicates directly with the robots; the PC video-server becomes a client although when it communicates with the multi-robot system, and then, the robots act
Fig. 8.9 This figure illustrates a sequence of what may happen in Scenario 1. Person A walks through the corridor with the bag and leaves it in the middle of the corridor. Person B approaches the bag and takes it, raising an alarm in the system causing the patrolling robot to go and inspect the area.
8
Multi-robot Teams for Environmental Monitoring
205
Fig. 8.10 This figure illustrates a sequence of what may happen in Scenario 2. Person B places a book(black) and a bottle(green) on the table and manipulates them under the surveillance of the system; until Person B decides to touch an unauthorized object (ie.laptop)(grey) raising an alarm in the system causing the patrolling robot to go and inspect the area.
as servers. Once the video surveillance has recognized an event; e.g. in scenario1, the person associated with the bag abandons the object (see Figure 8.9), the PC client camera sends to the PC video-server the event name and the 3D coordinates. The video-server then constructs a string with this information ( first transforms the 3D coordinates to a common coordinate system) and sends the message via wireless to the robots. Then, one of the robots is assigned to go and patrol the area and take a high-resolution picture if the event detected is “bag taken”. Figures 8.9 and 8.10 show the results in Scenarios 1 and 2 respectively. Figure 8.9 illustrates a sequence of what may happen in Scenario 1. On the top-left image of the figure a person with an object (bag) is walking through the corridor. On the top-right image of the figure, the video system detects that the person left the bag. Therefore a message is sent as “left bag”. On the bottom-left image another person walks very closed to the bag. On the bottom-right image the visual surveillance system detects that a person is taking a bag an a message “bag taken”is sent to the robots and as it can be seen one of the robots is sent to inspect the risen event. Figure8.10 illustrates a sequence of what may happen in Scenario 2. On the top-left, a laptop is placed
206
M.V. Espina et al.
on the table and one of the robots can be seen patrolling. The top-right and bottom left images of the figure there is a person who is allowed to manipulate different objects. On the bottom-right the person is touching the only object which is not allowed, therefore an alarm “allert”is risen.
8.7 Conclusion During the recent years, there has been an increased interest in using robotic technologies for security and defence applications, in order to increase their performance and reduce the danger for the people involved. The research proposed in this chapter aims to provide a distributed, multi-robot solution to the problem of environment monitoring, in order to detect or prevent undesired events, such as intrusions, or unattended baggage events or future applications such a fire detection. The problem of detecting and responding to threats through surveillance techniques is particularly well suited to a robotic solution comprising of a team of multiple robots. For large environments, the distributed nature of the multi-robot team provides robustness and increased performance of the surveillance system. In future, the extension of the system by using a group of mobile robots equipped with on-board processing cameras may have several significant advantages over a fixed surveillance camera system. First, the solution could be used in environments that have not been previously engineered with a camera-based monitoring system: the robot team could be deployed quickly to obtain information about a previously unknown environment. Second, the cameras attached to the robots could be moving through the environment in order to best acquire the necessary information, in contrast with a static camera, which can only perform observations from a fixed view point. Third, the robots in the team have the power to collaborate on the monitoring task and are also able to take actions that could pre-empt a potential threat. Fourth, the robots could be equipped with additional specialized sensors, which could be delivered at the appropriate place in the environment to detect the presence of chemical or biological agents. Last, the robot team could communicate with a human operator and receive commands about the goals and potential changes in the mission, allowing for a dynamic, adaptive solution.
Acknowledgements This publication was developed under Department of Homeland Security (DHS) Science and Technology Assistance Agreement No. 2009-ST-108-000012 awarded by the U.S. Department of Homeland Security. It has not been formally reviewed by DHS. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Department of Homeland Security. The Department of Homeland Security does not endorse any products or commercial services mentioned in this publication.
8
Multi-robot Teams for Environmental Monitoring
207
References [1] Agmon, N.: Multi-robot patrolling and other multi-robot cooperative tasks: An algorithmic approach. Ph.D. thesis, BarIlan University (2009) [2] Agmon, N., Kraus, S., Kaminka, G.: Multi-robot perimeter patrol in adversarial settings. In: Proc. of IEEE International Conference on Robotics and Automation (ICRA), pp. 2339–2345 (2008) [3] Alami, R., Fleury, S., Herrb, M., Ingrand, F., Robert, F.: Multi robot cooperation in the martha project. IEEE Robotics and Automation Magazine 5(1), 36–47 (1998) [4] Almeida, A., Ramalho, G., Santana, H., Tedesco, P.A., Menezes, T., Corruble, V., Chevaleyre, Y.: Recent advances on multi-agent patrolling. In: Bazzan, A.L.C., Labidi, S. (eds.) SBIA 2004. LNCS (LNAI), vol. 3171, pp. 474–483. Springer, Heidelberg (2004) [5] Bahadori, S., Iocchi, L., Leone, G.R., Nardi, D., Scozzafava, L.: Real-time people localization and tracking through fixed stereo vision. Applied Intelligence 26, 83–97 (2007) [6] Basilico, N., Gatti, N., Rossi, T., Ceppi, S., Amigoni, F.: Extending algorithms for mobile robot patrolling in the presence of adversaries to more realistic settings. In: WI-IAT 2009: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, pp. 557–564. IEEE Computer Society Press, Washington (2009) [7] Baumberg, A., Hogg, D.C.: Learning deformable models for tracking the human body. In: Shah, M., Jain, R. (eds.) Motion-Based Recognition, pp. 39–60 (1996) [8] Brand, M., Oliver, N., Pentland, A.: Coupled hidden markov models for complex action recognition. In: CVPR ’97: Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR 1997), p. 994. IEEE Computer Society Press, Washington (1997) [9] Brown, R., Hwang, P.: Introduction to Random Signals and Applied Kalman Filtering. John Wiley & Sons, Chichester (1997) [10] Buxton, H.: Generative models for learning and understanding dynamic scene activity. In: ECCV Workshop on Generative Model Based Vision, pp. 71–81 (2002) [11] Buxton, H., Gong, S.: Advanced visual surveillance using bayesian networks. In: International Conference on Computer Vision, pp. 111–123 (1995) [12] Calisi, D., Censi, A., Iocchi, L., Nardi, D.: OpenRDK: a modular framework for robotic software development. In: Proc. of Int. Conf. on Intelligent Robots and Systems (IROS), pp. 1872–1877 (2008) [13] Calisi, D., Farinelli, A., Iocchi, L., Nardi, D.: Autonomous navigation and exploration in a rescue environment. In: Proceedings of the 2nd European Conference on Mobile Robotics (ECMR), pp. 110–115 (2005) [14] Chevaleyre, Y.: Theoretical analysis of the multi-agent patrolling problem. In: IAT 2004: Proceedings of the IEEE/WIC/ACM International Conference on Intelligent Agent Technology, pp. 302–308. IEEE Computer Society, Washington (2004) [15] Choset, H.: Coverage for robotics - a survey of recent results. Ann. Math. Artif. Intell. 31(1-4), 113–126 (2001) [16] Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5), 564–575 (2003) [17] Correll, N., Martinoli, A.: Robust distributed coverage using a swarm of miniature robots. In: Proc. of IEEE International Conference on Robotics and Automation (ICRA), pp. 379–384 (2007)
208
M.V. Espina et al.
[18] Elmaliach, Y., Agmon, N., Kaminka, G.A.: Multi-robot area patrol under frequency constraints. In: Proc. of IEEE International Conference on Robotics and Automation (ICRA), pp. 385–390 (2007) [19] Farinelli, A., Iocchi, L., Nardi, D.: Multi robot systems: A classification focused on coordination. IEEE Transactions on System Man and Cybernetics, part B 34(5), 2015– 2028 (2004) [20] Farinelli, A., Iocchi, L., Nardi, D., Ziparo, V.A.: Assignment of dynamically perceived tasks by token passing in multi-robot systems. Proceedings of the IEEE 94(7), 1271– 1288 (2006) [21] Forss´en, P.E.: Maximally stable colour regions for recognition and matching. In: IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, IEEE, Minneapolis, USA (2007) [22] Gage, A., Murphy, R.R.: Affective recruitment of distributed heterogeneous agents. In: Proc. of Nineteenth National Conference on Artificial Intelligence, pp. 14–19 (2004) [23] Gerkey, B., Mataric, M.J.: Principled communication for dynamic multi-robot task allocation. In: Proceedings of the Int. Symposium on Experimental Robotics, pp. 353–362 (2000) [24] Goodman, J., O’Rourke, J.: Nearest neighbors in high dimensional spaces. In: Piotr Indy, K. (ed.) Handbook of Discrete and Computational Geometry, 2nd edn. IEE Professional Applications of Computing Series, vol. 5, ch. 39 (2004) [25] Harville, M., Gordon, G., Woodfill, J.: Foreground segmentation using adaptive mixture models in color and depth. In: IEEE Workshop on Detection and Recognition of Events in Video, vol. 0, p. 3 (2001) [26] Harville, M., Li, D.: Fast, integrated person tracking and activity recognition with planview templates from a single stereo camera. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 398–405 (2004) [27] Iocchi, L., Nardi, D., Piaggio, M., Sgorbissa, A.: Distributed coordination in heterogeneous multi-robot systems. Autonomous Robots 15(2), 155–168 (2003) [28] Jung, D., Zelinsky, A.: An architecture for distributed cooperative planning in a behaviour-based multi-robot system. Journal of Robotics and Autonomous Systems 26, 149–174 (1999) [29] Low, K.H., Dolan, J., Khosla, P.: Adaptive multi-robot wide-area exploration and mapping. In: Proceedings of the 7th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2008), pp. 23–30 (2008) [30] Marchetti, L., Grisetti, G., Iocchi, L.: A comparative analysis of particle filter based localization methods. In: Lakemeyer, G., Sklar, E., Sorrenti, D.G., Takahashi, T. (eds.) RoboCup 2006: Robot Soccer World Cup X. LNCS (LNAI), vol. 4434, pp. 442–449. Springer, Heidelberg (2007) [31] Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from maximally stable extremal regions. In: Proc. of British Machine Vision Conference, vol. 1, pp. 384– 393 (2002) [32] Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Kadir, T., Gool, L.V.: A comparison of affine region detectors. International Journal of Computer Vision 65 (1-2), 43–72 (2005) [33] Mu noz-Salinas, R., Aguirre, E., Garc´ıa-Silvente, M.: People detection and tracking using stereo vision and color. Image Vision Comput. 25(6), 995–1007 (2007) [34] Mu noz-Salinas, R., Medina-Carnicer, R., Madrid-Cuevas, F.J., Carmona-Poyato, A.: People detection and tracking with multiple stereo cameras using particle filters. J. Vis. Comun. Image Represent. 20(5), 339–350 (2009)
8
Multi-robot Teams for Environmental Monitoring
209
[35] Ning, H.Z., Wang, L., Hu, W.M., Tan, T.N.: Articulated model based people tracking using motion models. In: Proc. Int. Conf. Multi-Model Interfaces, pp. 115–120 (2002) [36] Parker, L.E.: ALLIANCE: An architecture for fault tolerant multirobot cooperation. IEEE Transactions on Robotics and Automation 14(2), 220–240 (1998) [37] Portugal, D., Rocha, R.: Msp algorithm: multi-robot patrolling based on territory allocation using balanced graph partitioning. In: SAC 2010: Proceedings of the 2010 ACM Symposium on Applied Computing, pp. 1271–1276. ACM, New York (2010) [38] Qu, Y.G.Z.: Coverage control for a mobile robot patrolling a dynamic and uncertain environment. In: WCICA 2004: Proceedings of 5th World Congress on Intelligent Control and Automation, pp. 4899–4903 (2004) [39] Sak, T., Wainer, J., Goldenstein, S.K.: Probabilistic multiagent patrolling. In: Zaverucha, G., da Costa, A.L. (eds.) SBIA 2008. LNCS (LNAI), vol. 5249, pp. 124–133. Springer, Heidelberg (2008) [40] Stauffer, C., Grimson, W.E.L.: Adaptive background mixture models for real-time tracking. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 246–252 (1999) [41] Tan, T.N., Sullivan, G.D., Baker, K.D.: Model-based localization and recognition of road vehicles. International Journal Computer Vision 29(1), 22–25 (1998) [42] Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics. The MIT Press, Cambridge (2005) [43] Tian, T., Tomasi, C.: Comparison of approaches to egomotion computation. In: Computer Vision and Pattern Recognition, pp. 315–320 (1996) [44] Toyama, K., Krumm, J., Brumitt, B., Meyers, B.: Wallflower: Principles and practice of background maintenance. In: IEEE International Conference on Computer Vision, vol. 1, p. 255 (1999) [45] Valera, M., Velastin, S.A.: chap. 1. A Review of the State-of-the-Art in Distributed Surveillance Systems. In: Velastin, S.A., Remagnino, P. (eds.) Intelligent Distributed Video Surveillance Systems. IEE Professional Applications of Computing Series, vol. 5, pp. 1–25 (2006) [46] Werger, B.B., Mataric, M.J.: Broadcast of local eligibility for multi-target observation. In: DARS 2000, pp. 347–356 (2000) [47] Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Computer Survey 38, 13 (2006) [48] Ziparo, V., Kleiner, A., Nebel, B., Nardi, D.: Rfid-based exploration for large robot teams. In: Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), Rome, Italy (2007) [49] Zlot, R., Stenz, A., Dias, M.B., Thayer, S.: Multi robot exploration controlled by a market economy. In: Proc. of IEEE International Conference on Robotics and Automation (ICRA), pp. 3016–3023 (2002)
212
Index