Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
2665
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo
Hsinchun Chen Richard Miranda Daniel D. Zeng Chris Demchak Jenny Schroeder Therani Madhusudan (Eds.)
Intelligence and Security Informatics First NSF/NIJ Symposium, ISI 2003 Tucson, AZ, USA, June 2-3, 2003 Proceedings
13
Volume Editors Hsinchun Chen Daniel D. Zeng Therani Madhusudan University of Arizona Department of Management Information Systems Tucson, AZ 85721, USA E-mail: {hchen/zeng/madhu}@eller.arizona.edu Richard Miranda Jenny Schroeder Tucson Police Department 270 S. Stone Ave., Tucson, AZ 85701, USA E-mail:
[email protected] Chris Demchak University of Arizona School of Public Administration and Policy Tucson, AZ 85721, USA E-mail:
[email protected] Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at .
CR Subject Classification (1998): H.4, H.3, C.2, I.2, H.2, D.4.6, D.2, K.4.1, K.5, K.6.5 ISSN 0302-9743 ISBN 3-540-40189-X Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2003 Printed in Germany Typesetting: Camera-ready by author, data conversion by PTP-Berlin GmbH Printed on acid-free paper SPIN: 10927359 06/3142 543210
Preface
Since the tragic events of September 11, 2001, academics have been called on for possible contributions to research relating to national (and possibly international) security. As one of the original founding mandates of the National Science Foundation, mid- to long-term national security research in the areas of information technologies, organizational studies, and security-related public policy is critically needed. In a way similar to how medical and biological research has faced significant information overload and yet also tremendous opportunities for new innovation, law enforcement, criminal analysis, and intelligence communities are facing the same challenge. We believe, similar to “medical informatics” and “bioinformatics,” that there is a pressing need to develop the science of “intelligence and security informatics” – the study of the use and development of advanced information technologies, systems, algorithms and databases for national security related applications, through an integrated technological, organizational, and policy-based approach. We believe active “intelligence and security informatics” research will help improve knowledge discovery and dissemination and enhance information sharing and collaboration across law enforcement communities and among academics, local, state, and federal agencies, and industry. Many existing computer and information science techniques need to be reexamined and adapted for national security applications. New insights from this unique domain could result in significant breakthroughs in new data mining, visualization, knowledge management, and information security techniques and systems. This first NSF/NIJ Symposium on Intelligence and Security Informatics (ISI 2003) aims to provide an intellectual forum for discussions among previously disparate communities: academic researchers (in information technologies, computer science, public policy, and social studies), local, state, and federal law enforcement and intelligence experts, and information technology industry consultants and practitioners. Several federal research programs are also seeking new research ideas and projects that can contribute to national security. Jointly hosted by the University of Arizona and the Tucson Police Department, the NSF/NIJ ISI Symposium program committee was composed of 44 internationally renowned researchers and practitioners in intelligence and security informatics research. The 2-day program also included 5 keynote speakers, 14 invited speakers, 34 regular papers, and 6 posters. In addition to the main sponsorship from the National Science Foundation and the National Institute of Justice, the meeting was also cosponsored by several units within the University of Arizona, including the Eller College of Business and Public Administration, the Management Information Systems Department, the Internet Technology, Commerce, and Design Institute, the NSF COPLINK Center of Excellence, the Mark and Susan Hoffman E-Commerce Lab, the Center for the Management of
VI
Preface
Information, and the Artificial Intelligence Lab, and several other organizations including the Air Force Office of Scientific Research, SAP, and CISCO. We wish to express our gratitude to all members of the conference Program Committee and the Organizing Committee. Our special thanks go to Mohan Tanniru and Joe Hindman (Publicity Committee Co-chairs), Kurt Fenstermacher, Mark Patton, and Bill Neumann (Sponsorship Committee Co-chairs), Homa Atabakhsh and David Gonzalez (Local Arrangements Co-chairs), Ann Lally and Leon Zhao (Publication Co-chairs), and Kathy Kennedy (Conference Management). Our sincere gratitude goes to all of the sponsors. Last, but not least, we thank Gary Strong, Art Becker, Larry Brandt, Valerie Gregg, and Mike O’Shea for their strong and continuous support of this meeting and other related intelligence and security informatics research.
June 2003
Hsinchun Chen, Richard Miranda, Daniel Zeng, Chris Demchak, Jenny Schroeder, Therani Madhusudan
ISI 2003 Organizing Committee
General Co-chairs: Hsinchun Chen Richard Miranda
University of Arizona Tucson Police Department
Program Co-chairs: Daniel Zeng Chris Demchak Jenny Schroeder Therani Madhusudan
University of Arizona University of Arizona Tucson Police Department University of Arizona
Publicity Co-chairs: Mohan Tanniru Joe Hindman
University of Arizona Phoenix Police Department
Sponsorship Co-chairs: Kurt Fenstermacher Mark Patton Bill Neumann
University of Arizona University of Arizona University of Arizona
Local Arrangements Co-chairs: Homa Atabakhsh David Gonzalez
University of Arizona University of Arizona
Publication Co-chairs: Ann Lally Leon Zhao
University of Arizona University of Arizona
VIII
Organization
ISI 2003 Program Committee
Yigal Arens Art Becker Larry Brandt Donald Brown Judee Burgoon Robert Chang Andy Chen Lee-Feng Chien Bill Chu Christian Collberg Ed Fox Susan Gauch Johannes Gehrke Valerie Gregg Bob Grossman Steve Griffin Eduard Hovy John Hoyt David Jensen Judith Klavans Don Kraft Ee-Peng Lim Ralph Martinez Reagan Moore Clifford Neuman David Neri Greg Newby Jay Nunamaker Mirek Riedewald Kathleen Robinson Allen Sears Elizabeth Shriberg Mike O’Shea Craig Stender Gary Strong Paul Thompson Alex Tuzhilin Bhavani Thuraisingham Howard Wactlar Andrew Whinston Karen White
University of Southern California Knowledge Discovery and Dissemination Program National Science Foundation University of Virginia University of Arizona Criminal Investigation Bureau, Taiwan Police National Taiwan University Academia Sinica, Taiwan University of North Carolina, Charlotte University of Arizona Virginia Tech University of Kansas Cornell University National Science Foundation University of Illinois, Chicago National Science Foundation University of Southern California South Carolina Research Authority University of Massachusetts, Amherst Columbia University Louisiana State University Nanyang Technological University, Singapore University of Arizona San Diego Supercomputing Center University of Southern California Tucson Police Department University of North Carolina, Chapel Hill University of Arizona Cornell University Tucson Police Department Corporation for National Research Initiatives SRI International National Institute of Justice State of Arizona National Science Foundation Dartmouth College New York University National Science Foundation Carnegie Mellon University University of Texas at Austin University of Arizona
Organization
Jerome Yen Chris Yang Mohammed Zaki
IX
Chinese University of Hong Kong Chinese University of Hong Kong Rensselaer Polytechnic Institute Keynote Speakers
Richard Carmona Gary Strong Lawrence E. Brandt Mike O’Shea Art Becker
Surgeon General of the United States National Science Foundation National Science Foundation National Institute of Justice Knowledge Discovery and Dissemination Program Invited Speakers
Paul Kantor Lee Strickland Donald Brown Robert Chang Pamela Scanlon Kelcy Allwein Gene Rochlin Jane Fountain John Landry John Hoyt Bruce Baicar Matt Begert John Cunningham Victor Goldsmith
Rutgers University University of Maryland University of Virginia Criminal Investigation Bureau, Taiwan Police Automated Regional Justice Information Systems Defense Intelligence Agency University of California, Berkeley Harvard University Central Intelligence Agency South Carolina Research Authority South Carolina Research Authority and National Institute of Justice National Law Enforcement & Corrections Technology Montgomery County Police Department City University of New York
Table of Contents
Part I: Full Papers Data Management and Mining Using Support Vector Machines for Terrorism Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aixin Sun, Myo-Myo Naing, Ee-Peng Lim, Wai Lam
1
Criminal Incident Data Association Using the OLAP Technology . . . . . . . Song Lin, Donald E. Brown
13
Names: A New Frontier in Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frankie Patman, Paul Thompson
27
Web-Based Intelligence Reports System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Dolotov, Mary Strickler
39
Authorship Analysis in Cybercrime Investigation . . . . . . . . . . . . . . . . . . . . . . Rong Zheng, Yi Qin, Zan Huang, Hsinchun Chen
59
Deception Detection Behavior Profiling of Email . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Salvatore J. Stolfo, Shlomo Hershkop, Ke Wang, Olivier Nimeskern, Chia-Wei Hu
74
Detecting Deception through Linguistic Analysis . . . . . . . . . . . . . . . . . . . . . . Judee K. Burgoon, J.P. Blair, Tiantian Qin, Jay F. Nunamaker, Jr
91
A Longitudinal Analysis of Language Behavior of Deception in E-mail . . . 102 Lina Zhou, Judee K. Burgoon, Douglas P. Twitchell
Analytical Techniques Evacuation Planning: A Capacity Constrained Routing Approach . . . . . . . 111 Qingsong Lu, Yan Huang, Shashi Shekhar Locating Hidden Groups in Communication Networks Using Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Malik Magdon-Ismail, Mark Goldberg, William Wallace, David Siebecker
XII
Table of Contents
Automatic Construction of Cross-Lingual Networks of Concepts from the Hong Kong SAR Police Department . . . . . . . . . . . . . . . . . . . . . . . . . 138 Kar Wing Li, Christopher C. Yang Decision Based Spatial Analysis of Crime . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Yifei Xue, Donald E. Brown
Visualization CrimeLink Explorer: Using Domain Knowledge to Facilitate Automated Crime Association Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Jennifer Schroeder, Jennifer Xu, Hsinchun Chen A Spatio Temporal Visualizer for Law Enforcement . . . . . . . . . . . . . . . . . . . 181 Ty Buetow, Luis Chaboya, Christopher O’Toole, Tom Cushna, Damien Daspit, Tim Petersen, Homa Atabakhsh, Hsinchun Chen Tracking Hidden Groups Using Communications . . . . . . . . . . . . . . . . . . . . . . 195 Sudarshan S. Chawathe
Knowledge Management and Adoption Examining Technology Acceptance by Individual Law Enforcement Officers: An Exploratory Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Paul Jen-Hwa Hu, Chienting Lin, Hsinchun Chen “Atrium” – A Knowledge Model for Modern Security Forces in the Information and Terrorism Age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Chris C. Demchak Untangling Criminal Networks: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . 232 Jennifer Xu, Hsinchun Chen
Collaborative Systems and Methodologies Addressing the Homeland Security Problem: A Collaborative Decision-Making Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 T.S. Raghu, R. Ramesh, Andrew B. Whinston Collaborative Workflow Management for Interagency Crime Analysis . . . . 266 J. Leon Zhao, Henry H. Bi, Hsinchun Chen COPLINK Agent: An Architecture for Information Monitoring and Sharing in Law Enforcement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Daniel Zeng, Hsinchun Chen, Damien Daspit, Fu Shan, Suresh Nandiraju, Michael Chau, Chienting Lin
Table of Contents
XIII
Monitoring and Surveillance Active Database Systems for Monitoring and Surveillance . . . . . . . . . . . . . . 296 Antonio Badia Integrated “Mixed” Networks Security Monitoring – A Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 William T. Scherer, Leah L. Spradley, Marc H. Evans Bioterrorism Surveillance with Real-Time Data Warehousing . . . . . . . . . . . 322 Donald J. Berndt, Alan R. Hevner, James Studnicki
Part II: Short Papers Data Management and Mining Privacy Sensitive Distributed Data Mining from Multi-party Data . . . . . . . 336 Hillol Kargupta, Kun Liu, Jessica Ryan ProGenIE: Biographical Descriptions for Intelligence Analysis . . . . . . . . . 343 Pablo A. Duboue, Kathleen R. McKeown, Vasileios Hatzivassiloglou Scalable Knowledge Extraction from Legacy Sources with SEEK . . . . . . . . 346 Joachim Hammer, William O’Brien, Mark Schmalz “TalkPrinting”: Improving Speaker Recognition by Modeling Stylistic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 Sachin Kajarekar, Kemal S¨ onmez, Luciana Ferrer, Venkata Gadde, Anand Venkataraman, Elizabeth Shriberg, Andreas Stolcke, Harry Bratt Emergent Semantics from Users’ Browsing Paths . . . . . . . . . . . . . . . . . . . . . . 355 D.V. Sreenath, W.I. Grosky, F. Fotouhi
Deception Detection Designing Agent99 Trainer: A Learner-Centered, Web-Based Training System for Deception Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 Jinwei Cao, Janna M. Crews, Ming Lin, Judee Burgoon, Jay F. Nunamaker Training Professionals to Detect Deception . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 Joey F. George, David P. Biros, Judee K. Burgoon, Jay F. Nunamaker, Jr. An E-mail Monitoring System for Detecting Outflow of Confidential Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 Bogju Lee, Youna Park
XIV
Table of Contents
Methodologies and Applications Intelligence and Security Informatics: An Information Economics Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 Lihui Lin, Xianjun Geng, Andrew B. Whinston An International Perspective on Fighting Cybercrime . . . . . . . . . . . . . . . . . . 379 Weiping Chang, Wingyan Chung, Hsinchun Chen, Shihchieh Chou
Part III: Extended Abstracts for Posters Data Management and Mining Hiding Traversal of Tree Structured Data from Untrusted Data Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 Ping Lin, K. Sel¸cuk Candan Criminal Record Matching Based on the Vector Space Model . . . . . . . . . . . 386 Jau-Hwang Wang, Bill T. Lin, Ching-Chin Shieh, Peter S. Deng Database Support for Exploring Criminal Networks . . . . . . . . . . . . . . . . . . . 387 M.N. Smith, P.J.H. King Hiding Data and Code Security for Application Hosting Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 Ping Lin, K. Sel¸cuk Candan, Rida Bazzi, Zhichao Liu
Security Informatics Secure Information Sharing and Information Retrieval Infrastructure with GridIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 Gregory B. Newby, Kevin Gamiel Semantic Hacking and Intelligence and Security Informatics . . . . . . . . . . . . 390 Paul Thompson
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
Using Support Vector Machines for Terrorism Information Extraction Aixin Sun1 , Myo-Myo Naing1 , Ee-Peng Lim1 , and Wai Lam2 1
Centre for Advanced Information Systems, School of Computer Engineering Nanyang Technological University, Singapore 639798, Singapore
[email protected] 2 Department of Systems Engineering and Engineering Management Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR
[email protected] Abstract. Information extraction (IE) is of great importance in many applications including web intelligence, search engines, text understanding, etc. To extract information from text documents, most IE systems rely on a set of extraction patterns. Each extraction pattern is defined based on the syntactic and/or semantic constraints on the positions of desired entities within natural language sentences. The IE systems also provide a set of pattern templates that determines the kind of syntactic and semantic constraints to be considered. In this paper, we argue that such pattern templates restricts the kind of extraction patterns that can be learned by IE systems. To allow a wider range of context information to be considered in learning extraction patterns, we first propose to model the content and context information of a candidate entity to be extracted as a set of features. A classification model is then built for each category of entities using Support Vector Machines (SVM). We have conducted IE experiments to evaluate our proposed method on a text collection in the terrorism domain. From the preliminary experimental results, we conclude that our proposed method can deliver reasonable accuracies. Keywords: Information extraction, terrorism-related knowledge discovery.
1
Introduction
1.1
Motivation
Information extraction (IE) is a task that extracts relevant information from a set of documents. IE techniques can be applied to many different areas. In the intelligence and security domains, IE can allow one to extract terrorism-related information from email messages, or identify sensitive business information from
This work is partially supported by the SingAREN 21 research grant M48020004. Dr. Ee-Peng Lim is currently a visiting professor at Dept. of SEEM, Chinese University of Hong Kong, Hong Kong, China.
H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 1–12, 2003. c Springer-Verlag Berlin Heidelberg 2003
2
A. Sun et al.
news documents. In some cases where perfect extraction accuracy is not essential, automated IE methods can replace the manual extraction efforts completely. In other cases, IE may produce the first-cut results reducing the manual extraction efforts. As reported in the survey by Muslea [9], the IE methods for free text documents are largely based on extraction patterns specifying the syntactic and/or semantic constraints on the positions of desired entities within sentences. For example, from the sentence, “Guerrillas attacked the 1st infantry brigade garrison”, one can define the extraction pattern subject active-attack to extract “Guerrilas” as a perpetrator, and active-attack direct object to extract “1st infantry bridage garrison” as a victim1 . The extraction pattern definitions currently used are very much based on some pre-defined pattern templates. For example, in AutoSlog [12], the above subject active-attack extraction pattern is an instantiation of the subject active-verb template. While pattern templates reduce the combinations of extraction patterns to be considered in rule learning, they may potentially pose as the obstacles to derive other more expressive and accurate extraction patterns. For example, IBM acquired direct-object is a very pertinent extraction pattern for extracting company information but cannot be instantiated by any of the 13 AutoSlog’s pattern templates. Since it will be quite difficult to derive one standard set of pattern templates that works well for any given domain, IE methods that do not rely on templates will become necessary. In this paper, we propose the use of Support Vector Machines (SVMs) for information extraction. SVM was proposed by Vapnik [16] and has been widelyused in image processing and classification problems [5]. The SVM technique finds the best surface that can separate the positive examples from negative ones. Positive and negative examples are separated by the maximum margin measured by a normal vector w. SVM classifiers have been used in various text classification experiments [2,5] and have been shown to deliver good classification accuracy. When SVM classifiers are used to solve an IE problem, two major research challenges must be considered. – Large number of instances: IE for free text involves extracting from document sentences target entities (or instances) that belong to some pre-defined semantic category(ies). A classification task, on the other hand, is to identify candidate entities from the document sentences, usually in the form of noun phrases or verb phrases, and assign each candidate entity to zero, one or more pre-defined semantic category. As large number of candidate entities can potentially be extracted from document sentences, it could lead to overheads in both learning and classification steps. – Choice of features: The success of SVM very much depends on whether a good set of features is given in the learning and classification steps. There should be adequate features that distinguish entities belonging to a semantic category from those outside the category. 1
Both extraction patterns have been used in the AutoSlog system [12].
Using Support Vector Machines for Terrorism Information Extraction
3
In our approach, we attempt to establish the links between the semantic category of a target entity with its syntactic properties, and reduce the number of instances to be classified based on their syntactic and semantic properties. A natural language parser is first used to identify the syntactic parts of sentences and only those parts that are desired are used as candidate instances. We then use both the content and syntax of a candidate instance and its surrounding context as features. 1.2
Research Objectives and Contributions
Our research aims to develop new IE methods that use classification techniques to extract target entities, while not using pattern templates and extraction patterns. Among the different types of IE tasks, we have chosen to address the template element extraction (TE) task which refers to extracting entities or instances in a free text that belong to some semantic categories2 . We apply our new IE method on free documents in the terrorism domain. In the terrorism domain, the semantic categories that are interesting include victim, perpetrator, witness, etc. In the following, we summarize our main research contributions. – IE using Support Vector Machines (SVM): We have successfully transformed IE into a classification problem and adopted SVM to extract target entities. We have not come across any previous papers reporting such an IE approach. As an early exploratory research, we only try to extract the entities falling under the perpetrator role. Our proposed IE method, nevertheless, can be easily generalized to extract other types of entities. – Feature selection: We have defined the content and context features that can be derived for the entities to be extracted/classified. The content features refer to words found in the entities. The context features refer to those derived from the sentence constituents surrounding the entities. In particular, we propose the a weighting feature scheme to derive context features for a given entity. – Performance evaluation: We have conducted experiments on the MUC text collection in the terrorism domain. In our preliminary experiments, the SVM approach to IE has been shown to deliver performance comparable to the published results by AutoSlog, a well known extraction pattern-based IE system. 1.3
Paper Outline
The rest of the paper is structured as follows. Section 2 provides a survey of the related IE work and distinguishes our work from them. Section 3 defines our IE problem and the performance measures. Our proposed method is described in Section 4. The experimental results are given in Section 5. Section 6 concludes the paper. 2
The template element extraction (TE) task has been defined in the Message Understanding Conference series (MUC) sponsored by DARPA [8].
4
A. Sun et al.
2
Related Work
As our research deals with IE for free text collections, we only examine related work in this area. Broadly, the related work can be divided into extraction pattern-based and non-extraction pattern-based. The former refers to approaches that first acquire a set of extraction patterns from the training text collections. The extraction patterns use the syntactic structure of a sentence and semantic knowledge of words to identify the target entities. The extraction process is very much a template matching task between the extraction patterns and the sentences. The non-extraction pattern-based approach are those that use some machine learning techniques to acquire some extraction models. The extraction models identify target entities by examining their feature mix that includes those based on syntactics, semantics and others. The extraction process is very much a classification task that involves accepting or rejecting an entity (e.g. word or phrase) as a target entity. Many extraction pattern-based IE approaches have been proposed in the Message Understanding Conference (MUC) series. Based on 13 pre-defined pattern templates, Riloff developed the AutoSlog system capable of learning extraction patterns [12]. Each extraction pattern consists of a trigger word (a verb or a noun) to activate its use. AutoSlog also requires a manual filtering step to discard some 74% of the learned extraction patterns as they may not be relevant. PALKA is another representative IE system that learns extraction patterns in the form of frame-phrasal pattern structures [7]. It requires each sentence to be first parsed and grouped into multiple simple clauses before deriving the extraction patterns. Both PALKA and AutoSlog require the training text collections to be tagged. Such tagging efforts require much manual efforts. AutoSlog-TS, an improved version of AutoSlog, is able to generate extraction patterns without a tagged training dataset [11]. An overall F1 measure of 0.38 was reported for both AutoSlog and AutoSlog-TS for the entities in perpetrator, and around 0.45 for victim and target object categories in the MUC-4 text collection (terrorism domain). Riloff also demonstrated that the best extraction patterns can be further selected using bootstrapping technique [13]. WHISK is an IE system that uses extraction patterns in the form of regular expressions. Each regular expression can extract either single target entity or multiple target entities [15]. WHISK has been experimented on the text collection under the management succession domain. SRV, another IE system, constructs first-order logical formulas as extraction patterns [3]. The extraction patterns also allow relational structures between target entities to be expressed. There have been very little IE research on non-extraction pattern based approaches. Freitag and McCallum developed an IE method based on Hidden Markov models (HMMs), a kind of probabilistic final state machines [4]. Their experiments showed that the HMM method outperformed the IE method using SRV for two text collections in the seminar announcements and corporate acquisitions domains.
Using Support Vector Machines for Terrorism Information Extraction
5
TST1-MUC3-0002 SAN SALVADOR, 18 FEB 90 (DPA) -- [TEXT] HEAVY FIGHTING WITH AIR SUPPORT RAGED LAST NIGHT IN NORTHWESTERN SAN SALVADOR WHEN MEMBERS OF THE FARABUNDO MARTI NATIONAL LIBERATION FRONT [FMLN] ATTACKED AN ELECTRIC POWER SUBSTATION. ACCORDING TO PRELIMINARY REPORTS, A SOLDIER GUARDING THE SUBSTATION WAS WOUNDED. THE FIRST EXPLOSIONS BEGAN AT 2330 [0530 GMT] AND CONTINUED UNTIL EARLY THIS MORNING, WHEN GOVERNMENT TROOPS REQUESTED AIR SUPPORT AND THE GUERRILLAS WITHDREW TO THE SLOPES OF THE SAN SALVADOR VOLCANO, WHERE THEY ARE NOW BEING PURSUED. THE NOISE FROM THE ARTILLERY FIRE AND HELICOPTER GUNSHIPS WAS HEARD THROUGHOUT THE CAPITAL AND ITS OUTSKIRTS, ESPECIALLY IN THE CROWDED NEIGHBORHOODS OF NORTHERN AND NORTHWESTERN SAN SALVADOR, SUCH AS MIRALVALLE, SATELITE, MONTEBELLO, AND SAN RAMON. SOME EXPLOSIONS COULD STILL BE HEARD THIS MORNING. MEANWHILE, IT WAS REPORTED THAT THE CITIES OF SAN MIGUEL AND USULUTAN, THE LARGEST CITIES IN EASTERN EL SALVADOR, HAVE NO ELECTRICITY BECAUSE OF GUERRILLA SABOTAGE ACTIVITY.
Fig. 1. Example Newswire Document
Research on applying machine learning techniques on name-entity extraction, a subproblem of information extraction, has been reported in [1]. Baluja et al proposed the use of 4 different types of features to represent an entity to extracted. They are the word-level features, dictionary features, part-of-speech tag features, and punctuation features (surrounding the entity to be extracted). Except the last feature type, the other three types of features are derived from the entities to be extracted. To the best of our knowledge, our research is the first that explores the use of classification techniques in extracting terrorism-related information. Unlike [4], we represent each entity to be extracted as a set of features derived from the syntactic structure of the sentence in which the entity is found, as well as the words found in the entity.
3
Problem Definition
Our IE task is similar to the template element (TE) task in the Message Understanding Conference (MUC) series. The TE task was to extract different types of target entities from each document, including perpetrators, victims, physicaltargets, event locations, etc. In MUC-4, a text collection containing newswire documents related to terrorist events in Latin America was used as the evaluation dataset. An example document is shown in Figure 1. In the above document, we could extract several interesting entities about the terrorist event, namely location (“SAN SALVADOR”), perpetrator (“MEMBERS OF THE FARABUNDO MARTI NATIONAL LIBERATION FRONT [FMLN]”), and victim(“SOLDIER”). The MUC-4 text collection consists of a training set (with 1500 documents and two test sets (each with 100 documents). For each document, MUC-4 specifies for each semantic category the target entity(ies) to be extracted.
6
A. Sun et al.
In this paper, we choose to focus on extracting target entities in the perpetrator category. The input of our IE method consists of the training set (1500 documents) and the perpetrator(s) of each training documents. The training documents are not tagged with the perpetrators. Instead, the perpetrators are stored in a separate file known as the answer key file. Our IE method therefore has to locate the perpetrators within the corresponding documents. Should a perpetrator appear in multiple sentences in a document, his or her role may be obscured by features from these sentences, making it more difficult to perform extraction. Once trained, our IE method has to extract perpetrators from the test collections. As the test collections are not tagged with candidate entities, our IE method has to first identify candidate entities in the documents before classifying them. The performance of our IE task is measured by three important metrics: Precision, Recall and F1 measure. Let ntp , nf p , and nf n be the number of entities correctly extracted, number of entities wrongly extracted, and number of entities missed respectively. Precision, recall and F1 measure are defined as follows: P recision =
Recall =
F1 =
4 4.1
ntp ntp + nf p
ntp ntp + nf n
2 · P recision · Recall P recision + Recall
Proposed Method Overview
Like other IE methods, we divide our proposed IE method into two steps: the learning step and the extraction step. The former learns the extraction model for the target entities in the desired semantic category using the training documents and their target entities. The latter applies the learnt extraction model on other documents and extract new target entities. The learning step consists of the following smaller steps. 1. Document parsing: As the target entities are perpetrators, they usually appear as noun-phrases in the documents. We therefore parse all the sentences in the document. To break up a document into sentences, we use the SATZ software [10]. As a noun-phrase could be nested within another noun-phrase in the parse tree, we only select all the simple noun-phrases as candidate entities. The candidate entities from the training documents are further grouped as positive entities if their corresponding noun-phrases match the perpetrator answer keys. The rest are used as negative entities.
Using Support Vector Machines for Terrorism Information Extraction
7
2. Feature acquisition: This step refers to deriving features for the training target entities, i.e., the noun-phrases. We will elaborate this step in Section 4.2. 3. Extraction model construction: This step refers to constructing the extraction model using some machine learning technique. In this paper, we explore the use of SVM to construct the extraction model (or classification model). The classification step performs extraction using the learnt extraction model following the steps below: 1. Document parsing: The sentences in every test document are parsed and simple noun phrases in the parse trees are used as candidate entities. 2. Feature acquisition: This step is similar to that in the learning step. 3. Classification: This step applies the SVM classifier to extract the candidate entities. By identifying all the noun-phrases and classifying them into positive entities or negative entities, we transform the IE problem into classification problem. To keep our method simple, we do not use co-referencing to identify pronouns that refers to the positive or negative entities. 4.2
Feature Acquisition
We acquire for each candidate entity the features required for constructing the extraction model and for classification. To ensure that the extraction model will be able to distinguish entities belonging to a semantic category or not, it is necessary to acquire a wide spectrum of features. Unlike the earlier work that focus on features that are mainly derived from within the entities [1] or the linear sequence of words surrounding the entities [4], our method derives features from syntactic structures of sentences in which the candidate entities are found. We divide the entity features into two categories: – Content features: These refer to the features derived from the candidate entities themselves. At present, we only consider terms appearing in the candidate entities. Given an entity e = w1 w2 · · · wn , we assign the content feature fi (w) = 1 if word w is found in e. – Context features: These features are obtained by first parsing the sentences containing a candidate entity. Each context feature is defined by a fragment of syntactic structure in which the entity is found and words associated with the fragment. In the following, we elaborate the way our context features are obtained. We first use the CMU’s Link Grammar Parser to parse a sentence [14]. The parser generates a parse tree such as the one shown in Figure 2. A parse tree represents the syntactic structure of a given sentence. Its leaf nodes are the word tokens of the sentence and internal nodes represents the syntactic constituents of the sentence. The possible syntactic constituents are S (clause), VP (verb phrase), NP (noun phrase), PP (prepositional phrase), etc.
8
A. Sun et al.
(S (NP Two terrorists) (VP (VP destroyed (NP several power poles) (PP on (NP 29th street))) and (VP machinegunned (NP several transformers))) .) Fig. 2. Parse Tree Example
For each candidate entity, we can derive its context features as a vector of term weights for the terms that appear in the sentences containing the nounphrase. Given a sentence parse tree, the weight of a term is assigned as follows. Terms appearing in the sibling nodes are assigned the weights of 1.0. Terms appearing in the higher level or lower level of the parse tree will be assigned smaller weights as they are further away from the candidate entity. The feature weights are reduced by half for every level further away from the candidate entity in our experiments. The 50% reduction factor has been chosen arbitrarily in our experiments. A careful study needs to be further conducted to determine the optimal reduction factor. For example, the context features of the candidate entity “several power poles” are derived as follows3 . Table 1. Context features and feature weights for “several power poles” Label Terms PP NP VP NP
on 29th street destroyed Two terrorists
Weight 1.00 0.50 0.50 0.25
To ensure that the included context features are closely related to the candidate entity, we do not consider terms found in the sibling nodes (and their subtrees) of the ancestor(s) of the entity. Intuitively, these terms are not syntactically very related to the candidate entity and are therefore excluded. For example, for the candidate entity “several power poles”, the terms in the subtree “and machinegunned several transformers” are excluded from the context feature set. 3
More precisely, stopword removal and stemming are performed on the terms. Some of them will be discarded during this process.
Using Support Vector Machines for Terrorism Information Extraction
9
If an entity appears in multiple sentences in the same document, and the same term is included as context features from different parse trees, we will combine the context features into one and assign it the highest weight among the original weights. This is necessary to keep one unique weight for each term. 4.3
Extraction Model Construction
To construct an extraction model, we require both positive training data and negative training data. While the positive training entities are available from the answer key file, the negative training entities can be obtained from the noun phrases that do not contain any target entities. Since pronouns such as “he”, “she”, “they”, etc. may possibly be co-referenced with some target entities, we do not use them as positive nor negative training entities. From the training set, we also obtain a entity filter dictionary that consists of noun-phrases that cannot be perpetrators. These are non-target noun-phrases that appear more than five times in the training set, e.g., “dictionary”, “desk” and “tree”. With this filter, the number of negative entities is reduced dramatically. If a larger number is used, fewer noun-phrases will be filtered causing a degradation of precision. On the other hand, a smaller number may increase the risk of getting a lower recall. Once an extraction model is constructed, it can perform extraction on a given document by classifying candidate entities in the document into perpetrator or non-perpetrator category. In the extraction step, a candidate entity is classified as perpetrator when the SVM classifier returns a positive score value.
5 5.1
Experiments and Results Datasets
We used MUC-4 dataset in our experiments. Three files (muc34dev, muc34tst1 and muc34tst2) were used as training set and the remaining two files (muc34tst3 and muc34tst4) were used as test set. There are totally 1500 news documents in the training set and 100 documents each for the two test files. For each news document, there are zero, one or two perpetrators defined in the answer key file. Therefore, most of the noun phrases are negative candidate entities. To avoid severely unbalanced training examples, we only considered the training documents that have at least one perpetrator defined in the answer key files. There are 466 training documents containing some perpetrators. We used all the 100 news documents in the test set since the classifier should not know if a test document contains a perpetrator. The number of documents used, number of positive and negative entities for the training and test sets are listed in Table 2. From the table, we observe that negative entities contribute about 90% of the entities of training set, and around 95% of the test set. 5.2
Results
We used SV M light as our classifiers in our experiment [6]. The SV M light is an implementation of Support Vector Machines (SVMs) in C and has been widely
10
A. Sun et al. Table 2. Documents, positive/negative entities in traing/test data set Dataset Documents Positive Entities Negative Entities Train Tst3 Tst4
466 100 100
1003 117 77
9435 2336 1943
used in text classification and web classification research. Due to the unbalanced training examples, we set the cost-factor (parameter j) of SV M light to be the ratio of number of negative entities over the number of positive ones. The costfactor denotes the proportion of cost allocated to training errors on positive entities against errors on negative entities. We used the polynomial kernel function instead of the default linear kernel function. We also set our threshold to be 0.0 as suggested. The results are reported in Table 3. Table 3. Results on training and test dataset Dataset Train Tst3 Tst4
Precision
Recall
F1 measure
0.7752 0.3054 0.2360
0.9661 0.4359 0.5455
0.8602 0.3592 0.3295
As shown in the table, the SVM classifier performed very well for the training data. It achieved both high precision and recall values. Nevertheless, the classifier did not perform equally well for the two test data sets. About 43% and 54% of the target entities have been extracted for Tst3 and Tst4 respectively. The results also indicated that many other non-target entities were also extracted causing the low precision values. The overall F1 measures are 0.36 and 0.33 for Tst3 and Tst4 respectively. The above results, compared to the known results given in [11] are reasonable as the latter also showed not more than 30% precision values for both AutoSlog and AutoSlog-TS4 . [11] reported F1 measures of 0.38 which is not very different from ours. The rather low F1 measures suggest that this IE problem is quite a difficult one. We, nevertheless, are quite optimistic about our preliminary results as they clearly show that the IE problem can be handled as a classification problem.
4
The comparison cannot be taken in absolute terms since [11] used a slightly different experimental setup for the MUC-4 dataset.
Using Support Vector Machines for Terrorism Information Extraction
6
11
Conclusions
In this paper, we attempt to extract perpetrator entities from a collection of untagged news documents in the terrorism domain. We propose a classificationbased method to handle the IE problem. The method segments each document into sentences, parses the latter into parse trees, and derives features for the entities within the documents. The features of each entity are derived from both its content and context. Based on SVM classifiers, our method was applied to the MUC-4 data set. Our experimental results showed that the method performs at a level comparable to some well known published results. As part of our future work, we would like to continue our preliminary work and explore additional features in training the SVM classifiers. Since the number of training entities is usually small in real applications, we will also try to extend our classification-based method to handle IE problems with small number of seed training entities.
References 1. S. Baluja, V. Mittal, and R. Sukthankar. Applying machine learning for high performance named-entity extraction. Computational Intelligence, 16(4):586–595, November 2000. 2. S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th International Conference on Information and Knowledge Management, pages 148– 155, Bethesda, Maryland, November 1998. 3. D. Freitag. Information extraction from HTML: Application of a general machine learning approach. In Proceedings of the 15th Conference on Artificial Intelligence (AAAI-98) 10th Conference on Innovation Applications of Artificial Intelligence (IAAI-98), pages 517–523, Madison, Wisconsin, July 1998. 4. D. Freitag and A. K. McCallum. Information extraction with hmms and shrinkage. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 31–36, Orlando, FL., July 1999. 5. T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning, pages 137–142, Chemnitz, DE, 1998. 6. T. Joachims. Making large-scale svm learning practical. In B. Sch¨ olkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT-Press, 1999. 7. J.-T. Kim and D. I. Moldovan. Acquisition of linguistic patterns for knowledgebased information extraction. IEEE Transaction on Knowledge and Data Engineering, 7(5):713–724, 1995. 8. MUC. Proceedings of the 4th message understanding conference (muc-4), 1992. 9. I. Muslea. Extraction patterns for information extraction tasks: A survey. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 1–6, Orlando, Florida, July 1999. 10. D. D. Palmer and M. A. Hearst. Adaptive sentence boundary disambiguation. In Proceedings of the 4th Conference on Applied Natural Language Processing, pages 78–83, Stuttgart, Germany, October 1994.
12
A. Sun et al.
11. E. Riloff. Automatically generating extraction patterns from untagged text. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96), pages 1044–1049, Portland, Oregon, 1996. 12. E. Riloff. An empirical study of automated dictionary construction for information extraction in three domains. Artificial Intelligence, 85(1-2):101–134, 1996. 13. E. Riloff and R. Jones. Learning dictionaries for information extraction by multilevel boot-strapping. In Proceedings of the 16th National Conference on Artificial Intelligence, pages 1044–1049, 1999. 14. D. Sleator and D. Temperley. Parsing english with a link grammar. Technical Report CMU-CS-91-196, Computer Science, Carnegie Mellon University, October 1991. 15. S. Soderland. Learning information extraction rules for semi-structured and free text. Journal of Machine Learning, 34(1-3):233–272, 1999. 16. V. N. Vapnik. The nature of statistical learning theory. Springer Verlag, Heidelberg, DE, 1995.
Criminal Incident Data Association Using the OLAP Technology Song Lin and Donald E. Brown Department of Systems and Information Engineering University of Virginia, VA 22904, USA {sl7h, brown}@virginia.edu
Abstract. Associating criminal incidents committed by the same person is important in crime analysis. In this paper, we introduce concepts from OLAP (online-analytical processing) and data-mining to resolve this issue. The criminal incidents are modeled into an OLAP data cube; a measurement function, called the outlier score function is defined on the cube cells. When the score is significant enough, we say that the incidents contained in the cell are associated with each other. The method can be used with a variety of criminal incident features to include the locations of the crimes for spatial analysis. We applied this association method to the robbery dataset of Richmond, Virginia. Results show that this method can effectively solve the problem of criminal incident association. Keywords. Criminal incident association, OLAP, outlier
1 Introduction Over the last two decades, computer technologies have developed at an exceptional rate, and become an important part of our life. Consequently, information technology now plays an important role in the law enforcement community. Police officers and crime analysts can access much larger amounts of data than ever before. In addition, various statistical methods and data mining approaches have been introduced into the crime analysis field. Crime analysis personnel are capable of performing complicated analyses more efficiently. People committing multiple crimes, known as serial criminals or career criminals, are a major threat in the modern society. Understanding the behavioral patterns of these career criminals and apprehending them is an important task for law enforcement officers. As the first step, identifying criminal incidents committed by the same person and linking them together is of major importance for crime analysts. According to the rational choice theory [5] in criminology, a criminal evaluates the benefit and the risk for committing an incident and makes a “rational” choice to maximize the “profit”. In the routine activity theory [9], a criminal incident is considered as the product of an interactive process of three key elements: a ready criminal, a suitable target, and lack of effective guardians. Brantingham and Brantingham [2] claim that the environment sends out some signals, or cues (physical, spatial, cultural, etc.), about its characteristics, and the criminal uses these cues to evaluate the target and make the decision. A criminal incident is usually an outcome H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 13–26, 2003. © Springer-Verlag Berlin Heidelberg 2003
14
S. Lin and D.E. Brown
of a decision process involving a multi-staged search in the awareness space. During the search phase, the criminal associates these cues, clusters of cues, or cue sequences with a “good” target. These cues form a template of the criminal, and once the template is built, it is self-reinforcing and relatively enduring. Due to the limit of the searching ability of a human being, a criminal normally does not have many decision templates. Therefore, we can observe criminal incidents with the similar temporal, spatial, and modus operandi (MO) features, which possibly come from the same template of the same criminal. It is possible to identify the serial criminal by associating these similar incidents. Different approaches have been proposed and several software programs have been developed to resolve the crime association problem. They can be classified into two major categories: suspect association and incident association. The Integrated Criminal Apprehension Program (ICAP) developed by Heck [12] enables police officers to match between the suspects and the arrested criminals using MO features; the Armed Robbery Eidetic Suspect Typing (AREST) program [1] employs an expert approach to perform the suspect association and classify a potential offender into three categories: probable, possible, or non suspect. The Violent Criminal Apprehension Program developed by the Federal Bureau of Investigation (FBI) (ViCAP) [13] is an incident association system. MO features are primarily considered in ViCAP. In the COPLINK [10] project undertaken by the researchers in the University of Arizona, a novel concept space model is built and can be used to associate searching terms with suspects in the database. A total similarity method was proposed by Brown and Hagen [3], and it can solve problems for both incident association and suspect association. Besides these theoretical methods, crime analysts normally use the SQL (Structure Query Language) in practice. They build the SQL string and make the system return all records that match their searching criteria. In this paper, we describe a crime association method that combines both OLAP concepts from the data warehousing area and outlier detection ideas from the data mining field. Before presenting our method, let us briefly review some concepts in OLAP and data mining.
2 Brief Review of OLAP and OLAP-Based Data Mining OLAP is a key aspect of many data warehousing systems [6]. Unlike its ancestor, OLTP (online transaction processing) systems, OLAP focus on providing summary information to the decision-makers of an organization. Aggregated data, such as sum, average, max, or min, are pre-calculated and stored in a multi-dimensional database called a data cube. Each dimension of the data cube consists of one or more categorical attributes. Hierarchical structures generally exist in the dimensions. Most existing OLAP systems concentrate on the efficiency of retrieving the summary data in the cube. For many cases, the decision-maker still needs to apply his or her domain knowledge and sometimes common sense to make the final decision. Data mining is a collection of techniques that detect patterns in large amounts of data. Quantitative approaches, including statistical methods, are generally used in data mining. Traditionally, data mining algorithms are developed for two-way datasets. More recently researchers have generalized some data mining methods for multi-
Criminal Incident Data Association Using the OLAP Technology
15
dimensional OLAP data structures. Imielinski et al. proposed the “cubegrade” problem [14]. The cubegrade problem can be treated as a generalized version of the association rule. Imielinski et al. claim that the association rule can be viewed as the change of count aggregates when imposing another constraint, or in OLAP terminology, making a drill-down operation on an existing cube cell. They think that other aggregates like sum, average, max, or min can also be incorporated, and the cubgegrade could support the “what if” analysis better. Similar to the cubegrade problem, the constrained gradient analysis was proposed by Dong et al. [7]. The constrained gradient analysis focuses on retrieving pairs of OLAP cubes that are quite different in aggregates and similar in dimensions (usually one cell is the ascendant, descendent, or sibling of the other cell). More than one aggregates can be considered simultaneously in the constrained gradient analysis. The discovery-driven exploration problem was proposed by Sarawagi et al. [18]. It aims at finding exceptions in the cube cells. They build a formula to estimate the anticipated value and the standard deviation (σ) of a cell. When the difference between the actual value of the cell and the anticipated value is greater than 2.5σ, the cell is selected as an exception. Similar to above approaches, our crime association method also focuses on the cells of the OLAP data cube. We define an outlier score function to measure the distinctiveness of the cell. Incidents contained in the same cell are determined to be associated with each other when the score is significant. The definition of the outlier score function and the association method is given in section 3.
3 Method 3.1 Rationale The rationale of this method is explained as follows: although theoretically the template (see section 1) is unique for each serial criminal, the data collected in the police department does not contain every aspect of the template. Some observed parts of the templates are “common” so that we may see a large overlap in these common templates. The creators (criminals) of those “common” templates are not separable. Some templates are “special”. For these “special templates”, we are more confident to say that the incidents come from the same criminal. For example, consider the weapon used in a robbery incident. We may observe many incidents with the value “gun” for weapon used. However, no crime analyst would say that the same person commits all these robberies because “gun” is a common template shared by many criminals. If we observe several robberies with a “Japanese sword” – an uncommon template, we are more confident in asserting that these incidents result from a same criminal. (This “Japanese sword” claim was first proposed by Brown and Hagen [4]). In this paper, we describe an outlier score function to measure this distinctiveness of the template.
16
S. Lin and D.E. Brown
3.2 Definitions In this section, we give the mathematical definitions used to build the outlier score function. People familiar with OLAP concepts can see that our notation derives from terms used in OLAP field. A1, A2, …, Am are m attributes that we consider relevant to our study, and D1, D2, …, Dm are their domains respectively. Currently, these attributes are confined to be categorical (categorical attributes like MO are important in crime association analysis). Let z(i) be the i-th incident, and z(i).Aj be the value on the j-th attribute of incident i. z(i) can be represented as z (i ) = ( z1(i ) , z 2(i ) ,..., z m(i ) ) , where
z k( i ) = z ( i ) . Ak ∈ D k , k ∈ {1,..., m} . Z is the set of all incidents. Definition 1. Cell Cell c is a vector of the values of attributes with dimension t, where t≤m. A cell can be represented as c = (ci1 , ci2 ,..., cit ) . In order to standardize the definition of a cell, for each Di, we add a “wildcard” element “*”. Now we allow D’i= Di∪{*}. For cell
c = (ci1 , ci2 ,..., cit ) , we can represent it as c = (c1 , c 2 ,..., c m ) , where c j ∈ D’j ,
and cj=* if and only if j ∉ {i1 , i2 ,..., it } . C denotes the set of all cells. Since each incident can also be treated as a cell, we define a function Cell: Z Å C. Cell(z)= (z1,z2,…,zm), if z=(z1,z2,…,zm), Definition 2. Contains relation We say that cell c = (ci1 , ci2 ,...,cit ) contains incident z if and only if z.Aj=cj or cj=*, j=1,2,…,m. For two cell, we say that cell c ’= (c1 ’, c 2 ’,..., c m ’) contains cell
c = (c1 , c2 ,..., cm ) if and only if c j ’= c j or c j ’= * , j = 1,2,..., m Definition 3. Count of a cell Function count is defined on a cell, and it returns the number of incidents that cell c contains. Definition 4. Parent cell Cell c’= (c’1 , c’2 ,..., c’m ) is the parent cell of cell c on the k-th attribute when: and
c’k = *
c’j = c j , for j ≠ k . Function parent(c,k) returns parent cell of cell c on the k-th
attribute.
Criminal Incident Data Association Using the OLAP Technology
17
Definition 5. Neighborhood P is called the neighborhood of cell c on the k-th attribute when P is a set of cells that takes the same values as cell c in all attributes but k, and does not take the wildcard value * on the k-th attribute, i.e., P= {c (1) , c ( 2 ) ,..., c (|P|) } where
cl( i ) = cl( j ) for all
l ≠ k , and c k( i ) ≠ * for all i = 1,2,..., | P | . Function neighbor (c , k ) returns the neighborhood of cell c on attribute k. (In OLAP field, the neighborhood is sometimes called siblings.) Definition 6. Relative frequency We call freq(c, k ) =
count(c) the relative frequency of cell c with count( parent(c, k ))
respect to attribute k. Definition 7. Uncertainty function We use function U to measure the uncertainty of a neighborhood. This uncertainty measure is defined on the relative frequencies. If we use P = {c (1) , c ( 2) ,..., c denote the neighborhood of cell c on attribute k, then,
U (c, k ) = U ( freq(c (1) , k ), freq(c ( 2) , k ),..., freq(c
(P)
(1)
(P)
} to
, k )) P
Obviously, U should be symmetric for all c , c ( 2) ,..., c . U takes a smaller value if the “uncertainty” in the neighborhood is low. One candidate uncertainty function is entropy, which comes from information theory:
U (c , k ) = H (c , k ) = −
∑ freq (c ’, k ) log( freq (c’, k ))
For
the
c ’∈neighbor ( c , k )
freq=0, we define 0 ⋅ log(0) = 0 , as is common in information theory. 3.3 Outlier Score Function (OSF) and the Crime Association Method Our goal is to build a function to measure the confidence or the significance level of associating crimes. This function is built over OLAP cube cells. We start building this function from analyzing the requirements that it needs to satisfy. Consider the following three scenarios: I.
II.
We have 100 robberies. 5 take the value of “Japanese sword” for the weapon used attributes, and 95 takes “gun”. Obviously, the 5 “Japanese swords” is of more interest than the 95 “guns”. Now we add another attribute: method of escape. Assume we have 20 different values: “by car”, “by foot”, etc. for the method of escape attribute. Each of them has 5 incidents. Although both “Japanese sword” and “by car” has 5 incidents, they should not be treated equally.
18
S. Lin and D.E. Brown
III.
“Japanese sword” highlights itself because all other incidents are “guns”, or in other words, the uncertainty level of the weapon used attribute is smaller. If we have some incidents takes “Japanese sword” on the weapon used attribute, and “by car” on the method of escape attribute, then the combination of “Japanese sword” and “by car” is more significant than both “Japanese sword” only and “by car” only. The reason is that we have more “evidences”.
Now we define function f as follows: − log( freq(c, k )) ) max ( f ( parent(c, k )) + f (c) = k takes all non−* dim ensionof c H (c, k ) 0 c = (*,*,...,*) When H(c,k) = 0, we say − log( freq (c, k )) = 0. H (c , k )
(1)
It is simple to verify that f satisfies above three requirements. We call f the outlier score function. (The term “outlier” is commonly used in the field of statistics. Outliers are observations significantly different that other observations and possibly are generated from a unique mechanism [11].) Based on the outlier score function, we give the following rule to associate criminal incidents: Given a pair of incidents, if there exists a cell containing both these incidents, and the outlier score of the cell is greater than some threshold value τ, we say that these two incidents are associated with each other. This association method is called an OLAP-outlier-based association method, or outlier-based method for abbreviation.
4 Application We applied this criminal incident association method to a real-world dataset. The dataset contained information on robbery incidents that occurred in Richmond, Virginia in 1998. The dataset consisted of two parts: the incident dataset and the suspect dataset. The incident dataset had 1198 records, and the temporal, spatial, and MO information were stored in the incident database. The name (if known), height, and weight information of the suspect were recorded in the suspect database. We applied our method to the incident dataset and used the suspect dataset for verification. Robbery was selected for two reasons: first, compared with some violent crime such as murder or sexual attack, serial robberies were more common; second, compared with breaking and entering crimes, more robbery incidents were “solved” (criminal arrested) or “partially solved” (the suspect’s name is known). These two points made the robbery favorable for evaluation purposes.
Criminal Incident Data Association Using the OLAP Technology
19
4.1 Attribute Selection We used three types of attributes in our analysis. The first set of attributes consisted of MO features. MO was primarily considered in crime association analysis. 6 MO attributes were picked. The second set of attributes was census attributes (the census data was obtained directly from the census CD held in library of the University of Virginia). Census data represented the spatial characteristics of the location where the criminal incident occurred, and it might help to reveal the spatial aspect of the criminals’ templates. For example, some criminals preferred to attack “high-income” areas. Lastly, we chose some distance attributes. They were distances from the incident location to some spatial landmarks such as a major highway or a church. Distance features were also important in analyzing criminals’ behaviors. For example, a criminal might preferred to initiate an attack from a certain distance range from a major highway so that the offense could not be observed during the attack, and he or she could leave the crime scene as soon as possible after the attack. There were a total of 5 distances. The names of all attributes and their descriptions are given in appendix I. They have also been used in a previous study on predicting breaking and entering crimes by Brown et al. [4]. An attribute selection was performed on all numerical attributes (census and distance attributes) before using the association method. The reason was that some attributes were redundant. These redundant attributes were unfavorable to the association algorithm in terms of both accuracy and efficiency. We adopted a featureselection-by-clustering methodology to pick the attributes. According to this method, we used the correlation coefficient to measure how similar or close two attributes were, and then we clustered the attributes into a number of groups according to this similarity measure. The attributes in the same group were similar to each other, and were quite different from attributes in other groups. For each group, we picked a representative. The final set of all representative attributes was considered to capture the major characteristics of the dataset. A similar methodology was used by Mitra et al. [16]. We picked the k-medoid clustering algorithm. (For more details about the kmedoid algorithm and other clustering algorithm, see [8].) The reason was that kmedoid method works on similarity / distance matrix (some other methods only work on coordinate data), and it tends to return spherical clusters. In addition, k-medoid returns a medoid for each cluster, based upon which we could select the representative attributes. After making a few slight adjustments and checking the silhouette plot [15], we finally got three clusters, as given in Fig. 1. The algorithm returned three medoids: HUNT_DST (housing unit density), ENRL3_DST (public school enrollment density), and TRAN_PC (expenses on transportation: per capita). We made some adjustments here. We replaced ENRL3_DST with another attribute POP3_DST (population density: age 12-17). The attackers and victims. For similar reasons, we replaced TRAN_PC with MHINC (median household income).
20
S. Lin and D.E. Brown
Fig. 1. Result of k-medoid clustering
There were a total of 9 attributes used in our analysis: 6 MO attributes (categorical) and 3 numerical attributes picked by applying the attributes selection procedure. Since our method was developed on categorical attributes, we converted the numerical attributes to categorical ones by dividing them into 11 equally sized bins. The number was determined by Sturge’s number of bins rule [19][20].
4.2 Evaluation Criteria We wanted to evaluate whether the association determined by our method corresponded to the true result. The information in the suspect database was considered as the “true result”. 170 incidents with the names of the suspects were used for evaluation. We generated all incident pairs. If two incidents in a pair had the suspects with the same name and date of birth, we said that the “true result” for this incident pair was a “true association”. There were 33 true associations. We used two measures to evaluate our method. The first measure was called “detected true associations”. We expected that the association method would be able to detect a large portion of “true associations”. The second measure was called “average number of relevant records”. This measure was built on the analogy of the search engine. Consider a search engine as Google. For each searching string(s) we give, it returns a list of documents considered to be “relevant” to the searching criterion. Similarly, for the crime association problem, if we give an incident, the algorithm will return a list of records that are considered as “associated” with the given incident. A shorter list is always preferred in both cases. The average “length” of the lists provided the second measure and we called it the “average number of relevant records”. The algorithm is more accurate when this measure has a smaller
Criminal Incident Data Association Using the OLAP Technology
21
value. In the information retrieval area [17], two commonly used criteria in evaluating a retrieval system are recall and precision. The former is the ability for a system to present relevant items, and the latter is the ability to present only the relevant items. Our first measure was a recall measure, and our second measure was equivalent to a precision measure. The above two measures do not work for our approach only; they can be used in evaluating any association algorithms. Therefore, we can use these two measures to compare the performances of different association methods. 4.3 Result and Comparison Different threshold values were set to test our method. Obvious if we set it to 0, we would expect that the method can detect all “true associations” and the average number of relevant records was 169 (given 170 incidents for evaluation). If we set the threshold, τ, to infinity, we would expect the method to return 0 for both “detected true associations” and “average number of relevant records”. As the threshold increased, we expected a decrease in both number of detected true associations and average number of relevant records. The result is given in Table 1. Table 1. Result of outlier-based method
Threshold 0 1 2 3 4 5 6 7 ∞
Detected true associations 33 32 30 23 18 16 8 2 0
Avg. number of relevant records 169.00 121.04 62.54 28.38 13.96 7.51 4.25 2.29 0.00
We compared this outlier-based method with a similarity-based crime association method. The similarity-based method was proposed by Brown and Hagen (Brown and Hagen, 2003). Given a pair of incidents, the similarity-based method first calculates a similarity score for each attribute, and then computes a total similarity score using the weighted average of all individual similarity scores. The total similarity score is used to determine whether the incidents are associated. Using the same evaluation criteria, the result of the similarity-based method is given in Table 2. If we set the average number of relevant records as the X-axis and set the detected true associations as the Y-axis, the comparisons can be illustrated as in Fig. 2. In Fig. 2, the outlier-based method lies above the similarity-based method for most cases. That means given the same “accuracy” (detected true associations) level, the outlier-based method returns fewer relevant records. Also if we keep the number
22
S. Lin and D.E. Brown Table 2. Result of similarity-based method
Threshold 0 0.5 0.6 0.7 0.8 0.9
Detected true associations 33 33 25 15 7 0
∞
Avg. number of relevant records 169.00 112.98 80.05 45.52 19.38 3.97
0
0.00
of relevant records (average length of the returned list) for both methods, the outlierbased method is more accurate. The curve of the similarity-based method sits slightly above the outlier-based method when the average number of relevant records is above 100. Since the size of the evaluation incident set is 170, no crime analyst would consider putting further investigation on any set of over 100 incidents. The outlierbased method is generally more effective.
35
30
Detected Associations
25
20 Similarity Outlier 15
10
5
0 0
20
40
60
80
100
120
140
160
180
Avg. relevant records
Fig. 2. Comparison: the outlier-based method vs. the similarity-based method
5 Conclusion In this paper, an OLAP-outlier-based method is introduced to solve the crime association problem. The criminal incidents are modeled into an OLAP cube and an outlier-score function is defined over the cube cells. The incidents contained in the
Criminal Incident Data Association Using the OLAP Technology
23
cell are determined to be associated with each other when the outlier score is large enough. The method was applied to a robbery dataset and results show that this method can provide significant improvements for crime analysts who need to link incidents in large databases.
References 1.
2. 3. 4. 5.
6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.
17. 18.
19. 20.
Badiru, A.B., Karasz, J.M. and Holloway, B.T., “AREST: Armed Robbery Eidetic Suspect Typing Expert System”, Journal of Police Science and Administration, 16, 210–216 (1988) Brantingham, P. J. and Brantingham, P. L., Patterns in Crimes, New York: Macmillan (1984) Brown D.E. and Hagen S.C., “Data Association Methods with Applications to Law Enforcement”, Decision Support Systems, 34, 369–378 (2003) Brown, D. E., Liu, H. and Xue, Y., “Mining Preference from Spatial-temporal Data”, Proc. of the First SIAM International Conference of Data Mining (2001) Clarke, R.V. and Cornish, D.B., “Modeling Offender’s Decisions: A Framework for Research and Policy”, Crime Justice: An Annual Review of Research, Vol. 6, Ed. by Tonry, M. and Morris, N. University of Chicago Press (1985) Chaudhuri, S. and Dayal, U., “An Overview of Data Warehousing and OLAP Technology”, ACM SIGMOD Record, 26 (1997) Dong, G., Han, J., Lam, J. Pei, J., and Wang, K., “Mining Multi-Dimensional Constrained Gradients in Data Cubes”, Proc. of the 27th VLDB Conference, Roma, Italy (2001) Everitt, B. Cluster Analysis, John Wiley & Sons, Inc. (1993) Felson, M., “Routine Activities and Crime Prevention in the Developing Metropolis”, Criminology, 25, 911–931 (1987) Hauck, R., Atabakhsh, H., Onguasith, P., Gupta, H., and Chen, H., “Using Coplink to Analyse Criminal-Justice Data”, IEEE Computer, 35, 30–37 (2002) Hawkins, D., Identifications of Outliers, Chapman and Hall, London, (1980) Heck, R.O., Career Criminal Apprehesion Program: Annual Report (Sacramento, CA: Office of Criminal Justice Planning) (1991) Icove, D. J., “Automated Crime Profiling”, Law Enforcement Bulletin, 55, 27–30 (1986) Imielinski, T., Khachiyan, L., and Abdul-ghani, A., Cubegrades: “Generalizing association rules”, Technical report, Dept. Computer Science, Rutgers Univ., Aug. (2000) Kaufman, L. and Rousseeuw, P. Finding Groups in Data, Wiley (1990) Mitra, P., Murthy, C.A., and Pal, S.K., “Unsupervised Feature Selection Using Feature Similarity”, IEEE Trans. On Pattern Analysis and Machine Intelligence, 24, 301–312 (2002) Salton, G. and McGill, M. Introduction to Modern Information Retrieval, McGraw-Hill Book Company, New York (1983) Sarawagi, S., Agrawal, R., and Megiddo. N., “Discovery-driven exploration of OLAP data cubes”, Proc. of the Sixth Int’l Conference on Extending Database Technology (EDBT), Valencia, Spain (1998) Scott, D. Multivariate Density Estimation: Theory, Practice and Visualization, New York, NY: Wiley (1992) Sturges, H.A., “The Choice of a Class Interval”, Journal of American Statistician Association, 21, 65–66 (1926)
24
S. Lin and D.E. Brown
Appendix I. Attributes used in the analysis (a) MO attributes Name Description Rsus_Acts Actions taken by the suspects R_Threats Method used by the suspects to threat the victim R_Force Actions that suspects force the victim to do Rvic_Loc Location type of the victim when robbery was committed Method_Esc Method of escape the scene Premise Premise to commit the crime (b) Census attributes Attribute name Description General POP_DST Population density (density means that the statistic is divided by the area) HH_DST Household density FAM_DST Family density MALE_DST Male population density FEM_DST Female population density Race RACE1_DST RACE2_DST RACE3_DST RACE4_DST RACE5_DST HISP_DST
White population density Black population density American Indian population density Asian population density Other population density Hispanic origin population density
Population Age POP1_DST POP2_DST POP3_DST POP4_DST POP5_DST POP6_DST POP7_DST POP8_DST POP9_DST POP10_DST
Population density (0-5 years) Population density (6-11 years) Population density (12-17 years) Population density (18-24 years) Population density (25-34 years) Population density (35-44 years) Population density (45-54 years) Population density (55-64 years) Population density (65-74 years) Population density (over 75 years)
Householder Age AGEH1_DST AGEH2_DST AGEH3_DST
Density: age of householder under 25 years Density: age of householder under 25-34 years Density: age of householder under 35-44 years
Criminal Incident Data Association Using the OLAP Technology
Attribute name AGEH4_DST AGEH5_DST AGEH6_DST
Description Density: age of householder under 45-54 years Density: age of householder under 55-64 years Density: age of householder over 65 years
Household Size PPH1_DST PPH2_DST PPH3_DST PPH6_DST
Density: 1 person households Density: 2 person households Density: 3-5 person households Density: 6 or more person households
Housing, misc. HUNT_DST OCCHU_DST VACHU_DST MORT1_DST MORT2_DST COND1_DST OWN_DST RENT_DST
Housing units density Occupied housing units density Vacant housing units density Density: owner occupied housing unit with mortgage Density: owner occupied housing unit without mortgage Density: owner occupied condominiums Density: housing unit occupied by owner Density: housing unit occupied by renter
Housing Structure HSTR1_DST HSTR2_DST HSTR3_DST HSTR4_DST HSTR6_DST HSTR9_DST HSTR10_DST
Density: occupied structure with 1 unit detached Density: occupied structure with 1 unit attached Density: occupied structure with 2 unit Density: occupied structure with 3-9 unit Density: occupied structure with 10+ unit Density: occupied structure trailer Density: occupied structure other
Income PCINC_97 MHINC_97 AHINC_97
Per capita income Median household income Average household income
School Enrollment ENRL1_DST ENRL2_DST ENRL3_DST ENRL4_DST ENRL5_DST ENRL6_DST ENRL7_DST
School enrollment density: public preprimary School enrollment density: private preprimary School enrollment density: public school School enrollment density: private school School enrollment density: public college School enrollment density: private college School enrollment density: not enrolled in school
Work Force CLS1_DST CLS2_DST
Density: private for profit wage and salary worker Density: private for non-profit wage and salary worker
25
26
S. Lin and D.E. Brown
Attribute name CLS3_DST CLS4_DST CLS5_DST CLS6_DST CLS7_DST
Description Density: local government workers Density: state government workers Density: federal government workers Density: self-employed workers Density: unpaid family workers
Consumer Expenditures ALC_TOB_PH APPAREL_PH EDU_PH ET_PH FOOD_PH MED_PH HOUSING_PH PCARE_PH REA_PH TRANS_PH ALC_TOB_PC APPAREL_PC EDU_PC ET_PC FOOD_PC MED_PC HOUSING_PC PCARE_PC REA_PC TRANS_PC
Expenses on alcohol and tobacco: per household Expenses on apparel: per household Expenses on education: per household Expenses on entertainment: per household Expenses on food: per household Expenses on medicine and health: per household Expenses on housing: per household Expenses on personal care: per household Expenses on reading: per household Expenses on transportation: per household Expenses on alcohol and tobacco: per capita Expenses on apparel: per capita Expenses on education: per capita Expenses on entertainment: per capita Expenses on food: per capita Expenses on medicine and health: per capita Expenses on housing: per capita Expenses on personal care: per capita Expenses on reading: per capita Expenses on transportation: per capita
(c) Distance attributes Name D_Church D_Hospital D_Highway D_Park D_School
Description Distance to the nearest church Distance to the nearest hospital Distance to the nearest highway Distance to the nearest park Distance to the nearest school
Names: A New Frontier in Text Mining 1
2
Frankie Patman and Paul Thompson 1
Language Analysis Systems, Inc. 2214 Rock Hill Rd., Herndon, VA 20170
[email protected] 2 Institute for Security Technology Studies Dartmouth College, Hanover, NH 03755
[email protected] Abstract. Over the past 15 years the government has funded research in information extraction, with the goal of developing the technology to extract entities, events, and their interrelationships from free text for further analysis. A crucial component of linking entities across documents is the ability to recognize when different name strings are potential references to the same entity. Given the extraordinary range of variation international names can take when rendered in the Roman alphabet, this is a daunting task. This paper surveys existing technologies for name matching and for accomplishing pieces of the cross-document extraction and linking task. It proposes a direction for future work in which existing entity extraction, coreference, and database name matching technologies would be harnessed for cross-document coreference and linking capabilities. The extension of name variant matching to free text will add important text mining functionality for intelligence and security informatics toolkits.
1 Introduction Database name matching technology has long been used in criminal investigations [1], counter-terrorism efforts [2], and in a wide variety of government processes, e.g., the processing of applications for visas. With this technology a name is compared to names contained in one or more databases to determine whether there is a match. Sometimes this matching operation may be a straightforward exact match, but often the process is more complicated. Two names may not match exactly for a wide variety of reasons and yet still refer to the same individual [3]. Often a name in a database comes from one field of a more complete database record. The values in other fields, e.g., social security number, or address, can be used to help match names which are not exact matches. The context from the complete record helps the matching process. In this paper we propose the design of a system that would extend database name matching technology to the unstructured realm of free text. Over the past 15 or so years the federal government has funded research in information extraction, e.g., the Message Understanding Conferences [4], Tipster [5], and Automatic Content
H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 27–38, 2003. © Springer-Verlag Berlin Heidelberg 2003
28
F. Patman and P. Thompson
Extraction [6]. The goal of this research has been to develop the technology to extract entities, events, and their interrelationships, from free text so that the extracted entities and relationships can be stored in a relational database, or knowledgebase, to be more readily analyzed. One subtask during the last few years of the Message Understanding Conference was the Named Entity Task in which personal and company names, as well as other formatted information, was extracted from free text. The system proposed in this paper would extract personal and company names from free text for inclusion in a database, an information extraction template, or automatically marked up XML text [7]. It would expand link analysis capabilities by taking into account a broad and more realistic view of the types of name variation found in texts from diverse sources. The sophisticated name matching algorithms currently available for matching names in databases are equally suited to matching name strings drawn from text. Analogous to the way in which the context of a full database record can assist in the name matching process, in the free text application, the context of the full text of the document can be used not only to help identify and extract names, but also to match names, both within a single document and across multiple documents.
2 Database Name Matching Name matching can be defined as the process of determining whether two name strings are instances of the same name. It is a component of entity matching but is distinct from that larger task, which in many cases requires more information than a name alone. Name matching serves to create a set of candidate names for further consideration—those that are variants of the query name. ‘Al Jones’, for example, is a legitimate variant of ‘Alfred Jones,’ ‘Alan Jones,’ and ‘Albert Jones.’ Different processes from those involved in name matching will often be required to equate entities, perhaps relation to a particular place, organization, event, or numeric identifier. However, without a sufficient representation of a name (the set of variants of the name likely to occur in the data), different mentions of the same entity may not be recognized. Matching names in databases has been a persistent and well-known problem for years [8]. In the context of the English-speaking world alone, where the predominant model for names is a given name, an optional middle name, and a surname of AngloSaxon or Western European origin, a name can have any number of variant forms, and any or all of these forms may turn up in database entries. For example, Alfred James Martin can also be A. J. Martin; Mary Douglas McConnell may also be Mary Douglas or Mary McConnell or Mary Douglas-McConnell; Jack Crowley and John Crowley may both refer to the same person; the surnames Laury and Lowrie can have the same pronunciation and may be confused when names are taken orally; jSmith is a common typographical error entered for the name Smith. These familiar types of name variation pose non-trivial difficulties for automatic name matching, and numerous systems have been devised to deal with them (see [3]). The challenges to name matching are greatly increased when databases contain names from outside the Anglo-American context. Consider some common issues that arise with names from around the world.
Names: A New Frontier in Text Mining
29
In China or Korea, the surname comes first, before the given name. Some people may maintain this format in Western contexts, others may reverse the name order to fit the Western model, and still others may use either. The problem is compounded further if a Western given name is added, since there is no one place in the string of names where the additional name is required to appear. Ex: Yi Kyung Hee ~ Kyung Hee Yi ~ Kathy Yi Kyung Hee ~ Yi Kathy Kyung Hee ~ Kathy Kyung Hee Yi In some Asian countries, such as Indonesia, many people have only one name; what appears to be a surname is actually the name of the father. Names are normally indexed by the given name. Ex: former Indonesian president Abdurrahman Wahid is Mr. Abdurrahman (Wahid being the name of his father). A name from some places in the Arab world may have many components showing the bearer’s lineage, and none of these is a family name. Any one of the name elements other than the given name can be dropped. Ex: Aziz Hamid Salim Sabah ~ Aziz Hamid ~ Aziz Sabah ~ Aziz Hispanic names commonly have two surnames, but it is the first of these rather than the last that is the family name. The final surname (which is the mother’s family name) may or may not be used. Ex: Jose Felipe Ortega Ballesteros ~ Jose Felipe Ortega, but is less likely to refer to the same person as Jose Felipe Ballesteros There may be multiple standard systems for transliterating a name from a native script (e.g. Arabic, Chinese, Hangul, Cyrillic) into the Roman alphabet, individuals may make up their own Roman spelling on the fly, or database entry operators may spell an unfamiliar name according to their own understanding of how it sounds. Ex: Yi ~ Lee ~ I ~ Lie ~ Ee ~ Rhee Names may contain various kinds of affixes, which may be conjoined to the rest of the name, separated from it by white space or hyphens, or dropped altogether. Ex: Abdalsharif ~ Abd al-Sharif ~ Abd-Al-Sharif ~ Abdal Sharif; al-Qaddafi ~ Qaddafi Systems for overcoming name variation search problems typically incorporate one or more of (1) a non-culture-specific phonetic algorithm (like Soundex1 or one of its refinements, e.g. [9]); (2) allowances for transposed, additional, or missing characters; (3) allowances for transposed, additional or missing name elements and for initials and abbreviations; and (4) nickname recognition. See [10] for a recent example. Less commonly, culture-specific phonetic rules may be used. The most serious problem for name-matching software is the wide variety of naming conventions represented in modern databases, which reflects the multicultural composition of many societies. Name-matching algorithms tend to take a one-size-fits-all approach, either by underestimating the effects of cultural variation, 1
Soundex, the most well-known algorithm for variant name searching in databases, is a phonetics-based system patented in 1918. It was devised for use in indexing the 1910 U.S. census data. The system groups consonants into sets of similar sounds (based on American names reported at the time) and assigns a common code to all names beginning with the same letter and sharing the same sequence of consonant groups. Soundex does not accommodate certain errors very well, and groups many highly dissimilar names under the same code. See [11].
30
F. Patman and P. Thompson
or by assuming that names in any particular data source will be homogenous. This may give reasonable results for names that fit one model, but may perform very poorly with names that follow different conventions. In the area of spelling variation alone, which letters are considered variants of which others differs from one culture to the next. In transcribed Arabic names, for example, the letters “K” and “Q” can be used interchangeably; “Qadafi” and “Kadafi” are variants of the same name. This is not the case in Chinese transcriptions, however, where “Kuan” and “Quan” are most likely to be entirely different names. What constitutes similarity between two name strings depends on the culture of origin of the names, and typically this must be determined on a case-by-case basis rather than across an entire data set. Language Analysis Systems, Inc. (LAS) has implemented a number of approaches to coping with the wide array of multi-cultural name forms found in databases. Names are first submitted to an automatic analysis process, which determines the most likely cultural/linguistic origin of the name (or, at the discretion of the user, the culture of origin can be manually chosen). Based on this determination, an appropriate algorithm or set of rules is applied to the matching process. LAS technologies include culturally sensitive search systems and processes for generating variants of names, among others. Some of the LAS technologies are briefly discussed below. Automatic Name Analysis: The name analysis system (NameClassifier¹) contains a knowledge base of information about name strings from various cultures. An input name is compared to what is known about name strings from each of the included cultures, and the probability of the name’s being derived from each of the cultures is computed. The culture with the highest score is assigned to the input name. The culture assignment is then used by other technologies to determine the most appropriate name-matching strategy. NameVariantGenerator¹: Name variant generation produces orthographic and syntactic variants of an input string. The string is first assigned a culture of origin through automatic name analysis. Culture-specific rules are then applied to the string to produce a regular expression. The regular expression is compared to a knowledge base of frequency information about names drawn from a database of over 750,000,000 names. Variant strings with a high enough frequency score are returned in frequency-ranked order. This process creates a set of likely variants of a name, which can then be used for further querying and matching. NameHunter¹: NameHunter¹ is a search engine that computes the similarity of two name strings based on orthography, word order, and number of elements in the string. The thresholds and parameters for comparison differ depending on the culture assignment of the input string. If a string from the database has a score that exceeds the thresholds for the input name culture, the name is returned. Returns are ranked relative to each other, so that the highest scoring strings are presented first. NameHunter allows for noisy data; thresholds can be tweaked by the user to control the degree of noise in returns. MetaMatch¹: MetaMatch¹ is a phonetic-based name retrieval system. Entry strings are first submitted to automatic name analysis for a culture assignment. Strings are then transformed to phonetic representations based on culture-specific rules, which are then stored in the database along with the original entry. Query strings are similarly processed, and the culture assignment is retained to determine the particular
Names: A New Frontier in Text Mining
31
parameters and thresholds for comparison. A similarity algorithm based on linguistic principles is used to determine the degree of similarity between query and entry strings [12]. Returns are presented in ranked order. This approach is particularly effective when name entries have been drawn from oral sources, such as telephone conversations. NameGenderizer¹: This module returns the most likely gender for a given name based on frequency of assignment of the name to males or females. A major advantage of the technologies developed by LAS is that a measure of similarity between name forms is computed and used to return names in order of their degree of similarity to the query term. An example of the effectiveness of this approach over a Soundex search is provided in Fig.1 in the Appendix.
3 Named Entity Extraction The task of named entity recognition and extraction is to identify strings in text that represent names of people, organizations, and places. Work in this area began in earnest in the mid-eighties, with the initiation of the Message Understanding Conferences (MUC). MUC is largely responsible for the definition of and specifications for the named entity extraction task as it is understood today [4]. Through MUC-6 in 1995, most systems performing named entity extraction were based on hand-built patterns that recognized various features and structures in the text. These were found to be highly successful, with precision and recall figures reaching 97% and 96%, respectively [4]. However, the systems were trained exclusively on English-language newspaper articles with a fixed set of domains, leaving open the question of how they would perform on other text sources. Bikel et al. [13] found that rules developed for one newswire source had to be adapted for application to a different newswire service, and that English-language rules were of little use as a starting point for developing rules for an unrelated language like Chinese. These systems are labor-intensive and require people trained in text analysis and pattern writing to develop and maintain rule sets. Much recent work in named entity extraction has focused on statistical/ probabilistic approaches (e.g., [14], [15], [13], [16]). Results in some cases have been very good, with F-measure scores exceeding 94%, even for systems gathering information from the least computationally expensive sources, such as punctuation, dictionary look-up, and part-of-speech taggers [15]. Borthwick et al. [14] found that by training their system on outputs tagged by hand-built systems (such as SRA’s NameTag extractor), scores improved to better than 97%, exceeding the F-measure scores of hand-built systems alone, and rivaling scores of human annotators. These results are very promising and suggest that named entity extraction can be usefully applied to larger tasks such as relation detection and link analysis (see, for example, [17]).
32
F. Patman and P. Thompson
4 Intra- and Inter-document Coreference The task of determining coreference can be defined as “the process of determining whether two expressions in natural language refer to the same entity in the world,” [18]. Expressions handled by coreference systems are typically limited to noun phrases of various types—including proper names—and pronouns. This paper will consider only coreference between proper names. For a human reader, coreference processes take place within a single document as well as across multiple documents when more than one text is read. Most coreference systems deal only with coreference within a document (see [19], [20], [21], [18], [22]). Recently, researchers have also begun work on the more difficult task of crossdocument coreference ([23], [24], [25]). Bagga [26] offers a classification scheme for evaluating coreference types and systems for performing coreference resolution, based in part on the amount of processing required. Establishing coreference between proper names was determined to require named entity recognition and generation of syntactic variants of names. Indeed, the coreference systems surveyed for this paper treat proper name variation (apart from synonyms, acronyms, and abbreviations) largely as a syntactic problem. Bontcheva et al., for example, allow name variants to be an exact match, a word token match that ignores punctuation and word order (e.g., “John Smith” and “Smith, John”), a first token match for cases like “Peter Smith” and “Peter,” a last token match for e.g., “John Smith” and “Smith,” a possessive form like “John’s,” or a substring in which all word tokens in the shorter name are included in the longer one (e.g., “John J. Smith” and “John Smith”). Depending on the text source, name variants within a single document are likely to be consistent and limited to syntactic variants, shortened forms, and synonyms, such as nicknames.2 One would expect intra-document coreference results for proper names under these circumstances to be fairly good. Bontcheva et al. [19] obtained precision and recall figures ranging from 94%-98% and 92%-95%, respectively, for proper name coreferences in texts drawn from broadcast news, newswire, and newspaper sources.3 Bagga and Baldwin [23] also report very good results (F-measures up to 84.6%) for tests of their cross-document coreference system, which compares summaries created for extracted coreference chains. Note, however, that their reported research looked only for references to entities named "John Smith," and that the focus of the cross-document coreference task was maintaining distinctions between different entities with the same name. Research was conducted exclusively on texts from the New York Times. Nevertheless, their work demonstrates that context can be effectively used for disambiguation across documents. Ravin and Kazi [24] focus on both distinguishing different entities with the same name and merging variant names 2
3
Note, however, that even within a document inconsistencies are not uncommon, especially when dealing with names of non-European origin. A Wall Street Journal article appearing in January 2003 referred to Mohammed Mansour Jabarah as Mr. Jabarah, while Khalid Sheikh Mohammed was called Mr. Khalid. When items other than proper names are considered for coreference, scores are much lower than those reported by Bontcheva et al. for proper names. The highest F-measure score for coreference at the MUC-7 competition was 61.8%. This figure includes coreference between proper names, various types of noun phrases, and pronouns.
Names: A New Frontier in Text Mining
33
referring to a single entity. They use the IBM Context Thesaurus to compare the contexts in which similar names from different documents are found. If there is enough overlap in the contextual information, the names are assumed to refer to the same entity. Their work was also limited to articles from the New York Times and the Wall Street Journal, both of which are edited publications with a high degree of internal consistency. Across documents from a wide variety of sources, consistent name variants cannot be counted on, especially for names originating outside the Anglo/Western European tradition. In fact, the many types of name variation commonly found in databases can be expected. A recent web search on Google for texts about Muammar Qaddafi, for example, turned up thousands of relevant pages under the spellings Qathafi, Kaddafi, Qadafi, Gadafi, Gaddafi, Kathafi, Kadhafi, Qadhafi, Qazzafi, Kazafi, Qaddafy, Qadafy, Quadhaffi, Gadhdhafi, al-Qaddafi, Al-Qaddafi, and Al Qaddafi (and these are only a few of the variants of this name known to occur). A coreference system that can be of use to agencies dealing with international names must be able to recognize name strings with this degree of variation as potential instances of a single name. Cross-document coreference systems currently suffer from the same weakness as most database name search systems. They assume a much higher degree of source homogeneity than can be expected in the world outside the laboratory, and their analysis of name variation is based on an Anglo/Western European model. For the coreference systems surveyed here, recall would be a considerable problem within a multi-source document collection containing non-Western names. However, with an expanded definition of name variation, constrained and supplemented by contextual information, these coreference technologies can serve as a starting point for linking and disambiguating entities across documents from widely varying sources.
5 Name Text Mining Support for Visualization, Link Analysis, and Deception Detection Commercial and research products for visualization and link analysis have become widely available in recent years, e.g., Hyperbolic Tree, or Star Tree [27], SPIRE [28], COPLINK [29], and InfoGlide [30]. Visualization and link analysis continues to be an active area of on-going research [31]. Some current tools have been incorporated into systems supporting intelligence and security informatics. For example, COPLINK [29] makes use of several visualization and link analysis packages, including i2’s [32] Analyst Notebook. Products such as COPLINK and InfoGlide also support name matching and deception detection. These tools make use of sophisticated statistical record linkage, e.g. [33], and have well developed interfaces to support analysts [32, 29]. Chen et al. [29] note that COPLINK Connect has the built-in capability for partial and phonetic-based name searches. It is not clear from the paper, however, what the scope of coverage is for phonetically spelled names, or how this is implemented. Research software and commercial products have been developed, such as those presented in [34, 30], which include modules that detect fraud in database records. These applications’ foci model ways that criminals, or terrorists, typically alter records to disguise their identity. The algorithms used by these systems could be
34
F. Patman and P. Thompson
augmented by taking into account a deeper multi-cultural analysis of names, as discussed in section 2.
6 Procedure for a Name Extraction and Matching Text Mining Module In this section a procedure is presented for name extraction and matching within and across documents. This algorithm could be incorporated in a module that would work with an environment such as COPLINK. The basic algorithm is as follows. Within document: 1. Perform named entity extraction. 2. Establish coreference between name mentions within a single document, creating an equivalence class for each named entity. 3. Discover relations between equivalence classes within each document 4. Find the longest canonical name string in each equivalence class. 5. Perform automatic name analysis on canonical names using NameClassifier; retain culture assignment. 6. Generate variant forms of canonical names according to culture-specific criteria using NameVariantGenerator. Across documents: 7. For each culture identified during name analysis, match sets of canonical name variants belonging to that culture against each other; for each pair of variant sets considered, if there are no incompatible (non-matching) members in the sets, mark as potential matches (e.g., Khalid bin (son of) Jamal and Khalid abu (father of) Jamal would be incompatible). 8. For potential name set matches, use a context thesaurus like that described in [24] to compare contexts where the names in the equivalence classes are found; if there are enough overlapping descriptions, merge the equivalence classes for the name sets (which will also expand the set of relations for the class to include those found in both documents); combine variant sets for the two canonical name strings into a single set, pruning redundancies. 9. For potential name set matches where overlapping contextual descriptions do not meet the minimum threshold, mark as a potential link, but do not merge. 10. Repeat process from #7 on for each pair of variant sets, until no further comparisons are possible. This algorithm could be implemented within a software module of a larger text mining application. The simplest integration of this algorithm would be as a module that extracted personal names from free text and stored the extracted names and relationships in a database. As discussed by [7], it would also be possible to use this algorithm to annotate the free text, in addition to creating database entries. This automatic markup would provide an interface for an analyst which would show not only the entities and their relationships, but also preserve the context of the surrounding text.
Names: A New Frontier in Text Mining
35
7 Research Issues This paper proposes an extension of linguistically-based, multi-cultural database name matching functionality to the extraction and matching of names from full text documents. To accomplish such an extension implies an effective integration of database and document retrieval technology. While this has been an on-going research topic in academic research [35, 36] and has received attention from major relational database vendors such as Oracle, Sybase, and IBM, effective integration has not yet been achieved, in particular in the area of intelligence and security informatics [37]. Achieving the sophistication of database record matching for names extracted from free text implies advances in text mining [38, 39, 40, 41]. One useful structure for supporting cross document name matching would be an authority file for named entities. Library catalogs maintain authority files which have a record for each author, showing variant names, pseudonyms, and so on. An authority file for named entity extraction could be built which would maintain a record for each entity. The record could start with information about the entity extracted from database records. When the named entity was found in free text, contextual information about the entity could be extracted and stored in the authority file with an appropriate degree of probability in the accuracy of the information included. For example, a name followed by a comma-delimited parenthetical expression, is a reasonably accurate source of contextual information about an entity, e.g., “X, president of Y, resigned yesterday”. A further application of linguistic/cultural classification of names could be to tracking interactions between groups of people where there is a strong association between group membership and language. For example, an increasing number of police reports in which both Korean and Cambodian names are found in the same documents might indicate a pattern in Asian crime ring interactions. Finally, automatic recognition of name gender could be used to support the process of pronominal coreference. Work is underway to provide a quantitative comparison of key-based name matching systems (such as Soundex) with other approaches to name matching. One of the hindrances to effective name matching system comparisons is the lack of generally accepted standards for what constitutes similarity between names. Such standards are difficult to establish in part because the definition of similarity changes from one user community to the next. A standardized metric for the evaluation of degrees of correlation of name search results, and a means for using this metric to measure the usefulness of different name search technologies is sorely needed. This paper has focused on personal name matching. Matching of other named entities, such as organizations, is also of interest for intelligence and security informatics. While different matching algorithms are needed, extending company name matching, or other entity matching, to free text will also be useful. One promising research direction integrating database, information extraction, and document retrieval that could support effective text mining of names is provided by work on XIRQL [7].
36
F. Patman and P. Thompson
8 Conclusion Effective tools exist for multi-cultural database name matching and this technology is becoming available in analytic tool kits supporting intelligence and security informatics. The proportion of data of interest to intelligence and security analysts that is contained in databases, however, is very small compared to the amount of data available in free text and audio formats. The extension of name extraction and matching to free text and audio will add important text mining functionality for intelligence and security informatics toolkits.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.
Taft, R.L.: Name Search Techniques. Special Rep. No. 1. Bureau of Systems Development, New York State Identification and Intelligence System, Albany (1970) Verton, D.: Technology Aids Hunt for Terrorists. Computer World, 9 September (2002) Borgman, C.L., Siegfried, S.L.: Getty’s Synoname and Its Cousins: A Survey of Applications of Personal Name-Matching Algorithms. Journal of the American Society for Information Science, Vol. 43 No. 7. (1992) 459–476 Grishman, R., Sundheim, B.: Message Understanding Conference – 6: A Brief History. In: th Proceedings of the 16 International Conference on Computational Linguistics. Copenhagen (1999) DARPA. Tipster Text Program Phase III Proceedings. Morgan Kaufmann, San Francisco (1999) National Institute of Standards and Technology. ACE-Automatic Content Extraction Information Technology Laboratories. http://www.itl.nist.gov/iad/894.01/tests/ace/index.htm (2000) Fuhr, N.: XML Information Retrieval and Extraction [to appear] Hermansen, J.C.: Automatic Name Searching in Large Databases of International Names. Georgetown University Dissertation, Washington, DC (1985) Holmes, D., McCabe, M.C.: Improving Precision and Recall for Soundex Retrieval. In: Proceedings of the 2002 IEEE International Conference on Information Technology – Coding and Computing. Las Vegas (2002) Navarro, G., Baeza-Yates, R., Azevedo Arcoverde, J.M.: Matchsimile: A Flexible Approximate Matching Tool for Searching Proper Names. Journal of the American Society for Information Science and Technology, Vol. 54 No. 1 (2003) 3–15 Patman, F., Shaefer, L.: Is Soundex Good Enough for You? On the Hidden Risks of Soundex-Based Name Searching. Language Analysis Systems, Inc., Herndon (2001) Lutz, R., Greene, S.: Measuring Phonological Similarity: The Case of Personal Names. Language Analysis Systems, Inc., Herndon (2002) Bikel, D.M., Schwartz, R., Weischedel, R.M.: An Algorithm that Learns What’s in a Name. Machine Learning, Vol. 34 No. 1-3. (1999) 211–231 Borthwick, A., Sterling, J., Agichtein, E., Grishman, R.: NYU: Description of the MENE Named Entity System as Used in MUC-7. In: Proceedings of the Seventh Message Understanding Conference. Fairfax (1998) Baluja, S., Mittal, V.O., Sukthankar, R.: Applying Machine Learning for High Performance Named-Entity Extraction. Pacific Association for Computational Linguistics (1999) Collins, M.,: Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted th Perceptron. In: Proceedings of the 40 Annual Meeting of the Association for Computational Linguistics. Philadelphia (2002) 489–496
Names: A New Frontier in Text Mining
37
17. Zelenko, D., Aone, C., Richardella, A.: Kernel Methods for Relation Detection Extraction. Journal of Machine Learning Research [to appear] 18. Soon, W.M., Ng, H.T., Lim, D.C.Y.: A Machine Learning Approach to Coreference Resolution of Noun Phrases. Association for Computational Linguistics (2001) 19. Bontcheva, K., Dimitrov, M., Maynard, D., Tablin, V., Cunningham, H.: Shallow Methods for Named Entity Coreference Resolution. TALN (2002) 20. Hartrumpf, S.: Coreference Resolution with Syntactico-Semantic Rules and Corpus Statistics. In: Proceedings of CoNLL-2001. Toulouse (2001) 137–144 21. Ng, V., Cardie, C.: Improving Machine Learning Approaches to Coreference Resolution. th In: Proceedings of the 40 Annual Meeting of the Association for Computational Linguistics. Philadelphia (2002) 104–111 22. McCarthy, J.F., Lehnert, W.G.: Using Decision Trees for Coreference Resolution. In: Mellish, C. (ed.): Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (1995) 1050–1055 23. Bagga, A., Baldwin, B.: Entity-Based Cross-Document Coreferencing Using the Vector th Space Model. In: Proceedings of the 36 Annual Meeting of the Association for th Computational Linguistics and the 17 International Conference on Computational Linguistics (1998) 79–85 24. Ravin, Y., Kazi, Z. Is Hillary Rodham Clinton the President? Disambiguating Names Across Documents. In: Proceedings of the ACL’99 Workshop on Coreference and Its Applications (1999) 25. Schiffman, B., Mani, I., Concepcion, K.J. : Producing Biographical Summaries : th Combining Linguistic Knowledge with Corpus Statistics. In: Proceedings of the 39 Annual Meeting of the Association for Computational Linguistics (2001) 450–457 26. Bagga, A.: Evaluation of Coreferences and Coreference Resolution Systems. In: Proceedings of the First International Conference on Language Resources and Evaluation (1998) 563–566 27. Inxight. A Research Engine for the Pharmaceutical Industry. http://www.inxight.com 28. Hetzler, B., Harris, W.M., Havre, S., Whitney, P.: Visualizing the Full Spectrum of Document Relationships. In: Structures and Relations in Knowledge Organization. th Proceedings of the 5 International ISKO Conference. ERGON Verlag, Wurzburg (1998) 168–175 29. Chen, H., Zeng, D., Atabakhsh, H., Wyzga, W., Schroeder, J.: COPLINK: Managing law enforcement data and knowledge. Communications of the ACM, Vol. 46 No. 1 (2003) 30. InfoGlide Software. Similarity Search Engine: The Power of Similarity Searching. http://www.infoglide.com/content/images/whitepapers.pdf(2002) 31. American Association for Artificial Intelligence Fall Symposium on Artificial Intelligence and Link Analysis (1998) 32. i2. Analyst’s Notebook. http://www.i2.co.uk/Products/Analysts_Notebook (2002) 33. Winkler, W.E.: The State of Record Linkage and Current Research Problems. Technical Report RR99/04. U.S. Census Bureau, http://www.census.gov/srd/papers/pdf/rr99-04.pdf 34. Wang, G., Chen, H., Atabakhsh, H.: Automatically Detecting Deceptive Criminal Identities [to appear] 35. Fuhr, N.: Probabilistic Datalog – A Logic for Powerful Retrieval Methods. In: Proceedings th of SIGIR-95, 18 ACM International Conference on Research and Development in Information Retrieval (1995) 282–290 36. Fuhr, N.: Models for Integrated Information Retrieval and Database Systems. IEEE Data Engineering Bulletin, Vol. 19 No. 1. (1996) 37. Hoogeveen, M., van der Meer, K.: Integration of Information Retrieval and Database Management in Support of Multimedia Police Work. Journal of Information Science, Vol. 20 No. 2 (1994) 38. Institute for Mathematics and Its Applications. IMA Hot Topics Workshop: Text Mining. http://www.ima.umn.edu/reactive/spring/tm.html (2000)
38
F. Patman and P. Thompson
39. KDD-2000 Workshop on Text Mining. The Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston (2000) http:www2.cs.cmu.edu/~dunja/WshKDD2000.html 40. SIAM Text Mining Workshop. http://www.cs.utk.edu/tmw02 (2002) 41. Text-ML 2002 Workshop on Text Learning. The Nineteenth International Conference on Machine Learning ICML-2002. Sydney (2002)
Appendix: Comparison of LAS MetaMatch¹ Search Engine Returns with SQL-Soundex Returns
Fig. 1. These searches were conducted in databases containing common surnames found in the 1990 U.S. Census data. The surnames in the databases are identical. The MetaMatch database differs only in that the phonetic form of each surname is also stored. The exact match “Sadiq” th th was 54 in the list of Soundex returns. “Siddiqui” was returned by Soundex in 26 place. th “Sadik” was 109 .
Web-Based Intelligence Reports System Alexander Dolotov and Mary Strickler Phoenix Police Department 620 W. Washington Street, Phoenix, Arizona 85003 {alex.dolotov, mary.strickler}@phoenix.gov
Abstract. Two areas for discussion will be included in this paper. The first area targets a conceptual design of a Group Detection and Activity Prediction System (GDAPS). The second area describes the implementation of the WEBbased intelligence and monitoring reports system called the Phoenix Police Department Reports (PPDR). The PPDR System could be considered the first phase of a GDAPS System. The already operational PPDR system’s goal is to support data access to heterogeneous databases, provide a means to mine data using search engines, and to provide statistical data analysis with reporting capabilities. A variety of static and ad hoc statistical reports are produced with the use of this system for interdepartmental and public use. The system is scalable, reliable, portable and secured. Performance is supported on all system levels using a variety of effective software designs, statistical processing and heterogeneous databases/data storage access.
1 System Concept The key to the effectiveness of a law enforcement agency and all of its divisions is directly related to its ability to make informed decisions for crime prevention. In order to prevent criminal activity, a powerful multilevel analysis tool based on a mathematical model could be used to develop a system that would target criminals and/or criminal groups and their activities. Alexander Dolotov and Ziny Flikop have proposed an innovative conceptual design of such a system. The fundamental idea of this project involves information collection and access to the information blocks reflecting the activities of an individual or groups of individuals. This data could then be used to create a system that would be capable of maintaining links and relationships for both groups and individuals in order to predict any abnormal activity. Such a system might be referred to as the “Group Detection and Activity Prediction System (GDAPS). The design of the GDAPS would include maintaining all individuals’ and groups’ activity trends simultaneously in real time mode. This system would be “selfeducating” meaning it would become more precise over a longer time-period and as more information is gathered. The design would be based on the principles of statistical modeling, fuzzy logic, open-loop control and many-dimensional optimizations. In addition, the latest software design technologies would be applied. The ultimate goal of this system would be to produce notifications alerting users of any predicted abnormal behavior. The initial plan for the design of GDAPS would be to break the system into three subsystems. The subsystems would consist of the PPDR system re H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 39–58, 2003. © Springer-Verlag Berlin Heidelberg 2003
40
A. Dolotov and M. Strickler
named to the “Data Maintaining and Reporting Subsystem” (DmRs), the “Group Detection Subsystem” (GTeS), and the “Activity Prediction Subsystem” (APreS). The first phase of the GDAPS System would be the PPDR System, which is currently operational within the Phoenix Police Department. This system is explained in detail in the remaining sections of this document. The DmRs subsystem (renamed from PPDR) supports access to heterogeneous databases using data mining search engines to perform statistical data analysis. Ultimately, the results are generated in report form. The GTeS subsystem would be designed to detect members of the targeted group or groups. In order to accomplish this, it would require monitoring communications between individuals using all available means. Intensity and duration of these communications can define relationships inside the group and possibly define the hierarchy of the group members. The GTeS subsystem would have to be adaptive enough to constantly upgrade information related to each controlled group, since every group has a life of its own. GTeS would provide the basic foundation for GDAPS. The purpose of the APreS subsystem is to monitor, in time, the intensity and modes of multiple groups’ communications by maintaining a database of all types of communications. The value of this subsystem would be the ability to predict groups’ activities based upon the historical correlation between abnormalities in the groups’ communication modes and intensities, along with any previous activities. APreS is the dynamic subsystem of GDAPS. To accelerate the GDAPS development, methodologies, already created for other industries, can be modified for use [1], [2], [3], [7]. Because of the complexity of the GDAPS system, a multi-phase approach to system development should be considered. Taking into account time and resources, this project can be broken down into manageable sub-projects with realistic development and implementation goals. The use of a multi-dimensional mathematical model will enable developers to assign values to different components, and to determine relationship between them. By using specific criteria, these values can be manipulated to determine the outcome under varying circumstances. The mathematical model, when optimized, will produce results that could be interpreted as “a high potential for criminal activity”. The multi-dimensional mathematical model is a powerful “forecasting” tool. It provides the ability to make decisions before a critical situation or uncertain conditions arise [4], [5], [6], [8]. Lastly, accumulated information must be stored in a database that is supported/serviced by a specific set of business applications. The following is a description of the PPDR system, the first phase of the Group Detection and Activity Prediction System (GDAPS).
2 Objectives A WEB-based intelligence and monitoring reports system called Phoenix Police Department Reports (PPDR) was designed in-house for use by the Phoenix Police Department (PPD). Even though this system was designed specifically for the Phoenix Police Department, it could easily be ported for use by other law enforcement agencies. Within seconds, this system provides detailed, comprehensive, and informative statistical reports reflecting the effectiveness and responsiveness of any division, for any date/time period, within the Phoenix Police Department. These reports are designed for use by all levels of management, both sworn and civilian, from police
Web-Based Intelligence Reports System
41
chiefs’ requests to public record requests. The statistical data from these reports provides information for use in making departmental decisions concerning such issues as manpower allocation, restructuring and measurement of work. Additionally, PPDR uses a powerful database mining mechanism, which would be valuable for use in the future development of the GDAPS System. In order to satisfy the needs of all users, the PPDR system is designed to meet the following requirements: - must maintain accurate and precise up-to-date information; - the use of a specific mathematical model for statistical analysis and optimization [5] [6]; - perform at a high level with quick response times; - must have the ability to support different security levels for different categories of users; - must be scalable and expandable; - must have a user friendly presentation and; - be able to easily maintain reliable and optimized databases and other information storage. The PPDR system went into production in February 2002. This system contains original and effective solutions. It provides the capability to make decisions which will ultimately have an impact on the short and long term plans for the department, the level of customer service provided to the public, overall employee satisfaction and organizational changes needed to achieve future goals. The PPDR system could be considered the first phase of a complex Intelligence Group Detection and Activity Prediction System.
3 Relationships to Other Systems and Sources of Information 3.1 Calls for Service There are two categories of information that are used for the PPDR. They are calls for service data and text messages sent by Mobile Data Terminal (MDT) and Computer Aided Dispatch (CAD) users. Both sources of information are obtained from the Department’s Computer Aided Dispatch and Mobil Data Terminal (CAD/MDT) System. The CAD/MDT System is operating on three redundant Hewlett Packard (HP) 3000 N-Series computers. The data is stored in HP’s proprietary Image database for six months. Phoenix Police Department’s CAD/MDT System handles over 7,000 calls for service daily from citizens of Phoenix. Approximately half of these calls require an officer to respond. The other half are either duplicates or ones where the caller is just asking for general information or wishing to report a non-emergency incident. Calls for Service data is collected when a citizen calls the emergency 911 number or the Department's crime stop number for service. A call entry clerk enters the initial call information into CAD. The address is validated against a street geobase which provides information required for dispatching such as the grid, the beat and the responsible precinct where the call originated. After all information is collected, the call is automatically forwarded to a dispatcher for distribution to an officer or officers in the field. Officers receive the call information on their Mobile Data Terminals (MDT). They enter the time they start on the call, arrive at the scene and the time they
42
A. Dolotov and M. Strickler
complete the call. Each call for service incident is given a disposition code that relates to how an officer or officers handled the incident. Calls for service data for completed incidents are transferred to a SQL database on a daily basis for use in the PPDR System. Police officers and detectives use calls for service information for investigative purposes. It is often requested by outside agencies for court purposes or by the general public for their personal information. It is also used internally for statistical analysis. 3.2 Messages The messages are text sent between MDT users, MDT users and CAD users, CAD users to other CAD users. The MDT system uses a Motorola radio system for communications, which interfaces to the CAD system through a programmable interface computer. The CAD system resides on a local area network within the Police Department. The message database also contains the results of inquiries on persons, vehicles, or articles requested by officers in the field from their MDTs or by CAD user from any CAD workstation within the Department. Each message stored by the CAD system contains structured data, such as the identification of the message sender, date and time sent and the free-form body of the message. Every twenty-four hours, more than 15,000 messages are passed through the CAD System. Copies of messages are requested by detectives, police officers, the general public and court systems, as well as outside law enforcement agencies.
4 PPDR System Architecture The system architecture of the PPDR system is shown in Figure 1.
5 PPDR Structural WEB Design PDR has been designed with seven distinctive subsystems incorporated within one easy to access location. The subsystems are as follows: Interdepartmental Reports; Ad Hoc Reports; Public Reports; Messages Presentation; Update functionality; Administrative Functionality; and System Security. Each subsystem is designed to be flexible as well as scaleable. Each subsystem has the capability of being easily expanded or modified to satisfy user enhancement requests.
Web-Based Intelligence Reports System
Fig. 1. PPDR Architecture (continued on next page)
43
44
A. Dolotov and M. Strickler
Fig. 1. (continued from previous page)
5.1 System Security Security begins when a user logs into the system and is continuously monitored until the user logs off. The PPDR security system is based on the assignment of roles to each user through the Administrative function. Role assignment is maintained across multiple databases. Each database maintains a set of roles for the PPDR system. Each role has the possibility of being assigned to both a database object and a WEB functionality. This results in a user being able to perform only those WEB and database functions that are available to his/her assigned role. When a user logs onto the system, the userid and password is validated with the database security information. Database security does not use custom tables but rather database tables that contain encrypted roles, passwords, userids and logins. After a successful login, the role assignment is maintained at the WEB level in a secure state and remains intact during the user’s session.
Web-Based Intelligence Reports System
45
The PPDR System has two groups of users: those that use the Computer Aided Dispatch System (CAD) and those that do not. Since most of the PPDR users are CAD users, it makes sense to keep the same userids and passwords for both CAD and PPDR. Using a scheduled Data Transfer System (DTS) process, CAD userids and passwords are relayed to the PPDR system on a daily basis, automatically initiating a database upgrade process in PPDR. The non-CAD users are entered into the PPDR system through the Administrative (ADMIN) function. This process does not involve the DTS transfer, but is performed in real time by a designated person or persons with ADMIN privileges. Security for non-CAD users is identical that of CAD users, including transaction logging that captures each WEB page access. In addition to transaction logging, another useful security feature is the storage of user history information on a database level. Anyone with ADMIN privileges can produce user statistics and historical reports upon request. 5.2 Regular Reports In general, Regular Reports are reports that have a predefined structure, based on input parameters entered by the user. In order to obtain the best performance and accuracy for these reports, the following technology has been applied: A special design of multiple databases which includes “summary” tables ( see Section V. Database Solutions); the use of cross tables reporting functionality which allows for creating a cross table recordset on a database level; and the use of a generic XML stream with XSLT performance on the client side instead of the use of ActiveX controls for the creation of reports. Three groups of Regular Reports are available within the PPDR system. The three groups are Response Time Reports, Calls for Service Reports and Details Reports. Response Time Reports. Response Time Reports present statistical information regarding the average response time for calls for service data obtained from the CAD System. Response time is the period between the time an officer was dispatched on a call for service and the time the officer actually arrived on the scene. Response time reports can be produced on several levels, including but not limited to beat, squad, precinct and even citywide level. Using input parameters such as date, time, shift, and squad area, a semi-custom report is produced within seconds. Below is an example of the “Average Quarterly Response Time By Precinct” report for the first quarter of 2002. This report calculates the average quarterly response time for each police precinct based on the priorities assigned to the calls for service. The right most column (PPD) is the citywide average, again broken down by priority. Calls for Service Reports. Calls for Service Reports are used to document the number of calls for service in a particular beat, squad, precinct area or citywide. These reports have many of the same parameters as the Response Time Reports. Some reports in this group are combination reports, displaying both the counts for calls for service
46
A. Dolotov and M. Strickler
Fig. 2. Response Time Reports
and the average response time. Below is an example of a “Monthly Calls for Service by Squad” report for the month of January 2002. This report shows a count of the calls for service for each squad area in the South Mountain precinct, broken down by calls that are dispatched and calls that are handled by a phone call made by Callback Unit.
Fig. 3. Calls For Service Report
Details Reports. These reports are designed to present important details for a particular call for service. Details for a call for service include such information as call location, disposition code (action taken by officer), radio code (type of calls for service - burglary, theft, etc.), received time and responding officer(s). From a Detail Re-
Web-Based Intelligence Reports System
47
port, other pertinent information related to a call for service is obtained quickly with a click of the mouse. Other available information includes unit history information. Unit history information is a collection of data for all the units that responded to a particular call for service, such as time the unit was dispatched, time unit arrived and what people or plates were checked. 5.3 AD HOC Reports The AD HOC Reports subsystem provides the ability to produce “custom” reports from the calls for service data. To generate an AD HOC report, a user should have basic knowledge of SQL queries using search criteria as well as basic knowledge of the calls for service data. There are three major steps involved in producing an AD HOC report: -
Selecting report columns Selecting search criteria Report generation
Selecting report columns and selecting search criteria use an active dialog. The report generation uses XML/XSLT performance. Selecting Report Columns. The first page that is presented when entering the AD HOC Reports subsystem allows the user to choose, from a list of tables, the fields that are to be displayed in the desired report. OLAP functionality is used for accessing a database’s schema, such as available tables and their characteristics, column names, formats, aliases and data types. The first presented page of the AD HOC Reports is displayed below. A selection can be made for any required field by checking the left check box. Other options such as original value (Orig), count, average (Averg), minimum (Minim), and maximum (Maxim) are also available to the user. Count, average, minimum and maximum are only available for numeric fields. As an example, if a user is requiring a count of the number of calls for service, a check is required in the Count field. When the boxes are check, the SELECT clause generates as a DHTML script. For instance, if the selected fields for an Ad Hoc report are ‘Incident Number ‘, ‘Date’, ‘Address’ and ‘Average of the Response Time’ (all members of the Incidents table), the following SELECT clause will be generated:
Date’,Incidents.Inc_Location AS ’Address’,Incidents.Inc_Time_Rec AS ’Received time’,Avg(Incidents.Inc_Time_Rec) AS ’Avg Of Received time’ FROM Incidents Syntaxing is maintained in the SELECT clause generation on the business logic level using a COM+ objects. Selecting Search Criteria. When all desired fields have been selected, click on “Submit” and the following search criteria page is presented:
48
A. Dolotov and M. Strickler
Fig. 4. Selecting Report Columns
This page will allow the user to build the search criteria necessary for the generation of the desired report. Most available criteria and their combinations are available to the user (i.e., >, among routes from s to all destinations d ∈ D, ( where n0 = s and nk = d ); (2) Sort routes Rs by total travel time, increasing order; (3) for each route Rs in sorted order do { (4) Initialize next start node on route Rs to move: st = 0; (5) while not all evacuees from n0 reached nk do { (6) t =next available time to start move from node nst ; (7) nend =furthest node can be reached from nst without stopping; (8) f low = min( number of evacuee at node nst , Available Edge Capacity(all edges between nst and nend on Rs ), Available N ode Capacity(all nodes from nst+1 to nend on Rs ), ); (9) for i = st to end − 1 do { (10) t = t + T ravel time(eni ni+1 ); (11) Available Edge Capacity(eni ni+1 , t) reduced by f low; (12) Available N ode Capacity(ni+1 , t ) reduced by f low; (13) t = t ; (14) } (15) st =closest node to destination on route Rs with evacuee; (16) } (17) } (18) Postprocess results and output evacuation plan; (19)
and edge capacities at certain time points along the route. The detailed pseudocode and algorithm description are as follows. In the first step(line 1-2), for each source node s, we find the route Rs with shortest total travel time among routes between s and all the destination nodes. The total travel time of route Rs is the sum of the travel time of all edges on Rs . For example, in figure 2, RN 1 is N1-N3-N4-N6-N10-N13 with a total travel time of 14 time units. RN 2 is N2-N3-N4-N6-N10-N13 with a total travel time of 14 time units. RN 8 is N8-N10-N13 with total travel time of 4 time units. This step is done by a variation of Dijkstra’s algorithm[7] in which edge travel time
Evacuation Planning: A Capacity Constrained Routing Approach
117
Table 2. Result Evacuation Plan of the Single-Route Capacity Constrained Planner Group of People ID Origin No. of People Start Time A N8 6 0 B N8 6 1 C N8 3 2 D N1 3 0 E N1 3 0 F N1 1 0 G N1 2 1 H N1 1 1 I N2 2 0 J N2 3 0
Route Exit Time N8-N10-N13 4 N8-N10-N13 5 N8-N10-N13 6 N1-N3-N4-N6-N10-N13 14 N1-N3(W1)-N4-N6-N10-N13 15 N1-N3(W2)-N4-N6-N10-N13 16 N1-N3(W1)-N4-N6-N10-N13 16 N1-N3(W2)-N4-N6-N10-N13 17 N2-N3(W3)-N4-N6-N10-N13 17 N2-N3(W4)-N4-N6-N10-N13 18
is treated as edge weight and the algorithm terminates when the shortest route from s to one destination node is determined. The second step(line 3), is to sort the routes we obtained from step 1 in increasing order of the total travel time. Thus, in our example, the order of routes will be RN 8 ,RN 1 ,RN 2 . The third step(line 4-18), is to reserve capacities for each route in the sorted order. The reservation for route Rs is done by sending all the people initially at node s to the exit along the route in the least amount of time. The people may need to be divided into groups and sent by waves due to the constraints of the capacities of the nodes and edges on Rs . For example, for RN 8 , the first group of people that starts from N8 at time 0 is at most 6 people because the available edge capacity of N8-N10 at time 0 is 6. The algorithm makes reservations for the 6 people by reducing the available capacity of each node and edge at the time point that they are at each node and edge. This means that available capacities are reduced by 6 for edge N8-N10 at time 0 because the 6 people travel through this edge starting from time 0; for node N10 at time 3 because they arrive at N10 at time 3; for edge N10-N13 at time 3 because they travel through this edge starting from time 3. They finally arrive at N13(EXIT1) at time 4. The second group of people leaving N8 has to wait until time 1 since the first group has reserved all the capacity of edge N8-N10 at time 0. Therefore, the second group leaves N8 at time 1 and reaches N13 at time 5. Similarly, the last group of 3 people leaves N8 at time 2 and reaches N13 at time 6. Thus all people from N8 are sent to exit N13. The next two routes, RN 1 and RN 2 , will make their reservation based on the available capacities that the previous routes left with. The final step of the algorithm is to output the entire evacuation plan, as shown in Table 2, which takes 18 time units.
118
3.2
Q. Lu, Y. Huang, and S. Shekhar
Multiple-Route Capacity Constrained Routing Approach
The Multiple-Route Capacity Constrained Planner (MRCCP) is an iterative approach. In each iteration, the algorithm re-computes the earliest time route from any source to any destination taking the previous reservations and possible onroute waiting time into consideration. Then it reserves the capacity for this route in the current iteration. The detailed pseudo-code and algorithm description are as follows. Algorithm 2 Multiple-Route Capacity Constrained Planner (MRCCP) Input: 1) G(N, E): a graph G with a set of nodes N and a set of edges E; Each node n ∈ N has two properties: M aximum N ode Capacity(n) : non-negative integer Initial N ode Occupancy(n) : non-negative integer Each edge e ∈ E has two properties: M aximum Edge Capacity(e) : non-negative integer T ravel time(e) : non-negative integer 2) S: set of source nodes, S ⊆ N ; 3) D: set of destination nodes, D ⊆ N ; Output: Evacuation plan Method: while any source node s ∈ S has evacuee do { (1) find route R < n0 , n1 , . . . , nk >= with earliest destination arrival time among routes between all s,d pairs, where s ∈ S,d ∈ D,n0 = s,nk = d; (2) f low = min( number of evacuee still at source node s, Available Edge Capacity(all edges on route R), Available N ode Capacity(all nodes from n1 to nk on route R), ); (3) for i = 0 to k − 1 do { (4) t = t + T ravel time(eni ni+1 ); (5) Available Edge Capacity(eni ni+1 , t) reduced by f low; (6) Available N ode Capacity(ni+1 , t ) reduced by f low; (7) t = t ; (8) } (9) } (10) Postprocess results and output evacuation plan; (11)
The MRCCP algorithm keeps iterating as long as there are still evacuees at any source node (line 1). Each iteration starts with finding the route R with the earliest destination arrival time from any sources node to any any exit node based on the current available capacities (line 2). This is done by generalizing Dijkstra’s shortest path algorithm [7] to work with the time series capacities and edge travel time. Route R is the route that reaches an exit in the least
Evacuation Planning: A Capacity Constrained Routing Approach
119
Table 3. Result Evacuation Plan of the Multiple-Routes Capacity Constrained Planner Group of People ID Origin No. of People Start Time A N8 6 0 B N8 6 1 C N8 3 0 D N1 3 0 E N1 3 1 F N1 3 0 G N1 1 2 H N1 3 1 I N2 2 2
Route Exit Time N8-N10-N13 4 N8-N10-N13 5 N8-N10-N14 5 N1-N3-N4-N6-N10-N13 14 N1-N3-N4-N6-N10-N13 15 N1-N3-N5-N7-N11-N14 15 N1-N3-N4-N6-N10-N13 16 N2-N3-N5-N7-N11-N14 16 N2-N3-N5-N7-N11-N14 17
amount of time and at least one person can be sent to the exit through route R. For example, at the very first iteration, R will be N8-N10-N13, which can reach N13 at time 4. The actual number of people that will travel through R is the smallest number among the number of evacuees at the source node and the available capacities of each of the nodes and edges on route R (line 3). Thus, in the example, this amount will be 6, which is the available edge capacity of N8-N10 at time 0. The next step is to reserve capacities for the people on each node and edge of route R (lines 4-9). The algorithm makes reservation for the 6 people by reducing the available capacity of each node and edge at the time point that they are at each node and edge. This means that available capacities are reduced by 6 for edge N8-N10 at time 0, for node N10 at time 3, and for edge N10-N13 at time 3. They finally arrive at N13(EXIT1) at time 4. Then, the algorithm goes back to line 2 for the next iteration. The iteration terminates when the occupancy of all source nodes is reduced to zero, which means all evacuee have been sent to exits. Line 11 outputs the evacuation plan, as shown in Table 3.
4
Comparison and Cost Models of the Two Algorithms
It can be seen that the key difference between the two algorithms is that the SRCCP algorithm only produces one single route for each source node, while the MRCCP can produce multiple routes for groups of people in each source node. MRCCP can produce evacuation plan with shorter evacuation time than SRCCP by the flexibility of adapting to the available capacities after previous reservations. Yet, MRCCP needs to re-compute the earliest time route in each iteration which incurs more computational cost than SRCCP. We then provide simple algebraic cost models for the computational cost of the two proposed heuristic algorithms. We assume the total number of nodes in the graph is n, the number of source nodes is ns , and the number of groups generated in the result evacuation plan is ng .
120
Q. Lu, Y. Huang, and S. Shekhar
The cost of the SRCCP algorithm consists of three parts: the cost of the computing the shortest time route from each source node to any exit node is denoted by Csp , the cost of sorting all the pre-computed routes by their total travel time is denoted by Css , and the cost of reserving capacities along each route for each group of people is denoted by Csr . The cost model of the SRCCP algorithm is given as follows: CostSRCCP = Csp + Css + Csr = O(ns × nlogn) + O(ns logns ) + O(n × ng ) (1) The MRCCP algorithm is an iterative approach. In each iteration, the route for one group of people is chosen and the capacities along the route are reserved. The total number of iterations is determined by the number of groups generated. In each iteration, the route with earliest destination arrival time from each source node to any exit node is re-computed with the cost of O(ns ×nlogn). Reservation is made for the node and edge capacities along the chosen route with the cost of O(n). The cost model of the MRCCP algorithm is given as follows: CostM RCCP = O((ns × nlogn + n) × ng )
(2)
In both cost models, the number of groups generated for the evacuation plan depends on the network configuration which include maximum capacity of nodes and edges, and the number of people to be evacuated at each source node.
5
Solution Quality and Performance Evaluation
In this section, we present the experiment design, our experiment setup, and the results of our experiments on a building dataset.
5.1
Experiment Design
Figure 3 describes the experimental design to evaluate the impact of parameters on the algorithms. The purpose is to compare the quality of solution and the computational cost of the two proposed algorithms with that of EVACNET which produces optimal solution. First, a test dataset which represents a building layout or road network is chosen or generated. The dataset is a evacuation network characterized by its route capacities and its size (number of nodes and edges). Next, a generator is used to generate the initial state of the evacuation by populating the network with a distribution model to assign people to source nodes. The initial state will be converted to EVACNET input format to produce optimal solution via EVACNET and converted to node-edge graph format to evaluate the proposed two heuristic algorithms. The solution qualities and algorithm performance will be analyzed in analysis module.
Evacuation Planning: A Capacity Constrained Routing Approach route capacity
number of nodes, edges
Test Dataset (Building layout or road network) number of people
121
Algorithm 1 Conversion to Node-Edge Model
initial people location distribution model
Intial State of the Building or Road Network Generator
Solution 1 Running Time1 Solution 2
Algorithm 2
Analysis
Running Time 2 Conversion to EVACNET Model
Optimal Solution Running Time 3
Fig. 3. Experiment Design
5.2
Experiment Setup and Results
The test dataset we used in the following experiments is the floor-map of Elliott Hall, a 6-story building on the University of Minnesota campus. The dataset network consists of 444 nodes with 5 exits nodes, 475 edges, and total node capacity of 3783 people. The generator produces initial states by varying source node ratio and occupancy ratio from 10% to 100%. The experiment was conducted on a workstation with Intel Pentium III 1.2GHz CPU, 256MB RAM and Windows 2000 Professional operating system. The initial state generator distributes Pn people to Sn randomly chosen Sn source nodes. The source node ratio is defined as and total number of nodes Pn . the occupancy ratio is defined as total capacity of all nodes We want to answer two questions: (1)How does people distribution affect the performance and solution quality of the algorithms? (2) Are the algorithms scalable with respect to the number of people to be evacuated? Experiment 1: Effect of People Distribution. The purpose of the first experiment is to evaluate how the people distribution affects the quality of the solution and the performance of the algorithms. We fixed the occupancy ratio and varied the source node ratio to observe the quality of the solution and the running time of the two proposed algorithms and EVACNET. The experiment was done with fixed occupancy ratio from 10% to 100% of total capacity. Here we present the experiment results with occupancy ratio fixed at 30% and source node ratio varying from 30% to 100% which shows a typical result of all test cases. Figure 4 shows the total evacuation time given by the three algorithms and Figure 5 shows their running time. As seen in Figure 4, at each source node ratio, MRCCP produces solution with total evacuation time that is within 10% longer than optimal solution produced by EVACNET. The quality of solution of MPCCP is not affected by the distribution of people when the total number of people is fixed. For SRCCP, the solution is 59% longer than EVACNET optimal solution when source node ratio is 30% and drops to 29% longer when source node ratio increases to 100%. It shows that the solution quality of SRCCP increases when source node ratio increases. In Figure 5, we can see that the running time of EVACNET grows
122
Q. Lu, Y. Huang, and S. Shekhar
Total Evacuation Time
250 200 150
SRCCP MRCCP EVACNET
100 50 0 30
50
70
90
100
Source Node Ratio (%)
Fig. 4. Quality of Solution With Respect to Source Node Ratio
35
Running Time (second)
30 25 SRCCP MRCCP EVACNET
20 15 10 5 0 30
50
70
90
100
Source Node Ratio (%)
Fig. 5. Running Time With Respect to Source Node Ratio
much faster then the running time of SRCCP and MRCCP when source node ratio increases. This experiment shows: (1)SRCCP produces solution closer to optimal solution when source node ratio is higher. (2)MRCCP produces close to optimal solution (less than 10% longer than optimal) with less than half of running time of EVACNET. (3) The distribution of people does not affect the performance of two proposed algorithms when total number people is fixed. Experiment 2: Scalability with Respect to Occupancy Ratio. In this experiment, we evaluated the performance of the algorithms when the source node ratio is fixed and the occupancy ratio is increasing. Figure 6 and Figure 7 show the total evacuation time and the running time of the 3 algorithms when the source node ratio is fixed at 70% and occupancy ratio varies from 10% to 70% which is a typical case among all test cases. As seen in Figure 6, compared with the optimal solution by EVACNET, solution quality of SRCCP decreases when occupancy ratio increases, while solution quality of MRCCP still remains within 10% longer than optimal solution. In Figure 7, the running time of EVACNET grows significantly when occupancy
Evacuation Planning: A Capacity Constrained Routing Approach
123
450
Total Evacuation Time
400 350 300 SRCCP MRCCP EVACNET
250 200 150 100 50 0 10
30
50
70
Occupany Ratio (%)
Fig. 6. Quality of Solution With Respect to Source Node Ratio
Running Time (second)
60 50 40 SRCCP MRCCP EVACNET
30 20 10 0 10
30
50
70
Occupany Ratio (%)
Fig. 7. Running Time With Respect to Source Node Ratio
ratio grows, while running time of MRCCP remains less than half of EVACNET and only grows linearly. This experiment shows: (1)The solution quality of SRCCP goes down when total number of people increases. (2) MRCCP is scalable with respect to number of people.
6
Conclusion and Future Work
In this paper, we proposed and evaluated two heuristic algorithms of capacity constrained routing approach. Cost models and experimental evaluations using a a real building dataset are presented. The proposed SRCCR algorithm can produces plan instantly but the quality of solution suffers when evacuee number grows. The MRCCR algorithm produces solution within 10% of optimal solution while the running time is scalable to number of evacuees and is reduced to half of the optimal algorithm. Both algorithms are scalable with respect to the number of evacuees. Currently, we choose the shortest travel time route without considering the available capacity of the route. In many cases, a longer route with larger available capacity may be a better choice. In our future work, we
124
Q. Lu, Y. Huang, and S. Shekhar
would like to explore heuristics with route ranking method based on weighted available capacity and travelling time while choosing best routes. We also want to extend and apply our approach to vehicle evacuation in transportation road networks. Modelling vehicle traffic during evacuation is a more complicated job than modelling pedestrian movements in building evacuation because modelling vehicle traffic at intersections and the cost of taking turns are challenging tasks. Current vehicle traffic simulation tools, such as DYNASMART [14], DYNAMIT [2], uses an assignment-simulation method to simulate the traffic based on origin-destination routes. We plan to extend our approach to work with such traffic simulation tools to address vehicle evacuation problems. Acknowledgment. We are particularly grateful to Spatial Database Group members for their helpful comments and valuable discussions. We would also like to express our thanks to Kim Koffolt for improving the readability of this paper. This work is supported by the Army High Performance Computing Research Center (AHPCRC) under the auspices of the Department of the Army, Army Research Laboratory under contract number DAAD19-01-2-0014. The content does not necessarily reflect the position or policy of the government and no official endorsement should be inferred. AHPCRC and the Minnesota Supercomputer Institute provided access to computing facilities.
References 1. Hurricane Evacuation web page. http://i49south.com/hurricane.htm, 2002. 2. M. Ben-Akiva et al. Deveopment of Dynamic Traffic Assignment System for Planning Purposes: DynaMIT User’s Guide. ITS Program, MIT, 2002. 3. S. Browon. Building America’s Anti-Terror Machine: How Infotech Can Combat Homeland Insecurity. Fortune, pages 99–104, July 2002. 4. The Volpe National Transportation Systems Center. Improving Regional Transportation Planning for Catastrophic Events(FHWA). Volpe Center Highlights, pages 1–3, July/August 2002. 5. L. Chalmet, R. Francis, and P. Saunders. Network Model for Building Evacuation. Management Science, 28:86–105, 1982. 6. C. Corman, T. Leiserson and R. Rivest. Introduction to Algorithms. MIT Press, 1990. 7. E.W. Dijkstra. A Note on Two Problems in Connexion with Graphs. Numerische Mathematik, 1:269–271, 1959. 8. ESRI. GIS for Homeland Security, An ESRI white paper. http://www.esri.com/library/whitepapers/pdfs/homeland security wp.pdf, November 2001. 9. R. Francis and L. Chalmet. A Negative Exponential Solution To An Evacuation Problem. Research Report No.84-86, National Bureau of Standards, Center for Fire Research, October 1984. 10. B. Hoppe and E. Tardos. Polynomial Time Algorithms For Some Evacuation Problems. Proceedings of the 5th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 433–441, 1994.
Evacuation Planning: A Capacity Constrained Routing Approach
125
11. B. Hoppe and E. Tardos. The Quickest Transshipment Problem. Proceedings of the 6th annual ACM-SIAM Symposium on Discrete Algorithms, pages 512–521, January 1995. 12. T. Kiosko and R. Francis. Evacnet+: A Computer Program to Determine Optimal Building Evacuation Plans. Fire Safety Journal, 9:211–222, 1985. 13. T. Kiosko, R. Francis, and C. Nobel. EVACNET4 User’s Guide. University of Florida, http://www.ise.ufl.edu/kisko/files/evacnet/, 1998. 14. H.S. Mahmassani et al. Development and Testing of Dynamic Traffic Assignment and Simulation Procedures for ATIS/ATMS Applications. Technical Report DTFH6 1-90-R-00074-FG, CTR, University of Texas at Austin, 1994.
Locating Hidden Groups in Communication Networks Using Hidden Markov Models Malik Magdon-Ismail1 , Mark Goldberg1 , William Wallace2 , and David Siebecker1 1
CS Department, RPI, Rm 207 Lally, 110 8th Street, Troy, NY 12180, USA. {magdon,goldberg,siebed}@cs.rpi.edu 2 DSES Department, RPI, 110 8th Street, Troy, NY 12180, USA.
[email protected].
Abstract. A communication network is a collection of social groups that communicate via an underlying communication medium (for example newsgroups over the Internet). In such a network, a hidden group may try to camoflauge its communications amongst the typical communications of the network. We study the task of detecting such hidden groups given only the history of the communications for the entire communication network. We develop a probabilistic approach using a Hidden Markov model of the communication network. Our approach does not require the use of any semantic information regarding the communications. We present the general probabilistic model, and show the results of applying this framework to a simplified society. For 50 time steps of communication data, we can obtain greater than 90% accuracy in detecting both whether or not their is a hidden group, and who the hidden group members are.
1
Introduction
The tragic events of September 11, 2001 underline the need for a tool which is capable of detecting groups that hide their existence and functionality within a large and complicated communication network such as the Internet. In this paper, we present an approach to identifying such groups. Our approach does not require the use of any semantic information pertaining to the communications. This is preferable because communication within a hidden group is usually encrypted in some way, hence the semantic information will be misleading, or unavailable. Social science literature has developed a number of theories regarding how social groups evolve and communicate, [1,2,3]. For example, individuals have a higher tendency to communicate if they are members of the same group, in accordance with homophily theory. Given some of the basic laws of how social groups evolve and communicate, one can construct a model of how the communications within the society should evolve, given the (assumed) group structure. If the group structure does not adequately explain the observed communications, but the addition of an extra, hidden, group does explain them, then we H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 126–137, 2003. c Springer-Verlag Berlin Heidelberg 2003
Locating Hidden Groups in Communication Networks
127
have grounds to believe that there is a hidden group attempting to camouflage its communications within the existing communication network. The task is to determine whether such a group exists, and identify its members. We use a maximum likelihood approach to solving this task. Our approach is to model the evolution of a communication network using a Hidden Markov Model. A Hidden Markov model is appropriate when an observed process (in our case the macroscopic communication structure) is naturally driven by an unobserved, or hidden, Markov process (in our case the microscopic group evolution). Hidden Markov models have been used extensively in such diverse areas as: speech recognition, [4,5]; inferring the language of simple grammars [6]; computer vision, [7]; time series analysis, [8]; biological sequence analysis and protein structure prediction, [9,10,11,12,13]. Our interpretation of the group evolution giving rise to the observed macroscopic communications evolution makes it natural to model the evolution of communication networks using a Hidden Markov model as well. Details about the general theory of Hidden Markov models can be found in [4,14,15]. In social network analysis there are many static models of, and static metrics for the measurement and evaluation of social networks [16]. These models range from graph structures to large simulations of agent behavior. The models have been used to discover a wide array of important communication and sociological phenomenon, from the small world principle [17] to communication theories such as homophily and contagion [1]. These models, as good as they are, are not sufficient to study the evolution of social groups and the communication networks that they use; most focus on the study of the evolution of the network itself. Few attempt to explain how the use of the network shapes its evolution [18]. Few can be used to predict the future of the network and communication behavior over that network. Though there is an abundance of simulation work in the field of computational analysis of social and organizational systems [2,19,3] that attempts to develop dynamic models for social networks, none have employed the proposed approach and few incorporate sound probability theory or statistics [20] as the underlying model. The outline of the paper is as follows. First we consider a simplified example, followed by a description of the general framework. We also present some results to illustrate proof of concept on an example, and we end with some concluding remarks. 1.1
Example
A simple, concrete example will help to convey the details of our method. A more detailed formulation will follow. Consider the newsgroups, for example alt.revisionism, alt.movies. A posting to a newsgroup in reply to a previous posting is a communication between two parties. Now imagine the existence of a hidden group that attempts to hide its communications, illustrated in the figure below. Figure 1(a) shows the group structure. There are 4 observed groups. A fifth hidden group also exists, whose members are unshaded. We do not observe the actual group composition, but rather the communications (who is posting and
128
M. Magdon-Ismail et al. 1
2
X
3
4
(a)
(b)
(c)
Fig. 1. Illustration of a society.
Communication Graph Time Series for 1 Hidden Group Communication Graph, t=1
Communication Graph, t=2
Communication Graph, t=3
Communication Graph, t=4
Communication Graph, t=5
Communication Graph Time Series for No Hidden Groups Communication Graph, t=1
Communication Graph, t=2
Communication Graph, t=3
Communication Graph, t=4
Communication Graph, t=5
Fig. 2. Communication time series of two societies.
replying to posts in a given newsgroup). This is illustrated in Figure 1(b), where all the communications are between members of the same group. Figure 1(c) illustrates the situation when the hidden group members need to broadcast some information among themselves. The hidden group member who initiates the broadcast (say X) communicates with all the other hidden group members who are in the same visible groups as X. The message is then passed on in a similar manner until all the hidden members have received the broadcast. Notice that no communication needs to occur between members who are not in the same group, yet, a message can be broadcast across the whole group. In order to maintain the the appearance of being a bona-fide member of a particular newsgroup, a hidden node will participate in the “normal” communications of that group as well. Only occasionally will a message need to be broadcast through the hidden group, resulting in a communication graph as in Figure 1(c). The matter is complicated by the fact that the communications in Figure 1(c) will be overlayed onto the normal group communications, Figure 1(b). What we observe are a time
Locating Hidden Groups in Communication Networks
129
series of node to node communications as illustrated in Figure 2, which shows the evolving communications of two hypothetical communities. The individuals are represented by nodes in the graph. An edge between two nodes represents communication during that time period. The thickness of the edge indicates the intensity of the communications. The dotted lines indicate communications between the hidden group members. The task is to take the communication history of the community (for example the one above) and to determine whether or not there exists a hidden group functioning within this community, and to identify its members. It would also be useful to identify which members belong to which groups. The hidden community may or may not be functioning as an aberrant group trying to camouflage its communications. In the above example the hidden community trying to camouflage its broadcasts. However, the hidden group could just as well be a new group that has suddenly arisen, and we would like to discover its existence. We assume that we know the number of observed groups (for example the newsgroups societies are known), and we have a model of how the society evolves. We do not know who belongs to which news group, and all communications are aggregated into the communications graph for a given time period. We will develop a framework to determine the presence of a hidden group that does not rely on any semantic information regarding the communications. The motivation for this approach is that even if the semantics are available (which is not likely), the hidden communications will usually be encrypted and designed so as to mimic the regular communications anyway.
2
Probabilistic Setup
We will illustrate our general methodology by first developing the solution of the simplified example discussed above. The general case is similar, with only minor technical differences. The first step is to build a model for how individuals move from group to group. More specifically, let Ng be the number of observed groups in the society, and denote the groups by F1 , . . . , FNg . Let n be the number of individuals in the society, and denote the individuals by x1 , . . . , xn . We denote by F(t), the micro-state of the society at time t. The micro-state represents the state of the society. In our case, F(t) is the membership matrix at time t, which is a binary n × Ng matrix that specifies who is in which group, 1 if node xi is in group Fj , (1) Fij (t) = 0 otherwise. The group membership may change with time. We assume that F(t) is a Markov chain, in other words, the members decide which groups to belong to at time t + 1 based solely on the group structure at time t. In determining which groups to join in the next period, the individuals may have their own preferences, thus there is some transition probability distribution P [F(t + 1)|F(t), θ],
(2)
130
M. Magdon-Ismail et al.
where θ is a set of (fixed) parameters that determine, for example, the individual preferences. This transition matrix represents what we define as the micro-laws of the society, that determines how its group structure evolves. A particular setting to the parameters θ is a particular realization of the micro-laws. We will assume that the group membership is static, which is a trivial special case of a Markov chain where the transition matrix is the identity matrix. In the general case, this need not be so, and we pick this simplified case to illustrate the mechanics of determining the hidden group, without complicating it with the group dynamics. Thus, the group structure, F(t) is fixed, so we will drop the t dependence. We do not observe the group structure, but rather the communications that are a result of this structure. We thus need a model for how the communications arise out of the groups. Let C(t) denote the communications graph at time t. Cij (t) is the intensity of the communication between node xi and node xj at time t. C(t) is the “expression” of the micro-state F. Thus, there is some probability distribution P [C(t)|F(t), λ],
(3)
where λ is a set of parameters governing how the group structure gets expressed in the communications. Since F(t) is a Markov chain, C(t) follows a Hidden Markov process governed by the two probability distributions P [F(t + 1)|F(t), θ] and P [C(t)|F(t), λ]. In particular, we will assume that there is some parameter 0 < λ < 1 that governs how nodes in the same group communicate. We assume that the communication intensity Cij (t) has a Poisson distribution with parameter Kλ, where K is the number of groups that both nodes are members of. If K = 0, we will set the Poisson parameter to λ2 1. otherwise K = λ. Thus, nodes that are not in any groups will tend not to communicate. The Poisson distribution is often used to model such “arrival” processes. Thus, P(k; Kλ) xi and xj are in K > 0 groups together, P [Cij = k] = (4) P(k; λ2 ) xi and xj are in no groups together. Where P(k; λ) is the Poisson probability distribution function, P(k; λ) =
e−λ λk . k!
(5)
We will assume that the communications between different pairs of nodes are independent of each other, as are communications at different time steps. Suppose we have a broadcast hidden group in the society as well, as illustrated in Figure 1(c). We assume a particular model for the communications within the hidden group, namely that every pair of nodes that are in the same visible group communicate. The intensity of the communications, B is assumed to follow a Poisson distribution with parameter β, thus P [B = k] = P(k; β),
(6)
Locating Hidden Groups in Communication Networks
131
We have thus fully specified the model for the society, and how the communications will evolve. The task is to use this model to determine, from communication history (as in Figure 2), whether or not there exists a hidden group, and if so, who the hidden group members are. 2.1
The Maximum Likelihood Approach
For simplicity we will assume that the only unknown is F, the group structure. Thus, F is static and unknown and λ and β are known. Let H be a binary indicator variable that is 1 if a hidden group is present, and 0 if not. Our approach is to determine how likely the observed communications would be if there is a hidden group, l1 and compare this with how likely the observed communications would be if there was no hidden group, l0 . To do this, we use the model describing the communications evolution with a hidden group (resp. without a hidden group) to find what the best group structure F would be if this model were true, and compute the likelihood of communications given this group structure and the model. Thus, we have two optimization problems, l1 = max P [Data|F, v, λ, β, H = 1],
(7)
l0 = max P [Data|F, λ, H = 0],
(8)
F,v F
where Data represents the communication history of the society, namely {C(t)}Tt=1 , and v is a binary indicator variable that indicates who the hidden and visible members of the society are. If l1 > l0 , then the communications are more likely if there is a hidden group, and we declare that there is a hidden group. As a by product, of the optimization, we will obtain F and v, hence we will identify not only who the hidden group members are, but also the remaining group structure for the society. In what follows, we will derive this likelihood function that needs to be optimized for our example society. What remains is to then solve the two optimization problems to obtain l1 , l0 . The simpler case is when there is no hidden group, which we analyze first. Suppose that F is given. Let fij be the number of groups that nodes xi and xj are both members of, Fik Fjk . (9) fij = k
Let λij be the Poisson parameter for the intensity of the communication between nodes xi and xj , λ2 fij = 0, (10) λij = λfij fij > 0. Let P (t) be the probability of obtaining the observed communications C(t) at time t. Since the communications between nodes are assumed independent, and
132
M. Magdon-Ismail et al.
each is distributed according to a Poisson process with parameter λij , we have that P (t) = P [C(t)|F, λ, H = 0] n = P(Cij (t); λij )
(11) (12)
i<j
Since the communications at different times are independent (given the group structure at that time), we have that P [Data|F, λ, H = 0] =
T n
P(Cij (t); λij ).
(13)
t=1 i<j
Since l0 is given by the maximum value of this function, we can equivalently maximize the logarithm. Further, the value of F that attains this maximum is the estimate of the group structure, assuming that there is no hidden group, log l0 = max F
n T
log P(Cij (t); λij )
(14)
t=1 i<j
F0 = argmax
n T
F
log P(Cij (t); λij )
(15)
t=1 i<j
When a hidden group is present, we use an indicator variable v to identify who the hidden group members are. We assume that there are as many hidden members as there are groups for illustration, as in Figure 1. At time t, the observed communications can be broken down into the regular communications plus a hidden group broadcast. The intensity B of the broadcast can be B = 0, 1, . . . , with B distributed according to a Poisson distribution, P [B|β] = P(B; β) This is our model of the hidden group communications. We can thus decompose the communication graph at time t into two components, a residual graph R(t), and the hidden graph H(t), C(t) = R(t) + H(t).
(16)
The hidden graph will be of the form in Figure 1(c) and is completely specified by the broadcast intensity B. This will then specify R(t) by R(t) = C(t) − H(t). Suppose that we know F, v. In this case, P (t) is given by P (t) = P [C(t)|F, v, λ, β, H = 1] ∞ = P [R(t)|B]P [B]
(17) (18)
B=0
Where P [R(t)|B] is given by an expression exactly analogous to (12), P [R(t)|B] =
n i<j
P(Rij (t; B); λij )
(19)
Locating Hidden Groups in Communication Networks
133
where R(t; B) is the residual graph depending on B, and λij is defined exactly analogously to (10) with fij = k Fik Fjk . v places a constraint on what F can be, and serves to determine what the hidden group broadcast graph can be. Note that the sum in (18) gets truncated when B gets large enough so that the residual graph has negative edges, which is impossible, since it must be a communications graph. We will denote this maximum possible value of B by t Bmax . Then, using the fact that P [B] = P(B; β), we get that t Bmax
P (t) =
P(B; β)
n
P(Rij (t; B); λij )
(20)
i<j
B=0
Taking the logarithm and summing over t, we get that log l1 = max F,v
T
t Bmax
log
t=1
{F1 , v1 } = argmax F,v
P(B; β)
B=0 T t=1
n
t Bmax
log
B=0
P(Rij (t; B); λij )
(21)
i<j
P(B; β)
n
P(Rij (t; B); λij )
(22)
i<j
Thus, in order to obtain l0 , l1 , F0 , F1 , v1 , we need to solve two combinatorial optimization problems. Notice that the size of the search space is huge. When there is no hidden group, the size of the search space is 2nNg , and the evaluation of the objective function is O(T n2 ). When there is a hidden group, the size of the search space is 2(n−Ng )Ng n!/(n − Ng )! and the evaluation of the objective function is O(Cmax T n2 ), where Cmax is the maximum communication intensity between any two nodes. If in addition the parameters of the model, namely λ, β are also not known, then we have to optimize with respect to these parameters as well, in which case, we have a mixed continuous/discrete optimization problem. Some algorithms for discrete/combinatorial optimization problems are reactive search, [21,22], and randomized approaches, see for example [23]. Continuous problems are often approached using derivative based methods such as gradient descent, conjugate gradients, Levenberg-Marquardt, etc., [24]. Mixed discrete/continuous problems have not been studied as intensely, and most methods are based upon simulated annealing [25] or genetic algorithms, [26]. For illustration, we assume that the parameters are known, the purpose here is to set the framework for the problem. To illustrate, we have implemented a simulated annealing approach to the combinatorial optimization. We used 10, 000 steps of Monte Carlo, where at each step, the current group structure F was randomly perturbed. The probability of perturbation decreased as a function of the step number. Results. We show results on a small society (9 nodes) with 3 groups. We picked this society so that it would be computationally efficient to run many simulations. We ran simulations to test both the false positive (declaring a hidden group when there isn’t one) and false negative (declaring no hidden group when there
134
M. Magdon-Ismail et al.
is one) errors. For each, we generated a society group structure randomly, and then generated the communication time series. These communication time series were fed into the optimization algorithm to obtain l0 , l1 , F0 , F1 , v1 . If l1 > l0 we declare a hidden group to be present and identify its members in v1 and the group structure in F1 . If not, we declare no hidden group and identify the group structure in F0 . The results are summarized in Table 1. Table 1. Error matrices for different time periods. % correct is the percentage of nodes identified correctly (hidden or not) when a hidden group is present and is predicted correctly. 10 time steps
True H 1 0
Predicted H 1 0 0.73 0.19
0.27 0.81
% correct =84%
20 time steps
True H 1 0
Predicted H 1 0 0.78 0.04
0.28 0.96
% correct =89%
50 time steps
True H 1 0
Predicted H 1 0 0.88 0.03
0.12 0.97
% correct=94%
As can be seen, with just 50 time steps of data, the error rate in predicting the presence of a hidden group is lower than 0.1. 2.2
General Maximum Likelihood Formulation
In general, the group structure evolves according to the micro-law transition matrix for the Markov chain, P [F(t + 1)|F(t), θ], and, the group structure gets expressed as a communication graph according to P [C(t)|F(t), λ]. In our example, P [F(t + 1)|F(t), θ] was the identity matrix, and P [C(t)|F(t), λ] based on modeling the communications using Poisson processes. A detailed description of a general model that describes an evolving society over a communication network is given in [27]. Let N = {x1 , . . . , xn } be the set of nodes and let H ⊂ N be the subset of nodes that forms the hidden group. We assume that H does not change with time. The hidden group may have a communication pattern governed by a different probability distribution, P [H(t)|H, β], where β is a set of parameters that governs this distribution. The group structure of the society from t = 1, . . . , T is given by the time series of matrices {F(t)}Tt=1 . In our example, this time series was specified by the constant matrix F. If there is no hidden group, we can compute the likelihood of observing the communication data {C(t)}Tt=1 as follows. The probability of obtaining the evolution F(1), F(2), . . . , F(T ) is given by P [{F(t)}|θ] = P [F(1)]
T t=2
P [F(t)|F(t − 1), θ].
(23)
Locating Hidden Groups in Communication Networks
135
The likelihood of obtaining the observed communications given this evolution is then given by P [{C(t)}|{F(t)}, θ, λ] =
T
P [C(t)|F(t), λ].
(24)
t=1
Ideally, we would like to compute P [{C(t)}, {F(t)}|θ, λ] l0 = P [{C(t)}|θ, λ] = =
(25)
{F(t)}
P [{F(t)}|θ]P [{C(t)}|{F(t)}, θ, λ]
(26)
{F(t)}
=
P [F(1)]P [C(1)|F(1), λ]
T
P [F(t)|F(t − 1), θ]P [C(t)|F(t), λ](27)
t=2
{F(t)}
If θ, λ are known, then this summation can be computed using a Monte Carlo simulation. If not, then we find the values of θ, λ that maximize l0 . In this case, the optimization is computationally costly and an alternative is to simultaneously optimize with respect to {F(t)}Tt=1 , θ, λ, which is itself a non-trivial mixed discrete/continuous optimization problem. When a hidden group H is present, we decompose the communications at time t to the hidden communications H(t) and the residual communications R(t), with C(t) = R(t) + H(t). Then, P [C(t)|{F(t)}, H, θ, λ, β] = P [R(t)|{F(t)}, θ, λ]P [H(t)|H, β], (28) H(t)
where this summation is finite because both R(t) and H(t) must have nonnegative edges. Taking the product over t gives us P [{C(t)}|{F(t)}, H, θ, λ, β] =
T
P [R(t)|{F(t)}, θ, λ]P [H(t)|H, β], (29)
t=1 H(t)
and finally multiplying by P [{F(t)}|θ] and summing over {F(t)}, we get that l1 = max H
{F(t)}
P [F(1)]
T
P [F(t + 1)|F(t), θ]P [{C(t)}|{F(t)}, H, θ, λ, β],
t=1
(30) where P [{C(t)}|{F(t)}, H, θ, λ, β] is given in (29). The hidden group H at which the maximum is attained identifies who the hidden group members are. We assume that the Hidden Markov model and its parameters (θ, λ, β) are known. If the parameters are not known, then they have to be optimized as well. For a relatively simple hidden group communication structure, for example the broadcast hidden group as in our example, the computation of the likelihood is tractable. For more complicated examples, one may need to use heuristic approaches to these combinatorial optimization problems.
136
3
M. Magdon-Ismail et al.
Concluding Remarks
We have presented a framework for determining the members of a hidden group that attempts to camouflage its broadcasts within a functioning communication network. The basic idea is to first have a model for the society’s evolutions. Then by examining the discrepancy between the observed and expected communications, one can draw conclusions regarding the presence or absence of a hidden group. We focussed on a specific example, where we made a number of assumptions: static group structure; Poisson communication model; independence between communications at different times; the hidden group communications were only broadcasts; we used a maximum likelihood formulation. These restrictions were made primarily for expository and computational reasons, and are dropped in the general framework (resulting in more computationally intensive and complex optimization problems). Ongoing research involves developing efficient heuristic algorithms that solve the combinatorial optimization problems faced in the more general framework, as well as applying our methodology toward finding hidden groups in real societies.
References 1. Monge, P., Contractor, N.: Theories of Communication Networks. Oxford University Press (2002) 2. Carley, K., Prietula, M., eds.: Computational Organization Theory. Lawrence Erlbaum associates, Hillsdale, NJ (2001) 3. Sanil, A., Banks, D., Carley, K.: Models for evolving fixed node networks: Model fitting and model testing. Journal oF Mathematical Sociology 21 (1996) 173–196 4. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 (1989) 257–286 5. Rabiner, L.R., Juang, B.H.: An introduction to hidden Markov models. IEEE ASSP Magazine (1986) 4–15 6. Georgeff, M.P., Wallace, C.S.: A general selection criterion for inductive inference. European Conference on Artificial Intelligence (ECAI, ECAI84) (1984) 473–482 7. Bunke, H., Caelli, T., eds.: Hidden Markov Models. Series in Machine Perception and Artificial Intelligence – Vol. 45. World Scientific (2001) 8. Edgoose, T., Allison, L.: MML Markov classification of sequential data. Stats. and Comp. 9 (1999) 269–278 9. Allison, L., Wallace, C.S., Yee, C.N.: Finite-state models in the alignment of macro-molecules. J. Molec. Evol. 35 (1992) 77–89 10. Allison, L., Wallace, C.S., Yee, C.N.: Normalization of affine gap costs used in optimal sequence alignment. J. Theor. Biol. 161 (1993) 263–269 11. Bystroff, C., Thorsson, V., Baker, D.: HMMSTR: A hidden Markov model for local sequence-structure correlations in proteins. Journal of Molecular Biology 301 (2000) 173–90 12. Bystroff, C., Baker, D.: Prediction of local structure in proteins using a library of sequence-structure motifs. J Mol Biol 281 (1998) 565–77 13. Bystroff, C., Shao, Y.: Fully automated ab initio protein structure prediction using I-sites, HMMSTR and ROSETTA. Bioinformatics 18 (2002) S54–S61
Locating Hidden Groups in Communication Networks
137
14. Baldi, P., Brunak, S.: Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge, MA (1998) 15. Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis. Cambridge, new York (2001) 16. Wasserman, S., Faust, K.: Social Network Analysis. Cambridge University Press (1994) 17. Watts, D.J.: Small Worlds: The dynamics of networks between order and randomness. Princeton University Press, Princeton, NJ (1999) 18. Butler, B.: The dynamics of cyberspace: Examing and modelling online social structure. Technical report, Carnegie Melon University, Pittsburgh, PA (1999) 19. Carley, K., Wallace, A.: Computational organization theory: A new perspective. In Gass, S., Harris, C., eds.: Encyclopedia of Operations Research and Management Science. Kluwer Academic Publishers, Norwell, MA (2001) 20. Snijders, T.: The statistical evaluation of social network dynamics. In Sobel, M., Becker, M., eds.: Sociological Methodology dynamics. Basil Blackwell, Boston & London (2001) 361–395 21. Battiti, R.: Reactive search: Toward self-tuning heuristics. Modern Heuristic Search Methods, Chapter 4 (1996) 61–83 22. Battiti, R., Protasi, M.: Reactive local search for the maximum clique problem. Technical Report TR-95-052, Berkeley, ICSI, 1947 Center St. Suite 600 (1995) 23. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge, UK (2000) 24. Bishop, C.M.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1995) 25. Aarts, E., Korst, J.: Simulated Annealing and Boltzmann Machines: A Stochastic Approach to Combinatorial Optimization and Neural Computing. John Wiley & Sons Ltd., New York (1989) 26. Stelmack, M., N., N., Batill, S.: Genetic algorithms for mixed discrete/continuous optimization in multidisciplinary design. In: AIAA Paper 98-4771, AIAA/ USAF/NASA/ISSMO Symposium on Multidisciplinary Analysis and Optimization, St. Louis, Missouri (1998) 27. Siebeker, D., Goldberg, M., Magdon-Ismail, M., Wallace, W.: A Hidden Markov Model for describing the statistical evolution of social groups over communication networks. Technical report, Rensselaer Polytechnic Institute (2003) Forthcoming.
Automatic Construction of Cross-Lingual Networks of Concepts from the Hong Kong SAR Police Department Kar Wing Li and Christopher C. Yang Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong {kwli, yang}@se.cuhk.edu.hk
Abstract. The tragic event of September 11 has prompted the rapid growth of attention of national security and criminal analysis. In the national security world, very large volumes of data and information are generated and gathered. Much of this data and information written in different languages and stored in different locations may be seemingly unconnected. Therefore, cross-lingual semantic interoperability is a major challenge to generate an overview of this disparate data and information so that it can be analysed, searched. The traditional information retrieval (IR) approaches normally require a document to share some keywords with the query. In reality, the users may use some keywords that are different from what used in the documents. There are then two different term spaces, one for the users, and another for the documents. The problem can be viewed as the creation of a thesaurus. The creation of such relationships would allow the system to match queries with relevant documents, even though they contain different terms. Apart from this, terrorists and criminals may communicate through letters, e-mails and faxes in languages other than English. The translation ambiguity significantly exacerbates the retrieval problem. To facilitate cross-lingual information retrieval, a corpusbased approach uses the term co-occurrence statistics in parallel or comparable corpora to construct a statistical translation model to cross the language boundary. However, collecting parallel corpora between European language and Oriental language is not an easy task due to the unique linguistics and grammar structures of oriental languages. In this paper, the text-based approach to align English/Chinese Hong Kong Police press release documents from the Web is first presented. This article then reports an algorithmic approach to generate a robust knowledge base based on statistical correlation analysis of the semantics (knowledge) embedded in the bilingual press release corpus. The research output consisted of a thesaurus-like, semantic network knowledge base, which can aid in semantics-based cross-lingual information management and retrieval.
1 Introduction In a string of fatal attacks that include the tragic event of September 11, a car bombing in Bali, and an explosion on a French oil tanker off the coast of Yemen, casualties of terrorism have increasingly become regular in daily news all over the globe. These events have prompted the rapid growth of attention of national security H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 138–152, 2003. © Springer-Verlag Berlin Heidelberg 2003
Automatic Construction of Cross-Lingual Networks of Concepts
139
and criminal analysis. However, Osama bin Laden’s al Qaeda terrorists are not the only threat. We also need to effectively predict and prevent other criminal activities. These include religious, racist and fascist terrorists, opportunistic crime, organized crime (narcocriminial, Mafia, Russian mob, Triads, etc.), political espionage and sabotage, anarchists and vandals. An intelligent system is required to retrieve relevant information from the criminal records and suspect communications. The system should continuously collect information from relevant data streams and compare incoming data to the known patterns to detect the important anomalies. For example, historical cases of tax fraud can disclose patterns of taxpayers’ behaviors and provide indicators for potential fraud. The customers’ credit card data can reveal the patterns of transactions and help to detect credit card theft. It should also allow the user to retrieve what persons, organizations, projects, and topics are relevant to a particular event of interest, e.g. car bombing in Bali. However, information stored in the repositories is often fragmented and unstructured, especially on-line catalogs. Also, the man-made fog of deliberate deception militates against normal pattern learning from databases causes much crucial information and the knowledge underlying to be buried. Therefore this information has become inaccessible. Developing systems that can retrieve relevant information have long been the goal of many researchers since important domain knowledge or information resides in the databases. Many information retrieval systems have been created in the past for medical diagnosis and business applications. The major difficulties to retrieve relevant information are the lack of explicit semantic clustering of relevant information and the limits of conventional keyword-driven search techniques (either full text or index-based)[2]. The traditional approaches normally require a document to share some keywords with the query. In reality, it is known that the users may use some keywords that are different from what used in the documents. There are then two different term spaces, one for the users, and another for the documents. How to create relationships for the related terms between the two spaces is an important issue. The problem can be viewed as the creation of a thesaurus. The creation of such relationships would allow the system to match queries with relevant documents, even though they contain different terms. Language boundaries is another problem for criminal analysis. In criminal analysis, we need to find out how to frame questions, or create search patterns, that would help an analyst. If the right questions are not posed, the analyst may head down a path with no conclusions. In addition, terrorists and criminals may communicate openly and less openly through letters, e-mails, faxes, bulletin boards, etc. in languages other than English. The translation ambiguity significantly exacerbates the retrieval problem. Use of every possible translation for a single term can greatly expand the set of possible meanings because some of those translations are likely to introduce additional homonomous or polysemous word senses in the second language. Also, the users can have different abilities for different languages, affecting their ability to form queries and refine results. The human expertise to decompose an information need into the queries may take a man several years to acquire. However, knowledge-based systems aim to capture human expertise or knowledge by means of computational models. Knowledge acquisition was defined by Buchanan [10] as “the transfer and transformation of potential problem-solving expertise from some knowledge source to a program”. The approach to knowledge elicitation is referred to as “knowledge mining” or
140
K.W. Li and C.C. Yang
“knowledge discovery in databases” [2]. The “knowledge discovery” approach is believed by many Artificial Intelligence experts and database researchers to be useful for resolving the information overload and knowledge acquisition bottleneck problems. In this research, our aim is to generate a robust knowledge base based on statistical correlation analysis of the semantics (knowledge) embedded in the documents of English/Chinese daily press release issued by Hong Kong Police Department. The research output consisted of a thesaurus-like, semantic network knowledge base, which can aid in semantics-based cross-lingual information management and retrieval. Before the generation of the thesaurus-like, semantic network knowledge base, we firstly propose the text-based approach to collect the parallel press release documents from the Web.
2 Automatic Construction of Parallel Corpus Cross-lingual semantic interoperability has drawn significant attention in recent criminal analysis as the information of criminal activities written in languages other English has grown exponentially. Since it is impractical to construct bilingual dictionary or sophisticated multilingual thesauri manually for large applications, the corpus-based approach uses the term co-occurrence statistics in parallel or comparable corpora to construct a statistical translation model for cross-lingual information retrieval. Many corpora are domain-specific. To deal with criminal analysis, we use the English/Chinese daily press release articles issued by Hong Kong SAR Police Department. Bates [1] stressed the importance of building domainspecific lexicons for retrieval purposes since a domain-specific, controlled list of keywords can help identify legitimate search vocabularies and help searchers “dock” on to the retrieval system. For most domain-specific databases, there appears to be some lists of subject descriptors (e.g., the subject indexes at the back of a textbook), people’s names (e.g., author indexes), and other domain-specific objects (e.g., organizational names, procedures, location names, etc.). These domain-specific keywords can be used to identify important concepts in documents. In the criminal analysis world, the information can help the analyst to identify the people who belongs to which group or organization, uses what methods to conduct the criminal activities in where. In addition, the online bilingual newswire articles used in this experiment are dynamic. They provide a continuous large amount of information for relieving the lag between the new information and the information incorporated into a reference work. To continuously collect English/Chinese daily Police press release articles from the data stream, we investigate the text-based approach to align English/Chinese parallel documents from the Web. Parallel corpus can be generated using overt translation or covert translation. The overt translation [20] possesses a directional relationship between the pair of texts in two languages, which means texts in language A (source text) is translated into texts in language B (translated text)[25]. The covert translation [13] is non-directional, e.g. press release from the government, commentaries on a sports event broadcast live in several languages by a broadcasting organization. There are two major approaches for document aligning, namely length-based and text-based alignment. The length-based makes use of the total number of characters or
Automatic Construction of Cross-Lingual Networks of Concepts
141
words in a sentence and the text-based approaches use linguistic information in the sentence alignment [9]. Many parallel text alignment techniques have been developed in the past. These techniques attempt to map various textual units to their translation and have been proven useful for a wide range of applications and tools, e.g. crosslingual information retrieval [18], bilingual lexicography, automatic translation verification and the automatic acquisition of knowledge about translation [22]. Translation alignment technique has been used in automatic corpus construction to align two documents [16]. There are three major structures of parallel documents on the World Wide Web, parent page structure, sibling page structure, and monolingual sub-tree structure[24]. Resnik [19] noticed that the parent page of the Web page may contain the links to different versions of the web page. The sibling page structure refers to the cases where the page in one language contains a link directly to the translated pages in the other language. The third structure contains a completely separate monolingual subtree for each language, with only the single top-level Web page pointing off to the root page of single-language version of the site. Parallel corpus generated by overt translation usually uses the parent page structure and sibling page structure. However, parallel corpus generated by covert translation uses monolingual sub-tree structure. Each sub-tree is generated independently [24]. The press release issued by the HKSAR Police Department is an example.
Hong Kong SAR Police Department Web page (Chinese)
1/1/1999 (Chinese)
Article 0001
Hong Kong SAR Police Department Web page (English)
Press News Archives
Press News Archives
(Chinese)
(English)
1/1/1999 (English)
……
……
……
Article 0019
……
……
parallel articles Fig. 1. Organization of Hong Kong SAR Police Department’s press release articles in the Hong Kong SAR Police Department Web site.
142
K.W. Li and C.C. Yang
2.1 Title Alignment Titles of two texts can be treated as the representations of two texts. Referring to He [11], the titles present “micro-summaries of texts” that contain “the most important focal information in the whole representation” and as “the most concise statement of the content of a document”. In other words, titles function as the condensed summaries of the information and content of the articles. In our proposed text-based approach, the longest common subsequence is utilized to optimize the alignment of English and Chinese titles [24]. Our alignment algorithm has three major steps: 1) alignment at word level and character level, 2) reducing redundancy, 3) score function. An English title, E, is formed by a sequence of English simple words, i.e., E = e1 e2 e3 … ei … , where ei is the ith English word in E. A Chinese title, C, is formed by a sequence of Chinese characters, i.e., C = char1 char2 char3 … charq … , where charq is a Chinese character in C. An English word in E, ei, can be translated to a set of possible Chinese translations, Translated(ei), by dictionary lookup. Translated(ei) = j { Te1 , Te2 , Te3 , … , T j , … } where T is the jth Chinese translation of ei. Each i
i
i
ei
ei
Chinese translation is formed by a sequence of Chinese characters. The set of the j
j
longest-common-subsequence (LCS) of a Chinese translation T ei and C is LCS( T ei , C). MatchList(ei) is a set that holds all the unique longest common subsequences of
T eij and C for all Chinese translations of ei. Based on the hypothesis that if the characters of the Chinese translation of an English word appears adjacently in a Chinese sentence, such Chinese translation is more reliable than other translations that their characters do not appear adjacently in the Chinese sentence. Contiguous(ei) is used to determine the most reliable translation based on adjacency. The second criteria of the most reliable Chinese translations, is the length of the translations. Reliable(ei) is used to identify the longest sequence in Contiguous(ei). Due to redundancy, the translations of an English word may be repeated completely or partially in Chinese. To deal with redundancy, Dele(x,y) is an edit operation to remove the LCS(x,y) from x. WaitList is a list to save all the sequences obtained by removing the overlapping of the elements of MatchList(ei) and Reliable(ei). MatchList(ei) is initialized to ∅ and Reliable(ei) is initialized to ε . Remain is a sequence that is initialized as C, and Reliable(ei) are removed from Remain starting from the e1 until the last English word. WaitList will also be updated for each ei. When all Reliable(ei) are removed from Remain, the elements in WaitList will also be removed from Remain in order to remove the redundancy. Given E and C, the ratio of matching is determined by the portion of C that matches with the reliable translations of English words in E. Given an English title, the Chinese title that has the highest Matching_Ratio among all the Chinese titles is considered as the counterpart of the English title. However, it is possible that more than one Chinese title have the highest Matching_Ratio. In such case, we shall also consider the ratio of matching determined by the portion of English title that is able to identify a reliable translation in the Chinese title.
Automatic Construction of Cross-Lingual Networks of Concepts
143
2.2 Experiment An experiment is conducted to measure the precision and recall of the aligned parallel Chinese/English documents from the HKSAR Police press releases using the textbased approach as described in Section 2.1. Results are shown on Table 1. The Hong Kong SAR Police press releases are developed based on covert translation. From 1st January, 2001 to 31st October,2002, there are 2,698 press articles in Chinese and 2,695 press articles in English. There are only 2,664 pairs of Chinese/English parallel articles. Experimental result shows that the proposed text-based title alignment approach can effectively align the Chinese and English titles. Table 1. Experimental results
Proposed text-based approach
Precision 1.00
Recall 1.00
3 A Corpus-Based Approach: Automatic Cross-Lingual Concept Space Generation The semantic network knowledge base approach to automatic thesaurus generation is also referred to as a concept space approach[4] because a meaningful and understandable concept space (a network of terms and weighted associations) could represent the concepts (terms) and their associations for the underlying information space (i.e., documents in the database). In terms of criminal analysis, recent terrorist events have demonstrated that terrorist and other criminal activities are connected, in particular, terrorism, money laundering, drug smuggling, illegal arms trading, and illegal biological and chemical weapons smuggling. In addition, hacker activities may be connected to these other criminal activities. Information in the concept space can be split into concepts and links. Concepts include real people, aliases, groups, organizations, companies (including bank and shells), countries, towns, regions, religious groups, families, attacks (hacker, terrorist), etc. The associated concepts in the concept space can provide links about the persons who generally remain hidden, unknown, and use aliases, who, in turn, belong to various groups and organizations, use banks, vehicles, phones, meet in various locations, conduct both criminal and noncriminal activities, and communicate openly and less openly through bulletin boards, e-mail, phone calls, letters, word-of-mouth, etc. – encrypted or not. It helps the analyst to detect the important anomalies. The cross-lingual concept space clustering model is originally suggested by Lin and Chen [15] and based on the Hopfield network. The cross-lingual concept space includes the concepts themselves, their translations as well as their associated concepts. The automatic Chinese-English concept space generation system consists of four components: 1)English phrase extraction; 2)Chinese phrase extraction; and 3) Hopfield network, and 4) Parallel Chinese/English Police press release corpus. The Chinese and English phrase extraction identifies important conceptual phrases in the corpora. The Hopfield network generates the cross-lingual concept space with the Chinese and English important conceptual phrases as input. A press release parallel corpus was dynamically collected from the Hong Kong Police website in order to get the relationship between Chinese terms and English terms.
144
K.W. Li and C.C. Yang
3.1 Automatic English Phrase Extraction Automatic phrase extraction is a fundamental and important phrase in concept space clustering. The clustering result will be downgraded significantly if the quality of term extraction is low. Salton [21] presents a blueprint for automatic indexing, which typically includes stop-wording and term-phrase formation. A stop-word list is used to remove non-semantic bearing words such as the, a, on, in, etc. After removing the stop words, term-phrase formation that formulates phrases by combining only adjacent words is performed[4]. 3.2 Chinese Phrase Extraction Unlike English language, there are not any natural delimiters in Chinese language to mark word boundaries. In our previous work, we have developed the boundary detection [23] and the heuristic techniques to segment Chinese sentence based on the mutual information and significant estimation [5]. The accuracy is over 90%. 3.2.1 Automatic Phrase Selection To generate the concept space, the relevance weights between the English and Chinese term phrases are first computed in order to select significant concepts from the collection. d ij = tf ij × log(
N × w j) df j
(1)
Equation 1 shows how the combined weight of term j in document i is calculated. tfij is the occurrence frequency of term j in document i. N is the total number of documents in the collection and dfj is the number of documents containing term j. wj is the length of term j. For an English term, the length of it is the number of words in it. For a Chinese term, the length of it is the number of characters in it. The weight is directly proportional to the occurrence frequency of the term because it carries important idea if it appears in the document for many times. On the other hand, it is inversely proportional to the number of documents containing the term because the meaning carried by the term may be too general. For example, "Hong Kong" frequently appears in the collection of documents from HKSAR Police. It becomes a common term in the collection and does not carry specific meaning in any document of the collection. The length of term also plays an important role in the weight. It is known that a longer term carries more specific meaning. For example, name of places and organizations are often in multiple words (for English) or characters (for Chinese). Terms, which significantly represent a document, are selected for clustering. Based on the combined weights of terms that are calculated using Equation 1, a number of terms with the largest combined weights in each document are selected for clustering. The number is based on the average length of documents in the collection. For longer average length, more terms are selected for clustering. Terms with common meaning and not representative are filtered out.
Automatic Construction of Cross-Lingual Networks of Concepts
145
3.2.2 Co-occurrence Weight After the calculation of dij, asymmetric co-occurrence function [2] is used to evaluate the relevance weights among concepts. For a pair of relevant term A and B, the weight of the link from term A to term B and that of the link from term B to term A are different. This function gives a good description of natural thinking of human to terms. For example, "Ford" and "car" are relevant. When a person comes up with "Ford", he can think of "car". However, when a person comes up with "car", he may not think of "Ford". This example shows that two terms the associations between two terms are not symmetric. Therefore, we adopt the co-occurrence weight to calculate the relevance weights. N (2) d = tf × log( × w ) ijk
ijk
j
df
jk
The co-occurrence weight, dijk , in Equation 2 is the weight between term j and term k that are both exist in document i . tfijk is the minimum between occurrence frequency of term j and that of term k in document i . The weight will be zero if either of term j or term k is not exist in the document. The calculation is similar to the calculation in Equation 1. Therefore, the co-occurrence weight is a measure of combined weight between term j and term k. n
Weight (T j , Tk ) =
∑d i =1 n
ijk
∑d i =1
(3) × WeightingF actor (Tk )
ij
n
Weight (Tk , T j ) =
∑d i =1 n
ikj
∑d
(4) × WeightingF actor (T j )
ik
i =1
Equation 3 shows the relevance weights from term j to term k. Equation 4 shows the relevance weight from term k to term j. Relevance weight measures the association between two terms in the collection. The combined weights and cooccurrence weights of terms in all documents are summed up to derive the global association between terms in the collection. log Weighting
Factor
(T j ) =
Factor
(T k ) =
(5)
log N N df k log N
log Weighting
N df j
(6)
Equation 5 shows the weighting factor of term j. Equation 6 shows the weighting factor of term k. The weighting factor is used to penalize general terms. General terms always affect the result of clustering. A lot of terms associate with the general terms. During clustering, if a general term is activated, other terms associate with that general term will also be activated. Then, the size of that concept space will be large and the precision will unavoidably low. The weighting factor is a value between 0 and 1. It carries an idea of inverse document frequency. The more the documents contain the concept, the smaller the weighting factor.
146
K.W. Li and C.C. Yang
3.2.3 The Hopfield Network Algorithm Given the relevance weights between the extracted Chinese and English term phrases in the parallel corpus, we will employ the Hopfield network to generate the concept space. The Hopfield network models the associate network and transforms a noisy pattern into a stable state representation. When a searcher starts with an English term phrase, the Hopfield network spreading activation process will identify other relevant English term phrases and gradually converge towards heavily linked Chinese term phrases through association (or vice versa). Term is represented by node in the network. The algorithm is shown below: n −1
u j ( t + 1 ) = f s [ ∑ t ij u i ( t )], 0 ≤ j ≤ n − 1
(7)
i=0
where uj(t+1) denotes the value of node j in iteration t+1, n is the total number of nodes in the network, tij denotes the relevance weight from node i to node j. fs(x) =
1 − (x − θ 1 + exp θ o
j
)
(8) Equation 8 shows the continuous SIGMOID transformation function which normalizes any given value to a value between 0 and 1[4].
∑ [u n −1
j
(t + 1) − u
j= 0
j
(t )
]
2
≤ ε
(9)
where ε was the maximal allowable difference between two iterations. ε measures the total change of values of nodes from iteration t to t+1. After several iterations, more nodes are activated and nodes with strong connection to the target node are those with high values. Total change of values of nodes is evaluated at the end of iteration. When the change is smaller than a threshold, ε, the Hopfield network is converged and the iteration process stops. Once the network converged, the final output represented the set of terms relevant to the starting term. In our system the following values were used: θ j = 0.1 , θ o = 0.01 , ε=1.
4 Concept Space Evaluation 10 students of the Department of System Engineering and Engineering Management, The Chinese University of Hong Kong, were invited to examine the performance of concept space. The concept space is a robust and domain-specific Hong Kong Police press release thesaurus which contains 9222 Chinese/English concepts. The thesaurus includes many social, political, legislative terms, abbreviations, names of government departments and agencies. Each concept in the thesaurus may associate with up to 46 concepts. It is generated from 2548 parallel Hong Kong Police press release article pairs. The goal of this experiment is to capture meaningful conceptual association between concepts. The associations forms the basis for the decisions and inferences the user use when searching the criminal information of Hong Kong.
Automatic Construction of Cross-Lingual Networks of Concepts
147
4.1 Experimental Design Among these 10 graduate students, 5 subjects are Hong Kong students and the other 5 subjects came from Mainland China. They all have been living in Hong Kong for more than one year. They use their knowledge and experience on both the Hong Kong SAR Police system and the living environment in Hong Kong to evaluate the concept space. 50 among 9222 concepts were randomly selected as the test descriptors. Twenty five among these 50 test descriptors are English concepts. The other 25 test descriptors are Chinese concepts. Each test descriptor together with its associated concepts were presented to the 10 subjects. A small portion (about 10% of total number of associated concepts for each test descriptor) of noise terms was added to reduce the bias generated by the subjects to the concept space. The experiment is divided into two phrases: recall phrase and recognition phrase. In the recall phrase, each subject (Hong Kong graduate students and graduate students from Mainland China) was asked to generate as many related terms as possible in response to each test descriptor presented. In the recognition phrase, the subjects needed to determine the associated concept either "irrelevant" or "relevant" to the test descriptor. Terms considered too general were to be ranked as “irrelevant”. This phrase tested the ability of subjects on recognition of relevant terms. If the subjects felt the definition of a concept needed to clarify or they wished to add comments on the concept, they were asked to write them on a piece of paper. After the experiment, we found that the subjects spent more time on recognition phrase than what they spent on recall phrase. This confirms the statement made by Chen et al. [3] that human beings are more likely to recognize than to recall. Apart from the 10 students, the 50 concepts in concept space were also carefully evaluated by two experimenters and no noise term was added in the case. One of them is a graduate student of the Department of System Engineering and Engineering Management. The other is a graduate student of the Department of Translation. They both have been living in Hong Kong for more than 10 years. They also have done research on Chinese to English translation and English to Chinese translation for more than two years. Since there is no tailored bilingual thesaurus for Hong Kong government press release articles, the experimental result provided by these two senior subjects is treated as a benchmark or human verified thesaurus in comparison with the result provided by the 10 subjects. The additional associated concepts provided by the 10 subjects in the recall phrase were examined by the two senior judges before treating them as relevant terms. 4.2 Experimental Result We adopted the concept recall and concept precision for evaluation based on the following equations: Number of Retrieved Relevant Concepts (10) Concept Recall = Number of Total Relevant Concepts
Number of Retrieved Relevant Concepts Concept Precision = Number of Total Retrieved Concepts
(11)
The number of Retrieved Relevant Concepts represented the number of concepts in the concept space judged as "Relevant". The number of total relevant concepts
148
K.W. Li and C.C. Yang
includes the concepts in the concept space judged as "Relevant", the additional relevant concepts provided. The number of total retrieved concepts represented the number of concepts suggested by the concept space and the human verified thesaurus. 4.3 Evaluation Provided by 10 Graduate Subjects The 10 graduate students provided 12 to 73 new associated concepts during the experiment. The analysis is listed in Table 2. It is interested to note that all the Hong Kong graduate subjects have been living in Hong Kong for at least six years but the graduate subjects from Mainland China have been living in Hong Kong around one year. So, the Hong Kong graduate subjects are more familiar with Hong Kong Police system and they added more new concepts to the concept space. In addition, the Hong Kong graduate students added more English concepts to the concept space than that of the graduate students from Mainland China. This confirms that even though the first language of all these graduate students is Chinese, the working language for the Hong Kong graduate students is English. Table 2. The statistics of new associated concepts added by the 10 graduate students
Table 3. Precision and recall
10 graduate students 2 experimenters
Precision 0.835 0.86
Recall 0.795 0.83
Table 4. The new concepts added by the 10 graduate students
10 graduate students
Chinese added 222
concept English added 220
concept
Hong Kong is a bilingual community. Even though the Police concept space contains many technical, political and geographical English vocabularies, the Hong Kong graduate students frequently encounter these terms in their daily life. As a result, the Hong Kong graduate students naturally added more English terms into the concept space. This observation also appears in Welsh and English community [7]. Also, even though Chinese technical terms do exist, they may not common use. Therefore, the Hong Kong graduate may have limited Chinese technical vocabulary
Automatic Construction of Cross-Lingual Networks of Concepts
149
even where Chinese is their first language and use English terms when necessary. As a result, the Hong Kong graduate subjects judged more English concepts to be relevant and added more English terms into the concept space. On the other hand, the graduate students from Mainland China have a higher degree of Chinese fluency than that of Hong Kong graduate students. Also, they know more Chinese translations of those English technical vocabularies in Mainland China. These cause them to add more Chinese concepts. We also observe some associated concepts are judged as irrelevant because the associated concepts do not show the clear association with their test descriptor. For example, one of associated concepts for the test descriptor " " (smuggling) is "Mr Mark Steeple" ( ) because the Chief Inspector Anti-smuggling Task Force in Hong Kong is Mr Mark Steeple. Another associated ) because of the recent trend of smuggling by small concept is "Mirs Bay" ( craft in the Mirs Bay area. However, all the graduate students do not have a prior knowledge of these and judged them as irrelevant. Since the corpus is a dynamic resource, it is not surprise that the students do not have a prior knowledge. For criminal analyst, the information is important for identifying the recent trend of smuggling by small craft in the Mirs Bay area. In addition, one of the associated ) are “ ” (Police). We know concepts for “Golden Bauhinia Square” ( that the flag raising ceremony began promptly at 8 a.m. with the Flag Raising Parade at the twin flagpoles at Golden Bauhinia Square. The flag party, provided by the Hong Kong Police Force comprised a Senior Inspector of Police, four flag raisers. Without knowing this, the subjects only read the concept space and judged that there is no clear association between “Police” and “Golden Bauhinia Square”. The phenomenon displays that the clustering process using Hopfield network induces the relevant concepts based on the contents of documents. Apart from this, as we know, a lexical item (word) in a sentence may be a concept in one language[12], where concept is a recognizable unit of meaning in any given language [11]. A concept represented by a word in one language may be translated into a word, two words, a phrase, or even a sentence in another language [11]. A concept in one language can be a broader concept encompassing some narrower concepts, and the translation of such a concept may result in an altered concept in another language. In contrast, a narrower concept in one language may be translated as a broader concept in another language. Such relationship is known as generic-specific relationship[12]. For example, the word “China” is modified to be a specific word “ ” (Beijing), a city of China. Omission, addition, and deviation are also common phenomena. For example, “Closure” ” in some cases. “Closure” is translated to “ ” by corresponds to “ (stop service)” in some cases (deviation). dictionary, but it refers to “ Therefore, conceptual alternation may occur in translation. This also causes the judges to judge some associated concepts to be irrelevant. Nida[11] explains that conceptual alteration is caused by three major reasons: 1) no two languages were completely isomorphic, 2) different languages might have different domain vocabulary; and 3) some languages were more rhetorical than other languages. Courtial and Pomian[6] argued that searches performed in the realms of science and technology frequently involve association of concepts that lie outside the traditional associations represented in thesauri. Associative networks gleaned through textual analysis, they argued, facilitated innovation by making obvious associations that would otherwise be impossible for humans to find on their own. In early research,
150
K.W. Li and C.C. Yang
Lesk[14] found little overlap between term relationships generated through term associations and those presented in existing thesauri. This term relationship is especially important for criminal analysis. The associated concepts in the concept space can provide links about the persons who generally remain hidden, unknown, and use aliases, who, in turn, belong to various groups and organizations, use banks, vehicles, phones, meet in various locations, conduct both criminal and non-criminal activities, and communicate through bulletin boards, e-mail, phone calls, letters, word-of-mouth, etc. – encrypted or not. Ekmekcioglu, Robertson and Willet [8] tested retrieval performances for 110 queries on a database of 26,280 bibliographic records using four approaches. Their result suggested that the performance may be greatly improved if a searcher can select and use the terms suggested by a co-occurrence thesaurus in addition to the terms he has generated[4]. 4.4 Translation Ability of the Concept Space The 46683 associated concepts were also examined. For those test descriptors associating with two relevant associated concepts, 47.64% of these associated concepts are Chinese concepts and 52.36% of these associated concepts are English concepts. Among these 9222 test descriptors, 87.7% of them obtain their translations from the associated concepts. It shows that the concept space generated through Hopfield network can effectively recognize the translations of a concept in a parallel corpus.
5 Conclusion The tragic event of September 11 has prompted the rapid growth of attention of national security and criminal analysis. In the national security world, very large volumes of data and information are generated and gathered. Much of this data and information written in different languages and stored in different locations may be seemingly unconnected. Therefore, cross-lingual semantic interoperability is a major challenge to generate an overview of this disparate data and information so that it can be analyzed, shared, searched. To effectively predict and prevent criminal activities, an intelligent system is required to retrieve relevant information from the criminal records and suspect communications. The system should continuously collect information from relevant data streams and compare incoming data to the known patterns to detect the important anomalies. However, information retrieval (IR) systems present two main interface challenges: first, how to permit a user to input a query in a natural and intuitive way, and second, how to enable the user to interpret the returned results. A component of the latter encompasses ways to permit a user to comment and provide feedback on results and to iteratively improve and refine results. As we know, the vocabulary difference problem has been widely recognized: users tend to use different terms for the same information sought. Also, in terms of criminal analysis, the man-made fog of deliberate deception militates against normal pattern learning from databases cause much crucial information and the knowledge underlying to be buried. As a result, an exact match between the user's terms and those of the indexer is unlikely. An advanced tool is required to understand the user's needs. Cross-lingual information retrieval brings an added complexity to the standard
Automatic Construction of Cross-Lingual Networks of Concepts
151
IR task. Users can have different abilities for different languages, affecting their ability to form queries and interpret results. This highlights the importance of automated assistance to refine a query in cross-lingual information retrieval. This article has presented a bilingual concept space approach using Hopfield network to relieve the vocabulary problem in national security information sharing, using the Hong Kong Police press release bilingual pairs as an example. The concept space allows the user to interactively refine a search by selecting concepts which have been automatically generated and presented to the user. This allows the user to descend to the level of actual objects in a collection at any time. By observation, some information may be seemingly unconnected but actually information can help the analyst to identify the important anomalies, such traffic accidents frequently happen at a particular location. Since the press release collection is dynamically generated, the subjects may not have a full prior knowledge. However, experimental result shows the precision and recall for the bilingual concept space are over 78% in all cases. Among these 9222 test descriptors, 87.7% of them obtain their translations from the associated concepts. It shows that the concept space generated through Hopfield network can effectively recognize the translations of a concept in a parallel corpus.
References 1.
Bates, M. J. "Subject access in online catalogs: A design model". Journal of the American Society for Information Science, 37,357–376. (1986) 2. Chen, H., Lynch, K. J., "Automatic construction of networks of concepts characterizing document database" IEEE Transactions on Systems, Man and Cybernetics, vol. 22, no. 5, pp. 885–902, Sept-Oct (1992) 3. Chen, H., Schatz, B., Ng, T., Martinez, J., Kirchhoff, A., Lin, C., "A Parallel Computing Approach to Creating Engineering Concept Spaces for Semantic Retrieval: The Illinois Digital Library Initiative Project" IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 8, pp. 771–782, August (1996) 4. Chen, H., Ng, T., Martinez, J., Schatz, B., "A Concept Space Appraoch to Addressing the Vocabulary Problem in Scientific Information Retrieval: An Experiment on the Worm Community System" In Journal of The American Society for Information Science, 48(1):17–31. (1997) 5. Chien, L. F., "PAT-Tree-BASED Keyword Extraction for Chinese Information Retrieval", In Proceedings of ACM SIGIR,pp.50-58, Philadelphia, PA,1997. 6. Courtial, J. P. and Pomian, J. “A system based on associational logic for the interrogation of databases”, In Journal of Information Science, 13,91–97,1987 7. Cunliffe, D., Jones, H., Jarvis, M., Egan, K., Huws, R., Munro, S., “Information Architecture for Bilingual Web Sites”. In Journal of The American Society for Information Science, 53(10):866–873. 2002 8. Ekmekcioglu, F. C., Robertson, A. M. and Willett, P. “Effectiveness of query expansion in ranked-output document retrieval systems”, In Journal of Information Science, 18, 139– 147,1992. 9. Fung, P. and McKeown, K. (1997) " A technical word- and term-translation aid using noisy parallel corpora across language groups". In Machine Translation 12: 53–87. 10. Hayes-Roth, F., Waterman, D. A. and Lenat, D. (1983) "Building Expert Systems". Reading, MA: Addison-Wesley.
152
K.W. Li and C.C. Yang
11. He, S. "Translingual Alteration of Conceptual Information in Medical Translation: A Cross-Language Analysis between English and Chinese," Journal of the American Society for Information Science, Vol. 51, No. 11,2000, pp.1047–1060. 12. Larson, M. L. Meaning-based translation: A guide to cross-language equivalence. Lanham, MD: University Press of American 13. Leonardi, V., "Equivalence in Translation: Between Myth and Reality," Translation Journal, Vol. 4, No.4, 2000. 14. Lesk, M. E. (1969) “Word-word associations in document retrieval systems”, In American Documentation, 20(1),27–38,1969. 15. Lin, C. H., Chen, H., "An Automatic Indexing and Neural Network Approach to Concept Retrieval and Classification of Multilingual (Chinese-English) Documents" IEEE Transactions on Systems, Man and Cybernetics, vol 26, no.1, pp. 75–88, Feb 1996 16. Ma X. and Liberman M. (1999) “BITS: A Method for Bilingual Text Search over the Web”. In Machine Translation Summit VII, September 13th, 1999, Kent Ridge Digital Labs, National University of Singapore. 17. Oard, D. W., & Dorr, B. J. (1996). A Survey of Multilingual Text Retrieval. UMIACS-TR96-19 CS-TR-3815. 18. Oard, D. W. (1997). Alternative approaches for cross-language text retrieval. In Hull D, Oard D, (Eds.) ,1997 AAAI Symposium in Cross-Language Text and Speech Retrieval. American Association for Artificial Intelligence. 19. Resnik P. "Mining the Web for Bilingual Text," 37th Annual Meeting of the Association for Computational Linguistics (ACL'99), College Park, Maryland, June, 1999. 20. Rose, M. G. (1981). Translation Types and Conventions. In Translation Spectrum: Essays in Theory and Practice, Marilyn Gaddis Rose, Ed., State University of New York Press, pp.31–33. 21. Salton, G. (1989) Automatic Text Processing. Addison-Wesley Publishing Company, Inc., Reading, MA, 1989. 22. Simard, M. (1999) "Text-translation Alignment: Three Languages Are Better Than Two". In Proceedings of EMNLP/VLC-99. College Park, MD. 23. Yang, C. C., Luk, J., Yung, S., Yen, J., (2000) “Combination and Boundary Detection Approach for Chinese Indexing, ” In Journal of the American Society for Information Science, Special Topic Issue on Digital Libraries, vol.51, no.4, March, 2000, pp.340–351. 24. Yang, C. C. and Li, K. W. "Automatic Construction of English/Chinese Parallel Corpora," Journal of the American Society for Information Science and Technology, vol.54, no.7, May, 2003. 25. Zanettin, F,. "Bilingual comparable corpora and the training of translators," Laviosa, Sara. (ed.) META, 43:4, Special Issue. The corpus-based approach: a new paradigm in translation studies: 616–630, 1998.
Decision Based Spatial Analysis of Crime Yifei Xue and Donald E. Brown Department of Systems and Information Engineering University of Virginia, Charlottesville, VA 22904, USA. {yx8d,brown}@virginia.edu
Abstract. Spatial analysis of criminal incidents is an old and important technique used by crime analysts. However, most of this analysis considers the aggregate behavior of criminals rather than individual spatial behavior. Recent advances in the modeling of spatial choice and data mining now enable us to better understand and predict individual criminal behavior in the context of their environment. In this paper, we provide a methodology to analyze and predict the spatial behavior of criminals by combining data mining techniques and the theory of discrete choice. The models based on this approach are shown to improve the prediction of future crime locations when compared to traditional hot spot analysis. Keywords. Spatial choice, feature selection, preference specification, modelbased clustering
1 Introduction Crime analysts are interested in understanding the relationship between location and crime and in using this understanding to predict where future crimes will occur. While no one would suppose that these predictive models could identify point targets for criminals, nonetheless, they do provide insight into areas that are expected to have increased likelihoods of criminal incidents compared with other areas. Predictions of this sort enable better use of scare resources to police those areas most threatened and to launch programs in those areas to address problems that may be feeding criminal activity. The recent introduction and widespread use of geographic information systems (GIS) have sped improvements in understanding of the role of space in criminal activity. Many methods for using GIS for the visualization of spatial data have been developed and these help to identify unusual concentrations of crimes or hot spots. These concentrations can be formally modeled as has been done in other fields, such as epidemiology and public health. Examples of formal statistical models include point pattern analysis [20], [21], distance statistics [2] and area analysis [9], [26]. Among all these studies of place-based crime data, regression analysis plays a crucial role in the attempts to explain the causes of criminal activities [17], [22]. However, most spatial analyses of crime data do not attempt to model the individual decision making process of the criminal but look instead at the aggregated behavior of many criminals. According to the rational choice perspective in criminology, criminal H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 153–167, 2003. © Springer-Verlag Berlin Heidelberg 2003
154
Y. Xue and D.E. Brown
incidents, like many other human initiated events involve a decision making and choice process. Much criminological work has taken the criminals or offenders as decision makers who want to benefit from their criminal behaviors and avoid the risk exposure to law enforcement [8]. We take advantage of the fact that the selection of crime targets indicates the criminals’ preferences for specific sites in terms of spatial attributes. While the interest in these criminal preferences is unique to law enforcement, we can exploit work in economics that has looked at spatial choice among consumers to aid us in better understanding criminal preferences and then use this understanding to predict criminal behavior. This paper develops a spatial choice methodology based on these ideas to analyze the location-based crime data. Spatial choice theory describes human’s behaviors in space as rational decisions among the available spatial alternatives. The choices indicate certain spatial patterns and represent the decision makers’ preferences. At the heart of recent work in this area is the pioneering development by McFadden which lead to the formal modeling of discrete spatial choice [1], [18]. Discrete choice models are used for analysis and prediction of spatial decision making under uncertainty with multiple alternatives. It has been extended to a number of areas, such as consumer destination selection [12], [24], travel mode analysis [3], [19], and recreational demand models [25]. These analyses indicate the spatial decisions of a large number of individuals. In general these decisions have been studied through surveys that address a rather limited set of spatial alternatives for each decision maker. Clearly, spatial choice analysis for crime data breaks new ground. The alternatives are commercial properties, buildings, and houses in the study area and while the number of spatial alternatives is finite it is, nonetheless, very large compared to other spatial choice problems. Also, the preferences of the criminal decision makers cannot be directly or accurately assessed through interviews, surveys, or questionnaires. In the rest of this paper, we first formally define the criminal spatial choice in Section 2. Section 3 presents the spatial choice models derived from these formal definitions. In Section 4, the models are applied to actual crime data and the locations of future criminal incidents are predicted. Comparison results with these new models are reported and summarized. Section 5 contains the conclusions.
2 Problem Statement Data items for spatial or crime analysis have two components: a location component and an attribute component. They can be represented by a vector {Q, S, k}. Q is the universe of the location component, which is discrete and indexes all spatial alternatives by an ordered pair of coordinates {x, y}. S is the attribute component associated with given spatial alternatives, which indicate S different attributes S = {s1 , s 2 ,..., s S } . k : Q → S is a mapping function specifying the observed attributes of the alternatives.
Decision Based Spatial Analysis of Crime
155
The spatial decision process can be represented by a vector {Q, S, k, A, D, u, P} . The set A is a subset of Q indicating finite choices available to all individuals D. A = { a1 , a 2 ,..., a N } represents N available alternatives for decision makers to choose. For spatial analysis of crimes, N is a very large number. D is the universe of individuals who make choices over the available alternative set A. Each individual makes choices based on a decision process. u is the utility function mapping the preferences from individuals D over the alternative set A to a utility value U. For a individual d, if choice set Ad = { a1 , a 2 ,..., a N } and Ad′ = { a1′ , a ′2 ,..., a ′N } have same attribute values, then the choice sets will have same utility U= u( Ad ) = u( Ad′ ) . According to the rational decision making assumptions, individuals make choices that maximize their utility. The probability that an individual d from D will choose alternative a i from an available choice set Ad can be specified as P{ ai
| Ad , d } , which is produced from the choice process {Q, S, k, A, D,u, P} . The probability P{ ai | Ad , d } is a mapping based on the
preferences of individual d and the attributes of all alternatives in set Ad. The mapping can be stated as P : A × S × D → ( 0 ,1 ) , or indicated by a utility-based
| Ad , d } = P{ u( a i ) ≥ u( a j ) | d , a j ∈ Ad } . The utility of alter-
function P{ ai
native ai to individual d can be divided into two parts U id
= V ( d , si ) + ε( d , s i ) .
V ( d , si ) = ∑ β x is the deterministic part of the utility value and expressed as a i l
i l
l
linear additive function of all attributes.
xli ∈ X = ( S , D ) represents the lth
component of the combination of attribute values si and characteristics of individual d. ε( d , s i ) is the error term of utility function indicating unobservable components of the utility function.
3 Model Development 3.1 Spatial Choice Patterns Spatial choice theory describes how individuals choose a specific site in space as their target. Their choices show certain patterns in space. The geographical sites form a spatial alternative set A. Individuals make selections from this choice set. Since the number of alternatives for a spatial choice process is very large, individuals are unable to evaluate all spatial alternatives before they make their selections. They can only compare part of the choice set and pick one spatial alternative with highest utility value. This can be stated as a sub-optimal or locally optimal problem. According to Fotheringham’s framework of individuals’ hierarchical information processing [13], individuals make spatial choices from the alternatives they have evaluated. For
156
Y. Xue and D.E. Brown
individual d, the choice set will be Ad
⊆ A , which indicates all spatial alternatives
that individual d really considered. The choice that individual d makes will probably have the highest utility among all alternatives in choice set Ad . What is different from previous work in discrete choice theory, the real choice set
Ad in crime analysis is not
clear to the analysts. Some methods are proposed to identify or estimate the probability P( ai ∈ Ad ) that an alternative ai is considered by individual d. After the identification of the individuals’ choice set, two factors are considered in people’s spatial choice process: i) the utility of alternative ai to individual d and ii) the probability that alternative ai is available or considered by individual d. Since the number of spatial alternatives is very large, it is possible that some alternatives can give higher utility values but they are never considered. In order to reveal the individuals’ preferences, we make an assumption here. Assumption 1: The two factors (i and ii) mentioned above are equally important to the individuals’ choice decisions. The combination of P( a i ∈ Ad ) and the utility of alternative ai to individual d, U id can give a better estimation of the possibility of choices. With the assumption 1, the probability that individual d chooses alternative ai from Ad can be stated as P( U id > U jd + ln P( a j ∈ Ad ), all a j ∈ Ad )P( a i ∈ Ad ) [13]. In order to get the spatial choice model, we make another assumption. Assumption 2: The error term of individuals’ utility function
ε( d , s i ) is
independently and identically distributed with Weibul distribution [18]. The spatial choice model is derived with same method as McFadden has used [13], [18].
P( ai | Ad , d ) = exp(V ( d , si )) ⋅ P( ai ∈ Ad ) / ∑ exp(V ( d , s j )) ⋅ P( a j ∈ Ad j∈A
) (1)
This model is a multinomial logit model where each alternative’s observable utility is weighted by the probability that the alternative is evaluated. 3.2 Specification of Prior Probability We assume that the hierarchical information process takes place before the individuals’ spatial choices. Individuals will first evaluate sets of alternatives and only alternatives within the sets can be selected. We can either define the choice set Ad or give the probability that an individual will evaluate certain alternative P( a i
∈ Ad ) .
Decision Based Spatial Analysis of Crime
157
For the spatial analysis of crime data, it is not easy to know individuals’ preferences. Then we have to make an assumption to simplify the model derivation. Assumption 3: During the process of individuals’ spatial choices, the preferences of all individuals d ∈ D are same. The pre-evaluated spatial alternative set Ad for different individuals is also same. We use M to represent the set of pre-evaluated spatial alternatives for all individuals. Under assumption 3, the spatial site selection model changes to P( ai | Ad , d ) = exp( V ( d , si )) ⋅ P( ai ∈ M ) / ∑ exp( V ( d , s j )) ⋅ P( a j ∈ M ) (2) j∈A The definition of P( ai ∈ M ) is important here. We use kernel density estimation method to get the probability that spatial alternative ai is evaluated by criminals. From the study of Brown et al. [7], we know that location components of spatial alternatives alone do not provide enough information about the criminals’ preferences. There are many feature values attached with the spatial alternatives. A part of these values is believed to be relevant to the occurrence of criminal incidents. Unfortunately, we do not know which part. However, we can mine the criminal’s preferences from all feature values of past crime incidents. We use a feature selection process to find the smallest feature subset from the universe feature space. It is called the key feature set or key feature space and shows that the past criminal incidents can indicate clear patterns in this key feature space. These are possible preferences of criminals’ preevaluation. Using the selected key features, we get the prior evaluation probability P( ai ∈ M ) as follow.
P( ai ∈ M ) = 1
2
1 K
K
∑ L( k =1
s i − s k si − s k s i − s k , , ,...) h1 h2 h3 1
1
2
2
3
3
(3)
3
where, si , si , si … are the key features of spatial alternative ai. K is the total number of observations; L is a function to specify the kernel estimator. We use a Gaussian function here. h’s are bandwidths used in the kernel estimation. The change of bandwidths will influence the effect of density estimation. The choice of bandwidths is important and literature in this area offers a great deal of discussions. We use a recommended bandwidth selection method from Bowman and Azzalini [5],
4 hi = ( p + 2 )⋅ K
1 /( p + 2 )
× σ i for ith dimension. p is the number of dimensions for
density estimation. The model adjusted with the estimated prior probability P( ai ∈ M ) is called the key feature adjusted spatial choice model. 3.3 Spatial Misspecification Both spatial choice model and other discrete choice models try to include all related predictor variables to estimate decision makers’ preferences and predict their future
158
Y. Xue and D.E. Brown
choices. However, it is practically impossible to include all relevant variables that affect people’s decisions into spatial choice models. First, some variables may be very difficult to measure. Second, some variables that affect choices may not have been conceptualized or identified by analysts. Third, even it is possible to identify and include all relevant variables. Some variables will be redundant and correlated with each other. Also, too many predictor variables will make the estimated parameters unstable and reduce the models’ predictive accuracy. It is necessary and inevitable to omit many predictor variables. This leads to the misspecification of choice models. During the development of our spatial choice model, assumption 3 indicates that the preferences of all individuals d ∈ D are same. The pre-evaluated choice set Ad is also the same for all individuals. This makes it easy to estimate the pre-evaluated choice set Ad . However, it also makes the estimated individuals’ preferences biased due to the lack of related information about decision makers’ preferences. For crime analysis, it is impossible to include all preference information into the spatial choice model. But the preferences of decision makers can be specified from their past choices. To avoid the bias and increase the accuracy of spatial choice models’ prediction, it is necessary to specify the bias introduced by the absence of important factors and discover the preferences of individuals. In our spatial choice model, we will specify the pre-evaluated choice set for individuals with different preferences. One solution is to classify all decision makers by their preferences from the past incidents of choices. With well selected key features, the past choices can indicate certain patterns in the key feature space. We will use clustering methods to identify the different classes of decision makers and identify their preferences by defining the pre-evaluated choice sets. The adjusted spatial choice model is called the Preference Specified Spatial Choice Model (PSSCM). 3.4 Clustering Methods Clustering is one of the most useful tasks in data mining for discovering groups and identifying interesting distributions and patterns of an underlying data set. Clustering involves partitioning a given data set into groups (clusters) such that the data points in a cluster are more similar to each other than points in different clusters. Researchers have extensively studied clustering since it occurs in many applications in engineering and science. Clustering may result in a different partitioning of a data set, depending on the specific criterion used for clustering. The basic steps to develop a clustering can be summarized as feature selection, clustering algorithm, validation of the results, and interpretation of the results. Feature selection chooses the features on which clustering is to be performed so as to encode as much information as possible. We have used the feature selection step for finding key features. By removing all features that are irrelevant to classification, the small feature space subset provides enough
Decision Based Spatial Analysis of Crime
159
information for pattern recognition therefore reducing the cost and improving the quality of classification [23]. The clustering algorithm is the most important part of the clustering process, and it includes similarity measures, partitioning methods, and stopping criteria. Each of these is described by a variety of sources [11], [16]. No matter what clustering algorithms are used, it is important to find a way to define a stopping criterion or define how many clusters are in the data set. Various strategies for simultaneous determination of the number of clusters and cluster membership have been proposed, like Engelman and Hartigan [10], Bock [4], Bozdogan [6], and Fraley and Raftery [14]. Fraley and Raftery use a model based strategy and Bayesian Information Criterion (BIC) to do clustering and determine the number of clusters. In this approach, the data are viewed as coming from a mixture of probability distributions, each representing a different cluster. Methods of this type have been applied in a number of practical applications, In model based clustering, it is assumed that the data are generated by a mixture of underlying probability distribution in which each component represents a different group or clusters. Let f k ( a i | θ k ) be the density of an observation ai from the kth component.
θ k are the corresponding parameters. The density function
f k ( a i | θ k ) is generally assumed to be a multivariate normal distribution. The function has the form as
1 exp{ ( ai − u k )T ∑ k−1 ( ai − u k )} 2 (4) f k ( ai µ k , ∑ k ) = 1/ 2 ( 2π ) p / 2 ∑ k where µk is the mean vector, ∑k is covariance matrix of observations. These are the parameters of the density distribution. The parameterization of covariance matrix ∑k decides the characteristics (orientation, volume and shape) of the distributions of clusters. These characteristics can be allowed to vary between clusters or constrained to be same for all clusters. Then expectation maximization (EM) is used to find the clusters and the Bayesian Information Criterion (BIC) is used as a criterion to compare different models.
4 Application of Spatial Choice Model for Real Crime Analysis 4.1 Crime Data Set The data for model estimation came from ReCAP (Regional Crime Analysis Program) system. The ReCAP system is an interactive shared information and decision support system that uses databases, geographic information system (GIS), and statistical tools to analyze, predict, and display future crime patterns.
160
Y. Xue and D.E. Brown
Our crime analysis was based on crime incidents between July 01 1997 and September 30 1997 in the city of Richmond Virginia. We used residential “Breaking and Entering” (B & E) crime incidents for model estimation and validation. Using the crime incidents in the training dataset, we got locations of all incidents on a geographic map. The sub regions shown in Fig. 1 are block groups, which are the smallest areas for which census counts are recorded.
Fig. 1. Breaking and Entering criminal incidents between July 01, 1997 and September 30, 1997 in Richmond, Virginia.
The analysis of B & E is related to locations of households in a city. However, it is difficult to represent all locations of individual houses in even a modest sized city, such as Richmond. Therefore, we aggregated alternatives using 2517 regular grids, which were assumed to be fine enough to represent all spatial alternatives within this area. The features of each spatial alternative came from the combination of census data (from the “censusCD + maps” compact disk held at university of Virginia’s geospatial and statistical data center) and calculated distance values. All features were possibly related to the decision process of criminals. 4.2 Feature Selection by Similarities Since the attributes of spatial alternatives came from census data and calculated distance values, it is possible that some values of these attributes are correlated. Using the calculated correlation values as similarities, we made hierarchical clustering on all features of observed spatial incidents. The clustering of features of observed spatial alternatives is shown as Fig. 2.
161
MED.PH PCINC.97 MHINC.97 AHINC.97 ALC.TOB.PH HOUSING.PH APPAREL.PH FOOD.PH ET.PH P.CARE.PH TRANS.PH
POP8910.DS AGEH56.DST CLS67.DST POP67.DST AGEH34.DST MALE.DST POP.DST FEM.DST RENT.DST HUNT.DST HH.DST OCCHU.DST POP45.DST AGEH12.DST CLS12.DST CLS345.DST FAM.DST POP123.DST MORT2.DST MORT1.DST OWN.DST
D.CHURCH D.PARK D.SCHOOL D.HOSPITAL VACHU.DST
D.HIGHWAY COND1.DST
Decision Based Spatial Analysis of Crime
Fig. 2. Clusters of features of observed spatial alternatives
From the clustering tree, we divided the features into five clusters. Each cluster included correlated features. After checking the distribution of the feature values, we found that COND1.DST is almost uniformly distributed. It is not a good feature for our analysis. Then there are two choices for feature selection of the rest features, random picking from each cluster or combining the features in same clusters. We picked the features D.HIGHWAY (distance to highway), FAM.DENSITY (Family density per unit area), P.CARE.PH (personal care expenditure per household) and D.HOSPITAL (distance to hospital). The first three were used by Brown et al. [7]. These are the key features and supposed to be good enough to represent all other features in same clusters. Based on the selected features, we applied Fraley and Raftery’s clustering methods [14] to the crime data for analysis. The number of clusters was decided by the calculated BIC values. The trends of BIC value are indicated by Fig. 3. According to the Fig. 3, we decided there are 6 clusters among the crime dataset. Each cluster corresponds to certain group of criminals that have similar preferences on their choices of spatial alternatives. The distribution of crime incidents within different clusters is listed in Table 1. 4.3 Model Estimation and Prediction The number of spatial alternatives for crime spatial analysis is very large, which makes the data preparation and computation time prohibitively expensive. To handle this problem, we adopted an importance sampling technique suggested by Ben-Akiva
162
Y. Xue and D.E. Brown
-1700
[1]. Sampling alternatives is an commonly applied technique for reducing the computational burden involved in the model estimation process.
4
4
4
-1750
4 4 3
-1800
3
2 3
-1850
4 3
1
1 2
2 3 1
3 4 2 1
1
2
-1900
BIC
2 3
1
-2000
-1950
2
1 4 3 1 2 2
4
6
8
number of clusters
Fig 3. The trends of BIC values of different parameterized model-based clustering algorithms 1: equal volume, equal shape and no orientation 2: variable volume, equal shape and no orientation 3: equal volume, equal shape and equal orientation 4: variable volume, variable shape and equal orientation
Table 1. Distribution of crime incidents in clusters
Cluster 1 Cluster 2 Cluster 3
Crime incidents 109 180 200
Cluster 4 Cluster 5 Cluster 6
Crime incidents 202 133 55
Next we considered the model estimation and prediction step. The prior probability P( ai ∈ M ) of the adjusted spatial choice models were calculated as in section 3.2 for each cluster. The key features are the features coming out from the feature selection process. Using the training data set of B & E incidents of each cluster, we obtained the estimation of the preference specified spatial choice model for each cluster P( ai | Ad , d ∈ M l ) . M l indicates the presence of criminals with preferences in lth cluster. The final prediction of future crime’s spatial distribution is
Decision Based Spatial Analysis of Crime
163
the combination of the predicted probabilities of all clusters. The combination method is also very important. Given the conditional probability that spatial alternative ai will be picked by criminals within cluster M l , P( ai | Ad , d ∈ M l ) and the chance that criminals
d ∈ M l will commit next crime within the study region P( M l ) , the
probability that spatial alternative ai is picked by any criminal will be L
P( ai | Ad , d ∈ M ) = ∑ P( a i Ad , d ∈ M l )P( M l ) . L is the total number of l =1
clusters within the crime data set. The probability methods. Here we used a ratio as P( M l ) =
P( M l ) can be defined by many
P( ai ∈ M l )
(5)
L
∑ P( a ∈ M i
j
)
j =1
P( ai ∈ M l ) is the probability that an individual d ∈ M l pre-evaluate spatial alternative ai. With the preference specified spatial choice model described above, we made our predictions. Also, we use hot spot model as the comparison model to test the two models provided by this paper, the key feature adjusted spatial choice model and preference specified spatial choice model. The residential B & E incidents between October 1, 1997 and October 31, 1997 were used as testing data set. The predictions of future crimes’ spatial distribution and the testing incidents are shown as Fig. 4-6.
Fig. 4. Prediction of hot spot model with crime incidents from 10/01/97 to 10/31/97
164
Y. Xue and D.E. Brown
Fig. 5. Prediction of key feature adjusted spatial choice model with crime incidents from 10/01/97 to 10/31/97
Fig. 6. Prediction of preference specified spatial choice model with crime incidents from 10/01/97 to 10/31/97
4.4 Model Comparisons To compare different models, we standardized all predictions of the adjusted models and the comparison model. The hypothesis is that for the population of all future crime incidents, the proposed model will outperform the comparison model.
Decision Based Spatial Analysis of Crime
165
We assumed that the testing data set contains m incidents that occurred at the locations a1′ , a ′2 ,..., a m′ , respectively. For incident a i′ , let the predicted probability
pspi and that given by the comparison model be p sci . The hypothesis test was built around µ which denoted the mean of the difference
given by the proposed model be
between the predicted probability given by the proposed model and that given by the comparison model. Assumed that the proposed model have a better prediction than the comparison model for future crimes. Then the null hypothesis is that the predicted probability difference µ between the two models for future crime incident locations is less than or equal to 0. The alternative hypothesis is the predicted probability for proposed model will be significantly better than the comparison model. We performed the hypothesis test as H0: µ
≤0, H a: µ > 0 .
(6) Using the testing data set with m crime incidents, we obtained the estimated probability difference µˆ . m
(
)
ˆ = (1 m )∑ pspi − p sci µ i =1
The standard deviation of the difference,
ˆ = σ
(7)
qsi = pspi − psci was estimated by
(1 (m − 1))∑ (qs m
i =1
i
ˆ −µ
)
2
.
(8)
The results of these tests are shown in table 1. In the testing results, “Mean” and “Std. Dev.” stand for µˆ and σˆ , respectively. p-value indicates the probability that the null hypothesis will be accepted. Table 2. The comparison results Testing data set (10/01/97 - 10/31/97)
Preference Specified vs. Hot Spot Key feature adjusted vs. Hot Spot Preference Specified vs. Key feature adjusted
Mean
Std. Dev
z-Statistic
p-Value
7.757×10-4
5.051×10-3
2.624
0.004
2.861×10-5
2.603×10-4
1.878
0.030
7.471×10-4
4.922×10-3
2.593
0.005
The comparison results indicate that the two spatial choice models significantly outperform the comparison hot spot model. The preference specified spatial choice model also outperforms the key feature adjusted spatial choice model significantly. The results prove that the analysis of feature values attached to all spatial alternatives and the analysis of specified preferences of decision makers lead to the improvement
166
Y. Xue and D.E. Brown
of the prediction of future crimes’ locations. Based on the estimation of criminals’ preferences over the feature space, we provide a more efficient and accurate prediction method for the analysis of crimes’ spatial information.
5 Conclusion Spatial analysis is of critical importance to law enforcement. It enables better planning and use of scarce resources and is particularly useful when addressing the variety of threats facing modern communities. Past work in this area has concentrated on aggregated approaches to understanding criminal behavior and displayed results of this analysis as hot spots. In this paper, a new preference specified spatial choice model is provided that shows how the preferences of criminals can be modeled to better understand the spatial patterns of crime. When used with actual breaking and entering data this method increased the accuracy of the prediction of future criminal locations by a statistically significant amount. In addition, the method also provides a way to interpret the relationship between criminal decision making and spatial attributes.
References 1.
Ben-Akiva, M. and Lerman, S. (1985). Discrete choice analysis, theory and application to travel demand. the MIT press. 2. Besage, J. and Newell, J. (1991). The detection of clusters in rare diseases. Journal of the Royal Statistical Society A 154: 143–155. 3. Bhat C. (1998) Incorporating observed and unobserved heterogeneity in urban work travel mode choice modeling. Transportation science, Vol. 34, No. 2, pp. 228–238, May. 4. Bock, H. H. (1996). Probability models and hypothesis testing in partitioning cluster analysis. Clustering and Classification, Ed. By Arabie, P., Hubert, L., and DeSorte, G. World Science Publishers. 5. Bowman, A. and Azzalini, A. (1997). Applied smoothing techniques for data analysis: the kernel approach with S-Plus illustrations. Oxford Statistical science series. 6. Bozdogan, H. (1993) Choosing the number of component clusters in the mixture model using a new informational complexity criterion of the inverse Fisher information matrix. Information and Classification, Ed. By Opitz, O., Lausen, B., and Klar, R. 40–54. Springer-Verlag. 7. Brown, D., Liu, H. and Xue, Y. (2001). Mining preferences from spatial-temporal data. Proceedings of first SIAM conference, 2001. 8. Clarke, R. and Cornish, D. (1985). Modeling offenders’ decisions: a framework for research and policy. Crime Justice: An Annual review of research, Vol. 6, Ed. By Tonry, M. and Morris, N. University of Chicago Press. 9. Cliff, A.D. and Ord, J.K. (1981). Spatial processes, models, and applications. London: Pion. 10. Engelman, L. and Hartigan, J.A. (1969). Percentage points of a test for clusters. Journal of the American Statistical Association, 64: 1674. 11. Everitt, B. (1993). Cluster analysis. John Wiley & Sons. New York. 12. Fotheringham, S. (1988) Consumer store choice and choice set definition. Marketing Science, Summer, 299–310.
Decision Based Spatial Analysis of Crime
167
13. Fotheringham, S., Brunsdon, C. and Charlton, M. (2000). Quantitiative Geography.SAGE Publications Ltd. 14. Fraley, C. and Raftery, A.E. (1998). How many clusters? Which clustering method? – Answers via model-based cluster analysis. The Computer Journal, 41(8): 578–588. 15. Graham, U. and Fingleton, B. (1985). Spatial data analysis by example. New York: John Wiley & Sons. 16. Jain, A.K., Murty, M.N. and Flynn, P.J. (1999). ACM Computing Surveys, Vol. 31, No. 3, September, 264–323. 17. Kposowa, A., and Breault, K.D. (1993) Reassessing the structural covariates for U.S. homicide rates: A county level study. Sociological Forces 26:27–46. 18. McFadden, D. (1973). Conditional logit analysis of qualitative choice behavior. Frontiers in Econometrics, 105–142. New York, 19. McFadden, D. and Train, K. (1978). The goods/leisure tradeoff and disaggregate work trip mode choice models. Transportation research, 12,349–353. 20. Openshaw, S., Charlton, M., Wymer, C. and Craft, A. (1987). A mark 1 geographical analysis machine for the automated analysis of point datasets. International Journal of Geographical Information Systems 1: 335–358. 21. Openshaw, S., A. Craft, A., Carlton, M. and Birch, J. (1988). Investigation of leukemia clusters by use of a geographical analysis machine. Lancet 1:272–273. 22. Osgood, W. (2000). Poisson-based regression analysis of aggregate crime rates. Journal of quantitative criminology. Vol. 16. No. 1. 23. Ripley, B. D. (1981). Spatial statistics, John Wiley and Sons: New York. 24. Rust, R. and Donthu, N. (1995). Capturing geographically localized misspecification error in retail store choice models. Journal of Marketing research, Vol. XXXII, 103–110. 25. Train, K. (1998). Recreational demand models with taste differences over people. Land economics, 74, 230–239. 26. Upton, G., and Fingleton, B. (1985). Spatial data analysis by example. New York: John Wiley & Sons.
CrimeLink Explorer: Using Domain Knowledge to Facilitate Automated Crime Association Analysis 1
2
2
Jennifer Schroeder , Jennifer Xu , and Hsinchun Chen 1
Tucson Police Department, 270 S. Stone Avenue, Tucson, AZ 85701
[email protected] 2 Department of MIS, University of Arizona, Tucson, Arizona 85721 {jxu, hchen}@eller.arizona.edu
Abstract. Link (association) analysis has been used in law enforcement and intelligence domains to extract and search associations between people from large datasets. Nonetheless, link analysis still faces many challenging problems, such as information overload, high search complexity, and heavy reliance on domain knowledge. To address these challenges and enable crime investigators to conduct automated, effective, and efficient link analysis, we proposed three techniques which include: the concept space approach, a shortest-path algorithm, and a heuristic approach that captures domain knowledge for determining importance of associations. We implemented a system called CrimeLink Explorer based on the proposed techniques. Results from our user study involving ten crime investigators from the Tucson Police Department showed that our system could help subjects conduct link analysis more efficiently. Additionally, subjects concluded that association paths found based on the heuristic approach were more accurate than those found based on the concept space approach.
1 Introduction Link analysis in the criminal justice domain, as opposed to link analysis in other disciplines, refers to the identification, analysis, and visualization of associations between entities such as persons, locations, and criminal incidents. Hereafter in this document, “link analysis” will refer to this process used by criminal justice investigators and crime intelligence analysts. Law enforcement officers and crime investigators throughout the world have long utilized link analysis to find and analyze relationships and associations between people. For example, the FBI used link analysis in the investigation of the Oklahoma City Bombing case and the Unabomber case to look for criminal associations and investigative leads. The Department of Treasury of the United States used link analysis to detect money laundering activities [12]. Link analysis often provides information about motives in a variety of crimes and helps uncover investigative leads [13]. However, link analysis remains a challenging problem, which consumes much time and human effort. First, it is complicated by the “information overload” problem [3, H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 168–180, 2003. © Springer-Verlag Berlin Heidelberg 2003
CrimeLink Explorer: Using Domain Knowledge
169
15]. Information about associations between crime entities (person, location, organization, property, etc.) is often buried in large volumes of raw data collected from multiple sources (e.g., crime incident reports, surveillance logs, telephone records, financial transaction records, etc.). Usually, link analysis entails an investigator manually expanding known entities by reading each document where the entities in question appear. If two entities appear in the same document, this indicates the two may have some association with each other. If no association is found, the investigator has to iteratively expand more documents until a significant path of associations between the entities is found. This process can be tremendously time-consuming. Second, high branching factors (the number of direct links an entity has) increase the search complexity of link analysis dramatically. A high branching factor can lead to a large number of associations that need to be evaluated when two crime entities are not directly associated. In a breadth-first-search of depth 4, for instance, an average branching factor of 7 can result in 2,401 associations that need to be evaluated. In reality, criminals who have repeated police contacts and arrests tend to commit many crimes with many people, causing high branching factors. The branching factor of an association search can be further inflated if associations with many other entity types (e.g., addresses, organizations, property, or vehicles) are considered. Third, determining the importance of associations for uncovering investigative leads relies heavily on domain knowledge. Crime investigators often focus only on those strong and important associations and paths because different types of crimes usually have different characteristic. Associations between crime entities may carry different weights in investigation of different types of crimes. For example, the relationship between a suspect and a victim may not be as important to uncover investigative leads in a burglary case as in a homicide case. Link analysis may distract or mislead an investigation if not guided by domain knowledge. There have been some link analysis software packages available for use in crime investigation. However, most of these packages do not help extract, search, and analyze associations beyond mere visualization of analysis results. Some tools facilitate only single-level association searches—finding only directly related entities. Automated, effective, efficient link analysis techniques are needed to assist law enforcement and intelligence investigators in carrying out crime investigation [21, 24]. To address the challenges of link analysis, we proposed and implemented several techniques for automated link analysis. These techniques include the concept space approach [4] to extracting associations from crime data, a heuristic-based approach to incorporating domain knowledge, and a shortest-path algorithm [7] to search association paths and reduce search complexity imposed by high branching factors. The rest of the paper is organized as follows. We review prior literature in section 2 and discuss system design in section 3. In section 4 we present results of a system evaluation study conducted at the Tucson Police Department (TPD). Section 5 concludes the paper and suggests future directions.
170
J. Schroeder, J. Xu, and H. Chen
2 Literature Review In this section, we review related work in link analysis, domain knowledge incorporation approaches, and shortest-path algorithms. 2.1 Link Analysis The earliest approach for link analysis is the Anacapa charting system [16]. In this approach, an investigator first constructs an association matrix by examining documents to identify associations between crime entities. Based on this association matrix, a link chart can be drawn for visualization purposes. In a link chart, different symbols represent different types of entities, such as individuals, organizations, vehicles, or locations. Based on this chart, an investigator may discover new investigative directions or confirm initial suspicions about specific suspects [24]. However, this approach is primarily a manual approach and depends on human investigators to extract, search, and analyze association data. It offers little help to address the information overload and high search complexity problems. Some automated approaches have been proposed for link analysis. Lee [20] developed a technique to extract association information from free text. Relying heavily on Natural Language Processing (NLP) techniques, this approach can extract entities and events from textual documents by applying large collections of predefined patterns. Associations among extracted entities and events are formed using relation-specifying words and phrases such as “member of” and “owned by”. The heavy dependence of this approach on hand-crafted language rules and patterns limits its application to crime data in diverse formats. There have been some link analysis tools that allow for “single-level” or direct association searches. Watson [2] can identify possible links and associations between entities by querying databases. Given a specific entity such as a person’s name, Watson can automatically form a database query to search for other related records. The related records found are linked to the given entity and the result is presented in a link chart. The COPLINK Detect system [5] applied a concept space approach developed by Chen and Lynch [4] for exploring associations. This approach was originally designed for generating thesauri from textual documents automatically by measuring cooccurrence weight, the frequency that two phrases appear in the same document. When applied to crime incident reports, this approach can automatically extract association information between crime entities and has been found to be efficient and useful for crime investigation [17]. However, both Watson and COPLINK Detect system allow users to search for only direct associations (“single-level”) and do not facilitate the search for association paths consisting of multiple intermediate links. Moreover, association strengths obtained using the concept space approach are merely based on co-occurrence weights. No domain knowledge is utilized to determine the importance of associations and to consider other information that can potentially suggest associations between crime enti-
CrimeLink Explorer: Using Domain Knowledge
171
ties. In next section, we review prior research on domain knowledge incorporation approaches. 2.2 Domain Knowledge Incorporation Approaches Domain knowledge often is important to solving domain-specific problems. In broader fields of artificial intelligence and data mining research, expert systems and Bayesian networks are typical techniques for incorporating domain knowledge. During the knowledge acquisition phase of expert system construction, domain experts’ knowledge and experience in addition to some common sense rules are collected and recorded. Knowledge generated usually is represented in a set of rules and stored in a knowledge base [26]. Expert systems have been employed in some domains such as factory scheduling [11], telephone switch maintenance [14], and disease diagnosis [23]. Because of the high expense of building knowledge bases and other issues such as low scalability and accuracy, expert systems have not been widely used. Bayesian network is another approach to incorporating knowledge of domain experts [18]. It encodes existing knowledge in a probability network with each node representing a variable and a link representing a dependency relationship between two variables. Some variables in a Bayesian network representing auditors’ knowledge of bank performance, for instance, can be the financial ratios indicating banks’ financial health. Other variables can be indicators of bank failure or other risks. Links between these variables specify the dependency relationships [25]. In addition to incorporating existing knowledge, Bayesian networks can learn new knowledge from data [18] and have been shown to be effective in some domains such as gene regulation function prediction [6,10]. In the domain of law enforcement and intelligence, the approaches for incorporating expert knowledge have been primarily ad-hoc. Goldberg and Senator [12] used a heuristic-based approach in the FinCEN system to forming associations between individuals who had a shared address, a shared bank account, or related transactions. Money laundering and other illegal financial activities could be detected based on associations discovered. However, these heuristics were used by investigators to manually uncover associations and have not been really incorporated into the system for automated link analysis. In case of large datasets, investigators still face the problems of information overload and high search complexity. The next section reviews shortest-path algorithms, which can help reduce search complexity for human investigators. Although they have been studied and employed widely in other domains, shortest-path algorithms have not yet been adopted widely in the law enforcement domain. 2.3 Shortest-Path Algorithms Shortest-path algorithms can find optimal paths between given nodes by evaluating link weights in a graph. One can focus on only the optimal path without being distracted by a large number of other possible paths. The Dijkstra algorithm [7] is the
172
J. Schroeder, J. Xu, and H. Chen
classical method for computing the shortest paths from a single source node to every other node in a weighted graph. Most other algorithms for solving shortest-path problems are based on the Dijkstra algorithm but have improved data structures for implementation [8]. Some researchers have proposed using neural network approaches to solving the shortest-path problem [1]. The shortest-path algorithm has been used to find the strongest association paths between two or more crime entities [27]. Another tool that employs the shortest-path algorithm is Link Discovery Tool [19]. It is able to search for association paths between two individuals that on the surface appear to be unrelated. In summary, prior work related to link analysis has proposed some approaches to addressing the challenges. However, link analysis remains to be a difficult problem for crime investigators when facing large volumes of data. In next section we present the system design of our CrimeLink Explorer to address the three challenges of link analysis.
3 System Design We designed and implemented CrimeLink Explorer for automated link analysis. The system contained a set of crime incident data originating from the Tucson Police Department (TPD) Records Management System. The concept space approach was used to identify and extract associations between all criminals in the dataset based on cooccurrence weights. Alternatively, a number of heuristics captured expert knowledge for identifying criminal associations and determining the importance of associations for investigation. To facilitate the search for the strongest association paths between individuals of interest, we implemented Dijkstra’s shortest-path algorithm with logarithmic transformations on association weights (co-occurrence weights or heuristic weights). A graphical user interface was provided to allow users to input names of interest and visualize association paths found based either on the concept space approach or on the heuristic approach. 3.1 Crime Incident Reports Law enforcement databases usually store crime incident reports, which are a rich source of data about both criminal and non-criminal incidents over extended time periods. Incident reports may document serious crimes such as homicides or trivial incidents such as suspicious activity calls or neighbor disputes. The trivial incident may later provide important information about associations that can later be used to solve serious crimes. Individuals involved in criminal activities may have repeated contacts with police, resulting in their presence in multiple incident reports. All crime incidents are classified into different types (e.g., Homicide, Aggravated Assault, Robbery, Fraud, Auto Theft, Sexual Assault, etc.) usually based on the Uniform Crime
CrimeLink Explorer: Using Domain Knowledge
173
Graphical User Interface
Association Path Search (shortest-path algorithm)
Co-occurrence Weights
Heuristic Weights
Heuristics
Concept Space
Crime Incident Reports
(crime types, shared address, shared phone)
Fig. 1. CrimeLink Explorer system architecture
Reporting (UCR) standard that has been the national standard for case classification and crime reporting since 1930 [22]. The successor to UCR, the National Incident Based Reporting System [9], has not been universally adopted by many U.S. law enforcement agencies. Thus, crime incident reports in this research are UCR based. These incident report records formed the source for automating link analysis in this research. 3.2 Concept Space Approach We used the concept space approach to automatically identifying and extracting associations from crime incident reports. We treated each incident report as a document and each crime entity as a phrase. To reduce complexity, we focused on associations only between persons and did not consider possible associations between other types of entities such as location and property. We then calculated the co-occurrence weights based on the frequency that two persons appeared together in the same crime incidents. Ideally, the value of a co-occurrence weight not only implies the presence of an association between two persons but also indicates the importance of the association for uncovering investigative leads [17]. However, this approach has its limitations when used in link analysis. An example is a burglary investigation where the victim and the suspect appear together in the incident report but have never met and are not even casual acquaintances. Moreover, co-occurrence weights obtained by the concept space algorithm had been found to be of only minor assistance when subjected to user evaluation. In previous user studies, investigators tended to made judgments about the associations independent of the cooccurrence weights provided by the system. Crime investigators were still facing the information overload problem because they had to make the final determination as to
174
J. Schroeder, J. Xu, and H. Chen
the importance of associations. In next section we discuss the heuristic approach as an alternative to the concept space approach. 3.3 Heuristic Approach We collected heuristics that domain experts often use when analyzing crime data to make judgments about the strength of associations between people. We interviewed several crime analysts and detective sergeants at the TPD. Three criteria were identified as the most important heuristics: (a) the relationship between crime type and person roles, (b) shared addresses or telephone numbers, and (c) repeated cooccurrence in incident reports. Rather than employing expert systems or Bayesian networks approaches to incorporating expert knowledge we represented heuristics collected using a 1-100 percentage scale to indicate the strength of associations ranging from weak to strong. A weak association, such as the relationship between a victim and a suspect in a burglary incident, was assigned a value of 1, and a strong association, such as a person and his close friend and criminal associate who have been arrested together repeatedly, were assigned a value near 100. Crime-Type and Person-Role Relationships. The crime investigators we interviewed specialized in investigation of one or more types of crime: Homicide, Aggravated Assault, Robbery, Fraud, Auto Theft, Sexual Assault, Child Sexual Abuse, Domestic Violence, and many others. Person roles used in the TPD dataset included: Victim, Witness, Suspect, Arrestee, and Other. We constructed a matrix and assigned scores to role combinations in each of the crime types. All of the crime investigators agreed that most co-arrestees or suspects in an incident had a strong association. Other role combinations, however, varied considerably depending on the type of crime. The score for a specific role combination was based on the estimation of the strength of the association occurring for that role combination and crime type out of every 100 incidents. For instance, the homicide detective sergeant estimated that at least 98 out of 100 homicide incidents included a victim and a suspect who were acquaintances. Thus, the corresponding score for victimsuspect combination for homicide crimes in the heuristic matrix was set to be 98. This method of assigning heuristic scores was somewhat arbitrary and could be enhanced by including a statistical analysis of the crime-type/person-role relationship. However, to capture such statistics by manually reading a large number of incident reports from each crime type to assess relationship information would be time prohibitive. We therefore relied on domain experts’ estimation based on their past experience rather than statistical analysis. Although informative, heuristics based on the relationship between crime type and person role could not necessarily provide complete information about criminal associations. For instance, the association score between two arrestees in narcotics sale incidents was assigned 95. This accurately reflected the high likelihood that the two arrestees knew each other, but did not capture the fine gradient from acquaintances to
CrimeLink Explorer: Using Domain Knowledge
175
close friends. Shared telephone and address associations and repeated appearances together in incidents could provide additional information to distinguish links from weak to strong. To allow a point spread to include this additional information, the heuristic scores based on crime type and person role were reduced to account for 85% of the final heuristic weight. Shared Address/Phone. Our domain experts stated that shared phone numbers and addresses were often important indicators for associations. We therefore included assignment of additional score to an association when two persons who shared a common phone number or address. Since phone number data were often subject to various errors in the TPD databases, they added only 5% of the final heuristic weight. Shared addresses added an additional 10% to the final heuristic weight since they were often more significant and less erroneous than phone number data. Co-occurrence. In the absence of other information suggesting an association, that two persons appeared together in multiple incidents might imply a strong relationship. This was the same rationale behind the concept space approach. However, rather than using co-occurrence weight, we estimated the strength of an association resulted from multiple co-occurrences in incidents based on an empirically derived probability distribution. We obtained the empirical distribution by analyzing a random sample of 40 incident reports of various crime types and counting the number of times each pair of persons co-occurred. We read supporting narrative reports for each incident to determine whether an association was important. We found that the more times two persons appeared together, the more likely they were involved in family related crimes. That is, a large number of co-occurrences between two persons implied a high likelihood for them to have a close relationship. For example, in 21 out of 40 incidents containing persons who appeared together four times, 15 were domestic violence incidents, custodial interferences, or family fights, six were court order enforcements or civil matters that were often related to domestic situations. Court orders and civil matters that were not family related overwhelmingly concerned persons who had some prior association. Based on our analysis, we constructed the probability distribution by assigning 1 to a single co-occurrence, indicating that it could be completely random with no other facts to support a stronger association. From two to three co-occurrences the probability increased rapidly. The probability distribution above 4 exceeded 99%, so all pairs of subjects who co-occur four or more times were given a probability of 100. Table 1. Empirically derived probability distribution Co-occurrence count 1 2 3 ≥4
Association probability (%) 1 45 98 100
176
J. Schroeder, J. Xu, and H. Chen
The final heuristic weight for a specific association was calculated by the maximum of the scores between the sum of crime-type/person-role relationship, shared address, and shared phone, and association probability based on co-occurrence counts: MAX(0.85 (crime-type/person-role score) + 0.05 (shared phone score) + 0.10 (shared address score)), 1.00 (association probability based on co-occurrence counts)). 3.4 Association Path Search For this system, we used the Dijkstra’s shortest-path algorithm [7] to address the search complexity problem. A logarithmic transformation was made on association weights because the conventional shortest-path algorithms could not be used directly to solve the problem of identifying the strongest association between a pair of persons [27]. With this transformation, a user could find the strongest association paths among two or more persons of interest. 3.5 User Interface A graphical user interface was implemented to allow a user to interact with the system. Figure 2 shows the user interface after the user has conducted a search for a path between three persons. Names are scrubbed for data confidentiality. The user entered the names of interest in the text field and then pressed the “Show Associations” button. The system conducted the shortest-path search based on either the co-occurrence weights or heuristic weights depending on the user’s choice. The user could then double-click on any node to see additional information (sex, date of birth, and Social Security number) about the person represented. The user could also double-click on a link and see information about the origin of the link, shared phone numbers or addresses, the weights from the concept space approach or from the heuristics, and the descriptions of incidents in which the two persons were involved.
4 System Evaluation We conducted a user study at the TPD to evaluate our system’s performance. We wanted to find out whether the automated link analysis approaches we proposed (concept space approach, heuristic approach, and the shortest-path algorithm) help address the information overload and search complexity problem and whether domain knowledge helps identify associations between crime entities more accurately than the concept space approach. We extracted approximately 20 months of incident reports from the TPD database. The resulting datasets contained 239,780 incident reports in which 229,938 persons were involved. Information, such as age, gender, race, address, and phone number, about those persons was also extracted.
CrimeLink Explorer: Using Domain Knowledge
177
Fig. 2. CrimeLink Explorer user interface
Ten crime analysts and criminal intelligence officers at the TPD participated in the study. Several subjects were very experienced specifically in link analyses. Each subject was asked to perform three tasks using CrimeLink Explorer and COPLINK Detection (“single-level” link analysis tool—finding crime entities that were only directly associated with a given entity): (a) use COPLINK Detect to find the strongest association paths among three given person names, (b) use the concept space approach provided by CrimeLink Explorer to find the strongest association paths among three given persons, and (b) use the heuristic approach provided by CrimeLink Explorer to find the strongest association paths among three given persons. Name sets used in the tasks were different but equally difficult. We summarize the results as follows: Subjects could conduct a link analysis more efficiently using CrimeLink Explorer than using COPLINK Detect. Because COPLINK Detect did not facilitate the search for association paths between crime entities that were indirectly connected, subjects had to expand links manually to find possible criminal associations. CrimeLink Explorer, in contrast, provided the functionality of searching for the strongest association paths between crime entities for multiple levels. Most subjects were able to find direct associations of the three given names using COPLINK Detect, but could not keep track of all the associations that were possible to generate as they traversed into the second and third level of the search. They said it would take them hours or possibly more than a day to find the paths between the names. However, all subjects could quickly find association paths for tasks (b) and (c) using CrimeLink Explorer. This
178
J. Schroeder, J. Xu, and H. Chen
result showed that automated path search functionality significantly increased the efficiency of link analysis by using the shortest-path algorithm. Subjects believed that association paths found using the heuristic approaches were more accurate than those found using the concept space approach. This was because the heuristics captured the domain knowledge crime investigators relied on to determine the importance of associations between crime entities. The heuristic weights included not only co-occurrence information but also person roles in different types of crimes, shared phones, and shared addresses. As some subjects commented, “That makes more sense, since it takes into account the kind of case". Subjects were also asked to indicate how useful the system was as an investigative tool. All subjects gave positive feedback and expressed enthusiasm about the tool. Several subjects had asked when they would be able to use the system for their daily work. The results of the user study were quite encouraging. The automated link analysis approaches we proposed in the research could greatly reduce crime investigators’ time and effort when conducting link analysis. Moreover, domain knowledge incorporated in the system could reflect human judgment more accurately about strength of associations between criminals.
5 Conclusions and Future Work Link analysis has faced challenges such as information overload, search complexity, and the reliance on domain knowledge. Several techniques were proposed in this paper for automated link analysis including the concept space approach, the shortest path algorithm, and a heuristic approach that captured domain knowledge for determining importance of associations. We implemented the proposed techniques in a system called CrimeLink Explorer. The system evaluation focused on the approaches’ efficiency and accuracy, both of which are desirable features of a sophisticated link analysis system. The user study results demonstrated the potential of our approaches to achieve these features using domain-specific heuristics. Rather than using estimates of heuristic weights, we plan in the future to apply a statistical analysis on NIBRS (National Incident-Based Reporting System) data [9], which captures specific information about the nature of associations between individuals involved in an incident, to validate the weights for the heuristic table. The heuristics can also be extended to include common vehicles and common organization associations. We also plan to encode expert knowledge in Bayesian networks and incrementally learn new knowledge from crime data. Variables in such a Bayesian network may specify whether two persons were family members, were good friends, or went to the same school. The other variables may be the likelihood of these pieces of information being important to uncovering investigative leads. Links between these variables can indicate the dependency relationships.
CrimeLink Explorer: Using Domain Knowledge
179
Acknowledgement. This project has primarily been funded by the National Science Foundation (NSF), Digital Government Program, “COPLINK Center: Information and Knowledge Management for Law Enforcement,” #9983304, July, 2000-June, 2003 and the NSF Knowledge Discovery and Dissemination (KDD) Initiative. We appreciate the critical and important comments, suggestions, and assistance from Detective Tim Petersen and other personnel from the Tucson Police Department.
References 1. 2. 3. 4.
5.
6. 7. 8. 9.
10.
11. 12.
13.
14. 15.
Ali, M., Kamoun, F.: Neural networks for shortest path computation and routing in computer networks. IEEE Transactions on Neural Networks, Vol. 4, No. 5. (1993) 941–953. Anderson, T., Arbetter, L., Benawides, A., Longmore-Etheridge, A.: Security works. Security Management, Vol. 38, No. 17. (1994) 17–20. Blair, D. C., Maron, M. E.: An evaluation of retrieval effectiveness for a full-text document-retrieval system. Communications of the ACM, Vol. 28, No. 3. (1985) 289–299. Chen, H., Lynch, K. J.: Automatic construction of networks of concepts characterizing document database. IEEE Transaction on Systems, Man and Cybernetics, Vol. 22, No. 5. (1992) 885–902. Chen, H., Zeng, D., Atabakhsh, H., Wyzga, W., Schroeder, J.: COPLINK: Managing law enforcement data and knowledge. Communications of the ACM, Vol. 46, No. 1. (2003) 28–34. Cooper, G., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Machine Learning, Vol. 9. (1992) 309–347. Dijkstra, E.: A note on two problems in connection with graphs. Numerische Mathematik, Vol. 1. (1959) 269–271. Evans, J., Minieka, E.: Optimization Algorithms for Networks and Graphs, 2nd edn. Marcel Dekker, New York (1992). Federal Bureau of Investigation: Uniform Crime Reporting Handbook: National IncidentBased Reporting System (NIBRS). Edition NCJ 152368. U.S. Department of Justice. Federal Bureau of Investigation (1992). Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using Bayesian networks to analyze expression data. In: Proceedings of the Forth Annual International Conference on Computational Molecular Biology (RECOMB00) (2000). Fox, M. S., Smith, S.F.: ISIS: A knowledge-based system for factory scheduling. Expert Systems, Vol. 1, No. 1. (1984). Goldberg, H. G., Senator, T. E.: Restructuring databases for knowledge discovery by consolidation and link formation. In: Proceedings of 1998 AAAI Fall Symposium on Artificial Intelligence and Link Analysis (1998). Goldberg, H. G., Wong, R. W.H.: Restructuring transactional data for link analysis in the FinCEN AI system. In: Proceedings of 1998 AAAI Fall Symposium on Artificial Intelligence and Link Analysis (1998). Goyal, S. K. et al.: COMPASS: An expert system for telephone switch maintenance. Expert Systems. July 1985. Grady, N. W., Tufano, D. R., Flanery, R. E. Jr.: Immersive visualization for link analysis. In: Proceedings of 1998 AAAI Fall Symposium on Artificial Intelligence and Link Analysis (1998).
180
J. Schroeder, J. Xu, and H. Chen
16. Harper, W. R., Harris, D. H.: The application of link analysis to police intelligence. Human Factors, Vol. 17, No. 2. (1975) 157–164. 17. Hauck, R., Atabakhsh, H., Onguasith, P., Gupta, H., Chen, H.: Using Coplink to analyze criminal-justice data. IEEE Computer, Vol. 35. (2002) 30–37. 18. Heckerman, D.: A tutorial on learning with Bayesian networks, Microsoft Research Report, MSR-TR-95-06, (1995). 19. Horn, R. D., Birdwell, J. D., Leedy, L. W.: Link discovery tool. In: Proceedings of the Counterdrug Technology Assessment Center and Counterdrug Technology Assessment Center’s ONDCP/CTAC International Symposium, Chicago, IL (1997). 20. Lee, R.: Automatic information extraction from documents: A tool for intelligence and law enforcement analysts. In: Proceedings of 1998 AAAI Fall Symposium on Artificial Intelligence and Link Analysis (1998). 21. McAndrew, D.: The structural analysis of criminal networks. In: Canter D., Alison L. (eds.), The Social Psychology of Crime: Groups, Teams, and Networks, Offender Profiling Series, III, Aldershot, Dartmouth (1999). 22. National Archive of Criminal Justice Data. Uniform Crime Reporting Program Data [United States] Series. http://www.icpsr.umich.edu:8080/NACJD-SERIES/00057.xml 23. Shortliffe, E. H.: Computer-Based Medical Consultations: MYCIN. Elsevier, NorthHolland (1976). 24. Sparrow, M. K.: The application of network analysis to criminal intelligence: an assessment of the prospects. Social Networks, Vo. 13. (1991) 251–274. 25. Sarkar, S., Sriram, R. S.: Bayesian models for early warning of bank failures. Management Science, Vol. 47, No. 11. (2001) 1457–1475. 26. Turban, E.: Review of expert systems technology. IEEE Transactions on Engineering Management, Vol. 35, No. 2. (1988) 71–81. 27. Xu, J., Chen, H.: Using shortest-path algorithms to identify criminal associations. In: Proceedings of the National Conference for Digital Government Research (dg.o 2002), Los Angeles, CA (2002).
A Spatio Temporal Visualizer for Law Enforcement 1
1
1
1
1
Ty Buetow , Luis Chaboya , Christopher O’Toole , Tom Cushna , Damien Daspit , 2 1 1 Tim Petersen , Homa Atabakhsh , and Hsinchun Chen 1
University of Arizona, MIS Department, AI Lab {tbuetow, chaboyal, otoolec}@cs.arizona.edu {tcushna, damien, homa, hchen}@bpa.arizona.edu 2 Tucson Police Department, Tucson, Arizona 75701
[email protected] Abstract. Analysis of crime data has long been a labor-intensive effort. Crime analysts are required to query numerous databases and sort through results manually. To alleviate this, we have integrated three different visualization techniques into one application called the Spatio Temporal Visualizer (STV). STV includes three views: a timeline; a periodic display; and a Geographic Information System (GIS). This allows for the dynamic exploration of criminal data and provides a visualization tool for our ongoing COPLINK project. This paper describes STV, its various components, and some of the lessons learned through interviews with target users at the Tucson Police Department.
1 Introduction Information visualization techniques have proven useful for presenting large amounts of data. Specifically, in the law enforcement domain visualization techniques can be very helpful for tasks such as crime investigation as well for presenting findings to supervisors and even in court. Law enforcement agencies currently use a combination of technological and manual techniques for crime analysis. However, these methods are very time consuming. We have developed the Spatio Temporal Visualizer (STV) tool to assist crime analysts in their search for information and in presenting their results. In order to visualize the data needed by crime analysts, we use three types of visualization techniques: a periodic view, timeline view and GIS view. Each technique has its own strength as follows: periodic visualization displays patterns with respect to time; timeline visualization displays characteristics of temporal data in a linear manner; GIS visualization displays information on a map and allows for spatial analysis of data. We combine these techniques into one tool to allow the same data to be examined from three different views simultaneously. In this paper we present the motivation behind STV followed by a literature review on relevant visualization techniques. Next, we demonstrate how STV provides dynamic access to data and presents three different views. We illustrate STV’s functionality with an example of how it would be used by a crime analyst or police officer. Finally, we discuss some of the lessons learned after interviews and discussions with
H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 181–194, 2003. © Springer-Verlag Berlin Heidelberg 2003
182
T. Buetow et al.
potential users from Tucson Police Department (TPD) and conclude with some future directions.
2 Background and Motivation Historically, law enforcement agencies have attempted to maintain records of criminal events to solve crime, aid prosecution, document response, detect serial crimes, and identify trends. Solving a crime often depends on identifying characteristics of the incident and then matching those characteristics to a known criminal or suspects whose past actions, motives or opportunities most closely correlate to the incident at hand. This matching process can occur in an individual officer’s memory or in a multimillion record database. The efficiency of the individual officer is adversely affected by the large amounts of information exceeding his memory capacity or his ability to process that information. In addition, the usefulness of a large database is dependant on the ability to display appropriate and adequate information in a manner, which can be efficiently utilized by an investigator. In the past, crime analysts have dealt with some of these issues through the use of pin maps, graphs, timelines and summarizations. All of these tend to be somewhat subjective and dependent upon the ability and understanding of the analyst doing the preparation and the quality of the data being analyzed. For a better understanding of the problem, imagine a situation in which an analyst is tasked with enlightening a group of police managers on the state of burglaries in a city. He would first need to decide how to approach this task, whether by comparing the number of incidents over several years, from year to year, or month to month. He would also need to decide whether to analyze the occurrence of these crimes between areas of the city, or time of day, day of week, type of victim or any other factors or combination of factors. He would extract data for the period (or periods) he considers appropriate and then through the use of various tools, construct graphs, charts and maps to depict the information in the manner he chooses. An undertaking of this nature often takes several days for an experienced analyst to complete. The problem is that the information the analyst chooses to survey is quite dependent upon the training and experience of the analyst or perhaps the input of his immediate supervisor. Considering the time and effort needed to compile the project, if the group of police managers had concerns or questions different from those which the analyst chose to address, a second or third separate project would be required. The current state of the art of crime analysis is hindered by limited objectivity and lack of tools to allow for dynamic review. STV aims to remedy this analysis deficiency by providing an easy, dynamic workspace.
3 Literature Review Research in the areas of the three views implemented in the STV project has been done extensively and has been applied to various application domains. In the area of crime analysis, GIS software allowing users to view crimes on a map is quite common. There are few tools that allow users to view law enforcement related data in a
A Spatio Temporal Visualizer for Law Enforcement
183
temporal context or in a periodic pattern. Analysis of these techniques shows that they are largely segregated and miss the synergy that is created when multiple views of the same data can be seen simultaneously. To the best of our knowledge, there are currently few tools that harness the power to examine a single data set from multiple perspectives. As will be described in section 4, STV aims to incorporate three different views of the same data set into one tool. 3.1 Periodic Data Visualization Tools Common methods for viewing periodic data include sequence charts, point charts, bar charts, line graphs, and spiral graphs which can all be displayed in 2D or 3D [7, 16]. We use the spiral graph method in STV due to its ability to visualize periodic patterns better than the other methods. The Spiral Graph [17] developed at the Technical University of Darmstadt is an excellent example in which the spiral method of visualization was used. Using the Spiral Graph, different periodic information can be visualized [17]. The main method of mapping data to the Spiral Graph relies upon the thickness of lines to represent the amount of data and different colors to represent different types of data. The University of Minnesota has also developed different implementations of the spiral method of visualization using the spiral of Archimedes [1]. Here we have good examples of how data can be mapped in different ways. For instance, both 2D and 3D spiral graphs in which the thickness of dots along the spokes of a spiral represents the amount of data [3]. The advantage of using a 3D representation is that several data sets can be shown simultaneously. However, using a 3D representation can become confused and make it difficult to see a developing pattern. The spiral method to which STV most resembles is the ReCAP implementation known as a Time Chart [2]. The disadvantage of Time Chart is that it only plots data in, monthly, 24-hour, or 7-day time periods. Therefore, the user does not have the ability to see yearly patterns. In addition, using Time Chart the user is unable to see how many incidents took place in a certain time period. As will be discussed in section 4.2.2, STV’s periodic pattern view overcomes these shortcomings. 3.2 Timeline Tools A timeline is a linear or graphical representation of a sequence of events. In general, timelines are a temporal ordering of a subject of interest. Events, entities, or topics of interest are displayed along an axis. As such, many projects have explored visualization through timeline techniques. The desire to visualize time relationships and patterns in data has been an ongoing area of research. One area addressed in visualization is the desire to see the big picture and be able to drill down to examine events in detail. In this regard, Snap [6] attempts to increase the total amount of data that can be displayed by placing a large number of entities into a single “aggregate”. This new collection can then be displayed for summary information or drilled down for closer inspection. Lifelines [14] display legal or medical data to professionals in those fields. Here, the goal is the visualization of a patient or case history, allowing users access to data
184
T. Buetow et al.
from one screen. In addition, this project aims to enhance anomaly and trend spotting and to streamline access to data. In an attempt to create more relational querying power in a timeline, Hibino [8] developed Multimedia Visual Information Seeking to allow users to interactively select two subsets of events and dynamically query for temporal relationships. In short, this allows for a user to ask, “How often is event type A followed by event type B?” Others, such as Kullberg [10], attempted to reinvent the 2D timeline into three dimensions. Holly [9] has proposed timelines to view program hotspots during execution. In a more general approach, Kumar [11] developed the ITER model for the basis of developing timeline applications. All of these applications offer different temporal views of their respective data sets. In addition, with the proliferation of the Internet, many forms of informal timelines are present, many of which communicate personal histories and the like. Several private companies also have timeline tools for analysis in various professional fields. Although there are many existing timeline tools, to the best of our knowledge there are very few tools that incorporate a timeline view simultaneously with other views of the same data set, as we have implemented in STV. 3.3 Crime Mapping Tools The uses of Geographic Information Systems (GIS) in law enforcement applications are becoming increasingly important in supporting crime analysts’ capabilities. This field is split between two main areas: finding better ways to display the data available and finding better ways to mine the data to help crime analysts save time. One tool might be used to mine data and another tool to display the information gleaned from the data mining. The crime analyst would still have to manually run the data mining program and then manually move the data into GIS software for display. This process can be painfully slow. One tool that combines these two areas is the Regional Crime Analysis Program (ReCAP) developed by Dr. Brown at the University of Virginia [2]. Brown realized that the current systems have three main shortcomings. These systems did not allow the user to run a spatial query to obtain the data set, in which the user has interest, nor did they automate the process of analyses through data mining. In addition, the systems required users to be proficient in GIS and mapping technologies. ReCAP was developed to fulfill these three shortcomings. A tool that deals mainly with data mining for GIS is the CrimeStat Spatial Statistics Program developed by Ned Levine & Associates [12]. This tool has an impressive amount of data mining options available. These features include spatial distribution analysis, distance analysis, hot spot analysis, interpolation (kernel density estimation), and space-time analysis (Knox and Mantel) tools. The user must manually import the data into those tools. The analyzed data can then be saved for later use. There are many examples of how other organizations have created tools to display the data they have mined using CrimeStat [12]. Two commercial tools that are popular in law enforcement for viewing crime data on a map are ArcView developed by ESRI and MapInfo developed by the MapInfo Corporation [5] [13]. These tools allow the user to import data from various file types and even perform sophisticated database operations on the imported data. Their
A Spatio Temporal Visualizer for Law Enforcement
185
popularity has the advantage that many people in the industry are already familiar with them.
4 Features of STV The STV is a data visualization tool built on top of our ongoing COPLINK project [4]. COPLINK provides a one-stop data access and search capabilities through an easy to use user interface, for local law enforcement agencies such as the Tucson Police Department (TPD). STV is intended to take COPLINK one step further by providing an interactive environment where analysts can load, save, and print police data in a dynamic fashion for exploration and dissemination. For instance, an analyst can search all robberies that have taken place over the past two years and visualize them. In addition the analyst may wish to visualize all drug arrests, simultaneously with the robberies, and see if there is any correlation between the two. 4.1 Technologies Used STV is built into a Java applet in a modular fashion. This was done with the intent that other types of views would be added in the future with relatively little work by taking advantage of object-oriented inheritance. One key advantage of an applet is that no software needs to be installed or maintained on analysts’ machines. Queries are performed using applet to servlet communication to connect to an Oracle database. Results are stored by a controller class and accessed by each STV view. On the backend, JDBC is used to connect to the COPLINK database. One addition, specifically required by the STV project, was an area to save user preferences and past queries specific to each of the views. Although this information is saved in the same database, it is independent of the COPLINK schema. This addition allows police officers the capability to save valuable time by saving the search information gathered in the application’s database. 4.2 Components STV overcomes some of the disadvantages of other existing crime visualization tools by viewing three perspectives on the same data. The detail of each view is described in the following sections. In addition, there are two screenshots of STV in figures 1 and 2, which illustrate its functionality by displaying an example of bank robbery data from 1996-2002, described in section 5. Control Panel. The control panel (figure 1.c) maintains central control over temporal aspects of the data.
• The time-slider controls the range of time viewed. Thus, the data may span six years, but the timeslider may be narrowed to focus on one year, or one month. This time window into the data may then be moved like a typical slider to incorporate
186
T. Buetow et al.
new data points and exclude others. This slider was inspired by Lifelines [14] and by Richter [15]. • Granularity, referring to unit of time, is controlled through a drop down menu. Currently, years, months, weeks, and days are implemented. Changing this option has the effect of re-labeling the timeline and altering the periodic patterns being examined. • The overall time bounds are controlled through a series of drop down menus. Thus, while all data points may lie in a particular time span, a user can narrow focus to a subset of data based on time bounds. Periodic View. The main purpose of the periodic view (figure 1.d) is to give the crime analyst a quick and easy way to search for crime patterns.
• The circle represents time in the granularity the user chooses. For instance, it may represent a year, month, week or day. • Within the circle there are sectors which divide it into different time periods within the granularity selected. The analyst also has the ability to change the granularity of the sectors. For example, the circle could be set to year granularity and the sectors could be set to represent months, weeks, or even days. The advantage of this is that the analyst may see different patterns developing over the different time periods. • Sectors are labeled to indicate their specific time interval. • Data is represented by spikes within each time period. • Rings with labels inside the circle represent quantity of data. • Using the box plot method a crime analyst can easily determine if any spikes are outliers. Timeline View. The timeline view (figure 1.a) is a 2D timeline with a hierarchical display of the data in the form of a tree.
• A specific time instant may be highlighted. When combined with the current granularity, all points in that time period are highlighted. For example, if the granularity is month and a point in June 1999 is selected, all data in June 1999 are highlighted. • The tree view and timeline views of the data are coordinated such that expanding a node in the tree expands the data points viewed on the timeline. At the same time, data under a particular node in the tree is summarized in the timeline at that node’s corresponding y-coordinate location. • The time-slider controls the current timeframe viewed. This has the effect of allowing the user to slide across the timeline at various levels of detail. • The tree view allows the user to see the data in a traditional and organized way. GIS View. The GIS view (figure 1.b) displays a map of the city of Tucson on which incidents can be represented as points of a specific color.
• The user can zoom in and out of the map. Zooming in allows for more streets to be displayed.
A Spatio Temporal Visualizer for Law Enforcement
187
• Incidents may be selected by dragging a box around points on the map. This will narrow the information being displayed by all views, focusing on the selected incidents. • The user can move backward and forward in the zoom history similar to an Internet browser. • The GIS view pronounces data points within the time period specified by the timeslider. Data points outside this period are faded. • Data points highlighted in the timeline view are highlighted in the GIS view.
Fig. 1.b Fig. 1.a
Fig. 1.d
Fig. 1.c
Fig. 1. STV. In this case, bank robberies for the last six years are displayed in the timeline, GIS and periodic views. From here, users may narrow focus through granularities and time bounds as well as geographic parameters
5 A Crime Analysis Example To illustrate STV functionality, we explore a hypothetical scenario in which a police officer has been assigned to the task of examining bank robbery data. The officer begins by logging into COPLINK as described in figures 1 and 2. He performs a search for bank robberies in Tucson and selects the results he’s interested in. STV starts by visualizing the 280 bank robberies selected. The officer looks for trends, using the three views. Upon expanding the spiral view, he notices that the period from October to December are peak months for bank robberies in Tucson. Deciding to compare this trend with the previous year, he narrows the data being viewed by inputting September 1, 2001 as a start date and December 31, 2001 as an end date (figure 3).
188
T. Buetow et al.
Fig. 2. Functionality. Views may be moved to provide better focus or because of user preference. Here, GIS view is centered and a geographic query is performed. The data set is narrowed to those selected by the user with corresponding updates in other tools. In the timeline view, points within the geo-search are emphasized, while other points are faded. The periodic view displays summary data on the selected points indicating June, April, November and December have higher incidence of bank robberies. The control panel allows for focus onto a specific period of time within the global time frame selected. Granularity (viewing in terms of days, weeks, months, years) and global time bounds may also be altered
At this point, the data has been narrowed to 31 bank robberies. By looking at the timeline view the officer sees three gaps in bank robbery occurrences (figure 4). He notices that at the beginning of September and October, no bank robberies occurred. More striking is the fact that after approximately Thanksgiving, only two robberies occurred. The officer decides to examine geographic aspects of the data to see if further trends are apparent (figure 5). He notices a cluster of robberies in the Northwest side of town. Zooming in, he sees that north of Broadway Avenue, is where the vast majority of bank robberies occurred during the selected time interval with some locations being robbed multiple times in four months. Additionally, an area around the intersection of Euclid Avenue and Grant Road appears to be the center of a concentration of activity. The officer selects points on the Northwest side of town by dragging a box around them to see if other trends become apparent. He then moves the periodic view to the center, bringing several trends to light. None of the 17 robberies occurring in this geographic region during the four month period occurred within the first week of a month while the third week of the month was the most frequently robbed. In addition, the periodic tool reveals that more robberies occur on Fridays than other days of the week (figure 6).
A Spatio Temporal Visualizer for Law Enforcement
189
Returning to the timeline view, he notices that several robberies have occurred on the same day. The officer highlights November 15. This automatically highlights the robberies on the geographic view as well. In addition, this helps the officer realize that two days earlier, two other banks were robbed in this same area. For a police officer or crime analyst, many questions arise. Why the sudden disappearance of robberies after Thanksgiving? Why was the first week of each month devoid of robberies? Why were so many banks hit in the same area at the same time? A crime analyst could use the STV for further queries, for example concerning arrests that occurred immediately after these robberies. Although further queries and exploration may be necessary, points of interest were discovered. It may now be advisable to increase patrols in those areas where increased incidents of bank robbery occurred, particularly within the time periods which became apparent. By manipulating the data, cutting and slicing, zooming in and zooming out several trends were revealed in less than 20 minutes of data manipulation.
Fig. 3. The periodic view displaying bank robberies for each month from 1996-2002. The period from October to December has more events than other months
6 Lessons Learned Although the STV tool has not yet been deployed at TPD, we have been able to receive feedback regarding the tool from ten TPD crime analysts and from a seasoned detective. It is important for the STV tool to be assessed by these sources because it is the detectives and crime analysts who will be the primary users of the STV tool. Comments made by the detective and analysts throughout the initial development are summarized below.
190
T. Buetow et al.
Fig. 4. Robberies from September 1, 2001 to December 31, 2001
Fig. 5. Selecting points in the GIS view narrows focus
A Spatio Temporal Visualizer for Law Enforcement
Fig. 6. The periodic view reveals week-per-month and day-per-week trends
Fig. 7. Highlights in the timeline view appear automatically in the GIS view
191
192
T. Buetow et al.
6.1 Current Strengths of STV From our first meeting with analysts, the options to load, save and print projects were expressed as high priorities. Once implemented, projects no longer needed to be recreated each time a user logged onto COPLINK. Similarly, the ability to produce a hard copy of information is often very desirable. These functions enable users to more easily incorporate STV into their analysis. Potential users of STV at the TPD have indicated that the ability to expand and constrict the data being displayed is important in searching for different crime patterns. For instance, an analyst may begin with a large number of incidents being displayed and then narrow them down to relevant incidents, or vice versa. They feel that the STV tool does this quickly and efficiently by means of the control panel and the GIS view. The STV tool will also allow police managers, along with the help of analysts, to discuss ongoing problems and trends. For example, TPD has a meeting known as the Targeted Operational Planning Meeting (TOP) in which Police Chiefs and other managers analyze problems and address them. Having STV available during this brainstorming session would allow these TPD officials to view additional crime trends that may not have been considered. The analysts indicated this as an important strength because quite often the Police Chiefs and managers will want to see different aspects of crime trends “on the fly”. A final strength that cannot be overestimated is STV’s ability to abstract away tedious details of database searches and displays. Computers are excellent at these types of processes. By shifting an analysts focus from a low level of computer interaction to a much higher level of patterns, causes, and effects of crime, STV increases the efficiency of analysis. 6.2 Areas of Improvement for STV While most of the feedback we received from TPD was favorable, users have indicated certain areas of potential improvement for STV. The biggest concern is the limited customization that the tool currently supports. For instance, crime analysts may wish to add a note and reference it to an incident that is being visualized. They may also wish to add events to the data set that are not present in the databases. A second area of concern is the customization of colors and shapes. For example, officers may want to have all robberies displayed by a green triangle, and all homicides displayed by a red circle. Size of data points was also expressed as a concern. A problem common to virtually all visualization techniques is that of labeling. Analysts recommended a variety of labels for data points, from standard text labels to balloon labels that appear on mouse hovers. The size and content of labels were also of interest. Crime analysts have also expressed interest in the ability to have STV communicate with COPLINK Connect/Detect [4] which has already been deployed at TPD. For instance, if a group of incidents such as robberies are visualized, an analyst may wish to select a particular incident and see the corresponding information from COPLINK Connect/Detect displayed.
A Spatio Temporal Visualizer for Law Enforcement
193
Finally, STV lacks automatic analysis functionality. This means that users cannot click a button and have an algorithm applied to their data set to solve a problem. Features such as hot spot algorithms which determine clusters of activity or algorithms that determine anomalies in data sets are currently not present, but desirable.
7 Conclusions and Future Directions The STV tool is scheduled to begin user studies at the TPD in March 2003. The plan is to have crime analysts use the STV tool in their daily activities in order to discover other strengths and areas for improvement. The experiences of crime analysts will provide valuable insights into future directions for the STV project. The ability provided in STV to synchronize three different views for visualizing crime related data would provide law enforcement an advantage in crime analysis. This combined with the dynamic access to data and STV’s user-friendly interface present advantages over traditional methods. As the veteran detective said, “This application has the potential to revolutionize the manner in which we examine crime trends and pursue criminals.” Acknowledgements. This project has primarily been funded by the following grants:
• NSF, Digital Government Program, “COPLINK Center: Information and Knowledge Management for Law Enforcement,” #9983304, July 2000-June 2003. • National Institute of Justice, “COPLINK: Database Integration and Access for A Law Enforcement Intranet,” # 97-LB-VX-K023, July 1997-Jan. 2000. • National Institute of Justice, “Distributed COPLINK Database and Concept Space Development,” #0308-01, Jan. 2001-Dec. 2001. • NSF, Information Technology Research, “Developing A Collaborative Information and Knowledge Management Infrastructure,” NSF/IIS #0114011, Sept. 2001-Aug. 2004. We would like to thank the following people for their support and assistance during the entire project development and evaluation process:
• All members of the University of Arizona Artificial Intelligence Lab staff and Coplink Staff, • Lt. Jenny Schroeder, Dan Casey and other contributing personnel from the Tucson Police Department, • The Phoenix Police Department.
194
T. Buetow et al.
References 1. 2.
3. 4. 5. 6.
7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.
Archimedean spiral, http://www.2dcurves.com/spiral/spirala.html Brown, Donald E. 1998. “The Regional Crime Analysis Program (RECAP): A Framework for Mining Data to Catch Criminals.” In Proceedings for the 1998 International Conference on Systems, MAN, and Cybernetics (San Diego, CA, USA, Oct. 11–14). IEEE, Piscataway, N.J., 2848-2853. University of Virginia, June 1998. Carlis, J. (1998). “Interactive Visualization of Serial Periodic Data,” Proceedings of User Interface Software and Technology. Chen, H., D. Zeng, H. Atabakhsh, W. Wyzga & J. Schroeder (2003). “COPLINK: Managing Law Enforcement Data and Knowledge,” Communications of the ACM, pp 28-34. Environmental Systems Research Institute (ESRI), http://www.esri.com Fredrikson, A., C. North, C. Plaisant & B. Schneiderman (1999). “Temporal, Geographical and Categorical Aggregations Viewed Through Coordinated Displays: a Case Study with Highway Incident Data,” Human-Computer Interaction Laboratory Technical Report No. 99-31 December 1999, NPIVM, pp 26–34. Harris, R. (1996). “Information Graphics – A Comprehensive Illustrated Reference,” Management Graphics. Hibino, S. & E.A. Rudensteiner (1998). “Comparing MMVIS to a Timeline for Temporal Trend Analysis of Video Data,” Proceedings of Advanced Visual Interfaces. Holly, M. (2001). “Temporal and Spatial Program Hot Spot Visualization,” Technical Report SOCS-01.6. Kullberg, R.L. (1996). “Dynamic Timelines: Visualizing Historical Information in Three Dimensions,” Proceeding of CHI ‘96, pp 386–387. Kumar, V. & R. Furuta (1998). “Metadata Visualization for Digital Libraries: Interactive Timeline Editing and Review,” Proceedings of the third ACM conference on Digital libraries, pp 126–133. Levine, Ned (2000). “CrimeStat: A Spatial Statistics Program for the Analysis of Crime Incident Locations (v 1.1),” URL, http://www.icpsr.umich.edu/NACJD/crimestat.html. MapInfo, http://www.mapinfo.com Plaisant, C., B. Milash, A. Rose, S. Widoff & B. Schneiderman (1996). “Lifelines: Visualizing Personal Histories,” ACM CHI ’96 Conference Proceedings. pp 221–227. Richter H, J. Brotherton, G.D. Abowd & K. Truong (1999). “A Multi-Scale Timeline Slider for Stream Visualization and Control,” GVU Technical Report GIT-GVU-99-30. Tufte, E. (1983). “The Visual Display of Quantitative Information”. Graphics Press. Webber, M., M. Alexa & W. Muller (2000). “Visualizing Time-Series on Spirals”, Technical University of Darmstadt.
Tracking Hidden Groups Using Communications Sudarshan S. Chawathe Computer Science Department University of Maryland College Park, Maryland 20742, USA
[email protected] Abstract. We address the problem of tracking a group of agents based on their communications over a network when the network devices used for communication (e.g., phones for telephony, IP addresses for the Internet) change continually. We present a system design and describe our work on its key modules. Our methods are based on detecting frequent patterns in graphs and on visual exploration of large amounts of raw and processed data using a zooming interface.
1
Introduction
Suppose a group of suspicious agents (henceforth, suspects) has been identified based on some a priori knowledge. Instead of taking immediate action to stop the suspicious activities, it is often prudent to carefully monitor the suspects and their communications in order to maximize the detection of suspects (expand the group) and uncover the nexus of activity (locate the key or controlling agents). Unfortunately, the suspects typically do not communicate using easily identifiable sources. For example, a ring of car thieves may continually change phone numbers (using prepaid cellular phones, short-term pager numbers, etc.). Similarly, globally dispersed agents planning a distributed denial-of-service attack on the cyber-infrastructure typically do not use the same IP address for very long. Such behavior makes it very difficult to accurately and efficiently track groups of suspects over extended periods of time. In this paper, we describe a strategy to solve this problem by using a combination of automated and human-directed techniques. We begin by describing the problem more precisely. Problem Development. We will use the term agents to denote real-world entities (typically, humans) that we are interested in monitoring. However, these agents are not directly observable and their real-world identities are, in general, unknown. That is, we do not have any method to directly track the actions of the agents. Instead, all we can observe is the communications between such agents. The medium used for such communication may be a phone network, the Internet, physical mail, etc. We refer to it as the network in general. We will use the term nodes to denote the devices used to communicate using this network (e.g., phone numbers in a telephone network, IP addresses on the Internet). A key feature of nodes is that they are, by virtue of their connections to the network, H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 195–208, 2003. c Springer-Verlag Berlin Heidelberg 2003
196
S.S. Chawathe
agents nodes Monitored messages Unobservable Real World Network
Observable Communication Network Unknown and dynamic mapping between agents and nodes Fig. 1. The tracking problem
easily identifiable and observable. Agents use nodes to communicate on the network. (For example, people use phone numbers to communicate using the phone network, and IP addresses to communicate using the Internet.) A group of communicating suspects is called a s-group. Note that since suspects are, in general, not directly observable, neither are s-groups. At a given point in time, there is a group of nodes (in the communication network) corresponding to the agents in an s-group; we refer to this group of nodes as a n-group. In contrast with s-groups, n-groups are easily observable. For example, the group of phone numbers used by a ring of car thieves in the past few days forms a ngroup. Over time, the n-group corresponding to a given s-group changes. For example, the ring of thieves is likely to be using a completely different set of phone numbers two months from now. The problem at hand is then the problem of tracking s-groups by observing only the n-groups. By observing a n-group, we mean tracking the communications between the nodes in the group. In this paper, we assume that the only information we can obtain from the communication network is a timestamped list of inter-node messages. We use the term messages in a general sense. In a phone network, a message is a phone call; on the Internet, a message may be a TCP connection. More precisely, monitoring the network yields a list of tuples of the form (n1 , n2 , t, A) indicating a message from n1 to n2 at time t. We use A to denote a list of additional attributes, which depend on the particulars of the communication network and the monitoring methods. In a phone network, A includes attributes such as the length of the call. On the Internet, A includes the source and destination ports associated with a TCP connection and other connection parameters. It is convenient to regard this stream of tuples as the edges of a connection multigraph whose nodes represent communication network nodes (e.g., phone numbers) and whose edges represent messages annotated with additional attributes (e.g., phone calls with durations). In most networks, such a list is never-ending and therefore better modeled as a stream of tuples. Another characteristic of the data from network monitoring is that it is typically produced at a very high rate. For example, call records
Tracking Hidden Groups Using Communications Network Monitoring Data (edge stream)
Newswires, memos, etc.
tuning
Online Analysis
197
Visual Exploratoin
alerts
Offline Analysis Storage
mining
Fig. 2. System architecture
on a phone network and TCP connection build-ups and break-downs occur at a very high rate. It is important to analyze such stream data using online methods that detect important patterns as early as possible. (For example, detecting that a ring of thieves is about to move to another state or country may prompt immediate action if it is detected in a timely manner.) Further, indiscriminately storing such stream data can exhaust even the large amounts of inexpensive storage currently available. Storing the data indiscriminately also makes it more difficult to operate on the data as less interesting data is likely to slow access to interesting data. On the other hand, many of the kinds of operations required by this application are not likely to yield to purely online methods. For example, many data mining algorithms require random access to data on disk and cannot be easily modified for the restrictions of stream data. Thus, a practical solution is likely to require both online and offline analysis methods that operate cooperatively. So far, we have not indicated how the results of the automated or semiautomated methods suggested above are presented to the analyst responsible for decisions, nor have we indicated how such analysts may use their knowledge to direct and guide the tracking process. A simple solution here is to process data in batches, and provide input in batches. For example, a detective may analyze the output of the tracking method from yesterday and adjust the input parameters for guiding the method when it is run on today’s data. This solution has problems analogous to those encountered by batch-based solutions to the tracking problem. Again, it is desirable to provide methods that permit online viewing of the results of tracking and immediate fine-tuning of the tracking process. Assuming we have at hand streaming methods for tracking s-groups, we need methods for visualizing, searching, and manipulating the streaming and dynamic data generated by these methods. System Architecture. Figure 2 depicts the high-level architecture of our system for tracking s-groups. The monitoring devices on the network (e.g., instrumented routers on the Internet) produce a stream of tuples, each of which describes a
198
S.S. Chawathe
message between nodes. This stream of tuples is sent to both the online analysis module and the storage module. The storage module is responsible for recording the stream and merging it with the archived data at suitable intervals (say, every 24 hours). The online analysis module uses the stream to trigger detection features based on the archived data and input from the analyst. The offline analysis module is where methods that are not suited to stream processing are implemented. These methods can be classified as data mining or pattern detection methods that require random access to data. The exploration module includes a graphical user interface and, more important, implementation of methods for quickly assimilating vast amounts of data at varying levels of detail. The data includes the stream data processed to varying degrees, the results of the online and offline analysis modules, and an integration with external data sources that are relevant to an analyst’s decision making process (e.g., newswire articles, police reports, memos). In Section 2, we describe methods for detecting frequent patterns in the connection graph. These methods form the building blocks for of the offline analysis module. Section 3 describes methods for exploring large volumes of graph data using a zooming interface that form the basis of the exploration module. We discuss related work briefly in Section 4 and conclude in Section 5. Due to space constraints, we do not discuss the online analysis module here, and refer the interested reader to [5] for details.
2
Detecting Frequent Patterns
In this section, we describe our method for detecting hidden groups by analyzing large volumes of historical connection data obtained by network monitoring. This method is part of the static analysis module of Figure 2. Recall that in this module, we are given a database consisting of a communication graph that forms a historical record of messages between nodes and we wish to detect potential s-groups for further investigation (and to serve as inputs for the online analysis module). The goal is to help an analyst detect s-groups by highlighting patterns in the data. The kinds of patterns of interest to analysts are likely to be varied and complex, and we do not attempt to completely automate the task of detecting them. Instead, our approach is to provide efficient implementation of a few key operations that the analyst may use to investigate the data based on real-world knowledge. In particular, we focus on the efficient implementation of an operation that is not only useful on its own, but also forms the building block for more sophisticated analysis methods (both automated and human directed). This operation is the detection and enumeration of frequently occurring patterns, which are informally patterns of communicating nodes occur frequently enough to be of potential interest for a detailed data analysis. (Such frequently occurring patterns are to our problem what frequent itemsets are to the problem of mining market basket data [1].) The main idea behind our method, which is called SEuS (Structure Extraction using Summaries) is the following three-phase process: In the first phase
Tracking Hidden Groups Using Communications
199
(summarization), we preprocess the given dataset to produce a concise summary. This summary is an abstraction of the underlying graph data. Our summary is similar to data guides and other (approximate) typing mechanisms for semistructured data [12,15,4]. In the second phase (candidate generation), our method interacts with a human analyst to iteratively search for frequent structures and refine the support threshold parameter. Since the search uses only the summary, which typically fits in main memory, it can be performed very rapidly (interactive response times) without any additional disk accesses. Although the results in this phase are approximate (a superset of final results), they are accurate enough to permit uninteresting structures to be conservatively filtered out. When the analyst has filtered potential structures using the approximate results of the search phase, an accurate count of the number of occurrences of each potential structure is produced by the third phase (counting).
Fig. 3. Example input graph
Users are often willing to sacrifice quality for a faster response. For example, during the preliminary exploration of a dataset, one might prefer to get a quick and approximate insight into the data and base further exploration decisions on this insight. In order to address this need, we introduce an approximate version of our method, called L-SEuS. This method only returns the top-n frequent structures rather than all frequent structures. We present only a brief discussion of SEuS below, and refer the reader to [11] for a detailed discussion both SEuS and L-SEuS. Summarization. We use a data summary to estimate the support of a structure (i.e., the number of subgraphs in the database that are isomorphic to the structure). The summary is a graph with the following characteristics. For each
200
S.S. Chawathe
Fig. 4. A structure and its three instances
distinct vertex label l in the original graph G, the summary graph X has an l-labeled vertex. For each m-labeled edge (v1 , v2 ) in the original graph there is an m-labeled edge (l1 , l2 ) in X , where l1 and l2 are the labels of v1 and v2 , respectively. The summary X also associates a counter with each vertex (and edge) indicating the number of vertices (respectively, edges) in the original graph that it represents. For example, Figure 5 depicts the summary generated for the input graph of Figure 3.
Fig. 5. Summary graph
We use the summary X to estimate the support of a structure S as follows: By construction, there is at most one subgraph of X (say, S ) that is isomorphic to S. If no such subgraph exists, then the estimated (and actual) support of S is 0. Otherwise, let C be the set of counters on S (i.e., C consists of counters
Tracking Hidden Groups Using Communications
201
on the nodes and edges of S ). The support of S is estimated by the minimum value in C. Given our construction of the summary, this estimate is an upper bound on the true support of S. Candidate Generation. The candidate generation phase is a simple search in the space of structures isomorphic to at least one subgraph of the database. We maintain two lists of structures: open and candidate. In the open list we store structures that have not been processed yet (and that will be checked later). The algorithm begins by adding all structures that consist of only one vertex and pass the support threshold test to the open list. The rest of the algorithm is a loop that repeats until there are no more structures to consider (i.e., the open list is empty.) In each iteration, we select a structure (S) from the open list and we use it to generate larger structures (called S’s children) by calling the expand subroutine, described below. New child structures that have an estimated support greater than the threshold are added to the open list. The qualifying structures are accumulated in the candidate list, which is returned as the output when the algorithm terminates. Given a structure S, the expand subroutine produces the set of structures generated by adding a single edge to S (termed the children of S). In the following description of the expand(S) subroutine, we use S(v) to denote the set of vertices in S that have the same label as vertex v in the data graph and V (s) to denote the set of data vertices that have the same label as a vertex s in S. For each vertex s in S, we create the set addable(S, s) of edges leaving some vertex in V (s). This set is easily determined from the data summary: It is the set of out-edges for the summary vertex representing s. Each edge e = (s, v, l) in addable(S, s) that is not already in S is a candidate for expanding S. If S(v) (the set of vertices with the same label as e’s destination vertex) is empty, we add a new vertex x with the same label as v and a new edge (s, x, l) to S. Otherwise, for each x ∈ S(v) if (s, x, l) in not in S, a new structure is created from S and e by adding the edge (s, x, l) (an edge between vertices already in S). If s does not have an l-labeled edge to any of the vertices in S(v), we also add a new structure which is obtained from S by adding a vertex x with the same label as v and an edge (s, x , l). Support Counting. Once the analyst is satisfied with the structures discovered in the candidate generation phase, she may be interested in finalizing the frequent structure list and getting the exact support of the structures. This task is performed in the support counting phase. Let us define the size of a structure to be the number of nodes and edges it contains; we refer to a structure of size k as a k-structure. From the method used for generating candidates (Section 2), it follows that for every k-structure S in the candidate list there exists a structure Sp of size k −1 or k −2 in the candidate list such that Sp is a subgraph of S. We refer to Sp as the parent of S in this context. Clearly, every instance I of S has a subgraph I that is an instance of Sp . Further, I differs from I only in having one fewer edge and, optionally, one fewer vertex. We use these properties in the support counting process.
202
S.S. Chawathe
Determining the support of a 1-structure (single vertex) consists of simply counting the number of instances of a like-labeled vertex in the database. During the counting phase, we store not only the support of each structure (as it is determined), but also a set of pointers to that structure’s instances on disk. To determine the support of a k-structure S for k > 1, we revisit the instances of its parent Sp using the saved pointers. For each such instance I , we check whether there is a neighboring edge and, optionally, a node that, when added to I generates an instance I of S. If so, I is recorded as an instance of S.
Fig. 6. A screenshot of the SEuS system
3
Visual Exploration
In this section, we describe methods for implementing the interface module of Figure 2. Recall that the task of this module is to help the analyst assimilate the output of the automated analysis modules (offline and online) as well as the external data feed (newswire articles, intelligence reports, etc.). The interconnections between data items from different sources are of particular interest. In this module, we model data as a multiscale graph in which nodes represent data items and edges represent the relationships among them. At a high level, this graph aggregates many data items into one node; at the lowest level, each node
Tracking Hidden Groups Using Communications
id
a100
a59
a64
a22
a23
a72
a42
a89
a7
a31
a35
a19
a51
a97
86 More
Forward/Cross Links
a100
a59
a64
a22
a23
a72
a89
a7
a31
a19
a51
a93 a74
Back Links
Zooming-in Details
Zooming-in Numbers id
203
a46
a28
a42
a9
a92
a35
a53
a14
a1
a67
a88
a37
a70
a63
a99
a2
a26
a58
a71
a81
a68
66 More
book/paper id name publications
p7 a100 Yeo p10 p25
p7 a22 Chawathe p2 b5 b29 ...
98 More
More Links
Fig. 7. Two kinds of logical zooming
represents a single data item or concept (e.g., a phone number). This representation allows the analyst to work at a level of abstraction best suited to the task at hand. We have implemented methods for exploring such graphical data at varying levels of detail as part of our VQBD system [6], and we describe the key ideas below. Although VQBD is extensible and incorporates many features for the power user, it is designed to be accessible to a casual user. To this end, the basic modes of interacting with the system are very simple. At all times, the VQBD display consists of a single window with a graphical representation of the XML data. Although, as we shall see below, this representation may be the result of some complex operations, the user interface is always the same: There are nodes (boxes) representing data elements (often summarized) and arcs (lines) representing relationships among them. There are no tool-bars, scroll-bars, sliders, or other widgets. We believe this simplicity is key to usability by a casual user. The basic modes of controlling, described below, VQBD are also simple and unchanging. The first three are meant for the casual user, while the next two are for users who have gained more experience with the system. Panning. The displayed objects can be moved in any direction relative to the canvas by a dragging motion with the left button of the mouse. Zooming. The display may be zoomed in (or out) by a right- (respectively, left-) dragging motion with the right mouse button. VQBD uses the position of the
204
S.S. Chawathe
pointer to determine the type of zooming. If the pointer is outside all graphical object then the result is simple graphical zooming (e.g., larger objects, bigger fonts). If the pointer is inside a graphical object then the data resolution of that object, and any others of a similar type, is increased. For example, consider the screenshot in Figure 8(b). The lower part represents speech and line objects and includes sample values from the input document. Zooming in with the pointer inside the larger box (representing the collection of line objects) results in the display of a larger number of sample speech objects. Zooming in with the pointer inside one of the smaller boxes representing an individual line object displays that object in more detail (more text). Figure 7 illustrates these two modes of zooming. In the case of other visualization modules (e.g., histograms), zooming results in actions appropriate to that module (e.g., histogram refinement). Link Navigation. Clicking on a link causes the display to recenter itself around the target of the link at an appropriate zoom level. Following the design method of the Jazz toolkit, such link navigation is not instantaneous; instead it occurs at a speed that allows the viewer to discern the relative positions of the referencing and referenced objects. In addition to selecting an appropriate graphical zoom level, VQBD automatically picks a suitable logical zoom level. For example, a collection of numbers that is too large to display in its entirety is often presented as a histogram. View Change. While VQBD automatically selects an appropriate method for visualizing data at the available resolution, the user may override this selection a pop-up menu bound to the middle mouse button. For example, a user interested in the highest values in a collection of numbers may force VQBD to change the view from histogram to sorted list. Querying. The XML document may be queried using a query-by-example interface. This interface permits users to specify selection conditions as annotations on displayed objects. In addition, the user may mark objects as distinguished objects for use in queries. Intuitively, these objects can be used as the starting points for query-based exploration. VQBD has built-in query modules for regular expressions and XPath. Additional query modules can be easily added using the plug-in interface. More precisely, these objects are logically inserted into a table that can be used in the from clause of OQL-like queries. Since we do not have access to realistic monitoring data, we illustrate the key features of VQBD using a sample user session based using Jon Bosak’s XML rendition of Shakespeare’s A Midsummer Night’s Dream, available at |http://www.ibiblio.org/xml/examples/shakespeare/—. The system parses the data and graphically and presents a summary of its implicit structure with objects representing the play, acts, scenes, and lines. This structural summary is the default view presented by VQBD. A screenshot appears as Figure 8(a). Note that the screenshots in Figure 8 are based on a rather small VQBD display (approximately 350x350 pixels). While we picked this size primarily to fit the space
Tracking Hidden Groups Using Communications
(a) Zoomed out—structural summary
205
(b) Zoomed in—instances
Fig. 8. Two screenshots of VQBD in action
constraints of this report, it also illustrates how VQBD’s zooming interface allows it to function effectively at this size. In this example, the summary is small enough to be displayed in its entirety. However, when the summary is larger (or the screen smaller), the panning and graphical zooming features of VQBD are used to view the summary. Now suppose the analyst zooms in on the speech object using a dragging motion with the right mouse button. Initially, the zooming results in standard graphical results (larger objects, higher resolution text, etc.). However, as soon as the object becomes large enough to display graphical elements within it, the graphical zooming is accompanied by a logical zooming: a few sample elements are displayed. VQBD displays randomly sampled elements, with the number of displayed elements increasing as the available space increases as a result of the zooming in operation. Figure 8(b) is a screenshot at this stage of exploration. In addition to details of the speech and line elements, details of scene elements (appearing above the speech elements in this figure as in Figure 8(b)) are partially visible, providing a useful context. These figures do not convey the colors used by VQBD for indicating many relationships, including grouping elements based on parents (enclosing elements). When a sample element is displayed in this manner, VQBD reads its attributes and sub-elements to pick a short string that distinguishes the element from others with the same tag. This string is displayed within the object representing the element on screen. In our example,
206
S.S. Chawathe
VQBD uses the scene titles to identify scene elements on screen. At this stage, the analyst also has the option of single-clicking on any of the displayed objects, causing VQBD to display all details of the selected object. For example, clicking on the scene object labeled A hall in the castle results in displaying the scene in greater detail (as much as will fit in the VQBD window). Note that this clicking action is simply an accelerated form of zooming; the same result could be achieved by zooming in to the scene object. Subelements of the scene element are displayed as active links that can be activated in order to smoothly transport the display to the referenced object. This link-based navigation can be freely interleaved with zooming. Zooming out at this point results in VQBD retracing its steps, displaying data in progressively less detail until we are back at the original structural summary view. In addition to browsing data in this manner, an analyst may also query data using the VQBD interface. For example, if a scene object is selected as the origin of a search for the string Lysander, VQBD executes the query and highlights objects in the query result. In our sample data, the query string matches elements of different types (two persona elements, one stagedir element, and several speaker and line elements). If the current resolution is insufficient to display individual objects, only the structural summary objects corresponding to the individual objects are highlighted. To view the query results in detail, one may zoom in as before. Unlike the earlier zooming action, which displayed a random sample of all elements corresponding to the summary object, VQBD now displays a sample chosen only from the elements in the query result. When all elements in the query result have been displayed, further zooming results in a random selection from the remaining elements (as before). (Colors are used to distinguish the elements in the query result from the rest of the elements.) This exploration of query results may be interleaved with zooming, panning, query refinement, and other VQBD operations.
4
Related Work
There is a long history of work on network and graph analysis. However, many of the methods do not scale to the amount of data generated by the network monitoring situations that interest us. For high-volume data, work on Communities of Interest [10,9] is perhaps the closest to our work. A method for managing high-volume call-graph data from a phone network based on daily merging of records is described in [10]. There is work on structure discovery in specific domains; a detailed comparison of several such methods appears in [7]. We are more interested in domain independent methods such as CLIP and Subdue [16,8]. The method of Section 2 differs from these in its use of a summary structure to yield an interactive system with high throughput. A detailed discussion and performance study appears in [11]. AGM [13] is an algorithm for finding frequent structures that uses an algorithm similar to the apriori algorithm for market basket data [2]. The FSG [14] is similar to AGM but uses a sparse graph representation that minimizes storage
Tracking Hidden Groups Using Communications
207
and computation costs. The FREQT algorithm is based on the idea of discovering tree structures using by attaching nodes to only the rightmost branches of trees [3]. The general idea of using a succinct summary of a graph for various purposes has a large body of work associated with it. For example, this idea is developed in semistructured databases as graph schemas, representative objects, and data guides, which are used for constraint enforcement, query optimization, and query-by-example interfaces [4,15,12].
5
Conclusion
We described and formalized the problem of tracking hidden groups of entities using only their communications, without a priori knowledge of the communication device identifiers (e.g., phone numbers) used by the entities. We discussed the practical constraints on the environment in which this problem must be solved and presented a system architecture that combines offline analysis, online analysis, and interactive exploration of both raw and processed data. We described our work on methods that form the basis of some of the system modules. We have conducted detailed evaluation of these methods by themselves and are now working on assembling and evaluating the system as a whole. Acknowledgments. Shayan Ghazizadeh helped design the and implement the SEuS system. Jihwang Yeo and Thomas Baby implemented parts of the VQBD system. This work was supported by National Science Foundation grants in the CAREER (IIS-9984296) and ITR (IIS-0081860) programs.
References 1. R. Agrawal, T. Imielinski, and A. Swami. Mining associations between sets of items in massive databases. SIGMOD Record, 22(2):207–216, June 1993. 2. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th International Conference Very Large Data Bases, pages 487–499. Morgan Kaufmann, 1994. 3. Tatsuya Asai, Kenji Abe, Shinji Kawasoe, et al. Efficient substructure discovery from large semi-structured data. In Proc. of the Second SIAM International Conference on Data Mining, 2002. 4. P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding structure to unstructured data. In Proceedings of the International Conference on Database Theory, 1997. 5. Sudarshan S. Chawathe. Tracking moving clutches in streaming graphs. Technical Report CS-TR-4376 (UMIACS-TR-2002-56), Computer Science Department, University of Maryland, College Park, Maryland 20742, May 2002. 6. Sudarshan S. Chawathe, Thomas Baby, and Jihwang Yeo. VQBD: Exploring semistructured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), Santa Barbara, California, May 2001. Demonstration Description.
208
S.S. Chawathe
7. D. Conklin. Structured concept discovery: Theory and methods. Technical Report 94-366, Queen’s University, 1994. 8. D. J. Cook and L. B. Holder. Graph-based data mining. ISTA: Intelligent Systems & their applications, 15, 2000. 9. Corinna Cortes and Daryl Pregibon. Signature-based methods for data streams. Data Mining and Knowledge Discovery, 5:167–182, 2001. 10. Corinna Cortes, Daryl Pregibon, and Chris Volinsky. Communities of interest. In Fourth International Symposium on Intelligent Data Analysis (IDA 2001), Lisbon, Portugal, 2001. 11. Shayan Ghazizadeh and Sudarshan S. Chawathe. SEuS: Structure extraction using summaries. In Steffen Lange, Ken Satoh, and Carl H. Smith, editors, Proceedings of the 5th International Conference on Discovery Science, volume 2534 of Lecture Notes in Computer Science (LNCS), pages 71–85, Lubeck, Germany, November 2002. Springer-Verlag. 12. R. Goldman and J. Widom. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the Twenty-third International Conference on Very Large Data Bases, Athens, Greece, 1997. 13. A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In Proc. of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 13–23, 2000. 14. M. Kuramochi and G. Karypis. Frequent subgraph discovery. In Proc. of the 1st IEEE Conference on Data Mining, 2001. 15. S. Nestorov, J. Ullman, J. Wiener, and S. Chawathe. Representative objects: Concise representations of semistructured, hierarchial data. In Proceedings of the International Conference on Data Engineering, pages 79–90, 1997. 16. K. Yoshida, H. Motoda, and N. Indurkhya. Unifying learning methods by colored digraphs. In Proc. of the International Workshop on Algorithmic Learning Theory, volume 744, pages 342–355, 1993.
Examining Technology Acceptance by Individual Law Enforcement Officers: An Exploratory Study 1
2
2
Paul Jen-Hwa Hu , Chienting Lin , and Hsinchun Chen 1
Accounting and Information Systems, David Eccles School of Business University of Utah, Salt Lake City, Utah 84112
[email protected] 2 Management Information Systems, Eller College of Management University of Arizona, Tucson, Arizona 85721 {linc,hchen}@eller.arizona.edu
Abstract. Management of technology implementation has been a critical challenge to organizations, public or private. In particular, user acceptance is paramount to the ultimate success of a newly implemented technology in adopting organizations. This study examined acceptance of COPLINK, a suite of IT applications designed to support law enforcement officers’ analyses of criminal activities. We developed a factor model that explains or predicts individual officers’ acceptance decision-making and empirically tested this model using a survey study that involved more than 280 police officers. Overall, our model shows a reasonably good fit to officers’ acceptance assessments and exhibits satisfactory explanatory power. Our analysis suggests a prominent core influence path from efficiency gain to perceived usefulness and then to intention to accept. Subjective norm also appears to have a significant effect on user acceptance through the mediation of perceived usefulness. Several managerial implications derived from our study findings are also discussed.
1 Introduction Technology implementation management [9] has been a critical challenge to organizations, public and private. In this regard, technology acceptance by individual users in an adopting organization is indispensable to the success of a newly implemented technology [16]. During the past decade, user technology acceptance has received, and is likely to continue receiving, considerable research attention; e.g., [11], [15], [17], [18], and [25]-[27]. Central to our continuing quest for successful information technology (IT) implementation is increased understanding of the key determinants of user acceptance, together with their causal relationships. Equipped with such insights, adopting organizations are more likely to create favorable conditions for IT adoptions and to design and implement effective management interventions for fostering technology acceptance among target users. Although most prior research has been concentrated on user acceptance in business settings, the deployment and use of IT also have been vigorously pursued in nonbusiness sectors that include government agencies. Of particular importance is user technology acceptance in various professional contexts where individuals perform H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 209–222, 2003. © Springer-Verlag Berlin Heidelberg 2003
210
P.J.-H. Hu, C. Lin, and H. Chen
highly specialized tasks and often have considerable autonomy. As Chau and Hu commented [5], the fast-growing investment and deployment of innovative technologies that support individual professionals demand additional investigations of their technology acceptance decision-making. Law enforcement is a fundamental and critical aspect of government services, as measured by its profound impacts on homeland security. By and large, law enforcement agencies are in the intelligence business and their crime fighting/prevention capability depends on individual officers’ timely access to relevant and accurate information presented in an effective and easily-assimilated manner. When investigating a criminal case or monitoring an organized gang ring, a police detective usually has to access, scrutinize, and integrate relevant information from various sources, internal and external. Because of its stringent information/knowledge support requirements, law enforcement indeed represents a service sector in which applications of information systems (IS) research and practice are inherently appealing and increasingly important. Our observations also suggest that individual officers usually have considerable autonomy in their case analysis and investigative tasks, thus manifesting or resembling a professional work arrangement. Together, the specialized and critical services in law enforcement settings, extensive information/knowledge management support requirements, and individual autonomy demand further examinations of user technology acceptance in law enforcement settings. Investigations of technology acceptance by individual law enforcement officers, nonetheless, have received limited attention from IS researchers. In response, this study aims at examining user acceptance of COPLINK [6-7], a suite of applications designed to provide enhanced information sharing and knowledge management support to offers within and across law enforcement agencies. Specifically, we developed a factor model that explains or predicts individual officers’ acceptance decisionmaking and then empirically tested the model using a survey study that involved more than 280 police officers. The current research purports to identify key technology acceptance drivers in law enforcement settings and investigate how these drivers and their effects might differ from those commonly observed in business contexts. The following section reviews relevant prior research and highlights our motivation.
2 Literature Review and Motivation In this study, technology acceptance broadly refers to an individual’s psychological state with regard to his or her voluntary and intentional use of a technology [13]. User technology acceptance has been examined extensively in IS research. A review of relevant previous studies suggests the dominance of a cognitive/behavioral anchor in conceptualizing and analyzing individual technology acceptance. According to this approach, an individual is conscious about his or her acceptance of a technology that can be sufficiently explained or mediated by the underlying behavioral intention. Substantial empirical support of the explanatory/mediating power of behavioral intention for actual technology use has also been established. As Mathieson [17] concluded, “given the strong causal link between intention and actual behavior, the fact that behavior was not directly assessed is not a serious limitation.” Several theories that anchor at behavioral intention have prevailed, including the Theory of Reasoned Action
Examining Technology Acceptance by Individual Law Enforcement Officers
211
[12], the Theory of Planned Behavior [1]-[2], the Diffusion of Innovations Theory [20], and the Technology Acceptance Model [11]. Rooted in social psychology, the Theory of Reasoned Action (TRA) suggests that an individual’s acceptance of a technology can be explained by his or her intention that is jointly determined by attitudinal beliefs and (perceived) subjective norm. The Theory of Planned Behavior (TPB) extends TRA by incorporating an additional construct (i.e., perceived behavioral control) to account for situations where an individual lacks the capability or resources necessary for performing the behavior under discussion. The Diffusion of Innovations (DOI) theory also has premises established in social psychology, positing that the diffusion of an innovation in a social system is jointly affected by the communication of key innovation attributes that include relative advantages, complexity, compatibility, demonstrability and trialibility. Overall, these theories are generic and have been applied to explain a wide array of individual behaviors, including technology acceptance. Previous individual technology acceptance studies that used TRA, TPB or DOI as a theoretical foundation have garnered considerable empirical support for the respective theories. The Technology Acceptance Model (TAM) adapts from TRA and is developed specifically for explaining individual technology acceptance across different technologies, user groups, and contexts. According to TAM, an individual’s decision on whether or not to accept a technology can be sufficiently explained by behavioral intention which, in turn, is determined by his or her perception of the technology’s usefulness and ease of use. Judged by its frequent use by prior studies, TAM has emerged as a predominant model for individual technology acceptance. This model, however, has been criticized for its parsimonious structure, thus subsequently limiting its use for designing effective organizational interventions that foster technology acceptance. As Mathieson commented [17], “TAM is predictive, but its generality does not offer sufficient understanding to provide system designers with information needed for creating and promoting user acceptance of new systems.” Nevertheless, TAM offers a valid and generic framework upon which extended or detailed models can be developed for specific user acceptance scenarios. Collectively, findings from previous research suggest that analysis of user technology acceptance in an organization setting should consider key characteristics pertaining to multiple fundamental contexts. For instance, Tornatzky and Klein [24] suggested that an individual’s acceptance decision in an organizational setting is jointly affected by factors pertaining to the technological context, the organizational context, and the external environment. Similarly, Chau and Hu [5] examined individual technology acceptance in a professional setting and singled out the importance of the technological, individual, and (organizational) implementation contexts. Igbaria et al. [15] highlighted the importance of management context. Goodhue and Thompson [14] discussed the importance of the technology and task contexts, advocating a contingency fit between them. A review of the literature suggests that conceptualization of user technology acceptance needs to include multiple fundamental contexts, and that model development should proceed from identifying important characteristics of these contexts, based on the user acceptance phenomenon examined. In addition, our literature review also suggests the development and empirical evaluation of specific models that extend from generic theories or models; e.g., [4], [23], [26], and [27]. According to this approach, a generic theory or model is used as a grounded framework upon which a detailed model is developed for a targeted user acceptance scenario; e.g., via inclusion of additional constructs or antecedents of key
212
P.J.-H. Hu, C. Lin, and H. Chen
acceptance drivers. The current research used both TAM and TPB as a theoretical framework for anchoring our analysis of key determinants of individual officers’ acceptance of COPLINK. Our model contained major TAM constructs (e.g. perceived usefulness and ease of use), as well as their key antecedents and other constructs from TPB. During our model development, we also took into consideration important characteristics pertinent to our targeted technology, user group, and organizational (implementation) context.
3 Overview of COPLINK Technology The COPLINK project was initiated and undertaken by the Artificial Intelligence Lab at the University of Arizona, in collaboration with the Tucson Police Department (TPD). An important project objective was to design, develop, and deploy innovative technology solutions to support and enhance information sharing and collaborative investigation within and across regional law enforcement agencies. Funded by the National Institute of Justice (NIJ) and the Digital Government Initiative of the National Science Foundation (NSF), the project has delivered COPLINK [6]-[7] which currently consists of two distinct but complementary applications: COPLINK Connect and COPLINK Detect. COPLINK Connect allows detectives and field officers to access data in other jurisdictions or government agencies, beyond the constraints of system or platform heterogeneity. COPLINK Detect extends the capabilities of Connect by supporting individual officers’ analysis of sophisticated criminal links and networks, using integrated and shared data. At the time of the study, a large-scale deployment of COPLINK had just been completed at TPD and the implementation planning was underway in other jurisdictions in the states of Arizona and Texas. In parallel, technology development in COPLINK also continued, aiming at further enhanced information/knowledge management support and extended functionality through the use of agent and wireless technologies.
4 Research Model and Hypotheses As shown in Figure 1, our research model suggests that an individual officer’s decision to accept or not to accept a technology can be explained by important characteristics pertaining to the technological, individual, and organizational contexts. Specifically, perceived usefulness, perceived ease of use, and efficiency gain are fundamental determinants of the technological context. Consistent with the propositions of TAM, our model states that perceived usefulness and perceived ease of use jointly determine attitude, and that perceived ease of use has a direct positive effect on perceived usefulness. All other factors being equal, an officer is more likely to consider COPLINK to be useful when it is easy to use. Efficiency gain refers to the degree to which an officer perceives his or her task performance efficiency would be improved through the use of COPLINK. Agility is critical in law enforcement, where individual officers are in a constant competition against time. In most cases, officers must respond to crime fighting/prevention challenges in a timely manner. Results from our preliminary evaluation of COPLINK showed that individual officers had
Examining Technology Acceptance by Individual Law Enforcement Officers
213
Organizational Context
Subjective Norm
Technological Context
- 0.18*
0.25**
Availability - 0.01
Efficiency Gain
0.67***
Perceived Usefulness
Intention to Accept
0.96***
(R2 = 0.60)
(R2 = 0.58)
0.68***
0.15
0.11
Perceived Ease of Use
0.28***
Attitude
(R2 = 0.66)
Individual Context *: P-value < 0.05 **: P-value < 0.01 ***: P-value < 0.001
Fig. 1. Research model and model testing results
placed great importance on task performance efficiency resulting from their use of the technology. Accordingly, we tested the following hypotheses. H1: H2: H3: H4: H5:
The usefulness of COPLINK as perceived by an officer has a positive effect on his or her attitude towards the technology. The usefulness of COPLINK as perceived by an officer has a positive effect on his or her intention to accept the technology. The ease of use of COPLINK as perceived by an officer has a positive effect on his or her attitude towards the technology. The ease of use of COPLINK as perceived by an officer has a positive effect on his or her perception of the technology’s usefulness. An officer’s perceived efficiency gain through the use of COPLINK has a positive effect on his or her perception of the technology’s usefulness.
Within a law enforcement setting, attitude is critical to the individual context and refers to an individual officer’s positive or negative attitudinal beliefs about the use of COPLINK. Through previous technology demonstrations and recently completed user training, officers at TPD were expected or likely to have developed personal assessments of and attitudinal beliefs about COPLINK. According to TAM and TPB, an individual who has a positive attitude towards a technology is likely to exhibit a strong intention to accept the technology. Venkatesh and Davis [25] and others (e.g., [11]) have questioned the effectiveness of attitude in mediating the impact of perceived usefulness and perceived ease of use on behavioral intention, thus suggesting its re-
214
P.J.-H. Hu, C. Lin, and H. Chen
moval from TAM and its extensions. In this study, we retained attitude in our model as a key intention determinant, partially because of the described autonomy of individual law enforcement officers, including their technology choice and use. Thus, we tested the following hypothesis. H6:
An officer is likely to have a strong intention to accept COPLINK when he or she has a positive attitude towards the technology.
Subjective norm and availability are key characteristics of the organizational (implementation) context. Consistent with TPB, subject norm refers to an officer’s assessment or perception of significant referents’ desire or opinion on whether or not he or she should accept COPLINK [1]-[2]. In this study, the organizational context includes the communication of COPLINK assessments by administrators and individual officers in an adopting agency and therefore encompasses the management context discussed by Igbaria et al. [15]. Specifically, we posit that subjective norm has a direct positive effect on both perceived usefulness and behavioral intention. Within the social system common to law enforcement agencies, an officer’s behavior might be somewhat affected by significant referents’ opinions or suggestions. Consequently, an officer is likely to consider COPLINK to be useful and thus develops a strong intention for its acceptance when his or her significant referents are in favor of the technology. By and large, officers appear to have a relatively strong psychological attachment to their agency and the social system within it; therefore, they are likely to develop and exhibit a close bond with colleagues and administrative commanders. Such psychological attachment and personal bond might be partially attributed to several factors that include an agency’s non-profit nature, less direct peer competition for resources or promotion (as compared with business organizations), personal commitment to public services, relatively long-term career pursuit, and the closed community common to most agencies. Therefore, we tested the following hypotheses. H7: H8:
An officer is likely to perceive COPLINK to be useful when his or her significant referents are in favor of the technology. An officer is likely to have a strong intention to accept COPLINK when his or her significant referents are in favor of the technology.
Availability is also essential to the organizational context. In this study, availability refers to an officer’s perception of the availability of the computing equipment necessary for using COPLINK. Availability is a fundamental aspect of perceived behavioral control (from TPB). As noted by Ajzen [1]-[2], perceived behavioral control embraces internal (e.g., self-efficacy [3], [8]) and external conditions (e.g., facilitating condition [23]). In their comparative examination of competing models, Taylor and Todd [23] explicitly separated the internal and external aspects of control beliefs. Similarly, Venkatesh [27] also argued that the availability of resources and opportunities required to perform a target behavior is an important perspective of perceived ease of use. Availability of the computing equipment necessary for using COPLINK has been singled out as a potential concern to many officers, particularly those routinely working on criminal case analysis or away from the department offices. Results from multiple focus group discussions and interviews with individual officers consistently suggested the importance of making available the necessary computing equipment. All other factors being equal, the greater the availability as perceived by an of-
Examining Technology Acceptance by Individual Law Enforcement Officers
215
ficer, the stronger his or her intention to accept COPLINL technology. Hence, we tested the following hypothesis. H9:
Availability of the computing equipment necessary for using COPLIBK technology has a positive effect on an officer’s intention to accept the technology.
5 Instrument Development and Validation We empirically tested our model using a self-administered survey that involved more than 280 police officers who volunteered their technology acceptance assessments. Our research method choice was made primarily because of its broad coverage (e.g., number of respondents) and support of different quantitative analyses. All participating officers were from the Tucson Police Department. Our investigation proceeded immediately after the department’s having completed technology implementation (including testing) and mandatory user training. Multiple methods were used in our survey instrument development. Candidate question items were first identified from relevant previous empirical studies. In parallel, we also conducted focus group discussions, as well as unstructured and semistructured interviews with individual officers from the participating police department and other similar agencies. Preliminary measurements for each included construct were obtained by combining our interview/discussion findings and the candidate items extracted from previously validated inventories. Three police officers then assessed the validity of the resultant question items at face value. Based on their comments and suggestions, several minor wording changes were made to tailor to the law enforcement context. All questionnaire items used a seven-point Likert-scale, with anchors from “strongly agree” to “strongly disagree.” To ensure the desired balance and randomness of the questionnaire, half of the question items were worded with proper negation and all items were randomly sequenced. A pretest was then conducted to validate the instrument in terms of reliability and construct validity. Although mostly drawn from previously validated measurements, we re-examined the question items to ensure the necessary validity in the law enforcement setting [21]. Our pre-test included a total of 42 police officers who varied in rank and division. Using their responses, we examined the instrument’s reliability by evaluating the Cronbach’s alpha value for the respective constructs. As summarized in Table 1, all the constructs showed an alpha value greater than 0.70, a commonly suggested threshold for exploratory research [19]. In addition, we also used pre-test responses to assess the instrument’s construct validity in terms of convergent and discriminant validity [21]. Specifically, we performed a principal component factor analysis, which yielded a total of seven components; i.e., matching the exact number of constructs specified in our model. As shown in Table 2, items intended to measure a particular construct exhibited a distinctly higher factor loading on a single component than on other components, suggesting the measurements were of adequate convergent and discriminant validity. The validated measurements were subsequently used in the survey study from which the individuals who had participated in the instrument development or pretest study were excluded. The question items used in the study are listed in the Appendix.
216
P.J.-H. Hu, C. Lin, and H. Chen Table 1. Reliability analysis - cronbach’s alpha Construct
Perceived Usefulness (PU)
Perceived Ease of Use (PEOU) Subjective Norm (SN) Attitude (ATT) Behavioral Intention (BI)
Availability (AV)
Efficiency Gain (EG)
Item PU-1 PU-2 PU-3 PU-4 PEOU-1 PEOU-2 PEOU-3 PEOU-4 SN-1 SN-2 ATT-1 ATT-2 ATT-3 BI-1 BI-2 BI-3 AV-1 AV-2 AV-3 AV-4 EG-1 EG-2 EG-3
Mean
STD
2.33 3.12 2.83 2.71 3.10 3.21 2.81 3.05 2.62 2.12 4.05 3.76 3.88 2.55 2.67 3.40 4.10 2.98 3.74 3.71 3.81 3.43 3.31
1.41 1.55 1.30 1.35 1.39 1.18 1.40 1.31 1.34 1.35 1.34 1.50 1.29 1.21 1.18 1.43 1.80 1.68 2.02 1.71 1.20 1.17 1.32
Cronbach’s • 0.91
0.84
0.78 0.89
0.73
0.89
0.87
6 Data Analysis Results A self-administered survey study was conducted to test our research model and hypotheses. With the assistance of multiple assistant chiefs and captains, questionnaires were distributed through the line of command using an email attachment. Our subjects were individual officers who had been identified as target users of COPLINK and had completed the mandatory user training. The participating officers were from investigative and field operations divisions and each of them was given two weeks to complete and return the questionnaire. Officers who had failed to complete and return the survey within the initial time window were reminded and given another two weeks to do so. A final one-week time window was then offered to whose who still failed to respond. Of the 411 questionnaires distributed, a total of 283 complete and effective responses were received, showing a 68.9% response rate. Analysis of the respondents’ gender distribution showed an approximate 4-1 ratio in favor of males. Most respondents were from the field operations divisions (60%), followed by the Criminal Investigative Division and Special Investigative Division (35%). Most of the respondents had a two-year college degree or associate bachelor’s degree (41%), followed by those having a high school diploma (30%), and those holding a four-year college degree (29%). On average, the responding officers were 38.4 years of age and had had 12.1 years of experience in law enforcement services. Comparative analysis of the of-
Examining Technology Acceptance by Individual Law Enforcement Officers
217
ficers who completed and returned the survey within the initial response period versus those who needed the extended response time window(s) showed no significant differences in gender or home division distribution, educational background, age, or experience in law enforcement. Table 3 summarizes the demographic profile of the 283 respondents in our survey. Table 2. Examination of convergent and discriminent validity – factor analysis results Factor 1 PU-1 PU-2 PU-3 PU-4 PEOU-1 PEOU-2 PEOU-3 PEOU-4 BI-1 BI-2 BI-3 ATT-1 ATT-2 ATT-3 SN-1 SN-2 AV-1 AV-2 AV-3 AV-4 EG-1 EG-2 EG-3 Eigen Values % of Variance
Factor 2
Factor 3
Factor 4
Factor 5
Factor 6
Factor 7
0.82 0.84 0.76 0.77 0.79 0.70 0.84 0.78 0.58 0.37 0.82 0.85 0.85 0.74 0.83 0.86 0.91 0.76 0.86 0.90 0.81 0.84 0.73 8.06
3.34
2.30
1.58
1.51
1.06
1.02
35.03
14.21
9.99
6.88
6.57
4.61
4.42
Model Testing Results. We tested our research model using LISREL. Analysis results showed our model exhibiting a reasonable fit to the data; e.g., Comparative Fit Index (CFI) being 0.91, Non-norm Fit Index (NNFI) being 0.89, and Standardized Root Mean Square Residual (SRMSR) being 0.06. We also assessed the model’s explanatory power. As shown in Figure 1, our model exhibited satisfactory explanatory utility, accounting for 58% of the variances in intention, 66% of the variances in attitude, and 60% of the variances in perceived usefulness. Individual causal Paths. Six of the nine hypothesized causal paths were significant statistically; i.e., p-value 0.05 or lower. As suggested by our analysis results, efficiency gain and subjective norm appeared to be significant determinants of perceived
218
P.J.-H. Hu, C. Lin, and H. Chen
usefulness, which, in turn, showed a significant effect on both attitude and behavioral intention. Perceived ease of use significantly affected attitude, which, however, was not a significant intention determinant. In addition, subjective norm appeared to have a significant effect on intention, but in direct opposition to our hypothesis. The remaining hypotheses were not supported by our data; i.e., perceived ease of use on perceived usefulness, availability on intention, and attitude on intention (which might have been somewhat significant). Table 3. Summary of respondent’s demographics profile Demographic Dimension
Descriptive Statistics
Average Age
38.4 Years
Average Experience in law Enforcement
12.1 Years
Gender Home Division
Education Background
• Male: 81% • Female: 19% • Criminal/Special Investigative: 35% • Field Operations: 60% • Other: 5% • 4-Year College or University: 29% • 2-Year College: 41% • High School: 30%
7 Discussion Overall, our model showed a reasonably good fit to the responding officers’ technology acceptance assessments and exhibited an explanatory power level compared to, if not higher than, that of representative previous studies; e.g., [17], [23]. Several research and management implications can be derived from our findings. First, our study suggests a prominent core influence path from efficiency gain to perceived usefulness and then to intention to accept. Perceived usefulness may be the single most important driver in individual officers’ technology acceptance decision-making. Based on our model testing results, perceived usefulness appears to be the only construct that has a significant direct effect on intention. The observed significance may suggest a tendency or likelihood of an officer’s anchoring his or her technology acceptance decision from a utility perspective. The discussed utility-centric view of technology is supported by the insignificant influence of perceived ease of use on perceived usefulness. Together, our findings suggest that a law enforcement officer is not likely to consider a technology to be useful simply because it is easy to use. Efficiency gain is a critical aspect or source of utility. According to our analysis, many officers feel that the use of COPLINK would improve their task performance, and that COPLINK is useful for their work. Second, subjective norm appears to be an important technology acceptance determinant, judged by its total effect on behavioral intention. According to our analysis,
Examining Technology Acceptance by Individual Law Enforcement Officers
219
subjective norm has a significant positive effect on individual acceptance decisionmaking but this effect may be mediated by other factors; e.g., perceived usefulness. Individual officers are likely to take significant referents’ opinions into consideration when assessing a technology’s usefulness. However, such normative beliefs alone may not foster positive acceptance decisions directly. In effect, our analysis shows a negative effect of subjective norm on behavioral intention, significant at the 0.05 level. One possible interpretation is that an officer exhibiting a strong intention to use COPLINK may have developed a negative response to others’ desire that he or she should accept the technology, and vice versa. The observed negative effect might be partially attributed to individual autonomy in law enforcement, thus resembling a professional setting to some degree. Third, the influence of attitude on intention may be somewhat significant, as suggested by a p-value between 0.05 and 0.10. Perceived usefulness and perceived ease of use appear to be important determinants of an individual officer’s attitude toward COPLINK and together explain a significant portion of the variances in attitude; i.e., 66%. Our finding suggests not to under-estimate the importance of individual attitudes. In this connection, administrators and technology providers need to proactively facilitate the cultivation and development of favorable attitudes by individual officers, particularly by means of convincing demonstrations and unambiguous communication of a technology’s utility and ease of operations. Management of individual attitude is essential in situations where law enforcement officers are relatively autonomous in task performance and technology use. With increased understanding of key acceptance drivers and their probable causal relationships, administrators and technology providers can identify specific areas where user acceptance is likely to be hindered and tackle these barriers accordingly. In light of the prominent influence path from efficiency gain to perceived usefulness and then to intention for acceptance, initial demonstrations and user training should concentrate on communicating a technology’s utility for improving officers’ performance and emphasize on the technology’s relevance to their routine tasks. Cultivating and promoting a favorable community assessment or view of the technology under discussion is also important and can create normative or even conformance pressure for individual acceptance decision-making. Such normative or compliant forces may not contribute directly to positive acceptance decisions, but can be so prevalent as to practically reinforce individual officers’ technology assessments. In addition, management of individual attitude towards a new implemented technology is also relevant and deserves administrative or managerial attention in situations where individual officers have considerable autonomy in their task performance and technology choice/use.
220
P.J.-H. Hu, C. Lin, and H. Chen
Acknowledgement. We would like to thank the following TPD officers for their input and support: Chief Richard Miranda, Asst. Chief Kathleen Robinson, Asst. Chief Kermit Miller, Cap. David Neri, Lt. Jenny Schroeder, Det. Tim Petersen, and Daniel Casey. We also would like to thank Andy Moosmann for his invaluable assistance in data collection. The work reported in this paper was substantially supported by the Digital Government Program, National Science Foundation (NSF Grant # 9983304: “COPLINK Center: Information and Knowledge Management for Law Enforcement”).
References 1. 2. 3. 4. 5. 6.
7. 8. 9. 10. 11. 12. 13.
14. 15.
Ajzen, I., “From Intention to Actions: A Theory of Planned Behavior,” in: Kuhl J. and Beckmann (eds): Action Control: From Cognition to Behavior, Springer Verlag, New York, 1985, pp. 11–39. Ajzen, I., “The Theory of Planned Behavior,” Organizational Behavior and Human Decision Processes, Vol. 50, 1991, pp. 179–211. Bandura, A., “Self-efficacy: Toward a Unifying Theory of Behavioral Change,” Psychological Review, Vol. 84, 1977, pp. 191–215. Chau, P.Y.K., “An Empirical Assessment of a Modified Technology Acceptance Model,” Journal of Management Information Systems, Vol. 13, No. 2, 1996, pp. 185–204. Chau, P.Y.K, and Hu, P.J., “Examining a Model for Information Technology Acceptance by Individual Professionals: An Exploratory Study”, Journal of Management Information Systems, Vol. 18, No. 4, 2002, pp. 191–229. Chen, H., Schroeder, J., V. Hauck, R., Ridgeway, L., Atabakhsh, H., Gupta, H., Boarman, C., Rasmussen, K., and Clements, A.W., “COPLINK Connect: Information and Knowledge Management for Law Enforcement,” Decision Support Systems, Vol. 34, No. 3, 2003, pp. 271–285. Chen, H., Zeng, D., Atabakhsh, H., Wyzga, W., and Schroeder, J., “COPLINK: Managing Law Enforcement Data and Knowledge,” Communications of the ACM, Vol. 46, No. 1, 2003, pp. 28–34. Compeau, D.R., and Higgins, C.A., “Computer Self-Efficacy: Development of a Measure and Initial Test,” MIS Quarterly, Vol. 19, 1995, pp. 189–211. Cooper, R.B., and Zmud, R.W., “Information technology implementation research: A technology diffusion approach,” Management Science, Vol. 34, No. 2, 1990, pp. 123–139. Davis, F.D., “Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology,” MIS Quarterly, Vol. 13, No. 3, September 1989, pp. 319–339. Davis, F.D., Bagozzi, R.P., and Warshaw, P.R., “User Acceptance of Computer Technology: A Comparison of Two Theoretical Models,” Management Science, Vol. 35, No. 8, 1989, pp. 982–1003. Fishbein, M. and Ajzen, I., Belief, Attitude, Intention and Behavior: An Introduction to Theory and Research, Addison-Wesley, Reading, MA, 1975. Gattiker, U.E., “Managing Computer-based Office Information Technology: A Process Model for Management in Human Factors in Organizational Design,” in H. Hendrick and O. Brown (eds), Human Factors in Organizational Design, Elsevier Science, Amsterdam, The Nethelands, 1984, pp. 395–403. Goodhue, D.L. and Thompson, R.L., “Task-Technology Fit and Individual Performance,” MIS Quarterly, Vol. 19, No. 2, June 1995, pp. 213–236. Igbaria, M., Guimaraes, T. and Davis, G.B., “Testing the Determinants of Microcomputer Usage via a Structural Equation Model,” Journal of Management Information Systems, Vol. 11, No. 4, 1995, pp. 87–114.
Examining Technology Acceptance by Individual Law Enforcement Officers
221
16. Keen, P., Shaping the Future: Business Design through Information Technology, Harvard Business School Press, Boston, MA, 1991. 17. Mathieson, K., “Predicting User Intention: Comparing the Technology Acceptance Model with Theory of Planned Behavior,” Information Systems Research, Vol. 2, No. 3, 1991, pp. 173–191. 18. Moore, G.C. and Benbasat, I., “Development of an Instrument to Measure the Perception of Adopting an Information Technology Innovation,” Information Systems Research, Vol. 2, No. 3, 1991, pp. 192–223. 19. Nunnally, J.C., Psychometric Theory, 2nd edn, McGraw-Hill, New York, 1978. 20. Rogers, E.M., Diffusion of Innovations, 4th edn, Free Press, New York, NY, 1995. 21. Straub, D.W., “Validating Instruments in MIS Research,” MIS Quarterly, Vol. 13, No. 2, 1989, pp. 147–169. 22. Szajna, B., “Empirical Evaluation of the Revised TAM,” Management Science, Vol. 42, No. 1, 1996, pp. 85–92. 23. Taylor, S. and Todd, P.A., “Understanding Information Technology Usage: A Test of Competing Models,” Information Systems Research, Vol. 6, No. 1, 1995, pp. 144–176. 24. Tornatzky, L.G. and Klein, K.J., “Innovation Characteristics and Innovation Adoption Implementation: A Meta-Analysis of Findings,” IEEE Transactions on Engineering Management, Vol. 29, No. 1, 1982, pp. 28–45. 25. Venkatesh, V. and Davis, F.D., “A Model of the Antecedents of Perceived Ease of Use: Development and Test,” Decision Sciences, Vol. 27, No. 3, 1996, pp. 451–482. 26. Venkatesh, V. and Davis, F.D., “A Theoretical Extension of the Technology Acceptance Model: Four longitudinal studies,” Management Science, Vol. 46, No. 2, 2000, pp. 186– 204. 27. Venkatesh, V., “Determinants of Perceived Ease of Use: Integrating Control, Intrinsic Motivation, and Emotion into the Technology Acceptance Model,” Information Systems Research, Vol. 11, No. 4, 2000, pp. 342–365.
Appendix: Listing of Questions Items Construct
Measurement Item
Source
PU-1: Using COPLINK would improve my job performance. Perceived Usefulness (PU)
Perceived Ease of Use (PEOU)
Attitude (ATT)
PU-2: Using COPLINK in my job would increase my productivity. PU-3: Using COPLINK would enhance my effectiveness at work. PU-4: Overall, I find COPLINK to be useful in my job. PEOU-1: My interaction with COPLINK is clear and understandable.
Venkatesh & Davis (1996)
PEOU-2: Interacting with COPLINK does not require a lot of mental effort. PEOU-3: Overall, I find COPLINK easy to use. PEOU-4: I find it easy to get COPLINK to do what I want it to do. ATT-1: Overall, it is a good idea to use COPLINK in my job. ATT-2: Using COPLINK would be pleasant. ATT-3: Using COPLINK would be beneficial to my work.
Venkatesh & Davis (1996)
Taylor & Todd (1995)
222
P.J.-H. Hu, C. Lin, and H. Chen
Subjective Norms (SN) Efficiency Gains (EG)
Availability (AV)
Behavioral Intention (BI)
SN-1: My colleagues in the department think that I should use COPLINK. SN-2: I would use COPLINK more if I knew my boss wanted me to. EG-1: Using COPLINK reduces the time I spend completing my job-related tasks. EG-2: COPLINK allows me to accomplish tasks more quickly. EG-3: Using COPLINK saves me time. AV-1: There are enough computers for everyone to use COPLINK. AV-2: I have no difficulty finding a computer to use COPLINK when I need it. AV-3: Availability of computers for accessing COPLINK is not going to be a problem. AV-4: There are enough computers for me to use COPLINK in the department. BI-1: When I have access to COPLINK, I would use it as often as needed. BI-2: To the extent possible, I intend to use COPLINK in my job. BI-3: Whenever possible, I would COPLINK for my tasks.
Taylor & Todd (1995)
Davis (1989)
Taylor & Todd (1995)
Venkatesh & Davis (1996)
“Atrium” – A Knowledge Model for Modern Security Forces in the Information and Terrorism Age Chris C. Demchak Cyberspace Policy Research Group, School of Public Administration and Policy University of Arizona, Tucson, Arizona 85721
[email protected] Abstract. Eighty percent of business process reengineering efforts have failed. This piece argues that the missing piece is an ability to see the newer technical systems conceptually integrated into an organization as well as functionally embedded. Similarly, a model of a modern military or security institution facing asymmetries in active security threats and dealing with extremely limited strategic depth needs to focus less on precision strikes and more on knowing what can be known in advance. Finding that most published designs and existing relations were based on rather static notions of accessing only explicitly collected knowledge, I turned to the development of an alternative socio-technical organizational design labeled the “Atrium” model based on the corporate hyperlinked model of Nonaka and Takeuchi. The rest of this work presents the basics of this model as applied to a military organization, though it could conceivably apply to any large scale security force.
1 Introduction Not all existing organizations can be equally responsive to the sociotechnical demands of a surprisingly insecure information age. It is a common failing among designers of technical networks and computer programs to assume they are designing a flexible but encapsulated tool that anyone can use. This is similar to viewing networked systems and their basic hardware as a car that anyone can drive anywhere. In reality, these systems are more like designing road or railways. It matters where the road is placed, what it accesses, who can or will travel on it, and what surrounding sociotechnical arrangements are changed by its creation and use. Not all organizational forms are receptive to newer knowledge systems. Much critical knowledge can be lost, distorted, or never recognized when the instantiation of new systems is seen as positive or, even with a tough transition, going to be at worst merely neutral. A rule of thumb is that 80 percent of business process reengineering efforts have failed, despite being largely spurred on, and instituted with and through, modern enterprise-wide networked systems. This piece argues that the missing piece is an ability to see the newer systems conceptually integrated into an organization as opposed to largely functionally improving value added activities. The underlying successful information operations (IO) rely on the accurate sociotechnical organization of knowledge––i.e., the right information with the right amount of precision in the right modality or format for absorption with H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 223–231, 2003. © Springer-Verlag Berlin Heidelberg 2003
224
C.C. Demchak
the right amount of time to apply the correct electronic or other response. In short, IO like the effective application of all other advanced technologies depends as much on the organization of the people around the artifacts than on the quality of the artifacts themselves [3][8]. A model of a modern military or security institution facing asymmetries in active security threats and dealing with extremely limited strategic depth needs to focus less on precision strikes and more on knowing what can be known in advance.1 To achieve that future focus, supportive organizational designs need to engaged in the transition process. In this work, I present my conclusions after several years of research taking a knowledge-centric approach in developing an alternative model to the dominant organizational models of modern security forces––in these cases, militaries––seen in several nations, including the US.2 In the research, I focused on how the current and loosely planned future organizational designs could or could not assure that explicit and implicit knowledge in a complex system could be discovered, winnowed, connected, weighted, and applied using advancing technologies when the threats were multi-layered and present in peace as well as war. Finding that most published designs and existing relations were based on rather static notions of accessing only explicitly collected knowledge, I turned to the development of an alternative sociotechnical organizational design labeled the “Atrium” model (see Figure 1 below) based on the corporate hyperlinked model of Nonaka and Takeuchi. The rest of this work presents the basics of this model as applied to a military organization, though it could conceivably apply to any large-scale security force. Before introducing the model itself, it is important to note that, like their civilian counterparts in a rapidly globalizing environment, modern military technologies across both machine and human systems need information sharing, not hoarding, to both act quickly and to counter surprises. Designed by engineers, not social scientists, however, the newer systems tend to assume knowledge will come with the automatic and comprehensive provision of data. However, knowledge is not an automatic byproduct of networks and grids unless the surrounding social system deliberately seeks to capture that knowledge. Ultimately, in military or commercial endeavors, it is the organization, not the computer network, that is ultimately the knowledge-producing entity. And it costs a great deal to develop everything one needs on one’s own. The more distinct the organization is from a supporting and surrounding knowledge base, the more expensive the internal development of knowledge for that group of people [5]. Hence, it is preferred for the organization to share and benefit from the sharing of other organizations. Furthermore, complex systems including organizations are also path-dependent on initial conditions. The more the initial organizational design facilitates absorbing and accumulating knowledge from the beginning, including more slack, redundancy and trial and error, the more likely the design will be robust and successful in the face of surprise. Since surprise is the endemic characteristic of the systems and requirements faced by militaries, especially smaller forces, any modernizing design needs to consider these complex system realities from the outset.
1 2
The description of this model is drawn heavily from [2]. Much of the model discussion was originally presented in an earlier work developing the model for a small state and using the case of Israel [2].
“Atrium” – A Knowledge Model for Modern Security Forces
225
Fig. 1. The Atrium
The uncertainties of the new global circumstances require a different kind of modernization of the military organization – one less tied to legacy forces and more designed to support a new social construction of the role of knowledge as a player in organizational operations. To meet these aims, I propose a military or security adaptation of the commercial “hypertext” organization described by Nonaka and Takeuchi [6:99-133]. This refinement, which I labeled the “Atrium” form of information based organization, is a design that treats knowledge as a third and equal partner in the military organization’s peacetime and wartime operations. In the original model and in my refinement, the knowledge base is not merely an overlain tool or connecting pipelines. Rather, the knowledge base of the organization is actively nurtured both in the humans and in the digitized institutional integrated institutional structure. Writing for the commercial world, Nonaka and Takeuchi attempted to reconcile the competing demands and benefits of both matrix and hierarchical organizational forms. Their “hypertext” organization intermingled three intermingling structures: a matrix structure in smaller task forces specifically focused on innovative problems at hand and answering to senior managers, a second hierarchical structure that both supports the general operational systems but also contributes and then reabsorbs the members of task forces, and finally a large knowledge base that is intricately interwoven through the activities of both matrix and hierarchical units.3 3
As Nonaka and Tageuchi [6:106-107] aptly phrased it, “The goal is an organizational structure that views bureaucracy and the task force as complementary rather than mutually exclusive…..Like an actual hypertext document, hypertext organization is made up of interconnected layers or contexts…”
226
C.C. Demchak
In both their and my models, the knowledge base is more than a library or a database on a server; it is a structure in and of itself integrating applications and data. It reaches into the task forces who use it for data mining while also sustaining the general operations, sharing information broadly. But it is also socially constructed as a key player in the organization such that task force members are required to download their experiences in a task force into the knowledge base before they are permitted to return to their positions in the hierarchical portion of the organization. Similarly, operations in the general hierarchy are required to interact through the knowledge base systems so that patterns in operations and actions are automatically captured for analysis [6:99-133]. The major contribution here is that the knowledge base is not a separate addition to the organization and irrelevant to the architecture of the human-machine processes as it is in the emergent US and other western models or modernizing militaries or security forces.4 Rather, it is integral to the success of processes and the survival of the institution. Several Japanese corporations seem to operate along these lines productively and one is struck by an interesting distinction––implicit knowledge developed by human interactions related to the job is not only viewed as a source of value by the corporation but also as key to long term survival.5 It is this view of knowledge that distinguishes these corporations and makes them more prepared for surprise in the marketplace. In adapting this design and social construction to a military or security setting, I have given this concept of a knowledge base a name, the “Atrium.”6 The term captures the sense of being a place to which a member of the organization can go, virtually or otherwise, to contribute and acquire essential knowledge, and that it is also a place of refuge to think out solutions. The mental image is that it is overarching, not beneath the human actors, but something that protects as well as demands inputs. Entering into and interacting with the Atrium is essentially acting with a major player in the institution. Such a conception rationalizes the efforts to ensure implicit knowledge is integrated into the long term analyses of the organization, such as the time spent in downloads of experiences and information from the task force members before they return to more hierarchical stem. The “atrium” form requires an explicit embrace of what has been called the “new knowledge management”7. In particular, the new knowledge management means using network/web technologies to move from controlling information inventories as human relationship-based “controlled hoards” to web-based “trusted source” struc4
5
6
7
A close reading of JV2010 and related US transformational documents shows a broad assumption that, as fast as the new equipment becomes, the knowledge needed to make that speed, lethality, and deployability successful will automatically be there as long as raw information is moved in real time. It is a rather naïve understanding of knowledge and complex systems but not unexpected if the decision-makers have focused on target acquisition and firing weapons at single points all their professional lives. “The goal is an organizational structure that views bureaucracy and the task force as complementary rather than mutually exclusive…..Like an actual hypertext document, hypertext organization is made up of interconnected layers or contexts…”, see [6:106-107]. In a manuscript under construction now “The Atrium–Refining the HyperText Organizational Form,” I more fully explain the mechanisms of integrating an Atrium into an organization. For a more modern use of this term, see [4] and [7].
“Atrium” – A Knowledge Model for Modern Security Forces
227
tures.8 With networks, everything is dual use and sufficient technical familiarity can be found in foreign ministries as well as in basements inhabited by teenage geeks with a sociopathic attitude. Knowledge development will inevitably come through surprises that are encountered all along the spectrum of formal declaration of operations, from peace-building, through peace-making, peace-keeping, posturing, and prevailing in actual hostilities. The design of a modern knowledge-centric military must, in effect, accept 24/7 operations with all the ethical, legal, budgetary, socio-economic, and geostrategic constraints implied.
2 The “Atrium” as Colleague and Institutional Memory Key to this model is the stabilizing the locus of institutional memory and creativity in the human-Atrium processes. In principle, according to their rank, each member of the organization will have the chance to cycle in and out of task forces, core operations or Atrium maintenance and refinement. As they cycle into a new position, gear up, operate, and then cycle out, each player does a data dump, including frustrations about process, data, and ideas, into the Atrium. Organizational members elsewhere can then apply data mining or other applications on this expanding pool of knowledge elements to guide their future processes. Explicit and implicit organizational institutional knowledge thus becomes instinctively valued and actively retained and maintained for use in ongoing or future operations.
3 The Core – Main Operational Knowledge Creation and Application Hierarchies With this new social construction of what one does with information in the military or security force (one creates, stores, refines, connects, weights, shares, and nurtures it), the Core then embraces the new knowledge potential of conscripts and reservists by reinforcing the trends in national digital education. That is, service involving computers is not only promoted as a benefit of conscription but training in computers is pursued irrespective of the actual military or security function. For example, the maintainer will expect to find knowledge about diagnostic workarounds in other maintenance units in a foray into the Atrium, as well as being expected to give back one’s own personal experiences to the system. That maintainer – who could easily be a youngster of 19 years – will have been taught not just how to do that diagnostic task but also how to manipulate digital applications in general. This education in the military or security force will enhance the surprise-reducing potential of operations but also to improve the soldiers' future marketability to the economy and their long-term contribution to Atrium nurturing as a reservist. As a side benefit, the growing unwillingness to serve in the military or in security services may be mitigated when all full
8
The evolution of the internet or the web is in essence a social history of information sharing among individuals embedded in organizations. There are a number of versions of the history of the internet. For one discussion, see [1]. See also the Internet Society web site.
228
C.C. Demchak
time members (and associated part-timers or reservists) receive what is considered a valuable education in networked technology. Furthermore, the Core also embraces the potential of part-timers or reservists for security forces by assigning tasks that further the knowledge development of the Atrium. In the United States, the role for reservists in the future conflicts involving terrorism is under debated. This model can orchestrate the accomplishment of Core tasks can be accomplished on weekends without requiring the reservist to show up during working hours in uniform. The implicit knowledge of these experienced individuals is not lost as they are able to draw upon reserve years of solving puzzles or refining data to keep their skills at usable levels while keeping other employment. Reservists can then still serve physically in uniform in the Core when called up but that period can be limited and infrequent since it is not expected the reservist will do much basic security tasks in the field. Naturally, this approach sustains all the advantages of a close connection between the wider society and its part-time or reservist security forces without having the disruption of a civilian job. As described, the Core will have plenty of tasks associated with the Atrium, both in initial creation of applications, elements, processes, and uses but also in the coordinating and integrated of these evolutions. Its use of part-time or reservist security forces provide an essential constant intellectual recharge available from the wider community, permitting the Atrium to avoid iterating into a brittle bureaucratic equilibrium. By having the problem solving of the task forces as well as the intense attention of active serving security force members, the members of the forces serving in the Core will come to understand the Atrium as an intelligent agent rather than a mindless amalgamation of individual databases. In short, the vibrancy of the Atrium in providing knowledge to accommodate surprise is due not to the professionalism of the small permanent Core party but to the newness of perspective and rising familiarity of both the active and part-time participants. However, this organization will be surrounded by complex systems as well as being a complex system itself. Problems beyond the normal Core operations and Atrium knowledge analysis will emerge constantly. Some of these will be physically dangerous and immediate. Some will be prospective, such as determining why certain neighboring political leaders have allocated budget amounts to shadowy organizations. Some will be long term, such as rechanneling the design goals of key data chunk allocations within the Atrium or retargeting some of its uses in the light of wider global trends. For these kinds of problems, a matrix organization is imminently preferred and hence we come to the final element, the task forces.
4 The Task Forces – Responses in Knowledge Creation and Security Applications Security forces, in and out of militaries, tend to fragment into many small existing units with specialized missions. Each of them develops a broad and deep array of implicit knowledge that this model would be able to capture and put to good use. Many of existing units can be altered to function as task force structures answering to the senior military or security force officers in a knowledge-centric organization. First, to capture the implicit information currently lost or buried, members of all field units
“Atrium” – A Knowledge Model for Modern Security Forces
229
will rotate in from their operations to download implicit knowledge, update their understanding of the Atrium’s holdings and possible insights, and contribute to the Core. Second, some of the more elite units will be retargeted along different modalities of knowledge acquisition and use to using such data in knowledge mining combined with other information presented in the Atrium. Some units will be left with the more physically challenging missions such as border incursion controls and basic training but their members will also be rotated in and out on longer cycles, perhaps a year, to accommodate exceptional physical requirements. Other units will be gradually altered to problem analysis units – moving from simply gathering data on all suspicious activity to meta-analyses of such activities over time and locations with an eye to proactively disrupting the initiating efforts of the infiltrating threat rather than sending squads after the cell is well established. For this, the members will have to be digitally creative as much as physically hardy. The deployed or physically demanding units will be smaller and directly answerable to senior members of the headquarters staff. However, since rotating organization members among the three – task forces, Core and the Atrium – is a basic tenet, even senior leaders must rotate. For example, senior leaders could spend most of their time leading each of the field divisions or commands but they must rotate in for Atrium service, as well as heading task forces occasionally. While on rotation to the Atrium, the senior leader must be free completely from leadership duties, thus attention must be paid to a functioning deputy leader culture. Finally, the explicit assumption is that each task force is solving a problem or exploring an opportunity but also developing important nonobvious information that must also be inputted into the Atrium's processes. Senior leaders just like lowly field members have implicit knowledge to contribute to, and skills to refine in extracting and manipulating data from, the Atrium resources. Not all of the existing military or security force units will change in their mission; rather they are more likely to scale back the size of the units and attach them higher up in the hierarchy. The ones that retain the more physically dangerous missions will alter only in that their members will rotate out of Core positions for a position in the elite force and then back through an Atrium tour before returning to the Core. Fortunately, the value placed on computer skills and possibly a civilian career to follow military or security force experience offers a way to socially construct this change for easier acceptance, as well as continuing service on a part-time basis. Also, placing them directly beneath the senior leaders also mollifies grievances over a loss of prestige. Personnel rotating in and out of these units are assured not only of interesting current problems for six months to a year but also greater visibility at senior levels. The units will benefit from the strong advantages of a matrix structure in creativity and are more produce more innovative problem solutions than can be produced today.
5 Advantages – Surprise-Oriented, Scalable Knowledge-Enabled Institutions This design has advantages in using advance knowledge to extend the limited strategic depth of a nation or community under the unknowable unknowns of the emerging information and terrorism age. Deleterious surprises by actively hostile opponents can
230
C.C. Demchak
be countered by integrating different kinds of forces across early warning and response forces, and in the innovative combination of information accumulations. The existing widely held model of a modern security force tends towards centralization of control by reducing slack in the organization’s time and/or redundancy in its resources. It has become an act of faith that this centralization explicitly promotes synchronicity of operations, and in due course centralization across networks is also encouraged. But a fixation on central decision-making and synchronized actions can encourage devastating ripple effects in an increasingly tightly coupled organization. In contrast, the Atrium model is based on an understanding of complexity across large-scale systems – the environment faced by security forces today under active threats. If only trends – not specifics – can be seen in advance, then the best preparation is to have the knowledge base and the skills in creative combinations ready and waiting for the elements of the trend to take concrete shape. The model encourages a dampening of rippling rogue outcomes by the rotation of members and inclusion of skilled part-timers. Its design presumes that surprise during operations is normal in complex systems and only slack built through knowledge mechanisms can really accommodate or mitigate or dampen the effects on a large-scale organization. Hence, the Atrium concept encourages independent thinking while permitting widespread coordination and integration across the organization, time, and operations. And that this response can be done at any scale. Having socialized into unit members some key central themes in operations is as close as the Atrium comes to endorsing expensive centralization such as the Total Information Awareness program currently under pursuit by the US Department of Defense. Furthermore, this proposal does not assume wisdom comes automatically with 100 percent visibility of any conflict arena, or that this kind of visibility of an operation is the goal of modernization. On the contrary, this Atrium organizational model presumes that the 24/7 accumulation of information, much of which implicit and never before digitized, will use data mining techniques and a constant inflow of new pairs of eyes (in rotations through the Atrium) to construct new visions of operations. Innovative operations at any scale are enhanced when integration of a wide variety of information is more possible. While a nation or a security service under threat still needs physically demanding forces and standoff weapons, other electronic options emerge such as targeted disruption efforts that may overtly or covertly derail threatening postures by hostile opponents or even a long-term, slow-roll deception goal that diverts potential hostile actors from other more dangerous choices. Furthermore, when work is digitized, internal security can increase nonobviously. It is easier and less intrusive to scan across employee actions when work is digitized. Also, when part-timers are rotating in and out of all functions and their implicit knowledge is also being accumulated in the Atrium, then individual elements of knowledge are potentially spread all over the society. With so many knowing in general the overall structure and uses of the Atrium and the military or security force’s capabilities, the competition is less for secret information but for positive social assessments by chief acquisition officers. This kind of institutional knowledge helps both in curbing corruption through database transparency and in permitting those secrets that absolutely must be kept to be buried in the data noise.
“Atrium” – A Knowledge Model for Modern Security Forces
231
References 1. Benedikt, Michael. (1991). Cyberspace: First Steps. Boston, Massachusetts: the MIT Press. 2. Demchak, Chris C. (2001). “Knowledge Burden Management and a Networked Israeli Defense Force: Partial RMA in ‘Hyper Text Organization’?” Journal of Strategic Studies. 24:2 (June). 3. Drucker, Peter F. (1959). Technology Management and Society. San Francisco, CA: Harper and Row. 4. Gleick, James. (1987). Chaos: Making a New Science. New York: Viking. 5. Landau, Martin. (1973). “On the Concept of a Self-Correcting Organization.” Public Administration Review (November- December 1973). 6. Nonaka, Ikujiro and Takeuchi, Hirotaka. (1997). “A New Organizational Structure (HyperText Organization).” In Prusak, Laurence, ed. 1997. Knowledge in Organizations. Boston: Butterworth-Heinemann. 99–133. 7. Wheatley, Margaret J. (1992). Leadership and the New Science. San Francisco: BoerrettKoehler Publishers. 8. Wilson, James Q. (1989). Bureaucracy: What Government Agencies Do and Why They Do It. New York: Basic Books, Inc.
Untangling Criminal Networks: A Case Study Jennifer Xu and Hsinchun Chen Department of Management Information Systems, University of Arizona Tucson, AZ 85721, U. S. A. {jxu, hchen}@eller.arizona.edu
Abstract. Knowledge about criminal networks has important implications for crime investigation and the anti-terrorism campaign. However, lack of advanced, automated techniques has limited law enforcement and intelligence agencies’ ability to combat crime by discovering structural patterns in criminal networks. In this research we used the concept space approach, clustering technology, social network analysis measures and approaches, and multidimensional scaling methods for automatic extraction, analysis, and visualization of criminal networks and their structural patterns. We conducted a case study with crime investigators from the Tucson Police Department. They validated the structural patterns discovered from gang and narcotics criminal enterprises. The results showed that the approaches we proposed could detect subgroups, central members, and between-group interaction patterns correctly most of the time. Moreover, our system could extract the overall structure for a network that might be useful in the development of effective disruptive strategies for criminal networks.
1 Introduction Criminals seldom operate in a vacuum but interact with one another to carry out various illegal activities. In particular, organized crimes such as terrorism, drug trafficking, gang-related offenses, frauds, and armed robberies require collaboration among offenders. Relationships between individual offenders form the basis for organized crimes [18] and are essential for smooth operation of a criminal enterprise, which can be viewed as a network consisting of nodes (individual offenders) and links (relationships). In criminal networks, there may exist groups or teams, within which members have close relationships. One group also may interact with other groups to obtain or transfer illicit goods. Moreover, individuals play different roles in their groups. For example, some key members may act as leaders to control activities of a group. Some others may serve as gatekeepers to ensure smooth flow of information or illicit goods. Structural network patterns in terms of subgroups, between-group interactions, and individual roles thus are important to understanding the organization, structure, and operation of criminal enterprises. Such knowledge can help law enforcement and intelligence agencies disrupt criminal networks and develop effective control strategies to combat organized crimes such as narcotic trafficking and terrorism. For examH. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 232–248, 2003. © Springer-Verlag Berlin Heidelberg 2003
Untangling Criminal Networks: A Case Study
233
ple, removal of central members in a network may effectively upset the operational network and put a criminal enterprise out of action [3, 17, 21]. Subgroups and interaction patterns between groups are helpful for finding a network’s overall structure, which often reveals points of vulnerability [9, 19]. For a centralized structure such as a star or a wheel, the point of vulnerability lies in its central members. A decentralized network such as a chain or clique, however, does not have a single point of vulnerability and thus may be more difficult to disrupt. To analyze structural patterns of criminal networks, investigators must process large volumes of crime data gathered from multiple sources. This is a nontrivial process that consumes much human time and effort. Current practice of criminal network analysis is primarily a manual process because of the lack of advanced, automated techniques. When there is a pressing need to untangle criminal networks, manual approaches may fail to generate valuable knowledge in a timely manner. To help law enforcement and intelligence agencies analyze criminal networks, we propose applying the concept space and social network analysis approaches to extract structural patterns automatically from large volumes of data. We have implemented these techniques in a prototype system, which is able to generate network representations from crime data, detect subgroups in a network, extract between-group interaction patterns, and identify central members. Multi-dimensional scaling has also been employed to visualize criminal networks and structural patterns found in them. The rest of the paper is organized as follows: Section 2 reviews related work; Section 3 describes the system architecture; Section 4 presents the case study in detail; Section 5 concludes the paper and suggests future research directions.
2 Related Work The process of extracting structural network patterns from crime data usually includes three phases: network creation, structural analysis, and network visualization. We review related work for each phase. 2.1 Network Creation To create network representations of criminal enterprises, investigators have to wade through floods of database records to search for clues of relationships between offenders. Such a task can be time-consuming and labor-intensive. A technique called link analysis has been used to detect relationships between crime entities and create network representations. Traditional link analysis is based on the Anacapa charting approach [12] in which data have to be examined manually to identify possible relationships. For visualization purposes, an association matrix is then constructed and a link chart based upon it is drawn. An investigator can study the structure of the link chart (a network representation) to discover patterns of interest. Krebs [15], for example, mapped a terrorist network comprised of the 19 hijackers in the September 11 attacks on the World Trade Center, using such an approach. How-
234
J. Xu and H. Chen
ever, the manual link analysis approach will become extremely ineffective and inefficient for large datasets. Some automated approaches to creating representations of criminal networks based on crime data have been proposed. Goldberg and Senator [11] used a heuristic-based approach to forming links and associations between individuals who had shared addresses, bank accounts, or related transactions. The networks created were analyzed to detect money laundering and other illegal financial activities. Dombroski and Carley [8] combined multi-agent technology, a hierarchical Bayesian inference model, and biased network models to create representations of a criminal network based on prior network data and informant perceptions of the network. A different network creation method used in the COPLINK system [13] is based on the concept space approach developed by Chen and Lynch [5]. Such an approach can generate a thesaurus from documents based on co-occurrence weights that measure the frequency with which two words or phrases appear in the same document. Applying this approach to crime incident data results in a network representation in which a link between a pair of entities exists if they ever appear together in the same criminal incident report. The more frequently they appear together, the stronger the association. After a network representation has been created, the next phase is to extract structural patterns from the networks. 2.2 Structural Analysis Social Network Analysis (SNA) provides a set of measures and approaches for structural network analysis. These techniques were originally designed to discover social structures in social networks [23] and are especially appropriate for studying criminal networks [17, 18, 21]. Specifically, SNA is capable of detecting subgroups, identifying central individuals, discovering between-group interaction patterns, and uncovering a network’s organization and structure [23]. Studies involving evidence mapping in fraud and conspiracy cases have recently employed SNA measures to identify central members in criminal networks [3, 20]. Subgroup Detection. With networks represented in a matrix format, the matrix permutation approach and cluster analysis have been employed to detect underlying groupings that are not otherwise apparent in data [23]. Burt [4] proposed to apply hierarchical clustering methods based on a structural equivalence measure [16] to partition a social network into positions in which members have similar structural roles. Centrality. Centrality deals with the roles of network members. Several measures, such as degree, betweenness, and closeness, are related to centrality [10]. The degree of a particular node is its number of direct links; its betweenness is the number of geodesics (shortest paths between any two nodes) passing through it; and its closeness is the sum of all the geodesics between the particular node and every other node in the
Untangling Criminal Networks: A Case Study
235
network. Although these three measures are all intended to illustrate the importance or centrality of a node, they interpret the roles of network members differently. From an individual’s having a high degree measurement, for instance, it may be inferred to have a leadership function whereas an individual with a high level of betweenness may be seen as a gatekeeper in the network. Baker and Faulkner [3] employed these three measures, especially degree, to find the key individuals in a price-fixing conspiracy network in the electrical equipment industry. Krebs [15] found that, in the network consisting of the 19 hijackers, Mohamed Atta scored the highest on degrees. Discovery of Patterns of Interaction. Patterns of interaction between subgroups can be discovered using an SNA approach called blockmodel analysis [2]. Given a partitioned network, blockmodel analysis determines the presence or absence of an association between a pair of subgroups by comparing the density of the links between them at a predefined threshold value. In this way, blockmodeling introduces summarized individual interaction details into interactions between groups so that the overall structure of the network becomes more apparent. 2.3 Network Visualization SNA includes visualization methods that present networks graphically. The Smallest Space Analysis (SSA) approach, a branch of Multi-Dimensional Scaling (MDS), is used extensively in SNA to produce two-dimensional representations of social networks. In a graphical portrayal of a network produced by SSA, the stronger the association between two nodes or two groups, the closer they appear on the graph; the weaker the association, the farther apart [17]. Several network analysis tools, such as Analyst’s Notebook [14], Netmap [11], and Watson [1], can automatically draw a graphical representation of a criminal network. However, they do not provide much structural analysis functionality and continue rely on investigators’ manual examinations to extract structural patterns. Based on our review of related work, we proposed to employ the concept space approach, SNA measures and approaches, and MDS for extracting and visualizing structural patterns of criminal networks. We have developed a prototype system in which the proposed techniques have been implemented. The architecture of the system and its individual components are presented in the next section.
3 System Architecture The prototype system contains three major components: network creation, structural analysis, and network visualization. Figure 1 illustrates the system architecture.
236
J. Xu and H. Chen
Fig. 1. System architecture
3.1 Network Creation Component We employed the concept space approach to create networks automatically, based on crime data. We assumed that criminals who committed crimes together might be related and that the more often they appeared together the more likely it would be that they were related. We treated each incident summary (database records specifying the date, location, persons involved, and other information about a specific crime) as a document and each person’s name as a phrase. We then calculated co-occurrence weights based on the frequency with which two individuals appeared together in the same crime incident. As a result, the value of a co-occurrence weight not only implied a relationship between two criminals but also indicated the strength of the relationship.
3.2 Structural Analysis Component The structural analysis component includes three functions: network partition for detecting subgroups, centrality measures for identifying central members, and blockmodeling for extracting interaction patterns between subgroups. Network Partition. We employed hierarchical clustering, namely complete-link algorithm [6], to partition a network into subgroups based on relational strength. Clusters obtained represent subgroups. To employ the algorithm, we first transformed cooccurrence weights generated in the previous phrase into distances/dissimilarities. The
Untangling Criminal Networks: A Case Study
237
distance between two clusters was defined as the distance between the pair of nodes drawn from each cluster that were farthest apart. The algorithm worked by merging the two nearest clusters into one cluster at each step and eventually formed a cluster hierarchy. The resulting cluster hierarchy specified groupings of network members at different granularity levels. At lower levels of the hierarchy, clusters (subgroups) tended to be smaller and group members were more closely related. At higher levels of the hierarchy, subgroups are large and group members might be loosely related. Centrality Measures. We used all three centrality measures to identify central members in a given subgroup. The degree of a node could be obtained by counting the total number of links it had to all the other group members. A node’s score of betweenness and closeness required the computation of shortest paths (geodesics) using Dijkstra’s algorithm [7]. Blockmodeling. At a given level of a cluster hierarchy, we compared between-group link densities with the network’s overall link density to determine the presence or absence of between-group relationships. SNA was the key technique in our prototype system for extraction of criminal network knowledge.
3.3 Network Visualization Component To map a criminal network onto a two-dimensional display, we employed MDS to generate x-y coordinates for each member in a network. We chose Torgerson’s classical metric MDS algorithm [22] since distances transformed from co-occurrence weights were quantitative data. A graphical user interface was provided to visualize criminal networks. Figure 2 shows the screenshot of our prototype system. In this example, each node was labeled with the name of the criminal it represented. Criminal names were scrubbed for data confidentiality. A straight line connecting two nodes indicated that two corresponding criminals committed crimes together and thus were related. To find subgroups and interaction patterns between groups, a user could adjust the “level of abstraction” slider at the bottom of the panel. A high level of abstraction corresponded with a high distance level in the cluster hierarchy. Group members’ rankings in centrality are listed in a table.
238
J. Xu and H. Chen (a) Left: A 57-member criminal network. Each node is labeled using the name of the criminal it represents. Lines represent the relationships between criminals.
(c) Right: The inner structure of the biggest group (the relationships between group members).
(b) Above: The reduced structure of the network. Each circle represents one subgroup labeled by its leader’s name. The size of the circle is proportional to the number of criminals in the group. A line represents a relationship between two groups. The thickness represents the strength of the relationship. Centrality rankings of members in the biggest group are listed in a table at the right-hand side.
Fig. 2. A prototype system for criminal network analysis and visualization
4 Case Study In order to examine our system’s ability to reveal structural patterns from criminal networks, we conducted a case study at the Tucson Police Department (TPD). The study was intended to answer the following research questions: Can structural analysis approaches correctly detect subgroups from criminal networks? Can structural analysis approaches correctly identify central members from criminal networks? Can structural analysis approaches correctly identify interaction patterns between subgroups from criminal networks? Can structural analysis approaches help extract the overall structure of a criminal network? In this study, we focused on two types of networks: gang and narcotics, both of which were organized crimes. For each network, the Gang Unit at the TPD provided a list of names of active criminals. We extracted from the TPD database all crime incidents in which these criminals had been involved and we created two networks.
Untangling Criminal Networks: A Case Study
239
4.1 Data Preparation The gang network. The list of gang members consisted of 16 offenders who had been under investigation in the first quarter of 2002. These gang members had been involved in 72 crime incidents of various types (e.g., theft, burglary, aggravated assault, drug offense, etc.) since 1985. We used the concept space approach and generated links between criminals who had committed crimes together, resulting in a network of 164 members (Figure 3a). The narcotics network (The “Meth World”). The list for narcotics network consisted of 71 criminal names. A sergeant from the Gang Unit had been studying the activities of these criminals since 1995. Because most of them had committed crimes related to methamphetamines, the Sergeant called this network “Meth World.” These offenders had been involved in 1,206 incidents since 1983. A network of 744 members was generated (Figure 3b).
(a) The 164-member gang network
(b) The 744-member narcotics network
Fig. 3. The gang and narcotics networks
These two networks were analyzed using our prototype system. Several crime investigators including the sergeant and one detective from the Gang Unit and two detectives from the Information Section validated our results. 4.2 Result Validation The study was divided into two sessions. During each session, the crime investigators examined one network and evaluated the structural patterns discovered from it. Both sessions were tape-recorded and the results were summarized as follows.
240
J. Xu and H. Chen
Detection of Subgroups. Since our system could partition a network into subgroups at different levels of granularity, we selected the partition that the crime investigators considered to be closest to their knowledge of the network organizations. The result showed that our system could detect subgroups from a network correctly: Subgroups could be detected correctly using cluster analysis. Two major subgroups together with several small subgroups were found in the 164-member gang network based on the clustering results (Figure 4a). The bigger subgroup (solid circle) consisted of 99 members and the smaller subgroup (dashed circle) consisted of 24 members. In the narcotics network, no obvious subgroups except for four cliques originally could be seen because of the large network size (Figure 3b). After clustering, however, two subgroups became very obvious with the bigger one (solid circle) consisting of 397 members and the smaller one (dashed circle) consisting of 331 members (Figure 4b). Moreover, the crime investigators verified that partitions within each of the subgroups were also correct.
(a) Subgroups in the gang network
(b) Subgroups in the narcotics network
Fig. 4. Subgroups detected from the networks
Subgroups detected had different characteristics. It turned out that the subgroups found were consistent with their members’ characteristics, specializations, or responsibilities in the networks. In the gang network (Figure 4a), the subgroup represented by a solid circle was identified as a set of white gang members who often were involved in murders, shootings, and aggravated assaults. “These are people who always create a lot of trouble,” the sergeant said. The subgroup represented by a dashed circle, on the other hand, consisted of many white gang members who specialized in sale of crack cocaine. The subgroup represented by a small dotted circle was a set of back gang members who were quite separate from the whole network. The two subgroups
Untangling Criminal Networks: A Case Study
241
(solid and dashed) in Figure 4b, similarly, corresponded with two criminal enterprises led by different leaders. Moreover, each subgroup could be further broken down into smaller subgroups that might be responsible for different tasks. For example, Figure 5a presents the subgroups within one of the criminal enterprises in the narcotics network. The group in solide circle was responsible for stealing, counterfeiting, and cashing checks and providing money to other groups to carry out drug transactions. The group in the dashed circle, on the other hand, consisted of many drug dealers.
173
87
( (a) Subgroups with different responsibilities
b) Relationships between group members
Fig. 5. Subgroup characteristics and relationships
Incident-based relationships reflected other types of associations between group members. Two group members might have been related because they came from the same family, went to the same school, spent time together in prison, etc. Figure 5b, for example, presents connections among 24 members of the crack cocaine group in the gang network. Member 87 was member 173’s girlfriend (connected by a solid line) who often brought female dancers to purchase crack cocaine. In the narcotics network in Figure 4b, members of the dashed circle were former schoolmates. As the sergeant commented, “They knew each other in high school and at that time they were juvenile gang members. Then they got involved in methamphetamines.” Long-time relationships between group members showed a high frequency of committing crimes together, and high relational strength was captured by high co-occurrence weight. Identification of Central Members. We interpreted the highest degree score as an indicator for a leader, the highest betweenness score as an indicator for a gatekeeper, and the one with the lowest closeness (the least likely to be a central member) as an outlier. The crime investigators evaluated central members identified from six subgroups at different granularity levels in both gang network and narcotics network. The
242
J. Xu and H. Chen
results showed that although the system could identify important members in a subgroup, it could not necessarily identify a true leader. A member who scored the highest in degree might not necessarily be a leader. On one hand, offenders with high degree often were those who had had frequent police contacts. Such offenders may play active roles in leading a group. Three out of six leaders were identified as true leaders in their subgroups. For example, in the crack cocaine subgroup shown in Figure 5b, member 173 had the largest number of connections with other group members. This person had a lot of money, was able to buy and sell drugs frequently, and provided his house for drug transactions. As mentioned in the previous section, his girlfriend also helped bring in more people to purchase drugs. Similarly, the member with the highest degree in the murderers group (solid circle in Figure 4a) was also identified as the leader in the group. On the other hand, a high degree could not always be interpreted as an indicator of leadership for two reasons. First, in a criminal enterprise, the leader may hide behind other offenders and keep frequency of activities low by using other people to do tasks. “Especially, when they got out of prison they tended to be smarter and more educated and thus were more careful to avoid police contacts,” the sergeant commented. In Figure 6, for example, member 501 (labeled with a star) was the true leader of one subgroup from the narcotics network. However, he did not score the highest in degree in this group because he actually used other group members (along the dashed path) to sell methamphetamines for him. Second, current police databases did not capture leadership data about criminal enterprises. A crime investigator had no way to tell which group member was the leader unless he/she obtained such information from interrogation or other sources. Three out of six leaders evaluated were not the true leaders of their groups. Therefore, the degree measure should be interpreted carefully. A member who scored highest in betweenness was a gatekeeper. Our crime investigators verified that all of the six gatekeepers were correctly identified from their subgroups. These gatekeepers played important roles in maintaining the flow of money, drugs, or other illicit goods in their networks. Although not identified as a leader based on degree measure, member 501 (labeled with a star) in Figure 6a was correctly identified as a gatekeeper because he controlled and managed the flow of money and drugs in his group. The star in Figure 6b represented a gatekeeper in that group because she was responsible for cashing stolen or counterfeit checks and redistributing money to other group members. The other four gatekeepers evaluated were offenders who often rode bicycles to sell drugs on the street. “Such gatekeepers were quite important to the operation of their criminal enterprises,” a detective from the Gang Unit said. An outlier who scored the lowest in closeness might play an important role in a network. No detailed evaluation was conducted on outliers because of the long time spent on the discussion of leader and gatekeeper roles in both validation sessions. Our crime
Untangling Criminal Networks: A Case Study
(a) A group leader without the highest degree
243
(b) A gatekeeper
Fig. 6. Central members in subgroups
investigators only mentioned that it was possible that an outlier might be a true leader who stayed away from the rest of his group but actually controlled the whole group. No specific example was given, however. Identification of Interaction Patterns between Subgroups. Our crime investigators evaluated a set of between-group interaction patterns including interactions among three groups (solid, dashed, and dotted) in the gang network (Figure 4a), interactions between two major groups (solid circle and dashed circle) in the narcotics network (Figure 4b), and those between the solid and dashed groups in Figure 5a. The results showed that patterns identified using blockmodel analysis reflected the truth about interactions between criminal groups correctly. Frequency of interaction (represented by thickness of lines) between subgroups was a correct indicator of the strength of between-group relationship. In Figure 4a, for example, the blockmodeling result revealed a strong link between the murderers’ group (solid circle) and the crack cocaine group (dashed circle). When asked whether this interaction pattern was accurate, the sergeant answered: “Sure. These guys often hang together. The leaders of these two groups are best friends.” Moreover, interaction patterns might also represent flows of money and goods between groups. In Figure 5a, money and drugs flowed frequently between the dashed group (for drug sales) and the solid group (for check washing and cashing). Interaction patterns between groups might also represent problems or hatred. Frequent interactions between the two major groups in the narcotics network (Figure 4b) resulted not only from their group members’ switching back and forth but also from
244
J. Xu and H. Chen
problems between the two groups, whose leaders had been at odds for a long time. Their subordinates often ran into shootings and fights. Interaction patterns identified could help reveal relationships that previously had been overlooked. During the evaluation of the gang network (Figure 4a), the sergeant noticed that there was a line (dotted) connecting the murderers’ group (solid circle) and the black gang group (dotted circle): “I have never seen these black gang members having any connection with those white gang members”. When referring back to the original network in Figure 3a, we found a link (dotted line) between one member from the black group and a member from the murderers’ group. According to the sergeant, identifying such a connection would be very helpful for developing investigative leads. Extraction of Overall Network Structures. According to our crime investigators, gang and narcotics enterprises usually differed in structure: gang enterprises tended to be more centralized and narcotics organizations tended to be more decentralized. In order to assess our system’s abilities to reveal such structural differences, we extracted two datasets from the TPD database: (a) incident summaries of narcotics crimes from January 2000 to May 2002, and (b) incident summaries of gang-related crimes from January 1995 to May 2002. We selected four gang networks and nine narcotics networks from our datasets. Sizes of these networks ranged from 21 to 100. Other networks generated from our datasets were either too small or too large and were not analyzed. We found that the blockmodeling function in our system did reveal distinguishing structural patterns of the two types of criminal enterprises: Two out of four gang networks under study had a star structure similar to that presented in Figure 2. The third network was a chain of stars and the fourth had a star structure with some of its branches being a smaller star or a clique (Figure 7a-b). All nine narcotics networks had a chain structure (Figure 7c-d). Three of these networks were chains of stars. One network had a circle in the middle of the chain. 4.3 Usefulness of System All our crime investigators provided very positive comments on our system. They believed that the system could be very useful for extracting structural network patterns and discovering knowledge about criminal enterprises. In particular, our system could help them in the following ways: Saving investigation time. The sergeant and his assistants had obtained knowledge about the gang and narcotics organizations during several years of work. Using information gathered from a large number of arrests and interviews, he had built the networks incrementally by linking new criminals to known gangs in the network and then studied the organization of these networks. Because there was no structural analysis tool available, he did all this work by hand. With the help of our system, he expected substantial time could be saved in network creation and structural analysis.
Untangling Criminal Networks: A Case Study
(a) A 51-member gang network
(c) A 60-member narcotics network
245
(b) The star structure found in the gang network
(d) The chain structure in the narcotics network
Fig. 7. Overall structures of criminal networks
Saving training time for new investigators. New investigators who did not have sufficient knowledge of criminal organizations could use the system to grasp the essence of the network and crime history quickly. They would not have to spend a significant amount of time studying hundreds of incident reports. Suggesting investigative leads that might otherwise be overlooked. For example, the link between the back gang group and the white murderers’ group in the gang network that had been overlooked and could have suggested useful investigative leads. Helping prove guilt of criminals in court. The relationships discovered between individual criminals and criminal groups would be helpful for proving guilt when presented at court for prosecution.
246
J. Xu and H. Chen
In summary, the structural analysis approaches we proposed showed promise for extracting important patterns in criminal networks. Specifically, subgroups, central members, and interaction patterns among subgroups usually could be identified correctly by the use of centrality measures, and blockmodeling functionality.
5 Conclusions and Future Work Criminal network knowledge has important implications for crime investigation and national security. In this paper we have proposed a set of approaches that helped extract structural network patterns automatically from large volumes of data. These techniques included the concept space approach for network creation, hierarchical clustering methods for network partition, and social network analysis for structural analysis. MDS was used to visualize a criminal network and its structural patterns. We conducted a case study with crime investigators from TPD to validate the structural patterns of gang and narcotics criminal enterprises. The results were quite encouraging—the approaches we proposed could detect subgroups, central members, and between-group interaction patterns correctly most of the time. Moreover, our system could extract the overall structure for a network that might help in the development of effective disruptive strategies for criminal networks. We plan to continue our criminal network analysis research in the following directions: Allowing investigators to edit a network by adding, deleting, and modifying nodes and links. Networks created using our system were based entirely on incident data. Other important information collected from multiple sources about network members, relationships between members, and member roles would help provide a more complete picture of a criminal enterprise. Especially, knowledge about group leaders that could not be obtained using incident data from typical police databases should be added to a network representation to avoid misleading interpretation of the degree measure. Including other entity types than person. Criminal networks in our current studies were limited to only person type. Criminals’ connections with other types of entities such as location, weapon, and property could also be useful. In the “Meth World”, for example, drug offenders often used a specific hotel to carry out transactions. Examining frequencies of hotel addresses associated with a set of narcotics crimes could help in understanding the operation of a narcotics organization and predicting future crimes. Studying temporal and cross-regional patterns of criminal networks. Over time criminal networks could change in size, organization, structures, member roles and many other characteristics. The “Meth World” in Tucson had expanded from a network consisting of no more than 150 members in 1995 to the one with more than 700 members in 2002. Members and their roles in the network had also changed a lot in the past eight years: some old members left the network because of arrest or death; new members had been attracted into the network in search of profit; more powerful
Untangling Criminal Networks: A Case Study
247
new leaders might have replaced old leaders, etc. It would be interesting to study how a criminal network evolved over time. Should a certain temporal pattern be discovered, it would be helpful to predicting the trend and operation of a criminal enterprise. On the other hand, a criminal enterprise can expand across several regions or nations. The “Meth World” was initially only in Tucson and was later connected with criminals from Phoenix, California, and Mexico. Cross-regional analysis of criminal enterprises could be used to analyze criminal enterprises on a large scale and could have significant value for combating terrorism. At the same time, we will continue to develop more techniques to further advance the research on criminal networks.
Acknowledgement. This project has primarily been funded by the National Science Foundation (NSF), Digital Government Program, “COPLINK Center: Information and Knowledge Management for Law Enforcement,” #9983304, July, 2000-June, 2003 and the NSF Knowledge Discovery and Dissemination (KDD) Initiative. Special thanks go to Dr. Ronald Breiger from the Department of Sociology at the University of Arizona for his kind help with the initial design of the research framework. We would like also to thank the following people for their support and assistance during the entire project development and evaluation processes: Dr. Daniel Zeng, Michael Chau, and other members at the University of Arizona Artificial Intelligence Lab. We also appreciate important analytical comments and suggestions from personnel from the Tucson Police Department: Lieutenant Jennifer Schroeder, Sergeant Mark Nizbet of the Gang Unit, Detective Tim Petersen, and others.
References 1. 2. 3.
4. 5.
6. 7. 8.
Anderson, T., Arbetter, L., Benawides, A., Longmore-Etheridge, A.: Security works. Security Management, Vol. 38, No. 17. (1994) 17–20. Arabie, P., Boorman, S. A., Levitt, P. R.: Constructing blockmodels: How and why. Journal of Mathematical Psychology, Vol. 17. (1978) 21–63. Baker, W. E., Faulkner R. R.: The social organization of conspiracy: illegal networks in the heavy electrical equipment industry. American Sociological Review, Vol. 58, No. 12. (1993) 837–860. Burt, R. S.: Positions in networks. Social Forces, Vol. 55, No. 1. (1976) 93–122. Chen, H., Lynch, K. J.: Automatic construction of networks of concepts characterizing document databases. IEEE Transactions on Systems, Man and Cybernetics, Vol. 22, No. 5. (1992) 885–902. Defays, D.: An efficient algorithm for a complete link method. Computer Journal, Vo. 20, No. 4. (1977) 364–366. Dijkstra, E.: A note on two problems in connection with graphs, Numerische Mathematik, Vol. 1. (1959) 269–271. Dombroski, M. J., Carley, K. M.: NETEST: Estimating a terrorist network’s structure. Computational & Mathematical Organization Theory, Vol. 8. (2002) 235–241.
248 9.
10. 11.
12. 13. 14.
15. 16. 17.
18. 19.
20.
21. 22. 23.
J. Xu and H. Chen Evan, W. M.: An organization-set model of interorganizational relations. In: M. Tuite, R. Chisholm, M. Radnor (eds.): Interorganizational Decision-making. Aldine, Chicago (1972) 181–200. Freeman, L.: Centrality in social networks: Conceptual clarification. Social Networks, Vol. 1. (1979) 215–239. Goldberg, H. G., Senator, T. E.: Restructuring databases for knowledge discovery by consolidation and link formation. In Proceedings of 1998 AAAI Fall Symposium on Artificial Intelligence and Link Analysis, (1998). Harper, W. R., Harris, D. H.: The application of link analysis to police intelligence. Human Factors, Vol. 17, No. 2. (1975) 157–164. Hauck, R. V., Atabakhsh, H., Ongvasith, P., Gupta, H., Chen H.: Using Coplink to analyze criminal-justice data. IEEE Computer, Vol. 35, No. 3. (2002) 30–37. Klerks, P.: The network paradigm applied to criminal organizations: Theoretical nitpicking or a relevant doctrine for investigators? Recent developments in the Netherlands, Connections, Vo. 24, No. 3. (2001) 53–65. Krebs, V. E.: Mapping networks of terrorist cells. Connections, Vo. 24, No. 3. (2001) 43– 52. Lorrain, F. P., White, H. C.: Structural equivalence of individuals in social networks, Journal of Mathematical Sociology, Vol. 1. (1971) 49–80. McAndrew, D.: The structural analysis of criminal networks. In: Canter, D., Alison, L. (eds.): The Social Psychology of Crime: Groups, Teams, and Networks, Offender Profiling Series, III, Aldershot, Dartmouth (1999) 53–94. McIllwain, J. S.: Organized crime: A social network approach. Crime, Law & Social Change, Vol. 32. (1999). 301–323. Ronfeldt, D., Arquilla, J.: What next for networks and netwars? In: Arquilla, J., Ronfeldt, D. (eds.): Networks and Netwars: The Future of Terror, Crime, and Militancy. Rand Press, (2001). Saether, M., Canter, D.V.: A structural analysis of fraud and armed robbery networks in Norway. In Proceedings of the 6th International Investigative Psychology Conference, Liverpool, (2001). Sparrow, M. K.: The application of network analysis to criminal intelligence: An assessment of the prospects. Social Networks, Vol. 13. (1991) 251–274. Torgerson, W. S.: Multidimensional scaling: Theory and method. Psychometrika, Vol. 17. (1952) 401–419. Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications, Cambridge, Cambridge University Press, (1994).
249
Addressing the Homeland Security Problem: A Collaborative Decision-Making Framework 1
2
3
T.S. Raghu , R. Ramesh , and Andrew B. Whinston 1
W. P. Carey School of Business, Arizona State University , Tempe, AZ. 85287
[email protected] 2 Department of Management Science & Systems, School of Management State University of New York at Buffalo, Buffalo, NY 14260
[email protected] 3 Department of Management Science & Information Systems University of Texas at Austin, Austin, TX 78712
[email protected] Abstract. A key underlying problem intelligence agencies face in effectively combating threats to homeland security is the diversity and volume of information that need to be disseminated, analyzed and acted upon. This problem is further exacerbated due to the multitude of agencies involved in the decisionmaking process. Thus the decision-making processes faced by the intelligence agencies are characterized by group deliberations that are highly ill structured and yield limited analytical tractability. In this context, a collaborative approach to providing cognitive support to decision makers using a connectionist modeling approach is proposed. The connectionist modeling of such decision scenarios offers several unique and significant advantages in developing systems to support collaborative discussions. Several inference rules for augmenting the argument network and to capture implicit notions in arguments are proposed. We further explore the effects of incorporating notions of information source reliability within arguments and the effects thereof.
1 Introduction A key underlying problem intelligence agencies face in effectively combating threats to homeland security is the diversity and volume of information that need to be disseminated, analyzed and acted upon. The Office of Management and Budget (OMD) lists about 100 different federal government categories that are funded to specifically carry out anti-terrorism tasks.1 This obviously excludes state and local government agencies that are often involved in anti-terrorism operations. Given the diversity of agencies and the diversity of information sources it is quite clear that decision-making tasks related to homeland security are highly decentralized. Effective sharing, dissemination and assimilation of information is key to successful homeland security
1
Office of Management and Budget, Annual Report to Congress on Combating Terrorism, available at www.whitehouse.gov/omb/legislative/nsd_annual_report2001.pdf (Oct. 2001), Pages 89–100.
H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 249–265, 2003. © Springer-Verlag Berlin Heidelberg 2003
250
T.S. Raghu, R. Ramesh, and A.B. Whinston
strategy. In this paper, a collaborative decision-making framework is proposed as a key enabler of a distributed information and decision-making backbone for homeland security. When acting upon and integrating intelligence information from several sources, decision makers have to consider and debate various possible decision and security alternatives. In general, most such decision problems are highly ill-structured and yield limited mathematical tractability. Consequently, such decision issues have to be resolved through discussions, where argumentative logic and persuasive presentation are critical. Conventional decision modeling tools may not be able to solve the decision issues as a whole, although they may be used to generate argumentative logic while discussing some of them. In a collaborative decision making process, the group members assume positions, which could be claims or endorsements or oppositions to other claims. These positions could be assumed with or without supporting arguments or evidential data gathered from various intelligence sources (summarized at various levels). The sequence of challenges and responses typically follows an evolutionary path until the decision issues are resolved. In this process, the argument logic and the supporting/contradicting evidential data could grow significantly in both size and complexity, causing a substantial cognitive load on the decision makers. The primary objective of this research is therefore to develop pragmatic and efficient support tools to ease the cognitive burden, focus the group on critical security related issues and guide creative positional and argument strategy development throughout the discussion. The Collaborative Decision-Making (CDM) framework in Figure 1 presents a broad architectural view of a CDM system for homeland security. The CDM system comprises of four broad components: Knowledge repositories, Group facilitation and coordination, Discussion strategy support and Dialectic decision support. Each of these perspectives share some requirements on basic systems components as backbone services, we have identified many of these components in our framework. A brief description of these perspectives is given below. Knowledge Repositories. Given the diversity of federal, state and local agencies involved in intelligence gathering and decision-making processes, a unifying, semantically developed structure to represent intelligence knowledge and information is the first key requirement. The volumes of data, diversity of culture, language and vocabularies exacerbate the complexity of knowledge storage and retrieval. To facilitate communication among geographically, culturally, and/or technically diverse populations of people and systems it is imperative to develop unified knowledge and data repositories. In this context, it is important to build domain ontology and taxonomies that will play a key role in shaping collaborative decision-support systems for homeland security. Group Facilitation and Coordination. Providing system support for enabling distributed teams to coordinate has been studied extensively in the literature. Under this perspective, the recent trends in the areas of group support systems, collaborative filtering and Computer Supported Cooperative Work (CSCW) are the key technological components of a CDM system.
Addressing the Homeland Security Problem
251
Fig. 1. A Collaborative Decision Support System Framework for Homeland Security
Discussion Strategy Support and Dialectic Support. These two aspects of CDM are perhaps the least understood. Considerable research in the areas of argumentation analysis, natural language processing, and structured knowledge interchange has taken place over the past few years. However, application of these fundamental areas in collaborative decision-making has been scarce if not non-existent. A substantial portion of the paper will delve into how semi-structured information from several sources can be meaningfully analyzed. Arguments and positions enunciated by decision-makers are enhanced through simple inference procedures and argument coherence and dialectical assessments are carried out through connectionist procedures.
252
T.S. Raghu, R. Ramesh, and A.B. Whinston
The organization of the paper is as follows. Section 2 discusses the foundations of this research. Section 3 presents the connection network architecture, and Section 4 summarizes the model elements and presents an integrated global view of dialectical support through connectionism. Section 5 presents our concluding remarks.
2 Research Foundations Intelligence communities involved in homeland security tasks represent a very complex global virtual organization. The underlying context of this domain is the geographical distribution of the strategic, tactical and operational communities and their activities over the globe. The key to achieving success and breakthroughs in homeland security lies in effective team communication, creative conflict management, sustained coordination of team efforts and continuity in collaboration, all ensured within a structured collaborative decision environment. Although the road to achieving the full potential of such teamwork is filled with challenges, both organizational and technical, advanced information technology can be used in novel ways to facilitate effective collaboration that have not even been conceived till recently. The current research is envisioned as an important milestone in this direction. We identify Information filtering as the first key challenge that would need to be overcome for effective decision-making. The objective here should be to filter the vast information base so that relevant and important intelligence information are accessible quickly to key decision makers. Most of the current filtering systems provide minimal means to classify documents and data. A common criticism of these systems is their extreme focus on information storage, and failure to capture the underlying meta-information. As a consequence, the concept of knowledge ontology has emerged, with a view to create domain level context that enable users to attach rich domain-specific semantic information and additional annotations to intelligence information and documents and employ the meta-information for information retrieval. Once information storage is augmented with knowledge ontology, it becomes easier to provide structured mechanisms for communication wherein decision makers are enabled to communicate over distributed systems. Structured communication enables one to capture the knowledge of intelligence community in easily accessible discussion archives. The underlying structure in the discussion archives would enable the provision of additional collaborative decision support to intelligence personnel. Thus, we draw upon the literature from knowledge ontology and theory of argumentation as the theoretical bases for this research. Formal ontology characterizes knowledge providing a framework binding contextual elements with the relationships that link them within the ontology, as well as the relationships with other units of knowledge [5]. The knowledge ontology consists of a conceptual model, a thesaurus, and a set of expanded attributes and axioms. Its concern is for the appropriate representation of content, which may later be augmented with a mechanistic formalism, such as UML (Unified Modeling Language), RDF (Resource Description Framework), BNF (Backus Naur Form), or formal logic [19]. The main challenge that agencies involved in homeland security face is the volume and number of different information sources that would potentially feed useful and useable information to the CDM system. For instance, the key targets that need protection include large buildings, sports arenas, nuclear facilities, airports, trains and sub-
Addressing the Homeland Security Problem
253
ways, and national symbols in over 200 cities [1]. Clearly operational and intelligence information pertaining to these key targets will be varied in format, content and context. It is therefore imperative to impose uniform semantic structures where possible and define contextual meta-data on other sources of information to enable dissemination of information across federal, state and local agencies. The main contribution of this research is to demonstrate that further decision support functionalities can be embedded in a CDM system that leverage the metainformation framework of domain knowledge ontology. This would help decision makers better utilize the volumes of information collected through various sources. The basis for collaborative decision support in our system comes from argumentation theory. The logic of argumentation can be studied in terms of its two, rather classical, elements: structure and content. The two components have a symbiotic relationship in the sense that the informational content of an argument needs a logical structure for its coherence and significance. Connectionist modeling provides a way to capture both the elements in a single framework[3,4]. Several works deal primarily with representation formalisms and heuristics for argument analysis, interpretation and outcome prediction [7,9,12,13,15,18]. Given the diffuse nature of intelligence information and the uncertainties associated with the information sources, it would be difficult for any system to provide discrete decisions on security issues. Our approach is to move towards a system of argument analysis in which one is not necessarily constrained to resolving argumentation to discrete categories [16,17]. Using binary categories as a basis for rejecting or accepting arguments prevents one from assessing the relative strengths of the arguments. While connectionist models do not have the strong theoretical underpinnings of logic based defeasible graphs[6,8], using Connectionist models for this purpose has many advantages over methods that utilize simple binary categories of acceptance and rejection [11]. Connectionist modeling achieves better sensitivity in argument assessment by indicating the degree of acceptance or rejection of arguments[14,16]. In addition, one can assign different weights on the arcs connecting the different units in the model. This enables one to capture not only the relations between units but also the strength of the relation. The basic computational details of the connectionist architecture are described in [14]. Briefly, arguments in a discussion are structured into basic, atomic-level information units along with their logical and other human-intended relationships. The basic informational units are represented as the units (which is a term used to represent network nodes in the connectionist literature) and their relationships as the arcs in a network formalism for argument logic. The dialectical power of an argument is an indicator of the strength or validity of an argument, and is measured by the activation level of the unit representing the final thesis of the argument at asymptotic convergence. For example, the final thesis of the argument can be that there is an imminent threat to a key national monument in the near future. An argument derives its dialectical power by the logical coherence inherent in its structure and by the support it derives from its evidence. The evidence could be either observed facts, intelligence information, and previous incidents or derived conclusions from other claims and arguments. The structure and content of the supporting as well as opposing logic behind an argument together determine its dialectical power. The dialectical power of various positions in a collaborative discussion is a very useful evaluative feedback to the decision makers. This measure identifies the relative strengths and weaknesses of the positions, and points to whether a discussion is
254
T.S. Raghu, R. Ramesh, and A.B. Whinston
moving towards a resolution or not. Consequently, it can be used to focus a group on critical security flaws, reexamine security measures if necessary and develop strategies to address future threats. Further, the connectionist paradigm can also be used to derive assessments on subsets of a large argument network selectively, or on higherlevel meta networks derived by aggregating argument sets from a basic network into meta-units and meta arcs. Thus the proposed model can provide selectively local views of a comprehensive discussion as well as condensed global perspectives on an entire discussion. The dialectical support functionality can provide comprehensive and dynamic monitoring/guidance systems for collaborative discussions on the Intranets.
3 Argument Structure and Connectionism 3.1 Argument Structure The basic formalism for our connectionist approach is available in [14]. We briefly describe the argumentation formalism here. For a detailed discussion please refer to [14]. The discussion of inference rules and incorporation of information source reliability are additional contributions in this paper. Let * denote the group of individuals in a collaborative discussion. Let ' denote the argument structure representing the various positions, facts, and their interrelationships generated in the discussion. Clearly, ' is a temporal entity, evolving and changing over time as the discussion proceeds. The structure ' is basically a collection of assertions made by the individuals in the group. This is indicated as follows. ' = {$$ is an Assertion}. An assertion $ is of two types: positions and inferences. A statement of position is a claim, and is assumed to be a well-formed sentence. A statement of inference is a structural relationship among a set of positions and facts. We formalize the structure of these assertion types as follows. Let / denote a language from which the structure ' is constructed. The language / is a triple , where 6 constitutes the sentences, 5 is a set of assertions built using sentences, and 4 is a set of assertion qualifications. 6 provides the basis for the construction of positions and statements of fact and is composed of defeasible sentences (Gd) and factual sentences (GF). A factual statement is any evidential data that is commonly accepted by the group, while the positions are the subject of discussion. 5 provides the basis for the construction of positional and inferential assertions. This enables the construction of positional assertions from sentences obtained from 6 as well as inferential structures from other assertions. 4 provides the basis for the qualification of an argument on whether it is strict or defeasible. While a defeasible argument is subject to debate and possibly defeat, a strict argument is a logical inference that will not be questioned by anyone in the group. 5 provides two constructs <support> and to build inferential structures among positions and facts. 4 provides two constructs <strict> and <defeasible> to qualify assertions. As a result, a combination of these constructs yields the following qualified inferences: <strict support>, <defeasible support>, <strict opposition> and <defeasible opposition>. We in-
Addressing the Homeland Security Problem
255
dicate these qualified inferences as , , , and , respectively, in our structural formalism. Finally, we associate each assertion $ with its proponents through a signature set s($), which consists of all the individuals who subscribe to $. 3.2 The Connectionist Formalism We define a connectionist network in graph theoretic terms as follows: Definition 1 (Connectionist Network) A connectionist network is defined as a 4-tuple: S = where, N denotes the set of connectionist units, is a real valued activation function that maps each element aN: N § of N to a real number, A is the set of directed arcs is a real valued arc weight function that maps A to a real WA: A § number. Given the above definition of a connectionist network we can define an argument network as a mapping of the argumentation system ' to a connectionist network S as follows. Definition 2 (Argument Network) An argument network is a mapping of an argumentation system ' to a connectionist network S. The framework is translated to the connectionist formalism by mapping to , and to . We represent the sentences (G) of the sentence structure (6) of an argument as units in the argument network. Two types of units are defined corresponding to the type of sentence they represent. Defeasible sentences (Gd) are represented as defeasible units (Nd) and factual sentences (GF) are represented as factual units (NF) in the network. In addition, logical connectives (such as ¾,¿) are represented using logical units (NL). Factual sentences and defeasible sentences differ in their activation level assignments in the argument network. Similarly, logical units express their behavior by their specific activation level assignments. We illustrate this as follows. d F L d F Define aN = {aN , aN , aN } where aN is the activation function for Nd, aN is the L activation function for NF and aN is the activation function for NL. Each qualification type from the qualification structure 4, namely, defeasible support, defeasible opposition, strict support, and strict opposition, is mapped to a corresponding arc type in Â Ä the connectionist network. The set of directed arcs is defined as A = {A , A , A , A }, Â Ä where A represents defeasible support, A represents defeasible opposition, A repre sents strict support, and A represents strict opposition. Each arc type may have a different weight assignment function. We define the weight assignment function WA as Â Ä Â Â Ä WA = {WA , WA , WA , WA }, where WA is the arc weight function for A , WA is the Ä arc weight function for A , WA is the arc weight function for A , and WA is the arc weight function for A . The following section presents extensions to the basic argumentation structure for enhancing inference capabilities.
256
T.S. Raghu, R. Ramesh, and A.B. Whinston
4 Inference Rules In First-order logic, inference is accomplished by using a set of axioms and modus ponens. Given that first-order inference is limited to two discrete points (true, false), it is not easily extensible to other forms of logic and it allows very restrictive set of rules for inference. For instance, if A implies B, and A is false, no inference can be made on the value of B. Whereas, in the connectionist framework, A supports B is equivalent to A opposes ~B and ~A opposes B, and inference on B is permitted even when value of A is close to –1 (false). Other systems of logic such as fuzzy logic systems have attempted to extend the concept of inference to systems where continuous values have to be inferred [17]. Using a similar approach we extend the concept of inference in first-order logic to connectionist argument networks. The inference rules are heuristic in nature and exploit the multi-valued nature of connectionist framework. The connectionist paradigm developed so far is built on the assertions, facts and their logical and human-intended relationships that have been explicitly stated by the members of a group in an argumentation process. However, when individuals make statements, several implicit ideas are usually intended. These ideas mostly take the form of implied relationships that occur beneath the labyrinth of explicit assertions, and are very important to a mutual understanding of each other among the group members. In fact, these unarticulated intentions play a critical role in collaborative decision-making by directing argument strategies and explicit statements from behind. While the argument network captures all the explicit statements, some, if not all, of the implicit notions can be captured from the structure of the stated arguments. This will significantly enrich the ability of the connectionist model to capture the content and structure of a discussion within a computationally supporting framework. Furthermore, the connectionist paradigm can provide a series of views of a discussion by successively incorporating different kinds of implied notions. These views can be extremely useful to a decision maker by facilitating what-if analyses on available intelligence information and the discussions thereof. We propose a set of inference rules to derive implied notions from explicit statements. These inference rules are regarded as the guiding principles of an assessment of human behavior in this model. The inference rules add additional arcs to an argument network, indicating the derived relationships among the units from the structure of the explicitly stated arguments. The process of generating the inference rules has been automated in our implementation, and it is also possible to selectively apply individual inference rules. Although the use of the inferred arcs is not a requirement for a connectionist assessment of a discussion, an inference based assessment yields several key insights. The inference rules can be used as a means to test the logical consistency among arguments. This implies that the activation levels of a unit with and without inferred arcs should be consistent, and the existence of inconsistencies would indicate inconsistencies in the argument network, and direct a decision maker to examine the arguments further. There is a behavioral rationale behind extending the concept of inference to connectionist networks. Human behavior reveals several principles for a determination of implicit notions in a connectionist framework. These intentions are captured as arcs derived from the inference rules among the connectionist units. These arcs represent the synergies and counteractions that arise among the units due to the argument
Addressing the Homeland Security Problem
257
structure. We organize the inference rules into three categories based on argument structure: common successor, common predecessor and transitive sequence. We develop the rationale underlying these inference rules and their structural properties in the following discussion. 4.1 Transitive Sequence The transitive sequence structure involves three defeasible units as follows. Let p, q and r be defeasible units such that arcs (p, q) and (q, r) exist in the argument network. The units p, q and r are said to form a sequence. A sequence is said to possess transitive synergy if : (i) p supports q and q supports r, or (ii) p opposes q and q opposes r. On the other hand, a sequence is said to possess transitive counteraction if : (i) p supports q and q opposes r or (ii) p opposes q and q supports r. Transitive synergy is captured by introducing a supporting arc from p to r and transitive counteraction by an opposing arc. The rules based on these ideas are stated as follows. Inference Rule: 1 (Transitive Synergy) Consider a synergy producing sequence of defeasible units p, q and r. This synergy generates an implied arc (p, r) of type support. Inference Rule: 2 (Transitive Counteraction) Consider a counteraction-producing sequence of defeasible units p, q and r. This counteraction generates an implied arc (p, q) of type oppose. 4.2 Common Successor The common successor argument structure involves the units structured as follows. Let p, q, and r be defeasible units such that arcs (p, r) and (q, r) exist in the argument network. The unit r is said to be the common successor of p and q. Synergy between propositions p and q arises if : (i) both jointly support r, or (ii) both jointly oppose r. On the other hand, counteraction between p and q arises if one of them supports r and the other opposes r. Synergy is captured in the connectionist model by enhancing the activation levels of p and q by introducing mutually supporting arcs. The counteraction is captured by inhibiting the activation levels of p and q by introducing mutually opposing arcs. The inference rules based on these ideas are stated as follows. Inference Rule: 3 (Common Successor Synergy) Consider two defeasible units p and q with a synergy generating common successor. This synergy generates two implied arcs (p, q) and (q, p) of type support. Inference Rule: 4 (Common Successor Counteraction) Consider two defeasible units p and q with a counteraction generating common successor. These counteractions generate two implied arcs (p, q) and (q, p) of type oppose. 4.3 Common Predecessor The common predecessor argument structure is the reverse structure of the common successor. In this case, arcs (r, p) and (r, q) exist in the argument network among the
258
T.S. Raghu, R. Ramesh, and A.B. Whinston
defeasible units p, q and r. The unit r is the common predecessor of p and q in this structure. Synergy between p and q arises if : (i) both are supported by r, or (ii) both are opposed by r. Similarly, counteraction between p and q arises if one of them is supported and the other opposed by r. Synergy is captured by introducing mutually supporting arcs between p and q, and counteraction by mutually opposing arcs. The inference rules based on these ideas are stated as follows. Inference Rule: 5 (Common Predecessor Synergy) Consider two defeasible units p and q with a synergy generating common predecessor. This synergy generates two implied arcs (p, q) and (q, p) of type support. Inference Rule: 6 (Common Predecessor Counteraction) Consider two defeasible units p and q with a counteraction generating common predecessor. This counteraction generates two implied arcs (p, q) and (q, p) of type oppose. 4.4 Argument Consistency In this section we demonstrate that the application of inference rules on an argument network can be used to determine argument consistency. We use the definition of argument consistency from [2], where given two sets of arguments P and Q (members of P and Q form the nodes in the argument network), where argument P opposes argument Q, the argument P is said to be inconsistent if a node in P opposes one or more nodes in P. Example: Suppose there are two facts F1 and F2 and nodes a, b and c in an argument where {a, b, c} ∈ P . An argument aÂc;bÂc; and bÄa is an inconsistent argument. In the case of a large argument network it becomes difficult to determine argument inconsistency. We now show that the application of the inference rules can aid in the determination of inconsistencies. Definition 3 (Complete Argument Network): An argument network is said to be complete if each node in the network connects to every other node in the network in both directions. Thus given an argument network with n nodes, the total number of arcs in n
that network is given by
∑ 2(i − 1) . i =1
Proposition: An inference rule enhanced argument network is inconsistent if it is not a complete argument network. Proof: We ignore disjoint networks and consider argument networks with more than two nodes for the proof. Assume a consistent argument network in which there exist two nodes p and q that are not completely connected (i.e., they do not have a bidirectional arc between them). Then there should exist some node r that satisfies one of the six inference rules. If the connections between p, q and r satisfy the inference rules 1 to 4, p and q it will result in bi-directional path between p and q. If the connection between p,q and r satisfies the transitive inference rules 5 or 6, we obtain the following. If inference rule 5 applies: Consider, pÂq;qÂr, then we obtain pÂr. Now application of inference rule 1 on qÂr and pÂr yields a bi-directional arc between p and q.
Addressing the Homeland Security Problem
259
If inference rule 6 applies: Consider pÂq; qÄr, then we obtain pÄr. Now application of inference rule 2 on qÄr and pÄr yields a bi-directional arc between p and q. Computationally, successive application of inference rules should yield consistent set of arcs between two nodes. When the arcs inferred conflict with existing arcs, it indicates inconsistencies in the arguments. Example: Suppose there are two facts F1 and F2 and nodes a, b and c in an argument where {a, b, c} ∈ P . An argument aÂc;bÂc; and bÄa is an inconsistent argument. Now application of inference rules yields the following results. Inf. Rule 1: From aÂc and bÂc, we obtain aÂb and bÂa. This is in conflict with bÄa in the original argument.
5 Computational Analysis We present extensive computational results from the connectionist argument networks on basic argument structures (with and without application of inference rules). The intent of these computational experiments is to demonstrate that inference in basic connectionist argument structures improve as a result of applying the inference rules. To summarize, the results indicate that the application of inference rules exaggerates the bias in the original network to indicate which direction the argument is heading. It is apparent from the results that the inference rules can assist in performing sensitivity analysis on the arguments proposed. However, inference rules cannot be used as a means to assess the dialectic support for the propositions in the arguments. We now present the application of the connectionist mechanism and the inference rules on a hypothetical argumentation scenario debating a security threat. The example is intended to illustrate how the model proposed in this paper can be utilized as the underlying mechanism of support for collaborative decisions made by intelligence agencies. 5.1 An Example The objective of intelligence discussions is to ensure that intelligence personnel use all available information in evaluating threats to homeland security. It is inevitable that in such discussions conflicting positions and claims may emerge which cannot be easily resolved through simple quantitative analyses and would require an elaborate discussion process between personnel involved in the decision process. In the discussion process, intelligence personnel will have to utilize the information sources available to them through the CDM system. In this section, we present a hypothetical scenario in which domestic and foreign intelligence teams discuss the possibility of an imminent threat of a terrorist attack. The factual information used by the discussants is assumed to be retrieved through an ontology enabled information repository. The domestic intelligence team is assumed to have better knowledge of the developments within the country, whereas the foreign intelligence team is assumed to be in a better position to make assessments of the terrorist organizations capabilities (where the terrorist organization originates). In the following, the defeasible assertions of the dod mestic intelligence team are denoted as D and those of the foreign intelligence team
260
T.S. Raghu, R. Ramesh, and A.B. Whinston d
are denoted as E . The evidential information presented in the debate process is denoted as F. In the interests of space, we summarize the debate process by presenting the claims and arguments made by the two teams in Table 1 and Figure 2. The arrows marked with thick dots are arguments made by the foreign intelligence team, and the rest of the arrows are arguments made by the domestic intelligence team. We have plotted the activation level of the domestic team’s (Dteam) main thesis after each stage of the argumentation process in Figure 3. In each stage, we compute the activation levels for the nodes in the argument network that is built based on the assertions made up to that stage. For instance, after the foreign intelligence team's d d (Fteam) attack against D1 and D2 using facts F1 and F2, the activation level of the domestic team’s main thesis drops to -0.8. The domestic team’s main thesis is overwhelmingly defeated well before the end of the discussion. Table 1. Meanings of the network units in the Example
Unit d D1 D2
d
d
D3 d D4 d
D5 d E1 F1 F2 F3 F4 F5 F6
Meaning The terrorist organization has the financial resources necessary to carry out the attack There is an imminent threat of terrorist attack on National Monument A An enemy government may be involved in planning this attack The remaining active cells of the terrorist orgn. are capable of carrying out the threat The terrorist orgn. is capable of quickly recruiting new recruits Improving economic condns. in the countries that it recruits are making it harder for the terrorist orgn. to recover quickly Field intelligence data (No.xxx) indicates preparations for a major attack The main local cell of the terrorist organization was broken up last month Intelligence report from a friendly country indicates a fallout between enemy government and the terrorist organization The main funding source for the terrorist organization was closed last month and all the money in their accounts have been frozen Field agents reported panic in rank and file of the terrorist organizations as a result of the raids in the U.S over the last few months A report by a special task force has documented that previous raids had little impact on overall effectiveness of the terrorist organization d
All the defeasible units of the Dteam’s arguments, except D5 , settle to activation levels between -0.4 and -0.85 (see Table 2). Since a majority of defeasible assertions are below zero, we conclude that the Dteam’s argument is very weak, and does not have adequate factual or other basis for support. Furthermore, the Fteam challenges these assertions with substantial factual support. Note that while only the fact F6 lends direct support to the Dteam’s position, all other facts are in opposition (except F1, which forms part of a logical assertion and does not provide any direct support). The connectionist model clearly reflects this situation. In the role of an impartial assessor, the model highlights the weaknesses in the Dteam’s argument.
Addressing the Homeland Security Problem
261
d
D2
d
D4 d
F1
D1
F4
F3
F2 d
D3
d
F5
D5
Strict Opposition
Strict Support Defeasible Support
E1d
F6
Defeasible Opposition
Fig. 2. Argument Network for the Example
A cti va ti o n L e ve l o f D 2 d in th e A rg u m e n ta tio n P ro ce ss 0.1 0 -0.1
Dteam
Fteam
Dte am
Fteam
Dteam
Fteam
-0.2 -0.3 -0.4 -0.5 -0.6 -0.7 -0.8 -0.9
Fig. 3. Activity Level at each stage of the debate process
Based on the above analysis, we can conclude that the Dteam’s argument supporting the assertion that there is an imminent threat of a terrorist attack on national monument A is weak. The Fteam armed with an array of facts seem to have a strong argument against the Dteam's claim. We further tested this argument network with the addition of inference rules. The d resulting activation level on the main thesis unit, D2 is -0.9351. Clearly, utilization of the inference rules further reinforces the initial assessment of minimal threat of a terrorist attack on the national monument.
262
T.S. Raghu, R. Ramesh, and A.B. Whinston Table 2. Final Activation Levels of the Defeasible Units in Example
Unit Name Activation Level
E1
d
0.0001
D5
d
0.714
d
d
d
d
d
D3
D4
D1
D1 ^ F1
D2
-0.83
-0.417
-0.825
-0.825
-0.789
6 Effects of Information and Argument Reliability While inference rules can aid in assessing the sensitivity of the dialectic power computed using the connectionist mechanism, in several instances one has to contend with other dynamics during argumentation. The variability in the sensitiveness and reliability of information and its sources can be quite large. Moreover, it is possible that decision-makers in some agencies may have greater power (and access to classified information). In this section, we illustrate how the connectionist approach can accommodate such variances in its model quite easily. We use the example used in Section 5 for this purpose. To capture the varying information and expertise levels of the two teams involved, we vary the weights on arcs D D proposed by each party as follows. Let wS and wC denote the weights on the support F F and opposition arcs proposed by the Dteam, respectively. Similarly, let wS and wC denote the arc weights of the Fteam’s proposals. In all the experimental runs, we set D D F F wC = -wS and wC = -wS . The varying reliance on the two team’s arguments is deD F scribed using the weights as follows. The parameters wS and wS are each tested at 10 levels, starting from 0.1 and increasing in steps of 0.1 till 1. The combination of D F weights (wS , wS ) describes the relative reliance placed on the arguments of the two D F teams in an experimental run. The levels of wS and wS result in (10 * 10) = 100 experimental runs. We now summarize the results of these experiments in the following discussion. The net effects of argument reliance in two-person argument games on the activation levels of final thesis units at convergence in the example are summarized in FigF D ures 4 and 5. We denote (wS /wS ) as the Fteam/Dteam reliance ratio. The figures show a behavior that is consistent with the earlier analysis of the example. Figure 5 shows that as the Fteam reliance increases for a given Dteam reliance, the activation level of the thesis unit continuously decreases to strongly negative values. However, a revealing feature of the analysis is the following: in order to bring neutral outcome to the argument, it is necessary to reduce the Fteam’s argument reliance to almost zero while holding the Dteam’s argument reliance at a significantly high level. This implies that the Fteam’s position should be almost entirely disregarded if the Dteam’s position is to be upheld. This clearly highlights the intrinsic weaknesses of the Dteam’s arguments. This implies that the argument reliance has absolutely minimal effects on the final outcome if the argument is well-knit and well supported. Figure 5 provides a scatter plot of the activation levels on the main thesis under different combinations of Dteam and Fteam’s argument reliance values.
Addressing the Homeland Security Problem
263
Activation Level
Argument Reliance Effect on Dteam’s Main Thesis 0 -0.1 -0.2 -0.3 -0.4 -0.5 -0.6 -0.7 -0.8 -0.9 -1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DteamReliance=0.1 DteamReliance=0.5 DteamReliance=1.0
Fteam’s Argument Reliance d
Fig. 4. Activation Levels of D2 in Example
Activation Level
S ca tte r P lo t o f Activa tio n L e ve ls fo r d iffe re n t R e lia n ce R a tio s 0 -0 .1 0 -0 .2 -0 .3 -0 .4 -0 .5 -0 .6 -0 .7 -0 .8
2
4
6
8
10
12
-0 .9 -1 Argum e nt Re lia nc e Ra tio= Dte a m /Fte a m
Fig. 5. Scatter of Plot of Activation Level
7 Conclusion In this work, we have presented the conceptual foundations of a collaborative decision-making system as a key information system for decision makers involved in homeland security. The underlying decision issues associated with homeland security are highly ill-structured and yield limited analytical tractability. Human cognition plays a pivotal role in such problem-solving instances, where one critically needs to assimilate the intelligence information presented, understand the implications, and arrive at sound judgments on the positions. We model the anti-terrorism decision tasks as an argumentation process among a group of geographically diverse decision-
264
T.S. Raghu, R. Ramesh, and A.B. Whinston
makers. The individuals assume positions in proposing claims and solutions to a security issue, and support them with argumentative logic contrived from evidential intelligence information and other propositions. In the argumentation process, decisionmakers challenge each other when positions conflict, and using a process of argument exchanges attempt to arrive at a resolution. Connectionist modeling presents an efficient and elegant approach that is close to human cognition for supporting the decision makers in a deliberation. Furthermore, in addition to closely capturing the human cognitive processes and providing an analytical approach to typically ill-structured decision problems, connectionist modeling exploits the tremendous computational power of computers in analyzing large, complex networks of arguments and also serving as a repository of discussions related to intelligence information gathering, assimilation, dissemination, and analysis.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.
Brookings Institution Project on Homeland Security, available at http://www.brook.edu/dybdocroot/fp/projects/homeland/report.htm , 2002 (pages 51–66). Phan Minh Dung: “On the Acceptability of Arguments and its Fundamental Role in Nonmonotonic Reasoning, Logic Programming and n-Person Games.” Artificial Intelligence 77(2): 321–358, 1995. Feldman, J. A., “A Connectionist Model of Visual Memory,” In G. E. Hinton, and J. A. Anderson (Eds.), Parallel Models of Associative Memory. Hillsdale, NJ: Lawrence Erlbaum Associates, 1981. Feldman, J., and D. Ballard, "Connectionist Models and their Properties," Cognitive Science, 6, 205–254, 1982. Guarino, Nicola. “Formal Ontology and Information Systems,” in N. Guarino (ed.), Formal Ontology in Information Systems, IOS Press, Amsterdam, Netherlands, 3–15, 1998. Hua, G., and Kimbrough, S., “On Hypermedia Based Argumentation Decision Support Systems,” Decision Support Systems, 22, 259–275, 1998. Loui, R. P., "Argument and Arbitration Games." Workshop on Computational Dialectics, AAAI Conference, 72–83, 1994. Kimbrough, S. O., "A Graph Representation for Management of Logic Models," Decision Support Systems, 2, 27–37, 1986. Lin, F., and Shoham, Y., "Provably correct theories of action." Journal of the Association for Computing Machinery. 42(2), 293–320. 1995. Locks, M. O., "The Logic of Policy as Argument," Management Science, 31(1), 109-114, 1985. Lorenzen, P., Formal Logic, Reidel Publishing Company, Dordrecht, The Netherlands, 1965. Loui, R. P., "Argument and Arbitration Games." Workshop on Computational Dialectics, AAAI Conference, 72–83, 1994. Mitroff, I. I., R. O. Mason, and V. P. Barabba, "Policy as Argument-A Logic for IllStructured Decision Problems, " Management Science, 28(12), 1391–1404, 1982. Raghu T. S., Ramesh. R., Whinston, A. B., and Ai-Mei Chang, "Collaborative Decision Making: A Connectionist Paradigm for Dialectical Support." Information Systems Research, 12(4), 363–383, 2001. Ramesh, R., A. B. Whinston, "Claims, Arguments, and Decisions: Formalisms for Representation, Gaming and Coordination," Information Systems Research, 5(3), 294–324, 1994.
Addressing the Homeland Security Problem
265
16. Thagard, P., "Explanatory Coherence," Behavioral and Brain Sciences, 12, 435–502, 1989. 17. Thornber, K. K., “The Fidelity of Fuzzy-logic Inference,” IEEE Transactions on Fuzzy Systems, 1(4), 288–297, 1993. 18. Toulmin, S., The Uses of Arguments, Cambridge University Press, Cambridge, England, 1958. 19. Wand, Yair; Storey, Veda C.; and Weber, Ron. “An Ontological Analysis of the Relationship Construct in Conceptual Modeling,” ACM Transactions on Database Systems, 24(4), 494–528, 2000.
Collaborative Workflow Management for Interagency Crime Analysis J. Leon Zhao, Henry H. Bi, and Hsinchun Chen Department of Management Information Systems University of Arizona, Tucson, AZ 85721 {lzhao, hbi, hchen}@bpa.arizona.edu
Abstract. To strengthen homeland security, there is a critical need for new tools that can facilitate real time collaboration among various law enforcement agencies. Through a field study, we find that law enforcement work is knowledge intensive and involves complex collaborative processes interrelating a large number of disparate units in a loosely defined virtual organization. To support knowledge intensive collaboration, we propose a new workflow centric framework to seamlessly integrate previously separate techniques from the fields of information retrieval and workflow management. Specifically, we develop a collaborative workflow management framework for interagency crime analysis. The key contribution of our research is that by integrating various state-of-the-art techniques innovatively, the proposed system can support real time collaboration processes in a virtual organization that evolves dynamically.
1 Introduction The recent creation of the Department of Homeland Security is the most significant transformation of the U.S. government in over a half-century. On the one hand, this signifies the importance of integrating the law enforcement forces at the federal level to strengthen public safety and security, and on the other hand, it offers a new opportunity of nationwide collaboration on fighting crimes. In this paper, we propose a workflow centric collaboration framework that is capable of supporting real time exchange of knowledge and facilitating event-based interactions among disparate law enforcement agencies. Currently, there is the trend to migrate traditional police information systems to Internet/Intranet [5]. Internet technologies, including the most recent development known as web services, can serve as the integration component to vertically and horizontally promote information flow among various agencies at the same and/or different levels. In this study, we will show how web services could be used as a universal middleware to integrate data and processes among different agencies. Through a field study, we find that law enforcement workflow is knowledge intensive, and as a result, much of the existing literature treats police organizations as information processing systems [10]. With the estimation that police officers spend up to 40% of their time on handling information, processing information is one of the
H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 266–280, 2003. © Springer-Verlag Berlin Heidelberg 2003
Collaborative Workflow Management for Interagency Crime Analysis
267
most expensive police activities [8]. As such, the primary emphasis of information technology in the law enforcement context is on “increasing applied, useful, and succinct information” [7]. Our investigation indicated to us that an interagency workflow system will help tremendously for the following reasons: The focus of most existing crime recording, analysis, and investigation systems is on the data perspective, such as the use of data mining techniques for automatically detecting crime patterns [1]. However, little research has been reported on how to manage the collaborative processes among disparate law enforcement agencies. Many detection failures in police work are due to the fact that the necessary information is either not received by police or lost or distorted within the police system, especially when the mobility of criminals is increased [13]. Most existing police search systems are developed and deployed by individual agencies operating at the regional, state, and national level [9]. In order to reduce or eliminate duplicated efforts, wasted time, and opportunities for error, a force-wide and fully integrated system is required [2]. Two major objectives of developing office automation systems in police forces are “to improve information flows and access” and “to improve the quality of available information within the force” [12]. Workflow technology has been developed to streamline and automate business processes [11], and an interagency workflow system will help fulfill these two objectives. We choose crime analysis as the context for the interagency collaboration. Interagency collaboration is a complex process that involves the search for collaborators, the establishment of collaborative relationships, the initiation of rules of engagement, and the execution of collaboration activities. However, existing workflow solutions are too rigid and are mainly suited for a well-structured organization and insufficient for supporting loosely structured applications in interagency collaboration. Therefore, major innovation to existing workflow concepts and techniques are needed to support interagency crime analysis workflow. As such, we propose a collaborative workflow management framework that integrates several previously separate techniques from the field of information retrieval and workflow management. In this study, our core contributions include the development of a business case on crime analysis processes, an interagency crime analysis workflow model, the integration of information retrieval and workflow techniques in crime analysis workflow, and a web services enabled system architecture for a crime analysis workflow system.
2 Characteristics of Crime Analysis Process: A Field Study In this section, we describe our findings about the characteristics of crime analysis processes in a major metropolitan police department based on a field study. Our main observation from a field study of crime analysis is that the crime analysis process is dynamic and complex and involves a large number of formal documents. As presented in a later section, this observation leads to an innovative workflow management framework that help automates interagency crime analysis processes.
268
J.L. Zhao, H.H. Bi, and H. Chen
2.1 Crime Analysis Processes in a Major Police Department We have conducted a field study with a major police department, during which we interviewed six police officers including patrol officers, crime analysts, and detectives over the period of one whole year. We also collected sample documents and observed how knowledge intensive law enforcement personnel work was. We summarize our main findings below. Typically, patrol officers, crime analysts, and detectives are three main roles that are involved in fighting crimes in a police department. Corresponding to these three roles, there are usually three functional departments in police agencies. Crime analysts and detectives are further organized into groups in their departments, respectively, according to crime types so that an agent can specialize in one or more crime types. In terms of workflow, crime analysis and investigation can be organized into the following five stages: 1. Collecting crime data. When a crime occurs, usually patrol officers and detectives work at the scene to collect and record crime information in various documents. Information collection also includes that detectives inquire victims, witnesses, and suspects during the whole investigation process. 2. Processing and storing crime data and documents. Documents are then collected and managed by a special unit inside the police department, and part of the recorded information is entered into one or more computer-based systems. 3. Searching, retrieving, and collecting additional information for crime analysis. When investigating a case, more information about the suspects and crimes are needed and retrieved from various data resources and databases. For instance, after a patrol officer or a detective is assigned a case, he or she searches for needed information by himself or herself, or may ask a crime analyst for help to search for needed information. There are many possible sources for collecting information including records from various institutions such as other police departments, numerous utility companies on water, gas, and electricity, phone companies, and various city, state, and federal government branches. 4. Analyzing information to find clues. In order to find more complete evidence to charge a criminal or to find clues for an investigation, sophisticated and logical crime analyses need to be conducted in order to find linkages among criminals and/or among crimes. Such analyses include crime pattern analysis, data analysis, etc. A crime analyst must type and print out a report after the analysis stage. The report is delivered to the requester by a hardcopy, not by email. Although the crime analysts are encouraged to enter the findings into a database to share with others, they usually do not do so because they are too busy. 5. Using information to prosecute criminals. Detectives use information collected from the crime scene, the suspects, victims, and witnesses of the crime, and crime analysis reports to generate formal documents in order to prosecute criminals. In addition, detectives also need to perform many data intensive tasks such as completing supplement reports, requesting lab reports from crime lab technicians, and obtaining transcripts of all interviews from transcribers. Furthermore, this stage often involves collaboration of detectives from several departmental units, county attorney’s office, and courts to combine all the information into a complete file that becomes the basis for the prosecution.
Collaborative Workflow Management for Interagency Crime Analysis
269
2.2 The Need for Greater Support in Collaborative Workflow The description of a crime analysis process in a major police department reveals several deficiencies in the current crime analysis and investigation practices: 1. Crime analysts and detectives need to retrieve information from many databases that are not integrated. When a crime analyst or detective needs information about a case, they have to search many data resources and then piece together scattered information. It is common to search four or five different systems to establish the history and intelligence about a single person or address [1]. 2. A computer-based recording system is strongly preferred to a paper-based recording system. Currently, the process of crime analysis and investigation in a typical policy department is mainly a non-automated process. Because all of the investigation steps must be thoroughly documented to secure an effective outcome. Paper-based processes result in large amount of duplicate information as victim, witness and suspect information, addresses, phone numbers and so on may need to be entered into many different documents. 3. Too many forms are to be completed because different units/organizations (record unit, CAO, court, etc.) require different document formats. Various documents require a number of similar data, such as information of victims, witnesses, and suspects. If detectives complete these documents on paper by writing the same thing ten times, record unit needs to enter these documents into computer by typing the same thing another ten times. There are numerous forms that duplicate much of the same information. It would be nice to boil some of that information down to fewer forms. Besides wasting money and time, paper forms also take much space. For example, it is required by many agencies to keep homicide cases forever and other cases for at least seven years. 4. Although many cases may be implicitly related to each other, for instance, a narcotics case may be related to a homicide case, they may be assigned to different crime analysts and detectives according to the characteristics of cases. As a result, while many criminals commit crimes in multiple places (counties, cities, or states), crime analysis and investigation usually have local focuses that are not effective enough to fight with criminals who commit crimes in multiple places. Further, the police department must collaborate with other law enforcement agencies to solve crimes that are related to one or another and particularly when the crime is of a federal nature. The four types of deficiencies in crime analysis point to a great need for process automation. A workflow system can improve the police work efficiency by linking various databases flexibly, automating the data collection process, standardizing and/or automatically creating various forms, and revealing relationships among various crime cases. However, existing workflow systems do not possess all the functions that are needed in the police work. In this paper, we focus on investigating a workflow solution to interagency crime analysis, which is probably the most challenging to find a solution. We hope that this focus will allow us to develop solutions later to support the whole crime analysis and investigation process.
270
J.L. Zhao, H.H. Bi, and H. Chen
3 A Conceptual Model of Collaborative Workflow for Interagency Crime Analysis In this study, we focus on studying interagency crime analysis as a workflow management problem. There are several issues in interagency crime analysis workflow. First, one needs to find the right agent to share with for a particular piece of information such as a crime case. Second, once a relevant agent or agency is identified, a workflow instance needs to be initiated. Since we assume loosely coupled law enforcement agencies, there is no standard way of handling the workflows among different agencies. For instance, some scheduling needs to be done before the workflow can be initiated because one agency cannot order another agency to do anything. The workflow process therefore must be extremely flexible and need to be design on the fly on a case-by-case basis. Furthermore, since it is not possible to ask each agency to invest a lot of time and money to use such an interagency crime analysis workflow system, we need to design a workflow mechanism that is inexpensive to implement and that can be used based on various information system standards. We propose a six-phase meta-process for collaborative workflow management for the given context. We use activity-based workflow modeling notations proposed in [4] to represent these meta processes. These notations are shown in Fig. 1. In activitybased workflow modeling, there are two types of nodes, activity nodes that represent activities and routing nodes that route the execution flows of activities. There are two special activity nodes. Start activity node represents the start point of the workflow and end activity node stands for the end point. There are three kinds of routing nodes, AND node, OR node, and XOR node. Activities linked to an AND node should all be executed. For activities that are linked to an XOR node, only one of them is executed. For activities that are linked to an OR node, one or more than one of them may be executed, depending on the scenarios. Finally, directed arcs are used to link nodes.
Activity
AND connector
OR connector
Exclusive OR connector
Control flow
s
e
Start node
End node
Fig. 1. Notations of activity-based workflow modeling
1. Workflow identification (Fig. 2.). Given a new crime case, a crime analyst can submit the case to the workflow system for identification of relevant cases and relevant crime analysts. The workflow system then searches the National Crime Case Repository based on the Unified Case Language System for similar cases (to be discussed in more detail in a later section). The relevant cases can be closed cases or open cases, and the relevant agents can be active agents who can perform collaborative analysis or agents that may have relocated or retired.
Collaborative Workflow Management for Interagency Crime Analysis
271
identify open cases
s
specify case search criteria
identify closed cases
submit a case
identify active agents
find matching results find no matching results
identify inactive agents
1 stop search
2
modify search criteria
Fig. 2. Process model of workflow identification
2. Workflow negotiation (Fig. 3.). Given the constraints of the matching results of the case, various optional workflows could result. For instance, if the identified agent is no longer with the agency, alternative agent needs to be identified. An agent must get approval from his/her agency in order to start a formal workflow.
contact relevant agents
request approval reject cooperation
1 contact alternative agents
request approval reject cooperation
get approval get disapproval
get approval get disapproval
find collaborators
3
find no collaborators
4
Fig. 3. Process model of workflow negotiation
3. Event-based workflow design (Fig. 4.). Once the agent and the associated agency agree to participate in the interagency workflow, the workflow process needs to be designed. The nature of the workflow design may require the selection and modification of existing workflow templates in order to design the workflow efficiently and correctly within a reasonable amount of time such as in the matter of minutes. Overly costly workflow design will render the workflow system difficult to adopt. Our proposal cannot be supported directly by existing workflow management systems due to the constraints aforementioned. In this study, we propose an eventbased workflow management environment. Under this environment, collaborators determine through discussion a list of action items, the associated events to each action item, the dependencies among those events and action items, and the deadlines for those events. Collaborative workflow management requires the system support for those user activities in the following respects:
• Event definition. Upon workflow initiation, registered users will be given a template to define and decide collectively their events. Events found in similar collaborative workflow instances in the past will be given as suggestions. Some events can contain sub-events as well if needed.
272
J.L. Zhao, H.H. Bi, and H. Chen
• Event dependency specification. Users can specify the dependency relationships among the events. The dependency relationships among events provide the basis for workflow scheduling and re-scheduling by sequencing the events properly. • Deadline Specification and integrity verification. Users can specify event deadlines collectively, while the system can help verify the integrity of the deadlines based on the dependency relationships among events. For instance, if event X depends on event Y, event X cannot occur before event Y. This implies that if a user wants to reschedule event X before event Y, the workflow system will not allow it.
start workflow process design
3
select an existing template
modify existing template
design a new template
determine action items
determine associated events
determine dependencies of action items & events
determine deadlines of events
5
Fig. 4. Process model of event-based workflow design
4. Workflow initialization (Fig. 5. ). The designed workflow instance will need to be initialized. This requires the verification of the workflow template to remove any conflicts or other errors. Once the workflow template is deemed correct, the users will be informed of the start of the workflow. Initial confirmations with all users are necessary to make sure that the users and their emails are correct and that all are committed to see the workflow through.
5
generate workflow model
verify workflow model
find no error
find errors
notify all collaborators
start collaboration workflow
6
correct errors
Fig. 5. Process model of workflow initialization
5. Event-based workflow execution (Fig. 6. ). • Event execution. Once the action items and events are specified, the collaborative workflow engine will monitor the progress of events and interact with the users in several ways. First, if an event completion passes the specified deadline, the workflow system will send messages to all users about the delay. Second, once an event is completed, the workflow engine will check if any subsequent events need to be started. Necessary data flows from prior events to subsequent events will be done. Furthermore, the workflow engine can also inform the users if the workflow is on schedule and indicate any positive or negative discrepancies from the schedule. • Event modification. When a particular event is overdue, other events depending on the delayed event might need to be rescheduled. Further, changes in circumstances might prompt users to ask for a change of event deadlines. In any case,
Collaborative Workflow Management for Interagency Crime Analysis
273
our workflow system supports deadline changes and/or event changes such as adding, deleting, or modifying an event. event modifications
6
monitor collaboration process
an event passes deadline
inform of delay
an event is completed
collaboration process is completed
check subsequent events
find positive/negative discrepancies
7
inform users
Fig. 6. Process model of workflow execution
6. Workflow termination (Fig. 7.). Once the workflow terminates, we need to update the relevant systems to insert the results of the crime cases involved. This requires the distribution of relevant documents to various legacy systems.
update database records
7
2
4 e
distribute relevant documents
Fig. 7. Process model of workflow termination
As discussed above, existing workflow management systems do not have the functions to support all the needed features of interagency crime analysis workflow. The main challenge is the integration of techniques from these separate research domains, namely, information retrieval, distributed and autonomous information systems, and agile workflow management. These six phases of workflow are generic. As shown in a later section, we will include a case-based workflow modeling paradigm at the instance level. To distinguish the two modeling levels, we refer to the generic workflow models as meta models.
4 A Collaborative Workflow Management System for Interagency Crime Analysis 4.1 A Three-Layer Framework In order to support a system architecture for managing interagency crime analysis workflow, we make several innovative proposals. First, we propose a National Crime Case Repository (NCCR) that will store crime cases that are submitted by law enforcement agencies that are interested in seeking collaborators. Those cases should typically have features of an overdue case and requiring multi-agency investigation.
274
J.L. Zhao, H.H. Bi, and H. Chen
This crime case repository should contain additional functions such as crime case matching and workflow routing, which are needed to automate interagency crime analysis workflow. Second, we propose a Unified Case Language System (UCLS) based on XML. The purpose of the UCLS is to aid the development of ontology and systems that help law enforcement professionals and researchers retrieve and integrate electronic crime fighting information from a variety of sources. Similar to the Unified Medical Language System (UMLS) that was designed to support information integration in the medical domain, UCLS can help create machine intelligence by enabling computerized parsing of crime cases and features and discover more precisely similarities among them. For instance, if we can create a gun ontology and a vehicle ontology, then the types of guns and automobiles found in a crime scene can be more easily identified and matched. Third, we design a Collaborative Workflow Management System (CWMS) based on the recent development in the web services. This agile workflow management framework will be able to support all the features needed by the interagency crime analysis workflow. As discussed above, the dynamic nature of the crime analysis process, the loosely coupled law enforcement agencies, and the heterogeneous data and processes in crime analysis collaborations renter the existing workflow standards difficult to apply.
Unified Case Language System Collaborative Workflow Management System National Crime Case Repository Fig. 8. A three-layer framework for interagency crime analysis workflow System
These three components, i.e., NCCR, UCLS, and AWMS, are essential to an interagency crime analysis system. NCCR is needed because a single source of cases will make case exchange and matching much easier and less costly. UCLS is needed to enable the creation of machine understandable case descriptions based on XML. AWMS will integrate NCCR with existing law enforcement information resources in an efficient manner. The web services standard will make it possible to develop solutions that are universally accessible with little change to existing police information systems. Fig. 8. illustrates the relationship among the three layers, namely, the case language layer, the workflow management layer, and the case repository layer. 4.2
Web Services Enabled System Architecture for Interagency Crime Analysis Workflow
As indicated through the crime analysis process, the core challenge for information technology has always been and will continue to be the integration of inter- and intraenterprise applications. Today, web services provide a new standard for enterprises to build successful, cost-effective, tractable application integration. The International Data Group (IDG) estimated that the total software, hardware and services opportu-
Collaborative Workflow Management for Interagency Crime Analysis
275
nity derived from Web Services will rise from USD1.6 billion in 2004 to USD34 billion by 2007. Web services help create a universal computing environment where all computer programs can communicate with almost any other program anywhere at anytime. The web services technology can therefore streamline inter-business processes by creating an open, distributed systems environment, and allow the possibility of significantly reducing the cost of integrating disparate business systems. Workflow technologies and web services can be implemented at three levels for police agencies to improve homeland security: 1. A workflow system is implemented inside each agency to automate crime analysis and investigation and increase information sharing inside the agency. 2. Web services are implemented as the middleware to link all agency-based workflow systems together to increase information sharing among agencies. 3. Use web services to vertically integrate workflow systems of police agencies with CAO, court, jail, and sheriff to automate crime investigation and reduce paper work. At the business level, our research goal is to integrate workflow technologies and web services to interagency crime analysis and investigation, and at the technical level, we strive to develop new information processing and process management techniques towards an agile workflow management framework. Fig. 9. illustrates how workflow techniques and web services are implemented in an interagency crime analysis workflow system. As shown, each law enforcement agency might have its own database used by various agents. We assume that various agencies will sign up to use the collaborative workflow services in crime analysis and the interagency crime analysis workflow system (implementing the three-layer framework shown in Figure 8) will enable direct communications of agents in other agencies from their existing information systems based on the SOAP connections. In this system architecture, SOAP (Simple Object Access Protocol) connections provide the web service channels between the interagency crime analysis workflow system and the agency information systems. A genc y 2 A gen c y n
A genc y 1
... IS 2
IS 1
ISn
,Q W H U D J H Q F \ & U LP H $ Q D O\ V LV : R U N I OR Z 6 \ V W H P
ag ent
IS i
info system i
62 $3FRQQHFWLRQ
Fig. 9. Web service-based system architecture for interagency workflow interoperability
276
J.L. Zhao, H.H. Bi, and H. Chen
This particular system architecture provides several important features. First, the SOAP protocols act as universal connectors to existing information systems already implemented in law enforcement agencies. Because most information system vendors are adopting the SOAP standard for system interoperability, this will make the interoperation between the interagency workflow system and existing information systems possible without major new investment. Second, the collaborative workflow management system has very high fluidity because it can enable spontaneous collaborative interactions among agents that are not yet collaborated before when they find one another based on a brand new case. Third, since the information systems currently used by the agents include a variety of types such as databases, emails, and decision support systems, this spontaneous interoperability makes the collaborative workflow management very flexible since the agents can choose any platforms they like so long as their preferred systems understand the web service standard. With the rapid pace of adopting web services by the software vendors, this web service-based system architecture makes it possible to design and implement the proposed nationwide interagency crime analysis workflow.
5 Event-Based Workflow and Event Management Language 5.1 Meta-level and Instance-Level Workflow Models The six phase conceptual model for collaborative workflow management presented in Section 3 contains two levels of workflow modeling techniques, a meta-level workflow management model and an instance-level workflow event model. The metalevel workflow management model is essentially the process management protocol used to set up a collaborative workflow spontaneously at the time of user request. The collaborative workflow engine will use this meta-model to interact with users to design and execute all collaborative workflow instances. The instance-level workflow event model is discussed in Section 3 under eventbased workflow design and event-based workflow execution phases. The workflow event management is our innovation in order to management workflows flexibly on a case-by-case basis. To the best of our knowledge, there is no work found in the literature until now that deals with workflow flexibility using an event-based workflow design and execution technique. Although the concept of event management is not new, applying the event management technique in workflow modeling is a novel concept. 5.2 A Workflow Event Language and Associated Operators To manage event-based workflow, we propose an event constraint language similar to a sequence constraint language proposed in [6]. In this paper, we illustrate the basic principles of the event constraint language and the associated event manipulation operators next. A workflow event is an action item that must be completed by one or more agents on or before a specified time. We denote event e to be completed by agent a by time t as e(a, t). When an event must be completed by more than one agent, we use A = {a1,
Collaborative Workflow Management for Interagency Crime Analysis
277
a2, ..., an} to indicate a set of agents. In this paper, we assume that the event start time is determined by its prerequisite event(s), but only the deadline t of event e is explicitly specified. Note that the concept of event in our framework is very general. That is, an event could entail a decision making task, a telephone conference, a data collection task, and an emailing task. The relationships between two events can be prerequisite, subsequent, and irrelevant. For instance, if event e1 must be completed before event that is e1 is a prerequisite of e2, we say e1 Å e2. Conversely, the same notation indicates that e2 is a subsequent event of e1. It is possible that multiple events, say e1, e2, and e3 are prerequisites of event e4. In that case, we denote the relationship as {e1, e2, e3} Å e4. Similarly, multiple events can be subsequent events of other events. The following collaborative workflow example can be modeled with this simple event specification language. Given five agents, John and Sarah from Tucson Police Department, Jen from the Arizona Homeland Security Council, Tom from the CIA, and Mike from the FBI. Assume that they have agreed to collaborate on a case that involves a group of potential criminals on a specific crime, which is supported by the interagency crime analysis workflow system. They have agreed to accomplish the tasks shown in the following table. This simple example is for the purpose of illustrating the use of the event language. The list of events can be written formally as E1({John, Sarah}, ‘5 pm 1/10/03’), E2(Tom, ‘5 pm 1/10/03’), E3(Mike, ‘5 pm 1/10/03’), E4({John, Sarah, Tom, Mike, Jen}, ‘2 pm 1/11/03’), E5(Jen, ‘7 pm 1/11/03’), E6({John, Mike}, ‘5 pm 1/15/03’), E7(Sarah, ‘11 am 1/17/03’), E8({John, Sarah, Tom, Mike, Jen}, ‘5 pm 1/18/03’). The dependency constraints are specified as {E1,E2,E3}ÅE4, E4ÅE5, E5ÅE6, E6ÅE7, E7ÅE8. Summary of the events is listed in Table 1.
Table 1. Listing of Events
(YHQWV ( ( ( ( ( ( ( (
'HVFULSWLRQ &ROOHFWGDWDIURP7XFVRQ$JHQFLHV &ROOHFWGDWDIURP&,$ &ROOHFWGDWDIURP)%, 7HOHFRQIHUHQFHPHHWLQJ &RPSLOHPHHWLQJPLQXWHV HPDLOWRDOODJHQWV 5HYLHZILQGLQJV SURSRVHSURVHFXWLRQDFWLRQV 3UHSDUHSURVHFXWLRQIRUPV HPDLOWRDOO )LQDOWHOHFRQIHUHQFH
$JHQWV -RKQ6DUDK 7RP 0LNH $OODJHQWV -HQ -RKQ0LNH 6DUDK $OODJHQWV
'HDGOLQH SP SP SP SP SP SP DP SP
3UHUHTXLVLWHV 1RQH 1RQH 1RQH ((( ( ( ( (
In practice, the list of events might not be specified completely at the start and are subject to change by adding new events or modifying existing events. This requires an event manipulation algebra that can insert a new event, remove an event, and modify an event. Further, the dependency relationships among events need also be changed when new events are inserted or old events are removed. There are a number of event manipulation operators. Insert(e1, e2, e3) means that event e1 is to be inserted before e3 and after e2. That is, e1 will become the prerequisite event of e3 and the subsequent event of e2. Remove(e4) means that event e4 is to be removed, in which case, any subsequent events of e4 will become the subsequent events of the prerequisite events of e4. For instance,
278
J.L. Zhao, H.H. Bi, and H. Chen
suppose we have {e2, e3}Å e4, and e4Åe8. After event e4 is removed, the event management system should automatically derive the relationship {e2, e3}Åe8. Other event manipulation operators will be needed as well to change the contents of any columns among event ID, event description, agents, deadline, and prerequisites. For instance, Modify(E1, Description = ‘Collect data from Arizona police departments’) would change the description of the first row in the table above. Additional details of the event manipulation operators are omitted. 5.3 Uniqueness of Event-Based Workflows Note that the event modeling method does not require complex routing rules as found in conventional workflow management systems because we assume that the collaboration will be handled by the involved agents. In a way, our workflow automation falls between a simple email-based collaboration system and a full-fledged workflow system. While the former does not provide any support for systematic event specification and modeling and the former is too complex and too costly for spontaneous and dynamic collaborations, our model has a moderate complexity and significant functionality with a lot of flexibility. The basic premise of event-based workflow design and execution is based on the idea that conventional workflow design and execution techniques are complex to learn and expensive to use. Typically, conventional workflow models are based on a process graph and logic-based triggering constraints. Because of the complexity involved in this paradigm, existing workflow systems typically provide a graphical user interface for designing a workflow. However, this conventional workflow modeling paradigm is not suitable in interagency crime analysis workflow for several reasons.
• First, the interagency crime analysis workflow models are difficult to standardize since the types of processes and their levels of complex are numerous. A workflow model used by one collaborative team of agents might not be useful for another collaborative team of agents because of the differences in their crime cases. • Second, our proposal of workflow template and event-based workflow design is superior to the conventional activity-based workflow design because changing the events during the workflow execution is much more convenient than changing a workflow graph. The main reason is that the event-based workflow model only requires the verification of event sequence constraints based on event dependencies. • Third, we abandon the concept of roles in the workflow model details since we assume that the collaborative team of agents usually consists of a small number of agents. New agents can be added to an existing workflow, and existing agents may be replaced under the event-based modeling paradigm while maintaining a log of all events and the related agents.
6 Conclusions In this paper, we outlined an interagency crime analysis workflow system based on web services and event management. The main objective is to facilitate interagency collaboration on crime analysis through a national crime case repository and a unified
Collaborative Workflow Management for Interagency Crime Analysis
279
case language system. The system can help collaborators work together seamlessly and efficiently. We argued that existing workflow solutions are too rigid and are largely confined within a well-structured organization and insufficient for supporting loosely structured applications in a virtual organization that is constantly evolving. Although the collaborative workflow concepts have been used in various contexts, our notion of collaborative workflow is unique. For instance, the collaboration management infrastructure developed in Microelectronics and Computer Technology Corporation [3] consists of the core workflow management functions and extended modules that support process and situation awareness needed for process coordination. However, the collaboration management infrastructure does not provide the kind of flexibility needed by the interagency crime analysis workflow. We claim a number of contributions in this study: (1) Our field study in a major police department revealed the intensive knowledge and complex processes in law enforcement work and a great need for collaborative workflow management. (2) We proposed a national crime case repository and an associated Unified Case Language System to enable the discovery of collaborators related to a specific crime case so that spontaneous collaboration can occur in real time when needed. (3) We designed a unique workflow management framework called collaborative workflow management and the associated workflow design and execution model. (4) We proposed an eventbased workflow management method and the associated event manipulation language. Currently, we are working on the details of the system design and related theoretical development to validate the system described in this paper.
References 1. 2. 3.
4. 5. 6. 7. 8.
Adderley, R.W., P. Musgrove. 2001. Police crime recording and investigation systems A user’s view. Policing: An International Journal of Police Strategies & Management. 24(1) 100–114. Anonymous. 1986. Stepping up the beat (Sussex Police Force computer system). Computer Systems. 6(5). Baker, D., D. Georgakopoulos, H. Schuster, A. Cassandra, A. Cichocki. 1999. Providing customized process and situation awareness in the collaboration management infrastructure. Proceedings of the 4th IFCIS International Conference on Cooperative Information Systems. Bi, H.H., J.L. Zhao. 2003. Mending the Lag between Commercial Needs and Research Prototypes: A Logic-based Workflow Verification Approach. Proc. of The 8th INFORMS Computing Society Conf. Kou, C.Y. 1998. An evaluation for nation wide police information system (NWPIS) based on Internet. Proc. of the 32nd Annual International Carnahan Conference on Security Technology. 117–120. Kumar, A., J.L. Zhao. 1999. Dynamic routing and operational controls in workflow management systems. Management Science. 45(2) 253–272. Manning, P.K. 1996. Information technology in the police context: The "sailor" phone. Information Systems Research. 7(1) 52–62. McKay-Smith, M., A.C. Gustin. 1982. Forms Control and Design in a Small Police Department. ARMA Records Management Quarterly. 16(1) 7–8.
280 9. 10. 11. 12. 13.
J.L. Zhao, H.H. Bi, and H. Chen Northrop, A., D. Dunkle, K.L. Kraemer, J.L. King. 1994. Computers, police, and the fight against crime: an ecology of technology, training and use. Informatization and the Public Sector. 3(1) 21–45. Simms, B.W., E.R. Petersen. 1991. An Information Processing Model of a Police Organization. Management Science. 37(2) 216–232. Stohr, E. A. and J. L. Zhao, "Workflow Automation: Overview and Research Issues", Information Systems Frontiers, Volume 3, Issue 3, September 2001, p. 281–96. Taylor, J.A., H. Williams. 1992. Police management, office automation and organizational change. New Technology, Work, and Employment. 7(1) 44–53. Willmer, M.A.P. 1978. An information-theoretic approach to the organization of police forces. Proceedings of the 4th International Congress of Cybernetics & Systems.
COPLINK Agent: An Architecture for Information Monitoring and Sharing in Law Enforcement Daniel Zeng, Hsinchun Chen, Damien Daspit, Fu Shan, Suresh Nandiraju, Michael Chau, and Chienting Lin Department of Management Information Systems University of Arizona, Tucson, Arizona 85721 {zeng, hchen, damien, shan, mchau, linc}@eller.arizona.edu
Abstract. In this paper, we report our work on developing and evaluating a prototype system aimed at addressing the information monitoring and sharing challenges in the law enforcement domain. Our system, called COPLINK Agent, is designed to provide automatic information filtering and monitoring functionalities. This system also supports knowledge sharing by proactively identifying officers who are working on the same or similar cases on a real-time basis. To accommodate the mobile needs of law enforcement officers who are constantly in the field, COPLINK Agent can deliver messages through a variety of communications channels including e-mail, pager, and mobile phones. In order to assess the effectiveness of COPLINK Agent, we conducted a pilot user study at the Tucson Police Department. Overall, COPLINK Agent was shown to be an effective tool for improving the effectiveness and efficiency of criminal investigations in crimes such as gang, theft, and fraud.
1 Introduction The rapid advancement of information technologies and the Internet provides great opportunities as well as challenges for government agencies. These technologies not only allow easier and faster access to information but also facilitate the sharing and reuse of information. In the law enforcement domain, information access is especially critical for crime analysis and investigation. Consequently, police officers and investigation personnel are increasingly becoming knowledge workers whose daily activities include considerably complex and extensive interactions with diverse information and knowledge sources. The types of interactions include: selecting, collecting, preserving, organizing, using, accessing, analyzing, and producing data. In addition, timely access to information is often critical. Consequently, crime and police report data are rapidly migrating from paper-based records to automated law enforcement records management systems (RMS). Despite the availability of such viable RMS solutions, there are still several major issues and challenges not yet addressed by existing systems. First, crime analysts often need to query different distributed data sources, including both internal databases as well as external ones managed by other law enforcement agencies. These data sources, funded and maintained by multiple agencies, often employ a wide range of different hardware platforms, database systems, network protocols, data schemas, and user interfaces. To find the desired inforH. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 281–295, 2003. © Springer-Verlag Berlin Heidelberg 2003
282
D. Zeng et al.
mation, law enforcement personnel have to know where such data sources are located and how to access them. As such, a large amount of manual and cognitive effort is required to query all the relevant data sources, each with a different search interface. Another major problem concerns the dynamic nature of law enforcement data sources. Since many cases involve long periods of investigation, law enforcement personnel often have to track the activities of a particular suspect or the whereabouts of a vehicle over a long period of time. As the data are updated frequently, the database has to be repeatedly queried for changes, often on a daily basis. Since such automatic monitoring functions are not available in most current systems, the data sources have to be checked manually, requiring a lot of time and effort from the user. Conceivably, it is not uncommon that cases requiring constantly monitoring have to be dropped because of lack of resources. Last but not least, the lack of support for knowledge sharing is yet another problem confronting the current law enforcement information systems. From their daily work, law enforcement personnel with different job functions and working at different locations can easily acquire a vast amount of knowledge about a particular suspect or case. Such knowledge, nonetheless, is tacit and often not efficiently shared. When a police officer needs some particular information, it is often not clear whom to contact. It is common that two different law enforcement units are working on two closely related cases (e.g., related to the same person) without knowing each other’s existence or progress. Naturally, the two units are not able to collaborate and share their collective knowledge. Thus, the ability to share knowledge in a collaborative environment by linking together people who are working on the same or similar cases can significantly improve law enforcement agencies’ crime-fighting capabilities. In this paper, we report on our experience in designing and evaluating a collaborative information monitoring system for law enforcement. This approach was implemented in a prototype system called the COPLINK Agent. COPLINK Agent was designed to answer the aforementioned challenges by providing automatic information filtering and knowledge sharing functionalities. The rest of the article is outlined as follows. Section 2 reviews related research and existing systems in the law enforcement domain. Section 3 discusses the research questions outlined in this study. Section 4 describes the system architecture and main components of the COPLINK Agent. In Section 5, a sample user session with the system is described in detail to illustrate its use in a law enforcement environment. Section 6 focuses on a user study designed to evaluate the COPLINK Agent and answers the research questions raised above. We conclude the paper in Section 7 by summarizing our research contributions and suggesting future research directions.
2 Research Background 2.1 Information Systems in the Law Enforcement Domain Quick and easy access to information is critical to the success of law enforcement agencies. Database technologies have been widely used to manage crime and police reports to provide faster and easier access for law enforcement personnel [15,18]. One such example is the COPLINK Connect system [7,14]. COPLINK Connect aims to enable law enforcement agencies to search for information more effectively by pro-
COPLINK Agent: An Architecture for Information Monitoring and Sharing
283
viding an user-friendly interface that integrates data from various sources such as incident records, mug shots, and gang information. COPLINK Connect was deployed in 2001 at the Tucson Police Department (TPD), which has more than 1,000 employees and serves a population of over 475,000 citizens. This system has been shown in a field study to have improved the efficiency of detectives and crime analysts [14]. Other information technologies also have been used in law enforcement. For example, the Comstat system introduced by the New York Police Department uses computer statistics and crime mapping techniques to identify the types of crimes happening in different districts [10]. Another example, COPLINK Detect system, uses co-occurrence analysis to identify the relationships among different entities (e.g., persons, vehicles, locations, and organizations) in criminal justice databases [14]. Data mining techniques have also been applied to identifying interesting patterns in criminal data. For example, a self-organizing map is used to cluster similar sexual offense cases into groups in order to identify serial offenders [1]. 2.2 Information Monitoring and Sharing The law enforcement information systems discussed above have mainly focused on two aspects, i.e., providing easy and efficient access to data, as well as performing analysis on existing data. However, none of these systems support automatic information monitoring or information sharing among users, which have been widely studied for Web applications. There are many monitoring and notification systems for Web information sources. One example is the NorthernLight Web search engine (www.northernlight.com), which alerts users when new Web pages are added to the database. Some client-side search tools, such as Copernic Agent (www.copernic.com) and WebSeeker (www.bluesquirrel.com), also provide the functionality for scheduling automatic searches. Stock prices are also frequently monitored. For example, E*trade (www.etrade.com) allows users to choose which stocks they want to monitor and the users are alerted when the stock price reaches the level they specified. In the financial application arena, more advanced monitoring (e.g., monitoring based on the results of complex financial analysis) has also been proposed [24]. In these systems, users can often opt to be alerted in different ways, such as Web messages, emails, pagers, voice messages, or short messages for mobile devices. In the area of providing monitoring and alerting support for law enforcement applications, the FALCON system [4] developed at Charlotte-Mecklenburg Police Department (CMPD) in Charlotte, North Carolina offers the functionality of monitoring all incoming police records as well as sending alert messages to police offers by email and pager. FALCON, nonetheless, does not offer collaborative filtering capabilities or advanced collaboration functions. To facilitate user collaboration and information sharing, collaborative filtering, also referred to as recommender system, has been widely studied in Web applications. Goldberg et al. [12] defines collaborative filtering as a kind of collaboration in which people help one another perform filtering by recording their reactions to documents they read. Examples of collaborative filtering and recommender systems include Amazon.com, GroupLens [17], Fab [3], Ringo [22], Do-I-Care [23], and Collaborative Spider [6]. When a user performs a search, these systems will recommend a set of documents or items that may be of interest based on this user’s profile and other users’ interests and past actions. Collaborative filtering systems typically operate in a
284
D. Zeng et al.
“push” mode. When certain users come across some interesting pieces of information, they highly recommend these pieces, or “push” them to other users who share similar interests. For instance, Amazon.com uses collaborative filtering to recommend books to potential customers based on the preferences and recommendations of other customers who have similar interests or purchasing histories. Annotations in free-text or predefined formats are also incorporated in systems such as AntWorld [16], Annotate! [11], and CIRE [21] to facilitate collaboration among users. In collaborative filtering systems, one major issue is users’ willingness to share information. It has been suggested that Lotus Notes was not well utilized because users had little or no incentive to share information [20]. The situation, however, becomes less problematic for Web searching, which consists mostly of voluntary contributions [21] as users are more willing to contribute in exchange of pride and popularity. Lastly, many systems attempt to minimize user effort by capturing user profiles/patterns automatically [2,23].
3 Research Questions As discussed earlier, information monitoring and information sharing are two areas that are critical to criminal investigation and analysis. Although these topics have been widely studied in other areas such as Web applications/computing, they have not been adequately applied to address the unique problems in law enforcement. In our research, we aim to answer the following research questions: Can information monitoring and sharing techniques be effectively applied to existing law enforcement information systems? Can such a system alleviate the existing information monitoring and sharing problems in the law enforcement domain? How will such a system affect law enforcement personnel in their crime analysis and investigation work? Can the system improve effectiveness and efficiency of current criminal investigation practices?
4 COPLINK Agent System Architecture 4.1 System Architecture Overview Attempting to answer the above questions, we propose a modular, personal-agentoriented architecture to support information monitoring and collaboration in law enforcement. Based on the architecture, we implemented a prototype system called COPLINK Agent, which is built on top of the COPLINK Connect system discussed in Section 2.1. The COPLINK Connect system, which supports data access and basic searching/sorting functions, serves as an ideal test environment for the proposed architecture as we can readily add advanced monitoring, collaboration, and alerting functions to the system.
COPLINK Agent: An Architecture for Information Monitoring and Sharing
CAD
TPD
MVD
285
TCC
Search queries and search results Searching and monitoring tasks
Searching and Monitoring Module
Searching and monitoring histories
Searching and monitoring tasks User Interface
Collaboration tasks
Collaboration Module
Finding similar cases and similar users
User Profile Database
Alert messages Web alert messages
Alert messages Alerting Module
Email, pager, and mobile phone alert messages
Fig. 1. System architecture for COPLINK Agent
Our proposed architecture is shown in Figure 1. It consists of a Web-based user interface and three functional modules, namely the Searching and Monitoring Module, the Collaboration Module, and the Alerting Module. The Searching and Monitoring Module is responsible for retrieving records from the database, keeping a list of monitoring tasks for each user, and performing these tasks periodically based on the user’s preference. The Collaboration Module facilitates the sharing of information among different users. The Alerting Module is responsible for keeping track of the messages for each user and delivering these messages through different communications channels. The Personalization Module keeps track of each user’s search history and allows the user to customize various system settings. The functionalities of each module are described in detail below. 4.2 Searching and Monitoring Module The Searching and Monitoring Module accepts search queries from users and forwards them to the corresponding data sources. In addition to the COPLINK database for TPD data used in COPLINK Connect, the Searching and Monitoring Module connects to three additional data sources: the Computer Aided Dispatch (CAD) database used at TPD, the Motor Vehicle Division (MVD) database in the state of Arizona, and the Tucson City Court (TCC) Web-based search engine. These databases provide ad-
286
D. Zeng et al.
ditional person, location and vehicle information that are not available in the COPLINK database. In addition to the search functionalities, this module also allows users to set up monitoring tasks for the available data sources. For instance, if a user wants to monitor all four data sources for a particular query, the monitor task will be stored in the user profile database and the data sources will be automatically monitored for changes. Different mechanisms are used to monitor the data sources due to the differences in their nature. The COPLINK database, to which our system has full access, is monitored by adding triggers to the database directly. For external databases, such as the CAD and the MVD databases, the Searching and Monitoring Module sends periodic queries to these databases. For TCC, which is accessed by a Web-based search form, the system sends HTTP request to periodically query the search engine. When the relevant records are updated or inserted into the databases, the system will send an alert message to the user through the Alerting Module. Readers are referred to Zeng et al. [25] for detailed technical discussion on how monitoring requests are handled by COPLINK Agent. To facilitate the management of user requests, the searching and monitoring sessions are also stored in the user profile database. Every time a user logs on the system, he/she can retrieve the previous searches and monitor histories from the user profile database. The user can then review previous search sessions as well as edit the settings of existing monitoring tasks. 4.3 Collaboration Module To facilitate collaboration among law enforcement personnel, we developed a collaborative filtering module in COPLINK Agent. While traditional collaborative filtering relies on documents read (e.g., [17]) or items purchased by users (e.g., Amazon.com), we make use of the users’ search actions and search histories. The rationale behind this design is that when two users search for the same information in criminal databases, it is likely that the users have similar information needs and that they may possibly be working on two related cases. By storing and analyzing user search histories, the Collaboration Module facilitates such collaboration in two different ways. First, when a user performs a search, the Collaboration Module can instantly identify other users who have performed a similar search in the past. For example, if a detective runs a search on a particular suspect, he/she can also view all the other users who have searched information about this suspect. Second, the user also can specify whether he/she wants to be notified when some other users perform a similar search in the future. When this happens, the Collaboration Module will notify both users through the Alerting Module. The users can then contact each other to determine whether they have any information to share. Currently, we consider two searches to be similar only if the search query terms match exactly with each other. Other matching algorithms can be easily added and will be pursued in our future research. In our initial user requirement study with TPD, detectives and crime analysts stated that it is very important to protect the confidentiality of the police personnel and the cases on which they are working [5]. While it is safe and possibly beneficial to share a user’s search history in some cases, in other cases users have to keep their search histories confidential and not accessible by other users (e.g., cases that involve undercover or internal investigations). It is also important that the system does not send an overwhelming number of alert messages to the users, which would otherwise create
COPLINK Agent: An Architecture for Information Monitoring and Sharing
287
yet another information overload problem. To cater to the different levels of confidentiality requirement and information needs, the Collaboration Module allows user to specify the confidentiality level and the alerting level of each search. More details about the different levels will be discussed in Section 5.2. 4.4 Alerting Module The Alerting Module manages all the alert messages that should be sent to a user. Whenever a user sets up a task in the Searching and Monitoring Module or the Collaboration Module that may result in future alerting messages, the user can specify how he/she wants to be notified. When an alerting condition is satisfied, the Alerting Module will receive the alerting messages from the collaboration and search modules. The messages will then be saved in the database and delivered to the user via the communications channel specified. Currently, messages can be sent to a user instantly through e-mail, pager, and mobile phone. If the user is currently logged on the system, the message also can be presented through the COPLINK Agent user interface. Otherwise, the user can see the message next time he/she logs on the system.
5 Sample User Sessions with COPLINK Agent In this section, we present a usage scenario to highlight the various monitoring and collaboration functions of COPLINK Agent. 5.1 Searching and Collaborating Suppose the user wants to perform a search for a person, he/she can click on the tab “Perform New Search”. Currently, four types of searches are implemented, namely “Person/Organization Search,” “Vehicle Search,” “Location Search,” and “Incident Search.” All the search forms have a similar layout while each form has its specific search fields. Person search is used in our example. After the user clicks on “Person Search,” the corresponding search form will be shown (see Figure 2). The search form shown in Figure 2 is divided into the five input areas. The following shows how the user goes through each area and inputs the necessary information. 1) Database Selection: This allows the user to select which data sources are to be searched. In the example shown in Figure 2, the TPD database is chosen by the user as the information source. 2) Search Fields: The user can then enter the searching criteria. The user wants to search for the records of a person named “Jason Sejkora”, so the user enters “Jason” in the first name field and “Sejkora” in the last name field.
288
D. Zeng et al.
Fig. 2. Person search and monitoring screen for COPLINK Agent
3) Collaboration Settings: The user can set the desired level of collaboration. In the upper portion, the user can choose the “Notification Level” of the search. The user can choose to be notified when anyone performs the same search, when anyone in the specified unit performs the same search, or not to be notified at all. In the lower part of the same interface, the user can choose the “Confidentiality Level” of the search. The user can choose to make the search visible to all other users, only users in the same unit, or nobody at all. In this example, the user chooses to be notified if any other user performs the same search. The user also chooses to make this particular search visible to all other users. 4) Alerting Methods: This allows the user to specify how he/she wants to be notified if there are some other users who have performed the same search. Multiple methods can be specified. The user can also decide whether to receive notification through email, cellular phone messages, or Web messages.
COPLINK Agent: An Architecture for Information Monitoring and Sharing
289
Fig. 3. Managing searching and monitoring tasks
5) Notes: The user can enter some personal notes in this area that are relevant to the search. The notes will be displayed in the alert messages. After specifying all the information, the user can now perform a search by clicking the “Search” button at the bottom of the search form. The search query is then forwarded to the specified database(s) and the search results are displayed to the user. When the number of search results is large, the user can click on the heading of any column to sort the records based on the values of that column. Alternatively, before performing the search, the user can also click on the “List” button to see whether any other user has performed the same search before. This is called the “Instant Collaboration” function, which allows the user to consult other officers instantly to see whether there is any further information about the person being searched. When the user clicks on the “List” button, a screen will be displayed showing all the users who have performed the same search in the past. The user can then click on the name of any of these users to retrieve their contact information and contact them directly.
290
D. Zeng et al.
5.2 Information Monitoring At the search result screen, the user can choose to add a monitor to the search results. There are two types of monitoring: the user can choose to monitor the changes to the existing records shown in the search results, or to monitor any future addition to the database that match the original search criteria. In this example, the user chooses to perform both types of monitoring. As with the collaboration function, the user can also specify how he/she wants to be notified when there is change in the database that matches with the monitoring task. After the alerting method is set, the user can select the “Add Monitor Tasks” option and a confirmation message will be displayed. 5.3 Managing Search Sessions When the user logs on the system at a later date, he/she can retrieve all his previous search sessions by choosing the “Manage Prior Searches” tab on the left-hand side menu. All the previous search sessions by the user will be displayed (see Figure 3). The user can choose to review these searches, or modify the monitoring settings of each of these sessions. For cases that have been solved, the user can also delete these search sessions and monitoring tasks from his profile such that he will not be alerted by future changes in the database.
6 Evaluation In conjunction with the Tucson Police Department (TPD), we evaluated the effectiveness of COPLINK Agent by conducting a usability study in the summer of 2002. Details of our evaluation design and analysis follow. 6.1 Methodology Our methodology for evaluating the COPLINK Agent system is a case study method incorporating structured interviews, usability surveys, and archival records analysis (e.g., summary of user-added monitoring tasks and the alerts produced by the system). To select the required usability evaluation techniques, we first identified two usability goals and the three dimensions of usability. The resulting usability metrics [9] encompassing the specific measures and techniques used is shown in Table 1. The structured interviews for the pilot users were guided by the COPLINK Agent system log files which include lists of monitoring profiles that the pilot users added into the system, as well as the alerts that the users received after matches are found. The subjective measure of user satisfaction was evaluated using a standard usability survey instrument. User comments for database monitoring and collaboration functions were also collected, along with suggestions for interface and functionality improvements. Lastly, the qualitative data obtained from the interview sessions was triangulated with the quantitative results from the alert log ratings and usability surveys.
COPLINK Agent: An Architecture for Information Monitoring and Sharing
291
Table 1. Usability Metrics of COPLINK Agent Usability Evaluation
Usability Objective Suitability for Investigative Tasks Learnability
Effectiveness Measures Percentage of Alerts deemed Useful (Archival Data + Interview) Percentage of functions learned (Survey)
Efficiency Measures Time required to create a new monitoring profile (Interview) Time to learn criteria (Interview)
Satisfaction Measures Rating scales for overall usability (Survey) Rating scales for ease of learning (Survey)
6.2 Participants Fifteen detectives from TPD’s Criminal Investigation Division (CID) were recruited to evaluate the COPLINK Agent prototype. The target participants are users who have had extensive prior experiences in the COPLINK Connect systems. The participant profile is shown in Table 2: Table 2. Participant demographics
Job Classification Police Units Gender
Sergeants 7%, Detectives 80%, Crime Analysts 13% Gang 34%, Fraud 20%, Theft 13 % Robbery 13%, Sex Offense 20% Female: 27%, Male 73%
6.3 Data Collection Procedures Participants who received alerts were given listings of the alerts and were asked to rate the usefulness of each alert. Based on the alert ratings and the list of monitoring tasks, a user was asked to provide his or her subjective rating of the alerts received, along with other relevant contextual information including: the nature/type of the case, the search parameters in which a user is interested (Last name, First name, Day of Birth, Race, and Sex, …etc), the reasons behind adding a monitoring profile, the usefulness of the alert messages received by the users if there is any, and if there is any follow-up done by the user for a particular alert. Participants were also asked to rate the effectiveness and efficiency of database monitoring and collaboration functions, as well as desired new functionalities. Suggestions for improving current functions and interface were also collected. To gauge subjective user satisfaction, we modified the QUIS instrument as reported by Chin [8] based on our application characteristics and added sections to gauge the effectiveness and efficiency of the monitoring and collaboration functions. QUIS was chosen because it provides specific ratings on the following two specific types of system quality information that we were interested in: (1) overall reactions; and (2) four specific interface factors including screen layout and sequence, terminology and system information, learning factors, and system capabilities.
292
D. Zeng et al.
6.4 Summary of Evaluation Study Effectiveness and efficiency of COPLINK Agent. During our three-month testing period, a user, on average, received 5.5 alerts per month. Out of those alerts received, approximately 32% of them were rated equal or above “Somewhat Useful” on our scale. The user’s subjective ratings of the alerts also averaged 5.5 out of a 7-point scale (with 7 being the most useful), suggesting a relatively high user satisfaction. The most typical reasons that users add monitoring tasks include: 1) person monitoring: monitoring a suspect, a witness, or an informant, or someone who is on parole; 2) address monitoring: monitoring the exact address or the address of the apartment complex of a suspect; 3) license plate monitoring: monitor a specific car whose license plate number is of interest to the detectives. As to the monitoring and collaboration functionalities, users were generally pleased with the system’s capabilities for assisting criminal investigations. One user commented, “Although I only have it ‘watching’ for 2 names, the information I have received back was instrumental in making at least 2 felony cases that will be prosecuted on the federal level.” Another user also commented, “Investigating gangs requires extensive use of networking and sharing of information. I find this option valuable.” In terms of overall effectiveness, COPLINK Agent also garnered positive feedback and one recent success story: A crime analyst involved in our evaluation study had previously added a monitoring task for a particular fraud suspect in our system. One day she received an alert about her suspect using counterfeit money in a local convenient store, and was able to follow up with the case and obtained the video tape from the store’s surveillance camera. The alert has led to two felony charges for the criminal on a federal level. Had the crime analyst not received the COPLINK Agent alert in a timely manner, she would have to wait for several weeks to see the case report. By that time, the critical video tape might have been destroyed as the convenient store only keeps video tapes for the past 30 days. As to the efficiency of operating the COPLINK Agent system, most users were able to finish adding a new monitoring profile within 2-5 minutes. User Satisfaction of COPLINK Agent. The short-form of the QUIS instrument averaged 5.5 for 27 items on a 7-point Likert scale (7: most useful). After conducting a profile analysis, the weaknesses of COPLINK Agent include: lack of help messages, difficult for inexperienced users, and obscure user preference settings. The strengths of COPLINK Agent include: offers good investigative power, easy to read layout, potential for collaborative information sharing, CAD Integration, as well as high intention to use. We were able to use the feedback on the user satisfaction to create a list of system enhancements that we plan to implement in the next phase of the COPLINK Agent development. 6.5 Discussions In this section, we discuss the lessons learnt from our field study of COPLINK Agent in TPD’s criminal investigative units. First, in order to harness the full potential of COPLINK Agent’s advanced information monitoring/filtering functionalities, the databases monitored by COPLINK Agent need to be checked on a near real-time basis. Although the TPD database records are examined once per day by our system, most
COPLINK Agent: An Architecture for Information Monitoring and Sharing
293
pilot users expressed interest in significantly increasing the frequency of database monitoring, e.g., allowing the system to check for new updates every 10 minutes or at least every hour. Some user comments in this area include: “The only other improvement I could ask for would be it query a couple times a day as opposed to once every 24 hours.” Another user states, “Detective could have been dispatched immediately, if notification had been in real time.” Second, the system under evaluation appeared to be quite effective in connecting people together, an essential functional aspect of the knowledge management framework proposed by O’Leary [19]. In our case, the alerts provided by COPLINK Agent facilitate inter-unit information and knowledge sharing [13] by creating a short and direct network path from detectives to field officers who have direct access to important knowledge concerning the cases under investigation. A TPD detective commented on the alert messages’ usefulness, “COPLINK Agent is allowing us to respond to incidents we know are important that the field units perhaps don’t realize in a timely manner.” Most subjects shared this assessment. As noted by another officer, “COPLINK Agent is good because so many times we complain that we don’t get information from the field. This way I know who ran (a query in our database on) someone and can inquire as to why.”
7 Conclusions and Future Directions In this paper, we report our work on designing and evaluating a prototype system for information monitoring and sharing in the law enforcement domain. The system, called COPLINK Agent, is based on an Agent system architecture designed to provide automatic information filtering functionalities and facilitate knowledge sharing between police personnel. Through the advanced information monitoring and sharing functions provided by the searching, collaboration and alerting modules, COPLINK Agent can help alleviate the information overload problem faced by many police officers and detectives. COPLINK Agent has also been shown to be an effective tool for improving the effectiveness and efficiency of criminal investigations in areas such as gang, theft, and fraud cases. Given the encouraging results from the user study, we plan to incorporate the functionalities of the COPLINK Agent system into the latest COPLINK Connect system deployed at TPD. Afterwards, a larger scale testing will be performed to study how the number of users affects the usability of the system. Additionally, we continue to work on improving each individual component of the COPLINK Agent system. Specifically, we are working on supporting more alerting methods, as well incorporating several load-balancing and task scheduling algorithms in the monitoring module such that the monitoring tasks can be scheduled effectively to avoid overloading particular databases. Load balancing techniques will gradually become more important as the number of users increases. We are also enhancing the collaboration module by adding fuzzy match techniques that can be used to identify similar searches. Lastly, we plan to apply data mining techniques on the user profile database in order to group similar users together and analyze their search patterns.
294
D. Zeng et al.
Acknowledgement. The work described in this report was substantially supported by the following grants: (1) NSF Digital Government Program, “COPLINK Center: Information and Knowledge Management for Law Enforcement,” #9983304, July 2000–June 2003; (2) National Institute of Justice, “COPLINK: Database Integration and Access for a Law Enforcement Intranet,” July 1997–January 2000; (3) NSF/CISE/CSS, “An Intelligent CSCW Workbench: Personalization, Visualization, and Agents,” #9800696, June 1998–June 2001. We also would like to thank all of the personnel from TPD who participated in this study. In particular, we would like to thank Lt. Jenny Schroeder, Det. Tim Petersen, and Dan Casey. Lastly, we also would like to thank members of the COPLINK Team at the University of Arizona and Knowledge Computing Corporation (KCC) for their support.
References 1. R. Adderley and P.B. Musgrove (2001). Data Mining Case Study: Modeling the Behavior of Offenders Who Commit Serious Sexual Assaults. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, August 26–29, 2001. 2. R. Armstrong, D. Freitag, T. Joachims and T. Mitchell (1995). WebWatcher: A Learning Apprentice for the World Wide Web. In Proceedings of the AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments, Mar 1995. 3. M. Balabanovic and Y. Shoham (1997). Fab: Content-Based, Collaborative Recommendation. Communications of the ACM, 97(3), 66–72. 4. M. Brown (1998). Future Alert Contact Network: Reducing Crime Via Early Notification. http://pti.nw.dc.us/solutions/solutions98/public_safety/charlotte.html. 5. M. Chau, H. Atabakhsh, D. Zeng, and H. Chen (2001). Building an Infrastructure for Law Enforcement Information Sharing and Collaboration: Design Issues and Challenges. In Proceedings of the National Conference for Digital Government Research, Los Angeles, California, May 21–23, 2001. 6. M. Chau, D. Zeng, H. Chen, M. Huang, and D. Hendriawan (2003). Design and Evaluation of a Multi-agent Collaborative Web Mining System. Decision Support Systems, Special Issue on Web Retrieval and Mining, 35(1), 167–183. 7. H. Chen, J. Schroeder, R.V. Hauck, L. Ridgeway, H. Atabakhsh, H. Gupta, C. Boarman, K. Rasmussen, A. W. Clements (2003). COPLINK Connect: Information and Knowledge Management for Law Enforcement. Decision Support Systems, 34(3), 271–285. 8. J.P. Chin, V.A. Diehl, and K.L. Norman (1988). Development of an Instrument Measuring User Satisfaction of the Human-Computer Interface. In Proceedings of the ACM CHI ’88, Washington, DC, pp. 213–218. 9. A. Dix, J. Finlay, G. Abowd, and R. Beale (1998). Human Computer Interaction. Englewood Cliffs, NJ: Prentice Hall. 10. R. Dussault (2000). Maps and Management: Comstat Evolves. Government Technology Crime and the Tech Effect, April 2000. 11. M. Ginsburg (1998). Annotate! A Tool for Collaborative Information Retrieval. In Proceedings of the 7th IEEE International Workshop on Enabling Technologies: Infrastructure for Collaborative Enterprises (WET ICE’98), IEEE CS, Los Alamitos, California, 1998, 75–80. 12. D. Goldberg, D. Nichols, B. Oki and D. Terry (1992). Using Collaborative Filtering to Weave an Information Tapestry. Communications of the ACM, 35(12), 61–69. 13. M. Hansen (2002). Knowledge Networks: Explaining Effective Knowledge Sharing in Multiunit Companies. Organization Science, 13(3), 232–248.
COPLINK Agent: An Architecture for Information Monitoring and Sharing
295
14. R.V. Hauck and H. Chen (1999). COPLINK: A Case of Intelligent Analysis and Knowledge Management. In Proceedings of the 20th Annual International Conference on Information Systems, Charlotte, December 13–15, 1999. 15. M. J. Hoogeveen and K. van der Meer (1994). Integration of Information Retrieval and Database Management in Support of Multimedia Police Work. Journal of Information Science, 20(2), 79–87. 16. P.B. Kantor, E. Boros, B. Melamed, V. Meñkov, B. Shapira, and D.J. Neu (2000). Capturing Human Intelligence in the Net. Communications of the ACM, 43(8), 112–115. 17. J.A. Konstan, B. Miller, D. Maltz, J. Herlocker, L. Gordon and J. Riedl (1997). GroupLens: Applying Collaborative Filtering to Usenet News. Communications of the ACM, 40(3), 77–87. 18. B. Miller (1996). Searchable Databases Help Missouri Solve Crime. Government Technology, 9(8), 18–19. 19. D.E. O'Leary (1998). Knowledge Management Systems: Converting and Connecting. IEEE Intelligent Systems, 13(2), 30–33. 20. W. Orlikowski (1992). Learning from Notes: Organizational Issues in Groupware Implementation. In Proceedings of the ACM Conference on Computer Supported Cooperative Work (CSCW’92), 1992, 362–369. 21. N. Romano, D. Roussinov, J. F. Nunamaker, and H. Chen (1999). Collaborative Information Retrieval Environment: Integration of Information Retrieval with Group Support Systems. In Proceedings of the 32nd Hawaii International Conference on System Sciences (HICSS-32), 1999. 22. U. Shardanand and P. Maes (1995). Social Information Filtering: Algorithms for Automating “Word of Mouth.” In Proceedings of the ACM Conference on Human Factors and Computing Systems, Denver, Colorado, May 1995. 23. B. Starr, M. Ackerman, and M. Pazzani (1996). Do-I-Care: A Collaborative Web Agent. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI’96), 273–274. 24. J. Yen, A. Chung, H. Ho, B. Tam, R. Lau, M. Chau, K. Hwang (1999). Collaborative and nd Scalable Financial Analysis with Multi-agent Technology. In Proceedings of the 32 Hawaii International Conference on System Sciences, Maui, Hawaii, January 5–8, 1999. 25. D. Zeng, M. Dror, and H. Chen (2002). Efficient Scheduling of Periodic Information Monitoring Requests. Submitted to INFORMS Journal on Computing.
Active Database Systems for Monitoring and Surveillance Antonio Badia Computer Engineering and Computer Science Department University of Louisville Louisville KY 40292
[email protected] Abstract. In many intelligence and security tasks it is necessary to monitor data in database in order to detect certain events or changes. Currently, database systems offer triggers to provide active capabilities. Most triggers, however, are based on the Event-Condition-Action paradigm, which can express only very primitive events. In this paper we propose an extension of traditional triggers in which the Event is a complex situation expressed by a Select-Project-Join-GroupBy SQL query, and the trigger can be programmed to look for changes in the situation defined. Moreover, the trigger can be directed to check for changes on a periodic basis. After proposing a language to define changes, we sketch an implementation, based on the idea of incremental view maintenance, to support efficiently our extended triggers.
1
Introduction
In the past, databases were passive, low-level repositories of data on top of which smarter, domain-focused applications were built. Lately, databases have taken a more active role, offering more advanced services and higher functionality to other applications. In this framework, the database assumes responsibility for execution of some tasks previously left to the application, which offers several advantages: the possibility of better performance (since the database has direct access to the data, knows how the data is stored and distributed), better data quality (since the database is already in charge of basic data consistency) and better overall control. However, this trend has resulted in the database taking in more ambitious roles, and having to provide more advanced functionality than in the past. One of the areas where this trend is clear is in the area of active databases. In the past, the database could monitor data and respond to certain changes via triggers (also called rules1 ). However, commercial systems offer very limited capabilities in this sense. In addition to problems of performance (triggers add quite a bit of overhead) and control (because of the problems of non-terminating, non-confluent trigger sets), trigger systems are very low-level: while the events that may activate a triggers are basic database actions (insertions, deletions and updates), users are interested in complex conditions that 1
In this paper, we use the terms rule and trigger as equivalent.
H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 296−307, 2003. Springer-Verlag Berlin Heidelberg 2003
Active Database Systems for Monitoring and Surveillance
297
may depend on several database objects and their interactions. It is difficult to express this high-level, application-dependent events in triggers. In this paper, we describe an ongoing project whose goal is to add advanced monitoring and control functionality to database systems through the design and development of extended rule systems. In a nutshell, we develop triggers were more complex events can be stated, thus letting system users specify, in a high level language, the patterns they need to monitor. Since performance is still an issue, we also develop efficient algorithms based on the idea of incremental recomputation already used in the evaluation of materialized views in data warehouses ([9]). As a result of the added functionality, a database system will be able to monitor the appearance of complex patterns and to detect changes on said patterns. Other research in active databases has not deal with this issue. Our approach is focused on concepts that may have a practical impact; in particular, we aim at expressing more complex events, making it easier for database users to specify the conditions they are interested in monitoring, but we also propose an efficient implementation, something which is absent from most research in the area.
2
Background and Related Research
In most database systems (certainly in all commercial systems), active capabilities are incorporated through the ability to define triggers. A trigger has the form Event-Condition-Action (ECA). The typical events considered by active rules are primitives for database state changes, like insertions, deletions and updates from/to database tables. The condition is either a database predicate or a query (the query is implicitly considered true if the query returns a non-empty answer, and false otherwise). The action may include transactional commands, rollback or rule manipulation commands, or sometimes may activate externally defined procedures, including arbitrary data manipulation programs. Rules are fired when a particular event occurs; the condition is then evaluated, and if found true then the action is executed. This simple schema is found lacking for several reasons ([4, 26]). Mainly, the events used in triggers are considered too low-level to be useful for many applications; a great deal of research in active databases has focused on defining more complex events ([16, 17, 15, 12, 10, 14]). In basically all the previous research, complex events are obtained by combining primitive events in some event language, which usually includes conjunction, disjunction, negation and sequencing of events ([27]). Some approaches include time primitives ([23, 21, 18]), sometimes based on some temporal logic ([22, 6]). Although none of these projects addresses the issue we are dealing with here (active monitoring of complex conditions) we note that [24] also proposes using incremental recomputation to compute complex events (described as queries), as we do; and [2] proposes incremental computation of temporal queries. However,
298
A. Badia
these works have no concept of active monitoring. Finally, [20] also propose a system for monitoring2 . We take an approach different from previous research, based on the observation that analysts are usually interested in much higher level events, which are application and goal oriented: in particular, they screen for conditions which deviate from normal or standard, or for complex conditions which may involve several objects and their relationships. As a simplified example, assume a database with two relations, PEOPLE(name,country) and CALLS(called,caller,date), where we keep a list of suspicious people and their country of residence, and also a list of telephone calls among them as intercepted by signal intelligence. Both called and caller are foreign keys for name. At some point, an analyst is following a suspected terrorist (let’s call him ’X’) and wants to know from which country he receives the most calls. The information can be easily obtained from the database (see below), but once obtained the analyst would like to follow up on this query by monitoring changes: in particular, the analyst may be interested in being alerted when the country from which ’X’ receives the highest number of calls changes. Since sending an alert is an action that must be taken only under certain circumstances, a trigger is the obvious way to implement this functionality. However, the event of interest to the analyst -when the country from which ’X’ receives the most calls changes from the current one- cannot be expressed with trigger events, which are limited to checking for insertions, deletions and updates in relations. Note that insertions in CALLS are the only way in which the current top-calling country could. Thus, one could simulate the desired trigger by using insertions in CALLS as events, and then computing the desired information. A simple SQL query can provide a list of the countries from which ’X’ is called in order of the number of calls, so that the top-calling country is in the first row of the answer: SELECT country, count(*) as numcalls FROM PEOPLE, CALLS WHERE caller = name and called = ’X’ GROUP BY country SORT BY numcalls There are, however, two problems with this approach: it is both conceptually hard and computationally inefficient. It is hard because the above still does not give us the answer: one should keep the name of the top-calling country in some table or variable and compare it with the name in the first row of the above query every time it is recomputed. Thus, quite a bit of programming is needed 2
It is also worth noting that such research, while containing many worthwhile ideas, has seen little practical use, possibly due to two concerns. First, even though some research has been implemented in systems ([14, 21, 25, 17]), efficiency is not addressed in most approaches ([12, 24] are some exceptions); second, sophisticated logic-based languages, as proposed in the research literature ([8, 6, 23, 27]), are highly expressive, but probably outside the comfort zone of most programmers, and certainly most users.
Active Database Systems for Monitoring and Surveillance
299
to implement a relatively simple request. It is inefficient because the trigger is still fired for every table insertion, and therefore its complex condition (the query above) must be evaluated every time. A possible approach would be to use the above SQL query to define a view or table T , and declare the trigger over T . In some systems, views cannot have triggers and hence T needs to be a table. This is clearly undesirable, since T is, conceptually, a view (i.e. needs to be updated whenever the tables it is based upon are updated). Even if the system allows triggers on views, there are several things that the analyst may be interested in, only some of which are expressible with regular triggers: – continuous monitoring, immediate reaction: this is what a trigger does. Every single change in T (insertion, deletion, update) fires the trigger; as soon as a change is detected, an action takes place. This gives us real time monitoring and is certainly useful for certain situations. Note that some programming would still be necessary: because we are looking for changes to a situation, we need to store the current situation (which country is the top producer of calls) and compare it after every event with the new result. – continuous monitoring, delayed action: recheck the situation after every single change in T , but if the condition is found to be true, take action only at certain specified points in time. Delayed action is adequate for periodical reporting. This could be simulated with a trigger (store changes in a temporary relation, for instance) at the cost of more programming. – periodical monitoring, immediate action: recheck the situation at certain specified periods (for instance, every month), and execute and action whenever a check detects a change. Note that this does not give us real time, since by the time the change is detected, the change itself may have taken place time ago. This is good, though, when we need regular and constant monitoring of a situation, but we do not need to be immediately aware of every single change. Again, this could be simulated in some trigger systems, depending on what is allowed in the condition part, at the cost of quite a bit of programming. – periodical monitoring, delayed action: recheck the situation periodically as in the previous case and execute action if changes are detected at certain specified periods. This could also be simulated in some trigger systems, depending on what is exactly allowed in the condition and action parts. Note that all cases can be simulated with some trigger systems. Most systems allow arbitrary programs in the condition and action part of a trigger; therefore, this is equivalent to writing a little program for each condition we want to monitor. As stated above, this is clearly inefficient because of the human effort (programming) and machine effort (trigger execution) involved. Clearly, a more flexible approach is needed.
300
3
A. Badia
The Proposal
The aim of our approach is to overcome the limitations described in the previous section. We would like to develop an approach that provides analysts with the tools needed to monitor real-life, complex events, in an conceptually simple and efficient manner. Consequently, the project has two parts: development of languages and interpreters for extended triggers, and design of algorithms to support efficient computation of the extension. Each part is discussed next. 3.1
Extended Triggers
In our proposal, we develop extended triggers, or triggers with extended events, which correspond to high-level, semantic properties. The events would be able to monitor evolution and change in data, by giving a language in which to represent changes and complex conditions. We call this active monitoring. By using extended triggers, an analyst is able to state naturally and simply, in a declarative language, what are the activities, changes or states which are noteworthy from the analyst’s point of view. Our extensions are based on several intuitions. First, the mismatch between currently allowed events in triggers (called database events) and the events we want to monitor (called semantic events) are due to a difference in levels: semantic events are high level, related to the application; database events are low level, related to the database (in our example, top-calling country vs. insertions on CALLS). Therefore, a mechanism is needed to bridge the gap, one that will express the semantic event in terms of database events. However, expressing the semantic event is not enough, since we are interested in monitoring changes in that event (in our example, changes in the top-calling country). Hence, a language in which to express changes is also needed. Second, even if the previous mismatch did not exist, triggers are not adequate for the task of active monitoring described above, since this task requires knowing when to start, when to stop and how often to check. This information cannot be expressed in current triggers, which are more of a one time action: although the trigger is fired repeatedly as the event repeats, each firing is an isolated event, unrelated to others -unless a link or history is established by adequate programming of the trigger. Finally, and as a result of the mismatch, many database events must happen before affecting a significant change on a semantic event (in our example, many calls may have to be inserted into CALLS before the top-calling country changes). This accumulation naturally happens over time and size (of the database). Thus, it is inefficient to check for a condition after every database event; it is more efficient to do it periodically. We propose a language which will establish: a) a certain environment or baseline to express semantic events; b) the changes to the baseline that the system can monitor; and c) an interval that determines how long and how often to monitor those changes. The baseline will be established by an SQL query that will specify the context in which changes must be examined. To establish the interval, an starting point, an end point and a frequency must be defined. Our
Active Database Systems for Monitoring and Surveillance
301
language supports interval definitions in two dimensions: time and size (of the database), as discussed above The following specification is proposed (keywords are all in uppercase): BASELINE <modification> <modification>:= IN := START <start-point> END <end-point> EVERY modification := CHANGE|INCREASE|DECREASE option := ORDER|NUMBER-GROUPS|TOP n| BOTTOM n [attribute] start-point := SIZE n| NOW| FROM NOW. end-point := SIZE n| | FROM NOW. date-expr := n time-unit := MINUTE HOUR DAY WEEK MONTH YEAR := SIZE n| SIZE n% RELATIVE | SIZE n% ABSOLUTE | := WARN IMMEDIATE| ON end-point This language corresponds to the intuition described above. The baseline is specified by an SQL query. We allow several classes of queries: SPJ (or Select-ProjectJoin) queries; SPJG (or Select-Project-Join-Group) queries; and SPJGO, (or Select-Project-Join-Group-Order) queries; our example above was of type SPJGO (the reason to distinguish between these types of queries is made clear next). The modification refers to the event to be monitored. Each baseline is a relation; we propose to monitor changes in that relation. Therefore, our language is designed for such a task. The modifications that make sense in a relation depend on the type of query that created it. Clearly, modifications in order do not seem meaningful if the query was not created with an ORDER BY clause. Likewise, computations that make sense on values obtained by aggregation may not be as meaningful in raw values. Thus, the modifications that we propose to monitor are tied to the type of query that created the baseline: for SPJ queries, it makes sense to look for changes in the number of tuples in the relation (insertions, deletions), or changes in the data itself (updates) as specified by a simple, declarative condition: the existence (or not) of a particular value, or a simple constraint on the existing values. For numeric values, simple arithmetic constraints are useful: for instance, the value under monitoring is less than a given constant, etc. For SPJG queries, all the changes mentioned for SPJ can be meaningful. Besides that, changes in the value of particular aggregates may also make sense, as well as simple relationships among aggregates (for instance, ratios among them). For SPJGO, besides changes for SPJG queries, changes in order, in the top N or bottom N elements make sense. The Interval specifies when we will look at the data to evaluate the modifications. This is is specified for both size and time via three components. The first one is a starting point, which specifies when monitoring begins. For size, a database size or a table size (or set of table sizes) can be used. For now, the size given has to be equal to or greater then current sizes. This means that the monitoring would start right away or at some time in the future. Allowing for smaller sizes then currently exist could be given the meaning of starting the monitoring at some point in the past. However, this possibility is left as further research down the road, since it is unclear how to
302
A. Badia
support it efficiently3 . For time, a constant now, a date later than now (that is, an absolute date), or a date expressions relative to now is used. Clearly, calendar sensitivity is needed in order to support this functionality. Absolute dates allow the user to specify dates like September 11, 2003, while relative dates give the user the ability to specify dates based on current time: 10 days from now, 1 month from now are both examples of relative dates. Again, dates will be restricted to be equal to or greater then the current date (greater meaning a date in the future), as it will be assumed that all monitoring takes part in the present and/or the future. Extending the techniques given here to arbitrary monitoring will be considered at later stages. Second, an ending point specifies when monitoring ends, and is again based on size or time. For size, a database size, expressed as an absolute number, or as a percentage of growth (10%), can be given. The interval ends when the absolute size is reached, or when the size reaches the starting size times the percentage of growth allowed. For time, again an absolute or relative date are used, or a duration relative to now. Once again, calendar sensitivity is needed. Third, a frequency specifies how often the monitoring must occur, and is also expressed based on size or time4 . For size, again an absolute number or a percentage of growth must be given. However, now percentage can be relative or absolute. An absolute percentage is always calculated with respect to the initial size. A relative percentage is calculated with respect to the size of the database at the time of the last monitoring. For instance, a table with 1,000 rows on which a rule was specified with a frequency of 20% absolute would be checked every 200 insertions: at sizes 1,200, 1,400,... However, a frequency of 20% relative would be checked when there was a growth of 20% with respect to the last time: at size 1,200, 1,440 (20% of 1,200 is 240), 1,728 (20% of 1,440 is 288),...For time, a number of time units should be specified. Time units refers to minutes, hours, days, weeks, months or years. Note that these values form a hierarchy; this can be exploited in the specification (every 24 hours is the same as every 1 day; a combination of units could be used, as in every week and every month). Again, calendar sensitivity is needed to support this feature. Finally, the action refers to when action takes place: immediately (keyword IMMEDIATE) or delayed to a certain point (specified just like a end-point, over time or size). Note that actions will not be really immediate to when the condition obtains, but to the moment it is detected, which depends on the frequency of the monitoring. Note also that this specified only when the action takes place, not which actions are allowed. For now, we consider only a warning, an external action (that is, an action not affecting the database state). This is enough for some purposes, but clearly not sufficient for a general ap3
4
We are assuming that only insertions (and not deletions) are relevant -and therefore relations grow over time. This is a simplifying assumption; however, there is no reason the framework cannot be extended to deal equally well with insertions and deletions, allowing negative rates of growth, or other measures of change. For now, the system is restricted to specify the starting point, ending point and frequency in the same dimension, i.e. all three in size or all three in time. Combinations will be considered as further work.
Active Database Systems for Monitoring and Surveillance
303
proach. Developing types of actions (and studying the consequences of allowing them) is a central part of this project. While the above language is clearly limited, it is able to express conditions that have practical use: we are able to fire a trigger every month for one year, or 20 times a week, or whenever a table grows by a 10%. As an example, assume our intelligence officer in our first example wants to monitor the condition every month for a year. Then the rule is given as BASELINE <same query as above> change IN TOP 1 country START now END 1 year FROM NOW EVERY 1 month WARN IMMEDIATE This rule will examine the baseline, looking for changes in the country with the most calls (top row by the ordering) every month for the next 12 months, and will issue an alert to the user whenever one of the checks detects a change. 3.2
Implementation
Providing a language in which to naturally represent complex events is only half the story. One important aspect of this problem is efficiency. Monitoring multiple aspects over large amounts of data may lead to a slowdown of the system. This creates a problem since there may be strict time requirements for the system’s answer. Therefore, it is fundamental that the system is designed to support active monitoring with a minimum of overhead. Our approach to this problem is to use tools developed for query optimization in data warehousing: incremental recomputation techniques developed for materialized view maintenance ([19, 11, 7, 9]). The basic idea is simple: when monitoring for changes, we are recomputing an expression over and over, in order to detect change (we note that [24] also proposes using incremental recomputation for complex conditions in triggers; however, the work there does not include a monitoring component, so there is no control of when to start or finish checking, or how often to check). Instead of recomputing the expression from scratch each time, temporary results of each evaluation are stored, and a formula is derived from the original expression which indicates how to update the temporary result. Thus, the baseline query is the expression for which we derive an incremental recomputation expression (IRE). The IRE includes an equation showing how to recompute the query, and also any auxiliary counters, variables or relations that need to be created in order to monitor the modification specified in the rule. In our first example above, instead of recomputing the SQL query every time there is an insertion into the CALLS table, we simply create a table storing the result of executing the query, and (after first execution of the query) a variable with the name of the country with the most calls. With incremental recomputation techniques, we can automatically derive an expression for the SQL query which allows us to update the table, and then check if the top country changes by comparing the country in the first row of the updated table with the one stored. Note that the temporary
304
A. Badia
table will only care for calls to ’X’. Thus, when ’X’ receives another call or set of calls -and only then- the temporary result is updated -a simple task, which consists in locating a group and changing one number. When the situation that the analyst is looking for happens, the system can detect it easily -and with little overhead. The architecture of the system involves a parser and compiler for the language given in the previous section, plus a rule processor which will deduce automatically the IRE for a given rule, create any needed auxiliary counters and tables and execute rule update. Note that this may involve quite a bit of reasoning, but it is done only once, at rule creation time. The algorithm followed by the rule processor, in broad strokes, is as follows: 1. at definition time, create a materialized view for the baseline, and derive a IRE for it. Create also needed auxiliary counters, variables or relations as needed by the change to be monitored. 2. also at definition time, compute from the interval information when to start monitoring, when to stop, and how often to check. Start a daemon to keep track of checks. 3. at every insertion of set of insertions, use the IRE to update the materialized view, auxiliary counters, and relations. 4. Whenever the daemon signals it is check time, test for the change being monitored. If the change has occurred, execute action. While the techniques of incremental recomputation are well known for relational algebra expressions involving the selection, projection and join operators, our research extends these rules to queries with grouping and ordering. Let R be a relation, R# denote a set of insertions in R, and Rnew = R ∪ R# denote the result of updating R with the insertions. A differential equation expresses the result of applying a relational operator to Rnew in terms of applying it to R. It is well known that – σc (Rnew ) = σc (R) ∪ σc (R# ). – Rnew $# S new = (R $# S) ∪ (R# $# S) ∪ (R# $# S) ∪ (R $# S # ). In our first example, the first step calls for the creation of a differential equation for the baseline query. This query can be expressed, in relational algebra, as ORDERnumcalls (GBcountry,count(∗) (σcallee=! X ! (P ERSON $# CALL))). In order to create a differential expression for this expression, we need rules for the GROUP BY node and the ORDER node. Research on incremental maintenance of materialized views ([9]) suggest a rule for grouping: +
GBA,F (B) (Rnew ) = GBA,F (B) (R) ∪A,F,B GBA,F (B) (R# ) +
where the operation R ∪A,F,B S is defined for arbitrary relations R, S, whenever sch(R) = sch(S), A, B ∈ sch(R) and F is an aggregate function, as follows: (πA,F ! (R.B,S.B) (R $# S)) ∪ πA,R.B (R LAJ S) ∪A,S.b (R RAJ S)) where LAJ is the left antijoin of R and S (that is, all tuples in R without match in S) and RAJ is the right antijoin of R and S (i.e. all tuples in S without match
Active Database Systems for Monitoring and Surveillance
305
in R), and F " is an aggregation function derived from F (for F = count, sum, F’ = sum; for F = min, max, F’ = F; for F = avg, F’ is computed from the count and sum of the tuples on each group, which can be maintained as indicated). In its use in the incremental computation of the grouping, the operation allows us to combine tuples with the same grouping values into one tuple (achieved by the computation of F’ in the result of the join5 ), and incorporate directly into the result tuples that do not have a counterpart in the other relation (achieved though the left and right antijoins). Note that this is a definition; implementation of such an operator can be carried out in one pass over the data using variants of the full outerjoin algorithm. As for the ordering node, it is easy to see that ORDER(Rnew ) = ORDER(ORDER(R) ∪ ORDER(R# )) Again, note that while the equation is highly redundant, an efficient implementation can be achieved given that R is assumed to be already ordered. Hence, R# can be sorted by whichever method is more efficient, and the last ordering can be done by merge sort, since results will already be sorted. Thus, the final step is pretty efficient. We can now calculate the differential equation for insertions in relation CALL as follows: +
ORDER(ORDER(GBcountry,count(∗) (σcallee=! X ! (P ERSON $# CALL))) ∪ +
ORDER(GBcountry,count(∗) (σcallee=! X ! (P ERSON # $# CALL))) ∪ +
ORDER(GBcountry,count(∗) (σcallee=! X ! (P ERSON $# CALL# ))) ∪ ORDER(GBcountry,count(∗) (σcallee=! X ! (P ERSON # $# CALL# )))) The first line part of the expression is actually the old expression, stored as a materialized view. The other three parts have to be computed. Again, note that the final implementation can be highly optimized. For instance, we can reason that P ERSON # $# CALL is always empty, since P ERSON contains a primary key and CALL a foreign key on which the join is based, and it is impossible for a brand new primary key to have foreign key matches. Optimizing IREs is part of our ongoing research. In our example, the system would, at definition time, create the materialized view based on the current contents of P ERSON and CALL. Next, an auxiliary variable would be used to hold the top row in the view. Then the IRE above would be deduced. Finally, a calendar-sensitive program would generate 12 check times, starting with the current date, at one month intervals. Every time that an insertion or set of insertions in P ERSON and CALL were introduced in the system, the IRE would be used to update the materialized view. At every check point, the country in the top row of the view would be compared with the stored country. If a change occurred, a warning would be produced. 5
+
Note that since we first group by A the relations that are input to ∪, there is only one tuple per group on each relation, and the join brings such matching tuple together
306
4
A. Badia
Conclusion and Further Research
We have described a project to provide databases with an active monitoring system. This is an ongoing project in its initial phase; obviously, considerable work is still needed to implement and refine the concept. We are currently considering extensions to the baseline. Right now, the most complex baseline is an SPJGO SQL query. Our final goal is to admit arbitrary SQL queries to define a baseline. The important issues here are: to determine what types of conditions make sense for each type of queries, and to support monitoring efficiently. We are also looking at extending the type of conditions that can be monitored. As stated before, certain changes make sense for certain types of queries. However, even in simple cases there may be many possible conditions to monitor. We do not claim to have captured all possibly interesting conditions in our language; further extensions are needed. Finally, we consider adding actions very important. For now, we have restricted ourselves to external actions, actions that do not have any effect on the database, like sending warnings. However, this is clearly not satisfactory; more actions must be considered. But when internal actions (actions that may provoke a database change of state) are considered, one must deal with issues like non-termination. In our case, a rule action triggering another rule’s condition may be detected in the incremental computation phase, and therefore it may be possible to control such problems, perhaps using techniques from [3]. We plan to start with basic database actions (the equivalent of database conditions: insertions, deletions, updates) and then will allow semantic actions (the equivalent of semantic conditions), and study the consequences of each for confluency and non-termination. Clearly, the amount of data gathered in intelligence work is increasing fast; database techniques must be used to deal with such large amounts of data. However, high-level, application-oriented analysis must also be provided so that valuable information can be distilled out of the data. Databases are in an ideal position to provide support for such intelligent applications.
References 1. Abiteboul, S., Hull, R. and Vianu, V. Foundations of Databases, AddisonWesley, 1995. 2. Lars Bkgaard, Leo Mark, Incremental Computation of Time-Varying Query Expressions, TKDE 7(4), 1995. 3. E. Baralis and J. Widom, An algebraic approach to static analysis of active database rules, TODS 25(3), September 2000. 4. Stefano Ceri, Roberta J. Cochrane, Jennifer Widom, Practical Applications of Triggers and Constraints: Success and Lingering Issues, In Proc of 26th VLDB Conference 2000,pp 254-262, Cairo, Egypt, September 10-14, 2000. 5. Cheung, D. W., Han, J., Ng, V. and Wong, C. Y. Maintenance of discovered Association Rules in large databases: An Incremental Updating Technique, in Proceedings of ICDE, 1996. 6. Chomicki, J., Toman, D. and Bohlen, M. H. Querying ARSQL Databases with Temporal Logic, TODS 26(2), June 2001.
Active Database Systems for Monitoring and Surveillance
307
7. Colby, L., Griffin, T., Libkin, L., Mumick, I. S. and Trickey, H. Algorithms for Deferred View Maintenance, in Proceedings of ACM SIGMOD, 1996. 8. Avigdor Gal, Opher Etzion Maintaining Data-driven Rules in Databases using Invariant Based Language, IEEE Computer 28(1), 1995. 9. Materialized Views: Techniques, Implementations and Applications, A. Gupta and I. S. Mumick, eds., MIT Press, 1999. 10. N.H.Gehani, H.V Jagadish, O. Shmueli Event specification in an Active Objectoriented Database, in Proceedings of ACM SIGMOD, 1992. 11. Gupta, A, Mumick, I. S. and Subrahmanian, V. S. Maintaining Views Incrementally, in Proceedings of ACM SIGMOD, 1993. Reprinted in [9]. 12. Eric N Hanson, Chris Carnes, Lan Huang, Mohan Konyala, Lloyd Noronha, Sashi Parthasarathy, J.B.Park and Albert Vernon, Scalable Trigger Processing, in Proceedings of ICDE, 1999. 13. M. Jarke, M, Lenzerini, Y. Vassiliou and P. Vassiliadis, Fundamentals of Data Warehouses, Springer, 2000. 14. Gerti Kappel, Stefan Rausch-schott, Werner Retschitzegger A tour on the Trigs Active Database System-Architecture and Implementation, In Communications of the ACM, June 1998. 15. P.Lang, W. Obermair, and M. Schrefl Modeling Business Rules with Situation/Activation Diagrams, in Proceedings of ICDE, 1997. 16. Lijuan Li and Sharma Chakravarthy An Agent-Based Approach to Extending the Native Active Capability of Relational Database Systems, in Proceedings of ICDE, 1999. 17. Daniel F.Lieuwen, Narain Gehani, and Robert Arlein, The Ode Active Database: Trigger semantics and Implementation, In Proceeedings of ICDE, 1996. 18. Iakovos Motakis, Carlo Zaniolo Temporal Aggregation in Active Database Rules, in Proceedings of SIGMOD, 1997. 19. Xiaolei Qian, Gio Wiederhold Incremental Recomputation of Active Relational Expressions, TKDE, 3(3), Sep 1991. 20. Arnon Rosenthal, Sharma Chakravarthy, Barbara T. Blaustein, Jose A. Blakeley Situation Monitoring for Active Databases, isectionn Proceednings of VLDB 1989. 21. Rakesh Chandra, Arie segev, Micheal Stonebraker Implementing Calendars and Temporal Rules in Next Generation Databases, In Proceedings of ICDE, 1994. 22. A Prasad Sistla, Ouri Wolfson. Temporal Triggers in Active Databases, Technical Report, Univ. of Illinois at Chicago, EECS Dept., 1994 23. A Prasad Sistla, Ouri Wolfson Temporal Conditions and Integrity Constraints in Active Database System, in Proceedings of ACM SIGMOD, 1995. 24. Martin Skold and Tore Risch, Using Partial Differencing for Efficient Monitoring of Deferred Complex Rule Conditions, in Proceegings of ICDE, 1996. 25. Jennifer Widom The Starbust Active Database Rule System, TKDE, 8(4), August 1996. 26. Carlo Zaniolo, Stefano Ceri, Christos Faloutsos, Richard T.Snodgrass, V.S.Subramanyam, Roberto Zicari Advanced Active DataBase Systems, Morgan Kaufmann, 1997. 27. Detlef Zimmer and Rainer Unland On the Semantics of Complex Events in Active Database Management Systems, in Proceedings of ICDE, 1999.
Integrated “Mixed” Networks Security Monitoring – A Proposed Framework 1
2
3
William T. Scherer , Leah L. Spradley , and Marc H. Evans 1
Associate Professor, Department of Systems and Information Engineering University of Virginia, Charlottesville, VA 22904
[email protected] 2 Research Assistant, Department of Systems and Information Engineering University of Virginia, Charlottesville, VA 22904
[email protected] 3 Research Engineer, Smart Travel Lab, Department of Civil Engineering University of Virginia, Charlottesville, VA 22904
[email protected] Abstract. Our primary concept is to develop a systemic security view of integrated, independent systems. We present a design for the monitoring of the security of an integrated public safety “mixed” network. The Capital Wireless Integrated Network (CapWIN) is such a system, consisting of a diverse and disparate mixture of public and private networks that share information in order to provide public safety services for metropolitan Washington, DC [1]. It is imperative that the members of CapWIN be aware of the status of the other participants in order to anticipate security events. As of now, there are few means for which the members can obtain real-time information pertaining to the security status of all other parties. Our system is designed to use data fusion to provide system monitoring and feedback to all members. The system is state-based and is designed to be 1) easy to implement, 2) require minimal bandwidth, and 3) be customizable according to the preferences of each member of the network. We present an overview of the system, an analytical model for a state-based description, and system mock-ups.
1 Introduction We are proposing a state-based approach to modeling the security of a complex public safety and Intelligent Transportation System (ITS) network called Capital Wireless Integration Network or CapWIN. CapWIN is a conglomerate between Maryland, Virginia, and the District of Columbia to develop and deploy a wireless network that incorporates transportation and public safety databases and communications. Because of the diversity of the networks involved in CapWIN (i.e., a wide ranging collection of disparate participating networks of various capabilities and functionalities), the nature and scale of any security issue must be assessed with regard the entire system. Our primary concept is to develop a systemic view of the integrated system, a view that may not be visible from any one system or group of sub-systems. “Mixed use” network models involve combining different networks into a single paradigm and describing the nature of such integrated systems [2], [3]. This follows CapWIN’s strucH. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 308–321, 2003. © Springer-Verlag Berlin Heidelberg 2003
Integrated “Mixed” Networks Security Monitoring
309
ture of disparate public/private, public safety/transportation networks. Security administrators of networks, especially a mixed-use network, must consider the potential threats (or risks) against a network, assess the likelihood of such events, and alleviate any risks. Our system goal is to perform a comparative security evaluation across multiple networks, with the initial focus on detection of events in the system based on the status of individual networks. Such a system is essential given that “these changing threats have caused a shift in the network security paradigm from one of certification to one of risk assessment” [2].
Fig. 1. Integrated Security System Architecture
Such a security monitoring system requires the following fundamental components:
• A communication system that allows for each participating agency to provide security information. The system must allow for a diverse set of communication strategies given the nature of CapWIN. • A data management system that maintains the current and historical information provided by the agencies. This system must be able to interpret the various data being pulled and pushed from the agencies. • An analytical engine that can determine the overall system state by integrating the individual systems security states. Also, this component must perform data analysis, such as clustering, time series analysis, etc., to provide information about the nature of the attack. • A web-based interface that can present the analysis to an overall system monitor and to member agencies. In order to demonstrate this concept, we have developed a specific approach to this problem, which is described after a brief overview of security issues. We note that
310
W.T. Scherer, L.L. Spradley, and M.H. Evans
this framework can also be applied to numerous other domains, such as infrastructure security modeling.
2 Security Overview Currently, many organizations secure their network with the use of a firewall in conjunction with an Intrusion Detection System. Although software and programming varies greatly between different firewalls, the methods by which they perform their tasks are the same. A firewall separates a protected network from an unprotected one, by implementing a set of access control rules that restricts or allows traffic to enter the network appropriately [4]. Intrusion Detection Systems function as a secondary system designed to backup the primary security systems such as firewalls, encryption, and authentication. The backup is required in the event a firewall’s rule set had not been updated. The goal of Intrusion Detection Systems is to monitor the activities of users on a network to identify evidence of intrusions. Demand for such systems came about with the emergence of time-sharing systems that needed to control access to computer resources in the 1960’s [5]. Originally, they consisted of administrators analyzing user activities by physically monitoring a number of screens while sitting at a console. The administrator would look for suspicious activities such as an “away” user that was logged in locally or for the unusual activity of a printer rarely in operation [6]. The next step involved the administrators reviewing audit logs instead of monitoring screens. This method was very time consuming because it required searching through large stacks of paper containing the audit information. Since then, applications have been developed that can analyze the audit data in real-time as it is produced [7]. Studies show that current intrusion detection systems work well across a network of clients on the same server. However, there is not yet an effective method for simultaneously monitoring the security of disparate networks that is customized to the desires of each participant. The difficulty in achieving the task mentioned above is that security systems are usually tailored to the network they monitor. Modern computer networks are comprised of many different operating systems, a variety of different server-client relationships, as well as numerous hardware vendors. For example, Virginia Department of Transportation (VDOT) intranet is a star architecture in which the local area networks of central office, district office and clients are connected through a central switch through standard 155 Mbps, 45 Mbps or 1.55 Mbps ATM services. Each client or machine may also be using different operating systems, such as an MS Windows variant, Linux or Unix [8]. These heterogeneous networks introduce a high level of complexity when it comes to management and security issues, and make it unreasonable to implement a common security concept to each individual component without considering others in same intranet. Intrusion detection systems deduce behavior based on the content and format of data packets on the network [9]. This means that the same exact attack may be recorded differently according to different analyzers. Because of the heterogeneous networks associated with a network such as CapWIN, the proposed system must analyze individual security components in comparison to their relative importance to the whole enterprise. In order to overcome this complication of ambiguity among separate agencies, a standard data model for describing attacks must be incorporated into
Integrated “Mixed” Networks Security Monitoring
311
the monitoring system. Utilizing such a standard protocol enables the system to unequivocally interpret alerts from a variety of analyzers [10]. In determining the security status of each network, the risks and threats must be considered together. The risk, or vulnerability, is the relative weakness of the network [11]. The first characteristic of risk is severity, which is defined as amount of significant resources required to exploit an existing vulnerability. Exposure, a second characteristic of risk, is the level at which the effects of vulnerability are contained or the probability of additional vulnerabilities being exploited [11]. Threat is defined by incidents that may compromise network security, such as disrupting the normal functioning, information integrity, or availability of services. Factors used to evaluate threat include the extent of the threat, the potential damage incurred, and the rate at which the threat spreads [12].
3 Conceptual Model The following are the four main aspects of the proposed system: 1) 2) 3) 4)
Obtainment of Security Data Standardization of the Data Calculation of Overall System Score Structure of Web-based Interface
3.1 Obtainment of Security Data Consider a typical integrated system architecture where each component (or subsystem or agency) in the system, e.g., the VDOT intranet, is assumed to have its own “security” systems and be self-aware of their current status. Individual components are likely to differ in 1) their method of obtaining a current security status, 2) the amount of information they are willing to share with the integrated security monitor. It is likely that all individual components will not use the same network security tools to obtain their current status, resulting in ambiguity among security definitions. For example, a larger government-run entity such as VDOT may use highly sophisticated security tools that produce detailed log files about activity in their network whereas a small, privately owned towing company may, at most, use a commercial security software package that provides a single number representing the current security status of their individual PC. Our proposed system does not require a specific type of security system software or security system policies – just self-awareness. In other words, the proposed monitoring system requires that each agency have knowledge of their security status. The amount of information each agency desires to share with the monitoring system will also vary. For example, the Virginia State Police may have to abide by certain governmental regulations that do not allow the organization to provide as much information regarding their security as say the University of Virginia Smart Travel Lab. One agency may prefer to e-mail their individual security state, while another may prefer to automate a transfer of their entire log file through an encrypted FTP connection. Furthermore, an agency may desire to provide an interface
312
W.T. Scherer, L.L. Spradley, and M.H. Evans
where the monitoring system can automatically pull the data instead of having to push the data themselves.
Fig. 2. Detailed Security System Architecture illustrates how the four components integrate to form the complete system. Following the diagram are sections explaining each component in detail.
To account for the discrepancies mentioned above, each agency will choose from a number of communication options. A ping, the simplest level of communication, does not require a transfer medium, but all higher levels of information can be obtained through SMTP, or FTP. A ping is a basic Internet action that affords verification of a particular IP address and can be used to ensure that the target computer is actually operating. Because the ping does not require an action on the side of the individual agencies, all agencies will be pinged at a given time interval. If the ping fails, the agency is said to be under a security threat. It is important to note here that a ping is a very simplistic form of communication and does not provide the monitoring system is detailed information. A failed ping can be caused by congestion on the network or other various reasons that do not necessarily imply a security threat to an agency. Please refer to standardization of data section below for more information about how ping latency contributes to an agencies calculated security state. Therefore, if an agency provides any higher level of communication, that information will always override data received from pinging that agency. A
Integrated “Mixed” Networks Security Monitoring
313
ping is also convenient for agencies that wish to have minimal participation with the monitoring system because it requires no action on the agencies part. An agency may also opt to send additional information regarding their security such as an integer or a log file. Current Status is defined as 1 “no detected events”, 2 means “suspicious activities”, 3 means “known to be under attack.” The proposed system allows each agency to “push” a variety of information from a simple number indicating their system status all the way up to a complete log entry. In order to handle such a variety of response levels, the system will use an XML schema that can hold as little or as much information as each agency will choose to provide. Figure 3 illustrates a possible XML data model. The use of customized tags is important for allowing this system to relay information about alert descriptions that vary with regard to specificity. XML schemas can also be included with the file that validates information contained in each tag in order to make sure that agencies enter the correct type of information. The following is an example of an XML Schema that could be used with the proposed system. <xs:schema <xs:element name="securitydata"> <xs:complexType> <xs:element name="agecnyid" type="xs:integer"/> <xs:element name="current status" type="xs:integer"/> <xs:element name="alert"> <xs:complexType> <xs:sequence> <xs:element name="analyzer" type="xs:string"/> <xs:element name="createtime" type="xs:string"/> <xs:element name="detecttime" type="xs:string"/> <xs:element name="analyzertime" type="xs:string"/> <xs:element name="source" type="xs:string"/> <xs:element name="target" type="xs:string"/> <xs:element name="classification" type="xs:string"/> <xs:element name="assesment" type="xs:string"/> <xs:element name="additional data" type="xs:string"/> <xs:attribute name="securitydataid" type="xs:string" use="required"/>
Fig. 3. Example of XML standard data model for security data.
As seen above, each XML document received has the element name “security data” with attribute “securitydataid.” The lowest level of information the agency can provide is their “agencyid” and their “current status” (integer 1-3). Each agency can then choose the number of additional details they wish to provide by filling in only those tags and leaving the remaining elements empty. As long as the data was entered into the right tags in the correct format there will be no problem parsing the document. Therefore, one XML schema will be provided to all the agencies. If problems occur during parsing for some reason, the ping data will be used to calculate the individual score for that agency instead of the XML document.
314
W.T. Scherer, L.L. Spradley, and M.H. Evans
The additional information with the element name “alert” is modeled after the Intrusion Detection Message Exchange Format developed by the Internet Engineering Task Force [13]. The data model used to describe alerts is content driven, meaning that new objects are introduced to accommodate additional pieces of information. This means that the data model will be precisely based on the information provided by the intrusion detection analyzer and it will not be ambiguous; it will not produce contradictory information on two alerts describing the same event. If the agency opts to send an XML Document, they will organize the document following the XML Schema that we have predefined. This option is preferred because it increases the ease of parsing the document and lessens the chance of ambiguity. The agency can also choose to send their information in a plain text file. However, the agency must format the text file in a predefined manner, or they must provide the format they wish to use during the development of the system. It will be necessary to write specific code for parsing text documents and the format of the text document must be known. 3.2 Standardization of Data The proposed concept does not require access to the individual systems and will not further increase net security risks. However, if agencies wish to allow access, the system has the capability to pull security data automatically. Moreover, the system can receive any level of security information, and standardize the data into an overall system state. Given that each system component can provide security status information via their own internal systems, our “wrappers,” or tables, will be able to convert their information into one of the proposed system’s defined states. For this simple prototype system, the individual system security will be converted into a vector of three attribute values: 1. Severity: Level of potential risk involved; 2. Exposure: Level of security practiced; 3. Current Status: 1 is “no detected events”, 2 means “suspicious activities”, 3 means “known to be under attack.” Each of the three values will be 1, 2, or 3, where a “1” means, for example, low severity, a “2” mild severity, and a “3” extreme severity. The attributes “severity” and “exposure” are defined via the Capability Maturity Model (CMM) for each individual system component (e.g., VDOT’s disparate sub-nodes) and are not expected to change frequently and only when there is a change in the nature of the individual system or a change in their security policies. The current status will be derived from the communication received from the individual system. Thus, typical system information at a specific time could look like this: UVA STL (VDOT sub-node): [Severity=3, Exposure=1, Current Status=1] Thus STL’s severity is high (significant potential damage), their exposure is low (tight security and limited access), and their current status is “no detected events.” Given the vector definition of state above, there are 27 possible states for any system 3 (3 ). Our next task is to convert the vector into a single score. This initial approach is to use a simple weighted score where each of the three components is given a weight,
Integrated “Mixed” Networks Security Monitoring
315
where the weights sum to one. Then the individual system scores are normalized to be in the range 0 to 1. For example, the three components could be equally weighted (.33, .33, .33), and some illustrative scores would be: (1,1,1) => 0 (under any weights) (3,3,3) => 1 (under any weights) (1,1,3) => .33 (2,2,3) => .83 The weights would be user selected and would depend on the specific integrated system. One additional feature is the penalty for reporting latency for an agency. Given the reports from each agency, it is possible to calculate, for each individual DJHQF\ WKH PHDQ UHSRUWLQJ WLPH DQG WKH DVVRFLDWHG VWDQGDUG GHYLDWLRQ Reporting latency here could refer to ping latency, XML file latency, or the delay of any other communication type. Using this latency data a score is penalized, increasing the score as reporting latency increases. The concept is that abnormal reporting latency, as defined as the number of standard deviations beyond the mean, indicates a higher likelihood of security problems. The individual state will be penalized according to the number of standard deviations the reporting latency is away from the historical mean. The historical means can be stored for any appropriate time period. For example, if the historical means are stored for every 15 minutes then a reporting latency for Agency A determined at time t will be compared to the historical mean for time period t to t+15. A z-Value will then be calculated by subtracting the historical mean latency from the current latency and dividing by the standard deviation. If the z-Value is negative, then the agency is said to be in state 1, “no detected events”. The agency is said to be in state 2 or 3 depending on how large the positive z-Value is calculated to be. Any function, or rules, could be used to combine the vector into a single systemwide score, and this approach of a simple weighted score is a starting point that is being evaluated in the prototype. A different functional form would be to use a multiplicative function that would have very different properties from the additive function in that it would not exhibit constant marginal returns (i.e., the difference between (1,1,2) to (1,1,3) would not be the same as (1,1,1) to (1,1,2)). The test bed system will evaluate different functional forms. To formalize and generalize, assume: N = # agencies (or subsystems) M = # components of the security vector K = # of possible integer values (non-zero) for each of the vector elements, e.g., if K = 3 then the set of values is (1,2,3). These are assumed to be ordered from best to worst. Vit = vector of length M for each agency i at time t. w = vector of weights of component j of the security vector, where wj = 1, j = 1, M. Then, the individual agency score, IAi, for an agency i is: M
IAit
=
{∑Vit ( j ) wj − 1} /( K − 1) j =1
316
W.T. Scherer, L.L. Spradley, and M.H. Evans
Note that this assumes the additive weighted functional form of combining the elements in the vector. In the above example, w = (.33, .33, .33), K =3, M=3, and for an example of Vit = (1,2,3), then IAit = .5 .
Next, the IAit is adjusted based on the time delay in reports from the agency. This adjustment could be measured form the time statistics of the reports received (or pulled), or form ping statistics. Assume that it has been Li time units (sec, min, ect.) since the last report from agency i. Also assume the mean reporting interval for agency i is i and the standard deviation is i. Then the individual score is adjusted to AIAit, where AIAit = IAit AIAit = IAit*ec(LiAIAit = 1.00
)
L
If (Li - i) If 0 < (Li - i) g i , and If (Li - i) > g i ,
where g is a constant and c = ln(1/ IAit)/ g i. As an example, assume the IAit = 0.5 for an agency where the i = 10 and i = 4, and let the g multiplier be 2. Thus, if a report has been received within the last 10 minutes, the score is not penalized and AIAit = IAit = 0.5. Alternatively, if the report is more than two standard deviations beyond the mean or 18 minutes since the last report (since g = 2, 10+8 = 18 minutes), then AIAit = 1.00. If the report is 6 minutes later than the mean then AIAit = IAit*ec(Li- ) = 0.84, having been increased from a security score of 0.50 to 0.84 for being a “late” report. If the report was one minute beyond the mean, then AIAit = 0.55. Note that the factor g determines the penalty strength, i.e., the larger g is, the smaller the penalty for being late. L
3.3 Calculation of the Overall System Score The next question then, is how to define the state of the entire integrated system. In a simple example prototype, there are six components representing such an integrated network: VDOT, VTRC, STL, Private #1, Private #2 and an integrated data warehouse (assume separate from the other systems for this example). Each one of these, following the defined simple example, would have an adjusted individual system score (AIA). For example: Agency VDOT VTRC STL Private #1 Private #2 Data warehouse
Individual System Score (AIA) 0.10 0.01 0.40 0.10 0.33 0.88
Given the above situation, what is the condition (or state) of the entire combined network? Given that the data warehouse, which is integrated with all of the systems, has a high score (0.88), this might be considered a serious threat and warrant attention across the different components.
Integrated “Mixed” Networks Security Monitoring
317
Alternatively, consider the following situation: Agency VDOT VTRC STL Private #1 Private #2 Data warehouse
Individual System Score (AIA) 0.01 0.10 0.10 0.10 0.01 0.01
In this case it might be considered a non-critical system-wide incident and warrant no particular action. This preliminary approach is to combine the individual system scores into a single integrated system score via the weighting of each individual subsystem, normalizing the scores to be between 0.01 and 1. The weight of the individual subsystem would depend on the importance of that sub-system to the integrated system. In the above example, the data warehouse might receive a high weight since it is critical to all subsystems. Assume each sub-system gets a weight of 1, 2, or 3, then, for example: Agency Individual System Score VDOT 0.01 VTRC 0.10 STL 0.10 Private #1 0.10 Private #2 0.01 Data warehouse 0.01
Sub-System Weight 2 1 2 1 1 3
The normalized score for the above integrated system would be 0.04 and for the former example 0.18. Once again any function or rules could be used to combine the vector of sub-systems into a single score, and this approach of a simple weighted score is a starting point that is being evaluated in the prototype. Formalizing again, let: awi = weight for agency i, where the weights are assumed to be integers between 1 and Q. ISt = integrated security score. Then, ISt =
N
n
i =1
i =1
∑ AIAitawi / ∑ awi
Once again this assumes the additive form of combining individual scores. In the last example, IAit = (0.01, 0.1, 0.1, 0.1, 0.01, 0.01) and aw = (2 1 2 1 1 3), such that ISt = 0.04. Clearly, the higher the system scores the more of a threat exists. Therefore, there is needed system-wide state definitions that determine what controls, policies, etc.
318
W.T. Scherer, L.L. Spradley, and M.H. Evans
should be undertaken across the entire integrated network for a given security score. For example: Example ITS System-Wide States Level 1 – 0.01 < ISt < 0.05: No known system security problems Level 2 – 0.05 < ISt < 0.10: Minor security problems, considered nonthreat, no action Level 3 – 0.10 < ISt < 0.20: Security problem, nuisance threat, managed locally, no action Level 4 – 0.20 < ISt < 0.45: Serious major threat being assessed, some systems partially disconnected from system Level 5 – 0.45 < ISt < 0.65: Major threat, system under administrator control, numerous ITS system isolation. Level 6 – 0.65 < ISt < 01.00: Complete ITS isolation of all sub-systems until state change. An alternative system would be to set the level according to the number of standard deviations from the mean system score, i.e., a severe warning might be issued if the current system score is beyond two standard deviations from the mean security score over the past 24 hours. For any given system-wide state or level, there would need to be polices, controls, etc. that are implemented, recommended, etc. For example, Level 2 (above) could trigger alerts to all sub-subsystems that there are detected threats and to be on the lookout. Level 3, however, could suggest (or require) certain sub-systems to be disconnected from the ITS system. Note that the ISt value would be a time series of data, and numerous techniques could be used to perform analysis of the series. Techniques from the quality control literature (six sigma) could also be used to determine control limit policies, i.e., when the system was deemed to beyond the limits of standard operations (e.g., two sigma from the mean). Numerous other techniques, such as data mining and cluster analysis, could be used to analyze the time series data to determine which sub-systems were under attack. Specifically, given that a state signal, ISt, has been generated, analysis can be performed on the signal to detect any incidents or security events. At the most basic level would be notice of a change in the basic statistics of the state signal. If, for example, the signals statistics are described simply by a mean value and a standard deviation, then tests could be used to determine 1) if the signal is not stationary, i.e., the mean and or variance if changing over time, and 2) to determine if a signal that has been stationary is now “out of control” in that it is more than K standard deviations beyond the mean for a certain period of time. Either of these events could indicate potential incidents or a changing environment, where, for example, one of the agencies has changed the manner in which they are scoring or reporting their data. Significantly more advanced approaches could be used to perform analysis on the signal. Techniques such as wavelet analysis and CUSUM are currently being used in various transportation applications for traffic incident detection and monitoring [14]. Wavelet analysis has been used extensively in electrical engineering for signal analysis and is based on performing a Fourier analysis of the signal. The CUSUM algorithm characterizes break points in the time series to identify when there has been change in the signal [15]. Classification algorithms have also been used for statistical anomaly detection [16].
Integrated “Mixed” Networks Security Monitoring
319
It is anticipated that our approach will use an integration of several of techniques of the types mentioned above to perform analysis on the system state signal and identify any significant changes. Warning indicators will be used to provide users and main security system operator information to investigate if an incident has occurred. The techniques employed will be data-driven in that they continually use the data received to update the statistics of the signal process and the parameters of the algorithms. As actual data is received the algorithms can be calibrated on the tradeoffs of false positives versus missing actual incidents. As mentioned earlier, the Vit values would be changing in an asynchronous fashion. At any time the ISt values would be calculated with the latest Vit scores. As described earlier, statistics could be kept on the time between receiving values for agencies and limits could be established such that a signal not received from an agency for a significant time (e.g., 2 sigma from the mean) could be interpreted in the worst case, i.e., as the highest state of attack. 3.4 Structure of Web-Based Interface To access to the system status, the network clients will login to the password protected security monitor web-pages. The main page view of the site is comprised of a navigation bar across the top, a list of agencies in the column to the left and a main display area in the middle of the page.
Fig. 4. Overall System Time Plot
Fig. 5. Map Display
The main display area will contain different views depending on which tab the user has highlighted in the navigation bar. The Home view is a time series graph of the overall system score, as well as the current system status, and current system score. The x-axis will cover a 24-hour time period so that the system score can be seen as “crawling” as the page refreshes every 2 minutes. The red line indicates the upper control limit so as to indicate when the system state has deviated from a normal range. This area will also have a link labeled network status that causes a pop-up window to appear. This window will show a time plot of ping latency across the network. The latency used for the time plot is calculated by taking the mean of several different “stable” sites that have been predetermined. This visualization will allow the user to
320
W.T. Scherer, L.L. Spradley, and M.H. Evans
compare the overall security of CapWIN with the relative speed of the network at any given time to determine if any correlation exists. The Map view, illustrated in Figure 5, will contain a geographical map of Metropolitan DC Area representing the DC, Maryland and Virginia areas with CapWIN agencies. When the user clicks on one of the three areas, a map of that area appears containing the name of each agency. The color of the text will indicate the agencies security state: Grey (1), Yellow (2), and Red (3). This text is also dynamically read from the database using ActiveX objects. Displaying the relative geographical locations of agencies allows users to easily identify if an attack is concentrated in a particular area. If the attack is recognized as corresponding to a specific section on the map, then other agencies in close proximity could react accordingly. In the column on the left is a list of all participating agencies in alphabetical order. The agency list will be dynamically read from the database using ActiveX objects so that no agency will be hard coded in the HTML. Each agency will appear in one of three colors representing their individual current status. Black (1), Yellow (2), and Red (3). When clicking on an agency name, the user will view a pop-up window the containing a time series graph for that agency’s score as well as a link to the Admin page for that user. The individual time series plot looks similar to that of the overall system score in that it is over a 24-hour time period and will crawl as the day goes by. However, their will be two y-axes for an individual plot. One will show the individual state (1, 2, 3) and the other will show the z-value for reporting latency. The two lines will be layered on the same plot so that the user can visualize if a security threat is correlated to reporting latency. The Admin Page will allow users to customize their viewing options, such as a Favorite Agency List and notification options. The column will also contain a listing of the agency categories divided according to function. For example, all agencies associated with police enforcement (Maryland, VA, DC Police) will belong to the category Police. The category name will appear in the color representing a weighted average of all the individual agency scores within the category.
4 Conclusions This paper has proposed the design of a security monitoring system for an integrated network comprised of disparate public safety and transportation systems. Such systems are complicated and relatively new; therefore, the proposed system provides each partner with an overview of the status of the entire system. This paper described the architecture of the system and provided mocked examples of its interface. Future efforts will involve, as described in an earlier section, building a working prototype for CapWIN to review. Actual implementation of the proposed system will allow for a comprehensive evaluation of the concept, along with determining improvements to the system. Future efforts must also consider the security of the monitoring system itself. Information assurance may be accomplished through the use of encryption. We believe that the conceptual framework described in this paper – a state-based security monitoring system – is directly applicable to a wide-range of transportation security issues, including the monitoring of the transportation infrastructure.
Integrated “Mixed” Networks Security Monitoring
321
Acknowledgments. We would like to acknowledge the Capital Wireless Integrated Network (CapWIN) program and the National Institute of Justice’s Office of Science and Technology for research support and K.P. White, Yiyi Zhang, Adam Shartzer, Lindsey Lane, and Loren Bushkar who assisted in an earlier version of this paper and research efforts.
References 1. CapWIN. Roles, 2002. http://www.capwinproject.com/. Accessed July 23, 2002. 2. Ghosh, S. Principles of Secure Network Systems Design. Springer, New York, 2002. ITS America, What is ITS?, 2002. http://www.itsa.org/whatits.html. Accessed July 23, 2002. 3. Schmuacher, H., and Ghosh, S. A Fundamental Framework for Network Security. Journal of Network and Compute Applications, Vol. 20, 1997, pp. 305–322. 4. Curtin, Matt. and Ranum Marcus. “Internet Firewalls: FAQ”. www.internethack.net\pubs\fwfaq, 2000 5. Sherif, Joseph and Dearmond, Tommy G. Intrusion Detection: Systems and Models. Proc. Of the Eleventh IEEE International Workshops on Enabling Technologies: Infrastructure or Collaborative Enterprises, 2002. 6. Kemmerer, R.A. and Vigna, G. “Intrusion Detection: A Brief History and Overview.” IEEE Symp. Security and Privacy, IEEE CS Press, Los Alamitos, Calif., 2002, pp. 27–29. 7. Packer, Ryon. “A Basic Guide to Intrusion Detection.” White Papers. August 2001: 1–8. 8. VDOT, Planned changes FY2002 environment, Enterprise computing Environment, Virginia Department of Transportation (VDOT), May 30, 2001. 9. Durst, Robert, et al. “Testing and Evaluating Computer Intrusion Detection Systems.” Communications of the ACM, July 1999: 53–61. 10. Cuppens, F. and R. Ortalo. Lambda: A language to model a database for detection of attacks. In Proceedings of the Third International Workshop on the Recent Advances in Intrusion Detection (RAID’2000), October 2000. 11. Bayne, J. An Overview of Threat and Risk Assessment. SAN’s Institute Information Security Reading Room, Jan 22, 2002. http://rr.sans.org/audit/overview.php. Accessed June 15, 2001. 12. Symantec Inc., Threat Severity Assessment, 2002. Accessed July 23, 2002. http://securityresponse.symantec.com/avcenter/threat.severity.html. 13. Curry, D. and Debar, H. “Intrusion Detection Message Exchange Format: Extensible Markup Language (XML) Document Type Defnition,” http://www.ietf.org.internet-drafts/draft-ietf-idwg-idmef-xml-06.txt, Dec. 2001. 14. Ogden, R.T., (1997) Essential Wavelets for Statistical Applications and Data Analysis, Birkhäuser. 15. Basseville, M., and Nikiforov, I., Detection of Abrupt Changes, Theory and Applications, Englewood Cliffs, NJ, Prentice Hall, 1993. 16. Manikopouls, Constantine, and Papavassiliou. Network Intrusion and Fault Detection: A Statistical Anomaly Approach. IEEE Communications Magazine. October 2002.
Bioterrorism Surveillance with Real-Time Data Warehousing Donald J. Berndt, Alan R. Hevner, and James Studnicki University of South Florida Tampa, FL 33620 {dberndt, ahevner}@coba.usf.edu
[email protected] Abstract. This paper discusses several technical challenges in the development of an effective bioterrorism surveillance system. Three factors are critical: 1. It must be multidimensional. 2. It must accelerate the transmission of findings and data to most closely approximate real time surveillance so as to provide sufficient warning. 3. It must have the capability for pattern recognition that will quickly identify an alarm or alert threshold value. We build on our on-going health care data warehousing research to provide solutions to these challenges. The innovative use of flash data warehousing provides the essential ability to compare real-time healthcare data with historical patterns of key surveillance indicators. A comprehensive architecture of a bioterrorism surveillance system is presented. A demonstration project in Florida showcases these ideas.
1 The Threat of Bioterrorism The threat of a premeditated biological attack on civilian populations is of real concern to all nations. Recent events, such as the sarin gas attack in Japan and the anthrax contamination of letters in the U.S. postal service, demonstrate the devastating consequences of bioterrorism both physically and psychologically [1]. Biological weapons can be based on a number of different biological agents and can take many forms of distribution, making the detection and response to biological attacks very difficult to prepare for [12]. Public anxiety is at an all-time high, and national, state, and local governments are being tasked to protect citizens from these threats to community health – now, not later. Surveillance systems to assist in the detection and prevention of such biothreats must be rapidly developed. Unfortunately, the data necessary to fuel such systems reside in a multitude of disparate, distributed data sources among hospitals, clinics, pharmacies, water treatment facilities, labs, emergency rooms, etc., and we cannot wait for solutions with long development and implementation times. This paper presents some of the technical challenges of developing an effective bioterrorism surveillance system. The innovative use of flash data warehousing provides the essential ability to compare real-time healthcare data with historical patterns of key surveillance indicators. A demonstration project in Florida showcases these ideas. H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 322–335, 2003. © Springer-Verlag Berlin Heidelberg 2003
Bioterrorism Surveillance with Real-Time Data Warehousing
323
2 Bioterrorism Surveillance Systems How can a community, a county, a state, or a nation determine that it is being attacked by terrorists who are using biological or chemical agents as weapons within a time period short enough to prevent or at least ameliorate major negative health consequences? In the U.S., the Center for Disease Control and Prevention (CDC), in its public health response performance plan, emphasizes the ability of local and state health departments to respond to terrorist attacks. Central to this initiative is the establishment of sentinel networks that not only have the capacity to respond to an act of terrorism, but also to have the infrastructure to anticipate and potentially prevent threats from being realized, or, at least, to minimize their epidemiological impact through early detection [9]. The most important component of these networks is an information system and dedicated data warehouse which would have many associated functions, but with the primary objectives of detecting and analyzing ongoing biochemical exposures and consequences affecting the population health status. Fundamentally, there are three major challenges required for such a system to be effective: 1. It must be multidimensional. In other words, it must include a range of appropriate indicators and sources of information in order to monitor as many types of threats and health effects as possible. 2. It must accelerate the transmission of findings and data to most closely approximate real time surveillance so as to provide sufficient warning. 3. It must have the capability for pattern recognition that will quickly identify an alarm or alert threshold value, raising the issue for further investigation or possible intervention. 2.1 Multidimensional Indicators The multiple health effects of biological, chemical, or other types of agents or hazards require the specification and monitoring of many different information sources. Among the more obvious information sources would be hospital emergency rooms, physician offices and primary care clinics, pharmacies, and clinical laboratories. The data taken from these sources can provide timely information about the nature of the threat, its health consequences, and the areas and populations affected. For example, there have been significant developments in the electronic reporting of laboratory results for notifiable conditions [2, 11]. Recognition of the need for more timely surveillance has been demonstrated for diarrheal disease, emergency department-based emerging infections, influenza epidemics, nosocomial infections, salmonella, meningitis, tuberculosis, and various clinical event monitors [8]. Among the many types of information sources that have received much less attention as part of an early detection system are physical monitors which are capable of detecting hazards affecting the water supply, the air supply, or the food supply. Each of these groups of indicators, or domains, would be specified as part of an integrated, comprehensive early warning system.
324
D.J. Berndt, A.R. Hevner, and J. Studnicki
2.2 Real-Time Information Timely detection is a key requirement to avoid the most serious negative consequences of any future terrorist attack involving biological or chemical agents. Timeliness, as measured by the time interval separating the event from its detection, has been infrequently studied as a performance requirement for an early warning system. An assessment of timeliness requirements for inhalation anthrax, with scenarios ranging from treatment starting on the first of 6 days to no treatment at all, suggests detection after 3 days is nearly useless and the cost accumulation during the steepest part of the curve (between day 2 and 3) is $200 million per hour! [5] A CDC study of tularemia suggests that starting treatment on the day after the attack reduces the mortality rate by two-thirds, but treatment has no effect if started as late as the fifth day. Current systems of reporting and notification involve the reporting of data from the local level, to the state, and on to the federal level. This highly centralized process batches the data at several levels and ‘timely’ reporting can actually take from a few weeks to more than a year to traverse these levels of reporting. Serving as an additional barrier to the development of the system is the fact that necessary data elements reside in a multitude of disparate, distributed data sources, complicating a timely implementation cycle. Many of the indicators that could be used in the system are not even identified, nor collected, and the logistical problems of capturing this information can be considerable for source organizations. Finally, the data discovery and categorization methods necessary for securing access to disparate, distributed data sources in real time requires unique and powerful technologies for identification, collection, integration, quality control, querying, reporting, and dissemination. 2.3 Pattern Recognition and Alarm Thresholds While the transmission of data elements in real time is a necessary capability of an effective system, it is in itself insufficient without the means to determine when the signal received actually represents the existence of an event. These alarm values, or alert thresholds, must be determined on the basis of historical pattern recognition that will enable researchers to determine both the existence and nature of the damaging event. Hospital admissions initiated in the emergency room, for example, vary from hospital to hospital, seasonally, and by diagnostic composition. These existing patterns must be analyzed before an alert threshold can be determined. For example, Figure 1 shows a high number of hospitalizations resulting from emergency room admissions for a hospital near many of the Florida theme parks and tourist destinations. (An online analytic processing (OLAP) tool produces the information presentation in Figure 1 from the hospital admissions data in the underlying data warehouse.) Using quarterly aggregations, the data show that the ER admission rate holds steady around 70%. However, Figure 2 depicts the much lower percentages that characterize a large, urban hospital located in an affluent area. For this hospital, the quarterly data show a steady ER admission rate of approximately 20%. This demonstrates the importance of having a historical baseline of data for comparison before sounding a bioterrorism alert, as well as, the wide variation in even the simplest indicators.
Bioterrorism Surveillance with Real-Time Data Warehousing
325
Fig. 1. High Rate of Hospitalizations from Emergency Room Admissions
Fig. 2. Low Rate of Hospitalizations from Emergency Room Admissions
Various marker admissions can also be historically monitored such as infectious and parasitic diseases, diseases of the respiratory system (e.g. those due to external agents such as chemical fumes or vapors), non-specific abnormal findings, poisonings by antibiotics, or the toxic effects of substances such as carbon monoxide or chlorine. The nature of the alarm threshold itself may vary. The actual level of the indicator value at a single hospital may serve as the alert; for example, any hospital that ex-
326
D.J. Berndt, A.R. Hevner, and J. Studnicki
ceeds its own expected number of respiratory disease admissions by one standard deviation or more in any two-hour period. Similarly, an alarm threshold might be reached when a smaller increase in specified admissions is achieved (e.g. 20%) but at some number (e.g. 3 or more) of hospitals. These patterns and the determination of valid alert levels can best be generated with the creation of a health care data warehouse that integrates the many disparate data elements. This warehouse can then support the use of sophisticated browsing tools for either explanatory or confirmatory purposes.
3 The Decision Making Context The decision as to whether an unfolding epidemiological situation is in fact an act of bioterrorism is clearly a very daunting task full of uncertainty. There has been much research regarding decision-making under ambiguous and confusing circumstances. For instance, signal detection theory has been widely applied in practice and research. Signal detection theory assumes that most decision-making tasks occur under conditions of uncertainty. The basic framework proposes four outcomes for such tasks: a “hit” corresponds to a correctly identified event (such as a bioterrorism attack) from the signals or available information, a “miss” is when a decision maker fails to identify such an event, a “false alarm,” and finally a “correct rejection” of the event. This reasoning echoes the concepts of Type I and Type II errors in statistics, where the convention is to structure the hypotheses to minimize the risk of costly Type I errors, while using sample size and other design factors to control Type II errors. The related concepts of sensitivity, specificity, and timeliness also describe important characteristics of the decision-making model, especially in the context of bioterrorism and disease surveillance [15]. Sensitivity relates to the level required to trigger an alarm or threshold, while specificity characterizes the accuracy or ability to correctly discriminate between outcomes. Typically, these parameters form the basis for a tradeoff, where increased sensitivity comes at the cost of reduced specificity. Lastly, timeliness is another critical issue that can often be improved by increasing sensitivity, again at the cost of other criteria. Timeliness is especially critical in the domain of disease outbreaks and bioterrorism early warning systems. Excessive delays can render effective interventions useless and dramatically reduce alternative courses of action. Since many disease or biochemical agents have unique temporal trajectories, research into the profiles of these threats is a high priority. Many investigators have theorized and experimented in the area of decision making under uncertain and risky conditions. Shapira [14] offers a model of risk in managerial decision-making that further refines the relationship between possible outcomes. This model has been recast for the study of “strategic surprises” [7]. While the model is more often used to characterize strategic surprises between business partners, several wartime events are used to illustrate the model, making the model very relevant for biochemical attacks. In particular, both the attack on Pearl Harbor and the Yom Kippur War provide examples where costly false alarms resulted in the upward adjustment of alarm thresholds and the resulting “surprise” attacks, despite very good intelligence. The authors suggest that it is tempting to cite an “information gap” due to less than ideal intelligence gathering activities. However, there is always
Bioterrorism Surveillance with Real-Time Data Warehousing
327
incomplete information and uncertainty in such circumstances. Perfect information is usually too expensive and often simply unattainable. These decision-making frameworks serve to highlight some important aspects of the biochemical threat detection challenge. The cost of false alarms in early warning systems for biochemical threats is extreme in terms of monetary expenditures and psychological burdens. Therefore, subsequent upward adjustments of alarm thresholds would be expected, reducing the sensitivity, timeliness, and ultimate usefulness of the system. In addition, biochemical attacks are (thankfully) exceedingly rare, providing few examples on which to refine and calibrate predictive models. Lastly, it may never be possible to definitively answer some questions regarding the origins of a particular attack, or even whether it was an intentional act or natural outbreak. In the face of such challenges, it is unlikely that a highly automated early warning system can be constructed, at least in the short term. A more appropriate goal may be to provide thorough, easily accessible, and sophisticated analytic capabilities for accelerating further investigations once early indications of a threat are identified. These types of human-in-the-loop, or more appropriately, epidemiologist-in-the-loop systems can be supported in part by available data warehousing, data mining, and information visualization technologies. The hope would be to accelerate and enhance the epidemiological investigative processes to improve timeliness, while still controlling the specificity and associated risks of false alarms.
4 Flash Data Warehousing Data warehousing technologies are a natural fit for many of the surveillance system requirements. In particular, the requirement for archiving both historical and real-time data is best accomplished using a data warehouse. In addition, any pattern recognition or data mining approaches to threat detection will require a data warehouse infrastructure. Our interdisciplinary team has amassed considerable experience in using data warehousing and data mining technologies for community health status assessment [4]. This experience has centered on supporting the Comprehensive Assessment for Tracking Community Health (CATCH) methodology with advanced data warehousing components and procedures. The current CATCH data warehouse supports population-based health status assessments. The data warehouse serves as a historical repository of fined-grained data, such as individual births and deaths, which are used to form aggregate indicators of health status. We use a number of effective techniques to provide a thorough level of quality assurance in our health care data warehouse [3]. Reconstructing and analyzing historical patterns are tasks well suited to data warehousing technologies. This retrospective view is appropriate for the community-level health status reports that drove the early data warehouse development work. However, new challenges such as surveillance systems for bioterrorism require more timely data and real-time data warehousing approaches. Corporate data warehousing efforts have followed a similar technological evolution. Data warehouses first supported the analysis of historical patterns, using the power of online analytic processing for queries and visualization of data extracted periodically from operational systems. Following these successful efforts, more empha-
328
D.J. Berndt, A.R. Hevner, and J. Studnicki
sis was placed on real-time decision support activities. There is tremendous interest in moving from periodic refreshment of data warehouses toward the real-time, more incremental data loading tasks that can support up-to-the-minute decision-making [6]. Of course, moving from a monthly or even weekly perspective to a minute-by-minute timeframe usually means that the data being extracted may be incomplete and subject to change. There are far fewer opportunities to cleanse and transform the data in such real-time environments. In a sense, a real-time data warehouse is like a Polaroid photograph that begins to develop. At first the image is barely recognizable, but as more of the colors darken an image begins to clarify, eventually becoming a stable snapshot of history. Early glimpses of the picture can be refined by cleverly estimating incomplete data, but the archival snapshot cannot be rushed. This process is much different than the careful extraction, transformation, and loading tasks that characterize many data warehouse staging activities. Such planned staging activities are much more akin to portrait photography than Polaroid snapshots. The challenge in bioterrorism surveillance lies in coupling a historical perspective with real-time data warehousing approaches. The overall architecture of the bioterrorism surveillance system is shown in Figure 3. The existing CATCH data warehouse provides the historical information against which any new data can be compared. The new components in our prototype bioterrorism surveillance system are the real-time data feeds and associated data warehouse components. These new flash data warehouse components are used to store partially available real-time data. The components act as persistent memory for incomplete real-time data that are preprocessed for comparative queries against the archival data warehouse components, and possibly overwritten as new data become available. The flash components share common metadata with the archival data warehouse, making the important data items useful for cross queries. It is these common data items that serve as input to any pattern recognition algorithms, and may ultimately be identified as potentially abnormal events. The success of any bioterrorism surveillance system also requires real-time data collection from a wide variety of sources. While this is a difficult task, it can be approached in an incremental fashion. In addition, the Internet and rapidly developing wireless networking infrastructure provide expanding opportunities to use off-theshelf technologies. As illustrated in Figure 4, data from many organizations can contribute to an effective surveillance system. Most database systems now incorporate many features for constructing distributed systems, thereby linking physically remote data sources. Many of the interoperability concerns have been reduced since most organizations now rely on one of the few dominant relational database engines. In order to support real-time data collection, our architecture is compatible with several existing projects in Florida. The Florida Department of Health has a number of important real-time data collection systems in place. Merlin is a web-based system for mandatory communicable disease reporting. These are multi-stakeholder efforts involving the standardization and reporting of laboratory tests, based on medical industry standards (e.g., NEDSS, HL7, LOINC, and SNOMED) [10]. The EpiCom system also includes information exchange, as well as communication between key public health decision-makers. These on-going projects are shown in Figure 4, along with other potential real-time data sources.
Bioterrorism Surveillance with Real-Time Data Warehousing
Threat Assessment Dashboards
Pattern Recognition Engines
Shared Metadata
Metadata
Real-Time Data
CATCH Health Care Data Warehouse
DW Staging
Flash Staging
Flash Data Warehouse
329
Statewide Health Care Databases
Fig. 3. Flash Data Warehouse Architecture for Bioterrorism Surveillance
5 Demonstration Bioterrorism Surveillance System in Florida The flash data warehouse architecture is employed in a demonstration surveillance system being developed in the State of Florida. The following sections provide a brief description of this on-going project. 5.1 Bioterrorism Threat Indicators Recent research on biological agents and our expertise with the CATCH healthcare data sets will be used in reviewing and identifying key bioterrorism threat indicators
330
D.J. Berndt, A.R. Hevner, and J. Studnicki
Air Quality Monitors
DOH Merlin
Water Quality Monitors
ER Signs and Symptoms
Real-Time Data
Real-Time Feeds
Internet / Private WAN Hospital Admissions
State Laboratories DOH EpiCom
Pharmacy Data
Practitioner Offices
Fig. 4. Real-Time Data Feeds for Bioterrorism Surveillance
[13]. For example, large volumes of hospital discharge data are already used for community assessment purposes. This same data can be coupled with real-time admissions data from hospitals for the demonstration system. Table 1 summarizes potential threat indicators that can be defined based on International Classification of Diseases (ICD) codes and derived from hospital admission/discharge data. 5.2 Real-Time Data Feeds The initial foci of the demonstration surveillance system will be on hospital admission/discharge data, clinical electronic laboratory reporting, and existing efforts such as the Merlin and EpiCom systems. The ongoing community assessment work sup-
Bioterrorism Surveillance with Real-Time Data Warehousing
331
ported by the CATCH data warehouse already incorporates hospital discharge data for a variety of health status indicators. In addition, some preliminary experiments with real-time reporting from local hospitals have been conducted with great success. Therefore, the existing hospital admission/discharge data warehouse components provide a sound foundation for experimentation. The second area of data collection will focus on electronic laboratory reporting (ELR) [11]. This area of research has received a good deal of attention, with several successful efforts around the country. The successful application of the HL7 messaging standards, laboratory test standards such as Logical Observation Identifiers, Names, and Codes (LOINC), and vocabularies like the Systematized Nomenclature of Human and Veterinary Medicine (SNOMED) make this an appropriate area for the demonstration of surveillance systems. 5.3 Pattern Recognition and Alarm Thresholds One of the most challenging aspects of the demonstration system will be developing pattern recognition algorithms and defining the nature of “abnormal” patterns in the selected threat indicators. The early detection of disease outbreaks is an emerging area of research that demands “extreme timeliness of detection” for identifying and responding to public health threats [15]. There has been a surge of interest in such early warning systems and corresponding attention to some fundamental questions. • “Which data are useful for early detection?” • “What are the timeliness requirements for outbreaks caused by different agents?” • “How do we measure timeliness of a detection system for a specific type of outbreak and especially for outbreaks such as large-scale inhalation anthrax that have not occurred in areas monitored by the new systems?” Solutions to these complex pattern recognition and early detection problems will only come from sustained research and development. There is no silver bullet here. 5.4 Surveillance Dashboards Once the real-time data feeds, flash data warehousing components, and initial pattern recognition algorithms are developed, the results of the system can be presented using Web-deployed surveillance dashboards. Figure 5 shows a high-level dashboard for the Florida demonstration system. A geographic information system (GIS) is used to present a map-based visualization of the area being monitored. Along the left side is a scrolling list of possible events for further investigation. Drill-down capabilities supported by the underlying data warehouse can be used to rapidly explore more detailed information. Zooming in on Miami-Dade County would lead to the dashboard depicted in Figure 6. In this view, a more detailed map is presented with the location of hospitals marked. A potentially abnormal pattern exists at the highlighted hospital, and the scrolling event list could also be pared down by geographic location.
332
D.J. Berndt, A.R. Hevner, and J. Studnicki
Bioterrorism Surveillance with Real-Time Data Warehousing
Fig. 5. A Florida State-Wide Surveillance Dashboard
Fig. 6. Hospital Sites in Miami-Dade County with Alert Status
333
334
D.J. Berndt, A.R. Hevner, and J. Studnicki
We are experimenting with different user-friendly presentation alternatives to best highlight bio-threats throughout the State of Florida. County health officials will be involved with the design and evaluation of the user interfaces on the surveillance dashboards.
6 Conclusions The urgent public need for an effective bioterrorism surveillance system has led us to extend the CATCH data warehouse with real-time flash components. The incorporation of the flash data warehouse architecture into a prototype bioterrorism surveillance system in Florida as shown in Figure 3 is on-going. The following items review the research challenges as discussed in this paper: 1. Identify the key bioterrorism threat indicators that could be monitored as part of the prototype system. A preliminary list of these indicators is presented in Table 1. The growing literature on biological and chemical threats are surveyed for appropriate indicators based on hospital and clinical laboratory data. A group of subject-area experts will be used to review and refine the selected indicators. 2. Arrange for real-time data feeds from selected organizations (Figure 4), such as local hospitals and/or clinical laboratories. 3. Design and implement flash data warehouse components and staging techniques for real-time data. These new components will draw on the existing CATCH data warehouse for design details, metadata, and the historical data for comparative purposes 4. Develop pattern recognition algorithms that signal potentially abnormal events based on the selected threat indicators. This area will require extensive ongoing study as new indicators are added and the definition of “abnormal” becomes more refined. 5. Design and implement effective user interfaces or threat assessment dashboards for presenting the information as demonstrated in Figures 5 and 6. 6. Evaluate the prototype system using a panel of experts, soliciting feedback for further development activities. Selected state and county health officials will provide detailed feedback on their use of the prototype system so we can evaluate how the bioterrorism surveillance system will fit into state and county operational procedures.
References 1. 2.
3.
Ackelsberg, J., Balter, S., et al. Syndromic Surveillance for Bioterrorism Following the Attacks on the WTC-NYC, 2001. Morbidity & Mortality Weekly Report (September 19, 2002), Center for Disease Control and Prevention. Barthell, E., Cordell, W., et al. The Frontlines of Medicine Project: A Proposal for the Standardized Communication of Emergency Department Data for Public Health Uses including Syndromic Surveillance for Biological and Chemical Terrorism. Annals of Emergency Medicine 39 4 (April 2002), 422–429. Berndt, D., Fisher, J., Hevner, A., and Studnicki, J. Healthcare Data Warehousing and Quality Assurance. IEEE Computer 34 12 (December 2001), 33–42.
Bioterrorism Surveillance with Real-Time Data Warehousing 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.
335
Berndt, D., Hevner, A., and Studnicki, J. The CATCH Data Warehouse: Support for Community Health Care Decision Making. To appear in Decision Support Systems (2003). Kaufmann et al. The Economic Impact of a Bioterrorist Attack: Are Prevention and Postattack Intervention Programs Justifiable? Emerging Infectious Diseases 3 (1997), 83–94. Kimball, R. Realtime Partitions. Intelligent Enterprise, (February 1, 2002). Lampel, J. and Shapira, Z. Judgmental Errors, Interactive Norms, and the Difficulty of Detecting Strategic Surprises. Organizational Science (2001). Lazarus, R., Kleinman, K., et al. Use of Automated Ambulatory-Care Encounter Records for Detection of Acute Illness Clusters, Including Potential Bioterrorism Events. Emerging Infectious Diseases 8 8 (August 2002), 753–760. Lober, W., Karras, B., et al. Roundtable on Bioterrorism Detection: Information SystemBased Surveillance. Journal of the American Medical Informatics Association 9 2 (March-April 2002), 105–115. NEDSS Working Group. National Electronic Disease Surveillance System (NEDSS): A Standards-Based Approach to Connect Public Health and Clinical Medicine. Journal of Public Health Management and Practice 7 6 (November 2001), 43–50. Overhage et al. Electronic Laboratory Reporting: Barriers, Solutions and Findings. Journal of Public Health Management and Practice 7 6 (November 2001), 60–66. Relman, D. and Olson, J. Bioterrorism Preparedness: What Practitioners Need to Know. Infectious Medicine 18 11 (November 2001), 497–515. Rotz, L., Khan, A., et al. Public Health Assessment of Potential Biological Terrorism Agents. Emerging Infectious Diseases 8 2 (February 2002). Shapira, Z., Risk Taking: A Managerial Perspective. New York: Russell Sage Foundation (1995). Wagner et al. The Emerging Science of Very Early Detection of Disease Outbreaks. Journal of Public Health Management and Practice 7 6 (November 2001), 51–59.
Privacy Sensitive Distributed Data Mining from Multi-party Data Hillol Kargupta, Kun Liu, and Jessica Ryan Computer Science and Electrical Engineering Department University of Maryland Baltimore County Maryland 21250, USA {hillol, kunliu1, jyan4}@cs.umbc.edu
Abstract. Privacy is becoming an increasingly important issue in data mining, particularly in security and counter-terrorism-related applications where the data is often sensitive. This paper considers the problem of mining privacy sensitive distributed multi-party data. It specifically considers the problem of computing statistical aggregates like the correlation matrix from privacy sensitive data where the program for computing the aggregates is not trusted by the owner(s) of the data. It presents a brief overview of a random projection-based technique to compute the correlation matrix from a single third-party data site and also multiple homogeneous sites.
1
Introduction
Many homeland defense applications require mining heterogeneous data for creating profiles, constructing social network models, detecting terrorist communications, among others. Usually the data is very sensitive to privacy issues. Financial transactions, health-care records, and network communication traffic are a few examples. Data mining in such privacy-sensitive domains is facing growing concerns. Therefore, we need to develop data mining techniques that are sensitive to the privacy issue [1,3,5,4]. This paper considers the problem of computing the correlation matrix from distributed data set(s) where the owner of the data does not trust the third party who developed the data mining program. Although, correlation computation is a relatively simple kind of statistical operation, its frequent use in data mining applications calls for the development of its privacy sensitive counter-part. This paper offers a novel way to compute the correlation matrix from privacysensitive data by using a random projection-based approach. The paper also briefly reviews an extension of this framework for handling multiple distributed data sources introduced elsewhere [5]. The remainder of this paper is organized as follows. Section 2 describes the foundation of the technical approach adopted in this paper using orthogonal matrices. Section 3 offers a “double-sided” random projection-based technique that can be used to compute the correlation matrix in a privacy-sensitive manner. It also summarizes several results regarding the theoretical properties of the H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 336–342, 2003. c Springer-Verlag Berlin Heidelberg 2003
Privacy Sensitive Distributed Data Mining from Multi-party Data
337
proposed technique. Section 4 extends the proposed technique to compute the correlation matrix from distributed data. Section 5 presents the experimental results. Finally, Section 6, concludes this paper and outlines the future research.
2
Privacy Preserving Correlation Computation and Orthogonal Matrices
The Pearson Product-Moment Correlation Coefficient, or correlation coefficient for short, is a measure of the degree of linear relationship between two random variables, X and Y . It is usually estimated from the given data set, comprised of m tuples (xi , yi ), using the following expression: Corr(X, Y ) =
m
xi yi
(1)
i=1
For the sake of simplicity, we assume that the data columns are normalized so that they have 0 mean and unit length (2 norm). Computing correlations using the above expression in a straight forward fashion requires direct access to the data. We cannot compute the correlation coefficient using Equation 1 unless we know the values of the tuples (xi , yi ). However, in a privacy sensitive application we cannot allow that. In this scenario, the data matrix U belongs to someone else. We can get the meta-data information that tells us about the underlying schema, the number of observed attributes, and the number of observed data points. Our goal is to compute the correlation matrix by observing some representation of U that does not allow reconstruction of the original data matrix U . It is natural to wonder why we cannot provide the correlation matrix to a third party directly since normally the raw data set cannot be reconstructed only from the correlation matrix? It is certainly possible if the client party is just interested in the correlation matrix, the data is owned by a single party, and the owner of the data has the resources to compute it. However, the objective of the privacy-sensitive data mining technology is somewhat different. In this domain, it is normal to assume that the owner of the data is not necessarily the data miner itself and the owner may not have any resource for mining the data. In our opinion, the final objective is to allow the data miner only limited access to some representation of the data in such a way that a class of mining objectives are fulfilled. For example, detecting bio-terrorism may require mining clinical records and pharmacy transactions data that belong to different parties such as hospitals and drugstores. These parties may not have capabilities to write their own trusted data mining software; it also may not be a part of their mission. A third party e.g., a government agency, may however seek access to the data for detecting the emerging patterns from the data. Its objective may be to apply clustering techniques for detecting outliers. This may require clustering, principal component analysis, and correlation analysis. These techniques are often packaged together in a single data mining software in commercial products
338
H. Kargupta, K. Liu, and J. Ryan
and just providing the correlation matrix may not serve the purpose. What we really need is a secured representation of data that allows performing each of those data mining operations without allowing the possibility of reconstructing the original data. Moreover, the idea of providing just the correlation matrix to the client party simply does not work for heterogeneous distributed data sets [5]. The following part of this section offers the main idea behind the proposed approach. Let U be an m×n matrix, R1 be an n×n random orthogonal matrix, and R2 be an m × m random orthogonal matrix. Now consider the following sequence of linear transformations of the data matrix U . U1 = U R1 U2 = U1T R2 = R1T U T R2 U2 U2T
(2)
= (R1T U T R2 )(R1T U T R2 )T = R1T U T R2 R2T U R1
Since both R1 and R2 are orthogonal matrices we can write, R1 U2 U2T R1T = R1 R1T U T R2 R2T U R1 R1T = UT U
(3)
Now recall that U T U is nothing but the correlation matrix of U . So if the owner of the data set U computes U2 and hands over that and the matrix R1 to a third party, the correlation matrix can still be computed by that party. However, since the matrix R2 is hidden there is no way to reconstruct the matrix U from U2 and R1 . The following lemma formally states this claim. Lemma 1. [5] Given an m × n real-valued data matrix U , two random orthogonal matrices R1 , and R2 such that R1 = I and R2 = I, and U2 , as defined above by Equation 2. The matrix U is not uniquely defined by U2 and R1 . Since the matrix U2 is generated by two linear transformations, in the rest of this paper we shall call it a “double-sided” transformation of the matrix U . The following section explores a randomized approach to this problem. It shows that randomly generated projection matrices can be used to compress the data while preserving the correlation information without exposing the raw data.
3
Random Projection Matrices for Correlation Computation
The previous section described a privacy preserving correlation computation algorithm using orthogonal matrices. This section points out that the same thing can be accomplished by randomly generated matrices.
Privacy Sensitive Distributed Data Mining from Multi-party Data
339
A Random matrix is a matrix whose elements are random variables with given probability laws. In this section, we are particularly interested in random projection matrices. From Equation 3 we note that the matrices R1 and R2 should satisfy the following constraints R1 R1T = R2 R2T = I. So far we treated R1 and R2 as square, orthogonal matrices. Now let us change the gear and redefine them. Let R1 be an n × k1 dimensional random matrix whose entries are independent, identically distributed (i.i.d.) according to some unknown distribution with zero mean and unit variance. Similarly let R2 be an m × k2 dimensional random matrix with i.i.d. entries with zero mean and unit variance. The randomized approach presented in this section exploits the fact that RT R approximates an identity matrix on average. Intuitively, this result echoes the observation made elsewhere [7] that in a high-dimensional space vectors with random directions are almost orthogonal. A similar result was proved elsewhere [2]. Lemma 2. Let R be a p × q dimensional random matrix such that each entry ri,j of R is independently chosen according to some unknown distribution with mean 0 and variance 1. Then, E[RRT ] = qI Lemma 2 can be used to prove [5] the following result. Lemma 3. Given an n × k1 dimensional random matrix R1 and an m × k2 dimensional random matrix R2 as defined by Lemma 2, and the “doubly-projected” matrix U2 defined by Equation 2. Then, E[R1 U2 U2T R1T ] = k12 k2 U T U This result points out that one can estimate the correlation matrix U T U by computing the average of R1 U2 U2T R1T . The average is computed over multiple trials.
4
Computing Correlation from Distributed Data
This section points out that the proposed technique can also be directly applied to compute correlation matrices from multiple distributed data sites. Let us consider homogeneous sites [6] where each site observes the same set of attributes, but the observations are different. This scenario is also sometimes called the horizontally partitioned distributed data mining scenario. Let U and V be the two data sets owned by two different parties. Both the data sets observe the same set of attributes. Each column vector of the data sets have been normalized to have zero mean and unit length. Let x and y be two such attributes. Also let CorrU (x, y), CorrV (x, y), and CorrU ∪V (x, y) be
340
H. Kargupta, K. Liu, and J. Ryan 0.24 homogeneous data sets centralized data sets 2.5
0.23
0.22
Root Mean Square Error
Root Mean Square Error
2
1.5
1
0.21
0.2
0.19
0.18 0.5
0.17
%
0 0
0.16 50
size
100
of k1
(% of
150
n)
120
100
80
60
40
20
0 %
0.15
size of k2 (% of m)
Fig. 1. Performance of the random projection-based algorithm with respect to varying k1 and k2 on synthetic data sets. The estimated correlation matrix is an average of 25 independent trials.
10
20
30
40 50 60 70 Number of trials of double projection
80
90
100
Fig. 2. Comparison of random projection-based privacy preserving method on centralized data sets and homogeneous distributed data sets; k1 =80% of n, k2 =80% of m.
the correlation matrices estimated from data sets U , V , and U ∪ V respectively. Then we can write, CorrU ∪V (x, y) =
CorrU (x, y) + CorrV (x, y) 2
(4)
The proposed approach exploits this useful decomposability property. After obtaining the estimated correlation matrix from both sites, we can combine them by adding them together to produce the overall estimated correlation matrix for the entire data set without ever seeing the raw data from either site. An extension of this approach for handling distributed heterogeneous data sites is introduced elsewhere [5]. The following section presents experimental results to back up the theoretical claims.
5
Experimental Results
We performed two sets of experiments. The first set measures the accuracy of the random projection mechanism based on a single party data-site, and the second set measures the same using distributed, homogeneous data sites. The results reported here use several data sets, randomly generated using Matlab, each containing 100 observations and 25 features. The accuracy is measured in terms of root mean square error (RMSE) between the correlation matrix generated by the proposed approach and the same computed from the original raw data using Equation 1. We first consider the scenario with a single party hosting a privacy sensitive data set and a client interested in the correlation matrix of the data. Two random matrices R1 and R2 are generated with i.i.d. N(0, 1) entries. Figure 1 shows
Privacy Sensitive Distributed Data Mining from Multi-party Data
341
the variation of the difference between estimated and the original correlation matrices. For the experiments with distributed and homogeneous data sites, we partitioned the data sets (200 observations and 25 features) horizontally into two subsets, one containing the first 100 observations and the other containing the rest of the observations. The number of trials varied from 10 to 100. We compared the performance of the proposed random projection-based method on the complete data set with the same using the distributed data sets. Figure 2 shows the results. More experimental results regarding the performance of the doubleprojection algorithm can be found elsewhere [5]. The following section concludes this paper.
6
Conclusions and Future Work
This paper presented a novel approach to compute correlation matrix from multiparty privacy sensitive data. The technique can also be easily extended for computing inner product matrices and other related statistics. The approach works using a sequence of randomized linear transformations of the data and guarantees that the original data cannot be reconstructed from the available information. It first shows that orthogonal matrices can serve this purpose. Next it shows that randomly generated projection matrices can also be used to do the same in a probabilistic sense. The proposed approach is simple and practical. The random projection-based technique may be even more powerful when used with other well-known techniques like the Fourier transformation or PCA that are frequently used for extracting the salient features of data. Such combined approach may further reduce the communication cost and we are also exploring this possibility. Acknowledgments. The authors acknowledge supports from the NASA (NRA) NAS2-37143 and the United States National Science Foundation CAREER award IIS-0093353.
References 1. R. Agrawal and S. Ramakrishnan. Privacy-preserving data mining. In Proceedings of SIGMOD Conference, pages 439–450, 2000. 2. R. Arriaga and S. Vempala. An algorithmic theory of learning: Robust concepts and random projection. In Proc. of the 40th Foundations of Computer Science, New York, New York, 1999. 3. M. Kantarcioglu and C. Clifton. Privacy-preserving distributed mining of association rules on horizontally partitioned data, 2002. 4. H. Kargupta, S. Datta, and K. Sivakumar. Random value perturbation: Does it really preserve privacy? Technical Report TR-CS-03-25, Computer Science and Electrical Engineering Department, University of Maryland, Baltimore County, 2003.
342
H. Kargupta, K. Liu, and J. Ryan
5. H. Kargupta, K. Liu, and J. Ryan. Random projection and privacy preserving correlation computation from distributed data. Technical Report TR-CS-03-24, Computer Science and Electrical Engineering Department, University of Maryland, Baltimore County, 2003. 6. H. Kargupta, B. Park, D. Hershberger, and E. Johnson. Collective data mining: A new perspective towards distributed data mining. In Advances in Distributed and Parallel Knowledge Discovery, Eds: Kargupta, Hillol and Chan, Philip. AAAI/MIT Press, 2000. 7. R.Hecht-Nielsen. Context vectors: general purpose approximate meaning representations self-organized from raw data. Computational Intelligence: Imitating Life, pages 43–56, 1994.
ProGenIE: Biographical Descriptions for Intelligence Analysis Pablo A. Duboue, Kathleen R. McKeown, and Vasileios Hatzivassiloglou Columbia University, Dept. of Computer Science New York, NY, 10025, USA {pablo,kathy,vh}@cs.columbia.edu http://www.cs.columbia.edu/nlp/
Abstract. Intelligence analysts face the need for immediate, up-to-date information about individuals of interest. While biographies can be written and stored in text databases, we argue that they can get quickly obsolete for living persons. We present here the architecture of ProGenIE, a biographical description generator currently under construction, focusing on the requirements of the task and its impact on potential users.
1
Introduction
Intelligence and law enforcement personnel face the need for immediate, up-todate information about individuals of interest. Biographies or profiles can be used to present such information. Obviously, it is not feasible to have human writers produce biographies for each person at hand. Even if that were the case, such biographies would need to be later stored in textual databases, where they will lose track of most recent, and, arguably most important, activities of the person being described. As part of the joint Columbia University—University of Colorado Open Question Answering project (AQUAINT), we present here our proposed architecture for a biographical description generator from knowledge sources and information extracted from the Internet. Our goal is to provide intelligence and law enforcement personnel with means to quickly and concisely communicate information about military and political personnel from foreign countries, and also terrorists and criminals. Working on different scenarios, different users of the system will require different presentations of available data about a given individual. For example, one analyst might want to see an overview of all data for a particular person, while another analyst may be looking for ties between a well-known terrorist and a particular country. We intend to fulfill these requirements via on the fly generation of such person descriptions. This paper is organized as follows: we shortly describe the motivation and relevance of our system. In Section 3, we describe ProGenIE’s three major components. Some final remarks conclude this paper. H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 343–345, 2003. c Springer-Verlag Berlin Heidelberg 2003
344
2
P.A. Duboue, K.R. McKeown, and V. Hatzivassiloglou
Motivation and Relevance
Person descriptions has been addressed in the past by IR, summarization and NLG techniques. IR-based systems [1] will look for existing biographies in a large textual database such as the Internet. Summarization techniques [2] produce a new biography by integrating pieces of text from various textual sources. Natural language generation systems for biography generation [3] create text from structured information sources. Ours is a novel approach, that builds on the NLG tradition. We will combine a generator with an agent-based infrastructure expecting to ultimately mix textual (like existing biographies and news articles) as well as non-textual (like airline passengers lists and bank records) sources. ProGenIE will offer significant advantages, as pure knowledge sources will be able to be mixed directly with text sources and numeric databases. It diverges from the NLG tradition, as we will use examples from the domain to automatically construct content plans. Such plans will guide the generation of biographies on unseen people. Moreover, the output of the system will be able to be personalized; and by the fact that the system learns from examples, it will be able to be dynamically personalized.
3
System Description
Three components make up for our system: a knowledge component, a learning component (our research focus) and a generation component. Learning Component. The key to greater flexibility in biography generation relies in a particular piece of the generation pipeline, the Content Planner. A content planner is responsible for the distribution of the information among the different paragraphs, bulleted lists, and other textual elements. Information-rich inputs require a thorough filtering, resulting in a small amount of the available data being conveyed in the output. The selection and structuring of the text, performed thus by the content planner, is responsible for our sought flexibilities. Our research objectives focus on the automatic acquisition of schemas, data structures that guide the content planning process [4], by means of machine learning techniques. We employed an aligned corpora of input data and output text to induce schemas using stochastic search [5]. Such schemas are then used to generate biographies on new people, different from the one used to learn them. The final system can then be easily customized for new needs or scenarios by the final users, expanding the current work to possibilities unforeseen by us. Knowledge Component. While the data employed to generate the biographies can be supplied by internal databases and networks such as Intelink or IAFIS [6], we plan to provide input to the generator by using information extraction agents on the Internet. Publicly available data can be of great use to mine information for well-known personalities and a test bed for the final system running on private intranets. To represent the input to the generator, we chose a variation of RDF. This selection strives for generality, reuse and portability.
ProGenIE: Biographical Descriptions for Intelligence Analysis
345
Generation Component. Seven modules in a pipeline will compose ProGenIE. These modules will include an Inference Module, a Content Planner, a Text Planner, a Referring Expression Generator, an Aggregation Module, a Lexical Chooser and a Surface Realizer. In this setting, the Content Planner module executes the learned schemas. We use a variation of the Lexical Chooser (that selects words for concepts) from the MAGIC generator and the FUF/SURGE unification based package for the Surface Realizer. Finally, the other modules will behave as follows: the Inference Module performs some limited world knowledge inferencing; the Text Planner splits a rhetorical tree into paragraphs; the Referring Expression Generator handles mostly pronominalization, although it can scan the input to generate descriptions like his father ; and the Aggregation modules is responsible for mixing together clauses with similar structure, in order to avoid repetition.
4
Final Remarks
We have presented here a biography generation system with three components. A prototype of the learning component inferred plans for an earlier domain [5]. A new version of it, focusing in selecting appropriate pieces of content for a given biographical task, succeeded in halving the available data by two, while keeping the correct data for further verbalization [7]. The generation component has currently five operational modules, at different levels of completion. The remainder two modules are the Lexical Chooser, undergoing knowledge acquistion, and the Aggregation Module, on the design phase. ProGenIE solves an existing requirement for intelligence and law enforcement personnel. Its design has highly benefited from interviews with potential users; this fact is reflected on the architecture presented here. We plan to have an integrated biographies generator by the end of 2003, operating over publicly available Internet sources.
References 1. M¨ uller, A., Kutschekmanesch, S.: Using abductive inference and dynamic indexing to retrieve multimedia sgml documents. In: Miro ’95. (1995) 2. Schiffman, B., Mani, I., Conception, K.J.: Producing biographical summaries: Combining linguistic knowledge with corpus statistics. In: ACL-EACL 2001. (2001) 3. Teich, E., Bateman, J.A.: Towards an application of text generation in an integrated publication system. In: Proc. of 7th IWNLG. (1994) 4. McKeown, K.R.: Text Generation: Using Discourse Strategies and Focus Constraints to Generate Natural Language Text. Cambridge University Press (1985) 5. Duboue, P.A., McKeown, K.R.: Content planner construction via evolutionary algorithms and a corpus-based fitness function. In: INLG-2002. (2002) 6. U.S. Dept. of Justice, F.B.I.: Inauguration of the integrated automated fingerprint identification system (IAFIS). Press Release (1999) 7. Duboue, P.A., McKeown, K.R.: Statistical acquisition of nlg content selection rules. submitted (2003)
Scalable Knowledge Extraction from Legacy Sources with SEEK 1
2
Joachim Hammer , William O’Brien , and Mark Schmalz 1
1
Department of Computer & Information Science & Engineering, 301 CSE Building, Box 116120, University of Florida, Gainesville, FL 32611-6120, U.S.A. {jhammer,mssz}@cise.ufl.edu 2 M.E. Rinker, Sr. School of Building Construction, 304 Rinker Hall, Box 115703, University of Florida, Gainesville, FL 32611-5703, U.S.A.
[email protected] Abstract. The SEEK project (Scalable Extraction of Enterprise Knowledge) at the University of Florida is directed toward developing scaleable data access and extraction technology for overcoming problems of assembling and integrating knowledge resident in numerous legacy information systems. Additionally, this integrated information would be made available for analysis and decision-support. Development of theory and knowledge in this area is relevant to many applications that depend on integrated access to heterogeneous information including detection/prevention of terrorist attacks, tactical situation analysis in battlefields, etc. SEEK is a modular toolkit that provides the ability to extract and compose knowledge resident in sources to enable the rapid instantiation and configuration of value-added wrappers and mediators.
1 Introduction Events such as the 1995 nerve gas attack on a crowded Tokyo subway station by the Japanese millennial cult Aum Shinrikyo or the tragic attacks on the World Trade Center and resulting loss of lives on 9/11/2001 have reminded us of the threat and horror of terrorist attacks. Equally alarming are the disclosure of information about the former Soviet Union’s massive biowarfare program and discoveries about the disturbing extent of Iraqi president Saddam Hussein’s hidden chemical and biological arsenals. Successful detection of, and response to, biowarfare and other terrorist attacks requires access to intelligence information from a wide variety of sources. However, the number of sources available in electronic form as well as their diversities (heterogeneities) in terms of access mechanism, operation, and representation of contents is staggering (see, for example, [6]). As a result, the time and investment needed to establish integrated access to multiple sources, especially legacy sources, has imposed severe limitations on the scalability and maintainability of current integration technologies. For example, current wrapper development tools [2, 3] require significant programmatic set-up with limited reusability of code. Efforts are H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 346–349, 2003. © Springer-Verlag Berlin Heidelberg 2003
Scalable Knowledge Extraction from Legacy Sources with SEEK
347
now under way to develop languages and tools for describing resources in a way that can be processed by computers (e.g., Web services such as Microsoft’s .Net, the Semantic Web [1]). However, they do not address the problem of how to discover and collect the available information, or how to maintain it efficiently for the continuously increasing number of legacy sources. The SEEK project1 (Scalable Extraction of Enterprise Knowledge) at the University of Florida is directed toward developing scaleable data access and extraction technology for overcoming selected problems in (1) assembling and integrating knowledge resident in numerous legacy information systems and (2) making this knowledge available for analysis and decision support. For a detailed overview of the SEEK project, the reader is referred to [7].
Applications/ decision support
End Users and Decision Support
Connection Tools
Analysis Module
Sources
Knowledge Extraction Module
source expert
Wrapper commander & situation analyst
SEEK components
Legacy data and systems
Legend: run-time/operational data flow build-time/set-up/tuning data flow
Fig. 1. Schematic diagram of SEEK logical architecture.
2 Overview of the SEEK Approach A high-level view of the SEEK architecture is shown in Fig. 1. SEEK follows established integration methodologies and provides a modular middleware layer (labeled SEEK Components in Fig. 1), which bridges the gap between legacy information sources and decision makers or support tools. At runtime, the analysis module (which is similar to a mediator [9] in the generic integration architecture) processes queries from end-users (e.g. decision support tools and analysts) and performs knowledge composition including basic mediation tasks and postprocessing of the extracted data. Data communication between the analysis module and the 1
Supported by the National Science Foundation under grant numbers CMS-0075407 and CMS-0122193.
348
J. Hammer, W. O’Brien, and M. Schmalz
legacy source is provided by the wrapper component, which translates SEEK queries into access commands understood by the source and converts native source results into SEEK’s internal language. However, unlike existing integration approaches, we have emphasized providing support for configuration of SEEK’s analysis module and wrapper(s). For example, the analysis module must be configured with information about source capabilities, available knowledge and its representation. The wrapper must be configured with information regarding communication protocols between SEEK and legacy sources, access mechanisms, and underlying source schemas. Hence our contributions to the state-of-the-art are tools and algorithms for extracting structure and semantics from the underlying legacy source(s). To this extent, we have developed a three-step extraction approach, which is implemented inside the knowledge extraction module in Fig. 1. Note that our approach is based on the assumption that information is stored in some form of data repository, which may be accessed by application code. 1. Schema reverse engineering and semantic analysis: SEEK generates a detailed description of the data repository using Data Reverse Engineering (DRE). The DRE algorithm is capable of producing schema and constraints for relational databases including entities, relationships, and encoded business rules with significantly less human input than current approaches. This information is augmented with semantics, which are extracted by the Semantic Analyzer (SA) from the application code accessing the data repository. The SA algorithm is capable of examining code written in either C or Java using a combination of different program comprehension techniques and extracts the application-specific meaning of those program elements that are shared between the code and the data repository. For details on DRE and SA, the reader is referred to [5] and [8]. 2. Domain model mapping: The semantically enhanced legacy source schema is mapped onto the domain model used by the application(s) that want(s) to access the legacy source. This is done using a schema mapping process that produces the mapping rules between the legacy source schema and the application domain model. 3. Wrapper generation: The extracted legacy schema and the mapping rules provide the input to a wrapper generation toolkit (see, for example, [4]) for fast, scalable, and efficient implementation of the source wrapper. At runtime, the source wrapper translates queries from the application domain model to the legacy source schema. The preceding three steps are carried out under the guidance of a source expert who can also extend the capabilities of the initial, automatic configuration directed by the knowledge extraction module. Use of domain experts in knowledge extraction and mapping rule generation is especially necessary for poorly formed database specifications often found in older legacy systems. Furthermore, the knowledge extraction module also enables step-wise refinement of wrapper configuration to improve extraction capabilities. It is important to note that SEEK is not a general-purpose toolkit. Rather, it allows extraction of knowledge required by specific types of decision support applications. Thus, SEEK enables scalable implementation of computerized decision and negotiation support across a network of sources. SEEK represents a departure from
Scalable Knowledge Extraction from Legacy Sources with SEEK
349
research and development in shared data standards. Instead, SEEK embraces heterogeneity in information systems, providing the ability to extract and compose knowledge resident in sources that vary in the way data is represented and how it can be queried and accessed.
3 Conclusion To-date, we have built an initial prototype to demonstrate feasibility of our approach. Future plans for the SEEK project are to develop a matching tool capable of producing mappings between two semantically related yet structurally different schemas. Currently, schema matching is performed manually, which is a tedious, error-prone, and expensive process. We are also in the process of integrating SEEK with a wrapper development toolkit to determine if the extracted knowledge is sufficiently rich semantically to support compilation of legacy source wrappers for various domains. The eventual system concept is that of a large, nearly-automatic system that can (1) acquire large amounts of knowledge from multiple legacy systems, (2) extend and enhance its on-board knowledge representation and characterization capabilities through ontology-based learning, and (3) thus make each successive acquisition of knowledge from a legacy system easier and more accessible to the SEEK user community.
References 1. T. Berners-Lee, J. Hendler, and O. Lassila, "The Semantic Web," Scientific American, 2001. 2. J.-R. Gruser, L. Raschid, M. E. Vidal, and L. Bright, "Wrapper Generation for Web Accessible Data Sources," 3rd IFCIS International Conference on Cooperative Information Systems, New York City, New York, USA, 1998. 3. J. Hammer, M. Breunig, H. Garcia-Molina, S. Nestorov, V. Vassalos, and R. Yerneni, "Template-Based Wrappers in the TSIMMIS System," Twenty-Third ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, 1997. 4. J. Hammer, H. Garcia-Molina, S. Nestorov, R. Yerneni, M. Breunig, and V. Vassalos, "Template-Based Wrappers in the TSIMMIS System," SIGMOD Record (ACM Special Interest Group on Management of Data), vol. 26, pp. 532–535, 1997. 5. J. Hammer, M. Schmalz, W. O'Brien, S. Shekar, and N. Haldavnekar, "Knowledge Extraction in the SEEK Project Part I: Data Reverse Engineering," University of Florida, Gainesville, FL, Technical Report TR02-008, September 2002. 6. W. Kent, "The Many Forms of a Single Fact," IEEE Spring Compcon, San Francisco, CA, 1989. 7. W. O'Brien, R. R. Issa, J. Hammer, M. S. Schmalz, J. Geunes, and S. X. Bai, "SEEK: Accomplishing Enterprise Information Integration Across Heterogeneous Sources," ITCON – Journal of Information Technology in Construction, vol. 7, pp. 101–124, 2002. 8. S. Shekar, J. Hammer, and M. Schmalz, "Extracting Meaning from Legacy Code through Pattern Matching," Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611-6120, Technical Report TR03-003, January 2003. 9. G. Wiederhold, "Mediators in the Architecture of Future Information Systems," IEEE Computer, vol. 25, pp. 38–49, 1992.
“TalkPrinting”: Improving Speaker Recognition by Modeling Stylistic Features Sachin Kajarekar, Kemal Sönmez, Luciana Ferrer, Venkata Gadde, Anand Venkataraman, Elizabeth Shriberg, Andreas Stolcke, and Harry Bratt Speech Technology and Research Laboratory SRI International, Menlo Park, CA
Abstract. Automatic speaker recognition is an important technology for intelligence gathering, law enforcement, and audio mining. Conventional speaker recognition systems, which are based on independent short-term spectral samples, suffer from a lack of noise robustness and are unable to model a speaker’s idiosyncratic stylistic features. This paper describes “TalkPrinting”, a program of research aimed at adding such stylistic features to conventional systems. Results on three preliminary systems based on stylistic features demonstrate that (1) the new features alone carry significant speaker information; (2) they also carry significant complementary information compared to the conventional features; and (3) they provide increasing improvements in performance with increasing test durations.
1 Introduction Automatic speaker recognition is the task of determining a speaker’s identity from his or her speech. It is a crucial technology for finding and tracking conversations involving particular target speakers from the unwieldy amount of audio data captured each day by intelligence sources. It can be used to automatically prioritize conversations for further analysis by human listeners. Conventional features for speaker recognition are based on the short-term spectrum1 (e.g., cepstrum). However, they lack robustness to mismatched training and testing conditions, e.g., due to different telephone handsets [6]. To improve robustness, researchers are investigating into the features that reflect speaking style, such as idiosyncratic word usage, intonation, and timing patterns. Some of these features were recently investigated at a workshop at Johns Hopkins University (JHU) [7] using an acoustic-prosodic feature database developed at SRI for other work [8]. The workshop results showed significant improvements in the performance of a conventional speaker recognition system by adding features that reflect higher-level and longer-term properties. 1
These features indirectly capture speaker’s vocal tract shape and its movements.
H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 350–354, 2003. © Springer-Verlag Berlin Heidelberg 2003
“TalkPrinting”: Improving Speaker Recognition by Modeling Stylistic Features
351
This paper describes further efforts to exploit speaking style features for speaker recognition, using an approach we call TalkPrinting2. We describe three types of features beyond those studied at the workshop: (1) higher-order cepstra modeling speaker fundamental frequency (F0), (2) language models capturing word usage, and (3) word duration models. Section 4 discusses results of individual systems and their combination. Section 5 describes the effect of different test sample durations and Section 6 describes conclusions from this work.
2 Overview of Task and Baseline System We use data from the NIST 2001 speaker recognition evaluation extended-data speaker recognition task [5]. This is the entire Switchboard-I corpus [4] divided into six “splits”, with 1 to 16 training conversations per split. Development on the whole task is computationally expensive, and we observed that meaningful results can be obtained by defining a “short” task, i.e., only two splits of 8-conversation training data. This report provides results on the short task3. We report performance of systems in terms of equal error rate (EER). EER is the point on the receiver operating characteristics curve at which false-acceptance and false-rejection errors are equal. To provide a baseline system for combination with our proposed new features, we extended SRI’s conventional speaker recognition system [9]. This Gaussian mixture model (GMM) [6] based system uses standard Mel-frequency cepstral coefficients (MFCCs) as features. The background GMM is trained with data from many speakers and the speaker GMM is adapted from the background GMM by using training data for the respective speaker.
3 High-Level Speaker Features We investigated three types of TalkPrinting features: F0-based, language-based, and duration-based. For the latter two feature types, we report results on true words. F0-based features. We used higher order linear frequency cepstra to represent F0, which allowed us to use the same modeling approach as for the baseline GMM system using MFCC features. This allowed us to model pitch information more explicitly than in the baseline system. Language-based features. We examined idiosyncratic word patterns, modeled using individual words and bigrams, following on previous work [1]. We characterized the stream of words by a single statistical language model (LM). This allowed us to optimize the vocabulary size of the LM, as well as the length of Ngrams to be used. 2
3
This captures the notion that we are adding features related to voluntary stylistic patterns in how a person talks, rather than relying solely on features related to pre-determined vocal physiology Note that comparison experiments on the full task were similar or improved; results reported here thus underestimate our true performance.
352
S. Kajarekar et al.
Duration-based features. We adapted a duration model used for speech recognition [3] to capture individual differences in speaking rate, constrained by lexical information. We represented each word by a vector of the durations of the individual phones in the word. Using these vectors, background and speaker models (GMMs) were estimated using an approach similar to that used for the baseline system. During speaker adaptation, if a word had insufficient training data then we backed off to phone duration models.
4 Results System scores from the high-level features and the baseline features were combined using linear discriminant analysis (LDA) [2]. Results on different system combinations on the short-task are shown in Table 1. Table 1. Performance of different systems (and combinations) on the short task4
System Baseline (S1) F0-based (S2) Language-model-based (S3) Duration-based (S4) S1+S2 S1+S3 S1+S4 S1+S2+S3 S1+S2+S3+S4
EER (%) 2.6 10.9 12.6 10.6 2.5 2.1 1.5 1.8 1.2
The EER of our baseline system (S1) is 2.6%. Using score normalization, performance improves to 1.3%. Due to practical limitations, we present system combinations without score normalization. Since our TalkPrinting features are not likely to be severely affected by channel variation, we believe that the gain in performance from score normalization will generalize to system combinations. The F0-only system (S2) yields 10.9% EER and is the second-best performing system. However, it provides only a small improvement after combination with the baseline. This can be attributed to a correlation between the F0 features and baseline cepstrum features. For the LM (S3) system, best results were obtained with unigrams over words occurring at least 300 times in the training data, for an EER of 12.6 on the full training set. The highest relative improvement is obtained using the durationbased system (S4), which provides the most complementary information. As mentioned earlier, these results significantly underestimate the absolute performance level of our systems on the complete 8-conversation task. After various 4
Note that they are from the suboptimal systems, please refer to end of this section for optimized results
“TalkPrinting”: Improving Speaker Recognition by Modeling Stylistic Features
353
system enhancements, the baseline EER is 2.0% without score normalization and 0.9% with score normalization. Further, EER for the duration-based system is now 4.5% and EER for the LM-based system is now 8.5%. With a neural-network based combiner, overall EER reduces to 0.20%. In summary, although individual systems based on the high-level features never outperform the baseline system, they provide a significant reduction in error when combined with that system. Duration modeling provides particularly useful complementary information, dramatically improving performance.
Fig. 1. % EER for different test durations using baseline system and its combination with duration-based system.
5 Effect of Test Segment Duration We also investigated the effect of test segment duration on performance, since for many applications more than a minute of data is likely to be available. We split our data into conditions using 30, 60, 120 and 180 second test segments. Figure 1 shows performance of the baseline system and various combined systems on these test sets. As can be seen, the baseline system does not take advantage of increasing test segment duration beyond approximately 120 seconds of speech5, which the TalkPrint system continues to improve as test segment duration increases. These results suggest that TalkPrinting can be particularly beneficial to applications that have access to longer speech samples.
5
Note: the average duration of the original tests is about 180 seconds.
354
S. Kajarekar et al.
6 Conclusions We investigated three high-level features for speaker recognition based on F0, word usage patterns, and timing patterns. Results show that the new features provide significant complementary information to standard speaker recognition systems, resulting in significant performance improvements after score combination. Furthermore, results show that unlike the baseline system, TalkPrint systems continue to improve as more data is available in testing, making their relative contribution even greater at longer test durations. TalkPrint features could therefore be of significant value to future intelligence applications.
Acknowledgments. This work was funded by a KDD supplement to NSF IRI9619921. We thank Gary Kuhn for helpful discussion and technical suggestions.
References 1. Doddington, G.: “Some Experiments on Ideolectal Differences Among Speakers,” http://www.nist.gov/speech/tests/spk/2001/doc/ (2001). 2. Fukunaga, K.: “Statistical Pattern Recognition,” Academic Press, Indiana. 3. Gadde, V. R. R.: “Modeling Word Durations,” Proc. Intl. Conf. on Spoken Language Processing, Beijing, (2000) 601–604. 4. Godfrey, J., Holliman, E., and McDaniel, J.: (1992) “SWITCHBOARD: Telephone speech corpus for research and development,” Proc. ICASSP, (1992) 517–520. 5. NIST 2001, http://www.nist.gov/speech/tests/spk/2001/doc/2001-spkrec-evalplan-v05.9.ps 6. Reynolds, D.: “Speaker Identification and Verification Using Gaussian Mixture Speaker Models,” Speech Communication, Vol. 17, No. 1–2, August (1995) 91–108. 7. Reynolds, D., et al.: “The SuperSID Project: Exploiting high-level Information for highaccuracy speaker recognition,” To appear in Proc. ICASSP, Hong Kong (2003). 8. Shriberg, E., Stolcke, A., Hakkani-Tur, D., and Tur, G.: “Prosody-Based Automatic Segmentation of Speech into Sentences and Topics,” Speech Communication, Vol. 32, No. 1–2, (2000) 127–154. 9. Sönmez, M. K., Heck, L., and Weintraub, M.: “Speaker Tracking and Detection with Multiple Speakers,” Proc. EUROSPEECH, Vol. 5, Budapest, Hungary, (1999) 2219–2222.
Emergent Semantics from Users’ Browsing Paths D.V. Sreenath1 , W.I. Grosky2 , and F. Fotouhi1 1 2
Department of Computer Science, Wayne State University, Detroit, MI 48202 Department of Computer and Information Science, UofM Dearborn, MI 48128 Abstract. The authors of web pages are not aware of the ways their content is used. The innocent information published on the web could be used for malicious purposes. We argue that authors of web pages cannot completely define the page’s semantics and that semantics emerge through use. Our goal is to derive the emergent semantics of the browsing paths of the users. Our research can be considered as the reciprocal of search engines: the problem is to derive the semantics from the sequence of web pages traversed by a user. Using an iterative process, we derive the semantic breakpoints of long browsing paths. This identifies short sub-paths with coherent uniform semantics. Using a variation of Latent Semantic Analysis, we attempt to derive high-level semantics of the browsing pattern of the user. With additional training data, an application of this research leads to terrorist trend detection.
1
Introduction
In a search engine, a user enters a query string and the search engine retrieves a list of URLs that match the query in the order of relevance. If it was a perfect search engine and we analyzed the list of URLs for deriving the semantics, we should derive the user’s query as the semantics. Our research can be considered as the reciprocal of a search engine: to derive the semantics from the sequence of web pages traversed by a user. Similar to the text search engines, our earlier research efforts in the development of feature-based techniques for the retrieval of multimedia information has emphasized the notion of similarity with respect to low-level features [1]. The reciprocal of the content-based image retrieval would be to derive the high-level semantics from a collection of images. The derived semantics should be as simple as “nature lover”, “archaeologist”, “terrorist” or “pornographer”. In [2], the semantics is not an intrinsic property captured during the image filtering process, but an emergent property of the interaction of the user and the database. Semantics of an image can be extracted by interpreting the sequence of queries posed by the user. In our research, the initial assumption is that the web page does not have fixed semantics as intended by the author, but multiple semantics that vary over time. The browsing path of a user is broken into contiguous sub-paths. The semantic break points are identified along the user’s browsing path where the semantics change appreciably. The sequence of pages that exhibit similar semantics is clustered as a sub-path. Each element of a page’s multiple semantics corresponds to the group of users who visit this particular page through similar H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 355–357, 2003. c Springer-Verlag Berlin Heidelberg 2003
356
D.V. Sreenath, W.I. Grosky, and F. Fotouhi
browsing paths. As different users visit a given page, its semantics changes. We attempt to derive the emergent semantics of the users’ browsing paths.
2
Our Approach
This research effort is an extension to our earlier research in deriving emergent semantics as detailed in [3]. We use the technique of latent semantic analysis [4], to determine the semantics of user browsing paths. Using this technique, we represent each browsing path in a reduced dimensional space, where each dimension corresponds to a concept, each concept representing a set of co-occurring keywords. We extract keywords from each web page in the browsing path to build a term-path matrix. The actual value used in the matrix represents the number of times that term appears in that path. An element xi,j , the (i, j)th element of the term-path matrix, M , is determined by the strength of the presence of the ith keyword, ti , along the j th browsing path, as well as how many times this path occurs in M . By singular value decomposition, any rectangular matrix M can be decomposed into a product of three other matrices U , Σ and V . M = U ΣV T . Defining Mk = Uk Σk VkT , Mk provides the best rank k approximation of M . Mk reveals the latent structure of M , combining linear combinations of keywords into new concepts that can be used to derive semantics. Each path represented as a vector in the term-path matrix can be visualized as a point in the reduced dimensional space. The semantics of a web page, w, can then be defined as the subset of the points in the reduced dimension space corresponding to the sub-paths that end at page w. The semantics of a user’s browsing path is the collection of concepts represented by the semantics of the pages traversed by him. The application of our research for “terrorist trend detection” is to first choose or formulate a query page comprising terms that best represents a terrorist activity. Determining if there is a terrorist activity in the users’ browsing history is as simple as determining if there are any matches to the query in the browsing history. We accomplish this by placing the query (also represented as a point) in the vector space and computing the distance between the query point and the set of all the points in the vector space. We use the cosine distance measure in the reduced-rank vector space model in [5]. We compare a query vector q to the columns of the approximation Mk to the term-by-path matrix M . If we define ej to be the j th canonical vector of dimension d ( the j th column of the d × d identity matrix), the j th column of Mk is given by Mk ej . The cosines of the angles between the query vector q and the approximate path vectors can be computed by cos θj = (sTj (UkT q))/( sj 2 q2 ) for j = 1, 2, . . . , d where sj = Σk VkT ej for j = 1, 2, . . . , d.
3
Preliminary Results
The test data consisted of several paths each, of varying concepts such as java as a programming language (java.sun.com,ibm.com/java), java island tour (eastjava.com,baliawesome.com), java as a beverage (joejavacoffee.com,
Emergent Semantics from Users’ Browsing Paths
357
e-barista.com), research in ammonium nitrate (cmu.edu, chem.ucalgary.ca), counter terrorism incident response(llnl.gov), chemical weapons (chemistry.about.com), defense information on chemical weapons (cdi.org) etc. The data (matrix) consisted of 68 paths with 6959 unique terms. The paths lengths varied from 1 to 12. Using an authority web page as a query, we tried to find the matching paths in the vector space for 6 different concepts. Since these authority pages were popular pages with well-defined semantics, the matching paths will also exhibit similar semantics. Using the http://ibm.com/developerworks/java/ as the page to best define Java programming language, the best cosine value of 0.57 (first match) detected the correct match for the path traversed by a Java programmer. Similarly, we were also able to distinguish between the other two Java contexts (island, beverage) from the browsing paths. The query pages used were central-java-tourism.com and nettizen.com/franchise/aa-javadaves.htm respectively. The best cosine distance values of 0.37 and 0.45 respectively, were able to correctly identify the paths that matched that semantics. It is to be noted that though these values are not very high, it is relatively higher than the other matches (bad) that were below 0.20. We also inferred that other paths composed of either individual pages or extremely short path lengths had lower cosine values than longer paths with uniform semantics. This is a very promising result to prove our hypothesis. We were also able to distinguish between a user browsing through a series of pages related to university chemistry research, from another user browsing the pages related to building chemical and biological weapons. Using www-chem.ucsd.edu/Faculty/bios/prather.html as the query clearly identified the browsing path of the user interested in the chemicals from a research perspective, with cosine value of 0.47. Using pages from chemistry.about.com and cdi.org we were able to correctly identify the intent of the browser - to build a chemical weapon or preparing for such a disaster; with cosine values 0.82 and 0.54 respectively. In spite of the occurrence of common keywords across these URLs, we were able to distinguish between the intents based on the emergent semantics of the browsing paths.
References 1. R. Zhao and W. I. Grosky, “Narrowing the semantic gap - improved text-based web document retrieval using visual features,” IEEE Transactions on Multimedia, Vol. 4, 189–200, 2002. 2. S. Santini, A. Gupta, and R. Jain, “Emergent semantics through interaction in image databases,” IEEE Transactions on Knowledge and Data Engineering, Vol. 13, pp. 337–351, 2001. 3. W. I. Grosky, D. V. Sreenath, and F. Fotouhi, “Emergent semantics and the multimedia semantic web,” ACM SIGMOD, 2002. 4. S. Deerwester, S. T. Dumais, G. W. Furnas, and T. K. Landauer, “Indexing by latent semantic analysis,” Journal of the American Society for Information Science, Vol. 41, pp. 391–407, 1990. 5. M. W. Berry, Z. Drmac, and E. R. Jessup, “Matrices, vector spaces, and information retrieval,” Siam Review, Vol. 41, pp. 335–362, 1999.
Designing Agent99 Trainer: A Learner-Centered, Web-Based Training System for Deception Detection Jinwei Cao, Janna M. Crews, Ming Lin, Judee Burgoon, and Jay F. Nunamaker Center for the Management of Information University of Arizona Tucson, AZ, 85721, USA {jcao, jcrews, mlin, jburgoon, nunamaker}@cmi.arizona.edu
Abstract. Research has long recognized that humans have many biases and shortcomings that severely limit our ability to accurately detect deception. How can we improve our deception detection ability? One possible method is to train individuals to recognize cues of deception. To do this, we need to create effective training curricula and educational tools. This paper describes how we used existing research to guide the design and development of a Web-based, multimedia training system called Agent99 Trainer to provide effective deception detection training. The Agent99 Trainer system integrates explicit instruction on the cues of deception, detection experience through practice, and immediate feedback with anytime, anywhere Web access. Our initial experiments show that our training improves human deception detection accuracy and the Agent99 Trainer system provides training as effective as instructor-led lecturebased training.
1 Introduction We are perpetually faced with assessing whether information is truthful or deceptive. Although some deception, such as “little white lies”, may be relatively harmless if undetected, deception executed with malicious intent can cause great harm to individuals, organizations, or even nations, if it goes undetected. Thus it is important that we are able to accurately detect deception. However, communication research has long recognized that humans have many biases and shortcomings that severely limit our ability to accurately detect deception, especially in novel circumstances. In general, although we think we are good, humans are poor at detecting deception and are correct just slightly more than 50 percent of the time [1], [2]. So, what is deception? And, how do we become better detectors? Adopting Buller & Burgoon’s definition [3], deception is “a message knowingly transmitted with the intent to foster false belief or conclusions (p.205).” Although computer systems may provide assistance in identifying deception [4], humans must make the final judgment. Therefore, it is necessary and critical to train individuals to understand the nature of deception and how to detect it. To do this, we need to develop effective training curricula and training tools. With proper design, computer systems can be excellent training tools [5]. In response, we developed the Agent99 Trainer system to deliver deception detection training including: instruction, practice and feedback. In this paH. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 358–365, 2003. © Springer-Verlag Berlin Heidelberg 2003
Designing Agent99 Trainer
359
per, we review previous research on deception detection training; describe our design and development of Agent99 Trainer, present our initial research results, and discuss future research plans.
2 Background Detecting deception is hard. Most people do not do it well, yet believe they are proficient, unaware of their personal shortcomings and biases. A lack of real knowledge of unfailingly reliable cues and strong reliance on a few “unauthenticated” cues, contribute to the problem [6]. Research studies on deceptive communication continue to reveal more cues with greater reliability as indicators of deception, as well as indicating that humans can use these cues to successfully detect deception. Two metaanalyses of deception detection research summarize the findings on the known cues of deception and report the statistical reliability of these cues as indicators [7], [8]. In conjunction, these studies provide a good knowledge base for training people to improve their deception detection accuracy. Research shows that it is possible to train people to better detect deception; however, training does not always improve the human ability to detect deception [9]. Training methods are critical. Early research in this area indicates that observers can improve their ability to detect deception with a “practice” and “self-taught” strategy, in which, (1) the observer evaluates the veracity of a communication after viewing a “normal”, truthful communication of the subject/potential deceiver, [10], [11], or (2) the observer evaluates the veracity of a communication and immediately receives outcome feedback (true or false) on the correctness of their judgment [12]. However, the improvement gained from this type of training is limited and is not generalizable because observers only become better at detecting deception in the subjects they study. Furthermore, observers can “learn” erroneous cues during the self-taught learning process. If not corrected, this erroneous knowledge can be maintained or even strengthened, resulting in a negative training effect [13], i.e. observers can internalize incorrect “knowledge” leading to lower detection accuracy after receiving training. To minimize erroneous “learning”, researchers began adding explicit training on generalizable and reliable cues of deception to their training programs [14], [15], [16], [17]. A common training format included: 1) explicit instruction on some set of cues of deception, 2) practice judging the veracity of real communications, and 3) immediate outcome feedback on the judgments. Research indicates that although explicit instruction on cues can improve detection accuracy rates, the best performance is obtained when instruction is combined with practice followed by outcome feedback [17]. Therefore, explicit instruction, practice and feedback are three critical components of effective deception detection training. One of the primary challenges in designing deception detection training is effectively incorporating these three critical components. In addition, previous research on deception detection training has investigated only one type of training: instructor-led lecture-based training in a classroom setting. The use of multi-media instructional technology in these training programs has been limited to showing communication videos. We believe that a Web-based, multimedia, learner-centered training system can be an invaluable instructional tool to deception detection training, providing many advantages over the instructor-led lecture-
360
J. Cao et al.
based format. The next section describes our design and development of such a training system; AGENT99 Trainer provides well-structured instruction with supportive materials, practice by viewing real life examples and scenarios, expert analysis as feedback, and access to self-paced, anytime-anywhere learning.
3 System Design and Development The Agent99 Trainer is an adaptation and extension of a previously developed Webbased multimedia training system called LBA (Learning by Asking) [18]. LBA was designed as a general training tool that provides the time and space independence of Web-based technologies, and the richer information channel of multimedia technologies. However, most importantly for our purposes, LBA is a learner-centered training system. In contrast to a traditional instructor-led lecture-based, instructor-centered training, learner-centered training is inspired by constructivism learning theory, which emphasizes that knowledge construction does not result from the learner passively receiving information or instruction, but from the learner’s active participation [19]ADDIN. Learner-centered training is especially well suited to deal with illdefined problems [20]. Deception detection is considered an ill-defined problem because 1) a well defined set of unfailing cues does not exist [6], and 2) a deep understanding of cues requires extensive experience and high levels of cognitive processing [12]. Therefore, as a learner-centered training system, LBA provided a good starting point for our design and development efforts. However, deception detection training has special requirements. The Frank metaanalysis [21] on deception detection training identifies three critical components of deception detection training: explicit instruction, practice and feedback. LBA supports only one of these components, explicit instruction, with multimedia, online lectures in the Watch Lecture module. Consequently, we designed and developed the View Example With Analysis module to provide practice judging the veracity of communications, as well as immediate feedback on those judgments. In addition, we adapted the Watch Lecture interface to integrate the two modules. These modules are implemented in Agent99 Trainer as follows: The Watch Lecture module provides explicit instruction on deception cues by capturing expert lectures on digital media. Multiple media sources (video of expert lecture, presentation slides, and lecture notes) are synchronized in an integrated interface. Navigation buttons and pull down menus allow users to switch to any topic or any communication example associated with it on demand. For this study, we videotaped a 34-minute lecture on the cues to detect deception. The cues in the lecture were based on the authenticated cues reported in the two aforementioned meta-analyses of previous deception research [7], [8]. We focused the explicit instruction in the lecture on 5 behavioral categories of cues: arousal, emotion, cognitive effort, memory process, and communication tactics. The View Example with Analysis module provides users with the opportunity to practice analyzing real life examples and scenarios. We extracted 21 examples from a series of studies on deceptive communication [22], [23], [24], and linked them to the relevant cues in the lecture. The examples are segments of interviews between pairs of participants. Although the interviews were conducted under experimental condi-
Designing Agent99 Trainer
361
tions, the interviews were real communications between the participants, in which the interviewee was instructed to answer particular questions truthfully and answer others deceptively. The interviewees also reported the veracity of their answers; therefore, this is how we determined which segments were deceptive and which were truthful communications. To help users develop a broader and deeper understanding of deception in varying task and communication conditions, different types of examples are used, including three media types (video, audio and text) and two modalities (face-to-face conversation and NetMeeting™ chat). Furthermore, an expert analysis of the veracity of the communication and related cues in each example provides the learner immediate outcome feedback and helps users learn deception detection concepts, theories and cues. The expert analyses were written by deception researchers and are provided in a text format. Thus, Agent99 Trainer is a learner-centered training system that provides explicit instruction, practice and feedback; all critical components of deception detection training. In addition, the Web-based system provides self-paced, anytime-anywhere instruction. Screenshots of the user interfaces of the two modules are provided below (Fig.1 and Fig. 2).
4 Experiment By addressing the special requirements of deception detection training, we believe that our design of the Agent99 Trainer should offer an effective training method. In addition, since the content of our lecture is guided by previous research findings, we expect the training content to improve the detection accuracy of learners. Therefore, we hypothesize that:
Fig. 1. The user interface of the Watch Lecture module
362
J. Cao et al.
Fig. 2. The user interface of the View Examples module
H1: Learners receiving the training by an instructor-led lecture-based training method, will improve their deception detection accuracy more than if they do not receive training, i.e. our training content will provide a positive training effect, and H2: Learners receiving training via Agent99 Trainer will improve their deception detection accuracy as much as, or greater than, they will improve their deception detection accuracy if they receive training by an instructor-led lecture-based training method. To test our hypotheses, an experiment was conducted at a research 1 university in the Southwest. The 28 participants were undergraduate students registered in summer classes offered by the Management Information Systems department. The experiment was a pretest-posttest comparison between two treatment groups: one treatment group received web-based training by Agent99 Trainer, while the other treatment group received traditional instructor-led lecture-based training. Participants were randomly assigned to each of the two groups. During the two-hour experiment, each group first received a pretest, then participated in a one-hour training session, and finally received a posttest. In both the pretest and posttest, subjects judged the veracity of six segments of human conversations. All test segments were extracted in the same manner and from the same deception communication studies as the examples provided in the training. The test and example segments were extracted from different interviewer-interviewee pairs to prevent bias and learning effects. The tests were constructed so that in each test there are three deceptive and three truthful segments, as suggested by Frank and Feeley [21]; two segments of each media type (audio, video and text); and two segments of each difficulty level (easy, medium and difficult). The six segments are randomly ordered in each test. This method of construction controls for guessing [21] and media bias, as well as facilitating equivalency of the two test forms. The results of an earlier pilot study demonstrated that the pretest and posttest
Designing Agent99 Trainer
363
were statistically equivalent and that no practice effect between the pretest and posttest resulted. In the one-hour training session, the students in the Agent99 Trainer group were able to control the pace and sequence of their training, i.e. they were able to watch the 34-minute lecture in any sequence, and/or view the 21 examples as they preferred. To avoid instructor bias, the instructor of the instructor-led lecture-based training was the same instructor who appeared in the lecture video in the Agent99 Trainer. In addition, the same lecture and examples were presented in the classroom lecture and the Agent99 Trainer video lecture. Thus, in the classroom treatment, the instructor controlled the sequence of the lecture and examples for every student, while the Agent99 treatment gave each individual student that control. The results of this experiment are reported and discussed in the next section.
5 Results and Discussion In the analysis, subjects’ judgments were compared with the actual veracity, and the deception detection accuracy (percentage of the correct judgments on the test) was calculated for both the pretest and posttest. A 2(TREATMENT: Agent99 or Lecture) x 2(TIME: pretest or posttest) ANOVA with repeated measures on the TIME factor was conducted. Results revealed a significant main effect for TIME, F(1, 27) = 32.29, p < 0.001, eta square = 0.545. However, no significant main effect for TREATMENT and no significant interactions were found in this experiment. Therefore, statistically we can conclude that the training did improve the detection accuracy, but we cannot conclude that the Agent99 group performed better than the lecture group, even though the detection accuracy of the Agent99 group did improve slightly more than the lecture group (see Table 1). Table 1. Detection accuracy means as a function of TREATMENT and TIME
TREATMENT Agent99 Lecture
Pretest 42.22 % 44.05 %
TIME Posttest 68.89 % 64.29 %
p-Value < .001* < .001*
As part of the first phase of evaluation of the Agent99 Trainer system, our results are encouraging and our hypotheses are supported.1 The good news is that the Agent99 Trainer provided learning effects as good as traditional instructor-led lecture-based training, and it provides additional learning advantages. As a Web-based system, Agent99 Trainer training can occur anytime, and anyplace, without requiring a human instructor. The not-so-good news is that Agent99 Trainer did not perform better than traditional instructor-led lecture-based training as we anticipated, but we believe several factors may have been contributed to this result. First, Agent99 Trainer was only partially implemented in this experiment, with limited practice examples. Second, we evaluated Agent99 Trainer in a controlled laboratory environ1
See [25], for another test of Agent99 Trainer.
364
J. Cao et al.
ment. To control for instructional time effects, access was limited to a one-hour session equivalent to the instruction time in the instructor-led lecture-based training, yet one strong advantage of a Web-based training system is its capability of providing self-paced, repeatable training with unlimited access time. This benefit could not be realized in our experimental design. In the future we intend to implement the full functionality of Agent99 Trainer. In the second phase of evaluation, we will conduct a field study that allows students to use Agent99 Trainer with unlimited access time. We would also like to investigate the effectiveness of combining instructor-led lecture-based training and Agent99 training, thereby achieving the advantages of both. Furthermore, we would like to explore how the different system functions affect the training effectiveness of the system, leading to a better understanding of design effects on deception detection training.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.
Kraut, R.: Humans as Lie Detectors. Journal of Communication, Vol. 30. (1980) 209-216 Miller, G. R., Stiff, J. B.: Deceptive Communication. Sage Publications, Inc. (1993) Buller, D. B., Burgoon, J. K.: Interpersonal Deception Theory. Communication Theory, Vol. 6. (1996) 203–242 Wang, G., Chen, H., Atabakhsh, H.: Automatically Detecting Deceptive Criminal Identities. Communication of the ACM, (forthcoming) Rosenberg, M. J.: E-Learning: Strategies for Delivering Knowledge in the Digital Age. McGraw-Hill (2000) Levine, T. R., Park, H. S., McCornack, S. A.: Accuracy in Detecting Truths and Lies: Documenting the "Veracity Effect". Communication Monographs, Vol. 66. (1999) 125– 144 Zuckerman, M., DePaulo, B. M., Rosenthal, R.: Verbal and Nonverbal Communication of Deception. In: Berkowitz, L. (eds.): Advances in Experimental Social Psychology, Vol. 14. Academic Press (1981) 1–59 DePaulo, B. M., Lindsay, J. J., Malone, B. E.: Cues to Deception. Psychological Bulletin, under review. (2001) 1–138 Kassin, S. M., Fong, C. T.: "I'm Innocent!": Effects of Training on Judgments of Truth and Deception in the Interrogation Room. Law & Human Behavior, Vol. 23. (1999) 499–516 Brandt, D. R., Hocking, J. E., Miller, G. R.: Effects of Self-Monitoring and Familiarity on the Ability of Observers to Detect Deception. Communication Quarterly, Vol. 28. (1980) 3–10 Brandt, D. R., Hocking, J. E., Miller, G. R.: The Truth-Deception Attribution: Effects of Familiarity on the Ability of Observers to Detect Deception. Human Communication Research, Vol. 6. (1980) 99–110 Zuckerman, M., Koestner, R., Alton, O. A.: Learning to Detect Deception. Journal of Personality and Social Psychology, Vol. 46. (1984) 519–528 DePaulo, B. M., Pfeifer, R. L.: On-the-Job Experience and Skill at Detecting Deception. Journal of Applied Social Psychology, Vol. 16. (1986) 249–267 DeTurck, M. A., Harszlak, J. J., Bodhorn, D., Texter, L.: The Effects of Training Social Perceivers to Detect Deception from Behavioral Cues. Communication Quarterly, Vol. 38. (1990) 1–11 DeTurck, M. A.: Training Observers to Detect Spontaneous Deception: The Effects of Gender. Communication Reports, Vol. 4. (1991) 79–89 Fiedler, K., Walka, I.: Training Lie Detectors to Use Nonverbal Cues Instead of Global Heuristics. Human Communication Research, Vol. 20. (1993) 199–223
Designing Agent99 Trainer
365
17. Vrij, A.: The Impact of Information and Setting on Detection of Deception by Police Detectives. Journal of Nonverbal Behavior, Vol. 18. (1994) 117–136 18. Zhang, D. S.: Virtual Mentor and Media Structuralization Theory. PhD dissertation in the MIS Department, University of Arizona, Tucson, AZ. (2002) 19. Phillips, D. C.: The Good, the Bad and the Ugly: The Many Faces of Constructivism. Educational Researcher, Vol. 24. (1995) 5–12 20. Ertmer, P. A., Newby, T. J.: Behaviorism, Cognitivism, Constructivism: Comparing Critical Features from an Instructional Design Perspective. Performance Improvement Quarterly, Vol. 6. (1993) 50–70 21. Frank, M. G., Feeley, T. H.: To Catch a Liar: Challenges for Research in Lie Detection Training. Journal of Applied Communication Research. (2002) 22. Burgoon, J. K., Buller, D. B.: Interpersonal Deception: III. Effects of Deceit on Perceived Communication and Nonverbal Behavior Dynamics. Journal of Nonverbal Behavior, Vol. 18. (1994) 155–184 23. Burgoon, J. K., Buller, D. B., Ebesu, A., Rockwell, P., White, C.: Testing Interpersonal Deception Theory: Effects of Suspicion on Nonverbal Behavior and Relational Messages. Communication Theory, Vol. 6. (1996) 243–267 24. Burgoon, J. K., Buller, D. B., White, C. H., Afifi, W. A., Buslig, A. L. S.: The Role of Conversational Involvement in Deceptive Interpersonal Communication. Personality and Social Psychology Bulletin, Vol. 25. (1999) 669–685 25. George, J. F., David P. Biros, L. C., Burgoon, J., Jay F. Nunamaker, J.: Training Professionals to Detect Deception. NSF/NIJ Symposium on "Intelligence and Security Informatics", Tucson, AZ (2003)
Training Professionals to Detect Deception 1
2
3
Joey F. George , David P. Biros , Judee K. Burgoon , and Jay F. Nunamaker, Jr.
3
1
College of Business, Florida State University, Tallahassee, FL, USA
[email protected] 2 USAF Chief Information Office, Washington, DC, USA
[email protected] 3 Eller College of Management, University of Arizona, Tucson, AZ, USA {jburgoon, nunamaker}@cmi.arizona.edu
Abstract. Humans are not very good at detecting deception in normal communication. One possible remedy for improving detection accuracy is to educate people about various indicators of deception and then train them to spot these indicators when they are used in normal communication. This paper reports on one such training effort involving over 100 military officers. Participants received training on deception detection generally, on specific indicators, and on heuristics. They completed pre- and post-tests on their knowledge in these areas and on their ability to detect deception. Detection accuracy was measured by asking participants to judge if behavior in a video, on an audiotape, or in a text passage was deceptive or honest. Trained individuals outperformed those who did not receive training on the knowledge tests, but there were no differences between the groups in detection accuracy.
1 Introduction Although dramatic threats to security, such as viruses, get most of the press, simple deception can also be a serious threat. A common use of deception for breaching security is social engineering, a technique often used by infamous hacker Kevin Mitnick [7]. Social engineering occurs when one individual poses as someone else in order to get information merely by asking for it. This practice can also be used for other purposes, such as getting unsuspecting users to download malicious software, which, if installed, can be used for tasks like launching denial of service attacks from the user’s machine (see, for example, http://www.cert.org/incident_notes/IN-2002-03.html). One reason that social engineering works as well as it does is that people are not very good at detecting deception. Deception detection is generally little more successful than chance, i.e., people are right only about half the time [3][8]. To try to improve these odds, researchers have worked to uncover reliable indicators of deception [1][11]. These include increased blinking, higher voice pitch, increased selfgrooming, more passive statements, more negative statements, and more distancing of the storyteller from the story told. Other cues that have emerged more recently include pauses, response latencies, eye contact, and message duration [2][9]. The research question is whether training people to identify these cues helps improve their accuracy in detecting deception. H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 366–370, 2003. © Springer-Verlag Berlin Heidelberg 2003
Training Professionals to Detect Deception
367
While training seems like the ideal way to improve detection accuracy, the findings from past research on the effectiveness of training are mixed [5]. The study described here was an attempt to develop and test a training program for deception detection for rank-and-file military members. The training provided a basic understanding of deception and specific information about cues for detecting deception. We present the study design and procedures, followed by a brief summary of findings related to accuracy in detecting deception, to illustrate how tests of such training systems can be conducted.
2 Study Design and Procedures The study was conducted in fall 2002 on a large USAF facility in the US. A total of 125 officers participated as subjects, although the total number participating per session varied. Subjects were randomly assigned to either the control or to one of three treatment groups. For one treatment, traditional lectures were used for all three training sessions. For the second treatment, lectures were used for the first and third sessions, but in the second session, a specially created system called Agent99 was used exclusively. For the third treatment, lectures were used for the first and third sessions, but for the second session, a combination of lecture and Agent99 was used. All lectures in all treatments were supported with Powerpoint presentations. The control received no training, but control subjects completed the same measurement instruments as the experimental subjects. A pre-session was used to collect baseline data on all subjects in all four groups. The instructors were one doctoral student from a US business school and three USAF officers completing their masters’ degrees at the Air Force Institute of Technology. The training curriculum was developed jointly by the authors and their respective research teams. The basis for the curriculum was a set of three Powerpoint presentations, each on a different topic: deception detection generally, cues used to detect deception, and heuristics for decision making that are susceptible to deception. Each presentation was designed to last for one hour. The second lecture, on cues, also included deceptive communication examples used by the instructors to illustrate the cues. These examples were either text only, audio only, or video with audio. Most examples came from past studies of deception detection, consisting of experimental subjects trying to deceive their interviewers. Other examples were specifically created and recorded for this study. The lectures, delivered by the training instructors, were also videotaped. The instructors pilot tested all training materials, including Agent99, weeks before the study began. Agent99 is a multifaceted information system, but the only parts we used were the ones designed to support training. (For details on the system and its development, see [6].) Agent99 allows users to access training materials in any format, in whatever order users decide to access them in, through a web browser interface. The version tested had only some functionalities operational and was used for the cues session only. It was populated with the same Powerpoint presentation and examples used in the traditional lecture format. In addition, the system included the videotaped delivery of the cues lecture. For the treatment that used a combination of lecture and Agent99, subjects began the session with a live lecture and then were allowed to view examples
368
J.F. George et al.
using Agent99. Great care was taken to ensure that subjects in all three treatments had access to the exact same content for the cues material. The basic procedures for the training sessions were as follows: Subjects reported to a classroom at the USAF facility. They began by completing a battery of instruments, including a knowledge pre-test and a deception detection accuracy pre-test. All data were collected on-line. The knowledge pre-test consisted of 12 multiple choice questions derived from the day’s training content. The deception accuracy pre-test consisted of 6 examples, two each in text, audio, and audio-video formats. Half were deceptive and half were not. Subjects were asked to judge if each example was deceptive or not. Subjects were then trained, except for control subjects, who were given a one hour break. Afterwards, all subjects completed a knowledge post-test, made up of the same questions as the pre-test but in a different order, and a deception detection accuracy post-test, similar to the pre-test but consisting of different examples. Subjects then completed additional instruments and were dismissed.
3
Findings
Table 1 provides the results of the knowledge tests for the control group and the combined treatment groups for all three sessions. Each knowledge test had 12 questions, and results reported indicate the number of questions answered correctly. Table 2 provides the results of the deception detection accuracy tests for the control group and the combined treatment groups for all three sessions. Each accuracy test had six examples, and the results reflect the number of examples evaluated correctly.
Table 1. Means and standard deviations (in parentheses) for knowledge pre-tests and
post-tests.
General Cues Heuristics
Control (N = 29) Pre-test Post-test 5.07 (1.60) 5.14 (1.64) 4.07 (1.49) 4.38 (1.59) 5.41 (2.23) 4.93 (2.30)
Treatments (N = 86) Pre-test Post-test 5.55 (1.69) 8.81 (1.64) 5.57 (1.63) 7.73 (2.30) 6.01 (1.97) 8.74 (2.07)
Table 2. Means and standard deviations (in parentheses) for accuracy pre-tests and post-tests
General Cues Heuristics
Control (N = 29) Pre-test Post-test 2.89 (1.11) 3.72 (1.25) 4.10 (1.23) 2.97 (0.78) 3.86 (1.22) 3.38 (1.37)
Treatments (N = 85) Pre-test Post-test 3.11 (1.29) 3.67 (1.01) 4.39 (1.11) 3.65 (0.86) 3.72 (1.14) 3.39 (1.02)
For the knowledge tests, performance was measured by taking the difference between pre-test and post-test scores within each session. Independent t-tests showed that the
Training Professionals to Detect Deception
369
treatment groups differed from the control group for all three sessions (general: t(113)=-8.921, p < .001; cues: t(113)=-4.54, p < .001; heuristics: t(113)=-7.536, p < .001). For each session, the control group did not improve, while the training session groups did. It is worth noting that, for the second session, there were no differences among the treatment groups, indicating that Agent99 by itself, or in combination with a lecture, was capable of delivering the same material as a traditional lecture alone. For the detection accuracy tests, performance was also measured by taking the difference between pre-test and post-test scores within the session. There were no statistically significant differences between the treatment groups and the control group on deception detection accuracy. There were also no differences among the treatment groups, indicating that those using Agent99 alone or with a lecture, did just as well in deception detection as subjects receiving only the lecture. The general trend, although not statistically significant, is toward an improvement in deception detection accuracy for the control group (an improvement of 0.48 between the pre-test of the first session and the post-test of the last session) and for the combined treatment groups (an improvement of 0.28 for the same comparison). If all subjects are combined, the average improvement in deception detection performance between the pre-test of the first session and the post-test of the last session is 0.333, which is statistically significant (t(113)=2.048, p < .043) (see Figure 1).
4
Discussion
Treatment groups improved their understanding of deception, vis-à-vis the control group, as shown by their knowledge test performance, but there were no differences between the treatment groups and the control on their deception detection accuracy. Overall, subjects participating in the study improved their detection accuracy. It may be that mere exposure to the accuracy tests improved performance, possibly through heightening subject lie bias [4]. The lack of differences among treatment conditions at least suggests that computer-based tools can deliver relevant training material without the necessity of a human instructor delivering all of the lecture content. The subtleties exposed by comparisons made within and between groups, and on both knowledge and judgment tests, illustrates the value of the training design employed in this investigation. Tests that lack pre- to post-test comparisons, control groups, or both types of knowledge gains (cognitive and judgmental) may fail to adequately discern what a given training curriculum and tool provide.
370
J.F. George et al.
Judgment Test Scores 5 4 3 No. correct answers
2 1
10/15 post
Overall Treatments Control 10/15 pre
10/1 post
10/1 pre
9/17 post
9/17 pre
0
Control Treatments Overall
Fig. 1. Deception detection accuracy scores for control vs. combined treatment groups and for the combined sample
References 1. Buller, D. B., & Burgoon, J. K. (1994). Deception: Strategic and nonstrategic communication. In J. A. Daly & J. M. Wiemann (Eds.), Strategic interpersonal communication (pp. 191–223). Hillsdale, NJ: Erlbaum. 2. deTurck, M.A., & Miller, G. R. (1985). Deception and arousal: Isolating the behavioral correlates of deception. Human Communication Research, 12, 181-201. 3. Feeley, T.H., & deTurck, M.A. (1995). Global cue usage in behavioral lie detection. Communication Quarterly, 43, 420–430. 4. Feeley, T.H., & Young, M.J. (1998). Humans as lie detectors: Some more second thoughts. Communication Quarterly, 46(2). 5. Frank, M., & Feeley, T. H. (in press). To catch a liar: Challenges for research in lie detection training. Journal of Applied Communication Research. 6. Lin, M., Cao, J., Crews, J., Burgoon, J. and Nunamaker, J.F. Jr. (2003). AGENT99 trainer: A web-based multimedia training system for deception detection. Working paper, CMI, University of Arizona, Tucson, AZ. 7. Littman, J. (1996). The Fugitive Game. Boston: Little, Brown & Co. 8. Miller, G.R., & Stiff, J.B. (1993). Deceptive communication. Newbury Park, CA: Sage Publications, Inc. 9. Vrij, A. (2000). Detecting lies and deceit: The psychology of lying and its implications for professional practice. Chichester: John Wiley and Sons. 10. Zhang, D.; Zhao, J. L.; Zhou, L.; Nunamaker, J. F., Can E-Learning Replace Traditional Classroom Learning? –Evidence and Implication of the Evolving E-Learning Technology, Communications of the ACM. 11. Zuckerman, M., & Driver, R.E. (1985). Telling lies: Verbal and nonverbal correlates of deception. In A.W. Siegman & S. Feldstein (Eds.), Nonverbal communication: An integrated perspective (129–147). Hillsdale, NJ: L. Erlbaum.
An E-mail Monitoring System for Detecting Outflow of Confidential Documents Bogju Lee and Youna Park Department of Computer Engineering, Dankook University, Korea {blee, ypark}@dankook.ac.kr
Abstract. E-mails are widely used communication tool with their convenience and efficiency. In spite of their usefulness, they are difficult to control in that e-mails are easily used as outflow path of confidential documents in an organization. In order to detect and prevent the outflow, the e-mail monitoring is widely used. We propose a system that detects in real time the outflow of the documents to be protected. It is based on the automatic text categorization and machine learning technique. The experimental result shows the high accuracy and efficiency of the method.
1 Introduction As the internet grows rapidly and spreads widely, web and e-mails can make information exchange faster than ever. Especially e-mails are one of the most important information exchange tool with their quickness and efficiency. But e-mails have difficulty in control. As the expansion of information system, most documents are digitalized in company. The digitalized confidential documents in an organization could easily outflow through the e-mails. According to a computer security-related survey, the loss caused by the internal outflow in a company is nine times more than that caused by the hacking. Once the confidential information outflow from a company happens, several year work and huge investment become useless. More than fifty percent of the leading companies in the States have been monitoring the e-mails to prevent this problem. Recently these kinds of companies are increasing in Korea and the e-mail monitoring systems are common. The simplest type of the e-mail monitoring system is the one in which human observes the e-mails manually. The ‘monitoring supervisors’ look through the e-mails and decide whether the e-mails contain any confidential information. Some system adds the filtering functions that pass only big-sized e-mails to the supervisors. Obviously it is the lots of wastes of human resource and time. Moreover determining the ‘confidentiality’ of a document involves relatively high-positioned managers usually. This makes the manual e-mail monitoring much more costly. Other type of filtering function is based on a list of keywords that the user provides. The system is better than the simplest type, but it is usually not good in the accuracy. Many of the today’s email monitoring system employ a technique called ‘sniffing’, that is, reconstructing eH. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 371–374, 2003. © Springer-Verlag Berlin Heidelberg 2003
372
B. Lee and Y. Park
mail from multiple packets in the network level. This technique enables e-mails without going through company’s mail server (e.g., web mails) to be detected. In this paper, we propose an e-mail monitoring system that detects the outflow of the confidential documents with high accuracy. The system employs sniffing technique for the wider range of use. It needs little human intervention due to its ability to decide the confidentiality. We used one of the automatic text classification (ATC) techniques coming from the machine learning and the information retrieval communities. The existing ATC algorithms include the latent semantic indexing [1], the naïve Bayes [2], the neural networks [3], the k-nearest neighbor [2], the decision tree [4], the inductive rule learning [5], the Rocchio [1, 6], the support vector machines (SVMs) [2]. By being provided the existing confidential documents and general documents together, the system learns what is the confidentiality of the company, resulting in a model that represents it. During the classifying phase, the model enables the system distinguish e-mails with confidential information.
2 Our Approach Among the ATC methods mentioned above, we used the SVM for its proved accuracy and efficiency. The SVM learns how to classify objects into two classes given positive and negative examples. Confidential documents collected from inside the company obviously become the positive examples. One difficulty in this problem is to decide the documents to be used as the negative examples. The approach used in this research is to use the non-confidential general documents inside the company as the negative examples. In order for the classifier to pass the regular e-mails that do not contain any confidential information, we also include the arbitrary documents randomly collected from the web as the negative examples. We follow the general ATC procedures to process the documents: the term extraction, the feature selection/reduction, and the classifier production. The stemming is not done for the feature selection. For the term weighting, the term frequencies (TF) normalized by the document length is used. The whole process of our system includes training and learning, sniffing, and classifying. When classified, the body and attached files of an e-mail are treated as each separate document. Even one detection of confidential information among these separate documents is regarded that the whole e-mail contains the confidential information. The words in the subject part of e-mails are merely included into the body. The file types that can be processed by the system include Korean Hangul, Microsoft Word, Excel, and PowerPoint.
3 Experimental Results The companies that were chosen in the experiment include a telecommunication company, a financial company, a medicine company, and an electronics company. Table 1 shows the experimental result of the telecommunication company for the various size
An E-mail Monitoring System for Detecting Outflow of Confidential Documents
373
of training set. As the training set sizes increase, the performance tends to be gradually better. It is expected because with large training set the learner can learn the model better.
Table 1. Error rate and processing speed in the case of the telecommunication company
Training set # non# confidenconfidential tial docs. docs. 98 78 388 279 599 579 792 879 1002 1068
Leave-one-out Sum
CPU seconds
Error rate
Recall
Precision
176 667 1178 1671 2070
0.19 0.98 2.01 2.94 3.24
1.70% 1.35% 1.19% 0.82% 0.63%
97.96% 98.45% 98.50% 98.85% 99.20%
98.97% 99.22% 99.16% 99.62% 99.50%
Table 2 shows the performance of the various companies. As indicated, the performances do not vary much in different domains and contexts.
Table 2. Error rate and processing speed for the various companies Companies
Telecommunication Finance Medicine Electronics
Training set # non# conficonfidential dential docs. docs.
Leave-one-out Sum
CPU seconds
Error rate
Recall
Precision
98
78
176
0.19
1.70%
97.96%
98.97%
122 53 30
104 47 28
226 100 58
0.36 0.25 0.12
1.77% 4.00% 3.33%
97.54% 96.23% 96.43%
99.17% 96.23% 96.43%
4 Conclusions In this paper an efficient e-mail monitoring system that detects the outflow of confidential documents is proposed. The system judges the confidentiality of the e-mail in real time without human involvement. It is also possible to monitor the all e-mail data over the network level by the sniffing technique. All the process is automatically controlled and maintenance is very easy. Monitored e-mail data are analyzed and reported to the administrator as a useful visual form. This maximizes the security of the company.
374
B. Lee and Y. Park
Acknowledgements. The present research was conducted by the research fund of Dankook University in 2003.
References 1.
2.
3.
4.
5.
6.
Hull, D.: Improving Text Retrieval for the Routing Problem Using Latent Semantic Indexing. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1994) Joachims, T.: Text Categorization with Support Vector Machines – Learning with Many Relevant Features. Proc. of the European Conference on Machine Learning, Springer (1998) Weiner, E., Pedersen, J., and Weigend, A.: A Neural Network Approach to Topic Spotting. Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (1995) Cohen, W. and Singer, Y.: Context-Sensitive Learning Methods for Text Categorization. Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1996) Apte, C., Damerau, F., and Weiss, S.: Towards Language Independent Automated Learning of Text Categorization Models. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1994) Ittner, D., Lewis, D., and Ahn, D.: Text Categorization of Low Quality Images. Proceeding th of the 4 Annual Symposium on Document Analysis and Information Retrieval (1995)
Intelligence and Security Informatics: An Information Economics Perspective 1
2
3
Lihui Lin , Xianjun Geng , and Andrew B. Whinston 1
School of Management, Boston University, Boston, MA 02215
[email protected] 2 Center for Research in Electronic Commerce, University of Texas, Austin, TX 78712
[email protected] 3 Center for Research in Electronic Commerce, University of Texas, Austin, TX 78712
[email protected] Abstract. An intelligence and security information system must meet the challenge of processing vast volume of information that come from diverse sources with distinctive incentives and credibility and providing actionable intelligence based on the processed information. We focus on solving the problem of analyzing the incentives and credibility of information sources and propose an information economics perspective to investigate the incentives of the information providers in the intelligence and security domain.
1 Introduction Timely and accurate analysis of information about terrorists and their activities is the basis for preventive and protective actions to ensure national security. A new science, “intelligence and security informatics” is emerging through efforts from the academia, government agencies, law enforcement communities and the industry. The challenge of analyzing the information on possible threats to homeland security lies not only in the sheer information overload, but also in the convolutions in the information. In this early phase of the development of the science of intelligence and security informatics, it is important to understand the problems that the science of intelligence and security informatics needs to solve and that define the agenda of this new discipline. We identify three functions that an intelligence and security information system should have: 1. Process vast volume of information; 2. Process and analyze information from diverse sources with distinctive incentives and credibility; 3. Provide actionable intelligence, i.e. suggest preemptive actions based on the analysis of information. In this research, we focus on the second function and propose an information economics perspective in analyzing the incentives of information senders.
H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 375–378, 2003. © Springer-Verlag Berlin Heidelberg 2003
376
L. Lin, X. Geng, and A.B. Whinston
2 Incentives and Credibility of Information Sources To process tremendous amount of information is not unique to security information systems. In natural sciences, a new family of disciplines, including bioinformatics, cheminformatics, medical informatics, etc. employs mathematical algorithm and computing methods to solve problems that entail enormous data. However, the data in natural sciences is fundamentally different from that in social sciences. Although the data may contain noise (and it always does), the credibility of the source is not an issue in natural sciences. One feature that sets intelligence and security informatics apart from other fields of study is the credibility of the information to be processed. To determine the credibility of information, we need to examine the incentives of the information provider. In social sciences, unlike in natural sciences, the incentive of the provider of the information is central to the inference process of information. However, in the literature of knowledge management, since it mainly addresses the issue of knowledge sharing within an organization, the incentives of participants could be assumed perfectly aligned. While this is also the case when government agencies, the private sector and academic researchers are coordinating efforts to enhance homeland security, one issue remains unsolved: the initial screening of tremendous amount of information, which may come from a credible source such as an intelligence agent or from terrorists themselves trying to mislead by distracting our attention or causing actions to be taken that benefit the terrorists. Information economics, one of the most active research areas of economics with significant impacts on industries and policies in the past thirty years, studies people’s incentives and incentive compatibility in providing information. It started with Akerlof’s seminal paper [1] on the used car market, and later developed models of signaling [7] and screening [6] that found wide applications in public finance, labor market, the finance sector, industrial organization structure, etc [5]. It can be generalized into a sender-receiver model [2][4]. Geng, Lin and Whinston [3] develop a sender-receiver framework for knowledge transfer. However, the techniques of information economics cannot be readily applied to intelligence and security and must be re-examined and adapted to fit the unique needs of intelligence and security informatics. In an information economics model, the basic incentive of the sender of information is homogenous. For example, in Spence’s labor market model [7], workers always have an incentive to overstate their abilities, although an individual worker has his/her own level of ability and thus chooses a specific education level that serves as a signal to the employer. This is a typical signaling model. Central to the signaling framework is the assumption that the sender possesses some knowledge and has private information about the knowledge. A receiver can only learn information about the knowledge by observing a signal from the sender. Whenever there exist discrepancies in incentives, the sender may strategically choose a signal that misleads the receiver, which in turn serves the sender’s own interest. The signaling framework is illustrated in the following information collection example. Assume that a terrorist group plans to initiate an attack at a certain location. We want to decipher the location by analyzing the collection of intelligence reports on recent suspicious activities: this is effective if terrorists are indeed organizing and preparing for the attacks at certain location. However, once the terrorists are aware that their activities may reveal their targeted locations, they may deliberately instigate
Intelligence and Security Informatics: An Information Economics Perspective
377
other activities in irrelevant locations to entice the attention of government authorities. As a result, we cannot simply assume that observed and concentrated suspicious activities in a certain location necessarily imply that this location is more likely to be attacked. There is another family of models where a sender does not possess private information about the knowledge that a receiver wants. But it does not imply that the sender can never mislead the receiver’s perception of the knowledge. In this case the sender’s behavior is called “signal-jamming, in which a sender does not possess any private information but can nevertheless try to distort, or “jam”, the signal the receiver observes. Recall in the terrorist attack example we mentioned above terrorists know the information that we want to collect. Now consider another example where we want to rescue a stranded pilot whose location is unknown but can be traced by intercepting radio signals he sends out. Even if the terrorists do not know his location either, they may send out similar radio signals from various locations to distract the rescuers’ attention. In other words, fake radio signals jam the search process and may slow down the search and lower the survival chance of the stranded pilot. The complexity in analyzing information on security issues is that one type of model cannot be used to process all the information. In economic models, one group of agents, such as the sender of information, have the same profile: the incentives and whether or not they possess private information. While such assumptions are valid in examining an economic problem, a model that analyzes security information cannot assume the information sources are homogeneous in nature. In particular, information sources may be of the following distinctive profiles: 1. Senders with private information and fully aligned interests with the receiver; 2. Senders with private information but opposing interests with the receiver; 3. Senders with private information and incentives that are neither fully aligned nor misaligned with the receiver (somewhere between the two extreme cases); 4. Senders without private information but trying to jam the receiver’s signal in order to achieve the objective of the senders. An adequate model of information on potential threats must incorporate the heterogeneity of the senders and a solution should allow the receiver to discern between senders of different nature as well as the information itself. A model of such a situation is not in place, and the reason may be that there has not been such a need before. A theory to describe and solve such complex and convoluted issues should be developed as a cornerstone in intelligence and security informatics. Last but certainly not least, intelligence and security informatics must meet the challenge of actionable intelligence. A security informatics model must be able to suggest preemptive actions purely based on the analysis of information.
3 Conclusion In the current knowledge management literature, knowledge is observable and verifiable. In identifying security threats and potential attacks, however, the sole purpose is to prevent the identified event from happening, thus actions must be taken before the event is observed or verified. Thus, research on security information systems should
378
L. Lin, X. Geng, and A.B. Whinston
be built on the assumption that the usefulness of knowledge is non-verifiable. Geng, Lin and Whinston [3] develop a knowledge transfer model where the usefulness of knowledge is unobservable and the inference process depends solely on the signals, but future research is needed to solve the unique problems in security informatics. In sum, we believe that a theoretical foundation needs to be developed to analyze large amount of information of different quality from diverse sources with fundamentally different incentives and ultimately to indicate the actions that must be taken to protect our society.
References 1. Akerlof, G., 1970, “The market for lemons: Quality uncertainty and the market mechanism”, Quarterly Journal of Economics 84: 488–500 2. Crawford, V. P., J. Sobel, 1982, “Strategic Information Transmission”, Econometrica, Vol. 50, No. 6, pp. 1431–1451 3. Geng, X., L. Lin and A. B. Whinston, 2003, “A Sender-Receiver Framework For Knowledge Transfer”, presented at Minnesota Symposium on Knowledge Management, March 2003 4. Green, J., N. Stokey, 1980, “A Two-Person Game of Information Transmission”, Harvard Institute of Economic Research Discussion Paper No. 751, Harvard University 5. Riley, J. G., 2001, “Silver Signals: Twenty-Five Years of Screening and Signaling”, Journal of Economic Literature, Vol. 39, pp. 432–478 6. Rothschild, M., and J. E. Stiglitz, 1976, “Equilibrium in competitive insurance markets: An essay in the economics of imperfect information”, Quarterly Journal of Economics, 90, 355– 374. 7. Spence, A. M., 1973. “Job Market Signaling”, Quarterly Journal of Economics, 87, 355– 374.
An International Perspective on Fighting Cybercrime 1
Weiping Chang1, Wingyan Chung2, Hsinchun Chen2, and Shihchieh Chou 1
Department of Information Management, National Central University, Chung-Li, Taiwan 32054 {wpchang, scchou}@mgt.ncu.edu.tw 2 Artificial Intelligence Lab, Department of Management Information Systems University of Arizona, Tucson, AZ 85721, USA {wchung, hchen}@eller.arizona.edu
Abstract. Cybercrime is becoming ever more serious. Findings from the 2002 Computer Crime and Security Survey show an upward trend that demonstrates a need for a timely review of existing approaches to fighting this new phenomenon in the information age. In this paper, we provide an overview of cybercrime and present an international perspective on fighting cybercrime. We review current status of fighting cybercrime in different countries, which rely on legal, organizational, and technological approaches, and recommend four directions for governments, lawmakers, intelligence and law enforcement agencies, and researchers to combat cybercrime.
1 Introduction As the Internet becomes a part of our daily lives, criminals increasingly are using it to conduct cybercrime. According to the 2002 Computer Crime and Security Survey conducted by Computer Security Institute and the U.S. Federal Bureau of Investigation, the threat of cybercrime and other information security breaches continues unabated and the financial toll is mounting [7]. The findings show that 90% of respondents detected computer security breaches over the past twelve months. The total amount of annual financial losses in 2002 has reached a record high of $455,848,000. The problem of cybercrime is so serious that many countries are actively fighting against it. In this paper, we provide an overview of cybercrime and present an international perspective on fighting cybercrime.
2 An Overview of Cybercrime What is cybercrime? Most researchers agree that it is any illegal activities conducted through computer, but some disagree on where cybercrime takes place. Parker considers information system (which may not be computerized) as the channel that cybercrime is committed [4]. In contrast, Philippsohn views cybercrime to appear mainly on H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 379–384, 2003. © Springer-Verlag Berlin Heidelberg 2003
380
W. Chang et al.
the Internet [5]. In this paper, we define “cybercrime” as illegal computer-mediated activities that often take place in the global electronic networks. Because existing laws in many countries are not tailored to deal with cybercrime, criminals increasingly conduct crimes on the Internet in order to take advantages of the less severe punishments or difficulties of being traced. Table 1 shows different definitions and types of cybercrime. Table 1. An overview of cybercrime
Definitions of cybercrime Illegal computer-mediated activities which can be conducted through global electronic networks. The illegitimate use of computer to conduct criminal activities. Encompasses any abuse and misuse of information that entails using knowledge of information systems. Philippsohn, 2001 Criminal activities conducted through the Internet. [5] Power, 2000 [8] The intentional access of a computer without authorization or by exceeding authorization and thereby obtain information to which the person is not entitled. Categories of cybercrime Internet fraud Advanced fee fraud, credit card fraud, fraudulent Internet banking Hacking for political reasons, hacking for personal Computer hacking/network intru- reasons, email spamming or sending junk/hostile emails sion Cyber piracy Software piracy, piracy of music or movies Spreading of ma- Spreading of virus or Trojan horse, spreading of licious code other malicious codes Others Identity theft, electronic property theft, money laundering, cyber-pornography, etc. Thomas & Loader, 2000 [11] Richards, 1999 [9] Parker, 1998 [4]
3 Fighting Cybercrime in Different Countries Owing to the worldwide impact of cybercrime, various countries are using legal, organizational, and technological approaches to fight against it. Legal approach aims to restrict cybercrime activities through legislation. Organizational approach aims to enforce laws, to promote cooperation, and to educate the public through the establishment of dedicated organizations. Technological approach aims to increase the effec-
An International Perspective on Fighting Cybercrime
381
tiveness and efficiency of cybercrime analysis and investigation with the help of new technologies. In the following, we review the current status of fighting cybercrime in different countries that use within-country and collaborative strategies. 3.1 Within-Country Strategy The United States. To protect the interests of Internet businesses, the U.S. Congress has created new laws to regulate activities on the Internet. With the first digital signature law in the world, the U.S. has established a number of regulations on cybercrime, such as the “National Infrastructure Protection Act of 1996”, the “Cyberspace Electronic Security Act of 1999” and the “Patriot Act of 2001” [2, 3]. In addition, a number of agencies have been set up in the U.S. to fight against cybercrime, including the FBI, National Infrastructure Protection Center, National White Collar Crime Center, Internet Fraud Complaint Center, Computer Crime and Intellectual Property Section of the Department of Justice (DoJ), Computer Hacking and Intellectual Property Unit of the DoJ, and so on. Furthermore, the U.S. government invests more resources into fighting against cybercrime than any other country. The FBI has set up special technical units and developed Carnivore, a computer surveillance system which can intercept all packets that are sent to and from the ISP where it is installed, to assist in the investigation of cybercrime [10]. England. Two cybercrime-related acts have been passed by the British parliament: the Data Protection Act of 1984 and the Computer Misuse Act of 1990. The former deals with the actual procurement and use of personal data, while the latter defines the laws, procedures, and penalties surrounding unauthorized entry into computers. In addition, the British parliament passed the Regulation of Investigatory Powers (RIP) Bill in July 2000, and set up the Government Technical Assistance Center (GTAC). The bill gives law enforcement agencies authority to obtain passwords to encrypted messages, access computer files and email for investigative purpose. Internet service providers (ISPs) in England have to track all data traffic passing through their computers and route it to GTAC. In terms of the technological initiatives, the National Criminal Intelligence Service has established a unit dedicated to decrypting criminal material. Besides, the British government has applied technologies of filtering and rating to protect minors from inappropriate material on the Web [6]. Canada. In 2001, the Canadian parliament passed the Criminal Law Amendment Act that has two sections. The first section defines unlawful entry into a computer system and interception of transmissions. The second section criminalizes the actual destruction, alteration, or interruption of data. A major organization dedicated to fighting cybercrime is the Canadian Police’s Information Technology Security Branch that consists of Security Evaluation and Inspection Team, Computer Investigative Support Unit, and Counter Technical Intrusion Unit. The Canadian also introduced Internet training that is specifically designed for police officers.
382
W. Chang et al.
Australia. In 2001, the Australian parliament passed the Australian Cybercrime Act that defines a number of behaviors, such as unauthorized access to computers and unauthorized modification of data, as illegal. Also, Australia’s Internet Industry Association released the draft of Cybercrime Code to mandate ISPs to keep data between six and twelve months to facilitate cybercrime investigation. In terms of the technological initiatives, the Australian National Police Research Unit approved the establishment of a Computer Investigation Techniques (CIT) program with the objective to develop investigative tools, to disseminate information, and to train police investigators. Taiwan. Taiwan government amended ten articles of the Criminal Law in 1997 to tackle cybercrime and has recently been adding a specific provision into the Criminal Law. The law enforcement agencies established to carry out cybercrime-related laws include: Cybercrime Prevention and Fighting Center of the Investigation Bureau in the Ministry of Justice, Telecommunication Police Squad of the Directorate General of Telecommunication in the Ministry of Transportation and Communication, and the Computer Crime Squad of the Criminal Investigation Bureau in the Ministry of Interior. In addition, the Criminal Investigation Bureau has developed proprietary software tools and hardware equipment for investigating cybercrime cases. They include Internet Patrol Agent, Globe IP Tracer, Evidence Collector, Packet Analyzer, and Remote Monitor. 3.2 Across-Country Strategy: Collaborative Fighting of Cybercrime Group of Eight (G8). G8 is made up of the heads of eight industrialized countries: the U.S., the United Kingdom, Russia, France, Italy, Japan, Germany, and Canada. In 1997, G8 released a Ministers Communiqué that includes an action plan and principles to combat cybercrime. The Communiqué requires member countries to ensure that appropriate measures are taken to criminalize cybercrime, to protect the confidentiality, integrity, and availability of data and systems from unauthorized impairment, and to preserve quick access to electronic data. G8 also mandates that all law enforcement personnel must be trained and equipped to address cybercrime, and designates all member countries to have a point of contact on a 24 × 7 basis. The Council of Europe (C.E.). The C.E. was established in 1949 by West European countries and now has over 40 member countries. In 2001, the C.E., the U.S., Canada, South Africa, and Japan co-drafted and passed the Cybercrime Convention, the first international convention aimed at Internet criminal behaviors. The convention mandates member countries to have unified legislation on cybercrime, maintain a connection network on a 24 × 7 basis, and provide essential training and equipments to related personnel.
An International Perspective on Fighting Cybercrime
383
4 Recommendations Based on our review and experience in fighting cybercrime1, we propose four recommendations to governments, lawmakers, international organizations, intelligence and law enforcement agencies, and researchers. Updating existing laws. Lawmakers should regularly update existing laws as new technologies are being developed. For example, laws regulating Internet Service Providers’ operations need to be updated regularly. Because criminals like to commit cybercrime in countries with less stringent laws, countries should work together to adopt a unified standard of fighting cybercrime that prevents criminals from taking advantages of countries having less stringent rules. This will create a cooperative foundation for countries to fight against cybercrime. Enhancing specialized task forces. Law enforcement agencies should recruit qualified investigators with technical and legal knowledge so as to keep up with the development and social impacts of cybercrime. Computer forensics labs should be established to collect digital evidence from computer equipment and to provide training for investigators. Utilizing civic resources. In addition to doing their own investigation, law enforcement agencies should utilize civic resources to enhance the effectiveness and efficiency of their work. They should look to university or research organizations for obtaining technical support and also should cooperate with private businesses (e.g., ISPs) to obtain first-hand information about cybercrime. These can speed up the processing of cybercrime cases and prevent more Internet users from becoming victims. Promoting cybercrime research. We should promote research to enhance understanding of cybercrime’s causes and to discover ways to prevent it. For example, research in intelligence analysis and knowledge management systems has enabled law enforcement agencies to manage different kinds criminal and intelligence data, information and knowledge more effectively and to sift through tons of criminalrelated information efficiently [1].
1
Mr. Weiping Chang is the director of the Information System Office of Criminal Investigation Bureau (National Police Administration) and has over five years of cybercrime investigation experience in over 300 cases in Taiwan.
384
W. Chang et al.
5 Conclusions and the Future Cybercrime greatly affects individuals, businesses, and national security due to the pervasiveness of the Internet. Increasingly, cybercrime is emerging as a major crime type in the 21st century. In this paper, we have provided an overview of cybercrime and reviewed the current status of fighting cybercrime in different countries. We believe that different countries should work together and use legal, organizational, and technological approaches to combat cybercrime so as to reduce the damage to critical infrastructures and protect the Internet from being abused.
References 1. Chen, H., Zeng, D., Atabakhsh, H., Wyzga, W. and Schroeder, J. COPLINK: Managing Law Enforcement Data and Knowledge. Communications of the ACM 46, 1 (2003), 28–34. 2. Freeh, L.J. Statement for the Record of Louis J. Freeh, Director, Federal Bureau of Investigation on Cybercrime Before the Senate Committee on Judiciary Subcommittee for the Technology, Terrorism, and Government Information, U.S. Department of Justice, http://www.usdoj.gov/criminal/cybercrime/freeh328.htm, 2000. 3. Garrison, L. and Grand, M. Network Defense: The Legal Aspects of Retaliation National Infrastructure Protection Center Highlights (Issue 7-01), 2001, 2. 4. Parker, D.B. Fighting Computer Crime: A New Framework for Protecting Information. Wiley Computer Publishing, 1998. 5. Philippsohn, S. Trends in cybercrime – an overview of current financial crimes on the Internet. Computers & Security 20, 1 (2001), 53–69. 6. POST Regulating Internet Content. Parliamentary Office of Science and Technology 159 (2001), 1–4. 7. Power, R. 2002 CSI/FBI Computer Crime and Security Survey. Computer Security Issues & Trends 8, 1 (2002), 1–22. 8. Power, R. Tangled Web: Tales of Digital Crime from the Shadows of Cyberspace. Que Corporation, 2000. 9. Richards, J.R. Transnational criminal organizations, cybercrime, and money laundering: a handbook for law enforcement officers, auditors, and financial investigators. CRC Press, 1999. 10. Smith, S.P., Perrit, H., Krent, H. and Mencik, S. Independent Technical Review of the Carnivore System, http://www.usdoj.gov/jmd/publications/carniv_final.pdf, 2000. 11. Thomas, D. and Loader, B.D. Introduction – cybercrime: law enforcement, security and surveillance in the information age. in Cybercrime: Law enforcement, security and surveillance in the information age, Taylor & Francis Group, New York, NY, 2000.
Hiding Traversal of Tree Structured Data from Untrusted Data Stores Ping Lin and K. Sel¸cuk Candan Department of Computer Sciences and Engineering Arizona State University Tempe, AZ. 85287 {ping.lin, candan}@asu.edu
With the increasing use of web services, many new challenges concerning data security are becoming critical. Especially in mobile services, where clients are generally thin in terms of computation power and storage space, a remote server can be outsourced for the computation or can act as a data store. Unfortunately, such a data store may not always be trustworthy and clients with sensitive data and queries may want to be protected from malicious attacks. We present a client-server protocol to hide tree structured data from potentially malicious data stores, while allowing clients to traverse the data to locate an object of interest without leaking information to the data store. The two motivating applications for this approach are hiding (1) tree-like XML data as well as XML queries that are in the form of tree-paths, and (2) tree-structured indexes and queries executed on such data structures. Redundancy is a technique used in private information retrieval literature to achieve hiding. We develop efficient redundancy techniques suitable for the hiding of the traversal of tree structured data. We complement these with novel swapping to enable clients to traverse tree structures obliviously. Redundancy requires a client to ask for a set of nodes including the target one and at least an empty one each time retrieving a node. In this way, the target node is hidden in the set. Swapping requires a client to swap the target node with an empty node before rewriting the set back (to hide the difference between update and nonupdate operation from data stores, each retrieval of a set is followed by rewriting the set back). If the size of the set is m, the tree depth is l, with redundancy, the probability for the data store to find the traversal path is m1l , which is sufficiently slim even for small ms. With swapping, the relationships among nodes become dynamic and any static guess of the data structure for some moment is of little use. We devised proper concurrency control to allow for simultaneous accessing of tree structures. Clients are required to retrieve nodes in a predefined order so that circular waits for locks on nodes are prevented and no deadlock will occur. Compared with general private information retrieval techniques, the proposed protocol has desirable communication and concurrency performance as demonstrated by the experiments we have conducted.
This work is supported by the AFOSR grant #F49620-00-1-0063 P0003.
H. Chen et al. (Eds.): ISI 2003, LNCS 2665, p. 385, 2003. c Springer-Verlag Berlin Heidelberg 2003
Criminal Record Matching Based on the Vector Space * Model Jau-Hwang Wang, Bill T. Lin**, Ching-Chin Shieh***, and Peter S. Deng Department of Information Management Central Police University 56 Shu-Ren Road, Ta-Kang, Kwei-Shan Tao-Yuan, Taiwan. 333
The process of establishing an association between crime scene and criminal has been an important task in crime investigation. For example, forensic experts usually search crime scene for physical evidences, such as fingerprints, to identify links between crime scene and criminals. The major aims of crime investigation are to discover and reconstruct crime by analyzing the evidences left in crime scene, and to provide links between a crime scene and criminals. Crime investigation is an information intensive process and the effectiveness and efficiency of an investigation are highly depended on the knowledge and information the investigator has gathered in his career as a detective. Due to the innovation and prevalence of information technology(IT), recent years there have been more and more researches devoted to the development of IT application techniques on crime investigation. It is found that most criminals are occasional offenders and just a small fraction (20%) of criminals are chronic ones. However, the small portion of criminals committed most (80%) of all crime cases. Thus, the investigation scope of a committed crime can be significantly narrowed down if there is an information system to help in determining whether a crime is committed by one of those who have committed ones before or it is just an occasional offence. The crime characteristics found at crime scene might reveal the criminal’s behavior characteristics and can be used to derive the modus operandi information of the criminal. The modus operandi information can then be used to search the criminal database to determine if the case has any association with a chronic offender. The process of linking a crime scene to criminals is very similar to the process of linking an user’s query to documents in information retrieval(IR). The vector space model(VSM) is one of the most well known IR models and has been widely used in document retrieval. This research proposed a criminal record matching scheme based on VSM and an experiment is implemented to evaluate the devised scheme. Experiment results shows that our scheme is very promising. Keywords. Crime Linking, Chronic Offender, Modus Operandi, Vector Space Model, Criminal Record Matching, Information Retrieval. *
This work is partially supported by a grant from NSC, Taiwan, ROC, under grant number: NSC 84-2213-E-015-001. ** Affiliated with the Department of Crime Investigation. *** Affiliated with Bureau of Immigration, Ministry of Interior
H. Chen et al. (Eds.): ISI 2003, LNCS 2665, p. 386, 2003. © Springer-Verlag Berlin Heidelberg 2003
Database Support for Exploring Criminal Networks M.N. Smith and P.J.H. King School of Computer Science & Information Systems, Birkbeck College, University of London {mat, pjhk}@dcs.bbk.ac.uk http://www.dcs.bbk.ac.uk/TriStarp
An established method of comprehending the data gathered thus far in a criminal investigation is to incrementally visualize it as a network, such as in the diagram displayed below which is referred to as a link chart, to ascertain if, and how, objects of interest are connected.
Link charts are typically constructed incrementally and are often altered due to the natural exploratory nature of the knowledge discovery process. Thus, if the underlying data is stored in a database one can consider a link chart to represent the results, or a subset of the results, returned by a number of queries such as display the paths that connect David and Fiona, and display all the people directly linked to Brian, each of which may add one or more objects or links to the chart. Current database support for link chart analysis is unsatisfactory. Query languages such as SQL are not expressive enough to form queries such as those given above which retrieve paths of an undefined length and structure through a database. Additionally, the record-based data models typically used in database products, such as the relational, object-orientated and network data models, do not explicitly model the links between objects. This results in a mismatch between the model of data in the database and that used in the visualization system. In order to manage this mismatch an additional software component has to be introduced to convert between the two representations. This increases maintenance costs, as the component has to be reconfigured whenever the structure of the database schema is modified. It can also adversely affect performance. This is not true for associative data models, which model data in a manner analogous to the way it is displayed in a link chart. We describe an experimental query interface, EDVC, for databases using an associative data model that supports link chart analysis through an incremental style of interaction and a range of query facilities for analysing the connections between the objects stored. For a further description of the system see the research report available from the website listed above. H. Chen et al. (Eds.): ISI 2003, LNCS 2665, p. 387, 2003. © Springer-Verlag Berlin Heidelberg 2003
Hiding Data and Code Security for Application Hosting Infrastructure Ping Lin, K. Sel¸cuk Candan, Rida Bazzi, and Zhichao Liu Department of Computer Sciences and Engineering Arizona State University Tempe, AZ. 85287 {ping.lin, candan, bazzi, zhichao.liu}@asu.edu
In mobile computing environments, where clients have limited computing power and memory, it is common for mobile clients to use a fixed host server to execute their applications. Application hosting services, which rent out storage, (Internet) presence, and computation power to clients with IT needs (but without appropriate infrastructures) have been widely adopted. A common way these hosting services are used is as follows: (1) The customer (or application owner) A with an application P publishes this application along with the relevant data onto the servers of the host, H. (2) Whenever they need, the customer (or application owner ) A or its clients access the application remotely by passing appropriate parameter variables v to the host. (3) The host, then, runs P (v) with the local data, and sends the result back to the requesting party. (4) The host charges the customer (or application owner) based on the resources (bandwidth, CPU, storage, etc.) required for the processing of the request. How to protect mobile code and input data from malicious executing environments, is a challenge that has not yet been completely solved. We study the problem to hide input data and polynomial functions with loops and conditional statements. We find that whatever encryption scheme is adopted, once the function is broken (the assignment matrix describing linear computations and condition matrix describing linear conditions are leaked), repeated evaluations of conditional statements in loops will reveal the feasible region of input data. To deal with this problem, we describe an algorithm based on methodical addition of dummy variables into functions to maximally increase dimensions of feasible region of input. This algorithm is based on the observation that the fewer distinct eigenvalues assignment matrices have and the more alike the distributions of eigenvalues of assignment matrices are, the larger dimension feasible region of input has and the more difficult for a malicious host to find the input data. We thus carefully choose and add dummy variables into application and increase the size of assignment matrices so that the new eigenvalues of enlarged assignment matrices are the same as original eigenvalues. We show that before a function is encrypted and transferred to a host for execution, if this algorithm is applied to this function, the feasible region of input data host can get is enlarged and the input will be protected even in the situation where function might be leaked.
This work is supported by the AFOSR grant #F49620-00-1-0063 P0003.
H. Chen et al. (Eds.): ISI 2003, LNCS 2665, p. 388, 2003. c Springer-Verlag Berlin Heidelberg 2003
Secure Information Sharing and Information Retrieval Infrastructure with GridIR 1
2
Gregory B. Newby and Kevin Gamiel 1
Arctic Region Supercomputing Center, University of Alaska, Fairbanks
[email protected] 2 Nassib Nassar, Etymon Systems, Inc.
This poster describes the emerging standard for information retrieval on computational grids, GridIR. GridIR is based on the work of the Global Grid Forum (GGF). GridIR implements a multi-tiered security model at the collection, query and datum level. Unlike large monolithic search engines and customized small-scale systems, GridIR provides a standard method for federating data sets with multiple data types. Three main components make up GridIR. The components can exist on any computer on the computational grid with sufficient computational resources, software and access permissions to join a Virtual Organization (VO): Collection managers (CMs). Systems that know how to access, stage/store, transform and deliver data items from collections. CMs may have privileged access to data, and may be able to perform complex or computationally intensive transformations. Alternatively, they may be simple Web harvesters or processes that access and deliver local files. Indexers. These make up the core traditional IR system. They build searchable collections and process queries to deliver results. Results may be documents, URLs, document surrogates, etc. Query Processors (QPs). These can interact with multiple indexers, over time, to gather query responses, merge them, and present them to their human users. GridIR offers several important components to address information sharing among Intelligence Community (IC) members. Several are derived from the use of Gridstandard technologies: (1) End-to-end encryption, using the OGSA (Open Grid Services Architecture) and other standards; (2) System-level authentication. Each member of the VO is cryptographically authenticated; (3) Fundamental infrastructure for data exchange, service discovery and logging. Other security considerations are enabled through GridIR itself, and are more directly geared towards the needs of the intelligence community; (4) Access control lists for collections; (5) Query processors that must authenticate in order to retrieve data or transmit queries; (6) Collection managers that can implement authentication before delivering a datum, or transform the datum as needed. These collection managers may also seek asynchronous policy decisions by humans or external review processes; (7) Indexers with customized capabilities for particular sets of documents, particular searchers, or particular information needs as desired; (8) Sophisticated methodologies for data fusion from multiple sources; (9) Capability for standing queries and information filtering; (10) Capability for cross-language information retrieval, multiple data types, multiple formats. Full paper is available online at http://www.GridIR.org under the “Publications” area. H. Chen et al. (Eds.): ISI 2003, LNCS 2665, p. 389, 2003. © Springer-Verlag Berlin Heidelberg 2003
Semantic Hacking and Intelligence and Security Informatics (Extended Abstract) Paul Thompson Institute for Security Technology Studies Dartmouth College Hanover, NH 03755
[email protected] Abstract. In the context of information warfare Libicki first characterized attacks on computer systems as being physical, syntactic, and semantic, where software agents were misled by an adversary’s misinformation [1]. Recently cognitive hacking was defined as an attack directed at the mind of the user of a computer system [2]. Countermeasures against cognitive and semantic attacks are expected to play an important role in a new science of intelligence and security informatics. Information retrieval, or document retrieval, developed historically to serve the needs of scientists and legal researchers, among others. In these domains, documents are expected to be honest representations of attempts to discover scientific truths, or to make sound legal arguments. This assumption does not hold for intelligence and security informatics. Intelligence and security informatics will be supported by data mining, visualization, and link analysis technology, but intelligence and security analysts should also be provided with an analysis environment supporting mixedinitiative, utility-theoretic interaction with both raw and aggregated data. This environment should include toolkits of semantic hacking countermeasures. For example, faced with a potentially deceptive news item, an automated countermeasure might provide an alert using adaptive fraud detection algorithms [3], or through a retrieval mechanism allow the analyst to quickly assemble and analyze related documents bearing on the potential misinformation. The author is currently developing such countermeasures.
References 1. Libicki, M.: The mesh and the net: Speculations on armed conflict in an age of free silicon National Defense University McNair Paper 28 (1994) 2. Cybenko, G., Giani, A., Thompson, P.: Cognitive Hacking: A Battle for the Mind IEEE Computer. Vol. 35, No. 8. (2002) 50–56 3. Fawcett, T., Provost, F.: Fraud Detection. In: Kloesgen, W., Zytkow, J. (eds.) Handbook of Data Mining and Knowledge Discovery, Oxford University Press (2002) H. Chen et al. (Eds.): ISI 2003, LNCS 2665, p. 390, 2003. © Springer-Verlag Berlin Heidelberg 2003
Author Index
Atabakhsh, Homa
181
Badia, Antonio 296 Bazzi, Rida 388 Berndt, Donald J. 322 Bi, Henry H. 266 Biros, David P. 366 Blair, J.P. 91 Bratt, Harry 350 Brown, Donald E. 13, 153 Buetow, Ty 181 Burgoon, Judee K. 91, 102, 358, 366 Candan, K. Sel¸cuk 385, 388 Cao, Jinwei 358 Chaboya, Luis 181 Chang, Weiping 379 Chau, Michael 281 Chawathe, Sudarshan S. 195 Chen, Hsinchun 59, 168, 181, 209, 232, 266, 281, 379 Chou, Shihchieh 379 Chung, Wingyan 379 Crews, Janna M. 358 Cushna, Tom 181 Daspit, Damien 181, 281 Demchak, Chris C. 223 Deng, Peter S. 386 Dolotov, Alexander 39 Duboue, Pablo A. 343 Evans, Marc H.
Kajarekar, Sachin 350 Kargupta, Hillol 336 King, P.J.H. 387 Lam, Wai 1 Lee, Bogju 371 Li, Kar Wing 138 Lim, Ee-Peng 1 Lin, Bill T. 386 Lin, Chienting 209, 281 Lin, Lihui 375 Lin, Ming 358 Lin, Ping 385, 388 Lin, Song 13 Liu, Kun 336 Liu, Zhichao 388 Lu, Qingsong 111 Magdon-Ismail, Malik 126 McKeown, Kathleen R. 343 Naing, Myo-Myo 1 Nandiraju, Suresh 281 Newby, Gregory B. 389 Nimeskern, Olivier 74 Nunamaker, Jay F. 91, 358, 366
308
O’Brien, William 346 O’Toole, Christopher 181
Ferrer, Luciana 350 Fotouhi, F. 355
Park, Youna 371 Patman, Frankie 27 Petersen, Tim 181
Gadde, Venkata 350 Gamiel, Kevin 389 Geng, Xianjun 375 George, Joey F. 366 Goldberg, Mark 126 Grosky, W.I. 355 Hammer, Joachim 346 Hatzivassiloglou, Vasileios
Hershkop, Shlomo 74 Hevner, Alan R. 322 Hu, Chia-Wei 74 Hu, Paul Jen-Hwa 209 Huang, Yan 111 Huang, Zan 59
Qin, Tiantian Qin, Yi 59
343
91
Raghu, T.S. 249 Ramesh, R. 249 Ryan, Jessica 336
392
Author Index
Scherer, William T. 308 Schmalz, Mark 346 Schroeder, Jennifer 168 Shan, Fu 281 Shekhar, Shashi 111 Shieh, Ching-Chin 386 Shriberg, Elizabeth 350 Siebecker, David 126 Smith, M.N. 387 S¨ onmez, Kemal 350 Spradley, Leah L. 308 Sreenath, D.V. 355 Stolcke, Andreas 350 Stolfo, Salvatore J. 74 Strickler, Mary 39 Studnicki, James 322 Sun, Aixin 1
Thompson, Paul 27, 390 Twitchell, Douglas P. 102 Venkataraman, Anand
350
Wallace, William 126 Wang, Jau-Hwang 386 Wang, Ke 74 Whinston, Andrew B. 249, 375 Xu, Jennifer 168, 232 Xue, Yifei 153 Yang, Christopher C. Zeng, Daniel 281 Zhao, J. Leon 266 Zheng, Rong 59 Zhou, Lina 102
138