Modeling and Simulation in Science, Engineering and Technology Series Editor Nicola Bellomo Politecnico di Torino Italy Advisory Editorial Board M. Avellaneda (Modeling in Economics) Courant Institute of Mathematical Sciences New York University 251 Mercer Street New York, NY 10012, USA
[email protected] H.G. Othmer (Mathematical Biology) Department of Mathematics University of Minnesota 270A Vincent Hall Minneapolis, MN 55455, USA
[email protected] K.J. Bathe (Solid Mechanics) Department of Mechanical Engineering Massachusetts Institute of Technology Cambridge, MA 02139, USA
[email protected] L. Preziosi (Industrial Mathematics) Dipartimento di Matematica Politecnico di Torino Corso Duca degli Abruzzi 24 10129 Torino, Italy
[email protected] P. Degond (Semiconductor and Transport Modeling) Mathématiques pour l’Industrie et la Physique Université P. Sabatier Toulouse 3 118 Route de Narbonne 31062 Toulouse Cedex, France
[email protected] A. Deutsch (Complex Systems in the Life Sciences) Center for Information Services and High Performance Computing Technische Universität Dresden 01062 Dresden, Germany
[email protected] M.A. Herrero Garcia (Mathematical Methods) Departamento de Matematica Aplicada Universidad Complutense de Madrid Avenida Complutense s/n 28040 Madrid, Spain
[email protected] W. Kliemann (Stochastic Modeling) Department of Mathematics Iowa State University 400 Carver Hall Ames, IA 50011, USA
[email protected] V. Protopopescu (Competitive Systems, Epidemiology) CSMD Oak Ridge National Laboratory Oak Ridge, TN 37831-6363, USA
[email protected] K.R. Rajagopal (Multiphase Flows) Department of Mechanical Engineering Texas A&M University College Station, TX 77843, USA
[email protected] Y. Sone (Fluid Dynamics in Engineering Sciences) Professor Emeritus Kyoto University 230-133 Iwakura-Nagatani-cho Sakyo-ku Kyoto 606-0026, Japan
[email protected] Dynamics On and Of Complex Networks Applications to Biology, Computer Science, and the Social Sciences
Niloy Ganguly Andreas Deutsch Animesh Mukherjee Editors
Birkhäuser Boston • Basel • Berlin
Editors Niloy Ganguly Indian Institute of Technology Department of Computer Science and Engineering Kharagpur 721302 India
[email protected] Andreas Deutsch Center for Information Services and High Performance Computing Technische Universität Dresden 01062 Dresden Germany
[email protected] Animesh Mukherjee Indian Institute of Technology Department of Computer Science and Engineering Kharagpur 721302 India
[email protected] ISBN: 978-0-8176-4750-6 DOI: 10.1007/978-0-8176-4751-3
e-ISBN: 978-0-8176-4751-3
Library of Congress Control Number: 2009921285 Mathematics Subject Classification (2000): 05C85, 68M10, 82B43, 90B15, 90B18, 90B40, 90C35, 91D30, 92D30, 94C15 © Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009
All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Birkhäuser Boston, c/o Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Birkhäuser Boston is part of Springer Science+Business Media (www.birkhauser.com)
Preface
In the context of network theory, Complex networks can be defined as a collection of nodes connected by edges representing various complex interactions among the nodes. Almost any large-scale system, be it natural or man-made, can be viewed as a complex network of interacting entities, which is dynamically evolving over time. Naturally occurring networks include biological, ecological and social networks (e.g., metabolic networks, gene regulatory networks, protein interaction networks, signaling networks, epidemic networks, food webs, scientific collaboration networks and acquaintance networks), whereas man-made networks include communication networks and transportation infrastructures (e.g., the Internet, the World Wide Web, peerto-peer networks, power grids and airline networks). This edited volume is a sequel to the workshop Dynamics on and of Complex Networks (http://www.cel.iitkgp.ernet.in/∼eccs07/ ) held as a satellite event of the fourth European Conference on Complex Systems in Dresden, Germany from October 1–5, 2007. The primary aim of this workshop was to systematically explore the statistical dynamics “on” and “of” complex networks that prevail across a large number of scientific disciplines. Dynamics on networks refers to the different types of processes, for instance, proliferation and diffusion, that take place on networks. The functionality/efficiency of these processes is strongly tied to the underlying topology as well as the dynamic behavior of the network. On the other hand, dynamics of networks mainly refers to the phenomena of self-organization, which in turn lead to the emergence of the complex structure of the network. Another important motivation of the workshop was to create a forum for researchers applying the theories of complex networks to various domains as well as across several disciplines such as computer science, statistical physics, nonlinear dynamics, econometrics, biology, sociology and linguistics. The workshop received a large number of quality submissions from authors pursuing research in multiple disciplines, thus making the forum truly interdisciplinary. The total number of participants who attended the workshop
VI
Preface
was approximately 40. There were around 20 speakers, including both senior researchers and young scientists, who spoke about the dynamics on and of different systems exhibiting a complex network structure. The theme of this edited volume is identical to that of the workshop. Its primary aim is to show how the theories of complex networks are being successfully used by researchers to tackle numerous difficult problems in various domains. Towards this aim, it presents an extended version of some of the very high quality submissions received at the workshop together with new invited contributions, which can play an extremely important role in the understanding as well as advancement of the field. Since the target audience of this book is expected to be largely cross-disciplinary, the chapters have been made as readable as possible, explaining all the intricate technicalities wherever necessary in sufficient detail. The uniqueness of this volume lies in the fact that it presents an equal mix of (a) very relevant reviews (eight chapters) of important works in the field, which gives the reader an up-to-date picture of the state of the art, and (b) independent research reports (eight chapters) providing a clear conception about how complex networks can be extremely useful in harnessing even the hardest problems of a particular discipline. The editors feel that research in this area has reached a stage where there is an urgent need to have a comprehensive knowledge of the past and the present before the future can be planned. The blend of reviews and the contributory chapters presented in this volume strive to achieve this objective and, thereby, set the platform for a “Phase II” research in complex networks. The volume consists of three parts. The contributions in Part I center around the application of complex networks in the understanding of biological problems. This part consists of five chapters. The first chapter is From Network Structure to Dynamics and Back Again: Relating Dynamical Stability and Connection Topology in Biological Complex Systems, in which Sitabhra Sinha presents a study of how the topology of a biological network influences the nature of its dynamics, and conversely, how dynamical considerations put constraints on the network structure. The next chapter deals with Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis, in which Madalena Chaves et al. model and analyze, in the framework of complex networks, the interaction of the nuclear factor κB with the apoptosis signaling pathway. In the third chapter, Network-Based Models in Molecular Biology, Andreas Beyer presents a survey on the extensive literature that employs complex networks to understand numerous intricate phenomena in biology. The fourth chapter, Ecological Networks: Structure, Interaction Strength, and Stability, by Samit Bhattacharyya and Somdatta Sinha, presents a detailed survey of the various studies conducted on ecological networks and especially on food webs. In the last chapter, Signaling and Feedback in Biological Networks, Sandeep Krishna et al. review some important studies on the signaling and feedback mechanisms that are observed in different biological networks.
Preface
VII
Part II is also spread over five chapters and focuses on social networks. This part begins with a chapter on Topographic Spreading Analysis of an Empirical Sex Workers’ Network, by Johannes Bjelland et al., where the authors present a “topographic” analysis of spreading (of HIV) on an empirical network of female sex workers. The authors find that the HIV graph breaks into small components, thereby reducing the spreading if perfect condom protection is made possible. The next chapter, Spectral Characterization of Network Structures and Dynamics, by Anirban Banerjee and J¨ urgen Jost, centers around the investigation of the spectral properties of complex networks with a special thrust on social networks. The third chapter, Dynamics of Social Complex Networks: Some Insights into Recent Research, is authored by Sergi Lozano and presents a comprehensive review of how complex network theory has been instrumental in explaining the structure and the dynamics of a society. The last two chapters show how complex networks can be applied to explain the dynamics of human languages. The first one, titled The Structure and Dynamics of Linguistic Networks, by Monojit Choudhury and Animesh Mukherjee, is a review of the current literature on linguistic networks. The second one, Networks Generated from Natural Language Text, by Chris Biemann and Uwe Quasthoff, presents a survey focusing on how corpus linguistics (i.e., the study of language as expressed in corpora) can be studied within the framework of complex networks. Part III presents a comprehensive overview of the networks that are prevalent in information sciences. This part is laid out in six chapters. The first chapter in this part, Efficiency of Navigation in Indexed Networks, by Petter Holme, explores the efficiency of navigation of data packets on “indexed” graphs. The second chapter, Evolution of Apache Open Source Software, by Haoran Wen et al., attempts to explain the evolution of the Apache open source software through the analysis of its call graphs. The next chapter, Some New Applications of Network Growth Models, by Gourab Ghoshal, presents new models of growth for peer-to-peer file-sharing networks. The fourth chapter, The Big Friendly Giant: The Giant Component in Clustered Random Graphs, by Yakir Berchenko et al., is a theoretical study of the properties of the giant component in a special kind of random graph, which is relevant for various information networks. The fifth chapter, Technological Networks, by Bivas Mitra, presents a detailed review of the large number of studies that have been conducted on information networks, especially the World Wide Web and peer-to-peer networks. The last chapter, Advances in the Theory of Complex Networks, by Fernando Peruani, presents a survey of some of the theoretical advancements that have taken place and helps in providing a better understanding of the structure and dynamics of information networks. These contributions collectively demonstrate that complex networks indeed provide an elegant research framework relevant to a variety of scientific disciplines. The chapters are designed to serve as the state of the art not only for students and new comers who intend to pursue research in this field but
VIII
Preface
also for the experts. All the chapters have been carefully peer reviewed for their scientific content as well as readability and self-consistency. We would like to thank the authors for their contributions, constructive co-operation and gracious acceptance of the editorial comments. We are also indebted to Ranjita Bhagwan, Chris Biemann, Lutz Brusch, Geoffrey Canright, Michael Gamon, Gourab Ghoshal, Petter Holme, A. Kumaran, Abyayananda Maiti, Pabitra Mitra, Luis Morelli, Gautam Mukherjee, Romit Roy Choudhury, Gustavo Sibona and Biplab K. Sikdar for their constructive criticisms, comments and suggestions, which have significantly improved the quality of the chapters. In addition, we would also like to extend our gratitude to Rishabh Singh for his painstaking effort in helping to prepare the Glossary of Essential Terms. Finally, we are also grateful to Tom Grasso and the Birkh¨ auser team for all their help and support towards the publication of this volume. Kharagpur, India Dresden, Germany Kharagpur, India
Niloy Ganguly Andreas Deutsch Animesh Mukherjee
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V List of Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XI Part I Biological Sciences From Network Structure to Dynamics and Back Again: Relating Dynamical Stability and Connection Topology in Biological Complex Systems Sitabhra Sinha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis Madalena Chaves, Thomas Eissing, and Frank Allg¨ ower . . . . . . . . . . . . . . 19 Network-Based Models in Molecular Biology Andreas Beyer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Ecological Networks: Structure, Interaction Strength, and Stability Samit Bhattacharyya and Somdatta Sinha . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Signaling and Feedback in Biological Networks Sandeep Krishna, Mogens H. Jensen, and Kim Sneppen . . . . . . . . . . . . . . . 73
Part II Social Sciences Topographic Spreading Analysis of an Empirical Sex Workers’ Network Johannes Bjelland, Geoffrey Canright, Kenth Engø-Monsen, and Valencia P. Remple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
X
Contents
Spectral Characterization of Network Structures and Dynamics Anirban Banerjee and J¨ urgen Jost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Dynamics of Social Complex Networks: Some Insights into Recent Research Sergi Lozano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 The Structure and Dynamics of Linguistic Networks Monojit Choudhury and Animesh Mukherjee . . . . . . . . . . . . . . . . . . . . . . . . . 145 Networks Generated from Natural Language Text Chris Biemann and Uwe Quasthoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Part III Information Sciences Efficiency of Navigation in Indexed Networks Petter Holme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Evolution of Apache Open Source Software Haoran Wen, Raissa M. D’Souza, Zachary M. Saul, and Vladimir Filkov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Some New Applications of Network Growth Models Gourab Ghoshal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 The Big Friendly Giant: The Giant Component in Clustered Random Graphs Yakir Berchenko, Yael Artzy-Randrup, Mina Teicher, and Lewi Stone . . . 237 Technological Networks Bivas Mitra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 Advances in the Theory of Complex Networks Fernando Peruani . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Glossary of Essential Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
List of Contributors
Frank Allg¨ ower Institute for Systems Theory and Automatic Control University of Stuttgart Pfaffenwaldring 9 70550 Stuttgart Germany
[email protected] Yael Artzy-Randrup Biomathematics Unit Faculty of Life Sciences Tel Aviv University Ramat Aviv 69978 Israel
[email protected] Anirban Banerjee Max Planck Institute for Molecular Genetics Ihnestr. 63–73 14195 Berlin Germany
[email protected] Yakir Berchenko Interdisciplinary Brain Research Center Bar Ilan University Ramat Gan 52900 Israel
[email protected] Andreas Beyer Biotechnology Center Technische Universit¨at Dresden 01062 Dresden Germany andreas.beyer@biotec. tu-dresden.de Samit Bhattacharyya Mathematical Modelling and Computational Biology Group Centre for Cellular and Molecular Biology, CSIR Hyderabad 500007 India
[email protected] Chris Biemann Institute for Computer Science NLP Department University of Leipzig Johannisgasse 26 04103 Leipzig Germany
[email protected] Johannes Bjelland Telenor R&I 1331 Fornebu Norway
[email protected] XII
List of Contributors
Geoffrey Canright Telenor R&I 1331 Fornebu Norway
[email protected] Madalena Chaves COMORE, INRIA 2004 Route des Lucioles, BP 93 06902 Sophia-Antipolis France
[email protected] Monojit Choudhury Microsoft Research India Sadashivnagar Bangalore 560080 India
[email protected] Raissa M. D’Souza Department of Mechanical and Aeronautical Engineering Center for Computational Science and Engineering University of California Davis, CA 95616 USA
[email protected] Thomas Eissing Bayer Technologies Services GmbH PT-AS Systems Biology 51368 Leverkusen Germany thomas.eissing@ bayertechnology.com Kenth Engø-Monsen Telenor R&I 1331 Fornebu Norway
[email protected] Vladimir Filkov Department of Computer Science University of California Davis, CA 95616 USA
[email protected] Gourab Ghoshal Department of Physics, and Michigan Center for Theoretical Physics University of Michigan Ann Arbor MI, 48109 USA
[email protected] Petter Holme Department of Physics Ume˚ a University 90187 Ume˚ a Sweden
[email protected] Mogens H. Jensen Center for Models of Life Niels Bohr Institute Blegdamsvej 17 2100 Copenhagen Denmark
[email protected] J¨ urgen Jost Max Planck Institute for Mathematics in the Sciences Inselstr. 22 04103 Leipzig Germany Santa Fe Institute Santa Fe, NM 87501 USA
[email protected] List of Contributors
Sandeep Krishna Center for Models of Life Niels Bohr Institute Blegdamsvej 17 2100 Copenhagen Denmark
[email protected] Sergi Lozano ETH Z¨ urich Swiss Federal Institute of Technology UNO D11 Universit¨atstr. 41 8092 Z¨ urich Switzerland
[email protected] Bivas Mitra Department of Computer Science and Engineering Indian Institute of Technology Kharagpur 721302 India
[email protected] XIII
Uwe Quasthoff Institute for Computer Science NLP Department University of Leipzig Johannisgasse 26 04103 Leipzig Germany quasthoff@informatik. uni-leipzig.de Valencia P. Remple BC Centre for Disease Control Epidemiology University of British Columbia Vancouver, BC V5Z 4R4 Canada
[email protected] Zachary M. Saul Department of Computer Science University of California Davis, CA 95616 USA
[email protected] Animesh Mukherjee Department of Computer Science and Engineering Indian Institute of Technology Kharagpur 721302 India
[email protected] Sitabhra Sinha The Institute of Mathematical Sciences CIT Campus Taramani Chennai 600113 India
[email protected] Fernando Peruani Service de Physique de l’Etat Condens´e (SPEC/CEA) and Complex System Institute Paris Ile-de-France (ISC-PIF) F-75005, Paris France
[email protected] Somdatta Sinha Mathematical Modelling and Computational Biology Group Centre for Cellular and Molecular Biology, CSIR Hyderabad 500007 India
[email protected] XIV
List of Contributors
Kim Sneppen Center for Models of Life Niels Bohr Institute Blegdamsvej 17 2100 Copenhagen Denmark
[email protected] Mina Teicher Interdisciplinary Brain Research Center Bar Ilan University Ramat Gan 52900 Israel
[email protected] Lewi Stone Biomathematics Unit Faculty of Life Sciences Tel Aviv University Ramat Aviv 69978 Israel
[email protected] Haoran Wen Department of Mechanical and Aeronautical Engineering Center for Computational Science and Engineering University of California Davis, CA 95616, USA
[email protected] From Network Structure to Dynamics and Back Again: Relating Dynamical Stability and Connection Topology in Biological Complex Systems Sitabhra Sinha The Institute of Mathematical Sciences, CIT Campus, Taramani, Chennai 600113, India;
[email protected] 1 Introduction To see a world in a grain of sand, And a heaven in a wild flower, Hold infinity in the palm of your hand, And eternity in an hour. – William Blake, Auguries of Innocence Like Blake, physicists look for universal principles that are valid across many different systems, often spanning several length or time scales. While the domain of physical systems has often offered examples of such widely applicable “laws,” biological phenomena tended to be, until quite recently, less fertile in terms of generating similar universalities, with the notable exception of allometric scaling relations [20]. However, this situation has changed with the study of complex networks emerging into prominence. Such systems comprise a large number of nodes (or elements) linked with each other according to specific connection topologies, and are seen to occur widely across the biological, social and technological worlds [4, 9, 16]. Examples range from the intra-cellular signaling system which consists of different kinds of molecules affecting each other via enzymatic reactions, to the internet composed of servers around the world which exchange enormous quantities of information packets regularly, and food webs which link, via trophic relations, large numbers of inter-dependent species. While the existence of complex networks in various domains had been known for some time, the recent excitement among physicists working on such systems has to do with the discovery of certain universal principles among systems which had hitherto been considered very different from each other. N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 1, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
4
S. Sinha
Reflecting the development of the modern theory of critical phenomena, the rise of physics of complex networks has been driven by the simultaneous occurrence of detailed empirical studies of extremely large networks that were made possible by the advent of affordable high-power computing and the development of statistical mechanics tools to analyze the new network models. Prior to these developments, the networks that were studied by physicists belonged to either the class of (i) regular networks, defined on geometrical lattices, where each node interacted with all the neighboring nodes belonging to a specified neighborhood, or (ii) random networks, where any pair of nodes had a fixed probability of being linked, i.e., interacting with each other. The first work that focused public attention on the new network approach presented a class of network models that were neither regular nor random, but exhibited properties of both [28]. Such small world networks, as they were referred to, exhibited high clustering (with nodes sharing a common neighbor having a higher probability of being connected to each other than to other nodes) and a very low average path length (where the path length between any two nodes is defined as the shortest number of connected nodes one has to go through in order to reach one node starting from the other). As the former property characterized a regular network, while the latter was typical for a random network, this new class of networks was somehow intermediate between the extremes of the two well-known network models, which was manifest in their construction procedure (Fig. 1). Several networks occurring in reality, in particular, the power grid, the actor collaboration network and the neural connection patterns of the C. elegans worm, were shown to have the small-world property. Later, other examples were added to this list, including the network of co-active functional brain areas [1] and the Indian railway system [21]. Very soon afterwards, it was discovered that the frequency distribution of a node degree (i.e., the number of links a node has) exhibits a power-law scaling form for a large variety of systems including the world wide web [3].
Fig. 1. Constructing a small-world network on a 2-dimensional square lattice substrate. Starting from a regular network (left) where each node is connected to its nearest and next-nearest neighbors, a fraction p of the links are rewired among randomly chosen pairs of nodes. When all the links are rewired, i.e., p = 1, the system is identical to a random network (right). For small p, the resulting network (center) still retains the local properties of the regular network (e.g., high clustering), while exhibiting global properties of a random network (e.g., short average path length).
From Network Structure to Dynamics and Back Again
5
This further underlined the fact that most networks occurring in reality are neither regular (in which case the degree distribution would be close to a delta function) nor random (which has a Poisson degree distribution), as for both cases the probability of having a node with large degree (i.e., a hub) would be significantly smaller than that indicated by the power-law tail of empirically obtained degree distributions. In addition, it was observed that there exist non-trivial degree correlations among linked pairs of nodes. For example, a network where nodes with high degree tend to preferentially connect with other high degree nodes is said to show assortative mixing [15]. On the other hand, in a disassortative network, nodes with a large number of links prefer to connect with nodes having low degree. Empirical studies indicate that most biological and technological networks are disassortative, while social networks tend to be assortative [16]. As assortative mixing promotes percolation and makes a network more robust to vertex removal, it may be hard to understand why natural evolution in the biological world has favored disassortativity. However, in a recent study, we have shown that when one considers the stability of dynamical states of a network, disassortative networks would tend to be more robust, and this may be one of the reasons why they are preferred [6]. This brings us to the thrust of recent work in the area of complex networks which has shifted from the initial focus on purely structural aspects of the connection topology to the role such features play in determining the dynamical processes defined on a network [27]. Over the past few years, much effort has been made to understand not only how structure affects dynamics, and hence function, in a network, but also the reverse problem of how functional criteria, such as the need for dynamical stability, can constrain the topological properties of a network. In this chapter, some of the principal results obtained by our group will be briefly described. The goal of our research program is to understand the evolution of robust yet complex biological structures, viz., networks occurring in reality that are stable against perturbations and, yet, which can adapt to a changing environment.
2 Biological Networks: Some Examples Across Length Scales Before describing our results, which are applicable to a wide range of networks, we provide motivation for our general approach by briefly discussing in this section a few examples of biological networks. Although they span an enormous range of length scales, from ∼10−8 m in the case of protein contact networks to ∼105 m in the case of ecological interaction networks, they are often subject to similar constraints and may share common structural and dynamical properties. Questions about networks in one domain may often have answers and ramifications in another domain.
6
S. Sinha
Molecular scale: protein contact network. Protein structure, viewed as a network of non-covalent connections between the constituent amino acids, is one of the smallest length scale networks in the natural world. Its nodes are the Cα atoms of each amino acid, and their interaction strength is determined by their proximity to each other. Two nodes are considered to be linked if the Euclidean distance between them (in 3-dimensional space) is less than a A, which is the relevant distance for noncutoff value dc , usually between 8–14 ˚ covalent interactions. Figure 2 shows the KirBac1.1 protein, which belongs to the family of potassium ion channels involved in transmission of inward rectifying current across a cellular membrane [13]. The protein consists of four identical subunits spanning the membrane and intra-cellular regions. The corresponding protein contact network (PCN) manifests the existence of the identical subunits in the approximately block diagonal structure of the adjacency matrix. In addition, each of these four blocks can be divided into two modules, corresponding approximately to the membrane and intra-cellular regions. It is easy to see that the PCN shares the features of a small-world network, with the majority of connections between spatially neighboring nodes, although there are a few long-range connections. This small-world property of PCNs for different protein molecules has indeed been noted several times in the literature (see, e.g., Ref. [2]). This is probably not very surprising, given that it is also true for a randomly folded polymer. However, in addition, the PCN adjacency network shows a modular structure, with a majority of connections occurring between nodes belonging to the same module. This is a feature not seen in conventional models of small-world networks (e.g., the Watts–Strogatz model [28]). It is all the more intriguing, as we have recently subunit I
intra−cellular domain
subunit II subunit III subunit IV
200 400 600
membrane domain
800 1000 200
400
600
800
1000
Fig. 2. Structure of the KirBac1.1 protein (left) which comprises four identical subunits spanning the membrane and intra-cellular regions [13]. The PCN is constructed A, whose adjacency matrix is shown for by considering a cutoff distance of dc = 12 ˚ the entire network (right). Each of the four blocks corresponding to a subunit shows a clear partition into membrane and intra-cellular compartments, indicating a modular structure.
From Network Structure to Dynamics and Back Again
7
shown that modular networks (whatever the connection topology of individual modules) exhibit the small-world properties of high clustering and low average path length [18]. To identify whether the existence of modules indeed has a significant effect on protein dynamics (e.g., during folding), we look at the spectral properties of the Laplacian matrix1 L, defined as Lii = ki , where the degree of node i, Lij = −1 if nodes i and j are connected, 0 otherwise. The eigenvector for the smallest eigenvalue (=0), c(1) , corresponds to the time-invariant properties of the system and has uniform contribution from all components. The next few smallest eigenvalues dominate the time-dependent behavior of the protein and show a relatively large spectral gap with the bulk of the eigenvalue spectra. This indicates the existence of very distinct time scales in the protein dynamics which approximately correspond to the interand intra-modular modes of motion. As we shall see, the occurrence of modular structures in complex networks and their effect on dynamics is not just confined to PCNs but appears in many other biological networks. Intra-cellular scale: signaling network. Signal transduction pathways, through which a cell responds appropriately to a signal or stimulus, involve ordered sequences of biochemical reactions carried out by enzymes inside the cell. One of the most commonly observed class of enzymes in intra-cellular signaling is that of kinases, which activate target molecules (usually proteins) by transferring phosphate groups from energy donor molecules such as ATP to the targets. This process of phosphorylation is mirrored by the reverse process of deactivation by phosphatases through dephosphorylation. Such reaction cascades are activated by second messengers (e.g., cyclic AMP or calcium ions) and may last for a few minutes, with the number of kinase proteins and other molecules involved in the process increasing with every reaction step away from the initial stimulation. Thus, such a signaling cascade can result in a large response for a relatively low-amplitude signal. Research over the past decade has, however, shown that the classical picture of almost isolated cascades linking a unique signal to a specific response does not explain many experimental results. The adaptability of intra-cellular signaling is now thought to be a result of multiple signaling pathways interacting with one another to form complex networks. In this picture, complexity arises from the large number of components, many of which have partially overlapping functions, from the large number of links (through enzymatic reactions) among components and from the spatial relationship between the components [29]. Figure 3 shows a small fraction of the signaling network downstream of the B-cell antigen receptor (BCR) involved in immune response. As the breakdown of communication in this network can lead to disease (a fact that may be utilized by infectious agents for proliferation), it is of obvious importance to understand the mechanisms by which the network allows the cell response to be sensitive to different stimuli and yet to be robust in the presence of intra-cellular noise. With this in mind, the time evolution 1
The Laplacian matrix is also referred to as the Kirchhoff matrix (e.g., see Ref. [10]).
8
S. Sinha IgG receptor Igα, Igβ Syk
Pyk2
PI3K
Lyn
PIP2
PIP3
Shc Btk
BLNK Grb2 SOS
Rac DAG MEKK
Raf−1 MEK 1/2 K
Erk 1/2
PDK1
PLCg2
Vav
MKK 4/7
MKK 3/4/6
Jnk 1/2
p38
IP3
PKC IKK IkB
Akt Ca2+
CaMK2 NFAT
Bad
Bcl2
Fig. 3. A subset of the signal transduction network of the BCR [12]. The kinases are represented by squares, while other molecules (such as second messengers and adapters) are depicted as circles.
of the activity (i.e., phosphorylation) of about 20 signaling molecules in this network was recorded in a recent experiment by Kumar et al. [12]. Apart from observing the activation profiles under normal conditions, the network was also subjected to a series of perturbations by serially blocking each of these molecules from activating any of the other molecules in the network. The resulting experimental data, capturing the behavior of these molecules under 21 different conditions, enabled the detection of correlations between the activity of these molecules. This showed that the existing picture of interactions (Fig. 3) is grossly inadequate in explaining these correlations, e.g., the fact that p38 kinase seems to influence the activation of a majority of the other molecules, although it occurs at the end of a particular pathway. The results suggest that the signaling network is, in fact, a far more densely connected system than had been previously suspected. It also raises the question of how certain signals can elicit very specific responses, without significant risk of cross-talk between interacting pathways. This brings us to the issue of whether functional modules can exist in networks, such that by using positive and negative interactions one can channel information from the stimulus to the response along specific subnetworks only. Inter-cellular scale: neuronal network. The previous question is of importance not only for information processing within a cell, but also between cells. The most important example of the latter process is, of course, the networks of neurons occurring in the brain. As the nervous system of the nematode C. elegans comprising 302 neurons has been completely mapped out (in terms of the positions of the neurons, as well as all their interconnections), it provides a model system for studying these issues. We have recently analyzed the connection topology of the non-pharyngeal portion of the nervous system to which the majority of the neurons (280) belong [7]. One of the striking
From Network Structure to Dynamics and Back Again
9
observations is that many of the sensory neurons belonging to different modalities, viz., chemosensation, mechanosensation, etc., send signals to the same set of densely connected interneurons which forms the innermost core of the nervous system. Subsequently, signals are sent from these interneurons to specific motor neurons which generate appropriate muscle response, e.g., moving along a chemical gradient, egg laying, etc. It is vital that the signals coming from different sensory neurons to the same interneurons should not interfere with each other, as it may result in activation of the incorrect motor response. A preliminary investigation of a dynamical model for the neuronal network shows that a complex set of excitatory and inhibitory links between the interneurons manages to achieve segregation of the different functional circuits. This means that, e.g., a mechanical tap signal will not elicit egg laying, even if the tap withdrawal circuit shares many common interneurons with the egglaying circuit. Even more interesting is the fact that such functional modules do not need the existence of structural modules in the underlying networks. It underscores the importance of looking at the nature of the interactions, which can create complicated control mechanisms to prevent cross-talk and enable robust response in the presence of environmental noise. Inter-organism scale: epidemic propagation network. At the scale of individual organisms, such as human beings, one of the most widely studied networks is that which leads to propagation of epidemics. The ubiquity of small-world networks in nature implies that some of the classic theories of epidemiological transmission, based on assumptions of random connections, may need to be reviewed. In particular, the global spread of diseases like SARS shows that even a few long-range links can drastically enhance the propagation of epidemics [8]. This has led to a series of studies of different disease propagation models on Watts–Strogatz or related network models (e.g., see Ref. [19]). However, as mentioned above, all the structural features of such networks are also shared by modular networks, although modular network have very different dynamical properties. We have recently shown that while Watts– Strogatz networks have a continuous range of time scales, modular networks exhibit very distinct time scales that are related to intra- and inter-modular events [18]. Thus, an effective strategy to counter the spread of epidemics must take into account a detailed knowledge of such structures in the social network of contagious and susceptible individuals. Inter-species scale: food webs. Possibly the largest (in terms of length scale) biological networks on earth are those of interactions between different species in an ecosystem. While general ecological networks consist of all possible links, such as cooperation and competition, food webs describe the trophic relations, i.e., between predator and prey. A food web is a directed network where the nodes are the various species, with prey connected by arrows to predators, the direction of the arrow indicating the flow of biomass. The links are usually weighted to represent the amount of energy that is transferred.
10
S. Sinha
It is in the context of these networks that questions first arose on the connection between the structural properties of a network and the stability of its dynamical behavior (see Section 4). Indeed, one not only asks what kind of structures allow complex networks to be stable against ever-present perturbations, but also how the requirement to be robust constrains the kind of structures such networks can evolve. To stress the universality of the questions asked by physicists about networks, we note that, like many other networks, food webs also have been shown to have a modular structure, with species in each module interacting between themselves strongly and only weakly with other species [11]. As in the other systems discussed earlier, the role that modularity plays in stabilizing the dynamics of ecosystems can be seen as a specific instance of a much more general question. Having discussed a few instances of how universal principles about networks can appear by investigating very different systems in the biological world, we now describe certain results of our studies on general network models. However, we stress that each of these results has relevance to problems appearing in the context of specific biological systems.
3 From Structure to Dynamics The role that the connection topology of a network plays in the nature of its dynamics has been extensively investigated for spin models occurring in physics. In fact, such systems had been explored for a long time prior to the recent interest in complex networks, and many results are known regarding ordering transition in both regular as well as random structures. More recently, it has been shown that, for partial random rewiring in a system of sufficiently large size, any finite value of p (the rewiring probability) causes a transition to the small-world regime, with the Ising model defined on such a network exhibiting a finite temperature ferromagnetic phase transition [5]. However, spin models are extremely restricted in their dynamical repertoire; therefore, researchers have looked at the effect of introducing other kinds of node dynamics in such network structures, e.g., oscillators. Motivated by recent observations that the brain may have a connection structure with small-world properties (see e.g., Ref. [1]), we have examined the effect of long-range connections (i.e., non-local diffusion) over an otherwise regular network of nodes with links between nearest neighbors on a square lattice [25]. The dynamics considered is that of the excitable type, with the variable having a single stable state and a threshold. If a perturbation causes the system variable to exceed the threshold, we see a rapid transition to a metastable excited state followed by a slow recovery phase when the system gradually converges to the stable state. As a result of coupling the dynamics of individual nodes through diffusive coupling, various spatial patterns (which may be temporally varying) are observed. Such a dynamics is commonly observed in a large variety of biological
From Network Structure to Dynamics and Back Again Temporal Patterns Burn−out
Spatial Patterns time
0
0
0
500
0.2
1000
time
0.5
Activity
0.5
1500
11
2000
0.4
p
0
1600
plc 0.6
1800
0.5
time
0 2000 0 100 200
0.8 pu c
1
Fig. 4. Schematic diagram indicating the different dynamical regimes in a 2-dimensional small-world excitable medium as a function of the rewiring probability, p. For low p, the system exhibits spatial patterns characterized by single or multiple spirals. At p = plc , there is a transition to a state dominated by temporally periodic patterns that are spatially relatively homogeneous. Above p = puc , all activity ceases after a brief transient.
cells such as neurons and cardiac myocytes, as well as in non-linear chemical systems such the Belousov–Zhabotinsky reaction. In our simulations, by varying the probability of long-range connections, p, we have observed three categories of patterns. For 0 < p < plc , after an initial transient period where multiple coexisting circular waves are observed, the system is eventually spanned by a single or multiple rotating spiral waves whose temporal behavior is characterized by a flat power spectral density. At p = plc , the system undergoes a transition from a regime with temporally irregular, spatial patterns to one with spatially homogeneous, temporally periodic patterns (Fig. 4). The latter behavior occurs over the range plc < p < puc as a result of the increased number of long-range connections, whereby a large fraction of the system is synchronously active and subsequently goes into the recovery phase. Beyond the upper critical value puc , there is no longer any self-sustained activity in the system, as all nodes converge to the stable state. The patterns in each regime were found to be extremely robust against even large perturbations or disorder in the system. Our model explains several hitherto unexplained observations in experimental systems where non-local diffusion had been implemented [26]. In addition, by identifying the long-range connections with those made by neurons and the regular network with that formed by the glial cells in the brain, our results provide a possible explanation of why evolution may have preferred to increase the number of glial cells over neurons (with a ratio of more than 10:1 for certain parts of the human brain) in order to maintain robust dynamical patterns as brain size increased. It also points towards a possible functional role of the small-world brain topology in the occurrence of dynamical diseases such as epileptic seizures and bursts. More generally, our work shows
12
S. Sinha
how non-standard network topologies can influence system dynamics by generating different kinds of spatio-temporal patterns depending on the extent of non-local diffusion.
4 From Dynamics to Structure An important functional criterion for most networks occurring in nature and society is the stability of their dynamical states. While earlier studies have concentrated on the robustness of the network when subjected to structural perturbations (e.g., removal of nodes or links), we have looked at the effect of perturbations on the steady states of network dynamics. In particular, the question we ask is whether networks become more susceptible to small perturbations as their size (i.e., number of nodes N ) increases, the connections between the nodes become denser (i.e., increased connection probability C) and the average strength of interaction (s) increases. This is related to a decades-old controversy, often referred to as the stability-complexity debate. In the early 1970s, May [14] had shown that for a model ecological network, where species are assumed to interact with a randomly chosen subset of all other species, an arbitrarily chosen equilibrium state of the system becomes unstable if any of the parameters determining the network’s complexity (e.g., N , C or s) is increased. In fact, by using certain results of random matrix theory, the critical condition for the stability of the network was shown to be N Cs2 < 1 (May–Wigner theorem) [14]. This flew against common wisdom, gleaned from a large number of empirical studies as well as naive reasoning, which dictated that increased diversity and/or stronger interactions between species results in more robust ecosystems. Thus, ever since the publication of these results, there have been attempts to understand the reason behind the apparent paradox, especially as this result relates not only to ecological systems but extends to all dynamical networks for which the stability of equilibria has functional significance, e.g., in intra-cellular biochemical networks where the concentrations of different molecules need to be maintained within physiological levels. Two of the common charges leveled against the theoretical model of May is that (i) it assumes the interaction network to be random, whereas naturally occurring networks may have certain kinds of structures, and (ii) the linear stability analysis assumes the existence of simple steady states (viz., fixed point attractors), which may not be the case for real systems that may either be oscillating or in a chaotic state. In our work on dynamical systems defined on networks, we have tried to address both of these lines of criticism (see Ref. [31] for a recent discussion of our results from the perspective of ecosystem robustness). For example, focusing on the question of the inadequacy of linear stability analysis, we have considered networks with non-trivial dynamics at the nodes, spanning the range from simple steady states to periodic oscillation and fully developed chaos, and measured the robustness of the dynamics with respect to variations in N , C and s [23, 24].
From Network Structure to Dynamics and Back Again
13
Each node in our model network has a dynamical variable associated with it, which evolves according to a well-known class of difference equations commonly used for modeling population dynamics. By varying a non-linear parameter, the nature of the dynamics (i.e., whether it converges to a steady state or undergoes chaotic fluctuations) at each node can be controlled. However, in the absence of coupling, each node will always have a finite, positive value for its dynamical variable. When coupled in a network (initially in a random fashion) with links that can have either positive or negative weights, it is possible that as a result of dynamical fluctuations, the variable for some nodes can become negative or zero. As this implies the absence of any activity, the corresponding node is considered to be “extinct” and thus isolated from the network. This procedure may create further fluctuations and cause more nodes to becomes “extinct,” resulting in gradual reduction of the size of the network (Fig. 5). The final asymptotic size of the network, relative to its initial size, is a measure of its robustness—the more robust network is one with a higher fraction of nodes having persistent activity. Analysis showed that the network robustness (as measured by the above global criterion) not only decreased with N , C and s, as expected from local stability analysis, but actually matched the May–Wigner theorem quantitatively [23]. In addition, the asymptotic network exhibited robust macroscopic features: (a) the number of persistently active nodes was independent of the initial network size, and (b) the asymptotic number of links between these persistently active nodes was independent of both the initial size and connectivity [24]. This is all the more surprising, as the removal of nodes (and hence, links) is not guided by any explicit fitness criterion, but rather emerges naturally from the nodal dynamics through fluctuations of individual node properties. Our results imply that asymptotically
Pa
Fig. 5. Evolution of a network with non-trivial dynamics at the nodes. The initial (left) and final asymptotic (right) networks are shown. Only nodes having persistent activity are connected to the network. The figures were drawn using Pajek software.
14
S. Sinha
active networks are non-extensive: when two networks of size N are coupled to each other (with the same connectance as the individual networks), although the resulting network initially has a size 2N , the ensuing dynamical fluctuations will reduce its size to N . This implies that simply increasing the number of redundant elements is not a good strategy for designing robust systems. We have also looked at the effect of empirically reported structures, such as small-world connection topology and scale-free degree distribution, on the dynamical stability of networks. Our results indicate that, in general, introducing such structural features does not alter the outcome expected from the May–Wigner theorem [6, 22]. However, these details can indeed affect the nature of the stability-instability transition; for example, the transition exhibiting a cross-over from being very sharp (resembling first-order phase transition) for a random network to a more gradual change as the network becomes more regular in the small-world regime [22].
5 Evolution of Robust Networks This brings us to the issue of how complex networks can be stable at all, given that the May–Wigner theorem seems to hold even for networks that have structures similar to those seen in reality and where non-trivial dynamical situations have also been considered. The solution to this apparent paradox lies in the observation that most networks that we see around us did not occur fully formed but emerged through a process of gradual evolution, where stability with respect to dynamical fluctuations is likely to be one of the key criteria for survival. In earlier work, we have shown that a simple model, where nodes are gradually added to or removed from a network according to whether this results in a dynamically stable network or not, leads to a non-equilibrium steady state in which the network is extremely robust [30]. The robustness is manifested by increased resistance and resilience, as well as decreased probability of large extinction cascades, when the network size (i.e., the system diversity) is increased. Thus, our results reconcile the apparently contradictory conclusions of the May–Wigner theorem and a large number of empirical studies. More recently, we have shown that model networks can evolve many of the observed structural features seen among networks in the natural world, by taking into account the fact that the majority of such systems must optimize between several (often conflicting) constraints, which may be structural as well as dynamical in nature. In particular, most networks need to have high communication efficiency (i.e., low average path length) and low connectivity (to reduce the resource cost involved in maintaining many links) while being stable with respect to dynamical perturbations. If a network satisfied only the first two constraints, the optimal structure would have been that of a star (Fig. 6). Even if the resource cost constraint is somewhat relaxed, so that the network can have more links than the minimum necessary to make it
From Network Structure to Dynamics and Back Again
15
(A)
(I)
(II)
(B)
(C)
Fig. 6. Networks with (I) star and (II) clustered star connection topologies can form the fundamental building blocks of different types of modular networks. Network configurations with clustered star modules can be constructed by (A) connecting different modules by single undirected links among the hub nodes, or (B) connecting nodes of a module to another module only through the hub node of the latter, or (C) connecting nodes of a module randomly to any node of another module.
connected, the resulting optimal configuration is slightly modified to that of a “clustered” star. However, we note that the dynamical equilibria in such systems would be extremely unstable with respect to small perturbations. This happens because the rate of growth of small perturbations is related to the maximum degree of the network, which, in the case of a star or a clustered star, is almost identical to the system size. It is easy to see that dividing the network into multiple stars, connected to each other, will reduce the maximum degree and hence increase the stability. Indeed, our results show that simultaneous optimization of all three constraints results in networks with modular structure, i.e., subnetworks with a high density of connections within themselves compared to between distinct subnetworks, where each module possesses a prominent hub [17] (see Fig. 6 for possible configurations of such modular networks). As these evolved systems also exhibit heterogeneous degree distribution, our findings have implications for a wide range of systems in the biological and technological worlds where such features have been observed.
16
S. Sinha
Acknowledgments I would like to thank my collaborators with whom the work described here has been carried out, in particular, R. K. Pan, S. Sinha, N. Chatterjee, M. Brede, C. C. Wilmers, J. Saram¨ aki and K. Kaski, as well as S. Vemparala, D. Kumar, K. V. S. Rao and B. Saha for helpful discussions.
References 1. Achard, S., Salvador, R., Whitcher, B., Suckling, J., Bullmore, E.: A resilient, low-frequency, small-world human brain functional network with highly connected association cortical hubs. J. Neurosci., 26, 63–72 (2006) 2. Aftabuddin, M., Kundu, S.: Hydrophobic, hydrophilic and charged amino acid networks within protein. Biophys. J., 93, 225–231 (2007) 3. Albert, R., Barab´ asi, A.L.: Emergence of scaling in random networks. Science, 286, 509–512 (1999) 4. Albert, R., Barab´ asi, A.L.: Statistical mechanics of complex networks. Rev. Mod. Phys., 74, 47–97 (2002) 5. Barrat, A., Weigt, M.: On the properties of small-world network models. Eur. Phys. J.B, 13, 547–560 (2000) 6. Brede, M., Sinha, S.: Assortative mixing by degree makes a network more unstable. Arxiv preprint, cond-mat/0507710 (2005) 7. Chatterjee, N., Sinha, S.: Understanding the mind of a worm: Hierarchical network structure underlying nervous system function in C. elegans. Prog. Brain Res., 168, 145–153 (2007) 8. Deem, M.W.: Mathematical adventures in biology. Physics Today, 60(1), 42–47 (2007) 9. Dorogovtsev, S.N., Mendes, J.F.F.: Evolution of Networks: From Biological Nets to the Internet and WWW. Oxford Univ. Press, Oxford (2003) 10. Haliloglu, T., Bahar, I., Erman, B.: Gaussian dynamics of folded proteins. Phys. Rev. Lett., 79, 3090–3093 (1997) 11. Krause, A.E., Frank, K.A., Mason, D.M., Ulanowicz, R.U., Taylor, W.W.: Compartments revealed in food-web structure. Nature, 426, 282–284 (2003) 12. Kumar, D., Srikanth, R., Ahlfors, H., Lahesmaa, R., Rao, K.V.S.: Capturing cellfate decisions from the molecular signatures of a receptor-dependent signaling response. Molecular Systems Biology, 3, 150 (2007) 13. Kuo, A., Gulbis, J.M., Antcliff, J.F., Rahman, T., Lowe, E.D., Zimmer, J., Cuthbertson, J., Ashcroft, F.M., Ezaki, T., Doyle, D.A.: Crystal structure of the potassium channel KirBac1.1 in the closed state. Science, 300, 1922–1926 (2003) 14. May, R.M.: Stability and Complexity in Model Ecosystems. Princeton Univ. Press, Princeton (1973) 15. Newman, M.E.J.: Assortative mixing in networks. Phys. Rev. Lett., 89, 208701 (2002) 16. Newman, M.E.J.: The structure and function of complex networks. SIAM Review, 45, 167–256 (2003) 17. Pan, R.K., Sinha, S.: Modular networks emerge from multiconstraint optimization. Phys. Rev. E, 76, 045103(R) (2007)
From Network Structure to Dynamics and Back Again
17
18. Pan, R.K., Sinha, S.: The small world of modular networks. Arxiv preprint, arXiv:0802.3671 (2008) 19. Saram¨ aki, J., Kaski, K.: Modelling development of epidemics with dynamic smallworld networks. J. Theor. Biol., 234, 413–421 (2005) 20. Schmidt-Nielsen K: Scaling: Why is Animal Size So Important? Cambridge Univ. Press, Cambridge (1984) 21. Sen, P., Dasgupta, S., Chatterjee, A., Sreeram, P.A., Mukherjee, G., Manna, S.S.: Small-world properties of the Indian railway network. Phys. Rev. E, 67, 036106 (2003) 22. Sinha, S.: Complexity vs. stability in small-world networks. Physica A, 346, 147– 153 (2005) 23. Sinha, S., Sinha, S.: Evidence of universality for the May-Wigner stability theorem for random networks with local dynamics. Phys. Rev. E, 71, 020902(R) (2005) 24. Sinha, S., Sinha, S.: Robust emergent activity in dynamical networks. Phys. Rev. E, 74, 066117 (2006) 25. Sinha, S., Saram¨ aki, J., Kaski, K.: Emergence of self-sustained patterns in smallworld excitable media. Phys. Rev. E, 76, 015101(R) (2007) 26. Steele, A.J., Tinsley, M., Showalter, K.: Spatiotemporal dynamics of networks of excitable nodes. Chaos, 16, 015110 (2006) 27. Strogatz, S.H.: Exploring complex networks. Nature, 410, 268–276 (2001) 28. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature, 393, 440–442 (1998) 29. Weng, G., Bhalla, U.S., Iyengar, R.: Complexity in biological signaling systems. Science, 284, 92–96 (1999) 30. Wilmers, C.C., Sinha, S., Brede, M.: Examining the effects of species richness on community stability: An assembly model approach. Oikos, 99, 363–367 (2002) 31. Wilmers, C.C.: Understanding ecosystem robustness. Trends Ecol. Evoln., 22, 504–506 (2007)
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis Madalena Chaves,1 Thomas Eissing,2 and Frank Allg¨ ower3 1
2
3
COMORE, INRIA, 2004 Route des Lucioles, BP 93, 06902 Sophia-Antipolis, France;
[email protected] Bayer Technologies Services GmbH, PT-AS Systems Biology, Germany
[email protected] Institute for Systems Theory and Automatic Control, University of Stuttgart, Pfaffenwaldring 9, 70550 Stuttgart, Germany;
[email protected] 1 Introduction Programmed cell death (or apoptosis) has an essential biological function, enabling successful embryonic development, as well as maintenance of a healthy living organism [6]. Apoptosis is a physiological process which enables an organism to remove unwanted or damaged cells. Malfunctioning apoptotic pathways can lead to many diseases, including cancer and inflammatory or immune system related problems. A family of proteins called caspases are primarily responsible for execution of the apoptotic process: basically, in response to appropriate stimuli, initiator caspases (for instance, caspases 8, 9) activate effector caspases (for instance, caspases 3, 7), which will then cleave various cellular substrates to accomplish the cell death process [22]. Nuclear factor κB (NFκB) is a transcription factor for a large group of genes which are involved in several different pathways. For instance, NFκB activates its own inhibitor (IκB) [14] as well as groups of pro-apoptotic and anti-apoptotic genes [21]. Among the latter, NFκB activates transcription of a gene encoding for inhibitor of apoptosis protein (IAP). This protein in turn contributes to downregulate the activity of the caspase cascade which forms the core of the apoptotic pathway [6, 8]. The canonical NFκB pathway is induced, among other stimuli, by the cytokine tumor necrosis factor α (TNFα) [21]. Binding of TNFα to death receptor TNFR1 forms a first complex which eventually activates NFκB. A second complex is later formed, which will activate the initiator caspase 8 [6], and hence activate the apoptotic process. The same signal (TNFα stimulation) thus triggers two parallel but contrary pathways: the pro-apoptotic caspase cascade and the anti-apoptotic NFκB-IκB-IAP pathway. These two pathways, together with the interactions among their components, form a N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 2, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
20
M. Chaves et al.
complex network which shapes the decision on cell survival or initiation of programmed cell death. To contribute to a better understanding of the role of NFκB in the regulation of apoptosis, we propose a qualitative study of this system and its dynamics, based on a discrete (Boolean) model of the complex network. This discrete model closely follows a continuous one, recently developed and studied in [23, 24]. The model integrates the well-known model for the NFκB pathway [17] and the caspase cascade [8]. Boolean models provide a convenient formalism to describe protein and gene networks [25]. The states of the network components (e.g., proteins or messenger RNAs) are characterized as “expressed” or “not expressed” and are represented by logical variables (with values 0 or 1). The interactions among the various components are classified as “inhibition” or “activation” links (these can generally be deduced from gene/protein expression data). Boolean models thus describe the network structure of a system without involving any kinetic details. The qualitative behaviour of a system can be seen as an emergent property of this structure. Boolean models are especially useful in the case of large networks [1, 9], for which kinetic parameters are often unknown, but qualitative properties such as generation of specific gene expression patterns, stability or multistability, and oscillatory modes can be studied. Several methods have been developed for analysis of discrete and qualitative models [2, 5, 7, 13, 26]. Using an approach which combines discrete rules with continuous degradation rates, our model reproduces many of the known properties of the system, notably the oscillatory dynamics that can be induced by the NFκB-IκB negative feedback loop [14, 15, 19]. We explore different configurations for the network structure and predict its effects on the decision between cell survival or apoptosis.
2 The Model The network of interactions among the NFκB pathway and the apoptosis signaling cascade to be studied here is shown in Fig. 1. The various components of the network (here messenger RNAs, proteins, or protein complexes) form the set of variables or nodes (Xi , i = 1, . . . , n) of the Boolean model. The system will evolve according to a set of logical rules which are deduced from the interactions or links depicted in the schematic diagram of Fig. 1. The interactions among nodes can be classified as “activation” or “inhibition” links: a directed arrow Xi → Xj means that a high concentration of component Xi activates component Xj , while the symbol Xi Xj means that a high concentration of component Xi inhibits Xj . The components in our model and the activation or inhibition links among them are based on existing literature data. For general aspects, the reviews [6, 21] were used. However, some pathways of regulation among the NFκB pathway and the caspase cascade are not yet clear, and more work is needed to understand how these two signaling pathways are interconnected. In this chapter, we aim to investigate and test several possible hypotheses for
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis
21
Fig. 1. Schematic diagram of the NFκB pathway and the caspase cascade (light shaded regions). The oval dark grey shaded region represents the cellular nucleus. Both pathways are activated by binding of TNFα to death receptor TNFR1 (the resulting complex is represented simply by the rectangle TNF). Messenger RNAs are represented by ellipses, while transcription factors, caspases, and other proteins are represented by squares. To study the interconnections between the two pathways, four network variants, based on different combinations of the links A, L, and C, will be analysed and compared (see Table 2).
the combined network structure. We will consider four model variants and try to discriminate between them by comparing our numerical analysis with experimental data from the literature. The four network variants (see Table 2) are based on different combinations of three links (A, L, C in Fig. 1) which have been suggested but are not fully established in the apoptosis literature. The NFκB pathway follows very closely the model presented in [17]. Stimulation of death receptors with TNFα leads (see for instance [6]), first, to the formation of a complex I (T1 in Fig. 1) which will recruit and activate inhibitor of IκB kinases (IKK). Inhibitor of NFκB, or IκB, acts by binding to NFκB molecules and preventing their transcriptional function. Active IKK (IKKa) phosphorylates IκB which releases NFκB, thus enabling its translocation to the nucleus and transcription of NFκB-dependent genes, including genes for inhibitor of apoptosis protein (iap), inhibitor of NFkB (iκB), a protein associated with inhibition of complex T2 (flip), and a protein regulating IKK activity (a20) [21]. Transcription of IκB mRNA generates a negative feedback
22
M. Chaves et al.
loop in the NFκB pathway [14, 20], which may lead to oscillatory behaviour in NFκB and IκB concentrations [19]. In a second step, after dissociation of components of complex I from the death receptor, a second complex is formed (T2 in Fig. 1) which will recruit and activate initiator caspase 8 (C8a). As a result of the signaling cascade [8, 22], effector caspase 3 is also activated (C3a). Thus, complex T1 activates the anti-apoptotic pathway and, after a certain delay, complex T2 activates the pro-apoptotic pathway. Two well-documented points of regulation of the apoptotic pathway by NFκB are inhibition of C3a by IAP and regulation of complex T2 by FLIP [6]. Active caspase 8 was found to be negatively regulated by caspase-8 and caspase-10-associated RING proteins (CARPs) [18], which seem to play an analogous role to IAP’s, but are less well studied. It was found that CARPs are overexpressed in tumors, and that their suppression leads to restoration of the apoptotic pathway, with the CARP being rapidly cleaved. In addition, it was observed that inhibitors of caspase 3 block CARP cleavage. In our model, we introduced CARP and a pre-complex CARP0 , which is inhibited by C3a. Inhibition by C3a is, however, not sufficient to control CARP, and there are probably other regulators. Since CARP plays a similar role to caspases 8 and 10, as IAP plays to caspases 3 and 9 (and in the absence of further details), we assume that the pre-complex CARP0 is also regulated by a product of the NFκB pathway. The points where the caspase cascade influences the NFκB pathway are less well documented. We will use our model to test different hypotheses by studying and comparing the network dynamics for the following cases (see also Table 2): inhibition of IKKa (link L) and/or NFκB (link A) by C3a, or neither of these links present. To obtain the logical rules shown in Table 1, some simplifications of the biological processes were inevitably introduced. For instance, the bound complex NFκB−IκB (either in the cytoplasm or in the nucleus) was not explicitly considered in the system, but was simply treated as an inhibition effect: the rule for NFκB says that it vanishes whenever IκB is expressed. Thus, any state with NFκB = 0 and IκB = 1 represents in fact a high concentration of bound complex NFκB − IκB, while any state with NFκB = 1 and IκB = 0 represents a high concentration of free NFκB and low concentration of free IκB. To translate our diagram into a set of logical rules, the convergence of two or more arrows (either activation or inhibition) at the same node was always treated as a logical AND, except in three cases: IκB, IAP, and CARP0 . For these proteins, the overall effect was treated as an AND in the presence of TNF stimulation, but treated as an OR in the absence of TNF. These three proteins represent inhibitors whose levels should be stable in the absence of any stimulus [8]: IAP and CARP0 (or CARP) should be effective inhibitors of the caspases, and IκB should be at approximately constant levels to control NFκB transcriptional activity. In contrast, with TNF stimulation, the degradation rates of these proteins can vary and lead to rapid changes in their concentrations (different degradation rates in the presence or absence of TNF
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis
23
have been observed, notably for bound IκB [20]). For instance, under TNF treatment, the rule for inhibition of NFκB is simplified to IκB+ = [iκB and not IKKa]. Suppose that IKK becomes activated at time t1 , that is IKKa(t1 ) = 1. Then, in the next iteration of the model, the IκB rule implies that IκB will degrade very fast, with IκB(t1 +Δ) = 0. In contrast, in the absence of the TNF stimulus, the rule is IκB+ = [iκB or not IKKa]. If IKK becomes active at time t1 , one has IκB(t1 + Δ) = iκB(t1 ), meaning that IκB is only rapidly degraded if no more of its messenger RNA is available. A similar reasoning justifies the rules for IAP and CARP0 . The rules for these three proteins with inhibiting roles reflect the fact that their degradation rates, and hence turnover, can be much faster in response to TNF stimulation.
3 Analysis of Boolean Models Boolean networks are a representation of a system, consisting of a set of n variables or nodes X = (X1 , . . . , Xn ), together with a set of logical rules (Fi (X), i = 1, . . . , n) describing the evolution of the system from the current state (Xi at time t) to the next state (Xi at time t + Δ). The variables or nodes take values in the discrete set {0, 1}, where 1 (resp., 0) denotes the “expressed” (resp., “not expressed”) state of the node. The associated rules are typically a composition of logical OR and AND functions, which can be determined from gene/protein expression patterns (from Western blots or microarray data, for instance). The set of rules Fi given in Table 1 for the NFκB pathway and the caspase cascade is a translation of the diagram shown in Fig. 1. The temporal evolution of the system, X(t), t ∈ (0, ∞), is determined by successively iterating the logical rules Fi , for which several algorithms are available. Synchronous algorithms assume that all nodes are simultaneously updated: Xi+ = Fi (X1 , . . . , Xn ),
i = 1, . . . , n,
(1)
where Xi ∈ {0, 1}, X = (X1 , . . . , Xn ) denotes the state of the system at time t, and X + = (X1+ , . . . , Xn+ ) denotes the next state (at t + Δ). Alternatively, with asynchronous algorithms, at each iteration the nodes are sequentially updated, according to a given order (which can be prespecified or randomly chosen). Discrete models focus on the structure of the network (links), thus offering a more qualitative description of the system’s dynamics. Continuous models may offer more detailed descriptions of a system, but they also have the disadvantage of involving a large set of kinetic parameters, many of which are unknown. A method for analysis of Boolean models was introduced in [12, 13], which provides a bridge between discrete and continuous approaches. In this method, each node Xi of the network is represented by one continuous variable (xi ) and one discrete variable (Xi , as before). The continous variables are
24
M. Chaves et al.
Table 1. Boolean rules for the model of regulation of apoptosis via the NFκB pathway. TNF is a constant input. Identification of the nodes is given in the text. The letter “a” juxtaposed to a variable name denotes the active form of a molecule. The subscript “nuc” denotes the given component in the cellular nucleus. Alternative rules are given for the presence/absence of links A, C, L. Node +
T1 T2 + IKKa+ NFκB+ NFκB+ nuc iκB+ IκB+ IκB+ nuc a20+ A20+ A20a+ iap+ IAP+ flip+ FLIP+ C3a+ C8a+ CARP+ 0 CARP+
Boolean rule TNF T1 and not FLIP {L} T1 and not A20a and not C3a {no L} T1 and not A20a {A} not IκB and not C3a {no A} not IκB NFκB and not IκBnuc NFκBnuc [T1 and (iκB and not IKKa)] or [not T1 and (iκB or not IKKa)] IκB NFκBnuc a20 T1 and A20 NFκBnuc [T1 and (iap and not C3a)] or [not T1 and (iap or not C3a)] NFκBnuc flip not IAP and C8a {C} not CARP and (C3a or T2 ) {no C} C3a or T2 [T1 and (NFκBnuc and not C3a)] or [not T1 and (NFκBnuc or not C3a)] CARP0
governed by ordinary differential equations, which combine a synthesis rate (based on its Boolean rule) and a linear degradation rate: d xi = −ai xi + bi Fi (X1 , X2 , . . . , Xn ), dt
i = 1, . . . , n.
(2)
At each instant t, the discrete variable Xi is defined as a function of the continuous variable according to a threshold value of its maximal concentration: 0, xi (t) ≤ θi abii Xi (t) = (3) 1, xi (t) > θi abii , where θi ∈ (0, 1) represents the fraction of maximal concentration which is necessary for component Xi to become “active” and perform its biological functions. Initial conditions are equal for discrete and continuous variables: Xi (0) = xi (0). It is easy to see that the hypercube [0, b1 /a1 ] × · · · × [0, bn /an ] is an invariant set for system (2). The continuous variables denote concentrations of molecules; they are translated into a Boolean 0/1 response according to θi . The discrete variables Xi represent expression (1) or not expression (0)
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis
25
of species i, according to whether its continuous concentration xi is above or below the threshold θi bi /ai . Letting the parameters ai , bi , and θi be specific for each node i allows us to study different time scales for different biological processes (for instance, transcription, translation, or post-translational processes, as in [5]), or investigate the relative turnover rates of two molecules. Similar piecewise linear systems have also been studied in [7, 26]. 3.1 Steady States The steady states of a Boolean model are given by all the possible solutions X ∗ of the equations: Xi∗ = Fi (X1∗ , . . . , Xn∗ ),
i = 1, . . . , n.
It is easy to see that any steady state of the Boolean model yields a steady state of the piecewise linear equations (2), since d xi bi = 0 ⇔ xi = Fi (X1 , X2 , . . . , Xn ), i = 1, . . . , n, dt ai independently of θi . Because the right-hand side of this equation is discontinuous, it is difficult to provide general results on the existence and uniqueness of solutions for system (2) (see for instance [3] and [11]). In view of this difficulty, in the present study we will assume that trajectories are well defined and analyze their dynamical behavior. For the model of Table 1, the steady states depend on the value of TNF (see Table 2). It is not difficult to check that (both with and without link A) there are exactly two distinct steady states when TNF = 0, characterized by the presence or absence of caspases 3 and 8, and hence corresponding to the survival or apoptotic responses (nodes not indicated below are zero): (4) (Ap0 ) T1 = T2 = 0, C3a = C8a = 1, IκB = IκBnuc = 1, (Lf0 ) T1 = T2 = 0, IκB = IκBnuc = 1, CARP0 = CARP = IAP = 1. This is in agreement with the idea that, under typical conditions, the cell should be capable of stably maintaining either an apoptotic or a survival Table 2. Steady states of the Boolean model, for each model variant, in the presence and absence of TNF. Model I II III IV
Links A, C, no L L, C, no A C, no A, no L L, no A, no C
TNF = 0 Ap0 , Ap0 , Ap0 , Ap0 ,
Lf0 Lf0 Lf0 Lf0
TNF = 1
Oscillations?
Ap1 — — —
Yes Yes Yes Yes
26
M. Chaves et al.
state [8, 4]. If TNF = 1, there is only one possible steady state for models with link A: (Ap1 )
T1 = T2 = 1, C3a = C8a = 1.
(5)
For models with no link A, there is no possible steady state when TNF = 1, and there are only periodic orbits of period higher than 1. Therefore, during TNF treatment, models with link A may at any time make a decision towards the apoptotic pathway, while models with no link A will exhibit oscillatory behaviour and can only make a decision when TNF treatment ceases. Upon removal of TNF stimulation, trajectories of system (2) may be expected to converge to either the apoptotic or survival state. The choice of one or the other state will depend on the initial condition and the set of parameters ai , bi , and θi . Since these parameters are very likely to vary from cell to cell, it is reasonable to consider several (randomly chosen) sets of parameters and then compute the probability of convergence to each steady state. To examine the dynamics of system (2), and its dependence on parameters and the structure of the network of interactions, several numerical studies were performed, as described next. 3.2 Numerical Experiments To test the model and analyse the effects of links A and L (Fig. 1), system (2) was simulated several times, with randomly chosen sets of parameters. For simplicity, the synthesis rates and threshold constants were fixed (bi = 1 and θi = 0.5 for all i), and only parameters ai were allowed to vary, chosen from a uniform distribution in the interval [1/3, 3] (h−1 ). This seems reasonable, as the degradation rates used in [17] are roughly between 0.5 and 4 h−1 . Observe that ai plays a double role: it represents a degradation rate, but also defines the 0/1 threshold concentration (0.5/ai ). Hence, high degradation rates also imply that a lower concentration is needed to achieve the 0/1 transition. Different durations of TNF stimulation were considered, namely: 2, 6, 11, 16, and 21 hours. For these simulations, one initial condition was chosen: IκB(0) = 1 and all other nodes set to zero. This is based on a natural physiological starting point of the system: previous to stimulation, IKK is in its inactive form, while IκB is bound to NFκB, preventing transcriptional activity. Caspases reside in the cytosol in dormant forms [22]. To understand the importance of the links A, C, and L (the least well documented), four variants of the model depicted in Fig. 1 are compared: (I) links A and C present, (II) links L and C present, (III) only link C present, and (IV) only link L present (as listed in Table 2). The first three variants aim at comparing the effects of links A and L, and the last aims at evaluating the effect of link C. Other alternatives gave similar results (for example, a model with all three links gave results very similar to I) and thus are not detailed here. For each variant, the response of the system to each of the five TNF
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis
27
durations was simulated 500 times. Since different sets of parameters {ai } introduce different time scales, variations in the dynamics from one simulation to another are expected. These variations may also be interpreted as a result of natural variability in biological systems. The average response over the 500 simulations will then yield the probability of the system converging towards each of the steady states. Other open questions that may be studied with our model include competition between the pro- and anti-apoptotic pathways and the point of irreversibility of the apoptotic decision. For instance, how long after caspase activation is recovery from the apoptotic pathway still possible [22]? To address these questions, numerical experiments were conducted by letting NFκB(0) = 1, setting all others to zero, and maintaining C3a(t) = 1 for durations of 10, 30, 60, and 360 minutes. For analysis of the numerical results, a “peak” in the trajectory of node Xj will be defined as a time interval [T0 , T1 ], during which Xj (t) = 1, and such that Xj (T0 − Δ) = Xj (T1 + Δ) = 0. The period of oscillations is calculated as the average time interval between the onset of two consecutive peaks, i.e., Np 1 T0,i − T0,i−1 , Period = Np − 1 i=2
where Np is the number of peaks observed during the simulation time.
4 Results and Discussion In the numerical simulations, it is observed that, once TNF stimulation ceases, a steady state pattern is always achieved, corresponding to either the apoptosis or survival states (4), (5). In the former case IκB is bound to NFκB, so that mRNAs and proteins downstream of NFκB are not expressed, and the cell has chosen the apoptotic pathway. The latter case represents survival of the cell, with IAP stably expressed preventing C3a activation, and CARP preventing C8a activation (see Fig. 2). In the presence of TNF stimulation, IκB, NFκB, and its dependent mRNAs/proteins may exhibit oscillatory dynamics, as observed experimentally in [14, 19]. In fact, computation of steady states shows that the models with no link A have no alternative but to exhibit oscillatory behaviour in the presence of TNF, since no possible steady states exist (except possible special solutions of the associated differential inclusion). The oscillatory behaviour (see analysis below) is in very good agreement with the experimental data reported in [19]. Qualitatively, all model variants respond in a similar fashion to TNF stimulation. As the stimulus duration increases, more cells choose the apoptotic pathway. Testing the four model variants shows that link A is very strong: not surprisingly, models with link A favour the apoptotic pathway, with 80% of cells reaching the apoptotic state, as opposed to around 50% or 40% in
M. Chaves et al.
TNF
28
1
1
0.5
0.5
C3a
C8a
IAP
NFkBn
IkBn
IKK
0 0 1
5
10
15
0 0 1
20
10
15
20 1.1631
0.5
0.5 0 0 1
5
10
15
0 0 1
20
5
10
15
20 2.9469
5
10
15
20 2.5784
5
10
15
20 1.8348
5
10
15
20 2.5642
5
10
15
1.898 0.5
0.5 0 0 1
5
10
15
0 0 1
20 2.3488
0.5
0.5
0 0 1
5
10
15
0 0 1
20 0.90041
0.5
0.5
0 0 1
5
10
15
0 0 1
20 0.79962
0.5
0.5
0 0 1
5
10
15
0 0 1
20 0.4439
0.5 0
5
2.6733
20 0.69736
0.5 0
5
10
15
0
20
0
5
Time (hours)
10
15
20
Time (hours)
Fig. 2. Example of network dynamics with the hybrid model (variant II), corresponding to cell survival (left) or apoptosis (right) solution. Numbers indicate the degradation rates for these numerical experiments. Solid lines represent normalized continuous variables (xi ) and dashed lines represent discrete variables (Xi ). 90 80
Survival rate (%)
70 60
III
50
II
40
IV
30
I
20 10
2
4
6
8
10
12
14
16
18
20
22
TNF duration (hours)
Fig. 3. Percentage of surviving cells for the four model variants.
models II and IV, or 30% in the model with only link C (which favours the anti-apoptotic pathway) (Fig. 3). These values appear to be in agreement with experimental data: Rehm et al. [22] report that, for 8 hour treatments with
Regulation of Apoptosis via the NFκB Pathway: Modeling and Analysis Average period (hours)
6
6
I
5
II
5 survival
4
3
apoptosis
3
2
2
2
1
1
apoptosis
1 10
0 0
20
0
20
7
I
6
survival apoptosis
0
TNF duration (hours)
TNF duration (hours) 7
10
III
5
4
survival
0 0
TPeak i −TPeak i−1 (hours)
6
3
4
6
4
4
4
3
3
3
2
2
2
1
1
1
4
6
0 0
2
4
Peak i
III
6
II
5
Peak i
20
7
5
2
10
TNF duration (hours)
5
0 0
29
6
0 0
2
4
6
Peak i
Fig. 4. Top row: Average period of nuclear IκB oscillations for apoptotic or surviving cells, as a function of TNF stimulus duration. Vertical lines represent standard deviation over the 500 numerical experiments. Bottom row: Relative timing of sucessive peaks in IκB oscillations, for apoptotic (grey) or surviving (black) cells. The “+” signs mark the experimental peak timing in [19].
high and low concentrations of TNFα, the percentage of cells undergoing activation of effector caspases was, respectively, 86% and 24%. The numerical experiments with our model capture the response to high (or significant) concentrations of TNFα, so variants I (followed by II and IV) are closer to the real system. Quantitative analysis of the oscillatory behaviour reveals some interesting facts (Fig. 4). To characterize the oscillatory dynamics, the following quantities were computed for nuclear IκB: period of oscillations (approximated), number of peaks, and relative timing between peaks. First, in all cells oscillations cease when TNF stimulation ceases, in agreement with observations. Second, the timing of successive peaks is also in remarkable quantitative agreement with experimental data [19], see Fig. 4 (bottom row). The first peak in nuclear IκB concentration was observed about 72 minutes from the start of TNF stimulation, and the second peak appears about 4 hours later, very close to the 75 minutes and 4.5 hours reported in [19]. It is striking that the time span of the first peak is typically longer than that of the following peaks, and that the time lapse between consecutive peaks decreases (see Figs. 2, 4). Third, the average period of oscillations is fairly constant, but “depends” on the apoptosis/survival decision. Statistical analysis of the period of oscillations
30
M. Chaves et al.
(calculated as indicated in Section 3.2) in nuclear IκB indicates that there is a natural period (for TNF treatment longer than 3 hours) for cells that eventually survived. This period is about 3.5 ± 1 hours for models I, II, and IV, and slightly higher at 4 ± 1 hours for model III. In contrast, for cells that chose the apoptotic pathway, the period of oscillations can be much smaller. For models with link A, essentially no oscillations are observed in apoptotic cells (Fig. 4, top, left): this is because cell death is decided very early on, with link A immediately preventing any further NFκB activity. For model II (links C and L only), oscillations are observed in apoptotic cells with a natural period which is lower (about 3 ± 1 hours) than that for surviving cells (Fig. 4, top, middle). Results for model IV (not shown) are quite similar to those of model II. For model variant III, there is no difference between observed periods (Fig. 4, top, right). These results provide indications for discriminating between the four model variants and also suggest that the period of oscillations may play a role in the survival/apoptosis decision: lower periods/higher frequencies would lead towards the apoptotic pathway. A similar result has been reported, for instance, in the p53-Mdm2 system [16], where more peaks (higher frequency) were detected in response to higher (and more damaging) γ-irradiation doses. The p53-Mdm2 system also contains a negative feedback loop similar to the NFκB-IκB loop. To address the question of irreversibility of the apoptotic decision, we checked the capacity of the network to recover from overexpression of active caspase 3. Fixing node C3a at its maximal value for intervals of 10, 30, 60, and 360 minutes (that is setting discrete C3a(t) = 1, for t 5.6 (Fig. 4). We now show the results of the effect of modifying the structure of this simple two-species network due to, addition of the new link through the virus, which not only separates the prey species into two compartments, but also modifies the predation strength (Model II). For simulation of the Model II network, the new parameter values are chosen as λ = 0.002, η = 0.7, μ = 0.05, and κ = 13. The introduction of the new node (V ) and links to the existing module (Model I) has interesting effects on the population dynamics of the species that
66
S. Bhattacharyya and S. Sinha
Prey
6 4 2 0
Predator
0.6 0.4 0.2 0 5.5
5.7
5.9
6.1
α
6.3
6.5
Fig. 4. Bifurcation diagram of the prey and predator in Model I with increasing predation strength α. At α = 5.6 (approx.), the system undergoes a period-doubling bifurcation.
2
1
10
0
b
−1
10
−2
0
6
101
4
100
Infected
4
10
Susceptible
6 Infected
Susceptible
a
2
10−2
0
10
500
0 5.5
6
α
6.5
100 50 10 5 5.5
0.4 Virus
0.2
Predator
Virus
Predator
500 0.4
10−1
0.2 0
6
α
6.5
5.5
6
α
6.5
100 50 10 5 5.5
6
α
6.5
Fig. 5. Bifurcation diagram of all four populations–Susceptible prey, infected prey, predator, and virus–in Model II as function of interaction strength parameter α, for different prey preference: (A) ξ = 0.5, (B) ξ = 0.99.
depends on the interaction strength. As ξ regulates the predation strength by changing the prey preference of the predator, we analyzed Model II for two different values of this interaction strength: ξ = 0.99 indicating high preference for the susceptible prey and very low preference for the infected prey; and, ξ = 0.5, where the predator has no preference of one over the other. Figure 5 shows the bifurcation diagrams of Model II for the two cases, ξ = 0.5 and 0.99. Figure 5(A) shows that, at ξ = 0.5, there are two important changes that occur in the same range of predation strength, i.e., 5.5 < α < 6.5. First, the network reduces to only a “prey (S and I) and virus (V )” system
Ecological Networks: Structure, Interaction Strength, and Stability
67
with the predator population going to zero. This happens because, in the absence of predation, all of S is available for inducing strong viral infection, which converts the susceptible prey class to the infected one, and the predator does not have enough preys to survive through predation. Second, the dynamics of this prey-virus system remains stable with a large virus population and low prey populations. When the the predator has a strong preference for the S population, i.e., at ξ = 0.99, this situation continues for low predation strength (until α = 6.2), and the reduced prey-virus system remains stable (Fig. 5(B)). However, at higher predation strength (α > 6.2), the predator succeeds in surviving on predation and reduces the population of I strongly enough to reduce the production of V , which in turn reduces infection, thereby increasing S, which is then available for predation. This kind of a delayed feedback on S eventually induces oscillations in all four populations, albeit at higher α compared to Model I. This interesting phenomenon essentially underscores the fact that distribution of the type (+ or −) and the strength of interactions can play a significant role in food web structure and dynamics. It can change the structure of the network by inducing a species to go extinct, and also promote stability in an otherwise oscillatory system.
4 Discussion and Conclusion Community stability in ecology is primarily decided by the topological and functional architecture of the entire organization. Some studies have indicated that weak interactions are one of the most dominant threads in weaving natural communities in tune [50, 77], which is also reasserted by our simple models. Weak interactions have been proposed as the “glue” that binds large networks together [43], with ramifications for biodiversity. In particular, this has important implications for those species whose low abundance and weak per capita consumption rates might otherwise be taken as evidence of a negligible role [42]. Large network simulations have shown that the distribution of interaction strengths is strongly skewed towards weak interactions [29, 61]. Although the experimental quantification of interaction strength in field studies is difficult, preliminary contributions on the nature of distributions of interaction strengths within real food webs are slowly emerging [16, 76]. Similarly, the importance of weak interactions for dynamic stability and species coexistence has been suggested from matrix analyses of soil food webs, numerical simulations of small and large webs, and experimental manipulations [6, 32, 48, 59]. Our study, with two very simple yet realistic ecological networks, points towards some intriguing features. One point of interest is that the introduction of another species in a two-species prey-predator interaction network compartmentalizes the single prey species into two subgroups leading to additional diversity in the network. Such a node can modify the network structure by pushing the predator species to extinction simply based on the interaction strength and its preference level. At higher values of both these interaction
68
S. Bhattacharyya and S. Sinha
parameters, the full network structure persists. These features, i.e., the interaction strength and network structure, also regulate the population dynamics of the species. A combination of type and strength of interactions determines the dynamical stability of the species in the network. One natural extension of our study would be to introduce yet another class of prey species, Recovered, which represents the population of individuals that recover from the infection after a time, and either return to the susceptible class, or may be immune to further infections. This would, obviously, increase the complexity of the network by adding new nodes and interactions among them. However, this would contribute towards understanding the concept “diversity leads to stability” on large-scale food web processes. Most of the recent research on food web theory in ecology centers around the local dynamics of a community, but the evolution of food web dynamics across different spatial scales has also received considerable attention [26, 27, 45, 57, 73]. “Habitat fragmentation and its impact on life” is one of the most important issues of present research [66]. The destruction of habitat occurs due to a variety of environmental threats, such as habitat removal, invading alien species, or hunting, each of which may have different effects on food web structure. Given that they often act concomitantly, these may also interact with each other in unpredictable ways. Introduction of alien species poses a significant threat to global biodiversity by altering ecosystem processes, such as nutrient cycling, or disturbance regimes in a community [65], which, in turn, also affect the strength of the links. If the performance of interacting species is habitat dependent, then interaction strength may change with scale. Certain approaches such as hierarchical communities of competitors [69, 70] and neutral and quasi-neutral communities [68] have been adapted to show that community organization is relevant in determining the effects of habitat loss and spatial patterning. Such research on ecological network theory in the future would involve rigorous modeling approaches, both analytical and through simulations, in combination with field and laboratory experimental studies, to resolve the crucial questions in conservation and restoration ecology. Acknowledgments The authors are thankful to the anonymous referees for constructive, critical comments, and to the Department of Science and Technology, India, for financial support.
References 1. Abrams, P. et al. The role of indirect effects in food webs. In Food Webs: Integration of Patterns and Dynamics (eds G.A. Polis & K.O. Winemiller), 371–395, Chapman & Hall, New York (1996) 2. Allesina, S. and Bodini, A. Who dominates whom in the ecosystem? Energy flow and bottlenecks and cascading extinctions. J. Theor. Biol., 230, 351–358 (2004)
Ecological Networks: Structure, Interaction Strength, and Stability
69
3. Bascompte, J. and Melian, C. J. Simple trophic modules for complex food webs. Ecology, 86, 2868–2873 (2005) 4. Bascompte, J. et al. Interaction strength combinations and the overfishing of a marine food web. Proc. Natl Acad. Sci. USA, 102, 5443–5447 (2005) 5. Bastolla, U., Lassig, M., Manrubia, S. C. and Valleriani, A. Diversity patterns from ecological models at dynamical equilibrium. J. Theor. Biol., 212, 11-34 (2001) 6. Berlow, E. L. et al. Interaction strengths in food webs: issues and opportunities. J. Anim. Ecol., 73, 585–598 (2004) 7. Berlow, E. L., Brose U., and Martinez, N. D. The “Goldilocks factor” in food webs. Proc. Natl. Acad. Sci. USA, 105, 4079–4080 (2008) 8. Bhattacharyya, S. and Bhattacharya, D. K. Pest control through viral diseases: mathematical modeling and analysis. J. Theor. Biol., 238, 177–197 (2006) 9. Caldarelli, G., Higgs, P. G. and McKane, A. J. Modelling coevolution in multispecies communities, J. Theor. Biol., 193, 345–358 (1998) 10. Camacho, J. et al. Quantitative analysis of the local structure of food webs. J. Theor. Biol., 246, 260–268 (2007) 11. Case, T. J. Invasion resistance arises in strongly interacting species-rich model competition communities. Proc. Natl. Acad. Sci. USA, 87, 9610–9614 (1990) 12. Chen, X. and Cohen, J. E. Global stability, local stability and permanence in model food webs. J. Theor. Biol., 212, 223–305 (2001) 13. Cohen, J. E., Briand, F. and Newman, C. M. Community food webs. Biomathematics, 20, Springer-Verlag, Berlin (1990) 14. Dambacher, J. M. et al. Relevance of community structure in assessing indeterminacy of ecological predictions. Ecology, 83, 1372–1385 (2002) 15. Dambacher, J. M. et al. Qualitative stability and ambiguity in model ecosystems. Am. Nat., 161, 876–888 (2003) 16. De Ruiter, P., Neutel, A. M. and Moore, J. C. Energetics, patterns of interaction strengths, and stability in real ecosystems. Science, 269, 1257–1260 (1995) 17. Drossel, B. and McKane, A. J. Modelling food webs. In Handbook of Graphs and Networks (eds S. Bornholdt & H. G. Schuster), 218–247, Wiley-VCH, Berlin (2003) 18. Dunne, J. A. et al. Network structure and biodiversity loss in food webs: robustness increases with connectance. Ecol. Lett., 5, 558-567 (2002) 19. Emmerson, M. C. and Raffaelli, D. Predator-prey body size, interaction strength and the stability of a real food web. J. Anim. Ecol., 73, 399–409 (2004) 20. Garcia-Domingo, J. L. and Saldana, J. Food-web complexity emerging from ecological dynamics on adaptive networks. J. Theor. Biol., 247, 819–826 (2007) 21. Garcia-Domingo, J. L. and Saldana, J. Effects of heterogeneous interaction strengths on food web complexity. Oikos, 117, 336–343 (2008) 22. Ghosh, S., Bhattacharyya, S. and Bhattacharya, D. K. Role of viral infection in pest control: a mathematical study. Bull. Math. Biol., 69, 2649–2691 (2007) 23. Gross, T. et al. Long food chains are in general chaotic. Oikos, 109, 135–144 (2005) 24. Hastings, A. and Powell, T. Chaos in a 3-species food-chain. Ecology, 72, 896–903 (1991) 25. Jansen, V. A. A. and Kokkoris, G. D. Complexity and stability revisited, Ecol. Lett., 6, 498–502 (2003) 26. Keitt, T. H. Network theory: an evolving approach to landscape conservation. Ecological and Modeling for Resource Managers, Springer Berlin, 125–134, (2003)
70
S. Bhattacharyya and S. Sinha
27. Keitt, T. H. and Economo, E. P. Species diversity in neutral metacommunities: a network approach. Ecol. Lett., 11(1), 52–62, (2008) 28. Kokkoris, G. D. et al. Variability in interaction strength and implications for biodiversity. J. Anim. Ecol., 71, 362–371 (2002) 29. Kokkoris, G. D., Jansen, V. A. A., Loreau, M. and Troumbis, A. Y. Variability in interaction strength and implications for biodiversity. J. Anim. Ecol., 71, 362–371 (2002) 30. Kondoh, M. Does foraging adaptation create the positive complexity-stability relationship in realistic food-web structure? J. Theor. Biol., 238, 646–651 (2006) 31. Krause, A. E. et al. Compartments revealed in food-web structure. Nature, 426, 282–285 (2003) 32. Laska, M. S. and Wootton, J. T. Theoretical concepts and empirical approaches for measuring interaction strength. Ecology, 79, 461–476 (1998) 33. Law, R. and Morton, R.D. Permanence and the assembly of ecological communities. Ecology, 77, 762–775 (1996) 34. Lawton, J. H. Food webs. In Ecological Concepts: the Contribution of Ecology to an Understanding of the Natural World (ed. J. Cherret), 43-78, Blackwell, Boston (1990) 35. Levines, R. Evolution in Changing Environments: Some Theoretical Explanations. Princeton University Press, Princeton, NJ, USA (1968) 36. Logofet, D. O. Stronger-than-Lyapunov notions of matrix stability, or how ‘flowers’ help solving problems in mathematical ecology. Linear Algebra and Its Applications, 398, 75–100 (2005) 37. Loreau, M. et al. A new look at the relationship between diversity and stability. In Biodiversity and Ecosystem Functioning: Synthesis and Perspectives (eds M. Loreau, S. Naeem and P. Inchausti), 79–91, Oxford University Press, Oxford (2002) 38. MacArthur, R. H. and Levines, R. Strong, or weak interactions? Tansactions of the Connecticut Academy of Arts and Sciences, 44, 177–188 (1972) 39. Martinez, N. D. et al. Diversity, complexity, and persistence in large model ecosystems. In Ecological Networks, Linking Structure to Dynamics in Food Webs (eds Pascual, M. and Dunne, J. A.) Santa Fe Inst., Studies in the sciences of complexity. Oxford Univ. Press, 163–185 (2006) 40. May, R. M. Will a large complex system be stable? Nature, 238, 413–414 (1972) 41. May, R. M. Stability and Complexity in Model Ecosystems, Princeton University Press, Princeton, NJ, USA(1973) 42. McCann, K. S. The diversity–stability debate. Nature, 405, 228–233 (2000) 43. McCann, K. et al. Weak trophic interactions and the balance of nature. Nature, 395, 794–798 (1998) 44. McCann, K. and Hastings, A. Re-evaluating the omnivory–stability relationship in food-webs. Proc. Roy. Soc. of London, Series B, 264, 1249–1254 (1998) 45. Memmott, J. et al. Biodiversity loss and ecological network structure. In Ecological Networks: Linking Structure to Dynamics in Food Webs (eds. M. Pascual and J.A. Dunne), Oxford University Press, Oxford (2006) 46. Milo, R. et al. Network motifs: simple building blocks of complex networks. Science, 298, 824–827 (2002) 47. Montoya, J. M., Pimm, S. L. and Sole, R. V. Ecological networks and their fragility. Nature, 442, 259–264 (2006) 48. Montoya, J. M. and Sole, R.V. Topological properties of food webs: from real data to community assembly models. Oikos, 102, 614–622 (2003)
Ecological Networks: Structure, Interaction Strength, and Stability
71
49. Navarrete, S. A. and Berlow, E. L. Variable interaction strengths stabilize marine community patterns. Ecol. Lett., 9, 526–536 (2006) 50. Navarrete, S. A. and Castilla, J. C. Experimental determination of predation intensity in an intertidal predator guild: dominant versus subordinate prey. Oikos, 100, 251-262 (2003) 51. Otto, S. B., Berlow, E. L., Rand, N. E., Smiley, J. and Brose, U. Predator diversity and identity drive interaction strength and trophic cascades in a food web. Ecology, 89, 134–144 (2008) 52. Paine, R. T. Food web complexity and species diversity. Am. Nat., 100, 65–75 (1966) 53. Paine, R. T. A note on trophic complexity and community stability. Am. Nat., 103(929), 91–93 (1969) 54. Paine, R. T. Food webs - road maps of interactions or grist for theoretical development. Ecology, 69, 1648–1654 (1988) 55. Paine, R. T. A. Conversation on refining the concept of keystone species. Conservation Biology, 9(4), 962–964 (1995) 56. Petchey, O. L., Beckerman, A. P, Riede, J. O. and Warren, P. H. Size, foraging, and food web structure. Proc. Natl. Acad. Sci. USA, 105, 4191–4196 (2008) 57. Peterson, E. E., Theobald, D. M. and Ver Hoef, J. M. Geostatistical modeling on stream networks: developing valid covariance matrices based on hydrologic distance and stream flow. Freshwater Biology, 52, 267–279 (2007) 58. Pimm, S. L. The complexity and stability of ecosystems. Nature, 307, 321-326 (1984) 59. Polis, G. A. Stability is woven by complex webs. Nature, 395, 744-745 (1998) 60. Post, W. M. and Pimm, S. L. Community assembly and food web stability, Math. Biosci., 64, 169–192 (1983) 61. Quince, C. et al. Topological structure and interaction strengths in model food webs. Ecol. Model., 187, 389–412 (2005) 62. Raffaelli, D. G. Trends in research on shallow water food webs. Journal of Experimntal Marine Biology and Ecology, 250, 223–232 (2000) 63. Rooney, N. et al. Structural asymmetry and the stability of diverse food webs. Nature, 442, 265–269 (2006) 64. Sabo, J. L. et al. Population dynamics and food web structure - predicting measurable food web properties with minimal detail and resolution. In Dynamic Food Webs, Multispecies Assemblages, Ecosystem Development and Environmental Change (eds. de Ruiter, P. C. et al.) Theor. Ecol. Ser., Academic Press, 437– 452 (2005) 65. Schmitz, D. C. and Simberlo, D. Biological invasions: a growing threat. Issues in Sci. & Tech. 13, 33–40 (1997) 66. Singh, B. K., Subba Rao, J., Ramaswamy, R. and Sinha, S. The role of heterogeneity on the spatiotemporal dynamics of hostparasite metapopulation. Ecol. Model., 180, 435–443 (2004) 67. Singh, B. K., Chattopadhyay, J. and Sinha, S. The role of virus infection in a simple phytoplankton zooplankton system. J. Theor. Biol., 231, 153–166 (2004) 68. Sole, R. V., Alonso, D. and McKane, A. self-organized instability in complex ecosystems. Phil. Trans. Roy. Soc. Lond. Ser., B-Biol. Sci. 357, 667–681 (2002) 69. Stone, L. Biodiversity and habitat destruction - a comparative study of model forest and coral-reef ecosystems. Proc. Natl. Acad. Sci. USA, 261, 381-388 (1995) 70. Tilman, D. et al. Habitat destruction and the extinction debt. Nature, 371, 6566 (1994).
72
S. Bhattacharyya and S. Sinha
71. Uchida, S. and Drossel, B. Relation between complexity and stability in food webs with adaptive behavior. J. Theor. Biol., 247, 713–722 (2007) 72. Uchida, S., Drossel, B. and Brose, U. The structure of food webs with adaptive behaviour. Ecol. Model., 206, 263–276 (2007) 73. Urban, D. L., Goslee, S., Pierce K. B. and Lookingbill, T.R. Extending community ecology to landscapes. Ecoscience, 9, 200–212 (2002) 74. Williams, R. J. and Martinez, N. D. Simple rules yield complex food webs. Nature, 404, 180–183 (2000) 75. Woodward, G. and Hildrew, A. G. Body-size constraints on niche overlap and intraguild predation in a complex food web. J. Anim. Ecol., 71, 1063–1074 (2002) 76. Wootton, J. T. Estimates and tests of per-capita interaction strength: diet, abundance, and impact of intertidally-foraging birds. Ecological Monographs, 67, 45– 64 (1997) 77. Wootton, J. T. and Emmerson M. Measurement of interaction strength in nature. Annu. Rev. Ecol. Evol. Syst., 36, 419–444 (2005) 78. Yodzis, P. The indeterminacy of ecological interactions as perceived through perturbation experiments. Ecology, 69, 508–515 (1988) 79. Yodzis, P. and Innes, S. Body-size and consumer-resource dynamics. Am. Nat., 139, 1151–1175 (1992)
Signaling and Feedback in Biological Networks Sandeep Krishna, Mogens H. Jensen, and Kim Sneppen Center for Models of Life, Niels Bohr Institute, Blegdamsvej 17, 2100 Copenhagen, Denmark;
[email protected],
[email protected],
[email protected] 1 Introduction Cellular processes operate on a wide range of time and length scales to produce complex and intricate dynamics. It is a great challenge to understand both how these dynamical patterns are produced, as well as why they are produced; that is, what functional or evolutionary role do they play? This is one of the most fruitful areas in which to apply the ideas of complex networks. Living cells have all the prerequisites for a useful representation as networks. First, cellular systems contain numerous non-identical active components—genes, proteins, RNA, etc. These are the nodes of the network. Second, there are many interactions between these components, which form the links between the nodes. Not every pair of components interacts, so the resulting network is not fully connected, nor is it a tree or other simple topology. Thus, cellular networks provide plenty of scope for analysing their structure and graphtheoretic properties, and numerous studies have taken advantage of this (see [1] for reviews and [2–9] for some examples). Network representations of cellular systems can easily be augmented to address dynamical issues. Each node can be associated with a dynamical variable which could represent, for example, the concentration of that protein or the level of expression of that gene. Equations or rules governing the temporal dynamics of these variables can then be written, where the network structure determines which variables interact with each other. This usually requires encoding more information about the interactions into the network representation. For instance, apart from knowing that one node links to another, one needs to know the sign and strength of the interaction. However, in a network picture it is sometimes difficult to encode more detailed molecular information, such as whether the binding of a protein to DNA is accompanied by DNA looping, or whether a small molecule that binds to a protein can also bind equally well when that protein is bound to DNA. N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 5, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
74
S. Krishna, M.H. Jensen, and K. Sneppen
The question then is: What kind of physiologically useful processes can be illuminated by the kind of information that is easily represented in a network picture of a cell? One broad class of such processes is signal propagation. Signals need to be sent in response to environmental conditions in order to trigger the appropriate functional proteins, and need to be sent between proteins in order to perform necessary computations. For example, the presence of food metabolites in the surroundings triggers signals to proteins involved in transport and metabolism of those molecules; or a sudden change in the temperature triggers signals to proteins which buffer the cell against the shock. Network representations of cellular systems are particularly suited to study signal propagation because they precisely delineate the paths along which signals could travel. The next level of complication occurs when a signal loops back onto itself. Such feedback loops are at the core of every non-trivial computation performed by a cell [10–16]. Feedback loops are necessary for much non-trivial dynamical behaviour, in particular, oscillations and multistability, both of which are important for proper cellular function in different organisms. Our review will therefore introduce biological networks specifically with the intention of investigating signal propagation and feedback. We will describe simple measures for examining signal propagation on networks. We will use the organism-wide cellular network of E. coli to discuss whether the network structure has any particular properties which would affect the cost and specificity of signal propagation. The review will then continue by discussing feedback in sub-networks of mammalian and yeast cells. We will take one example each from a biological setting where, respectively, negative and positive feedback in the network structure play a crucial role in the dynamical behaviour of the system. Finally, we will conclude by looking at combinations of feedback loops. We show that two entangled feedback loops, which are common in bacterial cells, have dynamical properties that are quite different from those of their individual loops.
2 Signaling An organism-wide protein network of the bacterium E. coli can be extracted from the database EcoCyc [17] and represented as a directed, bipartite graph with 2846 protein nodes and 2774 reaction nodes [18]. The reaction nodes include all kinds of cellular reactions between proteins: transcription reactions, complex formations, protein modifications and metabolic reactions. Figure 1A shows the giant weakly connected component of this graph, consisting of 1938 reactions (of which 812 are transcription reactions, squares) and 1897 proteins (circles). Figure 1A also illustrates that the E. coli graph is composed of a large number of relatively small strong components (a strong component is a sub-graph where there is a directed path between every pair of nodes). Figure 1B compares this with the strong component structure of a randomised network with exactly the same number of nodes and links, as well as the same in- and out-degree (number of in- and out-links) of each node. The E. coli
Signaling and Feedback in Biological Networks
75
Fig. 1. E. coli protein reaction network. (A, Left) The graph is the largest weak component of a bipartite network, consisting of proteins (circles) and reaction nodes (promoters (squares), complex formations and modifications (black squares)). The two largest hubs, σ 70 and CRP , and their links, have been removed for ease of visualisation. (A, bottom left) Illustration of the procedure of making the strong component graph. (A, Right) The resulting strong component graph of the E. coli network. An arrow in the strong component graph indicates that there is a path connecting the two strong components in the original graph; nodes correspond to strong components of minimum size two. (B) The strong component graph for a randomized version of the E. coli network. The randomisation preserves the total number of nodes, total number of links and the number of in- and out-links of each node [18].
protein network is much more modular than the randomized network, an overall feature of regulation/signaling that was first suggested in [19]. In such a network, what we call “signals” are perturbations in the dynamical variables associated with the nodes. For instance, if they were all proteins, then a perturbation in the concentration of one protein would alter the concentration of all the proteins downstream from the original one. The simplest aspect of the structure of the network that influences signaling is the number of nodes that are downstream of any given starting node (note that this is a quantity that can be sensibly studied only with a directed graph representation of the network; in any connected undirected graph all nodes are downstream of each other). The possible signals emanating from the starting node are
76
S. Krishna, M.H. Jensen, and K. Sneppen
Fig. 2. The cumulative distribution of number of downstream targets s for nodes of the E. coli network (lower curve) and the randomised network (upper curve) [18].
obviously limited to reach only these nodes. The strong component graphs in Fig. 1 show particularly clearly how the network structure affects signaling possibilities. Within each strong component, every node can, in principle, send a signal to another node. But between strong components the possibilities are hugely reduced. Thus, the E. coli network structure already seems to be set up to allow plentiful signaling on short length scales, but to allow only very specific paths on longer length scales. In the random network, however, most nodes can send signals to almost the entire network (because most of the nodes are part of one giant strong component). A percolating structure like this is not conducive to specific signaling because every node has almost the entire network downstream of it. Figure 2 bolsters this conclusion, showing that in the E. coli network proteins have a much smaller number of downstream targets than in the randomised network. 2.1 Cost of Signaling Signaling is not just about reaching a downstream target. As a signal propagates, it needs other molecules to help it pass the message across consecutive reactions. Consider, for example, a signal initiated by an increase in the concentration of a given transcription factor. The promoter it influences may depend on other transcription factors, for example, in an or-gate construction. If that is the case, and the other transcription factor is already abundant, the promoter activity will not be influenced and thus the signal will not be transmitted. More generally, for each additional reactant along a reaction pathway, signal propagation becomes increasingly coupled to the overall state of the
Signaling and Feedback in Biological Networks
77
Fig. 3. (A) Schematic showing how the “cost” of a signaling path, A → F , is measured. In this case proteins B and D are necessary, giving a cost C = 2. (B) Cost of a signaling path as a function of its length for the real (solid) and randomised (dashed) E. coli networks [18].
molecules in the cell. The more reactions in the path, and the more reactants in each reaction, the more conditions that must be met for propagation of the signal. We quantify this cost C = C(path) for an arbitrary path from a starting protein to a target protein by simply counting the number of reactants along the entire path (not counting the protein nodes which are part of the path), as described schematically in Fig. 3A. If the same reactant is used several times, it is only counted once. Notice that the propagation of a signal does not necessarily mean an increased level of the proteins involved. The key point is that a change in input state should be transmitted to a changed output state of the end product. Our cost function is a simple measure of the complexity of handling such a signal and it could, in principle, be calculated between any pair of proteins where a path exists in the directed network. Figure 3B shows the average cost of signals propagating from one protein to another along the shortest path connecting them, as a function of the length l of that path. Each data point is the average over all pairs which are at the given distance. Except for paths of length two, the average cost for signals
78
S. Krishna, M.H. Jensen, and K. Sneppen
D
E
F
Fig. 4. The six largest strong components of the E. coli network (A–F), along with plots of the average cost, C(l) as a function of signaling distance. The grey areas show the range spanned by C(l) for 100 randomised versions of the subgraphs [18].
is significantly smaller for the real E. coli network than for a randomised networks (error bars are smaller than the symbol size). Figure 4 repeats this analysis for each of the six largest strong components in the network. These strong components capture distinct functional units associated, respectively, to (A) predominantly fatty acid metabolism, (B) the transcription network around σ factors, (C) PTS-sugar transport, (D) ABC transporters, (E) the FeII and FeIII transport system and (F) the chemotaxis module. Overall, we see that the cost within each module is fairly similar to the random expectation. 2.2 Conclusions About Signaling We have shown that the molecular network of E. coli is designed in a way which facilitates local signaling. On longer distances, signal transmission is a priori nearly impossible, but we find statistical evidence for signal pathways in terms of a lower signaling “cost” when we measure this by the number of co-factors needed to transmit a given signal. The fact that the E. coli network has a lower than randomly expected cost of signaling for paths longer than two steps shows that it contains many linear chains which have few incoming branches. That is, the real network is “stringy,” while the randomised network is more “bushy,” having relatively many more branched pathways. Topologically, a low cost is equivalent to less cross talk, which is indeed desirable [3, 19]. This picture of a stringy network of long linear chains applies to the large scale: the place where the real network optimizes specific signaling is between
Signaling and Feedback in Biological Networks
79
strong component modules, rather than within them. A final intriguing point is that at small scales, within modules, the network has widely different design features, as seen from Fig. 4. Some modules (C,F) are dominated by complex formation reactions, and others (D,E) by linear pathways, while the remaining (A,B) are densely interconnected. Obviously, signaling is not only limited by the topology of the network, but also by the type of chemical reactions that facilitate the signals. For example, in pure protein-protein interaction networks, Refs. [20, 21] show that proteins with high concentrations propagate signals to proteins at low concentrations, but not vice versa. Further, when most of a protein is present in an unbound form, rather than in a complex with other proteins, it inhibits propagation of signals through that node of the network. Thus, the overall picture of signaling in biological networks is that one needs careful engineering of both topology and protein binding chemistry in order to facilitate signal propagation over more than one or two reactions.
3 Feedback Figure 5 shows a number of feedback loops. Each node in each loop receives signals (perturbations) from the previous node and sends it on to the next node in the cycle. When the signal travels all the way around the loop, it will Negative feedback loops
a
Hes1
b
d
c p53
Mdm2
IkBα
lactose
β−galacto sidase
LacI
IkBα mRNA
NF−kB
Positive feedback loops
e
f cI
cI
Cro
g
lactose
lactose transporter
LacI
Fig. 5. Examples of positive and negative feedback loops. An ordinary arrow indicates activation, a barred arrow indicates inhibition. (a)–(d) Negative feedback loops found involving proteins important for, respectively, development [29], apoptosis [30], lactose consumption [31, 32] and the immune system [33]. (e)–(g) Positive feedback loops involving proteins important for, respectively, λ phage lysis-lysogeny decision and induction [34], and import of extracellular lactose [31, 32].
80
S. Krishna, M.H. Jensen, and K. Sneppen
act to either dampen (negative feedback) or enhance (positive feedback) the original perturbation. Whether the feedback is positive or negative depends on how the nodes interact. In Fig. 5 we use an ordinary arrow to indicate that a node activates the next node, and a barred arrow if a node inhibits the next node. Then, clearly, all loops with an odd number of repressors are negative feedback loops, while those with an even number of repressors are positive feedback loops. In cellular networks such feedback loops are quite common. Previous studies which searched for small “motifs” in cellular networks found very few feedback loops and an overabundance of feedforward loops [7]. However, these studies looked only at transcription factor networks. As soon as one includes metabolism, then it becomes quickly apparent that feedback loops are by far the most common motif, especially at the interface between the metabolic and regulatory networks of the cell [22]. This interface is quite extensive, as evidenced by the fact that around half of all transcription factors in E. coli have a binding site for small metabolic molecules. The sugar lactose is one such example, being involved in both a negative (Fig. 5c) and a positive feedback loop (Fig. 5g). Figure 5 also shows some other examples of negative and positive feedback loops without small molecules. Positive feedback loops are closely related to the existence of multiple stable states of the system, while negative feedback loops are associated with oscillations. In fact, for a very general class of systems, it has been shown that the existence of at least one negative feedback loop is necessary (but not sufficient) for oscillations, and a similar result holds for positive feedback and multistability [23–25]. References [26–28] study, both theoretically and through the construction of synthetic gene circuits, multistability in positive feedback networks. Ref. [35] further explores the connection between oscillations and negative feedback, showing how the structure of the underlying loop can be extracted from oscillating time series. 3.1 Negative Feedback and Oscillations in Mammalian Immune Response The simplest negative feedback loop is, of course, a protein which represses itself (Fig. 5a). There are many examples of such proteins: the main regulator of the E. coli response to UV damage, LexA, represses its own production [36]; Hes1, involved in development in mammalian cells, also represses transcription of its own gene [29]. A well-known synthetic negative feedback loop is the repressilator, which consists of three proteins each repressing each other [37] (the same structure as Fig. 5c). Here we will concentrate on the negative feedback loop shown in Fig. 5d containing the transcription factor, NF-κB, which is one of the central regulators of the immune system in mammalian cells. The NF-κB family of proteins is one of the most studied, being involved in a variety of cellular processes including immune response, inflammation and development. NF-κB can be activated by a number of external stimuli
Signaling and Feedback in Biological Networks
81
including bacteria, viruses and various stresses and proteins. In response to these signals it controls, directly and indirectly, over 150 genes including many chemokines, immunoreceptors, stress reponse genes and acute phase inflammation response proteins [33]. Nuclear NF-κB is known to activate production of IκBα, an inhibitor protein which inhibits nuclear import of NF-κB by sequestering it in the cytoplasm, thus forming a negative feedback loop. Experimentally, when the NF-κB system is suitably excited, the concentration of NF-κB in the nucleus begins to oscillate [10, 38]. How does the negative feedback loop of NF-κB produce oscillations? Physically, what is required for instability of the fixed point, and hence oscillations, is a time delay, i.e., a sufficient slowing down of the signal going a round the loop. (If a perturbation in the concentration of one variable instantaneously affects the concentration of the next one, and so on, then for a negative feedback loop, any perturbation will be immediately cancelled and the steady state will be stable.) In cellular systems many processes could produce time delays: (i) a process that takes a finite minimum time, (ii) many intermediate steps, (iii) a sharp response by some of the variables, (iv) saturated degradation, or (v) autocatalysis (see Ref. [39] for more details). In the NF-κB system it is, in fact, saturated degradation of IκB that is behind the oscillations. NF-κB forms a complex with its inhibitor protein IκBα. This complex has the curious property that the external stimulus (a protein kinase called IKK) leads to a degradation of IκBα only when it is bound in the complex, and not when it is unbound. As a result, the degradation rate of IκBα has an upper limit, i.e., is saturated, due to the limited amount of NF-κB present and hence the limited amount of complex that can form. Mathematically, it is possible to describe all the essential features of the NF-κB system using a very simple model consisting of only three variables [11], nuclear NF-κB (Nn ), cytoplasmic IκB (I) and IκB mRNA (Im ): dNn (1 − Nn ) INn =A −B , dt +I δ + Nn dIm = Nn2 − Im , dt (1 − Nn )I dI = Im − C . dt +I
(1) (2) (3)
The saturated degradation is the second term in the last equation. Other terms in the equations model processes like nuclear import and export of NF-κB, production of IκB, etc. (see [11] for more details). An obvious question is why the cell requires oscillations in NF-κB in response to inflammation. This is a subject of much debate currently, and there is no clear answer. However, our model of NF-κB provides a possible clue: One property of the oscillations of nuclear NF-κB (in Fig. 6) that stands out is that they are extremely spiky. The spikiness is extremely robust to changes
82
S. Krishna, M.H. Jensen, and K. Sneppen
Fig. 6. (Left) Oscillations of nuclear NF-κB (Nn ) (black curve) and cytoplasmic IκB (grey curve) for simulations of the model with A = 0.007, B = 954.5, C = 0.035, δ = 0.029 and = 2 × 10−5 (these parameter values are derived from the ones used in Ref. [10], see [11]). In order to facilitate comparison with the experimental plot (right, obtained from Ref. [38]), the x-axis has been limited to 600 minutes, but the oscillations are sustained.
Fig. 7. Sensitivity to IKK. (Left) Spike duration, the fraction of time Nn spends above its mean value, as a function of IKK concentration. (Right) Spike peak, the maximum concentration of nuclear NF-κB, as a function of IKK concentration. In both plots, the black dot shows the IKK value used in Fig. 6, which separates regions of spiky and soft oscillations [11].
in parameter values. In general, the existence and spikiness of the oscillations is very robust to changes in most of the parameters of the model [11]. However, the system shows a very sensitive response to change in one parameter: the external stimulus, IKK. Figure 7 shows that both the spike height (or peak level), as well as the spike duration, can change by large amounts in response to small changes in the IKK level. Notice that this sensitivity is particularly high in IKK ranges which are near the transition from spiky to soft oscillations. It can be shown that this sensitivity can be transmitted to genes that are affected by NF-κB, producing a gene response sensitivity
Signaling and Feedback in Biological Networks
83
that is much larger than that obtained by other typical mechanisms which do not involve oscillations [40, 41]. Thus, oscillations could be a by-product of designing the system to have a very high sensitivity to small changes in the external stimulus. 3.2 Positive Feedback and Bistability in Yeast Epigenetics Cells carry information handed down from their ancestors and are able to pass on information to their descendants. In many cases this “memory” is epigenetic—not stored in the DNA sequence—allowing cells with identical DNA to maintain distinct properties. Epigenetic cell memory implies alternative states that are stable over time and are inherited through cell division. One proposed mechanism for epigenetic cell memory invokes positive feedback loops in nucleosome modification [42]. Nucleosomes are protein complexes that package eukaryotic DNA, with a density of about one nucleosome per 200 base pairs (bp). The core nucleosome is composed of two molecules each of four core histone proteins. Nucleosomes may carry various chemical modifications (e.g. acetylation and methylation) at different amino acid positions on the different histones, conferring a large potential information capacity on each nucleosome. Specific additions and removals of these nucleosome modifications are carried out by classes of enzymes, including histone acetyltransferases (HATs), histone methylases (HMTs), histone deacetylases (HDACs) and histone demethylases (HDMs). At least some of these modifications affect the activity of nearby genes, in part because the modifications can alter the binding of regulatory proteins to the DNA. Positive feedbacks are present in this system because nucleosomes that carry a particular modification may recruit (directly or indirectly) the enzymes that catalyse similar modification of neighbouring nucleosomes. Thus, a cluster of nucleosomes may be able to maintain itself stably in a particular modification state. These states can be inherited through DNA replication because nucleosomes on the parental DNA strand are distributed to both daughter strands [43], and the enzymes recruited by these parental nucleosomes may then establish the parental modification pattern on the newly deposited nucleosomes. A specific case in which positive feedbacks in nucleosome modification result in multiple stable states occurs in the mating-type system of the eukaryote S. pombe (fission yeast) [44]. A ∼20 kbp region of S. pombe DNA containing two mating-type cassettes is normally in a stable “silenced” state, with the mating-type genes not expressed. In certain mutants where part of the silenced region is modified, the system is bistable, flipping between states where the ura4 gene is either expressed (active) or not (silenced). Each state is stable and heritable, with transitions occurring at roughly equal frequencies of ≈ 5 × 10−4 per cell division [44]. Switching appears to be stochastic and is determined by factors associated with the region itself. In the silenced state, but not the active state, the region is dominated by nucleosomes that are
84
S. Krishna, M.H. Jensen, and K. Sneppen
Fig. 8. Illustration of basic ingredients of the model: Each oval represents a nucleosome that can be methylated (M), unmodified (U) or acetylated (A). Enzymatic transitions (solid arrows) between the three states are in part random (controlled by a noise level 1 − α), and in part autoregulated by recruitment (dotted lines) of enzymes (open symbols) by nucleosomes in the M or A state [45].
methylated at a particular site. An HMT that can catalyse this modification and certain HDAC proteins are known to be important for silencing. One can construct a simple network model [45] (schematically shown in Fig. 8) of the nucleosome modification system that exhibits all this behaviour, based on three simplifying assumptions. (1) There are only three relevant kinds of nucleosomes: unmodified, methylated and acetylated; methylation and acetylation are mutually exclusive. (2) The nucleosomes are enzymatically interconverted as shown in Fig. 8, by HMT, HDAC, HDM and HAT enzyme(s). (3) The HDAC and HMT enzyme(s) are recruited by methylated nucleosomes; the HDM and HAT enzymes are recruited by acetylated nucleosomes. This is what makes the feedback positive. To model S. pombe we take a system consisting of a fixed number of N = 60 nucleosomes, arranged on a 1-dimensional (1D) string. The region is isolated from neighbouring DNA by boundary elements [46], which we assume to be inert. Each nucleosome may be methylated (M), unmodified (U) or acetylated (A). At each time step one selects a random nucleosome n1 and attempts one of two changes: (a) With probability α one attempts a change associated to enzymatic activity of an enzyme recruited by another nucleosome in the modeled region. That is, one selects another random nucleosome n2 and if this is in either an
Signaling and Feedback in Biological Networks
85
M or A state, the nucleosome n1 is changed one step toward this state. For example, when nucleosome n2 is an M: if n1 is an A, then it is changed to U and if n1 is a U it is changed to M. If nucleosome n1 and n2 are in the same state, or if n2 is a U, then no changes are made. (b) With probability 1 − α one attempts a change of the selected nucleosome n1 : A U is changed to an M with probability 13 , or an A with probability 1 1 3 whereas an A or an M is changed to U with probability 3 . One may view process (a) as occurring due to the action of enzymes recruited by nucleosomes in the region within the isolating boundaries, whereas (b) reflects extrinsic noise caused by unrecruited enzymes. Thus, a lower α value indicates a higher noise level. In Fig. 9 we illustrate the dynamics of the model. One observes a fluctuating number of the three kinds of nucleosomes. In the upper panel α is small (noise is high) and the system has only one stable state, in which the nucleosome modifications are distributed randomly along the chain. In the lower panel, with a higher α, the system exists either in a state dominated by methylated nucleosomes or a state dominated by acetylated nucleosomes, with occasional switches between the two states. As α is increased further (i.e., noise is reduced) the states become more stable, and the switching occurs less often. However, the fact that the epigenetic states in the mutant S. pombe have a finite stability demonstrates that noise in the form of disordered methylation-acetylation events plays a crucial role.
Fig. 9. Time development of the standard model [45] for a system consisting of N = 60 nucleosomes with respectively α = 0.40 (upper figure) and α = 0.64 (lower figure). The light grey curve shows the number of methylated, dark grey the number of acetylated and black the number of unmodified nucleosomes. Time t is measured in number of attempted nucleosome updates per nucleosome.
86
S. Krishna, M.H. Jensen, and K. Sneppen
This simplified model of epigenetic inheritance in eukaryotes provides some unexpected insights. First, it is very important that nucleosomes are modified by enzymes recruited by non-neighbouring nucleosomes. A “1D” variant of the model where nucleosomes can recruit enzymes to modify only one of their neighbours along the string does not produce bistability [45]. The difficulty of obtaining a clear two-state behavior in 1D arises for reasons similar to those preventing spontaneous magnetization in the 1D Ising model, or the helix-coil transition in polymer models [47, 48]. Second, it is also very important that the transition from, say, an M state to an A state requires two consecutive acetylation recruitments by nucleosomes in the A state, and therefore effectively has a rate ∝ A2 . Bistability is lost in variants where this two-step process is replaced by a single step [45]. The non-linearity produced by this kind of “cooperative” two-step modification appears to be essential for bistability. Most importantly, however, at low α, where the modification-demodification events are completely random (and hence there is no feedback), there is only one state where the nucleosome modifications are distributed completely randomly along the string. Thus, we can conclude that positive feedback is essential for bistability.
4 Combining Multiple Feedback Loops In the previous sections we investigated the basic properties of single negative and positive feedback loops. In cellular networks, however, there are multiple entangled feedback loops. This can already be seen in Fig. 5, where some of the proteins are present in more than one example (LacI in Fig. 5c and g; cI in Fig. 5e and f). In an effort to understand how feedback loops interact and the range of dynamical behaviour possible, we begin by examining two interacting feedback loops. Such two-loop network motifs are seen in a large class of cellular response systems designed to regulate the flux and concentration of small molecules. These systems control, via two feedback loops, the transport and metabolism pathways. Typically, these two loops are connected by a common transcriptional regulator that senses the concentration of the small molecule. For instance, in the arabinose utilization system in E. coli, when intracellular arabinose binds to the regulator AraC it alters its binding to DNA such that RNA polymerase and the protein CRP can bind and initiate expression of genes that increase import of extracellular arabinose as well as its metabolic consumption [49]. This is schematically shown in Fig. 10. Here, the transport is controlled by a positive feedback loop, while the metabolism is a negative feedback loop. This is, of course, not the only logical combination of feedback loops possible. Figure 11 (left column) shows four logically distinct combinations of entangled transport and metabolism feedback loops. In each case, the two feedback loops are connected by a transcriptional regulator (R) that senses the concentration of a particular small molecule (s). One loop regulates transcription of
Signaling and Feedback in Biological Networks
87
Fig. 10. Schematic illustration of molecular processes in a two-loop motif. This motif is found in the regulation of uptake and metabolism of, for example, maltose and arabinose [50, 49]. σ, s denote, respectively, extracellular and intracellular concentrations of the small molecule. The molecule binds to the regulator, R, forming the complex {Rs} which activates production of transport proteins, T , and metabolic enzymes, E. γ is a parameter controlling the metabolic rate per enzyme [13].
the transport proteins (T ) facilitating the influx of the small molecule, while the other controls transcription of enzymes (E) responsible for the metabolism of s. The signs show the logic of each feedback loop: positive (+) or negative (-). Each motif can then be described by a notation of two signs, e.g. (+ –), which means that the transport loop is positive and the metabolism loop negative. Thus, there are four logical structures: the socialist (– –), the consumer (+ –), the fashion (– +) and the collector (+ +) [13]. Each can, in turn, be implemented in two distinct but logically equivalent ways, depending on whether s inhibits or activates R. This we denote using the notation (+ – i) or (+ – a), where the i (respectively, a) indicates inhibition (activation) of R by s. Th i- and a-motifs with the same logic behave very similarly, so here we will concentrate on only the a-motifs. The socialist motif. We call the (– –) motif the socialist because at low levels of extracellular s (low σ) it increases transport and reduces the metabolism, while at high levels of extracellular s, it does the opposite. Thus, the two negative feedback loops help maintain s robustly within a small concentration range. Such behaviour would be ideal for a system responsible for maintaining homeostasis. And indeed, a regulatory system with this logic is found in the iron homeostasis system in mammals [51]: iron activates the ferric uptake regulator (Fur), which represses transcription initiation of iron uptake genes, and enhances production of iron-using proteins. For most organisms iron is essential for several proteins, but is poisonous at high concentrations. There,
88
S. Krishna, M.H. Jensen, and K. Sneppen
Fig. 11. Behaviour of four entangled feedback loop motifs. Plots show the steady state values of s (middle column) and influx (σT = γEs + s, right column) as a function of σ. In all plots, the black curve shows the behaviour for the two-loop motif. The two other curves show the behaviour when only the transport loop is active (E = 1) and when only the metabolism loop is active (T = 1) [13].
the (– –) motif maintains the loosely bound iron within a narrow concentration range, and at the same time allows a high consumption of iron molecules by certain proteins that bind iron strongly. The consumer motif. The (+ –) motif we term the consumer, because any amount of extracellular small molecule results in the increase of both transport and metabolism. Thus, it is ideal for food molecules. This logic is in fact typical for sugar transport and metabolism in prokaryotes. The gal [52] and lac [31, 32] operons in E. coli are the most well studied of such systems. They both use the sugar molecule to inhibit the transriciption factor
Signaling and Feedback in Biological Networks
89
regulating transport and metabolism, the (+ – i) motif. In contrast, maltose [50] and arabinose [49] work by activating the regulation of transport and metabolism, the (+ – a) motif. In natural systems, transport and metabolic genes can be part of a single operon, as in lac [31], or separate operons, as in gal [52]. The latter arrangement allows non-coordinated regulation of transport and metabolism and therefore can be engineered to become bistable. This was also demonstrated by experiments on modified lactose and arabinose systems [53, 54], where the accompanying negative feedback loop was eliminated by inactivating E or using a non-metabolisable analogue of s, in agreement with our predictions from a similar cutting of the metabolic loop in Fig. 11. The fashion motif. As the fashion motif (– +) is indeed the opposite of the consumer motif, both logically and functionally, it is not surprising that we have not found any simple example of it in the regulation of small molecules in living cells. However, its behaviour (and the reason we call it the fashion motif) can be illustrated in terms of a market model for a product which is desirable in small amounts. In such a scenario, the resource, s, is analogous to a fashion product, E to the consumers, and T to the producers. R can be considered the value of the product, measured in terms of how much people desire it. When there is plenty of the product s in the market, its value R decreases, which in turn decreases its consumption (a positive metabolism feedback loop) as well as the desire amongst producers to make more of it (a negative transport feedback loop), making it a (– +) motif. The non-monotonicity of the flux of the fashion motif translates in this analogy to a saturation of the market when a fashion product becomes too abundant: Fashion products are most profitable when their availability is below a certain threshold. When the fashion motif is supplemented with a positive feedback of R to itself, the collapse of fashion goods can occur with a remarkably small change in external supply, which is reminiscent of fashion “bubbles” in society [55]. Although the fashion motif does not make much rational sense for small molecule response systems, it may be seen as a mechanism for coherent behaviour in social organization. The collector motif. The collector motif (+ +) is the logical opposite of the (– –) motif. Functionally it allows accumulation of a large amount of s, and is thus also functionally opposite to the socialist motif. Accumulation could be important for short periods of time, for instance, when an animal is preparing for hibernation. However, in such cases the (+ +) motif should eventually be overridden by another system which starts the consumption of the molecule. Such double positive feedback loops may be found in transcription regulatory networks and circuits involved in development and cell differentiation, but we failed to find any examples of them in small molecule regulation. Turning to a human analogy, the collector motif can be illustrated by making an analogy between s and the weight of a person. Then this weight increases with the intake of food (the analogue of transport), and is consumed by exercise (the analogue of metabolism). In this analogy R represents the internal “state” of the person, his or her mindset. An increase in a person’s weight, s, increases, via this internal state, their likelihood to eat more (positive transport feedback
90
S. Krishna, M.H. Jensen, and K. Sneppen
loop) and also decreases their chance to exercise (positive metabolism feedback loop), thus forming a collector motif. The bistable behaviour of the collector motif would then contribute to a broadening of the weight distribution in human populations [60]. 4.1 Two-Loop Motifs are More Than the Sum of Their Single Loops Figure 11 also shows the behaviour of individual loops in these motifs, obtained by keeping either E or T fixed, thereby cutting feedback in one of the loops. The near constant value of s in (– –) comes from the metabolic loop’s ability to constrain s for low σ, and the transport loop’s ability to constrain s at high σ. Thus, the functionality of (– –) is dominated by the sub-motif that best prevents large variation of s and flux. The (+ –) obtains a steady increase in s and a step-like increase in flux with σ by using the negative metabolic loop’s ability to “smooth out” the bistability associated to the positive transport loop. The (– +) motif exhibits a remarkable non-monotonic behaviour of flux, which cannot be obtained from any of the sub-motifs. The (+ +) motif maximizes bistability, by extending it to the extreme of the two bistable regions of its sub-motifs. Overall, we can conclude that whole two-loop motifs are more than a simple sum of their parts. 4.2 Going Beyond Two Loops Our analysis of two entangled feedback loops creates a framework for analysing small molecule regulatory circuits composed of multiple entangled feedback loops. For instance, the regulation of iron in E. coli, while being dominated by interactions that form a socialist motif [56, 57], also contains a positive feedback on the metabolism side involving usage of iron in FeS clusters [58]. An investigation of this three-loop motif suggests that two metabolism loops, connected like this in “parallel” (as opposed to the “series” connection between a transport and metabolism loop), are additive in behaviour [13, 59]. Due to this additiveness, iron regulation in E. coli is able to minimise variation of both the concentration of iron (a property of the socialist part) as well as the flux (a property of the fashion part) [56]. This indicates that an interesting direction to extend these ideas might be to try to formulate “design principles” for combinations of parallel and serially connected feedback loops.
5 Concluding Remarks To extract a useful network representation to describe a particular cellular system, it is necessary to ascertain the sensible level of coarse-graining for that system — is it the whole-cell network, individual proteins/genes or something
Signaling and Feedback in Biological Networks
91
in between? There is, of course, no one answer to this question. In the examples above we have looked at a wide range of scales, from the entire E. coli network, to three or four component sub-networks, down to nucleosomes on DNA. On all these scales the dynamical behaviour is, however, constrained first by the available communication channels, and second by the logical properties of feedback loops in the network. To summarise, we extract the following main “lessons” from our case studies: • • • • •
The E. coli protein network is highly modular. The real E. coli network is more “stringy” than the randomised version, and this reduces constraints on signal propagation. Most feedback loops go through small molecules; there are very few in the transcription network. Biological function is coupled to the logic (positive/negative) of the feedback. Entangled feedback loops are “more” than a simple sum of their parts.
Acknowledgments We thank our collaborators, with whom much of the work described here was done: J. Axelsen, I. Dodd, M. Micheelsen, S. Pigolotti, S. Semsey, G. Thon and G. Tiana. We acknowledge support from The Danish National Research Foundation and the Villum Kann Rasmussen Foundation.
References 1. S. Bornholdt and H.G Schuster, eds., Handbook of Graphs and Networks: From the Genome to the Internet, Wiley-VCH, Weinheim (2002). 2. E. Ravasz, A.L. Somera, D.A. Mongru, Z.N. Oltvai and A.-L. Barabasi, Science, 297, 1551–1555 (2002). 3. S. Maslov and K. Sneppen, Science, 296, 910–913 (2002). 4. K. Sneppen, A. Trusina and M. Rosvall, Europhys. Lett., 69, 853 (2005). 5. A. Trusina, S. Maslov, P. Minnhagen and K. Sneppen, Phys. Rev. Lett., 92, 178702 (2004). 6. J. B. Axelsen, S. Bernhardsson and K. Sneppen, BMC Systems Biology, 2, 25 (2008). 7. S.S. Shen-Orr, R. Milo, S. Mangan and U. Alon, Nat. Genetics, 31, 64–68 (2002). 8. A. Samal, S. Singh, V. Giri, S. Krishna, N. Raghuram and S. Jain, BMC Bioinformatics, 7, 118 (2006). 9. S. Singh, A. Samal, V. Giri, S. Krishna, N. Raghuram and S. Jain, Eur. Phys. J. B, 57, 75–80 (2007). 10. A. Hoffmann, A. Levchenko, M.L. Scott and D. Baltimore, Science, 298, 1241– 1245 (2002). 11. S. Krishna, M.H. Jensen and K. Sneppen, Proc. Natl. Acad. Sci. USA, 103, 10840– 10845 (2006).
92
S. Krishna, M.H. Jensen, and K. Sneppen
12. E. Aurell, S. Brown, J. Johansen and K. Sneppen, Phys. Rev. E, 65, 51914 (2002). 13. S. Krishna, S. Semsey and K. Sneppen, Proc. Natl. Acad. Sci. USA, 104, 20815– 20819 (2007). 14. K.B. Arnvig, S. Pedersen and K. Sneppen, Phys. Rev. Lett., 84, 3005 (2000). 15. G. Tiana, M.H. Jensen and K. Sneppen, Eur. Phys. J. B 29, 135 (2002). 16. M.H. Jensen, G. Tiana and K. Sneppen, Febs Letters 541, 176 (2003). 17. P.D. Karp et al., Nucl. Acids Res., 35, 7577–7590 (2007). 18. J.B. Axelsen, S. Krishna and K. Sneppen, J. Stat. Mech., P01018 (2008). 19. L.H. Hartwell, J.J. Hopfield, S. Leibler and A.W. Murray, Nature, 402(6761), C47–52 (1999). 20. S. Maslov, K. Sneppen and I. Ispolatov, New J. Phys., 9, 273 (2007). 21. S. Maslov and I. Ispolatov, Proc. Natl. Acad. Sci. USA, 104, 13655–13660 (2007). 22. S. Krishna, A.M.C. Andersson, S. Semsey and Kim Sneppen, Nucl. Acids Res., 34, 2455 (2006). 23. R. Thomas, Quantum noise, Springer Series in Synergetics 9, Ed. Gardiner, Springer, Berlin, pp. 180–193 (1981). 24. E.H. Snoussi, J, Biol. Sys., 6, 3–9 (1998). 25. J.L. Gouz´e, J. Biol. Syst., 6, 11–15 (1998). 26. J.E. Ferrell Jr., Curr. Opin. Cell Biol., 14, 140–148 (2002). 27. D. Angeli, J.E. Ferrell and E.D- Sontag, Proc. Natl. Acad. Sci. USA, 101, 1822– 1827 (2004). 28. F.J. Isaacs, J. Hasty, C.R. Cantor and J.J. Collins, Proc. Natl. Acad. Sci. USA, 100, 7714–7719 (2003). 29. H. Hirata, S. Yoshiura, T. Ohtsuka, Y. Bessho, T. Harada, K. Yoshikawa and R. Kageyama, Science, 298, 840–843 (2002). 30. S.L. Harris and A.J. Levine, Oncogene, 24, 2899–2908 (2005). 31. F. Jacob and J. Monod, J. Mol. Biol., 3, 318–356 (1961). 32. P. Wong, S. Gladney and J.D. Keasling, Biotechnol. Prog., 13, 132–143 (1997). 33. H.L. Pahl, Oncogene, 18, 6853–6866 (1999). 34. M. Ptashne, A Genetic Switch: Phage Lambda Revisited, Cold Spring Harbor Laboratory Press Cold Spring Harbor(2004). 35. S. Pigolotti, S. Krishna and M.H. Jensen, Proc. Natl. Acad. Sci. USA, 104, 6533– 6537 (2007). 36. M. Schnarr et al., Biochimie, 73, 423–431 (1991). 37. M.B. Elowitz and S. Leibler, Nature, 403, 335–338 (2000). 38. D.E. Nelson, A.E.C. Ihekwaba, M. Elliott, J.R. Johnson, C.A. Gibney, B.E. Foreman, G. Nelson, V. See, C.A. Horton, D.G. Spiller et al., Science, 306, 704–708 (2004). 39. G. Tiana, S. Krishna, S. Pigolotti, M. H. Jensen and K. Sneppen, Phys. Biol., 4, R1 (2007). 40. C.Y. Huang and J.E. Ferrel Jr, Proc. Natl. Acad. Sci. USA, 93, 10078–10083 (1996). 41. A. Goldbeter and D.E. Koshland, Proc. Natl. Acad. Sci. USA, 78, 6840–6844 (1981). 42. G. Felsenfeld and M. Groudine, Nature, 421, 448 (2003). 43. A.T. Annunziato, J. Biol. Chem., 280, 12065 (2005). 44. G. Thon and T. Friis, Genetics, 145, 685 (1997). 45. I.B. Dodd, M.A. Micheelsen, K. Sneppen and G. Thon, Cell, 129, 813–822 (2007). 46. G. Thon, P. Bjerling, C.M. Brunner and J. Verhein-Hansen, Genetics, 161, 611 (2002).
Signaling and Feedback in Biological Networks 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60.
93
B.H. Zimm, Proc. Natl. Acad. Sci. USA, 45, 1601 (1959). H.A. Scherage, Pure and Applied Chemistry, 36 1 (1972). R. Schleif, Trends Genet., 16, 559–565 (2000). E. Richet and O. Raibaud, EMBO J., 8, 981–987 (1989). E. Mass´e and M. Arguin, Trends Biochem. Sci., 30, 462–468 (2005). M.J. Weickert and S. Adhya, Mol. Microbiol., 10, 245–251 (1993). E.M. Ozbudak, M. Thattai, H.N. Lim, B.I. Shraiman and A. van Oudenaarden, Nature, 427, 737–740 (2004). W.P. Smits, O.P. Kuipers and J.W. Veening, Nat. Rev. Microbiol., 4, 259–271 (2006). R. Donangelo and K. Sneppen, Physica A, 316, 581–591 (2002). S. Semsey, A.M.C. Andersson, S. Krishna, M.H. Jensen, E. Mass´e and K. Sneppen, Nucl. Acids Res., 34, 4960–4967 (2006). N. Mitarai, A.M.C. Andersson, S. Krishna, S. Semsey and K. Sneppen, Phys. Biol., 4, 164–171 (2007). F.W. Outten, O. Djaman and G. Storz, Mol. Microbiol., 52, 861–872 (2004). M. Werner, S. Semsey, K. Sneppen and S. Krishna, preprint (2008). U.S. EPA Exposure Factors Handbook, 1997, http://www.epa.gov/ncea/efh/
Topographic Spreading Analysis of an Empirical Sex Workers’ Network Johannes Bjelland,1 Geoffrey Canright,1 Kenth Engø-Monsen,1 and Valencia P. Remple2 1
2
Telenor R&I, 1331 Fornebu, Norway
[email protected],
[email protected], kenth.engø
[email protected] BC Centre for Disease Control Epidemiology, University of British Columbia, Vancouver, BC, Canada;
[email protected] 1 Introduction The problem of epidemic spreading over networks has received considerable attention in recent years, due both to its intrinsic intellectual challenge and to its practical importance. A good recent summary of such work may be found in Newman [8], while [9] gives an outstanding example of a non-trivial prediction which is obtained from explicitly modeling the network in the epidemic spreading. In the language of mathematicians and computer scientists, a network of nodes connected by edges is called a graph. Most work on epidemic spreading over networks focuses on whole-graph properties, such as the percentage of infected nodes at long time. Two of us have, in contrast, focused on understanding the spread of an infection over time and space (the network) [1, 3, 2]. This work involves decomposing any given network into subgraphs called regions [1]. Regions are precisely defined as disjoint subgraphs which may be viewed as coarse-grained units of infection—in that, once one node in a region is infected, the progress of the infection over the remainder of the region is relatively fast and predictable [3]. We note that this approach is based on the ‘Susceptible-Infected’ (SI) model of infection, in which nodes, once infected, are never cured. This model is reasonable for some infections, such as HIV—which is one of the diseases studied here. We also study gonorrhea and chlamydia, for which a more appropriate model is Susceptible-InfectedSusceptible (SIS) [7] (since nodes can be cured); we discuss the limitations of our approach for these cases below. In this paper we apply the “topographic” regions-analysis approach to an empirical sex network, built from interviews with female sex workers (FSWs) in Vancouver, Canada. (See [3] for a detailed discussion of the “topographic” N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 6, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
98
J. Bjelland et al.
approach.) The network consists of the FSWs themselves, plus their sex partners (paid and unpaid), as well as any partners of these partners which were known to the FSW. This method, beginning with 49 interviewed FSWs, gave a highly connected network of 553 nodes [10]. Furthermore, STI (sexually transmitted infection) status was obtained for many of these nodes. In particular, two of the nodes were identified as being HIV-positive, while 11 other nodes have either gonorrhea, chlamydia, or both. From the collected network data we build an adjacency matrix, where element aij = 1 if i has a link to j, and is zero elsewhere. (In the case of a weighted graph, element aij equals the strength of the link from node i to node j.) The principal eigenvector of the adjacency matrix is a measure of a node’s centrality in the graph and is called the eigenvector centrality or EVC. The EVC scores for the nodes in the (weighted or unweighted) network give the starting point for our approach: they are used for assigning the nodes to regions, and for predicting the spreading of disease within and between regions. The aims of this work are several. One goal is to extend our earlier topographic approach to a graph with weighted links. As we will see, this seemingly small change can have very large effects; but we will also see that the validity of our approach is confirmed, in spite of these large effects. This is because the modified approach (presented here for the first time) is consistent: we use the link weights to modify the graph’s adjacency matrix, and hence the nodes’ EVC values; and we use them again when we define the regions via the steepest-ascent graph (SAG). A second aim of this work is to try to exploit the insights gained from the topographic analysis, in order to find novel suggestions for preventive actions to hinder the spread of the disease in question. We find that our progress towards this second goal is considerably more modest than that towards the first goal. We will show “thought experiments,” based on the empirical graph topology and link strengths, for which our analysis is extremely useful. However, we will not find practical suggestions which are immediately promising for the given Vancouver FSW graph. There are several reasons for this. First, the HIV graph is so thoroughly protected by condom use that we find little to add in terms of ideas for preventive measures. Second, the graphs for gonorrhea and for chlamydia are so thoroughly well connected, and also so well infected, that we do not find small topological changes which can make a large difference. We note that our approach treats the network as static; hence any effects of network dynamics are not taken into account. We believe, however, that our qualitative results are fairly robust to the likely dynamics of this network, since its overall structure is thought to be fairly stable over time. Also, our analysis (once the network is mapped out—which can be time consuming!) is not computationally demanding, and so may be performed in essentially zero time compared to the time scale of epidemic spreading. Hence any suggestions resulting from the analysis may be implemented in something approaching real time.
Topographic Spreading Analysis
99
2 Uniform Transmission Model First we study the FSW graph without taking into account the link weights. That is, each sexual contact is given strength “1” in the adjacency matrix. This is logically equivalent to giving each link the same probability of transmission per unit time. Our purpose in doing this analysis is to be able to compare with the analysis done using non-uniform link strengths (transmission probabilities). As we will see, the differences are large and important. 2.1 Visualization and Bipartiteness Our topographic analysis includes a novel approach to graph visualization: we group the nodes into their respective regions, and lay out the whole graph according to the SAG [4]. We present the basic ideas here, and refer the reader to earlier papers [1, 3, 4] for details. We view the EVC of a node as a measure of its “well-connectedness” and hence of its “spreading power.” Then we single out local maxima of the EVC as being particularly important in spreading; we call these nodes Centers. Also, since EVC (being recursively defined) is “smooth,” we can speak of “neighborhoods” in the graph as having a typical EVC; and we conclude that spreading is fast in neighborhoods of high EVC, and slow in “lower” neighborhoods. We then define regions of the graph— one for each Center. Each node finds its region (mountain) by following a steepest-ascent path until it terminates at a local maximum (mountaintop, or Center). The set of steepest-ascent paths then forms a directed hierarchical tree graph (the SAG), which is useful both for visualizing the graph and for predicting the likely paths of fastest epidemic spreading. In a tree graph, any two nodes are connected by exactly one path, and there are no cycles (closed loops of links). The SAG for the unweighted FSW graph is shown in Fig. 1. We note several interesting points from this visualization. (i) there are many regions (17). (ii) All the Centers (most central nodes in each region) are men. (iii) Many regions are small, i.e., 1–3 nodes, while (iv) the bulk of the nodes (517/553) lie in one of the three largest regions (red—marked R in the figure, blue (B), dark grey (G)). (v) Every region is well connected to the largest, red, region. Hence the red region is expected to play a dominant role in any epidemic spreading. (vi) One HIV-positive node is in the red region, and the other is (while in its own region) well connected to the central part of the red region. Now we comment on these points. We believe that points (i)–(iii) derive from the fact that the graph is nearly bipartite. A bipartite graph consists of two sets of nodes, such that all links are made between the two sets, and there are no links between nodes in the same set. Now we suppose (which is almost true) that the FSW graph is a strictly bipartite graph composed of M and F nodes. If we further assume that an M node is a Center (local maximum of centrality), then all of its neighbors are (a) female, (b) highly central,
100
J. Bjelland et al.
Fig. 1. Regions visualization of the FSW network, with all links set to equal strength. Only the links in the SAG are shown here for visual clarity. The most central node in each region is enlarged. The three largest regions, which will be discussed further in the text, are labeled R (red), G (grey), and B (blue). - Male, - Female.
and (c) automatically excluded from being a Center. Thus bipartiteness will tend to favor one gender over another. By the same token, highly central M nodes are never neighbors of other M Centers, and so are candidate Centers themselves. Hence there may be a tendency for more, and smaller, regions. Points (iv)–(vi) tell us that this network is highly prone to infection: the many regions are not well isolated from one another, because of their common connection to the dense, infectious red region. Also, the two start nodes are in or near the central part of the red region, where spreading is fast. 2.2 Infectious Spreading on the Unweighted Graph We have simulated spreading on the uniform FSW network, by giving each link the same probability per unit time for spreading. The value used is thus arbitrary, as is the unit of time. We typically use a value of a few percent, since much larger values give a very unsmooth time evolution (equivalent to a poor time resolution). We report the results here because they are illustrative of the strengths and weaknesses of our method, for the case of multiple regions. (For reasons given below, these are the only multi-region simulations that we can perform with this graph.)
Topographic Spreading Analysis
101
Fig. 2. Same visualization as Fig. 1, except that all links are shown. The arrows mark the known HIV-positive nodes.
Taking the start (infected at t = 0) nodes as shown in Fig. 2 above, we find, as expected, that the regions as we define them here are again valid coarse units of infection. We also find that it is difficult to stop or even retard the infection, because of the topology of the graph. The upper part of Fig. 3 shows a typical epidemic progression, with the growth in the red, blue, and grey regions resolved. All three “take off” at about the same time, and the infection spreads rapidly. Measures to retard spreading in the red region— without resorting to large topological change—are not found to be effective. We find however that protecting one node—the Center of the grey region— drastically weakens the red/grey connection. We see in the bottom part of Fig. 3 the results when this is done: the red and blue regions take off as before, but the grey region’s takeoff is greatly retarded. This is an example of the kind of benefit that we believe can be obtained from our analysis. We also considered the more promising problem of an infection starting in the grey region—again motivated by the observed red ⇐⇒ grey bottleneck in the topology. The top of Fig. 4 shows that takeoff is retarded by a factor of about 3, compared to the former case (top of Fig. 3). It is retarded even further (about 7 times as slow) if we in addition protect the grey Center (bottom of Fig. 4).
102
J. Bjelland et al.
Fig. 3. HIV spreading simulation without (top) and with (bottom) measures to isolate the grey region from the red region. In each plot, there are four growth curves, showing the total growth of the infection (‘Sum’), and the growth for the red (R), grey (G), and blue (B) regions (the largest regions in the network).
Fig. 4. Same simulation as Fig. 3, except that the infection starts from a peripheral node in the grey region.
Topographic Spreading Analysis
103
3 Links Weighted with Transmission Probabilities In this section we add an important further element of realism by weighting the links of our FSW graph with transmission probabilities. We are forced in many cases to use rather crude approximations. Nevertheless, we feel that the resulting model is considerably closer to reality than the uniform model. Also (as we will see) it is strikingly different—in particular, each disease will have its own graph. That is, while the basic topology is the same as that in Fig. 2, the set of link weights depends on the disease—because these weights represent transmission rates (probability/time). In fact, for the HIV case, the topology itself is changed, since we set some link strengths to exactly zero. In practice, incorporating the link strengths into the analysis involves (1) building a weighted adjacency matrix W using the link strengths, (2) finding the corrected EVC as the dominant eigenvector of this matrix W , and (3) redefining “steepest ascent” to take account of the varying link strengths. The first two steps are clear; and we describe step (3) in Section 3.2. Of course, before doing any of this, we must find the link strengths. We describe our procedure for doing so in the next section. 3.1 Estimating the Probabilities For each link we want a single weight (number) which gives the probability per unit time of transmission from an infected node to an uninfected node. This probability is based on a number of factors which must be estimated from limited data. We list these factors schematically as follows: Transmission probability/unit time = [(unprotected probability/contact)(non-condom use prevalence) × (contacts/time)] + [(protected probability/contact)(condom use prevalence) ×(contacts/time)] Now we discuss each factor in turn. For each disease (HIV, gonorrhea or ‘NG’, and chlamydia or ‘CT’) we estimate (unprotected probability/contact) from Ref. [6]. See Table 1. To correct for condom use, we must know the frequency of condom use for each link (condom use prevalence). For 256 links (about 17% of them) we have an estimate for (condom use prevalence) from survey data [10]. We know very little about the remaining links, except for Table 1. Transmission probabilities/contact for NG (gonorrhea), CT (chlamydia), and HIV.
Unprotected Protected
NG 0.43 0.16
CT HIV 0.10 0.05 0.074 0
104
J. Bjelland et al.
whether they are a “client” relationship or a “non-client” relationship. We explain below how we generate link weights for the links for which we have no survey data. Estimates for (contacts/time) were available (again) for those links for which we obtained survey information; however, here we have yet another source of uncertainty. That is, each interviewed FSW reported contacts with “regulars” and also contacts with new or “non-regular” customers. We take the reported estimates of (contacts/time) for regulars as given. For the nonregulars, we assume that either (i) they will become regular in the future, or (ii) they will be replaced by other non-regular customers who play essentially the same role in the network. In short: we ignore the distinction beween cases (i) and (ii). We still need a reasonable estimate of contacts/time for non-regulars. We proceed as follows: for each FSW, we define T to be the total number of contacts per unit time (summed over all neighbors). Also we let P be the percentage of contacts from regulars, and let C be the number of contacts/time from regulars. Then clearly C = P T ; and since we can estimate both P and C from the survey data, we get an estimate of T (= C/P ). We then estimate the total contacts/time N for non-regulars to be N = T − C. Finally, we take, from the survey data, the expected number of non-regular neighbors (still for each FSW), and call this number K. We then (finally) get the expected contacts/time for each non-regular as N/K. Our model is clearly very crude, treating each non-regular in a very average way; but it enables us to move (as we will see) well beyond the equal-transmission-probability model, and so, we believe, much closer to reality. Now we come to the term due to protected sex. We estimate (protected probability/contact) by correcting the (unprotected probability/contact) data, using data for the correction due to condom use from [5]. We note here that we set (protected probability/contact) for HIV to be exactly zero. Not surprisingly, this will have dramatic effects on the spreading behavior—as we will see in Section 3.3. This completes our prescription for estimating link weights for those links for which we have survey data. We then used a very simple approach—which we find appropriate to the high degree of uncertainty in our data—to estimate the remaining link weights (transmission probabilities/time). Our solution here is to first divide all links (surveyed and not surveyed) into two groups: client and non-client. Then, for each group, we simply reproduced the distribution over the “surveyed” links so as to also assign transmission probabilities to all of the “non-surveyed” links. Since the survey data is discrete, the link-weight distribution obtained is never smooth. Hence we reproduced these discrete distributions by simply repeating (sampling) each value in the discrete distribution with a probability equal to its frequency in the distribution. That is: we do not attempt to create distributions for each parameter in the link-weight estimate; instead we simply copy the discrete link weight values obtained from the survey data onto the unknown links, with appropriate probabilities.
Topographic Spreading Analysis
105
3.2 SAG∗ Now we address another complication arising from the use of weighted links: we must reconsider the definition of the steepest-ascent graph (SAG), which is used both for assigning region membership and for visualization purposes. Our point here is simple, namely that the definition of steepest ascent should take account of the link strength. This rather obvious point has not been addressed in our earlier use of the SAG [1, 2, 3], because these earlier studies were applied to unweighted graphs. Hence we offer a brief account here of the modification used for weighted links. We recall that region membership is assigned by in essence asking each node to find the steepest path to the “top”—i.e., to the “nearest” local maximum of the EVC. The notion of local maximum is independent of link stength. Suppose, however, that a node N has two local maxima (Centers, C1 and C2 ) as neighbors: which region do we place N in? Since we want steepest-ascent paths to represent most likely spreading, it seems reasonable that a neighbor C1 with a very weak link to N should not be assigned the steepest-ascent path—even if it is somewhat higher (in EVC) than C2 . In other words, if we retain the notion that steepest ascent gives the right answer, then we clearly want to define the slope as being slope = Δy/Δx,
(1)
with Δx (‘distance’) decreasing with increasing link strength. Clearly, Δy is the EVC difference, as in earlier (unweighted) work; hence we simply need some reasonable definition for the “distance” Δx. We take here the simple heuristic Δx(i, j) = 1/W (i, j) with W (i, j) the link strength (tranmission probability) between nodes i and j. Our point here is then that node N may find that it is not simply in the region of its highest neighbor: instead, it will be placed in the same region as the neighbor N ∗ with the highest product Δy/Δx = [EV C(N ∗ ) − EV C(N )][W (N, N ∗ )]. In short, if its link to the highest neighbor is very weak, then (reasonably) it will be placed instead in the region of a neighbor with a stronger link. We believe this is consistent with our aim for defining regions—namely, that a region is a coarsegrained unit of infection, such that infection within a region is relatively fast and predictable. We call the resulting steepest-ascent graph SAG∗ (to distinguish it from the SAG, which does not take link strengths into account). We will see below that our spreading simulations can only give a limited test of our SAG∗ definition— since in one case (HIV) the weighted network breaks down, while in the other two (NG and CT) we only obtain a single region. Hence—while we retain a belief that our definition is promising—a thorough test will have to await application to a weighted graph which (i) has several regions, but yet (ii) is better connected than our HIV graph of the following section.
106
J. Bjelland et al.
3.3 HIV Graph The SAG∗ for our weighted HIV graph is shown in Fig. 5. We see immediately that the contrast with Fig. 1 is enormous. In particular, the 17 regions of Fig. 1 have multiplied many times. In addition (which is not so easily seen in the figure) some nodes are completely disconnected due to the zero-weight links, and hence do not appear in the figure at all. The apparently isolated nodes in the corner of the figure are one-node regions; such regions occur typically on the periphery of a graph, where all EVC values are small. What is even more striking is that adding all non-zero links to the SAG∗ picture of Fig. 5 makes very little change; that is, there are only six non-zero links which are not shown in the figure (four connecting the one-node regions to one other node each, and two other inter-region links). Hence we do not show the full graph: it is essentially that of Fig. 5. This means in turn that HIV spreading—while seemingly unstoppable in the picture obtained from Fig. 2—is in fact not a problem for this FSW network. In particular, the
Fig. 5. Regions analysis for the HIV graph, corrected with the transmission probability on each link. Note that the graph breaks into very many small regions, due to the (assumed) zero transmission probability for reliable condom use. The two enlarged nodes are known to be HIV-infected; the four nodes in the upper left corner are singlenode regions in the weighted graph.
Topographic Spreading Analysis
107
two HIV-positive (male) nodes (marked with large squares in Fig. 5) are each confined to an effective two-node network, consisting of themselves and their nonclient partner. Hence our expected picture of condom use for this empirical network implies that HIV spreading will be limited to the non-client partner relationships of the two infected nodes, and so has effectively zero probability of reaching the rest of this dense sexual network. Because the effective graph is so fragmented, and also because the HIVinfected nodes are effectively isolated, we have not performed spreading simulations on the weighted HIV graph. We note that the largest region in Fig. 5 has 24 nodes, with a FSW as the most central node in the region. In fact the strongly bipartite picture obtained from the unweighted graph (Fig. 1) has also broken down here: both male and female Centers of the many regions are found. This is however not so surprising, given the fragmented nature of the effective graph. 3.4 Gonorrhea Figure 6 shows the steepest-ascent (SAG∗ ) graph when we use link strengths approriate to gonorrhea. Since 100% condom use does not give 100% protection [5], the effective gonorrhea graph has all the same links as were present
Fig. 6. Region (SAG∗ ) visualization for the gonorrhea network NG. The enlarged nodes are known to be STI-infected.
108
J. Bjelland et al.
in Fig. 2; but they are reweighted. We see that the reweighting has still had a dramatic effect. In particular, the 17 regions found for the unweighted graph are now a single region for the weighted graph. Also, the Center of this one region (and so of the entire graph) is an FSW. An interesting aspect of the gonorrhea SAG∗ is that one of the few existing homosexual (FSW ⇐⇒ FSW) links plays a very central role in the graph: the link between the Center and the head of the large red subregion is homosexual. This means that the two women involved are highly central in the weighted graph, and also that the link strength between them (transmission probability for gonorrhea) is not too small. One might then propose to remove this link— which (as it is certainly requested and paid for by a male customer) should be possible. However as we will see below, removal of this link—or any single link—has little or no beneficial effect. (This conclusion is perhaps intuitively grasped from the fully linked visualization of Fig. 7 below.) SAGs of either type are strict hierarchical structures—that is, they are directed trees, with links pointing strictly towards the root (Center). This means that, for any given region, one can readily define subregions in terms of branches of the tree. We have picked out the five largest branches of the gonorrhea SAG∗ and color coded them. We see that it is visually meaningful to think in terms of subregions for this region.
Fig. 7. Same layout as in as Fig. 6, but with all non-zero links displayed.
Topographic Spreading Analysis
109
Figure 7 shows the NG-graph again, but with all links displayed. We note that presently infected nodes are enlarged and marked yellow (lighter grey in printed version) in Fig. 6 and in Fig. 7. From Fig. 6 we see two infected nodes lying at the heads of their (large) respective subregions, and hence only one hop from the Center. Also we see that every major subregion is already infected. This immediately suggests that preventing the further spreading of gonorrhea on this graph will be quite difficult. This pessimistic prognosis is also supported by the visualization of Fig. 7. Here we see that all the major subregions are well connected to one aother, with infected nodes lying in the heart of a dense cloud of links. We will test (and confirm) this pessimistic prediction via stochastic simulations—see Section 4. 3.5 Chlamydia In Fig. 8 we show the SAG∗ visualization of the chlamydia graph. Qualitatively we see much the same picture as for the NG graph: a single region, with an FSW at the Center of the region. In fact, the homosexual dyad that we found lying centrally in the NG graph is also central here—with the one difference that here the two FSWs have exchanged roles (Center and subregion head). Our SAG∗ visualizations suggest that the CT graph is perhaps even more well connected than the NG graph—in that there are very few subregions,
Fig. 8. Region (SAG∗ ) visualization for the chlamydia network CT. Enlarged nodes are known STI-infected nodes.
110
J. Bjelland et al.
and they are very large. And since (again) every major subregion is infected, we arrive at the same qualitative prognosis for this graph: it will be difficult to hinder the further spreading of the disease. We have also plotted the analog of Fig. 7 for chlamydia—that is, the full graph with all non-zero links. The result is again qualitatively like that of Fig. 7; hence we do not show it here.
4 Spreading on the Gonorrhea Graph For reasons already given, we have not run spreading simulations on all three disease graphs. The HIV graph is so heavily disconnected by the many condom-use-induced zero links that we see no point in running simulations on it. Of course, these links, involving as they do real sexual contact, do not have exactly zero probability for infectious spreading, even with 100% condom use. Also the reported rates of 100% condom use are most likely overstated in many cases. Hence it would be of interest to set the strength of these “zero HIV links” to some small but positive value, and to examine the resulting graph. We reserve this idea for future work. The remaining two graphs (NG and CT) are qualitatively very similar. Hence we have chosen to focus on one of them—the NG (gonorrhea) graph. We must emphasize immediately however that our simulations, being based on SI dynamics [8], do not accurately model the long-time dynamics of diseases such as gonorrhea and chlamydia. A more appropriate model would be the SIS model [7] in which Infected nodes become again Susceptible after a variable time period. We expect the SI model to give qualitatively correct results in the early stage of any infectious process—when few nodes are infected, and they have not had time to recover. Beyond this early stage the SI model can only overestimate the degree of spreading. Hence we present simulation results in this section, based on the SI model, with two principal caveats: • •
Takeoff of the disease will likely occur later for the more realistic SIS model than what we show here. The long-time infected fraction will not approach 100%, but rather a lower value.
With these caveats clearly in mind, we present some simulations on the gonorrhea graph. Our aim is to see what insights we can gain from our SAG∗ picture. We will focus principally on when the infection takes off. Because we simply compare different scenarios (and their takeoff times) with one another, we feel that our (comparative) conclusions are not greatly weakened by the caveats given above. Our procedure for simulation is the same as before: at each time step, each link ij has a probability pij = W (i, j) of transmitting the infection if exactly one of the pair ij is already infected. Our link strength data, when the unit
Topographic Spreading Analysis
111
of time is one day, have values which vary from a few percent down to about 10−4 . With these small values we can increment the simulator with a time step of one day, and get smooth results. Our simulations differ from one another in three ways: (i) the choice of “start” nodes which are infected at t = 0; (ii) the choice of a set of “immune” nodes which cannot be infected; and (iii) sometimes, the choice of links which are to be blocked from transmission (removed). Choices (ii) and (iii) allow us to test various strategies for hindering spreading. In the real world of human sexual behavior, accomplishing either of these effects may be quite difficult; but we test them here simply to see what can be achieved. First we simulate the reference case, in which those nodes which are known to be infected are the start nodes (see again Figs. 6 and 7), and we immunize no nodes or links. We find (Fig. 9) that the infection takes off very fast—as anticipated in Section 3.4. Specifically, we see that the takeoff time is very short—just a few days. This is consistent with the fact that the infection has already reached three very central (as defined by EVC) nodes. This latter fact is consistent with two interpretations: either (i) the infection has recently come to this dense network, and it is on the verge of taking off, or (ii) the infection has been present for a long time, and has reached an equilibrium (and rather low) level. 600
500
Infected nodes
400
300
200
As−is Center Within 1 hop from center STI red region + head of region 50 random
100
0
0
50
100
150
200 time
250
300
350
400
Fig. 9. Spreading simulations for gonorrhea, based on the SI model, and using various prevention strategies. “As-is” = known infected start nodes and no strategy; the other scenarios involve immunizing various nodes, as described in the text. The unit of time is one day.
112
J. Bjelland et al.
We do not have sufficient empirical information to favor one of these interpretations over the other. If the first one is correct, it implies that one can expect a strong growth of infection rate in a relatively short time. If the second is correct, then our model is likely inadequate, not only in the SI aspect but probably in other aspects as well. We remind the reader that our topographic analysis is most useful in understanding the spreading of new infections over fairly static networks; hence it may be useful in case (i), but has little to say about case (ii). Now, in order to test our ideas further, we assume case (i). Based on our SAG∗ picture, we formulate various immunization strategies and test them via simulation. We have tried (a) immunizing the Center node; (b) immunizing the Center and all nodes within one hop of the Center (subregion heads); (c) immunizing the two infected nodes in the large red subregion, plus that subregion’s head node; and (d) immunizing 50 nodes chosen at random. Results for all of these cases are shown in Fig. 9. A simple conclusion is starkly obvious: none of these immunization strategies is able to retard the takeoff. In fact, the only clear difference is the trivial and useless one: that the long-time infected fraction is reduced by the number of immunized nodes [for example, by 14 for scenario (b), and by 50 for scenario (d)]. In short: as strongly suggested by Fig. 7, the NG network is sufficiently well connected, and sufficiently well infected, so that we find no simple strategy which is at all effective in retarding the takeoff. In order to investigate a different kind of test of the utility of our method of analysis, we next “cure” all infected nodes, and explore scenarios in which we can choose the start nodes freely. Our principal aim is to test the following hypothesis: that time to takeoff is strongly determined by distance from the Center of the SAG∗ . Some simple tests of this hypothesis are shown in Fig. 10. Here we show the progression of infection for three scenarios: (e) the Center is the only infected start node; (f) a node roughly halfway between the Center and the periphery is the start node; and (g) a very peripheral node is the start node. The results of Fig. 10 strongly support our hypothesis. Takeoff times vary from a few days to about 50 days to almost 150 days, as we move the start node outward in the SAG∗ . We also see, in the bottom half of the figure, that our earlier picture [3, 2] of the movement of the infection “front” over the topography is confirmed here: the infection [assuming it doesn’t start at the top as in (e)] moves slowly at first, until it begins to reach more central nodes, at which point it speeds up, while moving “uphill” (towards the Center); subsequently it moves “downhill,” slowing down all the while. While we have seen this dynamic pattern many times before, this is the first time we have tested it on a graph with weighted links (and with the EVC appropriately corrected via the weighted adjacency matrix). While Fig. 10 offers anecdotal evidence for our hypothesis, we also have statistical data. We have in fact run one-start-node simulations for each node on the graph, 10 times for each node, and recorded the average time needed to
Topographic Spreading Analysis
113
Infected nodes
600 500 400 300
100 0
Mean EVC infected nodes
Central node Medium Central Not central
200
0
50
100
150
200 time
250
300
350
400
0
50
100
150
200 time
250
300
350
400
0.2 0.15 0.1 0.05 0
Fig. 10. Three spreading simulations, based on three chosen scenarios, each with a single start node. We see that distance from the Center node (in a metric defined by the SAG∗ ) correlates strongly with time to takeoff. The lower part of the figure shows the average EVC of the newly infected nodes.
reach an infection number of 300 nodes (about 60%). To measure “distance” from the Center, we define the dual notion of “closeness”: a node’s closeness to the Center is simply the product of the link strengths over the (unique) path to the Center in SAG∗ . Thus many weak links give low closeness, while few strong links give high closeness; and both the number of hops and the link strengths of the hops affect the result. Figure 11 gives a scatter plot for average infection time vs closeness, for all nodes in the graph except the Center node. We see a strong decreasing relationship: closer nodes need less time to infect the graph. Thus we find from these results further strong support for our hypothesis.
5 Summary and Discussion In this chapter we have extended the topographic approach to the problem of epidemic spreading over networks to a problem involving two new features. First, the network is real: it is an empirical sex network, with some nodes known to be infected with the STIs HIV, gonorrhea, and chlamydia. Second, we have data which allow us to assign non-uniform link strengths (transmission probabilities), and we have generalized the topographic approach to incorporate these link strengths.
114
J. Bjelland et al.
time
103
102
101 −12 10
10−10
10−8
10−6 closeness
10−4
10−2
100
Fig. 11. Time needed for a single start node to infect 300 nodes, as a function of that start node’s “closeness” to the graph’s Center (averaged over 10 experiments for each start node). Closeness is measured entirely in terms of the modified steepest-ascent graph SAG∗ . We see a thorough statistical corroboration of the results of Fig. 10.
To help in illuminating the effects of incorporating link strengths, we first performed the analysis by ignoring these weights. We visualized the resulting unweighted FSW network, and simulated the progress of HIV on this network (using uniform transmission probabilities). We found some interesting effects from the almost-bipartite nature of the unweighted network. We also found that the network is very highly connected—with the two HIV-infected nodes very close to the network’s Center—so that retarding the spread of HIV was difficult. Nevertheless we were able to show significant benefits to be obtained from our analysis, for some hypothetical cases involving start nodes placed elsewhere. Incorporation of empirically obtained link strengths had large consequences. Each disease yielded a distinct weighted graph, by affecting the transmission probabilities. We found (using our assumption that perfect condom protection was possible) that the HIV graph broke down into many small components. While our visualization may still have some value, we saw no value in running simulations on these small components.
Topographic Spreading Analysis
115
Simulations on the gonorrhea graph gave results much like those on the unweighted FSW graph: the graph was very well connected, and the alreadyinfected nodes had rather central positions. The result was that we were unable to find simple topological fixes, inspired by our analysis, which could significantly retard spreading. However, we were able to find strong evidence confirming the basic applicability of our analysis to spreading. Specifically, we showed that our own notion of a node’s distance from the Center of the graph correlated strongly with the time needed for that node to infect the graph. We emphasize that this is the first application of the topographic approach to a weighted graph. Performing this analysis has required generalizing our earlier definition [3] of steepest ascent. The results we obtain here, based on this new, generalized definition, are very promising. Hence—even as we fail to come up with promising, concrete suggestions for hindering the spread of STIs in the Vancouver sex network—we feel that our results confirm the applicability of our approach to understanding spreading in the real-world case of a network with non-uniformly weighted links. We see a clear need for two obvious extensions of this work. First, it would be useful to reconnect the HIV graph, by assigning small but non-zero probabilities to the 100%-condom-use links. This would allow for a more meaningful regions analysis and the accompanying testing by simulations (perhaps over a long time scale). Second, our approach is most simply understood and applied for diseases for which SI spreading is appropriate (such as HIV). The application to gonorrhea or chlamydia would be greatly strengthened if one could generalize the method to the SIS and/or SIR case. This is an interesting challenge for future work. The data used arrive from self-reported infection status ([10]). To validate our model, empirically collected retrospective data on actual prevalence and incidence of the infections could be obtained. This is also recommended for future work. Finally, we remind the reader of the motivation for this work. We believe that the topographic analysis, based on EVC, is extremely useful for understanding epidemic spreading on a coarse scale. The analysis itself is not computationally demanding; hence it can be performed in essentially real time. Thus, we hope that our approach can be useful for disease prevention, in those cases for which the network can be mapped in reasonably short time—that is, short compared to both the time scale for infectious spreading, and the time scale for significant topology changes. The results presented here do not offer any immediate solution to the problem of STIs in the Vancouver FSW network, but they do add further support to our belief that this approach may be useful for this problem, and for others.
116
J. Bjelland et al.
Acknowledgments GC and KEM acknowledge partial support from the Future and Emerging Technologies unit of the European Commission through Project DELIS (IST-2002-001907). VPR acknowledges the financial and in-kind support, respectively, of the BC Medical Services Fdn and HIV/STI Prevention and Control, BC Centre for Disease Control.
References 1. G. Canright and K. Engø-Monsen. Roles in networks. Science of Computer Programming, pages 195–214, 2004. 2. G. Canright and K. Engø-Monsen. Epidemic spreading over networks: a view from neighbourhoods. Telektronikk, 101:65–85, 2005. 3. G. Canright and K. Engø-Monsen. Spreading on networks: a topographic view. In Proceedings, European Conference on Complex Systems, 2005. 4. G. S. Canright and K. Engø-Monsen. Some relevant aspects of network analysis and graph theory. In J. Bergstra and M. Burgess, editors, Handbook of Network and Systems Administration. Elsevier, Amsterdam, 2007. 5. K. Holmes, R. Levine, and M. Weaver. Effectiveness of condoms in preventing sexually transmitted infections. Bull World Health Organ, 82:454–461, 2004. 6. A. M. Jolly, M. E. Moffatt, M. V. Fast, and R. C. Brunham. Sexually transmitted disease thresholds in Manitoba, Canada. Ann Epidemiol, 15:781–788, 2005. 7. M. Kretzschmar, Y. T. P. H. van Duynhoven, and A. J. Severijnen. Modeling prevention strategies for gonorrhea and chlamydia using stochastic network simulations. American Journal of Epidimiology, 144:306–317, 1996. 8. M. Newman. The structure and function of complex networks. SIAM Review, 45:167–256, 2003. 9. R. Pastor-Satorras and A. Vespignani. Epidemic spreading in scale-free networks. Phys Rev Lett, 86:3200–3203, 2001. 10. V. P. Remple, D. M. Patrick, C. Johnston, M. W. Tyndall, and A. Jolly. Clients of indoor commercial sex workers: Heterogeneity in patronage patterns and implications for HIV and STI propagation through sexual networks. Sexually Transmitted Diseases, May 2007.
Spectral Characterization of Network Structures and Dynamics Anirban Banerjee1 and J¨ urgen Jost2 1
2
Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, 14195 Berlin, Germany;
[email protected] Max Planck Institute for Mathematics in the Sciences, Inselstr.22, 04103 Leipzig, Germany, and Santa Fe Institute, Santa Fe, NM 87501, USA;
[email protected] 1 Introduction Mathematically, graphs defy a systematic and complete classification, and empirically, the graphs representing networks come in a bewildering multitude. We have developed some tools [8, 9, 10] that at least allow for a rough classification of graphs that reflects the difference in the empirical domains from which network data are produced and that does not depend on sophisticated visualization tools. As such, a graph is a rather simple formal structure. It consists of nodes or vertices that are connected by edges or links. These nodes then represent the elements of a network (and we shall often not distinguish between the network and its underlying graph), and the edges represent relations between them. These could be chemical interactions as in intracellular networks of genes, proteins, or metabolites, synaptic connections between neurons, physical links in infrastructural networks, links between Internet pages, co-occurrences between words in sentences or on text pages, email contacts between people, co-authorships between scientists, and so on. This structure then can be expected to be somehow adapted to the function of the network, by evolution, self-organization, or design. In turn, any dynamics supported by the network will be constrained by this underlying structure. Our approach is based on associating certain mathematical objects—which ultimately just yield some numbers—to a graph which reflect its structural properties and which in particular encode the constraints on the dynamics that it can support. The mathematical objects will be an operator, the graph Laplacian (a discrete analogue of the Laplace operator in real analysis), and its eigenfunctions, and the numbers alluded to will be the eigenvalues of that operator.
N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 7, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
118
A. Banerjee and J. Jost
2 Growing Networks Empirical networks usually do not spring into existence, but rather grow to their present or final state from smaller beginnings. Naturally, such a growth process involves the sequential addition of nodes and links (connections). Usually, nodes are added at random, but their link formation with other nodes (already present in the network) is often not entirely random. This link formation will follow some rule that typically is still stochastic but also involves properties of those nodes that are candidates for receiving a link. When that rule is such that there is a higher chance of receiving links from those nodes that already have many connections than from those with fewer connections, we have some form of preferential attachment. Such a rule is known to lead to a scale-free degree distribution of the nodes in the network; that is, the number of nodes in the final network that have k links behaves like some power k −α , for some positive exponent α. The first such rule was proposed by Simon [44], and it directly stipulated that those nodes that have more connections also have a higher chance of receiving additional ones (“the-rich-get-richer” principle). This rule and the effects resulting from it were then systematically investigated by Barab´ asi–Albert [2, 11], and subsequently, many empirical networks were found to exhibit such a power-law degree distribution. It would be, however, premature to draw systematic consequences about other network properties from such a power-law degree distribution. In fact, there are many rules for network growth that are plausible in many areas of application that indirectly lead to such a kind of preferential attachment, but can lead to networks with properties that are otherwise rather different from those of the schemes of Simon and Barab´ asi–Albert. For instance, Jost– Joy [28] investigated the “make-friends-with-the-friends-of-your-friends” rule where a new node first forms one link with a randomly selected node in the network and then preferentially makes further links with neighbors of that node. Since the chance of a node being a neighbor of some randomly chosen node depends on its degree, these subsequent links then also constitute some preferential attachment, and the resulting degree distribution will follow a power law. However, other properties of that network are rather different from those obtained by the direct preferential attachment scheme. In particular, because of the preference for local connections, the network diameter will be typically much larger. Even the opposite scheme, where a node preferentially forms additional links with nodes from which it has a large distance, does not lead to a network with a very small diameter. For creating a network with a small diameter, it is rather more efficient that nodes directly use preferential attachment, that is, preferentially form links with other nodes that have a high degree and are therefore well connected in the network. Of course, the most efficient way to achieve a small diameter in a sparse network is to connect every node to one single central node. Another crucial difference between a “make-friends-with-the-friends-ofyour-friends” network and a “the-rich-get-richer” network is that the first
Spectral Characterization of Network Structures and Dynamics
119
eigenvalue of the make friends network will be much smaller, implying for instance that dynamics on such a network are much more difficult to synchronize, as will be explained below. In fact, spectral properties like the behavior of the first eigenvalue of scale-free networks were analyzed in [3, 4], and it was pointed out that the scaling exponent and the first eigenvalue are essentially independent parameters for a network. Of course, when networks are produced by a certain stochastic scheme or drawn from some probability distribution on the space of networks, then that scheme or distribution will also lead to some typical spectral behavior, as systematically investigated in [29]. However, when we only know whether a network is scale free, we should be careful about inferring other network properties. It might be a wiser strategy to find out more about the underlying network evolution rule, like the above “make-friends-with-the-friends-of-your-friends” principle, the Cameo principle of Blanchard–Kr¨ uger [12], or whatever is plausible in the given empirical domain. One important class of rules for which there is much evidence in various domains is the one of node duplications. That means that instead of randomly attaching a new external node, we take some node i already present in the network and double it in the sense that we create a new node i that forms links with all or some of the neighbors of i. It may or may not also form a link with i itself. Again, since the chance of another node j of being a neighbor of the randomly chosen node i and therefore receiving new connections from i depends on the degree of j, we do get a preferential attachment scheme. Again, however, as we shall see below, such a node duplication leads to some specific spectral properties that are not shared by networks arising from different schemes. There also exist other distinctions within the class of scale-free networks. An important one is whether the nodes of high degree are assortative, i.e., prefer connections with other high degree nodes, or disassortative, i.e., avoid connections with high degree nodes and rather form links with low degree nodes.
3 Graph Operators and their Spectral Properties We have already seen several important network parameters or properties, like the diameter, the synchronizability, the degree sequence (counting the number of nodes of degree k in the network as a function of k ∈ N), and the assortativity. Of course, there are many others, like the clustering coefficient, which expresses the relative frequency of triangles, that is, triples of nodes that are pairwise connected. The clustering coeffient is defined as C :=
3 × number of triangles . number of connected triples of nodes
The normalization is that C becomes one for a fully connected graph.
(1)
120
A. Banerjee and J. Jost
Certain properties characterize specific classes of graphs. Complete graphs are those where every vertex is connected with all others. Of course, for large graphs, this is an unrealistic situation, as they are typically sparse, in the sense that the average vertex has connections to only a small fraction of the vertices present in the graph. A graph is bipartite when it consists of two classes inside each of which there are no connections. A graph is bipartite iff it has no closed paths of odd length. In particular, for a bipartite graph, the clustering coefficient C vanishes. A complete bipartite graph is one where each member of one class is connected with all members of the other class. Trees are special bipartite graphs. They have the minimal number of edges, N − 1, that is needed to make a graph of N vertices connected. One may also consider more general structural properties, like cohesion, or functional aspects, like robustness against the destruction of links or the elimination of nodes. Clearly, no such list of parameters and properties can be exhaustive. Also, it may not be easy to understand the relations, if any, between those parameters and properties. In this situation, we have developed the spectral approach to the description of networks. As we shall explain, this means the analysis of the density of eigenvalues of a natural operator associated to a network, the graph Laplacian. While these eigenvalues do not always fully determine a graph, they nevertheless capture all important geometric properties, in a more or less explicit form. Plotting the density of eigenvalues also yields a representation of a graph that can be readily visually inspected. (In contrast, explicit presentation of the nodes and links becomes rather opaque once the graph exceeds some moderate size of, say 1–200 nodes.) Moreover, can easily manipulated by moving the nodes around in a plane. We now formally introduce the graph Laplacian and its spectrum. We represent our network structurally as a graph Γ which we assume to be finite and connected; let it have N vertices. Vertices i, j ∈ Γ connected by an edge of Γ are called neighbors, i ∼ j. The number of neighbors of a vertex i ∈ Γ is called its degree ni . For functions v from the vertices of Γ to R, we define the normalized Laplacian (henceforth simply called the Laplacian) as Δv(i) :=
1 v(j) − v(i). ni j,j∼i
(2)
This operator is different from the algebraic graph Laplacian Lv(i) := ni v(i)− j,j∼i v(j); see, e.g., [13, 14, 20, 32, 35]. In particular, the spectrum of Δ is different from that of L; Δ, however, has the same spectrum as the Laplacian investigated in [15] (in fact, the two operators are equivalent, differing only by a multiplier). The normalized Laplacian is the operator underlying random walks and conservative diffusion processes on graphs. Therefore, it seems to be the more natural operator from a geometric or physical perspective. However, the algebraic Laplacian does possess certain nice algebraic properties that are not shared by the normalized Laplacian, like a trace formula, see [22].
Spectral Characterization of Network Structures and Dynamics
121
Nevertheless, in our empirical studies, we have found that the Laplacian considered here seems to be a better tool for distinguishing different classes of graphs by spectral properties. We now recall some elementary properties, see, e.g., [15, 26]. The Laplacian is symmetric for the product ni u(i)v(i) (3) (u, v) := i∈V
for real-valued functions u, v on the vertices of Γ (and because of this symmetry, we need not consider complex-valued functions). The eigenvalues of Δ therefore are real. Δ is nonpositive in the sense that (Δu, u) ≤ 0 for all u. With the following convention, the eigenvalues λ then are nonnegative: Δu + λu = 0.
(4)
A nonzero solution u is called an eigenfunction for the eigenvalue λ. Since Γ has N vertices, Δ has N eigenvalues, not necessarily distinct, as some of them might occur with higher multiplicity. The smallest eigenvalue is λ0 = 0, with a constant eigenfunction. This eigenvalue is simple because we assume that Γ is connected; in general, the multiplicity of the eigenvalue 0 equals the number of connected components, with the corresponding eigenfunctions being ≡ 1 on one and ≡ 0 on all other components. Returning to our case of a connected graph Γ , then λk > 0
(5)
for k > 0 where we order the eigenvalues as λ0 = 0 < λ1 ≤ · · · ≤ λN −1 . For the largest eigenvalue, we have λN −1 ≤ 2.
(6)
In particular, the spectrum of Δ is always confined to the interval [0, 2], regardless of the size of the graph. This is not true for the algebraic graph Laplacian L, and this property of Δ allows for an easy comparison of the spectra of graphs irrespective of their sizes. We have equality in (6) iff the graph is bipartite. Thus, a single eigenvalue determines the global property of bipartiteness. More generally, a graph is bipartite iff whenever λ is an eigenvalue, then so is 2 − λ. Thus, the characteristic spectral property of a bipartite graph is that its spectrum is symmetric about 1. For instance, for a complete graph of N vertices, λ1 = ... = λN −1 =
N , N −1
(7)
122
A. Banerjee and J. Jost
that is, there is only one nontrivial eigenvalue, NN−1 , occurring with multiplicity N − 1. Among all graphs with N vertices, this is the largest possible value for λ1 and the smallest possible value for λN −1 . Thus, the characteristic spectral property of complete graphs is that there is this eigenvalue with the highest possible multiplicity. Many qualitative properties of graphs can be characterized by inequalities or other relationships between their eigenvalues. For instance, Monasson [36] carried out a systematic investigation of the spectrum of a small-world graph as the superposition of a regular ring and a random graph. Also, [23] develops a method for (re)constructing a graph from its spectrum. We should point out, however, that in general it is not possible to uniquely determine a graph from its spectrum. In fact, there exist isospectral graphs, that is, different graphs with the same eigenvalues. For instance, all complete bipartite graphs with the same number N of vertices have the same eigenvalues. Actually, they possess the eigenvalues 0 and 2 with multiplicity 1 and the eigenvalue 1 with multiplicity N − 2. Any graph with that spectrum is a complete bipartite graph, but among bipartite graphs of N vertices, the two classes may have different sizes N1 , N2 , as long as N1 + N2 = N , of course. We now rewrite the eigenvalue equation (4) as 1 u(j) = (1 − λ)u(i) for all i. (8) ni j∼i We observe that when the eigenfunction u vanishes at i, then also u(j) = 0.
(9)
j∼i
The converse also holds, except for the case λ = 1 when (9) holds at all points regardless of whether the eigenfunction vanishes there or not. We now consider motifs, that is, small subgraphs of Γ of a particular type, and analyze what happens to the spectrum when performing some natural operations with motifs. As our motif, we take some graph Λ. We start with motif joining: Here, the motif Λ is a graph that is independent of Γ . Let j0 be a vertex of Λ. We assume that Λ has eigenvalue λ and an eigenfunction uλ that vanishes at j0 , i.e., uλ (j0 ) = 0. We then form a graph Γ¯ by identifying the vertex j0 with an arbitrary vertex i of Γ . The new graph then also possesses the eigenvalue λ, with an eigenfunction that agrees with uλ on Λ and vanishes at the other vertices, that is, those coming from Γ . Thus, a motif Λ can be joined to an existing graph with a preserved eigenvalue and a localized eigenfunction when the joining occurs at one (or several) vertices where that eigenfunction vanishes. We next consider motif duplication: Here, the motif Λ is a subgraph of Γ , with vertices j1 , . . . , jm . Let the function u on the vertex set of Λ satisfy 1 u(j) = (1 − λ)u(i) for all i ∈ Λ and some λ, (10) ni j∈Λ,j∼i
Spectral Characterization of Network Structures and Dynamics
123
where ni is the degree of the vertex i in Γ . Let Γ¯ be obtained from Γ by doubling the motif Λ, that is, by adding vertices i1 , . . . , im and their connec/ Λ that are neighbors of jα . tions as in Λ and connecting each iα with all i ∈ Then the graph Γ¯ possesses the eigenvalue λ with an eigenfunction uλ that is nonzero at most of the vertices of Λ and its double; it agrees with u on Λ, with −u on the double of Λ. Thus, the eigenvalue λ is produced by motif duplication with symmetric eigenfunction balancing. We point out that for this effect it is essential that there be no connections between a node jα and its double iα . The simplest motif is a single vertex, and the corresponding motif duplication is the doubling of a single vertex j0 ∈ Γ . According to the general scheme, we add a new vertex i0 and connect i0 with all neighbors of j0 . This generates an eigenvalue 1, with an eigenfunction u1 that is nonzero only at j0 and i0 , with u1 (j0 ) = 1, u1 (i0 ) = −1. In the analysis of empirical networks, we often find that the spectral plot has a high peak at the eigenvalue 1. In such a situation, a natural hypothesis is that this network evolved via a sequence of vertex doublings. In fact, vertex duplication with subsequent random edge deletion has been proposed in different application fields as a mechanism for network growth that can reproduce qualitative properties of empirical networks, e.g., for the Internet [30], for protein-interaction networks [6, 45, 46, 47], or for citation networks [31], although the precise rules can differ between those investigations, for instance, whether the duplicated node and its copy are connected or not. The next simplest motif consists of two connected vertices. Thus, we consider an edge in Γ connecting two vertices j1 , j2 . Equation (10) then becomes 1 u(j2 ) = (1 − λ)u(j1 ), nj1 with the solutions
1 u(j1 ) = (1 − λ)u(j2 ), nj2
1 λ± = 1 ± √ . nj1 nj2
(11)
(12)
The duplication of an edge thus yields the eigenvalues λ± which are symmetric about 1. Also, when the degree of j1 or j2 is large, λ± are close to 1. The next motifs consist of three vertices. When we have a chain of vertices j1 , j2 , j3 for which j2 is connected to both j1 and j3 , but without a connection between j1 and j3 (that is, the motif is not a triangle), we obtain the eigenvalues 1 1 1 ( + ). (13) λ = 1, 1 ± nj2 nj1 nj3 The other motif with three vertices is a triangle, with vertices j1 , j2 , j3 . In this case, from (10), we obtain the cubic equation (1 − λ)3 nj1 nj2 nj3 − (1 − λ)(nj1 + nj2 + nj3 ) − 2 = 0 for λ.
(14)
124
A. Banerjee and J. Jost
4 Functional and Dynamical Aspects Determined by the First Eigenvalue In this section, we shall argue that the first nontrivial eigenvalue λ1 plays a special role for understanding important network properties. λ1 is also called the spectral gap, because it is equal to the difference λ1 − λ0 as λ0 = 0. λ1 admits the variational characterization 2 j∼i (v(i) − v(j)) : ni v(i) = 0}. (15) λ1 = min{ 2 v i ni v(i) i A function v attaining this minimum then is an eigenfunction for λ1 . Since the numerator in (15) only takes pairs of neighboring vertices into account, λ1 can become quite small when the graph consists of two large subgraphs that are connected by few edges. In (15), we can then achieve a small value by taking some function that equals a positive constant on one of those subgraphs and a negative constant on theother hand, where the two constants are adjusted so that the normalization i ni v(i) = 0 is satisfied. Therefore, it is intuitively clear that λ1 can be estimated against the Polya–Cheeger constant h(Γ ) of our graph Γ , which is defined as follows. Letting |E| denote the number of edges contained in an edge set E, we define h(Γ ) := inf{
|E0 | }, min( i∈V1 ni , i∈V2 ni )
(16)
where removing E0 disconnects Γ into the components V1 , V2 . We then have the estimates (see [15] for proofs) 1 h(Γ )2 ≤ λ1 ≤ 2h(Γ ). 2
(17)
Incidentally, this implies the inequality h(Γ ) ≤ 4
(18)
for any connected graph. Turning to dynamical aspects, we consider a dynamical system with coupling structure given by Γ . More specifically, we consider the coupled equation for a function u depending on the nodes i ∈ Γ and evolving in discrete time n∈N (f (u(j, n)) − f (u(i, n))). (19) u(i, n + 1) = f (u(i, n)) + ni j,j∼i Here, f : [0, 1] → [0, 1] is some function; the functions we have in mind are those whose iteration generates some chaotic dynamics, like the logistic map f (x) = 4x(1 − x).
(20)
Spectral Characterization of Network Structures and Dynamics
125
What is important about f is its Lyapunov exponent, N −1 1 log |f (¯ u(n))|; N →∞ N n=0
μ0 = lim
The Lyapunov exponent μ0 is positive for chaotic dynamics f . is a coupling parameter, usually in the range 0 ≤ ≤ 1. The specific question we wish to ask is whether, or better, under what circumstances, the solution u of (19) synchronizes, that is, asymptotically, lim (u(i, n) − u(j, n)) = 0 for all nodes i, j.
n→∞
(21)
This question can be understood as asking about the stability of a synchronized solution u(i, n) = u ¯(n) (22) that solves u ¯(n + 1) = f (¯ u(n)).
(23)
Systematic studies of synchronization are [42, 41]. It was then found in [27, 43] that a sufficient condition for such stability is 1 + e−μ0 1 − e−μ0 < < . λ1 λN −1
(24)
In practice, the left inequality, the one involving λ1 , is the crucial one here. In particular, when the eigenvalues satisfy appropriate conditions, we can have a stable synchronized solution that is chaotic (μ0 > 0). Note that the first eigenvalue even determines the synchronization of dynamics with transmission delays between the nodes, see [5].
5 Spectral Plots and What They May Tell Us In this final section, we describe how (a smoothed version of) the density plot for the eigenvalues of the Laplacian of a network yields a good heuristic clustering scheme for networks from different empirical domains. More precisely, we shall see that the spectral plots of different networks from the same domain typically look rather similar to each other, but different from those for networks from different domains. Also, these spectral plots often suggest suitable hypotheses about the dominant evolution mechanisms of the underlying networks. Let us give some examples that summarize some of the discussion in the preceding sections. • •
A high peak at the eigenvalue 1 may indicate many successive node duplications. This is readily visible in many of our spectral plots. Likewise, as analyzed above, see (12), (13), (14), duplications of small motives leave characteristic traces in the spectrum.
126
A. Banerjee and J. Jost 0.04
0.03
a
b
0.035 0.025 0.03 0.02
0.025 0.02
0.015
0.015 0.01 0.01 0.005
0.005 0
0.5
0
1
1.5
2
0.06
0
0.5
1
1.5
2
0.06
d
c 0.05
0.05
0.04
0.04
0.03
0.03
0.02
0.02
0.01
0.01
0
0
0
0.5
1
1.5
2
0 0
0.5
1
1.5
2
Fig. 1. (a) Protein-protein interaction network of Helicobacter pylori. Network size = 710. Data collected from http://www.cosinproject.org [Download date: 25 Sept. 2005]. (b) Metabolic network of Helicobacter pylori. Size of the network = 940. Nodes represent substrates, enzymes, and intermediate complexes. Data used in [24]. Data source: http://www.nd.edu/∼networks/resources.htm. [Download date: 22 Nov. 2004]. (c) Autonomous Systems (ASS) topology of the Internet. Every vertex represents an AS, and two vertices are connected if there is at least one physical link between the two corresponding ASS. AS graph of 1998/04/02. Network size = 3522. Data collected from http://www.cosinproject.org and data used in [18] [Download date: 23 September 2005]. Main source: BGP routing data collected by University of Oregon Route Views Project, then processed and made available in various formats at the Global ISP interconnectivity by AS number page of NLANR (National Laboratory of Applied Network Research). (d) Word-adjacency networks of a text in Spanish language. Size of the network = 11558. Data downloaded from http://www.weizmann.ac.il/mcb/UriAlon [Download date 3rd Feb. 2005]. Data used in [34].
Spectral Characterization of Network Structures and Dynamics
127
0.02
0.025
a
b
0.018 0.016
0.02
0.014 0.015
0.012
0.01
0.008
0.01
0.006 0.004
0.005
0.002 0
0
0.025
0.5
1
1.5
2
0
0
0.5
1
1.5
2
0.5
1
1.5
2
0.012
c
d 0.01
0.02
0.008 0.015 0.006 0.01 0.004 0.005
0
0.002
0
0.5
1
1.5
2
0
0
Fig. 2. (a) Foodweb network from “Florida bay in wet season”. Data downloaded from http://vlado.fmf.uni-lj.si/pub/networks/data (main data resource: Chesapeake Biological Laboratory. Web link: http://www.cbl.umces.edu/). [Download date 21 Dec. 2006]. Network size 128. (b) Foodweb network from “Ythan estuary”. Data downloaded from http://www.cosinproject.org. [Download Date 21 Dec. 2006]. Network size 135. (c) The network of hyperlinks between weblogs on US politics, recorded in 2005 by Adamic and Glance [1]. Network size 1222. Data downloaded from http://www-personal.umich.edu/∼mejn/netdata [Download date: 23 April 2007]. (d) Neuronal connectivity of Caenorhabditis elegans. Network size 297. Data used in [49, 50]. Data Source: http://cdg.columbia.edu/cdg/datasets [Download date: 18 Dec. 2006]. (e) E-mail interchanges between members of the Univeristy Rovira i Virgili (Tarragona) [21]. Network size 1133. Data downloaded from http://deim.urv.cat/∼aarenas/data/welcome.htm [Download date: 21 March, 2007].
128
A. Banerjee and J. Jost 9
x 10−3
e
8 7 6 5 4 3 2 1 0
0
0.5
1
1.5
2
Fig. 2. (Continued)
•
•
As follows from Section 4, the presence of many small eigenvalues indicates that the graph consists of many components that, while possibly connected densely inside, are only very loosely connected to each other. That is, the graph consists of many different “communities.” As indicated, this has important dynamical implications for the synchronizability of the graph. When the highest eigenvalue equals 2, or, more generally, when the spectrum is symmetric about 1, the graph is bipartite; see the discussion after (6). Thus, an approximate such symmetry, or an eigenvalue very close to 2, will indicate that the graph is close to being bipartite (we hope to present more precise estimates elsewhere). Also, a bipartite graph can readily support period 2 oscillations of coupled dynamics, so again, there are direct dynamical implications here. Also, when a graph is bipartite, a random walk on it need not converge to a stationary distribution. More generally, such convergence properties are related to the small and large (close to 2) eigenvalues. Thus, these eigenvalues will affect the properties of random search schemes on the underlying graph.
In the Figs. 1 through 4, we can clearly see that networks from the same empirical domain yield similar spectral plots. Also, we can distinguish different classes of spectral plots with specific characteristic features. A more detailed analysis of those classes can be found in [10]. The investigation of the graph properties that can be detected from spectral plots has just begun, and we expect significant advances in the detailed understanding of classes of empirical graphs from systematic investigations of their spectra.
Spectral Characterization of Network Structures and Dynamics 0.01
129
0.025
a
0.009
b 0.02
0.008 0.007
0.015
0.006 0.005
0.01
0.004 0.003
0.005
0.002 0.001 0
0
8
0.5
1
1.5
0
2
x 10−3
9
c
0.5
1
1.5
2
x 10−3
d
8
7
0
7
6
6 5
5 4
4 3
3
2
2
1 0
1 0
0.5
1
1.5
2
0
0
0.5
1
1.5
2
Fig. 3. (a) Topology of the Western states power grid of the United States [49]. Network size 4941. Data downloaded from http://cdg.columbia.edu/cdg/datasets [Download date: 1 March 2007]. (b) Jazz band network. Nodes represent jazz bands. Two bands are connected if a same musician played in those two bands. Network size 198. Data downloaded from http://deim.urv.cat/∼aarenas/data/welcome.htm [Download date: 17 March 2008]. Data used in [19]. (c) Co-authorships between scientists posting preprints on the High-Energy Theory E-Print Archive, http://arxiv.org/archive/hepth between 1 Jan. 1995 and 31st Dec. 1999 [37]. Network size 5835. (d) Co-authorships of scientists working on network theory and experiment [38]. Network size 379. (c,d) Data downloaded from http://www-personal.umich.edu/∼mejn/netdata [Download date: 23 April 2007].
130
A. Banerjee and J. Jost 6
x 10−3
7
x 10−3
a
b 6
5
5 4 4 3 3 2
2
1
0 0
1
0.5
7
0 0
2
1.5
1
1
0.5
1.5
2
x 10−3
c 6 5 4 3 2 1 0 0
0.5
1
1.5
2
Fig. 4. Electronic circuits. (a) With size = 122. (b) With size = 252. (c) With size = 512. Data downloaded from http://www.weizmann.ac.il/mcb/UriAlon [Download date: 15 March 2005]. Data used in [33].
References 1. L.A. Adamic and N. Glance, The political blogosphere and the 2004 US election: Divided they blog, in Proceedings of the WWW-2005 Workshop on the Weblogging Ecosystem (2005) 2. R. Albert, A.-L. Barab´ asi, Statistical mechanics of complex networks, Reviews of Modern Physics 74, 2002, 47–97 3. F.M. Atay, T. Bıyıko˘ glu, J. Jost, Synchronization of networks with prescribed degree distributions, IEEE Trans. Circuits and Systems I 53(1), 2006, 92–98 4. F.M. Atay, T. Bıyıko˘ glu, J. Jost, Network synchronization: Spectral versus statistical properties, Phys. D 224, 2006, 35–41 5. F.M. Atay, J. Jost, A. Wende, Delays, connection topology, and synchronization of coupled chaotic maps, Phys. Rev. Lett. 92(14), 2004, 144101
Spectral Characterization of Network Structures and Dynamics
131
6. A. Banerjee, J. Jost, Laplacian spectrum and protein-protein interaction networks, preprint 7. A. Banerjee, J. Jost, On the spectrum of the normalized graph Laplacian, Lin. Alg. Appl. 428, 2008, 3015–3022 8. A. Banerjee, J. Jost, Graph spectra as a systematic tool in computational biology, Discr. Appl. Math., to appear 9. A. Banerjee, J. Jost, Spectral plots and the representation and interpretation of biological data, Theory Biosc. 126, 2007, 15–21 10. A. Banerjee, J. Jost, Spectral plot properties: Towards a qualitative classification of networks, NHM 3, 2008, 395–411 11. A.-L. Barab´ asi, R.A. Albert, Emergence of scaling in random networks, Science 286, 1999, 509–512 12. P. Blanchard, T. Kr¨ uger, The “Cameo” principle and the origin of scale-free graphs in social networks, J. Stat. Phys. 114, 1399–1416, 2004 13. T. Bıyıko˘ glu, J. Leydold, P. Stadler, Laplacian Eigenvectors of Graphs, Springer Berlin, 2007 14. B. Bolob´ as, Modern Graph Theory, Springer, Berlin, 1998 15. F. Chung, Spectral Graph Theory, AMS, Providence, RI, 1997 16. F. Chung, L.Y. Lu, Complex Graphs and Networks, AMS, Providence, RI, 2006 17. S.N. Dorogovtsev, J.F.F. Mendes, Evolution of Networks, Oxford University Press, Oxford, 2003. 18. M. Faloutsos et al., On power-law relationships of the Internet topology, SIGCOMM, 1999. 19. P.M. Gleiser, L. Danon, Community structure in Jazz, Advances in Complex Systems (ACS) 6(4), 2003, 565–573 20. C. Godsil, G. Royle, Algebraic Graph Theory, Springer, Berlin, 2001 21. R. Guimera et al., Self-similar community structure in a network of human interactions, Physical Review E 68, 2003, 065103(R) 22. M. Horton, H. Stark, A. Terras, What are zeta functions of graphs and what are they good for? In Quantum graphs and their applications, Contemp. Math., Amer. Math. Soc., Providence, RI, 415, 2006, 173–189 23. M. Ipsen, A.S. Mikhailov, Evolutionary reconstruction of networks, Phys. Rev. E 66(4), 046109, 2002 24. H. Jeong et al., The large-scale organization of metabolic networks, Nature 407, 2000, 651–654 25. J. Jost, Mathematical methods in biology and neurobiology, monograph, to appear 26. J. Jost, in: J.F. Feng, J. Jost, M.P. Qian (eds.), Networks: From Biology to Theory, 35–62, Springer, Berlin, 2007 27. J. Jost, M.P. Joy, Spectral properties and synchronization in coupled map lattices, Phys. Rev. E 65(1), 2002, 016201 28. J. Jost, M.P. Joy, Evolving networks with distance preferences, Phys. Rev. E 66, 2002, 36126–36132 29. D.H. Kim, A. Motter, Ensemble averageability in network spectra, Phys. Rev. Lett. 98, 2007, 248701 30. J. Kleinberg et al., The Web as a Graph: Measurements, Models, and Methods, LNCS 1627, 1999, 1–17 31. P. Krapivsky, S. Redner, Network growth by copying, Phys. Rev. E 71, 2005, 036118 32. R. Merris, Laplacian matrices of graphs – A survey, Lin. Alg. Appl. 198, 1994, 143–176
132
A. Banerjee and J. Jost
33. R Milo et al., Network motifs: Simple building blocks of complex networks, Science 298, 2002, 824–827 34. R. Milo et al., Superfamilies of evolved and designed networks, Science 303, 2004, 1538–1542 35. B. Mohar, Some applications of Laplace eigenvalues of graphs, in: G. Hahn, G. Sabidussi (eds.), Graph Symmetry: Algebraic Methods and Applications, 227– 277, Springer, Berlin, 1997 36. R. Monasson, Diffusion, localization and dispersion relations on “small-world” lattices, Europ. Phys. J. B 12, 1999, 555–567 37. M.E.J. Newman, The structure of scientific collaboration networks, Proc. Natl. Acad. Sci. USA 98, 2001, 404–409 38. M.E.J. Newman, Finding community structure in networks using the eigenvectors of matrices, Phys. Rev. E 74, 2006, 036104 39. M. Newman, The structure and function of complex networks, SIAM Review 45, 2003, 167–256 40. S. Ohno, Evolution by Gene Duplication, Springer, Berlin, 1970 41. L.M. Pecora, T.L. Carroll, Synchronization in chaotic systems, Phys. Rev. Lett. 64, 1990, 821–824 42. A. Pikovsky, M. Rosenblum, J. Kurths, Synchronization – A Universal Concept in Nonlinear Science, Cambridge University Press, Cambridge, 2001 43. G. Rangarajan, M.Z. Ding, Stability of synchronized chaos in coupled dynamical systems, Phys. Lett. A 296, 2002, 204–212 44. H. Simon, On a class of skew distribution functions, Biometrika 42, 1955, 425–440 45. R. Sol´e et al., A model of large scale proteome evolution, Adv. Compl. Syst. 5, 2002, 43–54 46. A. Vazquez et al., Modelling of protein interaction networks, ComPlexUs 1, 2003, 38–44 47. A. Wagner, How the global structure of protein interaction networks evolves, Proc. Roy. Soc. B 270, 2003, 457–466 48. A. Wagner, Evolution of gene networks by gene duplications — A mathematical model and its implications on genome organization, Proc. Nat. Acad. Sciences USA 91(10), 1994, 4387–4391 49. D.J. Watts, S.H. Strogatz, Collective dynamics of ‘small-world’ networks, Nature 393, 1998, 440–442 50. J.G. White et al., The structure of the nervous system of the nematode Caenorhabditis elegans, Phil. Trans. Royal Soc. of London Series B-Bio. Sc. 314, 1986, 1–340 51. P. Zhu, R.C. Wilson, A study of graph spectra for comparing graphs. In Proc. of British Machine Vision Conf. (MBVC), Sep 2005 52. K.H. Wolfe, D.C. Shields, Molecular evidence for an ancient duplication of the entire yeast genome, Nature 387(6634), 1997, 708–713
Dynamics of Social Complex Networks: Some Insights into Recent Research Sergi Lozano ETH Zurich, Swiss Federal Institute of Technology, UNO D11, Universit¨ atstr. 41, 8092 Zurich, Switzerland;
[email protected] 1 Introduction: Social Networks as Complex Networks Social networks analysis (that is, the study of interactions among social actors from a structural viewpoint) has a long tradition covering several decades [1, 2, 3]. This sort of study has usually been performed over small social networks, and the limitation of size has conditioned the visibility of complexity [4, 5]. However, the situation has changed significantly in recent times due to basically two reasons. First, there is an increasing availability of larger social datasets (obtained in most cases from information and communication technologies). Secondly, a large number of physicists and other scholars from complexity science have started to take active interest in the field. New perspectives and tools have been provided by these ‘newcomers’, which in combination with the expertise and knowledge accumulated by ‘classical’ social network analysts, has formed the basis of a multidisciplinary field suitably termed the science of networks [6, 7]. This research has led to the formal definition of the complexity exhibited by social networks against the following simple ‘check list’ [5]. 1. The network must consist of a large number of nodes showing substantial heterogeneity. Here we understand heterogeneity to mean diversity of degree. 2. Its structure has to present an ‘intricate architecture’, that is, a topology that cannot be expressed in terms of simple patterns (like ‘regular’ or ‘completely random’) but must include several degrees of freedom. 3. This topological complexity is translated into the global system behavior in the form of ‘emergent phenomena’, i.e. even simple local interaction rules lead to a performance of the whole system that is richer than the sum of local effects. 4. This influence of local feedbacks over the macroscopical behavior can be manifested, in particular, as nonlinearities in the operation of the processes N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 8, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
134
S. Lozano
that shape the network itself (i.e. sudden emergencies of determined structural features are observed when a certain external parameter exceeds a certain threshold value). Regarding the fulfillment of this list of requirements by social networks, Vega-Redondo refers to the results of previous studies about social structure to confirm that social networks satisfy the first two. Following the same reasoning, we notice that the other two requirements (covering dynamic aspects) are repeatedly recognized in social phenomena, for instance, collective behavior and social mobilization [8, 9] (third point), or the emergence of hierarchical social structures from interactions at an individual level [10, 11] (fourth point). Once confirmed that social networks are indeed complex networks, in this chapter we will focus on the dynamic aspects of this complexity (the two later points in the check list above). More concretely, we will overview some of the recent research that addresses dynamics on and of social networks from the perspective of complex systems. The rest of the chapter is structured as follows. The second section is devoted to works dealing, as separate topics, with the analysis of social phenomena over static social networks and with the time evolution of the social structure. The third section focuses on the coevolution of social structure and phenomena, stressing the importance of this interplay from the complexity viewpoint. Finally, the last section summarizes the whole chapter and points out some ideas about the future evolution of the field.
2 Approaching the Dynamics on and of Social Networks Separately The majority of recent studies on social networks, from a complexity perspective, treat dynamics on and of social networks as different lines of research. In the first case, each node (social actor) is considered to be a dynamical system whose state evolves, in part, as a function of the topological features of the underlying static social substrate. Taking into account the intricate patterns (using the same expression as that in the Introduction) characterizing social networks, this scenario results in the nonlinear global behaviors already mentioned. In the second case, the whole network is considered to be a dynamical system with a topological state that evolves according to local rules. Investigations along this line have discovered that certain social rules at a local (individual) interaction level can forge some of the referred ‘intricate structural patterns’. In accordance with this scheme, we will address these two research lines separately. 2.1 Dynamics on Social Networks Topology is an important aspect that is always present in social dynamics [12]. Accordingly, social networks analysis has placed great importance on studying
Dynamics of Social Complex Networks
135
the influence of social networks and the individual’s role in the evolution of different social phenomena. A good example of this can be found in the research devoted to diffusion of innovations [13, 14]. This perspective has resulted in an in-depth knowledge of the most important structural characteristics of social networks and their influence on the behavior of the social actors, as has been recognized by scholars recently entering the field from complexity science [6, 7] (although some ‘traditional’ social network analysts claim that this effort by the ‘newcomers’ is not quite appreciable [1]). The incorporation of these ‘newcomers’ has not changed this orientation, but has reinforced it by contributing new analyses and modeling methodologies. The works ensuing from this combination of tools and perspectives have uncovered very relevant results. Some of them, for example, have related the emergence and resilience of cooperation in social groups with certain structural features of its social network, such as the degree heterogeneity [15, 16, 17] or the community structure [18]. Others have shown that scale-freeness and the small-world phenomenon can influence the consensus time of opinions in a population [19, 20] and even force scenarios with coexisting domains of opposite opinions [21]. In order to further understand the various tools and perspectives developed for explaining and modeling social networks, it is useful to resort to the exhaustive recent reviews on game theory [22], opinion dynamics [12, 23], language dynamics [12] or spreading phenomena [5]. Finally, as a sample of work addressing social dynamics on networks, the first chapter of Part in this book presents a work centered on the study of epidemic spreading [24]. In this work, the authors apply a mesoscopic (neither individual nor global, but intermediate) structural approach to predict and understand the spreading of an incurable disease (like HIV) over an empirical static network. First, they study the division into subnetworks or regions of a real social network of sexual contacts obtained by means of interviews. They also deduce qualitative predictions about infection spreading from the observed topological features. Second, they use a computational model to numerically contrast these predictions and design possible protection strategies suitable in this particular case. This work represents an important contribution to the literature on diseases spreading, since it highlights the analysis and visualization possibilities of mesoscopic approximations. 2.2 Dynamics of Social Networks The second separated approach that we are going to consider in this section is based on the study of network processes, that is, “series of events that create, sustain and dissolve social structures” [25]. Logically, this sort of study requires the use of time in addition to the structural description. However, in the past social networks analysis mainly focused on the study of static social networks and their influence over individual and collective behavior [6]. Borgatti [26] argues that one of the reasons behind such an orientation was the
136
S. Lozano
difficulty to obtain longitudinal empirical data. As has been pointed out in the Introduction, this scenario has changed lately with the increasing availability of large social datasets obtained from different information and communication technologies (email traffic, mobile phone calls, activities within peerto-peer systems, social media and social networking websites, etc.). Taking advantage of this new availability, scholars have developed different methodologies to understand the evolution of social networks using these data as input [25, 27, 28]. The (generally) large size of these datasets has given rise to especially interesting applications from a complex network perspective. On one side, we find works that try to deduce the basic mechanisms ruling social network processes. To do that, the authors analyze the evolution of these social datasets from a statistical point of view (macroscopic level) [4], focusing on their modular structure (mesoscopic or intermediate level) [29, 30, 31], or addressing key individual properties such as centrality (microscopic level) [32]. On the other side, following the example of seminal works by Watts and Strogatz [33] and Barab´ asi and Albert [34], datasets are also used to validate simple models based on single mechanisms that forge complex social-like features. In these works, empirical data are contrasted against the models’ simulations in terms of structural parameters at different topological scales. For example, some of these works present extensions of Barab´asi’s preferential attachment models and are focused on the degree distribution [35, 36]. Others present variants of the seceder model (where the mechanism conditioning topological evolution is based on each agent’s efforts to differentiate from the crowd) [37]. Finally, in Ref. [38] the authors propose a model where each agent is assigned a set of social values (representing different social attributes), and ties are established in the function of the social distances among agents (differences between their social attributes) and α, a parameter quantifying the homophily in the system (the individuals’ preference to establish and maintain links with other individuals they feel similar to). Interestingly, for different values of α the resulting social network presents different modular structures, while preserving general topological features of social networks (such as assortativity or high clustering).
3 Coevolution: Social Networks and Phenomena The separation into two different lines of research presented in the previous section has been the common approximation to social complex networks until recently. However, from real life observations we conclude that there is, normally, a certain interdependency among the evolution of both the social structure and the behavior of each one of the social actors [39]. Consider a friendship network as an example. On one side, friendship relationships (network links) are the path used, for instance, to cooperate, inform or imitate behaviors. Thus, the structure conditions different social processes related to
Dynamics of Social Complex Networks
137
these actions (like cooperation and diffusion of habits, for example). On the other side, the stronger the friendship relation among two people, the more probable that they introduce each other to new friends, modifying their mutual ‘friendship local neighborhood’ and, consequently, the whole structure of the network. In general, networks exhibiting such a feedback loop are called coevolutionary or adaptive networks [40]. This interdependency has clear implications from a complexity point of view. If structural patterns of social networks can induce nonlinearity in social phenomena evolving over them and, likewise, social network processes forge the emergence of complex structural features, a coevolutive scheme has to lead, necessarily, to scenarios exhibiting extremely rich behaviors. In their recent review on adaptive networks, Gross and Blasius [40] suport this assertion by reporting a list of four ‘hallmarks’ typically presented by adaptive networks in general (and social networks in particular): • • • •
Self-organization towards a dynamical critical state. Emergence of ‘specialized’ roles from an initially homogeneous population. Formation of complex global topologies (even from very simple local rules). Highly complex macroscopical dynamics due to the interaction of local states and topological complexity.
In the following, we will review some recent works that have addressed interesting sociological topics from an adaptive networks’ perspective. We will also identify some of the preced hallmarks in the referred examples. 3.1 Cooperation in Coevolutive Models In Ref. [41], Skyrms and Pemantle claim to “(..) create models that are more true to life (..)” by incorporating coevolution among structure and strategies in evolutive game theory models. Since then, some authors have proposed models where players’ strategies depend on the structure but, at the same time, they can modify the connectivity in their local neighborhood in order to maximize the payoff of a certain strategy (modifying, as an aggregated effect, the whole topology at the macroscopic level). Cooperation among individuals and, more concretely, the evolution of oneshot versions of the Prisoner’s Dilemma played over adaptive networks, have been intensively studied. In the results of these works, we can find some of the four ‘complexity hallmarks’ in coevolving networks listed previously. For example, in some cases the authors identify the formation of scale-free topologies (which present a power-law distribution) [42, 43] and the emergence of differentiated roles and hierarchies [42, 44, 45]. Moreover, regarding system dynamics, Ebel and Bornholdt [46], Eguiluz and co-workers [42] and Zimmermann and Eguiluz [44] report large avalanches of strategy changes when the system approaches the final state, identifying a sort of self-organized critical behavior. As a particular case, [47] analyzes a scenario where topological changes occur much faster than changes of individuals’ strategies. The authors find
138
S. Lozano
that the evolution of individual strategies in this situation no longer corresponds to the Prisoner’s Dilemma, but to a sort of coordination game, leading to a situation more favorable to cooperation. This result highlights the effect of separating the time scales of structural and individual dynamics. Notice that the scenarios considered in the previous section (with static networks or nonevolving nodes) can be seen as particular extreme cases of coevolving networks with completely separated time scales (i.e. one of the two time scales is so large compared with the other that it is not considered). In accordance with the importance of the relation between the two different time scales in coevolutive scenarios, we find (as we will see in the following subsections) several works that analyze this influence and that consider the cases with one aspect (network or individuals) static as bounding cases. 3.2 Communication and Diffusion of Information in Social Networks The interplay between communication within a population of socioeconomic agents and its underlying social structure, is an interesting social topic that deserves further study [48]. Taking business relationships as an example, an agent would presumably like to occupy a network position that is as strategic as possible in terms of information reception and processing (close to the other agents in terms of average distance or with a high betweenness, for instance). Moreover, since the socioeconomical environment is usually volatile (keeps changing), actors need to be continuously looking for better contacts and ‘fresh opportunities’ [49, 50]. Taking into account such a dynamical scenario, where the “who communicates with whom” and the social structure are strongly entangled, this issue is especially suitable to be studied from a coevolutive viewpoint. Following this perspective, we can find recent works focused on an individual’s movements across the social structure to reach strategic positions while minimizing linking costs [51, 52], or works targeting key positioned individuals [48]. Other authors investigate the impact of communication on social structure both quantitatively (more or less comunication) and qualitatively (different communication strategies) [53]. In general, the models employed in these works generate social structures that present complex patterns like modular structures [48]. Furthermore, some of these works report interesting behaviors of the modeled system like selforganization to states close to the transition between fragmented and ordered states [51], sharp phase transitions and resilience of the structure [49, 50]. 3.3 Opinion and Cultural Dynamics Opinion and cultural dynamics are other important social topics which have been addressed from a coevolutive viewpoint.
Dynamics of Social Complex Networks
139
Centola and co-workers [54] presented a coevolutionary version of Axelrod’s model on dissemination of culture [55]. As in the seminal model by Axelrod, they represent cultural traits and features by numerical values that are transmitted (copied) among individuals in contact, with the difference that the topology of interactions among individuals also evolves. More concretely, agents can erase and rewire links to neighbors with whom they have no common social trait (i.e. the affinity among them is 0). The model presents a complex relationship between heterogeneity and cultural diversity, in which a high diversity can reduce cultural group formation while simultaneously increasing social connectedness. The coevolutive approach has also been used in several recent works addressing opinion formation processes. In Refs. [56, 57, 58, 59], authors propose coevolutive versions of the two-state voter’s model [60] to study consensus in populations’ opinions. In this kind of model, interactions between agents are enhanced or penalized (or even broken) according to whether they succeed in reaching an agreement or not. From a complex network point of view, these models are used to explore the transition between different states, with a special interest in the emergence and duration of metastable states reached before the consensus. Another model of opinion formation based on a coevolutive approach that has received considerable attention is proposed in [61]. This model is especially interesting regarding time-scale separation. In each time step, a rewiring (structural change) or an opinion imitation (evolution of local state) occurs with certain probabilities φ and 1 − φ, respectively. Therefore, by tuning φ the authors can easily recover one of the two extremal single-evolving cases (static network or nonevolving nodes) or travel along different intermediate scenarios. By studying the whole range of possible situations, the authors find that the model undergoes a continuous phase transition as φ is varied, from a regime in which opinions are highly diverse to one in which most individuals hold the same opinion. 3.4 Spreading Phenomena Last but not least, there is a growing literature on spreading (epidemics, diseases, infections) phenomena from an adaptive network perspective. We find an example of this in the series of works proposing and analyzing an adaptive version of the SIS (Susceptible-Infected-Susceptible) model, where susceptible individuals try to avoid infection by erasing their links with the infected population [62, 63, 64]. This sort of work analyzes how different levels of rewiring modify the dynamics of the adaptive SIS model (note that this implies, once again, studying the effect of having separated time scales). One common observation is that high levels of rewiring lead to the self-organization of the susceptible population into a unique, densely connected cluster. In the case of eventual infection of an individual in the cluster, this sort of organization favors a rapid spreading of the disease, which is seen as an avalanche of state change from a macroscopic viewpoint.
140
S. Lozano
We also mention the work presented in [65], where the authors propose an innovative coevolutionary model of HIV infection spreading through the use of dynamic complex networks. On one hand, the state of each individual (her health situation) is determined by means of a Markov process that takes into account both topological data (such as the number of infected neighbours) and information regarding the HIV infections (probability of infection and progression from HIV to AIDS, for instance). On the other hand, the social structure of the population is defined at each time step in a function of certain statistical features and the state of nodes (nodes with AIDS are removed from the network). The authors find a good correspondence between simulation results and real demographic historical epidemiological data from the United States. Moreover, this epidemiological prediction model could be integrated in related decision support systems (regarding anti-drug policy, for instance).
4 Conclusions Summarizing, the analysis of social networks’ dynamics has been revealed to be an outstanding application of the complex network theory, as is demonstrated by the huge (and increasing) amount of work developed in the field during recent years. Two factors have contributed definitively to this success: the availability of large longitudinal social datasets obtained from communication technologies, and the massive integration of scientists from complexity science (especially physicists) to social networks analysis. In this chapter, we have provided a general view of recent research in this area. Following the evolution of the literature in the field, we have first referred to works treating dynamics on and of social networks separately, and later have addressed a more recent approach integrating both sorts of dynamics in a coevolutive scheme. In both cases, but especially in the last one, we have also echoed the results reported by authors regarding some of the points proposed by Vega-Redondo’s check list in the Introduction (emergence of nontrivial structural patterns, nonlinear macroscopical behaviors induced by local processes, etc.). When talking about coevolutive models, we have also stressed the effect of having more or less separated time scales for dynamics on and of social networks. Finally, regarding the future evolution of the research on the dynamics of social complex networks, it is expected to keep growing, as the availability of datasets is increasing and the field continues to attract scholars. Nevertheless, to ensure this growth, issues like the ethical implications of social data collection and analysis [66, 67], the integration among different disciplines and perspectives within the aforementioned science of networks should be seriously addressed.
Dynamics of Social Complex Networks
141
References 1. Freeman, L.C.: The Development of Social Network Analysis: A Study in the Sociology of Science. Empirical Press, Vancouver (BC Canada) (2004). 2. Scott, J.: Social Network Analysis: A Handbook. SAGE Publications, London (2000). 3. Wasserman, S., Faust, K.: Social Networks Analysis: Methods and Applications. Cambridge University Press, New York (1994). 4. Holme, P., Edling, C.R., Liljeros, F.: Structure and time evolution of an Internet dating community. Social Networks 26, 155–174 (2004). 5. Vega-Redondo, F.: Complex Social Networks. Cambridge University Press, New York (2007). 6. Watts, D.J.: Six Degrees: The Science of a Connected Age. W. W. Norton & Company Inc., New York (2003). 7. Barab´ asi, A.-L.: Linked: The New Science of Networks. Perseus Publishing, Cambridge (USA) (2002). 8. Coleman, J.: Foundations of Social Theory. Harvard University Press, Cambridge, MA (1990). 9. Gould, R.V.: Collective action and network structure. American Sociological Review 58 (2), 182–196 (1993). 10. Gould, R.V.: The origins of status hierarchies: A formal theory and empirical test. American Journal of Sociology 107 (5), 114378 (2002). 11. Epstein, J.M.: Generating classes without conquest. In: Generative Social Science: Studies in Agent-Based Computational Modeling. Princeton University Press, Princeton, NJ (2007). 12. Castellano, C., Fortunato, S., Loreto, V.: Statistical physics of social dynamics. Reviews of Modern Physics (Accepted) 348 (2008). 13. Rogers, E.M.: Diffusion of Innovations (5th ed.). Free Press, New York (2003). 14. Valente, T.W.: Models and methods for innovation diffusion. In: Carrington, P., Scott, J., Wasserman, S. (ed) Models and Methods in Social Network Analysis. Cambridge University Press, New York (2005). 15. Abramson, G., Kuperman, M.: Social games in a social network. Phys. Rev. E 63, 030901 (2001). 16. Duran, O., Mulet, R.: Evolutionary prisoners dilemma in random graphs. Physica D 208 (3–4), 257–265 (2005). 17. Santos, F.C., Pacheco, J.M., Lenaerts, T.: Evolutionary dynamics of social dilemmas in structured heterogeneous populations. Proc. Natl. Acad. Sci. 103, 3490– 3494 (2006). 18. Lozano, S., Arenas, A., Sanchez, A.: Mesoscopic structure conditions the emergence of cooperation on social networks. PLoS ONE 3(4): e1892 doi: 10.1371/ journal.pone.0001892 (2008). 19. Castellano, C., Loreto, V., Barrat, A., Cecconi, F., Parisi, D.: Comparison of voter and Glauber ordering dynamics on networks. Phys. Rev. E 71 (6), 066107 (2005). 20. Sood, V., Redner, S.: Voter model on heterogeneous graphs. Phys. Rev. Lett. 94 (17), 178701 (2005). 21. Castellano, C., Vilone, D., Vespignani, A.: Incomplete ordering of the voter model on small-world networks. Europhys. Lett. 63 (1), 153158 (2003). 22. Szab´ o, G., F´ ath, G.: Evolutionary games on graphs. Phys. Rep. 446 (4–6), 97–216 (2007).
142
S. Lozano
23. Stauffer, D.: Sociophysics Simulations II: Opinion Dynamics. arXiv:physics/ 0503115v1 [physics.soc-ph] (2005). 24. Bjelland, J., Canright, G., Engø-Monsen, K., Remple, V.P.: Topographic spreading analysis of an empirical sex workers network. In: (ed). Springer, Berlin (2008). 25. Doreian, P., Stokman, F.N. (ed): Evolution of Social Networks. Routledge, London (1997). 26. Borgatti, S.P.: The State of Organizational Social Network Research Today. Dept. of Organization Studies. Boston College, Boston, MA (2003). 27. Snijders, T.A.B.: Models for longitudinal network data. In: Carrington, P., Scott, J., Wasserman, S. (ed) Models and Methods in Social Network. Analysis. Cambridge University Press, New York (2005). 28. Dorogovtsev, S.N., Mendes, J.F.F.: Evolution of Networks: From Biological Nets to the Internet and WWW. Oxford University Press, Oxford (2003). 29. Palla, G., Barab´ asi, A-L., Vicsek, T.: Quantifying social group evolution. Nature 446 (5), 664–667 (2007). 30. Eckmann, J.-P., Moses, E., Sergi, D.: Entropy of dialogues creates coherent strutures in e-mail traffic. PNAS 101 (40), 14333–14337 (2004). 31. Onnela, J.-P., Saram¨ aki, J., Hyv¨ onen, J., Szab´ o, G., Lazer, D., Kaski, K., Kert´esz, J., Barab´ asi, A.-L.: Structure and tie strengths in mobile communication networks. PNAS 104 (18), 7332–7336 (2007). 32. Braha, D., Bar-Yam Y.: From centrality to temporary fame: Dynamic centrality in complex networks. Complexity 12 (2), 59–63 (2006). 33. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393, 440–442 (1998). 34. Barab´ asi, A.-L., Albert, R.: Emergence of scaling in random networks. Science 286, 509–512 (1999). 35. Jin, E.M., Girvan, M., Newman, M.E.J.: Structure of growing social networks. Phys. Rev. E 64, 046132 (2001). 36. Roth, C.: Generalized Preferential Attachment: Towards Realistic Social Network Models. ISWC 4th Intl Semantic Web Conference. (2005). 37. Gr¨ onlund, A., Holme, P.: Networking the seceder model: Group formation in social and economic systems. Phys. Rev. E 70, 036108 (2004). 38. Bogu˜ na, M., Pastor-Satorras, R., D´ıaz-Guilera A., Arenas A.: Models of social networks based on social distance attachment. Phys Rev E 70, 056122 (2004). 39. Lazer, D.: The co-evolution of individual and network. J. Math. Sociol. 25, 69108 (2001). 40. Gross T., Blassius, B.: Adaptive coevolutionary networks: A review. J. R. Soc. Interfac 5 (20), 259–271 (2007). 41. Skyrms, B., Pemantle, R.: A dynamic model of social network formation. Proc. Nat. Acad. Sci. 97 (16), 9340–9346 (2000). 42. Eguiluz, V.M., Zimmermann, M.G., Cela-Conde, C.J., San Miguel, M.: Cooperation and the emergence of role differentiation in the dynamics of social networks. AJS 110 (4), 9771008 (2005). 43. Biely, C., Dragosits, K., Thurner, S.: The prisoners dilemma on co-evolving networks under perfect rationality. Physica D 228, 4048 (2007). 44. Zimmermann, M.G., Egu´ıluz, V.M.: Cooperation, social networks, and the emergence of leadership in a prisoners dilemma with adaptive local interactions. Phys. Rev. E 72, 056118 (2005). 45. Zimmermann, M.G., Egu´ıluz, V.M., San Miguel, M.: Coevolution of dynamical states and interactions in dynamic networks. Phys. Rev. E 69, 065102(R) (2004).
Dynamics of Social Complex Networks
143
46. Ebel, H., Bornholdt, S.: Coevolutionary games on networks. Phys. Rev. E 66, 056118 (2002). 47. Pacheco, J.M., Traulsen, A., Nowak, M.A.: Coevolution of strategy and structure in complex networks with dynamical linking. Phys. Rev. Lett. 97, 258103 (2006). 48. Rosvall, M., Sneppen, K.: Dynamics of opinions and social structures. arXiv:0708.0368v2 [physics.soc-ph] (2007). 49. Marsili, M., Vega-Redondo, F., Slanina, F.: The rise and fall of a networked society: A formal model. Proc. Nat. Acad. Sci. 101, 1439–1442 (2004). 50. Ehrhardt, G.C.M.A, Marsili, M., Vega-Redondo, F.: Phenomenological models of socioeconomic network dynamics. Phys. Rev. E 74, 036106 (2006). 51. Holme, P., Ghoshal, G.: Dynamics of networking agents competing for high centrality and low degree. Phys. Rev. Lett. 96, 098701 (2006). 52. K¨ onig, M.D, Battiston, S., Napoletano, M., Schweitzer, F.: On algebraic graph theory and the dynamics of innovation networks. Networks and Heterogeneous Media 3 (2) 201–220 (2007). 53. Rosvall, M., Sneppen, K.: Modeling self-organization of communication and topology in social networks. Phys. Rev. E 74, 016108 (2006). 54. Centola, D., Gonz´ alez-Avella, J.C., Egui´ıluz, V.M., San Miguel, M.: Homophily, cultural drift, and the co-evolution of cultural groups. J. of Conflict Resolution 51 (6), 905–929 (2007). 55. Axelrod, R.: The dissemination of culture: A model with local convergence and global polarization. The Journal of Conflict Resolution 41 (2), 203–226 (1997). 56. Benczik, I.J., Benczik, S.Z., Schmittmann, B., Zia, V.: Lack of consensus in social systems. EPL 82, 48006 (2007). 57. V´ azquez, F., Egu´ıluz, V.M., San Miguel, M.: Generic absorbing transition in coevolution dynamics. Phys. Rev. Lett. 100, 108702 (2007). 58. Zanette, D.H., Gil, S.: Opinion spreading and agent segregation on evolving networks. Phys. D 224, 156–165 (2006). 59. Gil, S., Zanette, D.H.: Coevolution of agents and networks: Opinion spreading and community disconnection. Phys. Lett. A 356, 89–95 (2006). 60. Liggett, T.M.: Interacting Particle Systems. Springer, New York (1985). 61. Holme, P., Newman, M.E.J.: Nonequilibrium phase transition in the coevolution of networks and opinions. Phys. Rev. E 74, 056108 (2006). 62. Gross, T., D’Lima, C.J.D., Blasius, B.: Epidemic dynamics on an adaptive network. Phys. Rev. Lett. 96, 208701 (2006). 63. Gross, T., Kevrekidis, I.G.: Coarse-graining adaptive coevolutionary network dynamics via automated moment closure. arXiv:nlin/0702047v1 [nlin.AO] (2007). 64. Zanette, D.: Coevolution of agents and networks in an epidemiological model. arXiv:0707.1249v2 [physics.soc-ph] (2007). 65. Sloot, P.M.A., Ivanov, S.V., Boukhanovsky, A.V., Vijver, D., Boucher, C.A.: Stochastic simulation of HIV population dynamics through complex network modeling, Int. J. of Computer Mathematics 85 (8), 1175–1187 (2008). 66. Borgatti, S.P., Molina, J.L.: Toward ethical guidelines for network research in organizations. Social Networks. 27 (2), 107–117 (2005). 67. Birnbaum, M.H.: Methodological and ethical issues in conducting social psychology research via the Internet. In: Sansone, C., Morf, C.C., Panter, A.T. (ed) Handbook of Methods in Social Psychology. Sage, Thousand Oaks, CA (2004).
The Structure and Dynamics of Linguistic Networks Monojit Choudhury1 and Animesh Mukherjee2 1
2
Microsoft Research India, Sadashivnagar, Bangalore, India – 560080
[email protected] Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India – 721302
[email protected] 1 Introduction Human beings as a species are quite unique to this biological world, for they are the only organisms known to be capable of thinking, communicating and preserving potentially an infinite number of ideas that form the pillars of modern civilization. This unique ability is a consequence of the complex and powerful human languages characterized by their recursive syntax and compositional semantics [40]. It has been argued that language is a dynamic complex adaptive system that has evolved through the process of self-organization to serve the purpose of human communication needs [80]. The complexity of human languages has always attracted the attention of physicists, who have tried to explain several linguistic phenomena through models of physical systems (see e.g., [32, 42]). Like any physical system, a linguistic system (i.e., a language) can be viewed from three different perspectives [52]. On one extreme, a language is a collection of utterances that are produced by the speakers of a linguistic community during the course of their interactions with other speakers of the same community. This is analogous to the microscopic view of a thermodynamic system, where every utterance and its corresponding context contributes to the identity of the language, i.e., the grammar. On the other extreme, a language can be characterized by a set of grammar rules and a vocabulary. This is analogous to a macroscopic view. Sandwiched between these two extremes, one can also conceive of a mesoscopic view of language, where linguistic entities, such as the letters, words or phrases are the basic units and the grammar is an emergent property of the interactions among them. Complex networks provide a suitable framework to model and study the structure and dynamics of linguistic systems from a mesoscopic perspective. Although multi-agent simulation is the preferred modeling paradigm for N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 9, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
146
M. Choudhury and A. Mukherjee
microscopic studies in linguistics (see e.g., [15, 80]), there have been some works where networks are also involved. For instance, in [67], the interaction patterns between the agents are modeled as a social network, and the diffusion of linguistic innovations (which are key to language change) are studied on various network topologies. This survey is confined to the works pertaining to various linguistic networks only at the level of mesoscopy. There has been a plethora of works on linguistic networks with various motivations and at various levels of linguistic structure. On the basis of the primary goal of the research, the work in this area can be broadly classified into two categories: (1) those which investigate the structural properties of language from the perspective of language evolution and, thereby, explain the emergence of certain universal characteristics of languages, and (2) those which try to exploit the network-based representations to develop certain useful practical systems such as machine translation, information retrieval and summarization systems. This article focuses on the former works, but a brief overview of the latter is also presented in Section 5. The survey is organized from the perspective of linguistic structure. Section 2 describes lexical networks, where the nodes are words and edges represent the lexical relationship between two words such as phonetic and semantic similarity. In Section 3 we present an overview of various networks where again the nodes are the words, but unlike the case of lexical networks, the edges represent their co-occurrences in similar context. These networks are representations of the interactions among words as governed by the grammar rules of a language. Section 4 describes the phonological networks, where the nodes are sub-lexical units such as phonemes or syllables. Applications of linguistic networks in natural language processing (NLP) and information retrieval (IR) are discussed in Section 5. Section 6 concludes the survey by enumerating some open problems in the area of linguistic networks.
2 Lexical Networks The phrase “mental lexicon” (ML) usually refers to the repository of word forms that is assumed to reside in the human brain. The average size of the receptive vocabulary for a normal high school student has been found to be more than 100,000 [63]. Quite surprisingly, speakers are capable of navigating this huge lexicon in a very efficient way; reaction time to judge whether a word form is legitimate takes less than 100 milliseconds. Consequently, there can be two important questions associated with ML: (a) how the words are stored in the long-term memory, i.e., how ML is organized, and (b) how these words are retrieved from ML. Note that these questions are highly interrelated—to predict the organization one can investigate how words are retrieved from ML and vice versa. One of the earliest attempts to model the organization of ML was made in [13]. In this work, the authors propose a hierarchical structure of ML, where
Linguistic Networks
147
Fig. 1. The hierarchical structure of ML.
the concepts are arranged in the form of a tree and the attributes of a particular concept in this tree can be inherited by all the child concepts. Figure 1 shows a representative example formed from the concepts “animal”, “mammal” and “fish”. While early studies like [13] focused mainly on representation of the local structure of ML, its global structure remained largely unexplored. Recently, researchers have also started to investigate the global structure of ML primarily within the framework of complex systems and, more specifically, complex networks (see [36, 45, 77, 83, 86] for reference). In all of these studies ML is modeled as a web of interconnected nodes, where each node corresponds to a word form and the interconnections may be based on any one (or more) of the following: • • • • • •
Phonological similarity (e.g., the words banana, bear and bean may be connected since they start with the same phoneme), Semantic similarity (e.g., the words banana, apple and pear may be connected since all of them are names of fruits), Frequency of usage, Age at which the word forms are acquired, Parts of speech, and Orthographic properties.
In the rest of this section we review one representative study each (referring, wherever applicable, to the other relevant ones) of such complex networks constructed based on (a) phonological, (b) semantic, and (c) orthographic similarities of the word forms. Syntactic similarity-based networks will be discussed in detail in the next section. 2.1 Phonological Similarity-Based Networks Phonological similarity among the word forms has been extensively studied in the past to infer the structure of ML and, consequently, the nature of a linguistic system [4, 35, 71, 81]. This large-scale phonological ML has also been studied in the framework of complex networks in which the word forms represent the nodes and two nodes (read words) are connected by an edge if they differ only by the addition, deletion or substitution of one or more phonemes [36, 45, 83, 86]. [45] reports one of the most popular studies, where
148
M. Choudhury and A. Mukherjee
the author constructs a phonological neighborhood network (PNN) in order to unfurl the organizing principles of ML. In PNN there is an edge (u, v) connecting the nodes u and v iff at least two-thirds of the phonemes that occur in the word represented by u also occur in the word represented by v. For instance, if the word is 6 phonemes long, then one can derive all its neighbors by changing at most two phonemes through insertions, deletions, and substitutions. The author uses the Hoosier Mental Lexicon database [68] and builds the above network from the phonologically transcribed forms of each word present in the database. More specifically, he constructs a directed network, where a long word can have a short word as its neighbor without the short word being the neighbor of the long word. For instance, if the number of segments in which the two words, say w1 and w2 , differ is less than 1/3 of the length of w1 , then there will be a directed edge from the node corresponding to w1 to the node corresponding to w2 . The fraction 1/3 is chosen, because it has been useful in earlier experiments for predicting reaction times and familiarity ratings (see [53] for reference). The author shows that PNN is characterized by a very high clustering coefficient (0.235) but at the same time exhibits a long average path length (6.06) and diameter (20). This indicates that, like a small-world network, the lexicon has many densely interconnected neighborhoods. However, unlike small-world network, links between two nodes from different neighborhoods are harder to find. Low mean path lengths are necessary in networks that are to be traversed quickly; the purpose of traversal being search in most of cases. However, in the case of ML, the search should not inhibit the neighbors of the stimulus neighbors that are non-neighbors of the stimulus itself and are, therefore, not similar to the stimulus. Hence, it can be conjectured that, in order to search in PNN, traversal of links between distant nodes is usually not required. In contrast, the search involves an activation of the structured neighborhoods that share a single sub-lexical chunk, which could be acoustically related during word recognition [55]. Further, the author shows that the degree distribution of the nodes in PNN is exponential rather than scale free. Thus, one can posit that the structure of ML is not consistent with “growth via preferential attachment”—at least for the neighborhood density metrics used for this study. The reason is that the standard preferential attachment model, the emergent degree distribution of the network is known to be scale free [5]. The cause for the emergence of the exponential degree distribution for PNN is not yet well understood and is quite an open area for further research. 2.2 Semantic Similarity-Based Networks One of the classic examples of semantic similarity-based networks is the WordNet [20]. In this network, concepts (known as synsets) are the nodes,
Linguistic Networks
149
and semantic relationships between them are represented through the edges. In [77] the authors analyze the structure of the nouns in the English WordNet database (version 1.6). The semantic relationships between the nouns can be primarily of four types: (i) hypernymy/hyponymy (e.g., animal/cat), (ii) antonymy (e.g., day/night), (iii) meronymy/holonymy (e.g., trunk/tree) and (iv) polysemy (e.g., the concepts “the main stem of a tree”, “the body excluding the head and neck and limbs”, “a long flexible snout as of an elephant” and “luggage consisting of a large strong case used when traveling or for storage” are connected to each other due to the polysemous word “trunk” which can mean all of these). Some of the important findings of this work are as follows. • • • •
•
Semantic relationships are scale invariant. The hypernymy tree forms the skeleton of the network. Inclusion of polysemy reorganizes the network into a small world. The nodes with the most traffic (i.e., nodes with the maximum number of paths passing through them) correspond to those concepts which are expressed by the most polysemous words. They are also found to have very high clustering coefficients. In the presence of polysemous edges, the distance between two nodes across the network is not in correspondence with the depth at which they are found in the hypernymy tree.
Further references to the studies on such semantic relationship-based networks can be found in [1, 82]. Although there are several works attempting to analyze the structure of the semantic network of words, one hardly finds any study explaining the emergence of these topological properties through models of network synthesis. It would be very interesting to study the correlates of semantic acquisition and symbol grounding with the model parameters. 2.3 Orthographic Similarity-Based Networks Like phonological similarity networks, one can also construct networks based on orthographic similarity, where the nodes are the words and the edit distance between two words defines the edge weight between the nodes corresponding to them. Such networks have been studied in order to investigate the difficulties involved in spelling error detection and correction [11]. In this work the authors construct such networks (SpellNet) for three different languages (Bengali, Hindi and English) and analyze them to show the following. • • •
For a particular language, the probability of real word errors can be equated to the average weighted degree of SpellNet. The difficulty of non-word error correction correlates to the average clustering coefficient for a language. The basic topological properties are invariant in nature for all the languages; for instance, the authors find that the SpellNet for all of the three
150
M. Choudhury and A. Mukherjee
languages is characterized by an exponential degree distribution, high clustering coefficient and positive correlation between the degree and clustering coefficient of the nodes.
3 Word Co-Occurrence Networks In this section, we review the work on word co-occurrence networks, where the nodes are the words and an edge between two words indicates that the words have co-occurred in the language in certain context(s). Depending on the definition of the context, various networks can be defined. We describe in detail two such networks: the collocation network and the syntactic dependency network. As an application, we discuss the work by [79] where the collocation network has been used for unsupervised induction of the grammatical structure of a language. 3.1 Collocation Network One of the most basic and well-studied co-occurrence network types is that of word collocation networks, where two words are linked if they are neighbors, that is, if they collocate, in a sentence [24]. In this work, two types of collocation networks, unrestricted and restricted ones, were constructed for English from the British National Corpus. In an unrestricted network, all the collocation edges are preserved, whereas in a restricted one only those edges are preserved for which the probability of occurrence of the edge is higher than the case when the two words collocate independently. All these networks are undirected and unweighted, even though in language the order of words (“ticket book” is different from “book ticket”) as well as the frequency of the collocations have obvious significance. The authors found that both the networks exhibit small-world properties. The average path length between any two nodes is small (around 2 to 3), and the clustering coefficients are high (0.69 for the unrestricted and 0.44 for the restricted networks). However, the most striking observation regarding these networks is that the degree distributions follow a two-regime power law. The degree distribution of the 5000 most connected words follows a power law with an exponent −3.07, which is surprisingly close to that of the Barab´ asi-Albert growth model [5]. These findings led the authors to argue that the word usage of the human languages is preferential in nature, where the frequency of a word defines the comprehensibility and production capability. Thus, the higher the usage frequency of a word, the higher the probability that the speakers will be able to produce it easily and the listeners will comprehend it quickly. This is known as the recency effect in linguistics [3]. The small-world property of the collocation network, on the other hand, makes it easier to search the mental lexicon (ML). In essence, the authors conclude that the evolution of language has resulted in an optimal structure of the word interactions that facilitate easier and faster production, perception and navigation of the words.
Linguistic Networks
151
It does not follow, however, from the collocation networks that a word with high degree is indeed a word with high usage frequency (unless the word co-occurrences are completely independent in nature, which essentially is not the case). In a separate study, Cancho and Sol´e [25] have shown that the rank-degree distribution of the words in a very large corpus also follows a two-regime power law, supporting their claim regarding the presence of a core lexicon whose size is about 5000 words. In order to explain the tworegime power law in word collocation networks, Dorogovtsev and Mendes [18] proposed a preferential attachment-based growth model. At every time step t, a new word (i.e., a node) enters the language (i.e., the network) and connects itself preferentially to one of the pre-existing nodes. Simultaneously, ct (where c is a positive constant) new edges are grown between pairs of old nodes that are chosen preferentially. Through mathematical analysis and simulations, the authors establish that this model gives rise to a two-regime power law with exponents very close to those observed in [24]. There have been studies on the properties of collocation networks for languages other than English, including Russian [46] and many others [41]. The basic topological properties of the networks (e.g., scale-free, small-world, assortative) are similar across languages, which points to the fact that like Zipf’s law, these characteristics are also linguistic universals and call for a non-trivial psycholinguistic account of their emergence and existence. 3.2 Syntactic Dependency Network Although collocation networks are easier to construct, they do not necessarily capture the syntactic and semantic relationships between the words, because syntactic and semantic relations often extend beyond the local neighborhood of a word. Syntactic relations between the words of a language are governed by the underlying grammar. There are various formalisms, such as phrase structure grammar, tree-adjoining grammar and dependency grammar, to capture these relationships. In the dependency grammar formalism, a relationship, often shown as a directed edge, connects two words—the head and the dependent. The dependent word modifies the head word in a certain way. For example, the nouns are the heads of the adjectives that modify them. Similarly, the verbs are the heads of their subjects, objects and other arguments. Thus, in the dependency formalism, every sentence is represented as a directed acyclic graph or a dependency tree as illustrated in Fig. 2. Usually, the finite verb is the head of the whole sentence and is not dependent on any other word. Cancho and his co-authors [21, 26] defined the syntactic dependency network (SDN) where the words are the nodes and there is a directed edge between two words if in any of the sentences of a given corpus there is a directed dependency relation between these words. The direction of the dependencies in their construction is from the dependent word to the head word. In order to construct the SDN, one needs to know the dependency relations between the words of a sentence. Fortunately, there are large dependency treebanks for
152
M. Choudhury and A. Mukherjee
Fig. 2. Example of a dependency tree. The arrows are labeled by the type of dependency relation and run from the dependent to the head words.
some languages consisting of human annotated dependency trees for several thousand sentences. The authors studied the SDN for three languages: Czech, German and Romanian, and observed strikingly similar characteristics. All the networks exhibit power-law degree distributions and small-world structures. Some of the very interesting topological properties observed are the following. • • •
Disassortative mixing. This shows that words that are used for linking other words (such as prepositions) and, therefore, have high degree in the networks, are not linked themselves. Hierarchical organization. This implies that there is a top-down hierarchy that is the basis of phrase structure formalism. Small-world structure. This is necessary for recursion and fast navigation of the mental lexicon.
It is a well-known fact that syntactic dependency links usually do not intersect in any of the world’s languages. In [22], the author conjectured that this phenomenon is an outcome of minimization of the Euclidean distance between the syntactically related words of a sentence, where the Euclidean distance between two words is given by the number of words separating them.1 Later on, Cancho et al. [23] showed that spectral clustering of SDN classifies words belonging to the same syntactic categories in the same cluster. As we shall see in Section 5, quite similar techniques are being used in the field of NLP for unsupervised induction of syntactic categories. 3.3 Unsupervised Grammar Induction One of the fascinating applications of word collocation networks, illustrated in [79], is related to unsupervised induction of grammar. Explaining the process of language acquisition is one of the greatest challenges to modern science. Children learn languages that they are exposed to quite accurately and effortlessly. This is one of the strongest evidences in support of our instinctive 1
While it is true that syntactic dependencies have a tendency to avoid crossing, there are systematic exceptions to that generalization in languages with relatively free constituent order. In German, for example, about one-third of all relative clauses are extraposed, thus creating cross dependencies.
Linguistic Networks
153
capacities towards languages [70], which is dubbed the universal grammar by Noam Chomsky [10]. In [79], the authors proposed a very simple algorithm for learning hierarchical structures from the collocation graph of a raw text corpus. The algorithm, ADIOS, works as follows. A directed collocation graph is constructed from the corpus, where the words are the nodes, and an edge is drawn from words w to v if v follows w in a sentence. In fact, each sentence is represented as a separate path in the graph. The algorithm then iteratively searches for motifs that are shared by different sentences. A linguistic motif is defined as a sequence of words, which tends to occur quite frequently in the language and also serves some special functions. For example, “that the X is Y” is a very commonly occurring motif in English, where X and Y can be substituted by a large number of words and this whole pattern can be embedded in various parts of a sentence. Solan et al. [79] define the probability of a particular structure being a motif in terms of network flows. After finding the motifs, the algorithm proceeds to identify interchangeable motifs and merge them into a single node. Thus, at every step the network becomes smaller and a hierarchical structure emerges. This structure can then be presented as a set of phrase structure grammar rules. ADIOS has a high precision (≈70%), but low recall (≈40%). Through a comparative analysis of the induced grammars, the authors were able to construct a dendrogram of 6 languages that have been studied. Quite surprisingly, the dendrogram reflects the phylogenetic relations between these 6 languages. There are other graph-based methods for unsupervised induction of syntactic structures, but unlike ADIOS, these algorithms are based on standard probability theory and Bayesian models.
4 Phonological Networks In the earlier sections, we have seen how complex networks can be used to study the different types of interactions (phonological, syntactic and semantic) between the words of a language. In this section, we shall review some of the works where the networks are constructed from linguistic units that are smaller than words, e.g., phonemes and syllables. 4.1 Network of Human Speech Sounds The most basic units of human languages are the speech sounds. The repertoire of sounds that make up the sound inventory of a language are not chosen arbitrarily, even though the speakers are capable of perceiving and producing a plethora of them. In contrast, the inventories show exceptionally regular patterns across the languages of the world, which is arguably an outcome of the self-organization that goes on in shaping their structure. In fact, numerous computational models have been proposed in the literature in order to
154
M. Choudhury and A. Mukherjee
explain the self-organization of the vowel inventories [15, 47, 51, 76]. A few attempts have also been made in the area of linguistics to reason the observed patterns across the consonant inventories. Most of these works confine themselves to explaining certain individual principles rather than formulating a general theory describing the pattern emergence. However, complex networks have been recently used quite successfully to explain the self-organization of the consonant inventories. In [65] the authors construct a bipartite network called PlaNet, or the Phoneme-Language Network, in which one of the partitions consists of nodes representing the languages while the other partition consists of nodes representing the consonants. There is an edge between the nodes of these two partitions if a particular consonant occurs in a particular language. The authors further construct PhoNet (Phoneme-Phoneme Network), which is the one-mode projection of PlaNet onto the consonant nodes i.e., a network of consonants in which the nodes are linked as many times as they have co-occurred across the language inventories. The data used for constructing the above networks is drawn from the UCLA Phonological Segment Inventory Database (UPSID) [54], which consists of 317 languages and 541 consonants that are found across these languages. Several important observations are made from the study of PlaNet and PhoNet. The observations are noted below. From the study of PlaNet [65] • The degree distribution of the consonant nodes in PlaNet roughly follows a power law with an exponential cut-off towards the tail. • A synthesis model based on preferential attachment (a language node attaches itself to a consonant node depending on the current degree (k) of the consonant node) can explain the emergence of the degree distribution of PlaNet. The results match the empirical data more accurately if the attachment kernel is super-linear (i.e., the attachment probability is proportional to k α , where α > 1). From the study of PhoNet [64, 65] • The degree distribution of the consonant nodes in PhoNet also roughly indicate a power-law behavior with exponential cut-offs. • The clustering coefficient of PhoNet (=0.89) is significantly higher than that of a random graph with the same number of nodes and edges (=0.08). • Community structure analysis of PhoNet can capture the strong patterns of co-occurrence of consonants that are prevalent across the languages of the world. • The driving force that leads to the emergence of these communities is feature economy, which states that languages tend to use a small number of distinctive features and maximize their combinatorial possibilities to generate a large number of consonants. • The emergence of the degree distribution and the clustering coefficient of PhoNet can be explained through a synthesis model that is based on both preferential attachment and triad (i.e., fully connected triplet) formation. While the preferential part of the model reproduces the degree distribution
Linguistic Networks
•
155
of the network, the triad formation part imposes a large number of triangles onto the generated network, thereby increasing the clustering coefficient. The emergence of feature economy can be explained by having a synthesis model, which is a linear combination of two different parts, one driven by the usual degree-dependent preference and the other by a factor that favors the choice of those consonants that share many features with the already chosen ones.
The authors postulate that the physical significance of the synthesis models is grounded in the process of language change. Language change is a collective phenomenon that functions at the level of a population of speakers [80]. They also conjecture that it is possible to explain the significance of the models at the level of an individual, primarily in terms of the process of language acquisition. Further, they argue that there are two orthogonal preferences: (a) the occurrence frequency of a consonant, and (b) the feature-dependent preference (that increases the ease of learning), which are instrumental in the acquisition of the inventories. The synthesis model is essentially a linear combination of these two mutually orthogonal factors. 4.2 Network of Syllables The syllable inventory of each language can also be modeled and analyzed in the framework of a complex network. Each node in this network is a syllable, and links are established between two syllables each time they are shared by a word. In [78] the authors report the study of the network of Portuguese syllables from two different sources: a Portuguese dictionary (DIC) and the complete work of a very popular Brazilian writer—Machado de Assis (MA). The authors show that • • •
The networks have a low average shortest path (DIC: 2.44, MA: 2.61), The networks indicate a high clustering coefficient (DIC: 0.65, MA: 0.50), Both the networks show a power-law behavior.
Since in Portuguese the syllables are close to the basic phonetic units, unlike the case in English, the authors argue that the properties of the English syllabic network should be different from that of Portuguese. The authors further conjecture that since Italian has a strong parallelism between its structure and syllable hyphenization it is possible that the Italian syllabic network has properties close to that of the Portuguese network, pointing to certain universal characteristics of language.
5 Applications in NLP and IR Graph-based approaches are quite common in the areas of natural language processing (NLP) and information retrieval (IR). Interestingly, although there are no obvious technical differences between the scope of graph theory in these areas and in complex networks, the terminologies used and the objectives are
156
M. Choudhury and A. Mukherjee
often quite different. The works on linguistic networks discussed in the last three sections were primarily targeted to the statistical physics community, and the objective was to unfurl the structure of languages and their dynamics. In this section, we will survey some equally interesting and significant works, which use the same set of mathematical tools, but the objective is to develop practical applications concerning languages. 5.1 Induction of Syntactic and Semantic Categories One of the earliest and recurrent applications of networks in NLP has been in automatic induction of syntactic and semantic categories based on the distributional hypothesis [39]. The distributional hypothesis states that words of similar syntactic (semantic) category are found in similar contexts [39]. To illustrate this concept, consider two unknown words X and Y that occur in the following sentences: (1) The red X is very beautiful. (2) If you Y then I shall punish you. Even though we do not know what X and Y are, it is easy to infer that the former is a noun and the latter is a verb. We can draw such inferences about the syntactic categories (in this case the parts of speech) of words based on our knowledge that nouns, but not verbs, can be preceded by articles (the) and adjectives (red). The concept of distributional hypothesis is equally relevant for semantic categories. Words belonging to the same domain club together. Thus, the word student is expected to be in vicinity of the word school, rather than market. Measuring to what extent two words appear in similar contexts defines their similarity [62]. The general methodology [12, 27, 31, 72, 74, 75] for inducing word class information can be outlined as follows. 1. Define the context of a word as a vector. It could be just the set of words which occur in the same sentence, or only the immediate neighbors of the words. For syntactic class induction, usually the word order is preserved during construction of the vectors and the context vectors are defined only in terms of the function words (such as is, of, the and a). 2. Collect global context vectors for the words by summing up the local contexts. 3. Construct a weighted network, where the nodes are the words and the weight of the edge between two words is the distance between their context vectors. There are several ways to define the distance between the vectors. Some of the common measures are Euclidean distance, cosine similarity and correlation coefficients. 4. Apply a clustering algorithm on these networks to obtain the word classes. In the syntactic category induction literature, the 150–250 words with the highest frequency are considered as function words, and the context vectors
Linguistic Networks
157
are defined based on them. Some authors employ a much larger number of features and reduce the dimensions of the resulting matrix using singular value decomposition [72, 74]. [27] uses the spearman rank correlation coefficient and a hierarchical clustering, [74, 75] use the cosine between vector angles and buckshot clustering, [31] uses cosine on mutual information vectors for hierarchical agglomerative clustering and [12] applies Kullback–Leibler divergence in his CDC algorithm. [28] does not sum up the contexts of each word in a context vector, but uses the most frequent instances of four-word windows in a co-clustering algorithm [16]: rows and columns (here words and contexts) are clustered simultaneously. Two-step clustering is undertaken by [74]: clusters from the first step are used as features in the second step. More recently, Biemann [6] proposed the Chinese Whispers algorithm for clustering, which is fast and does not require any parameters to be specified. [7] reports application of Chinese Whispers for parts-of-speech (POS) induction in English, Finnish and German, which has also been applied very recently to Bengali [66]. In this work, the authors also investigate the topological properties of the word networks so constructed and report a scale-free degree distribution, high clustering coefficient and powerlaw cluster size distribution. Widdows and Dorow [87] propose an unsupervised incremental cluster building approach for acquisition of semantic classes. There are also graph-based algorithms to infer semantic classes (sets of synonyms, to be specific) from the lexicons (see, e.g., [17, 43]). Identification of syntactic or semantic classes is of great importance to NLP and IR. For instance, POS tagging is the first step towards parsing. However, the supervised machine learning techniques for POS tagging demand a large amount of human annotated data, which is expensive as well as non-existent for most of the languages. Since automatic induction of POS tags through graph clustering does not require annotated data, it might turn out to be a very useful technique in NLP for resource-poor languages. Similarly, semantic clustering of the words is useful for search and IR. 5.2 Word Sense Disambiguation Word sense disambiguation (WSD) refers to the task of assigning the appropriate sense or meaning to a word in a given context (i.e., sentence or paragraph) out of the several possibilities. For example, the English word bank has two different meanings as a noun: 1) river bank, and 2) a financial institution. However, as shown in the following sentences, in a given context only one of the senses is appropriate. (1) They were walking down the bank enjoying the cool river breeze. (2) She went to the bank to cash her check. There are several ways in which graph-based techniques have been applied for WSD. Examples include lexical chaining [29], semantic relatedness
158
M. Choudhury and A. Mukherjee
Fig. 3. Example of Hyperlex: (a) the network of words for disambiguation of the word “light”; (b) the minimal spanning tree obtained after introduction of the word “light”. The hubs are shown in bold font.
measures based on path lengths and random walks on semantic networks [57, 61] and lexicon graphs [50]. Due to the paucity of space, here we discuss in detail only one of the approaches—HyperLex [85]—that rely on the word co-occurrence graphs. Consider the problem of automatically identifying and disambiguating the various senses of the word light. The HyperLex algorithm works as follows. A sub-corpus consisting of all the paragraphs featuring at least one occurrence of the word light is extracted from a raw text corpus. A word co-occurrence graph is constructed from this sub-corpus, where the nodes are the content words except for the word light. Two words are connected by an edge if they co-occur in a paragraph more than a preset number of times. The weight of an edge decreases as the number of times the words co-occur increases. It has been found that word co-occurrence graphs built in this manner exhibit small-world properties. In this co-occurrence network, nodes with very high degree are identified as hubs. The word light, for which we want to build the disambiguator, is then introduced to the network and connected to the hubs. A minimal spanning tree is constructed from the co-occurrence graph, where light is the root node and the first level consists of the hubs. Figure 3 illustrates this process. Each node in the spanning tree can be thought of as a sense. Thus, the hubs denote the basic senses and, as we move further down the tree, we have more refined senses of the word. This tree can then be used for disambiguating the sense of the target word (here light) in a particular context. 5.3 Information Retrieval The central problem of IR is to rank a given collection of documents with their similarity to a query. Queries are usually very short and the collection of
Linguistic Networks
159
documents huge. In a typical IR setup, the whole web consisting of billions of webpages represents this collection of documents to be ranked and the query is only one or two words long. One of the challenges of IR is to utilize the network structure of the web to compute the ranks of the documents. The web can be conceptualized as a directed graph where the nodes are the webpages and a hyperlink from webpage A to webpage B represents a directed edge between the nodes corresponding to A and B. The very popular PageRank [9] is one of the first ranking algorithms that is allegedly used by Google search engine. The basic idea behind the PageRank algorithm is that the rank (or popularity) of a node is a function of the rank of its neighbors. In other words, the page which has a hyperlink from a popular page is also popular. An alternative view of the PageRank algorithm involves a random walker (here a random surfer). A random walker starts from a random node and follows the edges of the graph randomly to reach other nodes. The PageRank of a page is proportionate to the probability that a random surfer reaches that page by following random hyperlinks on the web. Yet another way to define PageRank is that it is the components of the principal eigenvector of the nodes. Thus, PageRank is also known as eigenvector centrality in the complex network literature. PageRank considers only the incoming edges of a node. Kleinberg [48] proposed another ranking algorithm, called HITS, where every node has two scores, hub and authority. The authority scores are similar to PageRank, whereas the hub scores are based on the outgoing links, but computed in the same way. The final rank of a node is the combination of its hub and authority scores. Kleinberg and co-authors [33] also demonstrated how eigenvectors of the web structure can be used to cluster and disambiguate the pages corresponding to ambiguous words such as “Jaguar” (referring to an animal or a football team or the car). One drawback of both PageRank and HITS is that the algorithms assume that all the hyperlinks have the same importance. There are various modifications of these algorithms, which use machine learning techniques to learn weights of the different types of hyperlinks. Examples include RankNet [73], TrustRank [37] and NetRank [2]. Link analysis, as this field is popularly called, is a very active area of research in the IR community. Some of the other emerging applications of complex networks in IR include mining social networks and blogs. The blogosphere [49], for example, can be represented as a multi-tier network, where blogs, bloggers and other webpages (typically news articles) are the nodes, and there are various types of edges representing the social network of bloggers, the links between blogs and those between the blogs and other webpages. Analysis of the Blogosphere network is useful in classification and personalized suggestion of blogs, opinion and sentiment analysis, as well as in investigating the dynamics of the world of blogs.
160
M. Choudhury and A. Mukherjee
5.4 Other Applications Due to space limitations, it is impossible to do justice to the network-based techniques in NLP and IR. There are a variety of NLP tasks, ranging from parsing to text summarization, where graph-based methods have been applied. In the previous three subsections we have discussed three specific problems to illustrate the various usages of such techniques. Before we wrap up this section, we list a few more example applications to demonstrate the extent and potential of graph-based techniques in these areas. Text summarization is a notably important and challenging application of NLP, which has been elegantly modeled within the framework of complex networks. The problem of text summarization involves identification of a small number of sentences from a set of given documents that best summarize the content of the documents. In [19] summarization has been reformulated as the problem of finding out the node centrality in a network whose nodes are the sentences and whose edges represent the word-level similarity between two sentences. The most central sentences are those which cover most of the ideas present in the given documents. Other application areas include dependency parsing [56], textual entailment [38], sentiment classification [34, 69], keyword extraction [60], novelty detection [30] and prepositional phrase disambiguation [84]. See [8, 58, 59] for further references.
6 Conclusion So far we have seen that there has been a substantial amount of work to understand the structure and dynamics of languages at the mesoscopic level within the framework of complex networks. A parallel thread of research in the field of NLP and IR tries to achieve a different goal, but uses very much the same means. Nevertheless, mesoscopic models of language as well as network-based approaches to NLP are in a nascent state, especially when compared to similar lines of research in the fields of biology, economics and other social sciences (refer to the surveys in this volume). On the other hand, there seems to be a great potential for application of complex network theory to a variety of open problems in linguistics and language engineering. One of the fundamental problems of linguistics is characterization and explanation of linguistic universals, i.e., properties that are common to all human languages. Differences among the languages, on the other hand, are restricted by the typologies and implicational hierarchies [14]. We have seen that, like Zipf’s law, there are many linguistic universals observable in the linguistic networks. For example, the SDNs as well as word collocation networks of all languages exhibit scale-free degree distributions and the small-world property. A systematic investigation of topological universals of linguistic networks can substantially improve our understanding of languages. At the same
Linguistic Networks
161
time, there are properties for which the linguistic networks vary across languages. For example, the average degrees of the SpellNets are very different for English, when compared to Hindi or Bengali. This difference has been attributed to the different writing systems used by English (which is alphabetic) and the two Indo-Aryan languages (which is abugida). Typological variations have also been predicted in the topological properties of syllable networks. Thus, it would be interesting to have a typological theory of languages based on the structure of the linguistic networks. Another question of great importance for any linguistic network is on the emergence of its structural properties. It is least clear why the word collocation networks should display small-world and scale-free properties. Even though the Dorogovtsev and Mendes model [18] can explain the emergence of the two-regime power law observed in the collocation networks, it does not explain by itself the validity and the physical significance of this model based on preferential attachment. In other words, the phenomenon of preferential attachment at the mesoscopic level needs an independent microscopic explanation in terms of psycholinguistic factors, because words cannot voluntarily link to other words. Similar microscopic explanations are required for the non-trivial topological properties of the other linguistic networks, such as ML, SDN, PhoNet and SpellNet. This is presumably a hard problem, but any mesoscopic explanation is incomplete without a corresponding microscopic model. In the context of NLP and IR applications, network-based models are mostly ad hoc and this reduces their credibility and, thereby, the popularity, as compared to the more principled Bayesian approaches. A network-based language model can bridge this gap and provide us with a more systematic way of solving the NLP problems within this framework. Although there have been some initiatives in this direction [44], this area is largely unexplored and presents numerous challenging problems. Another relatively unexplored, but potentially fecund, area of research is processes “on” linguistic networks. Navigation of the ML can be modeled as guided random walks on the ML network; similarly, typographical errors can be modeled as walks on SpellNet. The exact nature of such guided walks is still to be explored and can provide a strong understanding of underlying cognitive principles. In the previous sections we have seen several ways to define networks where the nodes represent words. One can conceive of a universal word network obtained through superimposition of these partial representations of a linguistic system into a multi-tier network where the nodes are the words and two nodes can be connected by several labeled edges signifying their phonetic, collocational, syntactic, orthographic, semantic and various other kinds of similarities. Studies on such a network can reveal a holistic picture of the interaction patterns between the words, thereby providing a unified model of grammar at different levels of linguistic structure.
162
M. Choudhury and A. Mukherjee
References 1. M. E. Adilson, A. P. S. de Moura, Y. C. Lai, and P. Dasgupta. Topology of the conceptual network of language. Physical Review E, 65(065102):1–4, 2002. 2. A. Agarwal, S. Chakrabarti, and S. Aggarwal. Learning to rank networked entities. In Proceedings of KDD, 2006. 3. A. Akmajian. Linguistics. An introduction to Language and Communication. MIT Press, Cambridge, MA, 1995. 4. A. Albright and B. Hayes. Rules vs. analogy in english past tenses: A computational/experimental study. Cognition, 90:119–161, 2003. 5. A.-L. Barab´ asi and R. Albert. Emergence of scaling in random networks. Science, 286:509–512, 1999. 6. C. Biemann. Chinese whispers - an efficient graph clustering algorithm and its application to natural language processing problems. In Proceedings of TextGraphs: the Second Workshop on Graph Based Methods for Natural Language Processing, pages 73–80, New York, NY, June 2006. Association for Computational Linguistics. 7. C. Biemann. Unsupervised part-of-speech tagging employing efficient graph clustering. In Proceedings of the COLING/ACL 2006 Student Research Workshop, pages 7–12, Sydney, Australia, July 2006. Association for Computational Linguistics. 8. C. Biemann, I. Matveeva, R. Mihalcea, and D. Radev, editors. Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing. Association for Computational Linguistics, Rochester, NY, 2007. 9. S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. CNIS, 30(1–7):107–117, 1998. 10. N. Chomsky. The Minimalist Program. MIT Press, Cambridge, MA, 1995. 11. M. Choudhury, M. Thomas, A. Mukherjee, A. Basu, and N. Ganguly. How difficult is it to develop a perfect spell-checker? A cross-linguistic analysis through complex network approach. In Proceedings of the Second Workshop on TextGraphs: GraphBased Algorithms for Natural Language Processing, pages 81–88, Rochester, NY, 2007. Association for Computational Linguistics. 12. A. Clark. Inducing syntactic categories by context distribution clustering. In C. Cardie, W. Daelemans, C. N´edellec, and E. T. K. Sang, editors, Proceedings of the Fourth Conference on Computational Natural Language Learning and of the Second Learning Language in Logic Workshop, Lisbon, 2000, pages 91–94. Association for Computational Linguistics, Somerset, NJ, 2000. 13. A. M. Collins and M. R. Quillian. Retrieval time from semantic memory. Journal of Verbal Learning and Verbal Memory, 8:240–247, 1969. 14. W. Croft. Typology and Universals. Cambridge University Press, Cambridge, MA, 1990. 15. B. de Boer. Self-organisation in vowel systems. Journal of Phonetics, 28(4): 441–465, 2000. 16. I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In Proceedings of The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), pages 89–98, 2003. 17. W. B. Dolan, L. Vanderwende, and S. Richardson. Automatically deriving structured knowledge base from on-line dictionaries. In Proceedings of the Pacific Association for Computational Linguistics, 1993.
Linguistic Networks
163
18. S. N. Dorogovtsev and J. F. F. Mendes. Language as an evolving word Web. Proceedings of the Royal Society of London B, 268(1485):2603–2606, December 22, 2001. 19. G. Erkan and D. Radev. LexRank: Graph-based lexical centrality as salience in text summarization. JAIR, 22:457–479, December 4, 2004. 20. C. Felbaum. WordNet, an Electronic Lexical Database for English. MIT Press, Cambridge, MA, 1998. 21. R. Ferrer-i-Cancho. The structure of syntactic dependency networks: insights from recent advances in network theory. In: “The Problems of Quantitative Linguistics”, G. Altmann, V. Levickij, and V. Perebyinis (eds.). Chernivtsi: Ruta. 60–75, 2005 22. R. Ferrer-i-Cancho. Why do syntactic links not cross? Europhysics Letters, 76:1228–1235, 2006. 23. R. Ferrer-i-Cancho, A. Capocci, and G. Caldarelli. Spectral methods cluster words of the same class in a syntactic dependency network. International Journal of Bifurcation and Chaos, 17(7):2453–2463, 2007. 24. R. Ferrer-i-Cancho and R. V. Sol´e. The small world of human language. Proceedings of The Royal Society of London. Series B, Biological Sciences, 268(1482):2261–2265, November 2001. 25. R. Ferrer-i-Cancho and R. V. Sol´e. Two regimes in the frequency of words and the origin of complex lexicons: Zipf’s law revisited. Journal of Quantitative Linguistics, 8:165–173, 2001. 26. R. Ferrer-i-Cancho and R. V. Sol´e. Patterns in syntactic dependency networks. Physical Review E, 69(051915), 2004. 27. S. Finch and N. Chater. Bootstrapping syntactic categories using statistical methods. In Background and Experiments in Machine Learning of Natural Language: Proceedings of the 1st SHOE Workshop, pages 229–235. Katholieke Universiteit, Brabant, Holland, 1992. 28. D. Freitag. Toward unsupervised whole-corpus tagging. In COLING ’04: Proceedings of the 20th International Conference on Computational Linguistics, page 357, Morristown, NJ, 2004. Association for Computational Linguistics. 29. M. Galley and K. McKeown. Improving word sense disambiguation in lexical chaining. In Proceedings of IJCAI, 2003. 30. M. Gamon. Graph-based text representation for novelty detection. In Proceedings of the Workshop on TextGraphs at HLT-NAACL, pages 17–24, 2006. 31. S. Gauch and R. Futrelle. Experiments in Automatic Word Class and Word Sense Identification for Information Retrieval. In Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 425–434, Las Vegas, NV, April 1994. 32. M. Gell-Mann. Language and complexity. In J. W. Minett and W. S.-Y. Wang, editors, Language Acquisition, Change and Emergence: Essays in Evolutionary Linguistics. City University of Hong Kong Press, July 2005. 33. D. Gibson, J. M. Kleinberg, and P. Raghavan. Inferring Web communities from link topology. In Proceedings of the Ninth ACM Conference on Hypertext and Hypermedia, pages 225–234, 1998. 34. A. B. Goldberg and J. Zhu. Seeing stars when there aren’t many stars: Graphbased semi-supervised learning for sentiment categorization. In HLT-NAACL 2006 Workshop on Textgraphs: Graph-based Algorithms for Natural Language Processing, 2006. 35. J. H. Greenberg and J. J. Jenkins. Studies in the psychological correlates of the sound system of American English. Word, 20:157–177, 1964.
164
M. Choudhury and A. Mukherjee
36. T. M. Gruenenfelder and D. B. Pisoni. Modeling the mental lexicon as a complex system: Some preliminary results using graph theoretic measures. In Research on Spoken Language Processing Progress Report No. 27, Bloomington, Indiana University, 27–47, 2005. 37. Z. Gy¨ ongyi, H. Garcia-Molina, and J. Pedersen. Combating Web spam with TrustRank. In Proceedings of VLDB, pages 576–587, 2004. 38. A. D. Haghighi, A. Y. Ng, and C. D. Manning. Robust textual inference via graph matching. In HLT ’05: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 387–394, Morristown, NJ, 2005. Association for Computational Linguistics. 39. Z. S. Harris. Mathematical Structures of Language. Wiley, New York, 1968. 40. M. D. Hauser, N. Chomsky, and W. T. Fitch. The faculty of language: What is it, who has it, and how did it evolve? Science, 298:1569–1579, 2002. 41. R. F. i-Cancho, A. Mehler, O. Pustylnikov, and A. D´ıaz-Guilera. Correlations in the organization of large-scale syntactic dependency networks. In TextGraphs-2: Graph-Based Algorithms for Natural Language Processing, pages 65–72. Association for Computational Linguistics, 2007. 42. Y. Itoh and S. Ueda. The Ising model for changes in word ordering rules in natural languages. Physica D: Nonlinear Phenomena, 198(3-4):333–339, 2004. 43. J. Jannink and G. Wiederhold. Thesaurus entry extraction from an on-line dictionary. In Proceedings of Fusion, 1999. 44. B. Jedynak and D. Karakos. Unigram language models using diffusion smoothing over graphs. In Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing, pages 33–36, Rochester, NY, 2007. Association for Computational Linguistics. 45. V. Kapatsinski. Sound similarity relations in the mental lexicon: Modeling the lexicon as a complex network. Speech Research Lab Progress Report, Indiana University, Bloomington, IN, 2006. 46. V. Kapustin and A. Jamsen. Vertex degree distribution for the graph of word cooccurrences in Russian. In Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing, pages 89–92, Rochester, NY, 2007. Association for Computational Linguistics. 47. J. Ke, M. Ogura, and W. S.-Y. Wang. Optimization models of sound systems using genetic algorithms. Computational Linguistics, 29(1):1–18, 2003. 48. J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of ACM, 46, 1999. 49. R. Kumar, J. Novak, P. Raghavan, and A. Tomkins. Structure and evolution of blogspace. Communications of the ACM, 47(12):35–39, 2004. 50. M. Lesk. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of SIGDOC, 1986. 51. J. Liljencrants and B. Lindblom. Numerical simulation of vowel quality systems: the role of perceptual contrast. Language, 48:839–862, 1972. 52. H. Liljenstrom. Micro Meso Macro: Addressing Complex Systems Couplings. World Scientific Publishing, Singapore, 2005. 53. P. A. Luce and D. B. Pisoni. Recognizing spoken words: The neighborhood activation model. Ear and Hearing, 19:1–36, 1998. 54. I. Maddieson. Patterns of Sounds. Cambridge University Press, Cambridge, 1984.
Linguistic Networks
165
55. W. Marslen-Wilson. Activation, competition, and frequency in lexical access. In: G. T. M. Altmann (ed.), Cognitive Models of Speech Processing: Psycholinguistic and Computational Perspectives, MIT Press, Cambridge, MA, pages 148–173, 1990. 56. R. McDonald, F. Pereira, K. Ribarov, and J. Hajiˇc. Non-projective dependency parsing using spanning tree algorithms. In HLT ’05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 523–530, Morristown, NJ, 2005. Association for Computational Linguistics. 57. R. Mihalcea. Graph-based ranking algorithms for large vocabulary word sense disambiguation. In Proceedings of HTL-EMNLP, 2005. 58. R. Mihalcea and D. Radev. Graph-based algorithms for information retrieval and natural language processing. Tutorial at HLT/NAACL 2006, 2006. 59. R. Mihalcea and D. Radev, editors. Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing. Association for Computational Linguistics, 2006. 60. R. Mihalcea and P. Tarau. TextRank: Bringing order into texts. In Proceedings of EMNLP, 2004. 61. R. Mihalcea, P. Tarau, and E. Figa. PageRank on semantic networks with applications to word sense disambiguation. In Proceedings of COLING, 2004. 62. G. A. Miller and W. G. Charles. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1–28, 1991. 63. G. A. Miller and P. M. Gildea. How children learn words. Scientific American, 257(3):86–91, 1987. 64. A. Mukherjee, M. Choudhury, A. Basu, and N. Ganguly. Modeling the cooccurrence principles of the consonant inventories: A complex network approach. International Journal of Modern Physics C, 18(2):281–295, 2007. 65. A. Mukherjee, M. Choudhury, A. Basu, and N. Ganguly. Self-organization of sound inventories: Analysis and synthesis of the occurrence and co-occurrence networks of consonants. Journal of Quantitative Linguistics, http://arXiv.org/ physics/0610120. 66. J. Nath, M. Choudhury, A. Mukherjee, C. Biemann, and N. Ganguly. Unsupervised parts-of-speech induction for Bengali. In Proceedings of the Sixth International Language Resources and Evaluation Conference (LREC), 2008. 67. D. Nettle. Using social impact theory to simulate language change. Lingua, 108: 95–117, 1999. 68. H. G. Nusbaum, D. B. Pisoni, and C. K. Davis. Sizing up the Hoosier mental lexicon: Measuring the familiarity of 20,000 words, Indiana University. Research on Speech Perception Progress Report No. 10, pages 357–376, 1984. 69. B. Pang and L. Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, pages 271–278, Barcelona, Spain, July 2004. 70. S. Pinker. The Language Instinct: How the Mind Creates Language. HarperCollins, New York, 1994. 71. S. Pinker and A. Price. On language and connectionism: Analysis of a parallel distributed processing model of language acquisition. Cognition, 28:195–247, 1988. 72. R. Rapp. A practical solution to the problem of automatic part-of-speech induction from text. In Conference Companion Volume of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05), Ann Arbor, MI, 2005.
166
M. Choudhury and A. Mukherjee
73. M. Richardson, A. Prakash, and E. Brill. Beyond PageRank: Machine learning for static ranking. In Proceedings of WWW, pages 707–715, 2006. 74. H. Sch¨ utze. Part-of-speech induction from scratch. In Proceedings of the 31st Annual Meeting on Association for Computational Linguistics, pages 251–258, Morristown, NJ, 1993. Association for Computational Linguistics. 75. H. Sch¨ utze. Distributional part-of-speech tagging. In Proceedings of the 7th Conference on European Chapter of the Association for Computational Linguistics, pages 141–148, San Francisco, CA, 1995. Morgan Kaufmann Publishers Inc. 76. J.-L. Schwartz, L.-J. Bo¨e, N. Vall´ee, and C. Abry. The dispersion-focalization theory of vowel systems. Journal of Phonetics, 25:255–286, 1997. 77. M. Sigman and G. A. Cecchi. Global organization of the wordnet lexicon. Proceedings of the National Academy of Science, 99(3):1742–1747, 2002. 78. M. M. Soares, G. Corso, and L. S. Lucena. The network of syllables in Portuguese. Physica A: Statistical Mechanics and its Applications, 355(2-4): 678–684, 2005. 79. Z. Solan, D. Horn, E. Ruppin, and S. Edelman. Unsupervised learning of natural languages. Proceedings of National Academy of Sciences, 102(33):11629–11634, 2005. 80. L. Steels. Language as a complex adaptive system. In Proceedings of PPSN VI, pages 17–26, 2000. 81. D. Steriade. Knowledge of similarity and narrow lexical override. BLS, 29: 583–598, 2004. 82. M. Steyvers and J. B. Tenenbaum. The large-scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cognitive Science, 29(1): 41–78, 2005. 83. M. Tamariz. Exploring the Adaptive Structure of the Mental Lexicon. Ph.D. thesis, Department of Theoretical and Applied Linguistics, Univerisity of Edinburgh, Scotland, 2005. 84. K. Toutanova, C. D. Manning, and A. Y. Ng. Learning random walk models for inducing word dependency distributions. In ICML ’04: Proceedings of the TwentyFirst International Conference on Machine Learning, page 103, New York, NY, 2004. 85. J. V´eronis. HyperLex: Lexical cartography for information retrieval. Computer Speech and Language, 18(3):223–252, 2004. 86. M. S. Vitevitch. Phonological neighbors in a small world (network): What can graph theory tell us about the mental lexicon? Departmental Colloquy co-sponsored by the Linguistics and Psychology Departments, Rice University, January 27, 2006. 87. D. Widdows and B. Dorow. A graph model for unsupervised lexical acquisition. In Proceedings of COLING, 2002.
Networks Generated from Natural Language Text Chris Biemann and Uwe Quasthoff Institute for Computer Science, NLP Department, University of Leipzig, Johannisgasse 26, 04103 Leipzig, Germany;
[email protected],
[email protected] 1 Introduction The study of large-scale characteristics of graphs that arise in natural language processing is an essential step in finding structural regularities. Structure discovery processes have to be designed with an awareness of these properties. Examining and contrasting the effects of processes that generate graph structures similar to those observed in language data sheds light on the structure of language and its evolution. In this chapter, we examine power-law distributions and small world graphs (SWGs) originating from natural language data. There are several reasons for the special interest in these structures. 1. Power laws appear in many rank-frequency statistics. Furthermore, we can construct graphs with words as nodes and use various rules to introduce edges between words. In many cases, this results in SWGs, which again often have a power-law distribution for their node degrees. 2. SWGs appear in many other real world data, like social networks of many kinds, in the link structure of the World Wide Web or in traffic networks. It is interesting to analyze all these networks in more detail to identify similarities and differences. 3. From an application-driven view, SWGs allow effective clustering strategies in nearly linear time. Because these clusters are often related to the growth process of the underlying graph, they are often meaningful. In the case of natural language these clusters usually reflect semantic and/or syntactic structures. After discussing several data sources that exhibit power-law distributions with respect to rank frequency in Section 2, graphs with small world properties in language data are discussed in Section 3. We shall see that these characteristics are omnipresent in language data, and we should be aware of them when designing structure discovery processes. For example, the knowledge that a N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 10, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
168
C. Biemann and U. Quasthoff
few hundred words make the bulk of words in a text allows one to use only these words as contextual features with only a minor loss in text coverage. Knowing that word co-occurrence networks possess the scale-free small world property has implications for clustering these networks. An interesting aspect is whether these characteristics are only inherent to real natural language data or whether they can be produced with generators of linear sequences in a much simpler way than our intuition about language complexity would suggest. In other words, we shall see how distinctive these characteristics are with respect to tests deciding whether a given sequence is natural language or not.
2 Power Laws in Rank-Frequency Distribution G. K. Zipf [31, 32] described the following phenomenon: if all words in a corpus of natural language are arranged in decreasing order of frequency, then the relation between a word’s frequency and its rank in the list follows a power law. Since then, a significant amount of research has been devoted to the question of how this property emerges and what kinds of processes generate such Zipfian distributions. Hence, some datasets related to language will be presented that exhibit a power law on their rank-frequency distribution. For this discussion, basic units of language will be examined. 2.1 Word Frequency The relation between the frequency of a word at rank r and its rank is given by f (r) ∼ r−z , where z is the exponent of the power law that corresponds to the slope of the curve in a log-log plot. The exponent z was assumed to be exactly 1 by Zipf. In natural language data, slightly differing exponents in the range of about 0.7 to 1.2 are also observed [30]. B. Mandelbrot [21] provided a formula that more closely approximates the frequency distributions in language data after noticing that Zipf’s law holds only for the medium range of ranks, whereas the curve is flatter for very frequent words and steeper for high ranks. Figure 1 displays the word rank-frequency distributions of corpora of different languages taken from the Leipzig Corpora Collection.1 There exist several exhaustive collections of research capitalising Zipf’s law and related distributions2 ranging over a wide area of datasets; here, only findings related to natural language will be reported. A related distribution is the lexical spectrum [16], which gives the probability of choosing a word from the vocabulary with a given frequency. For natural language, the lexical spectrum follows a power law with slope γ = z1 + 1, where z is the exponent 1 2
LCC, see http://www.corpora.uni-leipzig.de [July 7th, 2007]. e.g. http://www.nslij-genetics.org/wli/zipf/index.html [April 1, 2007].
Networks Generated from Natural Language Text
169
Zipf's law for various corpora 1e+007
German 1M English 300K Italian 300K Finnish 100K power law gamma=1 power-law gamma=0.8
1e+006
frequency
100000 10000 1000 100 10 1 0.1 1
10
100
1000 rank
10000
100000
1e+006
Fig. 1. Zipf’s law for various corpora. The numbers next to the language give the corpus size in sentences. Enlarging the corpus does not affect the slope of the curve, but merely moves it upwards in the plot. Most lines are almost parallel to the ideal power-law curve with z = 1. Finnish exhibits a lower slope of γ ≈ 0.8, akin to higher morphological productivity.
of the Zipfian rank-frequency distribution. For the relation between lexical spectrum, Zipf’s law and Pareto’s law, see [1]. But Zipf’s law in its original form is just the tip of the iceberg of power-law distributions in a quantitative description of language. While a Zipfian distribution for word frequencies can be obtained by a simple model of generating letter sequences with space characters as word boundaries [21, 22], these models based on “intermittent silence” can neither reproduce the distributions on sentence length [26] nor explain the relations of words in sequence. Next, more power-law distributions in natural language are discussed and exemplified. 2.2 Letter N -Grams To continue with a counter example, letter frequencies do not obey a power law in the rank-frequency distribution. This also holds for letter N -grams (including the space character), yet for higher N , the rank-frequency plots show a large power-law regime with exponential tails for high ranks. Figure 2 shows the rank-frequency plots for letter N -grams up to N = 6 for the first 10,000 sentences of the British National Corpus (BNC,3 [10]). Still, letter frequency distributions can be used to show that letters are not forming letter bigrams from the single letters independently, but there are restrictions on their combination. While this intuitively seems obvious for 3
http://www.natcorp.ox.ac.uk/ [April 1, 2007]
170
C. Biemann and U. Quasthoff rank-frequency letter N-gram 1e+006 letter 1gram letter 2gram letter 3gram letter 4gram letter 5gram letter 6gram power-law gamma=0.55
100000
frequency
10000
1000
100
10
1 1
10
100
1000 rank
10000
100000
1e+006
Fig. 2. Rank-frequency distributions for letter N -grams for the first 10,000 sentences in the BNC. Letter N -gram rank-frequency distributions do not exhibit power laws on the full scale, but increasing N results in a larger power-law regime for low ranks.
letter combination, the following test is proposed for quantitatively examining the effects of these restrictions: from letter unigram probabilities, a text is generated that follows the letter unigram distribution by randomly and independently drawing letters according to their distribution and concatenating them. The letter bigram frequency distribution of this generated text can be compared to the letter bigram frequency distribution of the real text from where the unigram distribution was measured. Figure 3 shows the generated plot and the real rank-frequency plot, again from the small BNC sample. The two curves clearly differ. The generated bigrams without restrictions predict a higher number of different bigrams and lower frequencies for bigrams of high ranks as compared to the real text bigram statistics. This shows that letter combination restrictions do exist, as not all bigrams predicted by the generation process were observed, resulting in higher counts for valid bigrams in the sample. 2.3 Word N -Grams For word N -grams, the relation between rank and frequency follows a power law, just as in the case for words (unigrams). Figure 4 (left) shows the rankfrequency plots up to N = 4, based on the first 1 million sentences of the BNC. As more different word combinations are possible with increasing N ,
Networks Generated from Natural Language Text
171
letter bigram: generated and real 10000 letter 2-grams generated by letter-1-gram distribution letter 2-gram real
frequency
1000
100
10
1 1
10
100 rank
1000
10000
Fig. 3. Rank-frequency plots for letter bigrams, for a text generated from letter unigram probabilities and for the BNC sample. word bigram: generated and real
word N gram rank-frequency 1e+006
1e+007
word 1-gram word 2-gram word 3-gram word 4-gram
1e+006
100000
100000
10000
frequency
frequency
word 1-gram-generated word 2-grams word 2-grams
10000 1000
1000
100 100
10
10
1
1 1
10
100
1000 10000 1000001e+0061e+0071e+008
rank
1
10
100
1000
10000 100000 1e+006 1e+007
rank
Fig. 4. Left: Rank-frequency distributions for word N -grams for the first one million sentences in the BNC. Word N -gram rank-frequency distributions exhibit power laws. Right: Rank-frequency plots for word bigrams, for a text generated from letter unigram probabilities and for the BNC sample.
the curves become flatter as the same total frequency is shared amongst more units, as previously observed (e.g. [27, 18]). Testing concatenation restrictions quantitatively as above for letters, it might at first seem surprising that the curve for a text generated with word unigram frequencies differs only very little from the word bigram curve, as Fig. 4 (right) shows. Small differences are only observable for low ranks: more top-rank generated bigrams reflect
172
C. Biemann and U. Quasthoff
that words are usually not repeated in the text. More low-ranked and less high-ranked real bigrams indicate that word concatenation takes place not entirely without restrictions, yet is subject to much more variety than letter concatenation. This coincides with the intuition that it is, for a given word pair, almost always possible to form a correct English sentence in which these words are neighbours. Regarding quantitative (as opposed to syntactic or semantic) aspects, the frequency distribution of word bigrams can be produced by a generation process based on word unigram probabilities. 2.4 Sentence Frequency Larger corpora that are compiled from a variety of sources contain a considerable amount of duplicate sentences. In the full BNC, which serves as the data basis in this case, 7.3% of the sentences occur two or more times. The most frequent sentences are “Yeah.”, “Mm.”, “Yes.” and “No.”, which are mostly found in the section of spoken language. But also longer expressions like “Our next bulletin is at 10.30 p.m.” have a count of over 250. The sentence frequencies also follow a power law with an exponent close to 1 (see Fig. 5), indicating that Zipf’s law also holds for sentence frequencies. 2.5 Other Power Laws in Language Data The preceding results strongly suggest that when counting document frequencies in large collections such as the World Wide Web, another power-law rank-frequency for sentences in the BNC 100000
sentences power-law gamma=0.9
frequency
10000
1000
100
10
1 1
10
100
1000
10000
100000 1e+006 1e+007
rank
Fig. 5. Rank-frequency plot for sentence frequencies in the full BNC, following a power law with γ ≈ 0.9, but with a high fraction of sentences occurring only once.
Networks Generated from Natural Language Text
173
rank-frequency for search queries 100000
search queries power-law gamma=0.75
frequency
10000
1000
100
10
1 1
10
100
1000
10000
100000
1e+006
rank
Fig. 6. Rank-frequency plot for AltaVista search queries, following a power law with γ ≈ 0.75.
distribution would be found, but such an analysis has not been carried out and would require access to the index of a web search engine. Further, there are more power laws in language-related areas, some are mentioned here briefly to illustrate their omnipresence. • • •
Web page requests follow a power law, which was employed for a caching mechanism in [17]. Related to this, frequencies of web search queries during a fixed time span also follow a power law, as exemplified in Fig. 6 for a 7-million queries log of AltaVista4 as used by Lempel [19]. The number of authors of Wikipedia5 articles was found to follow a power law with γ ≈ 2.7 for a large regime in [29]. The same paper further discusses various other power-law relationships.
3 Scale-Free Small Worlds in Language Data The previous section discussed the shape of rank-frequency distributions for natural language units. Now the properties of graphs with units represented as vertices and relations between them as edges will be the focus of interest. Internal as well as contextual features can be employed for computing similarities between language units that are represented as (possibly weighted) edges 4 5
http://www.altavista.com http://www.wikipedia.org
174
C. Biemann and U. Quasthoff
in the graph. Some of the graphs discussed here can be classified as scale-free SWGs; others have different characteristics and represent other, but related, graph classes. 3.1 Word Co-Occurrence Graph The notion of word co-occurrence is used to model dependencies between words. If two words X and Y occur together in some contextual unit of information (as neighbours, in a word window of 5, in a clause, in a sentence, in a paragraph), they are said to co-occur. When regarding words as vertices and edge weights as the number of times two words co-occur, the word co-occurrence graph of a corpus is given by the entirety of all word cooccurrences. In the following, two specific types of co-occurrence graphs are considered: the graph as induced by neighbouring words, henceforth called the neighbour-based graph, and the graph as induced by sentence-based cooccurrence, henceforth called the sentence-based graph. The neighbour-based graph can be undirected or directed with edges going from the left to the right words as found in the corpus, the sentence-based graph is undirected. To find out whether the co-occurrence of two specific words A and B is merely due to chance or exhibits a statistical dependency, measures are used that compute, to what extent the co-occurrence of A and B is statistically significant. Many significance measures can be found in the literature; for extensive overviews consult e.g. [9] or [14]. In general, the measures compare the probability for A and B to co-occur under the assumption of their statistical independence with the actual probability of their joint co-occurrence in the corpus. In this work, the log likelihood ratio [13] is used to sort the wheat from the chaff. It is given in expanded form in [9]: ⎤ ⎡ n log n − nA log nA − nB log nB + nAB log nAB ⎥ ⎢ + (n − nA − nB + nAB ) log (n − nA − nB + nAB ) ⎥ −2 log λ = 2 ⎢ ⎣ + (nA − nAB ) log (nA − nAB ) + (nB − nAB ) log (nB − nAB ) ⎦ , − (n − nA ) log (n − nA ) − (n − nB ) log (n − nB ) where n is the total number of contexts, nA the frequency of A, nB the frequency of B and nAB the number of co-occurrences of A and B. As pointed out by Moore [23], this formula overestimates the co-occurrence significance for small nAB . For this reason, often a frequency threshold t on nAB (e.g. a minimum of nAB = 2) is applied. Further, a significance threshold s regulates the density of the graph; for the log likelihood ratio, the significance values correspond to the χ2 tail probabilities [23], which makes it possible to translate the significance value into an error rate for rejecting the independence assumption.6 The operation of applying a significance test results in pruning edges 6 For example, a log likelihood ratio of 3.84 corresponds to a 5% error in stating that two words do not occur by chance, a significance of 6.63 corresponds to a 1% error.
Networks Generated from Natural Language Text
175
that exist due to random noise and keeping almost exclusively those edges that reflect a true association between their endpoints. Graphs that contain all significant co-occurrences of a corpus, with edge weights set to the significance value between their endpoints, are called significant co-occurrence graphs in the remainder. For convenience, no singletons in the graph are allowed, i.e. if a vertex is not contained in any edge because none of the co-occurrences for the corresponding word is significant, then the vertex is excluded from the graph. As observed previously [15, 24], word co-occurrence graphs exhibit the scale-free small world property. This is in line with co-occurrence graphs reflecting human associations [25] and human associations in turn forming SWGs [28]. The claim is confirmed here on an exemplary basis with the graph for Leipziy Corpora Collection’s (LCC’s) 1 million sentence corpus for German. Figure 7 gives the degree distributions and graph characteristics for various co-occurrence graphs. The shape of the distribution is dependent on the language, as Fig. 8 shows. Some languages—here English and Italian—have a hump-shaped distribution in the log-log plot where the first regime follows a power law with a lower exponent than the second regime, as observed in [15]. For the Finnish and German corpora examined here, this effect could not be found in the data. This property of two power-law regimes in the degree distribution of word co-occurrence graphs motivated the Dorogovtsev-Mendes (DM)-model, see [12]. There, the
de1M neighbour-based graphs degree distribution
fraction of vertices per degree
10000 1000 100 10 1 0.1 0.01 0.001 0.0001
de1M sb t=10 power law gamma=2 de1M sig. sb t=10 s=10
100000 fraction of vertices per degree
de1M nb. t=2 indegree de1M nb. t=2 outdegree power law gamma=2 de1M sig. nb t=10 s=10 indegree de1M sig. nb. t=10 s=10 outdegree
100000
de1M sentence-based graphs degree distribution
10000 1000 100 10 1 0.1 0.01 0.001 0.0001
1
10
100 1000 degree interval
10000
1
10
100 1000 degree interval
10000
Fig. 7. Graph characteristics for various co-occurrence graphs of LCC’s 1-million sentence German corpus. Abbreviations: nb = neighbour-based, sb = sentence-based, sig. = significant, t = co-occurrence frequency threshold, s = co-occurrence significance threshold. While the exact shapes of the distributions are language and corpus dependent, the overall characteristics are valid for all samples of natural language of sufficient size. The slope of the distribution is invariant to changes of thresholds. Characteristic path length and a high clustering coefficient at low average degrees are characteristic for SWGs.
176
C. Biemann and U. Quasthoff significant sentence-based graphs for various languages
fraction of vertices per degree
100000
Italian 300K sig. sentence-based graph t=2 s=6.63 English 300K sig. sentence-based graph t=2 s=6.63 Finnish 100K sig. sentence-based graph t=2 s=6.63 power law gamma=2.5 power law gamma=1.5 power-law gamma=2.8
10000 1000 100 10 1 0.1 0.01 0.001 0.0001
1
10
100 degree interval
1000
10000
Fig. 8. Degree distribution of significant sentence-based co-occurrence graphs of similar thresholds for Italian, English and Finnish. degree distribution with window size 2
degree distribution with window size 2 1e+006 Icelandic window 2 German window 2 power-law gamma=2
10000 100 1 0.01
Italian window 2 English BNC window 2 power-law gamma=1.6 power-law gamma=2.6
10000 # of vertices for degree
# of vertices for degree
1e+006
100 1 0.01 0.0001
0.0001
1e-006
1e-006 1
10
100
1000 10000 100000 1e+006 degree
1
10
100
1000 10000 1000001e+006 degree
Fig. 9. Degree distributions in word co-occurrence graphs for window size 2. Left: The distribution for German and Icelandic is approximated by a power law with γ = 2. Right: For English (BNC) and Italian, the distribution is approximated by two powerlaw regimes.
crossover point of the two power-law regimes is motivated by a kernel lexicon of about 5000 words that can be combined with all words of a language. The original experiments of [15] operated on a word co-occurrence graph with window size 2: an edge is drawn between words if they appear together at least once in a distance of one or two words in the corpus. Reproducing their experiment with the first 70 million words of the BNC and corpora of German, Icelandic and Italian of similar size reveals that the degree distribution of the English and the Italian graph is in fact approximated by two power-law regimes. In contrast to this, German and Icelandic show a single power-law distribution, just as in the experiments above; see Fig. 9. These results suggest
Networks Generated from Natural Language Text degree distribution with distance 1 1e+006
100 1 0.01
0.0001
Italian distance 2 English BNC distance 2 power-law gamma=1.6 power-law gamma=2.6
10000 # of vertices for degree
# of vertices for degree
degree distribution with distance 2 1e+006
Italian distance 1 English BNC distance 1 power-law gamma=1.8 power-law gamma=2.2
10000
177
100 1 0.01 0.0001
1e-006
1e-006 1
10
100
1000
10000
100000
1
degree
10
100
1000 degree
10000
100000
Fig. 10. Degree distributions in word co-occurrence graphs for distance 1 and distance 2 for English (BNC) and Italian. The hump-shaped distribution is much more distinctive for distance 2.
that two power-law regimes in word co-occurrence graphs with window size 2 are not a language universal, but only hold for some languages. To examine the hump-shaped distributions further, Fig. 10 displays the degree distribution for the neighbour-based word co-occurrence graphs and the word co-occurrence graphs for connecting only words that appear in a distance of 2. As it becomes clear from the plots, the hump-shaped distribution is mainly caused by words co-occurring in distance 2, whereas the neighbourbased graph shows only a slight deviation from a single power law. Together with the observations from sentence-based co-occurrence graphs of different languages in Figure 8, it becomes clear that a hump-shaped distribution with two power-law regimes is caused by long-distance relationships between words, if present at all. 3.1.1 Applications of Word Co-Occurrences Word co-occurrence statistics are an established standard and have been used in many language processing systems. The authors have used co-occurrences in practical applications like bilingual dictionary acquisition [4, 11], semantic lexicon extension [8] and visualisation of concept trails [7]. The aim of this chapter is to underpin their applications with a theoretical foundation. 3.2 Co-Occurrence Graphs of Higher Order The significant word co-occurrence graph of a corpus represents words that are likely to appear near to each other. When one is interested in words co-occurring with similar other words, it is possible to transform the abovedefined (first-order) co-occurrence graph into a second-order co-occurrence graph by drawing an edge between two words A and B if they share a common
178
C. Biemann and U. Quasthoff band saxophonist album music
albumn music
concerts
roll singer concert
Marsalis
jazz
pop
jazz
band
trumpeter
stars
star
rock
musicians pianist
rock
strata
singer
blues
classical musician
Jazz
mass burst
bursts
coal
Fig. 11. Neighbourhoods of jazz and rock in the significant sentence-based word cooccurrence graph as displayed on LCC’s English corpus website. Both neighbourhoods contain album, music, singer and band, which leads to an edge weight of 4 in the second-order graph.
neighbour in the first-order graph. Whereas the first-order word co-occurrence graph represents the global context per word, the corresponding second-order graph contains relations between words which have similar global contexts. The edge can be weighted according to the number of common neighbours, e.g. by weight = |neigh(A) ∩ neigh(B)|. Figure 11 shows neighbourhoods of the significant sentence-based first-order word co-occurrence graph from LCC’s English web corpus7 for the words jazz and rock. Taking into account only the data depicted, jazz and rock are connected with an edge of weight 4 in the second-order graph, corresponding to their common neighbours album, music, singer and band. The fact that they share an edge in the first-order graph is ignored. In general, a graph of order N + 1 can be obtained from the graph of order N , using the same transformation. The higher-order transformation without thresholding is equivalent to a multiplication of the unweighted adjacency matrix A with itself, then a zeroing of the main diagonal by subtracting the degree matrix of A. Since the average path length of scale-free SWGs is short and local clustering is high, this operation leads to an almost fully connected graph in the limit, which does not allow one to draw conclusions about the initial structure. Thus, the graph is pruned in every iteration N in the following way. For each vertex, only the maxN outgoing edges with the highest weights are taken into account. Notice that this vertex degree threshold maxN does not limit the maximum degree, as thresholding is asymmetric. This operation is equivalent to only keeping the maxN largest entries per row in the adjacency matrix A = (aij ), then At = (sign(aij + aji )), resulting in an undirected graph. To examine quantitative effects of the higher-order transformation, the sentence-based word co-occurrence graph of LCC’s 1-million German sentence corpus (s = 6.63, t = 2) underwent this operation. Figure 12 depicts the degree distributions for N = 2 and N = 3 for different maxN . 7
http://corpora.informatik.uni-leipzig.de/?dict=en [April 1, 2007]
Networks Generated from Natural Language Text German cooc order 3
German order 2 full German order 2 max 10 German order 2 max 3
1000 100 10 1 0.1 0.01 0.001 0.0001
German order 2 max 3 power-law gamma=1 power-law gamma=4
10000 vertices per degree interval
vertices per degree interval
German cooc order 2 10000
179
1000 100 10 1 0.1 0.01 0.001 0.0001
1
10
100 1000 10000 100000 degree
1
10
100 1000 degree
10000
Fig. 12. Degree distributions of word-co-occurrence graphs of higher order. The firstorder graph is the sentence-based word co-occurrence graph of LCC’s 1-million German sentence corpus (s = 6.63, t = 2). Left: N = 2 for max2 = 3, max2 = 10 and max2 = ∞. Right: N = 3 for t2 = 3, t3 = ∞, using the second-order graph with max2 = 3.
Applying the maxN threshold causes the degree distribution to change, especially for high degrees. In the third-order graph, two power-law regimes are observable. Studying the degree distribution of higher-order word co-occurrence graphs revealed that the characteristic of being governed by power laws is invariant to the higher-order transformation, yet the power-law exponent changes. This indicates that the power-law characteristic is inherent at many levels in natural language data. To examine what this transformation yields on the graphs generated by other random graph models, Figure 13 shows the degree distribution of second-order and third-order graphs as generated by the graph generation models of [3] (Barab´ asi-Albert (BA)-model), [28] (Steyvers-Tenenbaum (ST)-model) and [12] (DM-model). The underlying first-order graphs are the undirected graphs of order 10,000 and size 50,000 (k=10) from these three models. While the thorough interpretation of second-order graphs of random graphs might be subject to further studies, the following should be noted: the higher-order transformation reduces the power-law exponent of the BAmodel graph from γ = 3 to γ = 2 in the second order and to γ ≈ 0.7 in the third order. For the ST-model, the degree distribution of the full second-order graph shows a maximum around 2M , then decays with a power law with exponent γ ≈ 2.7. In the third-order ST-graph, the maximum moves to around 4M 2 for sufficient max2 . The DM-model second-order graph shows, like the first-order DM-model graph, two power-law regimes in the full version, and a power-law with γ ≈ 2 for the pruned versions. The third-order degree distribution exhibits many more vertices with high degrees than predicted by a power law.
180
C. Biemann and U. Quasthoff BA order 3
BA order 2
1000 BA full BA max 10 BA max 3 power-law gamma=2
vertices per interval
10000 1000 100 10 1 0.1
10 1 0.1 0.01
0.01 0.001
BA order 2 max 10 BA order 2 max 3 power-law gamma=0.7
100 vertices per interval
100000
1
10
100 degree
0.001
1000
1
10
10 1 0.1
vertices per degree interval
vertices per degree interval
ST full ST max 10 ST max 3 power-law gamma=2.5
100
ST order 2 max 10 ST order 2 max 3
1000 100 10 1 0.1 0.01
0.01 1
10
100 degree
1
1000
10
DM full DM max 10 DM max 3 power-law gamma=2 power-law gamma=1 power-law gamma=4
1000 100 10 1 0.1 0.01 1
10
100 degree
100 degree
1000
DM order 3
1000
1000 vertices per degree interval
vertices per degree interval
DM order 2
0.001
1000
ST order 3
ST order 2
1000
100 degree
DM order 2 max 10 DM order 2 max 3
100 10 1 0.1 0.01 0.001
1
10
100 degree
1000
Fig. 13. Second- and third-order graph degree distributions for BA-model, ST-model and DM-model graphs.
Networks Generated from Natural Language Text
181
In summary, all random graph models exhibit clear differences for word co-occurrence networks with respect to the higher-order transformation. The ST-model shows maxima depending on the average degree of the first-order graph. The BA-model’s power law is decreased with higher orders, but is able to explain a degree distribution with power-law exponent 2. The full DM model exhibits the same two power-law regimes in the second order as observed for German sentence-based word co-occurrences in the third order. 3.2.1 Applications of Co-Occurrence Graphs of Higher Orders In [6] and [20], the utility of word co-occurrence graphs of higher orders are examined for lexical semantic acquisition. The highest potential for extracting paradigmatic semantic relations can be attributed to second- and third-order word co-occurrences. In [9] second-order graphs are evaluated against lexical semantic resources. 3.3 Sentence Similarity Using words as internal features, the similarity of two sentences can be measured by the number of common words they share. Since the few top frequency words are contained in most sentences as a consequence of Zipf’s law, their influence should be downweighted or they should be excluded to arrive at a useful measure for sentence similarity. Here, the sentence similarity graph of sentences sharing at least two common words is examined, with the maximum frequency of these words bounded by 100. This maximum frequency threshold was arbitrarily chosen and could be replaced by a weighting scheme that attributes more weight to less frequent words. However, a hard threshold reduces the computational cost significantly. The corpus of examination is here LCC’s 3-million sentences of German. Figure 14 shows the component size distribution for this sentence similarity graph, Figure 15 shows the degree distributions for the entire graph and for its largest component. The degree distribution of the entire graph follows a power law with γ close to 1 for low degrees and decays faster for high degrees; the largest component’s degree distribution plot is flatter for low degrees. This can be attributed to limited sentence length: as sentences are not arbitrarily long, they cannot be similar to an arbitrary high number of other sentences with respect to the measure discussed here, as the number of sentences per feature word is bounded by the word frequency limit. However, the extremely high values for transitivity and clustering coefficient and the low γ values for the degree distribution for low degree vertices and comparably long average shortest path lengths indicate that the sentence similarity graph belongs to a different graph class than all other graphs discussed above.
182
C. Biemann and U. Quasthoff sentence similarity graph component distribution sentence similarity components power-law gamma=2.7
# of vertices
10000
1000
100
10
1 1
10
100
1000
10000
100000
component size
Fig. 14. Component size distribution for the sentence similarity graph of LCC’s 3-million sentence German corpus. The component size distribution follows a power law with γ ≈ 2.7 for small components, the largest component comprises 211,447 out of 416,922 total vertices. The component size distribution complies with the theoretical results of [2]. sentence similarity graph component distribution sentence similarity de3M sentences, >1 common, freq 0) [7]. Analogous to the excess degree, beginning at a randomly chosen node v0 and following one of the edges at that node, we reach a neighbor v1 . We are now interested in {ei }∞ i=0 , the distribution of the outgoing edges of v1 that are not connected to a neighbor of v0 . Suppose we travel from node v0 along an edge to node v1 having degree d(v1 ) = i + 1 (i.e., with an excess degree of i). The probability that it will have k neighbors that are not connected back to v0 (via a triangle) is More accurately: ∀ε > 0 P r(C > ε) → 0 as N → ∞. In Section 2.2, when C > 0, we will need a similar observation; namely, that the probability to have a cycle of length four, that is not composed of two triangles, scales as N −1 and hence can also be neglected for large N . 1 2
The Big Friendly Giant
i (1 − C)k C i−k . k
241
(8)
This is just the probability that of the i outgoing edges of v1 , i−k are connected in a triangular formation that includes v0 , while the other k edges do not. Here, as before, C is just the probability of a triangular formation. When d(v1 ) is not known, from (8) we obtain ek :=
∞ i=0
k ∞ 1−C i i k i−k i qi = qi . (1 − C) C C k k C
(9)
i=0
The generating function, Gc (x), for the distribution is Gc (x) :=
∞
k
ek x =
∞ ∞ k=0 i=0
k=0
k 1−C i i qi xk . C k C
(10)
The order of summation may be changed to obtain Gc (x) =
∞ i=0
k ∞ 1−C i x . qi C k C i
(11)
k=0
Using the binomial theorem we obtain Gc (x) =
∞
qi C
i
i=0
i ∞ 1−C x = qi (C + (1 − C)x)i = G1 (C + (1 − C)x). 1+ C i=0 (12)
Thus, we arrive at the key relationship Gc (x) = G1 (C + (1 − C)x).
(13)
Let us remark that in deriving (8)–(13), it is possible to use any other clustering index, such as c(k)—the degree-dependent clustering coefficient used in [28]. However, it might be hard, if not impossible, to obtain a solution with such a simple closed form. As an example of how (13) may be useful, it is possible to determine the mean free-excess degree: i
iei =
dGc (x) = (1 − C)G1 (1) = (1 − C)ze . x=1 dx
(14)
Similarly, it will prove useful to calculate the mean number of edges emanating outwards from nodes at a distance one to nodes at a distance two, beginning from some arbitrary source node (note that this is not the mean number of nodes at a distance two, due to the fact that there is a positive probability
242
Y. Berchenko et al.
that two edges reach the same node at a distance two). Similarly to (6) and (7), the mean is dG0 (Gc (x)) = G0 (1)G1 (1) · (1 − C) = (1 − C)z1 ze . x=1 dx
(15)
This parameter was also calculated in [23] by a different technique, but as will be discussed shortly, its importance appears to have been overlooked.
3 The Critical Point The interest in random graph theory was initiated by, and is in great debt to, a striking discovery by Erdos and Renyi [11]. They studied the following simple model of a network, referred to as GN , p, or simply as the ER random graph: Take some number N of nodes and connect each pair with probability p,3 thus defining a probability measure over the ensemble of all such graphs. Erdos and Renyi demonstrated what is considered to be one of the most important properties of the random graph, namely that it possesses a phase transition, from a low-p state (p(N ) < (1−) N ) in which all components are small (of size o(N )), to a high-p state (p(N ) > (1+) N ) in which an extensive fraction of all nodes (i.e., Θ(n)) are joined together in a single GC. This result has been extended by Molloy and Reed [18, 19] and [1] to graphs with an arbitrary degree distribution, thus making them more applicable for analyzing real-world networks. Here we examine the critical point, where a GC emerges, in the context of clustered networks (Section 3.1). There is yet another interesting point, though not as studied as the latter, where the graph becomes connected—there is a path from each of the nodes ) [8]. to any other node. For the ER graph, GN , p, this occurs when p = ln(N N In Section 3.2 we shall discuss briefly this issue for clustered networks. 3.1 The Emergence of the GC In their seminal paper, Molloy and Reed [18] introduced the parameter Q := i ipi (i − 2), which identifies the phase transition in random graphs, i.e., the point where a GC is born. Their procedure utilizes a method for constructing a random graph, which may be viewed as “walking through a graph” (Fig. 1a) and assessing the number of unknown nodes encountered along the way. Suppose one follows a random edge to a node v having degree k. How does this change the number of unknown nodes? First of all, by arriving at v the number of unknown nodes decreases by one. However, because v itself has degree k, then this leads to an increase of (k − 1) in the number of unknown nodes. The net effect is that the number of unknown nodes increases by (k − 2). In order to calculate the expected change, the probability 3
p is usually a function of N , p(N ).
The Big Friendly Giant
a
b
c1 c2 b2 b3 b1
a1 V0 a3
a2
a4
c1 c2 b2 b3
b4 b5
b1
243
b4
a1
a2
V0 a3
a4
b5
Fig. 1. Graphical illustration of the exposure procedure. Choose a node at random, say V0 , and start diffusing from it and counting the nodes encountered on the way. a) When C = 0 and the network is tree-like (see footnote 1), after counting the new nodes (a1 − a4 ) we pick one of them at random, say a1 , and count its new neighboring nodes (b1 − b3 ), which are distributed according to {qi }∞ i=0 . In the next step, we randomly choose one of the nodes (a2 − a4 , b1 − b3 ) and continue until the entire component is exposed. b) When C > 0, two modifications are required to deal with cycles due to triangles (the dashed edges): we use {ei }∞ i=0 and diffuse depthwise. After counting a1 − a4 , when we count the neighbors of a1 we avoid overcounting a2 because {ei }∞ i=0 governs the distribution of the solid-black edges. In the next step if we go from a1 to b3 in order to count the neighbors of b3 , again we avoid overcounting a2 (because it is connected to a1 ). The depthwise exposure, which is a permissible scheme [18], is used to avoid dependencies.
of arriving at v, which is proportional to the degree k, must also be factored in. This makes the expected increase in the number of unknown neighbors proportional to Q = i ipi (i − 2). If Q is positive, then with each step of the walk through the graph the number of unknown nodes, and the size of the component, grows larger—the hallmark traits of the GC. If Q is negative, then the number of unknown neighbors reduces to zero; therefore, we are not walking through a GC. Recalling earlier definitions, the condition Q > 0 may be stated as (16) ze > 1. Since in unclustered (C = 0) networks ze = z2 /z1 , Ref. [24] advocates the following equivalent criterion. Criterion A. There is a GC in random networks if z2 > z1 , i.e., the mean number of second-nearest neighbors is greater than the mean number of neighbors. This has the intuitive epidemiological interpretation: If the mean number of infected individuals grows with distance from the source, an epidemic outbreak will occur. In [7] we have adapted Molloy and Reed’s procedures in a manner that makes them applicable for clustered networks. Again, suppose we follow a random edge that begins from a source node and ends at some node v. Previously, if v had degree k, the number of “unknown” neighbors would increase by k − 2. However, with triangles there is a possibility that some of the k − 1 outgoing edges will return to nodes that are already known (via dashed edges
244
Y. Berchenko et al.
in Fig. 1b). It is possible to avoid counting these nodes twice, by counting them in a manner that considers the free-excess degree distribution ek . Thus, when a node v of free-excess degree i is encountered, the number of “unknown” neighbors increases by i − 1, and the expected increase in the number of unknown neighbors is thus proportional to Qc = i ei (i − 1). The criterion for the GC in a clustered network is just Qc > 0. However, from (14), this condition becomes (17) (1 − C)ze > 1, which differs from (16) by the scale factor (1 − C). Multiplying both sides by z1 , we obtain (1 − C)z1 ze > z1 . Recalling (15), this may be interpreted as the following criterion. Criterion B. There is a GC if the mean number of edges emanating outwards from nodes at a distance one to nodes at a distance two (beginning from some arbitrary source node) is larger than the mean degree.
a
L2
L1
V0
b
largest component
Note that in the epidemiological sense, the emphasis is on the growth in the number of outward edges or transmission routes from a typical source node to its neighbors, and then to its neighbors’ neighbors (Fig. 2a). Although previously criterion A was used for clustered networks without any proper justification [31, 22], Fig. 3a shows that it provides poor predictions of the critical mean degree z1∗ as a function of the clustering, C (predictions are made using estimates of z2 in the presence of clustering as detailed in [31, 23]). The accuracy of the prediction can be assessed against simulations (Fig. 3). In contrast, criterion B is a much better predictor as shown in Fig. 2b and Fig. 3a. The latter plots the analytic result for a Poisson degree distribution where z1 = ze [24] and z1∗ = (1 − C)−1 (from (17)). Simulation y = Const × N2/3
2
10
102
103 size of network (N)
Fig. 2. The difference between the new criterion B and the conventional criterion A. a) Consider the following example: a typical node has a neighborhood similar to V0 —3 nodes at a distance one in the first layer, L1 , and 2 nodes at a distance two in the second layer, L2 , but 4 edges to the second layer (from L1 to L2 ). Criterion B predicts a GC, while criterion A fails to predict a GC. b) The size of the largest component plotted vs. N for Poisson networks having mean degree z1 = 1.25 and C = 0.2 (i.e., at the critical point according to criterion B). Indeed the size at the critical point correctly scales as ∼N 2/3 , as is known for the case z1 = 1, C = 0 (see references in [8]). Note that criterion A would wrongly predict this regime to be below the critical point (since z2 ≈ 1.19 < z1 ) and would suggest that all components should scale as O(log N ).
The Big Friendly Giant
b
a
C=0
size
3
300
245
C = 0.25
z*1 0
z1
c 1.5
1 0
0.2 C
0.4
5
1 0
0.2
0.4
C
Fig. 3. The critical mean degree z1∗ for the formation of a GC, plotted as a function of C. a) Poisson degree distribution. Predictions of criterion A (grey line; z2 estimated as in [31]). Predictions of criterion B (black line; z1∗ = (1 − C)−1 (see text)). Empirical estimates of z1∗ (circles) were obtained through the following procedure in order to overcome finite size effects: first the value of the size of the largest component was found for networks with C = 0 at the known threshold z1∗ = 1 (b; dashed line). This value was used to identify the critical threshold in comparable networks with C > 0. c) SF degree distribution. Symbols as in a. Black and grey lines, which practically overlap, are based on expressions for z1 and ze for SF networks [24].
Scale-free (SF) networks, where pk ∼ k −α , are usually characterized by their exponent α. However, for the purpose of discussing criticality, when α ≈ 3.45 and the tail of the distribution is not very significant, we can also characterize them by their mean degree. Taking this approach we see that as opposed to the Poisson degree distribution, Fig. 3c shows that the critical mean degree for SF networks is almost constant as a function of C. Its constancy results from the fact that z1 ze and ze increases to a great extent with a small increase in z1 [24]. However, criterion A, being based on the behavior of the second moment of the distribution as well, gives similar predictions (Fig. 3c) from the same considerations. 3.2 Complete Connectivity Although the transition to complete connectivity is less well studied, the following example makes clear the need for further work in this area, particularly for clustered networks. In a recent series of papers [12, 17], the effect of clustering on a network of coupled phase oscillators was examined. These authors made the plausible assumption that by investigating a network with a very high mean degree their network will be connected. When they [17] found groups of oscillators, each group oscillating at a different frequency, they named them “dynamical clusters,” in order to distinguish them from the topological clusters (i.e., connected components).
246
Y. Berchenko et al. 1
N=100 N=200
GC 0.85 0
C
0.6
Fig. 4. Size of the GC vs. C for Poisson network with z1 = 1.5ln(N ).
However, from the previous section we might be tempted to guess whether the second critical point, where the graph become connected, scales with (1 − C)−1 . Unfortunately, while simulations do not confirm our guess for a ) disintegration at C ∗ = 1 − ln(N z1 N , they do clearly demonstrate that by introducing clustering to the network, it breaks down quite early (Fig. 4). When conducting studies such as [12, 17] or considering the validity of their implication, one should especially be careful while checking complete connectivity by counting the multiplicity of the eigenvalue 0 of the graph Laplacian (as done in [17]).4 In practical use, often numeric implementation will result in finding very small, though non-zero, eigenvalues instead of the correct ones [2].
4 The Size of the GC and Its Robustness 4.1 The Size of the GC In order to find the size of the GC, Andersson [3] examined the probability of extinction in a two-phase branching process that mimics the construction of a random graph (with C = 0). In this branching process the source node has a number of direct descendants distributed according to {pi }∞ i=0 (the first phase), while each of its descendants has a number of direct descendants distributed according to {qi }∞ i=0 (the second phase). First, consider the probability u for a lineage of a single branch that arrives at some node, v1 , to eventually die out. This necessitates that all k branches leaving v1 die out, an event that occurs with probability uk . Since the degreek of v1 is unspecified, we obtain the self-consistency condition u = ∞ k=0 qk u = G1 (u), which can be solved to find u. The second step takes into consideration that the branching process begins from some arbitrary source node. Because all branches originating from the source must die out in order for the process to become extinct, the probability
The idea is basically as follows: find the eigenvalues of the matrix L = D − A, where A is the graph adjacency matrix and D is a diagonal matrix with the degree of node j at the Djj -th entry; the multiplicity of the eigenvalue 0 is the number of connected components. 4
The Big Friendly Giant
247
of extinction (which is equivalent to belonging to a small component) is equal to G0 (u), while the probability of persistence (or belonging to a GC) is S = 1 − G0 (u), which is also the size of the GC. The preceding argument needs to be modified for clustered networks [7]. For the latter, the probability u for the lineage of a single branch to die out no longer fulfills the condition u = G1 (u), because the progeny in the second phase are no longer distributed by {qi }∞ i=0 . Instead, we can replace qi with ei so that the self-consistency condition is, to a close approximation, u = Gc (u). The error remaining is largely due to higher order correlations between nodes in the branching process that occur with probability of the order of C 2 (and even smaller when triangles sharing an edge are known to be rare, as is the focus of Ref. [28]). Indeed C 2 1 in many real-world networks. Thus, we get the following procedure: (a) Solve for u such that Gc (u) = u. (b) Calculate GC size as S = 1 − G0 (u). 4.2 The Robustness and Resilience of the GC Another related question concerns the size of the GC in the presence of dilution, i.e., when a fraction r of the nodes or edges (or a combination of nodes and edges) has been randomly removed.5 This is understood to be related to the robustness and resilience of the networks against breakdowns of its units, the classic example being the World Wide Web. Although the naive identification of functionality with the existence of the GC is sometimes considered problematic,6 this formalism does have important applications as in, for example, the study of epidemic outbreaks [10]. We can take the same approach from the previous section and ask again the probability u for a lineage of a single branch that arrives at some node, v1 , to eventually die out. In the case of node removal, in the branching process, following an edge we reach a node that is unoccupied (was removed) with probability rn . Therefore, the lineage will die out with probability rn plus 1−rn times the probability that any of the lineages of the outgoing edges from v1 will eventually die out (found via the self-consistency condition). Thus, step (a) becomes: Solve for u such that rn +(1−rn )G1 (u) = u. Similar consideration of edge removal with probability re , replacing the {qi }∞ i=0 with the free-excess (or G with G ) and demanding all branches originating probabilities {ei }∞ 1 c i=0 from the source to die out eventually, we get the size of the GC in clustered networks after joint edge+node removal: 5
Also known respectively as site, bond and joint site+bond percolation. Durret [10] gives a nice critique on the claim that “the internet is robust.. after dilution (in a certain parameters regime) we still get a GC.” In the regime referred to, “if all 6 billion people were initially connected then after the removal only 36 people can check their email.” 6
248
Y. Berchenko et al.
(a) Solve for u such that 1 − (1 − rn )(1 − re ) + (1 − rn )(1 − re )Gc (u) = u. (b) Calculate GC size as S = 1 − rn − (1 − rn )G0 (u). When C = 0, these equations coincide with those in [9]. Indeed, we feel that our formalism, in contrast to that of [28], has the advantage of being a natural generalization of previous theory [9, 24]. This theory for the size of the GC is evaluated against simulation and real-world data in the next section, showing good agreement.
5 Simulations and Real Data Clustered networks were generated by three different methods, all giving similar results, each having its own advantages in terms of efficiency. In all the methods, a degree sequence was generated by sampling from a desired distribution. In two of the methods, a network was constructed according to the generated degree sequence by using a fill algorithm [13]. In one case we then selectively switched links [4] to reach a desired degree of clustering. In the second case, we selectively reconnected links to nodes of distance two, which lead to an increase in the number of triangles. The third method was based on distributing triangles in an empty network under the restrictions of the degree sequence, and later filling in additional links using a fill algorithm [13]. In Fig. 5 we plot simulations against theory for the size of the GC for a variety of parameters. Figure 5a shows the size of the GC vs. the mean degree for different values of C, rn and re , the fraction of nodes and/or edges removed respectively. In order to isolate the effect of clustering, we have also plotted in figure 5b the size of the GC vs. C for a fixed mean degree. The most revealing plot is that of the case rn = re = 0 (top line in Fig. 5b), where there is good agreement at the lower values of C (i.e. C < 0.3),
a
b
200
0. 8
rn=0 re=0 rn=0 re=0.2 rn=0.2 re=0 rn=0.2 re=0.2
GC
0. 6 C=0.1 rn =0 re=0 C=0.1 rn=0.1 re=0 C=0.2 rn=0.2 re=0 C=0.2 rn=0.2 re=0.2
0
2
4 z1
6
0. 4 0. 2 0
0. 2
0. 4 C
0. 6
0. 8
Fig. 5. The size of the GC after dilution. a) As a function of the mean degree for networks with Poisson degree distribution. A fraction rn and re of the nodes/edges were removed randomly, for C = 0.1 and C = 0.2. b) As a function of C for networks with Poisson degree distribution and z1 = 2. A fraction rn and re of the nodes/edges were removed randomly. Black lines: our prediction for each case.
The Big Friendly Giant
a 450
0
0.2 0.4 0.6 0.8
1
b
c
1400
30
0
0.2
0.4
0.6
0.8
1
0
0.2 0.4 0.6 0.8
249
1
Fig. 6. The size of the GC after dilution in real-world networks. Grey: simulations with bars at a width of one std, black: our predictions, broken line: the naive predictions which do not consider C (i.e., C = 0). a) Nodes removal for the C. elegans neural network. N = 453, C = 0.124, z1 = 8.9. b) Edges removal for the yeast protein-protein interaction network. N = 2112, C = 0.055, z1 = 2.1. c) Joint nodes+edges removal for the network of Zachary’s Karate club. N = 34, C = 0.255, z1 = 4.4.
as well as for its higher values (at C ≈ 0.5), as opposed to a deviation at intermediate values. This is explained by the fact that initially the O(C 2 ) error in our approximation is rather small, at intermediate values it can grow, (but still < C 2 ) and towards the critical point it needs to converge back to the exact result, producing again a very small deviation. Notice as well that after dilution the deviations become smaller still (Fig. 5b). This might be explained by the sensitivity of the higher order correlations, which require many edges, and their fast destruction due to it. We can also take data from real-world networks and compare their behavior under dilution with the prediction. When doing so, we often find, due to the skewed degree distribution that characterizes many real-world networks and their “denseness,” that the network stays almost as one connected unit for a large range of dilution. It is thus not surprising that allowing for clustering does not improve the predictions. A distinct example is given in Fig. 6a, where the size of the GC of the neural network of C. elegans [34] is plotted vs. rn , the fraction of nodes removed. The size of the GC decreases almost linearly as rn . Nevertheless, Fig. 6b, c show two real-world networks, the yeast proteinprotein interaction network and Zachary’s Karate club [36], where considering the value of C gives an advantage in predicting the size of the GC as a function of dilution.
6 Discussion Perhaps the most far-reaching result presented here is our criterion B for the existence of the GC. This simple and intuitive criterion (Is the mean number of edges going to the second layer larger than the one going to the first?) is a natural generalization of the well-established Molloy–Reed condition (Is the mean number of nodes at the second layer larger than the one at the first?),
250
Y. Berchenko et al.
which is often misused. It might be that the Molloy–Reed condition gained much of its appeal due to the interpretation which identifies the existence of a GC with the possibility of a random walker, originating from a source node, to reach a large distance from the source (see as well the related and interesting electrostatic approach [27]). Although grossly oversimplified, we may conjecture that this is true for the general case. Indeed, when inspecting Fig. 2a, for example, we see that in order to have a positive drift away from the source we need not have an increasing number of nodes at each layer—rather an increasing number of edges between layers! We did not study the topological effects of having z2 > z1 in clustered networks. We expect still to find interesting behavior at z2 = z1 from quantities such as the diameter of the network. This is indeed a subject for future work.
Acknowledgments MT and YB are grateful for the support of the EC (project MATHfSS 15661) and DIP (project Compositionality F 1.2). LS and YAR are grateful for the support of the James S. McDonnell Foundation and the Israeli Science Foundation.
References 1. Aiello, W., Chung, F., Lu, L.: A random graph model for massive graphs, Proc. of the 32nd Annu. ACM Symposium on Theory of Computing (2000) 2. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK User’s Guide, 3rd edition, SIAM, Philadelphia (1999) 3. Andersson, H.: Limit theorems for a random graph epidemic model, Ann. Appl. Probab. 8, 1331–1349 (1998) 4. Artzy-Randrup, Y., Stone, L.: Generating uniformly distributed random networks, Phys. Rev. E. 72 (5): 056708 (2005) 5. Barabasi, A.-L., Albert, R.: Emergence of scaling in random networks, Science 286, 509512 (1999) 6. Bender, E. A., Canfield, E. R.: The asymptotic number of labeled graphs with given degree sequences, J. Combin. Theory A 24, 296307 (1978) 7. Berchenko, Y., Artzy-Randrup, Y., Teicher, M., Stone, L.: The emergence and the size of the giant component in clustered random graphs with a given degree distribution, submitted. 8. Bollobas, B.: Random Graphs, 2nd edition, Academic Press, New York (2001) 9. Callaway, D. S., Newman, M. E. J., Strogatz, S. H., Watts, D. J.: Network robustness and fragility: Percolation on random graphs, Phys. Rev. Lett. 85, 5468 (2000) 10. Durrett, R.: Random Graph Dynamics, Cambridge U. Press, Cambridge, UK (2006)
The Big Friendly Giant
251
11. Erdos, P., Renyi, A.: On the evolution of random graphs, Publications of the Mathematical Institute of the Hungarian Academy of Sciences 5, 1761 (1960). 12. Gomez-Gardenes, J., Moreno, Y., Arenas, A.: Paths to synchronization on complex networks, Phys Rev Lett. 98 (3):034101 17358685 (2007) 13. Gotelli, N. J., Entsminger, G. L.: Swap and fill algorithms in null model analysis: Rethinking the Knight’s Tour, Oecologia 129, 281–291 (2001) 14. Guillaume, J. L., Latapy, M.: A realistic model for complex networks, (2003) condmat/0307095. 15. Jeong, H., Mason, S., Barabasi, A.-L., Oltvai, Z. N.: Lethality and centrality in protein networks, Nature 411, 4142 (2001) 16. Keeling, M. J.: The effects of local spatial structure on epidemiological invasion. Proc. R. Soc. London B 266, 859–867 (1999) 17. McGraw, P. N., Menzinger, M.: Analysis of nonlinear synchronization dynamics of oscillator networks by Laplacian spectral methods, Phys. Rev. E 75, 027104 (2007) 18. Molloy, M., Reed, B.: A critical point for random graphs with a given degree sequence, Random Structures and Algorithms 6, 161179 (1995) 19. Molloy, M., Reed, B.: The size of the giant component of a random graph with a given degree sequence, Combin. Probab. Comput. 7, 295 (1998) 20. Montoya, J. M., Sole, R. V.: Small world patterns in food webs, J. Theor. Bio., 214, 405–412 (2002) 21. Newman, M. E. J.: The structure and function of complex networks, SIAM Review 45, 167 (2003) 22. Newman, M. E. J.: Properties of highly clustered networks, Phys. Rev. E 68, 026121 (2003) 23. Newman, M. E. J.: Random graphs as models of networks. In: Bornholdt, S., Schuster, H. G. (eds.) Handbook of Graphs and Networks, Wiley-VCH, Berlin (2003) 24. Newman, M. E. J., Strogatz, S. H., Watts, D. J.: Random graphs with arbitrary degree distributions and their applications, Phys. Rev. E. 64, (2001) 25. Park, J., Newman, M. E. J.: Solution for the properties of a clustered network, Phys. Rev. E 72, 026136 (2005) 26. Pastor-Satorras, R., Vasquez, A., Vespignnani, A.: Dynamical and correlation properties of the internet, Phys. Rev. Lett. 87, 258701 (2001) 27. Redner, S.: A Guide to First-Passage Processes, Cambridge University Press, New York (2001) 28. Serrano, M. A., Boguna, M.: Percolation and epidemic thresholds in clustered networks, Phys. Rev. Lett. 97, 088701 (2006) 29. Strauss, D.: On a general class of models for interaction, SIAM Review 28, 513–527 (1986) 30. Vazquez, A.: Growing networks with local rules: Preferential attachment, clustering hierarchy and degree correlations, cond-mat/0211528 (2002) 31. Volz, E.: Networks with tunable degree distribution and clustering, Phys. Rev. E 70, 056115 (2003) 32. Wasserman, S., Pattison, P.: Logit models and logistic regressions for social networks: I. An introduction to Markov random graphs and p*, Psychometrika 61, 401426 (1996) 33. Watts, D. J., Strogatz, S. H.: Collective dynamics of small-world networks, Nature 393, 440442 (1998)
252
Y. Berchenko et al.
34. White, J. G., Southgate, E., Thompson, J. N., Brenner, S.: Structure of the nervous system of the nematode C. elegans, Phil. Trans. R. Soc. London 314, 1340 (1986) 35. Wilf, H. S.: generatingfunctionology, 2nd edition, Academic Press, London (1994) 36. Zachary, W.: An information flow model for conflict and fission in small groups, Journal of Anthropological Research 33, 452–473 (1977)
Technological Networks Bivas Mitra Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, 721302, India;
[email protected] 1 Introduction The study of networks in the form of mathematical graph theory is one of the fundamental pillars of discrete mathematics. However, recent years have witnessed a substantial new movement in network research. The focus of the research is shifting away from the analysis of small graphs and the properties of individual vertices or edges to consideration of statistical properties of large scale networks. This new approach has been driven largely by the availability of technological networks like the Internet [12], World Wide Web network [2], etc. that allow us to gather and analyze data on a scale far larger than previously possible. At the same time, technological networks have evolved as a socio-technological system, as the concepts of social systems that are based on self-organization theory have become unified in technological networks [13]. In today’s society, we have a simple and universal access to great amounts of information and services. These information services are based upon the infrastructure of the Internet and the World Wide Web. The Internet is the system composed of ‘computers’ connected by cables or some other form of physical connections. Over this physical network, it is possible to exchange e-mails, transfer files, etc. On the other hand, the World Wide Web (commonly shortened to the Web) is a system of interlinked hypertext documents accessed via the Internet where nodes represent web pages and links represent hyperlinks between the pages. Peer-to-peer (P2P) networks [26] also have recently become a popular medium through which huge amounts of data can be shared. P2P file sharing systems, where files are searched and downloaded among peers without the help of central servers, have emerged as a major component of Internet traffic. An important advantage in P2P networks is that all clients provide resources, including bandwidth, storage space, and computing power. In this chapter, we discuss these technological networks in detail. The review is organized as follows. Section 2 presents an introduction N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 15, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
254
B. Mitra
to the Internet and different protocols related to it. This section also specifies the socio-technological properties of the Internet, like scale invariance, the small-world property, network resilience, etc. Section 3 describes the P2P networks, their categorization, and other related issues like search, stability, etc. Section 4 concludes the chapter.
2 The Internet The Internet is a global network connecting millions of computers in a decentralized form. Each Internet computer, called a host, is independent and operators can choose any of the commercial Internet service providers (ISPs). Many computer scientists observe the Internet as a “prime example of a largescale, highly engineered, yet highly complex system” (Fig. 1). The Internet is extremely heterogeneous in nature; for instance, data transfer rates and physical characteristics of connections vary widely. In addition, the Internet evolves and emerges based upon its large-scale self-organization property. Technically, the Internet can be defined as the network of networks working with Transmission Control Protocol (TCP)/Internet Protocol (IP). This definition visualizes the Internet as a purely technological system. However, this assumption overlooks the fact that knowledgeable human activities make the Internet work. Hence, more accurately, the Internet is a global socio-technological system that is based on a technological structure and a set of protocols [13]. Some of the important Internets-based services are e-mail, World Wide Web, remote access, and Internet telephony.
Fig. 1. Internet as complex network.
Technological Networks
255
2.1 Protocols Used in the Internet Once we have more than one computer, it is theoretically possible to communicate, provided that the computers ‘speak’ a common language. The Internet uses a suite of communication protocols, of which the two most important are the TCP and the IP [19]. These protocols have the following responsibilities: First, the protocol defines the basic unit of data transfer, called the ‘datagram’, used throughout the Internet. Thus, it specifies the exact format of all data as it passes across the Internet. Second, the TCP/IP software performs the routing function, choosing a path over which data will be sent. Third, the protocol includes a set of rules that embody the idea of reliable packet delivery over unreliable connections. In addition, these protocols introduce the IP addressing scheme which is integral to the process of routing datagrams through the Internet to the particular destination host. Each host on a TCP/IP network is assigned an unique 32-bit IP address that is divided into two main parts: the network number and the host number (Fig. 2). The network number identifies a network and must be assigned by the Internet Network Information Center (InterNIC) if the network is to be part of the Internet. An ISP can obtain blocks of network addresses from the InterNIC and can itself assign address space as necessary. The host number identifies a host on a network and is assigned by the local network administrator. To make them easier to remember, IP addresses are normally expressed in decimal format as a ‘dotted decimal number’. The four numbers in an IP address are called octets, because they each have eight bit positions when viewed in binary form. Currently three classes of networks (A, B, C) are commonly used. These classes may be segregated by the number of octets used to identify the network, and also by the range of numbers used by the first octet. If the value of the first octet is 127, it represents the local host, regardless of what network it is really in.
Fig. 2. IP addressing.
256
B. Mitra
2.2 Scale Invariance and Small World Property of the Internet The topology of the Internet is studied at two different levels. At the router level, the nodes are the routers, and edges are the physical connections between them. At the interdomain (or autonomous system) level, each domain, composed of hundreds of routers and computers, is represented by a single node, and an edge is drawn between two domains if there is at least one route that connects them. The topology of large-scale networks like the Internet is characterized by the degree distribution pk , which is defined as the fraction of nodes in the network having degree k. In 1999, Faloutsos et al. [12] studied the Internet at both levels, concluding that in each case the degree distribution follows a power law (Fig. 3) i.e. pk ∼ k −γ . The interdomain topology of the Internet, captured at three different dates between 1997 and the end of 2002, resulted in degree exponents between γ = 2.15 and γ = 2.2. The 1995 survey of the Internet topology at the router level, containing 3888 nodes, found γ = 2.48. In 2000, Govindan and Tangmunarunkit [15] mapped the connectivity of nearly 150,000 router interfaces and nearly 200,000 router adjacently, confirming the power-law scaling with γ = 2.3. It is widely believed that the scale invariance property of the Internet is related to the self-organization property of the participating nodes. The preferential attachment tendency of
Fig. 3. The first data file holds link directions corresponding to the traceroute directions, while the second file is an undirected version of the first file. There are a total of 192,244 nodes, 636,643 directed links, and 609,066 undirected links. The average and maximum node degrees (undirected) are 6.34 and 1071 respectively, and the node degree distribution is plotted.
Technological Networks
257
the nodes to join the network [42] stabilizes the degree distribution as the size of the Internet becomes very large. Internet as small world. An accurate characterization of the emergent topological properties of the Internet and a better understanding of the underlying processes that yield these characteristics are crucial for proper evaluation of network protocols and systems. In that vein, recent works [20, 5] have shown the prevalence of small-world phenomena [24, 44] in the Internet. Small-world graphs exhibit a high degree of clustering, yet have typically short path lengths between arbitrary vertices. Yook [47] and Pastor-Satorras [32] have studied the Internet at the domain/autonomous system level between 1997 and 1999 and found that its clustering coefficient ranges between 0.18 and 0.3, compared to the clustering coefficient 0.001 for random networks of similar parameters. On the other hand, the average path length of the Internet ranges between 3.70 and 3.77 and at the router level it is around 9, indicating its small-world character. Small-world behavior in the Internet maps to two possible causes: first, the high variability of node degree distributions and, second, the preference of vertices to have local connections [20]. With the high variability of the node degree distribution, it is likely that two interconnected vertices, say u and v, will have the same neighbor, say w specifically, when w is a node with extremely large degree. It means that u, v, and w form a triangle. Such a pattern contributes directly to the computation of the clustering coefficients of u, v, and w, (i.e. Cu , Cv , and Cw ) and results in a larger overall average clustering coefficient C of the network. Thus, C grows with the variability of vertex degree. Also, notice that with highly variable vertex degrees, the average distance between two vertices (L) is short. This happens because the shortest path is usually through those extremely popular vertices. That is, highly popular vertices serve as good navigators through the graph. On the other hand, preference for the local connectivity also results in small-world behavior. The reason behind this is that, with a non-negligible probability of a local connection, if a node u is connected to v and w, then it is likely that v and w are also close to each other. As a result, there is a non-negligible probability that a triangle will form among these vertices, resulting in a higher clustering coefficient. Meanwhile, since there are still many long-range connections, it is easy to find a short path between two randomly chosen nodes. In addition, researchers from Stanford University [37] found that as networks grow very large, they become very efficient in the number of steps a data packet takes to get from one node to another node. The number of steps grows logarithmically with the size of the network, which means that for 10,000 nodes we need five steps, but for 100 million the number grows only to 6.5. They also exhibit a clustering property, i.e. the relationships among nodes are not randomly distributed, but are grouped. Short path links means that there are some very short paths sprinkled throughout the network that
258
B. Mitra
may directly link one group to another. This conforms to Watts and Strogatz’s model [44], where a low dimensional regular lattice is transformed to a small world network. 2.3 Fault Tolerance of the Internet The Internet and other communication networks display a high degree of robustness: while key components regularly malfunction, local failures rarely lead to the loss of the global information-carrying ability of the network [3]. It has been observed that network topology plays an important role in the robustness of the Internet. Consider an arbitrary connected graph of N nodes, and assume that an f fraction of the nodes have been removed. This leads to important questions, like: What is the probability that the resulting subgraph is connected, and how does it depend on the removal probability f ? For a broad class of graphs there exists a threshold probability fc such that if f < fc the resulting subgraph is connected, but if f > fc the subgraph becomes disconnected (Fig. 4). Here fc is termed the percolation threshold. In the following discussion, we will call a network fault tolerant (or robust) if it contains a giant component comprising of most of the nodes even after a fraction of its nodes are removed. 2.3.1 Stability Criteria The topology of the Internet and the failure probability of nodes can be characterized by probability distributions pk and fk respectively. Here pk signifies the degree distribution which is the probability that a randomly chosen node has degree k. Similarly fk is the probability that a vertex of degree k, will be removed from the network. Nodes leave the Internet due to their faulty nature [8] or due to the attack mounted on the important nodes [9]. Based upon these basic parameters, an analytical framework has been derived to
Fig. 4. Illustration of the effects of node removal on an initially connected network.
Technological Networks
259
examine the stability of the Internet (or any kind of networks) where the vertices undergo some dynamics [28]. The analytical framework can be expressed with the help of the following equation: ∞
kpk (k(1 − fk ) − (1 − fk ) − 1) = 0.
(1)
k=0
Equation. (1) states the critical condition for the stability the Internet (characterized by pk ) undergoing any type of failure and attack (characterized by fk ). Stability analysis of networks under different node disturbance schemes. The existing empirical and theoretical results indicate that complex networks can be divided into two major classes based on their degree distribution pk . In the first class of networks, pk peaks at an average degree k and decays exponentially for large k. The most investigated examples of such exponential networks are the random graph model of Erdos and Renyi [11] and the small-world model of Watts and Strogatz [44], both leading to a fairly homogeneous network. In contrast, results on the Internet, World Wide Web, and other large networks indicate that many systems belong to a class of inhomogeneous networks, referred to as scale-free networks, for which pk decays as a power law, i.e. pk ∼ k −γ [8]. While the probability that a node has a very large number of connections (k k) is practically prohibited in exponential networks, highly connected nodes are statistically significant in scale-free networks. In this review, we concentrate on the scale-free network, as this kind of network is widely used to model the Internet. In this section, we consider two types of node removal schemes. The first scheme studies the removal of randomly selected nodes. In this case, the probability of removal of any randomly chosen node having degree k after this kind of failure is fk = f (independent of k) [8]. In the second technique, most highly connected nodes are removed at each step. This second scheme emulates an intentional attack on the network [9]. Formally, fk = 0 when k ≤ kmax and fk = 1 when k > kmax , i.e. all the nodes in the network having degree more than kmax are removed. Next we discuss the stability of scale-free networks in the face of failure and attack. The stability is measured by the change in the size of the giant component S and the average path length l after removal of the fraction of nodes. The maximum reduction in the size of the giant component indicates the breakdown of the network. Stability against random failure. We start by investigating the stability of scale-free network to random removal of nodes, looking at the changes in the relative size of the giant component S and the average path length l [8]. In a scale-free network, the size of the giant component S decreases slowly from S = 1 as the fraction of nodes removed f increases (see Fig. 5). In random failure, most of the removed nodes in the network have low degree; hence, they have little impact upon the size of the giant component S. Eventually,
260
B. Mitra
Fig. 5. The size of the giant component S and average path length l of an initially connected network when a fraction f of the nodes are removed. Scale-free network generated by the scale-free model with N = 10,000 and k = 4. Squares indicate random node removal, while circles correspond to preferential removal of the most connected nodes [3].
S reaches 0 at some higher f , which is denoted as the percolation threshold fc . The analytical calculations indicate that the percolation threshold fc → 1 as the size of the network increases to infinity. In simple terms, scale-free networks display an exceptional robustness against random node failures. On the other hand, the average path length l increases with the fraction of removed nodes f , as paths are disrupted in the network, and eventually l peaks at percolation threshold fc . In random failure, the average path length l increases slowly with f ; hence, its peak becomes less prominent. After the network breaks into isolated components, l decreases as well since in this regime the size of the largest component gradually decreases. Stability against intentional attack. In the case of intentional attack, the nodes with the highest degrees are targeted for removal. Naturally, in this kind of attack, the network breaks down into components faster than in the case of random failure. The stability of the scale-free networks mainly depends upon a few highly connected nodes. Removal of these key nodes during the intentional attack severely affects the stability of the scale-free networks [9]. This phenomenon also becomes predominant from the behavior of the average path length l, which increases rapidly and reaches its peak at percolation threshold fc . After the network breaks into isolated components, l decreases quickly since in this regime the size of the largest component decreases. 2.4 Spreading of Viruses in Internet Computer viruses and worms are posing serious challenges to the network research community. In computer science jargon, ‘virus’ refers to malicious software that spreads from computer to computer and can halt or hinder operations at numerous businesses and other organizations, disrupt
Technological Networks
261
cash-dispensing machines, delay airline flights, and even affect emergency call centers [41, 23, 4]. The structure of contact networks affects the rate and extent of spreading of computer viruses, just as it does for human diseases; understanding this structure is a key element in the control of infection. Thus, recent works in epidemiological models have emphasized the effects of the virus spread in scale-free networks, in which the degree distribution follows a power law [16]. There are various epidemic models available in the literature which can be used to formalize the spread of viruses in the network [33]. In these models, the susceptible (S) individuals do not have the disease and are ready to be attacked with a disease if they come in contact with virus infected (I) individuals. The infected individuals may gain permanent or temporary immunity after some time period and become recovered (R). The R individuals do not take part in disease transmission. Various epidemic dynamics like SI, SIS, SIR, SIRS exist in the literature [35, 36]. In SI dynamics, infected individuals increase until all the S individuals becomes infected. If the I individuals in SI dynamics become susceptible again after some time period, the SIS dynamics results [34]. Computer viruses mostly fall into this category; they can be ‘cured’ by antivirus software, but without a permanent virus-checking program the computer has no way to fend off subsequent attacks by the same virus. Let us assume that any susceptible individual has a uniform probability β per unit time of being infected from any other infected one, and that infected individuals recover and become immune at some stochastically constant rate γ. Then s, i, r, the individual fraction of nodes in the states of S, I, and R respectively, are governed by the following differential equations: ds = −βis, dt
di = βis − γi. dt
(2)
The classical SIS model can be applied to the networked system where infection probability of the node is not constant but varies between the nodes of the network depending upon its degree. The quantity βi represents the average rate at which a susceptible individual becomes infected by its neighbors. If λ is the rate of infection via contact with the single infective node and θ(λ) is the probability that the neighbor of a k degree susceptible node is infective, then the average rate of infection of the k degree susceptible node becomes βi = kλθ(λ). The implicit expression for θ(λ) is obtained in [35] by the following expression: k 2 pk λ = 1, z 1 + kλθ(λ)
(3)
k
where z is the average degree and pk is the degree distribution. For particular choices of pk , this equation can be solved for θ(λ) either exactly or approximately. For instance, for a power-law degree distribution, Pastor-Satorras and Vespignani [34] solve it by making an integral approximation, and hence show
262
B. Mitra
that there is no non-zero epidemic threshold for the SIS model in the powerlaw case, i.e. the disease will always persist, regardless of the value of the infection rate parameter. They have also generalized the solution to a number of other cases, including other degree distributions, finite-sized networks, and models that include vaccination of some fraction of individuals [35, 36]. In the latter case, they tackle both random vaccination and vaccination targeted at the vertices with highest degree. The results have shown that the propagation of the disease turns out to be relatively robust against random vaccination, at least in networks with right-skewed degree distributions, but highly susceptible to vaccination of the highest-degree individuals.
3 Peer-to-Peer Networks In client-server architecture, each computer or process in the network is either a client or a server. A large number of clients request and receive the service from the servers, and a fixed set of servers provides the service to those clients. Peer-to-peer (P2P) networks (shown in Fig. 6) provide a different paradigm of computer networks, where each workstation has equivalent capabilities and responsibilities [26, 6]. P2P networks diverge the responsibility between participants in a network and cumulate the bandwidths of network participants rather than using conventional centralized resources. An important advantage in this kind of network is that all clients provide resources, including bandwidth, storage space, and computing power. Thus, as nodes arrive and demand on the system increases, the total capacity of the system also increases simultaneously. This is not true for a traditional client-server architecture, in which adding more clients could mean slower data transfer for all users. In addition, popular items (like songs, movies) in the network become replicated over multiple peers due to repeated exchange of items, which increases the robustness of the shared items in the face of frequent joining and leaving of peers (termed as peer churn).
Fig. 6. Client-server model and P2P model.
Technological Networks
263
Overlay networks. Peers in the P2P networks are typically connected via ad hoc overlay connections. If a participating peer knows the location of another peer in the network, then there is a link from the former node to the latter in the overlay network. Based on how the nodes in the overlay network are linked to each other, the current P2P architecture can be classified into three types [43], centralized, decentralized and structured, and decentralized but unstructured. 1. Centralized: All object index items are kept in a centralized server in the form of object key, node address etc. Each arriving node needs to actively notify this server about its kept object information. Therefore, the querying node only needs to consult the central server to obtain the peer address containing its searched object. In order to download the searched object from the peer, the querying node directly establishes the connection with that peer and downloads the item. This type of P2P architecture is very simple and easy to deploy. But it has the problem of a single point of failure, although we can use several parallel servers. An example of this network type is Napster [31]. 2. Decentralized and structured: A structured P2P network employs a globally consistent protocol to ensure that any node can efficiently route a search query to a peer that has the desired file. Most of the structured P2P networks are based on the distributed hash table (DHT), in which a variant of consistent hashing is used to assign ownership of each file to a particular peer [27]. A DHT is a hash table whose table entries are distributed among different peers located in arbitrary locations. Each data item is hashed to a unique numeric key. Each node is also hashed to a unique ID in the same key space. Each node is responsible for a certain number of keys; that is, the responsible node stores the key and a pointer to the data item with that key. Keys are mapped to their responsible nodes. The searching and routing algorithms support two basic operations: lookup(key) and put(key); lookup(k) is used to find the location of the node that is responsible for the key k, and put(k) is used to store a data item (or a pointer to the data item) with the key k in the node responsible for k. It appears that searches in structured systems follow the well-defined neighboring links; henceforth, these systems provide guarantees on finding existing data in bounded overlay hops. However, the strict network structure imposes high overhead for handling dynamicity in P2P networks due to peer churn. Some well-known DHT based structured P2P networks are Chord, Pastry, Tapestry, CAN, and Tulip. 3. Decentralized and unstructured: An unstructured and decentralized P2P network is formed when the overlay links are established arbitrarily. As no special network structure needs to be maintained, unstructured P2P systems are extremely resilient to peer churn. Searching in unstructured networks is often based on flooding or its variation because there is no control over data storage [26]. The main disadvantage with such networks is that the queries may not always be resolved. Popular content is likely to
264
B. Mitra
be available at several peers, but if a peer is looking for rare data shared by only a few other peers, then it is highly unlikely that the search will be successful [10]. Since there is no correlation between a peer and the content managed by the peer, there is no guarantee that flooding will find a peer that has the desired data. However, due to the high dynamicity of peers, robustness is given the topmost priority. Most of the popular P2P networks such as Gnutella and FastTrack are unstructured in nature [14]. In addition, superpeer topologies have also emerged as the most influencing unstructured networks. Here some peers, called dominating nodes or superpeers, serve the search request of other regular peers [39, 46]. Most of the commercial systems like KaZaA, Skype have adopted superpeers in their design. In these systems, superpeer nodes with higher bandwidth and connectivity connect to each other, forming the upper level in the network hierarchy. Each superpeer node provides service to a set of regular peers which form the lower level of the network hierarchy. 3.1 Peer-to-Peer Search Schemes Searching is one of the most important services and utilities provided by the P2P networks where users try to locate the desired object in the network. Existing P2P systems support the simple object lookup by key or identifier. Some existing P2P systems can handle more complex keyword queries, which find documents containing keywords in queries. Searching techniques are primarily forwarding based. Starting with the requesting node, a query is forwarded or routed until the node which has the desired object is reached. To forward query messages, each node must keep information about some other nodes called neighbors. The information of these neighbors constitutes the routing table of a node. The desired features of searching algorithms in P2P systems include high-quality query results, minimal query packet overhead, high routing efficiency, load balance, resilience to node failures, and support of complex queries. The quality of query results is application dependent. Generally, it is measured by the number of results and relevance. The query packet overhead signifies the amount of packets generated in the network to satisfy a specific search query. The routing efficiency is generally measured by the number of overlay hops per query. Different searching techniques make different trade-offs between these desired characteristics. Searching in structured P2P networks follows the well-defined neighboring links to locate some specific object. This provides guarantees on finding existing data and bounds data lookup efficiency in terms of the number of overlay hops. But it shows poor performance in the dynamic condition where peers join and leave the network quite frequently. Searching in the unstructured P2P systems is more challenging, as the overlay network does not follow any structure dependent on the data storage. Searching techniques in unstructured networks can be classified as either flooding based or random walker
Technological Networks
265
based. Broadly, flooding-based techniques are fastest and most inefficient in terms of overhead, whereas random-walk-based schemas have low overhead and minimum speed. Therefore, both techniques lie at the extreme ends of the efficiency/speed spectrum. The following section describes flooding techniques and their variations and also the random-walk-based techniques. 3.1.1 Flooding-Based Search Techniques Searching in unstructured P2P networks is often based on flooding or its variations because there is no control over the location of objects. In these techniques, query packets are propagated to all neighbors within a certain radius until the desired object is found. However, blind flooding mechanism generates large numbers of redundant query packets in the network, which misutilizes the valuable bandwidth and makes the unstructured P2P systems far from scalable. Some proposed controlled flooding-based schemes such as iterative deepening/expanding ring, informed search, dynamic query-based flooding, LightFlood, Hurricane flooding, etc. try to improve bandwidth utilization. Iterative deepening. Yang and Garcia-Molina [45] borrowed an idea from artificial intelligence and used it in iterative deepening. Like ordinary flooding, in this case no node has information about the location of the desired data. The querying node periodically issues a sequence of breadth-first searches (BFSs) with increasing depth limits. The query terminates when the query result is satisfied or when the maximum depth limit has been reached. LightFlood. The LightFlood technique [17] (also called the expanding ring) not only retains the merits of pure flooding, but also eliminates most of the redundant messages caused by pure flooding. Thus, LightFlood greatly enhances the scalability of Gnutella-style P2P systems. The design of LightFlood is motivated by two observations: first, the majority of redundant messages are generated within high hops; second, the network coverage growth rates in low hops are much higher than those within high hops. Thus, the LightFlood scheme is divided into two stages. In the first stage, the messages are allowed on their low hops to be flooded by pure flooding (by giving a small time to live (TTL) number). Those peers reached on the last hop of pure flooding (TTL = 0) become seeds, from which the flooding is initiated for the second stage. The initial pure flooding ensures that a considerable number of seeds are dispersed across the overlay with a small number of redundant messages. The next stage of flooding ensures that most redundant messages caused by pure flooding within the rest of its hops are eliminated. The integration of these two stages retains the advantages of pure flooding: low latency, high coverage, and high reliability. Hurricane flooding. In Hurricane flooding [21], the source of a search cautiously but exponentially expands its search horizon in a spiral pattern. Like the expanding ring algorithm, Hurricane flooding increases the scope of flooding after each round. The source peer divides its neighbors into several
266
B. Mitra
groups with approximately of same size. The source sends query packets to its neighbors in the first group, starting the first round of flooding. These neighbors faithfully broadcast the query packets (but not back to the source). The source also sets a limit on the scope of these broadcasting query packets, e.g., by using a TTL value. The first round of flooding may have a very narrow scope with small TTL. This round of flooding may not return the desired result. Then the source sends query packets to its neighbors in the second group, with a larger limit on the scope of the flooding. This process repeats until the source obtains the desired result. It has been shown that Hurricane flooding reduces the search cost to arbitrarily close to a lower bound for any search algorithms and bounds the search latency, which is a logarithmic function of the location of the target. 3.1.2 Random-Walk-Based Search Techniques Random walk is a popular alternative to flooding for locating resources in P2P networks under scarcity of network bandwidth. In the standard random walk algorithm [25], the querying node forwards the query message to one randomly selected neighbor with some specific TTL value T . When an intermediate node receives the random walker, it checks to see if it has the resource. If the intermediate node does not have the resource, it checks the TTL field, and if T > 0, it decrements T by 1 and forwards the query to a randomly chosen neighbor; else if T = 0 the query message is dropped. On the other hand, if the intermediate node has the resource, the query is not forwarded and a reply is sent back to the querying node. This random walk technique greatly reduces the message overhead but causes a longer searching delay. In the k -walker random walk algorithm [26], k walkers are deployed by the querying node to search the desired item. That is, the querying node forwards k copies of the query message to k randomly selected neighbors. Each query message takes its own random walk and each walker checks whether it reached the destination or its TTL value reaches zero. In this way, the k-walker random walk algorithm attempts to reduce the routing delay by a factor of k. However, the arbitrary increase in the number of walkers results in a significant increase in the redundant visits in the initial stage, which increases the message overhead. Actually, the performance of k-walker random walk largely depends on the choice of k and T T L. Intuitively, the average number of nodes required to be probed for discovering a resource is inversely proportional to the popularity of the resource. Choosing low values of k and T T L for searching for a resource with low popularity would result in a low success rate and high delays; choosing high values of k and T T L for searching for a resource with high popularity would result in excessive overhead. Thus, the parameters of random walk must be chosen according to the popularity of the resource being searched for. The popularity of a resource may not be known a priori at the querying node. In addition, the popularity may change due to the arrival/departure of nodes,
Technological Networks
267
replication/deletion/exhaustion of resources, or other random changes in the network. Thus, the parameters of random walk must be set in an adaptive manner. The modified random BFS technique [22] is a modification of the k-walker random walk scheme to reduce the unnecessary message overhead. Here the querying node forwards the query to a randomly selected subset of its neighbors. On receiving a query message, each neighbor forwards the query to a randomly selected subset of its neighbors excluding the source node. This procedure continues until the query stop condition is satisfied. It is expected that this approach visits more nodes and has a higher query success rate than the k-walker random walk. Some hybrid schemes are also developed [25] based on a compromise between flooding and random walks. One of the hybrid schemes uses local flooding, until exactly K (predefined) new outer nodes have been discovered. Then, each of the K nodes initiates an independent random walk. Gradient-based search in scale-free networks. Recent measurements of Gnutella networks [7] and simulated Freenet networks [18] have shown that their topological structure follows a power-law degree distribution. [1] proposed a message-passing algorithm that can be efficiently used to search in scale free networks such as Gnutella. It has been observed that random walks in scale free networks naturally gravitate towards the high degree nodes, but an even better coverage is achieved by intentionally choosing high degree nodes. In [1], Adamic et al. have shown analytically that if the nodes with highest degree are visited first and subsequently go down to the degree sequence, the significant portion of the network can be covered very quickly. In the proposed algorithm, the walker approximately follows the degree sequence across the entire scale-free network with an exponent close to 2 (2.0 < γ < 2.3). At each step, the random walker chooses a node with a degree higher than the current node, quickly finding the highest degree node. Once the highest degree node has been visited, it will be avoided, and a node of approximately second highest degree will be chosen. Effectively, after a short initial climb, one goes down the degree sequence. This is the most efficient way to do this kind of sequential search, visiting highest degree nodes in sequence. These algorithms are completely decentralized and exploit the power-law link distribution in the node degree. The paper demonstrates that the search algorithms work well on real Gnutella networks, scale sublinearly with the number of nodes, and may help to reduce the network search traffic that tends to cripple such networks. 3.2 Topological Dynamics and Stability of Superpeer Networks From the point of view of topological dynamics, P2P networks exhibit similar behavior to that of the Internet. However, the special superpeer topology exhibited by many commercial P2P networks makes the outcome of the dynamics different from that of the Internet (mainly scale free networks).
268
B. Mitra
A superpeer network can be modeled by a bimodal degree distribution, where a small fraction of nodes are superpeers with high degree and a large fraction of nodes are low degree peers [28]. Formally, degree distribution pk of the superpeer networks can be specified as pk > 0 if k = kl , km ; pk = 0 otherwise, where kl and km are degrees of peers and superpeers respectively. Moreover, there are some differences in the dynamics of the P2P networks and the Internet. We explain the different kinds of peer dynamics and then illustrate the outcomes in each case. Peers in the P2P system join and leave the network randomly without any central coordination. This is termed as peer churn. In addition, important peers are targeted for attack [38]. All these peer dynamics can be modeled by different kinds of node removal schemes in random graph. 1. Random failure: Peer churn can be modeled by random removal of nodes from the graph. This is the simplest model of churn, and the probability of removal of a node is independent of its degree. 2. Degree-dependent failure: Peers having higher connectivity are more stable in the network than peers having lower connectivity because those loosely connected peers enter and leave the network quite frequently. This observation leads us to model churn in a more realistic manner, where the probability of removal of a node is inversely proportional to the degree of that node. 3. Degree-dependent attack: In case of attack, the nodes having higher degrees are more likely to be removed from the network. Let the probability distribution fk model the different node removal techniques. In the following we consider a unified churn/attack model of the form fk = C k γ , where γ is a parameter called attack exponent and C is a constant. The different node removal techniques can be realized from this unified model just by changing the parameter γ. 1. Random failure: For γ = 0, fk = C, i.e., the probability of removal of a node is independent of the degree of the node. 2. Degree-dependent failure: For γ < 0, the probability of removal of a node, having degree k is inversely proportional to the degree of the node, i.e. fk ∝ 1/k γ . 3. Degree-dependent attack: For γ > 0, the probability of removal of a node having degree k is directly proportional to the degree of the node, i.e., fk ∝ k γ . 3.2.1 Outcomes Next we illustrate the impact of different peer dynamics on the stability of the superpeer networks. The peer churn has been modeled by random failure and degree-dependent failure, and the attack has been modeled by degree-dependent attack.
Technological Networks
269
1
fr (Percolation threshold)
0.95 0.9 Theoretical 〈Ksp〉=30 Simulation 〈Ksp〉=30 Theoretical 〈Ksp〉=50 Simulation 〈Ksp〉=50
0.85 0.8 0.75 0.7
0.85
0.9 0.95 r (Fraction of peers)
1
Fig. 7. The impact of random failure upon the stability of superpeer networks.
Random failure. The analysis done in [30] shows that the superpeer networks are quite robust against churn (Fig. 7). Since churn affects peers and superpeers depending upon their individual fraction in the network, peers are affected much more than superpeers. The removal of a significant number of low degree peers along with a few high degree superpeers has less impact upon the stability of the networks. Practical experience also ensures that superpeer networks exhibit high robustness in the face of churn. Another significant observation is that a lower fraction of superpeers in the network (specifically when it is below 5%) results in a sharp fall in the percolation threshold; that is, the vulnerability of the network drastically increases when the fraction of superpeers is below 5%. Degree-dependent failure. It can be easily identified from Fig. 8, that with the increase of superpeer degree km , the value of critical attack exponent γc that percolates the network decreases. This increases the necessary fraction of superpeers required to be removed to break down the network. Since the increase of km increases the fraction of peers r, the removal of most of the low degree peers along with a fraction of superpeers increases the percolation threshold fd . It is also interesting to observe that the percolating γc remains quite low and less than 0.1 for the entire range of km . The reason is that small values of γc result in the removal of a higher fraction of superpeers nodes from the network. Since the degree-dependent failure mainly removes the lower degree nodes, which are not so useful for breaking the network down, removal of a significant amount of superpeers becomes necessary. Degree-dependent attack. [29] analyzes the behavior of superpeer networks against degree-dependent attack, where kl and km are the degree of peers and superpeers respectively and r is the fraction of peers in the network. In [29],
270
B. Mitra 1
0.07 0.06 0.05
0.98
=8 =12 =16 Line fitting curve
0.96
0.04
γc
fd
0.94
0.03
0.92
0.02
0.9
0.01
0.88
0 10
25 20 15 Km (Degree of superpeers)
30
0.86 10
Theoretical 〈k〉=4 Simulation 〈k〉=12 Theoretical 〈k〉=4 Simulation 〈k〉=12
15 20 25 Km (Degree of superpeers)
30
Fig. 8. Change in critical attack exponent γc and percolation threshold fd with respect of superpeer degree km for superpeer networks undergoing degree-dependent failure. Here mean degree k varies from 8 to 16. x-axis represents the superpeer degree(km ) and y-axis represents the corresponding γc and fd .
the authors have established the critical condition for the stability of the network against degree-dependent attack: γ+1 rklγ+1 (kl − 1) + (1 − r)km (km − 1) γ ≥ km (k(km + kl ) − km − 2k).
(4)
The inequality gives the set of solutions for the critical exponent γc and subsequently the normalizing constant C, which determines the fraction of peers and superpeers to be attacked. The nature of the solution set Sγc of the inequality has a profound impact upon the fraction of peers and superpeers required to be removed and the percolation threshold fc . The breakdown of the network can be due to one of the following three situations. Case A. Removal of all the superpeers along with a fraction of peers. Networks having bounded solution set Sγc where 0 ≤ γc ≤ γcbd exhibit this kind of behavior at the maximum value of the solution γc = γcbd . Case B. Removal of only a fraction of superpeers. Networks having unbounded solution set Sγc where 0 ≤ γc ≤ +∞ exhibit this kind of behavior as γc → ∞. Case C. Removal of some fraction of both superpeers and peers. Intermediate critical exponent γc ∈ Sγc signifies the fractional removal of both peers and superpeers. Figure 9 shows that solution set Sγc of the networks up to a threshold superpeer fraction spth (spth = 0.19 and 0.41 for kl = 3 and kl = 4 respectively) remains bounded. Hence, the removal of all the superpeers is necessary to disintegrate the network along with a fraction of the peers (Fig. 9). It also represents some instances of case B where only some fraction of superpeers are needed to be removed.
Technological Networks 5
1 Peer degree kl=3 Peer degree kl=4
0.9
Percolation threshold
4
Boundary γc (γcbd)
271
3 2 1
Percolation threshold (fc) (kI=3) Peer fraction removed (fp) (kI=3) Superpeer fraction removed (fsp) (kI=3) Percolation threshold (fc) (kI=4) Peer fraction removed (fp) (kI=4) Superpeer fraction removed (fsp) (kI=4)
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0 0
0.1
0.2 0.3 Superpeer fraction
0.4
0.5
0 0
0.1
0.2 0.3 Superpeer fraction
0.4
0.5
Fig. 9. Impact of degree-dependent attack on superpeer networks. Behavior of γcbd and percolation threshold due to the change of superpeer fraction is shown.
4 Conclusion In this chapter, we have presented a comprehensive study of various aspects of technological networks. We have chosen two different technological networks under consideration: the Internet and P2P networks. The protocols used in the Internet have been discussed briefly along with their services. An empirical study of the different topological properties of the Internet like scale invariance, small world, etc. have been elaborated. The impact of the fault tolerance of the Internet has been discussed in the light of general stability analysis. The spread of computer viruses has been modeled by network-aware epidemic models. We have also shed some light on the recent advancements and classifications of the P2P networks. As search is one of the most important services provided by the P2P systems, different search techniques and their comparative study have been provided. The stability of P2P networks in the face of churn and attack has also been discussed as a continuation of the Internet fault tolerance. The advancements of the Internet have also posed some serious challenges in front of the network research community. One of the significant problems is modeling the widely varying Internet traffic. An appropriate modeling of the Internet is often useful to measure the efficiency of routing algorithms and the quality of service (QoS) of different web applications. Maintaining specific QoS in a faulty environment can be another major research issue. There is always substantial uncertainty when making network management decisions. A decision maker is limited not only because it possesses only partial information due to decentralized control but is also limited by the impossibility of predicting the future in terms of traffic demand and/or network topology status. Hence, managing this large-scale Internet is also a non-trivial issue. Understanding the assortative or disassortative relation among different participating nodes and their impact upon the complex structural properties is also a major research problem.
272
B. Mitra
Advancements in P2P networks also raise some issues regarding security and trust. The P2P philosophy is based upon the cooperative nature of the participating peers. However, it has been found that in Gnutella networks, as many as 65% of the nodes do not contribute resources, but free-ride on other peers’ resources. Hence, the problem of selfish peers and free riders are a serious threat against the performance of any P2P system. Development of low overhead trust-aware protocols to ensure trust among the peers is necessary to enhance the utility of P2P networks. Understanding the self-organizing features, evolution, and scalability of the superpeer networks is also interesting and necessary.
References 1. L. A. Adamic, R. M. Lukose, A. R. Puniyani, B. A. Huberman, Search in powerlaw networks, Physical Review E, 64, 046135, 2001. 2. R. Albert, H. Jeong, A.-L. Barabasi, Diameter of the world wide web, Nature, 401, 130–131, 1999. 3. R. Albert, H. Jhong, A.-L. Barabasi, Error and attack tolerance of complex networks, Nature, 406, 2000. 4. N. Berger, C. Borgs, T. Chayes, A. Saberi, On the spread of viruses on the Internet, Proceedings of the 16th ACM-SIAM Symposium on Discrete Algorithms (SODA), 301–310, 2005. 5. T. Bu, D. Towsley, On distinguishing between Internet power law topology generators, Proceedings of INFOCOM, New York, NY, USA, 2002. 6. D. Clark, Face-to-face with peer-to-peer networking, IEEE Computer, 34 (1), pp. 18–21, January 2001. 7. Clip2 Company, Gnutella. http://www.clip2.com/gnutella.html. 8. R. Cohen, K. Erez, D. Avraham, S. Havlin, Resilience of the Internet to random breakdown, Physical Review Letters, 85 (21), 2000. 9. R. Cohen, K. Erez, D. Avraham, S. Havlin, Resilience of the Internet under intentional attack, Physical Review Letters, 86 (16), 2001. 10. Q. Deng, H. Lv, Analyzing unstructured peer-to-peer Search Networks with QIL Proceedings of the IEEE International Conference on Services Computing, pp. 547–550, Shanghai, China, 2004. 11. P. Erdos, A. Renyi, On Random Graphs I, Publ. Mathematical, Debrecen, 6, 290– 297, 1959. 12. M. Faloutsos, P. Faloutsos, C. Faloutsos, On power-law relationships of the internet topology, Computer Communications Review, 29, 251262, 1999. 13. C. Fuchs, The Internet as a self-organizing socio-technological system”, Cybernetics and Human Knowing, 12 (31), pp. 37–81, 2005. 14. Gnutella: www.gnutellaforums.com. 15. R. Govindan, H. Tangmunarunkit, Heuristics for internet map discovery, Proceedings of IEEE Infocom, 2000. 16. C. Griffin, R. Brooks, A note on the spread of worms in scale-free networks, IEEE Transactions on Systems, Man, and Cybernetics, Part B, Feb. 2006. 17. L. Guo, S. Jiang, X. Zhang, H. Wang, LightFlood: Minimizing redundant messages and maximizing scope of peer-to-peer search, IEEE Transactions on Parallel and Distributed Systems (TPDS) 19 (5), pp. 601–614, May 2008.
Technological Networks
273
18. T. Hong, in Peer-to-Peer: Harnessing the benefits of a disruptive technology, Andy Oram (ed), O’Reilly, Sebastopol, CA, Chap. 14, pp. 203–241, 2001. 19. C. Hunt, TCP/IP Network Administration, Second Edition, O’Reilly Networking, December 1997. 20. S. Jin, A. Bestavros, Small-World Internet topologies possible causes and implications on scalability of end-system multicast, Boston University, Technical Report BUCS-TR-2002-004, January 2002. 21. S. Jin, H. Jiang, Novel approaches to efficient flooding search in peer-to-peer networks, Computer Networks: The International Journal of Computer and Telecommunications Networking, 51(10), pp. 2818–2832, July 2007. 22. V. Kalogeraki, D. Gunopulos, D. Zeinalipour-yazti, A local search mechanism for peer to peer networks, Proc. of the 11th ACM Conference on Information and Knowledge Management (ACM CIKM02), 2002. 23. J. O. Kephart, A Biologically inspired immune system for computers, artificial Life IV: Proceedings of the Fourth International Workshop on the Synthesis and Simulation of Living Systemsl, Cambridge, MA, July, 1994. 24. J. M. Kleinberg, S. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, The Web as a graph: Measurements, models and methods, in Proceedings of the International Conference on Combinatorics and Computing, Lecture Notes in Computer Science, pp. 118, Springer, Berlin, 1999. 25. X. Li, J. Wu, Searching techniques in peer-to-peer networks, Handbook of Theoretical and Algorithmic Aspects of Sensor, Ad Hoc Wireless and Peer-to-Peer Networks, CRC Press, Ann Arbur, MI, 2005. 26. Q. Lv, P. Cao, E. Cohen, K. Li, S. Shenker, Search and replication in unstructured peer-to-peer networks, ACM International Conference on Supercomputing, New York, USA, 2002. 27. G. Manku, Routing networks for distributed hash tables, Annual ACM Symposium on Principles of Distributed Computing archive Proceedings of the twenty-second annual symposium on Principles of distributed computing, Boston, Massachusetts, pp. 133–142, 2003. 28. B. Mitra, F. Peruani, S. Ghose, N. Ganguly, Analyzing the vulnerability of superpeer networks against attack, 14th ACM Conference on Computer and Communications Security, Alexandria, USA, 29 Oct–2 Nov, 2007. 29. B. Mitra, Md. M. Afaque, S. Ghose, N. Ganguly, Developing analytical framework to measure robustness of peer-to-peer networks, 8th International Conference on Distributed Computing and Networking - ICDCN 2006 (formerly IWDC), December 27–30, 2006, IIT Guwahati, India. 30. B. Mitra, S. Ghose, N. Ganguly, Effect of dynamicity on peer to peer networks, 14th International Conference on High Performance Computing, Goa, India, 19–22 December 2007. 31. Napster: http://www.napster.com/. 32. R. Pastor-Satorras, A. Vzquez, A. Vespignani, Dynamical and correlation properties of the Internet, Phys Rev Lett, 87, 258701, 2001. 33. R. Pastor-Satorras, A. Vespignani, Epidemics and immunization in scale-free networks in S. Bornholdt and H. G. Schuster (eds.), Handbook of Graphs and Networks, Wiley-VCH, Berlin, 2003. 34. R. Pastor-Satorras, A. Vespignani, Epidemic dynamics in finite size scale-free networks, Physical Review E, 65, 035108, 2002. 35. R. Pastor-Satorras, A. Vespignani, Epidemic dynamics and epidemic states in complex networks, Physical Review E, 63, 066117, 2001.
274
B. Mitra
36. R. Pastor-Satorras, A. Vespignani, Epidemic spreading in scale-free networks, Physical Review Letters, 86, 32003203, 2001. 37. K. Patch, Internet stays small world, Technology Research News, 2003. 38. B. Pretre, Attacks on peer-to-peer networks, Ph.D. thesis, Swiss Federal Institute of Technology (ETH) Zurich, 2005. 39. Y. J. Pyun, D. S. Reeves, Constructing a balanced, log(N)-diameter super-peer topology, Proceedings of the 4th International Conference on Peer-to-Peer Computing, Zurich, Switzerland, August 2004. 40. K. Singh, H. Schulzrinne, peer-to-peer internet telephony Using SIP, Columbia University Technical Report CUCS-044-04, New York, NY, October, 2004. 41. P. Szor, The art of computer virus research and defense, Symantec Press, Indianapolis, IN, 2005. 42. A. Vazquez, R. Pastor-Satorras, A. Vespignani, Large-scale topological and dynamical properties of the Internet, Physical Rev E, 65, 066130, 2002. 43. C. Wang, B. Li, Peer-to-Peer Overlay Networks: A Survey, Department of Computer Science. The Hong Kong University of Science and Technology, Technical Report, 2003. 44. D. J. Watts, S. H. Strogatz, Collective dynamics of ‘small-world’ networks, Nature, 393, 440–442, 1998. 45. B. Yang, H. Garcia-Molina, Improving search in peer-to-peer networks, Proc. of the 22nd IEEE International Conference on Distributed Computing (IEEE ICDCS02), 2002. 46. B. Yang, H. Garca-Molina, Designing a super-peer networks, Proceedings of the International Conference on Data Engineering (ICDE), Los Alamitos, CA, March 2003. 47. S. Yook, H. Jeong, Y. Tu, A. L. Barabasi, Weighted evolution networks, Phys. Rev. Lett., 86, 5835, 2001.
Advances in the Theory of Complex Networks Fernando Peruani1,2 1
CEA-Service de Physique de l’Etat Condens´e, Centre d’Etudes de Saclay, 91191 Gif-sur-Yvette, France 2 Institut des Syst´emes Complexes de Paris ˆIle-de-France, 57/59, rue Lhomond F-75005 Paris, France;
[email protected] 1 Introduction An exhaustive and comprehensive review on the theory of complex networks would imply nowadays a titanic task, and it would result in a lengthy work containing plenty of technical details of arguable relevance. Instead, this chapter addresses very briefly the ABC of complex network theory, visiting only the hallmarks of the theoretical founding, to finally focus on two of the most interesting and promising current research problems: the study of dynamical processes on transportation networks and the identification of communities in complex networks.
2 The ABC of Complex Networks A network or a graph is a set of interconnected nodes (or vertices). The node connection is performed through edges. An edge represents a link between two nodes. Between two vertices there can run more than one edge. Alternatively, an edge can have a number associated to it denoting its importance or weight. Edges can be directed or undirected. A directed edge between a node A and a node B symbolizes that, for example, node A “speaks” to node B, while the opposite is not possible. On the other hand, undirected edges are completely symmetric. This review deals exclusively with undirected edges. For a comprehensive review on the theory of complex networks we refer the reader to [1, 2]. 2.1 Network Characterization 2.1.1 Degree Distribution A network can be characterized in many ways. For example, we could measure the mean degree of the network k. Here, k stands for one of the most N. Ganguly et al. (eds.), Dynamics On and Of Complex Networks, Modeling and Simulation in Science, Engineering and Technology, DOI: 10.1007/978-0-8176-4751-3 16, c Birkh¨ auser Boston, a part of Springer Science+Business Media, LLC 2009
276
F. Peruani
Fig. 1. The figure shows two network topologies: (a) network with exponential degree distribution and (b) network with power-law (scale-free) degree distribution. Figure taken from Ref. [12].
relevant properties of a node, its degree, which indicates the number of edges attached to it, and ... denotes the average over all nodes of the system. Though k is a useful and informative quantity, by itself it cannot characterize the structure of the network, and typically a good characterization also requires higher order moments, such as k 2 , k 3 , etc. How many moments do we need to know to unequivocally characterize the network? All the information about the moments is contained in the degree probability distribution of the network pk (see Fig. 1). pk is the probability of picking up a node at random and observing that its degree is k. The moments are computed as k n pk . If the network is such that the vertices (nodes) are statisk n = tically independent, that is, the connections are completely at random, then the degree probability distribution unequivocally determines the properties of the network. If this is not the case and there are correlations among nodes, the characterization of the network will require the use of a degree-degree probability distribution, or an even higher n-points probability distribution, etc. Let us assume for the moment that vertices are statistically independent. There are three types of degree distributions which due to their ubiquity and simplicity, deserve to be specially mentioned: a) the Poisson distribution, defined as pk = e−k kk /k! and which is the degree distribution of a classical random graph; b) the exponential distribution, defined as pk ∼ e−k/k (see Fig. 1(a)); and c) the power-law distribution (see Fig. 1(b)), which is proportional to pk ∼ k −γ , with γ > 0, and has (for infinite networks) all moments higher than m > γ − 1 diverging (for this reason these distributions are referred to as scalef ree). For distributions like a) and b), the first moment of the distribution, i.e., k, unequivocally characterizes the network topology, but in general higher moments are required to unequivocally determine the network topology.
Advances in the Theory of Complex Networks
277
2.1.2 Clustering Coefficient Another important quantity used to characterize the network topology is the clustering coefficient. The clustering coefficient measures the degree of connectivity in the environment close to a node, i.e., the degree of cliquishness of the closest environment of a node. In a more colloquial way, it is an answer to the question: Are my friends also friends of each other? If a node has degree z, i.e., z neighbors, and all these z nodes are connected among them, there would be z(z − 1)/2 edges linking the nodes. The clustering coefficient is defined as the ratio between the total number y of edges connecting the z nearest neighbors, and the total number of all possible edges between the z nearest neighbors, C = y/ (z(z − 1)/2) . (1) Logically, a network is associated with a distribution of clustering coefficients; however, typically only the average cluster coefficient is reported, which is a simple estimation for the probability of finding that any couple of neighbors of a given node are also connected among themselves. A simple approximation for the average clustering coefficient of Poissonian (or exponential) random networks is given by k . (2) Crand = N Another definition of average clustering coefficient extensively used in the literature is given by 3A Ctriangle = , (3) B where A stands for the number of interconnected triplets of nodes, such that each node is connected to the other two nodes (i.e., a triangle), and B is the number of connected triplets, where each node is connected to just one node or more. The factor 3 accounts for the fact that from each triangle three simple triplets can be formed. 2.1.3 Network Diameter The network diameter is defined as the maximal distance between any pair of nodes. The above definition strictly works for fully connected networks; however, by redefining the diameter as the maximum distance among all fully connected components (clusters) of the system, the definition is applicable to all kinds of networks. Assuming that the network has a sort of tree-like structure, a simple rough estimation can be obtained by equating kd with N as follows: d∼
ln(N ) . ln(k)
(4)
It has been shown that Eq. (4) predicts the correct scaling of d with N and k for random networks. Note that when k > ln(N ), a random network
278
F. Peruani
has a high probability of being totally connected [1]. The concept of network diameter is closely related to another important quantity, the average path length, which is the average distance between any pair of nodes. 2.1.4 Network Spectrum The network topology can also be studied through the adjacency matrix A, which is an N × N symmetric matrix whose elements Ai,j represent the connections among the nodes of the network. If nodes i and j are connected, then Ai,j = 1, otherwise, Ai,j = 0. The spectrum of the network is the set of eigenvalues of A, and since A has N eigenvalues, the spectral density takes the form N 1 δ(λ − λj ). (5) ρ(λ) = N i=1 In the limit of N → ∞, ρ(λ) becomes a continuum function. Interestingly, the topology of the network is related to the spectral density through 1 1 dλ λk ρ(λ) = (λj )k = Ai1 ,i2 Ai2 ,i3 . . . Aik ,i1 . (6) N j N i ,i ,...,i 1
2
k
Equation (6) represents the number of paths returning to the same node in the network. One of the most remarkable results connected to this kind of approach is Wigner’s law, which applies to infinite random networks with a connectivity p ∼ N −ξ . When 0 < ξ < 1, Wigner’s law predicts that the spectrum density is a semicircular distribution ρ(λ) = 4N p(1 − p) − λ2 /(2πN p(1−p)) for |λ| < 2 N p(1 − p) and is vanishing for larger values of λ, except for the principal eigenvalue, which is isolated from the bulk and increases with network size. For ξ > 1 the spectral density deviates from Wigner’s law and its odd moments vanish (i.e., k 2m+1 = 0), indicating that the only path that comes back to the original node is following all nodes previously visited, i.e., there are no closed loops [4–9, 26] 2.2 Building a Network There are equilibrium and non-equilibrium random networks. These terms are associated to the way in which the network was grown. In this subsection we briefly review how a network can be built. 2.2.1 Equilibrium Random Networks Given a fixed number N of nodes and a fixed number M of edges, the network is built by taking for each edge a randomly selected couple of nodes and inserting an edge between them.
Advances in the Theory of Complex Networks
279
2.2.2 Non-Equilibrium Random Networks In this case, the network is grown by simultaneously adding vertices and edges. The procedure is as follows. a) A node is added at each time step. b) Simultaneously, a pair (or several pairs) of randomly chosen vertices are connected by an edge. If at some moment the addition of nodes is stopped while the addition of edges continues, the network will tend to an equilibrium. However, the network will never approach the equilibrium state given by equilibrium networks, since the growing process produces a sort of correlation by which ‘old’ nodes are more connected than ‘young’ nodes. The only way to achieve an equilibrium network configuration is by also allowing the removal of old edges. 2.2.3 Preferential Attachment A huge amount of real-world networks are scale free; i.e., they exhibit a powerlaw degree distribution. The Barab´ asi–Albert model [33] was the first model that satisfactorily described a non-equilibrium network whose asymptotic degree distribution is a power law. The growth model is as follows. Starting with a small number m0 of nodes, at every time step, add a new node with m (m ≤ m0 ) edges and link the new node to m different nodes already present in the system according to the following rule: choose each node to which the new node connects to with probability Π proportional to the node degree ki , ki Π(ki ) = . j kj
(7)
The attachment rule described by Eq. (7) is referred to as preferential attachment. After t time steps this procedure produces a network with N = t + m0 nodes and mt edges. Asymptotically with t the degree distribution of the network approaches a power law with exponent γ = 3. This remarkable fact can be understood according to the following simple continuum theory [37]. Assume that ki is a continuous variable whose rate growth is proportional to Π(ki ), then ki evolves according to ∂ki ki = mΠ(ki ) = . (8) ∂t 2t The solution of this equation, with the initial condition that every node i at its introduction (at time ti ) has ki (ti ) = m, is ki (t) = m(t/ti )β with β = 1/2. Thus, the cumulative probability takes the form p[ki (t) < k] = 1 −
m1/β t . k 1/β (t + m0 )
(9)
Taking the derivative of Eq. (9), we obtain the degree distribution pk =
2m1/β t . (m0 + t)k 1/β+1
In the limit of t → ∞, pk ∼ 2m1/β k −γ with γ = β −1 + 1 = 3.
(10)
280
F. Peruani
2.3 Network Stability: Breaking Down a Network A finite network can be formed by many isolated clusters of various sizes, or it can be fully connected with only one giant component. For infinite networks this statement has to be rephrased in the following way. An infinite network can exhibit a giant cluster with an infinite number of nodes contained in it, or on the contrary, all clusters in the system can be finite. If a network exhibits a giant cluster, we say that the network is stable and highly connected. We now review the already classical results on percolation of complex networks [10–14]. Specifically, we follow the method proposed in [10, 11] and extended in [15, 16]. The goal is to find the minimum fraction of nodes that should be removed from a network in order to break down the connectivity of the network. By definition, a network is no longer connected when the initial giant component disappears, i.e., when the biggest cluster of connected nodes in the system is much smaller than the total initial number of nodes. Let pk be the network degree distribution, i.e., the probability of finding a randomly chosen vertex with degree k, and let qk be the probability that a node of degree k survives the failure or attack. Correspondingly, 1 − qk is the probability that a node of degree k is removed. In consequence, pk qk represents the fraction of nodes of degree k that are removed after the failure or attack. The objective is now to characterize the cluster size distribution of surviving nodes, and determine under which condition cluster sizes can be infinite. We make use of generating function formalism and define G(x) as the generating function of the network degree distribution pk : G(x) =
∞
pk xk .
(11)
k=0
Recall that the connection between the generating function and the probability distribution it generates is given by 1 dk G(x) . x−→0 k! dxk
pk = lim
(12)
We still need to derive the generating function F0 (x) of the probability of finding a node of degree k that has survived the attack. Since pk qk is the probability of finding a surviving node of degree k after the disruptive event, applying the definition of generating function, Eq. (11), we find that F0 (x) takes the form ∞ F0 (x) = pk qk xk . (13) k=0
Another important generating function is the one associated with the probability of finding a randomly chosen edge connected to a node of degree k (after the attack): ∞ kpk qk k xF0 (x) x = , (14) A(x) = z G (1) k=0
Advances in the Theory of Complex Networks
281
∞
where z = k = k=0 kpk = dG(1)/dx. To obtain an expression for the cluster size distribution, we need first to find the generating function of the probability that one of the outgoing edges of the node we arrived at connects to a surviving node of degree k. This is simply A(x)/x, and the desired generating function can be expressed as F1 (x) = F0 (x)/G (1) = F0 (x)/z.
(15)
Now we look for the generating function H1 (x) of the distribution of cluster sizes of surviving nodes that are reached by randomly choosing an edge and following it to one of its ends. If we choose an edge that leads us to a removed node, regardless of the degree of the node, we say that the cluster size we find is zero. The probability of following the randomly chosen edge and finding a surviving node of degree zero is zero, the probability of finding a surviving node of degree one is p1 q1 /z, the probability of finding a surviving node of degree two is 2p2 q2 /z, and so on.So, the probability of finding a surviving node, regardless of its degree, is ∞ k=0 kpk qk /z = F1 (1). In consequence, the probability of finding an edge that leads to a removed node is 1 − F1 (1). Clearly, this is also the probability of following a randomly chosen edge that leads to a zero size component, and so also the coefficient s0 that accompanies x0 in H1 (x). To find the full expression of H1 (x), we have still to look for the probabilities that accompany non-zero size components, i.e., xk with k > 0. This can be computed from the probability s1 of finding, by following a randomly chosen edge, a component of size 1. This is nothing other than the sum of the probabilities of following an edge and finding a surviving node of degree k which has its other k − 1 edges connected to removed nodes: s1 =
∞
kpk qk /z(1 − F1 (1))k−1 = F1 (H1 (0)).
(16)
k=1
Similarly for s2 , we can obtain s2 = =
∞
(k − 1)kpk qk /z(1 − F1 (1))k−2 s1
(17)
k=2 F1 (H1 (0))H1 (0),
where (1 − F1 (1))k−2 s1 is the probability of taking randomly k − 1 edges and finding that k − 2 edges are attached to removed nodes, and one to a size 1 component. The term k−1 indicates that there are k−1 possible configurations for these edges. We observe that Eq. (17) is the derivative with respect to x of Eq. (16) evaluated in x = 0. However, from the definition given by Eq. (11), we know that the term x1 is accompanied by a first derivative, while the second is associated with a second derivative and a factor 1/2. We solve this problem by considering that the function we have to derive successive times is xF1 (H1 (x)). The first derivative of this function evaluated in x = 0 is
282
F. Peruani
F1 (H1 (0)), while the second derivative evaluated in x = 0 is 2F1 (H1 (0))H1 (0). This suggest a self-consistence equation for H1 (x) of the form H1 (x) = (1 − F1 (1)) + xF1 (H1 (x)).
(18)
It can be easily verified that Eq. (18) leads to the correct expressions of s0 , s1 , . . . , sn by applying the definition given by Eq. (12). Along similar lines, we can obtain the generating function H0 (x) of the distribution of the component size to which a randomly chosen node belongs. The main difference is that instead of determining the probability of finding a randomly chosen edge attached to a component size s, we now randomly choose a node and want to determine the probability of finding this node belonging to a cluster of size s. For this reason, instead of using P (k) as before, we use pk qk and its corresponding generating function F1 (x). The expression for H0 (x) takes the form (19) H0 (x) = (1 − F0 (1)) + xF0 (H1 (x)). Finally from Eq. (19), we can obtain the average size of the components: H0 (1) = s = F0 (1) +
F0 (1)F1 (1) . 1 − F1 (1)
(20)
As mentioned above, we are interested in knowing the threshold at which the average cluster size becomes finite, or inversely, when it becomes infinite. Clearly, Eq. (20) diverges when 1 − F1 (1), and this critical condition sets the threshold between finite and infinite cluster sizes. Finally replacing F1 (1) by its definition, Eq. (15), we obtain a critical condition for qk , which was our initial goal: ∞ kpk (kqk − qk − 1) = 0. (21) k=0
Equation (21) defines the critical condition for the stability of an uncorrelated infinite network under an arbitrary attack. For failure, i.e., when the attack does not depend on the degree k of the node, qk = q and from Eq. (21) the classical percolation threshold for failure [13, 10] is retrieved as follows: qc = 1 −
k . k 2 − k
(22)
Notice that Eq. (22) defines the percolation threshold for infinite networks. The critical qc strongly depends on system size and thus Eq. (22) fails to describe the stability of finite networks [17]. Also notice that a basic assumption behind Eq. (21) is that the original network is uncorrelated. Expressions for the percolation threshold of finite and/or correlated networks are still missing.
Advances in the Theory of Complex Networks
283
3 Two Current Hot Problems in Complex Networks In this section we address two current hot problems in complex networks: dynamics on transportation networks and community identification in complex network. Part of the future advances of complex network theory clearly is going to be along the lines of the problems reviewed in this section. However, we warn the reader that this selection of problems just gathers a small number of timely interesting issues on networks which are particularly attractive for the author. The amount of relevant open problems in the fast-evolving area of network theory exceeds by far the small selection presented here. 3.1 Dynamics on Transportation Networks A transportation network typically models the movement of entities across the nodes of the network (see Fig. 2). A classical example is the airline transportation network where each node denotes a city (i.e., an airport) and edges indicate direct flights between cities. If we associate to each node i a number ni (t) denoting the number of individuals at node i at time t, we can model the dynamical flow of mass (or individuals) across the network. It is not difficult to imagine a transportation network moving various types (e.g., species) of individuals or entities. This means that at a given instant of time there will be various species of individuals coexisting at each node. If in turn there is a dynamics among the various types of individuals, on top of the transport dynamics there will be an inter-species dynamics. A chemical reaction where the chemical species diffuse across the transportation network [18] would be an example of this type of dynamical process. Another example would be the spreading of a disease through the airline transportation network [19, 20, 21], as occurred in 2002 during the outbreak of the severe acute respiratory syndrome (SARS) [19]. In this case, susceptible, infected, and recovered individuals are the reacting species. In this section we briefly review some recent results [18, 22–25] which have helped to elucidate some key aspects of the metapopulation dynamics which occurs on transport networks. Let us start by understanding the transport dynamics. 3.1.1 Transport Dynamics For the moment we assume that there is only one species diffusing in the system. A metapopulation description of the transport process can be obtained by thinking in terms of the mean occupation number n ˜ k (t) of nodes of degree k at time t, which by definition reads as n ˜ k (t) =
1 (i) n (t), Nk (i) k
=k
(23)
284
F. Peruani
where the sum runs over all nodes whose degree is k, Nk refers to the total number of nodes with degree k, and n(i) (t) denotes the occupation number (= number of individuals) at node i. It is assumed that there is a diffusion rate d(k, k ) that controls the migration of individuals from a subpopulation with degree k to another of degree k . In consequence, the probability per unit time Lk for an individual at a node of degree k of leaving the node is Lk = k kp(k |k)d(k, k ), where p(k |k) is the conditional probability that an edge departing from a node of degree k points to a node of degree k . Thus, the (mean-field) time evolution of n ˜ k (t) can be expressed as ˜ k (t) = −Lk n ˜ k (t) + k p(k |k)d(k , k)˜ nk (t). (24) ∂t n k
The reasoning behind Eq. (24) is very simple. The first term on the righthand side accounts for the number of individuals that initially are in a node of degree k and then leave it, while the second term considers the increase of individuals in k-degree nodes due to the migration of individuals from subpopulations of degree k to k. For uncorrelated networks, p(k |k) takes the form p(k |k) = k pk /k and Eq. (24) reduces to k ˜ k (t) = −Lk n ˜ k (t) + pk d(k , k)˜ nk (t). (25) ∂t n k k
If in addition it is assumed that the probability for an individual to leave a given population is independent of its degree, then Lk = L for all k, and d(k, k ) = L/k. The stationary solution for Nk (t) then reads: Nk (t → ∞) =
k N. k
(26)
A more realistic transportation process has to consider the migration of individuals to be proportional to the traffic intensity along the network edges. This can be obtained by defining a heterogeneous diffusion probability for any given individual to go from a subpopulation of degree k to another one of degree k as d(k, k ) = Lw0 (kk )θ /Tk , where Tk provides the correct renormalization to ensure that overall outflow is still L, θ is a model parameter that controls the impact of the network topology, and w0 is simply a constant. 3.1.2 Dynamics Among Different Species In the following discussion we assume that there are multiple species traveling across the network which interact among themselves. We consider three interacting species: susceptible, infected, and recovered individuals which follow the classical Susceptible-Infected-Recovered (SIR) dynamics (see Fig. 2). For a single population (node), an epidemic outbreak can occur depending on the basic reproductive number R0 , which accounts for the number of secondary infected cases generated by a primary infected individual. The basic reproductive number is defined as
Advances in the Theory of Complex Networks
285
j Subpopulation i:
i
i
Transportation network
Agents: susceptible infected recovered
Fig. 2. The scheme illustrates a tranportation network. Each node is a container of agents, i.e., a subpopulation. Agents are transported through the network edges, e.g., from node j to i. Inside each node, individual agents interact. The figure depicts a SIR dynamics in which susceptible agents get the disease from infected agents, which in turn become, after a characteristic time, recovered.
R0 =
β , μ
(27)
where 1/β is the characteristic time required by a susceptible individual to acquire the disease from any given neighbor, and 1/μ is the characteristic time an individual remains infected after getting the disease. If R0 > 1 initially, the number of infected individuals is larger than the number of recovered individuals, and the disease spreads. When R0 < 1 the epidemic goes to extinction. Note that even if R0 > 1, i.e., when the disease at the node level affects many individuals, the infection does not necessarily spread over the metapopulation system, which in turn means that a macroscopic fraction of nodes remains immune to the disease. For this to happen, we still require a fast enough diffusion of individuals. In the following we review the derivation of the metapopulation disease invasion predictor R∗ , which determines under which parameters (including R0 and d(k, k )) a disease infects a finite fraction of the network. Let us start out by estimating the number of new infected individuals (seeds) that may appear in a connected subpopulation of degree k during the duration of an outbreak in a subpopulation of degree k. We denote by αNk the number of infected individuals during the evolution of the epidemic in a closed subpopulation (α depends on the specific disease model). If each infected individual holds the disease for a characteristic time μ−1 during which it can travel to a neighboring subpopulation k with a rate d(k, k ), then the number of new seeds can be expressed as
286
F. Peruani
λk,k =
d(k, k )αNk . μ
(28)
Now we can derive a simple approximate evolution equation for the number of infected subpopulations Dkn of degree k at generation n for a random graph in which each subpopulation has the same degree k, 1 λkk n n−1 1 − Dn−1 /N . (29) D = D (k − 1) 1 − ( ) R0 The reasoning behind Eq. (29) is the following. Each of the Dn−1 infected populations at generation n − 1 will seed during the next generation a number of subpopulations proportional to k − 1 times that the neighbor the probability ing subpopulations are not infected (i.e., 1 − Dn−1 /N ), times the probability that the new infected individuals cause a local outbreak (this probability is proportional to 1 − R0−λkk since the probability that a single individual will
not transmit the disease is R0−1 [27]). Assuming, as before, that d(k, k ) = p/k, then λkk = pN0 α/(μk) (where N0 = Nk ) and in addition R0 1 such that 1 − R0−λkk ∼ λkk (R0 − 1), Eq. (29) reduces to Dn = pN0 αμ−1
k−1 (R0 − 1)Dn−1 . k
(30)
From Eq. (30) it is easy to observe that a macroscopic outbreak can only occur if k−1 (R0 − 1) > 1. (31) R∗ = pN0 αμ−1 k Thus, the global invasion threshold is defined by Eq. (31). This implies that to observe global spread the mobility rate has to be such that p≥
μk . α(k − 1) (R0 − 1)
(32)
In a heterogeneous metapopulation network, i.e., when the subpopulation degree varies across the network, Eq. (29) has to be replaced by (33) Dkn = Dkn−1 (k − 1)λk k (R0 − 1) p(k|k ) 1 − Dkn−1 /Nk , k
where again it was assumed that R0 1. Since p(k|k ) is the conditional probability that an edge attached to a node of degree k has its other tip connected to a node of degree k, p(k|k ) k is the probability that at least one edge is connected to a node of degree k. In Eq. (33), p(k|k ) (k − 1) refers to the probability that a recently infected node with degree k , discounting the edges from which the nodes got the disease, is linked to a node of degree k. As said above, when degree correlation can be neglected, p(k|k ) = k p(k)/k, and Eq. (33) can be expressed as
Advances in the Theory of Complex Networks
Dkn =
kp(k) (R0 − 1) Dkn−1 (k − 1)λk k . k
287
(34)
k
Similarly, Eq. (28) is reduced to λk,k =
pαNk . μk
(35)
Consequently, the evolution equation for Dkn reads: kp(k)pN0 α n−1 Dk (k − 1). μk2
Dkn = (R0 − 1)
(36)
k
Multiplying both sides by (k − 1) and taking the sum over k on both sides, Eq. (36) can be expressed as Θn = (R0 − 1) where Θn is defined as Θn = disease spreads only if
k
k 2 − k pN0 α n−1 Θ , k2 μ
(37)
Dkn (k − 1). From Eq. (37) we learn that the
R∗ = (R0 − 1)
k 2 − k pN0 α > 1. k2 μ
(38)
Equation (38) defines the global invasion threshold for a heterogeneous network. Though in recent years we have observed important progress related to the dynamics on transportation networks, there are still many open questions to be answered. For example, the degree of a subpopulation has been considered so far decoupled from the subpopulation size. However, we know that, in many cases, as in an airline transportation network which connects cities of different sizes, degree and subpopulation size are strongly correlated. In fact, a satisfactory network growth model for transportation networks is still lacking. Regarding the dynamics on the nodes, typically death and birth processes are ignored, even though small size nodes could experience large fluctuations which in turn could dramatically affect global flow on the network. Bottleneck effects due to limitation in the transportation channel, as well as limitation in node capacity, are important problems that deserve to be investigated. 3.2 Identifying Communities in Complex Networks If we observe real-world networks, we notice that typically there are small sets of nodes which are highly connected to each other but with only few links to the rest of the network (see Fig. 3). These sets of highly connected nodes are typically referred to as communities or modules. To fully understand the
288
F. Peruani
Fig. 3. The scheme illustrates a network comprising two modules or communities. Notice the high connectivity exhibited by nodes in each community.
internal topological structure of a network it is crucial to correctly detect the community structure in it. A general method for identification of communities in unipartite networks is the maximization of the modularity function Q introduced by Newman and Girvan [28]. The function Q evaluates the “goodness” of a partition of a network into communities. The basic assumption behind the modularity function is that a community or module of a network should exhibit a number of internal links greater than the number of links of a subset of a random network. For a network with N nodes and L links, the modularity function Q is defined as follows: 2 m m ds ls − qs = , (39) Q= L 2L s=1 s=1 where the sum runs over the m modules of the network, ls is the number of links inside module s, and ds is the total degree of the nodes in module s. The term ls /L denotes the fraction of links connecting pairs of nodes belonging to module s, while (ds /2L)2 represents the fraction of links that one would find in the module if links were placed at random in the network, under the constraint of respecting the degree distribution of the original network. If qs is such that 2 ds ls qs = − ≥ 0, (40) L 2L the module is well defined, in the sense that the module presents more links than expected by random chance. The greater qs , the better defined the module. The identification method implies the maximization of Q, which in turn involves sampling over all possible partitions of the network. Unfortunately,
Advances in the Theory of Complex Networks
289
the number of possible subsets grows exponentially with the network size, and the modularity optimization is an NP-complete problem [29]. Typically, the ambitious goal of finding the true optimum of the measure is not possible. However, approximations of the minimum can be obtained by applying optimization algorithms such as simulated annealing, extremal optimization, or spectral division. Other drawbacks of the Newman–Girvan method are that it cannot scan the network below some scale, leaving small modules undetected, and that it may be affected by the time evolution of the network, i.e., by network size. 3.2.1 Identifying Communities in Bipartite Networks Bipartite networks are a special and important class of networks in which nodes are divided in two disjoint subsets and edges link nodes of one subset with nodes in the other. The number of applications of bipartite networks is really huge; however, one application in social science has become the example of prototypical bipartite networks: the movie-actor network [30–34]. This network is divided in two sets, the set of actors and the set of movies (also referred to as teams; see Fig. 4). An edge that connects an actor a and a movie m indicates that a has participated in the movie m. Note that the behavior of these networks strongly depends on whether both partitions grow with time, which leads to scale-free degree distributions of actors, or on whether one of the partitions, e.g., the actor set, is fixed over time while the remaining set grows unboundedly, which results in a beta-distribution for the degree of actor nodes [35]. Bipartite network: D
C
B
A Teams
1
2
3
4 Actors
Unipartite projection: 1
3
2
4
Fig. 4. The figure shows a scheme of a growing bipartite network. The team node D represents a new incoming node. The scheme at the bottom indicates the resulting unipartite projection of actor nodes (see text).
290
F. Peruani
Many relevant properties of bipartite networks become evident in the unipartite projection of actor nodes. In this unipartite network, an edge running from an actor a to an actor a indicates that a and a have co-starred in the same movie (see Fig. 4). Notice that in consequence the actors attached to a movie m in the bipartite network are part of a clique in the unipartite projection. Bipartite networks have intrinsically very strong modularity and typically exhibit complex structure. Guimer´ a et al. [36] have recently proposed a simple and elegant model for bipartite network growth which allows us to study different levels of modularity in bipartite networks. The model assumes that each actor and movie has associated a color. The number of colors is a model parameter that has to be defined in advance. The next step is to assign to each actor a color. Once all this has been defined, the network is grown according to the following steps. a) b) c) d)
Create team m. Select the number μm of actors in the team. Select the color cm of the team. For each of the μm actors in m proceed as follows: with probability p, select the actor from the pool of actors with the team color cm ; otherwise, select an actor at random with equal probability.
The parameter p is called team homogeneity and quantifies how homogeneous a team is. For p = 1 all the actors in the team belong to the same module and modules are perfectly segregated, whereas for p = 0 the color of the team is irrelevant and actors are perfectly mixed and the network does not have a modular structure. Guimer´ a et al. in [36] have adapted the modularity criterion of Newman– Girvan, Eq. (39), to account for modularity in bipartite networks. They consider that the expected number of times a given actor a belongs to a team composed of μ actors is ta , (41) pa→m = μ k tk where ta is the total number of movies in which actor a has participated, i.e., the degree of node a. Eq. (41) represents the probability that a given team m with μ actors is connected to actor a. Thus, the probability that a team m is connected to a and a is given by ta ta . pa,a →m = μ(μ − 1) ( k tk )2
(42)
In consequence, the average number na,a of movies in which a and a have co-starred (assuming a non-correlated random process) is μm (μm − 1) ta ta , (43) na,a = m ( m μm )2
Advances in the Theory of Complex Networks
291
where m μm = k tk . From Eq. (43) the bipartite modularity can be expressed as the cumulative deviation from the random expectation of costarring movies (i.e., Eq. (44)): m a=a ∈s caa a=a ∈s ta ta − QB = , (44) ( m μm )2 m μm (μm − 1) s where caa is the actual number of movies in which a and a have co-starred. Notice that the identification of modules through the optimization of QB leads to the same type of problems present in the Newman–Girvan method: the method leaves small modules undetected and strongly depends on network size. The identification of communities in complex networks is extremely important, since it can reveal functional relationships between nodes. So far the available methods for modularity identification are purely phenomenological and they cannot guarantee the correct identification of the community structure. A theoretical founding for modularity identification is still lacking. Due to the relevance of the problem, we expect to observe important theoretical progress in this direction in the near future.
4 Concluding Remarks The complex network community has been growing for years. Everyday we see new articles on complex networks, and the evolution of the field seems limitless. In such a dynamical research field, any prediction about the future of complex network theory is extremely risky. The two selected hot topics in this chapter, dynamical processes on transportation networks and identification of communities in complex networks, are certainly areas that will experience important progress in the near future. Very important progress is also expected in many other areas, as for example, in dynamical networks of moving agents. In the coming years we will witness substantial new progress in network research.
References 1. R. Albert and A.-L. Barab´ asi, Rev. Mod. Phys. 74, 47 (2002). 2. S.N. Dorogovtsev and J.F.F. Mendes, Evolution of Networks: From Biological Nets to the Internet and WWW, Oxford University Press, Oxford, UK (2003). 3. F. Chung and L. Lu, Adv. Appl. Math. 26, 257 (2001). 4. E.P. Wigner, Ann. Math. 62, 548 (1955). 5. E.P. Wigner, Ann. Math. 65, 203 (1957). 6. E.P. Wigner, Ann. Math. 67, 325 (1958). 7. M.L. Metha, Random Matrices, 2nd ed., Academic Press, New York (1991).
292
F. Peruani
8. A. Crisanti, G. Paladin, and A. Vulpiani, Products of Random Matrices in Statistical Physics, Springer, Berlin (1993). 9. T. Guhr, A. Mueller-Groeling, and H.A. Weidenmueller, Phys. Rep. 299, 189 (1998). 10. D.S. Callaway, M.E.J. Newman, S.H. Strogatz, and D.J. Watts, Phys. Rev. Lett. 85, 5468 (2000). 11. M.E.J. Newman, S.H. Strogatz, and D.J. Watts, Phys. Rev. E 64, 026118 (2001). 12. R. Albert, H. Jeong, and A.L. Barab´ asi, Nature (London) 406, 6794 (2000); 406, 378 (2000). 13. R. Cohen, K. Erez, D. Ben-Avraham, and S. Havlin, Phys. Rev. Lett. 85, 4626 (2000). 14. R. Cohen, K. Erez, D. Ben-Avraham, and S. Havlin, Phys. Rev. Lett. 86, 3682 (2001). 15. B. Mitra, F. Peruani, S. Ghose, and N. Ganguly, in Proceedings of 14th ACM Conference on Computer and Communications Security (Association for Computing Machinery, Inc. New York, 2007). 16. B. Mitra, F. Peruani, S. Ghose, and N. Ganguly, in Proceedings of 26th Symposium on Principles of Distributed Computing (Association for Computing Machinery, Inc. New York, 2007). 17. B. Mitra, N. Ganguly, S. Ghose, and F. Peruani, Phys. Rev. E 78, 026115 (2008). 18. V. Colizza, R. Pastor-Satorras, and A. Vespignani, Nature Physics 3, 276–282 (2007). 19. L. Hufnagel, D. Brockmann, and T. Geisel, Proc. Natl. Acad. Sci. USA 101, 15124 (2004). 20. Z. Wu, L.A. Braunstein, V. Colizza, R. Cohen, S. Havlin, and H.E. Stanley, Phys. Rev. E 74, 056104 (2006). 21. V. Colizza, A. Barrat, M. Barthelemy, and A. Vespignani, Proc. Natl. Acad. Sci. USA 103, 2015–2020 (2006). 22. V. Colizza and A. Vespignani, J. Theor. Biol. 251, 450–467 (2008). 23. V. Colizza and A. Vespignani, Phys. Rev. Lett. 99, 148701 (2007). 24. V. Colizza, A. Barrat, M. Barthelemy, and A. Vespignani, Int. J. Bifurcation and Chaos 17, 2491–2500 (2007). 25. V. Colizza, A. Barrat, M. Barthelemy, and A. Vespignani, BMC Medicine 5, 34 (2007). 26. I.J. Farkas, I. Derenyi, A.-L. Barab´ asi, and T. Vicsek, Phys. Rev. E 64, 026704 (2001). 27. N.T. Bailey, The Mathematical Theory of Infectious Diseases, 2nd edition, Hodder Arnold (1975). 28. M.E.J. Newman and M. Girvan, Phys. Rev. E 69, 026113 (2004). 29. S. Fortunato, e-print arXiv:0705.4445. 30. J.J. Ramasco, S.N. Dorogovtsev, and R. Pastor-Satorras, Phys. Rev. E 70, 036106 (2004). 31. D.J. Watts and S.H. Strogatz, Nature (London) 393, 440 (1998). 32. R. Albert and A.-L. Barab´ asi, Phys. Rev. Lett. 85, 5234 (2000). 33. R. Albert and A.-L. Barab´ asi, Science 286, 509 (1999). 34. L.A.N. Amaral, A. Scala, M. Barth´el´emy, and H.E. Stanley, Proc. Natl. Acad. Sci. 97, 11149 (2000).
Advances in the Theory of Complex Networks
293
35. F. Peruani, M. Choudhury, A. Mukherjee, and N. Ganguly, Europhys. Lett. 79, 28001 (2007). 36. R. Guimera, M. Sales-Pardo, and L.A. Nunes Amaral, Phys. Rev. E 76, 036102 (2007). 37. A.-L. Barabasi, H. Jeong, and R. Albert, Physica A 272, 173 (1999).
Glossary of Essential Terms
Adjacency Matrix: Let G be a graph with n vertices. The n × n matrix A, such that aij = 1 if there is an edge between vertices vi and vj and where the rest of the values are 0, is called the adjacency matrix of graph G. Assortativity: Assortativity refers to a preference for a network’s nodes to attach to others that are similar or different in some way. Assortativity Coefficient: The assortativity coefficient is the Pearson correlation coefficient r between pairs of nodes. Hence, positive values of r indicate a correlation between nodes of similar degree, while negative values indicate relationships between nodes of different degree. Automorphic Equivalence: Two vertices u and v of a labeled graph G are automorphically equivalent if all the vertices can be relabeled to form an isomorphic graph with the labels of u and v interchanged. Betweenness Centrality: Betweenness centrality of a node v is defined as the sum of ratios of the number of shortest paths between vertices s and t (s, t ∈ V ) through v to the total number of shortest paths between s and t. The betweenness centrality g(v) of v is given by g(v) = Σs=v=t
σst (v) . σst
(1)
Biological Networks: Biological networks are representations of biological systems such as metabolic networks, protein interaction networks etc. Bipartite Graphs: Bipartite graphs are graphs that contain vertices of two distinct types, with edges running only between unlike types.
296
Glossary of Essential Terms
Centrality: The centrality of a node in a network is a measure of the structural importance of the node. Citation Networks: A citation network is a network formed by nodes of articles, such that there is a directed edge from node i to j if the article i cites article j. Clique: Cliques are complete graphs where all nodes are connected to all other nodes. Closeness Centrality: The closeness centrality Cc (v) for a vertex v is the reciprocal of the sum of geodesic distances to all other vertices in the graph: 1 . (2) Cc (v) = Σt∈V dG (v, t) Clustering Coefficient: The clustering coefficient for a vertex v in a network is defined as the ratio between the total number of connections among the neighbors of v to the total number of possible connections between the neighbors. For a vertex i, the clustering coefficient is given by Ci =
|ejk | : vj , vk ∈ Ni , ejk ∈ E. ki (ki − 1)
(3)
Community: A community is a subgraph, where in some reasonable sense the nodes in the subgraph have more to do with each other than with the nodes that are outside the subgraph. Coordination Number: The coordination number of a graph is the average degree z of the nodes of the network. Cumulative Advantage: Cumulative advantage means that the more connected a node is, the more likely it is to receive new links. Nodes with higher degree have a stronger ability to grab links added to the network. This concept is more popularly known as “preferential attachment.” Degree Centrality: Degree centrality is defined as the number of links incident upon a node. Degree Distribution: The degree distribution of a network gives the probability distribution of the degree of a random node in a network. Diameter: The diameter of a graph is defined as the maximum of all the shortest distances between any two nodes in the graph.
Glossary of Essential Terms
297
Dual Graphs: A dual graph of a given planar graph G is a graph that has a vertex for each plane region of G, and an edge for each edge in G joining two neighboring regions, for a certain embedding of G. Edge Connectivity: The edge connectivity of G, κ (G), is the minimum size of a disconnecting set. Edge Cutset: An edge cutset is a set F , a subset of E(G) such that G − F has more than one component. Eigenvector Centrality: Eigenvector centrality is a measure of the importance of a node in a network. It assigns relative scores to all nodes in the network based on the principle that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes. Thus, the centrality of a node is proportional to the centrality of the nodes to which it is connected and this in a recursive fashion. Erd˝ os-R´ enyi Graph: In the E-R graph model, each pair of n vertices is connected by an edge with some probability p. The probability of a vertex having degree k is given by (z = np) n k z k e−z . (4) p (1 − p)n−k pk = k k! Euclidean Distance: The Euclidean distance between two nodes A and B is defined as (5) ED(A, B) = Σi (Ai − Bi )2 . Euler’s Formula: If a connected planar graph G has exactly n vertices, e edges, and f faces, then n − e + f = 2. Euler Tour: An Euler tour of a connected, directed graph G = (V, E) is a cycle that traverses each edge of graph G exactly once, although it may visit a vertex more than once. Euler Walk: An Euler walk in an undirected graph is a path that uses each edge exactly once. Geodesic Path: The geodesic path between two vertices is the shortest path between them. Giant Component: The giant component refers to a connected subgraph that contains a majority of the entire graph’s nodes.
298
Glossary of Essential Terms
Hierarchical Clustering: Hierarchical clustering builds (agglomerative) or breaks up (divisive) a hierarchy of clusters. Hyperedges: The edges in the network that join more than two nodes together. Hypergraphs: Hypergraphs are graphs that have hyperedges. Incidence Matrix: The incidence matrix of a graph gives the (0, 1)-matrix which has a row for each vertex and column for each edge, and (v, e) = 1 iff edge e is incident on vertex v. Jaccard Coefficient: The Jaccard coefficient is defined as the size of the intersection divided by the size of the union of the sample sets: J(A, B) =
|A ∩ B| . |A ∪ B|
(6)
k-core: A k-core is defined as the maximal subset where each node is connected to at least k members. k -connected: A connected graph G is k-connected iff every pair of vertices in G is joined by at least k non-intersecting paths and there exists at least one pair with exactly k non-intersecting paths. k -plex: In a k-plex, all the nodes have degree at least (n − k). 1-plex represents a clique. Lagrange’s Matrix: If di is the degree of node i, then Lagrange’s matrix is defined as follows: ⎧ ⎨ di if i = j Lij = −1 if i is connected to j. (7) ⎩ 0 Otherwise n-clan: An n-clan is an n-clique S such that the subgraph induced by S has a diameter (D) less than or equal to n. n-clique: An n-clique is the maximal subset of the nodes where the distance between any two nodes u and v is less than or equal to n: d(u, v) ≤ n, ∀u, v.
(8)
Network Motif: Network motifs are patterns that occur in different parts of a network at frequencies much higher than those found in randomized networks.
Glossary of Essential Terms
299
Pearson’s Correlation Coefficient: Pearson’s correlation coefficient between two nodes x and y can be measured as ΣxΣy Σxy − n . (9) r= 2 2 (Σx) (Σy) )(Σy 2 − ) (Σx2 − n n Percolation Theory: Percolation theory is based on adding nodes and connections to an empty graph until a giant component surfaces. A percolation process is one in which vertices or edges on a graph are randomly designated as either occupied or unoccupied and one asks about various properties of the resulting patterns of vertices. Planar Graphs: A graph is planar if it has a drawing without crossings. Power Law: A power law is any polynomial relationship that exhibits the property of scale invariance. The most common power laws relate two variables and have the form f (x) = axk + o(xk ).
(10)
Preferential Attachment: Preferential attachment means that the more connected a node is, the more likely it is to receive new links. Nodes with higher degree have a stronger ability to grab links added to the network. Random Graphs: A random graph is a graph that is generated by some random process. Reciprocity: Reciprocity is the probability that a pair of vertices in a directed network are connected to each other by directed edges. Regular Equivalence: Two nodes are said to be regularly equivalent if they have the same profile of ties with other nodes that are also regularly equivalent. Resilience: The property of resilience of networks to the removal of their vertices. Scale-Free Network: The defining characteristic of scale-free networks is that their degree distribution follows the Yule–Simon distribution, a powerlaw relationship defined by pk ∼ k −γ . SIR Epidemic Model: SIR (Susceptible-Infected-Recovered/Removed) is a model of disease spread where individuals are susceptible to a disease, potentially contract the disease, and then recover without becoming susceptible any further. This can also include individuals who die of the disease.
300
Glossary of Essential Terms
SIS Epidemic Model: SIS (Susceptible-Infected-Susceptible) is a model of disease spread where individuals are susceptible to a disease, potentially contract the disease, and are once again susceptible as soon as they recover. Small-World Network: A small-world network is a type of mathematical graph in which most nodes are not neighbors of one another, but most nodes can be reached from every other node by a small number of hops or steps. These nodes show a large clustering coefficient value and a small average shortest path distance. Social Network: A social network is a social structure made of nodes that are tied by one or more specific types of interdependency, such as values, visions, ideas, financial exchange, friends, kinship, dislike, conflict, trade, web links, sexual relations, disease transmission (epidemiology), or airline routes. Strongly Connected Components: A strongly connected component of a directed graph G is a maximal set of vertices C ⊂ V , such that, for every pair of vertices u and v, there is a directed path from u to v and a directed path from v to u. Structural Equivalence: Two nodes are said to be structurally equivalent if they have the same relationships to all other nodes. Structural Holes: Structural holes are nodes that separate non-redundant sources of information; that is, they act as a bridge between two networks that are not directly linked. Technological Networks: Technological networks are man-made networks designed typically for distribution of some commodity or resource, such as electricity or information. Vertex Connectivity: The connectivity of G, κ(G), is the minimum size of vertex set S such that G − S is disconnected or has only one vertex. Vertex Cutset: A vertex cutset of a graph G is a set S, a subset of V (G) such that G − S has more than one component. Weakly Connected Components: A weakly connected component is a maximal subgraph of a directed graph such that, for every pair of vertices u, v in the subgraph, there is an undirected path from u to v and a directed path from v to u. Zipf ’s Law: Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.
Index
C. elegans, 4 E. coli, 74 ADIOS, 153 adjacency matrix, 98, 226 Akaike information criterion, 211 antonymy, 149 Apache, 200 apoptosis, 19, 27, 31 assembly model, 60 attachment kernel, 220, 222, 229–231, 234 attack, 258 authority, 159 B-cell antigen receptor, 7 bandwidth, 219, 224, 225, 227, 234 Barab´ asi–Albert model, 194 Bayesian network, 45, 46 Belousov–Zhabotinsky reaction, 11 bifurcation point, 43 binding motif, 39 biochemical reaction, 38 biological systems, 35 bistability, 86 blogosphere, 159 Boolean model, 20, 23, 25, 31 network, 45 rules, 23 cascade model, 60 cells, 35 cellular system, 36, 73, 90
child concepts, 147 Chinese Whispers algorithm, 157 chlamydia, 97 clustering coefficient, 119, 207, 218, 257, 277 coevolution, 137, 139 community, 58 matrix, 60 structure, 135, 138, 287 competition, see ecological interaction complex adaptive system, 145 complexity science, 133, 135 compositional semantics, 145 computer viruses, 260 condition specific, 42 configuration model, 232 context vectors, 156 core lexicon, 151 cost, 77 function, 77 cross talk, 78 deep sequencing, 48 degradation, 81 degree, 222, 232 correlations, 220, 230–232, 235 distribution, 136, 203, 217, 234, 256, 275 cumulative, 223, 232 excess, 222, 232 exponential, 232 Poisson, 223, 227, 228, 234 power-law, 217–219, 224, 232 Weibull, 218
302
Index
free-excess, 240 heterogeneity, 133, 135 deletion kernel, 220, 230, 231 dendrogram, 153 diameter, 277 differential equation, 36, 46 directed tree, 108 disassortative network, 5 disease, 36 distinctive features, 154 distributional hypothesis, 156 DNA, 73 microarray, 42 dynamic modeling, 46 dynamical, 5 eccentricity, 190 ecological interaction, 58 ammensalism, 59 commensalism, 59 competition, 59 mutualism, 59 parasitism, 59 predation, 58 symbiosis, 59 edge duplication, 123 pruning, 174 eigenfunction, 121–124 eigenvalue, 119–125, 128 eigenvector centrality, 98, 159 electric current, 45 elementary mode, 46 entangled, 86 epidemics, 9, 97, 244, 247 models, 261 epigenetic, 83 eukaryotes, 86 evolutionary model, 60 explanatory variables, 209 exponential random graph models, 200, 209 expression data, 42 factor graph, 45, 46 failures, 258 false negative, 41 positive, 41
fault tolerance, 258 feature economy, 154 feedback loops, 74 female sex workers, 97 fixed point, 81 flooding, 265 flux balance analysis, 46 food webs, 9, 58 game theory, 135, 137 gene, 35 generating function, 222, 234 genome, 35 giant component, 229, 242 glial cells, 11 global structure, 147 gonorrhea, 97 grammar dependency, 151 phrase structure, 151 tree-adjoining, 151 graph visualization, 99 bipartite, 99, 120, 121, 128 complete, 120 complete, 120 function call, 200 neighbour-based, 174 random, 122 regions, 97 second order co-occurrence, 177 sentence-based, 174 small-world, 167 steepest-ascent, 98 topographic analysis, 99 graphlet, 211 habitat fragmentation, 68 Heaviside step function, 230 hierarchical structure, 146 higher order transformation, 178 histones, 83 HITS, 159 HIV, 97 condom use, 103 Holme–Kim model, 194 holonymy, 149 homeostasis, 87 homosexual, 108
Index hub, 159 HyperLex, 158 hyperlink, 159 hypernymy, 149 hyponymy, 149 IκB, 81 implicational hierarchies, 160 indexing, 189–193, 195, 196 inflammation, 80 information, 189, 197 inhibitor, 81 interaction strength, 41, 61 Internet, 197, 253 intra-cellular signaling, 7 k -core, 201 kernel lexicon, 175 keystone species, 60 kinase, 8 cascade, 45 substrate cascade, 38 kinetic modeling, 45 kinetics, 46 Laplacian, 7 algebraic graph Laplacian, 120 graph Laplacian, 117 normalized Laplacian, 120 Leipzig Corpora Collection, 168 letter frequency distribution, 169 lexical spectrum, 168 linguistic systems, 145 universals, 160 local structure, 147 logical model, 45 Lyapunov exponent, 125 machine learning, 38, 41 macroscopic, 145 mass spectrometry, 44 maximum likelihood, 210 May–Wigner theorem, 12 mental lexicon, 146 meronymy, 149 mesoscopic, 135, 145 metabolic, 80 microscopic, 145
minimal cut set, 47 modeling, 135, 137 modular structure, 136 motif, 80, 122 collector, 89 consumer, 88 duplication, 122, 123 fashion, 89 joining, 122 socialist, 87 multistability, 74, 80 mutualism, see ecological interaction natural language processing, 167 negative feedback, 80 network adaptive, 137 assortative, 5 clustered, 237 co-authorship, 129 co-expression, 43 dynamic, 140 e-mail interchange, 127 electronic circuit, 130 equilibrium, 278 food web, 127 gene-gene, 38 Internet, 126 jazz band, 129 logical, 38 metabolic, 38, 46, 126 modular, 9 neuronal, 9, 127 peer-to-peer, 253 phonological neighborhood, 148 power-grid, 129 processes, 135 protein contact, 6 protein-protein interaction, 38, 126 protein-RNA, 39 random, 4 randomized, 76 reactive, 219 regular, 4 regulatory, 38 scale-free, 119, 259 sex, 97 signaling, 45 small-world, 9
303
304
Index
social, 134–136, 138, 140 star, 14 static, 138 structured P2P, 263 superpeer, 268 syntactic dependency, 151 technological, 253 transcriptional, 43 transportation, 283 unstructured and decentralized P2P, 263 weblog, 127 word co-occurrence, 167 word collocation, 150 word-adjacency, 126 neurons, 11 NF-κB, 80 niche model, 60 node deletion, 219, 229 preferential, 230, 232 random, 219, 221, 227, 230 targeted, 230, 232 duplication, 119, 125 noise, 40 non-equilibrium network, 279 non-local diffusion, 10 nucleosome, 83 open source software, 199 opinion dynamics, 135, 138 orthographic similarity, 149 oscillations, 29, 31, 74, 80 oscillatory behaviour, 27 PageRank, 159 parasitism, see ecological interaction Pareto’s law, 168 peer churn, 268 peer-to-peer, 197, 218, 219, 223, 227, 230, 234 percolation theory, 280 Petri net, 45, 46 PhoNet Phoneme-Phoneme Network, 154 phonological similarity, 147 phylogenetic profile, 38 relations, 153
piecewise linear, 25, 31 PlaNet Phoneme-Language Network, 154 Polya–Cheeger constant, 124 polysemy, 149 population dynamics, 13 positive feedback, 74 posttranscriptional process, 44 power law, 257 distribution, 167 in language-related areas, 173 two-regime, 151 preferential attachment, 118, 119, 217, 218, 279 prey prey-predator, 63 prey-preference, 65 procedural software, 200 propagation cost, 205 protein, 35, 73 activation, 42 complex, 37, 38 localization, 42 modification, 45 protein-DNA interaction, 39, 43 turnover, 44 qualitative dynamics, 32 properties, 20 quality assessment, 40 random matrix, 12 walk, 225, 266 biased, 229 rank-degree distribution, 151 rate equation, 219, 221, 230–232, 235 recency effect, 150 recursive syntax, 145 regulatory, 80 rich-get-richer principle, 118 robustness, 13 saturated degradation, 81 scale-free, 135, 137 small-world graphs in language, 173 science of networks, 133, 140 search time, 219, 224–227, 234
Index search tree, 191 searching techniques, 264 self-organization, 145 semantic similarity, 147 sentence frequency, 172 signal propagation, 74 signals, 74 simulated annealing, 43 SIS model, 139 small-world, 135, 207, 257 social dynamics, 134, 135 groups, 135 media, 136 network analysis, 133, 135, 140 networking, 136 networks, 134–136, 138, 140 phenomena, 135, 137 structure, 138 socio-technological system, 253, 254 software engineering, 200 systems, 199 sound inventory, 153 spectral gap, 124 plot, 123, 125, 128 spectrum, 120–122, 125, 128, 278 SpellNet, 149 spiky, 81 spiral waves, 11 spreading, 135, 139 square lattices, 194 stability, 5 state change, 44, 46 steady state, 46 stimulus, 148 stoichiometry, 38, 46 structure discovery, 167
305
sub-lexical units, 146 symbiosis, see ecological interaction synchronization, 125 solution, 125 syntactic similarity, 147 synthetic lethal, 39 text summarization, 160 time course data, 43 time dependent, 42 time scales, 138 time-lagged correlation, 43 topology, 79 transcription factor, 41, 80 factor binding, 43 regulatory cascade, 39 translation, 39, 44 transmission probability, 103 tree, 120 treebanks, 151 triangle duplication, 123 trophic level, 58 typologies, 160 UCLA Phonological Segment Inventory Database, 154 unsupervised induction, 150 Watts–Strogatz model, 6 weak spot, 46 webpages, 159 word N -gram frequency, 170 word co-occurrence, 174 word sense disambiguation, 157 World Wide Web, 253 Zipf’s law, 168 Zipfian distribution, 168