Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5498
Lee Giles Marc Smith John Yen Haizheng Zhang (Eds.)
Advances in Social Network Mining and Analysis Second International Workshop, SNAKDD 2008 Las Vegas, NV, USA, August 24-27, 2008 Revised Selected Papers
13
Volume Editors Lee Giles John Yen Pennsylvania State University, College of Information Science and Technology University Park, PA 16802, USA E-mail: {giles, jyen}@ist.psu.edu Marc Smith Microsoft Research, One Microsoft Way, Redmond, WA 98002, USA E-mail:
[email protected] Haizheng Zhang Amazon.com., Seattle, WA, USA E-mail:
[email protected] Library of Congress Control Number: 2010931753 CR Subject Classification (1998): H.3, H.4, I.2, H.2.8, C.2, H.2 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-642-14928-6 Springer Berlin Heidelberg New York 978-3-642-14928-3 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Preface
This year’s volume of Advances in Social Network Analysis contains the proceedings for the Second International Workshop on Social Network Analysis (SNAKDD 2008). The annual workshop co-locates with the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). The second SNAKDD workshop was held with KDD 2008 and received more than 32 submissions on social network mining and analysis topics. We accepted 11 regular papers and 8 short papers. Seven of the papers are included in this volume. In recent years, social network research has advanced significantly, thanks to the prevalence of the online social websites and instant messaging systems as well as the availability of a variety of large-scale offline social network systems. These social network systems are usually characterized by the complex network structures and rich accompanying contextual information. Researchers are increasingly interested in addressing a wide range of challenges residing in these disparate social network systems, including identifying common static topological properties and dynamic properties during the formation and evolution of these social networks, and how contextual information can help in analyzing the pertaining social networks. These issues have important implications on community discovery, anomaly detection, trend prediction and can enhance applications in multiple domains such as information retrieval, recommendation systems, security and so on. The second SNAKDD workshop focused on knowledge discovery and data mining in social networks, such as contextual community discovery, link analysis, the growth and evolution of social networks, algorithms for large-scale graphs, techniques that can be used for recovering and constructing social networks from online social systems, search on social networks, multi-agent-based social network simulation, trend prediction of social network evolution, and related applications in other domains such as information retrieval and security. The workshop was concerned with inter-disciplinary and cross-domain studies spanning a variety of areas in computer science including graph and data mining, machine learning, computational organizational and multi-agent studies, information extraction and retrieval, and security, as well as other disciplines such as information science, and social science. In the first paper “Leveraging Label-Independent Features for Classification in Sparsely Labeled Networks: An Empirical Study,” Brian Gallagher and Tina Eliassi-Rad study the problem of within-network classification in sparsely labeled networks. The authors present an empirical study and show that the use of LI features produces classifiers that are less sensitive to specific label assignments and can lead to significant performance improvement.
VI
Preface
In the second paper “Community Detection Using a Measure of Global Influence,” Rumi Ghosh and Kristina Lerman, define “influence” as the number of paths, of any length, that exist between two nodes and argue that this gives a better measure of network connectivity. The authors use the influence metric to partition a network into groups or communities by looking for regions of the network where nodes have more influence over each other than over nodes outside the community. The third paper, “Communication Dynamics of Blog Networks,” by Mark Goldberg, Malik Magdon-Ismil, Stephen Kelley, Konstantin Mertsalov, and William Wallace, studies the communication dynamics of Blog networks by exploring the Russian section of LiveJournal. The two fundamental questions that this paper is concerned with include (1) what models adequately describe such dynamic communication behavior; and (2) how does one detect changes in the nature of the communication dynamics. The paper leverages stable statistics in their research in disclosing the dynamics of the networks and characterizing the locality properties. In the fourth paper, “Finding Spread Blockers in Dynamic Networks,” Habiba, Yintao Yu, Tanya Berger-Wolf, and Jared Saia extend standard structural network measures to dynamic networks. The authors also compare the blocking ability of individuals in the order of ranking by the new dynamic measures. The authors found that overall, simple ranking according to a nodes static degree, or the dynamic version of a nodes degree, performed consistently well. Surprisingly the dynamic clustering coefficient seems to be a good indicator, while its static version performs worse than the random ranking. The fifth paper, “Social Network Mining with Nonparametric Relational Models,” by ZZhao Xu, Volker Tresp, Achim Rettinger, and Kristian Kersting, discusses how the infinite hidden relational model (IHRM) can be used to model and analyze social networks, where each edge is associated with a random variable and the probabilistic dependencies between variables are specified by the model based on the relational structure. The hidden variables are able to transport information such that non-local probabilistic dependencies can be obtained. The IHRM provides effective relationship prediction and cluster analysis for social networks. The experiments demonstrate that this model can provide good prediction accuracy and capture the inherent relations among social actors. In the sixth paper, “Using Friendship Ties and Family Circles for Link Prediction,” the authors (Elena Zheleva, Lise Getoor, Jennifer Golbeck, and Ugur Kuter) investigate how networks can be overlaid and propose a feature taxonomy for link prediction. The authors show that the accuracy of link prediction can be improved when there are tightly-knit family circles in a social network. Their experiments demonstrate significantly higher prediction accuracy compared to using traditional features such as descriptive node attributes and structural features. The last paper, “Information Theoretic Criteria for Community Detection” by Karl Branting, studies the resolution limit problem with two compressionbased algorithms that were designed to overcome such limits. The author
Preface
VII
identifies the aspect of each approach that is responsible for the resolution limit and proposes a variant, SGE, that addresses this limitation. The paper demonstrates on three artificial data sets that (1) SGE does not exhibit a resolution limit on graphs in which other approaches do, and that (2) modularity and the compression-based algorithms, including SGE, behave similarly on graphs not subject to the resolution limit. We would like to thank the authors of all submitted papers for both the joint workshop and this proceedings volume. We are further indebted to the Program Committee members for their rigorous and timely reviewing. They allowed us to make this workshop a major success. Lee Giles Marc Smith John Yen Haizheng Zhang
Organization
Program Chairs Lee Giles Marc Smith John Yen Haizheng Zhang
Program Committee Lada Adamic Aris Anagnostopoulos Arindam Banerjee Tanya Berger-Wolf Yun Chi Aaron Clauset Isaac Councill Tina Eliassi-Rad Lise Getoor Mark Goldberg Larry Holder Andreas Hotho Gueorgi Kossinets Kristina Lerman Wei Li Yi Liu Ramesh Nallapati Jennifer Neville Cheng Niu Dou Shen Bingjun Sun Jie Tang Andrea Tapia Alessandro Vespigani Xuerui Wang Michael Wurst Xiaowei Xu
Pennsylvania State University, USA Microsoft, USA Pennsylvania State University, USA Amazon.com, USA
X
Organization
Referees Vladimir Barash Mustafa Bilgic Matthias Broecheler Guihong Cao Bin Cao
Sanmay Das Anirban Dasgupta ˜ Robert JAschke Liu Liu Galileo Namata
Evan Xiang Xiaowei Xu Limin Yao Jing Zhang Yi Zhang
Table of Contents
Leveraging Label-Independent Features for Classification in Sparsely Labeled Networks: An Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Brian Gallagher and Tina Eliassi-Rad
1
Community Detection Using a Measure of Global Influence . . . . . . . . . . . . Rumi Ghosh and Kristina Lerman
20
Communication Dynamics of Blog Networks . . . . . . . . . . . . . . . . . . . . . . . . . Mark Goldberg, Stephen Kelley, Malik Magdon-Ismail, Konstantin Mertsalov, and William (Al) Wallace
36
Finding Spread Blockers in Dynamic Networks . . . . . . . . . . . . . . . . . . . . . . Habiba, Yintao Yu, Tanya Y. Berger-Wolf, and Jared Saia
55
Social Network Mining with Nonparametric Relational Models . . . . . . . . . Zhao Xu, Volker Tresp, Achim Rettinger, and Kristian Kersting
77
Using Friendship Ties and Family Circles for Link Prediction . . . . . . . . . . Elena Zheleva, Lise Getoor, Jennifer Golbeck, and Ugur Kuter
97
Information Theoretic Criteria for Community Detection . . . . . . . . . . . . . . L. Karl Branting
114
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
131
Leveraging Label-Independent Features for Classification in Sparsely Labeled Networks: An Empirical Study Brian Gallagher and Tina Eliassi-Rad Lawrence Livermore National Laboratory P.O. Box 808, L-560, Livermore, CA 94551, USA {bgallagher,eliassi}@llnl.gov
Abstract. We address the problem of within-network classification in sparsely labeled networks. Recent work has demonstrated success with statistical relational learning (SRL) and semi-supervised learning (SSL) on such problems. However, both approaches rely on the availability of labeled nodes to infer the values of missing labels. When few labels are available, the performance of these approaches can degrade. In addition, many such approaches are sensitive to the specific set of nodes labeled. So, although average performance may be acceptable, the performance on a specific task may not. We explore a complimentary approach to within-network classification, based on the use of label-independent (LI ) features – i.e., features calculated without using the values of class labels. While previous work has made some use of LI features, the effects of these features on classification performance have not been extensively studied. Here, we present an empirical study in order to better understand these effects. Through experiments on several real-world data sets, we show that the use of LI features produces classifiers that are less sensitive to specific label assignments and can lead to performance improvements of over 40% for both SRL- and SSL-based classifiers. We also examine the relative utility of individual LI features; and show that, in many cases, it is a combination of a few diverse network-based structural characteristics that is most informative. Keywords: Statistical relational learning; semi-supervised learning; social network analysis; feature extraction; collective classification.
1
Introduction
In this paper, we address the problem of within-network classification. We are given a network in which some of the nodes are “labeled” and others are “unlabeled” (see Figure 1). Our goal is to assign the correct labels to the unlabeled nodes from among a set of possible class labels (i.e., to “classify” them). For example, we may wish to identify cell phone users as either ‘fraudulent’ or ‘legitimate.’ Cell phone fraud is an example of an application where networks are often very sparsely labeled. We may have a handful of known fraudsters and a handful L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 1–19, 2010. c Springer-Verlag Berlin Heidelberg 2010
2
B. Gallagher and T. Eliassi-Rad
Fig. 1. Portion of the MIT Reality Mining call graph. We know the class labels for the black (dark) nodes, but do not have labels for the yellow (light) nodes.
of known legitimate users, but for the vast majority of users, we do not know the correct label. For such applications, it is reasonable to expect that we may have access to labels for fewer than 10%, 5%, or even 1% of the nodes. In addition, cell phone networks are generally anonymized. That is, nodes in these networks often contain no attributes besides class labels that could be used to identify them. It is this kind of sparsely labeled, anonymized network that is the focus of this work. Put another way, our work focuses on univariate within-network classification in sparsely labeled networks. Relational classifiers have been shown to perform well on network classification tasks because of their ability to make use of dependencies between class labels (or attributes) of related nodes [1]. However, because of their dependence on class labels, the performance of relational classifiers can substantially degrade when a large proportion of neighboring instances are also unlabeled. In many cases, collective classification provides a solution to this problem, by enabling the simultaneous classification of a number of related instances [2]. However, previous work has shown that the performance of collective classification can also degrade when there are too few labels available, eventually to the point where classifiers perform better without it [3]. In this paper, we explore another source of information present in networks that does not depend on the availability or accuracy of node labels. Such information can be represented using what we call label-independent (LI ) features. The main contribution of this paper is an in-depth examination of the effects of label-independent features on within-network classification. In particular, we address the following questions: 1. Can LI features make up for a lack of information due to sparsely labeled data? Answer: Yes.
Leveraging Label-Independent Features
3
2. Can LI features provide information above and beyond that provided by the class labels? Answer: Yes. 3. How do LI features improve classification performance? Answer: Because they are less sensitive to the specific labeling assigned to a graph, classifiers that use label-independent features produce more consistent results across prediction tasks. 4. Which LI features are the most useful ? Answer: A combination of a few diverse network-based structural characteristics (such as node and link counts plus betweenness) is the most informative. Section 2 covers related work. Section 3 describes our approach for modeling label-independent characteristics of networks. Sections 4 and 5, respectively, present our experimental design and results. We conclude the paper in Section 6.
2
Related Work
In recent years, there has been a great deal of work on models for learning and inference in relational data (i.e., statistical relational learning or SRL) [3,4,5,6,7]. All SRL techniques make use of label-dependent relational information. Some use label-independent information as well. Relational Probability Trees (RPTs) [8] use label-independent degree-based features (i.e., neighboring node and link counts). However, existing RPT studies do not specifically consider the impact of label-independent features on classifier performance. Perlich and Provost [9] provide a nice study on aggregation of relational attributes, based on a hierarchy of relational concepts. However, they do not consider label-independent features. Singh et al. [10] use descriptive attributes and structural properties (i.e., node degree and betweenness centrality) to prune a network down to its ‘most informative’ affiliations and relationships for the task of attribute prediction. They do not use label-independent features directly as input to their classifiers. Neville and Jensen [11] use spectral clustering to group instances based on their link structure (where link density within a group is high and between groups is low). This group information is subsequently used in conjunction with attribute information to learn classifiers on network data. There has also been extensive work on overcoming label sparsity through techniques for label propagation. This work falls into two research areas: (1) collective classification [2,3,7,12,13,14] and (2) graph-based semi-supervised learning (SSL) [15,16]. Previous work confirms our observation that the performance of collective classification can suffer when labeled data is very sparse [3]. McDowell et al. [14] demonstrate that “cautious” collective classification procedures produce better classification performance than “aggressive” ones. They recommend only propagating information about the top-k most confident predicted labels.
4
B. Gallagher and T. Eliassi-Rad
The problem of within-network classification can be viewed as a semi-supervised learning problem. The graph-based approaches to semi-supervised learning are particularly relevant here. In their study, Macskassy and Provost [7] compare the SSL Gaussian Random Field (GRF) model [15] to a SRL weighted-vote relational neighbor (wvRN) model that uses relaxation labeling for collective classification (wvRN+RL). They conclude that the two models are nearly identical in terms of accuracy, although GRF produces slightly better probability rankings. Our results with wvRN+RL and GRF are consistent with this conclusion. The “ghost edge” approach of Gallagher et al. [17] combines aspects of both SRL and SSL, and compares favorably with both wvRN+RL and GRF.
3
Label-Dependent vs. Label-Independent Features
Relational classifiers leverage link structure to improve performance. Most frequently, links are used to incorporate attribute information from neighboring nodes. However, link structure can also be used to extract structural statistics of a node (e.g., the number of adjacent links). We can divide relational features into two categories: label-dependent and label-independent. Label-dependent (LD) features use both structure and attributes (or labels) of nodes in the network. The most commonly used LD features are aggregations of the class labels of nodes one link away (e.g., the number of neighbors with the class label ‘fraudulent’). LD features are the basis for incorporating relational information in many SRL classifiers. Label-independent (LI) features are calculated using network structure, but not attributes or class labels of nodes. An example of a simple LI feature is the degree of a node (i.e., the number of neighboring nodes). Of course, we assume that there is an underlying statistical dependency between the class label of a node and its LI features. Otherwise, LI features would be of no value in predicting a node’s class. However, because they are calculated based only on network structure, LI feature values do not directly depend on the current class label assignments of nodes in a network. This means that, unlike LD features, LI features may be calculated with perfect accuracy regardless of the availability of class label information and are impervious to errors in class label assignments. 3.1
Extracting Label-Independent Features
We consider four LI features on nodes: (1) the number of neighboring nodes, (2) the number of incident links, (3) betweenness centrality, and (4) clustering coefficient. Features 1 and 2, respectively, are node-based and link-based measures of degree. Note that in multigraphs, these two are different. Betweenness centrality measures how “central” a node is in a network, based on the number of shortest paths that pass through it. Clustering coefficient measures neighborhood strength, based on how connected a node’s neighbors are to one another. For details, we refer the reader to a study by Mark Newman [18]. The success of network-based structural characteristics as predictors of class relies on two assumptions. First, members of different classes play different roles
Leveraging Label-Independent Features
5
in a network. Second, these roles can be differentiated by structural characteristics. The second assumption is met in many cases. For instance, “popular” nodes can be identified by degree and “important” nodes can be identified by centrality measures. Whether the first assumption is met depends on the class label. Suppose that executives tend to be more popular and central than an average employee in a company’s communication network, and that employees with a particular job title tend to have similar popularity and centrality, regardless of department. Then, we would expect structural features to be more useful for identifying executives than members of a particular department.
4
Experimental Design
We have designed our experiments to answer the following questions: 1. Can LI features make up for a lack of information due to sparsely labeled data? 2. Can LI features provide information above and beyond that provided by the class labels? 3. How do LI features improve classification performance? 4. Which LI features are the most useful ? To avoid confounding effects as much as possible, we focus on univariate binary classification tasks, and extend simple classifiers to incorporate label-independent features. 4.1
Classifiers
On each classification task, we ran ten individual classifiers: four variations of a link-based classifier [5], four variations of a relational neighbor classifier [19,7], and two variations of the Gaussian Random Field classifier [15]. We describe each of them below. nLB is the network-only link-based classifier [5]. It uses logistic regression to model a node’s class given the classes of neighboring nodes. To generate features, a node’s neighborhood is summarized by the link-weighted count of each class label. For example, given a binary classification task, two features will be generated: (1) the link-weighted count of a node’s neighbors with the positive class and (2) the linkweighted count of a node’s neighbors with the negative class. nLBLI is composed of two logistic regression models: (1) nLB with its two LD features and (2) logLI, which uses the four LI features (see Section 3.1). The nLBLI classifier calculates the probability of each class as: P (C) = w · PnLB (C) + (1 − w) · PlogLI (C)
(1)
where w is calculated based on the individual performance of nLB and logLI over 10-fold cross validation on the training data. We calculate area under the
6
B. Gallagher and T. Eliassi-Rad
ROC curve (AUC) for each fold and then obtain an average AUC score for each classifier, AU CLD and AU CLI . We then set w as follows: w=
AU CLD AU CLD + AU CLI
(2)
nLB+ICA uses the nLB classifier, but performs collective classification using the ICA algorithm described in Section 4.2. nLBLI+ICA uses the nLBLI classifier, but performs collective classification using the ICA algorithm described in Section 4.2. wvRN is the weighted-vote relational neighbor classifier [19,7]. It is a simple non-learning classifier. Given a node i and a set of neighboring nodes, N , the wvRN classifier calculates the probability of each class for node i as: 1 wi,j if Ci = c P (Ci = c|N ) = (3) Li 0 otherwise j∈N where wi,j is the number of links between nodes i and j and Lj is the number of links connecting node i to labeled nodes. When node i has no labeled neighbors, we use the prior probabilities observed in the training data. wvRNLI combines the LI features with wvRN in the same way that nLBLI does with nLB (i.e., using a weighted sum of wvRN and logLI). wvRN+ICA uses the wvRN classifier, but performs collective classification using the ICA algorithm described in Section 4.2. wvRNLI+ICA uses wvRNLI, but performs collective classification using the ICA algorithm described in Section 4.2. GRF is the semi-supervised Gaussian Random Field approach of Zhu et al. [15]. We made one modification to accommodate disconnected graphs. Zhu computes the graph Laplacian as L = D − cW , where c = 1. We set c = 0.9 to ensure that L is diagonally dominant and thus invertible. We observed no substantial impact on performance in connected graphs due to this change. GRFLI combines the LI features with GRF as nLBLI does with nLB (i.e., using a weighted sum of GRF and logLI). We also tried the approach of Zhu et al. [15], where one attaches a “dongle” node to each unlabeled node and assigns it a label using the external LI classifier. The transition probability from node i to its dongle is η and all other transitions from i are discounted by 1 − η . This approach did not yield any improvements. So, we use the weighted sum approach (i.e., Equation 1) for consistency. 4.2
Collective Classification
To perform collective classification, we use the iterative classification algorithm (ICA) [7], with up to 1000 iterations. We chose ICA because (1) it is simple, (2) it
Leveraging Label-Independent Features
7
performs well on a variety of tasks, and (3) it tends to converge more quickly than other approaches. We also performed experiments using relaxation labeling (RL) [7]. Our results are consistent with previous research showing that the accuracy of wvRN+RL is nearly identical to GRF, but GRF produces higher AUC values [7]. We omit these results due to the similarity to GRF. For a comparison of wvRN+RL and GRF on several of the same tasks used here, see Gallagher et al. [17]. Overall, ICA slightly outperforms RL for the nLB classifier. Several of our data sets have large amounts of unlabeled data since ground truth is simply not available. In these cases, there are two reasonable approaches to collective classification: (1) perform collective classification over the entire graph and (2) perform collective classification over the core set of nodes only (i.e., nodes with known labels). In our experiments, attempting to perform collective classification over the entire graph produced results that were often dramatically worse than the noncollective base classifier. We hypothesize that this is due to an inadequate propagation of known labels across vast areas of unlabeled nodes in the network. Note that for some of our experiments, fewer than 1% of nodes are labeled. Other researchers have also reported cases where collective classification hurts performance due to a lack of labeled data [3,11]. We found that the second approach (i.e., using a network of only the core nodes) outperformed the first approach in almost all cases, despite disconnecting the network in some cases. Therefore, we report results for the second approach only. 4.3
Experimental Methodology
Each data set has a set of core nodes for which we know the true class labels. Several data sets have additional nodes for which there is no ground truth available. Classifiers have access to the entire graph for both training and testing. However, we hide labels for 10% − 90% of the core nodes. Classifiers are trained on all labeled core nodes and evaluated on all unlabeled core nodes. For each proportion labeled, we run 30 trials. For each trial, we choose a class-stratified random sample containing 100 × (1.0 − proportionlabeled)% of the core nodes as a test set and the remaining core nodes as a training set. Note that a single node will necessarily appear in multiple test sets. However, we carefully choose test sets to ensure that each node in a data set occurs in the same number of test sets over the course of our experiments; and therefore, carries the same weight in the overall evaluation. Labels are kept on training nodes and removed from test nodes. We use identical train/test splits for each classifier. For more on experimental methodologies for relational classification, see Gallagher and Eliassi-Rad [20]. We use the area under the ROC curve (AUC) to compare classifiers because it is more discriminating than accuracy. In particular, since most of our tasks have a large class imbalance (see Section 4.4), accuracy cannot adequately differentiate between classifiers.
8
B. Gallagher and T. Eliassi-Rad
4.4
Data Sets
We present results on four real-world data sets: political book purchases [21], Enron emails [22], Reality Mining (RM) cellphone calls [23], and high energy physics publications (HEP-TH) from arXiv [24]. Our five tasks are to identify neutral political books, Enron executives, Reality Mining students, Reality Mining study participants, and HEP-TH papers with the topic “Differential Geometry.” Table 1 summarizes the prediction tasks. The Sample column describes the method used to obtain a data sample for our experiments: use the entire set (full ), use a time-slice (time), or sample a continuous subgraph via breadth-first search (BFS ). The Task column indicates the class label we try to predict. The |V |, |L|, and |E| columns indicate counts of total nodes, labeled nodes, and total edges in each network. The P (+) column indicates the proportion of labeled nodes that have the positive class label (e.g., 12% of the political books are neutral). For Enron, Reality Mining students, and HEP-TH, we have labels for only a subset of nodes (i.e., the “core” nodes) and can only train and test our classifiers on these nodes. However, unlabeled nodes and their connections to labeled nodes are exploited to calculate LI features of the labeled nodes. Table 1. Summary of Data Sets and Prediction Tasks Data Set Political Books Enron Reality Mining Reality Mining HEP-TH
5
Sample
Task
|V |
Full Neutral? 105 Time Executive? 9K BFS Student? 1K BFS In Study? 1K BFS Differential Geometry? 3K
|L|
|E|
P (+)
105 1.6K 84 1K 284
441 50K 32K 32K 36K
0.12 0.02 0.62 0.08 0.06
Experimental Results
In this section, we discuss our results. We assess significance using paired t-tests (p-values ≤ 0.05 are considered significant).1 5.1
Effects of Learning Label Dependencies
Figures 2 and 3 show results for statistical relational learning and semi-supervised learning approaches on all of our classification tasks. Supervised learning approaches, like nLB, use labeled nodes as training data to build a dependency model over neighboring class labels. The non-learning wvRN and GRF assume 1
It is an open issue whether the standard significance tests for comparing classifiers (e.g., t-tests, Wilcoxon signed-rank) are applicable for within-network classification, where there is typically some overlap in test sets across trials. It remains to be seen whether the use of such tests produces a bias and the extent of any errors caused by such a bias. This is an important area for future study that will potentially affect a number of published results.
Leveraging Label-Independent Features
Learning Algorithms
9
Non-learning Algorithms
nLB
nLB+ICA
wvRN
wvRN+ICA
nLBLI
nLBLI+ICA
wvRNLI
wvRNLI+ICA
Political Books 0.85
AUC AUC
0.85 0.8
0.8
0.75
0.75
0.7
0.7
0.65
0.65
0.6
0.6
0.55
0.55
0.5
0.5 0.1
0.3
0.5
0.7
0.9
0.1
0.3
0.5
0.7
0.9
0.1
0.3
0.5
0.7
0.9
0.3
0.5
0.7
0.9
0.3
0.5
0.7
0.9
0.3
0.5
0.7
0.9
AUC A UC
Enron Executives 0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4 0.1
0.3
0.5
0.7
0.9
Reality Mining Students 0.95
AUC AUC
0.95 0.9
0.9
0.85
0.85
0.8
0.8
0.75
0.75
0.7
0.7
0.65
0.65
0.6
0.6
0.55
0.55 0.1
0.3
0.5
0.7
0.9
0.1
AUC AUC
Reality Mining Study Participants 0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.3
0.5
0.7
0.9
0.1
HEP-TH Differential Geometry Papers
AUC AUC
0.85
0.85
0.8
0.8
0.75
0.75
0.7
0.7
0.65
0.65
0.6
0.6
0.55
0.55
0.5
0.5
0.45
0.45 0.1
0.3
0.5
0.7
Proportion of Core Nodes Labeled
Proportion of Core Nodes Labeled
0.9
0.1
Proportion of Core Nodes Labeled
Proportion of Core Nodes Labeled
Fig. 2. Classification results for statistical relational learning approaches on our data sets. For details on classifiers, see Section 4.1. Note: Due to differences in the difficulty of classification tasks, the y-axis scales are not consistent across tasks. However, for a particular classification task, the y-axis scales are consistent across the algorithms shown both in this figure and in Figure 3.
10
B. Gallagher and T. Eliassi-Rad Semi-supervised Learning Algorithms GRF
GRFLI
Political Books
Enron Executives
0.85
0.9
0.8
0.8
AUC
0.75
0.7
0.7 0.65
0.6
0.6
0.5
0.55 0.5
0.4 0.1
0.3
0.5
0.7
0.9
0.1
Proportion of Core Nodes Labeled
0.3
0.5
0.7
0.9
Proportion of Core Nodes Labeled
Reality Mining Students
Reality Mining Study Participants
0.95 0.9
AUC
0.9
0.8
0.85
0.7
0.8
0.6
0.75
0.5 0.4
0.7
0.3 0.65
0.2
0.6
0.1
0.55
0 0.1
0.3
0.5
0.7
0.9
0.1
Proportion of Core Nodes Labeled
0.3
0.5
0.7
0.9
Proportion of Core Nodes Labeled
HEP-TH Differential Geometry Papers 0.85 0.8 0.75
AUC
0.7 0.65 0.6 0.55 0.5 0.45 0.1
0.3
0.5
0.7
0.9
Proportion of Core Nodes Labeled
Fig. 3. Classification results for semi-supervised learning approaches on our data sets. For details on classifiers, see Section 4.1. Note: Due to differences in the difficulty of classification tasks, the y-axis scales are not consistent across tasks. However, for a particular classification task, the y-axis scales are consistent across the algorithms shown both in this figure and in Figure 2.
that class labels of neighboring nodes tend to be the same (i.e., high label consistency). GRF performs well on the Enron and RM student tasks, which have high label consistency between neighbors. On the RM study task, where neighboring labels are inversely correlated (i.e., low label consistency), wvRN and GRF perform poorly, whereas nLB can learn the correct dependencies.
Leveraging Label-Independent Features
5.2
11
Effects of Label-Independent Features
Figures 2 and 3 illustrate several effects of LI features. In general, the performance of the LI classifiers degrades more slowly than that of the corresponding base classifiers as fewer nodes are labeled. At ≤ 50% labeled, the LI features produce a significant improvement in 36 of 45 cases. The exceptions mainly occur for GRF on Enron, RM Student, and HEP-TH, where (in most cases) we have a statistical tie. In general, the information provided by the LI features is able to make up, at least in part, for information lost due to missing labels. Note that there are three separate effects that lower performance as the number of labels decreases. (1) Fewer labels available for inference lead to lower quality LD features at inference time, but do not impact the quality of LI features. (2) Fewer labels at training time mean that (labeled) training examples have fewer labeled neighbors. This impacts the quality of the LD features available at training time and the quality of the resulting model. LI features are not affected. (3) Fewer labels mean less training data. This impacts model quality for both LD and LI features. Note that wvRN and GRF are affected only by 1, since they do not rely on training data. In general, the LI models outperform the corresponding base models, leading to significant improvements in 49 out of 75 cases across all proportions of labeled data. There is only one case where the use of LI features significantly degrades performance: using GRF on the Enron task at 0.3 labeled. The GRF classifier does so well on this task that the LI features simply add complexity without additional predictive information. However, the degradation here is small compared to gains on other tasks. Another effect demonstrated in Figures 2 and 3 is the interaction between LI features and label propagation (i.e., ICA or GRF). In several cases, combining the two significantly outperforms either on its own (e.g., GRFLI on political books and the RM tasks). However, the benefit is not consistent across all tasks. The improved performance due to LI features on several tasks at 90% labeled (i.e., political books, both RM tasks) suggests that LI features can provide information above and beyond that provided by class labels. Recall that political books and RM study are the only data sets fully labeled to begin with. This indicates that LI features may have more general applicability beyond sparsely labeled data. Figure 4 shows the sensitivity of classifiers to the specific nodes that are initially labeled. For each classifier and task, we measure variance in AUC across 30 trials. For each trial, a different 50% of nodes is labeled. ICA has very little impact on the sensitivity of nLB to labeling changes. However, the LI features decrease the labeling sensitivity of nLB dramatically for all but one data set. The results for wvRN are qualitatively similar. LI features also decrease sensitivity for GRF in most cases. Since GRF has low sensitivity to begin with, the improvements are less dramatic. The observed reduction in label sensitivity is not surprising since LI features do not rely on class labels. However, it suggests that LI features make classifiers more stable. So, even in cases where average classifier performance does not increase, we expect an increase in the worst case due to the use of LI features.
12
B. Gallagher and T. Eliassi-Rad
Lableing Sensitivity (50% Labeled) 0.04 HepthArea EnronTitle PoliticalBook RealityMiningInStudy RealityMiningStudent
Variance in AUC
0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 nLB
nLB+ICA
nLBLI
GRF
GRFLI
Classifier
Fig. 4. Sensitivity of classifiers to specific assignments of 50% known labels across data sets
5.3
Performance of Specific LI Features
To understand which LI features contribute to the observed performance gains, we re-ran our experiments using subsets of the LI features. We used logistic regression with different combinations of the four LI features: each alone (4 classifiers), leave one feature out (4 classifiers), degree-based features only (1 classifier), non-degree-based features only (1 classifier), and all features (1 classifier).2 This yields 11 classifiers. We present results for 50% of nodes labeled. Results for other proportions labeled are similar. Figure 5 shows AUC using each LI feature alone vs. all features together. This demonstrates the utility of each feature in the absence of any other information.
Individual Label-Independent Features at 50% Labeled 0.9
AUC
0.8
All Node count
0.7
Link count Betweenness
0.6
Clust. coef. 0.5 0.4 Enron
HEP-TH
P. Books
RM Students
RM Study
Data Set
Fig. 5. Performance of LI features in isolation
Figure 6 shows the increase in AUC due to adding the specified feature to a classifier that already has access to all other LI features. The y-axis is the AUC 2
Degree-based features are node (or neighbor) count and link (or edge) counts. Nondegree-based features are betweenness and clustering coefficient.
Leveraging Label-Independent Features
13
Combined Label-Independent Features at 50% Labeled
0.3 Increase in AUC
0.25
Degree-based
0.2
Non-degree
0.15
Node count
0.1
Link count
0.05
Betweenness
0
Clust. coef.
-0.05 -0.1 Enron
HEP-TH
P. Books
RM Students
RM Study
Data Set
Fig. 6. Performance of LI features in combination. Degree-based features are node and link count. Non-degree features are betweenness and clustering coefficient.
of a classifier that uses all LI features minus the AUC of a classifier that uses all except the specified feature. This demonstrates the power of each feature when combined with the others. All features appear to be useful for some tasks. Clustering coefficient is the least useful overall, improving AUC slightly on two tasks and degrading AUC slightly on three. For all tasks, a combination of at least three features yields the best results. Interestingly, features that perform poorly on their own can be combined to produce good results. On the RM student task, node count, betweenness, and clustering coefficient produce AUCs of 0.57, 0.49, and 0.48 alone, respectively. When combined, these three produce an AUC of 0.78. Betweenness, which performs worse than random (AUC < 0.5) on its own, provides a boost of 0.32 AUC to a classifier using node count and clustering coefficient. For most tasks, performance improves due to using all four LI features. On Enron, however, clustering coefficient appears to mislead the classifier to the point where it is better to use either node or link count individually than to
Enron Executives 0.8 0.75
AUC
0.7 Logistic
0.65
Rand Forest
0.6 0.55 0.5 0.1
0.3
0.5
0.7
0.9
Proportion of Core Nodes Labeled
Fig. 7. Comparison of logistic regression and random forest classifiers with all four LI features
14
B. Gallagher and T. Eliassi-Rad
use all features. This is one case where we might benefit from a more selective classifier. Figure 7 compares logistic regression with a random forest classifier [25], both using the same four LI features. As expected, the random forest is better able to make use of the informative features without being misled by the uninformative ones. To get a feel for why some LI features make better predictors than others, we examine the distribution of each feature by class for each prediction task. Table 2 summarizes these feature distributions by their mean and standard deviation. In general, we expect features that cleanly separate the classes to provide the most predictive power. As mentioned previously, clustering coefficient appears to be the least powerful feature overall for our set of prediction tasks. One possible explanation for clustering coefficient’s general poor performance is that it does not vary enough from node to node; therefore, it does not help to differentiate among instances of different classes. Table 2. Mean and standard deviation (SD) of feature values by class and data set. The larger mean value for each feature (i.e., row) is shown in bold. Data Set/Feature
Mean (SD) for the ‘+’ Class
Mean (SD) for the ‘-’ Class
Political Books
Neutral
Other
Node Count Link Count Betweenness Clust. Coef.
5.8 (3.3) 5.8 (3.3) 0.027 (0.030) 0.486 (0.25)
8.8 (5.6) 8.8 (5.6) 0.019 (0.029) 0.489 (0.21)
Enron
Executive
Other
Node Count Link Count Betweenness Clust. Coef.
22 (27) 61 (100) 0.0013 (0.0037) 0.91 (0.77)
9.6 (20) 25 (66) 0.00069 (0.0025) 1.75 (4.5)
RM Student
Student
Other
Node Count Link Count Betweenness Clust. Coef.
19 (27) 471 (774) 0.027 (0.050) 15 (22)
22 (38) 509 (745) 0.022 (0.056) 8.0 (7.0)
RM Study
In-study
Out-of-study
Node Count Link Count Betweenness Clust. Coef.
18 (30) 418 (711) 0.022 (0.048) 10 (17)
1.4 (2.8) 30 (130) 0.00086 (0.022) 5.8 (51)
HEP-TH
Differential Geometry
Other
Node Count Link Count Betweenness Clust. Coef.
14 (9.0) 14 (9.0) 0.000078 (0.00010) 0.42 (0.19)
21 (26) 21 (26) 0.0011 (0.0056) 0.40 (0.23)
Leveraging Label-Independent Features
15
Feature Variability
Coefficient of Variation
12 10 PoliticalBook
8
EnronTitle 6
RealityMiningInStudy RealityMiningPosition
4
HepthArea
2
clu
st er in gC oe ffi cie nt
ee nn es s be tw
un t ed ge Co
ne ig hb or
Co un t
0
Fig. 8. Degree of variability for each LI feature on each prediction task
Figure 8 shows the degree of variability of each LI feature across the five prediction tasks. To measure variability, we use the coefficient of variation, a normalized measure of the dispersion of a probability distribution. The coefficient of variation is defined as: σ (4) cv(dist) = μ where μ is the mean of the probability distribution dist and σ is the standard deviation. A higher coefficient of variation indicates a feature with more varied values across instances in the data set. The variability of the clustering coefficient appears comparable to the degree features (i.e., node and link count) (see Figure 8). We even observe that the degree of variability of the clustering coefficient for the Enron task is higher than the degree of variability for the neighbor count feature, even though neighbor count provides much more predictive power (see Figure 5). So, clustering coefficient appears to have sufficient variability over the nodes in the graph. However, it is possible that the clustering coefficient exhibits similar variability for nodes of both classes; and thus, still fails to adequately distinguish between nodes of different classes. Therefore, we wish to quantify the extent to which the feature distributions can be separated from one another by class. Figure 9 shows how well each LI feature separates the two classes for each prediction task. We measure class separation by calculating the distance between the empirical distributions of the LI feature values for each class. Specifically, we use the Kolmogorov-Smirnov statistic (K-S) to measure the distance between two empirical (cumulative) distribution functions: D(Fn (x), Gn (x)) = max(|Fn (x) − Gn (x)|)
(5)
We observe from Figure 9 that, on average, the per-class distributions of values for the clustering coefficient are more similar to one another than for other LI features. So, although the clustering coefficient does vary from node to node,
16
B. Gallagher and T. Eliassi-Rad Class Separation
Kolmogorov-Smirnov Distance
0.8 0.7 0.6 PoliticalBook 0.5
EnronTitle
0.4
RealityMiningInStudy RealityMiningPosition
0.3
HepthArea 0.2 0.1
clu
st er in gC oe ffi cie nt
ee nn es s be tw
un t ed ge Co
ne ig hb or
Co un t
0
Fig. 9. Degree of class separation for each LI feature on each prediction task
the values do not differ consistently based on class. Therefore, clustering coefficient has a hard time distinguishing between instances of different classes, and exhibits poor predictive power overall. The exception is on the Reality Mining study-participant task, where we observe a high K-S distance (Figure 9) and a correspondingly high classification performance (Figure 5). In fact, the K-S distances in Figure 9 generally correspond quite well to the classification performance we observe in Figure 5. 5.4
Observations about Our Problem Domains and Data Sets
Table 2 highlights a number of interesting characteristics of our data sets. An examination of these characteristics provide insights into the underlying problem domains. We describe several such insights here. More politically extreme books tend to have higher degree (neighboring nodes and adjacent edges) and clustering coefficient, but lower betweenness than the neutral books. This tells us that there are two very different types of readers represented in our network: (1) party loyalists that tend to have more extreme viewpoints, strong agreement with others inside their party, and strong disagreement with outsiders and (2) political moderates who are more inclined to consider multiple differing perspectives on an issue. Enron executives tend to have higher degree and betweenness, but lower clustering coefficients than others. So, as we would expect, executives maintain more relationships than other employees and are positioned in a place of maximal control over information flow. The lower clustering coefficient suggests that executives maintain ties to multiple communities within the company and are less closely tied to a particular community. Reality Mining students tend to have higher betweenness and clustering coefficient, but lower degree than others. This indicates that students tend to be more cliquey, with many of their communications made within a single strong
Leveraging Label-Independent Features
17
community. It also indicates that students play an important role in keeping information flowing between more distant parts of the network. Reality Mining study participants tend to have higher degree, betweenness, and clustering coefficient than non-participants. These findings may reveal more about how the data were collected than about the underlying problem domain, but they are interesting nonetheless. Because their phones were instrumented with special software, we have information on all calls of study participants. However, we have only a partial view of the calls made and received by non-participants. More specifically, the only calls we observe for non-participants are those to or from a study participant. The result of this is that we end up with a central community of interconnected study participants, surrounded by a large, but diffuse periphery of non-participants. Thus, the participants appear to have more neighbors, higher centrality, and a more closely knit surrounding community. In HEP-TH, differential geometry papers tend to have higher clustering coefficient, but lower degree and betweenness than others topics. This indicates that differential geometry papers play a relatively isolated and peripheral role among high-energy physics papers, at least in our subset of the arXiv data.
6
Conclusion
We examined the utility of label-independent features in the context of withinnetwork classification. Our experiments revealed a number of interesting findings: (1) LI features can make up for large amounts of missing class labels; (2) LI features can provide information above and beyond that provided by class labels alone; (3) the effectiveness of LI features is due, at least in part, to their consistency and their stabilizing effect on network classifiers; (4) no single label-independent feature dominates, and there is generally a benefit to combining a few diverse LI features. In addition, we observed a benefit to combining LI features with label propagation, although the benefit is not consistent across tasks. Our findings suggest a number of interesting areas for future work. These include: – Combining attribute-based (LD) and structural-based (LI) features of a network to create new informative features for node classification. For instance, will the number of short paths to nodes of a certain label or the average path length to such nodes improve classification performance? – Exploring the relationship between attributes and network structure in timeevolving networks, where links appear and disappear and attribute values change over time. For example, in such a dynamic network, could we use a time-series of LI feature values to predict the values of class labels at a future point in time?
Acknowledgments We would like to thank Luke McDowell for his insightful comments. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract No. W-7405-ENG-48 and No. DE-AC52-07NA27344 (LLNL-JRNL-411529).
18
B. Gallagher and T. Eliassi-Rad
References 1. Taskar, B., Abbeel, P., Koller, D.: Discriminative probabilistic models for relational data. In: Proceedings of the 18th Conference on Uncertainty in AI, pp. 485–492 (2002) 2. Sen, P., Namata, G., Bilgic, M., Getoor, L., Gallagher, B., Eliassi-Rad, T.: Collective classification in network data. AI Magazine 29(3), 93–106 (2008) 3. Neville, J., Jensen, D.: Relational dependency networks. Journal of Machine Learning Research 8, 653–692 (2007) 4. Getoor, L., Friedman, N., Koller, D., Taskar, B.: Learning probabilistic models of link structure. Journal of Machine Learning Research 3, 679–707 (2002) 5. Lu, Q., Getoor, L.: Link-based classification. In: Proceedings of the 20th International Conference on Machine Learning, pp. 496–503 (2003) 6. Neville, J., Jensen, D., Gallagher, B.: Simple estimators for relational bayesian classifiers. In: Proceedings the 3rd IEEE International Conference on Data Mining, pp. 609–612 (2003) 7. Macskassy, S., Provost, F.: Classification in networked data: A toolkit and a univariate case study. Journal of Machine Learning Research 8, 935–983 (2007) 8. Neville, J., Jensen, D., Friedland, L., Hay, M.: Learning relational probability trees. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 625–630 (2003) 9. Perlich, C., Provost, F.: Aggregation-based feature invention and relational concept classes. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 167–176 (2003) 10. Singh, L., Getoor, L., Licamele, L.: Pruning social networks using structural properties and descriptive attributes. In: Proceedings of the 5th IEEE International Conference on Data Mining, pp. 773–776 (2005) 11. Neville, J., Jensen, D.: Leveraging relational autocorrelation with latent group models. In: Proceedings the 5th IEEE International Conference on Data Mining, pp. 322–329 (2005) 12. Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 307–318 (1998) 13. Jensen, D., Neville, J., Gallagher, B.: Why collective inference improves relational classification. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 593–598 (2004) 14. McDowell, L., Gupta, K., Aha, D.: Cautious inference in collective classification. In: Proceedings of the 22nd AAAI Conference on Artificial Intelligence, pp. 596–601 (2007) 15. Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using gaussian fields and harmonic functions. In: Proceedings of the 20th International Conference on Machine Learning, pp. 912–919 (2003) 16. Zhu, X.: Semi-supervised learning literature survey. Technical Report CS-TR-1530, University of Wisconsin, Madison, WI (December 2007), http://pages.cs.wisc.edu/~ jerryzhu/pub/ssl_survey.pdf 17. Gallagher, B., Tong, H., Eliassi-Rad, T., Faloutsos, C.: Using ghost edges for classification in sparsely labeled networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 256–264 (2008)
Leveraging Label-Independent Features
19
18. Newman, M.: The structure and function of complex networks. SIAM Review 45, 167–256 (2003) 19. Macskassy, S., Provost, F.: A simple relational classifier. In: Notes of the 2nd Workshop on Multi-relational Data Mining at KDD 2003 (2003) 20. Gallagher, B., Eliassi-Rad, T.: An examination of experimental methodology for classifiers of relational data. In: Proceedngs of the 7th IEEE International Conference on Data Mining Workshops, pp. 411–416 (2007) 21. Krebs, V.: Books about U.S. politics (2004), http://www.orgnet.com/divided2.html 22. Cohen, W.: Enron email data set, http://www.cs.cmu.edu/~ enron/ 23. Eagle, N., Pentland, A.: Reality mining: sensing complex social systems. Journal of Personal and Ubiquitous Computing 10(4), 255–268 (2006), http://reality.media.mit.edu 24. Jensen, D.: Proximity HEP-TH database, http://kdl.cs.umass.edu/data/hepth/hepth-info.html 25. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
Community Detection Using a Measure of Global Influence Rumi Ghosh and Kristina Lerman USC Information Sciences Institute, Marina del Rey, California 90292 {ghosh,lerman}@isi.edu
Abstract. The growing popularity of online social networks gave researchers access to large amount of network data and renewed interest in methods for automatic community detection. Existing algorithms, including the popular modularity-optimization methods, look for regions of the network that are better connected internally, e.g., have higher than expected number of edges within them. We believe, however, that edges do not give the true measure of network connectivity. Instead, we argue that influence, which we define as the number of paths, of any length, that exist between two nodes, gives a better measure of network connectivity. We use the influence metric to partition a network into groups or communities by looking for regions of the network where nodes have more influence over each other than over nodes outside the community. We evaluate our approach on several networks and show that it often outperforms the edge-based modularity algorithm. Keywords: community, social networks, influence, modularity.
1
Introduction
Communities and social networks have long interested researchers [5,13]. However, one of the main problems faced by the early researchers was the difficulty of collecting empirical data from human subjects [5]. The advent of the internet and the growing popularity of online social networks changed that, giving researchers access to huge amount of social interactions data. This, coupled with the ever increasing computation speed, storage capacity and data mining capabilities, led to the reemergence of interest in the social networks in general, and community detection specifically. Many existing community finding algorithms look for regions of the network that are better connected internally and have fewer connections to nodes outside the community [4]. Graph partitioning methods [7,27], for example, attempt to minimize the number of edges between communities. Modularity maximizationbased methods, on the other hand, identify groups of nodes that have higher than expected number of edges within them [22,21,24,23]. We believe, however, that edges do not give the true measure of network connectivity. We generalize the notion of network connectivity to be the number of paths, of any length, that exist between two nodes (Section 2). We argue that this metric, called influence L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 20–35, 2010. c Springer-Verlag Berlin Heidelberg 2010
Community Detection Using a Measure of Global Influence
21
by sociologists [13], because it measures the ability of one node to affect (e.g., send information to) another, gives a better measure of connectivity between nodes. We use the influence metric to partition a (directed or undirected) network into groups or communities by looking for regions of the network where nodes have more influence over each other than over nodes outside the community. In addition to discovering natural groups within a network, the influence metric can also help identify the most influential nodes within the network, as well as the “weak ties” who bridge different communities. We formalize our approach by describing a general mathematical framework for representing network structure (Section 3). We show that the metric used for detecting communities in random walk models, modularity-based approaches, and influence-based modularity are special cases of this general framework. We evaluate our approach (in Section 4) on the standard data sets used in literature, and find performance at least as good as that of the edge-based modularity algorithm.
2
A Measure of Global Influence
If a network has N nodes and E links it can be graphically represented by G(N, E) where N is the number of vertices and E is the number of edges. Edges are directed; however, if there exists an edge from vertex i to j and also from j to i, it is represented as an undirected edge. A path p is an n-hop path from i to j, if there are n vertices between the i and j along the path. We allow paths to be non-selfavoiding, meaning that the same edge could be traversed more than once. The graph G(N, E) can be represented by an adjacency matrix A, whose elements are defined as Aij = 1 if ∃ an edge from vertex i to j; otherwise, Aij = 0. A is symmetric for undirected graphs. The Oxford English Dictionary defines influence as “the capacity to have an effect on the character, development, or behavior of someone or something, or the effect itself.” The measure of influence that we adopt is along lines of Pool and Kochen [5], who state that “influence in large part is the ability to reach a crucial man through the right channels, and the more the channels in reserve the better.” This metric depends not only on direct edges between nodes, but also on the number of ways an effect or a message can be transmitted through other nodes. Therefore, the capacity of node i to influence node j can be measured by the weighted sum of the number of n-hop paths present from the i to j. The underlying hypothesis is that the more the number of paths from one node to another, the greater is the capacity to influence. This model is analogous to the independent cascade model of information spread [11,14]. The strength of the effect via longer paths is likely to be lower than via shorter paths. We model the attenuation using parameters αi where αi (0 ≤ αi ≤ 1) is the probability of transmission of effect in the (i-1)th hop. Let us consider transmitting an effect or a message from nodes b to c in the network shown in Figure 1. The probability of transmission via the immediate neighbors of c such as e to c or g to c is α1 . The probability of transmission over 1-hop paths such as b to c via e is α1 α2 . In general, the probability of a transmission along an
22
R. Ghosh and K. Lerman
Fig. 1. A directed graph representing a network n+1 n-hop path is Πi=1 αi . The total influence of b on c thus depends on the number of (attenuated) channels between b and c, or the sum of all the weighted paths from b to c. This definition of influence makes intuitive sense, because the greater the number of paths between b and c, the more opportunities there are for b to transmit messages to c or to affect c. For ease computation we simplify this model by taking α1 = β and αi = α, ∀i = 1. β is called the direct attenuation factor and is the probability of transmission of effect directly between adjacent nodes. α is the indirect attenuation factor and is the probability of transmission of effect through intermediaries. If α = β, i.e., the probability of transmission of effect through all links is the same, then this index reduces to the metric used to find the Katz status score [13]. n / j , is given The number of paths from i to j with n intermediaries, i n+1 times
by An = A · A · · · A = A(n−1) · A. Adding weights to take into account the attenuation of effect, we get the weighted total capacity of i to affect j as / j = β i 0 / j + · · · + βαn i n / j + · · ·. We represent this weighted i total capacity to influence by the influence matrix P : P = (βA + βαA1 + · · · + βαn An + · · ·) = βA(I − αA)
−1
,
(1)
As mentioned by Katz[13], the equation holds while α < 1/λ, where λ is the largest characteristic root of A [6]. We use the influence matrix to help find community structure in a network. We claim (without much theoretical or empirical support) that a community is composed of individuals who have a greater capacity to influence others within their community than outsiders. As a result, actions of community members will tend to become correlated with time, whether by adopting a new fashion trend,
Community Detection Using a Measure of Global Influence
23
vocabulary, watching a movie, or buying a product. Armed with this alternative definition of community, we adapt modularity maximization-based approach to identifying communities. 2.1
Influence-Based Modularity
The objective of the algorithms proposed by Newman and coauthors is to discover “community structure in networks — natural divisions of network nodes into densely connected subgroups” [25]. They proposed modularity as a measure for evaluating the strength of the discovered community structure. Algorithmically, their approach is based on finding groups with higher than expected edges within them and lower than expected edges between them [22,21,23]. The modularity Q optimized by the algorithm is given by: Q =(fraction of edges within community)-(expected fraction of such edges). Thus, Q is used as a numerical index to evaluate a division of the network. The underlying idea, therefore, is that connectivity of nodes within a community is greater than that of nodes belonging to different communities, and they take the number of edges as the measure of connectivity. However, we claim that path-based, rather than edge-based, connectivity is the true measure of network connectivity. Consider again the graph in Figure 1, where there exists an edge from a to c but not from b to c. Clearly, however, c is not unconnected from b, as there are several distinct paths from b to c. We use the influence matrix, which gives the global connectivity of the network, to identify communities. We redefine modularity as Q =(connectivity within the community) - ( expected connectivity within the community) and adopt the influence matrix P as the measure of connectivity. This definition implies that in the best division of the network, the influence of nodes within their community is greater than their influence outside their community. A division of the network into communities, therefore, maximizes the difference between the actual capacity to influence and the expected capacity to influence, given by the capacity to influence in an equivalent random graph. Let us denote the expected capacity to influence by an N × N matrix P¯ . We round off the values of Pij to the nearest integer values. Modularity Q can then be expressed as Q= [Pij − P¯ij ]δ(si , sj ) (2) ij
where si is the index of the community i belongs to and δ(si , sj ) = 1 if si = sj ; otherwise, δ(si , sj ) = 0. When allthe vertices are placed in a single group, then axiomatically, Q = 0. Therefore ij [Pij − P¯ij ] = 0. Hence, the total capacity to influence W is P¯ij = W = Pij (3) ij
ij
Hence the null model has the same number of vertices N as the original model, and in it the expected influence of the entire network equals to the actual influence of the original network. We further restrict the choice of null model to that
24
R. Ghosh and K. Lerman
where the expected influence on a vertex j, Wjin , is equal to the actual influence on the corresponding vertex in the real network. P¯ij = Wjin = Pij (4) i
i
Similarly, we also assume that in the null model, the expected capacity of a vertex i to influence others, Wiout , is equal to the actual capacity to influence of the corresponding vertex in the real network P¯ij = Pij . (5) Wiout = j
j
In order to compute the expected influence, we reduce the original graph G to a new graph G that has the same number of nodes as G and total number of edges W , such that each edge has weight 1 and the number of edges between nodes i and j in G is Pij . So now the expected influence between nodes i and j in graph G could be taken as the expected number of the edges between node i and j in graph G and the actual influence between nodes i and j in graph G can be taken as the actual number of edges between nodes i and node j in graph G . The equivalent random graph G is used to find the expected number of edges from node i to node j. In this graph the edges are placed at random subject to constraints: – The total number of edges in G is W . – The out-degree of a node i in G = out-degree of node i in G = Wiout . – The in-degree of a node j in graph G =in-degree of node j in graph G = Wjin . Thus in G the probability that an edge will emanate to a particular vertex i is dependent only on the out-degree of that vertex; and the probability that an edge is incident on a particular vertex i is dependent only on the in-degree of that vertex and the probabilities of the two vertices being the two ends of a single edge are independent of each other. In this case, the probability that an edge exists from i to j is given by P (emanates from i) · P (incident on j )=(Wiout /W )(Wjin /W ). Since the total number of edges is W in G , therefore the expected number of edges between i and j is W · (Wiout /W )(Wjin /W ) = P¯ij , the expected influence between i and j in G. 2.2
Detecting Community Structure
Once we have derived Q, we have to select an algorithm to divide the network into communities that optimize Q. Brandes et al. [3] have shown that the decision version of modularity maximization is NP-complete. Like others [23,17], we use the the leading eigenvector method to obtain the approximate solution. In [9] we applied this approach to the standard data sets used in literature, and found performance at least as good as that of the edge-based modularity algorithm. As can be mathematically derived from the formulation, we find that the
Community Detection Using a Measure of Global Influence
25
communities detected are independent of the value of β. So, henceforth without loss of generality, we shall assume the value of β = 1. In Section 4 we use this approach to partition several example networks into communities.
3
A Generalized Model of Influence
In this section, we present a mathematical framework that generalizes the notion of influence. In algebraic topology a k-simplex, with k ≥ 0, is a convex hull σ of k + 1 linearly independent points v0 , v1 , . . . , vk and dimension k. The points vi are called vertices of σ. Let σ = {v0 , v1 , . . . , vk } be a k-simplex and let ω = = wj if i = j. Then ω = {wi , . . . , wl } be a nonempty subset of σ, where wi {w0 , w1 , . . . , wl } is called the l-dimensional face of σ. A simplectic complex K is a finite collection of simplices in some Rn satisfying: – If σ ∈ K, then all faces of σ belong to K. – If σ1 , σ2 ∈ K, then either σ1 σ2 = Ø or σ1 σ2 is a common face of σ1 and σ2 . The dimension of K is defined to be −1 if K = Ø and the maximum of the dimensions of K otherwise. An undirected graph can then be viewed as a simplectic complex with a single-element set per vertex and a two-element set per edge. Suppose we are given finite sets X = {x1 , x2 , . . . , xn } and Y = {y1 , y2 , . . . , ym }, and a binary relation γ ⊆ X × Y between elements of X and elements of Y (X and Y could be the same). Then the relation γ may be expressed as an n × m incidence matrix A (γ) = (Aij ) where Aij = 1 if (xi , yj ) ∈ γ and 0 otherwise. Each row in the incidence matrix A (γ) may be viewed as a simplex in the following way: Let Y be a set of vertices. The i-th row of A (γ) can be identified with a k dimensional simplex {yj1 , yj2 , . . . , yj(k+1) } = σk (xi ) on the vertices Y (where Aij =1). Thus each xi ∈ X determines (with γ ) a row of A (γ) and each row A (γ) can be identified a simplex. The set of simplices is a simplical complex denoted by KX (γ, Y ). Since an arbitrary element xi is γ-related to exactly k + 1 yj , σk (xi ) is distinguished as a named simplex. If we let d denote the maximum dimension of KX (γ, Y ), we immediately see that d ≤ m − 1. Let σ and τ be two simplices in KX (γ, Y ). Then σ and τ are q − near if they have a common q-face, i.e., their intersection contains at least q + 1 elements. (This q-face need not be an element of the simplex family.) Then τ and σ are q-connected if there exists a sequence σ1 , σ2 , . . . , σp of simplices in KX (γ, Y ), such that σ1 = σ and σp = τ and σi is q-near to σi+1 ∀1 ≤ i ≤ p − 1. Thus q-connectivity is the transitive closure of q-nearness. Q analysis using q-nearness and q-connectivity was used by Atkin [2] to deal with pairs of sets and sets of contextual relations. As Legrand [16] points out, q-nearness and q-connectivity are not necessarily a true measure of how similar the vertices are to each other, for which the length of sequence of q-connectivity should be the true indicator. We therefore take the length of sequences into account by calculating how q-near vertex i is from vertex j, making it dependent on the length of the path between them.
26
R. Ghosh and K. Lerman
Therefore the adjacency matrix A, Aij = (q1ij ), shows if two simplices are zero-near to one another in a 0-hop path. The product A2 =A × A gives the value of q2ij such that A2 ij = q2ij , i.e., vertex i and vertex j when separated by a one-hop path are q-near each other with q = q2ij − 1. In the same way, A3 = A × A × A = (A3ij ) = (q3ij ) shows that vertices i and j connected by a two-hop path are q3ij − 1 near from each other. We then take the length of the sequence into account to calculate the expected q-nearness of one vertex to another by taking the weighted average of q-nearness of varying length of paths. The expected value of qij between two elements i and j, such that they are expected to be qij − 1 near each other, with (qkij ) = Akij is: E(qij ) =
(W1 · q1ij + W2 · q2ij + . . . + Wn · qnij + . . .) ∞ i=1 Wi
(6)
This expected value can be used to find out how connected two vertices are to each other, taking paths of all lengths into account. Note that Wi can be a scalar or a vector. This formulations allows us to generalize different network models for community detection and scoring like the random walk model [28,29,31], the Katz model [13] of status score, and the influence-based model. In random walk models, a particle starts a random walk from node i. The particle iteratively transitions to its neighbors with probability proportional to the corresponding edge weights. Also at each step, the particle returns to node i with some restart probability (1 − c). The proximity score from node i to node j is defined as the steady-state probability ri,j that the particle will be on node j [29]. These models can be shown to be special cases of the formulations of the expected q-nearness (without loss of generality we assume that T is an n × n matrix): k−1 −(k−1) 1. If Wk = where c is a constant and D is an n × n matrix with cn · D Dij = j=1 Aij if i = j and 0 otherwise; then, the expected q-nearness score reduces to proximity score in random walk model [28,29]. i 2. If Wi = Πj=1 αj , where the scalar αj is the attenuation factor of a (j − 1)-th hop in a (i−1) hop path, then the expected q-nearness reduces to metric used to find the influence score and represented by the influence matrix. For ease of computation of the influence matrix, we have taken α1 = β and αi = α ∀i = 1. As stated before, α < 1/λ where λ is the largest characteristic root of A. Gershgorins Circle Theorem (1931) gives the simple sufficient condition α < 1/maxi (Dii ). 3. When β = α, this in turn reduces to the metric used to find the Katz status score [13] with α as the attenuation factor. 4. When α1 = 1 and α2 = . . . = αn = . . . = 0, the expected q-nearness is the q-nearness of the 0-hop path which is metric used to calculate similarity in edge-based modularity approaches [22].
In summary, the capacity to influence is a measure of the expected q-nearness between vertices. Liben-Nowell and Kleinberg [20] have shown that Katz measure is the most effective measure for link prediction task. The influence score, which
Community Detection Using a Measure of Global Influence
27
is a generalization of the Katz score, can then be used to find communities as described in Section 2.1.
4
Evaluation
We applied influence-based community finding method to small networks studied previously in literature, as well as the friendship network extracted from the social photosharing site Flickr. On all the data sets we studied, the performance of the influence-based modularity optimization algorithm was at least as good as that of the edge-based modularity (α = 0 case). In several cases, the influencebased approach led to purer groups. 4.1
Zachary’s Karate Club
The karate club data represents the friendship network of members of a karate club studied by Zachary [30]. During the course of the study, a disagreement developed between the administrator and the club’s instructor, resulting in the division of the club into two factions, represented by circles and squares in Figure 2. We used this data to study the communities detected for different values of α and compare the performance of the influence-based modularity approach to Newman’s community-finding algorithms (which is the special case where α = 0) [21]. Using α < 1/λ (Section 2) we get the upper bound on α ≤ 0.29. When both Newman’s edge-based modularity maximization approach and our method (0 ≤ α ≤ 0.29) are used to bisect the network into just two communities, we recover the two factions observed by Zachary (Figure 2(c)). However, algorithms run until a termination condition is reached (no more bisections are possible), different values of α lead to different results, as shown in Figure 2. As stated when α = 0, the method reduces to Newman’s edge-based modularity maximization approach [21], and we get four communities (Figure 2(a)). For 0 < α < 0.14 the number of communities reduces to three (Figure 2(b)). As α is increased further (0.14 ≤ α ≤ 0.29) we get two communities(Figure 2(c)), which are the same as the factions found in Zachary’s study.
(a)
(b)
(c)
Fig. 2. Zachary’s karate club data. Circles and squares represent the two actual factions, while colors stand for discovered communities as the strength of ties increases: (a) α = 0 (b) 0 < α < 0.14 (c) 0.14 ≤ α ≤ 0.29.
28
R. Ghosh and K. Lerman
4.2
College Football
We also ran our approach on the US College football data from Girvan et al. [10].1 The network represents the schedule of Division 1 games for the 2000 season where the vertices represent teams and the edges represent the regular season game between the two teams they connect. The teams are divided into “conferences” (or communities) containing 8 to 12 teams each. Games are more frequent between members of the same conference than members of different conferences. Inter-conference games, however, are not uniformly distributed, with teams that are geographically closer likely to play more games with one another than teams separated by geographic distances. However, some conferences have teams playing nearly as many games against teams in other conferences as teams within their own conference. This leads to the intuition, that conferences may not be the natural communities, but the natural communities may actually be bigger in size than conferences, with teams playing as many games against others in the same conferences being put into the same community.
College football
Political books
Fig. 3. The graph showing the purity of communities predicted with different values of α and β in the (a) college football and (b) political books data sets. We see that purity increases with α, and is independent of β. When α = 0, the method reduces to eigenvector based modularity maximization method postulated by Newman [23].
We measure the quality of discovered communities in terms of purity. The purity of a community is the fraction of all pairs of teams that were assigned to that community that actually belong to the same conference. The quality of a network division produced by an algorithm is the average purity of the discovered communities. Figure 3 shows the purity the discovered communities as α is varied. Purity is independent of β, the weight of direct edges, but increases with α, reaching ∼ 90% near α = 0.1 (the upper bound to α is determined by the reciprocal of the largest eigenvalue of the adjacency matrix). When α = 0, the modularity reduces to edge-based modularity studied by Newman [23], the purity is around 72%. The number of predicted groups changes from 8 at α = 0 to four at higher values of α. 1
http://www-personal.umich.edu/∼mejn/netdata/
Community Detection Using a Measure of Global Influence
4.3
29
Political Books
We evaluated our approach on the political books data compiled by V. Krebs.2 In this network the nodes represent books about US politics sold by the online bookseller Amazon. Edges represent frequent co-purchasing of books by the same buyers, as indicated by the “customers who bought this book also bought these other books” feature of Amazon. This feature influences the book purchasing decisions of customers. The nodes were given labels liberal, neutral, or conservative by Mark Newman on a reading of the descriptions and reviews of the books posted on Amazon.3 49 of the books were marked as conservative, 43 books were marked as liberal and 13 books were marked as neutral. We use our algorithm to find the existing community structure in the network by varying parameters as shown in Figure 3. Purity is independent of the value of β, and similarly to the football data, as α increases, the number of communities decreases (from four at α = 0 to two at α = 0.08). Also the purity of the communities increases from 60% at α = 0 to 92% at α = 0.08. Again, α = 0 corresponds to Newman’s modularity method. Another observation is that when α = 0.08, leading to the formation of two groups, only the neutral books are split between group, indicates a possibility that some of the 13 neutral books were conservatively inclined and some liberally. 4.4
Flickr Social Network
We also ran our algorithm on the social network data collected from Flickr for the image search personalization study [18]. Flickr is a social photosharing site that allows users to upload images, tag them with descriptive keywords, known as tags, and to join social networks by adding other users as contacts. We believe that network structure, create by independent decisions to add another photographer as a contact, capture social knowledge, including knowledge about users’ photography interests. Thus, users who are interested in a particular topic are more likely to be connected than users interested in different topics. Since the actual social network on Flickr is rather vast, we sampled it by identifying users who were broadly interested in one of three topics: child and family portraiture, wildlife photography and technology. For each topic, we used the Flckr API to perform a tag search using a keyword relevant to that topic, and retrieved 500 ‘most interesting’ images. We then extracted the names of users who submitted these images to Flickr. These users were added to our data set. The keywords used for image search were (a) newborn for the portraiture topic, (b) tiger and beetle for the wildlife topic, and (c) apple for the technology topic. Each keyword is ambiguous. Tiger, for example, could mean a big cat of the panthera genus, but also a flower (Tiger lily), Mac operating system (OS X Tiger), or a famous golfer (Tiger Woods), while beetle could describe a bug or a car. The keyword newborn could refer to human babies just as well as to kittens and puppies, while apple could mean the computer maker or a fruit. 2 3
http://www.orgnet.com/ available at http://www-personal.umich.edu/∼mejn/netdata/
30
R. Ghosh and K. Lerman
From the set of users in each topic, we identified four (eight for the wildlife topic) who were interested in the topics we identified: i.e., wildlife for tiger and beetle query terms, portraiture for the newborn query, and technology for the apple query. We studied each user’s profile to confirm that the user was indeed interested in that topic. Specifically, we looked at group membership and user’s most common tags. Thus, groups such as “Big Cats”, “Zoo”, “The Wildlife Photography”, etc. pointed to user’s interest in the wildlife topic. In addition to group membership, tags that users attached to their images could also help identify their interests. For example, users who used tags nature and macro were probably interested wildlife rather than technology. Similarly, users interested in human, rather than animal, portraiture tagged their images with baby and family. We used the Flickr API to retrieve the contacts of each of the users we identified, as well as their contacts’ contacts. We labeled users by the topic through which they were discovered. In other words, users who uploaded one of the 500 most interesting images retrieved by the query tiger, were labeled wildlife, whether or not they were interested in wildlife photography. The contacts and contacts’s contacts of the four users within this set identified as being interested in wildlife photography were also labeled wildlife. Although we did not verify that all the labeled users were indeed interested in the topic, we use these soft labels to evaluate the discovered communities. Once we retrieved the social networks of target set of users, we reduced it to an undirected network containing mutual contacts only. In other words, every link in the network between two nodes, say A and B, implies that A lists B as contact and vice versa. This resulted in a network of 5747 users. Of these, 1620 users were labeled technology, 1337 and 2790 users were labeled portraiture and wildlife respectively. We ran our community finding algorithm for different values of α on this data set. For α = 0, we found four groups, while for higher values of α (α < 0.01), we found three groups. Figure 4 shows composition of the discovered groups in terms of soft labels. Group 1 is composed mainly of technology users, group 2 mainly wildlife users, and group 3 is almost exclusively portraiture. The fourth group found at α = 0.0 has 932 members, of which 497 are labeled wildlife, 242 technology, and 193 members portraiture. Except for the portraiture group (group 3), groups become purer as α increases.
5
Related Research
There has been some work in motif-based communities in complex networks [1] which like our work extends traditional notion of modularity introduced by Girvan and Newman [10]. The underlying motivation for motif-based community detection is that “the high density of edges within a community determines correlations between nodes going beyond nearest-neighbours,” which is also our motivation for applying the influence-based modularity metric to community detection. Though the motivation of this method is to determine the correlations between nodes beyond nearest neighbors, yet it does impose a limit on the proximity of neighbors to be taken into consideration dependent on the size of
Community Detection Using a Measure of Global Influence
31
Fig. 4. Composition of groups discovered in the Flickr social network for different values of α
32
R. Ghosh and K. Lerman
the motifs. The method we propose, on the other hand, imposes no such limit on proximity. On the contrary, it considers the correlation between nodes in a more global sense. The measure of global correlation evaluated using the influence metric would be equal to the weighted average of correlations when motifs of different sizes are taken. The influence matrix enables the calculation of this complex term in a quick and efficient manner. Resolution limit is one of the main limitations of the original modularity detection approach [8]. It can account for the comment by Leskovec et al. [19] that they “observe tight but almost trivial communities at very small scales, the best possible communities gradually ‘blend in’ with rest of the network and thus become less ‘community-like’.” However, that study is based on the hypothesis that communities have “more and/or better-connected ‘internal edges’ connecting members of the set than ‘cut edges’ connecting to the rest of the world.” Hence, like most graph partitioning and modularity-based approaches to community detection, their process depends on the local property of connectivity of nodes to neighbors via edges and is not dependent on the structure of the network on the whole. Therefore, it does not take into account the characteristics of node types, that is ‘who’ are the nodes that a node is connected to and how influential these nodes are. In their paper on motif-based community detection, Arenas et al.[1] state that the extended quality functions for-motif based modularity also obey the principle of the resolution limit. But this limit is now motif-dependent and then several resolution of substructures can be achieved by changing the motif. However, it would be difficult to verify which resolution of substructures is closest to natural communities. In influence-based modularity, on the other hand, the resolution limit would depend on the probability of transmission of the effect between nodes, i.e., the strength of ties. The probability of transmission of effect can indeed be calculated from the graph, by say observing the dynamics of spread of idea within a graph at different times. As stated before, Liben-Nowell and Kleinberg [20] have shown that Katz measure is the most effective measure for the link prediction task, better than hitting time, PageRank [26] and its variants. Thus we use influence score, which is a generalization of the Katz score, to detect communities and compute rankings of individuals. Recently researchers have used probabilistic models, e.g., mixture models, for community discovery. These models can probabilistically assign a node to more than one community, as it has been observed “objects can exhibit several distinct identities in their relational patterns” [15]. This indeed may be true, but whether the nodes in the network is to be divided into distinct communities or probabilities with which each node belongs to community is to be discovered, really depends on the specific application. By this, we mean that if the application we are interested in is finding the natural communities say in the karate club data, and if we use a probabilistic method (say [15]), we would be assigning the nodes into groups into which their probability of belonging is the highest, and the communities thus formed do not necessarily portray the division of the network into natural communities observed.
Community Detection Using a Measure of Global Influence
6
33
Conclusion and Future Work
We have proposed a new definition of community in terms of the capacity of nodes to influence each other. We gave a mathematical formulation of this effect in terms of the number of paths of any length that link two nodes, and redefined modularity in terms of the influence metric. We use the new definition of modularity to partition a network into communities. We applied this framework to networks well-studied in literature and found that it produces results at least as good as the edge-based modularity approach. Although the formulation developed in this paper applies equally well to directed graphs, we have only implemented it on undirected ones. Hence future work includes implementation of the of the algorithm on directed graphs that are common on social networking sites, as well applying it to bigger networks. The influence matrix approximates capacity to influence along the lines of independent cascade model of information spread. Future work includes approximation of capacity to influence along other models of information spread like the threshold influence model. Leskovec et al. [19] state that they “observe tight but almost trivial communities at very small scales, the best possible communities gradually ‘blend in’ with rest of the network and thus become less ‘community-like’.” However the hypothesis that they employ to detect communities is that communities have “more and/or better-connected ‘internal edges’ connecting members of the set than ‘cut edges’ connecting to the rest of the world.” Hence, like most graph partitioning and modularity based approaches to community detection, their process depends on the local property of connectivity of nodes to neighbors via edges and is not dependent on the structure of the network on the whole. Besides, it also does not take into account the heterogeneity of node types, that is ‘who’ are the nodes that a node is connected to and how influential these nodes are. Therefore, we argue that a global property, such as the measure of influence, is a better approach to community detection. It remains to be seen whether communities will similarly ‘blend in’ with the larger network if one uses the influence metric to discriminate them.
Acknowledgements This research is based on work supported in part by the National Science Foundation under Award Nos. IIS-0535182, BCS-0527725 and IIS-0413321.
References 1. Arenas, A., Fernandez, A., Fortunato, S., Gomez, S.: Motif-based communities in complex networks. Mathematical Systems Theory 41, 224001 (2008) 2. Atkin, R.: From cohomology in physics to q-connectivity in social science. International Journal of Man-Machines Studies 4, 341–362 (1972) 3. Brandes, U., Delling, D., Gaertler, M., Gorke, R., Hoefer, M., Nikoloski, Z., Wagner, D.: On modularity clustering. IEEE Trans. on Knowl. and Data Eng. 20(2), 172– 188 (2008)
34
R. Ghosh and K. Lerman
4. Clauset, A.: Finding local community structure in networks. Physical Review E (Statistical, Nonlinear, and Soft Matter Physics) 72(2) (2005) 5. de Sola Pool, I., Kochen, M.: Contacts and influence. Social Networks 1(1), 39–40 (1978–1979) 6. Ferrar, W.L.: Finite Matrices. Oxford University Press, Oxford (1951) 7. Fiedler, M.: Algebraic connectivity of graphs. Czech. Math. J. 23, 298–305 (1973) 8. Fortunato, S., Barthelemy, M.: Resolution limit in community detection. Proc. Natl. Acad. Sci. USA 104, 36 (2007) 9. Ghosh, R., Lerman, K.: Community detection using a measure of global influence. In: Proc. of the 2nd KDD Workshop on Social Network Analysis, SNAKDD 2008 (2008) 10. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 99, 7821 (2002) 11. Goldenberg, J., Libai, B., Muller, E.: Talk of the network: A complex systems look at the underlying process of word-of-mouth. Marketing Letters (2001) 12. Granovetter, M.: The strength of weak ties. The American Journal of Sociology (May 1973) 13. Katz, L.: A new status index derived from sociometric analysis. Psychometrika 18, 39–40 (1953) ´ Maximizing the spread of influence through 14. Kempe, D., Kleinberg, J., Tardos, E.: a social network. In: KDD 2003: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 137–146. ACM, New York (2003) 15. Koutsourelakis, P.S., Eliassi-Rad, T.: Finding mixed-memberships in social networks. In: AAAI Spring Symposium Social Information Processing (2008) 16. Legrand, J.: How far can q-analysis go into social systems understanding? In: Fifth European Systems Science Congress (2002) 17. Leicht, E.A., Newman, M.E.J.: Community structure in directed networks. Physical Review Letters 100, 118703 (2008) 18. Lerman, K., Plangprasopchok, A., Wong, C.: Personalizing results of image search on flickr. In: AAAI workshop on Intelligent Techniques for Web Personlization (2007) 19. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of community structure in large social and information networks. In: Proceedings of the World Wide Web Conference (2008) 20. Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58(7), 1019–1031 (2007) 21. Newman, M.E.J.: Detecting community structure in networks. The European Physical Journal B 38, 321–330 (2004) 22. Newman, M.E.J.: Fast algorithm for detecting community structure in networks. Physical Review E 69, 066133 (2004) 23. Newman, M.E.J.: Finding community structure in networks using the eigenvectors of matrices. Physical Review E 74, 036104 (2006) 24. Newman, M.E.J.: Modularity and community structure in networks. Proc. Natl. Acad. Sci. USA 103, 8577 (2006) 25. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Physical Review E 69, 026113 (2004) 26. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project (1998)
Community Detection Using a Measure of Global Influence
35
27. Pothen, A., Simon, H., Liou, K.P.: Partitioning sparse matrices with eigenvectors of graphs. SIAM J. Matrix Anal. Appl. 11, 430–452 (1990) 28. Tong, H., Faloutsos, C., Pan, J.: Fast random walk with restart and its applications. In: Sixth International Conference on Data Mining, ICDM 2006, pp. 613–622 (2006) 29. Tong, H., Papadimitriou, S., Yu, P.S., Faloutsos, C.: Proximity tracking on timeevolving bipartite graphs. In: SDM, pp. 704–715. SIAM, Philadelphia (2008) 30. Zachary, W.W.: An information ow model for conict and ssion in small groups. Journal of Anthropological Research 33, 452–473 (1977) 31. Zhou, H.: Network landscape from a brownian particles perspective. Physical Review E 67 (2003)
Communication Dynamics of Blog Networks Mark Goldberg1 , Stephen Kelley1 , Malik Magdon-Ismail1, Konstantin Mertsalov1, and William (Al) Wallace2 1 CS Department, RPI, 110 8th Street, Troy, NY {goldberg,kelles,magdon,mertsk2}@cs.rpi.edu 2 DSES Department, RPI, 110 8th Street, Troy, NY
[email protected] Abstract. We study the communication dynamics of Blog networks, focusing on the Russian section of LiveJournal as a case study. Communication (blogger-to-blogger links) in such online communication networks is very dynamic: over 60% of the links in the network are new from one week to the next, though the set of bloggers remains approximately constant. Two fundamental questions are: (i) what models adequately describe such dynamic communication behavior; and (ii) how does one detect the phase transitions, i.e. the changes that go beyond the standard high-level dynamics? We approach these questions through the notion of stable statistics. We give strong experimental evidence to the fact that, despite the extreme amount of communication dynamics, several aggregate statistics are remarkably stable. We use stable statistics to test our models of communication dynamics postulating that any good model should produce values for these statistics which are both stable and close to the observed ones. Stable statistics can also be used to identify phase transitions, since any change in a normally stable statistic indicates a substantial change in the nature of the communication dynamics. We describe models of the communication dynamics in large social networks based on the principle of locality of communication: a node’s communication energy is spent mostly within its own “social area,” the locality of the node.
1
Introduction
The structure of large social networks, such as the WWW, the Internet, and the Blogosphere, has been the focus of intense research during the last decade (see [1], [7], [8], [12], [17], [19], [20], [21], [22]. One of the main foci of this research has been the development of dynamic models of network creation ([2], [11], [22], [18]) which incorporates two fundamental elements: network growth, with nodes arriving one at a time; and some form of preferential attachment in which an arriving node is more likely to attach itself to a more prominent existing node than a less prominent one (the rich get richer). Once a network has grown and stabilized in size, how does it evolve? Such an evolution is governed by the communication dynamics of the network: links being broken and formed as social groups form, evolve and disappear. The communication dynamics of these networks have been studied much less, partially L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 36–54, 2010. c Springer-Verlag Berlin Heidelberg 2010
Communication Dynamics of Blog Networks
37
because the typical networks studied (the WWW, the Internet, collaboration networks) mainly exhibit growth dynamics and not communication dynamics. Clearly, as a network matures, the growth (addition of new users) becomes a minor ingredient of the total change (see Figure 1). Further, links in a socially dynamic network such as the Blogosphere should not be interpreted as static. The posts made by a blogger a week ago may not be reflective of his/her current interests and social groups. In fact, blog networks display extreme communication dynamics. Over the 20 week period shown in Figure 1, in a typical week, 510,000 pairs of bloggers communicated via blog comments. Out of those about 380,000 are between pairs of bloggers who did not communicate the week before, i.e. over 70% of the communications are new. What models adequately describe the dynamics of the communications in such networks which have more or less stabilized in terms of growth? 0.9 vertex growth new edges
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
51 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Fig. 1. Edge and vertex dynamics. Clearly the rate of growth is decreasing however the fraction of new edges which appear in a week remains approximately constant at over 70%.
To begin to address this question, one must first develop methods for testing the validity of a model. In such an environment of extreme stochastic dynamics, one cannot hope to replicate the dynamics of the individual communications; this explains our focus on the evolution of interesting macroscopic properties of the communication dynamics. Particularly interesting ones are those which are time invariant. We refer to such properties as stable statistics. As we demonstrate, even in such an active environment, certain statistics are remarkably stable. For example: the power-law coefficient for the in-degree distribution, the clustering coefficient, and the size of the giant component (see Table 1). 1.1
Our Contributions
In this paper, we first demonstrate by experimentation that a certain set of statistics is stable in the graph of the Blogosphere. The stability implies that these (and conceivably some other) statistics can be used to validate models, characterize networks and identify their phases. Second, we present several models for
38
M. Goldberg et al.
communication dynamics that are based on the principle of locality of communication. We show via simulation that our models stabilize to an equilibrium in which the aggregate statistics of the communication dynamics are stable. Furthermore, among the set of models we tested, we select the one whose stable statistics best reproduce the statistics of the observed network. Stable Statistics. Our case study was the Russian section of LiveJournal. During the observed period of 20 weeks, close to 153,000 users were active in any one week period. The size of this set is quite stable (changes typically from 1 to 2%). Although, the makeup of the set changes drastically from week to week. Surprisingly many aggregated statistics computed for the Blogograph show strong stability. Among those stable statistics are: the distribution of the in-degrees and the out-degrees of the nodes; the (overlapping) coalition distribution as described by the cluster density and size; and the size of the giant component. The nodes of the Blogograph represent bloggers and the directed edges between them represent all pairs {A, B} where blogger A visited the blog of B during the week in question and left a comment to a specific post already in the blog. We consider the following four types of stable statistics: (i) Individual Statistics: properties of individual nodes such as the in-degree and out-degree distributions for the graph (ii) Relational Statistics: properties of edges in the graph such as the persistence of edges and clustering coefficients (iii) Global Statistics: properties reflecting global information such as the size and diameter of the largest component and total density (iv) Community Statistics: properties relating to group structure such as the community size and density distributions The purpose of collecting these statistics is two-fold. First, they create a baseline which describes the normal behavior of individuals, communities, and the network as a whole. Once this base has been established, anomalous behavior at each of these levels can be identified and investigated further. Second, stable statistics can be used for testing any model of the network dynamics, as any model which attempts to replicate the communication dynamics must, in particular, be able to reproduce these statistics. Furthermore, the quality of a model can be measured by how well the statistics computed from the network generated by the model (in equilibrium) replicate those observed in the real-life network. Locality based Models of Communication Dynamics. Existing growthbased models fail to adequately replicate the observed stable statistics, as they do not capture communication dynamics. We consider models for communication dynamics which take as input: (a) the current (observed) communication graph; and, (b) each user’s out-degree (communication energy) at the next time step (or a distribution over for the user’s out-degree). These two inputs are standard for existing growth models (such as the preferential attachment growth model). Such models are only applicable when the communications are open (observable to all nodes). The output is the communication graph at the next time step, based on the model for probabilistic attachment of each node’s out-edges.
Communication Dynamics of Blog Networks
39
We discuss intuitive extensions of growth models for modeling communication dynamics and illustrate that these extensions are inadequate for modeling the observed stable statistics. We present a locality based model which relies on two fundamental principles to more accurately reflect the observed communication dynamics. First, our concept of locality reduces the set of nodes a node can attach to in the next time step (a week in our case). This locality is based on structural properties of the current (observable to all) communication graph. The locality represents a semi-stable set of “neighbor” nodes that an individual is highly likely to connect to and can be interpreted as that individuals view of the communities she belongs to. We test various structural (graph theoretic) definitions of a node’s social locality, ranging from trivial localities such as the entire graph to notions of a node’s neighborhood (e.g. the 2-neighborhood; the clusters to which a node belongs). Second, after obtaining a node’s locality, one must specify the attachment mechanism, the mechanism used by the individual to select the nodes in her locality to which she will connect at the next time step. We test a number of different attachment mechanisms which one could consider, ranging from uniform attachment to some form of preferential attachment. Thus, we present results using each of the various choices for the locality and attachment mechanism. Such probabilistic models are Markov chains, and we test a model’s performance by comparing the values it produces for the stable statistics after it has equilibrated. We find experimentally that the mixing times are small and the equilibrium statistics are independent of the starting state (the chains are ergodic), hence the equilibrium distribution is unique. Our results indicate that our locality based model with locality defined as the union of clusters to which a node belongs and a preferential attachment mechanism produces the best values for the stable statistics.
2
Clusters
The notion of a social community is crucial to our model of a Blog network. The underlying idea of our model is that every user selects the nodes to visit (to leave a comment) from the set of nodes that belong to a relatively small “area” of a node. Our experiments with different definitions of the local area of the node show that the best approximation to the observed statistics is achieved if the area is taken as the union of clusters containing a given node. Our definition of network clusters is borrowed from [4], [5], [6] with an important specification of the notion of the density of a set of nodes in a network. Definition. Given a graph G(V.E) let function D, called the density, be defined on the set of all subsets of V . Then, a set C ⊆ V is called a cluster if it is locally maximal w.r.t. D in the following sense: for every vertex x ∈ C (resp. x ∈ C), removing x from C (resp. adding x to C) creates a set whose density is smaller than D(C). The idea of the definition matches the common understanding of a social community as a set of members that forge more communication links within the
40
M. Goldberg et al.
set than that with the outside the set. The function D is not specified by the definition, but its precise formulation is crucial in “catching” the nature of social communities. The density function considered in [3] is as follows: D(C) =
win , win + wout
(1)
where win is the number of edges xy with x, y ∈ C and wout is the number of edges xy with either x ∈ C & y ∈ C or x ∈ C & y ∈ C (to allow for directed graphs). The main deficiency of the definition of a cluster as a computational representation of a social community is that it is easy to find examples of networks that permits very large and loosely connected clusters, that intuitively are not representing any community. The idea of our modification of 1 is to introduce an additional parameter which represents the edge probability in the set D(C) =
win 2win , +λ win + wout |C|(|C| − 1)
(2)
where the parameter λ depends on the specific network under the consideration, and is supposed to be selected by the researcher. For our experiments, we selected λ = 0.125.
3
Data
We define the blogograph as a directed, unweighted graph representing the communication of the blog network within a fixed time-period. There is a vertex in the blogograph representing each blogger and a directed edge from the author of any comment to the owner of the blog where the comment was made during the observed time period. Parallel edges are not allowed and a comment is ignored if the corresponding edge is already present in the graph. Loops, comments on a bloggers own blog, are ignored as well. To study the communication dynamics, we consider consecutive weekly snapshots of the network; the communication graph contains the bloggers that either posted or commented during a week and the edges represent the comments that appeared during the week. We chose to split graphs into one week periods due to the highly cyclic nature of activity in the blogosphere (see Figure 3). An illustration of the blogograph’s construction is given on Figure 2. The data used for our research was collected from the popular blogging service LiveJournal. As of May 2008, there were more than 15 million users for the whole network; the number of posts during a 24 hour period was approximately 191,000 (see html://www.livejournal.com/stats). Much of the communication in LiveJournal is public, which allows for open access. LiveJournal provides a real time RSS update feature that publishes all open posts that appear on any of the hosted blogs. We record the permament addresses of the posts and wait for the comments to accumulate. In our experience, the overwhelming majority of comments appear on these posts within two weeks of the posting date. Thus,
Communication Dynamics of Blog Networks
Thread on Alice’s Blog Alice posted Bill commented Alice commented Cory commented
Thread on Bill’s Blog
A Edges: B −>A C −>A
D
41
B
Edges: A −>B C −>B D −>B
C
Bill posted Alice commented Cory commented Dave commented
Fig. 2. Blogograph generation example. Vertices are placed for every blogger who posted or commented, the edges are placed from the author of the comment to the author of the post (the blog owner). Parallel edges and loops are not allowed. Table 1. Statistics for observed blogograph: order of the graph (|V |), graph size (|E|), fraction of vertices that are part of giant component (GC size), clustering coefficient (C), average separation (d), power law exponent (α) week 49 50 51 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
|V | 155,615 156,026 155,093 151,559 118,979 142,478 159,436 158,429 156,144 156,301 154,846 156,064 156,362 154,820 155,267 156,872 155,338 155,099 153,440 154,012 151,427
|E| 530,160 532,189 527,364 516,483 327,356 444,457 559,506 550,436 534,917 526,194 523,235 528,363 524,441 523,304 516,280 514,269 510,070 506,892 504,850 512,094 503,802
GC 95.88% 95.91% 95.62% 95.62% 93.55% 95.14% 96.16% 95.60% 95.49% 95.70% 95.44% 95.59% 95.58% 95.48% 95.13% 95.20% 95.42% 95.19% 95.32% 95.34% 95.30%
C 0.0639 0.0644 0.0635 0.0635 0.0573 0.0587 0.0629 0.0631 0.0627 0.0615 0.0622 0.0609 0.0602 0.0593 0.0600 0.0590 0.0601 0.0607 0.0601 0.0599 0.0611
d 5.333 5.327 5.316 5.316 5.777 5.392 5.268 5.224 5.293 5.338 5.337 5.320 5.377 5.368 5.356 5.367 5.342 5.309 5.303 5.298 5.288
α 2.63 2.66 2.65 2.71 2.92 2.68 2.68 2.67 2.72 2.72 2.69 2.69 2.68 2.68 2.68 2.63 2.71 2.73 2.73 2.60 2.75
our screen-scraping program visits the page of a post after it has been published for two weeks and collects the comment threads. We then generate the communication graph. We have focused on the Russian section of LiveJournal as it is reasonably but not excessively large (currently close to 580,000 bloggers out of the total 15 million) and almost self contained. We identify Russian blogs by the presence of Cyrillic characters in the posts. Technically this also captures the posts in other
42
M. Goldberg et al. 450000 400000 350000 300000 250000 200000 150000 01/19
02/02
02/16
03/01
03/15
03/29
(a) Comments per day 30000 25000 20000 15000 10000 5000 0 Mon
Tue
Wed
Thu
Fri
Sat
Sun
(b) Comments per hour Fig. 3. Number of comments per day that appeared between January 14, 2008 and April 6th, 2008 and number of comments per hour during a week between March 24th, 2008 and March 30th. The periodic drops in the number of comments per day correspond to Saturdays and Sundays.
languages with a Cyrillic alphabet, but we found that the vast majority of the posts are in Russian. The network of Russian bloggers is very active. On average, 32% of all posts contain Cyrillic characters. LiveJournal blogging has become a cultural phenomenon in Russia. Discussion threads often contain intense and interesting discussions which encourage communication through commenting. Our work is based on data collected between December 2007 and April 2008. The basic statistics about the size of obtained data are presented in Table 1. A simpler set of statistics on a smaller set of observed data is presented in [16].
4
Stable Statistics
The observed communication graph has interesting properties. The graph is very dynamic on the level of nodes and edges but has stable aggregated statistics. About 75% of active bloggers will also be active in the next week. Further, about 28% of edges that existed in a week will also be found in the next week. A large part of the network changes weekly, but a significant part is preserved. The stability of various statistics of the blogograph is presented in Table 1. The giant component (GC) is the largest connected (not necessarily strongly connected)
Communication Dynamics of Blog Networks
43
subgraph of the undirected blogograph. A giant component of similar size has been observed in other large social networks [18], [14]. The clustering coefficient (C) refers to the probability that the neighbors of a node are connected. The clustering coefficient of a node with degree k is the ratio of the number of edges between it’s neighbors and k(k − 1). The clustering coefficient of the graph is defined to be the average of the node clustering coefficients. The observed clustering coefficient is stable over multiple weeks and significantly different from the clustering coefficient in a random graph with the same out-degree distribution, which is 0.00029. The average separation (d) is the average shortest path between two randomly selected vertices of the graph. We computed it by sampling 10,000 random pairs of nodes and finding the undirected shortest path between them. The observed value in the blogograph is similar to what has been found in many other social networks ([18], [23]). Many large social networks ([2], [14]) display a power law in the degree distribution, P (k) ∝ ck −α , where P (k) is the probability a node has degree k. Figure 5 shows the mean in-degree distribution of the collected blogographs. In these graphs, we observed a power law tail with parameter α ≈ 2.70 which is stable from week to week. This value was computed using maximum likelihood method described in [10] and Matlab code made available by Aaron J. Clauset. To evaluate the dynamic in the observed communication we consider the change in the set of links or edges from one week to another. Figure 4 shows the distribution of number of weeks a particular pair of bloggers communicated. It is evident from this plot that the vast majority of communication does not re-occur. Yet, some links reappear every week. We also look at the past relationship between the bloggers who communicated. We define the history of the edge (i, j) that appeared in time cycle t to the the shortest undirected distance between i and j in the graph of the time cycle t − 1. Figure 4 presents the distribution of the edge histories of all observed edges of all time cycles. The edge history distribution of the particular observed weeks is very close to the
1
0.5
0.4
0.1
0.3 0.01 0.2 0.001
0.1
0.0001
0 0
5
10
15
(a) Edge Stability
20
0
2
4
6
8
10
12
14
16
(b) Edge History
Fig. 4. Edge stability: distribution of the number of weeks an edge appeared in 60% of all edges only appeared once. Edge history: the distribution of the shortest undirected distance of end points of an edge in a previous time cycle.
44
M. Goldberg et al. 1 observed distribution α = 2.70
0.1
P(x = k)
0.01 0.001 0.0001 1e-05 1e-06 1
10
100 k
1000
10000
Fig. 5. Average in-degree distribution in the blogograph observed over 21 weeks from Dec. 03, 2007 and Apr. 28, 2008 Table 2. 19 weeks of communities from the Russian section of Live Journal. |C| is the number of communities, δavg is the average density, and ep is the average edge probability within the communities. week 51 52 1 2 3 4 5 6 7 8
|C| 19631 19520 23187 20970 17986 18510 18808 19318 19343 19796
avg size 10.0183 10.0615 10.0915 9.98412 9.86184 9.71891 9.88255 9.79242 9.80381 9.83113
δavg 0.456677 0.453763 0.473676 0.458161 0.448757 0.453578 0.455823 0.454656 0.456364 0.453577
ep week |C| avg size δavg ep 0.253212 9 20136 9.95401 0.473693 0.252607 0.252101 10 19670 9.71678 0.45449 0.255778 0.248130 11 20212 9.66842 0.456908 0.256098 0.251843 12 20415 9.70331 0.461118 0.255819 0.254203 13 20030 9.78058 0.455676 0.254681 0.257481 14 19893 9.74936 0.455234 0.254384 0.254305 15 19392 9.73407 0.455365 0.254687 0.253901 16 19113 9.74787 0.454531 0.254721 0.255236 17 18737 9.72333 0.455775 0.255658 0.252818
presented distribution (the variation at each point is less then 2%). As the figure suggests, the majority of communicating vertices were less or at 3 hops away in the network on the previous time cycle. This provides evidence for the strong locality of communication that occurs in the observed network. In addition to looking for stability in structural statistics, it is also useful to examine stable community behavior. Using the notion of clusters discussed previously in this text, we find locally optimal communities using each edge in the graph as a seed. Once all seeds are optimized, duplicates and clusters of size 2 are removed. Statistics of the remaining clusters are showing in Table 2. A size vs density plot is also given in Figure 6. The general shape and scale of this plot is replicated across all observed weeks.
Communication Dynamics of Blog Networks
45
1
Density
0.8
0.6
0.4
0.2
0 0
10
20
30
40
50 Size
60
70
80
90
100
Fig. 6. A size vs density plot for week 5 of the observed data. The x-axis is a measure of the community size while the y-axis shows the value of δ. Each point represents a community.
5
Modeling
As previously stated, networks with such strong communication dynamics have not been well modeled. Much of the previous work aims to replicate the growth phase of a network’s life-cycle, ignoring the evolution of communication once the network’s size stabilizes. Models which replicate these dynamics would be useful as a sand-box within which social hypotheses on information diffusion, the emergence of leaders, and group formation and dissolution can be tested. To be considered useful, any model should create a set of graphs whose statistics come as close as possible to mirroring the statistics of the observed data presented previously. Before delving into the creation of a new model, let us first consider the modification of a previously existing one. The simplest method of producing a set of evolving graphs is to grow each week’s graph using a known network growth algorithm. Vertices can be assigned an out-degree based on the observed data and connected to each other via preferential attachment for each of the weeks. If done correctly, this would yield a set of graphs whose in-degree and out-degree distributions come close to matching the observed data’s power law distributions. Despite this initial positive result, examining the rest of the statistics demonstrates that the model is insufficient. Relational statistics such as edge stability, edge history, and clustering coefficient all significantly depart from the observed values, which we will show in detail further in the paper. This model’s inability do recreate these statistics is expected, since it generates each graph independently. Below, we propose a model which performs its edge connection within some locality in an effort to more closely mirror the edge stability, edge history, clustering coefficient, and community based statistics of the network.
46
6
M. Goldberg et al.
A Locality Based Model
The goal of our model is to produce a sequence of graphs which simulate the connection and reconnection of vertices. Our model specifies how nodes update their edges in response to the observed communication activity. In specifying this model of evolution, we take as input the out-degree distribution of the blogograph. The justification for this is that, while the out-degree distribution would be an interesting object to model, it mainly reflects the individual properties of the users in the network such as the level of energy and involvement of the user. Such quantities tend to be innate to a user. Different people have different social habits; some manage to communicate with hundreds of people while others interact with only a small group. Hence, out-degrees should be specified either ab initio (e.g. from social science theory) or extracted directly from the observed data. We take the latter approach to specifying the out-degree distribution when it comes to testing our model. An early version of this model with preliminary results is presented in [15]. Given the out-degrees for all nodes, the task is now to specify how to attach the out-edges of the nodes and to obtain the in-degree distribution. It is the in-degree distribution that characterizes the global communication structure of the network (for example, who is considered by others to be important). Clearly, the out-degree distribution of a graph alone does not determine its in-degree distribution. Algorithms for generating undirected random graphs with a prescribed degree distribution are well known (see [9], [13], [24]). However, even if those algorithms are expanded to the domain of directed graphs, they will still be insufficient for our purpose of modeling evolution, which requires repeated generation of the next graph given the previous one. To summarize, we are interested in models which reproduce the observed evolution given the out-degrees of the nodes. Thus, all our locality models assume that a node, when deciding where to attach its communication links, has some fixed budget of emanating edges which it can attach. The main task of our model is to develop an evolution mechanism that re-creates an in-degree distribution close to the observed one. We use standard graph theory terminology in describing our model (see for example [25]). The sequence of blogographs are represented by directed graphs G0 , G1 , G2 , . . ., where at every time step t, Gt = (V, Et ). V is the common vertex set of all known bloggers, V = {v1 , ..., vn }. An edge (vi , vj ) is in the edge set Et if blogger vi commented on a post by vj during the time period t. One time period covers one week, which appears to be the natural time scale in the blogosphere. The input to the model is the set of out-degrees at time t for each vertex, {kt1 , . . . , ktn } and Gt−1 , the blogograph at time t − 1. The output of the model is Gt , the blogograph at time t. Our model is locality based. At time t, every node vi identifies its area, and assigns its out-edges with destinations in its area. More formally, denote the area of vi at time t by Ait ⊆ V . Ait represents the locality of node vi at time t. Typically, a node’s locality at time t will depend on Gt−1 , the blogograph at time t − 1. The attachment mechanism is probabilistic for each node. Node vi attaches its kti out-edges according to its own probability
Communication Dynamics of Blog Networks
47
Algorithm 1. Evolution Model 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:
Function Model (T , OutDeg, Area, Prob) // Output: Blogographs G1 , . . . , GT . {k01 , . . . , k0n } ← OutDeg Initialize G0 (e.g. to a random graph) for t = 1 to T do Et ← ∅; {kt1 , . . . , ktn } ← OutDeg for i = 1 to n do Ait ← Area(i, Gt−1 ); pit ← Prob(i, Ait , Gt−1 ) Eti ← Attach(i, , Ai , pit , kti ); Et ← Et ∪ Eti . end for Gt+1 ← (V, Et ) end for
Algorithm 2. Edge attachment algorithm 1: Function Attach (i, Ait , pit , kti ) 2: // Output: Eti : edges in Gt originating at i 3: while kti > 0 do 4: if ( v∈Ai pit (v) > 0) then t
5: Select node v ∈ Ait−1 with probability pit (v) 6: pit (v) ← 0; renormalize pit 7: else 8: Select node v ∈ V \ Ait−1 with uniform probability 9: end if 10: kti = kti − 1 11: Eti = Eti ∪ (i, v) 12: end while
distribution pit , where pit (v) specifies the probability for node vi to attach to node v for v ∈ V . The probability distribution pit may depend on Ait and Gt−1 (e.g. higher degree nodes may get higher probabilities). In particular, we assume that i i v∈At pt (v) = 1, which corresponds to the assumption that every nodes expends all its communication energy within its local area. Since we do not allow parallel edges, if kti > |Ait |, it is not possible for node vi to expend all its communication energy within its local area Ait . In this case, we assume that kti − Ait edges are attached uniformly at random to nodes outside its area and the remaining edges are attached within its area. The precise algorithm for distributing the edges given the probability distribution pit is given in Algorithm 2. The evolution model is illustrated in Figure 7. In more detail, the model first obtains the out-degrees (which are exogenously specified). From Gt−1 , it computes Ait and pit for all nodes vi ∈ V . For all nodes, it then attaches edges according to Algorithm 2. This entire process is iterated for a user specified number of time steps. The entire process is given in Algorithm 1. The inputs to the model are the procedure OutDeg which specifies the out-degrees (assumed
48
M. Goldberg et al. Initial Graph
Iterate For each node:
Generate random graph with fixed out−degees
Determine local area
Assign out−degrees
Form new graph
Place edges w/in area
Fig. 7. Model execution flow
to be exogenous), the procedure Area which identifies the local areas of the nodes given the previous graph, and the procedure Prob which specifies the attachment probabilities according to the attachment model. We will now discuss some approaches to defining the areas and the attachment probabilities. When testing our model, we will also need the procedure for obtaining the out-degrees, which will be discussed in Section 7. 6.1
Locality Models
A node expends the majority of it’s communication energy within it’s local area. This captures the intuition that people mostly communicate with in a small group that contains friends, family, colleagues, etc. We propose the following definitions of an area: 1. Global. Every node vi is aware of the whole network, the local area of vi is Ait = V at every time period t. 2. k-neighborhood. The local area Ait of node vi at time t consists of all vj such that undirected shortest distance δt (vi , vj ) ≤ k. 3. Clusters. This definition is based on the notion of a cluster, defined as follows (see also [5], [4], [6]). First, the notion of the density of a set is introduced, which generally can be any function defined on a subset of nodes of the graph (e.g. ratio of number of edges within subset to a number of edges with at least one endpoint in the subset). For every density function, a cluster is defined to be any subset of nodes that is locally maximal w.r.t. to its density. Our definition of a cluster permits clusters to overlap. In fact, our experiments show that clusters overlap quite frequently. This is is expected in a graph of a social network, where the same member may belong to more than one community represented by clusters. Finally, using the definition of a cluster presented earlier in this paper, we define the local area of a blogger as the union of all clusters which she is a member of. Intuitively, this restricts a blogger’s activity to the set of individuals in groups which they have shown interest in previously.
Communication Dynamics of Blog Networks
6.2
49
Attachment Models
Given the local area Ait of the node vi at time t, the attachment model describes the probability pit+1 (vj ) of occurrence of an edge (vi , vj ) at time t + 1 for vj ∈ V . We propose the following attachment modes: 1. Uniform. Node vi attaches to any vj ∈ Ait with equal probability pit (vj ) =
1 |Ait |
(3)
and for vj ∈ / Ait , pit (vk ) = 0. 2. Preferential Attachment. Node vi attaches to any vj ∈ Ait with probability pit (vj ) ∝ indegt−1 (vj ) + γ
(4)
where indegt−1 (vj ) is the in-degree of vertex vj in graph Gt−1 and γ is a constant. 3. Markov Chain. To obtain the attachment probabilities for vertex vi we simulate the particle traveling over undirected edges of graph Gt starting from the node vi and randomly selecting edges to travel over until it arrives at first node ve ∈ / Ait . Every time the particle arrives at some node vj ∈ Ait , the counter cij is incremented. After this simulation is repeated with out resetting the counters cij , ∀ vj ∈ Ait a number of times, we determine the attachment probability pit (vj ) ∝ cij .
(5)
4. Inverse distance. Node vi attaches to some node vj ∈ Ait with probability pit (vj ) ∝
1
, ρ δt−1 (i, j)
(6)
where δt−1 (i, j) is the shortest undirected distance between vertices vi , vj in graph Gt−1 and ρ is a constant. The combination of the locality model and attachment model specifies the evolution model that, given the out-degree distribution, will produce a series of graphs that represent the blogograph at different time periods.
7
Experiments and Results
In this section we present the results of execution of few of the models and the evaluation of their performance.
50
M. Goldberg et al.
To evaluate the performance of the models, we compare the sequence of graphs produced by the model to the sequence of graphs produced by the observed communication in LiveJournal. In particular, we compare the clustering coefficient, the size of the giant component, average separation between two nodes, in-degree distributions, and community size and density distributions. To compare the in-degrees, we compute the point-wise difference of the normalized distributions. Formally, for each graph Gi we compute the normalized distribution Di (k) = |Vki | , where k is the degree and |Vi | is the number of vertices in the graph. The differences between distributions of observed graph Go and generated graph Gg is E= |Do (d) − Dg (d)|. (7) d
Notice, E ∈ [0, 2] and lower value of E corresponds to a closer match. To compare community structure we compute gerr , the average error between the size and density distributions that represent communities in generated and observed graphs. This value is computed by splitting the size-density plot into bins where bin width on the desity axis is 0.05 and on the size axis is 5. The resulting bin sizes are normalized with respect to the number of communities in the graph. Value of gerr is measured by the sum of difference of sizes of corresponding bins in size-density distributions of observed and generated data. To evaluate a particular model we execute enough iterations to let the model stabilize. We determine the stabilization by inspection of plots of the major parameters (including in-degree distribution, clustering coefficient, etc). Then, we compare the sequence of the graphs produced by the model after the stabilization to the sequence mined from LiveJournal. Table 3 contains the results of execution of models with various combinations of local area and attachment mechanisms compared to the average parameters of graphs of different observed weeks. Note, for the observed data average parameter E is computed by comparing distributions of graphs corresponding to various
Table 3. The stable parameters of graphs generated by various models compared to the parameters of the observed data Area Observed Global Global Global 3-Neighb. 3-Neighb. 3-Neighb. Clusters Clusters Clusters
Attch Uniform P.A. (in) P.A. (out) uniform P.A.(in) P.A.(out) uniform P.A. (in) P.A. (out)
GC 0.9545 0.9867 0.9688 0.8939 0.9776 0.9646 0.9643 0.9523
C 0.0613 5.2 × 10−6 0.00018 0.00045 0.00133 0.00252 0.00149 0.03156
d 5.34 7.86 5.21 5.30 4.53 6.73 6.88 6.56
E 0.0289 1.075 0.427 0.4331 0.1412 0.7267 0.1713 0.5320
gerr 0.00144 0.04215 0.01189 0.01792 0.03504 0.03484 0.03811 0.02034
Communication Dynamics of Blog Networks
51
observed weeks. Figure 8 compares the observed degree distribution to the ones generated by some of the best area/attachment combinations. As defined in Section 4, edge history conveys information about how close the end points of the observed edge were in the previous time cycle and therefore measures the significance of locality in the communications. Figure 8 compares the observed edge history with the edge histories produces by the best models. 7.1
Global Area Model
First, we consider the model with global area where vertices are aware of and can connect to any other vertex in the network. In the case of a uniform attachment, the resulting model is very similar to the Erd¨ os-R´enyi model. The in-degree distribution and other parameters generated by such model are predictably very different from the power law degree distribution in the observed graph. Global area with preferential attachment strictly proportional to the in-degree of the vertices in the graph of the previous iteration results in a formation of a power house - small set of vertices with very high in-degree that attract all of out-degree. This effect is caused directly by preferential attachment; since vertices with zero in-degree will never be attached to, any vertex that receives no incoming vertices at some iteration will not receive any incoming vertices in any of the following iterations. Clearly, the graph with small set of vertices that attract all of the in-degree is very different from the observed graph. The combination of global area and preferential attachment proportional to the out-degree of the vertices in the graph of the previous iteration produced results that were more similar to the observed network than the other global models, but the results were also significantly worse compared to models with other area definitions (k-neighborhood and union of clusters). Since this model allows for random selection of the end points of edges from the whole graph, the edge history (Figure 8) is very different from the one observed in the real-life network. 7.2
k-Neighborhood Area Model
We experimented with different values of k (k ∈ {2, 3, 4}) and determined that k = 3 produced the best models. The combination of 3-neighborhood area and uniform attachment produced a model that showed mediocre results when compared to the observed parameters. The combination of 3-neighborhood area and preferential attachment proportional to the in-degree produced a graph with a small power house in just a few iterations. 3-neighborhood area with preferential attachment proportional to the out-degree produced a model that generated graphs with in-degree distributions very similar to the observed graph. In particular, the power law tail resembled the tail of the observed graph. The edge history of this model was quite different from the observed, since most of the end points for new edges are selected such that their distance in the previous iteration’s graph was 3.
52
7.3
M. Goldberg et al.
Union of Clusters Area Model
An area constructed via the union of clusters in combination with preferential attachment proportional to the in-degree produced in-degree distributions more similar to the observed than other attachment mechanisms combined with this area definition. This model also produced a sequence of graphs with an edge history that was closest to the observed as evident from Figure 8.
1
1 observed Global + P.A. (out) Clusters + P.A. (in) 3-Neighb + P.A. (out)
0.95
P(d>=k)
0.9
Observed Global + P.A. (out) Clusters + P.A. (in) 3-Neighb + P.A. (out)
0.9 0.8
0.85
0.7
0.8
0.6
0.75
0.5
0.7
0.4
0.65
0.3
0.6
0.2
0.55
0.1
0.5 1
10
100 k
1000
(a) In-degree distribution
10000
0 2
4
6
8
10
12
14
16
(b) Edge history distribution
Fig. 8. In-degree and edge history distributions for various models and observed network
Models with this area definition were the only ones that produced non-trivial edge stability defined as the likelihood of a repetition of a recently observed edge. To evaluate this stability, we consider the number of edges that appeared more then once in 21 iterations of the model after stabilization. Models with global and 3-neighborhood area definitions that did not result in the formation of power houses produced a set of graphs such that less then 1% of edges that appeared more the once in all of the graphs. Models with an area defined by the union of clusters produced a set of graphs in which, on average, 14% of edges appear more then once in 21 iterations. In particular, a combination of this area definition with preferential attachment proportional to the in-degree produced a sequence of graphs with 18% of edges appear more than once, while in the observed network, 40% of edges (see Figure 4) appear more then once during 21 observed weeks. After considering all of the parameters of the models, we determined the combination of an area defined by the union of clusters with preferential attachment proportional to the in-degrees of the vertices to be the best model to describe the dynamics of communication in the observed network.
8
Conclusion
We have presented a set of statistics which display strong stability even for a dynamic network such as the blogosphere. Our list of stable statistics is not
Communication Dynamics of Blog Networks
53
exhaustive. However, they represent a comprehensive set of interesting properties of a network that any model for communication dynamics should capture. Our experiments have shown that the communication dynamics of large social networks are best explained as a result of local communication, where the majority of members communicate within their social locality, a relatively small set of nodes reflective of their interests or communities. The best approximation to this locality, among the models we evaluated on LiveJournal data was the one determined by the union of clusters a node belonged to. Our notion of a cluster is a set of nodes which locally maximized a cluster density. This notion of a cluster has the important property that it allows clusters to overlap, which is important if a cluster is to represent a social community or coalition. Many possibilities exist for enhancing the definitions of locality and the attachment mechanisms. One direction which we intend to pursue as future research is the combination of local with global attachment mechanisms. Acknowledgment. This material is based upon work partially supported by the U.S. National Science Foundation (NSF) under Grant Nos. IIS-0621303, IIS-0522672, IIS-0324947, CNS-0323324, NSF IIS-0634875 and by the U.S. Office of Naval Research (ONR) Contract N00014-06-1-0466 and by the U.S. Department of Homeland Security (DHS) through the Center for Dynamic Data Analysis for Homeland Security administered through ONR grant number N00014-07-1-0150 to Rutgers University. The content of this paper does not necessarily reflect the position or policy of the U.S. Government, no official endorsement should be inferred or implied.
References 1. Albert, R., Barab´ asi, A.-L.: Statictical mechanics of complex networks. Reviews of Modern Physics 74(47-97) (2002) 2. Barab´ asi, A.L., Jeong, J., N¨eda, Z., Ravasz, E., Shubert, A., Vicsek, T.: Evolution of the social network of scientific collaborations. Physica A 311(590-614) (2002) 3. Baumes, J., Chen, H.-C., Francisco, M., Goldberg, M., Magdon-Ismail, M., Wallace, W.: Dynamics of bridging and bonding in social groups, a multi-agent model. In: Third conference of the North American Association for Computational Social and Organizational Science (NAACSOS 2005), Notre-Dame, Indiana, June 26-28 (2005) 4. Baumes, J., Goldberg, M., Krishnamoorthy, M., Magdon-Ismail, M., Preston, N.: Finding comminities by clustering a graph into overlapping subgraphs. In: Proceedings of IADIS International Conference, Applied Computing 2005, pp. 97–104 (2005) 5. Baumes, J., Goldberg, M., Magdon-Ismail, M.: Efficient identification of overlapping communities. In: IEEE International Conference on Intelligence and Security Informatics (ISI), May 2005, pp. 27–36 (2005) 6. Baumes, J., Goldberg, M., Magdon-Ismail, M., Wallace, W.: Identification of hidden groups in communications. In: Handbooks in Information Systems, National Security, vol. 2 (2007) 7. Berger-Wolf, T.Y., Saia, J.: A framework for analysis of dynamic social networks. DIMACS Technical Report 28 (2005)
54
M. Goldberg et al.
8. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stat, R., Tomkins, A., Wiener, J.: Graph structure in the web. Computer Networks 33(1-6), 309–320 (2000) 9. Chung, F., Lu, L.: Connected components in random graphs with given degree sequence. Annals of Combinatoreics 6, 125–1456 (2002) 10. Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empirical data (2007) 11. Doreian, P., Stokman, E.F.N.: Evolution of social networks. Gordon and Breach (1997) 12. Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the internet topology. In: SIGCOMM, pp. 251–252 (1999) 13. Gkantsidi, C., Mihail, M., Zegura, E.: The markov chain simulatiopn methods for generating connected power law random graphs. In: Proc. of ALENEX 2003, pp. 16–50 (2003) 14. Goh, K.-I., Eom, Y.-H., Jeong, H., Kahng, B., Kim, D.: Structure and evolution of online social relationships: Heterogeneity in unrestricted discussions. Physical Review E (Statistical, Nonlinear, and Soft Matter Physics) 73(6), 66123 (2006) 15. Goldberg, M., Kelley, S., Magdon-Ismail, M., Mertsalov, K.: A locality model for the evolution of blog networks. In: IEEE Information and Security Informatics, ISI (2008) 16. Goldberg, M., Kelley, S., Magdon-Ismail, M., Mertsalov, K.: Stable statistics of the blogograph. In: Interdisciplinary Studies in Information Privacy and Security (2008) 17. Kleinberg, J.M., Lawrence, S.: The structure of the web. Science, 1849–1850 (2001) 18. Kossinets, G., Watts, D.J.: Empirical analysis of an evolving social network. Science 311, 88–90 (2006) 19. Kumar, R., Novak, J., Raghavan, P., Tomkins, A.: Structure and evolution of blogospace. Communications of the ACM 33(1-6), 309–320 (2004) 20. Kumar, R., Novak, J., Tomkins, A.: Structure and evolution of online social networks. In: KDD 2006 (2006) 21. Newman, M.: The structure and function of complex networks. SIAM Review 45(2), 167–256 (2003) 22. Newman, M., Barab´ asi, A.-L., Watts, D.: The structure and dynamics of networks. Princeton University Press, Princeton (2006) 23. Newman, M.E.J.: The structure of scientific collaboration networks. Proc. Natl. Acad. Sci. USA 98, 404 (2001) 24. Stauffer, A.O., Barbosa, V.C.: A study of the edge-switching markov chain method for the generation of random graphs. arxiv: cs. DM/0512105 (2006) 25. West, D.B.: Introduction to graph theory. Prentice Hall, Upper Saddle River (2003)
Finding Spread Blockers in Dynamic Networks Habiba1, , Yintao Yu2, , Tanya Y. Berger-Wolf1, , and Jared Saia3,† 1
2
University of Illinois at Chicago, {hhabib3,tanyabw}@uic.edu University of Illinois at Urbana-Champaign,
[email protected] 3 University of New Mexico,
[email protected] Abstract. Social interactions are conduits for various processes spreading through a population, from rumors and opinions to behaviors and diseases. In the context of the spread of a disease or undesirable behavior, it is important to identify blockers: individuals that are most effective in stopping or slowing down the spread of a process through the population. This problem has so far resisted systematic algorithmic solutions. In an effort to formulate practical solutions, in this paper we ask: Are there structural network measures that are indicative of the best blockers in dynamic social networks? Our contribution is two-fold. First, we extend standard structural network measures to dynamic networks. Second, we compare the blocking ability of individuals in the order of ranking by the new dynamic measures. We found that overall, simple ranking according to a node’s static degree, or the dynamic version of a node’s degree, performed consistently well. Surprisingly the dynamic clustering coefficient seems to be a good indicator, while its static version performs worse than the random ranking. This provides simple practical and locally computable algorithms for identifying key blockers in a network.
1
Introduction
How can we stop a process spreading through a social network? This problem has applications to diverse areas such as preventing or inhibiting the spread of diseases [7, 26, 40], computer viruses1 [8, 22], rumors, and undesirable fads or risky behaviors [23, 24, 37, 38]. A common approach to spread inhibition is to † 1
(No last name). Work supported in part by the Fulbright fellowship. Work performed in part while being a visiting student at the University of New Mexico. Work supported in part by the NSF grant IIS-0705822 and NSF CAREER Award 0747369. Work supported in part by the NSF grant IIS-0705822, NSF CAREER Award 0644058, and an AFO MURI award. In particular, we are concerned with computer malware that spreads through social networks, such as email viruses and worms, cell-phone viruses, and other related malware such as the recent MySpace worm.
L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 55–76, 2010. c Springer-Verlag Berlin Heidelberg 2010
56
Habiba et al.
identify key individuals whose removal will most dampen the spread. In the context of the spread of a disease, it is a question of finding individuals to be quarantined, inoculated, or vaccinated so that the disease is prevented from becoming an epidemic. We call this set of key individuals the blockers of the spreading process. There has been significant previous work related to studying and controlling the spread of dynamic processes in a network [9,10,11,16,18,22,23,26,35,40,43, 44, 46, 47, 51, 54, 57, 59, 60, 67]. Unfortunately, these results have three properties rendering them ineffective for identifying good blockers in large networks. First, many proposed algorithms focus on a slightly different objective: they aim to identify nodes that will be most effective in starting the spread of a process rather than blocking it [44, 47]; or alternatively, nodes that would be most effective in sensing that a process has started to spread, and where the process initiated [9, 10, 11]. In this paper, we are focused specifically on identifying those nodes that are good blockers. Second, algorithms proposed in previous work all require computationally expensive calculations of some global properties over the entire network, or rely on expensive, repeated stochastic simulations of the spread of a dynamic process. In this paper, we present heuristics that identify good blockers quickly, based only on local information. Finally, perhaps the most critical problem in previous work is the omission of the dynamic nature of social interactions. The very nature of a spreading process implies an explicit time axis [52]. For example, the flow of information through a social network depends on who starts out with the information when, and which individuals are in contact at the starting point with the information carrier [43]. In this paper, we consider explicitly dynamic networks, defined in Section 3.1. In these networks, we study the social interactions over a finite period of time, measured in discrete time steps. The main contributions of this paper are summarized below. – We formally define dynamic networks in Section 3.1. This representation of networks encompasses the traditional “aggregate”view of networks defined in Section 3.2 and adds the explicit temporal component to the interactions. The time axis is necessary since most spreading processes take place on networks that evolve over time. – We formally define the problem of identifying key spread blockers in networks in Section 3.3. – We modify various network measures, such as the centrality measures and clustering coefficient, to incorporate the dynamic nature of the networks (Section 3.4). – We compare the reduction in the extent of spread based on removing individuals from a network in the ranking order imposed by various network measures. We identify measures that consistently give a good approximation of the best spread blockers. – We compare the difference in the sets of top blockers identified by various measures.
Finding Spread Blockers in Dynamic Networks
57
– We extensively evaluate our methods on real networks (Section 5). We use the Enron email network dataset, the MIT Reality Mining dataset, DBLP co-authorship network, animal population networks of Grevy’s zebras, Plains zebras, and onagers. Ultimately, we show that the dynamics of interactions matters, and moreover that simple local measures, such as degree, are highly indicative of an individual’s capacity to prevent the spread of a phenomenon in a population. The implication of our results are that there are practical scalable heuristics for identifying quarantine and vaccination targets in order to prevent an epidemic.
2
Related Work
Dynamic phenomena such as opinions, information, fads, behavior, and disease spread through a network by contacts and interactions among the entities of the network. Such spreading phenomena have been studied in a number of domains including epidemiology [22,26,40,51,54,57,59], diffusion of technological innovations and adoption of new products [7, 16, 18, 23, 24, 38, 35, 44, 46, 60, 67], voting, strikes, rumors [36, 37, 53, 68], as well as spread of contaminants in distribution networks [8, 9, 10, 11, 46] and numerous others. One of the fundamental questions about dynamic processes is: Which individuals, if removed from the network, would block the spread of such process? Several previous results have addressed the problem of identifying such individuals [26, 40, 43]. Eubank et al. [26] experimentally show that global graph theoretic measures like expansion factor and overlap ratio are good indicators for devising vaccination strategies in static networks. Cohen et al. [21] propose another immunization strategy based on the aggregate network model. In particular, they propose an efficient method of picking high degree nodes in a network to immunize, thus inhibiting the spread of disease. Kempe et al. [43] show that a variant of the blocker identification problem is NP-hard. While these problems and suggested approaches are similar to finding good blockers in a network, unfortunately, there are critical differences that make these results inappropriate for our formulation. First of all our objective is to minimize the expected extent of spread in a network. We do not make any assumption about the source of the spread. Second, almost all the above methods simplify the spreading process by ignoring the time ordering of interactions. There has also been significant related work on the problem of determining where to place a small number of detectors in a network so as to minimize the time required to detect the spread of a dynamic process, and, ideally, also the location at which the spread began. Berger-Wolf et al. [9] give algorithms for the problem of minimizing the size of the infected population before an outbreak is detected. Berry et al. [10,11] give algorithms to strategically place sensors in utility distribution networks to minimize worst case time until detection. In [47], Leskovec et al. demonstrate that many objectives of the detection problem exhibit the property of submodularity. They exploit this fact to develop efficient and elegant algorithms for placing detectors in a network. While the detection problem is related to the
58
Habiba et al.
problem of blocking a process, it is only concerned with detecting a spreading process once, whereas a good blocker prevents multiple spreading paths. Moreover, the algorithms proposed for the detection problem all require global information and work only for a stable, relatively unchanging network. Another related problem is that of identifying nodes in a network that are most critical for spreading a dynamic process. Kempe et al. [44] show that identifying key spreaders – individuals that help to propagate the spread most – is NP-hard, but admits a simple greedy (1−1/e)-approximation. Later, Mossel and Roch [55] showed that the general case of finding a set of nodes with the largest “influence” is NP-hard, and has a (1 − 1/e − ε) approximation algorithm. Unfortunately, this approximation algorithm is computationally intensive. Strong inapproximability results for several variants of identifying nodes with high influence in social networks have been shown in [19]. Asur et al. in [5] present an event based characterization of critical behavior in interaction graphs for the purposes of modeling evolution, link prediction, and influence maximization. Finally, Aspnes et al. [4] have studied the inoculation problem from a graph theoretic perspective. They show that finding an optimum inoculation strategy is related to the sum-of-squares partition problem. Moreover, they show that the social welfare of an inoculation strategy found when each node is a selfish agent can be significantly less than the social welfare of an optimal inoculation strategy.
3
Definitions
Populations of individuals interacting over time are often represented as networks, or graphs, where the nodes correspond to individuals and a pairwise interaction is represented as an edge between the corresponding individuals. The idea of representing societies as networks of interacting individuals dates back to Lewin’s earlier work of group behavior [48]. Typically, there is a single network representing all interactions that have happened during the entire observation period. We call this representation an aggregate network (Section 3.2). In this paper we use an explicitly dynamic network representation (Section 3.1) that takes the history of interactions into account. 3.1
Dynamic Network
We represent dynamic network as a series G1 , . . . , GT of static networks where each Gt is a snapshot of individuals and their interactions at time t. For this work, we assume that the time during which the individuals are observed is finite. For simplicity, we also assume that the time period is divided into discrete steps {1, . . . , T }. The nontrivial problem of appropriate time discretization is beyond the scope of this paper. We assume that an interaction between a pair of individuals takes place within one time step. Definition 1. Let {1, . . . , T } be a finite set of discrete time steps. Let V = {1, . . . , n} be a set of individuals. Let Gt = (Vt , Et ) be a graph representing the snapshot of the network at time t. Vt ⊆ V , is a subset of individuals V observed
Finding Spread Blockers in Dynamic Networks
59
at time t. An edge (ut , vt ) ∈ Et if individuals u and v have interacted at time t. Further, for all v ∈ V and t ∈ {1, . . . , T − 1} the edges (vt , vt+1 ) ∈ E are directed self edges of individuals across time steps. . . , GT is the graph GD= (V, E) of the time A dynamic network GD = G1 , . series of graphs Gt such that V = t Vt and E = t Et ∪ t−1 (vt , vt+1 ). The definition is equivalent to an undirected multigraph representation in [43]. Figure 1 shows an example of several dynamic networks that have the same unweighted aggregate network representation.
(a)
(b)
(c)
(d)
(e) Fig. 1. Example of several dynamic networks that have the same unweighted aggregate network representation. Figures (a)–(d) show a dynamic networks of three individuals interacting over four time steps. The solid line edges represent interactions among individuals in a time step. Empty circles are individuals observed during a time step. While at any given time step some individuals may be unobserved, the particular example shows all the individuals being observed at all time steps. Figure (e) shows an unweighted aggregate network that has the same interactions as every dynamic network in the example. Figures (a)–(c) have the multiplicity two of each edge while figure (d) has the multiplicity four for every edge in the aggregate representation.
3.2
Aggregate Network
The aggregate network is the graph GA = (V, E) of individuals V and their interactions E observed over a period of time. In this representation an edge exists between a pair of individuals if they have ever interacted during the observed time period. Multiple interactions between a pair of individuals over time are represented as a single, possibly weighted, edge or multiple edges between them. This representation provides an aggregate view of the population where the information about the timing and order of interactions is discard. In this work we represent aggregate networks as multigraphs. Definition 2. Let {1, . . . , T } be a finite set of discrete time steps. Let Vt be the set of individuals observed at time t and let Et be the set of interactions among of such a network is individuals Vt at t. Then the aggregate graph GA = (V, E) the set of individuals V and interactions E such that V = t Vt and (u, v) ∈ E if ∃(ut , vt ) at some time step t ∈ {1, . . . , T }.
60
Habiba et al.
Using this aggregate network model, the structure and properties of many social networks have been studied from different perspectives [6,13,12,15,41]. However, as we have mentioned, this and other similar models do not explicitly consider the temporal aspect of the network. 3.3
Spread Blockers
We now formalize the notions of processes spreading in a network and individuals blocking this spread. Spread(.) is a function that gives the overall average extent of spread in a network, that is, the expected number of individuals affected by a stochastic spreading process within a specified number of time steps. The estimate of the spread is dependent on the model of the spreading process, the structure of the network, and, of course, the number of time steps under consideration. Spreadv (.) is the expected spread in a network, when the spreading process is initiated by the individual v. Given a model of a spreading process M and a distribution of the probability of infection X : E → [0, 1], we define the spreading functions as follows: Spreadv : {N etworks × Spread M odels × P robability × T ime} → R+ 1 Spread(G, M, X , T ) = Spreadv (G, M, X , T ) (1) |V | v∈V
The limit equilibrium state of spread is denoted by Spread(G, M, X ) = Spread(G, M, X , ∞)
(2)
For a fixed spread model, probability distribution and a time period we will use the overloaded shorthand notation Spread(G). We define BlX (.) as a function that measures the reduction in the expected spread size after removing the set X of individuals from the network. Hence, the blocking capacity of a single individual v, Blv (.), is the reduction in expected spread size after removing individual v from the network. BlX : {N etworks × Spread M odels × P robability × T ime} → R+ BlX (G) = BlX (G, M, X , T ) = Spread(G) − Spread(G \ X).
(3)
kBl(.) is the function that finds the maximum possible reduction in spread in a network when set of individuals of size k is removed from the network. Notice, that the value of this function is always at least k. The argmax of this function finds the best blocker(s) in a network. kBl(G) =
max
X⊆V,|X|=k
BlX (G).
(4)
Thus, finding the best blockers in the network is equivalent to finding the (set of) individuals whose removal from the network minimizes the expected extent of spread. Spread(G \ X). (5) kBl(G) = Spread(G) − min X⊆V,|X|=k
Finding Spread Blockers in Dynamic Networks
61
This definition of the individuals’ blocking capacity by removal corresponds in the disease spread context to the quarantine action. Vaccination or inoculation leave the node in the network but deactivate its ability to propagate the spread. For the Independent Cascade model of spread (Section 3.5) the two actions are equivalent at the abstract level of estimating the spread and identifying blockers in networks. Since no good analytical approaches are known for identifying blockers in networks, in this paper we focus on examining the possibility of using structural network measures as practical indicators of nodes’ blocking ability. We next briefly define the structural measures used in this paper. 3.4
Network Structural Measures
In network analysis various properties of the graph representing the population are studied as proxies of the properties of the individuals, their interactions, and the population itself. For example, the degree, various centrality measures, clustering coefficients, or the eigenvalues (PageRank) of the nodes have been used to determine the relative importance of the individuals, e.g., [17, 42]. Betweenness centrality has been used to identify cohesive communities [33] and the distributions of shortest path lengths employed to measure the “navigability” of the network [66]. These and many other graph theoretic measures have been translated to many social properties [50, 56, 57]. The ability of an individual to block a process spreading over a network can be seen as another such social property. Graph measures such as clustering and assortative mixing coefficients have been used to design local vaccination strategies [40]. However, it is not clear that those are the best network measures to be used as an indicator of a node which is a good blocker. In this paper we evaluate the power of all the standard network measures of a node to indicate the blocking ability of the corresponding individual. Moreover, we extend the standard static measures to reflect the dynamic nature of the underlying network. We examine the following measures: degree, average degree, betweenness, closeness centralities and clustering coefficient. We modify those to incorporate the time ordering of the interactions. We use the following terms interchangeably in this paper: individuals or nodes are the vertices of the network and interactions are edges that can be both directed or undirected. Neighbors of a node, N (.), are the set of nodes adjacent to it. The subscript T with a function name indicates the dynamic variant of the function. We now state the standard network measures for aggregate networks and define corresponding measures for dynamic networks. We focus first on the global measures that summarize the entire network and then address local measures that characterize a node. Global Structural Properties Density is the proportion of the number of edges |E| present in a network relative to the possible number of edges |V2 | .
62
Habiba et al.
|E| D(G) = |V | .
(6)
2
Dynamic Density is the average density of an observed time snapshot. 1 DT (G) = D(Gt ). T
(7)
1 1.
k =1
(b) Draw i.i.d. mixture components θkc ∼ Gc0 , k = 1, 2, . . .
(1)
Social Network Mining with Nonparametric Relational Models
83
2. For each relationship class b between two object classes ci and cj , draw φbk, ∼ Gb0 i.i.d. with component indices k for ci and for cj . 3. For each object eci in a class c, (a) Draw cluster assignment Zic ∼ Mult(·|π c ); (b) Draw object attributes Aci ∼ P (·|θc , Zic ). c c b ∼ P (·|φb , Zici , Zj j ). 4. For eci i and ej j with a relationship of class b, draw Ri,j The basic property of SBC is that: the distributions of the parameters (θkc and φb ) are sampled, e.g., the distribution of θkc can be represented as Gc = ∞ k,c c c c k=1 πk δθk , where δθk is a distribution with a point mass on θk . In terms of this property, SBC can sample objects independently; thus it might be efficient when a large domain is involved.
3
Inference
The key inferential problem in the IHRM model is computing posterior of unobservable variables given the data, i.e. P ({π c, Θc , Z c }c , {Φb }b |D, {αc0 , Gc0 }c , {Gb0 }b ). Unfortunately, the computation of the joint posterior is analytically intractable, thus we consider approximate inference methods to solve the problem. 3.1
Inference with Gibbs Sampling
Markov chain Monte Carlo (MCMC) sampling has been used to approximate posterior distribution with a DP mixture prior. In this section, we describe the efficient blocked Gibbs sampling (GS) with truncated stick breaking representation [14] for the IHRM model. The advantage is that given the posterior distributions, we can independently sample hidden variables in a block, which highly accelerates the computation. The Markov chain is thus defined not only on hidden variables, but also on parameters. Truncated stick breaking construction (TSB) fixes a value K c for each class of objects by letting VKc c = 1. That means the mixing weights πkc are equal to 0 for k > K c (refer to Equ. 1). The number of the clusters is thus reduced to K c . Note, that K c is an additional parameter in the inference method. At each iteration, we first update the hidden variables conditioned on the parameters sampled in the last iteration, and then update the parameters conditioned on the hidden variables. In detail: 1. For each class of objects, c(t+1) with probability proportional to: (a) Update each hidden variable Zi c(t)
c(t+1)
πk P (Aci |Zi
= k, Θc(t) )
b
c(t+1)
b P (Ri,j |Zi
j
c (t)
= k, Zj j
, Φb
(t)
), (2)
b where Aci and Ri,j denotes the known attributes and relationships about c (t)
i. cj denotes the class of the object j , Zj j denotes hidden variable of j at the last iteration t. Intuitively, the equation represents to what extent the cluster k agrees with the data Dic about the object i.
84
Z. Xu et al.
(b) Update π c(t+1) as follows: c(t+1) c(t+1) c(t+1) i. Sample vk from Beta(λk,1 , λk,2 ) for k = {1, . . . , K c − 1} c
c(t+1)
λk,1
= 1+
N
c
c(t+1)
δk (Zi
c(t+1)
), λk,2
= αc0 +
k =k+1
i=1
c(t+1)
c
N K
c(t+1)
c(t+1)
δk (Zi
), (3)
i=1
c(t+1)
and set vK c = 1. δk (Zi ) equals to 1 if Zi = k and 0 otherwise. c(t+1) c(t+1) k−1 c(t+1) = vk ) for k > 1 ii. Compute π c(t+1) as: πk k =1 (1 − vk c(t+1) c(t+1) and π1 = v1 . c(t+1) b(t+1) 2. Update θk ∼ P (·|Ac , Z c(t+1) , Gc0 ) and φk, ∼ P (·|Rb , Z (t+1) , Gb0 ). The parameters are drawn from their posterior distributions conditioned on the sampled hidden states. Again, since we assume conjugated priors as the base distributions (Gc0 and Gb0 ), the simulation is tractable. After convergence, we collect the last W samples to make predictions for the relationships of interest. Note that in blocked Gibbs sampling, the MCMC sequence is defined by hidden variables and parameters, including Z c(t) , π c(t) , Θc(t) , and b between a new object Φb(t) . The predictive distribution of a relationship Rnew,j c j c enew and a known object ej is approximated as b b B P (Rnew,j |D, {αc0 , Gc0 }C c=1 , {G0 }b=1 )
≈
W +w 1 b b(t) B P (Rnew,j |D, {Z c(t) , π c(t) , Θc(t) }C }b=1 ) c=1 , {Φ W t=w+1
∝
W +w K 1 b(t) c(t) c(t) b (t) b b P (Rnew,j |φk, ) πk P (Acnew |θk ) P (Rnew,j |φ k, ), W t=w+1 k=1
c
b
j
where and denote the cluster assignments of the objects j and j , respectively. The equation is quite intuitive. The prediction is a weighted sum of predictions b(t) b P (Rnew,j |φk, ) over all clusters. The weight of each cluster is the product of the last three terms, which represents to what extent this cluster agrees with the known data (attributes and relationships) about the new object. Since the blocked method also samples parameters, the computation is straightforward. 3.2
Inference with Variational Approximation
The IHRM model has multiple DPs which interact through relationships, thus blocked Gibbs sampling is still slow due to the slow exchange of information between DPs. To solve the problem, we outline an alternative solution by variational inference method. The main strategy is to convert a probabilistic inference problem into an optimization problem, and then to solve the problem with the known optimization techniques. In particular, the method assumes a distribution q, referred to as a variational distribution, to approximate the true posterior P as close as possible. The difference between the variational distribution q and the true posterior P can be measured via Kullback-Leibler (KL) divergence. Let
Social Network Mining with Nonparametric Relational Models
85
ξ denote a set of unknown quantities, and D denote the known data. The KL divergence between q(ξ) and P (ξ|D) is defined as: KL(q(ξ)||P (ξ|D)) =
q(ξ) log q(ξ) −
ξ
q(ξ) log P (ξ|D).
(4)
ξ
The smaller the divergence, the better is the fit between the true and the approximate distributions. The probabilistic inference problem (i.e. computing the posterior) now becomes: to minimize the KL divergence with respect to the variational distribution. In practice, the minimization of the KL divergence is formulated as the maximization of the lower bound of the log-likelihood: log P (D) ≥
q(ξ) log P (D, ξ) −
ξ
q(ξ) log q(ξ).
(5)
ξ
A mean-field method was explored in [6] to approximate the posterior of unobservable quantities in a DP mixture model. The main challenge of using mean-field inference for the IHRM model is that there are multiple DP mixture models coupled together with relationships and correlation mixture components. In the IHRM model, unobservable quantities include Z c , π c , Θc and Φb . Since mixing weights π c are computed on V c (see Equ. 1), we can replace π c with V c in the set of unobservable quantities. To approximate the posterior P ({V c , Θc , Z c }c , {Φb }b |D, {αc0 , Gc0 }c , {Gb0 }b ), we define a variational distribution b B q({Z c , V c , Θc }C c=1 , {Φ }b=1 ) as:
c
C N c
c
q(Zic |ηic )
i
K
⎤ ⎡ B K ci K cj q(Vkc |λck )q(θkc |τkc ) ⎣ q(φbk, |ρbk, )⎦ ,
k
b
k
(6)
where ci and cj denote the object classes involved in the relationship class b. k and denote the cluster indexes for ci and cj . Variational parameters include {ηic , λck , τkc , ρbk, }. q(Zic |ηic ) is a multinomial distribution with parameters ηic . Note, that there is one ηic for each object eci . q(Vkc |λck ) is a Beta distribution. q(θkc |τkc ) and q(φbk, |ρbk, ) are respectively with the same forms as Gc0 and Gb0 . We substitute Equ. 6 into Equ. 5 and optimize the lower bound with a coordinate ascent algorithm, which generates the following equations to iteratively update the variational parameters until convergence: c
λck,1 = 1 +
N
c
c ηi,k ,
λck,2 = αc0 +
i=1
c
K N i=1
c
c τk,1
=
β1c
+
N
c ηi,k
c τk,2
=
β2c
+
N
c
ci b j ηi,k ηj, T(Ri,j ),
c
ci j ηi,k ηj, ,
(9)
i,j
j
(8)
ρbk,,2 = β2b +
i,j k−1 k =1
b
c ηi,k ,
i=1
∝ exp Eq [log Vkc ] + +
(7)
c
c ηi,k T(Aci ),
i=1
ρbk,,1 = β1b +
c ηi,k ,
k =k+1
Eq [log(1 − Vkc )] + Eq [log P (Aci |θkc )]
cj b ηj, Eq [log P (Ri,j |φbk, )]
,
(10)
86
Z. Xu et al.
where λck denotes parameters of Beta distribution q(Vkc |λck ), λck is a twodimensional vector λck = (λck,1 , λck,2 ). τkc denotes parameters of the exponential c family distribution q(θkc |τkc ). We decompose τkc such that τk,1 contains the first c c c dim(θk ) components and τk,2 is a scalar. Similarly, β1 contain the first dim(θkc ) components and β2c is a scalar. ρbk,,1 , ρbk,,2 , β1b and β2b are defined equivalently. b ) denote the sufficient statistics of the exponential family disT(Aci ) and T(Ri,j c c b tributions P (Ai |θk ) and P (Ri,j |φbk, ), respectively. It is clear that Equ. 7 and Equ. 8 correspond to the updates for variational parameters of object class c, and they follow equations in [6]. Equ. 9 represents the updates of variational parameters for relationships, which is computed on the involved objects. The most interesting updates are Equ. 10, where the posteriors of object cluster-assignments are coupled together. These essentially connect the c include a DPs together. Intuitively, in Equ. 10 the posterior updates for ηi,k prior term (first two expectations), the likelihood term about object attributes (third expectation), and the likelihood terms about relationships (last term). To calculate the last term we need to sum over all the relationships of the object cj that is variational expectation about cluster-assignment of eci weighted by ηj, the other object involved in the relationship. Once the procedure reaches stationarity, we obtain the optimized variational parameters, by which we can approximate the predictive distribution b b B b P (Rnew,j |D, {αc0 , Gc0 }C c=1 , {G0 }b=1 ) of the relationship Rnew,j between a new cj c b object enew and a known object ej with q(Rnew,j |D, λ, η, τ, ρ) proportional to: c
c
K K j k
c
c
b c q(Rnew,j |ρbk, )q(Zj j = |ηj j )q(Znew = k|λc )
× q(Acnew |τkc )
b
j
c
c
b b q(Zj j = |ηj j )q(Rnew,j |ρk, ).
(11)
b The prediction is a weighted sum of predictions q(Rnew,j |ρbk, ) over all clusters. The weight consists of two parts. One is to what extent the cluster agrees with c the object ej j (i.e. the 2nd term), the other is to what extent the cluster k agrees with the new object (i.e. the product of the last 3 terms). The computations c about the two parts are different. The reason is that ej j is a known object, we c have optimized variational parameters ηj j about its cluster assignment.
4 4.1
Experimental Analysis Monastery Data
The first experiment is performed on the Sampson’s monastery dataset [19] for community discovery. Sampson surveyed social relationships between 18 monks in an isolated American monastery. The relationships between monks included esteem/disesteem, like/dislike, positive influence/negative influence, praise and blame. Breiger et al. [7] summarized these relationships and yielded a single
Social Network Mining with Nonparametric Relational Models
87
2 4 6 8 10 12 14 16 18 2
4
6
8
10
12
14
16
18
Fig. 3. Left: The matrix displaying interactions between Monks. Middle: A sociogram for three monks. Right: The IHRM model for the monastery sociogram.
relationship matrix, which reflected interactions between monks, as shown in Fig. 3 (left). After observing the monks in the monastery for several months, Sampson provided a description of the factions among the monks: the loyal opposition (Peter, Bonaventure, Berthold, Ambrose and Louis), the young turks (John Bosco, Gregory, Mark, Winfrid, Hugh, Boniface and Albert) and the outcasts (Basil, Elias and Simplicius). The other three monks (Victor, Ramuald and Amand) wavered between the loyal opposition and the young turks, and were identified as the fourth group, the waverers. Sampson’s observations were confirmed by the event that the young turks group resigned after the leaders of the group were expelled over religious differences. The task of the experiment is to cluster the monks. Fig. 3 (middle) shows a sociogram with 3 monks. The IHRM model for the monastery network is illustrated as Fig. 3 (right). There is one hidden variable for each monk. The relationships between monks are conditioned on the hidden variables of the involved monks. The mean field method is used for inference. We initially assume that each monk is in his own cluster. After convergence, the cluster number is optimized as 4, which is exactly the same as the number of the groups that Sampson identified. The clustering result is shown as Table 2. It is quite close to the real groups. Cluster 1 corresponds to the loyal opposition. Cluster 2 is the young turks, and cluster 3 is the outcasts. The waverers are split. Amand is assigned to cluster 4, Victor and Ramuald are assigned to the loyal opposition. Actually, previous research analysis has questioned the distinction of the waverers, e.g., [7,12] clustered Victor and Ramuald into the loyal opposition, which coincides with the result of the IHRM model. Table 2. Clustering result of the IHRM model on Sampson’s monastery data Cluster 1 2 3 4
Members Peter, Bonaventure, Berthold, Ambrose, Louis, Victor, Ramuald John, Gregory, Mark, Winfrid, Hugh, Boniface, Albert Basil, Elias, Simplicius Amand
88
4.2
Z. Xu et al.
Bernard & Killworth Data
In the second experiment, we perform link analysis with IHRM on the Bernard & Killworth data [5]. Bernard and Killworth collected several data sets on human interactions in bounded groups. In each study they obtained measures of social interactions among all actors, and ranking data based on the subjects’ memory of those interactions. Our experiments are based on three datasets. The BKFRAT data is about interactions among students living in a fraternity at a West Virginia college. All subjects had been residents in the fraternity from three months to three years. The data consists of rankings made by the subjects of how frequently they interacted with other subjects in the observation week. The BKOFF data concern interactions in a small business office. Observations were made as the observer patrolled a fixed route through the office every fifteen minutes during two four-day periods. The data contains rankings of interaction frequency as recalled by employees over the two-week period. The BKTEC data is about interactions in a technical research group at a West Virginia university. It contains the personal rankings of the remembered frequency of interactions. In the experiments, we randomly select 50% (60%, 70%, 80%) interactions as known and predict the left ones. The experiments are repeated 20 times for each setting. The average prediction accuracy is reported in Table 3. We compare our model with the Pearson collaborative filtering method. It shows that the IHRM model provides better performance on all the three datasets. Fig. 4 illustrates the link prediction results on the BKOFF dataset with 70% known links. The predicted interaction matrix is quite similar with the real one. Table 3. Link prediction on the Bernard & Killworth data with the IHRM
50% IHRM Pearson BKFRAT 66.50 61.82 BKOFF 66.21 57.32 BKTEC 65.47 58.85
Prediction Accuracy (%) 60% 70% IHRM Pearson IHRM Pearson 67.63 64.56 68.26 66.91 67.89 59.45 69.20 60.58 66.79 62.04 68.31 63.61
80% IHRM Pearson 68.69 67.41 69.82 61.54 69.58 64.46
Fig. 4. Left: Interaction matrix on the BKOFF data. Right: The predicted one, which is quite similar with the real situation.
Social Network Mining with Nonparametric Relational Models
4.3
89
MovieLens Data
We also evaluate the IHRM model on the MovieLens data [21]. There are in total 943 users and 1680 movies, and we obtain 702 users and 603 movies after removing low-frequent ones. Each user has about 112 ratings on average. The model is shown in Fig. 5. There are two classes of objects (users and movies) and one class of relationships (Like). The task is to predict preferences of users. The users have attributes Age, Gender, Occupation, and the movies have attributes Published-year, Genres and so on. The relationships have two states, where R = 1 indicates that the user likes the movie and 0 otherwise. The user ratings in MovieLens are originally based on a five-star scale, so we transfer each rating to binary value with R = 1 if the rating is higher than the user’s average rating, vice versa. The performance of the IHRM model is analyzed from 2 points: prediction accuracy and clustering effect. To evaluate the prediction performance, we perform 4 sets of experiments which respectively select 5, 10, 15 and 20 ratings for each test user as the known ratings, and predict the remaining ratings. These experiments are referred to as given5, given10, given15 and given20 in the following. For testing the relationship is predicted to exist (i.e., R = 1) if the predictive probability is larger than a threshold ε = 0.5. We implement the following three inference methods: Chinese restaurant process Gibbs sampling (CRPGS), truncated stick-breaking Gibbs sampling (TSBGS), and the corresponding mean field method TSBMF. The truncation parameters Ks for TSBGS and TSBMF are initially set to be the number of entities. For TSBMF we consider α0 = {5, 10, 100, 1000}, and obtain the best prediction when α0 = 100. For CRPGS and TSBGS α0 is 100. For the variational inference, the change of variational parameters between two iterations is monitored to determine the convergence. For the Gibbs samplers, the convergence was
Fig. 5. Top: A sociogram for movie recommendation system, illustrated with 2 users and 3 movies. For readability, only two attributes (user’s occupation and movie’s genre) show in the figure. Bottom: The IHRM model for the sociogram.
90
Z. Xu et al.
Fig. 6. Left: The traces of the number of user clusters for the runs of two Gibbs samplers. Middle: The trace of the change of the variational parameter η u for mean field method. Right: The sizes of the largest user clusters of the three inference methods.
analyzed by three measures: Geweke statistic on likelihood, Geweke statistic on the number of components for each class of objects, and autocorrelation. Fig. 6 (left) shows the trace of the number of user clusters in the 2 Gibbs samplers. Fig. 6 (middle) illustrates the change of variational parameters η u in the variational inference. For CRPGS, the first w = 50 iterations (6942 s) are discarded as burn-in period, and the last W = 1400 iterations are collected to approximate the predictive distributions. For TSBGS, we have w = 300 (5078 s) and W = 1700. Although the number of iterations for the burn-in period is much less in the CRPGS if compared to the blocked Gibbs sampler, each iteration is approximately a factor 5 slower. The reason is that CRPGS samples the hidden variables one by one, which causes two additional time costs. First, the expectations of attribute parameters and relational parameters have to be updated when sampling each user/movie. Second, the posterior of hidden variables have to be computed one by one, thus we can not use fast matrix multiplication techniques to accelerate the computation. Therefore if we include the time, which is required to collect a sufficient number of samples for inference, the CRPGS is slower by a factor of 5 (the row Time(s) in Table 4 ) than the blocked sampler. The mean field method is again by a factor around 10 faster than the blocked Gibbs sampler and thus almost two orders of magnitude faster than the CRPGS.
Table 4. Performance of the IHRM model on MovieLens data
Given5 Given10 Given15 Given20 Time(s) Time/iter. #C.u #C.m
CRPGS 65.13 65.71 66.73 68.53 164993 109 47 77
TSBGS 65.51 66.35 67.82 68.27 33770 17 59 44
TSBMF 65.26 65.83 66.54 67.63 2892 19 9 6
Pearson 57.81 60.04 61.25 62.41 -
SVD 63.72 63.97 64.49 65.13 -
Social Network Mining with Nonparametric Relational Models
91
The prediction results are shown in Table 4. All IHRM inference methods under consideration achieve comparably good performance; the best results are achieved by the Gibbs samplers. To verify the performance of the IHRM, we also implement Pearson-coefficient collaborative filtering (CF) method [18] and a SVD-based CF method [20]. It is clear that the IHRM outperforms the traditional CF methods, especially when there are few known ratings for the test users. The main advantage of the IHRM is that it can exploit attribute information. If the information is removed, the performance of the IHRM becomes close to the performance of the SVD approach. For example, after ignoring all attribute information, the TSBMF generates the predictive results: 64.55% for Given5, 65.45% for Given10, 65.90% for Given15, and 66.79% for Given20. The IHRM provides cluster assignments for all objects involved, in our case for the users and the movies. The rows #C.u and #C.m in Table 4 denote the number of clusters for users and movies, respectively. The Gibbs samplers converge to 46-60 clusters for the users and 44-78 clusters for the movies. The mean field solution have a tendency to converge to a smaller number of clusters, depending on the value of α0 . Further analysis shows that the clustering results of the methods are actually similar. First, the sizes of most clusters generated by the Gibbs samplers are very small, e.g., there are 72% (75.47%) user clusters with less than 5 members in CRPGS (TSBGS). Fig. 6 (right) shows the sizes of the 20 largest user clusters of the 3 methods. Intuitively, the Gibbs samplers tend to assign the outliers to new clusters. Second, we compute the rand index (0-1) of the clustering results of the methods, the values are 0.8071 between CRPGS and TSBMF, 0.8221 between TSBGS and TSBMF, which demonstrates the similarity of the clustering results. Fig. 7 gives the movies with highest posterior probability in the 4 largest clusters generated from TSBMF. In cluster 1 most movies are very new and
Fig. 7. The major movie clusters generated by TSBMF on MovieLens data
92
Z. Xu et al.
popular (the data set was collected from September 1997 through April 1998). Also they tend to be action and thriller movies. Cluster 2 includes many old movies, or movies produced by the non-USA countries. They tend to be drama movies. Cluster 3 contains many comedies. In cluster 4 most movies include relatively serious themes. Overall we were quite surprised by the good interpretability of the clusters. Fig. 8 (top) shows the relative frequency coefficient (RFC) of the attribute Genre in these movie clusters. RFC of a genre s in a cluster k is calculated as (fk,s − fs )/σs , where fk,s is the frequency of the genre s in the movie cluster k, fs is mean frequency, and σs is standard deviation of frequency. The labels for each cluster specify the dominant genres in the cluster. For example, action and thriller are the two most frequent genres in cluster 1. In general, each cluster involves several genres. It is clear that the movie clusters are related to, but not just based on, the movie attribute Genre. The clustering effect depends on both movie attributes and user ratings. Fig. 8 (bottom) shows RFC of the attribute Occupation in user clusters. Equivalently, the labels for each user cluster specify the dominant occupations in the cluster. Note that in the experiments we predicted a relationship attribute R indicating the rating of a user for a movie. The underlying assumption is that in principle anybody can rate any movie, no matter whether that person has watched the movie or not. If the latter is important, we could introduce an additional attribute Exist to specify if a user actually watched the movie. The relationship R would then only be included in the probabilistic model if the movie was actually watched by a user.
Fig. 8. Top: The relative frequency coefficient of the attribute Genre in different movie clusters, Bottom: that of the attribute Occupation in different user clusters
Social Network Mining with Nonparametric Relational Models
5
93
Related Work
The work on infinite relational model (IRM) [15] is similar to the IHRM, and has been developed independently. One difference is that the IHRM can specify any reasonable probability distribution for an attribute given its parent, whereas the IRM would model an attribute as a unary predicate, i.e. would need to transform the conditional distribution into a logical binary representation. Aukia et al. also developed a DP mixture model for large networks [4]. The model associates an infinite-dimensional hidden variable for each link (relationship), and the objects involved in the link are drawn from a multinomial distribution conditioned on the hidden variable of the link. The model is applied to the community web data with promising experimental results. The latent mixed-membership model [1] can be viewed as a generalization of LDA model on relational data. Although it is not nonparametric, the model exploits hidden variables to avoid the extensive structure learning and provides a principled way to model the relational networks. The model associates each object with a membership probability-like vector. For each relationship, cluster assignments of the involved objects are generated with respect to their membership vectors, and then relationship is drawn conditioned on the cluster assignments. There are some other important SRL research works for complex relational networks. The probabilistic relational model (PRM) with class hierarchies [10] specializes distinct probabilistic dependency for each subclass, and thus obtains refined probabilistic models for relational data. A group-topic model is proposed in [23]. It jointly discovers latent groups in a network as well as latent topics of events between objects. The latent group model in [16] introduces two latent variables ci and gi for an object, and ci is conditioned on gi . The object attributes depends on ci and relations depend on gi of the involved objects. The limitation is that only relations between members in the same group are considered. These models demonstrate good performance in certain applications. However, most are restricted to domains with simple relationships. These models demonstrate good performance in certain applications. However, most are restricted to domains with simple relationships.
6
Extension: Conditional IHRM
We have presented the IHRM model and an empirical analysis of social network data. As a generative model, the IHRM models both object attributes and relational attributes as random variables conditioned on clusters of objects. If the goal is to predict relationship attributes, one might expect to obtain improved prediction performance if one trains a model conditioned on the attributes. As part of ongoing work we study the extension of the IHRM model to discriminative learning. A conditional IHRMs directly models the posterior probability of relations given features derived from attributes of the objects. Fig. 9 illustrates the conditional IHRM model with a simple sociogram example.
94
Z. Xu et al.
Fig. 9. A conditional IHRM model for a simple sociogram. The main difference from the IHRM model in Fig. 2 is that attributes G are not indirect influence over relations R via object clusters Z, but are direct conditions of relations.
The main difference to the IHRM model in Fig. 2 is that relationship attributes are conditioned on both the states of the latent variables and features derived from attributes. A simple conditional model is based on logistic regression of the form log P (Ri,j |Zi = k, Zj = , F (Gi , Gj )) = σ(ωk, , xi,j ),
where xi,j = F (Gi , Gj ) denotes a vector describing features derived from all attributes of i and j. ωk, is a weight vector, which determines how much a particular attribute contributes to a choice of relation and can implicitly implement feature selection. Note that there is one weight vector for each cluster pair (k, ). ·, · denotes an inner product. σ(·) is a real-valued function with any form σ : R → R. The joint probability of the conditional model is now written as: P (R, Z|G) =
i
P (Zi )
P (Ri,j |Zi , Zj , F (Gi , Gj )),
(12)
i,j
where P (Zi ) is still defined as a stick breaking construction (Equ. 1). The preliminary experiments show promising results, and we will report the further results in future work.
7
Conclusions
This paper presents a nonparametric relational model IHRM for social network modeling and analysis. The IHRM model enables expressive knowledge representation of social networks and allows for flexible probabilistic inference without the need for extensive structural learning. The IHRM model can be applied to community detection, link prediction, and product recommendation. The empirical analysis on social network data showed encouraging results with interpretable clusters and relation prediction. For the future work, we will explore discriminative relational models for better performance. It will also be interesting to perform analysis on more complex relational structures in social network systems, such as domains including hierarchical class structures.
Social Network Mining with Nonparametric Relational Models
95
Acknowledgments This research was supported by the German Federal Ministry of Economy and Technology (BMWi) research program THESEUS, the EU FP7 project LarKC, and the Fraunhofer ATTRACT fellowship STREAM.
References 1. Airoldi, E.M., Blei, D.M., Xing, E.P., Fienberg, S.E.: A latent mixed-membership model for relational data. In: Proc. ACM SIGKDD Workshop on Link Discovery (2005) 2. Aldous, D.: Exchangeability and related topics. In: Ecole d’Ete de Probabilites de Saint-Flour XIII 1983, pp. 1–198. Springer, Heidelberg (1985) 3. Antoniou, G., van Harmelen, F.: A Semantic Web Primer. MIT Press, Cambridge (2004) 4. Aukia, J., Kaski, S., Sinkkonen, J.: Inferring vertex properties from topology in large networks. In: NIPS 2007 workshop on statistical models of networks (2007) 5. Bernard, H., Killworth, P., Sailer, L.: Informant accuracy in social network data iv. Social Networks 2 (1980) 6. Blei, D., Jordan, M.: Variational inference for dp mixtures. Bayesian Analysis 1(1), 121–144 (2005) 7. Breiger, R.L., Boorman, S.A., Arabie, P.: An algorithm for clustering relational data with applications to social network analysis and comparison to multidimensional scaling. Journal of Mathematical Psychology 12 (1975) 8. Dzeroski, S., Lavrac, N. (eds.): Relational Data Mining. Springer, Berlin (2001) 9. Getoor, L., Friedman, N., Koller, D., Pfeffer, A.: Learning probabilistic relational models. In: Dzeroski, S., Lavrac, N. (eds.) Relational Data Mining, Springer, Heidelberg (2001) 10. Getoor, L., Koller, D., Friedman, N.: From instances to classes in probabilistic relational models. In: Proc. ICML 2000 Workshop on Attribute-Value and Relational Learning (2000) 11. Getoor, L., Taskar, B. (eds.): Introduction to Statistical Relational Learning. MIT Press, Cambridge (2007) 12. Handcock, M.S., Raftery, A.E., Tantrum, J.M.: Model-based clustering for social networks. Journal of the Royal Statistical Society 170 (2007) 13. Hofmann, T., Puzicha, J.: Latent class models for collaborative filtering. In: Proc. 16th International Joint Conference on Artificial Intelligence (1999) 14. Ishwaran, H., James, L.: Gibbs sampling methods for stick breaking priors. Journal of the American Statistical Association 96(453), 161–173 (2001) 15. Kemp, C., Tenenbaum, J.B., Griffiths, T.L., Yamada, T., Ueda, N.: Learning systems of concepts with an infinite relational model. In: Proc. 21st Conference on Artificial Intelligence (2006) 16. Neville, J., Jensen, D.: Leveraging relational autocorrelation with latent group models. In: Proc. 4th international workshop on Multi-relational mining, pp. 49– 55. ACM Press, New York (2005) 17. Raedt, L.D., Kersting, K.: Probabilistic logic learning. SIGKDD Explor. Newsl. 5(1), 31–48 (2003)
96
Z. Xu et al.
18. Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J.: Grouplens: An open architecture for collaborative filtering of netnews. In: Proc. of the ACM 1994 Conference on Computer Supported Cooperative Work, pp. 175–186. ACM, New York (1994) 19. Sampson, F.S.: A Novitiate in a Period of Change: An Experimental and Case Study of Social Relationships. PhD thesis (1968) 20. Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Application of dimensionality reduction in recommender systems–a case study. In: WebKDD Workshop (2000) 21. Sarwar, B.M., Karypis, G., Konstan, J.A., Riedl, J.: Analysis of recommender algorithms for e-commerce. In: Proc. ACM E-Commerce Conference, pp. 158–167. ACM, New York (2000) 22. Sethuraman, J.: A constructive definition of dirichlet priors. Statistica Sinica 4, 639–650 (1994) 23. Wang, X., Mohanty, N., McCallum, A.: Group and topic discovery from relations and text. In: Proc. 3rd international workshop on Link discovery, pp. 28–35. ACM, New York (2005) 24. Xu, Z., Tresp, V., Yu, K., Kriegel, H.-P.: Infinite hidden relational models. In: Proc. 22nd UAI (2006) 25. Xu, Z., Tresp, V., Yu, S., Yu, K.: Nonparametric relational learning for social network analysis. In: Proc. 2nd ACM Workshop on Social Network Mining and Analysis, SNA-KDD 2008 (2008) 26. Yedidia, J., Freeman, W., Weiss, Y.: Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Transactions on Information Theory 51(7), 2282–2312 (2005)
Using Friendship Ties and Family Circles for Link Prediction Elena Zheleva1 , Lise Getoor1 , Jennifer Golbeck2 , and Ugur Kuter1 1
Department of Computer Science and Institute for Advanced Computer Studies, University of Maryland, College Park, Maryland 20742, USA {elena,getoor,ukuter}@cs.umd.edu 2 College of Information Studies, University of Maryland, College Park, Maryland 20742, USA
[email protected] Abstract. Social networks can capture a variety of relationships among the participants. Both friendship and family ties are commonly studied, but most existing work studies them in isolation. Here, we investigate how these networks can be overlaid, and propose a feature taxonomy for link prediction. We show that when there are tightly-knit family circles in a social network, we can improve the accuracy of link prediction models. This is done by making use of the family circle features based on the likely structural equivalence of family members. We investigated the predictive power of overlaying friendship and family ties on three realworld social networks. Our experiments demonstrate significantly higher prediction accuracy (between 15% and 30% more accurate) compared to using more traditional features such as descriptive node attributes and structural features. The experiments also show that a combination of all three types of attributes results in the best precision-recall trade-off.
1
Introduction
There is a growing interest in social media and in data mining methods which can be used to analyze, support and enhance the effectiveness and utility of social media sites. The analysis methods being developed build on traditional methods from the social network analysis community, extend them to deal with the heterogeneity and growing size of the data being generated and use tools from graph mining, statistical relational learning and methods for information extraction from unstructured and semi-structured text. Traditionally, social network analysis has focused on actors and ties (or relationships) between them, such as friendships or kinships. The two most common types of networks are (1) unimodal networks, where the nodes are actors and the edges represent ties such as friendships, and (2) affiliation networks which can be represented as bipartite graphs, where there are two types of nodes, the actors and organizations, and the edges represent the affiliations between actors and organizations. Most of the existing work has focused on networks that exhibit a single relationship type, either friendship or affiliation. L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 97–113, 2010. c Springer-Verlag Berlin Heidelberg 2010
98
E. Zheleva et al.
In this paper, we investigate the power of combining friendship and affiliation networks. We use the notion of structural equivalence, when two actors are similar based on participating in equivalent relationships, which is fundamental to finding groups in social networks. Our approach is an attempt to bridge approaches based on structural equivalence and community detection, where densely connected groups of actors are clustered together into communities. We show how predictive models, based on descriptive, structural, and community features, perform surprisingly well on challenging link-prediction tasks. We validate our results on a trio of social media websites describing friendships and family relationships. We show that our models are able to predict links accurately, in this case friendship relationships, in held-out test data. This is typically a very challenging prediction problem. With our results, we also hope to motivate further research in discovering closely-knit groups in social networks and using them to improve link-prediction performance. Our link-prediction approach can be applied in a variety of domains. The important properties of the data that we use are that there are actors, links between them and closely-knit groups such as families, housemates or officemates. In some data, groups are given; in other datasets, it may be necessary to first cluster the nodes in a meaningful manner. For example, in email communication networks, such as Enron [1,2], groups could be cliques of people that email each other frequently. In the widely studied co-authorship networks [3,4,5,6,7,8,9], affiliation groups may be cliques of authors that collaborate on many papers together. In these domains, the link-prediction task translates to finding people who are likely to communicate with each other [1] or authors who are likely to collaborate in the future [5,8]. Our contributions include the following: – We propose a general framework for combining social and affiliation networks. – We show how to instantiate it for overlaying friendship and family networks. – We show how features of the overlaid networks can be used to accurately predict friendship relationships. – We validate our results on three social media websites. In Section 2, we describe the link prediction problem that we focus on in this paper, and in Section 3, the social network model. Section 4 addresses the taxonomy of the descriptive, structural, and group features that we used for link prediction in our overlaid networks. We then propose a comparison of our network overlay method with two alternatives in Section 5. We describe experimental results in Section 6, related work in Section 7, the generality of our approach in Section 8, and discuss conclusions and future work in Section 9.
2
Link Prediction Problem
In this paper we study the problem of predicting friendship links in multirelational social networks. This problem is closely related to problems of link
Using Friendship Ties and Family Circles for Link Prediction
99
prediction [4,1,5,8], link completion [10], and anomalous link discovery [1,9] which are covered in more depth in Section 7. Link prediction in social networks is useful for a variety of tasks. The most straightforward use is for making data entry easier – a link-prediction system can propose links, and users can select the friendship links that they would like to include, rather than users having to enter the friendship links manually. Link prediction is also a core component of any system for dynamic network modeling—the dynamic model can predict which actors are likely to gain popularity, and which are likely to become central according to various social network metrics. Link prediction is challenging for a number of reasons. When it is posed as a pair-wise classification problem, one of the fundamental challenges is dealing with the large outcome space; if there are n actors, there are n2 possible relations. In addition, because most social networks are sparsely connected, the prior probability of any link is extremely small, thus we have to contend with a large class skew problem. Furthermore, because the number of links is potentially so large, the number of the negative instances will be huge, so constructing a representative training set is challenging. In our approach to link prediction in multi-relational social networks, we explore the use of both attribute and structural features, and, in particular, we study how group membership (in our case, family membership) can significantly aid in accurate link (here, friendship) prediction.
3
Social Network Model
Social networks describe actors and their relationships. The actors can have properties or attributes such as age and income. Relationships can represent dyadic (binary) relationships or group memberships (cliques or hyperedges); in addition, relationships can be directed, undirected and/or weighted. Here we consider both dyadic and group-membership relationships. Specifically, we consider friendship relationships and family group memberships. In our domain, these are undirected, unweighted relationships. More formally, the networks we consider consist of the following: actors: a set of actors A = {a1 , . . . , an }, and a grouping or partitioning of the actors into non-overlapping groups: groups: a group of individuals connected through a common affiliation. The affiliations group the actors into sets G = {G1 , . . . , Gm }. Affiliation groups can be a partitioning of the actors, or overlapping groups of actors. In this work, families of actors are such affiliation groups. We consider the following relationships: friends: F {ai , aj } denotes that ai is friends with aj , and family: M {ai , Gk } denotes that ai is a member of family Gk .
100
E. Zheleva et al.
Fig. 1. Actors in the same tightly-knit group often exhibit structural equivalence, i.e., they have the same connections to all other nodes. Using the original network (a), and a structural equivalence assumption, one can construct a network with new predicted links (b).
Actors can have attributes; if b is an attribute, then we use ai .b to denote the b attribute of actor ai . We denote the set of friends of actor ai by ai .F , and the set of family members of the same actor as ai .M . Figure 1(a) shows an example network of eight actors and five groups. Each node represents an actor, and a group is shown as a circle around the actors. The thick lines inside a group mark family relationships, and the thin black lines denote friendship relationships. Every actor belongs to at least one group. There are single-member groups, and there are actors without friends.
4
A Feature Taxonomy for Multi-relational Social Networks
We identified three classes of features that describe characteristics of potential links in a multi-relational social network: – Descriptive attributes are attributes inherent to the nodes, and they do not consider the structure of the network. – Structural attributes include characteristics of the networks based on the friendship relationships such as node degree. – Group attributes are based on structural properties of the network when both types of relationships, friendship and family, are considered. The groups in this case are the cliques of family members. Each feature within a class can be assigned to an actor or to a pair of actors (corresponding to a potential edge). The following sections describe our taxonomy of the features in more detail.
Using Friendship Ties and Family Circles for Link Prediction
4.1
101
Descriptive Attributes
The descriptive attributes are attributes of nodes in the social network that do not consider the link structure of the network. These features vary across domains. They provide semantic insight into the inherent properties of each node in a social network, or compare the values of the same inherent attributes for a pair of nodes. We define two classes of descriptive attributes for multi-relational social networks: 1. Actor features. These are inherent characteristics of an actor. 2. Actor-pair features. The actor-pair features compare the values of the same node attribute for a pair of nodes. 4.2
Structural Features
The next set of features that we introduce describe features of network structure. The first is a structural features for a single node, ai , while the remaining describe structural attributes of pairs of nodes, ai and aj . 1. Actor features. These features describe the link structure around a node. Number of friends. The degree, or number of friends, of an actor ai : |ai .F |. 2. Actor-pair features. These features describe how interconnected two nodes are. They measure the sets of friends that two actors have ai .F and aj .F . Number of common friends. The number of friends that the pair of nodes have in common in the network: |ai .F ∩ aj .F |. Jaccard coefficient of the friend sets. The Jaccard coefficient over the friend sets of two actors describes the ratio of the number of their common friends to their total number of friends: Jaccard(ai , aj ) =
|ai .F ∩aj .F | |ai .F ∪aj .F | .
The Jaccard coefficient is a standard metric for measuring the similarity of two sets. Unlike the feature number of common friends, it considers the size of the friendship circle of each actor. Density of common friends. For the set of common friends, the density is the number of friendship links between the common friends over the number of all possible friendship links in the set. The density of common friends of two nodes describes the strength in the community of common friends. Density is also known as clustering coefficient. 4.3
Group Features
The third category of features that we consider are based on group membership; in the networks we studied, the groups are families. These are the features that overlay friendship and affiliation networks.
102
E. Zheleva et al.
1. Actor features. These are features that describe the groups to which an actor belongs. Family Size. This is the simplest attribute and describes the size of an actor’s family: |ai .M |. 2. Actor-pair features. There are two types of features for modeling these interfamily relations based on the overlapping friend and family sets of two actors ai .F and aj .M : Number of friends in the family. The first feature describes the number of friends ai has in the family of aj : |ai .F ∩ aj .M |. This feature allows one to reason about the relationship between an actor and a group of other actors, where the latter is semantically defined over the network through the family relations. Portion of friends in the family. The second feature on inter-family relations describes the ratio between the number of friends that ai has in aj ’s family (the same as the above feature) and the size of aj ’s family. The rationale behind this feature is that the higher this ratio is, the more likely it is that aj is close to ai in the network since more of its family members are friends with ai . The idea behind the group features is based on the notion of structural equivalence of nodes within a group. Two nodes are structurally equivalent if they have the same links to all other actors. If we can detect tightly-knit groups in a social network and we assume that the nodes in each group are likely to behave similarly, then new links can be predicted by projecting links such that the nodes in the group become structurally equivalent. In our networks, such groups are the family cliques. In a weighted graph, a tight group could map to a clique of nodes with highly-weighted edges. Figure 1 shows an example of how a structural equivalence assumption can help in predicting new links. For example, if one of the actors from Group A is friends with an actor from Group B, as shown on the original network (a), then it may be more likely that there is a link between the other actor from Group A and the actor from Group B, shown as a dashed line in (b).
5
Alternative Network Representations
The traditional approach to studying networks is to treat all relationships as equal. In the previous section, we described overlaying networks with different link types in a way that distinguishes between these types, and uses information about affiliation groups. In other words, our link-prediction approach uses information about the actors A, the groups G, the friendship relationships F , and the family relationships M . We call our representation different-link and affiliation overlay. Therefore, a logical question one may ask is what the benefits of treating links as different are, and whether affiliation groups really make a difference in link prediction. Our claim is that affiliations are important and that they can have a predictive value. To illustrate the benefit of our approach as compared to the traditional one, we compare it to two alternative representations of the network.
Using Friendship Ties and Family Circles for Link Prediction
103
In the first alternative representation, which we call same-link and no affiliation overlay, the family and friendship links are treated the same, and affiliation groups are not given. More formally, in this representation, the graph consists of these components: actors A, and a set of edges to which we refer as implied friendships Fimplied = F ∪ M . We can compute the descriptive and structural features in this alternative overlay, and use them for link prediction. In our experiments, we investigate whether this alternative overlay can offer the same or better link-prediction accuracy as the different-link and affiliation overlay. Even if the first alternative overlay does not offer better accuracy, we still need to check whether the predictive value of the different-link and affiliation overlay comes from treating the links as different or from the fact that we are given the affiliation groups. To investigate that, we look at a second alternative overlay, the same-link and affiliation overlay, in which the family and friendship links are treated the same, and affiliation groups are given. In this overlay, the graph consists of these components: actors A, groups G, and implied friendships Fimplied . We can compute all classes of features in this alternative overlay, and use them for link prediction.
6
Experimental Evaluation
6.1
Social Media Data Sets
This research is based upon using networks that have two sets of connections: friendship links and family ties. We performed our experiments on three novel datasets describing petworks: Dogster, Catster, and Hamsterster1 . On these sites, profiles include photos, personal information, characteristics, as well as membership in community groups. Members also maintain links to friends and family members. As of February 2007, Dogster has approximately 375,000 members. Catster is based on the same platform as Dogster and contains about 150,000 members. Hamsterster has a different platform, but it contains similar information about its members. It is much smaller than Dogster and Catster about 2,000 members. These sites are the only three of the hundreds we visited that publicly share both family and friendship connections2 . However, these are networks where both types of connections are realistic and representative of what we would expect to see in other social networks if they collected this data. The family connections are representative of real life, since family links are only made between profiles of pets created by the same owner. The friendship linking behavior is in line with patterns seen in other social networks [11]. 1. Actor features: Breed. This is the pet breed such as golden retriever or chihuahua. A pet can have more than one breed value. 1 2
At http://www.dogster.com, http://www.catster.com, and http://www.hamsterster.com. For a full list, see http://trust.mindswap.org/SocialNetworks
104
E. Zheleva et al.
Fig. 2. Sample profile on Dogster which includes family and friends
Breed category. Each breed belongs to a broader category set. For example in Dogster, the major breed categories we identified are working, herding, terrier, toy, sporting, non-sporting, hound, and other, a catchall for the other breeds that appear in a the site, but not as frequently as the previous ones. When a dog has multiple breeds, its breed category is mixed. Single Breed. This boolean feature describes whether a pet has a single breed or whether it has multiple breed characteristics. Purebred. This is a boolean feature which specifies whether a dog owner considers its pet to be purebred or not. 2. Actor-pair features. All of the above features describe characteristics of a single user in the network. Same breed. This boolean feature is true if two profiles have at least one common breed. 6.2
Data Description
We have obtained a random sample of 10, 000 profiles each from Dogster and Catster, and all 2059 profiles registered with Hamsterster. Each instance in the test data contained the features for a pair of profiles where some of the features were individual node features. To construct the test data, we chose the pairs of
Using Friendship Ties and Family Circles for Link Prediction
105
nodes for which there was an existing friendship link, and we sampled from the space of node pairs which did not have a link. We computed the descriptive, structural and group features for each of the profiles. For each pair of profiles in the test data, we computed the features from the three classes described in Section 4. A test instance for a pair of profiles ai and aj includes both the individual actor features and the actor-pair features. It has the form < ai features, aj features, (ai , aj )-pair features, class > where class is the binary class which denotes whether a friendship link exists between the actors. For Dogster, the sample of 10,000 dogs had around 17,000 links among themselves, and we sample from the non-existing links at a 10:1 ratio (i.e., the nonexisting links are 10 times more than the existing links). For Catster, the 10,000 cats had 43,000 links, and for the whole Hamsterster dataset, the number of links was around 22,000. We sampled from the non-existing links in these datasets at the same 10:1 ratio. 6.3
Experimental setup
We used three well-known classifiers, namely Na¨ıve Bayes, logistic regression and decision trees for our experiments. The goal was to perform binary classification on the test instances and predict friendship links. The implementations of these classifiers were from the latest version of Weka (v3.4.12) from http://www.cs. waikato.ac.nz/ml/weka/. We allocated a maximum of 2GB of memory for each classifier we ran. We measured prediction accuracy by computing precision, recall, and their harmonic mean, F1 score, using 10-fold cross-validation. 6.4
Link-Prediction Results
We report only on the results from decision-tree classification because it consistently had the highest accuracy among the three classifiers. Table 1 summarizes our results. Adding group features to the descriptive and structural features increased accuracy by 15% to 30%. We discuss the results in more detail in the subsequent subsections. Table 1. Comparison of F1 values in the three datasets, with the feature types from our taxonomy Feature Type Dogster Catster Hamsterster Descriptive 37.6% 0.4% 19.8% Structural 76.1% 83.1% 59.9% Group 90.8% 95.2% 89.2% Descriptive and structural 78.6% 83.0% 60.3% Descriptive, structural, and group 94.8% 97.9% 90.5%
106
E. Zheleva et al.
Fig. 3. a) Recall, precision, and F1 score for Dogster using descriptive and structural attributes; b) F1 score across datasets. Using descriptive attributes together with structural attributes leads to a better F1 score in Dogster but not in Catster and Hamsterster.
Descriptive attributes can be useful in combination with structural attributes. In these experiments, we have investigated the predictive power of the simplest features, i.e., the descriptive attributes versus the impact of the structural attributes. Figure 3 shows the accuracy results from the decision-tree classifier. When we use only descriptive attributes, the link-prediction accuracy varies across datasets. In Dogster, there is some advantage to using descriptive attributes, yet the accuracy (F1 score) is relatively low (37.6%). In Catster and Hamsterster, building the complete decision trees led to 0.4% and 19.8% accuracy, respectively (using Weka’s default pruning parameter, the trees were empty, and the accuracies were 0%). This confirms that, in general, link prediction is a challenging prediction task. When we used the structural features (such as number of friends that two profiles share), the link-prediction accuracy increased to 76.1% in in Dogster. This suggests that the structural features are much more predictive than simple descriptive attributes. This effect was even more pronounced for Catster and Hamsterster. In Dogster, combining the node attributes and the structural features leads to futher improvement. Using descriptive attributes together with structural attributes leads to a better F1 score (78.6%) as compared to using either category alone (37.6% and 76.1%, respectively) in Dogster, as shown in Figure 3. For Catster and Hamsterster, the difference was less than 0.4%. Family group features are highly predictive. As the previous experiments showed, structural attributes are stronger predictors than the descriptive attributes alone. Next, we investigate the predictive power of the group features in our taxonomy. In Dogster, Catster and Hamsterster, the group features involve the families and friends of the users. Figure 4 shows our comparisons. Our results suggested that family groups are strong predictors for friendship links (F 1 = 90.8% for Dogster). We also ran experiments where we used not only family cliques, but also the structural and descriptive features. In these experiments, the results show that the accuracy (F1) improves by 4% in Dogster, 0.6% in Catster and 1.3% Hamsterster.
Using Friendship Ties and Family Circles for Link Prediction
107
Fig. 4. Link-prediction accuracy using all feature classes: descriptive, structural and group features. a) Recall, precision, and F1 score for Dogster; b) F1 score across datasets. Group features are highly predictive, yet adding the other features provided benefit too.
Computing more expensive structural attributes is not highly beneficial. Some structural features in our taxonomy were more computationally expensive to construct than others. For example, the feature that described the number of friends is easy to compute, whereas the feature that described the density of common friends for each pair of profiles is the hardest. Using a database, computing density of common friends for all pairs of profiles requires several joins of large tables. In order to investigate the trade-off between computing expensive features and their predictive impact on our results, we have performed the following experiments. We have designed experiments in which we add more expensive structural features one by one, and assess the link-prediction accuracy at each step. We used the following combinations of features: (1) using number of friends only, (2) using number of friends and number of common friends, (3) using number of friends, number of common friends and jaccard coefficient, and finally (4) using number of friends, number of common friends, jaccard coefficient and density of common friends. We are reporting on the results of these four sets of structural features together with the descriptive attributes since we showed in the previous subsection that using descriptive attributes can sometimes be beneficial. We also report on the setting in which group features were used. Surprisingly, it turned out that computing the more expensive features added very little benefit. Figure 5 shows the results of the experiments. For example, in the Dogster case, adding the number of common friends of two nodes improved accuracy (F1 score) by 2% over the individual number of friends. Computing the most expensive feature density of common friends pays off slightly (improves F1 score by 0.4%) only when there are no group attributes. Computing the more expensive jaccard coefficient did not pay off over using the simpler feature number of common friends. In the Catster and Hamsterster cases, the improvement was less that 0.5%. Our results also support the claim made in the preferential attachment model [3] that the number of friends of a node (node degree) plays a role in the process of new nodes linking to it. They contradict the link-prediction results in co-authorship networks [5] where jaccard coefficient and the number
108
E. Zheleva et al.
Fig. 5. Link-prediction accuracy using structural features of increasing computation cost (number of friends, number of common friends, jaccard coefficient of common friends, density of common friends). Computing more expensive structural attributes is not highly beneficial, especially in the presence of group information.
of common friends consistently out-performed the metric based on number of friends. This may be inherent to the types of networks discussed. Alternative network representations. In the next set of experiments, we used the alternative network overlays to test whether there was an advantage to keeping the different types of links and the affiliation groups. We compare our proposed different-link and affiliation overlay to the alternative representations same-link and no affiliation overlay and same-link and affiliation overlay (see Section 5). We compute only the descriptive and structural features in the overlay with no affiliation information, and compute all classes of features in the overlays where affiliation information was given. The results on Figure 6 show that when family affiliations were given, it did not matter whether the links were treated as the same type or different
Using Friendship Ties and Family Circles for Link Prediction
109
Fig. 6. Prediction accuracy when links are treated equal, with and without group affiliations. As the results from the affiliation overlays suggest, group features are the main contributor to the high link-prediction accuracy.
types: the link-prediction accuracy was the same. However, in the case when the affiliations were not given, it was better to compute the structural features using both types of relationships but treat them as one type. When family links were treated as friendship links, the accuracy of the predictions made by the structural attributes improved by 6% to 20%. This may be due to the fact that the overlap between friends and family links in the data was very small, and using both types of links when computing the structural features was beneficial. Using the affiliation information and computing all features on the data led to the best accuracy, and the accuracy was the same both in the different-link and same-link cases. These experiments also confirmed the previous results: group affiliation was the main contributor to the high link-prediction accuracy.
7
Related Work
In general, link-prediction algorithms process a set of features in order to learn and predict whether it is likely that two nodes in the data are linked. Sometimes, these features are hand-constructed by analyzing the problem domain, the attributes of the actors, and the relational structure around those actors [12,4,5,9]. Other times, they are automatically generated, i.e., the prediction algorithm first learns the best features to use and then predicts new links [8]. In this section, we discuss the existing work that is most relevant to the link-prediction problem in multi-relational social networks. The link-prediction techniques that are based on feature-construction are closest to our work [12,4,1,5,9]. As most of the relational domains can be represented as a network model, the constructed features not only include the attributes of the actors, but also the characteristics of the structure. Most of this work examines co-authorship and citation networks [4,5,8,9] whereas we validate our method using online social networks. Some of the approaches use machine learning techniques for classification [4,13,8,14], and others rely on ranking the feature values [12,5,9].
110
E. Zheleva et al.
Adamic and Adar [12] use a similarity-based approach to predicting friendships amongst students. They gather data from university student websites and mailing lists, and construct a vector of features for each student such as website text, in-links, out-links, and mailing lists the students belong to. Their approach uses descriptive features whereas ours also considers structural and group features. It has been shown that there is ”useful information contained in the network topology alone” [5]. Liben-Nowell and Kleinberg use a variety of structural features such as shortest path, (a variant of) number of friends, number of common friends, jaccard coefficient, and more elaborate structural features based on all paths between two nodes in co-authorship networks. Their experiments compare the link-prediction accuracy of each feature in isolation. They rank the nodepairs by each feature value and pick the top pairs as their predicted links. Their results suggest that simple features such as number of common friends coefficient perform well compared to others. In our work, besides structural features, we also constructed descriptive and group features, and instead of using the features in isolation, we combined them. Rattigan and Jensen [9] recognize that the extremely large class skew associated with the link-prediction task makes it very challenging. They look at a related problem, anomalous link discovery, in which instead of discovering new links, they are interested in learning properties of the existing links. They use structural features in co-authorship networks and rank the most and least likely collaborations based on an expensive structural feature, the Katz score. Another work that uses link prediction for anomaly discovery is the work of Huang and Zeng [1], in which they rank anomalous emails in the Enron dataset. The work described so far uses descriptive and structural attributes in isolation. Hasan et al. [4] use both. Their work studies classification for link prediction based on hand-constructed features in co-authorship networks. They report prediction accuracy (F score), precision, and recall results from a range of classifiers such as decision trees, k-nearest neighbor, multilayer perceptron, and supportvector machines. The novelty in our work compared to theirs is that we study link prediction in richer social network settings and we explore the use of group features and alternate representations. The link-prediction problem has also been studied in the domain of citation networks for scientific publications [8]. The authors posed the link-prediction problem as a binary classification problem, and used logistic regression to solve it. Their features are database queries such as most cited author, and thus they are similar to both the descriptive and structural features we have discussed so far. Their work describes a statistical learning approach for feature generation. In particular, it extends the traditional Inductive Logic Programming (ILP) to reason about probabilities, and uses this extension to learn new features from the problem domain both statistically and inductively. The experiments in this work suggest that the ratio of existing to non-existing links in the test data mattered, and the fewer non-existing link examples were included, the better the precisionrecall curve was. However, testing with more non-existing link examples would give a better estimate of the probability of a randomly picked pair of nodes in the
Using Friendship Ties and Family Circles for Link Prediction
111
network to be classified correctly. Another statistical learning approach to link prediction was presented by Taskar et al. [14]. The authors use relational Markov networks to define a probabilistic model over the entire link graph. Their features are both descriptive and relational. They apply their method to two domains: linked university websites and a student online social network. Another automated feature-generation method has been presented by Kubica et al. [13] who described a learning method for the task of friend identification which is similar to anomalous link discovery. Their method, called cGraph, learns an approximate graph model of the actual underlying link data given noisy linked data. Then, this graph model is used to predict pairwise friendship information regarding the nodes in the network. The types of features that they use are descriptive, structural and group. The difference with our work is that we use these features for link prediction rather than ranking the links. Link completion is a problem related to link prediction. Given the arity of a relationship and all but one entity participating in it, the goal is to predict the missing entity, as opposed to classifying the missing link itself. Goldenberg et al. [10] present a comparison of several classification algorithms such as Naive Bayes, Bayesian networks, cGraph (mentioned above), logistic regression, and nearest neighbor. This study used several real-world datasets from different domains, including co-authorship networks, and data collected from the Internet Movie Database site. It suggested that logistic regression performs well in general in the datasets above; in our study on real-world social networks, logistic regression usually performed worse than decision-tree classifier in terms of accuracy. There has also been interest in learning group features in social networks. Kubica et al. [15] describe a group-detection algorithm that uses descriptive features and links. First, they perform clustering based on the descriptive features (clustering) and find the groups. They allow group overlap and assume that group memberships are conditionally independent of each other given descriptive features. Then, their algorithm assigns a probability of a link between two actors based on the similarity of their groups, and it can answer ranking queries similar to the ones in the anomalous link discovery work. One of the issues with the proposed algorithm is that it is slow [13]. The work of Friedland and Jensen [16] studies the problem of identifying groups of actors in a social network that exhibit a common behavior over time. The authors focused on networks of employees in large organizations, and investigated the employee histories to identify the employees who worked together intentionally from those who simply shared frequently occurring employment patterns in the industry.
8
Discussion
When studying other large social networks, family information is not always relevant or available. However, groups and affiliations are often available, or communities can be discovered. The networks used here had a binary relationships - friend or family - but a similar effect can be achieved in networks where relationships are weighted. For example,
112
E. Zheleva et al.
co-authorship networks are widely studied as social networks [3,4,5,6,7,8,9], and edges can be weighted by the number of articles a pair of authors have authored together. In email communication networks - the Enron email corpus [1,2], for example - the number of messages between two senders can be used as a weight. To mimic the strong family-type relationship we used in this article, a threshold weight can be set. Any edge with a weight over that threshold can be treated as a “strong” relationship (like our family relationship). Clusters of nodes connected with strong ties would represent the equivalent of a family unit.
9
Conclusions and Future Work
Link prediction is a notoriously difficult problem. In this research, we found that overlaying friendship and affiliation networks was very effective. For the networks used in our study, we found that family relationships were very useful in predicting friendship links. Our experiments show that we can achieve significantly higher prediction accuracy (between 15% and 30% more accurate) as compared to using more traditional features such as descriptive node attributes and structural features. Family groups helped not only because they represent a clique of actors, but because the family relationship itself was indicative of structural equivalence. As future work, we plan to investigate the use of edge weights and thresholds to define strongly connected clusters, and see if it works as well in link prediction as the family groups did here.
Acknowledgments This work was partially supported by NSF under Grants No. 0746930 and No. 0423845.
References 1. Huang, Z., Zeng, D.: A Link Prediction Approach to Anomalous Email Detection. In: IEEE International Conference on Systems, Man, and Cybernetics (2006) 2. Klimt, B., Yang, Y.: The enron corpus: A new dataset for email classification research. In: Boulicaut, J.F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004) 3. Barabasi, A.L., Jeong, H., Neda, Z., Ravasz, E., Schubert, A., Vicsek, T.: Evolution of the social network of scientific collaborations. PHYSICA A 311, 3 (2002) 4. Hasan, M., Chaoji, V., Salem, S., Zaki, M.: Link Prediction using Supervised Learning. In: Proceedings of the Workshop on Link Analysis, Counter-terrorism and Security (with SIAM Data Mining Conference) (2006) 5. Liben-Nowell, D., Kleinberg, J.: The Link Prediction Problem for Social Networks. In: Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM) (2003) 6. Newman, M.: Who is the best connected scientist? a study of scientific coauthorship networks. Working Papers 00-12-064, Santa Fe Institute (December 2000), http://ideas.repec.org/p/wop/safiwp/00-12-064.html
Using Friendship Ties and Family Circles for Link Prediction
113
7. Newman, M.: Coauthorship networks and patterns of scientific collaboration. Proceedings of the National Academy of Sciences 101, 5200–5205 (2004) 8. Popescul, A., Ungar, L.H.: Statistical relational learning for link prediction. In: Proceedings of the Workshop on Learning Statistical Models from Relational Data at IJCAI 2003 (2003) 9. Rattigan, M.J., Jensen, D.: The case for anomalous link discovery. SIGKDD Explor. Newsl. 7(2), 41–47 (2005) 10. Goldenberg, A., Kubica, J., Komarek, P., Moore, A., Schneider, J.: A Comparison of Statistical and Machine Learning Algorithms on the Task of Link Completion. In: KDD Workshop on Link Analysis for Detecting Complex Behavior (2003) 11. Golbeck, J.: The dynamics of web-based social networks: Membership, relationships, and change. First Monday 12 (2007) 12. Adamic, L., Adar, E.: Friends and neighbors on the web. Social Networks 25(3), 211–230 (2003) 13. Kubica, J.M., Moore, A., Cohn, D., Schneider, J.: cGraph: A Fast Graph-Based Method for Link Analysis and Queries. In: Proceedings of the 2003 IJCAI TextMining & Link-Analysis Workshop (2003) 14. Taskar, B., Wong, M.F., Abbeel, P., Koller, D.: Link Prediction in Relational Data. In: Advances in Neural Information Processing Systems, NIPS 2003 (2003) 15. Kubica, J., Moore, A., Schneider, J., Yang, Y.: Stochastic Link and Group Detection. In: Proceedings of the Eighteenth National Conference on Artificial Intelligence, AAAI 2002 (2002) 16. Friedland, J., Jensen, D.: Finding Tribes: Identifying Close-Knit Individuals from Employment Patterns. In: Proceedings of Knowledge Discovery and Data Mining (KDD 2007) (2007)
Information Theoretic Criteria for Community Detection L. Karl Branting The MITRE Corporation, 7525 Colshire Drive McLean, VA 22102, USA
[email protected] Abstract. Many algorithms for finding community structure in graphs search for a partition that maximizes modularity. However, recent work has identified two important limitations of modularity as a community quality criterion: a resolution limit; and a bias towards finding equal-sized communities. Information-theoretic approaches that search for partitions that minimize description length are a recent alternative to modularity. This paper shows that two information-theoretic algorithms are themselves subject to a resolution limit, identifies the component of each approach that is responsible for the resolution limit, proposes a variant, SGE (Sparse Graph Encoding), that addresses this limitation, and demonstrates on three artificial data sets that (1) SGE does not exhibit a resolution limit on sparse graphs in which other approaches do, and that (2) modularity and the compression-based algorithms, including SGE, behave similarly on graphs not subject to the resolution limit.
1
Introduction
Many complex networks, such as the Internet, metabolic pathways, and social networks, are characterized by a community structure that groups related vertices together. Traditional clustering techniques group vertices based on some metric for attribute similarity [2]. More recent research has focused on detection of community structure from graph topology. Under this approach, the input to a community-detection algorithm is a graph in which vertices correspond to individuals (e.g., URLs, molecules, or people) and edges correspond to relationships (e.g., hyperlinks, chemical reactions, or marital and business ties). The output consists of a partition of the graph in which subgraphs correspond to meaningful groupings (e.g., web communities, families of molecules, or social clans).1 Community detection algorithms can be viewed as comprising two components: a utility function that expresses the quality of any given partition of a 1
Some communities, such as social clubs and families, can overlap. Membership in such communities is better modeled as attributes of vertices rather than through a partition of the graph [3]. The focus of this paper, however, as in the bulk of community detection research, is on partition-based community structure.
L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 114–130, 2010. c Springer-Verlag Berlin Heidelberg 2010
Information Theoretic Criteria for Community Detection
115
Table 1. Utility functions and search strategies for various community-detection algorithms. DHC represents divisive hierarchical clustering, ADHH represents agglomerative hierarchical clustering, and MDL represents “Minimum Description length.” utility function modularity modularity modularity modularity modularity log-likelihood MDL MDL
search strategy DHC/betweenness centrality AHC Genetic Algorithm DHC/network structure index AHC/spectral division fixed-point iteration simulated annealing iterated hill climbing
algorithm Newman & Girvan (2004) [11] Newman (2004) [1] Tasgin & Bingol (2006) [12] Rattigan et al. (2007) [13] Donetti & Munoz (2004) [14] Zhang et al. (2007) [15] Rosvall & Bergstrom (2007) [16] Chakrabarti (2004) [8]
graph; and a search strategy that specifies a procedure for finding a partition that optimizes the utility function. Table 1 sets forth utility functions and search strategies of eight recent community-detection algorithms, showing that utility functions have been paired with a variety of different search strategies. The utility function most prevalent in recent community detection research is the modularity function introduced in [1]: Q= (w(Dii )/l − (li /l)2 ) (1) 1 0 The encoding schemes used in RB and AP are as follows: 1. Bits needed to represent the number of vertices in the graph. Since this term doesn’t vary with differing community structure, it is irrelevant to the choice between different community structures and can be ignored. 2. Bits needed to represent the number of groups. – RB. Not explicitly represented. – AP. log∗ (m). log∗ (x) = log2 (x) + log2 log2 (x) + ... where only positive terms are included in the sum. This series is apparently intended to represent the mean coding length of integers given that the probability of an integer of a given length is a monotonically decreasing function of the integer’s length, i.e., longer integers are less probable, but no maximum length is known [17]. 3. Bits needed to represent the association between vertices and groups – RB. n log(m). The rationale appears to be that for each of the n vertices, log(m) bits are needed to identify the group to which the vertex belongs. – AP. If the groups are placed in decreasing order of length, i.e., a1 ≥ a2 ≥ ... ≥ am ≥ 1, m−1 log(ai ) i=1
m where ai = ( t=1 at ) − m + i. 4. Bits needed for the group adjacency matrix, i.e., the number of edges between pairs of groups. – RB. 12 m(m+1) log(l). The first term ( 12 m(m+1)) represents the number of pairs of groups, and the second term (log(l)) the number of bits needed to specify the number of edges between any pair of groups. – AP. log(ai aj + 1) 10
If the cost of representing the group adjacent matrix is calculated as expression 2 + expression 3, the cost will grow with the number of connected pairs rather than with the total number of pairs. SGE employs several additional minor modifications to further reduce the description length. The entire calculation is as follows: 1. Bits needed to represent the number of vertices in the graph. As with RB and AP, these bits are ignored. 2. Bits needed to represent the number of groups. The log* function of [17] used in AP is predicated on the assumption that no maximum integer size is known a priori. Here, however, the maximum number of groups is bounded by both the machine word size and the virtual memory size of the machine on which the algorithm is executed. Therefore, SGE uses instead RB’s calculation: log(m) 3. Bits needed to represent the association between vertices and groups. No group can contain more than n − m + l vertices (since each group must have at least one vertex). Accordingly, the following expression contains sufficient bits to represent the number of vertices in all m groups: m log(n − m + 1) 4. Bits needed for the group adjacency matrix, i.e., the number of edges between pairs of groups. As discussed above, the number of bits is: H (B) + log(ai aj ) + log(ai (aj − 1)) i=j∧w(Dij )>0
i=j∧w(Dij )>0
Information Theoretic Criteria for Community Detection
121
5. Bits needed to represent the full adjacency matrix for vertices given the group structure represented in terms 2-4. This consists, for every pair of groups i and j, of size of the i, j adjacency matrix, ai aj , times the entropy per entry in the corresponding binary matrix, H(Dij ). This is equivalent to the AP calculation, shown above: m m
ai aj H(Dij )
i=1 j=1
In summary, the relationship between SGE, RB, and AP is as follows: 1. Bits needed to represent the number of vertices in the graph. Ignored as in RB and AP. 2. Bits needed to represent the number of groups. Follows RB. 3. Bits needed to represent the association between vertices and groups. Uses an expression with fewer bits than that used in RB, and that is simpler than that used in AP. 4. Bits needed for the group adjacency matrix. The primary novelty of SGE, in that for sparse adjacency matrices this term grows more slowly than the square of the number of groups. 5. Bits needed to represent the full adjacency matrix for vertices. Follows AP.
4
Empirical Evaluation
The previous section suggested that a graph encoding in which the calculation of the number bits required to represent a group adjacency matrix was reduced from an expression that grows as the square of the number of groups, as in RB and AP, to an expression that grows in proportion to the number of pairs of connected groups, as in SGE, would reduce or eliminate any resolution limit in sparsely connected graphs. This hypothesis was tested by comparing the communities found by optimizing RB, AP, SGE, and modularity on three different artificial data sets. To avoid conflating the effect of a utility function with the behavior of a search strategy, it was necessary to compare alternative utility functions using a single common search strategy. Accordingly, a single search function was applied to all for utility functions in the experimental evaluation: the greedy divisive hierarchical clustering algorithm of Newman & Girvan (2004) [11]. In the Newman & Girvan procedure, the edge with the highest betweenness centrality is iteratively removed, and the partition in the resulting sequence having the optimal value under the utility function was returned as the community structure. Using a single search strategy removes the potentially confounding disparity of the search algorithms used in published descriptions of RB, AP, and modularity. 4.1
Evaluation Criteria
Various objective functions have been proposed for evaluating the quality of a proposed community structure given the actual correct community structure,
122
L. Karl Branting
including the Rand index [23], the adjusted Rand index [24], and f-measure. There is no consensus regarding the most informative objective function. In this evaluation, f-measure was selected since its use in information retrieval has made it familiar to a wide range of researchers. The intuition underlying the use of f-measure is that group structure can be expressed as a relation c(G) = {vi , vj | ∃g ∈ G ∧ vi , vj ∈ g}, that is, the community structure can be represented by specifying for each pair of vertices whether that pair is in the same group. The similarity between the proposed group structure and the actual group structure can be evaluated by comparing c(proposed) with c(actual). One way to make the comparison is to view each pair in c(proposed) that is also in c(actual) as a true positive, whereas each pair in c(proposed) that is not in c(actual) is a false positive. Under this view, recall and precision can be defined as follows: – Recall =
|c(proposed)|
– Precision =
|c(actual)|
|c(actual)| |c(proposed)| |c(actual)| |c(proposed)|
F-measure is the harmonic mean of recall and precision: – F-measure =
4.2
2 ∗ recall ∗ precision recall + precision
Experimental Procedure
Three experiments were performed, each with a different type of artificial graph. The first, ring graphs, are characterized by the sparsity of connections between groups observed in many large-scale real-world graphs [20]. The second, uniform random graphs, has been used in a number of evaluations of community-detection algorithms. The third, embedded Barabasi-Albert Graphs, consists of communities generated by preferential attachment [20] embedded in a random graph. Fifty trials were performed under each experimental condition for uniform random and EBA graphs. There is no randomness in the construction of ring graphs, so a single trial was sufficient. Experiment 1: Ring graphs. Ring graph Rm,c comprises m communities, each consisting of a ring of c vertices, connected to two other communities, each by a single link, such that all communities are connected. Ring graphs are similar to the clique rings of [4] but differ in that the individual communities are themselves rings rather than cliques. For example, Figure 1 depicts ring graph R15,4 . The evaluation compared RB, AP, SGE, and modularity on 91 ring graphs for which m, c ∈ {4 . . . 16} × {3 . . . 9}.3 Strikingly different behavior was observed 3
Note that for m, c > 3 ring graphs contain no triangles. Therefore, community detection techniques based on clustering coefficient, e.g., [18], are ineffective for finding communities in such ring graphs.
Information Theoretic Criteria for Community Detection
123
among the four community-structure utility functions. Optimizing SGE led to the correct partitions in all but two ring graphs, but RB and AP found no correct partitions. Optimizing modularity led to correct partitions only for those graphs below the resolution threshold identified by [4]. – SGE. The partition having the optimal (lowest) SGE had the correct partition (i.e., no separate communities were conflated) in every graph except for R4,3 and R13,3 In other words, the correct community structure was found in 89 out of 91 ring graphs. – RB and AP. No community structure was found by optimizing either RB or AP. The partition having the optimal (lowest) value for RB and AP contained at least one pair of communities that were grouped together in every ring graph tested. – Modularity. Optimizing modularity led to incorrect community structure for rings of more than 8 triangles, more than 10 squares, more than 11 pentagons, or more that 13 hexagons or heptagons. In other words, the correct partitions were obtained with modularity only for rings and communities of the following sizes:
Fig. 2. A uniform random graph with 32 vertices, 4 groups, size ratio 1.25, and io ratio 0.67
124
L. Karl Branting
• • • • • • •
R4,3 − R8,3 R4,4 − R10,4 R4,5 − R11,5 R4,6 − R13,6 R4,7 − R13,7 R4,8 − R16,8 R4,9 − R16,9
This evaluation confirmed empirically the existence of the resolution limit for modularity derived formally in [4]. The evaluation also showed the surprising result that optimizing RB and AP leads to even more conflation of distinct communities than does modularity. The observation that optimizing SGE led to the correct community structure provides confirmation for the hypothesis that the conflation of communities in RB and AP arises from term 4, which uses more bits than necessary to represent the number of edges connecting groups in sparse graphs. Substituting rings of cliques for rings of graphs that are themselves rings leads to almost identical results to those described here.
Fig. 3. An Embedded Barabasi-Albert (EBA) graph with 4 communities, each with 5 initial vertices per community, 3 new edges per time step, 10 time steps, and 25 singleton-group edges
Information Theoretic Criteria for Community Detection
125
Experiment 2: Uniform random graphs. A common data set for testing community-extraction algorithms consists of random networks of 128 vertices divided into 4 communities with average vertex degree of 16 [11,16,19]. In this experiment, the relative size of the communities was controlled by a size ratio | parameter s such that if the communities were placed in ascending order, |a|ai+1 = i| s, where ai is the ith communities. The connections among the vertices were determined by the average vertex degree d and in/out ratio i such that the average number of within-community edges incident to each vertex was i ∗ d and the average number of cross-community edges incident to each vertex was (1 − i) ∗ d. For example, Figure 2 shows a uniform random graph with s = 1.25 and i = 0.6. Tests were performed for each combination of n = 32, m = 4, d = 6, s ∈ {1.0, 1.25, 1.5, 1.75, 2.0}, and i ∈ {0.6, 0.75, 0.9}. Figures 4, 5, and 6 show the results of the 4 algorithms on uniform graphs for i ∈ {0.6, 0.75, 0.9} respectively. For i ∈ {0.75, 0.9}, in which the community structure is relatively distinct, all four algorithms led to similar results except when the size ratio s was equal to 2.0 (i.e., the sizes of the groups were highly skewed). Under these circumstances, modularity led to much lower f-measure than the other algorithms. When i was equal to 0.6 (i.e., the community structure was relatively unclear) modularity was best and AP worst for low size ratio, and RB and AP were best for high size ratio. These results are consistent with [7], which showed better performance for RB than modularity for skewed community sizes, but comparable performance when community sizes were equal.
Fig. 4. F-measure for uniform random graphs with i=0.6 (weak community structure)
126
L. Karl Branting
Fig. 5. F-measure for uniform random graphs with i=0.75 (moderate community structure)
Fig. 6. F-measure for uniform random graphs with i=0.9 (strong community structure)
Information Theoretic Criteria for Community Detection
127
Experiment 3: Embedded Barabasi-Albert Graphs. A wide range of naturally occurring graphs, including those mentioned in the introduction (the Internet, biochemical pathways, social networks) exhibit a power-law degree distribution that is not present in uniform random graphs [20,21,22]. However, few such “scale-free” graphs are annotated with correct community structure. The third data set consisted of communities with scale-free structure embedded in a sparse random graph. Each graph consists of m communities generated by the Jung 1.74 implementation of the Barabasi-Albert preferential attachment algorithm, each starting with i initial vertices in each community, with e new edges per time step following the preferential attachment rule of [20] for each of t time steps, together with c singleton-group vertices. The singleton-group vertices were connected to 1. . . e vertices randomly selected from the entire graph, i.e., including both community and singleton-group vertices. The graphs used tor testing had 4 communities, 4 initial vertices per community, 2–4 edges added per time step, 20 time steps, and 25 singleton-group vertices. For example, Figure 3 depicts an EBA graph with 3 edges added per time step. In evaluating EBA graphs, singleton-group vertices were ignored, regardless of whether they were grouped into new communities or added to existing communities. As shown in Figure 7, the behavior of all four algorithms was quite similar when the number of edges added per time step was 3 or 4, which leads to relatively densely connected graphs. When only 2 edges were added per time step (i.e., the communities where quite sparse), AP’s performance was much worse, and SGE’s somewhat worse, than that of the other two algorithms.
Fig. 7. F-measure for embedded Barabasi-Albert graph with 2–4 edges added per time step
128
5
L. Karl Branting
Conclusion
The empirical evaluation demonstrated that RB and AP conflate distinct communities in ring graphs, and that changing the calculation of the number of bits needed to represent the group adjacency matrix eliminated this conflation over the range of ring graphs tested. Ring graphs are artifacts not likely to occur in many real-world graphs of interest, but many real-world graphs are like ring graphs in having very sparse group adjacency matrices (i.e., communities with links to few other communities). The ring-graph experiment suggests that RB and AP may perform even more poorly than modularity in such graphs. SGE’s description length calculation did not entirely eliminate resolution limits in clustering. For example, SGE combines adjacent communities in extremely large rings, such as R100,4 . Moreover, SGE combines adjacent communities in R3,4 and R13,3 . Thus, it appears that SGE’s bit encoding is not optimal even in sparse graphs. No one algorithm consistently outperformed the others in EBA or uniform random graphs, but modularity was consistently worse than the MDL algorithms on highly skewed uniform random graphs, and AP and SGE had lower performance than the others on sparse EBA graphs. Neither uniform random graphs nor EBA graphs have the sparse group adjacency matrices that characterize ring graphs, so most errors consist of assigning a vertex to the wrong community rather than combining two communities that should remain distinct. Under these circumstances, SGE’s representation of the group adjacency matrix confers no particular benefit. While MDL is clearly a powerful tool for identifying community structure, there are many options for MDL encodings, and the consequences of each choice can be difficult to anticipate. SGE demonstrates that the resolution limits of RB and AP in graphs with sparse group adjacency matrices can be easily addressed, but the fact that SGE did not outperform RB or RB on other types of graphs suggests that considerable subtlety is required to identify the MDL encoding most effective over a wide range of graph and community types.
Acknowledgments This work was funded under contract number CECOM Wl5P7T-08-C-F600. The MITRE Corporation is a nonprofit Federally Funded Research and Development Center chartered in the public interest.
References 1. Newman, M.E.J.: Fast algorithm for detecting community structure in networks. Physical Review E 69, 066133 (2004) 2. Ganti, V., Ramakrishnan, R., Gehrke, J., Powell, A.L., French, J.C.: Clustering large datasets in arbitrary metric spaces. In: Proceedings of the 15th IEEE International Conference on Data Engineering, Sydney, pp. 502–511 (1999)
Information Theoretic Criteria for Community Detection
129
3. Koutsourelakis, P., Eliassi-Rad, T.: Finding mixed-memberships in social networks. In: Papers from the 2008 AAAI Spring Symposium on Social Information Processing, Technical Report WW-08-06, pp. 48–53. AAAI Press, Menlo Park (2008) 4. Fortunato, S., Barthelemy, M.: Resolution limit in community detection. Proc. Natl. Acad. Sci. USA 104, 36 (2007) 5. Ruan, J., Zhang, W.: Identifying network communities with a high resolution. PhysRevE (2007) 6. Ronhovde, P., Nussinov, Z.: An improved potts model applied to community detection. physics.soc-ph (2008) 7. Rosvall, M., Bergstrom, C.: An information-theoretic framework for resolving community structure in complex networks. Proc. Natl. Acad. Sci. USA 104(18), 7327– 7331 (2007) 8. Chakrabarti, D.: Autopart: Parameter-free graph partitioning and outlier detection. In: Proceedings of the European Conference on Machine Learning and Practice of Knowledge Discovery in Databases, pp. 112–124 (2004) 9. Sun, J., Faloutsos, C., Papadimitriou, S., Yu, P.: Graphscope: parameter-free mining of large time-evolving graphs. In: KDD 2007: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 687–696. ACM, New York (2007) 10. Wallace, C.S., Dowe, D.L.: Minimum message length and Kolmogorov complexity. The Computer Journal 42(4), 270–283 (1999) 11. Newman, M.E., Girvan, M.: Finding and evaluating community structure in networks. Physical review. E, Statistical, nonlinear, and soft matter physics 69(2 Pt 2) (February 2004) 12. Tasgin, M., Bingol, H.: Community detection in complex networks using genetic algorithm. In: ECCS 2006: Proc. of the European Conference on Complex Systems (2006) 13. Rattigan, M.J., Maier, M., Jensen, D.: Graph clustering with network structure indices. In: ICML 2007: Proceedings of the 24th international conference on Machine learning, pp. 783–790. ACM, New York (2007) 14. Donetti, L., Muoz, M.: Detecting net-work communities: a new systematic and efficient algorithm. Journal of Statistical Mechanics: Theory and Experiment 10012, 1–15 (2004) 15. Zhang, H., Giles, C.L., Foley, H.C., Yen, J.: Probabilistic community discovery using hierarchical latent gaussian mixture model. In: AAAI 2007: Proceedings of the 22nd national conference on Artificial intelligence, pp. 663–668. AAAI Press, Menlo Park (2007) 16. Rosvall, M., Bergstrom, C.T.: An information-theoretic framework for resolving community structure in complex networks. PNAS 104(7327) (2007) 17. Rissanen, R.: A universal prior for integers and estimation by minimum description length. The Annals of Statistics 2, 416–431 (1983) 18. Du, N., Wu, B., Pei, X., Wang, B., Xu, L.: Community detection in large-scale social networks. In: WebKDD/SNA-KDD 2007: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, pp. 16–25. ACM, New York (2007) 19. Bagrow, J.: Evaluating local community methods in networks. J. Stat. Mech. 2008(05), P05001 (2008) 20. Barabasi, A., Albert, R.: Emergence of scaling in random networks. Science 286, 509–512 (1999) 21. Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empirical data, cite arxiv:0706.1062 (2007) http://www.santafe.edu/~ aaronc/powerlaws/
130
L. Karl Branting
22. Clauset, A., Shalizi, C., Newman, M.: Power-law distributions in empirical data. SIAM Review 51(4), 661–703 (2009) 23. Rand, W.M.: Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association 66(336), 846–850 (1971) 24. Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218 (1985)
Author Index
Berger-Wolf, Tanya Y. 55 Branting, L. Karl 114
Magdon-Ismail, Malik Mertsalov, Konstantin
Eliassi-Rad, Tina
Rettinger, Achim
1
Gallagher, Brian 1 Getoor, Lise 97 Ghosh, Rumi 20 Golbeck, Jennifer 97 Goldberg, Mark 36
Saia, Jared
77
55
Tresp, Volker
77
Wallace, William (Al) Habiba,
55
Kelley, Stephen 36 Kersting, Kristian 77 Kuter, Ugur 97 Lerman, Kristina
20
Xu, Zhao Yu, Yintao
36 36
77 55
Zheleva, Elena
97
36