_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
4
First of all, I am grateful for my dissertation advisor and mentor, Professor Hsinchun
Chen, for his guidance and encouragement throughout my five years at the University of
Arizona. It has been an invaluable opportunity for me to work in the Artificial
Intelligence Lab under his direction. I feel very fortunate to have had such a wonderful advisor. Many thanks go to my major committee members, Dr. Jay F. Nunamaker, Jr. and
Dr. Daniel D. Zeng, and my minor committee members in the Department of
Communication, Dr. Chris Segrin and Dr. Kyle Tusing, for their guidance and encouragement. I also thank all the faculty members in the MIS Department for their support.
My dissertation has been partly supported by grants from the National Science
Foundation/Central Intelligence Agency (EIA9983304) “Knowledge Discovery and
Dissemination ARJIS/COPLINK ‘Border Safe’” and “COPLINK Center for Excellence:
Information and Knowledge Management in Law Enforcement,” and (CTS0311652)
“Intelligent Patent Analysis for Nanoscale Science and Engineering.” Most projects discussed in this dissertation have been supported by other AI Lab members: Dr. Homa
Atabakhsh and Ms. Cathy Larson, and personnel from the Tucson Police Department:
Detective Tim Peterson, Sergeant Mark Nisbet, and Lieutenant Jennifer Shroeder. I thank former AI Lab members who I regard as role models for my research: Dr. Michael Chau,
Dr. Gondy Leroy, Dr. Chienting Lin, Dr. Bin Zhu, and Dr. Dorbin Ng. I also thank Ms.
Barbara Sears and Ms. Sarah Marshall for editing my papers.
I would like to thank my colleagues for their tremendous help and support through the past five years: Yilu Zhou, Yiwen Zhang, Ming Lin, Gang Wang, Jialun Qin, Jason Li,
Zan Huang, Jinwei Cao, Xiaoyun Sun, Rong Zheng, Daniel McDonald, Byron Marshall,
Xin Li, Jiannan Wang, Yang Xiang, Huihui Zhang, Dr. Edna Reid, Dr. Hua Su, ChunJu
Tseng, and Shing Ka Wu. I especially thank Yilu Zhou for her encouragement and emotional support during the stressful time of my last year of study, and Yiwen Zhang,
Ming Lin, and many other friends for their invaluable care, concern, and help during the time when I was struggling with a big challenge in my personal life.
I am extremely grateful for my parents, sister, and brother. Their unconditional love is my source of energy for working hard through the years. I appreciate the love, care, and encouragement from my husband, Yanhai Sun. He is the one that I can always count on when I feel frustrated and discouraged. Last but not least, I thank my 14month old son,
Patrick R. Sun, the most precious present I have received from God. He makes me believe that life is beautiful and research is only a part of my life.
This dissertation is dedicated to my parents.
5
6
Foundations................................................................................ 21
Concepts.................................................................................. 26
Representation........................................................................ 27
Presentation............................................................................ 28
2.3 The Computational Framework for Network Structure Mining ................... 31
2.3.1 Static Structure Mining......................................................................... 33
2.3.1.1 Locating Critical Resources in Networks ..................................... 33
2.3.1.2 Reducing Network Complexity .................................................... 38
2.3.1.3 Extracting Topological Properties ................................................. 44
2.3.2 Dynamic Structure Mining ................................................................... 47
2.3.2.1 Describing Structural Dynamics ................................................... 48
2.3.2.2 Modeling Structural Dynamics ..................................................... 49
3.1 Introduction................................................................................................... 53
Work ................................................................................................ 55
Analysis ........................................................................................ 56
3.2.1.1 Network Construction................................................................... 56
3.3
3.2.1.2 Link Analysis Tools...................................................................... 58
Algorithms...................................................................... 59
The Modified BFS Algorithm....................................................................... 62
3.4.1 Network Representation Transformation.............................................. 64
Algorithms...................................................................... 68
3.4.2.1 The Modified PFS Algorithm ....................................................... 69
3.4.2.2 The TwoTree Dijkstra/PFS Algorithm ......................................... 71
Evaluation ........................................................................................ 73
3.5.1.1 COPLINK Concept Space and AZNP .......................................... 74
7
3.5.1.2 Data Set......................................................................................... 75
3.5.2 Results
3.5.2.1 User Evaluation: Effectiveness Issue............................................ 76
3.5.2.2 Simulation Experiment: Efficiency Issue ..................................... 81
3.6 Conclusions................................................................................................... 85
4.1 Introduction................................................................................................... 87
4.2 Background................................................................................................... 88
4.2.1 Implications of Structural Network Analysis ....................................... 89
4.2.2 Special Network Structures................................................................... 90
Work ................................................................................................ 91
4.3.1 Existing Network Analysis Tools ......................................................... 91
4.3.1.1 First Generation: Manual Approach .............................................. 91
4.3.1.2 Second Generation: GraphicsBased Approach............................. 93
4.3.1.3 Third Generation: Structural Analysis Approach .......................... 95
4.3.2 Social Network Analysis....................................................................... 96
Analysis ........................................................................ 96
4.4
4.3.2.4 Visualization of Social Networks ................................................ 100
Crimenet Explorer: Extracting Structural Patterns in Criminal Networks . 101
Partition................................................................................ 104
Analysis.............................................................................. 106
4.4.5 CrimeNet
Evaluation ...................................................................................... 110
4.5.1 The Narcotics and Gang Networks ..................................................... 111
4.5.2.1 Task I: Subgroup Detection (Clustering)..................................... 114
4.5.2.2 Tasks II and III: Interaction Pattern and Central Members
Identification ............................................................................................... 116
4.5.3 Results Discussion ....................................................................... 118
4.6 Conclusions................................................................................................. 123
8
5.1 Introduction................................................................................................. 125
Work .............................................................................................. 127
5.2.2 Determining Link Weights for Weighted Graphs............................... 128
Unweighted
Algorithms..................................................................... 130
5.3
5.3.1
The Proposed Approach: Local Density Based Partition Algorithms ........ 133
Defining Edge Local Density.............................................................. 133
5.3.2 Illustrating Edge Local Density .......................................................... 135
Case 1: CliqueBridgeClique..................................................................... 136
Case 2: TreeBridgeTree ........................................................................... 139
Case 3: CliqueBridgeTree ........................................................................ 139
Case 4: CliqueClique................................................................................. 141
Case 5: CliqueTree .................................................................................... 142
Metrics........................................................................... 145
5.4.2 Hypotheses.......................................................................................... 147
5.4.3 Results Discussion ....................................................................... 149
5.4.3.1 Effectiveness ................................................................................ 149
5.4.3.2 Efficiency..................................................................................... 155
5.5 Conclusions................................................................................................. 158
6.1 Introduction................................................................................................. 160
Work .............................................................................................. 161
Sets ..................................................................................................... 163
6.4.1 Statistical Properties of the Dark Networks........................................ 164
6.4.1.1 SmallWorld
Properties ................................................................... 168
6.4.2 Robustness of the Dark Networks....................................................... 170
6.5 Conclusions................................................................................................. 173
7.1 Introduction................................................................................................. 175
Work .............................................................................................. 177
7.3 The Composite Evolution Model................................................................ 181
9
7.3.1
7.3.2
7.3.3
7.3.4
The Composite Model......................................................................... 181
The Simple Degree Model .................................................................. 183
The Simple Fitness Model .................................................................. 184
The Multiplicative Fitness Model....................................................... 184
7.3.5
7.4
The Additive Fitness Model................................................................ 185
The Evolution of Patent Citation Networks................................................ 187
Questions............................................................................. 188
Analysis ........................................................................... 189
Evolutionary
7.4.4.2 Estimating the Composite Model................................................. 199
7.5 Conclusions................................................................................................. 205
8.1 Contributions............................................................................................... 206
Contributions ................................................................... 206
Contributions...................................................................... 208
8.2 Relevance to Business, Management, and MIS.......................................... 211
Directions ........................................................................................ 212
A1: Instructions for Experiment Participants .................................................... 214
A2: Introduction to System Functionality.......................................................... 215
Sheet................................................................................................... 216
Questionnaire .................................................................................. 217
10
Figure 2.1: Graph representation………………………………………………..
Figure 2.2: The computational framework for network structure mining………
Figure 3.1: The modified BFS algorithm………………………………………..
Figure 3.2: Two indirectly connected nodes…………………………………….
Figure 3.3: The modified PFS algorithm………………………………………..
Figure 3.4: The twotree PFS algorithm………………………………………...
Figure 3.5: Execution time scatter plot………………………………………….
Figure 4.1: The terrorist network surrounding the 19 hijackers on September
11, 2001……………………………………………………………..
Figure 4.2: Secondgeneration criminal network analysis tools………………...
Figure 4.3: Procedures for automated criminal network mining and
93
95 visualization………………………………………………………… 101
Figure 4.4: The pseudocode of the modified version of the RNNbased completelink algorithm……………………………………………. 105
Figure 4.5: CrimeNet Explorer…………………………………………………. 110
Figure 5.1: The transformation of an unweighted graph into a weighted graph using the edge local density measure………………………………. 135
Figure 5.2: The five illustrative cases for edge local density measure………….............................................................................. 136
28
32
63
65
70
72
82
Figure 5.3: Three illustrative networks with different
p
out
/
p
in
ratios…………… 150
Figure 5.4: Effectiveness results of the six clustering methods: sLD, sECC, iLD, iECC, GN, and modularity…………………………………... 151
Figure 5.5: The efficiency of sLD, iLD, modularity based, and GN algorithm.. 156
Figure 6.1: The giant component in the GSJ Network…………………………. 164
Figure 6.2: The degree distributions of the dark networks……………………... 168
Figure 6.3: The aging effect in the Meth World………………………………... 170
Figure 6.4: Dark networks’ vulnerability to attacks……………………………. 171
Figure 7.1: The size dynamics in patent citation networks of the four technology fields…………………………………………………… 189
Figure 7.2: Dynamics of average degrees………………………………………. 191
Figure 7.3: Dynamics in average path lengths………………………………….. 193
Figure 7.4: Degree distributions of the four fields……………………………… 194
Figure 7.5: Institutions’ productivity distribution for the drug field…………… 196
Figure 7.6: The loglog plot of conditional content similarity between linked patent pairs…………………………………………………………. 198
Figure 7.7: The distribution of the content similarity between linked drug patents………………………………………………………………. 201
Figure 7.8: The fits of different models………………………………………… 203
Table 2.1:
Table 3.1:
Table 3.2:
Table 3.3:
Table 4.1:
Table 4.2:
Table 4.3:
Table 4.4:
Table 5.1:
Table 5.2:
Table 5.3:
Table 5.4:
Table 6.1:
The statistics for network topology………………………………..
Sample statistics of two networks…………………………………
46
75
Effectiveness evaluation results……………………………………
Mean execution time (in seconds) for the two shortestpath
78 algorithms…………………………………………………………. 81
Sizes of networks generated from the two datasets………………..
Clustering recall and precision…………………………………….
Effectiveness……………………………………………………….
Efficiency…………………………………………………………..
Hypotheses regarding clustering effectiveness…………………….
Mean values of the effectiveness metrics of the six methods……...
Summary of hypothesis testing results for effectiveness…………..
Mean running times (in seconds) of sLD, iLD, and the modularity based methods……………………………………………………...
The statistics and parameters in the exponentially truncated powerlaw degree distribution of the dark networks………………
156
165
112
118
120
120
148
152
153
Table 6.1:
Table 6.2:
Table 7.1:
Table 7.2:
Table 7.3:
Table 7.4:
Table 7.5:
The statistics and parameters in the exponentially truncated powerlaw degree distribution of the dark networks………………
Smallworld properties of the dark networks……………………...
Basic statistics of the four patent citation data sets………………..
The five most productive institutions in the four technology fields.
Exponent values of productivity distributions and similarity distributions for the four fields…………………………………….
The similarity coefficients between linked patents and those between unlinked patents…………………………………………..
Estimated parameter values in the content similarity distributions..
165
165
188
195
196
199
201
11
12
Contemporary organizations live in an environment of networks: internally, they manage the networks of employees, information resources, and knowledge assets to enhance productivity and improve efficiency; externally, they form alliances with strategic partners, suppliers, buyers, and other stakeholders to conserve resources, share risks, and gain market power. Many managerial and strategic decisions are made by organizations based on their understanding of the structure of these networks. This dissertation is devoted to
network structure mining
, a new research topic on
knowledge discovery in databases
(KDD) for supporting knowledge management and decision making in organizations.
A comprehensive computational framework is developed to provide a taxonomy and summary of the theoretical foundations, major research questions, methodologies, techniques, and applications in this new area based on extensive literature review.
Research in this new area is categorized into static structure mining and dynamic structure mining. The major research questions of static mining are locating critical resources in networks, reducing network complexity, and capturing topological properties of largescale networks. An inventory of techniques developed in multiple reference disciplines such as social network analysis and Web mining are reviewed. These techniques have been used in mining networks in various applications including knowledge management, marketing, Web mining, and intelligence and security. Dynamic pattern mining is concerned with network evolution and major findings are reviewed.
13
A series of case studies are presented in this dissertation to demonstrate how network structure mining can be used to discover valuable knowledge from various networks ranging from criminal networks to patent citation networks. Several techniques are developed and employed in these studies. Performance evaluation results are provided to demonstrate the usefulness and potential of this new research field in supporting knowledge management and decision making in real applications.
14
In today’s information age the competitive advantages of organizations no longer depend on organizations’ information storage capabilities (Carr, 2003) but on their ability to analyze information and discover valuable knowledge.
Knowledge discovery in databases
(KDD) plays an indispensable role in supporting contemporary organizations’ knowledge management and decision making by “identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad et al., 1996a, p. 30). The core of
KDD is
data mining
, a process of using appropriate techniques to extract patterns and knowledge from data. Research on KDD and data mining has advanced substantially and many techniques have been developed for a spectrum of data mining problems including association rule mining (Agrawal et al., 1993), clustering (Jain et al., 1999), classification
(Quinlan, 1986), outlier analysis, and sequential pattern extraction (Fayyad et al., 1996a;
Fayyad et al., 1996b).
Recently, a new data mining topic,
network structure mining
, has attracted much attention in the KDD research community (Cook & Holder, 2000; Domingos &
Richardson, 2001; Palmer et al., 2002). Unlike conventional data mining that extracts patterns based on individual data objects, network structure mining is intended to mine patterns based on the relationships between objects.
The concept of
network
is not new to most people. Regardless of its context, a network often refers to a set of
nodes
(objects) connected by
links
(relationships). Networks are
15 prevalent in nature and society. Our familiar networks include
social networks
,
information networks
,
communication networks
, and
biological networks
(Newman,
2003b).
•
Social networks
are collections of social actors such as individuals and organizations who interact with one another through various relationships.
Relationships between individuals can be kinship, friendship, comembership, and affective or influential ties (Wasserman & Faust, 1994). Relationships between organizations can be strategic partnership, buyersupplier relationship, transactions, and other business associations (Gulati & Gargiulo, 1999; Powell et al., 2005 (forthcoming); Stuart, 1998).
•
In
information networks
nodes can be documents, articles, words and phrases, or other objects containing data and information assets. Links are formed because of the underlying relevance or similarity in the content of the nodes. Examples of information networks are citation networks (Garfield, 2001; Hajra & Sen, 2005), which consist of documents and citation links, and the World Wide Web (Brin &
Page, 1998; Kleinberg, 1998), which consists of a large number of Web pages connected by hyperlinks.
•
Communication networks
such as electronic power grids and the Internet are often used to facilitate the transmission of certain resources or information (Amaral et al., 2000; Watts & Strogatz, 1998). On the Internet, for example, computers and routers are connected through cables and wires that transmit digitized data.
16
•
Biological networks
contain biological components that interact with each other.
Examples of biological networks include metabolic pathways (Jeong et al., 2000), genetic regulatory networks (Somogyi & Sniegoski, 1996), biochemical networks, food webs (Garlaschelli et al., 2003), and neural networks (Watts & Strogatz,
1998).
Network structure mining is aimed at extracting valid, novel, and useful structural patterns in various networks. The structural patterns refer to a range of regularities in the structure of networks, such as:
•
Who are the most influential customers whose purchasing decisions may influence other customers (Domingos & Richardson, 2001)? What are the classic articles that are cited frequently by other articles in a scientific discipline (Culnan,
1987; Small, 1999)? How can people locate highquality pages on the World
Wide Web (Brin & Page, 1998; Kleinberg, 1998)?
•
Are there different research specialties and paradigms in a scientific discipline
(Culnan, 1986; Giannakis & Croom, 2001; Small, 1977)? How can users find communities of Web pages that discuss similar topics (Flake et al., 2000; Gibson et al., 1998)? Do criminals or terrorists form groups or teams to carry out offenses
(McAndrew, 1999; Xu & Chen, Forthcoming)?
•
What is the “big picture” of a large network (Chen et al., 2001; Small, 1999;
Toyoda & Kitsuregawa, 2001)? What are the properties that characterize
17 networks of specific topologies (Albert & Barabási, 2002; Bollobás, 1985; Watts
& Strogatz, 1998)?
•
How do information, technology, fads, diseases, and viruses spread in social, communication, and biological networks (Kephart et al., 1998; Liljeros et al.,
2001; Valente, 1995)? Does the structure of the network affect the speed of spreading?
•
How robust is a network against failures and attacks (Albert et al., 2000)? How can people protect computer networks (Tu, 2000), social networks, and biological networks (Jeong et al., 2001) from attacks?
•
What are the patterns of dynamics in the network structure over time (Barabási et al., 2002; Doreian & Stokman, 1997)? How do networks evolve (Dorogovtsev &
Mendes, 2003)? What are the mechanisms that govern the evolution of networks
(Barabási & Alert, 1999; Bianconi & Barabási, 2001; Menczer, 2004)?
The research on network structure can support decision making in a wide variety of application domains, including ecommerce and marketing (Domingos & Richardson,
2001; Janssen & Jager, 2003), strategic planning (Powell et al., 2005 (forthcoming)), citation analysis (Culnan, 1986; Small, 1999), Web mining (Gibson et al., 1998;
Kleinberg, 1998; Toyoda & Kitsuregawa, 2003), knowledge sharing (Kautz et al., 1997), and security and intelligence (McAndrew, 1999; Sparrow, 1991; Xu & Chen, 2005).
18
However, because the research on network structure mining is young compared with other data mining fields, it faces several challenges. First, there has not been a comprehensive research framework that provides a taxonomy and summary of the major research questions, techniques, and applications of network structure mining. This new field is multidisciplinary in nature and has been studied in several references disciplines including sociology, mathematics, statistics, physics, computer science, and biology.
These disciplines share many common research questions related to network structure and also offer unique perspectives and methodologies for studying networks. It is desirable to develop a research framework that consolidates these different perspectives, summarizes existing techniques and finings, and provides guidance for future research.
Second, although many techniques have been proposed to tackle various networkrelated problems, such as the identification of important nodes and the detection of groups, research on network structure mining still strives to find new techniques that are more effective, efficient, scalable, and useful.
Third, most existing network studies focus on the static structural patterns in networks.
How to extract the patterns of dynamics in network is still a challenging problem. In addition, because specific evolution processes lead to specific network structures which further affect the function and performance of networks, the search for the underlying mechanisms that govern the evolution of networks is particularly important. Presently, research on network evolution is still at its infant age (Dorogovtsev & Mendes, 2003).
19
Last, it is believed that the research on networks has led to a “new science of networks”
(Barabási, 2002; Watts, 2004). The significance of this new science in terms of its roles for supporting knowledge management and decision making in real world applications, together with the impacts of network mining technology on users, organizations, and society, is still an open question. A large number of empirical studies that are intended to evaluate such significance and impacts need to be conducted to demonstrate the value of this new field.
Facing these challenges, this dissertation is intended to achieve the following research objectives:
•
To develop a comprehensive research framework that incorporates major research questions, techniques, methodologies, and applications of network structure mining;
•
To develop and employ effective and efficient techniques for mining static and dynamic structural patterns in networks in several application domains;
•
To evaluate the performance of these techniques in terms of their abilities to support knowledge management and decision making.
The remainder of this dissertation is organized as follows. Chapter 2 presents the research framework of network structure mining after reviewing related literature. Chapters 3, 4, and 6 are demonstrations of several network mining techniques to support knowledge management in the law enforcement, intelligence, and security domains (Xu & Chen,
20
2004, 2005). Chapter 5 is devoted to a new network partition approach that is more efficient than existing approaches. Chapter 7 proposes a new network evolution model.
Chapter 8 summarizes the contributions of this dissertation, points out the connections between network structure mining and business and management, and suggests future research directions.
21
Based on extensive literature review of prior work I present the computational framework in this chapter. This computational framework consists of several major research questions in network structure mining and existing techniques for addressing these questions. Before the literature review I first introduce the theoretical foundations and fundamental concepts of network structure mining.
The study of network structure is a multidisciplinary area and is grounded on three different theoretical foundations:
graph theory
from mathematics and computer science,
social network analysis
from sociology, and
topological analysis
from statistical physics.
Graph theory
is the study of properties of graphs (Bollobás, 1998). It provides the mathematical formalism for defining, representing, and solving a series of graph related problems such as graph isomorphism problems, graph coloring problems, network flow problems, etc. (Bollobás, 1998; Harary, 1994). Graph theory was first introduced in mathematics and has advanced substantially in computer science. While mathematicians focus on formal solutions of graph related problems, computer scientists focus on the development of efficient algorithms to deal with graphs. Since its introduction in the 18 th century graph theory has grown into a fullfledged branch of its own. Development and results in graph theory have been used to tackle problems in a wide variety of
22 applications including electronic circuit layout (vanCleemput, 1976), task scheduling
(Hesham
et al.
, 1994), resource allocation (Deo, 1974), and computer network design, among many others.
The mathematical and algorithmic solutions to graph related problems were not intended for structural pattern mining purposes but their applications can help extract regularities in network structures. For example, the algorithms for finding maximum flow and minimum cut in network flow problems have been used to identify Web communities
(Flake et al., 2000). A
Web community
is a set of Web pages that discuss similar topics or are created by authors sharing common interests (Flake
et al.
, 2000; Gibson
et al.
, 1998).
Another important theoretical foundation of network structure mining is
social network analysis
(SNA). SNA is used in sociology research to analyze patterns of relationships and interactions between social actors in order to discover the underlying social structure
(Berkowitz, 1982; Breiger, 2004; Scott, 1991; Wasserman & Faust, 1994). The most distinctive feature of SNA is “the use of structural or relational information to study or test social theories” (Wasserman & Faust, 1994, p.21). Not only the attributes of social actors, such as their age, gender, socioeconomic status, and education, but also the properties of relationships between social actors, such as the nature, intensity, and frequency of the relationships, are believed to have important impact on the social structure. SNA methods have been employed to study organizational behavior (Borgatti
& Foster, 2003; Brass, 1984), interorganizational relations (Powell
et al.
, 2005
23
(forthcoming); Stuart, 1998), citation patterns (Baldi, 1998; Price, 1965), computer mediated communication (Garton
et al.
, 1999), and many other domains.
SNA has both behavioral and computational focuses. The behavioral focus is on the validation of social theories based on the regularities found in social relationships. The computational focus is on the development of methods and measures for fining the regularities. The computational focus thus is the most relevant to the network structure mining research.
Computational SNA distinguishes between relational analysis and positional analysis
(Burt, 1980; Wasserman & Faust, 1994). Relational analysis studies the connectivity of a social network. It is often used to identify central members or to find subgroups in a social network. In such studies, links usually are weighted by relational strength.
Positional analysis is concerned with structural roles of social actors. The purpose of positional analysis is to discover the overall structure of a social network. Both relational analysis and positional analysis are very relevant to the extraction of structural patterns from networks. For example, the centrality measures in relational analysis can be used to identify influential authors in citation networks (Culnan, 1987).
A recent movement in statistical physics has brought revolutionary insights and research methodology to the study of network structure. This new movement is best described as
statistical analysis of network topology
(Albert & Barabási, 2002). Unlike graph theory and SNA, which deal primarily with static structure of networks, topological analysis views the structure of a network as the result of some evolutionary processes, which can
24 be described and modelled using certain statistical mechanisms. The power of this new perspective lies in its ability to explain and predict the structural phenomena observed in large networks such as the World Wide Web (Albert
et al.
, 1999).
Three models have been proposed to characterize the topology of large, complex networks:
random graph model
(Bollobás, 1985; Erdös & Rényi, 1960),
smallworld model
(Watts & Strogatz, 1998), and
scalefree model
(Barabasi & Alert, 1999). A random network starts with a fix number of nodes. With a probability
p
two arbitrary nodes are selected and connected by a link. As a result each node has roughly the same number of links. The
degree distribution
,
P
(
k
), is the probability that a node has exactly
k
links. It is shown that the degree distribution of a random graph follows the Poisson distribution (Bollobás, 1985), peaking at the average degree. A random network usually has a small average path length so that an arbitrary node can reach any other node in a few steps. The assumption of the random graph model is that the evolution of real networks is primarily a random process. In the past few decades random graph model has been used as the single model of network topology. However, it has recently been found that most complex systems and real networks are not random but are governed by certain organizing principles encoded in the topology of the networks (Albert & Barabási, 2002).
The smallworld model and scalefree model substantially deviate from the random graph model (Albert & Barabási, 2002; Newman, 2003b). A smallworld network has a significantly higher tendency to form clusters and groups (Watts & Strogatz, 1998) which are rarely present in random graphs. Scalefree networks (Barabási & Alert, 1999), on the
25 other hand, are characterized by the powerlaw degree distribution, meaning that while a large percentage of nodes in the network have just a few links, a small percentage of the nodes have a large number of links. It is believed that scalefree networks evolve following the selforganizing principle, where growth and preferential attachment play a key role in the emergence of the powerlaw degree distribution (Barabási & Alert, 1999).
The smallworld model and the scalefree model have spurred the research on the topological properties of large scale networks and complex systems since they are proposed in the late 1990’s. A large number of papers have been published in leading science journals such as
Nature
,
Science
, and the
Proceedings of the National Academy of Sciences
(
PNAS
). The new findings and variants of the two models reported have greatly enriched our knowledge about large, complex networks (Albert & Barabási,
2002).
In addition to the three theoretical foundations many other disciplines and research communities have contributed to the study of network structure. Among these research communities
Web mining
is the most important. Web mining is about the automatic discovery of information, service, and valuable patterns from the content of Web documents (Web content mining), the structure of hyperlinks (Web structure mining), and the usage of Web pages and services (Web usage mining) (Etzioni, 1996). An important application of Web mining is to improve the design of online search engines and crawlers to help users find what they look for more effectively and efficiently (Chau
et al.
, 2003). Especially, Web structure mining, often called link analysis in Web mining research (Kleinberg, 1998; Kleinberg & Lawrence, 2001), exploits the structure of
26 hyperlinks between Web pages to locate high quality Web documents (Broder
et al.
,
2000; Chakrabarti
et al.
, 1999; Kleinberg & Lawrence, 2001; Kumar
et al.
, 1999).
The computational framework will incorporate a number of most important research questions, technologies, and findings in network structure mining built upon the three theoretical foundations. Before presenting this framework I will introduce the fundamental concepts in structural analysis including the definition, representation, and presentation of networks.
Networks are essentially graphs.
Graph
is a mathematical abstraction of networks of various types. In graph theory a graph is formally defined as a pair of sets
G
= (
V
,
A
), where
V
is the set of vertices and
A
is the set of edges, 
V
 =
n
, 
A
 =
m
, and
m
≤
n
2 .
Vertices are also called nodes, points, and objects. Edges are also called arcs, links, and lines. A graph can be
directed
or
undirected
depending on whether the links have origins and destinations. A graph can also be
weighted
or
unweighted
depending on whether each link is associated with a numeric label called weight.
Throughout this dissertation I will use nodes and links to refer to the two basic types of elements in graphs. I will also use network and graph interchangeably. Note that the
“network” here refers to the general graph and is not the same as the formal definition of
27
network
in graph theory, which refers to only directed, weighted graph (Harary, 1994).
There are a large number of other concepts and terms related to graph, such as path, density, and subgraph, among many others. I will introduce and provide definitions of them in later sections when needed.
Graph can be represented in various formats. The two most widely used formats for representing graphs are
graphics
and
matrices
. The graphic representation is quite intuitive. For example, in Figure 2.1a, an undirected, unweighted graph consisting of five nodes is drawn. The circles represent nodes and lines between the circles represent links.
Nodes are labeled with numbers in this graph. The graph can also be represented as a matrix (see Figure 2.1b). Such a matrix is called
sociogram
in social network analysis
(Moreno, 1953) and
adjacency matrix
in computer science. The graph is represented as an
n
×
n
square matrix with rows and columns representing nodes. The value of a cell, (
i
,
j
), is set to be the weight of the link between nodes
i
and
j
. A zero means that
i
and
j
are not directly connected. In this simple example cell values are either 1 or 0 indicating the presence or absence of links.
28
1
3
5
2
4
(a) (b)
Figure 2.1: Graph representation. (a) Graphic representation. (b) Matrix representation.
An important issue related to network structure mining is network presentation and visualization. As the old saying goes, “a picture is worth one thousand words,” a good presentation can reflect the intrinsic structure of a network (Battista
et al.
, 1999; Herman
et al.
, 2000) and help “visually mine” the network. For example, one can easily find the popular nodes that have many links in a network if these nodes are placed close to the center of the network. The hierarchical structure of a tree will become more apparent if nodes at the same level are placed along the same horizontal line (Reingold & Tilford,
1981).
Two types of approaches have been employed to present and visualize networks, namely
multidimensional scaling
(MDS) and
graph layout
approaches.
MDS
is the most commonly used method for social network visualization (Freeman, 2000). It is a statistical method that projects higherdimensional data onto a lowerdimensional display.
It seeks to provide a visual representation of proximities (dissimilarities) among nodes so that nodes that are more similar to each other are closer on the display and nodes that are
29 less similar to each other are farther apart (Kruskal & Wish, 1978; Young, 1987). Metric
MDS deals with numerical proximities. Nonmetric MDS is used when only rank order of the proximities are considered (Kruskal, 1964; Torgerson, 1952; Young, 1987). For both methods, Kruskal’s STRESS statistic (Kruskal, 1964), which measures the goodnessoffit when reducing the dimensionality of data, is the objective function to be optimized. A high STRESS value indicates that the network is significantly distorted and that two distant nodes in the higherdimensional space may be placed close to each other on the lowerdimensional display. The advantage of MDS is that the physical distance between two nodes on the visual display indicates the “similarity” of the two nodes. However,
MDS considers only the positions of nodes and ignores the placement of links. A popular node may not necessarily be placed on the center of the network. In addition, there might be many crossing links, making it difficult to visualize the structure of the network.
Graph layout
algorithms have been developed particularly for drawing aestheticallypleasing network presentations (Fruchterman & Reingold, 1991). One of the most important aesthetic rules is to minimizing the number of crossing links (Purchase, 1997).
Other aesthetic rules include distributing nodes evenly, making link lengths uniform, and keeping nodes from being too close to links (Davidson & Harel, 1996; Fruchterman &
Reingold, 1991). To automatically draw graphs of high aesthetic quality, computer scientists have proposed a type of graph layout algorithm called spring embedder, also known as forcedirected method (Davidson & Harel, 1996; Eades, 1984; Fruchterman &
Reingold, 1991; Kamada & Kawai, 1989). This algorithm treats a network as an energy system in which steel rings (nodes) are connected by springs (links). Nodes attract and
30 repulse each other and finally settle down when the total energy carried by the springs is minimized. The network layout generated by a spring embedder might be quite different from that generated by MDS because of their different objective functions and node position handling mechanisms.
The size of a network can impose a great challenge on the performance of both the MDS and spring embedder algorithms. The time complexities of MDS and the spring embedder algorithms are
O
(
n
2
) and
O
(
n
3
), respectively (Herman
et al.
, 2000), where
n
is the size of the network. It can be quite slow to draw a network consisting of thousands of nodes.
More importantly, as the size of a network increases the presentation will become more and more cluttered, making it difficult for users to comprehend the structure. One approach to address this problem is the
focus
+
context
technique (Furnas, 1986). This technique mimics the distorting effect of fisheye lens such that objects around the focal point selected by a user are enlarged and objects in distance are shown with less detail.
As a result users can examine the local details of a network without losing the sense of the context of the whole network.
These fundamental concepts of network structure such as graph definition, representation, and presentation are helpful for understanding the technology of network structure mining, which will be presented in the computational framework in the next section.
31
Since network structure mining has a great potential in supporting knowledge management and decision making in many application domains yet is facing many challenges, I develop this computational framework. This framework provides taxonomy of the major research areas in this new field, identifies the key research questions in each area, and reviews existing techniques for addressing these research questions.
Figure 2.2 presents this computational research framework for network structure mining.
There are two major areas: static structure mining and dynamic structure mining. The static structure mining studies the “snapshot” of a network, that is, nodes and links observed at a single point in time. Dynamic structure mining, in contrast, analyzes a network based on data observed at multiple points in time. Static analysis is aimed at discovering the structural regularities in the specific configuration of the nodes and links of a network at the time of observation. Dynamic analysis is aimed at finding the patterns of changes in the network over time. The focus of static analysis is on structure, while the focus of dynamic analysis is on the processes and the evolutionary mechanisms that lead to the structure (Barabási & Alert, 1999; Doreian & Stokman, 1997).
32
Locating critical resources
Identifying key nodes
Identifying key links/paths
Reducing network complexity
Identifying subgroups
Modeling betweengroup relationships
Extracting topological properties
Describing structural dynamics
Modeling structural dynamics
Graph theoretical
 Centrality measures
 Neighborhood function
Linkanalysisbased
 HITS
 Pagerank
Graph theoretical
 Edge betweenness
 Shortestpath algorithms
Weighted graph partitioning
Spectral clustering
 Hierarchical clustering
Unweighted graph partitioning
 Linkanalysis based
 Graph theoretical
 hierarchical clustering
Blockmodeling
General properties
Characterizing properties
 Average path length
 Clustering coefficient
 Degree distribution
General properties
Characterizing properties
Analytical approaches
Simulation approaches
Figure 2.2: The computational framework for network structure mining.
33
The three major problems of static structure mining are locating critical recourses in network, reducing network complexity, and extracting topological properties of networks.
2.3.1.1 Locating Critical Resources in Networks
A network can be viewed as a collection of recourses. On the World Wide Web, for example, the contents of Web documents can be viewed as information resources. Users search for quality Web pages whose contents match their information needs. Cables and wires in a computer network are also resources whose breakage may bring the whole network down. The key people, documents, relations, and communication channels in a network often are critical to the function of the network. Existing techniques for locating critical resource have been used in a number of applications, such as finding highquality pages on the Web (Chakrabarti
et al.
, 1999; Kleinberg, 1998), locating cables and wires whose failure reduces the robustness of the Internet (Kleinberg
et al.
, 2004; Kumar
et al.
,
2002; Tu, 2000), searching for experts for a specific problem in collaboration networks
(Kautz
et al.
, 1997; Newman, 2001b), and identifying leaders and gatekeepers in criminal and terrorist networks (Krebs, 2001; Xu & Chen, 2005).
In general, the key recourses in a network are those important nodes, links, or paths, which are sequences of links.
•
Identifying Key Nodes
34
Methods for identifying key nodes can be categorized into two types: graph theoretical approaches and link analysis based approaches. o
Graph Theoretical Approaches
Graph theoretical approaches originate from graph theory and social network analysis.
They treat a network as a graph and identify the key nodes based on the link structure of the network. Centrality measures in SNA are often used to locate key nodes. Freeman
(1979) provides definitions of the three most popular centrality measures:
degree
,
betweenness
, and
closeness
.
Degree
measures how active a particular node is. It is defined as the number of direct links a node has. “Popular” nodes with high degree scores are the leaders, experts, or hubs in a network. It has been shown that these popular nodes can be a network’s
“Archilles’ Heel,” whose failure or removal will cause the network to quickly fall apart
(Albert
et al.
, 2000; Holme
et al.
, 2002). Especially, in some communication networks such as electronic power grids and the Internet a key node’s failure may cause cascading breakdown of other nodes due to traffic rerouting (Watts, 2002; Zhao
et al.
, 2004). In the counterterrorism and crime fighting context, the removal of key offenders is often an effective disruptive strategy (McAndrew, 1999; Sparrow, 1991).
Betweenness
measures the extent to which a particular node lies between other nodes in a network. The betweenness of a node is defined as the number of geodesics (shortest paths between two nodes) passing through it. Nodes with high betweenness scores often serve
35 as gatekeepers and brokers between different parts of a network. They are important communication channels through which information, goods, and other resources are transmitted or exchanged (Newman, 2004a; Wasserman & Faust, 1994). Holme
et al.
(2002) show that the removal of nodes with high betweenness scores can be more devastating than the removal of nodes with high degrees.
Closeness
is the sum of the length of geodesics between a particular node and all the other nodes in a network. It actually measures how far away one node is from other nodes and is sometimes called “farness” (Baker & Faulkner, 1993; Freeman, 1979). A node with low closeness may find it very difficult to communicate with other nodes in the network. Such nodes are thus more “peripheral” and can become outliers in the network
(Sparrow, 1991; Xu & Chen, 2005).
Another centrality related measure in SNA is
prestige
, which is similar to degree but is defined for directed graphs (Wasserman & Faust, 1994). The prestige of a node is the number of inlinks the node has. A prestigious node tends to have many nominations from other nodes.
Both degree and prestige measure the importance of a node based on the direct neighbors of the node. Recently, Palmer
et al.
(2002) have proposed a
neighborhood function
for categorizing the importance of node. The neighborhood function for a node
u
at distance
h
is the total number of nodes that can be reached from
u
within
h
or fewer hops. An important router in a computer network, for example, will be the one that can reach most of the routers within a few hops.
36 o
Link Analysis based Approaches
In Web mining research, the
HITS
(Kleinberg, 1998) and
PageRank
(Brin & Page, 1998) algorithms are the two most widely used methods for locating highquality documents on the Web. Unlike centrality measures which calculate the scores directly both the HITS and PageRank algorithms are iterative procedures.
The
HITS
(HyperlinkInduced Topic Search) algorithm is based on a simple intuition.
Highquality Web pages can be either authoritative pages or hub pages. The authoritative pages contain highquality information related to a particular topic and thus may be pointed to by many other pages. Hub pages are not necessarily authoritative pages but provide links to many authoritative pages. The authoritative score of a page thus is measured by the number of inlinks from hub pages. The hub score of a page is measured by the number of outlinks that point to authoritative pages. The algorithm begins by assigning random numbers to the authoritative and hub scores of all pages. The two scores of each page are iteratively updated until they converge. Similarly, the
PageRank
algorithm determines the quality of a page based on the number of inlinks the page receives. In addition, each inlink is weighted based on the quality of the page where the link originates. The quality of this neighbor page is also determined by PageRank.
•
Identifying Key Links/Paths
A set of graph theoretical approaches have been proposed to identify the key links and paths in a network. Girvan and Newman (2002) define a measure called
edge
37
betweenness
to find links that serve as bridges between different groups in a network.
Analogous to node betweenness, the edge betweenness of a link is the number of shortest paths passing through it. If a network contain groups and there are a few bridges connecting these groups, one must pass through these bridges when traveling from one group to another. These bridges become critical to the connectivity of the whole network.
Removal of links with high edge betweenness scores will easily cause network breakdown. The edge betweenness has been used for network partition tasks, which will reviewed shortly (Girvan & Newman, 2002). Because the calculation of edge betweenness requires global traversal of a graph that is computationally costly, Radicchi
et al.
(2004) propose the
edge clustering coefficient
measure that requires only local traversal to approximate edge betweenness.
For key path identification, the most widely used algorithm is the
shortestpath algorithm
(Dijkstra, 1959). The algorithm can find the shortest path between two nodes, which might be the quickest way to travel from one city to another (Wang & Crowcroft, 1992), the most efficient rout to transmit data from one router to another (Perkins & Bhagwat,
1994), or the strongest relationships between two people (Xu & Chen, 2004). It has also been used to find the nodeindependent paths in a network (White & Newman, 2001).
The number of nodeindependent paths between two nodes is the minimum number of nodes that must be removed to disconnect the two nodes. Nodeindependent paths thus have a direct impact on the robustness of a network. The classic Dijkstra algorithm computes the shortest paths from a single source node to every other node in a graph.
Other variants improve the speed of the algorithm using efficient data structures. For
38 example, the PriorityFirstSearch (PFS) algorithm (Cormen
et al.
, 1991) is faster than the Dijkstra algorithm through the use of a priority queue.
2.3.1.2 Reducing Network Complexity
A network can be very complex due to the large number of nodes and links it contains.
Understanding the structure of a network becomes increasingly difficult when its size scales up. For example, a marketing manager may get lost when he/she faces a network consisting of thousands of existing and potential customers. A researcher may find it difficult to understand the intellectual structure of an unfamiliar discipline when studying its citation networks containing hundreds of papers or authors. Therefore, it is desirable to extract the “big picture” out of a complex network by reducing it into a simpler image while preserving the intrinsic structure. To achieve this goal, a network can be first partitioned into subgroups, each of which contains a set of nodes. The betweengroup relationships can then be extracted. A number of applications can benefit from this technology. Especially, network partition methods have been employed to find communities on the Web (Flake
et al.
, 2000; Gibson
et al.
, 1998; Toyoda & Kitsuregawa,
2001), major research topics and paradigms in a discipline in citation networks (Small,
1999; White & McCain, 1998), and criminal groups in criminal networks (Xu & Chen,
2005).
39
•
Identifying Subgroups
In SNA a group is cohesive if nodes in this group have stronger or denser links with nodes within the group than with nodes outside of the group (Wasserman & Faust, 1994).
The methods for identifying cohesive subgroups and partitioning a network are different depending on whether the network is weighted or unweigthed. A weighted graph can be partitioned into cohesive groups by maximizing the withingroup link weights while minimizing betweengroup link weights. Because the link weight represents node similarity or link strength and intensity, nodes in the same group are more similar to each other or more strongly connected. An unweighted graph can be partitioned into cohesive groups by maximizing withingroup link density while minimizing betweengroup link density. In this case, cohesive groups are denselyknit subsets of the graph. Weighted graph partitioning is less challenging than unweighted graph partitioning. o
Weighted Graph Partitioning
Given a weighted graph,
spectral clustering
and
hierarchical clustering
methods can be used to find subgroups in the graph.
Spectral clustering
methods partition a graph by analyzing the spectrum of the Laplacian matrix representing the graph (Fiedler, 1973; Pothen
et al.
, 1990). The Laplacian matrix is constructed from the graph’s adjacency matrix and its spectrum is found by calculating the eigenvalues and the eigenvectors of the matrix. The optimal partition is found by minimizing the total link weights between groups. The eigenvalue corresponding to the
40 optimal solution to the objective function gives a reduced lowerdimensional representation of the graph. Nodes are then mapped to this lowerdimensional representation and closer nodes will be in the same cluster. The problem of spectral clustering methods is that the number of clusters to be found must be specified beforehand (Chung, 1997; Kannan
et al.
, 2004; Pothen
et al.
, 1990). They cannot be used to partition a network when the number of groups is unknown.
Hierarchical clustering
is an alternative approach which does not require the prior knowledge about number of groups. There are two types of hierarchical clustering methods:
agglomerative
and
divisive
(Jain & Dubes, 1988; Jain
et al.
, 1999; Johnson,
1967). These methods partition a graph into a series of nested clusters rather than a fixed number of clusters. In hierarchical clustering, link weights are often transformed into distances.
Agglomerative methods start with individual nodes, each of which is treated as a cluster.
The algorithm merges two clusters into one cluster if the two clusters are closest to each other. Smaller clusters are progressively merged until all nodes in the network fall into one big cluster. These nested clusters are organized in treelike structure often called
dendrogram
. A dendrogram can be “cut” at a specific distance level corresponding to a specific partition of the network. In contrast to agglomerative methods, divisive methods treat the whole network as one cluster at the beginning. It progressively removes the longest/weakest links until the network are dissolved into individual nodes. The most efficient hierarchical clustering algorithm runs
O
(
n
2
) in time and space (Murtagh, 1984).
41
The disadvantage of hierarchical clustering methods is that the determination of the cut level of the denrogram is often adhoc and rather subjective (Jain
et al.
, 1999).
Hierarchical clustering is widely used to partition weighted graphs. However, for graphs such as the World Wide Web, citation networks, and other networks when link weight is not available, the problem becomes more challenging. o
Unweighted Graph Partitioning
Three types of methods have been proposed to partition unweighted graphs:
link analysis based methods
,
graph theoretical approaches
, and
hierarchical clustering
.
Link analysis based methods
are used in Web mining research to identify Web communities (Gibson
et al.
, 1998; Kumar
et al.
, 1999; Toyoda & Kitsuregawa, 2001,
2003). These methods are rooted in the HITS algorithm proposed by Kleinberg
(Kleinberg, 1998). Kumar
et al
. (1999) propose a trawling approach to find a set of core pages containing both authoritative and hub pages for a specific topic. The core is a directed bipartite subgraph whose node set is divided into two sets with all hub pages in one set and authoritative pages in the other. The core and the other related pages constitute a Web community (Gibson
et al.
, 1998; Toyoda & Kitsuregawa, 2001, 2003).
In addition to link analysis based approaches,
graph theoretical approaches
have also been used to find Web communities (Flake
et al.
, 2000; Flake
et al.
, 2002; Imafuji &
Kitsuregawa, 2002). These approaches focus on the minimumcut problem which finds clusters of roughly equal sizes while minimizing the number of links between clusters.
42
Realizing that the minimumcut problem is equivalent to the maximumflow problem in graph theory (Ford Jr. & Fulkerson, 1956), Flake
et al
. (2000) formulate the Web community identification problem as an
s

t
maximum flow problem. Efficient algorithm for solving minimumcut problem, such as the KernighanLin algorithm (Kernighan &
Lin, 1970), runs
O
(
n
2
) in time. However, the size of the communities must be specified beforehand (Newman, 2004b).
Both link analysis based methods and graph theoretical approaches are proposed for graph partition in the Web context. They require seed nodes, i.e., the starting pages, to find Web communities. They are not appropriate for finding communities in general graphs where no seed nodes are available.
Recently, researchers have proposed a number of
hierarchical clustering methods
to partition unweighted networks. The GN algorithm (Girvan & Newman, 2002), for example, is a divisive clustering algorithm. When deciding which link to remove at each step the GN algorithm selects the one with the highest edge betweenness (Girvan &
Newman, 2002) and iteratively removes the links with the highest betweenness. In each iteration, the betweenness of each node must be recomputed. It has been shown that the algorithm is effective in identifying groups in various real networks (Girvan & Newman,
2002; Newman & Girvan, 2004; Radicchi
et al.
, 2004). However, the algorithm is rather slow and runs
O
(
m
2
n
) in time. This is because two reason. First, the calculation of betweenness depends on the computation of shortest paths which requires global traversals in a network. Second, the algorithm must recompute betweenness in every
43 iteration. The lack of scalability severely limits the GN algorithm’s ability to partition large networks such as the World Wide Web and the Internet.
Variants of the GN algorithm have been proposed to improve the efficiency of the algorithm. Radicchi
et al
. (2004) propose an alternative divisive algorithm using edge clustering coefficient to approximate edge betweenness. Newman (Newman, 2004c) proposes an agglomerative approach that based on a measure called
modularity
. The modularity of network indicates how much the graph structure deviates from a random graph, in which no group exists. In each iteration the algorithm seeks for a pair of clusters whose merge results in the largest increase or smallest decrease in the value of modularity. Although they are faster than the GN algorithm the two new algorithms time complexities stills scale with
m
2
. Details of these algorithms will be provided in Chapter
5.
•
Modeling Betweengroup Relationships
After a network is partitioned into groups, the betweengroup relationships become composites of links between individual nodes. In SNA, a positional analysis method called
blockmodeling
is often used to discover the overall structure of a social network
(White
et al.
, 1976).
Blockmodeling
identifies betweengroup relationships and interaction patterns after network partition. However, rather than being partitioned into subgroup, the network is clustered into
positions
based on a
structural equivalence
measure (Lorrain & White,
44
1971; Wasserman & Faust, 1994). Two nodes are structurally equivalent if they have identical links to and from other nodes. A position thus is a collection of nodes who are structurally substitutable, or in other words, similar in social activities, status, and connections with other members. Position is different from the concept of subgroup in relational analysis because two network members who are in the same position need not be directly connected (Lorrain & White, 1971; Scott, 1991).
Although it is a positional analysis, blockmodeling can be used to model relationships between subgroups (Xu & Chen, Forthcoming; Xu & Chen, 2005). Given subgroups in a network, blockmodel analysis determines the presence or absence of a relationship between two subgroups based on the
link density
(Wasserman & Faust, 1994). When the density of the links between the two subgroups is greater than a predefined threshold value, a betweengroup relationship is present, indicating that the two subgroups interact with each other constantly and thus have a strong relationship. By this means, blockmodeling summarizes individual relational details into relationships between groups so that the overall structure of the network becomes more prominent.
2.3.1.3
Extracting Topological Properties
Recent years have witnessed an increasing interest in the topological properties of largescale networks such as the World Wide Web (Broder
et al.
, 2000), metabolic pathways
(Jeong
et al.
, 2000), food webs (Garlaschelli
et al.
, 2003), citation networks (Hajra & Sen,
2005), and collaboration networks (Newman, 2001b; Watts & Strogatz, 1998), among many others. This new trend in the statistical properties of networks results from two
45 primary reasons. First, data collection and analysis of extremely large networks becomes possible due to the greatly improved computing power. The size of the World Wide Web studied, for example, has been up to several million nodes (Lawrence & Giles, 1999).
Second, the recently proposed smallworld and scalefree network models (Barabási &
Alert, 1999; Watts & Strogatz, 1998) have motivated scientists to search for the universal organizing principles that may be responsible for the commonality observed in a range of networks. These commonalities are found by categorizing, comparing, and contrasting the networks’ topological properties (Albert & Barabási, 2002) using two categories of statistics:
general statistics
and
topology characterizing statistics
.
•
General Statistics
These statistics are intended to capture the size and scale of a network regardless of its specific structure. They include the number of nodes or
network size
, the number of links, and several others. Table 2.1 provides a relatively complete inventory of these statistics often found in network topology studies. The size of a network is a direct indicator of the complexity of a network. Networks that have been studied range from food webs consisting of a few hundred nodes (Solé & Montoya, 2001) to scientific collaboration networks consisting of millions of authors and papers (Newman, 2001a, 2004a). The
giant component
is the largest connected component in a network (Bollobás, 1985). Most giant components have been found to contain more than 70% of the nodes in various networks (Newman, 2001a). The
average degree
of a network is the average number of links an arbitrary node has and defined as
<
k
>=
m
. The
density
of a network is the
n
46 number of links that are actually present divided by the possible number of nodes in a network (Wasserman & Faust, 1994). The density of an undirected network thus is
d
=
n
(
n m
−
1 ) / 2
. Sparse networks have low densities. The
diameter
of a network
(Wasserman & Faust, 1994) is the length of the longest shortest path in the network.
General statistics
Topology characterizing statistics
Statistics
Number of nodes, network size
Number of links
Number of nodes in the giant component
Percentage of nodes in the giant component
Average degree
Density
Largest shortest path length, diameter
Average shortest path length
Clustering coefficient
Degree distribution
Table 2.1: The statistics for network topology.
•
Topology Characterizing Statistics
Symbol
n m
S s
<
k
>
d
D
L
C
P
(
k
)
Three special statistics are used to categorize the topology of network and distinguish among random network (Erdös & Rényi, 1960), smallworld network (Watts & Strogatz,
1998), and scalefree network (Barabási & Alert, 1999). The three statistics are a
verage shortest path length
, (vertex)
clustering coefficient
, and
degree distribution
. As mentioned in Section 2.1, random networks are characterized by small shortest path length, low clustering coefficient, and Poisson degree distribution with a single characterizing degree, <
k
>. A smallworld network is different from random networks due to its high tendency to form clusters and groups. The small shortestpath length together with the high clustering coefficient of smallworld networks reflects the
six
47
degrees of separation
phenomenon (Milgram, 1967). The distinctive characteristic of the scalefree network is its powerlaw degree distribution, which is skewed toward small degrees and has a long flat tail for large degrees. Networks of different types and sizes have found to be strikingly similar in their topologies and have both smallworld and scalefree properties (Albert & Barabási, 2002). These findings lead to a conjecture that networks in nature and society are governed by a universal selforganizing principle
(Albert & Barabási, 2002).
Static structure mining provides a means of discovering structural patterns in networks.
However, networks are not static but constantly change. How to reveal the dynamics of networks and the evolutionary mechanisms leading to certain topology is the focus of the dynamic structure mining. The advantage of dynamic structure mining is its abilities to explain and predict the structure of networks (Albert & Barabási, 2002; Doreian &
Stokman, 1997).
Networks are subject to all kind of changes and dynamics in their nodes and links. New nodes may be added to the system and old nodes may be removed. New links may be formed between old nodes or between old and new nodes. Understanding the dynamics and the process of evolution in networks is of vital practical importance. The evolutionary mechanisms lead to specific type of network topology, which has direct impact on the function of a system. For example, it is found that protein interaction
48 networks in cells are scalefree networks. That is, a small percentage of hub proteins mediate the interactions with the rest of the proteins. Such a topology is critical to the survival of a cell because it is rather robust against random attacks (Jeong
et al.
, 2001).
How does a cell evolve into such a structure can be the key to develop effective means to protect healthy cells or attack harmful cells such as cancer cells. Existing dynamic mining approaches distinguish between
descriptive
and
modeling
approaches.
2.3.2.1 Describing Structural Dynamics
Descriptive approaches are aimed at capturing and observing the changes in a network over time using a set of topological statistics.
•
Changes in General Statistics
General statistics such as those listed in Table 2.1 are often measured at different points in time. The changes observed are then plotted with respect to time in order to examine the dynamic patterns. For example, Barabási
et al
. (Barabási
et al.
, 2002) study the evolution of the scientific collaboration networks in mathematics and neuroscience in the period of 19911998, respectively. Based on coauthorship information from papers published in journals they analyze the patterns of changes in the number of papers, number of authors (network size), average degree, and the relative size of the giant component in the network. They find that the networks are growing in that
n
,
s
, and <
k
> all increase over time. Other studies that use general statistics of network can be found in
(Csányi & Szendroi, 2004; Hajra & Sen, 2005).
49
•
Changes in Characterizing Statistics
This type of statistics can be used to distinguish between different topologies. It is found in (Barabási
et al.
, 2002) that both clustering coefficient and average path length of the scientific collaboration networks decrease over time, and the degree distribution follows a powerlaw. The decreasing
L
deviates from existing models which predict that
L
scales with
n
. This might be due to the addition of internal links which act as short cuts between distant parts of the network and the limited time window of the data set (Barabási
et al.
,
2002).
Modeling usually follows the descriptive analysis in attempt to explain the observed patterns of dynamics using certain mechanisms.
2.3.2.2 Modeling Structural Dynamics
Modeling approaches are aimed at explaining the emergence of specific type of network topology (random, smallworld, or scalefree) based on microscopic mechanisms.
Presently, the research focus is primarily on the evolution process of scalefree topology due to three reasons. First, degree distribution of scalefree networks significant deviates from the Poisson distribution (Albert & Barabási, 2002). Second, the scalefree topology has shown to be robust to random failures but vulnerable to targeted attacks (Albert
et al.
,
2000). Third, scalefree topology can facilitate efficient resource transmission (Toroczkai
& Bassler, 2004). The evolution of scalefree topology thus is particularly interesting because the structures of many real networks ranging from the Internet to geneprotein
50 interaction networks are scalefree (Faloutsos
et al.
, 1999; Garlaschelli
et al.
, 2003; Jeong
et al.
, 2000; Newman, 2004a). The core research question is: what are the mechanisms responsible for the powerlaw distribution in degree (Albert & Barabási, 2002)?
Several mechanisms, such as growth (Barabási & Alert, 1999), preferential attachment
(Barabási & Alert, 1999), competition (Bianconi & Barabási, 2001), and individual preference (Menczer, 2004; Pennock
et al.
, 2002), have been proposed to explain the emergence of scalefree topology in real networks. To examine the role of these mechanisms in the evolution of scalefree networks researchers have employed
simulation
and
analytical
approaches.
•
Simulation Approaches
With simulation approaches, a network evolves while new nodes and links are added to the network over time. The mechanisms are incorporated into the evolution process by controlling which two nodes are selected for a newly added link. In the basic evolution model proposed by Barabási and Alert (1999), for example, the evolution starts with a small number, say
m
0
,
nodes. At each time step, a new node is added to the system. The new node is allowed to link to
m
(
m
≤
m
0
) different nodes that are already in the network.
When choosing the target nodes to link to the new node makes a decision based on how many links the target nodes have. Therefore, the more links a node has the more likely it will be linked by the new node. This preferential attachment mechanism thus leads to the
richgetricher
phenomenon, manifesting the scalefree topology. In the fitness model which considers the competition effect (Bianconi & Barabási, 2001), the target nodes are
51 selected not only based on the number of their links but also on their intrinsic abilities to attract links. A Web page with highquality content thus may quickly attract much attention although it does not have many inlinks initially. The resulting network has a different topology than the scalefree and contains a few stars that connect to almost every node in the network, a phenomenon described as
winnerstakeall
(Pennock
et al.
,
2002).
The simulation approach helps observe and demonstrate the evolution of a network.
However, simulation approach lacks generalizability.
•
Analytical Approaches
Analytical approaches seek the general solution to a problem and often require the formal definition of a problem and various assumptions. Using meanfield theory, Barabási
et al.
(1999) derive the functional form of the powerlaw distribution of scalefree networks and claim that regardless of the network size the exponent of the powerlaw is 3. In the fitness model, instead, the exponent is a function of the fitness of a node (Bianconi &
Barabási, 2001). Nodes with higher fitness scores will acquire links at higher speeds than nodes with lower fitness score. The resulting degree distribution is a weighted sum of a spectrum of powerlaw distributions.
The research on network dynamics is a recent development and fairly new compared with static research. More innovative approaches and models are expected to be added to this line of research in the near future.
52
The computational framework presented in this chapter provides a guideline for network structure mining. In Chapters 37, I will present a series of case studies that demonstrate how static and dynamic structural patterns can be mined from various networks ranging from criminal networks to patent citation networks using the technologies reviewed in this chapter.
53
As discussed in Chapters 1 and 2, networks can be viewed as collections of resources.
Important relations and relational paths are critical resources that may reveal important structural information about the network. In this chapter I propose a graph theoretical approach to locate important relations between criminals in criminal networks (Xu &
Chen, 2004). The objective is to support knowledge management and decision making in the law enforcement domain to help fight organized crimes.
Organized crimes such as terrorism, narcotics violations, armed robbery, and kidnapping often involve multiple offenders who are connected through various relationships (e.g., kinship, friendship, coworkers, or business associates) (Harper & Harris, 1975). These criminals can be treated as a network in which they interact and play different roles in illegal activities (McAndrew, 1999). For instance, a narcotics network may consist of interrelated criminals who are responsible for handling the supply, distribution, sale, and smuggling of drugs, or even money laundering. Members in a terrorist network may have shared religious beliefs or attended terrorist training together previously so that they trust each other and cooperatively plan and commit terrorist attacks (Krebs, 2001). In a broader sense, a criminal network may be composed of a variety of entities (e.g., organizations, locations, vehicles, weapons, properties, bank accounts, etc.) in addition to
54 persons. Learning relations between these entities is a critical part of uncovering criminal activities and fighting crimes. To achieve this goal, crime investigators often employ a method called
link analysis
(Coady, 1985; Harper & Harris, 1975; Sparrow, 1991) which can help generate investigative leads and uncover missing information that may be buried in a criminal network. In a narcotics network, for example, link analysis may reveal that a group of offenders actually belong to the same drug supply chain. In a homicide crime, link analysis may find “hidden”, intermediate persons connecting the victim with the suspect who denies knowing the victim. Note that the concept of link analysis here is not the same as “link analysis” in Web mining. It refers to the task of identifying criminal relations in the specific context of crime fighting.
Link analysis usually consists of two major tasks: extracting information about entity relations from raw data (e.g., telephone records, surveillance logs, and crime reports) and constructing a network representation, and identifying relations between seemingly unrelated entities in a network. Both tasks can be very timeconsuming and laborintensive. Current link analysis practice in law enforcement is mainly an adhoc manual process. To solve a crime, investigators may spend a large amount of time performing extensive database searches, reading crime reports, and looking for clues of criminal relations. Although some software packages have been labeled with “link analysis tools”, they provide only visual representations of criminal networks and are “still not doing the analysis” (Sparrow, 1991). Because of these problems, link analysis is used only for highprofile cases. Effective and efficient link analysis techniques are needed to help fight crime (McAndrew, 1999).
55
To address the lackoftechnique problem, I propose using a type of graph theoretical approaches, namely two variations of the classical shortestpath algorithms (Dijkstra,
1959) for link analysis. The evaluation studies assess both the effectiveness and efficiency of the proposed algorithms. The effectiveness issue concerns whether relation paths found by the proposed algorithms are more useful for uncovering investigative leads than those found by a modified
BreadthFirstSearch
(BFS) algorithm. The modified BFS algorithm to a large extent simulated the manual approach of relation search by crime investigators and was used as a benchmark technique for effectiveness comparison. The efficiency issue concerns which shortestpath algorithm is faster in what type of networks.
The rest of the chapter is organized as follows. Section 3.2 reviews the literature on link analysis and the shortestpath algorithms. Section 3.3 presents the modified BFS algorithm. The two proposed shortestpath algorithms are introduced in section 3.4.
Evaluation and results are presented and discussed in section 3.5. In section 3.6 I conclude the paper and suggest directions for future work.
In this section I review network construction techniques proposed in previous research and existing link analysis tools. I then introduce the algorithms for computing shortest paths in a graph.
56
3.2.1.1 Network Construction
To entail link analysis, an indispensable task is to extract information about entities and their relations from large amounts of raw data and convert the information into a network representation. Usually entities are represented by nodes and relations between them are represented by links in a network. Different network construction methods may be needed, depending on whether the raw data are structured database records or unstructured textual documents.
Several techniques have been developed for constructing network representations of structured data records. For example, Goldberg and Senator (1998) suggested that consolidation and link formation operations be performed on transactional data records during investigations of financial crimes. Consolidation is a process of “disambiguating and combining identification information into a unique key which refers to specific individuals” (Goldberg & Senator, 1998). Links or relations between consolidated individuals are formed based on a set of heuristics such as whether the individuals have shared addresses, shared bank accounts, or related transactions. This technique has been employed by the U.S. Department of the Treasury to detect money laundering transactions and activities (Goldberg & Wong, 1998). A different network construction method used by COPLINK Detect (Hauck
et al.
, 2002) is based on the concept space approach developed by Chen and Lynch (1992). A concept space can be treated as a
57 network in which nodes represent domainspecific concepts and links represent weighted cooccurrence relations between concepts (Hauck
et al.
, 2002). In COPLINK Detect, nodes are records of entities (persons, organizations, vehicles, and locations) stored in crime databases. In such a network, a relation exists between a pair of entities if they appear together in the same criminal incident. The more frequently they occur together, the stronger the relation. The concept space approach is primarily a statisticbased approach and differs from the heuristicbased one in (Goldberg & Senator, 1998).
Some other techniques can build networks based on information extracted from unstructured data or textual documents. Lee (1998) developed a technique to construct criminal networks from free texts. This approach can extract entities and events from textual crime reports by applying a large collection of predefined patterns. Relations among extracted entities and events are formed using relationspecifying words and phrases. For example, the phrase “member of” indicates an entitytoentity relation between an individual and an organization; the word “arrest” may suggest an entitytoevent relation between an individual and an arrest event. This approach relies heavily on a fixed set of predefined patterns and rules and thus has a limited scope of application.
The concept space approach (Chen & Lynch, 1992; Hauck
et al.
, 2002), as mentioned earlier, can also be used to construct networks from textual documents. Instead of using structured data from databases, it uses noun phrases extracted from crime reports as entities to build a criminal network. A relation or cooccurrence relationship exists between a pair of entities as long as they appear together in the same report. However, the noun phrases extracted may not necessarily be the entities that interest the crime
58 investigators. Success of this type of network construction approaches, to a large extent, depends on the development of namedentity extraction technique (Chinchor, 1998), which is the automatic identification from text documents of the names of entities of interest, such as date, time, number expression, person, location, and organization (Chau et al., 2002; Chinchor, 1998).
3.2.1.2 Link Analysis Tools
In addition to network construction, another important link analysis task is searching for possible relations between entities. However most existing link analysis tools can only visualize criminal networks and do not offer much help with relation search. This section will provide a review of existing link analysis tools.
The earliest link analysis tool is the Anacapa charting system (Harper & Harris, 1975) which has been used extensively in law enforcement since its introduction. Based on humanextracted relation information, the system can generate a twodimensional visual representation of a network with different symbols representing different types of entities.
However, this tool does not facilitate relation search and an investigator must manually examine the network display to find relation paths between entities or confirm initial suspicions about specific suspects (Sparrow, 1991). Other link analysis tools such as
Netmap (Goldberg & Wong, 1998) and Analyst’s Notebook (Klerks, 2001) are also designed for network visualization rather than for relation search.
59
A link analysis tool called Watson (Anderson
et al.
, 1994) can search and identify direct relations between entities by querying databases. Given a specific entity such as a person’s name, Watson automatically forms a query to search for other records that are related to the person. For example, an analyst may want to find out who is related to a kidnapped child. The related records found by Watson, which may include the child’s relatives, friends, or other acquaintances, will be linked to this child and presented in a link chart. COPLINK Detect (Hauck
et al.
, 2002) can also be treated as a link analysis tool which provides direct relation search functionality.
In the next section I review shortestpath algorithms, which I propose to address the problem of identifying the strongest relations between entities that are not directly related.
Although these algorithms have been studied and employed widely in other domains, their importance and relevance to link analysis have not yet been recognized in law enforcement.
Shortestpath algorithms are a type of graph search algorithms. They can identify the optimal paths between nodes in a graph (i.e., a network) by examining link weights.
Conventional shortestpath algorithms have been used in many applications such as robot motion planning (Asano
et al.
, 2002), computer network routing (Perkins & Bhagwat,
1994), transportation and traffic control (Wang & Crowcroft, 1992), critical path computation in PERT charts, etc. Recently, a neural network approach in artificial
60 intelligence has been proposed for shortestpath computation (Ali & Kamoun, 1993;
Araujo
et al.
, 2001). In this section I review the conventional approaches and briefly introduce the neural network approach.
The Dijkstra algorithm (Dijkstra, 1959) is the classical method for computing the shortest paths from a single source node to every other node in a weighted graph. Most other algorithms for solving this problem are based on this algorithm but have improved data structures for implementation (Evans & Minieka, 1992). For example, the PriorityFirst
Search (PFS) algorithm (Cormen
et al.
, 1991) is faster than the Dijkstra algorithm because of the use of a priority queue.
Unlike the classical Dijkstra algorithm, the twotree Dijkstra algorithm computes the shortest path from a single source node to a single destination node, rather than to every other node in a graph. Previous studies have demonstrated that the twotree Dijkstra algorithm can be much faster than the Dijkstra algorithm. According to Helgason et al.
(1993), in most cases the Dijkstra algorithm generated a shortestpath tree containing approximately 50% of the nodes in a graph before the shortest path between a source node and a destination node was found. Shortestpath trees generated by the twotree
Dijkstra algorithm, in contrast, contained only 6% of the nodes in the graph. This might save a substantial amount of computational time.
Some researchers have proposed neural network approaches to solving the shortestpath problem. Araujo et al. (2001) extended Ali and Kamoun’s study (1993) and applied a twolayer Hopfield net to the shortestpath problem. In their Hopfield net, each neuron
61 corresponds with a link in a graph. The value of a neuron is 1 if the link it represents participates in the shortest path and 0 otherwise. It has been found that the twolayer
Hopfield net could be faster than conventional shortestpath algorithms because of its parallel architecture. However, these proposed Hopfield net approaches work only for networks of small size (e.g., 40 in (Araujo
et al.
, 2001)).
In summary, previous studies have proposed some techniques for network construction in link analysis. However, little research has been done to address the relation search problem. Specifically, an effective and efficient link analysis technique is needed to find relation paths between two or more source entities not directly related. Moreover, the paths found should reveal strong relations between entities so that important investigative leads can be uncovered. I propose to use the shortestpath algorithms to achieve this goal.
To compare the proposed algorithms with current link analysis practices, in my pilot study I recorded and analyzed the relation search processes of crime investigators experienced in link analysis. I found that the typical relation search approach can be described as a breadthfirst search (Cormen
et al.
, 1991). However, such an approach cannot guarantee finding the strongest relations between entities and thus may not successfully generate investigative leads. In the next section I present the modified BFS algorithm, which simulates the typical relation search.
62
Since existing link analysis tools are limited to direct relation search, crime investigators must explore links manually when they have entities that are not directly related. I found that a typical search starts with a single source entity and incrementally builds up a relation path during link exploration. For example, a crime investigator may need to find relations between two seemingly unrelated drug offenders. In this case, the crime investigator may start with one offender’s name and use a link analysis tool to find all entities that are associated with the offender in previous crimes. By reading each crime report, the investigator can determine whether a link is useful for generating a new lead to connect the two offenders. He then selects those useful links and does further searches, in which entities associated with the newly selected entities from the previous round are examined. He keeps exploring new entities until a relation path is found that connects the two offenders.
Such a search process is very similar to a graph traversal algorithm called
BreadthFirst
Search
(BFS) (Cormen
et al.
, 1991), except that an investigator may consider link usefulness during exploration. Given a weighted directed graph
G =
(
N, A
), a nonnegative number,
l
ij
, is used to represent the weight of the link (
i, j
)
∈
A
. Each node
u
∈
N
has an incoming link set,
In
(
u
), and an outgoing link set,
Out
(
u
). Since the criminal networks are undirected graphs,
In
(
u
) =
Out
(
u
).
63
Starting at a source node
s
, BFS can find paths leading to a target node
t
. It works by maintaining a traversal tree
T
rooted at the node
s
. In this tree, the child nodes of a specific node
u
are
u
’s outgoing neighbors in the graph
G
. Initially
T
contains only
s
. The algorithm then collects all the outgoing neighbors of
s
in
G
and sets them as the child nodes of
s
. For each child node of
s
, the algorithm further finds its children and adds them to the tree. This procedure is repeated until the target node
t
is reached. The time complexity of a BFS algorithm is
O
(
n
+
m
) (Cormen
et al.
, 1991).
As indicated earlier, a crime investigator may not explore all entities associated with a specific entity but selects only those having strong relations. I therefore modified the BFS algorithm so that when it finds the children of a node, it selects only those neighbors that have a link weight greater than a predefined threshold value. The modified BFS algorithm is presented in Figure 3.1.
Modified BFS algorithm
//
This modified BFS algorithm computes the paths from the first node in K to every other node in K. //K may contain multiple source nodes
Begin
Initialize:
s
= the 1 st
element of
K
;
p s
=
s
; //
p i
is the parent node of i p i
=
0 for all
i
∈
N
,
i
≠
s
;
0
=
{
s
}; //
L i
stores the current nodes
while
(
L i
≠
Ø
)
L i+
1
=
Ø
; //
L i+
1
for each
u
∈
L
stores the child nodes of the current nodes in L i
do
i
//
Explore a link only if its length is less than the threshold value
1,
which
//corresponds to link weight of 0.5 in the original, untransformed graph
for each
(
u
,
v
)
∈
Out
T
=
T
∪
{
v
};
(
u
)
such that
l uv
< 1
do
if
v
∉
T
then
//
Include v into the tree and set u as the parent of v
64
p v
=
u
;
L i+
1
=
L
end
i+
1
∪
{
v
};
end
i
=
i
+ 1;
if
v
∈
K
,
then
K
=
K
{
v
};
if
(
K
=
Ø
)
break
; //
Stop when all source nodes in K are included in the tree
endwhile; end
.
Figure 3.1: The modified BFS algorithm.
Notice that multiple paths may exist between the source entities
s
and
t
. BFS simply finds one such path and does not guarantee to identify the strongest relations between source entities. This suggests that the shortestpath algorithms may be a better option.
To find the strongest relations between two or more source entities I propose to employ conventional shortestpath algorithms. However, to apply the algorithms, a network representation transformation must be made.
In the criminal networks, the strength of a relation between two directly connected nodes is represented by their link weight, which is a number between zero and one. A link weight can be treated as a probability measure indicating how likely it is that two nodes are related. In general, the probability of a set of mutually independent events occurring together is the product of the probabilities of the individual events. Therefore, if two nodes are not connected directly but by a path consisting of a sequence of intermediate
65 links, the strength of the relation between these two nodes should be the product of the weights of these intermediate links. For example, if node A and node C are connected through node B, and the weights of the intermediate links (AB) and (BC) are 0.5 and
0.8, respectively, then the weight of the path (ABC) would be 0.4. To find the strongest relation between a pair of nodes, therefore, is to find the path with the largest weight product. Figure 3.2 presents an illustrative example.
In this figure, the number beside each link is that link’s weight or relation strength. Two paths, (ABCD) and (AED), exist between the source node A and the destination node
D. The relation strength of path (ABCD) is 0.28 (0.5X0.8X0.7), and the relation strength of path (AED) is 0.24 (0.8X0.3). Therefore, path (ABCD) has a stronger relation between node A and node D than path (AED).
0.8
B
C
0.7
0.5
A D
0.8
E 0.3
Figure 3.2: Two indirectly connected nodes (A and D).
Although the shortestpath algorithms can identify the optimal path between a pair of nodes, they cannot be used directly to identify the strongest relation between the two nodes. This is because of the following two representation problems:
(a) In a general weighted graph, the weight of a link represents the distance or cost of traveling from one end of the link to the other. Therefore, a low weight is preferred to a high weight. However, a link weight in a criminal network is an
66 indicator of how strongly the two nodes are related to each other. Thus, a high weight is preferred to a low weight.
(b) The shortest path is often computed based on the minimum total weight, which is the sum of the weights of the links along this path. However, my objective is to find a path with the maximum weight product.
In order to address the two representation problems, I transformed the link weight in a criminal network to a distance measure in a new graph representation. In this new graph, the nodes are the same as those in the original network, but the new link weights are computed based on the original weights using a simple logarithmic transformation:
l
= − ln
w
0
<
w
≤
1 , (3.1) where
l
is the link weight in the new graph, and
w
is the corresponding link weight in the original network. Given this transformation, I postulate the following axioms:
(1) All link weights in the new graph are nonnegative numbers.
(2) A lower link weight in the new graph corresponds with a higher link weight in the original network.
(3) The shortest path (using summation of link weights) between a pair of nodes in the new graph generates a path with the maximum link weight product among all the alternative paths between these two nodes in the original network.
67
Proof
:
Proofs of these three axioms are fairly straightforward, following the transformation equation directly.
Axiom
(
1
)
Since 0
<
w
≤
1 , thus ln
w
≤
0 , which suggests that
− ln
w
≥
0 .
Axiom
(
2
)
Let
l
1
<
l
2
, then
− ln
w
1
< − ln
w
2
, or ln
w
1
> ln
w
2
.
Since ln
w
is a monotonic increasing function, it follows that
w
1
>
w
2
.
Axiom
(
3
)
Consider the shortest path, say P, between a pair of nodes A and B. P consists of a set of links with weight (
l
1
,
l
2
, ...
,
l p
), 1
≤
p
≤
n
, where
n
is the total number of nodes in this graph. The total length of this path is
i p
∑
=
1
l i
. Consider another path between node A and node B, say Q, consisting of another set of links with weight (
l
1
′
,
l
2
′
, ...
,
l
′
q
), 1
≤
q
≤
n
. The
q
total length is
∑
i
=
1
l i
′
. Because P is the shortest path between node A and node B, we know that
68
i p
∑
=
1
l i
<
q
∑
i
=
1
l i
′
.
Since
l i
= − ln
w i
and
l i
′
= − ln
w i
′
by definition, we have
i p
∑
=
1 ln
w i
>
i q
∑
=
1 ln
w i
′
.
It follows that
exp
(
i p
∑
=
1 ln
w i
)
> exp
(
i q
∑
=
1 ln
w i
′
), which suggests that
i p
∏
=
1
w i
>
i q
∏
=
1
w i
′
.
Axiom (1) ensures that the new graph does not contain negativeweight links, which is a necessary condition for the shortestpath algorithms (Evans & Minieka, 1992). Axioms (2) and (3) respectively address the two representation problems. Therefore, with such a transformation, I am able to use conventional shortestpath algorithms to identify the strongest relations between a pair of nodes or entities in a criminal network.
I propose using the PriorityFirstSearch (PFS) (Cormen
et al.
, 1991) and the twotree
Dijkstra algorithm (Helgason
et al.
, 1993). Both algorithms can compute the shortest path between two source nodes. Considering the situation where an investigator needs to find relations between more than two entities, I repeatedly use the algorithms to identify the strongest relations among multiple source nodes.
I assume that a group of nodes is strongly associated if each pair of nodes in the group is strongly associated. That is, given
k
source nodes (
u
1
, u
2
, … , u k
), I first find the shortest paths between
u
1
and every other source node (
u
2
through
u k
). Then I find the shortest
69 paths between
u
2
and the remaining source nodes (
u
3
through
u
k
). Such a process is repeated until the shortest paths between all possible pairs of the
k
source nodes are found.
The total number of these shortest paths is
k
(
k
1)/2. It is possible that some of these paths share common links. If this happens, I combine the common links to avoid redundancy.
3.4.2.1 The Modified PFS Algorithm
The PFS algorithm (Cormen
et al.
, 1991) is a variation of the classical Dijkstra algorithm
(Dijkstra, 1959). The algorithm works by maintaining a shortestpath tree
T
rooted at a source node
s
.
T
contains nodes whose shortest distances from
s
are already known. Each node
u
in
T
has a parent, which is represented by
p u
. A set of labels,
d u
, is used to record the distances from the node
u
to
s
. Initially, T contains only
s
. At each step, I select from the candidate set
Q
a node with the minimum distance to
s
and add this node to
T
. Once
T
includes all nodes in the graph, the shortest paths from the source node
s
to all the other nodes have been found. PFS differs from the Dijkstra algorithm because it uses an efficient priority queue for the candidate set
Q
.
With modifications, PFS can be used to compute the shortest paths from a single source node to a set of specified nodes in the graph. That is, given a set of nodes
K
⊆
N
, 
K
 =
k
≥
2
, and a source node
s
∈
K
, the modified PFS algorithm can compute the shortest paths from
s
to all
u
∈
K
, and
u
≠
s
. I therefore modify the algorithm so that it stops as soon as all
u
∈
K
are included in the shortestpath tree
T
. Note that when
K
contains only two
70 nodes, the problem is reduced to a onetoone shortestpath problem (Helgason
et al.
,
1993). The modified PFS algorithm is presented in Figure 3.3.
Modified PFS algorithm
//
This modified PFS algorithm computes the shortest paths from the first node in K to every other //node in K
Begin
Initialize:
s
= the 1 st
element of
K
;
d s
= 0
,
p s
=
s
;
d i
=
∞
, p i
=
0 for all
i
∈
N
,
i
≠
s
;
T Q
= {
s
}.
while
(
K
≠
Ø
) //
Search Q for the node with minimum distance to s
u
≤
d j
,
i
,
j
∈
Q, i
≠
j
};
Q
};
//
The shortest path between u and s has been found and u is added to T
T u
};
for each
(
u
,
v
)
∈
Out
(
u
)
such that
d u
+ l
//
Update the distance label of v uv
<
d v
do
d v
=
d u
+ l uv
;
p v
=
u
;
if
v
∉
Q
then
Q
=
Q
∪
{
v
};
end
if
u
∈
K
,
then
K
=
K
{
u
};
endwhile; end
.
Figure 3.3: The modified PFS algorithm.
When computing the shortest paths from
K
’s second node to every other node in
K
, I repeat this procedure. Note that I do not need to compute the shortest path from the second node to the first node again, since it has already been computed. This procedure is repeated
k
1 times until the shortest paths between all possible pairs of the nodes in
K
have been found.
71
I implement the priority queue using a heap tree for the candidate set
Q
. At each iteration of the
while
loop, it takes
O
(log
n
) time to search for the minimum element
u
from
Q
, and
O
(
Out
(
u
)

log
n
) time to examine and update the distances of incident links of
u
. Thus the execution time for the
while
loop is
∑
u
∈
N
( 1
+

Out
(
u
) ) log
n
, or
O
((
n+m
)log
n
), because
∑
u
∈
N

Out
(
u
) 
=
m
. As a result, the overall time complexity for computing all shortest paths for
k
nodes is
O
(
k
(
n+m
)log
n
). PFS is faster than the Dijkstra algorithm, whose time complexity is
O
(
k
(
n
2
+m
)) (Evans & Minieka, 1992).
3.4.2.2 The TwoTree Dijkstra/PFS Algorithm
No modification is made to the twotree Dijkstra algorithm because it can find the shortest path only between two nodes. The twotree Dijkstra algorithm works by searching from both ends of the shortest path simultaneously (Helgason
et al.
, 1993). A shortestpath tree rooted at the source node
s
and a shortestpath tree rooted at another source node
t
grow in alternate steps. The two trees are analogous except that the tree rooted at
s
expands a node by examining its outgoing links, and the tree rooted at
t
expands a node by examining its incoming links. A shortest path is found when both trees have a common node, say
r
, such that
d r s
+
d r t
is a minimum, where
d r s
is the distance between
r
and
s
, and
d r t
is the distance between
r
and
t
, respectively. I define
β
as the minimum distance and
J
as the set of nodes that can be used to identify the shortest path.
The following twotree Dijkstra algorithm is provided in (Helgason
et al.
, 1993).
72
Assuming a priority queue is used for the candidate set
Q
, I call this algorithm twotree
PFS (Figure 3.4).
TwoTree PFS algorithm
//
Twotree PFS computes the shortest path between node s and node t
Begin
Initialize:
d s s
=
0 ,
p s s
=
s
,
T s
= {
s
};
Q s
= {
s
};
p i s
=
0 ,
d i s
= ∞
for all i
∈
N
;
d t t
=
0 ,
p t t
=
t
,
T t
= {
t
};
Q t
= {
t
};
p i t
=
0 ,
d i t
= ∞
for all i
∈
N
.
while
(
T s
∩
T t
=
Ø
)
do
//
Search Q s
for the node with minimum distance to s u
= {
i
:
d i s
≤
d j s
,
i
,
j
∈
Q s
,
i
≠
j
};
Q s
T s
=
Q
= T s s

{
u
};
//The shortest path between u and s has been found and u is added to T s
∪
{
u
};
//
Examine outgoing links of u
for each
(
u, v
)
∈
Out
(
u
)
such that
d u s
+
l uv
<
d v s
do
d v s l uv
;
end
p v s
=
u
;
if
v
∉
Q s
then
Q s
=
Q s
∪
{
v
};
//
Search Q t
for the node with minimum distance to t
v
= {
i
:
d
Q
T t t
= Q
= T t t

{
v
};
//
The shortest path between v and t has been found and v is added to T t
∪
{v
};
//
Examine incoming links of v
for each
i t
(
≤
u, v d
)
t j
,
∈
i
,
j
In
(
∈
v
Q
)
t
,
i
≠
j
};
such that
d v t
+
l uv
<
d t u
do
d t u
=
=
d u s d t v
+
+
l uv
;
p v t
=
v
;
if
u
∉
end
//
Stopping criterion
Q t
then
Q t
=
Q t
∪
{
u
};
β
J
=
= min{
d i s
{
i
∈
T s
∪
+
T d i t t
:
i
:
d i s
∈
T
+
d s i t
∪
=
T
β
t
};
};
endwhile; end
.
73
Figure 3.4: The twotree PFS algorithm.
Because the twotree PFS algorithm computes the shortest path only between two nodes, it must be used
k
(
k
1)/2 times to identify the shortest paths for all possible node pairs in
K
.
As a result, the overall time complexity is
O
(
k
2
(
n+m
)log
n
).
I did not use Floyd’s (Floyd, 1962) or Dantzig’s (Dantzig, 1960) allpair shortestpath algorithms, which compute the shortest path for every pair of nodes in a graph. These algorithms require a substantial execution time of
O
(
n
3
) (Evans & Minieka, 1992).
However, the execution time of the two proposed algorithms will not exceed
O
(
k
2
n
2
), which is less than
O
(
n
3
) as long as
k
2
<
n
. In most situations where
k
is rather small compared with
n
, these two proposed algorithms will work faster than allpair shortestpath algorithms.
I conducted a user evaluation and a simulation experiment in order to assess the performance of the proposed shortestpath algorithms. The user evaluation was aimed at addressing the effectiveness issue, namely, whether relation paths identified by the shortestpath algorithms are more likely to generate investigative leads than those identified by the modified BFS algorithm, which is representative of the typical relation search approach. The purpose of the simulation experiment, on the other hand, was to determine which shortestpath algorithm was more efficient for what type of networks.
Crime investigators often encounter the efficiency issue when they work on a large
74 network (Goldberg & Wong, 1998). In this section I first briefly describe the network construction process and then present the evaluation results.
3.5.1.1 COPLINK Concept Space and AZNP
The criminal networks used in my experiment were constructed based on the same concept space approach (Chen & Lynch, 1992) used in COPLINK Detect (Hauck
et al.
,
2002). In such networks, the strength of a relation is indicated by a cooccurrence weight.
As reviewed previously, the nodes in COPLINK Detect are structured database records of entities. COPLINK Detect allows for link analysis with depth 1, that is, only nodes directly associated with source nodes can be found.
Rather than using structured database records, the criminal networks were constructed from unstructured textual documents. This is because law enforcement agencies often rely on crime report narratives to obtain detailed criminal relation information that may not otherwise be available in structured data. I used an automated nounphrasing tool called AZNP to extract noun phrases from texts based on partofspeech tagging and noun phrasing rules (Tolle & Chen, 2000). The extracted noun phrases included various entity types such as persons, locations, vehicles, and properties. Cooccurrence weights between these entities were calculated to generate relation strength measures.
75
3.5.1.2 Data Set
The Phoenix Police Department provided me with oneyear’s worth of crime reports. The size of the dataset is 1GB. These reports described various types of crimes ranging from shoplifting to auto theft, from credit card fraud to narcotics possession and sales. I selected two samples as my test bed, namely, kidnapping and narcotics, both of which are organized crimes. The size of the kidnapping report collection is 4.5MB, and the size of the narcotics report collection is 38MB.
The crime reports varied substantially in length. For example, in the kidnapping sample, some documents simply contained a few lines about a phonedin kidnapping report, while others had hundreds of lines detailing a kidnapping investigation. Since the length of a document can affect the cooccurrence weights of the concepts it contains (Chen &
Lynch, 1992), I removed from my data sets those reports containing fewer than five lines of text. The noun phrases were extracted from the resulting document collections, and irrelevant terms were filtered out based on a 3400item stop word list. The noun phrases left after filtering were used as network nodes and their cooccurrence weights were calculated. Two networks were constructed: one for the kidnapping sample and the other for the narcotics sample. Table 3.1 presents the statistics for the two samples.
Kidnapping
Narcotics of reports noun phrases extracted
Network size (
n
)
Number of links (
m
)
Average number of links a node has
271 95,328 280 25,862 92.4
3572 861,516 4257 733,572 172.3
Table 3.1: Sample statistics of two networks.
76
3.5.2.1 User Evaluation: Effectiveness Issue
In the user evaluation, I compared the effectiveness of the relation paths identified by the shortestpath algorithms and those identified by the modified BFS algorithm. The purpose of the evaluation was to ascertain whether the shortestpath algorithms would be more useful for uncovering crime investigative leads.
The paths identified by an algorithm may consist of links that are not useful for crime investigations. With the concept space approach, a link between two entities is created if they cooccur in crime reports. However, a cooccurring relation may not necessarily mean an important relationship between entities. For example, the shortest path algorithms identified three relation paths for a kidnapping case with three source nodes:
Juan (person), Jose (person), and West Van Buren (location):
(1) Juan – Jose
(2) Juan – Maria – West Van Buren
(3) Jose – Maria – West Van Buren
Path (1) is useful because both Juan and Jose are listed in a report as victims in a kidnapping crime. Path (2) is considered to be nonuseful. Two reports describe the relation between Juan and Maria: one records that Juan Balderaz's exwife was Maria
Palma; the other indicates that Juan Rodriguez kidnapped Maria Molina’s daughter. The
77 relation between Maria and West Van Buren is recorded in another report which indicates that Maria Dillon lived at 3100 West Van Buren. Notice that the three Maria’s are different persons. Thus, the relation path with Maria as the intermediate node cannot provide information about how Juan and West Van Buren are related. Path (3) is a useful path because one report indicates that Jose Carrasco’s friend was Maria Dillon, who lived at 3100 West Van Buren. All entity names are scrubbed to ensure data confidentiality.
To measure the effectiveness of my algorithms, I used a precision rate defined as follows:
Precision
=
Number of
Total number useful of paths paths selected by experts identified by the algorithm
×
100 % (3.1)
Because the modified BFS algorithm did not guarantee to identify the strongest relation paths between entities, I predicted that the shortestpath algorithms could achieve a higher precision than the modified BFS algorithm.
I randomly selected 30 pairs of source nodes from each of the kidnapping network and the narcotics network. Relation paths were computed using both a shortestpath algorithm and the modified BFS algorithm. As shown in Table 3.2, the paths found by the modified
BFS algorithm generally contain more intermediate links than a shortestpath algorithm.
Which shortestpath algorithm was used is not important here because they always generate the same paths.
A domain expert from the Tucson Police Department evaluated the resulting relation paths. The expert had been serving in law enforcement for more than 30 years and had a
78 substantial amount of experience in link analysis. For the results produced by an algorithm, he examined the 30 paths from each network by reading the original crime reports. He determined whether a relation path was useful for generating investigative leads based on his past experience investigating similar crimes. It took 2.53 hours to complete the evaluation task for each network. The results show that on average the shortestpath algorithms identified more useful relation paths than the modified BFS algorithm. Around 70% of the paths found by the shortestpath algorithms were considered useful for both networks. For modified BFS, in contrast, only 30% of the paths from the kidnapping network and 16.7% of the paths from the narcotics network were considered to be useful. Table 3.2 shows the precision rate of each algorithm.
Algorithm
Shortestpath algorithms
Modified BFS
Average number of links in relation paths
Precision
Table 3.2: Effectiveness evaluation results.
The shortestpath algorithms can achieve a higher precision because they always select relations with high cooccurrence weights during link exploration. As discussed previously, a cooccurrence weight is a measure of how frequently two entities are related. Therefore, the more frequently two entities are associated, the less likely they are to be related by chance, and the more likely such a relation will be useful for investigations. In contrast, the modified BFS algorithm produces arbitrary paths between
79 entities. It is very likely that these paths contain unimportant relations, resulting in a low precision rate.
Although promising, the shortestpath algorithms still failed to identify useful paths about
30% of the time. Based on my analysis of the nonuseful paths found by the shortest path algorithms, I categorized the reasons for the failures as follows (using the kidnapping network as an example):
•
Some nodes in the networks do not represent unique entities.
This situation often occurs for the person type. Usually, after a person’s full name is provided at the beginning of a crime report narrative, he/she is referred to only by the first name in later parts of the report. During network construction, the same first names extracted by the noun phraser from different reports are indiscriminately treated as one single node. As a result, a node (e.g., Maria) may not refer to a unique person but to different people with the same first name (e.g., Maria Palma, Maria Molina, Maria Dillon, etc.). This problem also exists for other types of entities such as vehicles, locations, and properties. For example, “white car” may refer to different white cars owned by different persons;
“North 7 th
Street” includes a number of addresses on that particular street. A nonuseful relation path may result if it contains such intermediate nodes. In my test bed, 54.2% of the nonuseful relation paths fell into this category.
•
Whether an entity is relevant or not depends on specific contexts
. This problem seldom affects entities such as persons and addresses, because their presence in a crime report usually implies that they are relevant to that particular crime. Indeed, any person
80 mentioned in a report has a role descriptor. For example, “sp” means suspect, “v” means victim, and “w” means witness. However, property entities may include any physical object that a person possesses. It is much more difficult to determine whether or not a property is relevant to a particular crime without considering the specific context of a crime. When a property is the target of a crime it usually is considered to be relevant.
However, if a physical object is mentioned simply to describe the environment or a situation it is often treated as irrelevant. For example, a “cell phone” is a relevant property if it is stolen in a crime; it is irrelevant if a witness used his or her cell phone to report a crime to the police. Unlike a human, who can determine an entity’s relevance based on contextual clues, the noun phraser cannot examine texts semantically to distinguish between relevant and irrelevant entities. As a result, a relation path will be nonuseful if it happens to include an irrelevant entity. Over 37% of the nonuseful paths had this problem.
•
Two entities may have a “fake” relationship even though they are listed in the same report
. A link is established when two entities appear together in the same document. However, this link may be a trivial relation between the two entities. Usually, relations between a person and other entities (e.g., another person, vehicles, addresses, etc.) are less frequently subject to this problem. However, relations between entities other than persons are often less informative. For example, a link exists between “white
Toyota” and “North 7 th
Street” because they are listed in the same report narrative. In this report, I found that a male driving a white Toyota car kidnapped the daughter of a person, who lived on North 7th Street. Such a link does not imply a useful relationship between
81 these two entities but a “fake” one. Around 5% of the nonuseful paths fell into this category.
Result of this analysis suggests that the effectiveness of my algorithms may be improved if more appropriate entities and relations are extracted and used.
3.5.2.2 Simulation Experiment: Efficiency Issue
The simulation experiment focused on the efficiency of the two shortestpath algorithms
(modified PFS and twotree PFS). I define the efficiency of an algorithm as its average execution time. The experiment was intended to ascertain which algorithm is more efficient for what type of networks in terms of network size and other structural characteristics.
To compare the efficiency of these two algorithms in the case of multiple source nodes, I varied the number of source nodes,
k
, from 2 to 5 in the simulations. I chose these numbers based on the observation from my pilot studies in which investigators usually used less than five source entities during a relation search. Given a specific
k
, I randomly generated 100 cases using both algorithms for each network. The execution time for the algorithms was recorded and is presented in Table 3.3.
(a)
Algorithm
Modified PFS
Twotree PFS k = 2
1.00 (0.54)
k = 3
2.89 (0.97)
0.35
(0.19)
0.95
(0.28)
k = 4
6.00
(1.26)
1.94
(0.37)
k = 5
10.67
(2.09)
3.45
(0.65)
82
(b)
Algorithm
Modified PFS
Twotree PFS k = 2
66.75
(27.06)
239.00
(132.00)
k = 3
194.05
(53.97)
709.50
(263.75)
k = 4
419.47
(61.91)
1,350.56
(348.70)
k = 5
661.10
(132.22)
2,322.28
(546.25)
Table 3.3: Mean execution time (in seconds) for the two shortestpath algorithms
(Numbers in parentheses are standard deviations). (a) Results for the kidnapping network.
(b) Results for the narcotics network.
For all four values of
k
, the pairwise
t
tests for the mean execution time suggest that twotree PFS is significantly faster than PFS (
p
< 0.001) in the kidnapping network. However,
PFS is significantly faster than the twotree PFS algorithm (
p
< 0.01) in the narcotics network. Figure 3.5 presents the execution time plot with
k
= 5 for the kidnapping and narcotics networks, respectively.
PFS Twotree PFS
20
15
10
5
0
Simulation case
(a)
83
PFS Twotree PFS
4000
3500
3000
2500
2000
1500
1000
500
0
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
Simulation case
(b)
Figure 3.5: Execution time scatter plot (
k
= 5). (a) Results for the kidnapping network. (b)
Results for the narcotics network.
The result from the kidnapping network is consistent with the findings in (Helgason
et al.
,
1993). According to Helgason et al. (1993), a twotree algorithm usually is faster than onetree algorithms. In their study, a shortestpath tree in a onetree algorithm contains about 50% of the nodes in a network before the shortest path is found; whereas a twotree algorithm can find the shortest path when its trees contain only 6% of the nodes. I found similar results in terms of the number of nodes contained in the shortestpath trees. For the kidnapping network, the onetree PFS algorithm generated a tree containing 52% of the nodes, and the twotree PFS algorithm generated two trees containing 14.7% of the nodes in total. For the narcotics network, the tree in the onetree PFS algorithm contained
49.6% of the nodes, and the trees in the twotree PFS algorithm only contained 3.9% of the nodes.
84
However, the onetree algorithm outperformed its twotree counterpart in the narcotics network. Based on my analysis of the structural characteristics of both networks, I found that two factors might have caused this discrepancy.
•
Network size
. As the size of a network increases, the size of the candidate set
Q
, which contains temporarily labeled nodes, also increases. It takes time to search and update the labels in
Q
when incident links of a node are explored. Therefore, when a network is large and the computational cost of processing the candidate sets becomes high, the twotree algorithm will be inefficient. For the narcotics network (
n
= 4,257), the two candidate sets in the twotree PFS algorithm together contained 120% of the total nodes, whereas the candidate set in the onetree PFS algorithm contained only 47.9% of the total nodes. Thus, the time for processing the candidate sets in the twotree PFS algorithm was much longer than the time spent in the onetree PFS algorithm, causing the twotree PFS algorithm to be slower.
•
Network density
. The density of a network is defined as the ratio of the total number of links to the possible number of links (Wasserman & Faust, 1994). Thus, the density of an undirected network consisting of
n
nodes and
m
links is 2
m
/
n
(
n
1). Network density may have an impact on the efficiency of a twotree algorithm, which can find a shortest path only if the two trees have overlapping nodes. The lower the density of a network, the less likely two trees will overlap. In my experiment the density of the narcotics network is 0.08. This means that the two trees have overlapping nodes only 8% of the time and that the algorithm must spend more time growing the trees. The
85 kidnapping network, in contrast, has a much higher density (0.66), causing the twotree algorithm to be faster than the onetree algorithm.
Based on the analysis, I suggest that the twotree PFS algorithm be used for small and dense networks. For large and sparse networks, the onetree PFS algorithm is faster.
Effective and efficient link analysis techniques can assist investigation of organized crimes. With the help of such techniques, crime investigators may acquire better understanding of the interrelationships between offenders, thereby discovering new leads for investigation.
In this paper, I proposed a link analysis technique that employs shortestpath algorithms
(PFS and twotree PFS) to identify the strongest relations between two or more entities in a criminal network. Modifications were made to the algorithms to solve the shortestpath computation problem for multiple source nodes. After a logarithmic transformation of the link weights, these shortest paths could identify the strongest relations between given entities.
The evaluation study focused on the approach’s effectiveness and efficiency, both of which are desirable features of a sophisticated decisionsupport system. The results show that the shortestpath algorithms outperformed the typical relation search approach (as represented by the modified BFS algorithm) of crime investigators in terms of
86 effectiveness. The relation paths identified using the shortestpath algorithms were considered as useful about 70% of the time, as opposed to precision rates of 30% (for the kidnapping network) and 16.7% (for the narcotics network) with the modified BFS algorithm. The two shortestpath algorithms always produced identical results but the twotree PFS algorithm was faster for the small and dense kidnapping network and the
PFS algorithm was faster for the large and sparse narcotics network.
Analysis of the evaluation results suggests that the effectiveness might be improved by extracting more appropriate entities from texts and using them as network nodes. In my future research I will apply effective namedentity extraction techniques to replace my current noun phraser. I will also incorporate some domainspecific heuristics to help the system select only entities and relations that are considered useful by crime investigators.
87
In Chapter 3, I proposed using shortest path algorithms to identify important relations between criminals. Many other static structural patterns such as key nodes and subgroups in criminal networks are also valuable knowledge resources for the investigation of organized crimes. In this chapter I propose using a number of techniques to address the static structural pattern mining problems to help law enforcement and intelligence agencies to better manage their knowledge assets about crimes and criminals (Xu & Chen,
2005).
Law enforcement and intelligence agencies have long realized that knowledge about criminal networks is important to crime investigation and may to a large extent shape police efforts (McAndrew, 1999). A clear understanding of network structures, operations, and individual roles can help develop effective control strategies to prevent crimes from taking place.
However, criminal network analysis and mining currently is primarily a manual process, usually consuming much time and human effort at each stage of the knowledge discovery process (data processing, transformation, analysis, and visualization). Although some existing tools provide visual representations of criminal networks to assist investigation,
88 they lack structural network analysis functionality that may offer a deeper insight into the structure and organization of criminal enterprises.
To help discover criminal network knowledge efficiently and effectively, I propose in this chapter a series of procedures for automated network structure mining and visualization: network creation, network partition, structural analysis, and network visualization. I have developed a prototype system called
CrimeNet Explorer
that incorporates several advanced techniques (a concept space approach, social network analysis methods, etc.) for automatically extracting structural patterns in criminal networks, namely, key members, subgroups, and interaction patterns between subgroups.
The remainder of the chapter is organized as follows: Section 4.2 introduces the background of criminal network analysis; Section 4.3 reviews existing network analysis tools and social network analysis techniques; Section 4.4 provides details about the mining procedures and
CrimeNet Explorer
. System evaluation is discussed in Section 4.5, and Section 4.6 concludes this chapter.
When analyzing criminal networks, crime investigators often focus on characteristics of the network structure to gain insight into the following questions (McAndrew, 1999;
Sparrow, 1991):
•
Who is central in the network?
89
•
What subgroups exist in the network?
•
What are the patterns of interaction between subgroups?
•
What is the overall structure of the network?
•
Which member’s removal would result in disruption of the network?
•
How do information or goods flow in the network?
Knowledge of these structural characteristics can help reveal vulnerabilities of criminal networks and may have important implications for crime investigation.
Usually, criminal network members who occupy central positions should be targeted for removal or surveillance (Baker & Faulkner, 1993; McAndrew, 1999; Sparrow, 1991). A central member may play a key role in a network by acting as a leader who issues commands and provides steering mechanisms or serving as a gatekeeper who ensures that information or goods flow effectively among different parts of the network. Removal of these central members may effectively disrupt the network and put the operation of a criminal enterprise out of action.
In addition to studying roles of individual members, crime investigators also need to pay special attention to subgroups in criminal enterprises. Each subgroup or team may be responsible for specific tasks. Group members have to interact and cooperate to
90 accomplish the tasks. Therefore, detecting subgroups in which members are closely related to one another can increase understanding of a network’s organization.
Moreover, groups may interact with each other in such a way that interactions and relationships may reveal certain patterns. For example, one group may have frequent interactions with one other specific group but seldom interact with the rest of the network.
When interaction and relationship patterns between groups are found, the overall structure of the network can become more apparent. Indeed, different structures have different points of vulnerability. Intelligence regarding the overall structure of a network can help law enforcement and intelligence agencies develop the most effective strategies to disrupt that network.
Different criminal network structures such as chain, star/wheel, and complete/clique
(Evan, 1972; Ronfeldt & Arquilla, 2001) require specific disruptive strategies. A chain structure consists of members (individuals or groups) that are connected one by one so that information or goods must flow from one member to its neighbor before getting to the next. In a star structure, members are all connected to a central member who acts as a leader or hub. In a complete network, all members are fully connected with one another so that communication between any two members can be carried out directly. A star structure is a centralized network, whereas chain and complete structures are considered decentralized networks (Baker & Faulkner, 1993; Freeman, 1979). To disrupt a
91 centralized network, removal of the central member(s) can cause the network to fall apart.
A decentralized network, however, is more difficult to disrupt and more resistant to damage.
Although criminal network knowledge has important implications for crime investigation, little research has been done to develop advanced, automated techniques to assist with such tasks (Klerks, 2001; McAndrew, 1999; Sparrow, 1991). In the next section I review existing network analysis and visualization tools and introduce several new techniques that could be used for network analysis and structural pattern mining.
Existing network analysis tools used by law enforcement and intelligence agencies mainly focus on network visualization and do not have much structural analysis capability. Such a limitation might be successfully addressed by several methods from social network analysis research.
Klerks (2001) categorized existing criminal network analysis tools into three generations.
4.3.1.1 First Generation: Manual Approach
Representative of the first generation is the Anacpapa Chart of (Harper & Harris, 1975), which has been briefly reviewed in Chapter 1. In this approach, an investigator first
92 constructs an association matrix by examining data files to identify relations between criminals. Based on this association matrix, a link chart can be drawn for visualization purposes. The criminal having the most links to other people may be placed at the center of the link chart, indicating his/her importance in the network. The investigator then can study the structure of the graphical portrayal of the network to discover patterns of interest. Krebs (Krebs, 2001), for example, mapped a terrorist network comprised of the
19 hijackers in the September 11 attacks. He first examined publicly released information reported in several major newspapers to gather data about relationships among the terrorists. He then manually constructed an association matrix to integrate these relations and drew a terrorist network depicting possible patterns of interactions based on the matrix (see Figure 4.1).
Although such a manual approach is helpful for crime investigation, for very large data sets its use becomes extremely ineffective and inefficient.
93
Figure 4.1: The terrorist network surrounding the 19 hijackers on September 11, 2001
(Source: http://www.orgnet.com).
4.3.1.2 Second Generation: GraphicsBased Approach
Secondgeneration tools are more sophisticated because they can produce graphical representations of networks automatically. Most current criminal network analysis tools belong to this generation; among them are Analyst’s Notebook, Netmap, and Watson.
These three tools have also been briefly reviewed in Chapter 1.
Analyst’s Notebook has been widely employed by law enforcement in the United States and the Netherlands (Klerks, 2001). Like the firstgeneration approach, Analyst’s
Notebook relies on a human analyst to detect criminal relationships in data and can
94 automatically generate a link chart based on relational data stored in a spreadsheet or text file. It uses icons to distinguish between different types of entities (e.g., persons, bank accounts, companies, addresses, etc.) and allows a user to drag those icons around to rearrange the network layout. For example, an icon representing a key person can be dragged to the center of the chart, and less important icons can be placed on the periphery
(see Figure 4.2a).
Similarly, Netmap provides network visualization functionality (see Figure 4.2b). The system lays out entities of various types on the perimeter of a circle and places straight lines between entities to represent links. By examining the links, an analyst may discover useful patterns of interactions and relations hidden behind the network. Netmap has been adopted in the FinCEN system at the U.S. Department of the Treasury to analyze patterns of financial transaction data to detect money laundering (Goldberg & Senator, 1998).
Another secondgeneration tool called Watson (Anderson
et al.
, 1994) can search and identify possible relations between persons by querying databases (see Figure 4.2c).
Given a person’s name, Watson can automatically form a database query to search for related persons. The related persons found are linked to the given person and the result is presented in a link chart.
Although secondgeneration tools are capable of visualizing criminal networks, their sophistication level remains modest because they offer little structural analysis capability.
The analysis burden is still on human crime analysts.
95
(a)
(c)
(b)
Figure 4.2: Secondgeneration criminal network analysis tools. (a) Analyst’s Notebook.
Network members are automatically arranged for easy interpretation (Source: i2, Inc.). (b)
Netmap. The thickness of a line indicates the relational strength of the link it represents.
Different colors are used to represent different entity types (Source: Netmap Analytics,
LLC.). (c) Watson. Relations among a group of people (the central sphere) are extracted from telephone records. Phone calls that are not to or from the group are also displayed
(the peripheral nodes). A color is used to represent phone calls related to a particular person (Source: Xanalys, Ltd.).
4.3.1.3 Third Generation: Structural Analysis Approach
No existing tool is sophisticated enough to be categorized as being of the third generation.
Tools of this new generation are expected to provide more advanced analytical
96 facilitation that helps discover structural characteristics of criminal networks: central members, subgroups, interaction patterns between groups, and the overall structure.
SNA has recently been recognized as a promising technology for studying criminal organizations and enterprises (McAndrew, 1999; Sparrow, 1991). Studies involving evidence mapping in fraud and conspiracy cases have recently been added to this list
(Baker & Faulkner, 1993; Saether & Canter, 2001). These studies, however, focused only on central network members and did not identify subgroups and interaction patterns in criminal networks. Actually, both relational and positional analysis in SNA are relevant to the study of criminal networks (McAndrew, 1999).
4.3.2.1 Relational Analysis
Relational analysis focuses on the connectivity of a network. It is often used to identify central members or to partition a network into subgroups. In such studies, links usually are weighted by relational strength. The three most popular centrality measures are defined as follows (Freeman, 1979):
The d
egree
of a node
u
is defined as the number of links
u
has,
C
D
(
u
)
=
i n
∑
=
1
a
(
i
,
u
) , (4.1)
97 where
n
is the total number of nodes in a network;
a
(
i
,
u
) is a binary variable indicating whether a link exists between nodes
i
and
u
. A network member with a high degree could be the leader or “hub” in a network. The
betweenness
of a node
u
is defined as the number of geodesics (shortest paths between two nodes) passing through
u
,
C
B
(
u
)
=
n n
∑∑
i
<
j g ij
(
u
) , (4.2) where
g ij
(
u
) indicates whether the shortest path between two other nodes
i
and
j
passes through
u
. A member with high betweenness may act as a gatekeeper or “broker” in a network for smooth communication or flow of goods (e.g., drugs). The
closeness
is the sum of the length of geodesics between
u
and all the other nodes in a network,
C
C
(
u
)
=
i n
∑
=
1
l
(
i
,
u
) , (4.3) where
l
(
i
,
u
) is the length of the shortest path connecting nodes
i
and
u
.
Another type of relational analysis is to partition a network based on the strength of relationships between network members. Because criminals often form groups or teams to commit crimes, such an approach can help detect subgroups in a large criminal network.
Two methods have been employed for network partition in SNA studies: matrix permutation and hierarchical clustering (Arabie
et al.
, 1978; Wasserman & Faust, 1994).
The purpose of matrix permutation is to rearrange rows and columns of a matrix so that
98 members who occupy adjacent rows (or columns) can be organized into the same group.
Since matrix permutation is inherently an NPhard problem, many SNA studies use hierarchical clustering methods (Arabie
et al.
, 1978). Hierarchal clustering will be reviewed in Section 4.3.2.3.
4.3.2.2 Positional Analysis
Unlike relational analysis, positional analysis examines how similarly two network members connect to other members. The purpose of positional studies is to discover the overall structure of a social network using
blockmodeling approach
(White
et al.
, 1976).
To model interaction patterns between positions after network partition, blockmodel analysis compares the density of links between two positions with the overall density of a network (Arabie et al., 1978; Breiger et al., 1975; White et al., 1976)
. Link density
between two positions is the actual number of links between all pairs of nodes drawn from each position divided by the possible number of links between the two positions. In a network with undirected links, for example, the betweenposition link density can be calculated by
d ij
=
m ij n i n j
, (4.4) where
d ij
is the link density between positions
i
and
j
;
m ij
is the actual number of links between positions
i
and
j
;
n i
and
n j
represent the number of nodes within positions
i
and
j
, respectively. The overall link density of a network is defined as the total number of links
99 divided by the possible number of links in the whole network, i.e.,
d
=
n
(
n m
−
1 ) / 2
, where
m
is the total number of links;
n
is the total number of nodes in the network. Notice that for an undirected network the possible number of links is always
n
(
n 
1)/2.
A blockmodel of a network is thus constructed by comparing the density of the links between each pair of positions,
d ij
, with
d
: a betweenposition interaction is present if
d ij
≥
d
, and absent otherwise. Blockmodeling therefore reduces a complex network to a simpler structure by summarizing individual interaction details into relationship patterns between positions (White
et al.
, 1976). As a result, the overall structure of the network becomes more evident.
4.3.2.3 Hierarchical Clustering
Although they are based on different measures, both relational and positional analysis in
SNA may employ hierarchical clustering to partition a network. When used in relational analysis, hierarchical clustering treats relational strength as a similarity measure.
Therefore, the resulting clusters represent subgroups whose members are closely related.
When applied in positional analysis, on the other hand, hierarchical clustering uses structural equivalence to measure similarity and resulting clusters represent positions whose members are similar in the way they connect to other members.
The advantage of hierarchical clustering is that a network can be partitioned into different numbers of clusters at different similarity levels. With this feature, the underlying
100 structure of a network can be analyzed at different levels of detail. The disadvantage of hierarchical clustering, on the other hand, is that each node can be assigned to only one cluster at a specific level of similarity (Wasserman & Faust, 1994). There is no overlap between clusters.
Among the three most popular hierarchical clustering methods (singlelink, completelink, and Ward’s algorithm), the completelink algorithm is most widely used because it gives more homogeneous and stable clusters than the others (Jain & Dubes, 1988; Jain
et al.
,
1999; Lance & Williams, 1967).
4.3.2.4 Visualization of Social Networks
SNA studies employ multidimensional scaling (MDS) in both relational and positional analysis of social networks (Breiger
et al.
, 1975; Burt, 1976; Freeman, 2000; Wasserman
& Faust, 1994). When applied to a relational analysis, MDS uses relational strength as a measure of proximity and outputs an
xy
coordinate for each object on a twodimensional plane so that closelyrelated members are also close visually. When applied to positional analysis, MDS uses the structural equivalence between members as a proximity measure so that members who are structurally substitutable are close together on the display.
Recent SNA studies have also used spring embedder algorithms to visualize social networks (Freeman, 2000).
In summary, SNA offers several structural analysis techniques that can be used to extract structural patterns from criminal networks. However, existing network analysis tools are
101 not sophisticated enough to employ these techniques. To analyze a criminal network, an investigator has to extract information about criminal relationships from data, create a network representation, and perform structural analysis manually to identify central members, to detect subgroups, and to discover interaction patterns among groups. It is highly desirable to automate the whole process of criminal network analysis so that knowledge can be extracted more efficiently and effectively.
I propose using several techniques to facilitate structural pattern extraction. I have also developed a system called
CrimeNet Explorer
that can be categorized as a thirdgeneration network analysis tool, which incorporates these techniques. Figure 4.3 presents the proposed structural pattern mining processes:
network creation
,
network partition
,
structural analysis
, and
network visualization
.
Criminal
justice
Data
Network
Creation
Networked
Data
Network
Partition
Cluster
Hierarchies
Structural
Analysis
Network
Visualization
Concept Space
Hierarchical
Clustering
Centrality
Blockmodeling
MDS
Figure 4.3: Procedures for automated criminal network mining and visualization.
102
Criminaljustice data collected from crime incident reports, telephone records, surveillance logs, financial transaction records, and other sources usually do not store explicit information about criminal relationships. The task of extracting relational information from raw data and transforming it into a networked format could be quite laborintensive and timeconsuming.
To address this problem, I employed a c
oncept space approach
(Chen & Lynch, 1992) to create networks automatically (Chen
et al.
, 2003; Hauck
et al.
, 2002). The concept space approach was originally employed in information retrieval applications for extracting term relations in documents. It uses cooccurrence weight to measure the frequency with which two words or phrases appear in the same document. The more frequently two words or phrases appear together, the more likely it will be that they are related.
The criminaljustice data used in this chapter consisted of crime incident summaries provided by the Tucson Police Department (TPD). I treated each incident summary
(database records specifying the date, location, persons involved, and other information about a specific crime) as a document and each person’s name as a phrase. I then calculated cooccurrence weights based on the frequency with which two individuals appeared together in the same crime incident. I assumed that criminals who committed crimes together might be related and that the more often they appeared together the more likely it would be that they were related. As a result, the value of a cooccurrence weight
103 not only implied a relationship between two criminals but also indicated the strength of the relationship (Hauck
et al.
, 2002).
With the concept space approach, criminal relationships therefore could be extracted from crime incident data and transformed into a networked format automatically.
Resulting networks were undirected, weighted graphs in which nodes represented individual criminals and cooccurrence weights of links represented relational strength. It is worth mentioning that the concept space approach has both advantages and disadvantages for extracting relations. On one hand, the weight of a link was normalized to a range between 0 and 1, better than the simple cooccurrence count. More importantly, the distribution of cooccurrences was extremely skewed. More than 90% of the criminal pairs resulted from a onetime cooccurrence and a small portion (around 2.4%) of pairs cooccurred 10 times or more. The concept space approach, which penalized extremely large cooccurrences (Chen & Lynch, 1992), helped prevent the link weights from being skewed. On the other hand, the concept space approach is limited since the relational strength can be affected by other factors such as crime type. For example, a cooccurrence relation in a gangrelated crime in which a large number of criminals participated might not be as strong as a relation in an autotheft crime in which only two criminals were involved.
I also observed that the network generated might not necessarily be a single connected graph that contained all criminals in a set of data. This might be due to the fact that some
104 criminal enterprises might not have any connection with other criminal organizations. It could also be caused by the incompleteness of the data (McAndrew, 1999).
The networks created were stored in a database table in which each tuple specified a pair of criminals and an associated cooccurrence weight. These cooccurrence weights would be used later in both structural pattern mining and network visualization.
With data expressed in a networked format, I employed hierarchical clustering to partition a network into subgroups based on relational strength. I used a completelink algorithm since it was less likely to be subject to the chaining effect (Jain
et al.
, 1999).
Existing completelink algorithms vary in space and time complexity (Day &
Edelsbrunner, 1984; Defays, 1977; Voorhees, 1986). Although clustering was an offline operation that did not necessarily require high speed, I took into consideration that online dynamic clustering would be needed under some circumstances in the future. Therefore, time complexity was the primary criterion for algorithm selection. The algorithm I chose was an
RNNbased completelink
algorithm that used the
reciprocal nearest neighbor
(RNN) approach developed by Murtagh (1984). It took
O
(
n
2
) time and
O
(
n
2
) space and was significantly faster than other algorithms that typically required
O
(
n
3 ) time
(Roussinov & Chen, 1999).
Cooccurrence weights generated in the previous stage were first transformed into distances/dissimilarities. Since I was employing a completelink algorithm, the distance
105 between two clusters was defined as the distance between the farthest pair of nodes drawn from each cluster.
Initially, the algorithm treated each node as a cluster and then arbitrarily selected a cluster and incrementally built for it a
nearestneighbor chain
(NNchain). In an NNchain, each cluster was the nearest neighbor of its previous cluster. A chain terminated with two clusters that were the nearest neighbor of each other. The two nearest clusters were then merged into a larger cluster and the dendrogram was updated. The algorithm kept merging nearest clusters until all the nodes were merged into one big cluster. The resulting hierarchy had multiple levels and each level corresponded to a specific partition of a network.
Since the previous stage created multiple disjoint networks, I modified the algorithm to make it generate a separate cluster hierarchy for each network. The hierarchies generated were stored in a database for later use. Figure 4.4 presents the pseudocode of the modified algorithm.
Form a cluster for each node;
while
at least one betweencluster distance is less than infinite
do
currentCluster = an arbitrary cluster; found = false;
while
not found
find the nearest neighbor, C, to the currentCluster;
if do
isRNN
(
C, currentCluster
)
then else end while
end while
106
Figure 4.4: The pseudocode of the modified version of the RNNbased completelink algorithm.
In structural analysis, central member identification and blockmodeling are online operations performed by request.
I used the three centrality measures (degree, betweenness, and closeness) to identify central members in a given subgroup. The degree of a node could be obtained by counting the total number of links a node had to all the other group members. A node’s score of betweenness and closeness required computing the shortest paths (geodesics).
In my implementation, Dijkstra’s classical shortestpath algorithm (Dijkstra, 1959) was used to compute the geodesics from a single node to every other node in a subgroup.
Given an undirected graph representing a subgroup
i
that consisted of
n i
nodes, applying the algorithm
n i
−
1 times could generate the shortest paths between all pairs of nodes in the subgroup. Betweenness of a specific node
u
was thus obtained by counting the number of geodesics between the other nodes passing through node
u
. Because running the Dijkstra’s algorithm once took
O
(
n i
2
) time, the overall time complexity for calculating betweenness of nodes in the subgroup
i
was
O
(
n i
3
).
There are specific algorithms for allpair shortest path calculations such as Dantzig’s
(Dantzig, 1960) and Floyd’s (Floyd, 1962) algorithms. These algorithms’ time complexity is also
O
(
n
3
). The advantage of using the Dijkstra’s algorithm was that by the
107 time all the geodesics for a specific node were found the computation of the closeness of that node was also finished, because the closeness was simply the sum of the length of the geodesics. Thus, closeness was a “byproduct” of betweenness and was obtained with no extra cost.
To extract betweengroup interaction patterns and the overall structure of a criminal network, I performed blockmodel analysis. Unlike general blockmodel analysis in SNA research that revealed interaction patterns between network positions based on the structural equivalence measure, the blockmodel analysis examined relationships between subgroups based on the relational strength measure. I decided on this approach based on interviews with the crime investigators from TPD and evidence that crime investigators often are more interested in interaction patterns between subgroups rather than between positions.
Blockmodeling therefore was used to identify interaction patterns between subgroups discovered in the network partition stage. At a given level of a cluster hierarchy, I compared betweengroup link densities with the network’s overall link density to determine the presence or absence of betweengroup relationships.
To map a criminal network onto a twodimensional display, I employed MDS to assign a location to each node in a network of
n
nodes, given the corresponding
n
×
n
distance matrix. Since distances transformed from cooccurrence weights were quantitative data, I
108 selected Torgerson’s classical metric MDS algorithm (Torgerson, 1952). This algorithm first transformed the distance matrix into a scalar product matrix
B
by doublecentering.
It then solved the
singular value decomposition
(SVD) problem for
B
to generate an
n
×
n
matrix
X
, the first two columns of which stored the coordinates of the
n
nodes. The key step in this algorithm was SVD, which could be solved efficiently using the library routine provided by Press
et al.
(Press
et al.
, 1992).
In
CrimeNet Explorer
a graphical user interface was provided for easy interaction between a user and the system. Figure 4.5 shows screen shots of the system interface.
Each node was labeled with the name of the criminal it represented. Criminal names were scrubbed for data confidentiality. A straight line connecting two nodes indicated that the two corresponding criminals committed crimes together and thus were related.
To find subgroups and interaction patterns between groups, a user could adjust the “level of abstraction” slider at the bottom of the panel. A high level of abstraction corresponded with a high distance level in the cluster hierarchy. At any level of abstraction, a circle represented a subgroup. The size of the circle was proportional to the number of criminals in the subgroup. To view how group members were connected within a subgroup a user could click on the corresponding circle to bring up a small window depicting the group’s inner structure. At the same time, rankings in terms of the three
109 centrality measures of the group members were listed at the righthand side of the small window.
Straight lines connecting circles represented betweengroup relationships. The thickness of a line was proportional to the density of the links between the two corresponding groups. Such a design was different from general blockmodel analysis, which treats a low link density as an indicator of the absence of a betweengroup relationship. I thought that the absence of a line between two subgroups might possibly cause a user to infer mistakenly that there was no actual link connecting members from the two groups. I therefore kept a line between two groups as long as there was a link between members from the two groups. This design decision could be more informative than the treatment in general blockmodel analysis for crime investigations.
(a)
110
(b)
(c)
Figure 4.5:
CrimeNet Explorer
. In this example, the network appeared to be a star structure after performing blockmodel analysis. The vulnerability of this network, therefore, lay in the central members. (a) A 57member criminal network. Each node is labeled using the name of the criminal it represents. Lines represent the relationships between criminals. (c) The inner structure of the biggest group (the relationships between group members). (b) The reduced structure of the network. Each circle represents one subgroup labeled by its leader’s name. The size of the circle is proportional to the number of criminals in the group. A line represents a relationship between two groups. The thickness represents the strength of the relationship. Centrality rankings of members in the biggest group are listed in a table at the righthand.
As discussed previously, the purpose of this chapter is to employ advanced structural analysis and visualization techniques to help discover valuable criminal network structural patterns. The major advantage of CrimeNet Explorer over existing network analysis tools is its structural analysis capabilities.
I conducted system evaluation to answer the following research questions:
•
Will the system detect subgroups from criminal networks correctly?
111
•
Will the structural analysis functionality help extract structural properties of criminal networks more effectively and efficiently?
Prior to the system evaluation I carefully examined the TPD datasets and found that networks generated from them varied in size and structure.
I extracted two datasets from TPD databases: (a) incident summaries of narcotics crimes from January 2000 to May 2002, and (b) incident summaries of gangrelated crimes from
January 1995 to May 2002. Both narcotics and gangrelated crimes were organized crimes likely to have been committed by networked offenders. I chose a longer time period for gang data because in each year there were substantially fewer gangrelated crimes than narcotics crimes.
I analyzed the sizes of the networks generated from the two datasets. The narcotics dataset consisted of 12,842 criminals who were from 2,628 networks. The gang dataset consisted of 4,376 criminals from 289 networks. Both datasets contained a single large network (e.g., the 502member network in the narcotics dataset) and a large number of small networks with less than 20 members. The biggest gang network was much larger than the biggest narcotics network although the gang dataset contained fewer criminals.
Table 4.1 provides networksize statistics of the two datasets. Further examination of the incident summaries revealed that members in the large networks (those having more than
20 members) were mostly serial offenders and possibly came from various criminal
112 organizations. In contrast, small networks (those having fewer than 20 members) consisted primarily of “onetime” offenders and would probably be less interesting for a study of criminal organizations and enterprises.
Narcotic networks
Gang networks
220 members
2,618
21100 members
9
>100 members
(a 502member network)
Table 4.1: Sizes of networks generated from the two datasets.
In addition to network size, I examined network structures using the blockmodeling function of
CrimeNet Explorer
. Because it was quite difficult to display the biggest networks in the two datasets on a screen, each having several hundred members, I analyzed only the structures of networks with 21100 members. I found that the two types of networks had distinguishing structural patterns:
•
Two out of the four gang networks studied had a star structure similar to the example in Figure 4.5. The third network had a chain of stars. The fourth network had a star structure with each branch being a smaller star or a clique and its overall structure looked like a snowflake.
•
All nine narcotics networks had a chain structure. Three of these networks were chains of stars. One network had a circle in the middle of the chain.
Analysis of network size and structure revealed that gang networks tended to be bigger and more centralized, whereas narcotics networks were smaller and more decentralized.
113
This finding implied that different strategies could be used to disrupt the two types of networks.
I selected a 60member narcotics network and a 24member gang network and used them in a subject study to evaluate
CrimeNet Explorer
.
To address the research questions, I conducted a controlled laboratory experiment to evaluate system performance. Thirty students from the Department of Management
Information Systems at the University of Arizona participated in the experiment. I used students rather than crime investigators as research subjects based on two considerations.
First, it was difficult to recruit a sufficient number of crime investigators because of their busy work schedules. Second, although the prototype system was designed for criminal network analysis, finding structural patterns from networks of nodes was not a domainspecific task. Student subjects should be able to perform the tasks assigned to them even without domain knowledge in crime investigation.
Each subject participated in four sessions: demographic survey, training, testing, and posttest questionnaire. The demographic survey focused on subjects’ background information such as gender, age, and computer experience. The training session was designed to help subjects understand the major concepts (e.g., subgroups, central members, etc.) and gain handson experience with the system. During the testing sessions, subjects performed nine tasks on each of two test networks. They then completed a post
114 test questionnaire on which they reported their attitudes towards the system’s easeofuse and their satisfaction with the system’s functionality.
The 18 tasks used in the experiment were divided into three types: (1) detecting subgroups in a network, (2) identifying interaction patterns between subgroups, and (3) identifying central members within a given subgroup.
4.5.2.1 Task I: Subgroup Detection (Clustering)
I wanted to learn through task I whether the system could achieve performance comparable to that of untrained users when partitioning a network into clusters
(subgroups). I asked a domain expert (a detective who had served in law enforcement for more than 20 years) to provide partitions of the two test networks based on his knowledge of narcotics and gangrelated crimes. His partitions were used as “gold standards” to evaluate clustering results generated by the system and subjects who represented untrained users.
There has not been a generally accepted metric for evaluating clustering results (Jain &
Dubes, 1988). I selected for the experiment the clustering precision and cluster recall metrics developed by Roussinov and Chen (Roussinov & Chen, 1999). These two measures examined whether or not a pair of documents was put in the same cluster by human subjects and by the system (Sahami
et al.
, 1998). Based on the same rationale, I defined the cluster precision and recall as:
115
Recall system
=
Number of node pairs in both system partition and expert partition
Number of node pairs in expert partition
(4.5)
Recall huma
=
Number of node pairs in both human partition and expert
Number of node pairs in expert partition partition
(4.6)
Precision system
=
Number of node pairs in both system partition and expert partition
Number of node pairs in system partition
(4.7)
Precision human
=
Number of node pairs in both human partition and expert
Number of node pairs in human partition partition
(4.8)
I developed two hypotheses to compare the clustering results from the system and the human subjects:
•
H1: The system and subjects will achieve different clustering
recall
.
•
H2: The system and subjects will achieve different clustering
precision
.
Since hierarchical clustering generated nested partitions for a network, I selected the partition containing the same number of clusters as in the expert’s partition to be the system’s clustering result. During the experiment, subjects were asked to partition a given network into the same number of clusters as in the expert partition. Although both the system and subjects generated the same number of clusters, they could assign different node pairs in a cluster, resulting in different recall and precision.
116
4.5.2.2 Tasks II and III: Interaction Pattern and Central Members Identification
Because the major advantage of
CrimeNet Explorer
was its structural analysis capability in addition to its network visualization functionality, I was interested in comparing subjects’ performances under two experimental conditions: (1) structural analysis plus visualization, and (2) visualization only.
I considered two general information systems performance metrics (Jordan, 1998):
Effectiveness =
total number of correct answers a subject generated for a given type of tasks.
Efficiency =
the average time a subject spent to complete a given type of tasks.
Since the system could automatically identify interaction patterns between subgroups and central members within a subgroup, it was expected that a subject could achieve higher efficiency and effectiveness with the help of structural analysis functionality than with only visualization functionality. Specifically, I developed four hypotheses to compare the performance under two experimental conditions:
•
H3: A subject will achieve higher
effectiveness
for interaction pattern identification tasks using the system having both structural analysis and visualization functionality than with that having visualization functionality only.
117
•
H4: A subject will achieve higher
effectiveness
for central member identification tasks using the system having both structural analysis and visualization functionality than with that having visualization functionality only.
•
H5: A subject will achieve higher
efficiency
for interaction pattern identification tasks using the system having both structural analysis and visualization functionality than with that having visualization functionality only.
•
H6: A subject will achieve higher
efficiency
for central member identification tasks using the system having both structural analysis and visualization functionality than with that having visualization functionality only.
The domain expert validated answers to all the questions for tasks II and III. To eliminate a learning effect, the orders of experimental conditions and tasks were randomized for each test network.
For task II, subjects were asked to answer two questions regarding the interaction patterns between subgroups:
•
Given two subgroups, determine whether they were related;
•
Given three subgroups (e.g., A, B, and C), determine whether group A had more interactions with group B than with group C.
For task III, subjects were asked to identify central members with the highest degree. I did not assign tasks of identifying central members with the highest betweenness and
118 closeness because these two measures required computation of shortest paths, which were difficult for subjects to find under the visualizationonly condition. I therefore included only degree for fair comparison between the two experimental conditions.
For tasks II and III, subjects were encouraged to complete the tasks as quickly as possible.
Each subject’s task completion time was recorded. On average, it took a subject 3045 minutes to complete all 18 tasks.
4.5.3.1 Quantitative Analysis
Clustering recall and precision.
H1 and H2 were supported. Paired
t
tests showed that the system’s clustering recall and precision were significantly higher than subjects’
(recall:
t
= 4.39,
p
< 0.001; precision:
t
= 5.33,
p
< 0.001). Table 4.2 gives the recall and precision rates of the system and the subjects. Numbers in parentheses are standard deviations.
Recall
Precision
Human System
0.86 (0.07) 0.93 (0.00)
0.77 (0.03) 0.91 (0.00)
Table 4.2: Clustering recall and precision.
I believe that the difference in clustering recall and precision resulted from visual clues that subjects relied on when performing clustering tasks.
•
In the experiment, the domain expert based his partitioning of the test networks on his knowledge of network members and grouped criminals who frequently hang
119 together in the same clusters. His judgment of clusters was not affected by visual clues from the network layouts.
•
The system neither had domain knowledge nor was affected by visual clues from the network layouts. Thus, partitioning of the networks depended entirely on link weights (relational strengths). Since relational strength was determined by the frequency with which two criminals committed crimes together, it could relatively accurately reflect reality. Therefore, partitions generated by the system closely resembled the expert’s partitions.
•
Untrained subjects had to rely entirely on relative locations of nodes in the visual display of networks to determine relational strength between criminals. Visual clues thus could affect subjects’ judgment heavily. When a network display was distorted (caused by dimensionality problem in the MDS algorithm) a subject actually could group weakly related criminals into one cluster if they appeared to be close visually. The test networks used in the experiment suffered from the distortion problem, which may have caused the clustering recall and precision by subjects to be worse than the clustering recall and precision of the system.
Effectiveness
. H3 and H4 were not supported. I performed paired
t
tests for both tasks II and III to compare the effectiveness under the two experimental conditions (Task II:
t
=
1.41,
p
> 0.05; Task III:
t
= 1.80,
p
> 0.05). Such results implied that the analysis functionality did not help to achieve a significantly higher effectiveness. Table 4.3 shows the results.
120
Task type 2
Task type 3
Visualization plus analysis
3.90 (0.31)
3.30 (1.02)
Table 4.3: Effectiveness.
Visualization only
3.73 (0.59)
3.20 (1.13)
Such a result could be for two reasons.
•
For both tasks II and III, a subject could obtain a correct answer by counting lines on the network display under the visualizationonly condition. For example, to compare the frequency of interactions between one group (A) and the other two groups (B and C), a subject could count the number of lines between A and B, and the number of lines between A and C. A simple comparison of these two numbers would suggest which two groups had more frequent interactions. As long as the subject was careful, he/she could find the correct answer.
•
The two testing networks used in this experiment were not very large, making these two types of tasks relatively simple.
Efficiency
. The paired
t
tests for efficiency comparison supported both H5 and H6 (task II:
t
= 6.92,
p
< 0.001; task III:
t
= 10.66,
p
< 0.001). This means that subjects could achieve significantly higher efficiency under the visualization plus analysis condition than under the visualization only condition. Table 4.4 shows the efficiency statistics.
Task type 2
Task type 3
Visualization plus analysis
7.13 (2.19)
6.24 (3.85)
Visualization only
12.10 (4.81)
26.93 (12.45)
121
Table 4.4: Efficiency.
The results implied that with the help of structural analysis functionality subjects could identify interaction patterns among subgroups and the central members in a given subgroup significantly faster. Under the visualization plus analysis condition, a subject did not have to count lines manually to identify interaction patterns between groups because a straight line between two groups implied the presence of a betweengroup interaction. At the same time, the thickness of the line indicated the frequency of the interaction. In addition, the degrees of all group members were computed by the system so that a subject could find the one with the highest degree directly from the centrality table on the interface.
In summary, the structural analysis functionality provided by the system could significantly improve efficiency of network analysis tasks although the gain in effectiveness was not significant. Moreover, the system could identify subgroups of a network significantly better than untrained subjects.
4.5.3.2 Qualitative Feedback
Most subjects reported that features provided by the system were easy to learn and easy to use. For example, it was easy to adjust the slider to view different partitions at different abstract levels; it was convenient to visualize the inner structure of a subgroup in a small window. The table used to list degree rankings of group members was similar to an Excel spreadsheet and easy to understand.
122
Subjects’ negative comments about the system were primarily concerned with network layout and network partitions.
•
Network layout
. Many subjects felt that the network was too cluttered in some areas where nodes were so close to each other that labels were overlapped and hard to read.
•
Network partition
. Most reported the difficulty of deciding where to put nodes that had many connections to nodes from different groups. They said they wished overlapped groups could be allowed so that some very popular nodes could belong to more than one group. However, hierarchical clustering algorithms always generated mutually exclusive clusters that did not overlap.
The domain expert also provided positive feedback. He said he had enjoyed using this system and believed that
CrimeNet Explorer
could be very useful for crime investigation in the following ways:
•
Increasing work productivity
. With the structural analysis functionality of
CrimeNet Explorer
, a large amount of investigation time could be saved.
•
Assisting training for new crime investigators
. New investigators who did not have sufficient knowledge about local criminal organizations could use the system to grasp the essence of the networks and crime history quickly. They would not have to spend a significant amount of time studying hundreds of incident reports.
123
•
Suggesting investigative leads that might otherwise be overlooked
.
•
Assisting prosecution
. Known relationships between individual criminals and criminal groups would be helpful to the prosecution when seeking to prove guilt in court.
Overall, the results of the quantitative and qualitative analysis showed that the system could be efficient and useful for extracting criminal network knowledge from large volumes of data.
Network structure mining is important for understanding the structure and organization of criminal enterprises. Advanced, automated techniques and tools are needed to extract knowledge about criminal networks efficiently and effectively. Such knowledge could help intelligence and law enforcement agencies enhance public safety and national security by developing comprehensive disruptive strategies to prevent and respond to organized crimes such as terrorist attacks and narcotics trafficking. I proposed in this chapter several techniques for automated criminal network analysis and visualization to help network creation, network partition, structural analysis, and network visualization.
The main contribution is the proposal of a series of procedures to guide structural pattern mining in the criminal network analysis domain. I incorporated various techniques to automatically extract valuable criminal network knowledge from large volumes of data.
124
Most of these techniques originated in other disciplines and initially were not intended for knowledge discovery. For example, the concept space approach, originally designed to generate automated thesauri from textual documents, was used to identify criminal relationships from crime incident summary data. The blockmodeling approach in SNA research was designed for validating theories of social structures and focused on interactions between “positions” of network members who were similar in social status and roles. I used the blockmodeling approach to extract interaction patterns among criminal groups in which members were closely related.
The prototype system,
CrimeNet Explorer
, has structural analysis functionality to detect subgroups, to identify betweengroup interaction patterns, and to identify central members of subgroups. Quantitative evaluation of the system demonstrated that subjects could achieve significantly higher efficiency with the help of structural analysis functionality than with only network visualization. No significant gain in effectiveness was present, however. Feedback from the subjects and the domain expert showed that
CrimeNet Explorer
was very promising and could be useful for crime investigation.
125
This chapter focuses on the identification of groups in unweighted networks. In Chapter 4,
I showed how to partition the criminal networks using hierarchical clustering algorithms.
These criminal networks were weighted by the cooccurrence of criminal names in crime reports. In unweighted networks, however, all links are essentially equally weighted.
Conventional hierarchical clustering algorithms will fail because they cannot determine the order to merge or divide clusters. In this chapter I propose an
edge local density
measure to approximate the weight of a link based on the local link structure. This measure can be incorporated into both singlepass and iterative clustering algorithms to find groups in unweighted networks.
Group
is also called community (Gibson
et al.
, 1998; Newman & Girvan, 2004), cluster
(Wasserman & Faust, 1994; Xu & Chen, 2005), compartment (Krause
et al.
, 2003), and module (Ravasz
et al.
, 2002; Rives & Galitski, 2003). A group is a set of nodes connected by dense or stronge links. A group in a social network can be a set of social actors with similar background and socioeconomic status (Galaskiewicz & Krohn, 1984;
Wasserman & Faust, 1994). A group in a citation network may be a collection of articles of a specific research specialty or paradigm (Chen
et al.
, 2001; Culnan, 1986; Garfield,
2001). A Web community is a group of Web pages whose authors share similar interests
(Flake
et al.
, 2000; Gibson
et al.
, 1998; Kumar
et al.
, 1999).
126
Finding groups or identifying the
community structure
of networks has important empirical implications because the community structure of a network often relate to the function of the system. For example, biological components such as proteins are organized in modules in cells. It has been found that the modular structure is critical to the survival of cells because harmful effects or attacks to a single module can be limited in the module without affecting other modules (Ravasz
et al.
, 2002; Rives & Galitski,
2003). In the context of Web mining, identifying Web community structure can be of a great help for designing focused crawlers, developing Web portals, and improving search engine performance (Flake
et al.
, 2000; Imafuji & Kitsuregawa, 2002; Kumar
et al.
,
1999).
Researchers have long been working on the development of effective and efficient graph partition techniques for finding groups and identifying community structure in unweighted networks. The generic form of the unweighted graph partitioning is an NPcomplete problem for which no polynomial time algorithms exists (Flake et al., 2000).
Various approximation methods have been proposed to address this problem. However, most of existing algorithms are subject to low efficiency although some of them are rather effective, limiting their applicability to large networks. It is desirable to develop methods that can well balance efficiency and effectiveness based on the demand of different applications. In addition, general guidance is needed for selecting appropriate clustering methods in different situations where efficiency and effectiveness are valued differently. To address these issues I propose the edge local density measure in this chapter.
127
The remainder of this chapter is organized as follows. In Section 5.2 I review related work on network partition. Section 5.3 presents the design of the local density measure.
Section 5.4 discusses the experimental design, hypotheses, and results of the performance evaluation. Section 5.5 concludes this chapter.
Before I review related work for the unweighted graph partitioning problem it is worth briefly reviewing the definition of group and the determination of link weights in weighted networks.
There has not been a widely accepted definition for
group
in networks (Flake
et al.
,
2000). In social network analysis, whether a subset of actors in a network can be viewed as a group depends on the
cohesion
of the subset. A subset is a cohesive group if its members connect with each other through stronger or denser links than with actors outside of the subset (Wasserman & Faust, 1994). This definition implies that groups are identified based on link weights in weighted networks and on link density in unweighted networks. In Web community research, a
community
is defined as a subset of nodes, each of which has at least as many links connecting with nodes in the same subset as it does with nodes in the rest of the network (Flake et al., 2000). This definition is equivalent to the
strong community
definition given in (Radicchi
et al.
, 2004). Radicchi et al. (2004) also define
weak community
, in which the total number of links connecting its members
128 is greater than the number of links connecting its members with the rest of the nodes in the network. This chapter follows the definition of cohesive groups in SNA.
In weighted graphs each link receives a weight that indicates the link strength and intensity, or the similarity between the two nodes incident on the link. There are many ways to infer link weights between nodes in weighted networks. Roughly speaking, these methods can be categorized into two types:
link intensity based
and
node similarity based
.
Link intensity based methods
represent the strength or weight of a link based on the frequency of interactions between the two incident nodes. For example, the weight of friendship between two people can be approximated by the frequency that they meet, make phone calls, or write emails to each other. In scientific collaboration networks, the weight of a collaboration link between two authors often is estimated using the number of times the two authors publish papers together (Barabási et al., 2002; Newman, 2001b). In
Chapter 4, I used the cooccurrence (Chen & Lynch, 1992) weight to approximate the frequency that two criminals commit crimes together (Xu & Chen, 2005).
Similarity based methods
infer the similarity between the properties of the two incident nodes. In SNA the weight of a similarity link between two people can be estimated based on how similar they are in terms of their biographical, educational, and socioeconomic background. In document networks, where each node represents a document, the content similarity between documents are often used to approximate the weight of similarity links
129 between documents. The content similarity between documents can be measured by
Jaccard
(Rasmussen, 1992) or
Cosine
coefficients, which are widely employed in information retrieval and document categorization applications.
Both the link intensity based and the node similarity based methods reply on information about the intrinsic properties of the link or nodes. They do not consider the structure of the network where the nodes and links reside.
Given a weighted graph, hierarchical clustering algorithms can be used to find groups based on link weight. As a result, nodes in the same group have stronger links with each other. However, for graphs such as the World Wide Web, citation networks, and other networks where link weight is not available, the partition problem becomes more challenging.
Because link weight information is not available, methods in this category must rely on the graph structure for the partitioning task. As reviewed in Chapter 2, there are three types of unweighted graph partition methods: link analysis based, graph theoretical, and hierarchical clustering. Both link analysis based methods and graph theoretical approaches are proposed for graph partition in the Web context. They require seed nodes to find Web communities and are not appropriate for finding groups in general graphs.
Chapter 2 has briefly reviewed recent development of hierarchical clustering methods for
130 partitioning general unweighted graphs. I provide more details about these new algorithms here.
5.2.3.1 Divisive Algorithms
Divisive algorithms treat a whole network as a single cluster at the beginning and progressively remove links until all links are removed. When deciding which link to remove at each step the GN algorithm (Girvan & Newman, 2002) selects the one with the highest
edge betweenness
. The algorithm is rather effective for identifying natural groups in various real networks (Girvan & Newman, 2002; Newman & Girvan, 2004;
Radicchi
et al.
, 2004). However, it is by no means an efficient algorithm and runs
O
(
m
2
n
) in time. As reviewed in Chapter 2, the lack of efficiency results from the algorithm’s recomputation of edge betweenness in each iteration and its demand for global traversal of the graph. This algorithm becomes extremely slow when a network contains up to a few thousand nodes (Newman, 2004c).
The alternative algorithm proposed by Radicchi
et al
. (2004) reduces the time complexity to
O
(<
k
>
2
m
2
) (Newman, 2004b), where <
k
> is the average degree, by using
edge clustering coefficient
(ECC) to approximate edge betweenness. The edge clustering coefficient of a link (
i
,
j
) is defined as
ECC ij
= min[(
k i z ij
+
1
−
1 ), (
k j
−
1 )]
, (5.1)
131 where
z ij
is the number of triangles to which link (
i
,
j
) belongs; and the denominator is the number of triangles that could possibly include link (
i
,
j
). The numerator is
z ij
+ 1 to avoid the situation where link (
i
,
j
) does not belong to any triangle. Because the computation of ECC does not require global graph traversal, the Radicchi’s algorithm is slightly faster than the GN algorithm. However, it is worse than the GN algorithm in effectiveness (Radicchi et al., 2004). More importantly, this algorithm has three major disadvantages. First, the definition of ECC sometimes leads to certain degeneracy. For example, when the degree of one of the incident nodes is 1, the denominator of equation
(5.1) becomes 0, causing the ECC to be indeterminate. Second, the algorithm relies on the existence of a large number of triangles in the network. For networks containing few triangles such as nonsocial networks the algorithm will fail to find groups (Newman,
2004b, c). Third, although the algorithm runs faster than the GN algorithm, the time complexity is still rather high.
5.2.3.2 Agglomerative Algorithm
In order to improve the efficiency of hierarchical clustering algorithms, Newman (2004c) proposes an agglomerative approach based on the
modularity
measure. The modularity
Q
of a graph is defined as
Q
=
i
∑
(
e ij
−
a i
2
) , (5.2) where
e ij
is the percentage of links in the graph that connect nodes in cluster
i
and those in cluster
j
;
a i
=
Σ
j e ij
is the expected value of
e ij
if nodes are randomly connected
132
(Newman, 2004c). The modularity indicates how much the graph structure deviates from a random graph, in which no significant community structure exists.
Q
is 0 if the number of withingroup links is no more than would be expected by random chances (Newman,
2004c). At each step, the algorithm seeks a pair of clusters whose merge results in the largest increase or smallest decrease in
Q
. The best partition can be obtained by finding the maximal value of
Q
along the resulting dendrogram. This is a relatively fast algorithm with
O
((
m
+
n
)
n
) time complexity. The effectiveness of the modularity algorithm has also been shown to be comparable to the GN algorithm. Both the GN and Radicchi’s algorithms are iterative procedures which must updates the edge beweeness or ECC of links in each round. The modularity based algorithm is a “singlepass algorithm” that does not requires iterative update of link weights.
In summary, the major problems facing unweighted graph partition is efficiency. Most existing hierarchical clustering algorithms suffer from high time complexities. Some algorithms such as the Radicchi’s algorithm, slightly improves efficiency at the cost of effectiveness. It is desirable to develop methods that help balance between effectiveness and efficiency based on the demand of realworld applications. In the next section I propose edge local density that is potentially helpful for addressing this problem.
133
To address the problems of existing algorithms, especially the Radicchi’s algorithm, I propose a new measure called
edge local density
for unweighted graphs.
The edge local density measure is derived from graph link density. Recall that the
link density
of an undirected graph is defined as (Wasserman & Faust, 1994)
d
=
m n
(
n
−
1 ) / 2
. (5.3)
It is the number of links that actually are present in a network divided by the total possible number of links. The value of
d
is between 0 and 1. The link density is 1 when we have a
complete graph
, in which every node is connected with all other nodes. A complete graph is also called
clique
(Wasserman & Faust, 1994).
Consider a subgraph representing a group. According to the cohesive group definition, the density of withingroup links should be greater than that of betweengroup links. This implies that every withingroup link is involved in a denselyknit neighborhood of links and the betweengroup links are relatively sparse. Based on this rationale I propose
edge local density
for measuring the potential of a link (
i
,
j
) to be involved in a cohesive group:
LD ij
=
m ij n ij
(
n ij
+
c ij
−
1 ) / 2
, (5.4)
134 where
n ij
is the total number of nodes in the neighborhood of the link (
i
,
j
);
m ij
is the total number of links in the neighborhood, and
c ij
is the number of common neighbors of nodes
i
and
j
. The neighborhood of the link (
i
,
j
) includes all nodes that are incident on nodes
i
or
j
, including
i
and
j
themselves. The denominator of equation (5.4) is the number of possible links in the neighborhood.
Note that the value of
LD ij
can be greater than 1 because of the additional term
c ij
. For example, the local density of links in a clique is 1
+
n
−
2
n
(
n
−
1 ) / 2
, because the maximum number of common neighbors that two nodes can share is
n
2. The reason for adding this extra term is based on the observation that nodes in the same group often share many common neighbors, while two nodes belonging to different groups share few or no common neighbors. As a result, the local densities of withingroup links are raised further and the betweengroup link densities are lowered further.
With this local density measure all originally unweighted links receive weights reflecting their local link structures. Thus, nodes in denselyknit groups are connected by strong links and nodes from different groups are separated by weak links. This is illustrated in
Figure 5.1. Note that this measure is different from similarity based and link intensity based weights because it relies entirely on the structure of the network rather than on the properties of nodes or links. In addition, the calculation of this measure only requires of the knowledge of local structure rather than the global structure of the whole network.
135
1
1
2 3 8
2 3 8
7
7
4 5 9 4 5 9
6
6
(a)
(b)
Figure 5.1: The transformation of an unweighted graph into a weighted graph using the edge local density measure. (a) The unweighted graph. (b) The transformed weighted graph which can be divided into two denselyknit groups.
In this section I illustrate how edge local density works in different situations. At the same time I compare it with ECC (Radicchi et al., 2004). As mentioned in Section 5.2.3.1, a major disadvantage of ECC is that it works only for networks containing many triangles.
For treestructured networks that contain many nodes with 1 degree Radicchi’s algorithm will fail to find natural groups. In contrast, the local density measure does not depend on the presence of triangles in networks. It can help find groups in more generic networks.
The following five cases illustrate how local density assigns different weights to withingroup and betweengroup links and the necessary conditions for local density measure to outperform ECC.
136
1
8
2
5 6 9
4
7
3
(a)
2
1
8
2
1
8
5 6 9
5 6 9
4
3
7
4
3
7
1
(b)
1
(c)
2
6
2
6
5 8
5 8
4
3
(d)
7
4
7
3
(e)
Figure 5.2: The five illustrative cases for edge local density. (a) CliqueBridgeClique. (b)
TreeBridgeTree. (c) CliqueBridgeTree. (d) CliqueClique. (e) CliqueTree.
Case 1: CliqueBridgeClique
This case represents a situation where two denselyknit groups are connected by a few bridge links (see Figure 5.2a). For simplification, I make two assumptions: (a) both groups are cliques, i.e., all withingroup nodes are fully connected, and (b) there is only a single bride link between the two groups. In real networks, these two assumptions will not always be true. Groups are not always complete and between groups there might be many links.
137
Let
G
1
= (
V
1
,
A
1
) and
G
2
= (
V
2
,
A
2
) be two groups connected by a single link. Because both
G
1
and
G
2
are cliques,
m
1
=
n
1
(
n
1
1)/2 and
m
2
=
n
2
(
n
2
1)/2. In addition, it is assumed that
n
1
≥ 3,
n
2
≥ 3, and
n
1
≥
n
2
. The graph reduces to a trivial chain structure when
n
1
< 3 and
n
2
< 3.
In Figure 5.2a,
G
1 contains nodes 15 and
G
2 contains nodes 69,
n
1
= 5, and
n
2
= 4.
Between
G
1
and
G
2 there is a bridge link (5, 6) with nodes 5 and 6 acting as gatekeepers.
Other nodes, nodes 14 and nodes 79, are “insiders.” There are three types of links in the network: withingroup
insiderinsider link
(bold line), withingroup
insidergatekeeper
link (dashed line), and betweengroup
gatekeepergatekeeper
link (dotted line).
Obviously, the same types of links in a group have the same local density values. For example, link (1, 3) and link (2, 4) are equally weighted. Based on the definition of edge local density, the weights of the three types of links are as follows:
•
Insiderinsider links
. Considering link (1, 2) as an example, the neighborhood of nodes 1 and 2 include nodes 15. Thus the local density is
LD
1 , 2
=
n
1
(
n
1
−
1 )
n
1
(
n
1
/ 2
+
−
1 ) /
(
n
1
2
−
2 )
=
13
10
.
•
Insidergatekeeper links
. Considering link (1, 5) as an example, the neighborhood of nodes 1 and 5 includes nodes 16. Thus the local density is
LD
1 , 5
=
n
1
(
n
1
−
1 ) /
n
1
(
n
1
2
+
1
+
+
1 ) /
(
n
1
2
−
2 )
=
14
15
.
138
•
Gatekeepergatekeeper link
. There is only one betweengroup link, link (5, 6). Its neighborhood includes all nodes in the network since nodes 5 and 6 together connect with all other nodes. The local density is
LD
5 , 6
=
n
1
(
n
1
(
n
1
−
+
1 )
n
/
2
2
+
)(
n
1
n
2
+
(
n
1
n
2
−
1 )
−
1 ) /
/ 2
2
+
1
=
17
36
.
It is expected that the strongest links are insiderinsider links and the weakest links should be the bridge links, that is
LD
1,2
≥
LD
1,5
≥
LD
5,6
. Because
LD
1,2
> 1,
LD
1,5
< 1, and
LD
5,6
< 1, we have
LD
1,2
>
LD
1,5
, and
LD
1,2
>
LD
5,6
. However, we do not necessarily have
LD
1,5
>
LD
5,6
because
LD
1 , 5
−
LD
5 , 6
=
2 (
n
1
3
n n
1
2
+
(
n
1
n
1
2
+
n
2
−
1 )(
n
1
2
n
1
2
+
n
2
−
n
2
2
)(
n
1
−
+
2
n
1
n
2
n
2
+
1 )
+
n
2
)
.
It cannot guarantee that the numerator of the equation is greater than 0 for arbitrary values of
n
1
and
n
2
, although
LD
1,5 is greater than
LD
5,6
in this particular example. This means that the bridge link is not necessarily the weakest. On the other hand, it is easy to show that the ECC well distinguishes between withingroup links and betweengroup links:
ECC
1 , 2
ECC
5 , 6
ECC
1 , 2
=
ECC
1 , 5
=
= min(
n
1
ECC
1 , 5
=
−
1
>
n
1 min(
n
1
−
−
2
+
1
2 ,
n
1
−
2 )
2 ,
n
2
ECC
−
5 , 6
2 )
=
n
2
1
−
2
=
=
n
1
n
1
−
−
1
2
1
<
1
2
=
4
3
>
1
This is one of the situations where local density is worse than ECC.
139
Case 2: TreeBridgeTree
In this situation two trees are connected by a single bridge link (see Figure 5.2b). There are only two types of links in this network: insidergatekeeper links and gatekeepergatekeeper links. Let us assume
n
1
≥ 2,
n
2
≥ 2, and
n
1
≥
n
2
. Because there is no insiderinsider links in each group,
m
1
=
n
1
– 1 and
m
2
=
n
2
– 1. Again, using link (1,5) and link
(5,6) as the examples, the local densities of the two types of links are:
•
Insidergatekeeper links
:
LD
1 , 5
=
n
1
n
1
(
n
1
−
1
+
+
1 )
1
/ 2
=
n
1
2
+
1
=
1
3
.
•
Gatekeepergatekeeper link
:
LD
5 , 6
=
(
n
1
+
n
2
n
1
)(
+
n
1
n
2
+
−
1
n
2
−
1 ) / 2
=
n
1
2
+
n
2
=
2
9
.
Since
n
2
≥ 2, we have
LD
1 , 5
−
LD
5 , 6
=
(
n
1
2 (
n
2
+
1 )(
−
n
1
1 )
+
n
2
)
>
0 .
This implies that the two groups are connected by a relatively weak bridge link. The ECCs for insidergatekeeper links are indeterminate since the denominators are 0. Therefore, for networks containing no cyclic structure, ECC and the Radicchi’s algorithm will fail.
Case 3: CliqueBridgeTree
In this case a clique is connected with a tree through a bridge (see Figure 5.2c). Let’s assume
n
1
≥ 3,
n
2
≥ 2, and
n
1
≥
n
2
. We also have
m
1
=
n
1
(
n
1
1)/2 and
m
2
=
n
2
1. The local densities of the three types of links are:
140
•
Insiderinsider links
. This type of links exists only in
G
1
. The values are the same as in Case 1.
•
Insidergatekeeper links
. There are two subtypes in this category of link: the links in
G
1
and the links in
G
2
. The local densities of insidergatekeeper links in
G
1
are the same as in Case 1. The local densities of insidergatekeeper links, such as link
(6,8), is
LD
6 , 8
=
n
2
n
2
(
n
2
−
1
+
+
1 )
1
/ 2
=
n
2
2
+
1
=
2
5
.
•
Gatekeepergatekeeper link
. The local density of the bridge link is different from that in Case 1 and is given by
LD
5 , 6
=
(
n
1
n
+
1
(
n n
2
1
−
)(
n
1
1 )
+
/ 2
n
2
+
−
n
2
1 ) / 2
=
7
18
.
As in Case 1 it is obvious that local densities of insiderinsider links are greater than those of the other two types of links. It can be shown that
The numerator of
LD
1 , 5
−
LD
5 , 6
=
n
1
2
n
2
( 2
n
1
−
1 )
>
=
3 (
n
1
n
2
8
n
1
n
2
)( 6
>
0
−
1 )
−
−
7
n
1
n
2
7
n
1
n
2
+
n
1
2
(
n
2
2
−
2 )
+
n
2
2
(
n
1
−
2 )
+
2 (
n
1
+
n
2
)
This means that the insidergatekeeper links in the clique is guaranteed to be stronger than the bridge link. In addition, we have
The numerator of
LD
6 , 8
−
LD
5 , 6
=
n
2
(
n
1
−
1 )[ 4
n
2
−
n
1
(
n
2
−
1 )]
=
>
<
0
0 if
n
1
≤
4 or if
n
2
≤
n
1
n
1
−
4 and
n
1 if
n
1
≤
≥
8
9
141
The insidergatekeeper links in
G
2
will be stronger than the bridge link only if the size of the clique is less than 9. When the clique becomes larger the bridge link will become stronger. This is another situation where the local density causes some problems. The
ECC does not work either because it is indeterminate for the insidergatekeeper links in the tree group.
Case 4: CliqueClique
Sometimes a popular node may belong to multiple groups at the same time. This is illustrated in Figure 5.2d. A common gatekeeper sits between two cliques. In this example,
n
1
= 5, and
n
2
= 3. Because of the absence of the bridge link, there are only two types of links, whose densities are
•
Insiderinsider links
. The values are the same as in Case 1.
•
Insidergatekeeper links
. There are two subtypes in this category of link: the links in
G
1
and the links in
G
2
. The local density of insidergatekeeper links in
G
1
is
LD
1 , 5
=
n
1
(
n
1
−
1 )
(
n
1
/ 2
+
+
n
2
n
)(
2
n
(
n
1
2
+
+
1 )
n
2
/ 2
−
1 )
+
/ 2
(
n
1
−
2 )
=
17
28
, and the local density of the insidergatekeeper links in
G
2
is
LD
5 , 6
=
n
1
(
n
1
−
1 )
(
n
1
/ 2
+
+
n
2
n
)(
2
n
(
n
1
2
+
+
n
2
1 )
−
/ 2
1 )
+
/ 2
(
n
2
−
2 )
=
15
28
.
Because
n
1
≥
n
2
, we have
LD
1,5
≥
LD
5,6
. Thus, the gatekeeper is more strongly connected with the larger clique than with the smaller clique based on local density. The values of
142
ECC in
G
1
and
G
2
are 4/3 and 3/2, respectively. The gatekeeper is more strongly connected with the smaller clique based on ECC.
Case 5: CliqueTree
In this case, one clique in the network is replaced by a tree (Figure 5.2e). It is easy to show that
LD
1,5
≥
LD
5,6
. Similar to Case 4, the gatekeeper is more strongly associated with the clique than with the tree. Again, ECCs for the links in the tree cannot be determined.
The five cases consider general situations where local density compares with ECC. I omit the TreeTree situation because two trees connected by a common gatekeeper can be viewed as one single tree. The local density’s advantage over ECC is that it is not limited to networks containing cyclic structures. It can be applied to more sparse networks such as trees which do not contain cycles. However, in certain situations, such as in Case 1 and
Case 3, local density does not guarantee to reflect the structural role of links optimally.
The local density measure can be used in two types of hierarchical clustering methods:
•
SinglePass Agglomerative Method
. “Single pass” means link weights are computed only once. The idea is to transform an unweighted graph into a weighted one using edge local density. After the transformation each link receives a local density based weight so that the weighted graph can be partitioned using agglomerative clustering algorithms (Day & Edelsbrunner, 1984; Jain & Dubes, 1988).
143
Agglomerative methods merge nodes that are more strongly related first. Consider the abovementioned five cases. The insiders in the clique groups in all of the five cases and the nodes in the tree groups in Case 2 are connected by the strongest links. Thus, these nodes will be merged together first. The gatekeepers are then added to the cliques they belong to. At last, the two separate groups will merge and the whole network becomes one single cluster. Note that the gatekeeper in Case 4 joins the larger clique first. This is rather intuitive because a popular node belonging to multiple groups may be considered as the member of the largest group it belongs to. However, it is possible that the two gatekeepers in Case 1 and Case 3 may merge together before they are added to their own groups. This is because of the problems discussed Cases 1 and 3.
The clustering algorithm selected is the same as the one used in Chapter 4—reciprocal nearest neighbor (RNN) based completelink algorithm (Murtagh, 1984) with
O
(
n
2
) time complexity. The additional time used for calculating local densities of all links in the network takes
O
(<
k
> 2
m
) time, assuming every node maintains its own lists of neighbors.
Therefore, the overall running time is
O
(<
k
>
2
m
+
n
2
), faster than all existing hierarchical algorithms.
•
Iterative Divisive Method
. Like the existing divisive methods such as the GN algorithm (Girvan & Newman, 2002) and Radicchi’s algorithm (Radicchi et al., 2004), this divisive method recomputes the local densities of all links and removes the weakest links in each iteration. The bridge links in the five cases, except for Cases 1 and 3, will be the ones that are removed first, breaking each network into two groups. The time
144 complexity of this iterative method is the same as that of the Radicchi’s algorithm,
O
(<
k
>
2
m
2
) (Newman, 2004b).
For simplification, I call these two local density based methods sLD (singlepass) and iLD (iterative), respectively. The sLD method is expected to provide higher efficiency with compromised yet acceptable effectiveness. The iLD is expected to provide higher effectiveness than the Radicchi’s algorithm and higher efficiency than the GN algorithm.
To evaluate the performance of the local density based clustering methods I conducted a series of experiments. These experiments were intended to answer the following research questions:
•
How does the edge local density measure perform compared with the edge clustering coefficient (ECC) measure?
•
How does the local density based clustering methods, sLD and iLD, perform compared with existing hierarchical clustering methods?
•
How to choose an appropriate method for different effectiveness and efficiency demands of real applications?
145
The experiments tested the performance of the proposed methods using simulated network data. The two performance metrics were
effectiveness
and
efficiency
. In effectiveness testing, the community structure was predetermined so that the effectiveness of a partition could be objectively measured. The effectiveness metrics used in this chapter were clustering
precision
,
recall
,
F value
, and
accuracy
. Precision, recall, and F value are frequently used in information retrieval applications and have been used for evaluating clustering effectiveness in document categorization applications
(Roussinov & Chen, 1999). In this evaluation, the predetermined partition was called
true partition
and the algorithm generated partition was call
algorithm partition
. A node pair was considered
correct
if it was in both the algorithm partition and the true partition. An
incorrect
node pair was in the algorithm partition but not in the true partition. That is, the two nodes placed in the same group by the algorithm actually belonged to different groups. A
missed
node pair was in the true partition but not in the algorithm partition.
That is, the two nodes separated into different groups by the algorithm were actually in the same group. The clustering
precision
and
recall
were defined as
Precision
=
Number of correct node pairs
Number of correct node pairs
+ number of incorrect node pairs
, (5.5)
Recall
=
Number of
Number of correct node pairs correct node pairs
+ number of missed node pairs
. (5.6)
146
The precision reflected how accurate a clustering algorithm was and the recall reflected how well the algorithm captured the correct pairs. Because the precision can be increased by compromising recall and vise versa, the
F value
was used to reflect the combined effect of precision and recall (Shaw et al., 1997)
F value
=
2
×
Recision
×
Recall
Precision
+
Recall
. (5.7)
Because the true partition was known in this evaluation each node in the network received a label indicating its group membership. Thus, the effectiveness was also evaluated by measuring the percentage of nodes that were assigned correct labels. The clustering
accuracy
was defined as
Accuracy
=
Number of correctly classified
Total number of nodes nodes
(5.8)
In real applications the true partition is often unknown so that nodes do not have associated class labels. In addition, the number of clusters in a network often has to be determined subjectively. In these cases, accuracy cannot be used for evaluating clustering effectiveness.
The
efficiency
of an algorithm was defined as the algorithm running time.
147
For effectiveness testing, I compared the two local density based methods, sLD and iLD, with several existing algorithms: the GN algorithm (Girvan & Newman, 2002),
Radicchi’s iterative ECC based algorithm (iECC) (Radicchi
et al.
, 2004), and the modularity based algorithm (Newman, 2004c). In addition, to compare the performance of edge local density and ECC, I also included a
singlepass ECC method
(sECC), which clustered a transformed graph with ECCbased link weights. In implementation links with indeterminate ECCs were assigned a large constant number.
There were four categories of hypotheses corresponding to the four effectiveness metrics: precision, recall, F value, and accuracy. Table 5.1 lists the detailed hypotheses for the precision category. The detailed hypotheses for the other three metrics are omitted from
Table 5.1 because they are similar to the precision hypotheses with only metric names changed.
Hypotheses H1.1H1.5 focused on the precision of the sLD method and H1.6H1.8 on the iLD method. The rationale behind these hypotheses was as follows:
•
The local density based methods, sLD and iLD, were expected to be more effective than the ECC based methods, sECC and iECC (H1.1 and H1.6) because in most of the abovementioned five cases, the local density could better distinguish between different types of links than ECC.
148
H1: Clustering Precision
H1.1: The sLD method will achieve
higher precision
than the sECC method
H1.2: The sLD method will achieve
lower precision
than the iECC method
H1.3: The sLD method will achieve
lower precision
than the iLD method
H1.4: The sLD method will achieve
lower precision
than the GN algorithm
H1.5: The sLD method will achieve
lower precision
than the modularitybased algorithm
H1.6: The iLD will achieve
higher precision
than the iECC method
H1.7: The iLD will achieve
comparable precision
with the GN algorithm
H1.8: The iLD method will achieve
comparable precision
with the modularitybased algorithm
H2: Recall
(detailed hypotheses similar to H1.1H1.8)
H3: F value
(detailed hypotheses similar to H1.1H1.8)
H4: Accuracy
(detailed hypotheses similar to H1.1H1.8)
Table 5.1: Hypotheses regarding clustering effectiveness.
•
The singlepass method, sLD, would be less effective than iterative methods, iLD and iECC (H1.2 and H1.3), because the singlepass method computed link weights only once. It did not recalculate link weights at each time when the dendrogram was updated.
•
Both the sLD and iLD methods were expected to be less effective than the GN algorithm (H1.4 and H1.7), which had been shown to outperform all existing hierarchical clustering methods (Newman, 2004c; Radicchi et al., 2004).
•
No research had systematically evaluated the effectiveness of the modularity based algorithm. H1.5 and H1.8 thus predicted that the modularity based algorithm would achieve comparable effectiveness with iLD but higher effectiveness than sLD.
I compared the efficiency of only three algorithms: sLD, iLD, and the modularity based algorithm (Newman, 2004c). Their time complexity was
O
(<
k
>
2
m
+
n
2
),
O
(<
k
>
2
n
2
), and
149
O
(
mn
+
n
2 ), respectively. The singlepass algorithm, sLD, was expected to have the highest efficiency. I did not include sECC and iECC in the comparison because it took the same time to calculate local density as ECC.
H5: The sLD method achieves
higher efficiency
than the iECC algorithm;
H6: The sLD method achieves
higher efficiency
than the modularity based algorithm.
5.4.3.1 Effectiveness
I considered the simulated networks used in previous studies (Girvan & Newman, 2002;
Newman, 2004c; Radicchi et al., 2004) for effectiveness testing. The network consisted of 128 nodes divided into four groups of equal sizes. The average degree was set to be 16.
Nodes in the same group were connected with probability
p
in
, and nodes in different groups with
p
out
. The two parameters
p
in
and
p
out
control the structure of the network.
Figure 5.3 presents three illustrative networks which correspond to high, medium, and low
p
out
/
p
in
ratios. When
p
out
is rather small compared with
p
in
the groups are well separated with only a few links connecting these denselyknit groups (Figure 5.3a). As
p
out
increases the boundaries of groups become more “blurred” and it is harder to identify the groups (Figure 5.3b). When
p
in
and
p
out
are equal,
p
in
=
p
out
= 16/127 ≈ 0.125 in this example, the network becomes totally random and no group exists (Figure 5.3c).
150
(b)
(c)
(a)
Figure 5.3: Three illustrative networks with different
p
out
= 0.01). (b)
p
out
/
p
in
= 0.14 (
p
out
= 0.05). (c)
p
out
/
p
in
= 1.0 (
/
p
in
p
out
ratios. (a)
= 0.125).
p
out
/
p
in
= 0.02 (
p
out
For each specific value of
p
out
I generated 30 networks, which were clustered using the six methods: sLD, sECC, iLD, iECC, GN, and modularity. Because there were four groups in the true partition, the dendrograms generated by these algorithms were cut at the level where the network was divided into four clusters. The effectiveness metrics were recorded and plotted against the
p
out
/
p
in
ratio (Figure 5.4).
In addition, a series of paired
t
tests were performed to test the hypotheses. Table 5.2 provides the mean values of the four effectiveness metrics of the six methods. Table 5.3 summarizes the results of the hypothesis testing.
0.4
0.2
0
0
1
0.8
0.6
0.7
0.6
0.5
0.4
0.3
1
0.9
0.8
0.2
0.1
0
0
SinglePass Local Density
SinglePass ECC
Modularity
Iterative Local Density
Iterative ECC
GN
0.2
0.4
p_out/p_in
0.6
(a)
0.8
SinglePass Local Density
SinglePass ECC
Modularity
Iterative Local Density
Iterative ECC
GN
1
0.2
0.4
p_out/p_in
0.6
(b)
0.8
1
151
152
0.7
0.6
0.5
0.4
1
0.9
0.8
0.3
0.2
0.1
0
0
SinglePass Local Density
SinglePass ECC
Modularity
Iterative Local Density
Iterative ECC
GN
0.2
0.4
p_out/p_in
0.6
(c)
0.8
1
1
0.8
sLD sECC
0.6
0.4
0.2
0
0
SinglePass Local Density
SinglePass ECC
Modularity
Iterative Local Density
Iterative ECC
GN
0.2
0.4
p_out/p_in
0.6
(d)
0.8
1
Figure 5.4: Effectiveness results of the six clustering methods: sLD, sECC, iLD, iECC,
GN, and modularity. (a) Precision. (b) Recall. (c) F value. (d) Accuracy.
0.59 (0.32)
0.54 (0.30)
0.54 (0.34)
0.48 (0.30)
0.56 (0.33)
0.50 (0.30)
0.67 (0.27)
0.62 (0.25)
153
iLD iECC
GN
Modularity
0.73 (0.31)
0.67 (0.31)
0.71 (0.31)
0.68 (0.31)
0.67 (0.36)
0.58 (0.35)
0.59 (0.40)
0.64 (0.33)
0.70 (0.34)
0.61 (0.34)
0.63 (0.38)
0.66 (0.32)
0.79 (0.25)
0.76 (0.24)
0.88 (0.13)
0.75 (0.26)
Table 5.2: Mean values of the effectiveness metrics of the six methods. Numbers in parentheses are standard deviations.
.1: sLD
better than
sECC
.2: sLD
worse than
iECC
.3: sLD
worse than
iLD
.4: sLD
worse than
GN
.5: sLD
worse than
modularity
H1.1
H1.2
H1.3
H1.5
H2.1
H2.2
H2.3
H2.5
H3.1
H3.2
H3.3
H1.4 H2.4 H3.4
H3.5
H4.1
H4.2
H4.3
H4.4
H4.5
.6: iLD
.7: iLD
better than
iECC
comparable with
GN
H1.6 H2.6 H3.6 H4.6
H1.7 H2.7 H3.7 H4.7
.8: iLD
comparable with
H1.8 H2.8 H3.8 H4.8
Table 5.3: Summary of hypothesis testing results for effectiveness. Shaded cells indicate confirmed hypotheses. Blank cells are not confirmed hypotheses. All differences are significant with
p
< 0.001.
•
Precision
. Hypotheses H1.1H1.6 were supported. H1.7 and H.8 were not supported because the precision of iLD was significantly higher than those of the GN algorithm and the modularity based algorithm. The iLD method significantly outperformed all other methods by identifying more correct node pairs and fewer incorrect node pairs. Especially, the local density performed better than ECC in both singlepass and iterative methods. It means that local density is a better measure for approximating link weights than ECC.
The sLD method performed worse than GN, modularity, and iECC. This is due to two possible reasons. First, both the GN algorithm and the iECC method recalculated link weights at each time a link was removed. The recalculated link weights reflected the changes in structure and helped improve the performance. Second, both the GN and the
154 modularity based algorithm depended on the knowledge about the global structure.
However, local density relied only on local link structure.
•
Recall
. Hypotheses H2.1H2.3, H2.5, and H2.6 were supported. Similar to the precision results, the iLD method achieved significantly higher recall than all the other methods. Other methods missed more correct node pairs. The sLD method performed better than the sECC method and worse than the modularity based algorithm, iLD, and iECC. Note that the sLD method achieved comparable recall with the GN algorithm
(H2.4 thus was not supported). It is shown in Figure 5.3b that when
p
out
/
p
in
> 0.31, the G
N algorithm was worse than the sLD method causing the average recall to be comparable with the sLD method.
•
F value
.
Consistent with the precision results, hypotheses H3.1H3.6 were supported. The iLD appeared to be the best method. The sLD method outperformed sECC.
•
Accuracy
. Hypotheses H4.1H4.6 were supported. H4.7 was not supported because the accuracy of the GN algorithm was significantly higher than the iLD method.
The iLD method was more accurate than the modularity based algorithm. Thus, H4.8 was not supported.
An interesting pattern was observed when Figure 5.4 was reviewed. Overall, these methods performed equally well for low
p
out
/
p
in
ratios and equally poor when the ratio approached 1. The difference was most significant for medium values of the ratio. The
155 only exception was the significantly higher accuracy of the GN algorithm along all ratios.
The second highest accuracy was achieved by the iLD method. For precision, recall, and
F value, the medium range of
p
out
/
p
in
ratio was roughly between 0.1 and 0.4. The sECC method seemed to be the worse method in terms of all four metrics.
5.4.3.2 Efficiency
To test the efficiency of different methods I considered the testing networks used in
(Radicchi
et al.
, 2004). A series of random networks with increasing sizes were generated.
With a specific size
n
, 30 networks were generated and clustered using sLD, iLD, and modularity. The networks were generated using a Java programs running on a desktop computer with 2.8GHZ CPU. The average running time for each algorithm was recorded and plotted in Figure 5.5. The mean running times for sLD, iLD, and modularity based methods are reported in Table 5.4.
Both hypotheses H5 and H6 were supported with
p
values being less than 0.001. Figure
5.5 shows that sLD was the fastest method because it required only a singlepass calculation of link weights. The sLD algorithm was faster than the modularity based method because it relied on local link structure, while the modularity of a network must be evaluated based on the global structure. The iLD was also slower than sLD due to its iterative nature.
156
Network size
n
≤ 100
100 <
n
≤ 10
3
10
3
<
n
≤ 10
4 sLD iLD Modularity
157.0 37.4 103.2
1,185.6 4,539.3 2,205.9
191,192.8 801,811.0 494,119.2
Table 5.4: Mean running times (in seconds) of sLD, iLD, and the modularity based methods.
1,800
1,600
1,400
1,200
1,000
800
600
400
200
0
0 sLD
Modularity iLD
GN
1,000 2,000 3,000 4,000 5,000
n
6,000 7,000 8,000 9,000 10,000 11,000
Figure 5.5: The efficiency of sLD, iLD, modularity based, and GN algorithm.
In Figure 5.5, the running time for the GN algorithm was also shown for networks with less than 1000 nodes. The running time of the GN algorithm scaled very quickly as networks grew. It was much less scalable than the other three methods.
The networks in efficiency testing were rather sparse, that is, <
k
>
2
<<
n
. When <
k
>
2 approaches
n
, the modularity based method and sLD will achieve similar efficiency.
In summary, the local density measure was better than ECC in approximating link weights. Compared with existing clustering algorithms, the two local density based
157 methods, sLD and iLD, also achieved promising performance in terms of effectiveness and efficiency.
The performance experiments also suggest guidance for selecting appropriate clustering methods in different situations:
•
For networks that have salient community structures (
p
out
/
p
in
is close to 0) or that are rather random (
p
out
/
p
in
is close to 1), these algorithms except for sECC are almost equally effective or noneffective. Thus, the fastest algorithm, sLD, can be used to find groups in networks.
•
Within the medium range of
p
out
/
p
in
, the selection of algorithm depends on the demand of the particular application: o
If the application requires a fast algorithm that can partition large networks with compromised yet acceptable effectiveness, the sLD algorithm is a good choice; o
If the efficiency is not the major concern and the network size is relatively small, the iLD method outperforms the Radicchi’s algorithm (iECC), the
GN algorithm, and the modularity based algorithm in terms of clustering precision, recall, and F value. In addition, it takes significantly less time than the GN algorithm and the modularity based algorithm.
158
In this chapter I propose the edge local density measure to approximate link weights based on the structure of unweighted graphs. When the local density is used in a singlepass clustering algorithm, the unweighted graph is transformed into a weighted graph, in which each link receives a weight reflecting its local link density. Agglomerative methods such as the completelink algorithm can then partition the transformed graph.
When the local density is used in an iterative method, the local density based weights of links are updated in each iteration.
The performance evaluation shows that the local density measure was a better measure than edge clustering coefficient, which may fail to find groups in a graph with few or no triangles. The singlepass algorithm based on local density was more effective than the
ECC based algorithm and more efficient than iterative algorithms such as the GN algorithm and the Radicchi’s algorithm. The iterative algorithm based on this measure outperformed all existing algorithms in terms of clustering precision and recall. These two local density based methods better balance between the effectiveness and efficiency than existing algorithms.
This chapter contributes to the research of unweighted graph partition problem by not only proposing the new measure but also providing guidance for selecting appropriate clustering algorithms in different situations.
159
Future research needs to be done to evaluate the new measure’s perform in real networks such as the World Wide Web and citation networks.
160
In recent years scientists have revealed the topological properties of a wide variety of complex systems characterized as largescale networks (Albert & Barabási, 2002), such as scientific collaboration networks (Newman, 2001b, 2004a), the World Wide Web
(Albert
et al.
, 1999), the Internet (Faloutsos
et al.
, 1999), electric power grids (Watts &
Strogatz, 1998), food webs (Garlaschelli
et al.
, 2003), and biological networks (Jeong
et al.
, 2000), among many others. Despite the tremendous variation in their component, function, and size, these networks are surprisingly similar in topology (e.g., the powerlaw degree distribution (Albert & Barabási, 2002; Wasserman & Faust, 1994)). This leads to a conjecture that complex systems are governed by the ubiquitous selforganizing principle (Albert & Barabási, 2002).One missing piece in this picture, however, is the analysis on the topology of “dark” networks (Raab & Milward, 2003) that are hidden from view yet could bring devastating impact to our society and economy, analogous to the “dark matter” in the galaxy. Terrorist networks, drugtrafficking rings, arms smuggling networks, gang networks, and many other covert networks are all dark networks. The structure of dark networks are largely unknown due to the difficulty of collecting and accessing reliable data (Krebs, 2001). Do dark networks share the same topological properties with other types of networks? Do they follow the same organizing principle? How do they achieve efficiency under constant surveillance and threats from
161 authorities? How robust are they against attacks? In this chapter I report the topological properties of several covert criminal or terroristrelated networks. I hope not only to contribute to general knowledge of the topological properties of complex systems in a hostile environment but also to provide authorities with insights regarding disruptive strategies.
The remainder of this chapter is organized as follows. In Section 6.2 I briefly review existing network models and their linkages to the function of complex systems. I introduce the four terrorist and criminalrelated covert networks under study and the methods I used to collect the data in Section 6.3. In Section 6.4 I report the statistical properties of these four networks. I also tested the robustness of these dark networks and suggest some disruptive strategies. Section 6.5 summarizes the results and point to future research directions.
As reviewed in Chapter 2, network topology have been studied using three models: random graph model (Bollobás, 1985; Erdös & Rényi, 1960), smallworld model (Watts
& Strogatz, 1998), and scalefree model (Barabasi & Alert, 1999). Random networks are categorized by small average path lengths and low clustering coefficients. The degree distribution of a random graph follows the Poisson distribution (Bollobás, 1985). A smallworld network also has a small average path length relative to its size but has a rather high tendency to form clusters and groups. (Watts & Strogatz, 1998). The degree
162 distribution of scalefree networks (Barabasi & Alert, 1999) is a powerlaw degree, a skewed distribution that significantly deviates from the Poisson distribution. The powerlaw distribution takes a form of
P
(
k
) ~
k
−
γ
, (6.1) where
P
(
k
) is the degree distribution indicating the probability that a randomly selected nodes has exactly
k
links;
γ
is the exponent of the distribution that often takes on a value between 2.0 and 3.0 (Albert & Barabási, 2002) .
The analysis on the topology of complex systems has important implications to our understanding of nature and society. Research has shown that the function of a complex system may be to a great extent affected by its network topology (Albert & Barabási,
2002; Newman, 2003b). For instance, the small average path length of the World Wide
Web makes cyberspace a very convenient, strongly navigable system, in which any two web pages are on average only 19 clicks away from each other (Albert
et al.
, 1999). It has also been shown that the higher tendency for clustering in metabolic networks is correspondent to the organization of functional modules in cells, which contributes to the behaviour and survival of organisms (Ravasz
et al.
, 2002; Rives & Galitski, 2003). In addition, networks with scalefree properties (e.g., proteinprotein interaction networks) are highly robust against random failures and errors (e.g., mutations) but quite vulnerable under targeted attacks (Albert
et al.
, 2000; Jeong
et al.
, 2001; Solé & Montoya, 2001).
163
To understand the topology and function of dark networks I studied four terrorist and criminalrelated networks:
•
The Global Salafi Jihad (GSJ) terrorist network (Sageman, 2004) (see Figure 6.1), which consists of 366 members including members from Osama Bin Laden’s Al
Qaeda. These terrorists were connected by kinship, friendship, religious ties, and relations formed after they joined the GSJ network. The network was constructed based entirely on opensource data but all nodes (terrorists) and links (relations) were examined and carefully validated by a domain expert (Sageman, 2004).
•
A narcoticstrafficking criminal network (“Meth World”) whose members mainly deal with methamphetamines (Xu & Chen, 2003). Based on the data of narcoticsrelated crimes which occurred in Tucson, Arizona, between 1985 and 2002, I generated the network consisting of 1,349 criminals. Two criminals were considered related if they committed at least one crime together.
•
A gang criminal network consisting of 3,917 criminals who were involved in gangrelated crimes in Tucson between 1985 and 2002 (Xu & Chen, 2003).
•
A terrorist web site network (“Dark Web”) collected based on reliable governmental sources (Chen
et al.
, 2004). I identify 104 web sites created by four major international terrorist groups (Chen
et al.
, 2004), namely, AlGama’a al
Islamiyya, Hizballa, AlJihad, and Palestinian Islamic Jihad and their supporters.
164
A link is created between two web sites if at least one hyperlink exists between any two web pages in them.
Figure 6.1: The giant component in the GSJ Network, data courtesy of Marc Sageman
(2004). The terrorists belong to one of four groups (Sageman, 2004): Bin Laden’s Al
Qaeda or Central Staff (pink), Core Arabs (yellow), Maghreb Arabs (blue), and
Southeast Asians (green). Each circle represents one or more terrorist activities, such as the 9/11 attacks and Bali bombing, which are noted.
Table 6.1 and Table 6.2 present the statistics of the four networks. Each network contains many small components and a single giant component. The separation between the 356 terrorists in the GSJ network and the remaining 10 terrorists is because no valid evidence has been found to connect the 10 terrorists to the giant component of the network. The
165 giant components in the Meth World and gang network contain only 57.0% and 68.5% of the nodes, respectively. This may be because the data was collected from a single law enforcement jurisdiction which may not have complete information about all relations between criminals, causing missing links between the giant component and other smaller components. The isolated components in the Dark Web are possibly due to the differences in the terrorist groups’ ideologies (Chen
et al.
, 2004).
Number of Nodes
Number of Links
Size of Giant
Component
Link Density
Average Degree, <
k
>
Exponent, 
γ
Cutoff,
κ
GSJ Meth World Gang Network Dark Web
366 1349 3917 104
1247 4784 9051 156
356 (97.3%) 924 (68.5%) 2231 (57.0%) 80 (77.9%)
0.02 0.01 0.003 0.05
6.97 4.62 2.87 1.94
0.67 1.41 1.11 1.33
15.35 23.60 14.65 34.59
Table 6.1: The statistics and parameters in the exponentially truncated powerlaw degree distribution of the dark networks.
Average path length
Diameter
Clustering coefficient
GSJ
Real Random
Meth World
Real Random
Gang Network Dark Web
Real Random Real Random
4.20 3.23 6.49 4.52 9.56 6.23 4.70 3.35
9 6.00 17 9.57 22 16.40 12 13.16
0.55 0.2×10
1
0.60 0.5×10
1
0.68 0.6×10
3
0.47 0.1×10
1
Table 6.2: Smallworld properties of the dark networks. For each network, the metrics in the network (real) and those in the random graph counterpart (random) are presented.
6.4.1.1 SmallWorld Properties
I focused only on the giant component in these networks and performed topology analysis. I found that all these networks are small worlds (see Table 6.2). The average
166 path lengths and diameters of these networks are small with respect to their network sizes.
Thus, a terrorist or criminal can connect with any other member in a network through just a few mediators. In addition, these networks are quite sparse with very low link density
(Wasserman & Faust, 1994). These two properties have important implications for the efficiency of the covert network function–transmission of goods and information.
Because the risk of being detected by authorities increases as more people are involved, the small path length and link sparseness can help lower risks and enhance efficiency.
In addition, I calculated the path length of a node to a central node, a measure which is called “Erdös number” in the collaboration networks of mathematicians (Newman,
2001a). This measure is also related to the closeness centrality (Wasserman & Faust,
1994). I found that members in the criminal and terrorist networks are extremely close to their leaders. The terrorists in the GSJ network are on average only 2.5 steps away from
Bin Laden, meaning that Bin Laden’s command can reach an arbitrary member through only two mediators. Similarly, the average path length to the leader in the Meth World
(Xu & Chen, 2003) is only 3.9. Such a short chain of command means communication efficiency. However, special attention should be paid to the Dark Web. Despite its small size (80), the average path length is 4.70, larger than that (4.20) of the GSJ network, which has almost 9 times more nodes. Since hyperlinks help visitors navigate between web pages, and because terrorist web sites are often used for soliciting new members and donations (Chen
et al.
, 2004), the relatively big path length may be due to the reluctance of terrorist groups to share potential resources with other terrorist groups.
167
The other smallworld topology, high clustering coefficient, is also present in these dark networks (see Table 6.2). The clustering coefficients of these four networks are significantly higher than those of random graph counterparts. Previous studies have also shown the evidence of groups and teams in these networks (Chen
et al.
, 2004; Sageman,
2004; Xu & Chen, 2003, Forthcoming). In these groups and teams, members tend to have denser and stronger relations with one another. The communication between group members becomes more efficient, making a crime or an attack easier to plan, organize, and execute (McAndrew, 1999).
In Table 6.1 I also report the average degrees and maximum degrees of the four networks. It can be seen that some terrorists in the GSJ network and some terrorist web sites in the Dark Web are extremely popular, connecting to more than 10% of the nodes in the networks. The assortativity in Table 6.1 indicates the tendency for nodes to connect with others who are similarly popular in terms of degree (Newman, 2003a). The assortativity coefficients of the GSJ and the gang networks are positive, meaning that popular members tend to connect with other popular members. However, the Meth World and Dark Web have negative assortativity coefficients. This may be because that the
Meth World consists of drug dealers who sold drugs to many individual buyers; the buyers did not connect with many other buyers or dealers. The popular web sites on the
Dark Web, on the other hand, received many inbound hyperlinks from less popular web sites.
168
6.4.1.2 ScaleFree Properties
Moreover, these dark networks are scalefree systems. The three human networks have an exponentially truncated powerlaw degree distribution (Amaral
et al.
, 2000; Newman,
2001a),
P
(
k
) ~
k
−
γ
e
−
k
κ
, (6.2) with exponent 
γ
and cutoff
κ
. (see Table 6.1 and Figure 6.2). Different from other types of networks (Albert
et al.
, 1999; Faloutsos
et al.
, 1999; Newman, 2001b; Watts &
Strogatz, 1998) whose exponents usually are between 2.0 and 3.0, the absolute values of the exponents of dark networks are fairly small. The degree distribution decays much more slowly for small degrees than for that of other types of networks, indicating a higher frequency for small degrees. At the same time, the exponential cutoff implies that the distribution for large degrees decays faster than is expected for a powerlaw distribution, preventing the emergence of large hubs which have many links.
2
3
4
0
1
5
6
7
0
Data
Pure powerlaw
Truncated powerlaw
1 2
ln(k )
(a)
3 4
2
3
4
5
0
1
6
7
0
Data
Pure powerlaw
Truncated powerlaw
1 2
ln(k)
(b)
3 4
169
2
0
0
1
2
2
4
3
4
6
8
Data
Pure powerlaw
Truncated powerlaw
5
6
7
0
Data
Pure pow erlaw
Truncated pow erlaw
10
0 4
0.5
1 1.5
ln(k )
(d)
2 2.5
3 3.5
1 2
ln(k )
(c)
3
Figure 6.2: The degree distributions of the dark networks. (a) The GSJ network. (b) The
Meth World. (c) The gang network. (d) The Dark Web. The truncated powerlaw distribution fits the data slightly better than the pure powerlaw distribution for network
(a)(c).
Two possible reasons have been suggested that may attenuate the effect of growth and preferential attachment (Amaral
et al.
, 2000): (a) the aging effect: as time progresses some older nodes may stop receiving new links, and (b) the cost effect: as maintaining links induces costs (Hummon, 2000), there is a constraint on the maximum number of links a node can have. I believe that the aging effect does exist in the dark networks. In the Meth World, for example, some criminals who were present in the network several years ago may have become inactive due to arrest or death, and thus could not receive new links even though they are still included in the network (see Figure 6.3). Moreover, the cost of links takes the form of risks. Under constant threats from authorities, criminals or terrorists may avoid attaching to too many people, limiting the effects of preferential attachment. Evidence has shown that hubs in criminal networks may not be the real leaders (Sparrow, 1991; Xu & Chen, 2003). Another possible constraint on preferential attachment is trust (Krebs, 2001). This constraint is especially common in the GSJ
170 network where the terrorists preferred to attach to those who were their relatives, friends, or religious partners (Sageman, 2004).
Figure 6.3: The aging effect in the Meth World. As time progresses, fewer older members stay in network due to arrest or death. The overall size of the network is increasing, however, due to the addition of new nodes every year.
Because scalefree networks usually are resilient to random failures (Albert
et al.
, 2000),
I tested the dark networks’ robustness only against targeted attacks. I simulated two types of attack (Holme
et al.
, 2002): attacks targeting the hubs and attacks targeting the bridges.
While hubs are nodes that have many links (high degree), bridges are nodes through which many shortest paths pass (high betweenness (Wasserman & Faust, 1994)). When simulating the attacks I distinguished between two attack strategies (Holme
et al.
, 2002): simultaneous removal of a fraction of nodes based on a measure (degree or betweenness) without updating the measure after each removal, and progressive removal of nodes with the measure being updated after each removal.
171
Figure 6.4: Dark networks’ vulnerability to attacks. (a) Simultaneous attacks (filled markers) and progressive attacks (empty markers) to bridges in the GSJ network. The critical points,
f
, at which the network falls into many small components, are marked on the diagram. It can be seen that progressive attacks are more devastating (f p
< f s
). (b) The changes in the average path length of the GSJ network under different attack strategies.
(c)(f) Progressive attacks to the GSJ network (c), the Meth World (d), the gang network
(e), and the Dark Web (f). Two types of attacks are used: hub attack (filled markers) and bridge attacks (empty markers). It shows that bridge attacks are more devastating (
f b
<
f h
).
172
In (f),
f b
and
f h
are very close indicating that hub attacks and bridge attacks can be equally effective to disrupt a pure scalefree network.
Figure 6.4 (a)(b) presents the comparison between simultaneous and progressive removal of bridges. I plot the changes in
S
(the fraction of the nodes in the largest component), <
s
> (the average size of remaining components), and average path length after a fraction of nodes are removed. It shows that progressive attacks are more devastating than simultaneous attacks. The progressive attacks are similar to “cascading failures” in power grids where an initial failure can cause a series of failures, because unbearably high traffic is redirected to the next bridge node.
Figure 6.4 (c)(f) presents the difference between the network reactions to bridge attacks and hub attacks. It shows that dark networks are more sensitive to attacks targeting the bridges than those targeting the hubs. In a smallworld network, which consists of communities and groups, there might be many bridges linking different communities together. Intuitively, when these bridges are removed, the network will quickly fall apart.
Note that a bridge may not necessarily be a hub since a node that connects two communities can have as few as two links. Smallworld networks such as the dark networks thus may be more vulnerable to bridge attacks than hub attacks. In these networks bridges and hubs usually are not the same nodes. The rank order correlations between degree and betweenness in the GSJ, Meth World, and gang networks are 0.63,
0.47, and 0.30, respectively. Note that although bridge attacks are more devastating, strategies targeting the hubs are also fairly effective since these networks are also scalefree networks (Barabási & Alert, 1999). Hub attacks and bridge attacks can be equally
173 effective in tearing apart a pure scalefree network (e.g., the Dark Web with a high degreebetweenness rank order correlation, 0.70), in which hubs are also bridges connecting different parts of the network.
In summary, I examined the structures of several covert networks and found that these networks share many common topological properties with other types of networks. Their efficiency in communication and flow of information, commands, and goods can be tied to their smallworld structures characterized by small average path length and high clustering coefficient. On the other hand, while the dark networks are also governed by selforganizing principles, various constraints on the formation and maintenance of links keep these networks from evolving into pure scalefree networks. This results in a phenomenon that I refer to as “constrained dark networks.” In addition, I found that dark networks are more vulnerable to attacks on the bridges that connect different communities than to attacks on the hubs. This may provide authorities with insights for intelligence and security purposes.
An interesting future research direction is to examine the evolution of dark networks. By comparing the simulated evolution model and real data, I may be able to test the effects of various dynamic mechanisms such as growth (Barabasi & Alert, 1999), linear and nonlinear preferential attachment (Jeong
et al.
, 2003; Krapivsky
et al.
, 2000), aging
174
(Amaral
et al.
, 2000), costs (Amaral
et al.
, 2000), and fitness(Bianconi & Barabási, 2001), among many others.
175
Chapters 36 present several studies that focus on the static structural pattern mining part of the computational framework. Various techniques have been developed and employed to locate critical resources (e.g., the key nodes and paths) in networks, reduce network complexity, and capture the topological properties of networks. This chapter shifts the focus onto the dynamic pattern mining part of the framework and proposes a composite evolution model based on prior work on network evolution.
Many networks in our nature and society are dynamic systems. Identifying the underlying mechanisms that govern the evolution of networks is the key to explain the function and predict the behavior of the systems. During evolution the structure of a network may change. Such changes may be reflected in the following three types of dynamics:
•
Node dynamics
. The number of nodes in a network can change over time. In a
growing network
(Barabási & Alert, 1999), new nodes are added to the system and the size of the network increases over time. In a
decaying network
the network size decreases.
Citation networks often are growing networks that keep including new papers that cite existing papers in the networks. In reality, a network may undergo both addition and removal of nodes at the same time and the overall size displays a monotonically
176 increasing or deceasing pattern. The overall size of the World Wide Web, for example, has increased from a few thousand initially to almost 10
9
as of 1999 (Lawrence & Giles,
1999), although both page addition and deletion occur every day.
•
Link dynamics
. The dynamics of links is more complex than node dynamics. First, the number of links may change. New links may be created between existing nodes or between existing nodes and new nodes that are added to the system. Existing links may also break. Second, links may be rewired (Watts & Strogatz, 1998). That is, one end of a link is reconnected to a different node. In this case, although the total number of links in the network is fixed, the structure of the network changes. Third, the strength of a link may not always stay the same. For example, acquaintances may become friends thereby strengthening their relationship.
•
Group dynamics
. Although changes in groups result directly from node and link changes, group is a different unit of analysis that may offer a different view. Group dynamics occur when one group splits, two or more groups merge, existing members leave the group, or new members join the group. All these changes cause the structure of the network to change accordingly.
This chapter is devoted to the modeling of node and link dynamics. Specifically, the proposed composite model is aimed at describing and explaining the evolution processes of patent citation networks using several microscopic mechanisms. Applying this model to the patent citation networks helps discover how these mechanisms interplay and affect network evolution.
177
Patent is a type of open source document regarding technology innovations. It is a reliable source of information for analysis of various purposes. The analysis on patent content and citation patterns can be used to evaluate the technology development, performance, transfer, and trend in technology fields, countries, institutions, and industries. More importantly, patent citation networks are ideal for the study of network evolution, because the time that new patent documents are added to the citation networks is explicitly available and accurately recorded in large patent databases.
The remainder of this chapter is organized as follows. Section 7.2 reviews related work on network evolution models. Section 7.3 presents the composite model. Evaluation and results are discussed in Section 7.4. Section 7.5 concludes this chapter and points to future research directions.
Recent years have witnessed increased attention to the evolution of scalefree topology, which is found to be ubiquitous in real networks. Research in this area seeks the underlying mechanisms that govern the evolution processes of scalefree networks. As reviewed in Chapter 2, such research can be roughly categorized into two types: descriptive and explanatory. I do not repeat the review of descriptive analysis but provide more details to the explanatory (modeling) analysis in this Section.
A number of mechanisms have been proposed in prior research, including growth, preferential attachment, competition, and individual preference, among others.
178
•
Growth
. One of the key differences between the random graph model (Erdös &
Rényi, 1960) and the scalefree model (Barabási & Alert, 1999) is the latter’s growth assumption. The size of a scalefree network increases rather than stays fixed over time.
Moreover, because the number of links also increases at the same time, the average degree of the network is roughly constant (Albert & Barabási, 2002).
•
Preferential attachment
. Motivated by the richgetricher phenomenon, Barabási and Albert (1999) proposed the socalled
BA model
. In the BA model the probability that an old node received links from new nodes is proportional to the degree of this old node.
As a result, the degree distribution is a powerlaw with a constant exponent,
P
(
k
) ~
k
−
γ
.
This implies that whereas a large percentage of the nodes have a small number of links, a small percentage of nodes have a large number of links.
It has been shown that both growth and preferential attachment are indispensable to the emergence of scalefree topology (Barabási et al., 1999; Barabási & Alert, 1999). If a network is not growing it will become fully connected at last. The absence of preferential attachment, on the other hand, will lead to an exponential degree distribution rather than a powerlaw. The BA model, together with a few other models that provide similar results for powerlaw distributions (Dorogovtsev et al., 2000; Krapivsky et al., 2000), is the first model that explains the evolution of scalefree networks.
However, the BA model is subject to several weaknesses. First, it predicts that the asymptotic value of the powerlaw exponent (
γ
) is 3 (Barabási et al., 1999). However,
179 empirical studies show that many real networks have exponents ranging between 2 and 
3. Second, the degree distribution of many real networks deviate from a strict powerlaw curve (Amaral et al., 2000) appeared as a straight line on a loglog plot. Some of the curves have a unimodal body and powerlaw tail (Pennock et al., 2002), and some others have an exponential cutoff (Jeong et al., 2001; Newman, 2001b). Third, the BA model implies that old nodes in a network will be more popular than younger nodes because they have more time to acquire links. However, in real networks we often see that some younger nodes can acquire a large number of links and become new “stars” in a very short period of time. For example, a Web page with excellent content may quickly become more popular than older Web pages with mediocre content. Forth, the BA model assumes that all new nodes have the knowledge of the global structure of a network
(Albert & Barabási, 2000; GómezGardenes & Moreno, 2004). That is, new nodes know how many links each old node has. This is not always true. To overcome these weaknesses, researchers have proposed several new models based on alternative mechanisms.
•
Competition
. In many real systems nodes compete for links. For example, companies complete for customers’ attention on product markets. A new product may quickly dominate a market and wipe off older products because of its superior quality, functionality, or other attractions. The
fitness model
is proposed to incorporates the effect of competition (Berger et al., forthcoming; Bianconi & Barabási, 2001). The fitness of a node may be considered as the intrinsic abilities to attract links from others, increasing the node’s competitive advantages. Therefore, it is possible for a younger node with high
180 fitness to have more links than old nodes with low fitness. If a few nodes have extremely high fitness, they will become the “winners” and connect to almost every other node in the network (Pennock et al., 2002).
•
Individual preference
. The global knowledge assumption shared by both the BA model and the fitness model is not always realistic. In addition, it has been observed that although the degree distribution for the whole Web follows a powerlaw (Broder et al.,
2000; Huberman & Adamic, 1999), the degree distributions for specific categories of
Web pages, such as company, education, government, are different from a powerlaw.
Specifically, these distributions have a unimodal body and a powerlaw tail. To explain such discrepancy Pennock et al. (2002) propose to add a random mechanism to the BA model. This random mechanism reflects the observation that when a Web page author choose target pages to link to, he/she may consider not only the target pages’ popularity but also their relevance to his/her needs. In this model, there is a tuning parameter, α , to balance between the preferential attachment and the random mechanism. The analytical and simulation results show that this model can better fit the categoryspecific distributions than the BA model in the Web context.
Another model that explicitly considers individual preference is the
degreesimilarity mixture model
proposed in (Menczer, 2004). The extra mechanism in this model is directly related to the content similarity between the new document and the target document. Therefore, the more similar the target document’s content is to the new document, the more likely it obtains the citation link from the new document.
181
In addition to the abovereviewed mechanisms, researchers have proposed various other mechanisms such as the copying effect (Kleinberg et al., 1999), the internal links and link rewiring (Barabási et al., 2002), and the aging effect (Hajra & Sen, 2005). However, there has not been a composite model that incorporates most of these mechanisms. How does a mechanism affect the network structure when other mechanisms also play roles in the evolution? Which mechanism is more responsible for the emergence of a specific topology? Answers to these questions remain unknown. In an attempt to address these questions, I propose a composite evolution model in the next Section.
The composite model consists of two general types of microscopic mechanisms:
•
Attractiveness of the target node
. When a new node is added to the network it must make a decision to select a set of target nodes to link to. The more attractive an existing node is the more likely it is selected as the target node. Based on prior models the attractiveness can be measured by degree (Barabási & Alert, 1999) and fitness (Bianconi & Barabási, 2001).
•
Usefulness of the link
. When a new node selects a target node it considers not only how attractive the target node is but also how useful the potential link is. For example, when an author cites other papers he/she probably selects papers that are
182 wellcited and also relevant to his/her own paper. It is unlikely for the author to cite a paper in an irrelevant discipline even though that paper is popular. Similarly, before two corporations decide to become strategic partners they must consider how much they can benefit from the partnership to reduce uncertainty, leverage resources, and gain market power (Stuart, 1998). In the two individual preference models (Menczer, 2004; Pennock et al., 2002) the random and content similarity mechanisms can be viewed as link usefulness effects.
The composite model captures the scaling behavior of node degrees. In this model, the probability ( Π
i
) that an old node
i
acquires a link from a new node is determined by
Π
i
=
α
β
η ϕ
j
∑
η ϕ
j k j k j
+
( 1
−
β
)
∑
ζ
i
ζ
j
+
( 1
−
α
)
u
(
Θ
) . (7.1)
The first part of equation (7.1) represents the attractiveness effect and is a function of both the degree (
k i
) and fitness (
η i
or
ζ i
) of node
i
. The second part represents the usefulness effect and is a function of one or more link usefulness related variables ( Θ ).
The parameter (
α
) weighs the attractiveness and usefulness effects and takes on a value between 0 and 1. Note that in the first part, both
η i
and
ζ i
refer to the fitness of node
i
.
They are used to reflect the two different ways that the fitness effect enters the model: multiplicative and additive. The parameter
β
balances between these two ways and ranges between 0 and 1. The parameter
φ
is either 0 or 1, controlling the presence or absence of the multiplicative fitness. This model is a composite model because with different
183 parameter settings it reduces to different models. Some of these models have been proposed in prior research.
When
α
= 1,
β
= 1, and
φ
= 0, equation (7.1) reduces to the simple degree model, which is the BA model (Barabási & Alert, 1999),
Π
i
=
∑
k i k j
, (7.2)
I rename the BA model because in this model only preferential attachment based on degree is considered. Other mechanisms such as fitness and link usefulness are omitted.
To derive the functional form of the degree distribution of scalefree topology, Barabási and Alert (1999) use numerical simulations. Initially, there are a small number,
m
0
, nodes in the system. At each time step, a new node is added to the system. The new node is allowed to link to
m
(
m
≤
m
0
) different nodes that are already in the network. When selecting the target nodes the new node makes a decision based on the probability defined in equation (7.2).
Treating
k i
as a continuous variable, the analytical solution to equation (7.2) is (Barabási et al., 1999):
k i
(
t
)
=
m
t i t
0 .
5
, (7.3)
184
P
(
k
) ~
k
−
3
, (7.4) where
t i
is the time that node
i
is added to the system. Equation (7.2) implies that the degree scales with time and older nodes have more advantage over younger nodes in acquiring links. The degree distribution
P
(
k
) is a powerlaw with 3 as the exponent, which is independent of
n
and
m
.
If
α
= 1 and
β
= 0, the composite model reduces to a simple fitness model
Π
i
=
∑
ζ
i
ζ
j
. (7.5)
This model considers only the effect of the fitness of nodes. This model is rather unique because the effect of preferential attachment, which is believed to be the key to the evolution of scalefree networks in all prior models (Barabási & Alert, 1999; Bianconi &
Barabási, 2001; Menczer, 2004; Pennock et al., 2002), is completely excluded. Such a situation may occur when the new node is attracted to an old node that has significantly high fitness. The new node does not care about whether the old node is popular or not.
When
α
= 1,
β
= 1, and
φ
= 1, equation (7.1) becomes a multiplicative fitness model,
185
Π
i
=
∑
η
i
η
k j i k j
. (7.6)
This model is equivalent to the fitness model proposed in (Bianconi & Barabási, 2001).
In this model, Π
i
is a function of the product of fitness and degree (Bianconi & Barabási,
2001). That is, the fitness and preferential attachment mechanisms interplay with each other. As in the BA model the analytical solution can be derived using the meanfield theory (Bianconi & Barabási, 2001):
k
η
i
(
t
)
=
m
t t i
β
(
η
i
)
, (7.7)
P
(
k
)
=
∫
ρ
(
η
)
C
η
m k
C
η
+
1
, (7.8) where
β
(
η
)
=
η
C
and
C
=
∫
ρ
(
η
)
1
−
η
β
(
η
)
d
η
. Equation (7.7) implies that the scaling behavior of degrees depend on the dynamic exponent
β
(
η i
) and that nodes with higher fitness will acquire links faster. The degree distribution given in equation (7.8) is a weighted sum of different powerlaws.
It is easy to imagine that in some situations, the preferential attachment and fitness effects might work independently. They do not interplay and their combined effect governs the
186 evolution of a network. This situation is described by the additive fitness model when
α
=
1, 0<
β
<1, and φ = 0,
Π
i
=
β
∑
k i k j
+
( 1
−
β
)
∑
ζ
ζ
i j
. (7.9)
In addition to the four reduced models, the composite model will reduce to the unimodal powerlaw model (Pennock et al., 2002) when 0<
β
<1 and
u
(
Θ
)
=
1
N
(
t
)
, where
N
(
t
) is the number of nodes at time
t
. When 0<
β
<1 and
u
(
Θ
) ~ (
σ
1
c
−
1 )
− µ
, where
σ c
is the similarity between the new node and old node
i
, equation (7.1) reduces to the degreesimilarity mixture model (Menczer, 2004).
The composite model is rather flexible, allowing different mechanisms to affect network evolution independently or interactively. Models that are not proposed in prior research, such as the simple fitness model, are also made possible. When both
α
and
β
take on values between 0 and 1 and
φ
= 1, the model becomes rather complex and multiple mechanisms can play roles in network evolution simultaneously. In addition, additional mechanisms can be incorporated into the model. For example, the variable set ( Θ ) of the usefulness function can include other mechanisms in addition to the random and similarity effects.
187
To ascertain the composite model’s applicability to real networks I used several patent citation networks.
In patent citation networks each node is a patent document. A patent document often contains several standard fields including title, application date, issue date, assignee (the institution to which the patent is assigned to), inventors, citation, and technology field classifications, among many others (Huang et al., 2003). For this chapter, I choose the citation networks of nanoscale science and engineering (NSE) related patents. NSE has been very active in recent years and has been recognized to be critical to a county’s future science and technology competence.
Many countries have established comprehensive patent repositories to facilitate the application, management, and research on patents. Among various patent system, US
Patent and Trademark Office (USPTO) is the most complete and reliable (Huang et al.,
2003).
The test data sets of NSErelated patents were retrieved from the USPTO’s patent databases in March 2003. A keywordbased approach was used to retrieve a subset of the NSErelated patents from 1976 to 2002. The key words used in the retrieval process are provided in ref. (Huang et al., 2003). The number of NSErelated patents collected
188 was 88,546, which covered 418 of 462 firstlevel US Patent Classification categories of technology fields, including chemistry, drug, etc. From the top 10 technology fields that generate the largest number of patents in the time period of 19762002 I selected four fields: drug, material science, optics, and semiconductor. I extracted the citation links among these patents and created patent citation networks for each of the four fields.
Table 7.1 shows the basic statistics of these four data sets.
Total number of patents
Network size
Number of links
Drug Material Optics Semiconductor Total
8228 8093 6093 3903 26,317
4377 4156 4377 2247 15,157
7548 6867 7099 2772 24,286
Table 7.1: Basic statistics of the four patent citation data sets.
The analysis of patent citation networks is aimed at answering the following research questions:
•
How do patent citation networks evolve?
•
Do patent citation networks have similar dynamic patterns across different technology fields?
•
Are growth and preferential attachment the only mechanisms that are responsible for the emergence of scalefree topology?
•
How do other mechanisms affect the network evolution?
189
I performed both descriptive and explanatory analysis on the patent citation networks.
The descriptive analysis was intended to answer the first two research questions. The general and topology characterizing statistics were collected for each year for each technology field. I also explored two additional statistics: institution’s
productivity distribution
and patent
content similarity distribution
. The major results are summarized as follows.
•
The sizes of the networks increased over time.
Size (Drug)
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
1980 1985 1990
Year
(a)
1995 2000
M
S
Total Size
# isolated
N
Size (Material)
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
1980 1985 1990
Year
1995
(b)
Size (Optics)
2000
4500
4000
3500
3000
2500
2000
1500
1000
500
0
1980
8000
7000
6000
5000
4000
3000
2000
1000
0
1980 1985 1990
Ye ar
(c)
1995
Size (Semiconductor)
2000
1985 2000 1990
Year
(d)
1995
M
S
Total Size
# isolated
N
N
M
Total Size
# isolated
S
M
S
Total Size
# isolated
N
190
191
Figure 7.1: The size dynamics in patent citation networks of the four technology fields. (a)
Drug. (b) Material science. (b) Optics. (d) Semiconductor.
Figure 7.1 shows that the size dynamics are very similar across the four technology fields.
Before mid1990s the sizes of these fields rose slowly. Since mid1990s they have experienced rapid growths. The total number of patents issued (Total Size), the number of patents in the network (
N
), and the number of citation links (
M
) all increased dramatically after the mid1990s. This was because NSErelated research attracted a substantial amount of attention during that period. The number of isolated patents (# isolated) which did not cite other patents and were not cited also increased linearly with time. The sizes of the giant components stayed almost constant until the middle of 1990’s, after then they increased over time. In addition,
M
rose faster than
N
, resulting in increasing average degrees (see Figure 7.2).
•
The average degree increased over time.
Degree (Drug)
4
3.5
3
2.5
2
1.5
1
1980 1985 1990
Ye ar
(a)
1995 2000
Degree (Material)
<k>
<k_in>
<k_out>
3.5
3
2.5
2
1.5
1
1980 1985 1990
Ye ar
(b)
1995 2000
<k>
<k_in>
<k_out>
192
Degree (Optics) Degree (Semiconductor)
3.5
3
2.5
2
1.5
1
1980 1990
Ye ar
(c)
1995
<k>
<k_in>
<k_out>
2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1
1980
<k>
<k_in>
<k_out>
1985 2000
1985 1990
Year
(d)
1995 2000
Figure 7.2: Dynamics of average degrees. (a) Drug. (b) Material Science. (c) Optics. (d)
Semiconductor.
Most prior models predict that the average degree is constant over time. An exception is the model proposed in (Barabási et al., 2002), where the increasing average degree results primarily from the new links formed between existing nodes. However, this does not apply to citation networks since once a patent is issued its citation does not change. It is not possible to form new links between existing patents. Thus, the only possible reason is that, on average, younger patents cited more patents than older patents did. This can be seen from the <
k_out
> curves in Figure 7.2, in which the average number of citations a patent has increased over time.
•
The clustering coefficients increased over time.
Because clustering coefficient measures the tendency for nodes to form groups, the increasing clustering coefficients mean that groups of patents discussing related topics were more common over time. This is quite natural because as a field matures, an initially general topic may develop into different subtopics. Patents mostly cite only parents discussing the same subtopic.
193
•
The average shortest paths increased over time.
Prior models (Albert et al., 1999; Bollobás, 1985) predict that average shortest path should increase with network size logarithmically. However, for all four fields, there was a jump on average shortest path around the mid1990s. This implies that during the booming period of NSErelated research patents were more “distant” from each other, possibly resulted from the lack of crossreference of different subtopics.
Average Path Length (Drug) Average Path Length (Material)
2.5
2
1.5
1
0.5
0
1980
4
3.5
3
2.5
2
1.5
1
0.5
0
1980
1985 1990
Year
1995
(a)
Average Path Length (Optics)
2000
1985 1990
Year
1995
(b)
Average Path Length (Semicondutor)
2000
7
6
5
4
3
2
1
0
1980 1990
Year
(c)
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
1980
1985 1995 2000
1985 1990
Year
(d)
1995 2000
Figure 7.3: Dynamics in average path lengths. (a) Drug. (b) Material Science. (c) Optics.
(d) Semiconductor.
•
The degree distribution followed a powerlaw, but deviated from the powerlaw for small degrees.
194
Overall, these degree distributions followed powerlaws. However, their body deviated from the powerlaw line significantly for small degrees. The deviation was more severe for the optics and semiconductor fields. The exponents of the powerlaws for drug, material, optics, and semiconductors were 2.12, 2.24, 2.29, and 2.58, respectively.
They were rather close to the value predicted by the simple degree model (Barabási et al.,
1999; Barabási & Alert, 1999). However, the simple degree model cannot explain the observed deviation from powerlaw.
Degree Distribution (Drug)
Degree Distribution (Material)
6
7
8
9
3
4
5
10
0
1
2
0 1 2 3 4 5
1
2
3
4
5
6
7
8
9
1
0 0 1 2 3 4
ln(k )
(a)
Degree Distribution (Optics) ln(k )
(b)
Degree Distribution (Semiconductor)
9
10
5
6
7
8
1
2
3
4
1
0
0 0.5
1 1.5
2 2.5
3 3.5
4
3
4
5
6
7
8
9
1
2
1
0
0 1 2 3 4
ln(k)
(c)
ln(k)
(d)
Figure 7.4: Degree distributions of the four fields. (a) Drug. (b) Material Science. (c)
Optics. (d) Semiconductor.
•
The productivity distributions followed powerlaws.
195
An institution’s
productivity
was measured by the number of patents it generated divided by the total number of patents in the field that it belonged to. The five most productive institutions in the four fields are listed in Table 7.2.
Drug
Material
Optics
Semiconductor
L'Oreal 303
Merck & Co., Inc. 163
University of California
Eli Lilly and Company
Genentech, Inc.
167
130
111
Minnesota Mining and Manufacturing Company
Xerox Corp.
260
226
IBM 158
PPG Industries, Inc. 152
3M Innovative Properties Company 149
IBM 140
The Secretary of the Navy
Lucent Technologies Inc.
111
107
Hughes Aircraft Company
Corning Incorporated
103
80
IBM 303
Micron Technology, Inc.
Advanced Micro Devices, Inc.
Motorola, Inc.
Texas Instruments Incorporated
245
221
185
169
Table 7.2: The five most productive institutions in the four technology fields.
Moreover, I found that those most productive institutions were also the ones that received the most citations. The productivity distribution was found to follow a powerlaw,
P
(
x
) ~
x

µ
, where
x
is the number of patents generated by a specific institution (See Figure 5).
The exponent values for the four fields are given in Table 7.3. The powerlaw distributions imply that while a large number of institutions generated only a small number of patents, a small number of institutions generated a large number of patents.
196
Ins titution's productivity (drug)
1
1
0.1
0.01
10 100 1000
0.001
0.0001
# pate nts generated
Figure 7.5: Institutions’ productivity distribution for the drug field.
Drug
Material
Optics
Semiconductor
1.41 1.08
1.24 0.93
1.48 1.08
0.89 1.30
Table 7.3: Exponent values of productivity distributions and similarity distributions for the four fields.
•
The conditional similarity distributions scaled with similarity and followed powerlaws.
To measure the content similarity between two patent documents, I extracted the noun phrases from the title and abstract of each patent and calculated the Jaccard coefficient
(Rasmussen, 1992), a similarity measure often used in information retrieval applications.
The similarity (
σ ij
) between patents
i
and
j
was defined as (Chen et al., 1998):
σ
ij
=
∑
Q q w iq
2
+
∑
q
Q w iq
∑
q
Q w
2
jq w
−
jq
∑
Q q w iq w jq
, (7.10)
197 where
Q
was the total number of terms extracted from a data set (e.g., drug field) and
w iq
=
tf iq
×
idf q
. Term frequency (
tf iq
) was the number of occurrences of term
q
in document
i
.
Inversed document frequency (
idf q
) was the inverse of the logarithm of the number of documents in which term
q
occurred.
The distribution of content similarity was a conditional probability distribution. It was defined as the percentage of patent pairs that were linked by citations over all possible patent pairs in a data set, given a specific value of content similarity.
Unlike the phase transition distribution of content similarity observed in the degreesimilarity mixture model (Menczer, 2004), the content similarity between linked patent pairs followed a powerlaw,
P
(
σ c
) ~
σ c
υ
. This means that the more similar the contents of two patents, the more likely that one would cite the other. Figure 7.6 presents the similarity distribution of drug patents. The exponent of the powerlaw distributions for each field is given in Table 3.
7.4.4.1 Possible Evolutionary Mechanisms
The two categories of mechanisms that might possibly affect the evolution of patent citation networks were attractiveness of patents and usefulness of citation links.
The attractiveness of a patent document could be based on degree and fitness. The degree of a patent was the number of citations it received from other patents and could be easily
198 measured by its number of inlinks. The fitness of a patent was its intrinsic traits such as the quality of the content or the innovativeness of the technology presented. Measuring fitness, however, was not as straightforward. I observed that most popular patents that received a large number of citations were written by assignees from those productive institutions. That it, patents from those large, productive institutions appeared to be more attractive to patent assignees and tended to receive more citations. The fitness of a patent thus was estimated by the productivity of the institution of the assignee.
1.0E+00
0.01
1.0E01
Conditional Similarity Distribution (Drug)
0.1
1.0E02
1.0E03
1
1.0E04
Similarity
Figure 7.6: The loglog plot of conditional content similarity between linked patent pairs.
The usefulness of citation links were estimated primarily by the content similarity between the citing patent and the cited patent. To test whether content similarity played a role in citation, I compared the average similarity between linked patents and the average similarity between unlinked patents. I found that the former was significantly higher than the latter, indicating its impact on citation selection. Table 7.4 presents the similarity coefficients of the four technology fields.
199
Drug
Material
Optics
Semiconductor
***
p
< 0.0001
Average similarity between linked patents
Average similarity between unlinked patents
0.153*** 0.008
0.097*** 0.007
0.068*** 0.006
0.059*** 0.004
Table 7.4: The similarity coefficients between linked patents and those between unlinked patents.
7.4.4.2 Estimating the Composite Model
To estimate parameters in the composite model, the best approach would be regression based on equation (7.1) using the real data. However, the probability Π
i
was very difficult to measure (Jeong et al., 2003). I therefore used a simulation approach to estimate the parameters. Using simulation for parameter estimation was rather adhoc. However, it could provide some insights into the impacts of various mechanisms in network evolution.
Given a specific citation network, the simulation began with two linked nodes. At each time step, a new node was added to the network. The new node was allowed to link to
<
k
> existing nodes in the network. The target nodes were selected based on one of the variants of the composite model: the simple degree model, the simple fitness model, the multiplicative model, the additive model, and the composite model. The simulated network continued to including new nodes until its size was equal to the size of the real citation network under study.
200
The fitness scores of the nodes in these models were drawn from the empirical distributions of institution productivity. The usefulness function in the composite model was based on the empirical distributions of the content similarity between linked patents,
u
(
σ
c
)
=
σ
σ
c
− µ if if
σ
σ
c c
<
<
0 .
991
,
0 .
991
(7.11)
u
(
σ c
) was the number of linked parent pairs with
σ c
similarity divided by the total number of linked pairs. Note that this distribution was not the conditional content similarity presented in Section 7.4.3. It was a distribution with a phrase transition (Menczer, 2004).
When the content similarity between two patents was less than 0.991, the distribution was a powerlaw. This indicated that while a large number of linked pairs had small similarity, a small percentage of linked patent pairs were very similar. When the similarity was close to 1.0, the probability was a constant number σ . Figure 7.7 presents the similarity distribution of the drug network. The constant similarity σ was marked on this chart. The values of µ and σ are given in Table 7.5.
Because the parameters in the composite model were unknown, the first four models must be simulated and analyzed first. Based on the analysis, the contribution (coefficient) of each mechanism was determined by multiple trials. Some mechanisms might be dropped from the composition model because of their poor fit.
201
Drug
Material
Optics
Semiconductor
0.72 0.06
0.85 0.03
0.79 0.03
0.61 0.02
Table 7.5: Estimated parameter values in the content similarity distributions.
Distribution of Content Similarity between Linked Patents
(Drug) p
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0.01
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
σ c
Figure 7.7: The distribution of the content similarity between linked drug patents.
The estimated models for drug, material science, optics, and semiconductor field are
Π
i
=
0 .
9
∑
k i k i
+
0 .
1
σ
c
−
0 .
72
, (7.12)
Π
i
=
0 .
7
∑
k i k i
+
0 .
3
σ
c
−
0 .
85
, (7.13)
Π
i
=
0 .
35
∑
k i k i
+
∑
ζ
i
ζ
i
+
0 .
3
σ −
0
c
.
79
(7.14)
Π
i
=
0 .
45
∑
k i k i
+
∑
ζ
i
ζ
i
+
0 .
1
σ
−
c
0 .
61
. (7.15)
202
Figure 7.8 presents the model fits of the five models. Major findings are summarized as follows:
•
Equations (7.12)(7.13) show that the attractiveness mechanisms were more responsible for the degree distributions observed in the four patent citation networks. The values of
α
for all the four fields ranged between 0.7 and 0.9. In equations (7.14) and (7.15), the values of
β
were 0.5.
•
Preferential attachment was the most significant mechanism in the evolution of the networks. We see from Figure 7.8 that the simple degree model fit the data well, although it might not be the best.
•
However, preferential attachment may not be the only mechanisms that led to scalefree topology. Figure 7.8 shows that the simple fitness model could also result in a scalefree topology although the degree scaled very fast.
•
The multiplicative and additive fitness models scaled too fast in most cases. They generated networks with a few nodes that had extremely large degrees.
•
The composite model seemed to be the model which best fit the data because it was more flexible and included the additional mechanism, content similarity.
6
7
8
9
4
5
2
3
0
1
0 1
Model Fit (Drug)
2 3 4 5
Data
Simple Degree Model
Multiplicative Fitness Model
Simple Fitness Model
Additive Fitness Model
Composite Model
6
3
4
5
6
7
8
9
0
1
0
2
1 2
ln(k)
(a)
Model Fit (Material)
3 4 5 6 7
Data
Simple Degree Model
Multiplicative Fitness Model
Simple Fitness Model
Additive Fitness Model
Composite Model
8
ln(k)
(b)
203
204
Model Fit (Optics)
7
8
9
3
4
5
6
0
1
0
2
1 2 3 4 5 6 7
Data
Simple Degree Model
Multiplicative Fitness Model
Simple Fitness Model
Additive Fitness Model
Composite Model
8
ln(k)
(c)
Model Fit (Semiconductor)
4
5
6
7
8
9
2
3
0
1
0 1 2 3 4 5 6
Data
Simple Degree Model
Multiplicative Fitness Model
Simple Fitness Model
Additive Fitness Model
Compositel Model
7
ln(k)
(d)
Figure 7.8: The fits of different models. (a) Drug. (b) Material Science. (c) Optics. (d)
Semiconductor.
205
I proposed a composite model for network evolution in this chapter. The microscopic mechanisms that may possibly impact on the evolution of networks are attractiveness of the target nodes or the usefulness of the links. Using the NSErelated patent citation networks I compared this model with several models proposed in prior research. The preliminary results showed that the composite model had the potential to better fit real networks.
The major limitation of this chapter lies in the use of simulation approach to estimating the parameters in the composite model due to the difficulty of measuring the link selecting probability. Future research will focus on removing this limitation and systematically estimate and test the significance of the parameters. In addition, various other mechanisms will be added to the model in my future research.
206
Contemporary organizations live in an environment of networks: internally, organizations manage the networks of employees, information resources, and knowledge assets to enhance productivity and improve efficiency; externally, they form alliances with strategic partners, suppliers, buyers, and other stakeholders to conserve resources, share risks, and gain market power. Organizations make many managerial and strategic decisions based on their understanding of the structure of these various networks.
This dissertation is devoted to network structure mining, a new research topic on knowledge discovery in databases (KDD) for supporting knowledge management and decision making in organizations. A comprehensive computational framework is proposed and a series of case studies are presented to address various research questions in this new field. In this chapter, I summarize the theoretical, technical, and empirical contributions of this dissertation, discuss its relevance to management, business, and MIS research, and suggest future research directions.
The dissertation contributes to various aspects of the research and practice of network structure mining. Specifically, the
theoretical contributions
of this dissertation are as follows:
207
•
The computational framework proposed in this dissertation is the first comprehensive framework that offers a relatively complete taxonomy and summary of the theoretical foundations, major research questions and methodologies, existing technologies, and applications of network structure mining. Books and articles that provide excellent survey and summary on the study of networks can be found in graph theory (Bollobás, 1998), social network analysis (Wasserman & Faust, 1994; Watts, 2004), and statistical physics (Albert
& Barabási, 2002; Newman, 2003b), which are the three major theoretical foundations that network structure mining is grounded upon. However, they focus only on the research questions and methodologies that are relevant to their own disciplines and have not taken much advantage of the multidisciplinary nature of network research. For example, SNA studies seldom address the network robustness question. Statistical physics research, on the other hand, never uses the blockmodeling approach from SNA to reduce network complexity. This computational framework, in contrast, consolidates the research questions and techniques from multiple reference disciplines and can also be used for guiding future research.
•
The framework and the case studies presented in this dissertation contribute to the
KDD research community by defining the new area of network structure mining and demonstrating how structural patterns can be extracted from networks using conventional data mining techniques, such as hierarchical clustering algorithms, and new methods borrowed from other disciplines, such as the blockmodeling
208 approach. Network structure mining together with conventional data mining topics such as association rule mining, clustering, and classification will be the major pillars of KDD research.
This dissertation has also made the following
technical contributions
:
•
A new shortestpath algorithm, twotree priorityfirst search (twotree PFS), was developed and compared with a few other graph traversal algorithms, such as the onetree priorityfirst search (PFS) and breadthfirst search (BFS) algorithms, to locate important relational paths in networks (Xu & Chen, 2004). The performance evaluation results showed that both onetree and twotree PFS algorithms were more effective than the BFS algorithm. In addition, the twotree
PFS algorithm was more efficient than the onetree PFS algorithm in dense networks.
•
A number of techniques that were previously used in other disciplines such as the concept space approach from information retrieval (Chen & Lynch, 1992), hierarchical clustering algorithms from data mining (Aldenderfer & Blashfield,
1984), the blockmodeling approach from SNA (Arabie et al., 1978), and multidimensional scaling approach (MDS) from statistics (Kruskal & Wish, 1978) were employed to mine and visualize structural patterns in networks (Xu & Chen,
2005). Compared with the graphicsbased approaches that are employed in
209 current network analysis tools, the prototype system developed based on these new techniques was more efficient and useful.
•
To address the lack of efficiency problem in unweighted graph partitioning methods, I proposed
edge local density
to approximate link weights based on the structure of the network. When incorporated in singlepass and iterative hierarchical clustering algorithms, this measure was shown to be potentially helpful for enhancing partitioning efficiency with acceptable effectiveness or for improving effectiveness with acceptable efficiency. It could be used to provide a better balance between the different effectiveness and efficiency demands of applications than existing clustering methods.
•
A composite model was proposed to explain evolution processes and the emergence of scalefree topology in networks. The composite model could reduce to different models proposed in prior research or new models under different parameter settings. This model incorporated more evolutionary mechanisms than prior models and was more flexible and realistic.
This dissertation addresses network structure mining from the perspective of knowledge management and decision making. Specifically, the case studies presented were aimed at supporting knowledge management and decision making in various application domains:
210
•
Chapters 3 and 4 focused on the law enforcement domain, proposing effective methods to help extract knowledge about the structures of criminal networks of organized crimes (Xu & Chen, 2004, 2005). The techniques employed, such as the shortestpath algorithms and SNA methods, have been found to be very promising in supporting crimeinvestigation related knowledge discovery tasks.
•
Chapter 5 presented the new measure for addressing the unweighted network partition problem. It can be used in a variety of applications such as identifying research specialties in a research discipline based on citation networks and extracting communities of Web pages on the Internet.
•
Chapter 6 used the new topological analysis approaches from statistical physics to analyze the structure and robustness of “dark networks” such as criminal networks, terrorist networks, and Web sites created by terrorists and their supporters. The findings could help authorities better understand the organization of these dark networks and develop effective disruptive strategies.
•
Chapter 7 described and modeled the evolution of several patent citation networks.
The findings would be useful for understanding the history of technology development and predicting future technology trends.
In addition to these contributions, this dissertation is especially relevant to management, business, and MIS research.
211
The science of networks (Barabási, 2002; Watts, 2004) has motivated a new way of thinking that views everything surrounding us as connected and makes us ponder what it means for science, business, and everyday life. In particular, managers of organizations may find a number of new opportunities for business and management by thinking in terms of networks and employing network structure mining techniques presented in this dissertation:
•
Marketing managers can exploit customer networks and mine the “network value” of customers (Domingos & Richardson, 2001). Some wellconnected customers are rather important. They may be early adopters of some new products and can influence the purchasing decisions of many other people. Approaches proposed in this dissertation can help marketing managers locate these key customers and develop better marketing strategies.
•
Boards of directors are the decision making bodies of large corporations (Robins
& Alexander, 2004). Many strategic practices regarding corporate governance, adoption of new technology, and technology outsourcing spread among directors sitting on different corporation boards. Executives and directors may find network structure mining helpful for understanding the structures and evolution processes of these elite networks and making more intelligent strategic decisions.
212
•
Financial equities, stocks, and banks form networks in the financial market
(Bonanno et al., 2004; Inaoka et al., 2004), which is a naturally complex system in nature. With the techniques presented in this dissertation, managers, banks, and financial workers may better understand the behavior of this complex system and select profitable financial portfolios or financial policies.
•
Information systems that incorporate the network mining techniques will be able to provide organizations with not only information storage functionality but also the ability to discover useful knowledge from networks, thereby enhancing organizations’ competitive advantages.
Network structure mining is a fairly new area. Many new methodologies and technologies are needed. My future research on network structure mining will proceed in the following directions.
In the
theoretical perspective
, I will develop a more comprehensive research framework as the research on network structure mining matures. New research questions, techniques, and findings will be added to the framework. Although it is rather comprehensive, the current framework does not incorporate the research on resource diffusion in networks.
The future framework will consider this missing piece. I will also continue to work on the network evolution problem by developing new models and revealing new
213 mechanisms responsible for network evolution. Such research will contribute to the theory building of network evolution.
In the
technical perspective
, my future research will include the development of more techniques and methods for mining structural patterns in networks. In particular, the unweighted network partition approach proposed in this dissertation still has much room for improvement. My objective is to develop more effective and efficient algorithms.
In the
empirical perspective
, I will experiment with my techniques in more application domains. This dissertation covers only a few domains where network structure mining can apply. In the future, I will apply the techniques to Web mining, biological network mining, and citation network mining, among many others.
214
Participant number: _________________________
Date: _____________________________________
The purpose of this study is to evaluate the performance of a criminal network analysis system.
The network presented in this study consists of criminals. However, no detailed information about any criminals is shown except for their scrubbed names. Therefore, the network can be simply treated as a general network consisting of nodes. This study does not require any domain knowledge about criminal networks and crime investigation. You are eligible to participate because you have basic experience of using computers.
Your participation will involve completing the tasks of discovering the structure of a network.
You may choose not to answer some or all of the questions. During the observation, time will be recorded for each task you complete. Your name will not appear on any written notes.
Any questions you have will be answered and you may withdraw from the study at any time.
There are no known risks form your participation and no direct benefit from your participation is expected.
Questionnaire and observation information will be assigned a subject number and locked in a cabinet in a secure place. Your name will not be revealed in any reports that result from this project.
215
* Degree of a point: the number of links the point has. Hint: A point with a high degree score is like a “leader”.
Operation Explanation
Switch between two tabbed panes, one of which is for narcotics and the other for gangs.
Show the network of individuals
Reset the network to its original display
Adjust the level of abstraction with 0% indicating the original network of individuals
Centrality rankings of individuals
Draganddrop Move points around on the display
View a group’s inner structure
Display rankings of group members roles
Singleclick on a bubble representing a group on the display
(level > 0) to see the inner structure of that group
Singleclick on a bubble representing a group on the display
(level > 0) to see the rankings of group members in terms of their degree*.
216
Task 1. Do you think the people who are included in the circle A should be in the same group? If not, who should be included to into this group and who should be excluded from the group?
Task 2. Do you think the people who are included in the circle B should be in the same group? If not, who should be included to into this group and who should be excluded from the group?
Task 3. Do group A and group B have direct relations (are there lines linking members of group
A and members of group B)?
Task 4. Does group A have more links to group B than to group C?
Task 5. Identify the person who scores the highest in
degree
(give his/her name).
Task 6. Identify the person who scores the highest in
degree
(give his/her name).
Task 7. Group A is labeled by “
PERALES JASON
”; group B is labeled by “
SANCHEZ KEDI
”.
Do group A and group B have direct relations?
Task 8. Group A is labeled by “
PERALES JASON
”; group B is labeled by “
SANCHEZ KEDI
”; group C is labeled by “
TEMPLETON SERGIO
”. Does group C have a stronger relation to A than to B?
Task 9. Identify the person who scores the highest in
degree
(give his/her name).
Task 10. Identify the person who scores the highest in
degree
(give his/her name).
3.
4.
5.
1.
2.
Participant #: ______________
Year of Birth: ______________
Gender: Male Female
Academic background: Undergraduate Graduate
In general, I am very comfortable with computers.
A. Strongly agree
B. Agree C. Neither degree nor disagree
6.
7.
I am very experienced with the Internet.
A. Strongly B. Agree C. Neither degree agree nor disagree
I am very experienced with Microsoft Excel.
A. Strongly agree
B. Agree C. Neither degree nor disagree
D. Disagree
D. Disagree
D. Disagree
E. Strongly
Disagree
E. Strongly
Disagree
E. Strongly
Disagree
217
218
Please answer the following questions regarding the interface of this system.
8.
Switching between the two tabbed panes is easy.
A. Strongly B. Agree C. Neither degree D. Disagree agree nor disagree
9.
The reset button is NOT easy to use.
A. Strongly B. Agree C. Neither degree agree nor disagree
10.
The slider is easy to adjust.
A. Strongly agree
B. Agree C. Neither degree nor disagree
11.
The meaning of the slider is confusing.
A. Strongly agree
B. Agree C. Neither degree nor disagree
D. Disagree
D. Disagree
D. Disagree
E. Strongly
Disagree
E. Strongly
Disagree
E. Strongly
Disagree
E. Strongly
Disagree
12.
The table is easy to use.
A. Strongly agree
B. Agree
13.
The table is confusing.
A. Strongly agree
B. Agree
C. Neither degree nor disagree
D. Disagree E. Strongly
Disagree
C. Neither degree nor disagree
D. Disagree E. Strongly
Disagree
14.
Moving points around on the network is easy.
A. Strongly agree
B. Agree C. Neither degree nor disagree
E. Strongly
Disagree
15.
The meaning of the network at 0 level of abstraction (points are individuals) is confusing.
A. Strongly agree
B. Agree C. Neither degree nor disagree
D. Disagree
D. Disagree E. Strongly
Disagree
16.
The meaning of groups is confusing (groups are represented by circles).
A. Strongly agree
B. Agree C. Neither degree nor disagree
D. Disagree E. Strongly
Disagree
17.
I learned how to use the system interface (including buttons, slider, table, etc.) very quickly.
A. Strongly agree
B. Agree C. Neither degree nor disagree
D. Disagree E. Strongly
Disagree
219
18.
In general, the interface is easy to use.
A. Strongly agree
B. Agree C. Neither degree nor disagree
D. Disagree E. Strongly
Disagree
You have just completed many tasks. Recall how you performed these tasks and answer the following questions.
19.
I was very comfortable with evaluating the groupings produced by the system.
A. Strongly B. Agree C. Neither degree D. Disagree E. Strongly agree nor disagree Disagree
20.
I felt confused with the groupings produced by the system.
A. Strongly agree
B. Agree C. Neither degree nor disagree
D. Disagree E. Strongly
Disagree
21.
It was easier to find intergroup relations when group members are put into circles than are shown individually.
A. Strongly agree
B. Agree C. Neither degree nor disagree
D. Disagree E. Strongly
Disagree
22.
It was easier to use the table to find the person with highest degree than counting lines on a small window.
A. Strongly agree
B. Agree C. Neither degree nor disagree
23.
In general, this system is easy to learn.
A. Strongly agree
B. Agree C. Neither degree nor disagree
24.
In general, this system is easy to use.
A. Strongly agree
B. Agree C. Neither degree nor disagree
D. Disagree
D. Disagree
D. Disagree
25.
In general, I am satisfied with the system.
A. Strongly agree
B. Agree C. Neither degree nor disagree
D. Disagree
26.
Please give any comments regarding the system. Thank you very much.
E. Strongly
Disagree
E. Strongly
Disagree
E. Strongly
Disagree
E. Strongly
Disagree
220
Agrawal, R., Imielinski, T., & Swami, A. (1993).
Mining association rules between sets of items in large databases.
Proceedings of the ACM SIGMOD International
Conference on Management of Data
, Washington, D.C.
Albert, R., & Barabási, A.L. (2000). Topology of evolving networks: Local events and universality.
Physical Review Letters, 85
(24), 52345237.
Albert, R., & Barabási, A.L. (2002). Statistical mechanics of complex networks.
Reviews of Modern Physics, 74
(1), 4797.
Albert, R., Jeong, H., & Barabási, A.L. (1999). Diameter of the WorldWide Web.
Nature, 401
, 130131.
Albert, R., Jeong, H., & Barabási, A.L. (2000). Error and attack tolerance of complex networks.
Nature, 406
, 378382.
Aldenderfer, M. S., & Blashfield, R. K. (1984).
Cluster Analysis
. Beverly Hills: Sage
Publications.
Ali, M., & Kamoun, F. (1993). Neural networks for shortest path computation and routing In computer networks.
IEEE Transactions on Neural Networks, 4
(5), 941
953.
Amaral, L. A. N., Scala, A., Barthelemy, M., & Stanley, H. E. (2000). Classes of smallworld networks.
Proceedings of the National Academy of Science of the United
States of America, 97
, 1114911152.
Anderson, T., Arbetter, L., Benawides, A., & LongmoreEtheridge, A. (1994). Security works.
Security Management, 38
(17), 1720.
Arabie, P., Boorman, S. A., & Levitt, P. R. (1978). Constructing blockmodels: How and why.
Journal of Mathematical Psychology, 17
, 2163.
Araujo, F., Ribeiro, B., & Rodrigues, L. (2001). A neural network for shortest path computation.
IEEE Transactions on Neural Networks, 12
(5), 10671073.
Asano, T., Kirkpatrick, D., & Yap, C. (2002).
Pseudo approximation algorithms, with applications to optimal motion planning.
Proceedings of the 18th Annual
Symposium on Computational Geometry
, Barcelona, Spain.
221
Baker, W. E., & Faulkner, R. R. (1993). The social organization of conspiracy: Illegal networks in the heavy electrical equipment industry.
American Sociological
Review, 58
(12), 837860.
Baldi, S. (1998). Normative versus social constructivist processes in the allocation of citations: A networkanalytic model.
American Sociological Review, 63
(6), 829
846.
Barabási, A.L. (2002).
Linked: The New Science of Networks
. New York, NY: Perseus
Books Group.
Barabási, A.L., Albert, R., & Jeong, H. (1999). Meanfield theory for scalefree random networks.
Physica A, 272
, 173187.
Barabási, A.L., & Alert, A.L. R. (1999). Emergence of scaling in random networks.
Science, 286
(5439), 509512.
Barabasi, A.L., & Alert, R. (1999). Emergence of Scaling in Random Networks.
Science,
286
(5439), 509512.
Barabási, A.L., Jeong, H., Zéda, Z., Ravasz, E., Schubert, A., & Vicsek, T. (2002).
Evolution of the social network of scientific collaborations.
Physica A, 311
, 590
614.
Battista, G. d., Eades, P., Tamassia, R., & Tollis, I. G. (1999).
Graph Drawing:
Algorithms for the Visualization of Graphs
. Upper Saddle River, NJ: Prentice Hall.
Berger, N., Borgs, C., Chayes, J. T., D'Souza, R. M., & Kleinberg, R. D. (forthcoming).
Degree distribution of competitioninduced preferential attachment.
Combinatorics, Probability and Computing
.
Berkowitz, S. D. (1982).
An Introduction to Structural Analysis: The Network Approach to Social Research
. Toronto: Butterworth.
Bianconi, G., & Barabási, A.L. (2001). Competition and multiscaling in evolving networks.
Europhysics Letters, 54
, 436442.
Bollobás, B. (1985).
Random Graphs
. London: Academic.
Bollobás, B. (1998).
Modern Graph Theory
. New York, NY: SpringerVerlag.
Bonanno, G., Caldarelli, G., Lillo, F., Micciche, S., Vandewalle, N., & Mantegna, R. N.
(2004). Networks of equities in financial markets.
The European Physical Journal
B, 38
, 363371.
222
Borgatti, S. P., & Foster, P. C. (2003). The network paradigm in organizational research:
A review and typology.
Journal of Management, 29
, 9911013.
Brass, D. J. (1984). Being in the right place: A structural analysis of individual influence in an organization.
Administrative Science Quarterly, 29
, 518539.
Breiger, R. L. (2004). The analysis of social networks. In M. A. Hardy & A. Bryman
(Eds.),
Handbook of Data Analysis
(pp. 505526). London, UK: Sage Publications.
Breiger, R. L., Boorman, S. A., & Arabie, P. (1975). An algorithm for clustering relational data, with applications to social network analysis and comparison with multidimensional scaling.
Journal of Mathematical Psychology, 12
, 328383.
Brin, S., & Page, L. (1998).
The anatomy of a largescale hypertextual web search engine.
Proceedings of the 7th WWW Conference
, Brisbane, Australia.
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., State, R., et al.
(2000). Graph structure in the web.
Computer Networks, 33
(16), 309320.
Burt, R. S. (1976). Positions in networks.
Social Forces, 55
, 93122.
Burt, R. S. (1980). Models of network structure.
Annual Review of Sociology, 6
, 79141.
Carr, N. G. (2003). IT doesn't matter.
Harvard Business Review, 81
(5), 4149.
Chakrabarti, S., Dom, B. E., Kumar, S. R., Raghavan, P., Rajagopalan, S., Tomkins, A., et al. (1999). Mining the web's link structure.
IEEE Computer, 32
(8), 6067.
Chau, M., Xu, J., & Chen, H. (2002).
Extracting meaningful entities from police narrative reports.
Proceedings of National Conference on Digital Government Research
,
Los Angeles, CA.
Chau, M., Zeng, D., Chen, H., Huang, M., & Hendriawan, D. (2003). Design and evaluation of a multiagent collaborative Web mining system.
Decision Support
Systems, 35
(1), 167183.
Chen, C., Paul, R. J., & O'Keefe, B. (2001). Fitting the jigsaw of citation: Information visualization in domain analysis.
Journal of American Society of Information
Science and Technology, 52
(4), 315330.
Chen, H., Chung, Y., Ramsey, M., & Yang, C. (1998). A smart itsy bitsy spider for the web.
Journal of the American Society for Information Science, 49
(7), 604618.
223
Chen, H., & Lynch, K. J. (1992). Automatic construction of networks of concepts characterizing document databases.
IEEE Transactions on Systems, Man and
Cybernetics, 22
(5), 885902.
Chen, H., Qin, J., Reid, E., Chung, W., Zhou, Y., Xi, W., et al. (2004).
The dark web portal: Collecting and analyzing the presence of domestic and international terrorist groups on the web.
Proceedings of the 7th Annual IEEE Conference on
Intelligent Transportation Systems (ITSC 2004)
, Washington, D. C.
Chen, H., Zeng, D., Atabakhsh, H., Wyzga, W., & Schroeder, J. (2003). COPLINK managing law enforcement data and knowledge.
Communications of the ACM,
46
(1), 2834.
Chinchor, N. A. (1998).
Overview of MUC7/MET2.
Proceedings of the 7th Message
Understanding Conference (MUC7)
, Washington, D.C.
Chung, F. R. K. (Ed.). (1997).
Spectral graph theory
(Vol. 92): American Mathematical
Society.
Coady, W. F. (1985). Automated link analysis: Artificial intelligencebased tool for investigators.
Police Chief, 52
(9), 2223.
Cook, D. J., & Holder, L. B. (2000). Graphbased data mining.
IEEE Intelligent Systems,
15
, 3241.
Cormen, T. H., Leiserson, C. E., & Rivest, R. L. (1991).
Introduction to Algorithms
.
Cambridge, MA: The MIT Press.
Csányi, G., & Szendroi, B. (2004). Structure of a large social network.
Physical Review E,
69
, 036131.
Culnan, M. J. (1986). The intellectual development of management information systems,
19721982: A cocitation analysis.
Management Science, 32
(2), 156172.
Culnan, M. J. (1987). Mapping the intellectual structure of MIS, 19801985: A cocitation analysis.
MIS Quarterly
, 341353.
Dantzig, G. (1960). On the shortest route through a network.
Management Science, 6
,
187190.
Davidson, R., & Harel, D. (1996). Drawing graphs nicely using simulated annealing.
ACM Transactions on Graphics, 15
(4), 301331.
224
Day, W. H. E., & Edelsbrunner, H. (1984). Efficient algorithms for agglomerative hierarchical clustering methods.
Journal of Classification, 1
, 724.
Defays, D. (1977). An efficient algorithm for a complete link method.
Computer Journal,
20
(4), 364366.
Deo, N. (1974).
Graph Theory with Applications to Engineering and Computer Science
.
Englewood Cliffs, New Jersey: PrenticeHall.
Dijkstra, E. (1959). A note on two problems in connection with graphs.
Numerische
Mathematik, 1
, 269271.
Domingos, P., & Richardson, M. (2001).
Mining the network value of customers.
Proceedings of the 7th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining
, San Francisco, CA.
Doreian, P., & Stokman, F. N. (1997). The dynamics and evolution of social networks. In
P. Doreian & F. N. Stokman (Eds.),
Evolution of Social Networks
(pp. 117).
Australia: Gordon and Breach.
Dorogovtsev, S. N., & Mendes, J. F. F. (2003).
Evolution of networks: From biological nets to the Internet and WWW
. New York, NY: Oxford University Press.
Dorogovtsev, S. N., Mendes, J. F. F., & Samukhin, A. N. (2000). Structure of growing networks with preferential linking.
Physical Review Letters, 85
(21), 46334636.
Eades, P. (1984). A heuristic for graph drawing.
Congressus Numerantium, 42
, 149160.
Erdös, P., & Rényi, A. (1960). On the evolution of random graphs.
Publications of the
Mathematical Institute of the Hungarian Academy of Sciences, 5
, 1761.
Etzioni, O. (1996). The World Wide Web: Quagmire or gold mine.
Communications of the ACM, 39
(11), 6568.
Evan, W. M. (1972). An organizationset model of interorganizational relations. In M.
Tuite, R. Chisholm & M. Radnor (Eds.),
Interorganizational DecisionMaking
(pp. 181200). Chicago: Aldine.
Evans, J., & Minieka, E. (1992).
Optimization Algorithms for Networks and Graphs
(2 ed.). New York, NY: Marcel Dekker.
Faloutsos, M., Faloutsos, P., & Faloutsos, C. (1999).
On powerlaw relationships of the
Internet topology.
Proceedings of Annual Conference of the Special Interest
Group on Data Communication (SIGCOMM '99)
, Cambridge, MA.
225
Fayyad, U., PiatetskShapiro, G., & Smyth, P. (1996a). The KDD process for extracting useful knowledge from volumes of data.
Communications of the ACM, 39
(11),
2734.
Fayyad, U. M., PiatetskyShapiro, G., & Smyth, P. (1996b). From data mining to knowledge discovery: An overview. In U. M. Fayyad, G. PiatetskShapiro, P.
Smyth & R. Uthurusamy (Eds.),
Advances in Knowledge Discovery and Data
Mining
. Menlo Park, CA: AAAI Press/The MIT Press.
Fiedler, M. (1973). Algebraic connectivity of graphs.
Czechoslovak Mathematical
Journal, 23
, 298305.
Flake, G. W., Lawrence, S., & Giles, C. L. (2000).
Efficient identification of web communities.
Proceedings of the 6th International Conference on Knowledge
Discovery and Data Mining (ACM SIGKDD 2000)
, Boston, MA.
Flake, G. W., Lawrence, S., Giles, C. L., & Coetzee, F. M. (2002). Selforganization and identification of web communities.
IEEE Computer, 35
(3), 6671.
Floyd, R. W. (1962). Algorithm 97: Shortest path.
Communications of the ACM, 5
(6),
345370.
Ford Jr., L. R., & Fulkerson, D. R. (1956). Maximal flow through a network.
Canadian
Journal of Mathematics, 8
, 399404.
Freeman, L. C. (1979). Centrality in social networks: Conceptual clarification.
Social
Networks, 1
, 215240.
Freeman, L. C. (2000). Visualizing social networks.
Journal of Social Structure, 1
(1).
Fruchterman, T. M. J., & Reingold, E. M. (1991). Graph drawing by forcedirected placement.
SoftwarePractice & Experience, 21
(11), 11291164.
Furnas, G. W. (1986).
Generalized fisheye views.
Proceedings of ACM Conference on
Human Factors in Computing Systems (CHI '86)
, Boston, MA.
Galaskiewicz, J., & Krohn, K. (1984). Positions, roles, and dependencies in a community interorganization system.
Sociological Quarterly, 25
, 527550.
Garfield, E. (2001).
From bibliographic coupling to cocitation analysis via algorithmic historiobibliography: A citationist's tribute to Belver C. Griffith, Lazerow
Lecture presented at Drexel University, Philadelphia PA. November 27, 2001
, from http://garfield.library.upenn.edu/ papers/drexelbevergriffith92001.pdf
226
Garlaschelli, D., Caldarelli, G., & Pietronero, L. (2003). Universal scaling relations in food webs.
Nature, 423
(6936), 165168.
Garton, L., Haythornthwaite, C., & Wellman, B. (1999). Studying online social networks.
In S. Jones (Ed.),
Doing Internet Research
(pp. 75105). Thousand Oaks, CA:
Sage Publications.
Giannakis, M., & Croom, S. (2001).
The intellectual structure of supply chain management: An application of the social network analysis and citation analysis to SCM related journals.
Proceedings of the 10th International Annual IPSERA
Conference
, Jönkoping, Sweden.
Gibson, D., Kleinberg, J., & Raghavan, P. (1998).
Inferring web communities from link topology.
Proceedings of the 9th ACM Conference on Hypertext and Hypermedia
,
Pittsburgh, PA.
Girvan, M., & Newman, M. E. J. (2002). Community structure in social and biological networks.
Proceedings of the National Academy of Science of the United States of
America, 99
, 78217826.
Goldberg, H. G., & Senator, T. E. (1998).
Restructuring databases for knowledge discovery by consolidation and link formation.
Proceedings of 1998 AAAI Fall
Symposium on Artificial Intelligence and Link Analysis
, Orlando, FL.
Goldberg, H. G., & Wong, R. W. H. (1998).
Restructuring transactional data for link analysis in the FinCen AI system.
Proceedings of 1998 AAAI Fall Symposium on
Artificial Intelligence and Link Analysis
, Orlando, FL.
GómezGardenes, J., & Moreno, Y. (2004). Local versus global knowledge in the
BarabásiAlbert scalefree network model.
Physical Review E, 69
, 037103.
Gulati, R., & Gargiulo, M. (1999). Where do interorganizational networks come from?
American Journal of Sociology, 104
(4), 14391493.
Hajra, K. B., & Sen, P. (2005). Aging in citation networks.
Physica A, 346
, 4448.
Harary, F. (1994).
Graph Theory
. Reading, MA: AddisonWesley.
Harper, W. R., & Harris, D. H. (1975). The application of link analysis to police intelligence.
Human Factors, 17
(2), 157164.
Hauck, R. V., Atabakhsh, H., Ongvasith, P., Gupta, H., & Chen, H. (2002). Using coplink to analyze criminaljustice data.
IEEE Computer, 35
(3), 3037.
227
Helgason, R. V., Kennington, J. L., & Stewart, B. D. (1993). The onetoone shortestpath problem: An empirical analysis with the twotree Dijkstra algorithm.
Computational Optimization and Applications, 1
, 4775.
Herman, I., Melancon, G., & Marshall, M. S. (2000). Graph visualization and navigation in information visualization: A survey.
IEEE Transactions on Visualization and
Computer Graphics, 6
(1), 2443.
Hesham, E., Theodore, G. L., & Hesham, H. A. (1994).
Task scheduling in parallel and distributed systems
. Upper Saddle River, NJ: PrenticeHall.
Holme, P., Kim, B. J., Yoon, C. N., & Han, S. K. (2002). Attack vulnerability of complex networks.
Physical Review E, 65
, 056109.
Huang, Z., Chen, H., Yip, A., Ng, G., Guo, F., Chen, Z.K., et al. (2003). Longitudinal patent analysis for nanoscale science and engineering: Country, institution, and technology field.
Journal of Nanoparticle Research, 5
, 333363.
Huberman, B. A., & Adamic, L. A. (1999). Growth dynamics of the WorldWide Web.
Nature, 401
, 131.
Hummon, N. P. (2000). Utility and dynamic social networks.
Social Networks, 22
, 221
249.
Imafuji, N., & Kitsuregawa, M. (2002).
Effects of maximum flow algorithm for identifying web community.
Proceedings of the 4th ACM CIKM International
Workshop on Web Information and Data Management (WIDM'02)
, McLean, VA.
Inaoka, H., Takayasu, H., Shimizu, T., Ninomiy, T., & Taniguchi, K. (2004). Selfsimilarity of banking network.
Physica A, 339
, 621634.
Jain, A. K., & Dubes, R. C. (1988).
Algorithms for Clustering Data
. Upper Saddle River,
NJ: PrenticeHall.
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review.
ACM
Computing Surveys, 31
(3), 264323.
Janssen, M. A., & Jager, W. (2003). Simulating market dynamics: Interactions between consumer psychology and social networks.
Artificial Life, 9
, 343356.
Jeong, H., Mason, S. P., Barabási, A.L., & Oltvai, Z. N. (2001). Lethality and centrality in protein networks.
Nature, 411
(6833), 41.
228
Jeong, H., Neda, Z., & Barabási, A.L. (2003). Measuring preferential attachment for evolving networks.
Europhysics Letters, 61
, 567572.
Jeong, H., Tombor, B., Albert, R., Oltval, Z. N., & Barabási, A.L. (2000). The largescale organization of metabolic networks.
Nature, 407
(6804), 651654.
Johnson, S. C. (1967). Hierarchical clustering schemes.
Psychometrika, 32
, 241254.
Jordan, P. W. (1998).
An Introduction to Usability
. Bristol, PA: Taylor & Francis.
Kamada, T., & Kawai, S. (1989). An algorithm for drawing general undirected graphs.
Information Processing Letters, 31
(1), 715.
Kannan, R., Vempala, S., & Vetta, A. (2004). On clustering: Good, bad and spectral.
Journal of the Association for Computing Machinery, 51
(3), 497515.
Kautz, H., Selman, B., & Shah, M. (1997). ReferralWeb: Combining social networks and collaborative filtering.
Communications of the ACM, 40
(3), 2736.
Kephart, J. O., Sorkin, G. B., Arnold, W. C., Chess, D. M., Tesauro, G. J., & White, S. R.
(1998). Biologically inspired defenses against computer viruses. In R. S.
Michalski (Ed.),
Machine Learning and Data Mining: Methods and Applications
.
New York, NY: John Wiley.
Kernighan, B. W., & Lin, S. (1970). An efficient heuristic procedure for partitioning graphs.
Bell System Technical Journal, 49
, 291307.
Kleinberg, J. (1998).
Authoritative sources in a hyperlinked environment.
Proceedings of the 9th ACMSIAM Symposium on Discrete Algorithms
, San Francisco, CA.
Kleinberg, J., Kumar, R., Raghavan, P., Rajagopalan, S., & Tomkins, A. S. (1999).
The web as a graph: Measurements, models, and methods.
Proceedings of 5th Annual
International Conference on Computing and Combinatorics (COCOON'99)
,
Tokyo, Japan.
Kleinberg, J., & Lawrence, S. (2001). The structure of the web.
Science, 294
, 18491850.
Kleinberg, J., Sandler, M., & Slivkins, A. (2004).
Network failure detection and graph connectivity.
Proceedings of the 15th Annual ACMSIAM Symposium on Discrete
Algorithms
, New Orleans, LA.
Klerks, P. (2001). The network paradigm applied to criminal organizations: Theoretical nitpicking or a relevant doctrine for investigators? Recent developments in the
Netherlands.
Connections, 24
(3), 5365.
229
Krapivsky, P. L., Redner, S., & Leyvraz, F. (2000). Connectivity of growing random networks.
Physical Review Letters, 85
(21), 46294632.
Krause, A. E., Frank, K. A., Mason, D. M., Ulanowicz, R. E., & Tayloar, W. W. (2003).
Compartments revealed in foodweb structure.
Nature, 426
, 282285.
Krebs, V. E. (2001). Mapping networks of terrorist cells.
Connections, 24
(3), 4352.
Kruskal, J. B. (1964). Nonmetric multidimensional scaling: A numerical method.
Psychometrika, 29
(2), 115128.
Kruskal, J. B., & Wish, M. (1978).
Multidimensional Scaling
. Beverly Hills, CA: Sage
Publications.
Kumar, S. R., Raghavan, P., Rajagopalan, S., & Tomkins, A. (1999). Trawling the web for emerging cybercommunities.
Computer Networks, 31
(1116), 14811493.
Kumar, S. R., Raghavan, P., Rajagopalan, S., & Tomkins, A. (2002). The web and social networks.
IEEE Computer, 35
(11), 3236.
Lance, G. N., & Williams, W. T. (1967). A general theory of classificatory sorting strategies: II. Clustering systems.
Computer Journal, 10
, 271277.
Lawrence, S., & Giles, C. L. (1999). Accessibility of information on the web.
Nature,
400
, 107109.
Lee, R. (1998).
Automatic information extraction from documents: A tool for intelligence and law enforcement analysts.
Proceedings of 1998 AAAI Fall Symposium on
Artificial Intelligence and Link Analysis
, Orlando, FL.
Liljeros, F., Edling, C. R., Amaral, L. A. N., Stanley, H. E., & Aberg, Y. (2001). The web of human sexual contacts.
Nature, 411
, 907908.
Lorrain, F. P., & White, H. C. (1971). Structural equivalence of individuals in social networks.
Journal of Mathematical Sociology, 1
, 4980.
McAndrew, D. (1999). The structural analysis of criminal networks. In D. Canter & L.
Alison (Eds.),
The Social Psychology of Crime: Groups, Teams, and Networks,
Offender Profiling Series, III
(pp. 5394). Dartmouth: Aldershot.
Menczer, F. (2004). Evolution of document networks.
Proceedings of the National
Academy of Science of the United States of America, 101
, 52615265.
Milgram, S. (1967). The small world problem.
Psychology Today, 2
, 6067.
230
Moreno, J. L. (1953).
Who Shall Survive?
Beacon, NY: Beacon House.
Murtagh, F. (1984). A survey of recent advances in hierarchical clustering algorithms which use cluster centers.
Computer Journal, 26
, 354359.
Newman, M. E. J. (2001a). Scientific collaboration networks. I. Network construction and fundamental results.
Physical Review E, 64
, 016131.
Newman, M. E. J. (2001b). The structure of scientific collaboration networks.
Proceedings of the National Academy of Science of the United States of America,
98
, 404409.
Newman, M. E. J. (2003a). Mixing patterns in networks.
Physical Review E, 67
(2),
026126.
Newman, M. E. J. (2003b). The structure and function of complex networks.
SIAM
Review, 45
(2), 167256.
Newman, M. E. J. (2004a). Coauthorship networks and patterns of scientific collaboration.
Proceedings of the National Academy of Science of the United
States of America, 101
, 52005205.
Newman, M. E. J. (2004b). Detecting community structure in networks.
European
Physical Journal B, 38
, 321330.
Newman, M. E. J. (2004c). Fast algorithm for detecting community structure in networks.
Physical Review E, 69
(6), 066133.
Newman, M. E. J., & Girvan, M. (2004). Finding and evaluating community structure in networks.
Physical Review E, 69
(2), 026113.
Palmer, C. R., Gibbons, P. B., & Faloutsos, C. (2002).
ANF: A fast and scalable tool for data mining in massive graphs.
Proceedings of the 8th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining
, Edmonton,
Alberta, Canada.
Pennock, D. M., Flake, G. W., Lawrence, S., Glover, E. J., & Giles, C. L. (2002).
Winners don't take all: Characterizing the competition for links on the web.
Proceedings of the National Academy of Science of the United States of America,
99
(8), 52075211.
Perkins, C. E., & Bhagwat, P. (1994).
Highly dynamic destinationsequenced distancevector routing (DSDV) for mobile computers.
Proceedings of
231
SIGCOMM Symposium on Communications Architectures and Protocols
,
London, UK.
Pothen, A., Simon, H. D., & Liou, K.P. (1990). Partitioning sparse matrices with eigenvectors of graphs.
SIAM Journal on Matrix Analysis and Applications, 11
(3),
430  452.
Powell, W. W., White, D. R., Koput, K. W., & OwenSmith, J. (2005 (forthcoming)).
Network dynamics and field evolution: The growth of interorganizational collaboration in the life sciences.
American Journal of Sociology
.
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical
Recipes in C (Version 2nd edition). Cambridge: Cambridge University Process.
Price, D. J. D. (1965). Networks of scientific papers.
Science, 149
, 510515.
Purchase, H. C. (1997).
Which aesthetic has the greatest effect on human understanding?
Proceedings of the 5th International Symposium on Graph Drawing
, Rome, Italy.
Quinlan, J. R. (1986). Introduction of decision trees.
Machine Learning, 1
, 86106.
Raab, J., & Milward, H. B. (2003). Dark networks as problems.
Journal of Public
Administration Research and Theory, 13
(4), 413439.
Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., & Parisi, D. (2004). Defining and identifying communities in networks.
Proceedings of the National Academy of
Science of the United States of America, 101
, 26582663.
Rasmussen, E. (1992). Clustering algorithms. In W. B. Frakes & R. BaezaYates (Eds.),
Information Retrieval: Data Structures and Algorithms
. Englewood Cliffs, NJ:
Prentice Hall.
Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N., & Barabási, A.L. (2002).
Hierarchical organization of modularity in metabolic networks.
Science, 297
,
15511555.
Reingold, E. M., & Tilford, J. S. (1981). Tidier drawing of trees.
IEEE Transactions on
Software Engineering, 7
(2), 223228.
Rives, A. W., & Galitski, T. (2003). Modular organization of cellular networks.
Proceedings of the National Academy of Science of the United States of America,
100
(3), 1128–1133.
232
Robins, G., & Alexander, M. (2004). Small worlds among interlocking directors:
Network structure and distance in bipartite graphs.
Computational &
Mathematical Organization Theory, 10
, 6994.
Ronfeldt, D., & Arquilla, J. (2001). What next for networks and netwars? In J. Arquilla &
D. Ronfeldt (Eds.),
Networks and Netwars: The Future of Terror, Crime, and
Militancy
. Santa Monica, CA: Rand Press.
Roussinov, D. G., & Chen, H. (1999). Document clustering for electronic meetings: An experimental comparison of two techniques.
Decision Support Systems, 27
, 6779.
Saether, M., & Canter, D. V. (2001).
A structural analysis of fraud and armed robbery networks in Norway.
Proceedings of the 6th International Investigative
Psychology Conference
, Liverpool, England.
Sageman, M. (2004).
Understanding Terror Networks
. Philadelphia, PA: University of
Pennsylvania Press.
Sahami, M., Yusufali, S., & Baldonado, Q. W. (1998).
SONIA: A service for organizing networked information autonomously.
Proceedings of the 3rd ACM International
Conference on Digital Libraries
, Pittsburgh, PA.
Scott, J. (1991).
Social Network Analysis
. London, UK: Sage Publications.
Shaw, W. M. J., Burgin, R., & Howell, P. (1997). Performance standards and evaluations in information retrieval test collections: Clusterbased retrieval models.
Information Processing & Management, 33
(1), 114.
Small, H. (1999). Visualizing science by citation mapping.
Journal of American Society of Information Science, 50
(9), 799813.
Small, H. G. (1977). A cocitation model of a scientific specialty: A longitudinal study of collagen research.
Social Studies of Science, 7
, 139166.
Solé, R. V., & Montoya, J. M. (2001). Complexity and fragility in ecological networks.
Proceedings of the Royal Society B, 268
, 20392045.
Somogyi, R., & Sniegoski, S. A. (1996). Modeling the complexity of genetic networks:
Understanding multigenic and pleiotropic regulation.
Complexity, 1
(6), 4563.
Sparrow, M. K. (1991). The application of network analysis to criminal intelligence: An assessment of the prospects.
Social Networks, 13
, 251274.
233
Stuart, T. E. (1998). Network positions and propensities to investigation of strategic alliance formation in a hightechnology industry.
Administrative Science
Quarterly, 43
, 668698.
Tolle, K. M., & Chen, H. (2000). Comparing noun phrasing techniques for use with medical digital library tools.
Journal of the American Society for Information
Science, 51
(4), 352370.
Torgerson, W. S. (1952). Multidimensional scaling: Theory and method.
Psychometrika,
17
, 401419.
Toroczkai, Z., & Bassler, K. E. (2004). Jamming is limited in scalefree systems.
Nature,
428
, 716.
Toyoda, M., & Kitsuregawa, M. (2001).
Creating a web community chart for navigating related communities.
Proceedings of the 12th ACM Conference on Hypertext and
Hypermedia
, Arhus, Denmark.
Toyoda, M., & Kitsuregawa, M. (2003).
Extracting evolution of web communities from a series of web archives.
Proceedings of the 14th conference on Hypertext and
Hypermedia
, Nottingham, UK.
Tu, Y. (2000). How robust is the Internet?
Nature, 406
, 353354.
Valente, T. W. (1995).
Network Models of the Diffusion of Innovations
. Cresskill, NY:
Hampton Press. vanCleemput, W. M. (1976).
On the topological aspects of the circuit layout problem.
Proceedings of the 13th Conference on Design Automation
, San Francisco, CA.
Voorhees, E. M. (1986). Implementing agglomerative hierarchical clustering algorithms for use in document retrieval.
Information Processing & Management, 22
(6),
465476.
Wang, Z., & Crowcroft, J. (1992). Analysis of shortestpath routing algorithms in a dynamic network environment.
ACM Computer Communication Review, 22
(2),
6371.
Wasserman, S., & Faust, K. (1994).
Social Network Analysis: Methods and Applications
.
Cambridge: Cambridge University Press.
Watts, D. J. (2002). A simple model of global cascades on random networks.
Proceedings of the National Academy of Science of the United States of America,
99
, 57665771.
234
Watts, D. J. (2004). The "new" science of networks.
Annual Review of Sociology, 30
,
243270.
Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of "smallworld" networks.
Nature, 393
, 440442.
White, D. R., & Newman, M. E. J. (2001).
Fast approximation algorithms for finding nodeindependent paths in networks
, from http://ideas.repec.org/p/wop/safiwp/01
07035.html
White, H. C., Boorman, S. A., & Breiger, R. L. (1976). Social structure from multiple networks: I. Blockmodels of roles and positions.
American Journal of Sociology,
81
, 730780.
White, H. D., & McCain, K. W. (1998). Visualizing a discipline: An author cocitation analysis of information science, 19721995.
Journal of American Society of
Information Science and Technology, 49
(4), 327355.
Xu, J., & Chen, H. (2003).
Untangling criminal networks: A case study.
Proceedings of the 1st NSF/NIJ Symposium on Intelligence and Security Informatics (ISI'03)
,
Tucson, AZ.
Xu, J., & Chen, H. (Forthcoming). Criminal network analysis and visualization: A data mining perspective.
Communications of the ACM
.
Xu, J. J., & Chen, H. (2004). Fighting organized crime: Using shortestpath algorithms to identify associations in criminal networks.
Decision Support Systems, 38
(3), 473
487.
Xu, J. J., & Chen, H. (2005). CrimeNet Explorer: A framework for criminal network knowledge discovery.
ACM Transactions on Information Systems, 23
(2).
Young, F. W. (1987).
Multidimensional Scaling: History, Theory, and Applications
.
Hillsdale, NJ: Lawrence Erlbaum Associations.
Zhao, L., Park, K., & Lai, Y.C. (2004). Attack vulnerability of scalefree networks due to cascading breakdown.
Physical Review E, 70
, 035101.
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project