MINING STATIC AND DYNAMIC STRUCTURAL PATTERNS IN NETWORKS FOR KNOWLEDGE MANAGEMENT:

MINING STATIC AND DYNAMIC STRUCTURAL PATTERNS IN NETWORKS FOR KNOWLEDGE MANAGEMENT:

MINING STATIC AND DYNAMIC STRUCTURAL PATTERNS IN

NETWORKS FOR KNOWLEDGE MANAGEMENT:

A COMPUTATIONAL FRAMEWORK AND CASE STUDIES

by

Jie Xu

A Dissertation Submitted to the Faculty of the

COMMITTEE ON BUSINESS ADMINISTRATION

In Partial Fulfillment of the Requirements for the Degree of

DOCTOR OF PHILOSOPHY

WITH A MAJOR IN MANAGEMENT

In the Graduate College

THE UNIVERSITY OF ARIZONA

2005

2

THE UNIVERSITY OF ARIZONA

GRADUATE COLLEGE

As members of the Dissertation Committee, we certify that we have read the dissertation prepared by Jie Xu entitled “Mining Static and Dynamic Structural Patterns in Networks for Knowledge

Management: A Computational Framework and Case Studies,” and recommend that it be accepted as fulfilling the dissertation requirement for the

Degree of Doctor of Philosophy.

_______________________________________________________________________

Date: May 10, 2005

Hsinchun Chen, Ph.D.

_______________________________________________________________________

Date: May 10, 2005

Jay F. Nunamaker Jr., Ph.D.

_______________________________________________________________________

Date: May 10, 2005

Daniel D. Zeng, Ph.D., Ph.D.

Final approval and acceptance of this dissertation is contingent upon the candidate’s submission of the final copies of the dissertation to the Graduate College.

I hereby certify that I have read this dissertation prepared under my direction and recommend that it be accepted as fulfilling the dissertation requirement.

________________________________________________ Date: May 10, 2005

Dissertation Director: Hsinchun Chen, Ph.D.

3

STATEMENT BY AUTHOR

This dissertation has been submitted in partial fulfillment of requirements for an advanced degree at The University of Arizona and is deposited in the University Library to be made available to borrowers under rules of the Library.

Brief quotations from this dissertation are allowable without special permission, provided that accurate acknowledgment of source is made. Requests for permission for extended quotation from or reproduction of this manuscript in whole or in part may be granted by the head of the major department or the Dean of the Graduate College when in his or her judgment the proposed use of the material is in the interests of scholarship. In all other instances, however, permission must be obtained from the author.

SIGNED: _____Jie Xu___________________

4

ACKNOWLEDGEMENT

First of all, I am grateful for my dissertation advisor and mentor, Professor Hsinchun

Chen, for his guidance and encouragement throughout my five years at the University of

Arizona. It has been an invaluable opportunity for me to work in the Artificial

Intelligence Lab under his direction. I feel very fortunate to have had such a wonderful advisor. Many thanks go to my major committee members, Dr. Jay F. Nunamaker, Jr. and

Dr. Daniel D. Zeng, and my minor committee members in the Department of

Communication, Dr. Chris Segrin and Dr. Kyle Tusing, for their guidance and encouragement. I also thank all the faculty members in the MIS Department for their support.

My dissertation has been partly supported by grants from the National Science

Foundation/Central Intelligence Agency (EIA-9983304) “Knowledge Discovery and

Dissemination ARJIS/COPLINK ‘Border Safe’” and “COPLINK Center for Excellence:

Information and Knowledge Management in Law Enforcement,” and (CTS-0311652)

“Intelligent Patent Analysis for Nanoscale Science and Engineering.” Most projects discussed in this dissertation have been supported by other AI Lab members: Dr. Homa

Atabakhsh and Ms. Cathy Larson, and personnel from the Tucson Police Department:

Detective Tim Peterson, Sergeant Mark Nisbet, and Lieutenant Jennifer Shroeder. I thank former AI Lab members who I regard as role models for my research: Dr. Michael Chau,

Dr. Gondy Leroy, Dr. Chienting Lin, Dr. Bin Zhu, and Dr. Dorbin Ng. I also thank Ms.

Barbara Sears and Ms. Sarah Marshall for editing my papers.

I would like to thank my colleagues for their tremendous help and support through the past five years: Yilu Zhou, Yiwen Zhang, Ming Lin, Gang Wang, Jialun Qin, Jason Li,

Zan Huang, Jinwei Cao, Xiaoyun Sun, Rong Zheng, Daniel McDonald, Byron Marshall,

Xin Li, Jiannan Wang, Yang Xiang, Huihui Zhang, Dr. Edna Reid, Dr. Hua Su, Chun-Ju

Tseng, and Shing Ka Wu. I especially thank Yilu Zhou for her encouragement and emotional support during the stressful time of my last year of study, and Yiwen Zhang,

Ming Lin, and many other friends for their invaluable care, concern, and help during the time when I was struggling with a big challenge in my personal life.

I am extremely grateful for my parents, sister, and brother. Their unconditional love is my source of energy for working hard through the years. I appreciate the love, care, and encouragement from my husband, Yanhai Sun. He is the one that I can always count on when I feel frustrated and discouraged. Last but not least, I thank my 14-month old son,

Patrick R. Sun, the most precious present I have received from God. He makes me believe that life is beautiful and research is only a part of my life.

DEDICATION

This dissertation is dedicated to my parents.

5

6

TABLE OF CONTENTS

LIST OF FIGURES .......................................................................................10

LIST OF TABLES.........................................................................................11

ABSTRACT ..................................................................................................12

CHAPTER 1: INTRODUCTION..............................................................14

CHAPTER 2: LITERATURE REVIEW AND THE RESEARCH

FRAMEWORK...... .......................................................................................21

Foundations................................................................................ 21

Concepts.................................................................................. 26

Representation........................................................................ 27

Presentation............................................................................ 28

2.3 The Computational Framework for Network Structure Mining ................... 31

2.3.1 Static Structure Mining......................................................................... 33

2.3.1.1 Locating Critical Resources in Networks ..................................... 33

2.3.1.2 Reducing Network Complexity .................................................... 38

2.3.1.3 Extracting Topological Properties ................................................. 44

2.3.2 Dynamic Structure Mining ................................................................... 47

2.3.2.1 Describing Structural Dynamics ................................................... 48

2.3.2.2 Modeling Structural Dynamics ..................................................... 49

CHAPTER 3: LOCATING KEY RELATIONSHIPS IN CRIMINAL

NETWORKS….. ...........................................................................................53

3.1 Introduction................................................................................................... 53

Work ................................................................................................ 55

Analysis ........................................................................................ 56

3.2.1.1 Network Construction................................................................... 56

3.3

3.2.1.2 Link Analysis Tools...................................................................... 58

Algorithms...................................................................... 59

The Modified BFS Algorithm....................................................................... 62

3.4.1 Network Representation Transformation.............................................. 64

Algorithms...................................................................... 68

3.4.2.1 The Modified PFS Algorithm ....................................................... 69

3.4.2.2 The Two-Tree Dijkstra/PFS Algorithm ......................................... 71

Evaluation ........................................................................................ 73

3.5.1.1 COPLINK Concept Space and AZNP .......................................... 74

7

TABLE OF CONTENTS - CONTINUTED

3.5.1.2 Data Set......................................................................................... 75

3.5.2 Results

3.5.2.1 User Evaluation: Effectiveness Issue............................................ 76

3.5.2.2 Simulation Experiment: Efficiency Issue ..................................... 81

3.6 Conclusions................................................................................................... 85

CHAPTER 4: EXTRACTING STATIC STRUCTURAL PATTERNS

IN CRIMINAL NETWORKS .......................................................................87

4.1 Introduction................................................................................................... 87

4.2 Background................................................................................................... 88

4.2.1 Implications of Structural Network Analysis ....................................... 89

4.2.2 Special Network Structures................................................................... 90

Work ................................................................................................ 91

4.3.1 Existing Network Analysis Tools ......................................................... 91

4.3.1.1 First Generation: Manual Approach .............................................. 91

4.3.1.2 Second Generation: Graphics-Based Approach............................. 93

4.3.1.3 Third Generation: Structural Analysis Approach .......................... 95

4.3.2 Social Network Analysis....................................................................... 96

Analysis ........................................................................ 96

4.4

4.3.2.4 Visualization of Social Networks ................................................ 100

Crimenet Explorer: Extracting Structural Patterns in Criminal Networks . 101

Partition................................................................................ 104

Analysis.............................................................................. 106

4.4.5 CrimeNet

Evaluation ...................................................................................... 110

4.5.1 The Narcotics and Gang Networks ..................................................... 111

4.5.2.1 Task I: Subgroup Detection (Clustering)..................................... 114

4.5.2.2 Tasks II and III: Interaction Pattern and Central Members

Identification ............................................................................................... 116

4.5.3 Results Discussion ....................................................................... 118

4.6 Conclusions................................................................................................. 123

CHAPTER 5: IDENTIFYING GROUPS IN UNWEIGHTED

NETWORKS….. ........................................................................................ 125

8

TABLE OF CONTENTS - CONTINUTED

5.1 Introduction................................................................................................. 125

Work .............................................................................................. 127

5.2.2 Determining Link Weights for Weighted Graphs............................... 128

Unweighted

Algorithms..................................................................... 130

5.3

5.3.1

The Proposed Approach: Local Density Based Partition Algorithms ........ 133

Defining Edge Local Density.............................................................. 133

5.3.2 Illustrating Edge Local Density .......................................................... 135

Case 1: Clique-Bridge-Clique..................................................................... 136

Case 2: Tree-Bridge-Tree ........................................................................... 139

Case 3: Clique-Bridge-Tree ........................................................................ 139

Case 4: Clique-Clique................................................................................. 141

Case 5: Clique-Tree .................................................................................... 142

Metrics........................................................................... 145

5.4.2 Hypotheses.......................................................................................... 147

5.4.3 Results Discussion ....................................................................... 149

5.4.3.1 Effectiveness ................................................................................ 149

5.4.3.2 Efficiency..................................................................................... 155

5.5 Conclusions................................................................................................. 158

CHAPTER 6: THE TOPOLOGICAL PROPERTIES OF DARK

NETWORKS….. ........................................................................................ 160

6.1 Introduction................................................................................................. 160

Work .............................................................................................. 161

Sets ..................................................................................................... 163

6.4.1 Statistical Properties of the Dark Networks........................................ 164

6.4.1.1 Small-World

Properties ................................................................... 168

6.4.2 Robustness of the Dark Networks....................................................... 170

6.5 Conclusions................................................................................................. 173

CHAPTER 7: MODELING THE EVOLUTION OF PATENT

CITATION NETWORKS .......................................................................... 175

7.1 Introduction................................................................................................. 175

Work .............................................................................................. 177

7.3 The Composite Evolution Model................................................................ 181

9

TABLE OF CONTENTS - CONTINUTED

7.3.1

7.3.2

7.3.3

7.3.4

The Composite Model......................................................................... 181

The Simple Degree Model .................................................................. 183

The Simple Fitness Model .................................................................. 184

The Multiplicative Fitness Model....................................................... 184

7.3.5

7.4

The Additive Fitness Model................................................................ 185

The Evolution of Patent Citation Networks................................................ 187

Questions............................................................................. 188

Analysis ........................................................................... 189

Evolutionary

7.4.4.2 Estimating the Composite Model................................................. 199

7.5 Conclusions................................................................................................. 205

CHAPTER 8: CONCLUSIONS AND FUTURE DIRECTIONS ......... 206

8.1 Contributions............................................................................................... 206

Contributions ................................................................... 206

Contributions...................................................................... 208

8.2 Relevance to Business, Management, and MIS.......................................... 211

Directions ........................................................................................ 212

APPENDIX A: DOCUMENTS FOR THE CRIMENET EXPLORER

EXPERIMENT… ....................................................................................... 214

A1: Instructions for Experiment Participants .................................................... 214

A2: Introduction to System Functionality.......................................................... 215

Sheet................................................................................................... 216

Questionnaire .................................................................................. 217

REFERENCES ........................................................................................... 220

10

LIST OF FIGURES

Figure 2.1: Graph representation………………………………………………..

Figure 2.2: The computational framework for network structure mining………

Figure 3.1: The modified BFS algorithm………………………………………..

Figure 3.2: Two indirectly connected nodes…………………………………….

Figure 3.3: The modified PFS algorithm………………………………………..

Figure 3.4: The two-tree PFS algorithm………………………………………...

Figure 3.5: Execution time scatter plot………………………………………….

Figure 4.1: The terrorist network surrounding the 19 hijackers on September

11, 2001……………………………………………………………..

Figure 4.2: Second-generation criminal network analysis tools………………...

Figure 4.3: Procedures for automated criminal network mining and

93

95 visualization………………………………………………………… 101

Figure 4.4: The pseudocode of the modified version of the RNN-based complete-link algorithm……………………………………………. 105

Figure 4.5: CrimeNet Explorer…………………………………………………. 110

Figure 5.1: The transformation of an unweighted graph into a weighted graph using the edge local density measure………………………………. 135

Figure 5.2: The five illustrative cases for edge local density measure………….............................................................................. 136

28

32

63

65

70

72

82

Figure 5.3: Three illustrative networks with different

p

out

/

p

in

ratios…………… 150

Figure 5.4: Effectiveness results of the six clustering methods: sLD, sECC, iLD, iECC, G-N, and modularity…………………………………... 151

Figure 5.5: The efficiency of sLD, iLD, modularity based, and G-N algorithm.. 156

Figure 6.1: The giant component in the GSJ Network…………………………. 164

Figure 6.2: The degree distributions of the dark networks……………………... 168

Figure 6.3: The aging effect in the Meth World………………………………... 170

Figure 6.4: Dark networks’ vulnerability to attacks……………………………. 171

Figure 7.1: The size dynamics in patent citation networks of the four technology fields…………………………………………………… 189

Figure 7.2: Dynamics of average degrees………………………………………. 191

Figure 7.3: Dynamics in average path lengths………………………………….. 193

Figure 7.4: Degree distributions of the four fields……………………………… 194

Figure 7.5: Institutions’ productivity distribution for the drug field…………… 196

Figure 7.6: The log-log plot of conditional content similarity between linked patent pairs…………………………………………………………. 198

Figure 7.7: The distribution of the content similarity between linked drug patents………………………………………………………………. 201

Figure 7.8: The fits of different models………………………………………… 203

LIST OF TABLES

Table 2.1:

Table 3.1:

Table 3.2:

Table 3.3:

Table 4.1:

Table 4.2:

Table 4.3:

Table 4.4:

Table 5.1:

Table 5.2:

Table 5.3:

Table 5.4:

Table 6.1:

The statistics for network topology………………………………..

Sample statistics of two networks…………………………………

46

75

Effectiveness evaluation results……………………………………

Mean execution time (in seconds) for the two shortest-path

78 algorithms…………………………………………………………. 81

Sizes of networks generated from the two datasets………………..

Clustering recall and precision…………………………………….

Effectiveness……………………………………………………….

Efficiency…………………………………………………………..

Hypotheses regarding clustering effectiveness…………………….

Mean values of the effectiveness metrics of the six methods……...

Summary of hypothesis testing results for effectiveness…………..

Mean running times (in seconds) of sLD, iLD, and the modularity based methods……………………………………………………...

The statistics and parameters in the exponentially truncated power-law degree distribution of the dark networks………………

156

165

112

118

120

120

148

152

153

Table 6.1:

Table 6.2:

Table 7.1:

Table 7.2:

Table 7.3:

Table 7.4:

Table 7.5:

The statistics and parameters in the exponentially truncated power-law degree distribution of the dark networks………………

Small-world properties of the dark networks……………………...

Basic statistics of the four patent citation data sets………………..

The five most productive institutions in the four technology fields.

Exponent values of productivity distributions and similarity distributions for the four fields…………………………………….

The similarity coefficients between linked patents and those between unlinked patents…………………………………………..

Estimated parameter values in the content similarity distributions..

165

165

188

195

196

199

201

11

12

ABSTRACT

Contemporary organizations live in an environment of networks: internally, they manage the networks of employees, information resources, and knowledge assets to enhance productivity and improve efficiency; externally, they form alliances with strategic partners, suppliers, buyers, and other stakeholders to conserve resources, share risks, and gain market power. Many managerial and strategic decisions are made by organizations based on their understanding of the structure of these networks. This dissertation is devoted to

network structure mining

, a new research topic on

knowledge discovery in databases

(KDD) for supporting knowledge management and decision making in organizations.

A comprehensive computational framework is developed to provide a taxonomy and summary of the theoretical foundations, major research questions, methodologies, techniques, and applications in this new area based on extensive literature review.

Research in this new area is categorized into static structure mining and dynamic structure mining. The major research questions of static mining are locating critical resources in networks, reducing network complexity, and capturing topological properties of large-scale networks. An inventory of techniques developed in multiple reference disciplines such as social network analysis and Web mining are reviewed. These techniques have been used in mining networks in various applications including knowledge management, marketing, Web mining, and intelligence and security. Dynamic pattern mining is concerned with network evolution and major findings are reviewed.

13

A series of case studies are presented in this dissertation to demonstrate how network structure mining can be used to discover valuable knowledge from various networks ranging from criminal networks to patent citation networks. Several techniques are developed and employed in these studies. Performance evaluation results are provided to demonstrate the usefulness and potential of this new research field in supporting knowledge management and decision making in real applications.

14

CHAPTER 1: INTRODUCTION

In today’s information age the competitive advantages of organizations no longer depend on organizations’ information storage capabilities (Carr, 2003) but on their ability to analyze information and discover valuable knowledge.

Knowledge discovery in databases

(KDD) plays an indispensable role in supporting contemporary organizations’ knowledge management and decision making by “identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad et al., 1996a, p. 30). The core of

KDD is

data mining

, a process of using appropriate techniques to extract patterns and knowledge from data. Research on KDD and data mining has advanced substantially and many techniques have been developed for a spectrum of data mining problems including association rule mining (Agrawal et al., 1993), clustering (Jain et al., 1999), classification

(Quinlan, 1986), outlier analysis, and sequential pattern extraction (Fayyad et al., 1996a;

Fayyad et al., 1996b).

Recently, a new data mining topic,

network structure mining

, has attracted much attention in the KDD research community (Cook & Holder, 2000; Domingos &

Richardson, 2001; Palmer et al., 2002). Unlike conventional data mining that extracts patterns based on individual data objects, network structure mining is intended to mine patterns based on the relationships between objects.

The concept of

network

is not new to most people. Regardless of its context, a network often refers to a set of

nodes

(objects) connected by

links

(relationships). Networks are

15 prevalent in nature and society. Our familiar networks include

social networks

,

information networks

,

communication networks

, and

biological networks

(Newman,

2003b).

Social networks

are collections of social actors such as individuals and organizations who interact with one another through various relationships.

Relationships between individuals can be kinship, friendship, co-membership, and affective or influential ties (Wasserman & Faust, 1994). Relationships between organizations can be strategic partnership, buyer-supplier relationship, transactions, and other business associations (Gulati & Gargiulo, 1999; Powell et al., 2005 (forthcoming); Stuart, 1998).

In

information networks

nodes can be documents, articles, words and phrases, or other objects containing data and information assets. Links are formed because of the underlying relevance or similarity in the content of the nodes. Examples of information networks are citation networks (Garfield, 2001; Hajra & Sen, 2005), which consist of documents and citation links, and the World Wide Web (Brin &

Page, 1998; Kleinberg, 1998), which consists of a large number of Web pages connected by hyperlinks.

Communication networks

such as electronic power grids and the Internet are often used to facilitate the transmission of certain resources or information (Amaral et al., 2000; Watts & Strogatz, 1998). On the Internet, for example, computers and routers are connected through cables and wires that transmit digitized data.

16

Biological networks

contain biological components that interact with each other.

Examples of biological networks include metabolic pathways (Jeong et al., 2000), genetic regulatory networks (Somogyi & Sniegoski, 1996), biochemical networks, food webs (Garlaschelli et al., 2003), and neural networks (Watts & Strogatz,

1998).

Network structure mining is aimed at extracting valid, novel, and useful structural patterns in various networks. The structural patterns refer to a range of regularities in the structure of networks, such as:

Who are the most influential customers whose purchasing decisions may influence other customers (Domingos & Richardson, 2001)? What are the classic articles that are cited frequently by other articles in a scientific discipline (Culnan,

1987; Small, 1999)? How can people locate high-quality pages on the World

Wide Web (Brin & Page, 1998; Kleinberg, 1998)?

Are there different research specialties and paradigms in a scientific discipline

(Culnan, 1986; Giannakis & Croom, 2001; Small, 1977)? How can users find communities of Web pages that discuss similar topics (Flake et al., 2000; Gibson et al., 1998)? Do criminals or terrorists form groups or teams to carry out offenses

(McAndrew, 1999; Xu & Chen, Forthcoming)?

What is the “big picture” of a large network (Chen et al., 2001; Small, 1999;

Toyoda & Kitsuregawa, 2001)? What are the properties that characterize

17 networks of specific topologies (Albert & Barabási, 2002; Bollobás, 1985; Watts

& Strogatz, 1998)?

How do information, technology, fads, diseases, and viruses spread in social, communication, and biological networks (Kephart et al., 1998; Liljeros et al.,

2001; Valente, 1995)? Does the structure of the network affect the speed of spreading?

How robust is a network against failures and attacks (Albert et al., 2000)? How can people protect computer networks (Tu, 2000), social networks, and biological networks (Jeong et al., 2001) from attacks?

What are the patterns of dynamics in the network structure over time (Barabási et al., 2002; Doreian & Stokman, 1997)? How do networks evolve (Dorogovtsev &

Mendes, 2003)? What are the mechanisms that govern the evolution of networks

(Barabási & Alert, 1999; Bianconi & Barabási, 2001; Menczer, 2004)?

The research on network structure can support decision making in a wide variety of application domains, including e-commerce and marketing (Domingos & Richardson,

2001; Janssen & Jager, 2003), strategic planning (Powell et al., 2005 (forthcoming)), citation analysis (Culnan, 1986; Small, 1999), Web mining (Gibson et al., 1998;

Kleinberg, 1998; Toyoda & Kitsuregawa, 2003), knowledge sharing (Kautz et al., 1997), and security and intelligence (McAndrew, 1999; Sparrow, 1991; Xu & Chen, 2005).

18

However, because the research on network structure mining is young compared with other data mining fields, it faces several challenges. First, there has not been a comprehensive research framework that provides a taxonomy and summary of the major research questions, techniques, and applications of network structure mining. This new field is multidisciplinary in nature and has been studied in several references disciplines including sociology, mathematics, statistics, physics, computer science, and biology.

These disciplines share many common research questions related to network structure and also offer unique perspectives and methodologies for studying networks. It is desirable to develop a research framework that consolidates these different perspectives, summarizes existing techniques and finings, and provides guidance for future research.

Second, although many techniques have been proposed to tackle various network-related problems, such as the identification of important nodes and the detection of groups, research on network structure mining still strives to find new techniques that are more effective, efficient, scalable, and useful.

Third, most existing network studies focus on the static structural patterns in networks.

How to extract the patterns of dynamics in network is still a challenging problem. In addition, because specific evolution processes lead to specific network structures which further affect the function and performance of networks, the search for the underlying mechanisms that govern the evolution of networks is particularly important. Presently, research on network evolution is still at its infant age (Dorogovtsev & Mendes, 2003).

19

Last, it is believed that the research on networks has led to a “new science of networks”

(Barabási, 2002; Watts, 2004). The significance of this new science in terms of its roles for supporting knowledge management and decision making in real world applications, together with the impacts of network mining technology on users, organizations, and society, is still an open question. A large number of empirical studies that are intended to evaluate such significance and impacts need to be conducted to demonstrate the value of this new field.

Facing these challenges, this dissertation is intended to achieve the following research objectives:

To develop a comprehensive research framework that incorporates major research questions, techniques, methodologies, and applications of network structure mining;

To develop and employ effective and efficient techniques for mining static and dynamic structural patterns in networks in several application domains;

To evaluate the performance of these techniques in terms of their abilities to support knowledge management and decision making.

The remainder of this dissertation is organized as follows. Chapter 2 presents the research framework of network structure mining after reviewing related literature. Chapters 3, 4, and 6 are demonstrations of several network mining techniques to support knowledge management in the law enforcement, intelligence, and security domains (Xu & Chen,

20

2004, 2005). Chapter 5 is devoted to a new network partition approach that is more efficient than existing approaches. Chapter 7 proposes a new network evolution model.

Chapter 8 summarizes the contributions of this dissertation, points out the connections between network structure mining and business and management, and suggests future research directions.

21

CHAPTER 2: LITERATURE REVIEW AND THE RESEARCH

FRAMEWORK

Based on extensive literature review of prior work I present the computational framework in this chapter. This computational framework consists of several major research questions in network structure mining and existing techniques for addressing these questions. Before the literature review I first introduce the theoretical foundations and fundamental concepts of network structure mining.

The study of network structure is a multidisciplinary area and is grounded on three different theoretical foundations:

graph theory

from mathematics and computer science,

social network analysis

from sociology, and

topological analysis

from statistical physics.

Graph theory

is the study of properties of graphs (Bollobás, 1998). It provides the mathematical formalism for defining, representing, and solving a series of graph related problems such as graph isomorphism problems, graph coloring problems, network flow problems, etc. (Bollobás, 1998; Harary, 1994). Graph theory was first introduced in mathematics and has advanced substantially in computer science. While mathematicians focus on formal solutions of graph related problems, computer scientists focus on the development of efficient algorithms to deal with graphs. Since its introduction in the 18 th century graph theory has grown into a full-fledged branch of its own. Development and results in graph theory have been used to tackle problems in a wide variety of

22 applications including electronic circuit layout (vanCleemput, 1976), task scheduling

(Hesham

et al.

, 1994), resource allocation (Deo, 1974), and computer network design, among many others.

The mathematical and algorithmic solutions to graph related problems were not intended for structural pattern mining purposes but their applications can help extract regularities in network structures. For example, the algorithms for finding maximum flow and minimum cut in network flow problems have been used to identify Web communities

(Flake et al., 2000). A

Web community

is a set of Web pages that discuss similar topics or are created by authors sharing common interests (Flake

et al.

, 2000; Gibson

et al.

, 1998).

Another important theoretical foundation of network structure mining is

social network analysis

(SNA). SNA is used in sociology research to analyze patterns of relationships and interactions between social actors in order to discover the underlying social structure

(Berkowitz, 1982; Breiger, 2004; Scott, 1991; Wasserman & Faust, 1994). The most distinctive feature of SNA is “the use of structural or relational information to study or test social theories” (Wasserman & Faust, 1994, p.21). Not only the attributes of social actors, such as their age, gender, socioeconomic status, and education, but also the properties of relationships between social actors, such as the nature, intensity, and frequency of the relationships, are believed to have important impact on the social structure. SNA methods have been employed to study organizational behavior (Borgatti

& Foster, 2003; Brass, 1984), inter-organizational relations (Powell

et al.

, 2005

23

(forthcoming); Stuart, 1998), citation patterns (Baldi, 1998; Price, 1965), computer mediated communication (Garton

et al.

, 1999), and many other domains.

SNA has both behavioral and computational focuses. The behavioral focus is on the validation of social theories based on the regularities found in social relationships. The computational focus is on the development of methods and measures for fining the regularities. The computational focus thus is the most relevant to the network structure mining research.

Computational SNA distinguishes between relational analysis and positional analysis

(Burt, 1980; Wasserman & Faust, 1994). Relational analysis studies the connectivity of a social network. It is often used to identify central members or to find subgroups in a social network. In such studies, links usually are weighted by relational strength.

Positional analysis is concerned with structural roles of social actors. The purpose of positional analysis is to discover the overall structure of a social network. Both relational analysis and positional analysis are very relevant to the extraction of structural patterns from networks. For example, the centrality measures in relational analysis can be used to identify influential authors in citation networks (Culnan, 1987).

A recent movement in statistical physics has brought revolutionary insights and research methodology to the study of network structure. This new movement is best described as

statistical analysis of network topology

(Albert & Barabási, 2002). Unlike graph theory and SNA, which deal primarily with static structure of networks, topological analysis views the structure of a network as the result of some evolutionary processes, which can

24 be described and modelled using certain statistical mechanisms. The power of this new perspective lies in its ability to explain and predict the structural phenomena observed in large networks such as the World Wide Web (Albert

et al.

, 1999).

Three models have been proposed to characterize the topology of large, complex networks:

random graph model

(Bollobás, 1985; Erdös & Rényi, 1960),

small-world model

(Watts & Strogatz, 1998), and

scale-free model

(Barabasi & Alert, 1999). A random network starts with a fix number of nodes. With a probability

p

two arbitrary nodes are selected and connected by a link. As a result each node has roughly the same number of links. The

degree distribution

,

P

(

k

), is the probability that a node has exactly

k

links. It is shown that the degree distribution of a random graph follows the Poisson distribution (Bollobás, 1985), peaking at the average degree. A random network usually has a small average path length so that an arbitrary node can reach any other node in a few steps. The assumption of the random graph model is that the evolution of real networks is primarily a random process. In the past few decades random graph model has been used as the single model of network topology. However, it has recently been found that most complex systems and real networks are not random but are governed by certain organizing principles encoded in the topology of the networks (Albert & Barabási, 2002).

The small-world model and scale-free model substantially deviate from the random graph model (Albert & Barabási, 2002; Newman, 2003b). A small-world network has a significantly higher tendency to form clusters and groups (Watts & Strogatz, 1998) which are rarely present in random graphs. Scale-free networks (Barabási & Alert, 1999), on the

25 other hand, are characterized by the power-law degree distribution, meaning that while a large percentage of nodes in the network have just a few links, a small percentage of the nodes have a large number of links. It is believed that scale-free networks evolve following the self-organizing principle, where growth and preferential attachment play a key role in the emergence of the power-law degree distribution (Barabási & Alert, 1999).

The small-world model and the scale-free model have spurred the research on the topological properties of large scale networks and complex systems since they are proposed in the late 1990’s. A large number of papers have been published in leading science journals such as

Nature

,

Science

, and the

Proceedings of the National Academy of Sciences

(

PNAS

). The new findings and variants of the two models reported have greatly enriched our knowledge about large, complex networks (Albert & Barabási,

2002).

In addition to the three theoretical foundations many other disciplines and research communities have contributed to the study of network structure. Among these research communities

Web mining

is the most important. Web mining is about the automatic discovery of information, service, and valuable patterns from the content of Web documents (Web content mining), the structure of hyperlinks (Web structure mining), and the usage of Web pages and services (Web usage mining) (Etzioni, 1996). An important application of Web mining is to improve the design of online search engines and crawlers to help users find what they look for more effectively and efficiently (Chau

et al.

, 2003). Especially, Web structure mining, often called link analysis in Web mining research (Kleinberg, 1998; Kleinberg & Lawrence, 2001), exploits the structure of

26 hyperlinks between Web pages to locate high quality Web documents (Broder

et al.

,

2000; Chakrabarti

et al.

, 1999; Kleinberg & Lawrence, 2001; Kumar

et al.

, 1999).

The computational framework will incorporate a number of most important research questions, technologies, and findings in network structure mining built upon the three theoretical foundations. Before presenting this framework I will introduce the fundamental concepts in structural analysis including the definition, representation, and presentation of networks.

2.2 Fundamental Concepts

2.2.1 Network Definition

Networks are essentially graphs.

Graph

is a mathematical abstraction of networks of various types. In graph theory a graph is formally defined as a pair of sets

G

= (

V

,

A

), where

V

is the set of vertices and

A

is the set of edges, |

V

| =

n

, |

A

| =

m

, and

m

n

2 .

Vertices are also called nodes, points, and objects. Edges are also called arcs, links, and lines. A graph can be

directed

or

undirected

depending on whether the links have origins and destinations. A graph can also be

weighted

or

unweighted

depending on whether each link is associated with a numeric label called weight.

Throughout this dissertation I will use nodes and links to refer to the two basic types of elements in graphs. I will also use network and graph interchangeably. Note that the

“network” here refers to the general graph and is not the same as the formal definition of

27

network

in graph theory, which refers to only directed, weighted graph (Harary, 1994).

There are a large number of other concepts and terms related to graph, such as path, density, and subgraph, among many others. I will introduce and provide definitions of them in later sections when needed.

2.2.2 Network Representation

Graph can be represented in various formats. The two most widely used formats for representing graphs are

graphics

and

matrices

. The graphic representation is quite intuitive. For example, in Figure 2.1a, an undirected, unweighted graph consisting of five nodes is drawn. The circles represent nodes and lines between the circles represent links.

Nodes are labeled with numbers in this graph. The graph can also be represented as a matrix (see Figure 2.1b). Such a matrix is called

sociogram

in social network analysis

(Moreno, 1953) and

adjacency matrix

in computer science. The graph is represented as an

n

×

n

square matrix with rows and columns representing nodes. The value of a cell, (

i

,

j

), is set to be the weight of the link between nodes

i

and

j

. A zero means that

i

and

j

are not directly connected. In this simple example cell values are either 1 or 0 indicating the presence or absence of links.

28

1

3

5

2

4

(a) (b)

Figure 2.1: Graph representation. (a) Graphic representation. (b) Matrix representation.

2.2.3 Network Presentation

An important issue related to network structure mining is network presentation and visualization. As the old saying goes, “a picture is worth one thousand words,” a good presentation can reflect the intrinsic structure of a network (Battista

et al.

, 1999; Herman

et al.

, 2000) and help “visually mine” the network. For example, one can easily find the popular nodes that have many links in a network if these nodes are placed close to the center of the network. The hierarchical structure of a tree will become more apparent if nodes at the same level are placed along the same horizontal line (Reingold & Tilford,

1981).

Two types of approaches have been employed to present and visualize networks, namely

multidimensional scaling

(MDS) and

graph layout

approaches.

MDS

is the most commonly used method for social network visualization (Freeman, 2000). It is a statistical method that projects higher-dimensional data onto a lower-dimensional display.

It seeks to provide a visual representation of proximities (dissimilarities) among nodes so that nodes that are more similar to each other are closer on the display and nodes that are

29 less similar to each other are farther apart (Kruskal & Wish, 1978; Young, 1987). Metric

MDS deals with numerical proximities. Nonmetric MDS is used when only rank order of the proximities are considered (Kruskal, 1964; Torgerson, 1952; Young, 1987). For both methods, Kruskal’s STRESS statistic (Kruskal, 1964), which measures the goodness-offit when reducing the dimensionality of data, is the objective function to be optimized. A high STRESS value indicates that the network is significantly distorted and that two distant nodes in the higher-dimensional space may be placed close to each other on the lower-dimensional display. The advantage of MDS is that the physical distance between two nodes on the visual display indicates the “similarity” of the two nodes. However,

MDS considers only the positions of nodes and ignores the placement of links. A popular node may not necessarily be placed on the center of the network. In addition, there might be many crossing links, making it difficult to visualize the structure of the network.

Graph layout

algorithms have been developed particularly for drawing aestheticallypleasing network presentations (Fruchterman & Reingold, 1991). One of the most important aesthetic rules is to minimizing the number of crossing links (Purchase, 1997).

Other aesthetic rules include distributing nodes evenly, making link lengths uniform, and keeping nodes from being too close to links (Davidson & Harel, 1996; Fruchterman &

Reingold, 1991). To automatically draw graphs of high aesthetic quality, computer scientists have proposed a type of graph layout algorithm called spring embedder, also known as force-directed method (Davidson & Harel, 1996; Eades, 1984; Fruchterman &

Reingold, 1991; Kamada & Kawai, 1989). This algorithm treats a network as an energy system in which steel rings (nodes) are connected by springs (links). Nodes attract and

30 repulse each other and finally settle down when the total energy carried by the springs is minimized. The network layout generated by a spring embedder might be quite different from that generated by MDS because of their different objective functions and node position handling mechanisms.

The size of a network can impose a great challenge on the performance of both the MDS and spring embedder algorithms. The time complexities of MDS and the spring embedder algorithms are

O

(

n

2

) and

O

(

n

3

), respectively (Herman

et al.

, 2000), where

n

is the size of the network. It can be quite slow to draw a network consisting of thousands of nodes.

More importantly, as the size of a network increases the presentation will become more and more cluttered, making it difficult for users to comprehend the structure. One approach to address this problem is the

focus

+

context

technique (Furnas, 1986). This technique mimics the distorting effect of fisheye lens such that objects around the focal point selected by a user are enlarged and objects in distance are shown with less detail.

As a result users can examine the local details of a network without losing the sense of the context of the whole network.

These fundamental concepts of network structure such as graph definition, representation, and presentation are helpful for understanding the technology of network structure mining, which will be presented in the computational framework in the next section.

31

Since network structure mining has a great potential in supporting knowledge management and decision making in many application domains yet is facing many challenges, I develop this computational framework. This framework provides taxonomy of the major research areas in this new field, identifies the key research questions in each area, and reviews existing techniques for addressing these research questions.

Figure 2.2 presents this computational research framework for network structure mining.

There are two major areas: static structure mining and dynamic structure mining. The static structure mining studies the “snapshot” of a network, that is, nodes and links observed at a single point in time. Dynamic structure mining, in contrast, analyzes a network based on data observed at multiple points in time. Static analysis is aimed at discovering the structural regularities in the specific configuration of the nodes and links of a network at the time of observation. Dynamic analysis is aimed at finding the patterns of changes in the network over time. The focus of static analysis is on structure, while the focus of dynamic analysis is on the processes and the evolutionary mechanisms that lead to the structure (Barabási & Alert, 1999; Doreian & Stokman, 1997).

32

Locating critical resources

Identifying key nodes

Identifying key links/paths

Reducing network complexity

Identifying subgroups

Modeling between-group relationships

Extracting topological properties

Describing structural dynamics

Modeling structural dynamics

Graph theoretical

- Centrality measures

- Neighborhood function

Link-analysis-based

- HITS

- Pagerank

Graph theoretical

- Edge betweenness

- Shortest-path algorithms

Weighted graph partitioning

-Spectral clustering

- Hierarchical clustering

Unweighted graph partitioning

- Link-analysis based

- Graph theoretical

- hierarchical clustering

Blockmodeling

General properties

Characterizing properties

- Average path length

- Clustering coefficient

- Degree distribution

General properties

Characterizing properties

Analytical approaches

Simulation approaches

Figure 2.2: The computational framework for network structure mining.

33

2.3.1 Static Structure Mining

The three major problems of static structure mining are locating critical recourses in network, reducing network complexity, and extracting topological properties of networks.

2.3.1.1 Locating Critical Resources in Networks

A network can be viewed as a collection of recourses. On the World Wide Web, for example, the contents of Web documents can be viewed as information resources. Users search for quality Web pages whose contents match their information needs. Cables and wires in a computer network are also resources whose breakage may bring the whole network down. The key people, documents, relations, and communication channels in a network often are critical to the function of the network. Existing techniques for locating critical resource have been used in a number of applications, such as finding high-quality pages on the Web (Chakrabarti

et al.

, 1999; Kleinberg, 1998), locating cables and wires whose failure reduces the robustness of the Internet (Kleinberg

et al.

, 2004; Kumar

et al.

,

2002; Tu, 2000), searching for experts for a specific problem in collaboration networks

(Kautz

et al.

, 1997; Newman, 2001b), and identifying leaders and gatekeepers in criminal and terrorist networks (Krebs, 2001; Xu & Chen, 2005).

In general, the key recourses in a network are those important nodes, links, or paths, which are sequences of links.

Identifying Key Nodes

34

Methods for identifying key nodes can be categorized into two types: graph theoretical approaches and link analysis based approaches. o

Graph Theoretical Approaches

Graph theoretical approaches originate from graph theory and social network analysis.

They treat a network as a graph and identify the key nodes based on the link structure of the network. Centrality measures in SNA are often used to locate key nodes. Freeman

(1979) provides definitions of the three most popular centrality measures:

degree

,

betweenness

, and

closeness

.

Degree

measures how active a particular node is. It is defined as the number of direct links a node has. “Popular” nodes with high degree scores are the leaders, experts, or hubs in a network. It has been shown that these popular nodes can be a network’s

“Archilles’ Heel,” whose failure or removal will cause the network to quickly fall apart

(Albert

et al.

, 2000; Holme

et al.

, 2002). Especially, in some communication networks such as electronic power grids and the Internet a key node’s failure may cause cascading breakdown of other nodes due to traffic rerouting (Watts, 2002; Zhao

et al.

, 2004). In the counter-terrorism and crime fighting context, the removal of key offenders is often an effective disruptive strategy (McAndrew, 1999; Sparrow, 1991).

Betweenness

measures the extent to which a particular node lies between other nodes in a network. The betweenness of a node is defined as the number of geodesics (shortest paths between two nodes) passing through it. Nodes with high betweenness scores often serve

35 as gatekeepers and brokers between different parts of a network. They are important communication channels through which information, goods, and other resources are transmitted or exchanged (Newman, 2004a; Wasserman & Faust, 1994). Holme

et al.

(2002) show that the removal of nodes with high betweenness scores can be more devastating than the removal of nodes with high degrees.

Closeness

is the sum of the length of geodesics between a particular node and all the other nodes in a network. It actually measures how far away one node is from other nodes and is sometimes called “farness” (Baker & Faulkner, 1993; Freeman, 1979). A node with low closeness may find it very difficult to communicate with other nodes in the network. Such nodes are thus more “peripheral” and can become outliers in the network

(Sparrow, 1991; Xu & Chen, 2005).

Another centrality related measure in SNA is

prestige

, which is similar to degree but is defined for directed graphs (Wasserman & Faust, 1994). The prestige of a node is the number of in-links the node has. A prestigious node tends to have many nominations from other nodes.

Both degree and prestige measure the importance of a node based on the direct neighbors of the node. Recently, Palmer

et al.

(2002) have proposed a

neighborhood function

for categorizing the importance of node. The neighborhood function for a node

u

at distance

h

is the total number of nodes that can be reached from

u

within

h

or fewer hops. An important router in a computer network, for example, will be the one that can reach most of the routers within a few hops.

36 o

Link Analysis based Approaches

In Web mining research, the

HITS

(Kleinberg, 1998) and

PageRank

(Brin & Page, 1998) algorithms are the two most widely used methods for locating high-quality documents on the Web. Unlike centrality measures which calculate the scores directly both the HITS and PageRank algorithms are iterative procedures.

The

HITS

(Hyperlink-Induced Topic Search) algorithm is based on a simple intuition.

High-quality Web pages can be either authoritative pages or hub pages. The authoritative pages contain high-quality information related to a particular topic and thus may be pointed to by many other pages. Hub pages are not necessarily authoritative pages but provide links to many authoritative pages. The authoritative score of a page thus is measured by the number of in-links from hub pages. The hub score of a page is measured by the number of out-links that point to authoritative pages. The algorithm begins by assigning random numbers to the authoritative and hub scores of all pages. The two scores of each page are iteratively updated until they converge. Similarly, the

PageRank

algorithm determines the quality of a page based on the number of in-links the page receives. In addition, each in-link is weighted based on the quality of the page where the link originates. The quality of this neighbor page is also determined by PageRank.

Identifying Key Links/Paths

A set of graph theoretical approaches have been proposed to identify the key links and paths in a network. Girvan and Newman (2002) define a measure called

edge

37

betweenness

to find links that serve as bridges between different groups in a network.

Analogous to node betweenness, the edge betweenness of a link is the number of shortest paths passing through it. If a network contain groups and there are a few bridges connecting these groups, one must pass through these bridges when traveling from one group to another. These bridges become critical to the connectivity of the whole network.

Removal of links with high edge betweenness scores will easily cause network breakdown. The edge betweenness has been used for network partition tasks, which will reviewed shortly (Girvan & Newman, 2002). Because the calculation of edge betweenness requires global traversal of a graph that is computationally costly, Radicchi

et al.

(2004) propose the

edge clustering coefficient

measure that requires only local traversal to approximate edge betweenness.

For key path identification, the most widely used algorithm is the

shortest-path algorithm

(Dijkstra, 1959). The algorithm can find the shortest path between two nodes, which might be the quickest way to travel from one city to another (Wang & Crowcroft, 1992), the most efficient rout to transmit data from one router to another (Perkins & Bhagwat,

1994), or the strongest relationships between two people (Xu & Chen, 2004). It has also been used to find the node-independent paths in a network (White & Newman, 2001).

The number of node-independent paths between two nodes is the minimum number of nodes that must be removed to disconnect the two nodes. Node-independent paths thus have a direct impact on the robustness of a network. The classic Dijkstra algorithm computes the shortest paths from a single source node to every other node in a graph.

Other variants improve the speed of the algorithm using efficient data structures. For

38 example, the Priority-First-Search (PFS) algorithm (Cormen

et al.

, 1991) is faster than the Dijkstra algorithm through the use of a priority queue.

2.3.1.2 Reducing Network Complexity

A network can be very complex due to the large number of nodes and links it contains.

Understanding the structure of a network becomes increasingly difficult when its size scales up. For example, a marketing manager may get lost when he/she faces a network consisting of thousands of existing and potential customers. A researcher may find it difficult to understand the intellectual structure of an unfamiliar discipline when studying its citation networks containing hundreds of papers or authors. Therefore, it is desirable to extract the “big picture” out of a complex network by reducing it into a simpler image while preserving the intrinsic structure. To achieve this goal, a network can be first partitioned into subgroups, each of which contains a set of nodes. The between-group relationships can then be extracted. A number of applications can benefit from this technology. Especially, network partition methods have been employed to find communities on the Web (Flake

et al.

, 2000; Gibson

et al.

, 1998; Toyoda & Kitsuregawa,

2001), major research topics and paradigms in a discipline in citation networks (Small,

1999; White & McCain, 1998), and criminal groups in criminal networks (Xu & Chen,

2005).

39

Identifying Subgroups

In SNA a group is cohesive if nodes in this group have stronger or denser links with nodes within the group than with nodes outside of the group (Wasserman & Faust, 1994).

The methods for identifying cohesive subgroups and partitioning a network are different depending on whether the network is weighted or unweigthed. A weighted graph can be partitioned into cohesive groups by maximizing the within-group link weights while minimizing between-group link weights. Because the link weight represents node similarity or link strength and intensity, nodes in the same group are more similar to each other or more strongly connected. An unweighted graph can be partitioned into cohesive groups by maximizing within-group link density while minimizing between-group link density. In this case, cohesive groups are densely-knit subsets of the graph. Weighted graph partitioning is less challenging than unweighted graph partitioning. o

Weighted Graph Partitioning

Given a weighted graph,

spectral clustering

and

hierarchical clustering

methods can be used to find subgroups in the graph.

Spectral clustering

methods partition a graph by analyzing the spectrum of the Laplacian matrix representing the graph (Fiedler, 1973; Pothen

et al.

, 1990). The Laplacian matrix is constructed from the graph’s adjacency matrix and its spectrum is found by calculating the eigenvalues and the eigenvectors of the matrix. The optimal partition is found by minimizing the total link weights between groups. The eigenvalue corresponding to the

40 optimal solution to the objective function gives a reduced lower-dimensional representation of the graph. Nodes are then mapped to this lower-dimensional representation and closer nodes will be in the same cluster. The problem of spectral clustering methods is that the number of clusters to be found must be specified beforehand (Chung, 1997; Kannan

et al.

, 2004; Pothen

et al.

, 1990). They cannot be used to partition a network when the number of groups is unknown.

Hierarchical clustering

is an alternative approach which does not require the prior knowledge about number of groups. There are two types of hierarchical clustering methods:

agglomerative

and

divisive

(Jain & Dubes, 1988; Jain

et al.

, 1999; Johnson,

1967). These methods partition a graph into a series of nested clusters rather than a fixed number of clusters. In hierarchical clustering, link weights are often transformed into distances.

Agglomerative methods start with individual nodes, each of which is treated as a cluster.

The algorithm merges two clusters into one cluster if the two clusters are closest to each other. Smaller clusters are progressively merged until all nodes in the network fall into one big cluster. These nested clusters are organized in tree-like structure often called

dendrogram

. A dendrogram can be “cut” at a specific distance level corresponding to a specific partition of the network. In contrast to agglomerative methods, divisive methods treat the whole network as one cluster at the beginning. It progressively removes the longest/weakest links until the network are dissolved into individual nodes. The most efficient hierarchical clustering algorithm runs

O

(

n

2

) in time and space (Murtagh, 1984).

41

The disadvantage of hierarchical clustering methods is that the determination of the cut level of the denrogram is often ad-hoc and rather subjective (Jain

et al.

, 1999).

Hierarchical clustering is widely used to partition weighted graphs. However, for graphs such as the World Wide Web, citation networks, and other networks when link weight is not available, the problem becomes more challenging. o

Unweighted Graph Partitioning

Three types of methods have been proposed to partition unweighted graphs:

link analysis based methods

,

graph theoretical approaches

, and

hierarchical clustering

.

Link analysis based methods

are used in Web mining research to identify Web communities (Gibson

et al.

, 1998; Kumar

et al.

, 1999; Toyoda & Kitsuregawa, 2001,

2003). These methods are rooted in the HITS algorithm proposed by Kleinberg

(Kleinberg, 1998). Kumar

et al

. (1999) propose a trawling approach to find a set of core pages containing both authoritative and hub pages for a specific topic. The core is a directed bipartite subgraph whose node set is divided into two sets with all hub pages in one set and authoritative pages in the other. The core and the other related pages constitute a Web community (Gibson

et al.

, 1998; Toyoda & Kitsuregawa, 2001, 2003).

In addition to link analysis based approaches,

graph theoretical approaches

have also been used to find Web communities (Flake

et al.

, 2000; Flake

et al.

, 2002; Imafuji &

Kitsuregawa, 2002). These approaches focus on the minimum-cut problem which finds clusters of roughly equal sizes while minimizing the number of links between clusters.

42

Realizing that the minimum-cut problem is equivalent to the maximum-flow problem in graph theory (Ford Jr. & Fulkerson, 1956), Flake

et al

. (2000) formulate the Web community identification problem as an

s

-

t

maximum flow problem. Efficient algorithm for solving minimum-cut problem, such as the Kernighan-Lin algorithm (Kernighan &

Lin, 1970), runs

O

(

n

2

) in time. However, the size of the communities must be specified beforehand (Newman, 2004b).

Both link analysis based methods and graph theoretical approaches are proposed for graph partition in the Web context. They require seed nodes, i.e., the starting pages, to find Web communities. They are not appropriate for finding communities in general graphs where no seed nodes are available.

Recently, researchers have proposed a number of

hierarchical clustering methods

to partition unweighted networks. The G-N algorithm (Girvan & Newman, 2002), for example, is a divisive clustering algorithm. When deciding which link to remove at each step the G-N algorithm selects the one with the highest edge betweenness (Girvan &

Newman, 2002) and iteratively removes the links with the highest betweenness. In each iteration, the betweenness of each node must be recomputed. It has been shown that the algorithm is effective in identifying groups in various real networks (Girvan & Newman,

2002; Newman & Girvan, 2004; Radicchi

et al.

, 2004). However, the algorithm is rather slow and runs

O

(

m

2

n

) in time. This is because two reason. First, the calculation of betweenness depends on the computation of shortest paths which requires global traversals in a network. Second, the algorithm must recompute betweenness in every

43 iteration. The lack of scalability severely limits the G-N algorithm’s ability to partition large networks such as the World Wide Web and the Internet.

Variants of the G-N algorithm have been proposed to improve the efficiency of the algorithm. Radicchi

et al

. (2004) propose an alternative divisive algorithm using edge clustering coefficient to approximate edge betweenness. Newman (Newman, 2004c) proposes an agglomerative approach that based on a measure called

modularity

. The modularity of network indicates how much the graph structure deviates from a random graph, in which no group exists. In each iteration the algorithm seeks for a pair of clusters whose merge results in the largest increase or smallest decrease in the value of modularity. Although they are faster than the G-N algorithm the two new algorithms time complexities stills scale with

m

2

. Details of these algorithms will be provided in Chapter

5.

Modeling Between-group Relationships

After a network is partitioned into groups, the between-group relationships become composites of links between individual nodes. In SNA, a positional analysis method called

blockmodeling

is often used to discover the overall structure of a social network

(White

et al.

, 1976).

Blockmodeling

identifies between-group relationships and interaction patterns after network partition. However, rather than being partitioned into subgroup, the network is clustered into

positions

based on a

structural equivalence

measure (Lorrain & White,

44

1971; Wasserman & Faust, 1994). Two nodes are structurally equivalent if they have identical links to and from other nodes. A position thus is a collection of nodes who are structurally substitutable, or in other words, similar in social activities, status, and connections with other members. Position is different from the concept of subgroup in relational analysis because two network members who are in the same position need not be directly connected (Lorrain & White, 1971; Scott, 1991).

Although it is a positional analysis, blockmodeling can be used to model relationships between subgroups (Xu & Chen, Forthcoming; Xu & Chen, 2005). Given subgroups in a network, blockmodel analysis determines the presence or absence of a relationship between two subgroups based on the

link density

(Wasserman & Faust, 1994). When the density of the links between the two subgroups is greater than a predefined threshold value, a between-group relationship is present, indicating that the two subgroups interact with each other constantly and thus have a strong relationship. By this means, blockmodeling summarizes individual relational details into relationships between groups so that the overall structure of the network becomes more prominent.

2.3.1.3

Extracting Topological Properties

Recent years have witnessed an increasing interest in the topological properties of largescale networks such as the World Wide Web (Broder

et al.

, 2000), metabolic pathways

(Jeong

et al.

, 2000), food webs (Garlaschelli

et al.

, 2003), citation networks (Hajra & Sen,

2005), and collaboration networks (Newman, 2001b; Watts & Strogatz, 1998), among many others. This new trend in the statistical properties of networks results from two

45 primary reasons. First, data collection and analysis of extremely large networks becomes possible due to the greatly improved computing power. The size of the World Wide Web studied, for example, has been up to several million nodes (Lawrence & Giles, 1999).

Second, the recently proposed small-world and scale-free network models (Barabási &

Alert, 1999; Watts & Strogatz, 1998) have motivated scientists to search for the universal organizing principles that may be responsible for the commonality observed in a range of networks. These commonalities are found by categorizing, comparing, and contrasting the networks’ topological properties (Albert & Barabási, 2002) using two categories of statistics:

general statistics

and

topology characterizing statistics

.

General Statistics

These statistics are intended to capture the size and scale of a network regardless of its specific structure. They include the number of nodes or

network size

, the number of links, and several others. Table 2.1 provides a relatively complete inventory of these statistics often found in network topology studies. The size of a network is a direct indicator of the complexity of a network. Networks that have been studied range from food webs consisting of a few hundred nodes (Solé & Montoya, 2001) to scientific collaboration networks consisting of millions of authors and papers (Newman, 2001a, 2004a). The

giant component

is the largest connected component in a network (Bollobás, 1985). Most giant components have been found to contain more than 70% of the nodes in various networks (Newman, 2001a). The

average degree

of a network is the average number of links an arbitrary node has and defined as

<

k

>=

m

. The

density

of a network is the

n

46 number of links that are actually present divided by the possible number of nodes in a network (Wasserman & Faust, 1994). The density of an undirected network thus is

d

=

n

(

n m

1 ) / 2

. Sparse networks have low densities. The

diameter

of a network

(Wasserman & Faust, 1994) is the length of the longest shortest path in the network.

General statistics

Topology characterizing statistics

Statistics

Number of nodes, network size

Number of links

Number of nodes in the giant component

Percentage of nodes in the giant component

Average degree

Density

Largest shortest path length, diameter

Average shortest path length

Clustering coefficient

Degree distribution

Table 2.1: The statistics for network topology.

Topology Characterizing Statistics

Symbol

n m

S s

<

k

>

d

D

L

C

P

(

k

)

Three special statistics are used to categorize the topology of network and distinguish among random network (Erdös & Rényi, 1960), small-world network (Watts & Strogatz,

1998), and scale-free network (Barabási & Alert, 1999). The three statistics are a

verage shortest path length

, (vertex)

clustering coefficient

, and

degree distribution

. As mentioned in Section 2.1, random networks are characterized by small shortest path length, low clustering coefficient, and Poisson degree distribution with a single characterizing degree, <

k

>. A small-world network is different from random networks due to its high tendency to form clusters and groups. The small shortest-path length together with the high clustering coefficient of small-world networks reflects the

six

47

degrees of separation

phenomenon (Milgram, 1967). The distinctive characteristic of the scale-free network is its power-law degree distribution, which is skewed toward small degrees and has a long flat tail for large degrees. Networks of different types and sizes have found to be strikingly similar in their topologies and have both small-world and scale-free properties (Albert & Barabási, 2002). These findings lead to a conjecture that networks in nature and society are governed by a universal self-organizing principle

(Albert & Barabási, 2002).

Static structure mining provides a means of discovering structural patterns in networks.

However, networks are not static but constantly change. How to reveal the dynamics of networks and the evolutionary mechanisms leading to certain topology is the focus of the dynamic structure mining. The advantage of dynamic structure mining is its abilities to explain and predict the structure of networks (Albert & Barabási, 2002; Doreian &

Stokman, 1997).

2.3.2

Dynamic Structure Mining

Networks are subject to all kind of changes and dynamics in their nodes and links. New nodes may be added to the system and old nodes may be removed. New links may be formed between old nodes or between old and new nodes. Understanding the dynamics and the process of evolution in networks is of vital practical importance. The evolutionary mechanisms lead to specific type of network topology, which has direct impact on the function of a system. For example, it is found that protein interaction

48 networks in cells are scale-free networks. That is, a small percentage of hub proteins mediate the interactions with the rest of the proteins. Such a topology is critical to the survival of a cell because it is rather robust against random attacks (Jeong

et al.

, 2001).

How does a cell evolve into such a structure can be the key to develop effective means to protect healthy cells or attack harmful cells such as cancer cells. Existing dynamic mining approaches distinguish between

descriptive

and

modeling

approaches.

2.3.2.1 Describing Structural Dynamics

Descriptive approaches are aimed at capturing and observing the changes in a network over time using a set of topological statistics.

Changes in General Statistics

General statistics such as those listed in Table 2.1 are often measured at different points in time. The changes observed are then plotted with respect to time in order to examine the dynamic patterns. For example, Barabási

et al

. (Barabási

et al.

, 2002) study the evolution of the scientific collaboration networks in mathematics and neuro-science in the period of 1991-1998, respectively. Based on co-authorship information from papers published in journals they analyze the patterns of changes in the number of papers, number of authors (network size), average degree, and the relative size of the giant component in the network. They find that the networks are growing in that

n

,

s

, and <

k

> all increase over time. Other studies that use general statistics of network can be found in

(Csányi & Szendroi, 2004; Hajra & Sen, 2005).

49

Changes in Characterizing Statistics

This type of statistics can be used to distinguish between different topologies. It is found in (Barabási

et al.

, 2002) that both clustering coefficient and average path length of the scientific collaboration networks decrease over time, and the degree distribution follows a power-law. The decreasing

L

deviates from existing models which predict that

L

scales with

n

. This might be due to the addition of internal links which act as short cuts between distant parts of the network and the limited time window of the data set (Barabási

et al.

,

2002).

Modeling usually follows the descriptive analysis in attempt to explain the observed patterns of dynamics using certain mechanisms.

2.3.2.2 Modeling Structural Dynamics

Modeling approaches are aimed at explaining the emergence of specific type of network topology (random, small-world, or scale-free) based on microscopic mechanisms.

Presently, the research focus is primarily on the evolution process of scale-free topology due to three reasons. First, degree distribution of scale-free networks significant deviates from the Poisson distribution (Albert & Barabási, 2002). Second, the scale-free topology has shown to be robust to random failures but vulnerable to targeted attacks (Albert

et al.

,

2000). Third, scale-free topology can facilitate efficient resource transmission (Toroczkai

& Bassler, 2004). The evolution of scale-free topology thus is particularly interesting because the structures of many real networks ranging from the Internet to gene-protein

50 interaction networks are scale-free (Faloutsos

et al.

, 1999; Garlaschelli

et al.

, 2003; Jeong

et al.

, 2000; Newman, 2004a). The core research question is: what are the mechanisms responsible for the power-law distribution in degree (Albert & Barabási, 2002)?

Several mechanisms, such as growth (Barabási & Alert, 1999), preferential attachment

(Barabási & Alert, 1999), competition (Bianconi & Barabási, 2001), and individual preference (Menczer, 2004; Pennock

et al.

, 2002), have been proposed to explain the emergence of scale-free topology in real networks. To examine the role of these mechanisms in the evolution of scale-free networks researchers have employed

simulation

and

analytical

approaches.

Simulation Approaches

With simulation approaches, a network evolves while new nodes and links are added to the network over time. The mechanisms are incorporated into the evolution process by controlling which two nodes are selected for a newly added link. In the basic evolution model proposed by Barabási and Alert (1999), for example, the evolution starts with a small number, say

m

0

,

nodes. At each time step, a new node is added to the system. The new node is allowed to link to

m

(

m

m

0

) different nodes that are already in the network.

When choosing the target nodes to link to the new node makes a decision based on how many links the target nodes have. Therefore, the more links a node has the more likely it will be linked by the new node. This preferential attachment mechanism thus leads to the

rich-get-richer

phenomenon, manifesting the scale-free topology. In the fitness model which considers the competition effect (Bianconi & Barabási, 2001), the target nodes are

51 selected not only based on the number of their links but also on their intrinsic abilities to attract links. A Web page with high-quality content thus may quickly attract much attention although it does not have many in-links initially. The resulting network has a different topology than the scale-free and contains a few stars that connect to almost every node in the network, a phenomenon described as

winners-take-all

(Pennock

et al.

,

2002).

The simulation approach helps observe and demonstrate the evolution of a network.

However, simulation approach lacks generalizability.

Analytical Approaches

Analytical approaches seek the general solution to a problem and often require the formal definition of a problem and various assumptions. Using mean-field theory, Barabási

et al.

(1999) derive the functional form of the power-law distribution of scale-free networks and claim that regardless of the network size the exponent of the power-law is -3. In the fitness model, instead, the exponent is a function of the fitness of a node (Bianconi &

Barabási, 2001). Nodes with higher fitness scores will acquire links at higher speeds than nodes with lower fitness score. The resulting degree distribution is a weighted sum of a spectrum of power-law distributions.

The research on network dynamics is a recent development and fairly new compared with static research. More innovative approaches and models are expected to be added to this line of research in the near future.

52

The computational framework presented in this chapter provides a guideline for network structure mining. In Chapters 3-7, I will present a series of case studies that demonstrate how static and dynamic structural patterns can be mined from various networks ranging from criminal networks to patent citation networks using the technologies reviewed in this chapter.

53

CHAPTER 3: LOCATING KEY RELATIONSHIPS IN CRIMINAL

NETWORKS

3.1 Introduction

As discussed in Chapters 1 and 2, networks can be viewed as collections of resources.

Important relations and relational paths are critical resources that may reveal important structural information about the network. In this chapter I propose a graph theoretical approach to locate important relations between criminals in criminal networks (Xu &

Chen, 2004). The objective is to support knowledge management and decision making in the law enforcement domain to help fight organized crimes.

Organized crimes such as terrorism, narcotics violations, armed robbery, and kidnapping often involve multiple offenders who are connected through various relationships (e.g., kinship, friendship, co-workers, or business associates) (Harper & Harris, 1975). These criminals can be treated as a network in which they interact and play different roles in illegal activities (McAndrew, 1999). For instance, a narcotics network may consist of interrelated criminals who are responsible for handling the supply, distribution, sale, and smuggling of drugs, or even money laundering. Members in a terrorist network may have shared religious beliefs or attended terrorist training together previously so that they trust each other and cooperatively plan and commit terrorist attacks (Krebs, 2001). In a broader sense, a criminal network may be composed of a variety of entities (e.g., organizations, locations, vehicles, weapons, properties, bank accounts, etc.) in addition to

54 persons. Learning relations between these entities is a critical part of uncovering criminal activities and fighting crimes. To achieve this goal, crime investigators often employ a method called

link analysis

(Coady, 1985; Harper & Harris, 1975; Sparrow, 1991) which can help generate investigative leads and uncover missing information that may be buried in a criminal network. In a narcotics network, for example, link analysis may reveal that a group of offenders actually belong to the same drug supply chain. In a homicide crime, link analysis may find “hidden”, intermediate persons connecting the victim with the suspect who denies knowing the victim. Note that the concept of link analysis here is not the same as “link analysis” in Web mining. It refers to the task of identifying criminal relations in the specific context of crime fighting.

Link analysis usually consists of two major tasks: extracting information about entity relations from raw data (e.g., telephone records, surveillance logs, and crime reports) and constructing a network representation, and identifying relations between seemingly unrelated entities in a network. Both tasks can be very time-consuming and laborintensive. Current link analysis practice in law enforcement is mainly an ad-hoc manual process. To solve a crime, investigators may spend a large amount of time performing extensive database searches, reading crime reports, and looking for clues of criminal relations. Although some software packages have been labeled with “link analysis tools”, they provide only visual representations of criminal networks and are “still not doing the analysis” (Sparrow, 1991). Because of these problems, link analysis is used only for highprofile cases. Effective and efficient link analysis techniques are needed to help fight crime (McAndrew, 1999).

55

To address the lack-of-technique problem, I propose using a type of graph theoretical approaches, namely two variations of the classical shortest-path algorithms (Dijkstra,

1959) for link analysis. The evaluation studies assess both the effectiveness and efficiency of the proposed algorithms. The effectiveness issue concerns whether relation paths found by the proposed algorithms are more useful for uncovering investigative leads than those found by a modified

Breadth-First-Search

(BFS) algorithm. The modified BFS algorithm to a large extent simulated the manual approach of relation search by crime investigators and was used as a benchmark technique for effectiveness comparison. The efficiency issue concerns which shortest-path algorithm is faster in what type of networks.

The rest of the chapter is organized as follows. Section 3.2 reviews the literature on link analysis and the shortest-path algorithms. Section 3.3 presents the modified BFS algorithm. The two proposed shortest-path algorithms are introduced in section 3.4.

Evaluation and results are presented and discussed in section 3.5. In section 3.6 I conclude the paper and suggest directions for future work.

In this section I review network construction techniques proposed in previous research and existing link analysis tools. I then introduce the algorithms for computing shortest paths in a graph.

56

3.2.1 Link Analysis

3.2.1.1 Network Construction

To entail link analysis, an indispensable task is to extract information about entities and their relations from large amounts of raw data and convert the information into a network representation. Usually entities are represented by nodes and relations between them are represented by links in a network. Different network construction methods may be needed, depending on whether the raw data are structured database records or unstructured textual documents.

Several techniques have been developed for constructing network representations of structured data records. For example, Goldberg and Senator (1998) suggested that consolidation and link formation operations be performed on transactional data records during investigations of financial crimes. Consolidation is a process of “disambiguating and combining identification information into a unique key which refers to specific individuals” (Goldberg & Senator, 1998). Links or relations between consolidated individuals are formed based on a set of heuristics such as whether the individuals have shared addresses, shared bank accounts, or related transactions. This technique has been employed by the U.S. Department of the Treasury to detect money laundering transactions and activities (Goldberg & Wong, 1998). A different network construction method used by COPLINK Detect (Hauck

et al.

, 2002) is based on the concept space approach developed by Chen and Lynch (1992). A concept space can be treated as a

57 network in which nodes represent domain-specific concepts and links represent weighted co-occurrence relations between concepts (Hauck

et al.

, 2002). In COPLINK Detect, nodes are records of entities (persons, organizations, vehicles, and locations) stored in crime databases. In such a network, a relation exists between a pair of entities if they appear together in the same criminal incident. The more frequently they occur together, the stronger the relation. The concept space approach is primarily a statistic-based approach and differs from the heuristic-based one in (Goldberg & Senator, 1998).

Some other techniques can build networks based on information extracted from unstructured data or textual documents. Lee (1998) developed a technique to construct criminal networks from free texts. This approach can extract entities and events from textual crime reports by applying a large collection of predefined patterns. Relations among extracted entities and events are formed using relation-specifying words and phrases. For example, the phrase “member of” indicates an entity-to-entity relation between an individual and an organization; the word “arrest” may suggest an entity-toevent relation between an individual and an arrest event. This approach relies heavily on a fixed set of predefined patterns and rules and thus has a limited scope of application.

The concept space approach (Chen & Lynch, 1992; Hauck

et al.

, 2002), as mentioned earlier, can also be used to construct networks from textual documents. Instead of using structured data from databases, it uses noun phrases extracted from crime reports as entities to build a criminal network. A relation or co-occurrence relationship exists between a pair of entities as long as they appear together in the same report. However, the noun phrases extracted may not necessarily be the entities that interest the crime

58 investigators. Success of this type of network construction approaches, to a large extent, depends on the development of named-entity extraction technique (Chinchor, 1998), which is the automatic identification from text documents of the names of entities of interest, such as date, time, number expression, person, location, and organization (Chau et al., 2002; Chinchor, 1998).

3.2.1.2 Link Analysis Tools

In addition to network construction, another important link analysis task is searching for possible relations between entities. However most existing link analysis tools can only visualize criminal networks and do not offer much help with relation search. This section will provide a review of existing link analysis tools.

The earliest link analysis tool is the Anacapa charting system (Harper & Harris, 1975) which has been used extensively in law enforcement since its introduction. Based on human-extracted relation information, the system can generate a two-dimensional visual representation of a network with different symbols representing different types of entities.

However, this tool does not facilitate relation search and an investigator must manually examine the network display to find relation paths between entities or confirm initial suspicions about specific suspects (Sparrow, 1991). Other link analysis tools such as

Netmap (Goldberg & Wong, 1998) and Analyst’s Notebook (Klerks, 2001) are also designed for network visualization rather than for relation search.

59

A link analysis tool called Watson (Anderson

et al.

, 1994) can search and identify direct relations between entities by querying databases. Given a specific entity such as a person’s name, Watson automatically forms a query to search for other records that are related to the person. For example, an analyst may want to find out who is related to a kidnapped child. The related records found by Watson, which may include the child’s relatives, friends, or other acquaintances, will be linked to this child and presented in a link chart. COPLINK Detect (Hauck

et al.

, 2002) can also be treated as a link analysis tool which provides direct relation search functionality.

In the next section I review shortest-path algorithms, which I propose to address the problem of identifying the strongest relations between entities that are not directly related.

Although these algorithms have been studied and employed widely in other domains, their importance and relevance to link analysis have not yet been recognized in law enforcement.

3.2.2 Shortest-Path Algorithms

Shortest-path algorithms are a type of graph search algorithms. They can identify the optimal paths between nodes in a graph (i.e., a network) by examining link weights.

Conventional shortest-path algorithms have been used in many applications such as robot motion planning (Asano

et al.

, 2002), computer network routing (Perkins & Bhagwat,

1994), transportation and traffic control (Wang & Crowcroft, 1992), critical path computation in PERT charts, etc. Recently, a neural network approach in artificial

60 intelligence has been proposed for shortest-path computation (Ali & Kamoun, 1993;

Araujo

et al.

, 2001). In this section I review the conventional approaches and briefly introduce the neural network approach.

The Dijkstra algorithm (Dijkstra, 1959) is the classical method for computing the shortest paths from a single source node to every other node in a weighted graph. Most other algorithms for solving this problem are based on this algorithm but have improved data structures for implementation (Evans & Minieka, 1992). For example, the Priority-First-

Search (PFS) algorithm (Cormen

et al.

, 1991) is faster than the Dijkstra algorithm because of the use of a priority queue.

Unlike the classical Dijkstra algorithm, the two-tree Dijkstra algorithm computes the shortest path from a single source node to a single destination node, rather than to every other node in a graph. Previous studies have demonstrated that the two-tree Dijkstra algorithm can be much faster than the Dijkstra algorithm. According to Helgason et al.

(1993), in most cases the Dijkstra algorithm generated a shortest-path tree containing approximately 50% of the nodes in a graph before the shortest path between a source node and a destination node was found. Shortest-path trees generated by the two-tree

Dijkstra algorithm, in contrast, contained only 6% of the nodes in the graph. This might save a substantial amount of computational time.

Some researchers have proposed neural network approaches to solving the shortest-path problem. Araujo et al. (2001) extended Ali and Kamoun’s study (1993) and applied a two-layer Hopfield net to the shortest-path problem. In their Hopfield net, each neuron

61 corresponds with a link in a graph. The value of a neuron is 1 if the link it represents participates in the shortest path and 0 otherwise. It has been found that the two-layer

Hopfield net could be faster than conventional shortest-path algorithms because of its parallel architecture. However, these proposed Hopfield net approaches work only for networks of small size (e.g., 40 in (Araujo

et al.

, 2001)).

In summary, previous studies have proposed some techniques for network construction in link analysis. However, little research has been done to address the relation search problem. Specifically, an effective and efficient link analysis technique is needed to find relation paths between two or more source entities not directly related. Moreover, the paths found should reveal strong relations between entities so that important investigative leads can be uncovered. I propose to use the shortest-path algorithms to achieve this goal.

To compare the proposed algorithms with current link analysis practices, in my pilot study I recorded and analyzed the relation search processes of crime investigators experienced in link analysis. I found that the typical relation search approach can be described as a breadth-first search (Cormen

et al.

, 1991). However, such an approach cannot guarantee finding the strongest relations between entities and thus may not successfully generate investigative leads. In the next section I present the modified BFS algorithm, which simulates the typical relation search.

62

3.3 The Modified BFS Algorithm

Since existing link analysis tools are limited to direct relation search, crime investigators must explore links manually when they have entities that are not directly related. I found that a typical search starts with a single source entity and incrementally builds up a relation path during link exploration. For example, a crime investigator may need to find relations between two seemingly unrelated drug offenders. In this case, the crime investigator may start with one offender’s name and use a link analysis tool to find all entities that are associated with the offender in previous crimes. By reading each crime report, the investigator can determine whether a link is useful for generating a new lead to connect the two offenders. He then selects those useful links and does further searches, in which entities associated with the newly selected entities from the previous round are examined. He keeps exploring new entities until a relation path is found that connects the two offenders.

Such a search process is very similar to a graph traversal algorithm called

Breadth-First-

Search

(BFS) (Cormen

et al.

, 1991), except that an investigator may consider link usefulness during exploration. Given a weighted directed graph

G =

(

N, A

), a nonnegative number,

l

ij

, is used to represent the weight of the link (

i, j

)

A

. Each node

u

N

has an incoming link set,

In

(

u

), and an outgoing link set,

Out

(

u

). Since the criminal networks are undirected graphs,

In

(

u

) =

Out

(

u

).

63

Starting at a source node

s

, BFS can find paths leading to a target node

t

. It works by maintaining a traversal tree

T

rooted at the node

s

. In this tree, the child nodes of a specific node

u

are

u

’s outgoing neighbors in the graph

G

. Initially

T

contains only

s

. The algorithm then collects all the outgoing neighbors of

s

in

G

and sets them as the child nodes of

s

. For each child node of

s

, the algorithm further finds its children and adds them to the tree. This procedure is repeated until the target node

t

is reached. The time complexity of a BFS algorithm is

O

(

n

+

m

) (Cormen

et al.

, 1991).

As indicated earlier, a crime investigator may not explore all entities associated with a specific entity but selects only those having strong relations. I therefore modified the BFS algorithm so that when it finds the children of a node, it selects only those neighbors that have a link weight greater than a predefined threshold value. The modified BFS algorithm is presented in Figure 3.1.

Modified BFS algorithm

//

This modified BFS algorithm computes the paths from the first node in K to every other node in K. //K may contain multiple source nodes

Begin

Initialize:

s

= the 1 st

element of

K

;

p s

=

s

; //

p i

is the parent node of i p i

=

0 for all

i

N

,

i

s

;

0

=

{

s

}; //

L i

stores the current nodes

while

(

L i

Ø

)

L i+

1

=

Ø

; //

L i+

1

for each

u

L

stores the child nodes of the current nodes in L i

do

i

//

Explore a link only if its length is less than the threshold value

1,

which

//corresponds to link weight of 0.5 in the original, untransformed graph

for each

(

u

,

v

)

Out

T

=

T

{

v

};

(

u

)

such that

l uv

< 1

do

if

v

T

then

//

Include v into the tree and set u as the parent of v

64

p v

=

u

;

L i+

1

=

L

end

i+

1

{

v

};

end

i

=

i

+ 1;

if

v

K

,

then

K

=

K-

{

v

};

if

(

K

=

Ø

)

break

; //

Stop when all source nodes in K are included in the tree

endwhile; end

.

Figure 3.1: The modified BFS algorithm.

Notice that multiple paths may exist between the source entities

s

and

t

. BFS simply finds one such path and does not guarantee to identify the strongest relations between source entities. This suggests that the shortest-path algorithms may be a better option.

To find the strongest relations between two or more source entities I propose to employ conventional shortest-path algorithms. However, to apply the algorithms, a network representation transformation must be made.

3.4.1 Network Representation Transformation

In the criminal networks, the strength of a relation between two directly connected nodes is represented by their link weight, which is a number between zero and one. A link weight can be treated as a probability measure indicating how likely it is that two nodes are related. In general, the probability of a set of mutually independent events occurring together is the product of the probabilities of the individual events. Therefore, if two nodes are not connected directly but by a path consisting of a sequence of intermediate

65 links, the strength of the relation between these two nodes should be the product of the weights of these intermediate links. For example, if node A and node C are connected through node B, and the weights of the intermediate links (A-B) and (B-C) are 0.5 and

0.8, respectively, then the weight of the path (A-B-C) would be 0.4. To find the strongest relation between a pair of nodes, therefore, is to find the path with the largest weight product. Figure 3.2 presents an illustrative example.

In this figure, the number beside each link is that link’s weight or relation strength. Two paths, (A-B-C-D) and (A-E-D), exist between the source node A and the destination node

D. The relation strength of path (A-B-C-D) is 0.28 (0.5X0.8X0.7), and the relation strength of path (A-E-D) is 0.24 (0.8X0.3). Therefore, path (A-B-C-D) has a stronger relation between node A and node D than path (A-E-D).

0.8

B

C

0.7

0.5

A D

0.8

E 0.3

Figure 3.2: Two indirectly connected nodes (A and D).

Although the shortest-path algorithms can identify the optimal path between a pair of nodes, they cannot be used directly to identify the strongest relation between the two nodes. This is because of the following two representation problems:

(a) In a general weighted graph, the weight of a link represents the distance or cost of traveling from one end of the link to the other. Therefore, a low weight is preferred to a high weight. However, a link weight in a criminal network is an

66 indicator of how strongly the two nodes are related to each other. Thus, a high weight is preferred to a low weight.

(b) The shortest path is often computed based on the minimum total weight, which is the sum of the weights of the links along this path. However, my objective is to find a path with the maximum weight product.

In order to address the two representation problems, I transformed the link weight in a criminal network to a distance measure in a new graph representation. In this new graph, the nodes are the same as those in the original network, but the new link weights are computed based on the original weights using a simple logarithmic transformation:

l

= − ln

w

0

<

w

1 , (3.1) where

l

is the link weight in the new graph, and

w

is the corresponding link weight in the original network. Given this transformation, I postulate the following axioms:

(1) All link weights in the new graph are nonnegative numbers.

(2) A lower link weight in the new graph corresponds with a higher link weight in the original network.

(3) The shortest path (using summation of link weights) between a pair of nodes in the new graph generates a path with the maximum link weight product among all the alternative paths between these two nodes in the original network.

67

Proof

:

Proofs of these three axioms are fairly straightforward, following the transformation equation directly.

Axiom

(

1

)

Since 0

<

w

1 , thus ln

w

0 , which suggests that

− ln

w

0 .

Axiom

(

2

)

Let

l

1

<

l

2

, then

− ln

w

1

< − ln

w

2

, or ln

w

1

> ln

w

2

.

Since ln

w

is a monotonic increasing function, it follows that

w

1

>

w

2

.

Axiom

(

3

)

Consider the shortest path, say P, between a pair of nodes A and B. P consists of a set of links with weight (

l

1

,

l

2

, ...

,

l p

), 1

p

n

, where

n

is the total number of nodes in this graph. The total length of this path is

i p

=

1

l i

. Consider another path between node A and node B, say Q, consisting of another set of links with weight (

l

1

,

l

2

, ...

,

l

q

), 1

q

n

. The

q

total length is

i

=

1

l i

. Because P is the shortest path between node A and node B, we know that

68

i p

=

1

l i

<

q

i

=

1

l i

.

Since

l i

= − ln

w i

and

l i

= − ln

w i

by definition, we have

i p

=

1 ln

w i

>

i q

=

1 ln

w i

.

It follows that

exp

(

i p

=

1 ln

w i

)

> exp

(

i q

=

1 ln

w i

), which suggests that

i p

=

1

w i

>

i q

=

1

w i

.

Axiom (1) ensures that the new graph does not contain negative-weight links, which is a necessary condition for the shortest-path algorithms (Evans & Minieka, 1992). Axioms (2) and (3) respectively address the two representation problems. Therefore, with such a transformation, I am able to use conventional shortest-path algorithms to identify the strongest relations between a pair of nodes or entities in a criminal network.

3.4.2 Shortest-Path Algorithms

I propose using the Priority-First-Search (PFS) (Cormen

et al.

, 1991) and the two-tree

Dijkstra algorithm (Helgason

et al.

, 1993). Both algorithms can compute the shortest path between two source nodes. Considering the situation where an investigator needs to find relations between more than two entities, I repeatedly use the algorithms to identify the strongest relations among multiple source nodes.

I assume that a group of nodes is strongly associated if each pair of nodes in the group is strongly associated. That is, given

k

source nodes (

u

1

, u

2

, … , u k

), I first find the shortest paths between

u

1

and every other source node (

u

2

through

u k

). Then I find the shortest

69 paths between

u

2

and the remaining source nodes (

u

3

through

u

k

). Such a process is repeated until the shortest paths between all possible pairs of the

k

source nodes are found.

The total number of these shortest paths is

k

(

k-

1)/2. It is possible that some of these paths share common links. If this happens, I combine the common links to avoid redundancy.

3.4.2.1 The Modified PFS Algorithm

The PFS algorithm (Cormen

et al.

, 1991) is a variation of the classical Dijkstra algorithm

(Dijkstra, 1959). The algorithm works by maintaining a shortest-path tree

T

rooted at a source node

s

.

T

contains nodes whose shortest distances from

s

are already known. Each node

u

in

T

has a parent, which is represented by

p u

. A set of labels,

d u

, is used to record the distances from the node

u

to

s

. Initially, T contains only

s

. At each step, I select from the candidate set

Q

a node with the minimum distance to

s

and add this node to

T

. Once

T

includes all nodes in the graph, the shortest paths from the source node

s

to all the other nodes have been found. PFS differs from the Dijkstra algorithm because it uses an efficient priority queue for the candidate set

Q

.

With modifications, PFS can be used to compute the shortest paths from a single source node to a set of specified nodes in the graph. That is, given a set of nodes

K

N

, |

K

| =

k

2

, and a source node

s

K

, the modified PFS algorithm can compute the shortest paths from

s

to all

u

K

, and

u

s

. I therefore modify the algorithm so that it stops as soon as all

u

K

are included in the shortest-path tree

T

. Note that when

K

contains only two

70 nodes, the problem is reduced to a one-to-one shortest-path problem (Helgason

et al.

,

1993). The modified PFS algorithm is presented in Figure 3.3.

Modified PFS algorithm

//

This modified PFS algorithm computes the shortest paths from the first node in K to every other //node in K

Begin

Initialize:

s

= the 1 st

element of

K

;

d s

= 0

,

p s

=

s

;

d i

=

, p i

=

0 for all

i

N

,

i

s

;

T Q

= {

s

}.

while

(

K

Ø

) //

Search Q for the node with minimum distance to s

u

d j

,

i

,

j

Q, i

j

};

Q

};

//

The shortest path between u and s has been found and u is added to T

T u

};

for each

(

u

,

v

)

Out

(

u

)

such that

d u

+ l

//

Update the distance label of v uv

<

d v

do

d v

=

d u

+ l uv

;

p v

=

u

;

if

v

Q

then

Q

=

Q

{

v

};

end

if

u

K

,

then

K

=

K-

{

u

};

endwhile; end

.

Figure 3.3: The modified PFS algorithm.

When computing the shortest paths from

K

’s second node to every other node in

K

, I repeat this procedure. Note that I do not need to compute the shortest path from the second node to the first node again, since it has already been computed. This procedure is repeated

k-

1 times until the shortest paths between all possible pairs of the nodes in

K

have been found.

71

I implement the priority queue using a heap tree for the candidate set

Q

. At each iteration of the

while

loop, it takes

O

(log

n

) time to search for the minimum element

u

from

Q

, and

O

(

|Out

(

u

)

|

log

n

) time to examine and update the distances of incident links of

u

. Thus the execution time for the

while

loop is

u

N

( 1

+

|

Out

(

u

) |) log

n

, or

O

((

n+m

)log

n

), because

u

N

|

Out

(

u

) |

=

m

. As a result, the overall time complexity for computing all shortest paths for

k

nodes is

O

(

k

(

n+m

)log

n

). PFS is faster than the Dijkstra algorithm, whose time complexity is

O

(

k

(

n

2

+m

)) (Evans & Minieka, 1992).

3.4.2.2 The Two-Tree Dijkstra/PFS Algorithm

No modification is made to the two-tree Dijkstra algorithm because it can find the shortest path only between two nodes. The two-tree Dijkstra algorithm works by searching from both ends of the shortest path simultaneously (Helgason

et al.

, 1993). A shortest-path tree rooted at the source node

s

and a shortest-path tree rooted at another source node

t

grow in alternate steps. The two trees are analogous except that the tree rooted at

s

expands a node by examining its outgoing links, and the tree rooted at

t

expands a node by examining its incoming links. A shortest path is found when both trees have a common node, say

r

, such that

d r s

+

d r t

is a minimum, where

d r s

is the distance between

r

and

s

, and

d r t

is the distance between

r

and

t

, respectively. I define

β

as the minimum distance and

J

as the set of nodes that can be used to identify the shortest path.

The following two-tree Dijkstra algorithm is provided in (Helgason

et al.

, 1993).

72

Assuming a priority queue is used for the candidate set

Q

, I call this algorithm two-tree

PFS (Figure 3.4).

Two-Tree PFS algorithm

//

Two-tree PFS computes the shortest path between node s and node t

Begin

Initialize:

d s s

=

0 ,

p s s

=

s

,

T s

= {

s

};

Q s

= {

s

};

p i s

=

0 ,

d i s

= ∞

for all i

N

;

d t t

=

0 ,

p t t

=

t

,

T t

= {

t

};

Q t

= {

t

};

p i t

=

0 ,

d i t

= ∞

for all i

N

.

while

(

T s

T t

=

Ø

)

do

//

Search Q s

for the node with minimum distance to s u

= {

i

:

d i s

d j s

,

i

,

j

Q s

,

i

j

};

Q s

T s

=

Q

= T s s

-

{

u

};

//The shortest path between u and s has been found and u is added to T s

{

u

};

//

Examine outgoing links of u

for each

(

u, v

)

Out

(

u

)

such that

d u s

+

l uv

<

d v s

do

d v s l uv

;

end

p v s

=

u

;

if

v

Q s

then

Q s

=

Q s

{

v

};

//

Search Q t

for the node with minimum distance to t

v

= {

i

:

d

Q

T t t

= Q

= T t t

-

{

v

};

//

The shortest path between v and t has been found and v is added to T t

{v

};

//

Examine incoming links of v

for each

i t

(

u, v d

)

t j

,

i

,

j

In

(

v

Q

)

t

,

i

j

};

such that

d v t

+

l uv

<

d t u

do

d t u

=

=

d u s d t v

+

+

l uv

;

p v t

=

v

;

if

u

end

//

Stopping criterion

Q t

then

Q t

=

Q t

{

u

};

β

J

=

= min{

d i s

{

i

T s

+

T d i t t

:

i

:

d i s

T

+

d s i t

=

T

β

t

};

};

endwhile; end

.

73

Figure 3.4: The two-tree PFS algorithm.

Because the two-tree PFS algorithm computes the shortest path only between two nodes, it must be used

k

(

k-

1)/2 times to identify the shortest paths for all possible node pairs in

K

.

As a result, the overall time complexity is

O

(

k

2

(

n+m

)log

n

).

I did not use Floyd’s (Floyd, 1962) or Dantzig’s (Dantzig, 1960) all-pair shortest-path algorithms, which compute the shortest path for every pair of nodes in a graph. These algorithms require a substantial execution time of

O

(

n

3

) (Evans & Minieka, 1992).

However, the execution time of the two proposed algorithms will not exceed

O

(

k

2

n

2

), which is less than

O

(

n

3

) as long as

k

2

<

n

. In most situations where

k

is rather small compared with

n

, these two proposed algorithms will work faster than all-pair shortestpath algorithms.

I conducted a user evaluation and a simulation experiment in order to assess the performance of the proposed shortest-path algorithms. The user evaluation was aimed at addressing the effectiveness issue, namely, whether relation paths identified by the shortest-path algorithms are more likely to generate investigative leads than those identified by the modified BFS algorithm, which is representative of the typical relation search approach. The purpose of the simulation experiment, on the other hand, was to determine which shortest-path algorithm was more efficient for what type of networks.

Crime investigators often encounter the efficiency issue when they work on a large

74 network (Goldberg & Wong, 1998). In this section I first briefly describe the network construction process and then present the evaluation results.

3.5.1 Network Construction

3.5.1.1 COPLINK Concept Space and AZNP

The criminal networks used in my experiment were constructed based on the same concept space approach (Chen & Lynch, 1992) used in COPLINK Detect (Hauck

et al.

,

2002). In such networks, the strength of a relation is indicated by a co-occurrence weight.

As reviewed previously, the nodes in COPLINK Detect are structured database records of entities. COPLINK Detect allows for link analysis with depth 1, that is, only nodes directly associated with source nodes can be found.

Rather than using structured database records, the criminal networks were constructed from unstructured textual documents. This is because law enforcement agencies often rely on crime report narratives to obtain detailed criminal relation information that may not otherwise be available in structured data. I used an automated noun-phrasing tool called AZNP to extract noun phrases from texts based on part-of-speech tagging and noun phrasing rules (Tolle & Chen, 2000). The extracted noun phrases included various entity types such as persons, locations, vehicles, and properties. Co-occurrence weights between these entities were calculated to generate relation strength measures.

75

3.5.1.2 Data Set

The Phoenix Police Department provided me with one-year’s worth of crime reports. The size of the dataset is 1GB. These reports described various types of crimes ranging from shop-lifting to auto theft, from credit card fraud to narcotics possession and sales. I selected two samples as my test bed, namely, kidnapping and narcotics, both of which are organized crimes. The size of the kidnapping report collection is 4.5MB, and the size of the narcotics report collection is 38MB.

The crime reports varied substantially in length. For example, in the kidnapping sample, some documents simply contained a few lines about a phoned-in kidnapping report, while others had hundreds of lines detailing a kidnapping investigation. Since the length of a document can affect the co-occurrence weights of the concepts it contains (Chen &

Lynch, 1992), I removed from my data sets those reports containing fewer than five lines of text. The noun phrases were extracted from the resulting document collections, and irrelevant terms were filtered out based on a 3400-item stop word list. The noun phrases left after filtering were used as network nodes and their co-occurrence weights were calculated. Two networks were constructed: one for the kidnapping sample and the other for the narcotics sample. Table 3.1 presents the statistics for the two samples.

Kidnapping

Narcotics of reports noun phrases extracted

Network size (

n

)

Number of links (

m

)

Average number of links a node has

271 95,328 280 25,862 92.4

3572 861,516 4257 733,572 172.3

Table 3.1: Sample statistics of two networks.

76

3.5.2 Results and Discussions

3.5.2.1 User Evaluation: Effectiveness Issue

In the user evaluation, I compared the effectiveness of the relation paths identified by the shortest-path algorithms and those identified by the modified BFS algorithm. The purpose of the evaluation was to ascertain whether the shortest-path algorithms would be more useful for uncovering crime investigative leads.

The paths identified by an algorithm may consist of links that are not useful for crime investigations. With the concept space approach, a link between two entities is created if they co-occur in crime reports. However, a co-occurring relation may not necessarily mean an important relationship between entities. For example, the shortest path algorithms identified three relation paths for a kidnapping case with three source nodes:

Juan (person), Jose (person), and West Van Buren (location):

(1) Juan – Jose

(2) Juan – Maria – West Van Buren

(3) Jose – Maria – West Van Buren

Path (1) is useful because both Juan and Jose are listed in a report as victims in a kidnapping crime. Path (2) is considered to be nonuseful. Two reports describe the relation between Juan and Maria: one records that Juan Balderaz's ex-wife was Maria

Palma; the other indicates that Juan Rodriguez kidnapped Maria Molina’s daughter. The

77 relation between Maria and West Van Buren is recorded in another report which indicates that Maria Dillon lived at 3100 West Van Buren. Notice that the three Maria’s are different persons. Thus, the relation path with Maria as the intermediate node cannot provide information about how Juan and West Van Buren are related. Path (3) is a useful path because one report indicates that Jose Carrasco’s friend was Maria Dillon, who lived at 3100 West Van Buren. All entity names are scrubbed to ensure data confidentiality.

To measure the effectiveness of my algorithms, I used a precision rate defined as follows:

Precision

=

Number of

Total number useful of paths paths selected by experts identified by the algorithm

×

100 % (3.1)

Because the modified BFS algorithm did not guarantee to identify the strongest relation paths between entities, I predicted that the shortest-path algorithms could achieve a higher precision than the modified BFS algorithm.

I randomly selected 30 pairs of source nodes from each of the kidnapping network and the narcotics network. Relation paths were computed using both a shortest-path algorithm and the modified BFS algorithm. As shown in Table 3.2, the paths found by the modified

BFS algorithm generally contain more intermediate links than a shortest-path algorithm.

Which shortest-path algorithm was used is not important here because they always generate the same paths.

A domain expert from the Tucson Police Department evaluated the resulting relation paths. The expert had been serving in law enforcement for more than 30 years and had a

78 substantial amount of experience in link analysis. For the results produced by an algorithm, he examined the 30 paths from each network by reading the original crime reports. He determined whether a relation path was useful for generating investigative leads based on his past experience investigating similar crimes. It took 2.5-3 hours to complete the evaluation task for each network. The results show that on average the shortest-path algorithms identified more useful relation paths than the modified BFS algorithm. Around 70% of the paths found by the shortest-path algorithms were considered useful for both networks. For modified BFS, in contrast, only 30% of the paths from the kidnapping network and 16.7% of the paths from the narcotics network were considered to be useful. Table 3.2 shows the precision rate of each algorithm.

Algorithm

Shortest-path algorithms

Modified BFS

Average number of links in relation paths

Precision

Table 3.2: Effectiveness evaluation results.

The shortest-path algorithms can achieve a higher precision because they always select relations with high co-occurrence weights during link exploration. As discussed previously, a co-occurrence weight is a measure of how frequently two entities are related. Therefore, the more frequently two entities are associated, the less likely they are to be related by chance, and the more likely such a relation will be useful for investigations. In contrast, the modified BFS algorithm produces arbitrary paths between

79 entities. It is very likely that these paths contain unimportant relations, resulting in a low precision rate.

Although promising, the shortest-path algorithms still failed to identify useful paths about

30% of the time. Based on my analysis of the nonuseful paths found by the shortest path algorithms, I categorized the reasons for the failures as follows (using the kidnapping network as an example):

Some nodes in the networks do not represent unique entities.

This situation often occurs for the person type. Usually, after a person’s full name is provided at the beginning of a crime report narrative, he/she is referred to only by the first name in later parts of the report. During network construction, the same first names extracted by the noun phraser from different reports are indiscriminately treated as one single node. As a result, a node (e.g., Maria) may not refer to a unique person but to different people with the same first name (e.g., Maria Palma, Maria Molina, Maria Dillon, etc.). This problem also exists for other types of entities such as vehicles, locations, and properties. For example, “white car” may refer to different white cars owned by different persons;

“North 7 th

Street” includes a number of addresses on that particular street. A nonuseful relation path may result if it contains such intermediate nodes. In my test bed, 54.2% of the nonuseful relation paths fell into this category.

Whether an entity is relevant or not depends on specific contexts

. This problem seldom affects entities such as persons and addresses, because their presence in a crime report usually implies that they are relevant to that particular crime. Indeed, any person

80 mentioned in a report has a role descriptor. For example, “sp” means suspect, “v” means victim, and “w” means witness. However, property entities may include any physical object that a person possesses. It is much more difficult to determine whether or not a property is relevant to a particular crime without considering the specific context of a crime. When a property is the target of a crime it usually is considered to be relevant.

However, if a physical object is mentioned simply to describe the environment or a situation it is often treated as irrelevant. For example, a “cell phone” is a relevant property if it is stolen in a crime; it is irrelevant if a witness used his or her cell phone to report a crime to the police. Unlike a human, who can determine an entity’s relevance based on contextual clues, the noun phraser cannot examine texts semantically to distinguish between relevant and irrelevant entities. As a result, a relation path will be nonuseful if it happens to include an irrelevant entity. Over 37% of the nonuseful paths had this problem.

Two entities may have a “fake” relationship even though they are listed in the same report

. A link is established when two entities appear together in the same document. However, this link may be a trivial relation between the two entities. Usually, relations between a person and other entities (e.g., another person, vehicles, addresses, etc.) are less frequently subject to this problem. However, relations between entities other than persons are often less informative. For example, a link exists between “white

Toyota” and “North 7 th

Street” because they are listed in the same report narrative. In this report, I found that a male driving a white Toyota car kidnapped the daughter of a person, who lived on North 7th Street. Such a link does not imply a useful relationship between

81 these two entities but a “fake” one. Around 5% of the nonuseful paths fell into this category.

Result of this analysis suggests that the effectiveness of my algorithms may be improved if more appropriate entities and relations are extracted and used.

3.5.2.2 Simulation Experiment: Efficiency Issue

The simulation experiment focused on the efficiency of the two shortest-path algorithms

(modified PFS and two-tree PFS). I define the efficiency of an algorithm as its average execution time. The experiment was intended to ascertain which algorithm is more efficient for what type of networks in terms of network size and other structural characteristics.

To compare the efficiency of these two algorithms in the case of multiple source nodes, I varied the number of source nodes,

k

, from 2 to 5 in the simulations. I chose these numbers based on the observation from my pilot studies in which investigators usually used less than five source entities during a relation search. Given a specific

k

, I randomly generated 100 cases using both algorithms for each network. The execution time for the algorithms was recorded and is presented in Table 3.3.

(a)

Algorithm

Modified PFS

Two-tree PFS k = 2

1.00 (0.54)

k = 3

2.89 (0.97)

0.35

(0.19)

0.95

(0.28)

k = 4

6.00

(1.26)

1.94

(0.37)

k = 5

10.67

(2.09)

3.45

(0.65)

82

(b)

Algorithm

Modified PFS

Two-tree PFS k = 2

66.75

(27.06)

239.00

(132.00)

k = 3

194.05

(53.97)

709.50

(263.75)

k = 4

419.47

(61.91)

1,350.56

(348.70)

k = 5

661.10

(132.22)

2,322.28

(546.25)

Table 3.3: Mean execution time (in seconds) for the two shortest-path algorithms

(Numbers in parentheses are standard deviations). (a) Results for the kidnapping network.

(b) Results for the narcotics network.

For all four values of

k

, the pairwise

t

-tests for the mean execution time suggest that twotree PFS is significantly faster than PFS (

p

< 0.001) in the kidnapping network. However,

PFS is significantly faster than the two-tree PFS algorithm (

p

< 0.01) in the narcotics network. Figure 3.5 presents the execution time plot with

k

= 5 for the kidnapping and narcotics networks, respectively.

PFS Two-tree PFS

20

15

10

5

0

Simulation case

(a)

83

PFS Two-tree PFS

4000

3500

3000

2500

2000

1500

1000

500

0

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96

Simulation case

(b)

Figure 3.5: Execution time scatter plot (

k

= 5). (a) Results for the kidnapping network. (b)

Results for the narcotics network.

The result from the kidnapping network is consistent with the findings in (Helgason

et al.

,

1993). According to Helgason et al. (1993), a two-tree algorithm usually is faster than one-tree algorithms. In their study, a shortest-path tree in a one-tree algorithm contains about 50% of the nodes in a network before the shortest path is found; whereas a two-tree algorithm can find the shortest path when its trees contain only 6% of the nodes. I found similar results in terms of the number of nodes contained in the shortest-path trees. For the kidnapping network, the one-tree PFS algorithm generated a tree containing 52% of the nodes, and the two-tree PFS algorithm generated two trees containing 14.7% of the nodes in total. For the narcotics network, the tree in the one-tree PFS algorithm contained

49.6% of the nodes, and the trees in the two-tree PFS algorithm only contained 3.9% of the nodes.

84

However, the one-tree algorithm outperformed its two-tree counterpart in the narcotics network. Based on my analysis of the structural characteristics of both networks, I found that two factors might have caused this discrepancy.

Network size

. As the size of a network increases, the size of the candidate set

Q

, which contains temporarily labeled nodes, also increases. It takes time to search and update the labels in

Q

when incident links of a node are explored. Therefore, when a network is large and the computational cost of processing the candidate sets becomes high, the two-tree algorithm will be inefficient. For the narcotics network (

n

= 4,257), the two candidate sets in the two-tree PFS algorithm together contained 120% of the total nodes, whereas the candidate set in the one-tree PFS algorithm contained only 47.9% of the total nodes. Thus, the time for processing the candidate sets in the two-tree PFS algorithm was much longer than the time spent in the one-tree PFS algorithm, causing the two-tree PFS algorithm to be slower.

Network density

. The density of a network is defined as the ratio of the total number of links to the possible number of links (Wasserman & Faust, 1994). Thus, the density of an undirected network consisting of

n

nodes and

m

links is 2

m

/

n

(

n

-1). Network density may have an impact on the efficiency of a two-tree algorithm, which can find a shortest path only if the two trees have overlapping nodes. The lower the density of a network, the less likely two trees will overlap. In my experiment the density of the narcotics network is 0.08. This means that the two trees have overlapping nodes only 8% of the time and that the algorithm must spend more time growing the trees. The

85 kidnapping network, in contrast, has a much higher density (0.66), causing the two-tree algorithm to be faster than the one-tree algorithm.

Based on the analysis, I suggest that the two-tree PFS algorithm be used for small and dense networks. For large and sparse networks, the one-tree PFS algorithm is faster.

3.6 Conclusions

Effective and efficient link analysis techniques can assist investigation of organized crimes. With the help of such techniques, crime investigators may acquire better understanding of the interrelationships between offenders, thereby discovering new leads for investigation.

In this paper, I proposed a link analysis technique that employs shortest-path algorithms

(PFS and two-tree PFS) to identify the strongest relations between two or more entities in a criminal network. Modifications were made to the algorithms to solve the shortest-path computation problem for multiple source nodes. After a logarithmic transformation of the link weights, these shortest paths could identify the strongest relations between given entities.

The evaluation study focused on the approach’s effectiveness and efficiency, both of which are desirable features of a sophisticated decision-support system. The results show that the shortest-path algorithms outperformed the typical relation search approach (as represented by the modified BFS algorithm) of crime investigators in terms of

86 effectiveness. The relation paths identified using the shortest-path algorithms were considered as useful about 70% of the time, as opposed to precision rates of 30% (for the kidnapping network) and 16.7% (for the narcotics network) with the modified BFS algorithm. The two shortest-path algorithms always produced identical results but the two-tree PFS algorithm was faster for the small and dense kidnapping network and the

PFS algorithm was faster for the large and sparse narcotics network.

Analysis of the evaluation results suggests that the effectiveness might be improved by extracting more appropriate entities from texts and using them as network nodes. In my future research I will apply effective named-entity extraction techniques to replace my current noun phraser. I will also incorporate some domain-specific heuristics to help the system select only entities and relations that are considered useful by crime investigators.

87

CHAPTER 4: EXTRACTING STATIC STRUCTURAL PATTERNS IN

CRIMINAL NETWORKS

4.1 Introduction

In Chapter 3, I proposed using shortest path algorithms to identify important relations between criminals. Many other static structural patterns such as key nodes and subgroups in criminal networks are also valuable knowledge resources for the investigation of organized crimes. In this chapter I propose using a number of techniques to address the static structural pattern mining problems to help law enforcement and intelligence agencies to better manage their knowledge assets about crimes and criminals (Xu & Chen,

2005).

Law enforcement and intelligence agencies have long realized that knowledge about criminal networks is important to crime investigation and may to a large extent shape police efforts (McAndrew, 1999). A clear understanding of network structures, operations, and individual roles can help develop effective control strategies to prevent crimes from taking place.

However, criminal network analysis and mining currently is primarily a manual process, usually consuming much time and human effort at each stage of the knowledge discovery process (data processing, transformation, analysis, and visualization). Although some existing tools provide visual representations of criminal networks to assist investigation,

88 they lack structural network analysis functionality that may offer a deeper insight into the structure and organization of criminal enterprises.

To help discover criminal network knowledge efficiently and effectively, I propose in this chapter a series of procedures for automated network structure mining and visualization: network creation, network partition, structural analysis, and network visualization. I have developed a prototype system called

CrimeNet Explorer

that incorporates several advanced techniques (a concept space approach, social network analysis methods, etc.) for automatically extracting structural patterns in criminal networks, namely, key members, subgroups, and interaction patterns between subgroups.

The remainder of the chapter is organized as follows: Section 4.2 introduces the background of criminal network analysis; Section 4.3 reviews existing network analysis tools and social network analysis techniques; Section 4.4 provides details about the mining procedures and

CrimeNet Explorer

. System evaluation is discussed in Section 4.5, and Section 4.6 concludes this chapter.

4.2 Background

When analyzing criminal networks, crime investigators often focus on characteristics of the network structure to gain insight into the following questions (McAndrew, 1999;

Sparrow, 1991):

Who is central in the network?

89

What subgroups exist in the network?

What are the patterns of interaction between subgroups?

What is the overall structure of the network?

Which member’s removal would result in disruption of the network?

How do information or goods flow in the network?

Knowledge of these structural characteristics can help reveal vulnerabilities of criminal networks and may have important implications for crime investigation.

4.2.1 Implications of Structural Network Analysis

Usually, criminal network members who occupy central positions should be targeted for removal or surveillance (Baker & Faulkner, 1993; McAndrew, 1999; Sparrow, 1991). A central member may play a key role in a network by acting as a leader who issues commands and provides steering mechanisms or serving as a gatekeeper who ensures that information or goods flow effectively among different parts of the network. Removal of these central members may effectively disrupt the network and put the operation of a criminal enterprise out of action.

In addition to studying roles of individual members, crime investigators also need to pay special attention to subgroups in criminal enterprises. Each subgroup or team may be responsible for specific tasks. Group members have to interact and cooperate to

90 accomplish the tasks. Therefore, detecting subgroups in which members are closely related to one another can increase understanding of a network’s organization.

Moreover, groups may interact with each other in such a way that interactions and relationships may reveal certain patterns. For example, one group may have frequent interactions with one other specific group but seldom interact with the rest of the network.

When interaction and relationship patterns between groups are found, the overall structure of the network can become more apparent. Indeed, different structures have different points of vulnerability. Intelligence regarding the overall structure of a network can help law enforcement and intelligence agencies develop the most effective strategies to disrupt that network.

4.2.2 Special Network Structures

Different criminal network structures such as chain, star/wheel, and complete/clique

(Evan, 1972; Ronfeldt & Arquilla, 2001) require specific disruptive strategies. A chain structure consists of members (individuals or groups) that are connected one by one so that information or goods must flow from one member to its neighbor before getting to the next. In a star structure, members are all connected to a central member who acts as a leader or hub. In a complete network, all members are fully connected with one another so that communication between any two members can be carried out directly. A star structure is a centralized network, whereas chain and complete structures are considered decentralized networks (Baker & Faulkner, 1993; Freeman, 1979). To disrupt a

91 centralized network, removal of the central member(s) can cause the network to fall apart.

A decentralized network, however, is more difficult to disrupt and more resistant to damage.

Although criminal network knowledge has important implications for crime investigation, little research has been done to develop advanced, automated techniques to assist with such tasks (Klerks, 2001; McAndrew, 1999; Sparrow, 1991). In the next section I review existing network analysis and visualization tools and introduce several new techniques that could be used for network analysis and structural pattern mining.

Existing network analysis tools used by law enforcement and intelligence agencies mainly focus on network visualization and do not have much structural analysis capability. Such a limitation might be successfully addressed by several methods from social network analysis research.

4.3.1 Existing Network Analysis Tools

Klerks (2001) categorized existing criminal network analysis tools into three generations.

4.3.1.1 First Generation: Manual Approach

Representative of the first generation is the Anacpapa Chart of (Harper & Harris, 1975), which has been briefly reviewed in Chapter 1. In this approach, an investigator first

92 constructs an association matrix by examining data files to identify relations between criminals. Based on this association matrix, a link chart can be drawn for visualization purposes. The criminal having the most links to other people may be placed at the center of the link chart, indicating his/her importance in the network. The investigator then can study the structure of the graphical portrayal of the network to discover patterns of interest. Krebs (Krebs, 2001), for example, mapped a terrorist network comprised of the

19 hijackers in the September 11 attacks. He first examined publicly released information reported in several major newspapers to gather data about relationships among the terrorists. He then manually constructed an association matrix to integrate these relations and drew a terrorist network depicting possible patterns of interactions based on the matrix (see Figure 4.1).

Although such a manual approach is helpful for crime investigation, for very large data sets its use becomes extremely ineffective and inefficient.

93

Figure 4.1: The terrorist network surrounding the 19 hijackers on September 11, 2001

(Source: http://www.orgnet.com).

4.3.1.2 Second Generation: Graphics-Based Approach

Second-generation tools are more sophisticated because they can produce graphical representations of networks automatically. Most current criminal network analysis tools belong to this generation; among them are Analyst’s Notebook, Netmap, and Watson.

These three tools have also been briefly reviewed in Chapter 1.

Analyst’s Notebook has been widely employed by law enforcement in the United States and the Netherlands (Klerks, 2001). Like the first-generation approach, Analyst’s

Notebook relies on a human analyst to detect criminal relationships in data and can

94 automatically generate a link chart based on relational data stored in a spreadsheet or text file. It uses icons to distinguish between different types of entities (e.g., persons, bank accounts, companies, addresses, etc.) and allows a user to drag those icons around to rearrange the network layout. For example, an icon representing a key person can be dragged to the center of the chart, and less important icons can be placed on the periphery

(see Figure 4.2a).

Similarly, Netmap provides network visualization functionality (see Figure 4.2b). The system lays out entities of various types on the perimeter of a circle and places straight lines between entities to represent links. By examining the links, an analyst may discover useful patterns of interactions and relations hidden behind the network. Netmap has been adopted in the FinCEN system at the U.S. Department of the Treasury to analyze patterns of financial transaction data to detect money laundering (Goldberg & Senator, 1998).

Another second-generation tool called Watson (Anderson

et al.

, 1994) can search and identify possible relations between persons by querying databases (see Figure 4.2c).

Given a person’s name, Watson can automatically form a database query to search for related persons. The related persons found are linked to the given person and the result is presented in a link chart.

Although second-generation tools are capable of visualizing criminal networks, their sophistication level remains modest because they offer little structural analysis capability.

The analysis burden is still on human crime analysts.

95

(a)

(c)

(b)

Figure 4.2: Second-generation criminal network analysis tools. (a) Analyst’s Notebook.

Network members are automatically arranged for easy interpretation (Source: i2, Inc.). (b)

Netmap. The thickness of a line indicates the relational strength of the link it represents.

Different colors are used to represent different entity types (Source: Netmap Analytics,

LLC.). (c) Watson. Relations among a group of people (the central sphere) are extracted from telephone records. Phone calls that are not to or from the group are also displayed

(the peripheral nodes). A color is used to represent phone calls related to a particular person (Source: Xanalys, Ltd.).

4.3.1.3 Third Generation: Structural Analysis Approach

No existing tool is sophisticated enough to be categorized as being of the third generation.

Tools of this new generation are expected to provide more advanced analytical

96 facilitation that helps discover structural characteristics of criminal networks: central members, subgroups, interaction patterns between groups, and the overall structure.

4.3.2 Social Network Analysis

SNA has recently been recognized as a promising technology for studying criminal organizations and enterprises (McAndrew, 1999; Sparrow, 1991). Studies involving evidence mapping in fraud and conspiracy cases have recently been added to this list

(Baker & Faulkner, 1993; Saether & Canter, 2001). These studies, however, focused only on central network members and did not identify subgroups and interaction patterns in criminal networks. Actually, both relational and positional analysis in SNA are relevant to the study of criminal networks (McAndrew, 1999).

4.3.2.1 Relational Analysis

Relational analysis focuses on the connectivity of a network. It is often used to identify central members or to partition a network into subgroups. In such studies, links usually are weighted by relational strength. The three most popular centrality measures are defined as follows (Freeman, 1979):

The d

egree

of a node

u

is defined as the number of links

u

has,

C

D

(

u

)

=

i n

=

1

a

(

i

,

u

) , (4.1)

97 where

n

is the total number of nodes in a network;

a

(

i

,

u

) is a binary variable indicating whether a link exists between nodes

i

and

u

. A network member with a high degree could be the leader or “hub” in a network. The

betweenness

of a node

u

is defined as the number of geodesics (shortest paths between two nodes) passing through

u

,

C

B

(

u

)

=

n n

∑∑

i

<

j g ij

(

u

) , (4.2) where

g ij

(

u

) indicates whether the shortest path between two other nodes

i

and

j

passes through

u

. A member with high betweenness may act as a gatekeeper or “broker” in a network for smooth communication or flow of goods (e.g., drugs). The

closeness

is the sum of the length of geodesics between

u

and all the other nodes in a network,

C

C

(

u

)

=

i n

=

1

l

(

i

,

u

) , (4.3) where

l

(

i

,

u

) is the length of the shortest path connecting nodes

i

and

u

.

Another type of relational analysis is to partition a network based on the strength of relationships between network members. Because criminals often form groups or teams to commit crimes, such an approach can help detect subgroups in a large criminal network.

Two methods have been employed for network partition in SNA studies: matrix permutation and hierarchical clustering (Arabie

et al.

, 1978; Wasserman & Faust, 1994).

The purpose of matrix permutation is to rearrange rows and columns of a matrix so that

98 members who occupy adjacent rows (or columns) can be organized into the same group.

Since matrix permutation is inherently an NP-hard problem, many SNA studies use hierarchical clustering methods (Arabie

et al.

, 1978). Hierarchal clustering will be reviewed in Section 4.3.2.3.

4.3.2.2 Positional Analysis

Unlike relational analysis, positional analysis examines how similarly two network members connect to other members. The purpose of positional studies is to discover the overall structure of a social network using

blockmodeling approach

(White

et al.

, 1976).

To model interaction patterns between positions after network partition, blockmodel analysis compares the density of links between two positions with the overall density of a network (Arabie et al., 1978; Breiger et al., 1975; White et al., 1976)

. Link density

between two positions is the actual number of links between all pairs of nodes drawn from each position divided by the possible number of links between the two positions. In a network with undirected links, for example, the between-position link density can be calculated by

d ij

=

m ij n i n j

, (4.4) where

d ij

is the link density between positions

i

and

j

;

m ij

is the actual number of links between positions

i

and

j

;

n i

and

n j

represent the number of nodes within positions

i

and

j

, respectively. The overall link density of a network is defined as the total number of links

99 divided by the possible number of links in the whole network, i.e.,

d

=

n

(

n m

1 ) / 2

, where

m

is the total number of links;

n

is the total number of nodes in the network. Notice that for an undirected network the possible number of links is always

n

(

n -

1)/2.

A blockmodel of a network is thus constructed by comparing the density of the links between each pair of positions,

d ij

, with

d

: a between-position interaction is present if

d ij

d

, and absent otherwise. Blockmodeling therefore reduces a complex network to a simpler structure by summarizing individual interaction details into relationship patterns between positions (White

et al.

, 1976). As a result, the overall structure of the network becomes more evident.

4.3.2.3 Hierarchical Clustering

Although they are based on different measures, both relational and positional analysis in

SNA may employ hierarchical clustering to partition a network. When used in relational analysis, hierarchical clustering treats relational strength as a similarity measure.

Therefore, the resulting clusters represent subgroups whose members are closely related.

When applied in positional analysis, on the other hand, hierarchical clustering uses structural equivalence to measure similarity and resulting clusters represent positions whose members are similar in the way they connect to other members.

The advantage of hierarchical clustering is that a network can be partitioned into different numbers of clusters at different similarity levels. With this feature, the underlying

100 structure of a network can be analyzed at different levels of detail. The disadvantage of hierarchical clustering, on the other hand, is that each node can be assigned to only one cluster at a specific level of similarity (Wasserman & Faust, 1994). There is no overlap between clusters.

Among the three most popular hierarchical clustering methods (single-link, complete-link, and Ward’s algorithm), the complete-link algorithm is most widely used because it gives more homogeneous and stable clusters than the others (Jain & Dubes, 1988; Jain

et al.

,

1999; Lance & Williams, 1967).

4.3.2.4 Visualization of Social Networks

SNA studies employ multidimensional scaling (MDS) in both relational and positional analysis of social networks (Breiger

et al.

, 1975; Burt, 1976; Freeman, 2000; Wasserman

& Faust, 1994). When applied to a relational analysis, MDS uses relational strength as a measure of proximity and outputs an

x-y

coordinate for each object on a two-dimensional plane so that closely-related members are also close visually. When applied to positional analysis, MDS uses the structural equivalence between members as a proximity measure so that members who are structurally substitutable are close together on the display.

Recent SNA studies have also used spring embedder algorithms to visualize social networks (Freeman, 2000).

In summary, SNA offers several structural analysis techniques that can be used to extract structural patterns from criminal networks. However, existing network analysis tools are

101 not sophisticated enough to employ these techniques. To analyze a criminal network, an investigator has to extract information about criminal relationships from data, create a network representation, and perform structural analysis manually to identify central members, to detect subgroups, and to discover interaction patterns among groups. It is highly desirable to automate the whole process of criminal network analysis so that knowledge can be extracted more efficiently and effectively.

Networks

I propose using several techniques to facilitate structural pattern extraction. I have also developed a system called

CrimeNet Explorer

that can be categorized as a thirdgeneration network analysis tool, which incorporates these techniques. Figure 4.3 presents the proposed structural pattern mining processes:

network creation

,

network partition

,

structural analysis

, and

network visualization

.

Criminal

-justice

Data

Network

Creation

Networked

Data

Network

Partition

Cluster

Hierarchies

Structural

Analysis

Network

Visualization

Concept Space

Hierarchical

Clustering

Centrality

Blockmodeling

MDS

Figure 4.3: Procedures for automated criminal network mining and visualization.

102

4.4.1 Network Creation

Criminal-justice data collected from crime incident reports, telephone records, surveillance logs, financial transaction records, and other sources usually do not store explicit information about criminal relationships. The task of extracting relational information from raw data and transforming it into a networked format could be quite labor-intensive and time-consuming.

To address this problem, I employed a c

oncept space approach

(Chen & Lynch, 1992) to create networks automatically (Chen

et al.

, 2003; Hauck

et al.

, 2002). The concept space approach was originally employed in information retrieval applications for extracting term relations in documents. It uses co-occurrence weight to measure the frequency with which two words or phrases appear in the same document. The more frequently two words or phrases appear together, the more likely it will be that they are related.

The criminal-justice data used in this chapter consisted of crime incident summaries provided by the Tucson Police Department (TPD). I treated each incident summary

(database records specifying the date, location, persons involved, and other information about a specific crime) as a document and each person’s name as a phrase. I then calculated co-occurrence weights based on the frequency with which two individuals appeared together in the same crime incident. I assumed that criminals who committed crimes together might be related and that the more often they appeared together the more likely it would be that they were related. As a result, the value of a co-occurrence weight

103 not only implied a relationship between two criminals but also indicated the strength of the relationship (Hauck

et al.

, 2002).

With the concept space approach, criminal relationships therefore could be extracted from crime incident data and transformed into a networked format automatically.

Resulting networks were undirected, weighted graphs in which nodes represented individual criminals and co-occurrence weights of links represented relational strength. It is worth mentioning that the concept space approach has both advantages and disadvantages for extracting relations. On one hand, the weight of a link was normalized to a range between 0 and 1, better than the simple co-occurrence count. More importantly, the distribution of co-occurrences was extremely skewed. More than 90% of the criminal pairs resulted from a one-time co-occurrence and a small portion (around 2.4%) of pairs co-occurred 10 times or more. The concept space approach, which penalized extremely large co-occurrences (Chen & Lynch, 1992), helped prevent the link weights from being skewed. On the other hand, the concept space approach is limited since the relational strength can be affected by other factors such as crime type. For example, a cooccurrence relation in a gang-related crime in which a large number of criminals participated might not be as strong as a relation in an auto-theft crime in which only two criminals were involved.

I also observed that the network generated might not necessarily be a single connected graph that contained all criminals in a set of data. This might be due to the fact that some

104 criminal enterprises might not have any connection with other criminal organizations. It could also be caused by the incompleteness of the data (McAndrew, 1999).

The networks created were stored in a database table in which each tuple specified a pair of criminals and an associated co-occurrence weight. These co-occurrence weights would be used later in both structural pattern mining and network visualization.

4.4.2 Network Partition

With data expressed in a networked format, I employed hierarchical clustering to partition a network into subgroups based on relational strength. I used a complete-link algorithm since it was less likely to be subject to the chaining effect (Jain

et al.

, 1999).

Existing complete-link algorithms vary in space and time complexity (Day &

Edelsbrunner, 1984; Defays, 1977; Voorhees, 1986). Although clustering was an offline operation that did not necessarily require high speed, I took into consideration that online dynamic clustering would be needed under some circumstances in the future. Therefore, time complexity was the primary criterion for algorithm selection. The algorithm I chose was an

RNN-based complete-link

algorithm that used the

reciprocal nearest neighbor

(RNN) approach developed by Murtagh (1984). It took

O

(

n

2

) time and

O

(

n

2

) space and was significantly faster than other algorithms that typically required

O

(

n

3 ) time

(Roussinov & Chen, 1999).

Co-occurrence weights generated in the previous stage were first transformed into distances/dissimilarities. Since I was employing a complete-link algorithm, the distance

105 between two clusters was defined as the distance between the farthest pair of nodes drawn from each cluster.

Initially, the algorithm treated each node as a cluster and then arbitrarily selected a cluster and incrementally built for it a

nearest-neighbor chain

(NN-chain). In an NN-chain, each cluster was the nearest neighbor of its previous cluster. A chain terminated with two clusters that were the nearest neighbor of each other. The two nearest clusters were then merged into a larger cluster and the dendrogram was updated. The algorithm kept merging nearest clusters until all the nodes were merged into one big cluster. The resulting hierarchy had multiple levels and each level corresponded to a specific partition of a network.

Since the previous stage created multiple disjoint networks, I modified the algorithm to make it generate a separate cluster hierarchy for each network. The hierarchies generated were stored in a database for later use. Figure 4.4 presents the pseudocode of the modified algorithm.

Form a cluster for each node;

while

at least one between-cluster distance is less than infinite

do

currentCluster = an arbitrary cluster; found = false;

while

not found

find the nearest neighbor, C, to the currentCluster;

if do

isRNN

(

C, currentCluster

)

then else end while

end while

106

Figure 4.4: The pseudocode of the modified version of the RNN-based complete-link algorithm.

4.4.3 Structural Analysis

In structural analysis, central member identification and blockmodeling are online operations performed by request.

I used the three centrality measures (degree, betweenness, and closeness) to identify central members in a given subgroup. The degree of a node could be obtained by counting the total number of links a node had to all the other group members. A node’s score of betweenness and closeness required computing the shortest paths (geodesics).

In my implementation, Dijkstra’s classical shortest-path algorithm (Dijkstra, 1959) was used to compute the geodesics from a single node to every other node in a subgroup.

Given an undirected graph representing a subgroup

i

that consisted of

n i

nodes, applying the algorithm

n i

1 times could generate the shortest paths between all pairs of nodes in the subgroup. Betweenness of a specific node

u

was thus obtained by counting the number of geodesics between the other nodes passing through node

u

. Because running the Dijkstra’s algorithm once took

O

(

n i

2

) time, the overall time complexity for calculating betweenness of nodes in the subgroup

i

was

O

(

n i

3

).

There are specific algorithms for all-pair shortest path calculations such as Dantzig’s

(Dantzig, 1960) and Floyd’s (Floyd, 1962) algorithms. These algorithms’ time complexity is also

O

(

n

3

). The advantage of using the Dijkstra’s algorithm was that by the

107 time all the geodesics for a specific node were found the computation of the closeness of that node was also finished, because the closeness was simply the sum of the length of the geodesics. Thus, closeness was a “byproduct” of betweenness and was obtained with no extra cost.

To extract between-group interaction patterns and the overall structure of a criminal network, I performed blockmodel analysis. Unlike general blockmodel analysis in SNA research that revealed interaction patterns between network positions based on the structural equivalence measure, the blockmodel analysis examined relationships between subgroups based on the relational strength measure. I decided on this approach based on interviews with the crime investigators from TPD and evidence that crime investigators often are more interested in interaction patterns between subgroups rather than between positions.

Blockmodeling therefore was used to identify interaction patterns between subgroups discovered in the network partition stage. At a given level of a cluster hierarchy, I compared between-group link densities with the network’s overall link density to determine the presence or absence of between-group relationships.

4.4.4 Network Visualization

To map a criminal network onto a two-dimensional display, I employed MDS to assign a location to each node in a network of

n

nodes, given the corresponding

n

×

n

distance matrix. Since distances transformed from co-occurrence weights were quantitative data, I

108 selected Torgerson’s classical metric MDS algorithm (Torgerson, 1952). This algorithm first transformed the distance matrix into a scalar product matrix

B

by double-centering.

It then solved the

singular value decomposition

(SVD) problem for

B

to generate an

n

×

n

matrix

X

, the first two columns of which stored the coordinates of the

n

nodes. The key step in this algorithm was SVD, which could be solved efficiently using the library routine provided by Press

et al.

(Press

et al.

, 1992).

4.4.5 CrimeNet Explorer

In

CrimeNet Explorer

a graphical user interface was provided for easy interaction between a user and the system. Figure 4.5 shows screen shots of the system interface.

Each node was labeled with the name of the criminal it represented. Criminal names were scrubbed for data confidentiality. A straight line connecting two nodes indicated that the two corresponding criminals committed crimes together and thus were related.

To find subgroups and interaction patterns between groups, a user could adjust the “level of abstraction” slider at the bottom of the panel. A high level of abstraction corresponded with a high distance level in the cluster hierarchy. At any level of abstraction, a circle represented a subgroup. The size of the circle was proportional to the number of criminals in the subgroup. To view how group members were connected within a subgroup a user could click on the corresponding circle to bring up a small window depicting the group’s inner structure. At the same time, rankings in terms of the three

109 centrality measures of the group members were listed at the right-hand side of the small window.

Straight lines connecting circles represented between-group relationships. The thickness of a line was proportional to the density of the links between the two corresponding groups. Such a design was different from general blockmodel analysis, which treats a low link density as an indicator of the absence of a between-group relationship. I thought that the absence of a line between two subgroups might possibly cause a user to infer mistakenly that there was no actual link connecting members from the two groups. I therefore kept a line between two groups as long as there was a link between members from the two groups. This design decision could be more informative than the treatment in general blockmodel analysis for crime investigations.

(a)

110

(b)

(c)

Figure 4.5:

CrimeNet Explorer

. In this example, the network appeared to be a star structure after performing blockmodel analysis. The vulnerability of this network, therefore, lay in the central members. (a) A 57-member criminal network. Each node is labeled using the name of the criminal it represents. Lines represent the relationships between criminals. (c) The inner structure of the biggest group (the relationships between group members). (b) The reduced structure of the network. Each circle represents one subgroup labeled by its leader’s name. The size of the circle is proportional to the number of criminals in the group. A line represents a relationship between two groups. The thickness represents the strength of the relationship. Centrality rankings of members in the biggest group are listed in a table at the right-hand.

As discussed previously, the purpose of this chapter is to employ advanced structural analysis and visualization techniques to help discover valuable criminal network structural patterns. The major advantage of CrimeNet Explorer over existing network analysis tools is its structural analysis capabilities.

I conducted system evaluation to answer the following research questions:

Will the system detect subgroups from criminal networks correctly?

111

Will the structural analysis functionality help extract structural properties of criminal networks more effectively and efficiently?

Prior to the system evaluation I carefully examined the TPD datasets and found that networks generated from them varied in size and structure.

4.5.1 The Narcotics and Gang Networks

I extracted two datasets from TPD databases: (a) incident summaries of narcotics crimes from January 2000 to May 2002, and (b) incident summaries of gang-related crimes from

January 1995 to May 2002. Both narcotics and gang-related crimes were organized crimes likely to have been committed by networked offenders. I chose a longer time period for gang data because in each year there were substantially fewer gang-related crimes than narcotics crimes.

I analyzed the sizes of the networks generated from the two datasets. The narcotics dataset consisted of 12,842 criminals who were from 2,628 networks. The gang dataset consisted of 4,376 criminals from 289 networks. Both datasets contained a single large network (e.g., the 502-member network in the narcotics dataset) and a large number of small networks with less than 20 members. The biggest gang network was much larger than the biggest narcotics network although the gang dataset contained fewer criminals.

Table 4.1 provides network-size statistics of the two datasets. Further examination of the incident summaries revealed that members in the large networks (those having more than

20 members) were mostly serial offenders and possibly came from various criminal

112 organizations. In contrast, small networks (those having fewer than 20 members) consisted primarily of “one-time” offenders and would probably be less interesting for a study of criminal organizations and enterprises.

Narcotic networks

Gang networks

2-20 members

2,618

21-100 members

9

>100 members

(a 502-member network)

Table 4.1: Sizes of networks generated from the two datasets.

In addition to network size, I examined network structures using the blockmodeling function of

CrimeNet Explorer

. Because it was quite difficult to display the biggest networks in the two datasets on a screen, each having several hundred members, I analyzed only the structures of networks with 21-100 members. I found that the two types of networks had distinguishing structural patterns:

Two out of the four gang networks studied had a star structure similar to the example in Figure 4.5. The third network had a chain of stars. The fourth network had a star structure with each branch being a smaller star or a clique and its overall structure looked like a snowflake.

All nine narcotics networks had a chain structure. Three of these networks were chains of stars. One network had a circle in the middle of the chain.

Analysis of network size and structure revealed that gang networks tended to be bigger and more centralized, whereas narcotics networks were smaller and more decentralized.

113

This finding implied that different strategies could be used to disrupt the two types of networks.

I selected a 60-member narcotics network and a 24-member gang network and used them in a subject study to evaluate

CrimeNet Explorer

.

4.5.2 Experimental Design

To address the research questions, I conducted a controlled laboratory experiment to evaluate system performance. Thirty students from the Department of Management

Information Systems at the University of Arizona participated in the experiment. I used students rather than crime investigators as research subjects based on two considerations.

First, it was difficult to recruit a sufficient number of crime investigators because of their busy work schedules. Second, although the prototype system was designed for criminal network analysis, finding structural patterns from networks of nodes was not a domainspecific task. Student subjects should be able to perform the tasks assigned to them even without domain knowledge in crime investigation.

Each subject participated in four sessions: demographic survey, training, testing, and post-test questionnaire. The demographic survey focused on subjects’ background information such as gender, age, and computer experience. The training session was designed to help subjects understand the major concepts (e.g., subgroups, central members, etc.) and gain hands-on experience with the system. During the testing sessions, subjects performed nine tasks on each of two test networks. They then completed a post-

114 test questionnaire on which they reported their attitudes towards the system’s ease-of-use and their satisfaction with the system’s functionality.

The 18 tasks used in the experiment were divided into three types: (1) detecting subgroups in a network, (2) identifying interaction patterns between subgroups, and (3) identifying central members within a given subgroup.

4.5.2.1 Task I: Subgroup Detection (Clustering)

I wanted to learn through task I whether the system could achieve performance comparable to that of untrained users when partitioning a network into clusters

(subgroups). I asked a domain expert (a detective who had served in law enforcement for more than 20 years) to provide partitions of the two test networks based on his knowledge of narcotics and gang-related crimes. His partitions were used as “gold standards” to evaluate clustering results generated by the system and subjects who represented untrained users.

There has not been a generally accepted metric for evaluating clustering results (Jain &

Dubes, 1988). I selected for the experiment the clustering precision and cluster recall metrics developed by Roussinov and Chen (Roussinov & Chen, 1999). These two measures examined whether or not a pair of documents was put in the same cluster by human subjects and by the system (Sahami

et al.

, 1998). Based on the same rationale, I defined the cluster precision and recall as:

115

Recall system

=

Number of node pairs in both system partition and expert partition

Number of node pairs in expert partition

(4.5)

Recall huma

=

Number of node pairs in both human partition and expert

Number of node pairs in expert partition partition

(4.6)

Precision system

=

Number of node pairs in both system partition and expert partition

Number of node pairs in system partition

(4.7)

Precision human

=

Number of node pairs in both human partition and expert

Number of node pairs in human partition partition

(4.8)

I developed two hypotheses to compare the clustering results from the system and the human subjects:

H1: The system and subjects will achieve different clustering

recall

.

H2: The system and subjects will achieve different clustering

precision

.

Since hierarchical clustering generated nested partitions for a network, I selected the partition containing the same number of clusters as in the expert’s partition to be the system’s clustering result. During the experiment, subjects were asked to partition a given network into the same number of clusters as in the expert partition. Although both the system and subjects generated the same number of clusters, they could assign different node pairs in a cluster, resulting in different recall and precision.

116

4.5.2.2 Tasks II and III: Interaction Pattern and Central Members Identification

Because the major advantage of

CrimeNet Explorer

was its structural analysis capability in addition to its network visualization functionality, I was interested in comparing subjects’ performances under two experimental conditions: (1) structural analysis plus visualization, and (2) visualization only.

I considered two general information systems performance metrics (Jordan, 1998):

Effectiveness =

total number of correct answers a subject generated for a given type of tasks.

Efficiency =

the average time a subject spent to complete a given type of tasks.

Since the system could automatically identify interaction patterns between subgroups and central members within a subgroup, it was expected that a subject could achieve higher efficiency and effectiveness with the help of structural analysis functionality than with only visualization functionality. Specifically, I developed four hypotheses to compare the performance under two experimental conditions:

H3: A subject will achieve higher

effectiveness

for interaction pattern identification tasks using the system having both structural analysis and visualization functionality than with that having visualization functionality only.

117

H4: A subject will achieve higher

effectiveness

for central member identification tasks using the system having both structural analysis and visualization functionality than with that having visualization functionality only.

H5: A subject will achieve higher

efficiency

for interaction pattern identification tasks using the system having both structural analysis and visualization functionality than with that having visualization functionality only.

H6: A subject will achieve higher

efficiency

for central member identification tasks using the system having both structural analysis and visualization functionality than with that having visualization functionality only.

The domain expert validated answers to all the questions for tasks II and III. To eliminate a learning effect, the orders of experimental conditions and tasks were randomized for each test network.

For task II, subjects were asked to answer two questions regarding the interaction patterns between subgroups:

Given two subgroups, determine whether they were related;

Given three subgroups (e.g., A, B, and C), determine whether group A had more interactions with group B than with group C.

For task III, subjects were asked to identify central members with the highest degree. I did not assign tasks of identifying central members with the highest betweenness and

118 closeness because these two measures required computation of shortest paths, which were difficult for subjects to find under the visualization-only condition. I therefore included only degree for fair comparison between the two experimental conditions.

For tasks II and III, subjects were encouraged to complete the tasks as quickly as possible.

Each subject’s task completion time was recorded. On average, it took a subject 30-45 minutes to complete all 18 tasks.

4.5.3 Results and Discussion

4.5.3.1 Quantitative Analysis

Clustering recall and precision.

H1 and H2 were supported. Paired

t

-tests showed that the system’s clustering recall and precision were significantly higher than subjects’

(recall:

t

= 4.39,

p

< 0.001; precision:

t

= 5.33,

p

< 0.001). Table 4.2 gives the recall and precision rates of the system and the subjects. Numbers in parentheses are standard deviations.

Recall

Precision

Human System

0.86 (0.07) 0.93 (0.00)

0.77 (0.03) 0.91 (0.00)

Table 4.2: Clustering recall and precision.

I believe that the difference in clustering recall and precision resulted from visual clues that subjects relied on when performing clustering tasks.

In the experiment, the domain expert based his partitioning of the test networks on his knowledge of network members and grouped criminals who frequently hang

119 together in the same clusters. His judgment of clusters was not affected by visual clues from the network layouts.

The system neither had domain knowledge nor was affected by visual clues from the network layouts. Thus, partitioning of the networks depended entirely on link weights (relational strengths). Since relational strength was determined by the frequency with which two criminals committed crimes together, it could relatively accurately reflect reality. Therefore, partitions generated by the system closely resembled the expert’s partitions.

Untrained subjects had to rely entirely on relative locations of nodes in the visual display of networks to determine relational strength between criminals. Visual clues thus could affect subjects’ judgment heavily. When a network display was distorted (caused by dimensionality problem in the MDS algorithm) a subject actually could group weakly related criminals into one cluster if they appeared to be close visually. The test networks used in the experiment suffered from the distortion problem, which may have caused the clustering recall and precision by subjects to be worse than the clustering recall and precision of the system.

Effectiveness

. H3 and H4 were not supported. I performed paired

t

-tests for both tasks II and III to compare the effectiveness under the two experimental conditions (Task II:

t

=

1.41,

p

> 0.05; Task III:

t

= 1.80,

p

> 0.05). Such results implied that the analysis functionality did not help to achieve a significantly higher effectiveness. Table 4.3 shows the results.

120

Task type 2

Task type 3

Visualization plus analysis

3.90 (0.31)

3.30 (1.02)

Table 4.3: Effectiveness.

Visualization only

3.73 (0.59)

3.20 (1.13)

Such a result could be for two reasons.

For both tasks II and III, a subject could obtain a correct answer by counting lines on the network display under the visualization-only condition. For example, to compare the frequency of interactions between one group (A) and the other two groups (B and C), a subject could count the number of lines between A and B, and the number of lines between A and C. A simple comparison of these two numbers would suggest which two groups had more frequent interactions. As long as the subject was careful, he/she could find the correct answer.

The two testing networks used in this experiment were not very large, making these two types of tasks relatively simple.

Efficiency

. The paired

t

-tests for efficiency comparison supported both H5 and H6 (task II:

t

= 6.92,

p

< 0.001; task III:

t

= 10.66,

p

< 0.001). This means that subjects could achieve significantly higher efficiency under the visualization plus analysis condition than under the visualization only condition. Table 4.4 shows the efficiency statistics.

Task type 2

Task type 3

Visualization plus analysis

7.13 (2.19)

6.24 (3.85)

Visualization only

12.10 (4.81)

26.93 (12.45)

121

Table 4.4: Efficiency.

The results implied that with the help of structural analysis functionality subjects could identify interaction patterns among subgroups and the central members in a given subgroup significantly faster. Under the visualization plus analysis condition, a subject did not have to count lines manually to identify interaction patterns between groups because a straight line between two groups implied the presence of a between-group interaction. At the same time, the thickness of the line indicated the frequency of the interaction. In addition, the degrees of all group members were computed by the system so that a subject could find the one with the highest degree directly from the centrality table on the interface.

In summary, the structural analysis functionality provided by the system could significantly improve efficiency of network analysis tasks although the gain in effectiveness was not significant. Moreover, the system could identify subgroups of a network significantly better than untrained subjects.

4.5.3.2 Qualitative Feedback

Most subjects reported that features provided by the system were easy to learn and easy to use. For example, it was easy to adjust the slider to view different partitions at different abstract levels; it was convenient to visualize the inner structure of a subgroup in a small window. The table used to list degree rankings of group members was similar to an Excel spreadsheet and easy to understand.

122

Subjects’ negative comments about the system were primarily concerned with network layout and network partitions.

Network layout

. Many subjects felt that the network was too cluttered in some areas where nodes were so close to each other that labels were overlapped and hard to read.

Network partition

. Most reported the difficulty of deciding where to put nodes that had many connections to nodes from different groups. They said they wished overlapped groups could be allowed so that some very popular nodes could belong to more than one group. However, hierarchical clustering algorithms always generated mutually exclusive clusters that did not overlap.

The domain expert also provided positive feedback. He said he had enjoyed using this system and believed that

CrimeNet Explorer

could be very useful for crime investigation in the following ways:

Increasing work productivity

. With the structural analysis functionality of

CrimeNet Explorer

, a large amount of investigation time could be saved.

Assisting training for new crime investigators

. New investigators who did not have sufficient knowledge about local criminal organizations could use the system to grasp the essence of the networks and crime history quickly. They would not have to spend a significant amount of time studying hundreds of incident reports.

123

Suggesting investigative leads that might otherwise be overlooked

.

Assisting prosecution

. Known relationships between individual criminals and criminal groups would be helpful to the prosecution when seeking to prove guilt in court.

Overall, the results of the quantitative and qualitative analysis showed that the system could be efficient and useful for extracting criminal network knowledge from large volumes of data.

4.6 Conclusions

Network structure mining is important for understanding the structure and organization of criminal enterprises. Advanced, automated techniques and tools are needed to extract knowledge about criminal networks efficiently and effectively. Such knowledge could help intelligence and law enforcement agencies enhance public safety and national security by developing comprehensive disruptive strategies to prevent and respond to organized crimes such as terrorist attacks and narcotics trafficking. I proposed in this chapter several techniques for automated criminal network analysis and visualization to help network creation, network partition, structural analysis, and network visualization.

The main contribution is the proposal of a series of procedures to guide structural pattern mining in the criminal network analysis domain. I incorporated various techniques to automatically extract valuable criminal network knowledge from large volumes of data.

124

Most of these techniques originated in other disciplines and initially were not intended for knowledge discovery. For example, the concept space approach, originally designed to generate automated thesauri from textual documents, was used to identify criminal relationships from crime incident summary data. The blockmodeling approach in SNA research was designed for validating theories of social structures and focused on interactions between “positions” of network members who were similar in social status and roles. I used the blockmodeling approach to extract interaction patterns among criminal groups in which members were closely related.

The prototype system,

CrimeNet Explorer

, has structural analysis functionality to detect subgroups, to identify between-group interaction patterns, and to identify central members of subgroups. Quantitative evaluation of the system demonstrated that subjects could achieve significantly higher efficiency with the help of structural analysis functionality than with only network visualization. No significant gain in effectiveness was present, however. Feedback from the subjects and the domain expert showed that

CrimeNet Explorer

was very promising and could be useful for crime investigation.

125

CHAPTER 5: IDENTIFYING GROUPS IN UNWEIGHTED NETWORKS

5.1 Introduction

This chapter focuses on the identification of groups in unweighted networks. In Chapter 4,

I showed how to partition the criminal networks using hierarchical clustering algorithms.

These criminal networks were weighted by the co-occurrence of criminal names in crime reports. In unweighted networks, however, all links are essentially equally weighted.

Conventional hierarchical clustering algorithms will fail because they cannot determine the order to merge or divide clusters. In this chapter I propose an

edge local density

measure to approximate the weight of a link based on the local link structure. This measure can be incorporated into both single-pass and iterative clustering algorithms to find groups in unweighted networks.

Group

is also called community (Gibson

et al.

, 1998; Newman & Girvan, 2004), cluster

(Wasserman & Faust, 1994; Xu & Chen, 2005), compartment (Krause

et al.

, 2003), and module (Ravasz

et al.

, 2002; Rives & Galitski, 2003). A group is a set of nodes connected by dense or stronge links. A group in a social network can be a set of social actors with similar background and socioeconomic status (Galaskiewicz & Krohn, 1984;

Wasserman & Faust, 1994). A group in a citation network may be a collection of articles of a specific research specialty or paradigm (Chen

et al.

, 2001; Culnan, 1986; Garfield,

2001). A Web community is a group of Web pages whose authors share similar interests

(Flake

et al.

, 2000; Gibson

et al.

, 1998; Kumar

et al.

, 1999).

126

Finding groups or identifying the

community structure

of networks has important empirical implications because the community structure of a network often relate to the function of the system. For example, biological components such as proteins are organized in modules in cells. It has been found that the modular structure is critical to the survival of cells because harmful effects or attacks to a single module can be limited in the module without affecting other modules (Ravasz

et al.

, 2002; Rives & Galitski,

2003). In the context of Web mining, identifying Web community structure can be of a great help for designing focused crawlers, developing Web portals, and improving search engine performance (Flake

et al.

, 2000; Imafuji & Kitsuregawa, 2002; Kumar

et al.

,

1999).

Researchers have long been working on the development of effective and efficient graph partition techniques for finding groups and identifying community structure in unweighted networks. The generic form of the unweighted graph partitioning is an NPcomplete problem for which no polynomial time algorithms exists (Flake et al., 2000).

Various approximation methods have been proposed to address this problem. However, most of existing algorithms are subject to low efficiency although some of them are rather effective, limiting their applicability to large networks. It is desirable to develop methods that can well balance efficiency and effectiveness based on the demand of different applications. In addition, general guidance is needed for selecting appropriate clustering methods in different situations where efficiency and effectiveness are valued differently. To address these issues I propose the edge local density measure in this chapter.

127

The remainder of this chapter is organized as follows. In Section 5.2 I review related work on network partition. Section 5.3 presents the design of the local density measure.

Section 5.4 discusses the experimental design, hypotheses, and results of the performance evaluation. Section 5.5 concludes this chapter.

Before I review related work for the unweighted graph partitioning problem it is worth briefly reviewing the definition of group and the determination of link weights in weighted networks.

5.2.1 Defining Group

There has not been a widely accepted definition for

group

in networks (Flake

et al.

,

2000). In social network analysis, whether a subset of actors in a network can be viewed as a group depends on the

cohesion

of the subset. A subset is a cohesive group if its members connect with each other through stronger or denser links than with actors outside of the subset (Wasserman & Faust, 1994). This definition implies that groups are identified based on link weights in weighted networks and on link density in unweighted networks. In Web community research, a

community

is defined as a subset of nodes, each of which has at least as many links connecting with nodes in the same subset as it does with nodes in the rest of the network (Flake et al., 2000). This definition is equivalent to the

strong community

definition given in (Radicchi

et al.

, 2004). Radicchi et al. (2004) also define

weak community

, in which the total number of links connecting its members

128 is greater than the number of links connecting its members with the rest of the nodes in the network. This chapter follows the definition of cohesive groups in SNA.

5.2.2 Determining Link Weights for Weighted Graphs

In weighted graphs each link receives a weight that indicates the link strength and intensity, or the similarity between the two nodes incident on the link. There are many ways to infer link weights between nodes in weighted networks. Roughly speaking, these methods can be categorized into two types:

link intensity based

and

node similarity based

.

Link intensity based methods

represent the strength or weight of a link based on the frequency of interactions between the two incident nodes. For example, the weight of friendship between two people can be approximated by the frequency that they meet, make phone calls, or write emails to each other. In scientific collaboration networks, the weight of a collaboration link between two authors often is estimated using the number of times the two authors publish papers together (Barabási et al., 2002; Newman, 2001b). In

Chapter 4, I used the co-occurrence (Chen & Lynch, 1992) weight to approximate the frequency that two criminals commit crimes together (Xu & Chen, 2005).

Similarity based methods

infer the similarity between the properties of the two incident nodes. In SNA the weight of a similarity link between two people can be estimated based on how similar they are in terms of their biographical, educational, and socioeconomic background. In document networks, where each node represents a document, the content similarity between documents are often used to approximate the weight of similarity links

129 between documents. The content similarity between documents can be measured by

Jaccard

(Rasmussen, 1992) or

Cosine

coefficients, which are widely employed in information retrieval and document categorization applications.

Both the link intensity based and the node similarity based methods reply on information about the intrinsic properties of the link or nodes. They do not consider the structure of the network where the nodes and links reside.

Given a weighted graph, hierarchical clustering algorithms can be used to find groups based on link weight. As a result, nodes in the same group have stronger links with each other. However, for graphs such as the World Wide Web, citation networks, and other networks where link weight is not available, the partition problem becomes more challenging.

5.2.3 Partitioning Unweighted Graph

Because link weight information is not available, methods in this category must rely on the graph structure for the partitioning task. As reviewed in Chapter 2, there are three types of unweighted graph partition methods: link analysis based, graph theoretical, and hierarchical clustering. Both link analysis based methods and graph theoretical approaches are proposed for graph partition in the Web context. They require seed nodes to find Web communities and are not appropriate for finding groups in general graphs.

Chapter 2 has briefly reviewed recent development of hierarchical clustering methods for

130 partitioning general unweighted graphs. I provide more details about these new algorithms here.

5.2.3.1 Divisive Algorithms

Divisive algorithms treat a whole network as a single cluster at the beginning and progressively remove links until all links are removed. When deciding which link to remove at each step the G-N algorithm (Girvan & Newman, 2002) selects the one with the highest

edge betweenness

. The algorithm is rather effective for identifying natural groups in various real networks (Girvan & Newman, 2002; Newman & Girvan, 2004;

Radicchi

et al.

, 2004). However, it is by no means an efficient algorithm and runs

O

(

m

2

n

) in time. As reviewed in Chapter 2, the lack of efficiency results from the algorithm’s recomputation of edge betweenness in each iteration and its demand for global traversal of the graph. This algorithm becomes extremely slow when a network contains up to a few thousand nodes (Newman, 2004c).

The alternative algorithm proposed by Radicchi

et al

. (2004) reduces the time complexity to

O

(<

k

>

2

m

2

) (Newman, 2004b), where <

k

> is the average degree, by using

edge clustering coefficient

(ECC) to approximate edge betweenness. The edge clustering coefficient of a link (

i

,

j

) is defined as

ECC ij

= min[(

k i z ij

+

1

1 ), (

k j

1 )]

, (5.1)

131 where

z ij

is the number of triangles to which link (

i

,

j

) belongs; and the denominator is the number of triangles that could possibly include link (

i

,

j

). The numerator is

z ij

+ 1 to avoid the situation where link (

i

,

j

) does not belong to any triangle. Because the computation of ECC does not require global graph traversal, the Radicchi’s algorithm is slightly faster than the G-N algorithm. However, it is worse than the G-N algorithm in effectiveness (Radicchi et al., 2004). More importantly, this algorithm has three major disadvantages. First, the definition of ECC sometimes leads to certain degeneracy. For example, when the degree of one of the incident nodes is 1, the denominator of equation

(5.1) becomes 0, causing the ECC to be indeterminate. Second, the algorithm relies on the existence of a large number of triangles in the network. For networks containing few triangles such as nonsocial networks the algorithm will fail to find groups (Newman,

2004b, c). Third, although the algorithm runs faster than the G-N algorithm, the time complexity is still rather high.

5.2.3.2 Agglomerative Algorithm

In order to improve the efficiency of hierarchical clustering algorithms, Newman (2004c) proposes an agglomerative approach based on the

modularity

measure. The modularity

Q

of a graph is defined as

Q

=

i

(

e ij

a i

2

) , (5.2) where

e ij

is the percentage of links in the graph that connect nodes in cluster

i

and those in cluster

j

;

a i

=

Σ

j e ij

is the expected value of

e ij

if nodes are randomly connected

132

(Newman, 2004c). The modularity indicates how much the graph structure deviates from a random graph, in which no significant community structure exists.

Q

is 0 if the number of within-group links is no more than would be expected by random chances (Newman,

2004c). At each step, the algorithm seeks a pair of clusters whose merge results in the largest increase or smallest decrease in

Q

. The best partition can be obtained by finding the maximal value of

Q

along the resulting dendrogram. This is a relatively fast algorithm with

O

((

m

+

n

)

n

) time complexity. The effectiveness of the modularity algorithm has also been shown to be comparable to the G-N algorithm. Both the G-N and Radicchi’s algorithms are iterative procedures which must updates the edge beweeness or ECC of links in each round. The modularity based algorithm is a “single-pass algorithm” that does not requires iterative update of link weights.

In summary, the major problems facing unweighted graph partition is efficiency. Most existing hierarchical clustering algorithms suffer from high time complexities. Some algorithms such as the Radicchi’s algorithm, slightly improves efficiency at the cost of effectiveness. It is desirable to develop methods that help balance between effectiveness and efficiency based on the demand of real-world applications. In the next section I propose edge local density that is potentially helpful for addressing this problem.

133

5.3 The Proposed Approach: Local Density Based Partition Algorithms

To address the problems of existing algorithms, especially the Radicchi’s algorithm, I propose a new measure called

edge local density

for unweighted graphs.

5.3.1 Defining Edge Local Density

The edge local density measure is derived from graph link density. Recall that the

link density

of an undirected graph is defined as (Wasserman & Faust, 1994)

d

=

m n

(

n

1 ) / 2

. (5.3)

It is the number of links that actually are present in a network divided by the total possible number of links. The value of

d

is between 0 and 1. The link density is 1 when we have a

complete graph

, in which every node is connected with all other nodes. A complete graph is also called

clique

(Wasserman & Faust, 1994).

Consider a subgraph representing a group. According to the cohesive group definition, the density of within-group links should be greater than that of between-group links. This implies that every within-group link is involved in a densely-knit neighborhood of links and the between-group links are relatively sparse. Based on this rationale I propose

edge local density

for measuring the potential of a link (

i

,

j

) to be involved in a cohesive group:

LD ij

=

m ij n ij

(

n ij

+

c ij

1 ) / 2

, (5.4)

134 where

n ij

is the total number of nodes in the neighborhood of the link (

i

,

j

);

m ij

is the total number of links in the neighborhood, and

c ij

is the number of common neighbors of nodes

i

and

j

. The neighborhood of the link (

i

,

j

) includes all nodes that are incident on nodes

i

or

j

, including

i

and

j

themselves. The denominator of equation (5.4) is the number of possible links in the neighborhood.

Note that the value of

LD ij

can be greater than 1 because of the additional term

c ij

. For example, the local density of links in a clique is 1

+

n

2

n

(

n

1 ) / 2

, because the maximum number of common neighbors that two nodes can share is

n

-2. The reason for adding this extra term is based on the observation that nodes in the same group often share many common neighbors, while two nodes belonging to different groups share few or no common neighbors. As a result, the local densities of within-group links are raised further and the between-group link densities are lowered further.

With this local density measure all originally unweighted links receive weights reflecting their local link structures. Thus, nodes in densely-knit groups are connected by strong links and nodes from different groups are separated by weak links. This is illustrated in

Figure 5.1. Note that this measure is different from similarity based and link intensity based weights because it relies entirely on the structure of the network rather than on the properties of nodes or links. In addition, the calculation of this measure only requires of the knowledge of local structure rather than the global structure of the whole network.

135

1

1

2 3 8

2 3 8

7

7

4 5 9 4 5 9

6

6

(a)

(b)

Figure 5.1: The transformation of an unweighted graph into a weighted graph using the edge local density measure. (a) The unweighted graph. (b) The transformed weighted graph which can be divided into two densely-knit groups.

5.3.2 Illustrating Edge Local Density

In this section I illustrate how edge local density works in different situations. At the same time I compare it with ECC (Radicchi et al., 2004). As mentioned in Section 5.2.3.1, a major disadvantage of ECC is that it works only for networks containing many triangles.

For tree-structured networks that contain many nodes with 1 degree Radicchi’s algorithm will fail to find natural groups. In contrast, the local density measure does not depend on the presence of triangles in networks. It can help find groups in more generic networks.

The following five cases illustrate how local density assigns different weights to withingroup and between-group links and the necessary conditions for local density measure to outperform ECC.

136

1

8

2

5 6 9

4

7

3

(a)

2

1

8

2

1

8

5 6 9

5 6 9

4

3

7

4

3

7

1

(b)

1

(c)

2

6

2

6

5 8

5 8

4

3

(d)

7

4

7

3

(e)

Figure 5.2: The five illustrative cases for edge local density. (a) Clique-Bridge-Clique. (b)

Tree-Bridge-Tree. (c) Clique-Bridge-Tree. (d) Clique-Clique. (e) Clique-Tree.

Case 1: Clique-Bridge-Clique

This case represents a situation where two densely-knit groups are connected by a few bridge links (see Figure 5.2a). For simplification, I make two assumptions: (a) both groups are cliques, i.e., all within-group nodes are fully connected, and (b) there is only a single bride link between the two groups. In real networks, these two assumptions will not always be true. Groups are not always complete and between groups there might be many links.

137

Let

G

1

= (

V

1

,

A

1

) and

G

2

= (

V

2

,

A

2

) be two groups connected by a single link. Because both

G

1

and

G

2

are cliques,

m

1

=

n

1

(

n

1

-1)/2 and

m

2

=

n

2

(

n

2

-1)/2. In addition, it is assumed that

n

1

≥ 3,

n

2

≥ 3, and

n

1

n

2

. The graph reduces to a trivial chain structure when

n

1

< 3 and

n

2

< 3.

In Figure 5.2a,

G

1 contains nodes 1-5 and

G

2 contains nodes 6-9,

n

1

= 5, and

n

2

= 4.

Between

G

1

and

G

2 there is a bridge link (5, 6) with nodes 5 and 6 acting as gatekeepers.

Other nodes, nodes 1-4 and nodes 7-9, are “insiders.” There are three types of links in the network: within-group

insider-insider link

(bold line), within-group

insider-gatekeeper

link (dashed line), and between-group

gatekeeper-gatekeeper

link (dotted line).

Obviously, the same types of links in a group have the same local density values. For example, link (1, 3) and link (2, 4) are equally weighted. Based on the definition of edge local density, the weights of the three types of links are as follows:

Insider-insider links

. Considering link (1, 2) as an example, the neighborhood of nodes 1 and 2 include nodes 1-5. Thus the local density is

LD

1 , 2

=

n

1

(

n

1

1 )

n

1

(

n

1

/ 2

+

1 ) /

(

n

1

2

2 )

=

13

10

.

Insider-gatekeeper links

. Considering link (1, 5) as an example, the neighborhood of nodes 1 and 5 includes nodes 1-6. Thus the local density is

LD

1 , 5

=

n

1

(

n

1

1 ) /

n

1

(

n

1

2

+

1

+

+

1 ) /

(

n

1

2

2 )

=

14

15

.

138

Gatekeeper-gatekeeper link

. There is only one between-group link, link (5, 6). Its neighborhood includes all nodes in the network since nodes 5 and 6 together connect with all other nodes. The local density is

LD

5 , 6

=

n

1

(

n

1

(

n

1

+

1 )

n

/

2

2

+

)(

n

1

n

2

+

(

n

1

n

2

1 )

1 ) /

/ 2

2

+

1

=

17

36

.

It is expected that the strongest links are insider-insider links and the weakest links should be the bridge links, that is

LD

1,2

LD

1,5

LD

5,6

. Because

LD

1,2

> 1,

LD

1,5

< 1, and

LD

5,6

< 1, we have

LD

1,2

>

LD

1,5

, and

LD

1,2

>

LD

5,6

. However, we do not necessarily have

LD

1,5

>

LD

5,6

because

LD

1 , 5

LD

5 , 6

=

2 (

n

1

3

n n

1

2

+

(

n

1

n

1

2

+

n

2

1 )(

n

1

2

n

1

2

+

n

2

n

2

2

)(

n

1

+

2

n

1

n

2

n

2

+

1 )

+

n

2

)

.

It cannot guarantee that the numerator of the equation is greater than 0 for arbitrary values of

n

1

and

n

2

, although

LD

1,5 is greater than

LD

5,6

in this particular example. This means that the bridge link is not necessarily the weakest. On the other hand, it is easy to show that the ECC well distinguishes between within-group links and between-group links:

ECC

1 , 2

ECC

5 , 6

ECC

1 , 2

=

ECC

1 , 5

=

= min(

n

1

ECC

1 , 5

=

1

>

n

1 min(

n

1

2

+

1

2 ,

n

1

2 )

2 ,

n

2

ECC

5 , 6

2 )

=

n

2

1

2

=

=

n

1

n

1

1

2

1

<

1

2

=

4

3

>

1

This is one of the situations where local density is worse than ECC.

139

Case 2: Tree-Bridge-Tree

In this situation two trees are connected by a single bridge link (see Figure 5.2b). There are only two types of links in this network: insider-gatekeeper links and gatekeepergatekeeper links. Let us assume

n

1

≥ 2,

n

2

≥ 2, and

n

1

n

2

. Because there is no insiderinsider links in each group,

m

1

=

n

1

– 1 and

m

2

=

n

2

– 1. Again, using link (1,5) and link

(5,6) as the examples, the local densities of the two types of links are:

Insider-gatekeeper links

:

LD

1 , 5

=

n

1

n

1

(

n

1

1

+

+

1 )

1

/ 2

=

n

1

2

+

1

=

1

3

.

Gatekeeper-gatekeeper link

:

LD

5 , 6

=

(

n

1

+

n

2

n

1

)(

+

n

1

n

2

+

1

n

2

1 ) / 2

=

n

1

2

+

n

2

=

2

9

.

Since

n

2

≥ 2, we have

LD

1 , 5

LD

5 , 6

=

(

n

1

2 (

n

2

+

1 )(

n

1

1 )

+

n

2

)

>

0 .

This implies that the two groups are connected by a relatively weak bridge link. The ECCs for insider-gatekeeper links are indeterminate since the denominators are 0. Therefore, for networks containing no cyclic structure, ECC and the Radicchi’s algorithm will fail.

Case 3: Clique-Bridge-Tree

In this case a clique is connected with a tree through a bridge (see Figure 5.2c). Let’s assume

n

1

≥ 3,

n

2

≥ 2, and

n

1

n

2

. We also have

m

1

=

n

1

(

n

1

-1)/2 and

m

2

=

n

2

-1. The local densities of the three types of links are:

140

Insider-insider links

. This type of links exists only in

G

1

. The values are the same as in Case 1.

Insider-gatekeeper links

. There are two subtypes in this category of link: the links in

G

1

and the links in

G

2

. The local densities of insider-gatekeeper links in

G

1

are the same as in Case 1. The local densities of insider-gatekeeper links, such as link

(6,8), is

LD

6 , 8

=

n

2

n

2

(

n

2

1

+

+

1 )

1

/ 2

=

n

2

2

+

1

=

2

5

.

Gatekeeper-gatekeeper link

. The local density of the bridge link is different from that in Case 1 and is given by

LD

5 , 6

=

(

n

1

n

+

1

(

n n

2

1

)(

n

1

1 )

+

/ 2

n

2

+

n

2

1 ) / 2

=

7

18

.

As in Case 1 it is obvious that local densities of insider-insider links are greater than those of the other two types of links. It can be shown that

The numerator of

LD

1 , 5

LD

5 , 6

=

n

1

2

n

2

( 2

n

1

1 )

>

=

3 (

n

1

n

2

8

n

1

n

2

)( 6

>

0

1 )

7

n

1

n

2

7

n

1

n

2

+

n

1

2

(

n

2

2

2 )

+

n

2

2

(

n

1

2 )

+

2 (

n

1

+

n

2

)

This means that the insider-gatekeeper links in the clique is guaranteed to be stronger than the bridge link. In addition, we have

The numerator of

LD

6 , 8

LD

5 , 6

=

n

2

(

n

1

1 )[ 4

n

2

n

1

(

n

2

1 )]

=



>

<

0

0 if

n

1

4 or if

n

2

n

1

n

1

4 and

n

1 if

n

1

8

9

141

The insider-gatekeeper links in

G

2

will be stronger than the bridge link only if the size of the clique is less than 9. When the clique becomes larger the bridge link will become stronger. This is another situation where the local density causes some problems. The

ECC does not work either because it is indeterminate for the insider-gatekeeper links in the tree group.

Case 4: Clique-Clique

Sometimes a popular node may belong to multiple groups at the same time. This is illustrated in Figure 5.2d. A common gatekeeper sits between two cliques. In this example,

n

1

= 5, and

n

2

= 3. Because of the absence of the bridge link, there are only two types of links, whose densities are

Insider-insider links

. The values are the same as in Case 1.

Insider-gatekeeper links

. There are two subtypes in this category of link: the links in

G

1

and the links in

G

2

. The local density of insider-gatekeeper links in

G

1

is

LD

1 , 5

=

n

1

(

n

1

1 )

(

n

1

/ 2

+

+

n

2

n

)(

2

n

(

n

1

2

+

+

1 )

n

2

/ 2

1 )

+

/ 2

(

n

1

2 )

=

17

28

, and the local density of the insider-gatekeeper links in

G

2

is

LD

5 , 6

=

n

1

(

n

1

1 )

(

n

1

/ 2

+

+

n

2

n

)(

2

n

(

n

1

2

+

+

n

2

1 )

/ 2

1 )

+

/ 2

(

n

2

2 )

=

15

28

.

Because

n

1

n

2

, we have

LD

1,5

LD

5,6

. Thus, the gatekeeper is more strongly connected with the larger clique than with the smaller clique based on local density. The values of

142

ECC in

G

1

and

G

2

are 4/3 and 3/2, respectively. The gatekeeper is more strongly connected with the smaller clique based on ECC.

Case 5: Clique-Tree

In this case, one clique in the network is replaced by a tree (Figure 5.2e). It is easy to show that

LD

1,5

LD

5,6

. Similar to Case 4, the gatekeeper is more strongly associated with the clique than with the tree. Again, ECCs for the links in the tree cannot be determined.

The five cases consider general situations where local density compares with ECC. I omit the Tree-Tree situation because two trees connected by a common gatekeeper can be viewed as one single tree. The local density’s advantage over ECC is that it is not limited to networks containing cyclic structures. It can be applied to more sparse networks such as trees which do not contain cycles. However, in certain situations, such as in Case 1 and

Case 3, local density does not guarantee to reflect the structural role of links optimally.

5.3.3 Clustering Unweighted Graphs based on Edge Local Density

The local density measure can be used in two types of hierarchical clustering methods:

Single-Pass Agglomerative Method

. “Single pass” means link weights are computed only once. The idea is to transform an unweighted graph into a weighted one using edge local density. After the transformation each link receives a local density based weight so that the weighted graph can be partitioned using agglomerative clustering algorithms (Day & Edelsbrunner, 1984; Jain & Dubes, 1988).

143

Agglomerative methods merge nodes that are more strongly related first. Consider the above-mentioned five cases. The insiders in the clique groups in all of the five cases and the nodes in the tree groups in Case 2 are connected by the strongest links. Thus, these nodes will be merged together first. The gatekeepers are then added to the cliques they belong to. At last, the two separate groups will merge and the whole network becomes one single cluster. Note that the gatekeeper in Case 4 joins the larger clique first. This is rather intuitive because a popular node belonging to multiple groups may be considered as the member of the largest group it belongs to. However, it is possible that the two gatekeepers in Case 1 and Case 3 may merge together before they are added to their own groups. This is because of the problems discussed Cases 1 and 3.

The clustering algorithm selected is the same as the one used in Chapter 4—reciprocal nearest neighbor (RNN) based complete-link algorithm (Murtagh, 1984) with

O

(

n

2

) time complexity. The additional time used for calculating local densities of all links in the network takes

O

(<

k

> 2

m

) time, assuming every node maintains its own lists of neighbors.

Therefore, the overall running time is

O

(<

k

>

2

m

+

n

2

), faster than all existing hierarchical algorithms.

Iterative Divisive Method

. Like the existing divisive methods such as the G-N algorithm (Girvan & Newman, 2002) and Radicchi’s algorithm (Radicchi et al., 2004), this divisive method recomputes the local densities of all links and removes the weakest links in each iteration. The bridge links in the five cases, except for Cases 1 and 3, will be the ones that are removed first, breaking each network into two groups. The time

144 complexity of this iterative method is the same as that of the Radicchi’s algorithm,

O

(<

k

>

2

m

2

) (Newman, 2004b).

For simplification, I call these two local density based methods sLD (single-pass) and iLD (iterative), respectively. The sLD method is expected to provide higher efficiency with compromised yet acceptable effectiveness. The iLD is expected to provide higher effectiveness than the Radicchi’s algorithm and higher efficiency than the G-N algorithm.

To evaluate the performance of the local density based clustering methods I conducted a series of experiments. These experiments were intended to answer the following research questions:

How does the edge local density measure perform compared with the edge clustering coefficient (ECC) measure?

How does the local density based clustering methods, sLD and iLD, perform compared with existing hierarchical clustering methods?

How to choose an appropriate method for different effectiveness and efficiency demands of real applications?

145

5.4.1 Performance Metrics

The experiments tested the performance of the proposed methods using simulated network data. The two performance metrics were

effectiveness

and

efficiency

. In effectiveness testing, the community structure was predetermined so that the effectiveness of a partition could be objectively measured. The effectiveness metrics used in this chapter were clustering

precision

,

recall

,

F value

, and

accuracy

. Precision, recall, and F value are frequently used in information retrieval applications and have been used for evaluating clustering effectiveness in document categorization applications

(Roussinov & Chen, 1999). In this evaluation, the predetermined partition was called

true partition

and the algorithm generated partition was call

algorithm partition

. A node pair was considered

correct

if it was in both the algorithm partition and the true partition. An

incorrect

node pair was in the algorithm partition but not in the true partition. That is, the two nodes placed in the same group by the algorithm actually belonged to different groups. A

missed

node pair was in the true partition but not in the algorithm partition.

That is, the two nodes separated into different groups by the algorithm were actually in the same group. The clustering

precision

and

recall

were defined as

Precision

=

Number of correct node pairs

Number of correct node pairs

+ number of incorrect node pairs

, (5.5)

Recall

=

Number of

Number of correct node pairs correct node pairs

+ number of missed node pairs

. (5.6)

146

The precision reflected how accurate a clustering algorithm was and the recall reflected how well the algorithm captured the correct pairs. Because the precision can be increased by compromising recall and vise versa, the

F value

was used to reflect the combined effect of precision and recall (Shaw et al., 1997)

F value

=

2

×

Recision

×

Recall

Precision

+

Recall

. (5.7)

Because the true partition was known in this evaluation each node in the network received a label indicating its group membership. Thus, the effectiveness was also evaluated by measuring the percentage of nodes that were assigned correct labels. The clustering

accuracy

was defined as

Accuracy

=

Number of correctly classified

Total number of nodes nodes

(5.8)

In real applications the true partition is often unknown so that nodes do not have associated class labels. In addition, the number of clusters in a network often has to be determined subjectively. In these cases, accuracy cannot be used for evaluating clustering effectiveness.

The

efficiency

of an algorithm was defined as the algorithm running time.

147

5.4.2 Hypotheses

For effectiveness testing, I compared the two local density based methods, sLD and iLD, with several existing algorithms: the G-N algorithm (Girvan & Newman, 2002),

Radicchi’s iterative ECC based algorithm (iECC) (Radicchi

et al.

, 2004), and the modularity based algorithm (Newman, 2004c). In addition, to compare the performance of edge local density and ECC, I also included a

single-pass ECC method

(sECC), which clustered a transformed graph with ECC-based link weights. In implementation links with indeterminate ECCs were assigned a large constant number.

There were four categories of hypotheses corresponding to the four effectiveness metrics: precision, recall, F value, and accuracy. Table 5.1 lists the detailed hypotheses for the precision category. The detailed hypotheses for the other three metrics are omitted from

Table 5.1 because they are similar to the precision hypotheses with only metric names changed.

Hypotheses H1.1-H1.5 focused on the precision of the sLD method and H1.6-H1.8 on the iLD method. The rationale behind these hypotheses was as follows:

The local density based methods, sLD and iLD, were expected to be more effective than the ECC based methods, sECC and iECC (H1.1 and H1.6) because in most of the above-mentioned five cases, the local density could better distinguish between different types of links than ECC.

148

H1: Clustering Precision

H1.1: The sLD method will achieve

higher precision

than the sECC method

H1.2: The sLD method will achieve

lower precision

than the iECC method

H1.3: The sLD method will achieve

lower precision

than the iLD method

H1.4: The sLD method will achieve

lower precision

than the G-N algorithm

H1.5: The sLD method will achieve

lower precision

than the modularity-based algorithm

H1.6: The iLD will achieve

higher precision

than the iECC method

H1.7: The iLD will achieve

comparable precision

with the G-N algorithm

H1.8: The iLD method will achieve

comparable precision

with the modularity-based algorithm

H2: Recall

(detailed hypotheses similar to H1.1-H1.8)

H3: F value

(detailed hypotheses similar to H1.1-H1.8)

H4: Accuracy

(detailed hypotheses similar to H1.1-H1.8)

Table 5.1: Hypotheses regarding clustering effectiveness.

The single-pass method, sLD, would be less effective than iterative methods, iLD and iECC (H1.2 and H1.3), because the single-pass method computed link weights only once. It did not recalculate link weights at each time when the dendrogram was updated.

Both the sLD and iLD methods were expected to be less effective than the G-N algorithm (H1.4 and H1.7), which had been shown to outperform all existing hierarchical clustering methods (Newman, 2004c; Radicchi et al., 2004).

No research had systematically evaluated the effectiveness of the modularity based algorithm. H1.5 and H1.8 thus predicted that the modularity based algorithm would achieve comparable effectiveness with iLD but higher effectiveness than sLD.

I compared the efficiency of only three algorithms: sLD, iLD, and the modularity based algorithm (Newman, 2004c). Their time complexity was

O

(<

k

>

2

m

+

n

2

),

O

(<

k

>

2

n

2

), and

149

O

(

mn

+

n

2 ), respectively. The single-pass algorithm, sLD, was expected to have the highest efficiency. I did not include sECC and iECC in the comparison because it took the same time to calculate local density as ECC.

H5: The sLD method achieves

higher efficiency

than the iECC algorithm;

H6: The sLD method achieves

higher efficiency

than the modularity based algorithm.

5.4.3 Results and Discussion

5.4.3.1 Effectiveness

I considered the simulated networks used in previous studies (Girvan & Newman, 2002;

Newman, 2004c; Radicchi et al., 2004) for effectiveness testing. The network consisted of 128 nodes divided into four groups of equal sizes. The average degree was set to be 16.

Nodes in the same group were connected with probability

p

in

, and nodes in different groups with

p

out

. The two parameters

p

in

and

p

out

control the structure of the network.

Figure 5.3 presents three illustrative networks which correspond to high, medium, and low

p

out

/

p

in

ratios. When

p

out

is rather small compared with

p

in

the groups are well separated with only a few links connecting these densely-knit groups (Figure 5.3a). As

p

out

increases the boundaries of groups become more “blurred” and it is harder to identify the groups (Figure 5.3b). When

p

in

and

p

out

are equal,

p

in

=

p

out

= 16/127 ≈ 0.125 in this example, the network becomes totally random and no group exists (Figure 5.3c).

150

(b)

(c)

(a)

Figure 5.3: Three illustrative networks with different

p

out

= 0.01). (b)

p

out

/

p

in

= 0.14 (

p

out

= 0.05). (c)

p

out

/

p

in

= 1.0 (

/

p

in

p

out

ratios. (a)

= 0.125).

p

out

/

p

in

= 0.02 (

p

out

For each specific value of

p

out

I generated 30 networks, which were clustered using the six methods: sLD, sECC, iLD, iECC, G-N, and modularity. Because there were four groups in the true partition, the dendrograms generated by these algorithms were cut at the level where the network was divided into four clusters. The effectiveness metrics were recorded and plotted against the

p

out

/

p

in

ratio (Figure 5.4).

In addition, a series of paired

t

-tests were performed to test the hypotheses. Table 5.2 provides the mean values of the four effectiveness metrics of the six methods. Table 5.3 summarizes the results of the hypothesis testing.

0.4

0.2

0

0

1

0.8

0.6

0.7

0.6

0.5

0.4

0.3

1

0.9

0.8

0.2

0.1

0

0

Single-Pass Local Density

Single-Pass ECC

Modularity

Iterative Local Density

Iterative ECC

G-N

0.2

0.4

p_out/p_in

0.6

(a)

0.8

Single-Pass Local Density

Single-Pass ECC

Modularity

Iterative Local Density

Iterative ECC

G-N

1

0.2

0.4

p_out/p_in

0.6

(b)

0.8

1

151

152

0.7

0.6

0.5

0.4

1

0.9

0.8

0.3

0.2

0.1

0

0

Single-Pass Local Density

Single-Pass ECC

Modularity

Iterative Local Density

Iterative ECC

G-N

0.2

0.4

p_out/p_in

0.6

(c)

0.8

1

1

0.8

sLD sECC

0.6

0.4

0.2

0

0

Single-Pass Local Density

Single-Pass ECC

Modularity

Iterative Local Density

Iterative ECC

G-N

0.2

0.4

p_out/p_in

0.6

(d)

0.8

1

Figure 5.4: Effectiveness results of the six clustering methods: sLD, sECC, iLD, iECC,

G-N, and modularity. (a) Precision. (b) Recall. (c) F value. (d) Accuracy.

0.59 (0.32)

0.54 (0.30)

0.54 (0.34)

0.48 (0.30)

0.56 (0.33)

0.50 (0.30)

0.67 (0.27)

0.62 (0.25)

153

iLD iECC

G-N

Modularity

0.73 (0.31)

0.67 (0.31)

0.71 (0.31)

0.68 (0.31)

0.67 (0.36)

0.58 (0.35)

0.59 (0.40)

0.64 (0.33)

0.70 (0.34)

0.61 (0.34)

0.63 (0.38)

0.66 (0.32)

0.79 (0.25)

0.76 (0.24)

0.88 (0.13)

0.75 (0.26)

Table 5.2: Mean values of the effectiveness metrics of the six methods. Numbers in parentheses are standard deviations.

.1: sLD

better than

sECC

.2: sLD

worse than

iECC

.3: sLD

worse than

iLD

.4: sLD

worse than

G-N

.5: sLD

worse than

modularity

H1.1

H1.2

H1.3

H1.5

H2.1

H2.2

H2.3

H2.5

H3.1

H3.2

H3.3

H1.4 H2.4 H3.4

H3.5

H4.1

H4.2

H4.3

H4.4

H4.5

.6: iLD

.7: iLD

better than

iECC

comparable with

G-N

H1.6 H2.6 H3.6 H4.6

H1.7 H2.7 H3.7 H4.7

.8: iLD

comparable with

H1.8 H2.8 H3.8 H4.8

Table 5.3: Summary of hypothesis testing results for effectiveness. Shaded cells indicate confirmed hypotheses. Blank cells are not confirmed hypotheses. All differences are significant with

p

< 0.001.

Precision

. Hypotheses H1.1-H1.6 were supported. H1.7 and H.8 were not supported because the precision of iLD was significantly higher than those of the G-N algorithm and the modularity based algorithm. The iLD method significantly outperformed all other methods by identifying more correct node pairs and fewer incorrect node pairs. Especially, the local density performed better than ECC in both single-pass and iterative methods. It means that local density is a better measure for approximating link weights than ECC.

The sLD method performed worse than G-N, modularity, and iECC. This is due to two possible reasons. First, both the G-N algorithm and the iECC method recalculated link weights at each time a link was removed. The recalculated link weights reflected the changes in structure and helped improve the performance. Second, both the G-N and the

154 modularity based algorithm depended on the knowledge about the global structure.

However, local density relied only on local link structure.

Recall

. Hypotheses H2.1-H2.3, H2.5, and H2.6 were supported. Similar to the precision results, the iLD method achieved significantly higher recall than all the other methods. Other methods missed more correct node pairs. The sLD method performed better than the sECC method and worse than the modularity based algorithm, iLD, and iECC. Note that the sLD method achieved comparable recall with the G-N algorithm

(H2.4 thus was not supported). It is shown in Figure 5.3b that when

p

out

/

p

in

> 0.31, the G-

N algorithm was worse than the sLD method causing the average recall to be comparable with the sLD method.

F value

.

Consistent with the precision results, hypotheses H3.1-H3.6 were supported. The iLD appeared to be the best method. The sLD method outperformed sECC.

Accuracy

. Hypotheses H4.1-H4.6 were supported. H4.7 was not supported because the accuracy of the G-N algorithm was significantly higher than the iLD method.

The iLD method was more accurate than the modularity based algorithm. Thus, H4.8 was not supported.

An interesting pattern was observed when Figure 5.4 was reviewed. Overall, these methods performed equally well for low

p

out

/

p

in

ratios and equally poor when the ratio approached 1. The difference was most significant for medium values of the ratio. The

155 only exception was the significantly higher accuracy of the G-N algorithm along all ratios.

The second highest accuracy was achieved by the iLD method. For precision, recall, and

F value, the medium range of

p

out

/

p

in

ratio was roughly between 0.1 and 0.4. The sECC method seemed to be the worse method in terms of all four metrics.

5.4.3.2 Efficiency

To test the efficiency of different methods I considered the testing networks used in

(Radicchi

et al.

, 2004). A series of random networks with increasing sizes were generated.

With a specific size

n

, 30 networks were generated and clustered using sLD, iLD, and modularity. The networks were generated using a Java programs running on a desktop computer with 2.8-GHZ CPU. The average running time for each algorithm was recorded and plotted in Figure 5.5. The mean running times for sLD, iLD, and modularity based methods are reported in Table 5.4.

Both hypotheses H5 and H6 were supported with

p

-values being less than 0.001. Figure

5.5 shows that sLD was the fastest method because it required only a single-pass calculation of link weights. The sLD algorithm was faster than the modularity based method because it relied on local link structure, while the modularity of a network must be evaluated based on the global structure. The iLD was also slower than sLD due to its iterative nature.

156

Network size

n

≤ 100

100 <

n

≤ 10

3

10

3

<

n

≤ 10

4 sLD iLD Modularity

157.0 37.4 103.2

1,185.6 4,539.3 2,205.9

191,192.8 801,811.0 494,119.2

Table 5.4: Mean running times (in seconds) of sLD, iLD, and the modularity based methods.

1,800

1,600

1,400

1,200

1,000

800

600

400

200

0

0 sLD

Modularity iLD

G-N

1,000 2,000 3,000 4,000 5,000

n

6,000 7,000 8,000 9,000 10,000 11,000

Figure 5.5: The efficiency of sLD, iLD, modularity based, and G-N algorithm.

In Figure 5.5, the running time for the G-N algorithm was also shown for networks with less than 1000 nodes. The running time of the G-N algorithm scaled very quickly as networks grew. It was much less scalable than the other three methods.

The networks in efficiency testing were rather sparse, that is, <

k

>

2

<<

n

. When <

k

>

2 approaches

n

, the modularity based method and sLD will achieve similar efficiency.

In summary, the local density measure was better than ECC in approximating link weights. Compared with existing clustering algorithms, the two local density based

157 methods, sLD and iLD, also achieved promising performance in terms of effectiveness and efficiency.

The performance experiments also suggest guidance for selecting appropriate clustering methods in different situations:

For networks that have salient community structures (

p

out

/

p

in

is close to 0) or that are rather random (

p

out

/

p

in

is close to 1), these algorithms except for sECC are almost equally effective or noneffective. Thus, the fastest algorithm, sLD, can be used to find groups in networks.

Within the medium range of

p

out

/

p

in

, the selection of algorithm depends on the demand of the particular application: o

If the application requires a fast algorithm that can partition large networks with compromised yet acceptable effectiveness, the sLD algorithm is a good choice; o

If the efficiency is not the major concern and the network size is relatively small, the iLD method outperforms the Radicchi’s algorithm (iECC), the

G-N algorithm, and the modularity based algorithm in terms of clustering precision, recall, and F value. In addition, it takes significantly less time than the G-N algorithm and the modularity based algorithm.

158

5.5 Conclusions

In this chapter I propose the edge local density measure to approximate link weights based on the structure of unweighted graphs. When the local density is used in a singlepass clustering algorithm, the unweighted graph is transformed into a weighted graph, in which each link receives a weight reflecting its local link density. Agglomerative methods such as the complete-link algorithm can then partition the transformed graph.

When the local density is used in an iterative method, the local density based weights of links are updated in each iteration.

The performance evaluation shows that the local density measure was a better measure than edge clustering coefficient, which may fail to find groups in a graph with few or no triangles. The single-pass algorithm based on local density was more effective than the

ECC based algorithm and more efficient than iterative algorithms such as the G-N algorithm and the Radicchi’s algorithm. The iterative algorithm based on this measure outperformed all existing algorithms in terms of clustering precision and recall. These two local density based methods better balance between the effectiveness and efficiency than existing algorithms.

This chapter contributes to the research of unweighted graph partition problem by not only proposing the new measure but also providing guidance for selecting appropriate clustering algorithms in different situations.

159

Future research needs to be done to evaluate the new measure’s perform in real networks such as the World Wide Web and citation networks.

160

CHAPTER 6: THE TOPOLOGICAL PROPERTIES OF DARK

NETWORKS

6.1 Introduction

In recent years scientists have revealed the topological properties of a wide variety of complex systems characterized as large-scale networks (Albert & Barabási, 2002), such as scientific collaboration networks (Newman, 2001b, 2004a), the World Wide Web

(Albert

et al.

, 1999), the Internet (Faloutsos

et al.

, 1999), electric power grids (Watts &

Strogatz, 1998), food webs (Garlaschelli

et al.

, 2003), and biological networks (Jeong

et al.

, 2000), among many others. Despite the tremendous variation in their component, function, and size, these networks are surprisingly similar in topology (e.g., the powerlaw degree distribution (Albert & Barabási, 2002; Wasserman & Faust, 1994)). This leads to a conjecture that complex systems are governed by the ubiquitous self-organizing principle (Albert & Barabási, 2002).One missing piece in this picture, however, is the analysis on the topology of “dark” networks (Raab & Milward, 2003) that are hidden from view yet could bring devastating impact to our society and economy, analogous to the “dark matter” in the galaxy. Terrorist networks, drug-trafficking rings, arms smuggling networks, gang networks, and many other covert networks are all dark networks. The structure of dark networks are largely unknown due to the difficulty of collecting and accessing reliable data (Krebs, 2001). Do dark networks share the same topological properties with other types of networks? Do they follow the same organizing principle? How do they achieve efficiency under constant surveillance and threats from

161 authorities? How robust are they against attacks? In this chapter I report the topological properties of several covert criminal- or terrorist-related networks. I hope not only to contribute to general knowledge of the topological properties of complex systems in a hostile environment but also to provide authorities with insights regarding disruptive strategies.

The remainder of this chapter is organized as follows. In Section 6.2 I briefly review existing network models and their linkages to the function of complex systems. I introduce the four terrorist- and criminal-related covert networks under study and the methods I used to collect the data in Section 6.3. In Section 6.4 I report the statistical properties of these four networks. I also tested the robustness of these dark networks and suggest some disruptive strategies. Section 6.5 summarizes the results and point to future research directions.

As reviewed in Chapter 2, network topology have been studied using three models: random graph model (Bollobás, 1985; Erdös & Rényi, 1960), small-world model (Watts

& Strogatz, 1998), and scale-free model (Barabasi & Alert, 1999). Random networks are categorized by small average path lengths and low clustering coefficients. The degree distribution of a random graph follows the Poisson distribution (Bollobás, 1985). A small-world network also has a small average path length relative to its size but has a rather high tendency to form clusters and groups. (Watts & Strogatz, 1998). The degree

162 distribution of scale-free networks (Barabasi & Alert, 1999) is a power-law degree, a skewed distribution that significantly deviates from the Poisson distribution. The powerlaw distribution takes a form of

P

(

k

) ~

k

γ

, (6.1) where

P

(

k

) is the degree distribution indicating the probability that a randomly selected nodes has exactly

k

links;

γ

is the exponent of the distribution that often takes on a value between -2.0 and -3.0 (Albert & Barabási, 2002) .

The analysis on the topology of complex systems has important implications to our understanding of nature and society. Research has shown that the function of a complex system may be to a great extent affected by its network topology (Albert & Barabási,

2002; Newman, 2003b). For instance, the small average path length of the World Wide

Web makes cyberspace a very convenient, strongly navigable system, in which any two web pages are on average only 19 clicks away from each other (Albert

et al.

, 1999). It has also been shown that the higher tendency for clustering in metabolic networks is correspondent to the organization of functional modules in cells, which contributes to the behaviour and survival of organisms (Ravasz

et al.

, 2002; Rives & Galitski, 2003). In addition, networks with scale-free properties (e.g., protein-protein interaction networks) are highly robust against random failures and errors (e.g., mutations) but quite vulnerable under targeted attacks (Albert

et al.

, 2000; Jeong

et al.

, 2001; Solé & Montoya, 2001).

163

To understand the topology and function of dark networks I studied four terrorist- and criminal-related networks:

The Global Salafi Jihad (GSJ) terrorist network (Sageman, 2004) (see Figure 6.1), which consists of 366 members including members from Osama Bin Laden’s Al

Qaeda. These terrorists were connected by kinship, friendship, religious ties, and relations formed after they joined the GSJ network. The network was constructed based entirely on open-source data but all nodes (terrorists) and links (relations) were examined and carefully validated by a domain expert (Sageman, 2004).

A narcotics-trafficking criminal network (“Meth World”) whose members mainly deal with methamphetamines (Xu & Chen, 2003). Based on the data of narcoticsrelated crimes which occurred in Tucson, Arizona, between 1985 and 2002, I generated the network consisting of 1,349 criminals. Two criminals were considered related if they committed at least one crime together.

A gang criminal network consisting of 3,917 criminals who were involved in gang-related crimes in Tucson between 1985 and 2002 (Xu & Chen, 2003).

A terrorist web site network (“Dark Web”) collected based on reliable governmental sources (Chen

et al.

, 2004). I identify 104 web sites created by four major international terrorist groups (Chen

et al.

, 2004), namely, Al-Gama’a al-

Islamiyya, Hizballa, Al-Jihad, and Palestinian Islamic Jihad and their supporters.

164

A link is created between two web sites if at least one hyperlink exists between any two web pages in them.

Figure 6.1: The giant component in the GSJ Network, data courtesy of Marc Sageman

(2004). The terrorists belong to one of four groups (Sageman, 2004): Bin Laden’s Al

Qaeda or Central Staff (pink), Core Arabs (yellow), Maghreb Arabs (blue), and

Southeast Asians (green). Each circle represents one or more terrorist activities, such as the 9/11 attacks and Bali bombing, which are noted.

6.4 Results and Discussion

6.4.1 Statistical Properties of the Dark Networks

Table 6.1 and Table 6.2 present the statistics of the four networks. Each network contains many small components and a single giant component. The separation between the 356 terrorists in the GSJ network and the remaining 10 terrorists is because no valid evidence has been found to connect the 10 terrorists to the giant component of the network. The

165 giant components in the Meth World and gang network contain only 57.0% and 68.5% of the nodes, respectively. This may be because the data was collected from a single law enforcement jurisdiction which may not have complete information about all relations between criminals, causing missing links between the giant component and other smaller components. The isolated components in the Dark Web are possibly due to the differences in the terrorist groups’ ideologies (Chen

et al.

, 2004).

Number of Nodes

Number of Links

Size of Giant

Component

Link Density

Average Degree, <

k

>

Exponent, -

γ

Cutoff,

κ

GSJ Meth World Gang Network Dark Web

366 1349 3917 104

1247 4784 9051 156

356 (97.3%) 924 (68.5%) 2231 (57.0%) 80 (77.9%)

0.02 0.01 0.003 0.05

6.97 4.62 2.87 1.94

-0.67 -1.41 -1.11 -1.33

15.35 23.60 14.65 34.59

Table 6.1: The statistics and parameters in the exponentially truncated power-law degree distribution of the dark networks.

Average path length

Diameter

Clustering coefficient

GSJ

Real Random

Meth World

Real Random

Gang Network Dark Web

Real Random Real Random

4.20 3.23 6.49 4.52 9.56 6.23 4.70 3.35

9 6.00 17 9.57 22 16.40 12 13.16

0.55 0.2×10

-1

0.60 0.5×10

-1

0.68 0.6×10

-3

0.47 0.1×10

-1

Table 6.2: Small-world properties of the dark networks. For each network, the metrics in the network (real) and those in the random graph counterpart (random) are presented.

6.4.1.1 Small-World Properties

I focused only on the giant component in these networks and performed topology analysis. I found that all these networks are small worlds (see Table 6.2). The average

166 path lengths and diameters of these networks are small with respect to their network sizes.

Thus, a terrorist or criminal can connect with any other member in a network through just a few mediators. In addition, these networks are quite sparse with very low link density

(Wasserman & Faust, 1994). These two properties have important implications for the efficiency of the covert network function–transmission of goods and information.

Because the risk of being detected by authorities increases as more people are involved, the small path length and link sparseness can help lower risks and enhance efficiency.

In addition, I calculated the path length of a node to a central node, a measure which is called “Erdös number” in the collaboration networks of mathematicians (Newman,

2001a). This measure is also related to the closeness centrality (Wasserman & Faust,

1994). I found that members in the criminal and terrorist networks are extremely close to their leaders. The terrorists in the GSJ network are on average only 2.5 steps away from

Bin Laden, meaning that Bin Laden’s command can reach an arbitrary member through only two mediators. Similarly, the average path length to the leader in the Meth World

(Xu & Chen, 2003) is only 3.9. Such a short chain of command means communication efficiency. However, special attention should be paid to the Dark Web. Despite its small size (80), the average path length is 4.70, larger than that (4.20) of the GSJ network, which has almost 9 times more nodes. Since hyperlinks help visitors navigate between web pages, and because terrorist web sites are often used for soliciting new members and donations (Chen

et al.

, 2004), the relatively big path length may be due to the reluctance of terrorist groups to share potential resources with other terrorist groups.

167

The other small-world topology, high clustering coefficient, is also present in these dark networks (see Table 6.2). The clustering coefficients of these four networks are significantly higher than those of random graph counterparts. Previous studies have also shown the evidence of groups and teams in these networks (Chen

et al.

, 2004; Sageman,

2004; Xu & Chen, 2003, Forthcoming). In these groups and teams, members tend to have denser and stronger relations with one another. The communication between group members becomes more efficient, making a crime or an attack easier to plan, organize, and execute (McAndrew, 1999).

In Table 6.1 I also report the average degrees and maximum degrees of the four networks. It can be seen that some terrorists in the GSJ network and some terrorist web sites in the Dark Web are extremely popular, connecting to more than 10% of the nodes in the networks. The assortativity in Table 6.1 indicates the tendency for nodes to connect with others who are similarly popular in terms of degree (Newman, 2003a). The assortativity coefficients of the GSJ and the gang networks are positive, meaning that popular members tend to connect with other popular members. However, the Meth World and Dark Web have negative assortativity coefficients. This may be because that the

Meth World consists of drug dealers who sold drugs to many individual buyers; the buyers did not connect with many other buyers or dealers. The popular web sites on the

Dark Web, on the other hand, received many inbound hyperlinks from less popular web sites.

168

6.4.1.2 Scale-Free Properties

Moreover, these dark networks are scale-free systems. The three human networks have an exponentially truncated power-law degree distribution (Amaral

et al.

, 2000; Newman,

2001a),

P

(

k

) ~

k

γ

e

k

κ

, (6.2) with exponent -

γ

and cutoff

κ

. (see Table 6.1 and Figure 6.2). Different from other types of networks (Albert

et al.

, 1999; Faloutsos

et al.

, 1999; Newman, 2001b; Watts &

Strogatz, 1998) whose exponents usually are between -2.0 and -3.0, the absolute values of the exponents of dark networks are fairly small. The degree distribution decays much more slowly for small degrees than for that of other types of networks, indicating a higher frequency for small degrees. At the same time, the exponential cutoff implies that the distribution for large degrees decays faster than is expected for a power-law distribution, preventing the emergence of large hubs which have many links.

-2

-3

-4

0

-1

-5

-6

-7

0

Data

Pure power-law

Truncated power-law

1 2

ln(k )

(a)

3 4

-2

-3

-4

-5

0

-1

-6

-7

0

Data

Pure power-law

Truncated power-law

1 2

ln(k)

(b)

3 4

169

2

0

0

-1

-2

-2

-4

-3

-4

-6

-8

Data

Pure power-law

Truncated power-law

-5

-6

-7

0

Data

Pure pow er-law

Truncated pow er-law

-10

0 4

0.5

1 1.5

ln(k )

(d)

2 2.5

3 3.5

1 2

ln(k )

(c)

3

Figure 6.2: The degree distributions of the dark networks. (a) The GSJ network. (b) The

Meth World. (c) The gang network. (d) The Dark Web. The truncated power-law distribution fits the data slightly better than the pure power-law distribution for network

(a)-(c).

Two possible reasons have been suggested that may attenuate the effect of growth and preferential attachment (Amaral

et al.

, 2000): (a) the aging effect: as time progresses some older nodes may stop receiving new links, and (b) the cost effect: as maintaining links induces costs (Hummon, 2000), there is a constraint on the maximum number of links a node can have. I believe that the aging effect does exist in the dark networks. In the Meth World, for example, some criminals who were present in the network several years ago may have become inactive due to arrest or death, and thus could not receive new links even though they are still included in the network (see Figure 6.3). Moreover, the cost of links takes the form of risks. Under constant threats from authorities, criminals or terrorists may avoid attaching to too many people, limiting the effects of preferential attachment. Evidence has shown that hubs in criminal networks may not be the real leaders (Sparrow, 1991; Xu & Chen, 2003). Another possible constraint on preferential attachment is trust (Krebs, 2001). This constraint is especially common in the GSJ

170 network where the terrorists preferred to attach to those who were their relatives, friends, or religious partners (Sageman, 2004).

Figure 6.3: The aging effect in the Meth World. As time progresses, fewer older members stay in network due to arrest or death. The overall size of the network is increasing, however, due to the addition of new nodes every year.

6.4.2 Robustness of the Dark Networks

Because scale-free networks usually are resilient to random failures (Albert

et al.

, 2000),

I tested the dark networks’ robustness only against targeted attacks. I simulated two types of attack (Holme

et al.

, 2002): attacks targeting the hubs and attacks targeting the bridges.

While hubs are nodes that have many links (high degree), bridges are nodes through which many shortest paths pass (high betweenness (Wasserman & Faust, 1994)). When simulating the attacks I distinguished between two attack strategies (Holme

et al.

, 2002): simultaneous removal of a fraction of nodes based on a measure (degree or betweenness) without updating the measure after each removal, and progressive removal of nodes with the measure being updated after each removal.

171

Figure 6.4: Dark networks’ vulnerability to attacks. (a) Simultaneous attacks (filled markers) and progressive attacks (empty markers) to bridges in the GSJ network. The critical points,

f

, at which the network falls into many small components, are marked on the diagram. It can be seen that progressive attacks are more devastating (f p

< f s

). (b) The changes in the average path length of the GSJ network under different attack strategies.

(c)-(f) Progressive attacks to the GSJ network (c), the Meth World (d), the gang network

(e), and the Dark Web (f). Two types of attacks are used: hub attack (filled markers) and bridge attacks (empty markers). It shows that bridge attacks are more devastating (

f b

<

f h

).

172

In (f),

f b

and

f h

are very close indicating that hub attacks and bridge attacks can be equally effective to disrupt a pure scale-free network.

Figure 6.4 (a)-(b) presents the comparison between simultaneous and progressive removal of bridges. I plot the changes in

S

(the fraction of the nodes in the largest component), <

s

> (the average size of remaining components), and average path length after a fraction of nodes are removed. It shows that progressive attacks are more devastating than simultaneous attacks. The progressive attacks are similar to “cascading failures” in power grids where an initial failure can cause a series of failures, because unbearably high traffic is redirected to the next bridge node.

Figure 6.4 (c)-(f) presents the difference between the network reactions to bridge attacks and hub attacks. It shows that dark networks are more sensitive to attacks targeting the bridges than those targeting the hubs. In a small-world network, which consists of communities and groups, there might be many bridges linking different communities together. Intuitively, when these bridges are removed, the network will quickly fall apart.

Note that a bridge may not necessarily be a hub since a node that connects two communities can have as few as two links. Small-world networks such as the dark networks thus may be more vulnerable to bridge attacks than hub attacks. In these networks bridges and hubs usually are not the same nodes. The rank order correlations between degree and betweenness in the GSJ, Meth World, and gang networks are 0.63,

0.47, and 0.30, respectively. Note that although bridge attacks are more devastating, strategies targeting the hubs are also fairly effective since these networks are also scalefree networks (Barabási & Alert, 1999). Hub attacks and bridge attacks can be equally

173 effective in tearing apart a pure scale-free network (e.g., the Dark Web with a high degree-betweenness rank order correlation, 0.70), in which hubs are also bridges connecting different parts of the network.

6.5 Conclusions

In summary, I examined the structures of several covert networks and found that these networks share many common topological properties with other types of networks. Their efficiency in communication and flow of information, commands, and goods can be tied to their small-world structures characterized by small average path length and high clustering coefficient. On the other hand, while the dark networks are also governed by self-organizing principles, various constraints on the formation and maintenance of links keep these networks from evolving into pure scale-free networks. This results in a phenomenon that I refer to as “constrained dark networks.” In addition, I found that dark networks are more vulnerable to attacks on the bridges that connect different communities than to attacks on the hubs. This may provide authorities with insights for intelligence and security purposes.

An interesting future research direction is to examine the evolution of dark networks. By comparing the simulated evolution model and real data, I may be able to test the effects of various dynamic mechanisms such as growth (Barabasi & Alert, 1999), linear and nonlinear preferential attachment (Jeong

et al.

, 2003; Krapivsky

et al.

, 2000), aging

174

(Amaral

et al.

, 2000), costs (Amaral

et al.

, 2000), and fitness(Bianconi & Barabási, 2001), among many others.

175

CHAPTER 7: MODELING THE EVOLUTION OF PATENT CITATION

NETWORKS

7.1 Introduction

Chapters 3-6 present several studies that focus on the static structural pattern mining part of the computational framework. Various techniques have been developed and employed to locate critical resources (e.g., the key nodes and paths) in networks, reduce network complexity, and capture the topological properties of networks. This chapter shifts the focus onto the dynamic pattern mining part of the framework and proposes a composite evolution model based on prior work on network evolution.

Many networks in our nature and society are dynamic systems. Identifying the underlying mechanisms that govern the evolution of networks is the key to explain the function and predict the behavior of the systems. During evolution the structure of a network may change. Such changes may be reflected in the following three types of dynamics:

Node dynamics

. The number of nodes in a network can change over time. In a

growing network

(Barabási & Alert, 1999), new nodes are added to the system and the size of the network increases over time. In a

decaying network

the network size decreases.

Citation networks often are growing networks that keep including new papers that cite existing papers in the networks. In reality, a network may undergo both addition and removal of nodes at the same time and the overall size displays a monotonically

176 increasing or deceasing pattern. The overall size of the World Wide Web, for example, has increased from a few thousand initially to almost 10

9

as of 1999 (Lawrence & Giles,

1999), although both page addition and deletion occur every day.

Link dynamics

. The dynamics of links is more complex than node dynamics. First, the number of links may change. New links may be created between existing nodes or between existing nodes and new nodes that are added to the system. Existing links may also break. Second, links may be rewired (Watts & Strogatz, 1998). That is, one end of a link is reconnected to a different node. In this case, although the total number of links in the network is fixed, the structure of the network changes. Third, the strength of a link may not always stay the same. For example, acquaintances may become friends thereby strengthening their relationship.

Group dynamics

. Although changes in groups result directly from node and link changes, group is a different unit of analysis that may offer a different view. Group dynamics occur when one group splits, two or more groups merge, existing members leave the group, or new members join the group. All these changes cause the structure of the network to change accordingly.

This chapter is devoted to the modeling of node and link dynamics. Specifically, the proposed composite model is aimed at describing and explaining the evolution processes of patent citation networks using several microscopic mechanisms. Applying this model to the patent citation networks helps discover how these mechanisms interplay and affect network evolution.

177

Patent is a type of open source document regarding technology innovations. It is a reliable source of information for analysis of various purposes. The analysis on patent content and citation patterns can be used to evaluate the technology development, performance, transfer, and trend in technology fields, countries, institutions, and industries. More importantly, patent citation networks are ideal for the study of network evolution, because the time that new patent documents are added to the citation networks is explicitly available and accurately recorded in large patent databases.

The remainder of this chapter is organized as follows. Section 7.2 reviews related work on network evolution models. Section 7.3 presents the composite model. Evaluation and results are discussed in Section 7.4. Section 7.5 concludes this chapter and points to future research directions.

Recent years have witnessed increased attention to the evolution of scale-free topology, which is found to be ubiquitous in real networks. Research in this area seeks the underlying mechanisms that govern the evolution processes of scale-free networks. As reviewed in Chapter 2, such research can be roughly categorized into two types: descriptive and explanatory. I do not repeat the review of descriptive analysis but provide more details to the explanatory (modeling) analysis in this Section.

A number of mechanisms have been proposed in prior research, including growth, preferential attachment, competition, and individual preference, among others.

178

Growth

. One of the key differences between the random graph model (Erdös &

Rényi, 1960) and the scale-free model (Barabási & Alert, 1999) is the latter’s growth assumption. The size of a scale-free network increases rather than stays fixed over time.

Moreover, because the number of links also increases at the same time, the average degree of the network is roughly constant (Albert & Barabási, 2002).

Preferential attachment

. Motivated by the rich-get-richer phenomenon, Barabási and Albert (1999) proposed the so-called

BA model

. In the BA model the probability that an old node received links from new nodes is proportional to the degree of this old node.

As a result, the degree distribution is a power-law with a constant exponent,

P

(

k

) ~

k

γ

.

This implies that whereas a large percentage of the nodes have a small number of links, a small percentage of nodes have a large number of links.

It has been shown that both growth and preferential attachment are indispensable to the emergence of scale-free topology (Barabási et al., 1999; Barabási & Alert, 1999). If a network is not growing it will become fully connected at last. The absence of preferential attachment, on the other hand, will lead to an exponential degree distribution rather than a power-law. The BA model, together with a few other models that provide similar results for power-law distributions (Dorogovtsev et al., 2000; Krapivsky et al., 2000), is the first model that explains the evolution of scale-free networks.

However, the BA model is subject to several weaknesses. First, it predicts that the asymptotic value of the power-law exponent (-

γ

) is -3 (Barabási et al., 1999). However,

179 empirical studies show that many real networks have exponents ranging between -2 and -

3. Second, the degree distribution of many real networks deviate from a strict power-law curve (Amaral et al., 2000) appeared as a straight line on a log-log plot. Some of the curves have a unimodal body and power-law tail (Pennock et al., 2002), and some others have an exponential cutoff (Jeong et al., 2001; Newman, 2001b). Third, the BA model implies that old nodes in a network will be more popular than younger nodes because they have more time to acquire links. However, in real networks we often see that some younger nodes can acquire a large number of links and become new “stars” in a very short period of time. For example, a Web page with excellent content may quickly become more popular than older Web pages with mediocre content. Forth, the BA model assumes that all new nodes have the knowledge of the global structure of a network

(Albert & Barabási, 2000; Gómez-Gardenes & Moreno, 2004). That is, new nodes know how many links each old node has. This is not always true. To overcome these weaknesses, researchers have proposed several new models based on alternative mechanisms.

Competition

. In many real systems nodes compete for links. For example, companies complete for customers’ attention on product markets. A new product may quickly dominate a market and wipe off older products because of its superior quality, functionality, or other attractions. The

fitness model

is proposed to incorporates the effect of competition (Berger et al., forthcoming; Bianconi & Barabási, 2001). The fitness of a node may be considered as the intrinsic abilities to attract links from others, increasing the node’s competitive advantages. Therefore, it is possible for a younger node with high

180 fitness to have more links than old nodes with low fitness. If a few nodes have extremely high fitness, they will become the “winners” and connect to almost every other node in the network (Pennock et al., 2002).

Individual preference

. The global knowledge assumption shared by both the BA model and the fitness model is not always realistic. In addition, it has been observed that although the degree distribution for the whole Web follows a power-law (Broder et al.,

2000; Huberman & Adamic, 1999), the degree distributions for specific categories of

Web pages, such as company, education, government, are different from a power-law.

Specifically, these distributions have a unimodal body and a power-law tail. To explain such discrepancy Pennock et al. (2002) propose to add a random mechanism to the BA model. This random mechanism reflects the observation that when a Web page author choose target pages to link to, he/she may consider not only the target pages’ popularity but also their relevance to his/her needs. In this model, there is a tuning parameter, α , to balance between the preferential attachment and the random mechanism. The analytical and simulation results show that this model can better fit the category-specific distributions than the BA model in the Web context.

Another model that explicitly considers individual preference is the

degree-similarity mixture model

proposed in (Menczer, 2004). The extra mechanism in this model is directly related to the content similarity between the new document and the target document. Therefore, the more similar the target document’s content is to the new document, the more likely it obtains the citation link from the new document.

181

In addition to the above-reviewed mechanisms, researchers have proposed various other mechanisms such as the copying effect (Kleinberg et al., 1999), the internal links and link rewiring (Barabási et al., 2002), and the aging effect (Hajra & Sen, 2005). However, there has not been a composite model that incorporates most of these mechanisms. How does a mechanism affect the network structure when other mechanisms also play roles in the evolution? Which mechanism is more responsible for the emergence of a specific topology? Answers to these questions remain unknown. In an attempt to address these questions, I propose a composite evolution model in the next Section.

7.3 The Composite Evolution Model

7.3.1 The Composite Model

The composite model consists of two general types of microscopic mechanisms:

Attractiveness of the target node

. When a new node is added to the network it must make a decision to select a set of target nodes to link to. The more attractive an existing node is the more likely it is selected as the target node. Based on prior models the attractiveness can be measured by degree (Barabási & Alert, 1999) and fitness (Bianconi & Barabási, 2001).

Usefulness of the link

. When a new node selects a target node it considers not only how attractive the target node is but also how useful the potential link is. For example, when an author cites other papers he/she probably selects papers that are

182 well-cited and also relevant to his/her own paper. It is unlikely for the author to cite a paper in an irrelevant discipline even though that paper is popular. Similarly, before two corporations decide to become strategic partners they must consider how much they can benefit from the partnership to reduce uncertainty, leverage resources, and gain market power (Stuart, 1998). In the two individual preference models (Menczer, 2004; Pennock et al., 2002) the random and content similarity mechanisms can be viewed as link usefulness effects.

The composite model captures the scaling behavior of node degrees. In this model, the probability ( Π

i

) that an old node

i

acquires a link from a new node is determined by

Π

i

=

α

β

η ϕ

j

η ϕ

j k j k j

+

( 1

β

)

ζ

i

ζ

j

+

( 1

α

)

u

(

Θ

) . (7.1)

The first part of equation (7.1) represents the attractiveness effect and is a function of both the degree (

k i

) and fitness (

η i

or

ζ i

) of node

i

. The second part represents the usefulness effect and is a function of one or more link usefulness related variables ( Θ ).

The parameter (

α

) weighs the attractiveness and usefulness effects and takes on a value between 0 and 1. Note that in the first part, both

η i

and

ζ i

refer to the fitness of node

i

.

They are used to reflect the two different ways that the fitness effect enters the model: multiplicative and additive. The parameter

β

balances between these two ways and ranges between 0 and 1. The parameter

φ

is either 0 or 1, controlling the presence or absence of the multiplicative fitness. This model is a composite model because with different

183 parameter settings it reduces to different models. Some of these models have been proposed in prior research.

7.3.2 The Simple Degree Model

When

α

= 1,

β

= 1, and

φ

= 0, equation (7.1) reduces to the simple degree model, which is the BA model (Barabási & Alert, 1999),

Π

i

=

k i k j

, (7.2)

I rename the BA model because in this model only preferential attachment based on degree is considered. Other mechanisms such as fitness and link usefulness are omitted.

To derive the functional form of the degree distribution of scale-free topology, Barabási and Alert (1999) use numerical simulations. Initially, there are a small number,

m

0

, nodes in the system. At each time step, a new node is added to the system. The new node is allowed to link to

m

(

m

m

0

) different nodes that are already in the network. When selecting the target nodes the new node makes a decision based on the probability defined in equation (7.2).

Treating

k i

as a continuous variable, the analytical solution to equation (7.2) is (Barabási et al., 1999):

k i

(

t

)

=

m



t i t



0 .

5

, (7.3)

184

P

(

k

) ~

k

3

, (7.4) where

t i

is the time that node

i

is added to the system. Equation (7.2) implies that the degree scales with time and older nodes have more advantage over younger nodes in acquiring links. The degree distribution

P

(

k

) is a power-law with -3 as the exponent, which is independent of

n

and

m

.

7.3.3 The Simple Fitness Model

If

α

= 1 and

β

= 0, the composite model reduces to a simple fitness model

Π

i

=

ζ

i

ζ

j

. (7.5)

This model considers only the effect of the fitness of nodes. This model is rather unique because the effect of preferential attachment, which is believed to be the key to the evolution of scale-free networks in all prior models (Barabási & Alert, 1999; Bianconi &

Barabási, 2001; Menczer, 2004; Pennock et al., 2002), is completely excluded. Such a situation may occur when the new node is attracted to an old node that has significantly high fitness. The new node does not care about whether the old node is popular or not.

7.3.4 The Multiplicative Fitness Model

When

α

= 1,

β

= 1, and

φ

= 1, equation (7.1) becomes a multiplicative fitness model,

185

Π

i

=

η

i

η

k j i k j

. (7.6)

This model is equivalent to the fitness model proposed in (Bianconi & Barabási, 2001).

In this model, Π

i

is a function of the product of fitness and degree (Bianconi & Barabási,

2001). That is, the fitness and preferential attachment mechanisms interplay with each other. As in the BA model the analytical solution can be derived using the mean-field theory (Bianconi & Barabási, 2001):

k

η

i

(

t

)

=

m

t t i



β

(

η

i

)

, (7.7)

P

(

k

)

=

ρ

(

η

)

C

η

m k

C

η

+

1

, (7.8) where

β

(

η

)

=

η

C

and

C

=

ρ

(

η

)

1

η

β

(

η

)

d

η

. Equation (7.7) implies that the scaling behavior of degrees depend on the dynamic exponent

β

(

η i

) and that nodes with higher fitness will acquire links faster. The degree distribution given in equation (7.8) is a weighted sum of different power-laws.

7.3.5 The Additive Fitness Model

It is easy to imagine that in some situations, the preferential attachment and fitness effects might work independently. They do not interplay and their combined effect governs the

186 evolution of a network. This situation is described by the additive fitness model when

α

=

1, 0<

β

<1, and φ = 0,

Π

i

=

β

k i k j

+

( 1

β

)

ζ

ζ

i j

. (7.9)

In addition to the four reduced models, the composite model will reduce to the unimodal power-law model (Pennock et al., 2002) when 0<

β

<1 and

u

(

Θ

)

=

1

N

(

t

)

, where

N

(

t

) is the number of nodes at time

t

. When 0<

β

<1 and

u

(

Θ

) ~ (

σ

1

c

1 )

− µ

, where

σ c

is the similarity between the new node and old node

i

, equation (7.1) reduces to the degree-similarity mixture model (Menczer, 2004).

The composite model is rather flexible, allowing different mechanisms to affect network evolution independently or interactively. Models that are not proposed in prior research, such as the simple fitness model, are also made possible. When both

α

and

β

take on values between 0 and 1 and

φ

= 1, the model becomes rather complex and multiple mechanisms can play roles in network evolution simultaneously. In addition, additional mechanisms can be incorporated into the model. For example, the variable set ( Θ ) of the usefulness function can include other mechanisms in addition to the random and similarity effects.

187

7.4 The Evolution of Patent Citation Networks

To ascertain the composite model’s applicability to real networks I used several patent citation networks.

7.4.1 Data Sets

In patent citation networks each node is a patent document. A patent document often contains several standard fields including title, application date, issue date, assignee (the institution to which the patent is assigned to), inventors, citation, and technology field classifications, among many others (Huang et al., 2003). For this chapter, I choose the citation networks of nanoscale science and engineering (NSE) related patents. NSE has been very active in recent years and has been recognized to be critical to a county’s future science and technology competence.

Many countries have established comprehensive patent repositories to facilitate the application, management, and research on patents. Among various patent system, US

Patent and Trademark Office (USPTO) is the most complete and reliable (Huang et al.,

2003).

The test data sets of NSE-related patents were retrieved from the USPTO’s patent databases in March 2003. A key-word-based approach was used to retrieve a subset of the NSE-related patents from 1976 to 2002. The key words used in the retrieval process are provided in ref. (Huang et al., 2003). The number of NSE-related patents collected

188 was 88,546, which covered 418 of 462 first-level US Patent Classification categories of technology fields, including chemistry, drug, etc. From the top 10 technology fields that generate the largest number of patents in the time period of 1976-2002 I selected four fields: drug, material science, optics, and semiconductor. I extracted the citation links among these patents and created patent citation networks for each of the four fields.

Table 7.1 shows the basic statistics of these four data sets.

Total number of patents

Network size

Number of links

Drug Material Optics Semiconductor Total

8228 8093 6093 3903 26,317

4377 4156 4377 2247 15,157

7548 6867 7099 2772 24,286

Table 7.1: Basic statistics of the four patent citation data sets.

7.4.2 Research Questions

The analysis of patent citation networks is aimed at answering the following research questions:

How do patent citation networks evolve?

Do patent citation networks have similar dynamic patterns across different technology fields?

Are growth and preferential attachment the only mechanisms that are responsible for the emergence of scale-free topology?

How do other mechanisms affect the network evolution?

189

I performed both descriptive and explanatory analysis on the patent citation networks.

7.4.3 Descriptive Analysis

The descriptive analysis was intended to answer the first two research questions. The general and topology characterizing statistics were collected for each year for each technology field. I also explored two additional statistics: institution’s

productivity distribution

and patent

content similarity distribution

. The major results are summarized as follows.

The sizes of the networks increased over time.

Size (Drug)

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

1980 1985 1990

Year

(a)

1995 2000

M

S

Total Size

# isolated

N

Size (Material)

9000

8000

7000

6000

5000

4000

3000

2000

1000

0

1980 1985 1990

Year

1995

(b)

Size (Optics)

2000

4500

4000

3500

3000

2500

2000

1500

1000

500

0

1980

8000

7000

6000

5000

4000

3000

2000

1000

0

1980 1985 1990

Ye ar

(c)

1995

Size (Semiconductor)

2000

1985 2000 1990

Year

(d)

1995

M

S

Total Size

# isolated

N

N

M

Total Size

# isolated

S

M

S

Total Size

# isolated

N

190

191

Figure 7.1: The size dynamics in patent citation networks of the four technology fields. (a)

Drug. (b) Material science. (b) Optics. (d) Semiconductor.

Figure 7.1 shows that the size dynamics are very similar across the four technology fields.

Before mid-1990s the sizes of these fields rose slowly. Since mid-1990s they have experienced rapid growths. The total number of patents issued (Total Size), the number of patents in the network (

N

), and the number of citation links (

M

) all increased dramatically after the mid-1990s. This was because NSE-related research attracted a substantial amount of attention during that period. The number of isolated patents (# isolated) which did not cite other patents and were not cited also increased linearly with time. The sizes of the giant components stayed almost constant until the middle of 1990’s, after then they increased over time. In addition,

M

rose faster than

N

, resulting in increasing average degrees (see Figure 7.2).

The average degree increased over time.

Degree (Drug)

4

3.5

3

2.5

2

1.5

1

1980 1985 1990

Ye ar

(a)

1995 2000

Degree (Material)

<k>

<k_in>

<k_out>

3.5

3

2.5

2

1.5

1

1980 1985 1990

Ye ar

(b)

1995 2000

<k>

<k_in>

<k_out>

192

Degree (Optics) Degree (Semiconductor)

3.5

3

2.5

2

1.5

1

1980 1990

Ye ar

(c)

1995

<k>

<k_in>

<k_out>

2.6

2.4

2.2

2

1.8

1.6

1.4

1.2

1

1980

<k>

<k_in>

<k_out>

1985 2000

1985 1990

Year

(d)

1995 2000

Figure 7.2: Dynamics of average degrees. (a) Drug. (b) Material Science. (c) Optics. (d)

Semiconductor.

Most prior models predict that the average degree is constant over time. An exception is the model proposed in (Barabási et al., 2002), where the increasing average degree results primarily from the new links formed between existing nodes. However, this does not apply to citation networks since once a patent is issued its citation does not change. It is not possible to form new links between existing patents. Thus, the only possible reason is that, on average, younger patents cited more patents than older patents did. This can be seen from the <

k_out

> curves in Figure 7.2, in which the average number of citations a patent has increased over time.

The clustering coefficients increased over time.

Because clustering coefficient measures the tendency for nodes to form groups, the increasing clustering coefficients mean that groups of patents discussing related topics were more common over time. This is quite natural because as a field matures, an initially general topic may develop into different subtopics. Patents mostly cite only parents discussing the same subtopic.

193

The average shortest paths increased over time.

Prior models (Albert et al., 1999; Bollobás, 1985) predict that average shortest path should increase with network size logarithmically. However, for all four fields, there was a jump on average shortest path around the mid-1990s. This implies that during the booming period of NSE-related research patents were more “distant” from each other, possibly resulted from the lack of cross-reference of different subtopics.

Average Path Length (Drug) Average Path Length (Material)

2.5

2

1.5

1

0.5

0

1980

4

3.5

3

2.5

2

1.5

1

0.5

0

1980

1985 1990

Year

1995

(a)

Average Path Length (Optics)

2000

1985 1990

Year

1995

(b)

Average Path Length (Semicondutor)

2000

7

6

5

4

3

2

1

0

1980 1990

Year

(c)

5

4.5

4

3.5

3

2.5

2

1.5

1

0.5

0

1980

1985 1995 2000

1985 1990

Year

(d)

1995 2000

Figure 7.3: Dynamics in average path lengths. (a) Drug. (b) Material Science. (c) Optics.

(d) Semiconductor.

The degree distribution followed a power-law, but deviated from the power-law for small degrees.

194

Overall, these degree distributions followed power-laws. However, their body deviated from the power-law line significantly for small degrees. The deviation was more severe for the optics and semiconductor fields. The exponents of the power-laws for drug, material, optics, and semiconductors were -2.12, -2.24, -2.29, and -2.58, respectively.

They were rather close to the value predicted by the simple degree model (Barabási et al.,

1999; Barabási & Alert, 1999). However, the simple degree model cannot explain the observed deviation from power-law.

Degree Distribution (Drug)

Degree Distribution (Material)

-6

-7

-8

-9

-3

-4

-5

-10

0

-1

-2

0 1 2 3 4 5

-1

-2

-3

-4

-5

-6

-7

-8

-9

1

0 0 1 2 3 4

ln(k )

(a)

Degree Distribution (Optics) ln(k )

(b)

Degree Distribution (Semiconductor)

-9

-10

-5

-6

-7

-8

-1

-2

-3

-4

1

0

0 0.5

1 1.5

2 2.5

3 3.5

4

-3

-4

-5

-6

-7

-8

-9

-1

-2

1

0

0 1 2 3 4

ln(k)

(c)

ln(k)

(d)

Figure 7.4: Degree distributions of the four fields. (a) Drug. (b) Material Science. (c)

Optics. (d) Semiconductor.

The productivity distributions followed power-laws.

195

An institution’s

productivity

was measured by the number of patents it generated divided by the total number of patents in the field that it belonged to. The five most productive institutions in the four fields are listed in Table 7.2.

Drug

Material

Optics

Semiconductor

L'Oreal 303

Merck & Co., Inc. 163

University of California

Eli Lilly and Company

Genentech, Inc.

167

130

111

Minnesota Mining and Manufacturing Company

Xerox Corp.

260

226

IBM 158

PPG Industries, Inc. 152

3M Innovative Properties Company 149

IBM 140

The Secretary of the Navy

Lucent Technologies Inc.

111

107

Hughes Aircraft Company

Corning Incorporated

103

80

IBM 303

Micron Technology, Inc.

Advanced Micro Devices, Inc.

Motorola, Inc.

Texas Instruments Incorporated

245

221

185

169

Table 7.2: The five most productive institutions in the four technology fields.

Moreover, I found that those most productive institutions were also the ones that received the most citations. The productivity distribution was found to follow a power-law,

P

(

x

) ~

x

-

µ

, where

x

is the number of patents generated by a specific institution (See Figure 5).

The exponent values for the four fields are given in Table 7.3. The power-law distributions imply that while a large number of institutions generated only a small number of patents, a small number of institutions generated a large number of patents.

196

Ins titution's productivity (drug)

1

1

0.1

0.01

10 100 1000

0.001

0.0001

# pate nts generated

Figure 7.5: Institutions’ productivity distribution for the drug field.

Drug

Material

Optics

Semiconductor

-1.41 1.08

-1.24 0.93

-1.48 1.08

-0.89 1.30

Table 7.3: Exponent values of productivity distributions and similarity distributions for the four fields.

The conditional similarity distributions scaled with similarity and followed power-laws.

To measure the content similarity between two patent documents, I extracted the noun phrases from the title and abstract of each patent and calculated the Jaccard coefficient

(Rasmussen, 1992), a similarity measure often used in information retrieval applications.

The similarity (

σ ij

) between patents

i

and

j

was defined as (Chen et al., 1998):

σ

ij

=

Q q w iq

2

+

q

Q w iq

q

Q w

2

jq w

jq

Q q w iq w jq

, (7.10)

197 where

Q

was the total number of terms extracted from a data set (e.g., drug field) and

w iq

=

tf iq

×

idf q

. Term frequency (

tf iq

) was the number of occurrences of term

q

in document

i

.

Inversed document frequency (

idf q

) was the inverse of the logarithm of the number of documents in which term

q

occurred.

The distribution of content similarity was a conditional probability distribution. It was defined as the percentage of patent pairs that were linked by citations over all possible patent pairs in a data set, given a specific value of content similarity.

Unlike the phase transition distribution of content similarity observed in the degreesimilarity mixture model (Menczer, 2004), the content similarity between linked patent pairs followed a power-law,

P

(

σ c

) ~

σ c

υ

. This means that the more similar the contents of two patents, the more likely that one would cite the other. Figure 7.6 presents the similarity distribution of drug patents. The exponent of the power-law distributions for each field is given in Table 3.

7.4.4 Explanatory Analysis

7.4.4.1 Possible Evolutionary Mechanisms

The two categories of mechanisms that might possibly affect the evolution of patent citation networks were attractiveness of patents and usefulness of citation links.

The attractiveness of a patent document could be based on degree and fitness. The degree of a patent was the number of citations it received from other patents and could be easily

198 measured by its number of in-links. The fitness of a patent was its intrinsic traits such as the quality of the content or the innovativeness of the technology presented. Measuring fitness, however, was not as straightforward. I observed that most popular patents that received a large number of citations were written by assignees from those productive institutions. That it, patents from those large, productive institutions appeared to be more attractive to patent assignees and tended to receive more citations. The fitness of a patent thus was estimated by the productivity of the institution of the assignee.

1.0E+00

0.01

1.0E-01

Conditional Similarity Distribution (Drug)

0.1

1.0E-02

1.0E-03

1

1.0E-04

Similarity

Figure 7.6: The log-log plot of conditional content similarity between linked patent pairs.

The usefulness of citation links were estimated primarily by the content similarity between the citing patent and the cited patent. To test whether content similarity played a role in citation, I compared the average similarity between linked patents and the average similarity between unlinked patents. I found that the former was significantly higher than the latter, indicating its impact on citation selection. Table 7.4 presents the similarity coefficients of the four technology fields.

199

Drug

Material

Optics

Semiconductor

***

p

< 0.0001

Average similarity between linked patents

Average similarity between unlinked patents

0.153*** 0.008

0.097*** 0.007

0.068*** 0.006

0.059*** 0.004

Table 7.4: The similarity coefficients between linked patents and those between unlinked patents.

7.4.4.2 Estimating the Composite Model

To estimate parameters in the composite model, the best approach would be regression based on equation (7.1) using the real data. However, the probability Π

i

was very difficult to measure (Jeong et al., 2003). I therefore used a simulation approach to estimate the parameters. Using simulation for parameter estimation was rather ad-hoc. However, it could provide some insights into the impacts of various mechanisms in network evolution.

Given a specific citation network, the simulation began with two linked nodes. At each time step, a new node was added to the network. The new node was allowed to link to

<

k

> existing nodes in the network. The target nodes were selected based on one of the variants of the composite model: the simple degree model, the simple fitness model, the multiplicative model, the additive model, and the composite model. The simulated network continued to including new nodes until its size was equal to the size of the real citation network under study.

200

The fitness scores of the nodes in these models were drawn from the empirical distributions of institution productivity. The usefulness function in the composite model was based on the empirical distributions of the content similarity between linked patents,

u

(

σ

c

)

=

σ

σ

c

− µ if if

σ

σ

c c

<

<

0 .

991

,

0 .

991

(7.11)

u

(

σ c

) was the number of linked parent pairs with

σ c

similarity divided by the total number of linked pairs. Note that this distribution was not the conditional content similarity presented in Section 7.4.3. It was a distribution with a phrase transition (Menczer, 2004).

When the content similarity between two patents was less than 0.991, the distribution was a power-law. This indicated that while a large number of linked pairs had small similarity, a small percentage of linked patent pairs were very similar. When the similarity was close to 1.0, the probability was a constant number σ . Figure 7.7 presents the similarity distribution of the drug network. The constant similarity σ was marked on this chart. The values of µ and σ are given in Table 7.5.

Because the parameters in the composite model were unknown, the first four models must be simulated and analyzed first. Based on the analysis, the contribution (coefficient) of each mechanism was determined by multiple trials. Some mechanisms might be dropped from the composition model because of their poor fit.

201

Drug

Material

Optics

Semiconductor

0.72 0.06

0.85 0.03

0.79 0.03

0.61 0.02

Table 7.5: Estimated parameter values in the content similarity distributions.

Distribution of Content Similarity between Linked Patents

(Drug) p

0.09

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0.01

0

-0.01

0 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

σ c

Figure 7.7: The distribution of the content similarity between linked drug patents.

The estimated models for drug, material science, optics, and semiconductor field are

Π

i

=

0 .

9

k i k i

+

0 .

1

σ

c

0 .

72

, (7.12)

Π

i

=

0 .

7

k i k i

+

0 .

3

σ

c

0 .

85

, (7.13)

Π

i

=

0 .

35

k i k i

+

ζ

i

ζ

i

+

0 .

3

σ −

0

c

.

79

(7.14)

Π

i

=

0 .

45

k i k i

+

ζ

i

ζ

i

+

0 .

1

σ

c

0 .

61

. (7.15)

202

Figure 7.8 presents the model fits of the five models. Major findings are summarized as follows:

Equations (7.12)-(7.13) show that the attractiveness mechanisms were more responsible for the degree distributions observed in the four patent citation networks. The values of

α

for all the four fields ranged between 0.7 and 0.9. In equations (7.14) and (7.15), the values of

β

were 0.5.

Preferential attachment was the most significant mechanism in the evolution of the networks. We see from Figure 7.8 that the simple degree model fit the data well, although it might not be the best.

However, preferential attachment may not be the only mechanisms that led to scale-free topology. Figure 7.8 shows that the simple fitness model could also result in a scale-free topology although the degree scaled very fast.

The multiplicative and additive fitness models scaled too fast in most cases. They generated networks with a few nodes that had extremely large degrees.

The composite model seemed to be the model which best fit the data because it was more flexible and included the additional mechanism, content similarity.

-6

-7

-8

-9

-4

-5

-2

-3

0

-1

0 1

Model Fit (Drug)

2 3 4 5

Data

Simple Degree Model

Multiplicative Fitness Model

Simple Fitness Model

Additive Fitness Model

Composite Model

6

-3

-4

-5

-6

-7

-8

-9

0

-1

0

-2

1 2

ln(k)

(a)

Model Fit (Material)

3 4 5 6 7

Data

Simple Degree Model

Multiplicative Fitness Model

Simple Fitness Model

Additive Fitness Model

Composite Model

8

ln(k)

(b)

203

204

Model Fit (Optics)

-7

-8

-9

-3

-4

-5

-6

0

-1

0

-2

1 2 3 4 5 6 7

Data

Simple Degree Model

Multiplicative Fitness Model

Simple Fitness Model

Additive Fitness Model

Composite Model

8

ln(k)

(c)

Model Fit (Semiconductor)

-4

-5

-6

-7

-8

-9

-2

-3

0

-1

0 1 2 3 4 5 6

Data

Simple Degree Model

Multiplicative Fitness Model

Simple Fitness Model

Additive Fitness Model

Compositel Model

7

ln(k)

(d)

Figure 7.8: The fits of different models. (a) Drug. (b) Material Science. (c) Optics. (d)

Semiconductor.

205

7.5 Conclusions

I proposed a composite model for network evolution in this chapter. The microscopic mechanisms that may possibly impact on the evolution of networks are attractiveness of the target nodes or the usefulness of the links. Using the NSE-related patent citation networks I compared this model with several models proposed in prior research. The preliminary results showed that the composite model had the potential to better fit real networks.

The major limitation of this chapter lies in the use of simulation approach to estimating the parameters in the composite model due to the difficulty of measuring the link selecting probability. Future research will focus on removing this limitation and systematically estimate and test the significance of the parameters. In addition, various other mechanisms will be added to the model in my future research.

206

CHAPTER 8: CONCLUSIONS AND FUTURE DIRECTIONS

Contemporary organizations live in an environment of networks: internally, organizations manage the networks of employees, information resources, and knowledge assets to enhance productivity and improve efficiency; externally, they form alliances with strategic partners, suppliers, buyers, and other stakeholders to conserve resources, share risks, and gain market power. Organizations make many managerial and strategic decisions based on their understanding of the structure of these various networks.

This dissertation is devoted to network structure mining, a new research topic on knowledge discovery in databases (KDD) for supporting knowledge management and decision making in organizations. A comprehensive computational framework is proposed and a series of case studies are presented to address various research questions in this new field. In this chapter, I summarize the theoretical, technical, and empirical contributions of this dissertation, discuss its relevance to management, business, and MIS research, and suggest future research directions.

8.1 Contributions

8.1.1 Theoretical Contributions

The dissertation contributes to various aspects of the research and practice of network structure mining. Specifically, the

theoretical contributions

of this dissertation are as follows:

207

The computational framework proposed in this dissertation is the first comprehensive framework that offers a relatively complete taxonomy and summary of the theoretical foundations, major research questions and methodologies, existing technologies, and applications of network structure mining. Books and articles that provide excellent survey and summary on the study of networks can be found in graph theory (Bollobás, 1998), social network analysis (Wasserman & Faust, 1994; Watts, 2004), and statistical physics (Albert

& Barabási, 2002; Newman, 2003b), which are the three major theoretical foundations that network structure mining is grounded upon. However, they focus only on the research questions and methodologies that are relevant to their own disciplines and have not taken much advantage of the multidisciplinary nature of network research. For example, SNA studies seldom address the network robustness question. Statistical physics research, on the other hand, never uses the blockmodeling approach from SNA to reduce network complexity. This computational framework, in contrast, consolidates the research questions and techniques from multiple reference disciplines and can also be used for guiding future research.

The framework and the case studies presented in this dissertation contribute to the

KDD research community by defining the new area of network structure mining and demonstrating how structural patterns can be extracted from networks using conventional data mining techniques, such as hierarchical clustering algorithms, and new methods borrowed from other disciplines, such as the blockmodeling

208 approach. Network structure mining together with conventional data mining topics such as association rule mining, clustering, and classification will be the major pillars of KDD research.

8.1.2 Technical Contributions

This dissertation has also made the following

technical contributions

:

A new shortest-path algorithm, two-tree priority-first search (two-tree PFS), was developed and compared with a few other graph traversal algorithms, such as the one-tree priority-first search (PFS) and breadth-first search (BFS) algorithms, to locate important relational paths in networks (Xu & Chen, 2004). The performance evaluation results showed that both one-tree and two-tree PFS algorithms were more effective than the BFS algorithm. In addition, the two-tree

PFS algorithm was more efficient than the one-tree PFS algorithm in dense networks.

A number of techniques that were previously used in other disciplines such as the concept space approach from information retrieval (Chen & Lynch, 1992), hierarchical clustering algorithms from data mining (Aldenderfer & Blashfield,

1984), the blockmodeling approach from SNA (Arabie et al., 1978), and multidimensional scaling approach (MDS) from statistics (Kruskal & Wish, 1978) were employed to mine and visualize structural patterns in networks (Xu & Chen,

2005). Compared with the graphics-based approaches that are employed in

209 current network analysis tools, the prototype system developed based on these new techniques was more efficient and useful.

To address the lack of efficiency problem in unweighted graph partitioning methods, I proposed

edge local density

to approximate link weights based on the structure of the network. When incorporated in single-pass and iterative hierarchical clustering algorithms, this measure was shown to be potentially helpful for enhancing partitioning efficiency with acceptable effectiveness or for improving effectiveness with acceptable efficiency. It could be used to provide a better balance between the different effectiveness and efficiency demands of applications than existing clustering methods.

A composite model was proposed to explain evolution processes and the emergence of scale-free topology in networks. The composite model could reduce to different models proposed in prior research or new models under different parameter settings. This model incorporated more evolutionary mechanisms than prior models and was more flexible and realistic.

8.1.3 Empirical Contributions

This dissertation addresses network structure mining from the perspective of knowledge management and decision making. Specifically, the case studies presented were aimed at supporting knowledge management and decision making in various application domains:

210

Chapters 3 and 4 focused on the law enforcement domain, proposing effective methods to help extract knowledge about the structures of criminal networks of organized crimes (Xu & Chen, 2004, 2005). The techniques employed, such as the shortest-path algorithms and SNA methods, have been found to be very promising in supporting crime-investigation related knowledge discovery tasks.

Chapter 5 presented the new measure for addressing the unweighted network partition problem. It can be used in a variety of applications such as identifying research specialties in a research discipline based on citation networks and extracting communities of Web pages on the Internet.

Chapter 6 used the new topological analysis approaches from statistical physics to analyze the structure and robustness of “dark networks” such as criminal networks, terrorist networks, and Web sites created by terrorists and their supporters. The findings could help authorities better understand the organization of these dark networks and develop effective disruptive strategies.

Chapter 7 described and modeled the evolution of several patent citation networks.

The findings would be useful for understanding the history of technology development and predicting future technology trends.

In addition to these contributions, this dissertation is especially relevant to management, business, and MIS research.

211

8.2 Relevance to Business, Management, and MIS

The science of networks (Barabási, 2002; Watts, 2004) has motivated a new way of thinking that views everything surrounding us as connected and makes us ponder what it means for science, business, and everyday life. In particular, managers of organizations may find a number of new opportunities for business and management by thinking in terms of networks and employing network structure mining techniques presented in this dissertation:

Marketing managers can exploit customer networks and mine the “network value” of customers (Domingos & Richardson, 2001). Some well-connected customers are rather important. They may be early adopters of some new products and can influence the purchasing decisions of many other people. Approaches proposed in this dissertation can help marketing managers locate these key customers and develop better marketing strategies.

Boards of directors are the decision making bodies of large corporations (Robins

& Alexander, 2004). Many strategic practices regarding corporate governance, adoption of new technology, and technology outsourcing spread among directors sitting on different corporation boards. Executives and directors may find network structure mining helpful for understanding the structures and evolution processes of these elite networks and making more intelligent strategic decisions.

212

Financial equities, stocks, and banks form networks in the financial market

(Bonanno et al., 2004; Inaoka et al., 2004), which is a naturally complex system in nature. With the techniques presented in this dissertation, managers, banks, and financial workers may better understand the behavior of this complex system and select profitable financial portfolios or financial policies.

Information systems that incorporate the network mining techniques will be able to provide organizations with not only information storage functionality but also the ability to discover useful knowledge from networks, thereby enhancing organizations’ competitive advantages.

Network structure mining is a fairly new area. Many new methodologies and technologies are needed. My future research on network structure mining will proceed in the following directions.

In the

theoretical perspective

, I will develop a more comprehensive research framework as the research on network structure mining matures. New research questions, techniques, and findings will be added to the framework. Although it is rather comprehensive, the current framework does not incorporate the research on resource diffusion in networks.

The future framework will consider this missing piece. I will also continue to work on the network evolution problem by developing new models and revealing new

213 mechanisms responsible for network evolution. Such research will contribute to the theory building of network evolution.

In the

technical perspective

, my future research will include the development of more techniques and methods for mining structural patterns in networks. In particular, the unweighted network partition approach proposed in this dissertation still has much room for improvement. My objective is to develop more effective and efficient algorithms.

In the

empirical perspective

, I will experiment with my techniques in more application domains. This dissertation covers only a few domains where network structure mining can apply. In the future, I will apply the techniques to Web mining, biological network mining, and citation network mining, among many others.

214

APPENDIX A: DOCUMENTS FOR THE CRIMENET EXPLORER

EXPERIMENT

A1: Instructions for Experiment Participants

Participant number: _________________________

Date: _____________________________________

The purpose of this study is to evaluate the performance of a criminal network analysis system.

The network presented in this study consists of criminals. However, no detailed information about any criminals is shown except for their scrubbed names. Therefore, the network can be simply treated as a general network consisting of nodes. This study does not require any domain knowledge about criminal networks and crime investigation. You are eligible to participate because you have basic experience of using computers.

Your participation will involve completing the tasks of discovering the structure of a network.

You may choose not to answer some or all of the questions. During the observation, time will be recorded for each task you complete. Your name will not appear on any written notes.

Any questions you have will be answered and you may withdraw from the study at any time.

There are no known risks form your participation and no direct benefit from your participation is expected.

Questionnaire and observation information will be assigned a subject number and locked in a cabinet in a secure place. Your name will not be revealed in any reports that result from this project.

215

A2: Introduction to System Functionality

* Degree of a point: the number of links the point has. Hint: A point with a high degree score is like a “leader”.

Operation Explanation

Switch between two tabbed panes, one of which is for narcotics and the other for gangs.

Show the network of individuals

Reset the network to its original display

Adjust the level of abstraction with 0% indicating the original network of individuals

Centrality rankings of individuals

Drag-and-drop Move points around on the display

View a group’s inner structure

Display rankings of group members roles

Single-click on a bubble representing a group on the display

(level > 0) to see the inner structure of that group

Single-click on a bubble representing a group on the display

(level > 0) to see the rankings of group members in terms of their degree*.

216

Task 1. Do you think the people who are included in the circle A should be in the same group? If not, who should be included to into this group and who should be excluded from the group?

Task 2. Do you think the people who are included in the circle B should be in the same group? If not, who should be included to into this group and who should be excluded from the group?

Task 3. Do group A and group B have direct relations (are there lines linking members of group

A and members of group B)?

Task 4. Does group A have more links to group B than to group C?

Task 5. Identify the person who scores the highest in

degree

(give his/her name).

Task 6. Identify the person who scores the highest in

degree

(give his/her name).

Task 7. Group A is labeled by “

PERALES JASON

”; group B is labeled by “

SANCHEZ KEDI

”.

Do group A and group B have direct relations?

Task 8. Group A is labeled by “

PERALES JASON

”; group B is labeled by “

SANCHEZ KEDI

”; group C is labeled by “

TEMPLETON SERGIO

”. Does group C have a stronger relation to A than to B?

Task 9. Identify the person who scores the highest in

degree

(give his/her name).

Task 10. Identify the person who scores the highest in

degree

(give his/her name).

3.

4.

5.

1.

2.

Participant #: ______________

Year of Birth: ______________

Gender: Male Female

Academic background: Undergraduate Graduate

In general, I am very comfortable with computers.

A. Strongly agree

B. Agree C. Neither degree nor disagree

6.

7.

I am very experienced with the Internet.

A. Strongly B. Agree C. Neither degree agree nor disagree

I am very experienced with Microsoft Excel.

A. Strongly agree

B. Agree C. Neither degree nor disagree

D. Disagree

D. Disagree

D. Disagree

E. Strongly

Disagree

E. Strongly

Disagree

E. Strongly

Disagree

217

218

Please answer the following questions regarding the interface of this system.

8.

Switching between the two tabbed panes is easy.

A. Strongly B. Agree C. Neither degree D. Disagree agree nor disagree

9.

The reset button is NOT easy to use.

A. Strongly B. Agree C. Neither degree agree nor disagree

10.

The slider is easy to adjust.

A. Strongly agree

B. Agree C. Neither degree nor disagree

11.

The meaning of the slider is confusing.

A. Strongly agree

B. Agree C. Neither degree nor disagree

D. Disagree

D. Disagree

D. Disagree

E. Strongly

Disagree

E. Strongly

Disagree

E. Strongly

Disagree

E. Strongly

Disagree

12.

The table is easy to use.

A. Strongly agree

B. Agree

13.

The table is confusing.

A. Strongly agree

B. Agree

C. Neither degree nor disagree

D. Disagree E. Strongly

Disagree

C. Neither degree nor disagree

D. Disagree E. Strongly

Disagree

14.

Moving points around on the network is easy.

A. Strongly agree

B. Agree C. Neither degree nor disagree

E. Strongly

Disagree

15.

The meaning of the network at 0 level of abstraction (points are individuals) is confusing.

A. Strongly agree

B. Agree C. Neither degree nor disagree

D. Disagree

D. Disagree E. Strongly

Disagree

16.

The meaning of groups is confusing (groups are represented by circles).

A. Strongly agree

B. Agree C. Neither degree nor disagree

D. Disagree E. Strongly

Disagree

17.

I learned how to use the system interface (including buttons, slider, table, etc.) very quickly.

A. Strongly agree

B. Agree C. Neither degree nor disagree

D. Disagree E. Strongly

Disagree

219

18.

In general, the interface is easy to use.

A. Strongly agree

B. Agree C. Neither degree nor disagree

D. Disagree E. Strongly

Disagree

You have just completed many tasks. Recall how you performed these tasks and answer the following questions.

19.

I was very comfortable with evaluating the groupings produced by the system.

A. Strongly B. Agree C. Neither degree D. Disagree E. Strongly agree nor disagree Disagree

20.

I felt confused with the groupings produced by the system.

A. Strongly agree

B. Agree C. Neither degree nor disagree

D. Disagree E. Strongly

Disagree

21.

It was easier to find inter-group relations when group members are put into circles than are shown individually.

A. Strongly agree

B. Agree C. Neither degree nor disagree

D. Disagree E. Strongly

Disagree

22.

It was easier to use the table to find the person with highest degree than counting lines on a small window.

A. Strongly agree

B. Agree C. Neither degree nor disagree

23.

In general, this system is easy to learn.

A. Strongly agree

B. Agree C. Neither degree nor disagree

24.

In general, this system is easy to use.

A. Strongly agree

B. Agree C. Neither degree nor disagree

D. Disagree

D. Disagree

D. Disagree

25.

In general, I am satisfied with the system.

A. Strongly agree

B. Agree C. Neither degree nor disagree

D. Disagree

26.

Please give any comments regarding the system. Thank you very much.

E. Strongly

Disagree

E. Strongly

Disagree

E. Strongly

Disagree

E. Strongly

Disagree

220

REFERENCES

Agrawal, R., Imielinski, T., & Swami, A. (1993).

Mining association rules between sets of items in large databases.

Proceedings of the ACM SIGMOD International

Conference on Management of Data

, Washington, D.C.

Albert, R., & Barabási, A.-L. (2000). Topology of evolving networks: Local events and universality.

Physical Review Letters, 85

(24), 5234-5237.

Albert, R., & Barabási, A.-L. (2002). Statistical mechanics of complex networks.

Reviews of Modern Physics, 74

(1), 47-97.

Albert, R., Jeong, H., & Barabási, A.-L. (1999). Diameter of the World-Wide Web.

Nature, 401

, 130-131.

Albert, R., Jeong, H., & Barabási, A.-L. (2000). Error and attack tolerance of complex networks.

Nature, 406

, 378-382.

Aldenderfer, M. S., & Blashfield, R. K. (1984).

Cluster Analysis

. Beverly Hills: Sage

Publications.

Ali, M., & Kamoun, F. (1993). Neural networks for shortest path computation and routing In computer networks.

IEEE Transactions on Neural Networks, 4

(5), 941-

953.

Amaral, L. A. N., Scala, A., Barthelemy, M., & Stanley, H. E. (2000). Classes of smallworld networks.

Proceedings of the National Academy of Science of the United

States of America, 97

, 11149-11152.

Anderson, T., Arbetter, L., Benawides, A., & Longmore-Etheridge, A. (1994). Security works.

Security Management, 38

(17), 17-20.

Arabie, P., Boorman, S. A., & Levitt, P. R. (1978). Constructing blockmodels: How and why.

Journal of Mathematical Psychology, 17

, 21-63.

Araujo, F., Ribeiro, B., & Rodrigues, L. (2001). A neural network for shortest path computation.

IEEE Transactions on Neural Networks, 12

(5), 1067-1073.

Asano, T., Kirkpatrick, D., & Yap, C. (2002).

Pseudo approximation algorithms, with applications to optimal motion planning.

Proceedings of the 18th Annual

Symposium on Computational Geometry

, Barcelona, Spain.

221

Baker, W. E., & Faulkner, R. R. (1993). The social organization of conspiracy: Illegal networks in the heavy electrical equipment industry.

American Sociological

Review, 58

(12), 837-860.

Baldi, S. (1998). Normative versus social constructivist processes in the allocation of citations: A network-analytic model.

American Sociological Review, 63

(6), 829-

846.

Barabási, A.-L. (2002).

Linked: The New Science of Networks

. New York, NY: Perseus

Books Group.

Barabási, A.-L., Albert, R., & Jeong, H. (1999). Mean-field theory for scale-free random networks.

Physica A, 272

, 173-187.

Barabási, A.-L., & Alert, A.-L. R. (1999). Emergence of scaling in random networks.

Science, 286

(5439), 509-512.

Barabasi, A.-L., & Alert, R. (1999). Emergence of Scaling in Random Networks.

Science,

286

(5439), 509-512.

Barabási, A.-L., Jeong, H., Zéda, Z., Ravasz, E., Schubert, A., & Vicsek, T. (2002).

Evolution of the social network of scientific collaborations.

Physica A, 311

, 590-

614.

Battista, G. d., Eades, P., Tamassia, R., & Tollis, I. G. (1999).

Graph Drawing:

Algorithms for the Visualization of Graphs

. Upper Saddle River, NJ: Prentice Hall.

Berger, N., Borgs, C., Chayes, J. T., D'Souza, R. M., & Kleinberg, R. D. (forthcoming).

Degree distribution of competition-induced preferential attachment.

Combinatorics, Probability and Computing

.

Berkowitz, S. D. (1982).

An Introduction to Structural Analysis: The Network Approach to Social Research

. Toronto: Butterworth.

Bianconi, G., & Barabási, A.-L. (2001). Competition and multiscaling in evolving networks.

Europhysics Letters, 54

, 436-442.

Bollobás, B. (1985).

Random Graphs

. London: Academic.

Bollobás, B. (1998).

Modern Graph Theory

. New York, NY: Springer-Verlag.

Bonanno, G., Caldarelli, G., Lillo, F., Micciche, S., Vandewalle, N., & Mantegna, R. N.

(2004). Networks of equities in financial markets.

The European Physical Journal

B, 38

, 363-371.

222

Borgatti, S. P., & Foster, P. C. (2003). The network paradigm in organizational research:

A review and typology.

Journal of Management, 29

, 991-1013.

Brass, D. J. (1984). Being in the right place: A structural analysis of individual influence in an organization.

Administrative Science Quarterly, 29

, 518-539.

Breiger, R. L. (2004). The analysis of social networks. In M. A. Hardy & A. Bryman

(Eds.),

Handbook of Data Analysis

(pp. 505-526). London, UK: Sage Publications.

Breiger, R. L., Boorman, S. A., & Arabie, P. (1975). An algorithm for clustering relational data, with applications to social network analysis and comparison with multidimensional scaling.

Journal of Mathematical Psychology, 12

, 328-383.

Brin, S., & Page, L. (1998).

The anatomy of a large-scale hypertextual web search engine.

Proceedings of the 7th WWW Conference

, Brisbane, Australia.

Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., State, R., et al.

(2000). Graph structure in the web.

Computer Networks, 33

(1-6), 309-320.

Burt, R. S. (1976). Positions in networks.

Social Forces, 55

, 93-122.

Burt, R. S. (1980). Models of network structure.

Annual Review of Sociology, 6

, 79-141.

Carr, N. G. (2003). IT doesn't matter.

Harvard Business Review, 81

(5), 41-49.

Chakrabarti, S., Dom, B. E., Kumar, S. R., Raghavan, P., Rajagopalan, S., Tomkins, A., et al. (1999). Mining the web's link structure.

IEEE Computer, 32

(8), 60-67.

Chau, M., Xu, J., & Chen, H. (2002).

Extracting meaningful entities from police narrative reports.

Proceedings of National Conference on Digital Government Research

,

Los Angeles, CA.

Chau, M., Zeng, D., Chen, H., Huang, M., & Hendriawan, D. (2003). Design and evaluation of a multi-agent collaborative Web mining system.

Decision Support

Systems, 35

(1), 167-183.

Chen, C., Paul, R. J., & O'Keefe, B. (2001). Fitting the jigsaw of citation: Information visualization in domain analysis.

Journal of American Society of Information

Science and Technology, 52

(4), 315-330.

Chen, H., Chung, Y., Ramsey, M., & Yang, C. (1998). A smart itsy bitsy spider for the web.

Journal of the American Society for Information Science, 49

(7), 604-618.

223

Chen, H., & Lynch, K. J. (1992). Automatic construction of networks of concepts characterizing document databases.

IEEE Transactions on Systems, Man and

Cybernetics, 22

(5), 885-902.

Chen, H., Qin, J., Reid, E., Chung, W., Zhou, Y., Xi, W., et al. (2004).

The dark web portal: Collecting and analyzing the presence of domestic and international terrorist groups on the web.

Proceedings of the 7th Annual IEEE Conference on

Intelligent Transportation Systems (ITSC 2004)

, Washington, D. C.

Chen, H., Zeng, D., Atabakhsh, H., Wyzga, W., & Schroeder, J. (2003). COPLINK managing law enforcement data and knowledge.

Communications of the ACM,

46

(1), 28-34.

Chinchor, N. A. (1998).

Overview of MUC-7/MET-2.

Proceedings of the 7th Message

Understanding Conference (MUC-7)

, Washington, D.C.

Chung, F. R. K. (Ed.). (1997).

Spectral graph theory

(Vol. 92): American Mathematical

Society.

Coady, W. F. (1985). Automated link analysis: Artificial intelligence-based tool for investigators.

Police Chief, 52

(9), 22-23.

Cook, D. J., & Holder, L. B. (2000). Graph-based data mining.

IEEE Intelligent Systems,

15

, 32-41.

Cormen, T. H., Leiserson, C. E., & Rivest, R. L. (1991).

Introduction to Algorithms

.

Cambridge, MA: The MIT Press.

Csányi, G., & Szendroi, B. (2004). Structure of a large social network.

Physical Review E,

69

, 036131.

Culnan, M. J. (1986). The intellectual development of management information systems,

1972-1982: A co-citation analysis.

Management Science, 32

(2), 156-172.

Culnan, M. J. (1987). Mapping the intellectual structure of MIS, 1980-1985: A cocitation analysis.

MIS Quarterly

, 341-353.

Dantzig, G. (1960). On the shortest route through a network.

Management Science, 6

,

187-190.

Davidson, R., & Harel, D. (1996). Drawing graphs nicely using simulated annealing.

ACM Transactions on Graphics, 15

(4), 301-331.

224

Day, W. H. E., & Edelsbrunner, H. (1984). Efficient algorithms for agglomerative hierarchical clustering methods.

Journal of Classification, 1

, 7-24.

Defays, D. (1977). An efficient algorithm for a complete link method.

Computer Journal,

20

(4), 364-366.

Deo, N. (1974).

Graph Theory with Applications to Engineering and Computer Science

.

Englewood Cliffs, New Jersey: Prentice-Hall.

Dijkstra, E. (1959). A note on two problems in connection with graphs.

Numerische

Mathematik, 1

, 269-271.

Domingos, P., & Richardson, M. (2001).

Mining the network value of customers.

Proceedings of the 7th ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining

, San Francisco, CA.

Doreian, P., & Stokman, F. N. (1997). The dynamics and evolution of social networks. In

P. Doreian & F. N. Stokman (Eds.),

Evolution of Social Networks

(pp. 1-17).

Australia: Gordon and Breach.

Dorogovtsev, S. N., & Mendes, J. F. F. (2003).

Evolution of networks: From biological nets to the Internet and WWW

. New York, NY: Oxford University Press.

Dorogovtsev, S. N., Mendes, J. F. F., & Samukhin, A. N. (2000). Structure of growing networks with preferential linking.

Physical Review Letters, 85

(21), 4633-4636.

Eades, P. (1984). A heuristic for graph drawing.

Congressus Numerantium, 42

, 149-160.

Erdös, P., & Rényi, A. (1960). On the evolution of random graphs.

Publications of the

Mathematical Institute of the Hungarian Academy of Sciences, 5

, 17-61.

Etzioni, O. (1996). The World Wide Web: Quagmire or gold mine.

Communications of the ACM, 39

(11), 65-68.

Evan, W. M. (1972). An organization-set model of interorganizational relations. In M.

Tuite, R. Chisholm & M. Radnor (Eds.),

Interorganizational Decision-Making

(pp. 181-200). Chicago: Aldine.

Evans, J., & Minieka, E. (1992).

Optimization Algorithms for Networks and Graphs

(2 ed.). New York, NY: Marcel Dekker.

Faloutsos, M., Faloutsos, P., & Faloutsos, C. (1999).

On power-law relationships of the

Internet topology.

Proceedings of Annual Conference of the Special Interest

Group on Data Communication (SIGCOMM '99)

, Cambridge, MA.

225

Fayyad, U., Piatetsk-Shapiro, G., & Smyth, P. (1996a). The KDD process for extracting useful knowledge from volumes of data.

Communications of the ACM, 39

(11),

27-34.

Fayyad, U. M., Piatetsky-Shapiro, G., & Smyth, P. (1996b). From data mining to knowledge discovery: An overview. In U. M. Fayyad, G. Piatetsk-Shapiro, P.

Smyth & R. Uthurusamy (Eds.),

Advances in Knowledge Discovery and Data

Mining

. Menlo Park, CA: AAAI Press/The MIT Press.

Fiedler, M. (1973). Algebraic connectivity of graphs.

Czechoslovak Mathematical

Journal, 23

, 298-305.

Flake, G. W., Lawrence, S., & Giles, C. L. (2000).

Efficient identification of web communities.

Proceedings of the 6th International Conference on Knowledge

Discovery and Data Mining (ACM SIGKDD 2000)

, Boston, MA.

Flake, G. W., Lawrence, S., Giles, C. L., & Coetzee, F. M. (2002). Self-organization and identification of web communities.

IEEE Computer, 35

(3), 66-71.

Floyd, R. W. (1962). Algorithm 97: Shortest path.

Communications of the ACM, 5

(6),

345-370.

Ford Jr., L. R., & Fulkerson, D. R. (1956). Maximal flow through a network.

Canadian

Journal of Mathematics, 8

, 399-404.

Freeman, L. C. (1979). Centrality in social networks: Conceptual clarification.

Social

Networks, 1

, 215-240.

Freeman, L. C. (2000). Visualizing social networks.

Journal of Social Structure, 1

(1).

Fruchterman, T. M. J., & Reingold, E. M. (1991). Graph drawing by force-directed placement.

Software--Practice & Experience, 21

(11), 1129-1164.

Furnas, G. W. (1986).

Generalized fisheye views.

Proceedings of ACM Conference on

Human Factors in Computing Systems (CHI '86)

, Boston, MA.

Galaskiewicz, J., & Krohn, K. (1984). Positions, roles, and dependencies in a community interorganization system.

Sociological Quarterly, 25

, 527-550.

Garfield, E. (2001).

From bibliographic coupling to co-citation analysis via algorithmic historio-bibliography: A citationist's tribute to Belver C. Griffith, Lazerow

Lecture presented at Drexel University, Philadelphia PA. November 27, 2001

, from http://garfield.library.upenn.edu/ papers/drexelbevergriffith92001.pdf

226

Garlaschelli, D., Caldarelli, G., & Pietronero, L. (2003). Universal scaling relations in food webs.

Nature, 423

(6936), 165-168.

Garton, L., Haythornthwaite, C., & Wellman, B. (1999). Studying online social networks.

In S. Jones (Ed.),

Doing Internet Research

(pp. 75-105). Thousand Oaks, CA:

Sage Publications.

Giannakis, M., & Croom, S. (2001).

The intellectual structure of supply chain management: An application of the social network analysis and citation analysis to SCM related journals.

Proceedings of the 10th International Annual IPSERA

Conference

, Jönkoping, Sweden.

Gibson, D., Kleinberg, J., & Raghavan, P. (1998).

Inferring web communities from link topology.

Proceedings of the 9th ACM Conference on Hypertext and Hypermedia

,

Pittsburgh, PA.

Girvan, M., & Newman, M. E. J. (2002). Community structure in social and biological networks.

Proceedings of the National Academy of Science of the United States of

America, 99

, 7821-7826.

Goldberg, H. G., & Senator, T. E. (1998).

Restructuring databases for knowledge discovery by consolidation and link formation.

Proceedings of 1998 AAAI Fall

Symposium on Artificial Intelligence and Link Analysis

, Orlando, FL.

Goldberg, H. G., & Wong, R. W. H. (1998).

Restructuring transactional data for link analysis in the FinCen AI system.

Proceedings of 1998 AAAI Fall Symposium on

Artificial Intelligence and Link Analysis

, Orlando, FL.

Gómez-Gardenes, J., & Moreno, Y. (2004). Local versus global knowledge in the

Barabási-Albert scale-free network model.

Physical Review E, 69

, 037103.

Gulati, R., & Gargiulo, M. (1999). Where do interorganizational networks come from?

American Journal of Sociology, 104

(4), 1439-1493.

Hajra, K. B., & Sen, P. (2005). Aging in citation networks.

Physica A, 346

, 44-48.

Harary, F. (1994).

Graph Theory

. Reading, MA: Addison-Wesley.

Harper, W. R., & Harris, D. H. (1975). The application of link analysis to police intelligence.

Human Factors, 17

(2), 157-164.

Hauck, R. V., Atabakhsh, H., Ongvasith, P., Gupta, H., & Chen, H. (2002). Using coplink to analyze criminal-justice data.

IEEE Computer, 35

(3), 30-37.

227

Helgason, R. V., Kennington, J. L., & Stewart, B. D. (1993). The one-to-one shortestpath problem: An empirical analysis with the two-tree Dijkstra algorithm.

Computational Optimization and Applications, 1

, 47-75.

Herman, I., Melancon, G., & Marshall, M. S. (2000). Graph visualization and navigation in information visualization: A survey.

IEEE Transactions on Visualization and

Computer Graphics, 6

(1), 24-43.

Hesham, E., Theodore, G. L., & Hesham, H. A. (1994).

Task scheduling in parallel and distributed systems

. Upper Saddle River, NJ: Prentice-Hall.

Holme, P., Kim, B. J., Yoon, C. N., & Han, S. K. (2002). Attack vulnerability of complex networks.

Physical Review E, 65

, 056109.

Huang, Z., Chen, H., Yip, A., Ng, G., Guo, F., Chen, Z.-K., et al. (2003). Longitudinal patent analysis for nanoscale science and engineering: Country, institution, and technology field.

Journal of Nanoparticle Research, 5

, 333-363.

Huberman, B. A., & Adamic, L. A. (1999). Growth dynamics of the World-Wide Web.

Nature, 401

, 131.

Hummon, N. P. (2000). Utility and dynamic social networks.

Social Networks, 22

, 221-

249.

Imafuji, N., & Kitsuregawa, M. (2002).

Effects of maximum flow algorithm for identifying web community.

Proceedings of the 4th ACM CIKM International

Workshop on Web Information and Data Management (WIDM'02)

, McLean, VA.

Inaoka, H., Takayasu, H., Shimizu, T., Ninomiy, T., & Taniguchi, K. (2004). Selfsimilarity of banking network.

Physica A, 339

, 621-634.

Jain, A. K., & Dubes, R. C. (1988).

Algorithms for Clustering Data

. Upper Saddle River,

NJ: Prentice-Hall.

Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review.

ACM

Computing Surveys, 31

(3), 264-323.

Janssen, M. A., & Jager, W. (2003). Simulating market dynamics: Interactions between consumer psychology and social networks.

Artificial Life, 9

, 343-356.

Jeong, H., Mason, S. P., Barabási, A.-L., & Oltvai, Z. N. (2001). Lethality and centrality in protein networks.

Nature, 411

(6833), 41.

228

Jeong, H., Neda, Z., & Barabási, A.-L. (2003). Measuring preferential attachment for evolving networks.

Europhysics Letters, 61

, 567-572.

Jeong, H., Tombor, B., Albert, R., Oltval, Z. N., & Barabási, A.-L. (2000). The largescale organization of metabolic networks.

Nature, 407

(6804), 651-654.

Johnson, S. C. (1967). Hierarchical clustering schemes.

Psychometrika, 32

, 241-254.

Jordan, P. W. (1998).

An Introduction to Usability

. Bristol, PA: Taylor & Francis.

Kamada, T., & Kawai, S. (1989). An algorithm for drawing general undirected graphs.

Information Processing Letters, 31

(1), 7-15.

Kannan, R., Vempala, S., & Vetta, A. (2004). On clustering: Good, bad and spectral.

Journal of the Association for Computing Machinery, 51

(3), 497-515.

Kautz, H., Selman, B., & Shah, M. (1997). ReferralWeb: Combining social networks and collaborative filtering.

Communications of the ACM, 40

(3), 27-36.

Kephart, J. O., Sorkin, G. B., Arnold, W. C., Chess, D. M., Tesauro, G. J., & White, S. R.

(1998). Biologically inspired defenses against computer viruses. In R. S.

Michalski (Ed.),

Machine Learning and Data Mining: Methods and Applications

.

New York, NY: John Wiley.

Kernighan, B. W., & Lin, S. (1970). An efficient heuristic procedure for partitioning graphs.

Bell System Technical Journal, 49

, 291-307.

Kleinberg, J. (1998).

Authoritative sources in a hyperlinked environment.

Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms

, San Francisco, CA.

Kleinberg, J., Kumar, R., Raghavan, P., Rajagopalan, S., & Tomkins, A. S. (1999).

The web as a graph: Measurements, models, and methods.

Proceedings of 5th Annual

International Conference on Computing and Combinatorics (COCOON'99)

,

Tokyo, Japan.

Kleinberg, J., & Lawrence, S. (2001). The structure of the web.

Science, 294

, 1849-1850.

Kleinberg, J., Sandler, M., & Slivkins, A. (2004).

Network failure detection and graph connectivity.

Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete

Algorithms

, New Orleans, LA.

Klerks, P. (2001). The network paradigm applied to criminal organizations: Theoretical nitpicking or a relevant doctrine for investigators? Recent developments in the

Netherlands.

Connections, 24

(3), 53-65.

229

Krapivsky, P. L., Redner, S., & Leyvraz, F. (2000). Connectivity of growing random networks.

Physical Review Letters, 85

(21), 4629-4632.

Krause, A. E., Frank, K. A., Mason, D. M., Ulanowicz, R. E., & Tayloar, W. W. (2003).

Compartments revealed in food-web structure.

Nature, 426

, 282-285.

Krebs, V. E. (2001). Mapping networks of terrorist cells.

Connections, 24

(3), 43-52.

Kruskal, J. B. (1964). Nonmetric multidimensional scaling: A numerical method.

Psychometrika, 29

(2), 115-128.

Kruskal, J. B., & Wish, M. (1978).

Multidimensional Scaling

. Beverly Hills, CA: Sage

Publications.

Kumar, S. R., Raghavan, P., Rajagopalan, S., & Tomkins, A. (1999). Trawling the web for emerging cyber-communities.

Computer Networks, 31

(11-16), 1481-1493.

Kumar, S. R., Raghavan, P., Rajagopalan, S., & Tomkins, A. (2002). The web and social networks.

IEEE Computer, 35

(11), 32-36.

Lance, G. N., & Williams, W. T. (1967). A general theory of classificatory sorting strategies: II. Clustering systems.

Computer Journal, 10

, 271-277.

Lawrence, S., & Giles, C. L. (1999). Accessibility of information on the web.

Nature,

400

, 107-109.

Lee, R. (1998).

Automatic information extraction from documents: A tool for intelligence and law enforcement analysts.

Proceedings of 1998 AAAI Fall Symposium on

Artificial Intelligence and Link Analysis

, Orlando, FL.

Liljeros, F., Edling, C. R., Amaral, L. A. N., Stanley, H. E., & Aberg, Y. (2001). The web of human sexual contacts.

Nature, 411

, 907-908.

Lorrain, F. P., & White, H. C. (1971). Structural equivalence of individuals in social networks.

Journal of Mathematical Sociology, 1

, 49-80.

McAndrew, D. (1999). The structural analysis of criminal networks. In D. Canter & L.

Alison (Eds.),

The Social Psychology of Crime: Groups, Teams, and Networks,

Offender Profiling Series, III

(pp. 53-94). Dartmouth: Aldershot.

Menczer, F. (2004). Evolution of document networks.

Proceedings of the National

Academy of Science of the United States of America, 101

, 5261-5265.

Milgram, S. (1967). The small world problem.

Psychology Today, 2

, 60-67.

230

Moreno, J. L. (1953).

Who Shall Survive?

Beacon, NY: Beacon House.

Murtagh, F. (1984). A survey of recent advances in hierarchical clustering algorithms which use cluster centers.

Computer Journal, 26

, 354-359.

Newman, M. E. J. (2001a). Scientific collaboration networks. I. Network construction and fundamental results.

Physical Review E, 64

, 016131.

Newman, M. E. J. (2001b). The structure of scientific collaboration networks.

Proceedings of the National Academy of Science of the United States of America,

98

, 404-409.

Newman, M. E. J. (2003a). Mixing patterns in networks.

Physical Review E, 67

(2),

026126.

Newman, M. E. J. (2003b). The structure and function of complex networks.

SIAM

Review, 45

(2), 167-256.

Newman, M. E. J. (2004a). Coauthorship networks and patterns of scientific collaboration.

Proceedings of the National Academy of Science of the United

States of America, 101

, 5200-5205.

Newman, M. E. J. (2004b). Detecting community structure in networks.

European

Physical Journal B, 38

, 321-330.

Newman, M. E. J. (2004c). Fast algorithm for detecting community structure in networks.

Physical Review E, 69

(6), 066133.

Newman, M. E. J., & Girvan, M. (2004). Finding and evaluating community structure in networks.

Physical Review E, 69

(2), 026113.

Palmer, C. R., Gibbons, P. B., & Faloutsos, C. (2002).

ANF: A fast and scalable tool for data mining in massive graphs.

Proceedings of the 8th ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining

, Edmonton,

Alberta, Canada.

Pennock, D. M., Flake, G. W., Lawrence, S., Glover, E. J., & Giles, C. L. (2002).

Winners don't take all: Characterizing the competition for links on the web.

Proceedings of the National Academy of Science of the United States of America,

99

(8), 5207-5211.

Perkins, C. E., & Bhagwat, P. (1994).

Highly dynamic destination-sequenced distancevector routing (DSDV) for mobile computers.

Proceedings of

231

SIGCOMM Symposium on Communications Architectures and Protocols

,

London, UK.

Pothen, A., Simon, H. D., & Liou, K.-P. (1990). Partitioning sparse matrices with eigenvectors of graphs.

SIAM Journal on Matrix Analysis and Applications, 11

(3),

430 - 452.

Powell, W. W., White, D. R., Koput, K. W., & Owen-Smith, J. (2005 (forthcoming)).

Network dynamics and field evolution: The growth of inter-organizational collaboration in the life sciences.

American Journal of Sociology

.

Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical

Recipes in C (Version 2nd edition). Cambridge: Cambridge University Process.

Price, D. J. D. (1965). Networks of scientific papers.

Science, 149

, 510-515.

Purchase, H. C. (1997).

Which aesthetic has the greatest effect on human understanding?

Proceedings of the 5th International Symposium on Graph Drawing

, Rome, Italy.

Quinlan, J. R. (1986). Introduction of decision trees.

Machine Learning, 1

, 86-106.

Raab, J., & Milward, H. B. (2003). Dark networks as problems.

Journal of Public

Administration Research and Theory, 13

(4), 413-439.

Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., & Parisi, D. (2004). Defining and identifying communities in networks.

Proceedings of the National Academy of

Science of the United States of America, 101

, 2658-2663.

Rasmussen, E. (1992). Clustering algorithms. In W. B. Frakes & R. Baeza-Yates (Eds.),

Information Retrieval: Data Structures and Algorithms

. Englewood Cliffs, NJ:

Prentice Hall.

Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N., & Barabási, A.-L. (2002).

Hierarchical organization of modularity in metabolic networks.

Science, 297

,

1551-1555.

Reingold, E. M., & Tilford, J. S. (1981). Tidier drawing of trees.

IEEE Transactions on

Software Engineering, 7

(2), 223-228.

Rives, A. W., & Galitski, T. (2003). Modular organization of cellular networks.

Proceedings of the National Academy of Science of the United States of America,

100

(3), 1128–1133.

232

Robins, G., & Alexander, M. (2004). Small worlds among interlocking directors:

Network structure and distance in bipartite graphs.

Computational &

Mathematical Organization Theory, 10

, 69-94.

Ronfeldt, D., & Arquilla, J. (2001). What next for networks and netwars? In J. Arquilla &

D. Ronfeldt (Eds.),

Networks and Netwars: The Future of Terror, Crime, and

Militancy

. Santa Monica, CA: Rand Press.

Roussinov, D. G., & Chen, H. (1999). Document clustering for electronic meetings: An experimental comparison of two techniques.

Decision Support Systems, 27

, 67-79.

Saether, M., & Canter, D. V. (2001).

A structural analysis of fraud and armed robbery networks in Norway.

Proceedings of the 6th International Investigative

Psychology Conference

, Liverpool, England.

Sageman, M. (2004).

Understanding Terror Networks

. Philadelphia, PA: University of

Pennsylvania Press.

Sahami, M., Yusufali, S., & Baldonado, Q. W. (1998).

SONIA: A service for organizing networked information autonomously.

Proceedings of the 3rd ACM International

Conference on Digital Libraries

, Pittsburgh, PA.

Scott, J. (1991).

Social Network Analysis

. London, UK: Sage Publications.

Shaw, W. M. J., Burgin, R., & Howell, P. (1997). Performance standards and evaluations in information retrieval test collections: Cluster-based retrieval models.

Information Processing & Management, 33

(1), 1-14.

Small, H. (1999). Visualizing science by citation mapping.

Journal of American Society of Information Science, 50

(9), 799-813.

Small, H. G. (1977). A co-citation model of a scientific specialty: A longitudinal study of collagen research.

Social Studies of Science, 7

, 139-166.

Solé, R. V., & Montoya, J. M. (2001). Complexity and fragility in ecological networks.

Proceedings of the Royal Society B, 268

, 2039-2045.

Somogyi, R., & Sniegoski, S. A. (1996). Modeling the complexity of genetic networks:

Understanding multigenic and pleiotropic regulation.

Complexity, 1

(6), 45-63.

Sparrow, M. K. (1991). The application of network analysis to criminal intelligence: An assessment of the prospects.

Social Networks, 13

, 251-274.

233

Stuart, T. E. (1998). Network positions and propensities to investigation of strategic alliance formation in a high-technology industry.

Administrative Science

Quarterly, 43

, 668-698.

Tolle, K. M., & Chen, H. (2000). Comparing noun phrasing techniques for use with medical digital library tools.

Journal of the American Society for Information

Science, 51

(4), 352-370.

Torgerson, W. S. (1952). Multidimensional scaling: Theory and method.

Psychometrika,

17

, 401-419.

Toroczkai, Z., & Bassler, K. E. (2004). Jamming is limited in scale-free systems.

Nature,

428

, 716.

Toyoda, M., & Kitsuregawa, M. (2001).

Creating a web community chart for navigating related communities.

Proceedings of the 12th ACM Conference on Hypertext and

Hypermedia

, Arhus, Denmark.

Toyoda, M., & Kitsuregawa, M. (2003).

Extracting evolution of web communities from a series of web archives.

Proceedings of the 14th conference on Hypertext and

Hypermedia

, Nottingham, UK.

Tu, Y. (2000). How robust is the Internet?

Nature, 406

, 353-354.

Valente, T. W. (1995).

Network Models of the Diffusion of Innovations

. Cresskill, NY:

Hampton Press. vanCleemput, W. M. (1976).

On the topological aspects of the circuit layout problem.

Proceedings of the 13th Conference on Design Automation

, San Francisco, CA.

Voorhees, E. M. (1986). Implementing agglomerative hierarchical clustering algorithms for use in document retrieval.

Information Processing & Management, 22

(6),

465-476.

Wang, Z., & Crowcroft, J. (1992). Analysis of shortest-path routing algorithms in a dynamic network environment.

ACM Computer Communication Review, 22

(2),

63-71.

Wasserman, S., & Faust, K. (1994).

Social Network Analysis: Methods and Applications

.

Cambridge: Cambridge University Press.

Watts, D. J. (2002). A simple model of global cascades on random networks.

Proceedings of the National Academy of Science of the United States of America,

99

, 5766-5771.

234

Watts, D. J. (2004). The "new" science of networks.

Annual Review of Sociology, 30

,

243-270.

Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of "small-world" networks.

Nature, 393

, 440-442.

White, D. R., & Newman, M. E. J. (2001).

Fast approximation algorithms for finding node-independent paths in networks

, from http://ideas.repec.org/p/wop/safiwp/01-

07-035.html

White, H. C., Boorman, S. A., & Breiger, R. L. (1976). Social structure from multiple networks: I. Blockmodels of roles and positions.

American Journal of Sociology,

81

, 730-780.

White, H. D., & McCain, K. W. (1998). Visualizing a discipline: An author co-citation analysis of information science, 1972-1995.

Journal of American Society of

Information Science and Technology, 49

(4), 327-355.

Xu, J., & Chen, H. (2003).

Untangling criminal networks: A case study.

Proceedings of the 1st NSF/NIJ Symposium on Intelligence and Security Informatics (ISI'03)

,

Tucson, AZ.

Xu, J., & Chen, H. (Forthcoming). Criminal network analysis and visualization: A data mining perspective.

Communications of the ACM

.

Xu, J. J., & Chen, H. (2004). Fighting organized crime: Using shortest-path algorithms to identify associations in criminal networks.

Decision Support Systems, 38

(3), 473-

487.

Xu, J. J., & Chen, H. (2005). CrimeNet Explorer: A framework for criminal network knowledge discovery.

ACM Transactions on Information Systems, 23

(2).

Young, F. W. (1987).

Multidimensional Scaling: History, Theory, and Applications

.

Hillsdale, NJ: Lawrence Erlbaum Associations.

Zhao, L., Park, K., & Lai, Y.-C. (2004). Attack vulnerability of scale-free networks due to cascading breakdown.

Physical Review E, 70

, 035101.

Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project