Gossip-based Protocols for Large-scale Distributed Systems Márk Jelasity DSc Dissertation

Gossip-based Protocols for Large-scale Distributed Systems Márk Jelasity DSc Dissertation
Gossip-based Protocols for
Large-scale Distributed Systems
DSc Dissertation
Márk Jelasity
Szeged, 2013
ii
Contents
Preface
1
1 Gossip Protocol Basics
1.1 Gossip and Epidemics . . . . . . . . . . . .
1.2 Information dissemination . . . . . . . . .
1.2.1 The Problem . . . . . . . . . . . .
1.2.2 Algorithms and Theoretical Notions
1.2.3 Applications . . . . . . . . . . . .
1.3 Aggregation . . . . . . . . . . . . . . . . .
1.3.1 Algorithms and Theoretical Notions
1.3.2 Applications . . . . . . . . . . . .
1.4 What is Gossip after all? . . . . . . . . . .
1.4.1 Overlay Networks . . . . . . . . .
1.4.2 Prototype-based Gossip Definition .
1.5 Conclusions . . . . . . . . . . . . . . . . .
1.6 Further Reading . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Peer Sampling Service
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . .
2.2 Peer-Sampling Service . . . . . . . . . . . . . . . . .
2.2.1 API . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Generic Protocol Description . . . . . . . . . .
2.2.3 Design Space . . . . . . . . . . . . . . . . . .
2.2.4 Known Protocols as Instantiations of the Model
2.2.5 Implementation . . . . . . . . . . . . . . . . .
2.3 Local Randomness . . . . . . . . . . . . . . . . . . .
2.3.1 Experimental Settings . . . . . . . . . . . . .
2.3.2 Test Results . . . . . . . . . . . . . . . . . . .
2.3.3 Conclusions . . . . . . . . . . . . . . . . . . .
2.4 Global Randomness . . . . . . . . . . . . . . . . . . .
2.4.1 Properties of Degree Distribution . . . . . . .
2.4.2 Clustering and Path Lengths . . . . . . . . . .
2.5 Fault Tolerance . . . . . . . . . . . . . . . . . . . . .
2.5.1 Catastrophic Failure . . . . . . . . . . . . . .
2.5.2 Churn . . . . . . . . . . . . . . . . . . . . . .
2.5.3 Trace-driven Churn Simulations . . . . . . . .
2.6 Wide-Area-Network Emulation . . . . . . . . . . . . .
2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . .
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
4
5
5
12
13
14
17
18
18
19
19
20
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
25
25
25
27
29
29
30
31
31
33
33
34
40
42
42
43
46
48
50
2.8
2.9
3
4
2.7.1 Randomness . . . . . . . . .
Related Work . . . . . . . . . . . . .
2.8.1 Gossip Membership Protocols
2.8.2 Complex Networks . . . . . .
2.8.3 Unstructured Overlays . . . .
2.8.4 Structured Overlays . . . . .
Concluding Remarks . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Average Calculation
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Gossip-based Aggregation . . . . . . . . . . . . . . . . . . . . .
3.3.1 The Basic Aggregation Protocol . . . . . . . . . . . . . .
3.3.2 Theoretical Analysis of Gossip-based Aggregation . . . .
3.4 A Practical Protocol for Gossip-based Aggregation . . . . . . . .
3.4.1 Automatic Restarting . . . . . . . . . . . . . . . . . . . .
3.4.2 Coping with Churn . . . . . . . . . . . . . . . . . . . . .
3.4.3 Synchronization . . . . . . . . . . . . . . . . . . . . . .
3.4.4 Importance of Overlay Network Topology for Aggregation
3.4.5 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Aggregation Beyond Averaging . . . . . . . . . . . . . . . . . .
3.5.1 Examples of Supported Aggregates . . . . . . . . . . . .
3.5.2 Dynamic Queries . . . . . . . . . . . . . . . . . . . . . .
3.6 Theoretical Results for Benign Failures . . . . . . . . . . . . . .
3.6.1 Crashing Nodes . . . . . . . . . . . . . . . . . . . . . . .
3.6.2 Link Failures . . . . . . . . . . . . . . . . . . . . . . . .
3.6.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . .
3.7 Simulation Results for Benign Failures . . . . . . . . . . . . . . .
3.7.1 Node Crashes . . . . . . . . . . . . . . . . . . . . . . . .
3.7.2 Link Failures and Message Omissions . . . . . . . . . . .
3.7.3 Robustness via Multiple Instances of Aggregation . . . . .
3.8 Experimental Results on PlanetLab . . . . . . . . . . . . . . . . .
3.9 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Distributed Power Iteration
4.1 Introduction . . . . . . . . . . . . . . . . . . . . .
4.2 Chaotic Asynchronous Power Iteration . . . . . . .
4.3 Adding Normalization . . . . . . . . . . . . . . .
4.4 Controlling the Vector Norm . . . . . . . . . . . .
4.4.1 Keeping the Vector Norm Constant . . . .
4.4.2 The Random Surfer Operator of PageRank
4.5 Experimental Results . . . . . . . . . . . . . . . .
4.5.1 Notes on the Implementation . . . . . . . .
4.5.2 Artificially Generated Matrices . . . . . . .
4.5.3 Results . . . . . . . . . . . . . . . . . . .
4.5.4 PageRank on WWW Crawl Data . . . . . .
4.6 Related Work . . . . . . . . . . . . . . . . . . . .
iv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
50
52
52
52
53
53
53
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55
55
56
57
57
58
67
67
67
67
68
71
72
72
75
75
75
76
78
78
78
80
81
82
85
86
.
.
.
.
.
.
.
.
.
.
.
.
89
89
90
91
92
92
93
93
93
94
95
97
98
4.7
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Slicing Overlay Networks
5.1 Introduction . . . . . . . . . . . . . . . .
5.2 Problem Definition . . . . . . . . . . . .
5.2.1 System Model . . . . . . . . . .
5.2.2 The Ordered Slicing Problem . .
5.3 A Gossip-based Approach . . . . . . . .
5.4 Analogy with Gossip-based Averaging . .
5.5 Experimental Analysis . . . . . . . . . .
5.5.1 The Number of Successful Swaps
5.5.2 Message Drop . . . . . . . . . .
5.5.3 Churn . . . . . . . . . . . . . . .
5.5.4 An Illustrative Example . . . . .
5.6 Conclusions . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 T-Man: Topology Construction
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Related Work and Contribution . . . . . . . . . . . . . . .
6.3 System Model . . . . . . . . . . . . . . . . . . . . . . . .
6.4 The Overlay Construction Problem . . . . . . . . . . . . .
6.5 The T-M AN Protocol . . . . . . . . . . . . . . . . . . . .
6.6 Key Properties of the Protocol . . . . . . . . . . . . . . .
6.6.1 Analogy with the Anti-Entropy Epidemic Protocol
6.6.2 Parameter Setting for Symmetric Target Graphs . .
6.6.3 Notes on Asymmetric Target Graphs . . . . . . . .
6.6.4 Storage Complexity Analysis . . . . . . . . . . .
6.7 Experimental Results . . . . . . . . . . . . . . . . . . . .
6.7.1 A Practical Implementation . . . . . . . . . . . .
6.7.2 Simulation Environment . . . . . . . . . . . . . .
6.7.3 Ranking Methods . . . . . . . . . . . . . . . . . .
6.7.4 Performance Measures . . . . . . . . . . . . . . .
6.7.5 Evaluating the Starting Mechanism . . . . . . . .
6.7.6 Evaluating the Termination Mechanism . . . . . .
6.7.7 Parameter Tuning . . . . . . . . . . . . . . . . . .
6.7.8 Failures . . . . . . . . . . . . . . . . . . . . . . .
6.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . .
7 Bootstrapping Chord
7.1 Introduction . . . . . . . . . . . . . . . . .
7.2 System Model . . . . . . . . . . . . . . . .
7.3 The T-C HORD protocol . . . . . . . . . . .
7.3.1 A Brief Introduction to Chord . . .
7.3.2 T-C HORD . . . . . . . . . . . . . .
7.3.3 T-C HORD -P ROX: Network Proximity
7.4 Experimental Results . . . . . . . . . . . .
7.4.1 Experimental Settings . . . . . . .
7.4.2 Convergence . . . . . . . . . . . .
7.4.3 Scalability . . . . . . . . . . . . .
v
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
98
.
.
.
.
.
.
.
.
.
.
.
.
99
99
100
100
100
101
103
105
105
106
107
108
110
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
111
111
112
113
114
116
117
118
118
120
122
125
125
127
128
129
129
130
131
132
134
.
.
.
.
.
.
.
.
.
.
135
135
136
136
137
137
138
138
138
139
140
7.5
7.6
8
7.4.4 Parameters . . . . . . .
7.4.5 Robustness . . . . . . .
7.4.6 Starting and Termination
Related Work . . . . . . . . . .
Conclusions . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Towards a Generic Bootstrapping Service
8.1 Introduction . . . . . . . . . . . . . .
8.2 The Architecture . . . . . . . . . . .
8.3 Bootstrapping Prefix Tables . . . . . .
8.4 Simulation Results . . . . . . . . . .
8.5 Conclusions . . . . . . . . . . . . . .
Bibliography
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
141
141
142
143
144
.
.
.
.
.
145
145
146
147
148
150
151
vi
Preface
This dissertation is based on my work related to gossip protocols that solve various problems in massively distributed large scale networks. Resisting the temptation to present
all my results in the area, my aim was to paint a coherent picture focusing on a subset of
my core results consisting of closely related algorithms and systems that strongly build
on each other. These results can be considered puzzle pieces that can be used to build a
class of self-organizing massively distributed and robust adaptive systems.
Complying with the requirement of the Hungarian Academy of Sciences, the results
included in the dissertation all originate from a period well after defending my PhD dissertation in the year 2001. Besides, my PhD dissertation was in the area of heuristic
optimization and genetic algorithms, so there is no overlap with the presented work at all.
In the following I present the outline of the dissertation, referring to the publications that
form the basis of each chapter. I will describe my own contribution to these publications
and I will mention results that are not covered in the dissertation when appropriate.
Chapter 1 is based mainly on [1], a textbook chapter I wrote to introduce gossip protocols. It was extended with material from [2], a survey paper that was written under my
direction, based on my insight, namely that gossip protocols cannot be defined rigorously,
instead they represent a design philosophy that one can follow to varying degrees and that
is shared by many other approaches. The chapter was also revised, most importantly, the
message passing formulations of the algorithms were improved so that they can form the
basis of the common conventions that I adopted throughout the dissertation.
Chapter 2 is based on [3]. The origins of this work can be traced back to the N EWS CAST protocol that was published only as a technical report [4]. However, even as a technical report it achieved a rather high impact (138 citations as of now in Google Scholar)
and was recently reprinted as a book chapter [5]. After realizing that many similar protocols have been proposed independently of N EWSCAST, the original idea underlying the
unifying peer sampling framework was mine. This first resulted in a conference publication [6] followed by the journal article [3]. I am the main author of the paper and I
contributed most of the implementation and experimental work as well (except the tracedriven simulations, and the cluster implementation and experiments). The presentation
of the algorithm itself in this dissertation was thoroughly revised and was aligned with
the common conventions that are based on an asynchronous event-based approach. I also
added a section about the related protocols, in particular, N EWSCAST that inspired the
unified framework. One notable publication not covered in the dissertation is [7] that
presents an improvement over N EWSCAST to make it adaptive to different non-uniform
localized patterns of failure and application load.
Chapter 3 is based on a journal article [8] that in turn is based on two conference publications: [9] and [10]. The idea of the algorithm, as well as the theoretical analysis is my
contribution. I also implemented the algorithm, and contributed some of the experimental
evaluation as well. In this chapter the theoretical analysis was completely rewritten from
1
2
PREFACE
scratch. This is due to the unfortunate fact that in the original publication the analysis
was not correct. The new theoretical discussion leaves the main conclusions unchanged
but provides rigorous formulations and proofs for the original results. This way I now
precisely characterize the convergence speed in many interesting cases.
Chapter 4 is based on [11]. As the main author, I contributed the implementation and
the experimental evaluation, as well as most details of the algorithm. This chapter illustrates how the peer sampling service from Chapter 2 and averaging from Chapter 3 can
be combined to solve a practically relevant non-trivial problem. As in the other chapters,
the algorithm presentation was thoroughly revised and aligned with the conventions.
Chapter 5 is based on [12]. The idea of the algorithm, its implementation, as well as
its theoretical and experimental evaluation is my contribution. The interesting aspect of
this work is that it implements a sorting algorithm in a distributed way that can be characterized using the theoretical tools developed in Chapter 3. Since there the theoretical
results are new (as mentioned above), in this chapter the theoretical results are also thoroughly revised and extended, and a closer connection is made with averaging than what
was presented in the original publication. The workshop paper [13] should be mentioned
here as well (not covered in the dissertation) that implements an entirely different method
for distributed ranking, that is also based on gossip.
Chapter 6 is based on the journal article [14], which in turn roots back to [15]. The
original idea of the T-M AN algorithm and its first implementation is my contribution. My
co-authors contributed practical features such as starting and termination variants. The
approximative theoretical models and the related empirical analysis were also contributed
by me. A part of the experimental work was completed by me as well. In this chapter the
algorithm description was reworked to fit into the framework used by the other chapters,
and the theoretical discussion was slightly reformulated and clarified.
The remaining two chapters discuss applications of T-M AN. Chapter 7 is based on [16]
and partly also on [14]. Here I contributed to the design of the algorithm. The presentation of the algorithm was thoroughly revised to match the structure of the dissertation,
and the experimental section was extended with results from [14]. Chapter 8 is based
on [17], where the algorithm is based on my initial idea, and its implementation, and the
experimental evaluation is my contribution. We note that this chapter also illustrates how
to build complex applications from the components presented in the dissertation.
Finally, it should be noted here that this dissertation was completed while visiting
Cornell University. This allowed me to fully focus on finalizing this work in an exceptional intellectual atmosphere, which clearly had a noticeable impact on the quality. I am
especially grateful to Prof. Kenneth Birman—my host during the visit—who was very
supportive of my efforts to finish the dissertation.
April 2013, Szeged, Hungary and Ithaca, NY, USA
Márk Jelasity
Chapter 1
Gossip Protocol Basics
Gossip plays a very significant role in human society. Information spreads throughout the
human grapevine at an amazing speed, often reaching almost everyone in a community,
without any central coordinator. Moreover, rumor tends to be extremely stubborn: once
spread, it is nearly impossible to erase it. In many distributed computer systems—most
notably in cloud computing and peer-to-peer computing—this speed and robustness, combined with algorithmic simplicity and the lack of central management, are very attractive
features.
Accordingly, over the past few decades several gossip-based algorithms have been
developed to solve various problems. The prototypical application of gossip is information spreading (also known as multicast) where a piece of news is being spread over a
large network. It the dissertation, this application is not discussed. However, since in this
chapter our goal is to provide the necessary background, intuition and motivation for the
gossip approach, we provide a brief introduction to this area.
After discussing information spreading, we move on to generalize the family of gossip
protocols. To illustrate the generality of the gossip approach, we first discuss information
aggregation (an area of distributed data mining), where distributed information is being
summarized. We then present a completely generic gossip algorithm framework, that will
accommodate most of the work in the dissertation.
1.1 Gossip and Epidemics
Like it or not, gossip plays a key role in human society. In his controversial book, Dunbar (an anthropologist) goes as far as to claim that the primary reason for the emergence
of language was to permit gossip, which had to replace grooming—a common social
reinforcement activity in primates—due to the increased group size of early human populations in which grooming was no longer feasible [18].
Whatever the case, it is beyond any doubt that gossip—apart from still being primarily
a social activity—is highly effective in spreading information. In particular, information
spreads very quickly, and the process is most resistant to attempts to stop it. In fact,
sometimes it is so much so that it can cause serious damage; especially to big corporations.
Rumors associating certain corporations to Satanism, or claiming that certain restaurantchains sell burgers containing rat meat or milk shakes containing cow eyeball fluid as
thickener, etc., are not uncommon. Accordingly, controlling gossip has long been an
important area of research. The book by Kimmel gives many examples and details on
human gossip [19].
3
4
CHAPTER 1. GOSSIP PROTOCOL BASICS
While gossip is normally considered to be a means for spreading information, in reality information is not just transmitted mechanically but also processed. A person collects
information, processes it, and passes the processed information on. In the simplest case,
information is filtered at least for its degree of interest. This results in the most interesting pieces of news reaching the entire group, whereas the less interesting ones will stop
spreading before getting to everyone. More complicated scenarios are not uncommon
either, where information is gradually altered. This increases the complexity of the process and might result in emergent behavior where the community acts as a “collectively
intelligent” (or sometimes perhaps not so intelligent) information processing medium.
Gossip is analogous to an epidemic, where a virus plays the role of a piece of information, and infection plays the role of learning about the information. In the past years
we even had to learn concepts such as “viral marketing”, made possible through Web 2.0
platforms such as video sharing sites, where advertisers consciously exploit the increasingly efficient and extended social networks to spread ads via gossip. The key idea is that
shocking or very funny ads are especially designed so as to maximize the chances that
viewers inform their friends about it, and so on.
Not surprisingly, epidemic spreading has similar properties to gossip, and is equally
(if not more) important to understand and control. Due to this analogy and following common practice we will mix epidemiological and gossip terminology, and apply epidemic
spreading theory to gossip systems.
Gossip and epidemics are of interest for large scale distributed systems for at least
two reasons. The first reason is inspiration to design new protocols: gossip has several
attractive properties like simplicity, speed, robustness, and a lack of central control and
bottlenecks. These properties are very important for information dissemination and collective information processing (aggregation) that are both key components of large scale
distributed systems.
The second reason is security research. With the steady growth of the Internet, viruses
and worms have become increasingly sophisticated in their spreading strategies. Infected
computers typically organize into networks (called botnets) and, being able to cooperate
and perform coordinated attacks, they represent a very significant threat to IT infrastructure. One approach to fighting these networks is to try and prevent them from spreading,
which requires a good understanding of epidemics over the Internet.
In this chapter we focus on the former aspect of gossip and epidemics: we treat them
as inspiration for the design of robust self-organizing systems and services.
1.2 Information dissemination
The most natural application of gossip (or epidemics) in computer systems is spreading
information. The basic idea of processes periodically communicating with peers and exchanging information is not uncommon in large scale distributed systems, and has been
applied from the early days of the Internet. For example, the Usenet newsgroup servers
spread posts using a similar method, and the IRC chat protocol applies a similar principle as well among IRC servers. In many routing protocols we can also observe routers
communicating with neighboring routers and exchanging traffic information, thereby improving routing tables.
However, the first real application of gossip, that was based on theory and careful
analysis, and that boosted scientific research into the family of gossip protocols, was part
of a distributed database system of the Xerox Corporation, and was used to make sure each
1.2. INFORMATION DISSEMINATION
5
replica of the database on the Xerox internal network was up-to-date [20]. In this section
we will employ this application as a motivating example and illustration, and at the same
time introduce several variants of gossip-based information dissemination algorithms.
1.2.1 The Problem
Let us assume we have a set of database servers (in the case of Xerox, 300 of them, but
this number could be much larger as well). All of these servers accept updates; that is,
new records or modifications of existing records. We want to inform all the servers about
each update so that all the replicas of the database are identical and up-to-date.
Obviously, we need an algorithm to inform all the servers about a given update. We
shall call this task update spreading. In addition, we should take into account the fact that
whatever algorithm we use for spreading the update, it will not work perfectly, so we need
a mechanism for error correction.
At Xerox, update spreading was originally solved by sending the update via email to
all the servers, and error correction was done by hand. Sending emails is clearly not scalable: the sending node is a bottleneck. Moreover, multiple sources of error are possible:
the sender can have an incomplete list of servers in the network, some of the servers can
temporarily be unavailable, email queues can overflow, and so on.
Both tasks can be solved in a more scalable and reliable way using an appropriate
(separate) gossip algorithm. In the following we first introduce several gossip models and
algorithms, and then we explain how the various algorithms can be applied to solve the
above mentioned problems.
1.2.2 Algorithms and Theoretical Notions
We assume that we are given a set of nodes that are able to pass messages to each other.
In this section we will focus on the cost of spreading a single update among these nodes.
That is, we assume that at a certain point in time, one of the nodes gets a new update from
an external source, and from that point we are interested in the dynamics of the spreading
of that update when using the algorithms we describe.
When discussing algorithms and theoretical models, we will use the terminology of
epidemiology. According to this terminology, each node can be in one of three states,
namely
• susceptible (S): The node does not know about the update
• infected (I): The node knows the update and is actively spreading it
• removed (R): The node has seen the update, but is not participating in the spreading
process (in epidemiology, this corresponds to death or immunity)
These states are relative to one fixed update. If there are several concurrent updates,
one node can be infected with one update, while still being susceptible to another update,
and so on.
In realistic applications there are typically many updates being propagated concurrently, and new updates are inserted continuously. Accordingly, our algorithms will in
fact be formulated to deal with multiple updates that are coming continuously in an unpredictable manner. However, we present the simplest possible forms of these algorithms.
It is important to note that additional techniques can be applied to optimize the amortized
CHAPTER 1. GOSSIP PROTOCOL BASICS
6
Algorithm 1 SI gossip
1: loop
2:
wait(∆)
3:
p ← random peer
4:
if push then
5:
sendPush(p, known updates)
6:
else if pull then
7:
sendPullRequest(p)
8:
9:
10:
procedure ON P USH(m)
if pull then
sendPull(m.sender, known updates)
14:
store m.updates
11:
12:
13:
15:
16:
17:
procedure ON P ULL(m)
store m.updates
procedure ON P ULL R EQUEST(m)
sendPull(m.sender, known updates)
cost of propagating a single update, when there are multiple concurrent updates in the
system. In Section 1.2.3 we discuss some of these techniques. In addition, nodes might
know the global list or even the insertion time of the updates, as well as the list of updates
available at some other nodes. This information can also be applied to reduce propagation
cost even further.
The allowed state transitions depend on the model that we study. Next, we shall
consider the SI model and the SIR model. In the SI model, nodes are initially in state
S with respect to a fixed update, and can change to state I (when they learn about the
update). Once in state I, a node can no longer change its state (I is an absorbing state). In
the SIR model, we allow nodes in state I to switch to state R, where R is the absorbing
state. This means that in the SIR model nodes might stop spreading an update eventually,
but they never forget about the update.
The Algorithm in the SI Model
The algorithm that implements gossip in the SI model is shown in Algorithm 1. It is
formulated in an asynchronous message passing style, where each node executes one
process (that we call the active thread) and, furthermore, it has message handlers that
process incoming messages.
The active thread is executed once in each ∆ time units. We will call this waiting
period a gossip cycle (other terminology is also used such as gossip round or period).
In line 3 we assume that a node can select a random peer node from the set of all
nodes. This assumption is not trivial, especially in very large and dynamically changing
networks. In fact, peer sampling is a fundamental service that all gossip protocols rely on.
We will discuss random peer sampling briefly in Section 1.4. Chapter 2 discusses random
peer sampling in detail.
The algorithm makes use of two important Boolean parameters called push and pull.
At least one of them has to be true, otherwise no messages are sent. Depending on these
parameters, we can talk about push, pull, and push-pull gossip, each having significantly
different dynamics and cost. In push gossip, susceptible nodes are passive and infective
nodes actively infect the population. In pull and push-pull gossip each node is active.
Obviously, a node cannot stop pulling for updates unless it knows what updates can be
expected; and it cannot avoid getting known updates either unless it advertises which updates it has already. As mentioned before, we present only a simple formulation: we pull
continuously and we keep pushing all known updates as well. Practical applications will
involve various techniques to minimize the redundant messages; although if the updates
1.2. INFORMATION DISSEMINATION
7
Algorithm 2 SI gossip, simpler, but inferior version
8: procedure ON U PDATE(m)
1: loop
9:
store m.updates
2:
wait(∆)
3:
p ← random peer
10:
4:
if push then
11: procedure ON U PDATE R EQUEST (m)
5:
sendUpdate(p, known updates) 12: sendUpdate(m.sender, known updates)
6:
if pull then
7:
sendUpdateRequest(p)
themselves are small, then in the SI model there is not much room for optimization.
We did in fact apply a form of optimization though. To see how, let us consider
Algorithm 2. This algorithm is simpler and slightly more intuitive than Algorithm 1 but
it is not identical: the difference is that in Algorithm 1 in the message handler ON P USH
we can explicitly control the order of processing the push message and sending the pull
message when the push-pull variant is being run. In this case, it makes more sense to
first send the pull message and then store the received updates, because this way some
redundancy can be avoided. In fact we can easily make sure we send only the nonredundant updates back (we do not indicate this in the pseudocode to keep it simple).
Algorithm 2 does not offer this possibility of control (note that message delay is not
under our control). For this reason, in the remaining parts of the thesis we will always use
the style of formulation of Algorithm 1.
Basic Theoretical Properties of the SI Model
For theoretical purposes we will assume that messages are transmitted without delay,
and for now we will assume that no failures occur in the system. We will also assume
that messages are sent at the same time at each node, that is, messages from different
cycles do not mix and cycles are synchronized. None of these assumptions are critical for
practical usability, but they are needed for theoretical derivations that nevertheless give a
fair indication of the qualitative and also quantitative behavior of gossip protocols.
Let us start with the discussion of the push model. We will consider the propagation
speed of the update as a function of the number of nodes N. Let s0 denote the proportion
of susceptible nodes at the time of introducing the update at one node. Clearly, s0 =
(N − 1)/N. Let st denote the proportion of susceptible nodes at the end of the t-th cycle;
that is, at time t∆. We can calculate the expectation of st+1 as a function of st , provided
that the peer selected in line 3 is chosen independently at each node and independently of
past decisions as well. In this case, we have
N (1−st )
1
E(st+1 ) = st 1 −
≈ st e−(1−st ) ,
(1.1)
N
where N(1 − st ) is the number of nodes that are infected at cycle t, and (1 − 1/N) is the
probability that a fixed infected node will not infect some fixed susceptible node. Clearly,
a node is susceptible in cycle t + 1 if it was susceptible in cycle t and all the infected
nodes picked some other node. Actually, as it turns out, this approximative model is
rather accurate (the deviation from it is small), as shown by Pittel in [21]: we can take the
expected value E(st+1 ) as a good approximation of st+1 .
It is easy to see that if we wait long enough, then eventually all the nodes will receive
the update. In other words, the probability that a particular node never receives the update
8
CHAPTER 1. GOSSIP PROTOCOL BASICS
is zero. But what about the number of cycles that are necessary to let every node know
about the update (become infected)? Pittel proves that in probability,
SN = log2 N + log N + O(1) as N → ∞,
(1.2)
where SN = min{t : st = 0} is the number of cycles needed to spread the update.
The proof is rather long and technical, but the intuitive explanation is rather simple. In
the initial cycles, most nodes are susceptible. In this phase, the number of infected nodes
will double in each cycle to a good approximation. However, in the last cycles, where st
is small, we can see from (1.1) that E(st+1 ) ≈ st e−1 . This suggests that there is a first
phase, lasting for approximately log2 N cycles, and there is a last phase lasting for log N
cycles. The “middle” phase, between these two phases, can be shown to be very fast,
lasting a constant number of cycles.
Equation (1.2) is often cited as the key reason why gossip is considered efficient: it
takes only O(log N) cycles to inform each node about an update, which suggests very
good scalability. For example, with the original approach at Xerox, based on sending
emails to every node, the time required is O(N), assuming that the emails are sent sequentially.
However, let us consider the total number of messages that are being sent in the network until every node gets infected. For push gossip it can be shown that it is O(N log N).
Intuitively, the last phase that lasts O(log N) cycles with st being very small already involves sending too many messages by the infected nodes. Most of these messages are in
vain, since they target nodes that are already infected. The optimal number of messages
is clearly O(N), which is attained by the email approach.
Fortunately, the speed and message complexity of the push approach can be improved
significantly using the pull technique. Let us consider st in the case of pull gossip. Here,
we get the simple formula of
E(st+1 ) = st · st = s2t ,
(1.3)
which intuitively indicates a quadratic convergence if we assume the variance of st is
small. When st is large, it decreases slowly. In this phase the push approach clearly
performs better. However, when st is small, the pull approach results in a significantly
faster convergence than push. In fact, the quadratic convergence phase, roughly after
st < 0.5, lasts only for O(log log N) cycles, as can be easily verified.
One can, of course, combine push and pull. This can be expected to work faster than
either push or pull separately, since in the initial phase push messages will guarantee
fast spreading, while in the end phase pull messages will guarantee the infecting of the
remaining nodes in a short time. Although faster in practice, the speed of push-pull is still
O(log N), due to the initial exponential phase.
What about message complexity? Since in each cycle each node will send at least
one request, and O(log N) cycles are necessary for the update to reach all the nodes, the
message complexity is O(N log N). However, if we count only the updates, and ignore
request messages, we get a different picture. Just counting the updates is not meaningless,
because an update message is normally orders of magnitude larger than a request message.
It has been shown that in fact the push-pull gossip protocol sends only O(N log log N)
updates in total [22].
The basic idea behind the proof is again based on dividing the spreading process into
phases and calculating the message complexity and duration of each phase. In essence,
1.2. INFORMATION DISSEMINATION
9
the initial exponential phase—that we have seen with push as well—requires only O(N)
update transmissions, since the number of infected nodes (that send the messages) grows
exponentially. But the last phase, the quadratic shrinking phase as seen with pull, lasts
only O(log log N) cycles. Needless to say, as with the other theoretical results, the mathematical proof is quite long and technical.
The SIR Model
In the previous section we outlined some important theoretical results regarding convergence speed and message complexity. However, we ignored one problem that can turn
out to be important in practical scenarios: termination.
Push protocols never terminate in the SI model, constantly sending useless updates
even after each node has received every update. Pull protocols could stop sending messages if the complete list of updates was known in advance: after receiving all the updates,
no more requests need to be sent. However, in practice not even pull protocols can terminate in the SI model, because the list of updates is rarely known.
Here we will discuss solutions to the termination problem in the SIR model. These
solutions are invariably based on some form of detecting and acting upon the “age” of the
update.
We can design our algorithm with two different goals in mind. First, we might wish
to ensure that the termination is optimal; that is, we want to inform all the nodes about
the update, and we might want to minimize redundant update transmissions at the same
time. Second, we might wish to opt for a less intelligent, simple protocol and analyze
the size of the proportion of the nodes that will not get the update as a function of certain
parameters.
One simple way of achieving the first design goal of optimality is by keeping track
of the age of the update explicitly, and stop transmission (i.e., switching to the removed
state, hence implementing the SIR model) when a pre-specified age is reached. This age
threshold must be calculated to be optimal for a given network size N using the theoretical
results sketched above. This, of course, assumes that each node knows N. In addition, a
practically error- and delay-free transmission is also assumed, or at least a good model of
the actual transmission errors is needed.
Apart from this problem, keeping track of the age of the update explicitly represents
another, non-trivial practical problem. We assumed in our theoretical discussions that
messages have no delay and that cycles are synchronized. When these assumptions are
violated, it becomes rather difficult to determine the age of an update with an acceptable
precision.
From this point on, we shall discard this approach, and focus on simple asynchronous
methods that are much more robust and general, but are not optimal. To achieve the
second design goal of simplicity combined with reasonable performance, we can try to
guess when to stop based on local information and perhaps information collected from a
handful of peers. These algorithms have the advantage of simplicity and locality. Besides,
in many applications of the SIR model, strong guarantees on complete dissemination are
not necessary, as we will see later on.
Perhaps the simplest possible implementation is when a node moves to the removed
state with a fixed probability whenever it encounters a peer that has already received the
update. Let this probability be 1/k, where the natural interpretation of parameter k is the
average number of times a node sends the update to a peer that turns out to already have
CHAPTER 1. GOSSIP PROTOCOL BASICS
10
Algorithm 3 an SIR gossip variant
1: loop
2:
wait(∆)
3:
p ← random peer
4:
if push then
5:
sendPush(p, infective updates)
6:
else if pull then
7:
sendPullRequest(p)
8:
9:
10:
11:
procedure ON F EEDBACK (m)
for all u ∈ m.updates do
switch u to state R with pr. 1/k
procedure ON P USH(m)
if pull then
sendPull(m.sender, infective updates)
15:
onPull(m)
12:
13:
14:
16:
17:
18:
19:
20:
procedure ON P ULL(m)
buffer ← m.updates ∩ {known updates}
sendFeedback(m.sender, buffer)
store m.updates
21:
22:
23:
procedure ON P ULL R EQUEST (m)
sendPull(m.sender, infective updates)
the update before stopping its transmission. Obviously, this implicitly assumes a feedback
mechanism because nodes need to check whether the peer they sent the update to already
knew the update or not.
As shown in Algorithm 3, this feedback mechanism is the only difference between SIR
and SI gossip, apart from the fact that in the SI model all known updates are infective,
whereas in the SIR model they are either infective or removed. The active thread and
procedure ON P ULL R EQUEST are identical to Algorithm 1. However, procedures ON P USH
and ON P ULL send a feedback containing the received updates that were known already.
This message is processed by procedure O N F EEDBACK, eventually switching all updates
to the removed state. Removed updates are stored but are not included in the push and
pull messages anymore.
A typical approach to model the SIR algorithm is to work with differential equations,
as opposed to the discrete stochastic approach we applied previously. Let us illustrate this
approach via an analysis of Algorithm 3, assuming a push variant. Following [20, 23], we
can write
ds
= −si
dt
1
di
= si − (1 − s)i
dt
k
(1.4)
(1.5)
where s(t) and i(t) are the proportions of susceptible and infected nodes, respectively.
The nodes in the removed state are given by r(t) = 1 − s(t) − i(t). We can take the ratio,
eliminating t:
di
k+1
1
=−
+ ,
(1.6)
ds
k
ks
which yields
k+1
1
i(s) = −
s + log s + c,
(1.7)
k
k
where c is the constant of integration, which can be determined using the initial condition
that i(1 − 1/N) = 1/N (where N is the number of nodes). For a large N, we have
c ≈ (k + 1)/k.
Now we are interested in the value s∗ where i(s∗ ) = 0: at that time sending the update
is terminated, because all nodes are susceptible or removed. In other words, s∗ is the
1.2. INFORMATION DISSEMINATION
11
proportion of nodes that do not know the update when gossip stops. Ideally, s∗ should be
zero. Using the results, we can write an implicit equation for s∗ as follows:
s∗ = exp[−(k + 1)(1 − s∗ )].
(1.8)
This tells us that the spreading is very effective. For k = 1, 20% of the nodes are predicted
to miss the update, but with k = 5, 0.24% will miss it, while with k = 10 it will be as few
as 0.00017%.
Let us now proceed to discussing message complexity. Since full dissemination is not
achieved in general, our goal is now to approximate the number of messages needed to
decrease the proportion of susceptible nodes to a specified level.
Let us first consider the push variant. In this case, we make the rather striking observation that the value of s depends only on the number of messages m that have been sent
by the nodes. Indeed, each infected node picks peers independently at random to send the
update to. That is, every single update message is sent to a node selected independently
at random from the set of all the nodes. This means that the probability that a fixed node
is in state S after a total of m update messages has been sent can be approximated by
s(m) = (1 −
1 m
m
) ≈ exp[− ]
N
N
(1.9)
Substituting the desired value of s, we can easily calculate the total number of messages
that need to be sent in the system: it is
m ≈ −N log s
(1.10)
If we demand that s = 1/N, that is, we allow only for a single node not to see
the update, then we need m ≈ N log N. This reminds us of the SI model, that had an
O(N log N) message complexity to achieve full dissemination. If, on the other hand, we
allow for a constant proportion of the nodes not to see the update (s = 1/c) then we have
m ≈ N log c; that is, a linear number of messages suffice. Note that s or m cannot be set
directly, but only through other parameters such as k.
Another notable point is that (1.9) holds irrespective of whether we apply a feedback
mechanism or not, and irrespective of the exact algorithm applied to switch to state R.
In fact, it applies even for the pure SI model, since all we assumed was that it is a pushonly gossip with random peer selection. Hence it is a strikingly simple, alternative way
to illustrate the O(N log N) message complexity result shown for the SI model: roughly
speaking, we need approximately N log N messages to make s go below 1/N.
Since m determines s irrespective of the details of the applied push gossip algorithm,
the speed at which an algorithm can have the infected nodes send m messages determines
the speed of convergence of s. With this observation in mind, let us compare a number of
variants of SIR gossip.
Apart from Algorithm 3, one can implement termination (switching to state S) in
several different ways. For example, instead of a probabilistic decision in procedure ON F EEDBACK, it is also possible to use a counter, and switch to state S after receiving the
k-th feedback message. Feedback could be eliminated altogether, and moving to state R
could depend only on the number of times a node has sent the update.
It is not hard to see that the the counter variants improve load balancing. This in
turn improves speed because we can always send more messages in a fixed amount of
time if the message sending load is well balanced. In fact, among the variants described
12
CHAPTER 1. GOSSIP PROTOCOL BASICS
above, applying a counter without feedback results in the fastest convergence. However,
parameter k has to be set appropriately to achieve a desired level of s. To set k and s
appropriately, one needs to know the network size. Variants using a feedback mechanism
achieve a somewhat less efficient load balancing but they are more robust to the value of k
and to network size: they can “self-tune” the number of messages based on the feedback.
For example, if the network is large, more update messages will be successful before the
first feedback is received.
Lastly, as in the SI model, it is apparent that in the end phase the pull variant is much
faster and uses fewer update messages. It does this at the cost of constantly sending update
requests.
We think in general that, especially when updates are constantly being injected, the
push-pull algorithm with counter and feedback is probably the most desirable alternative.
1.2.3 Applications
We first explain how the various protocols we discussed were applied at Xerox for maintaining a consistent set of replicas of a database. Although we cannot provide a complete
picture here (see [20]), we elucidate the most important ideas.
In Section 1.2.1 we identified two sub-problems, namely update spreading and error
correction. The former is implemented by an SIR gossip protocol, and the latter by an
SI protocol. The SIR gossip is called rumor mongering and is run when a new update
enters the system. Note that in practice, many fresh updates can piggyback a single gossip message, but the above-mentioned convergence properties hold for any single fixed
update.
The SI algorithm for error correction works for every update ever entered, irrespective of age, simultaneously for all updates. In a naive implementation, the entire database
would be transmitted in each cycle by each node. Evidently, this is not a good idea, since
databases can be very large, and are mostly rather similar. Instead, the nodes first try to
discover what the difference is between their local replicas by exchanging compressed descriptions such as checksums (or lists of checksums taken at different times) and transmit
only the missing updates. However, one cycle of error correction is typically much more
expensive than rumor mongering.
The SI algorithm for error correction is called anti-entropy. This is not a very fortunate
name: we should remark here that it has no deeper meaning than to express the fact that
“anti-entropy” will increase the similarity among the replicas thereby increasing “order”
(decreasing randomness). So, since entropy is usually considered to be a measure of
“disorder”, the name “anti-entropy” simply means “anti-disorder” in this context.
In the complete system, the new updates are spread through rumor mongering, and
anti-entropy is run occasionally to take care of any undelivered updates. When such an
undelivered update is found, the given update is redistributed by re-inserting it as a new
update into the database where it was not present. This is a very simple and efficient
method, because update spreading via rumor mongering has a cost that depends on the
number of other nodes that already have the update: if most of the nodes already have it,
then the redistribution will die out very quickly.
Let us quickly compare this solution to the earlier, email based approach. Emailing
updates and rumor mongering are similar in that both focus on spreading a single update
and have a certain small probability of error. Unlike email, gossip has no bottleneck nodes
and hence is less sensitive to local failure and assumes less about local resources such as
1.3. AGGREGATION
13
bandwidth. This makes gossip a significantly more scalable solution. Gossip uses slightly
more messages in total for the distribution of a single update. But with frequent updates
in a large set of replicas, the amortized cost of gossip (number of messages per update) is
more favorable (remember that one message may contain many updates).
In practical implementations, additional significant optimizations were performed.
Perhaps the most interesting one is spatial gossip where, instead of picking a peer at
random, nodes select peers based on a distance metric. This is important because if the
underlying physical network topology is such that there are bottleneck links connecting
dense clusters, then random communication places a heavy load on such links that grows
linearly with system size. In spatial gossip, nodes favor peers that are closer in the topology, thereby relieving the load from long distance links, but at the same time sacrificing
some of the spreading speed. This topic is discussed at great length in [24].
We should also mention the removal of database entries. This is solved through “death
certificates” that are updates stating that a given entry should be removed. Needless to say,
death certificates cannot be stored indefinitely because eventually the databases would be
overloaded by them. This problem requires additional tricks such as removing most but
not all of them, so that the death certificate can be reactivated if the removed update pops
up again.
Apart from the application discussed above, the gossip paradigm has recently received
yet another boost. After getting used to Grid and P2P applications, and witnessing the
emergence of the huge, and often geographically distributed data centers that increase in
size and capacity at an incredible rate, in the past years we had to learn another term:
cloud computing [25–27].
Cloud computing involves a huge amount of distributed resources (a cloud), typically
owned by a single organization, and organized in such a way that for the user it appears to
be a coherent and reliable storage or computing service. The exact details of commercially
deployed technology are not always clear, but from several sources it seems rather evident
that gossip protocols are involved. For example, after a recent crash of Amazon’s S3
storage service, the message explaining the failure included some details:
(...) Amazon S3 uses a gossip protocol to quickly spread server state information throughout the system. This allows Amazon S3 to quickly route around
failed or unreachable servers, among other things.1 (...)
In addition, a recent academic publication on the technology underlying Amazon’s computing architecture provides further details on gossip protocols [28], revealing that an
anti-entropy gossip protocol is responsible for maintaining a full membership table at
each server (that is, a fully connected overlay network with server state information).
1.3 Aggregation
The gossip communication paradigm can be generalized to applications other than information dissemination. In these applications some implicit notion of spreading information will still be present, but the emphasis is not only on spreading but also on processing
information on the fly.
This processing can be for creating summaries of distributed data; that is, computing
a global function over the set of nodes based only on gossip-style communication. For
1
http://status.aws.amazon.com/s3-20080720.html
CHAPTER 1. GOSSIP PROTOCOL BASICS
14
Algorithm 4 push-pull averaging
1: loop
2:
wait(∆)
3:
p ← random peer
4:
sendPush(p,x)
5:
6:
7:
8:
9:
10:
procedure ON P USH(m)
sendPull(m.sender,x)
x ← (m.x + x)/2
procedure ON P ULL(m)
x ← (m.x + x)/2
example, we might be interested in the average, or maximum of some attribute of the
nodes. The problem of calculating such global functions is called data aggregation or
simply aggregation. We might want to compute more complex functions as well, such
as fitting models on fully distributed data, in which case we talk about the problem of
distributed data mining.
In the past few years, a lot of effort has been directed at a specific problem: calculating averages. Averaging can be considered the archetypical example of aggregation.
Chapter 3 will discuss this problem in detail, here we describe the basic notions to help
illustrate the generality of the gossip approach.
Averaging is a very simple problem, and yet very useful: based on the average of a
suitably defined local attribute, we can calculate a wide range of values. To elaborate on
this notion, let us introduce some formalism. P
Let xi be an attribute value at node i for all
0 < i ≤ N. We are interested in the average N
i=1 xi /N. Clearly, if we can calculate the
average then we can calculate any mean of the form
!
PN
i=1 f (xi )
(1.11)
g(x1 , . . . , xN ) = f −1
N
as well, where we simply apply f () on the local attributes before averaging. For example,
f (x) = log x generates the geometric mean, while f (x) = 1/x generates the harmonic
mean. In addition, if we calculate the mean of several powers of xi , then we can calculate
the moments of the distribution of the values. For example, the variance can be expressed
as a function over averages of x2i and xi :
N
N
1 X 2
1 X 2
σ =
x −(
xi )
N i=1 i
N i=1
2
(1.12)
Finally, other interesting quantities can be calculated using averaging as a primitive. For
example, if every attribute value is zero, except at one node, where the value is 1, then the
average is 1/N. This allows us to compute the network size N.
In the remaining parts of this section we focus on several gossip protocols for calculating the average of node attributes.
1.3.1 Algorithms and Theoretical Notions
The first, perhaps simplest, algorithm we discuss is push-pull averaging, presented in
Algorithm 4. Each node periodically selects a random peer to communicate with, and
then sends the local estimate of the average x. The recipient node then replies with its
own current estimate. Both participating nodes (the sender and the one that sends the
reply) will store the average of the two previous estimates as a new estimate.
1.3. AGGREGATION
15
initial state
cycle 1
cycle 2
cycle 3
cycle 4
cycle 5
Figure 1.1: Illustration of the averaging protocol. Pixels correspond to nodes (100x100
pixels=10,000 nodes) and pixel color to the local approximation of the average.
Similarly to our treatment of information spreading, Algorithm 4 is formulated for an
asynchronous message passing model, but we will assume several synchronicity properties when discussing the theoretical behavior of the algorithm. We will return to the issue
of asynchrony in Section 1.3.1.
For now, we also treat the algorithm as a one-shot algorithm; that is, we assume that
first the local estimate xi of node i is initialized as xi = xi (0) for all the nodes i = 1 . . . N,
and subsequently the gossip algorithm is executed. This assumption will also be relaxed
later in this section, where we briefly discuss the case, where the local attributes xi (0) can
change over time and the task is to continuously update the approximation of the average.
Let us first have a brief look at the convergence of the algorithm. It is clear that the
state when all the xi values are identical is a fixed point, assuming there are no node
failures and message failures, and that the messages are delivered without delay. In addition, observe that the sum of the approximations remains constant throughout. This very
important property is called mass conservation. We can then look at the difference between the minimal and maximal approximations and show that this difference can only
decrease and, furthermore, it converges to zero in probability, using the fact that peers are
selected at random.
But if all the approximations are the same, they can only be equal to
PN
the average i=1 xi (0)/N due to mass conservation.
The really interesting question, however, is the speed of convergence. The fact of convergence is easy to prove in a probabilistic sense, but such a proof is useless from a practical point of view without characterizing speed. The speed of the protocol is illustrated
in Figure 1.1. The process shows a diffusion-like behavior. The averaging algorithm is of
course executed using random peer sampling (the pixel pairs are picked at random). The
arrangement of the pixels is for illustration purposes only.
16
Algorithm 5 push averaging
1: loop
2:
wait(∆)
3:
p ← random peer
4:
sendPush(p,(x/2, w/2))
5:
x ← x/2
6:
w ← w/2
CHAPTER 1. GOSSIP PROTOCOL BASICS
7:
8:
9:
procedure ON P USH(m)
x ← m.x + x
w ← m.w + w
In Chapter 3 we characterize the speed of convergence and show that the variance of
the approximations decreases by a constant factor in each cycle. In practice, 10-20 cycles
of the protocol already provide an extremely accurate estimation: the protocol not only
converges, but it converges very quickly as well.
Asynchrony
In the case of information dissemination, allowing for unpredictable and unbounded message delays (a key component of the asynchronous model) has no effect on the correctness
of the protocol, it only has an (in practice, marginal) effect on spreading speed. For Algorithm 4 however, correctness is no longer guaranteed in the presence of message delays.
To see why, imagine that node j receives a PUSH U PDATE message from node i and as
a result it modifies its own estimate and sends its own previous estimate back to i. But
after that point, the mass conservation property of the network will be violated: the sum
of all approximations will no longer be correct. This is not a problem if neither node j nor
node i receives or sends another message during the time node i is waiting for the reply.
However, if they do, then the state of the network may become corrupted. In other words,
if the pair of push and pull messages are not atomic, asynchrony is not tolerated well.
Algorithm 5 is a clever modification of Algorithm 4 and is much more robust to message delay. The algorithm is very similar, but here we introduce another attribute called
w. For each node i, we initially set wi = 1 (so the sum of these values is N). We also
modify the interpretation of the current estimate: on node i it will be xi /wi instead of xi ,
as in the push-pull variant.
To understand why this algorithm is more robust to message delay, consider that we
now have mass conservation in a different sense: the sum of the attribute values at the
nodes plus the sum of the attribute values in the undelivered messages remains constant,
for both attributes x and w. This is easy to see if one considers the active thread which
keeps half of the values locally and sends the other half in a message. In addition, it can
still be proven that the variance of the approximations xi /wi can only decrease.
As a consequence, messages can now be delayed, but if message delay is bounded,
then the variance of the set of approximations at the nodes and in the messages waiting
for delivery will tend to zero. Due to mass conservation, these approximations will converge to the true average, irrespective of how much of the total “mass” is in undelivered
messages. (Note that the variance of xi or wi alone is not guaranteed to converge zero.)
Robustness to failure and dynamism
We will now consider message and node failures. Both kinds of failures are unfortunately
more problematic than asynchrony. In the case of information dissemination, failure had
1.3. AGGREGATION
17
no effect on correctness: message failure only slows down the spreading process, and
node failure is problematic only if every node fails that stores the new update.
In the case of push averaging, losing a message typically corrupts mass conservation.
In the case of push-pull averaging, losing a push message will have no effect, but losing
the reply (pull message) may corrupt mass conservation. The solutions to this problem
are either based on failure detection (that is, they assume a node is able to detect whether
a message was delivered or not) and correcting actions based on the detected failure, or
they are based on a form of rejuvenation (restarting), where the protocol periodically
re-initializes the estimates, thereby restoring the total mass. The restarting solution is
feasible due to the quick convergence of the protocol. Both solutions are somewhat inelegant; but gossip is attractive mostly because of the lack of reliance on failure detection,
which makes restarting more compatible with the overall gossip design philosophy. Unfortunately restarting still allows for a bounded inaccuracy due to message failures, while
failure detection offers accurate mass conservation.
Node failures are a source of problems as well. By node failure we mean the situation
when a node leaves the network without informing the other nodes about it. Since the
current approximation xi (or xi /wi ) of a failed node i is typically different from xi (0),
the set of remaining nodes will end up with an incorrect approximation of the average of
the remaining attribute values. Handling node failures is problematic even if we assume
perfect failure detectors. Solutions typically involve nodes storing the contributions of
each node separately. For example, in the push-pull averaging protocol, node i would
store δji : the sum of the incremental contributions of node j to xi . More precisely, when
receiving an update from j (push or pull), node i calculates δji = δji + (xj − xi )/2. When
node i detects that node j failed, it performs the correction xi = xi − δji .
We should mention that this is feasible only if the selected peers are from a small fixed
set of neighboring nodes (and not randomly picked from the network), otherwise all the
nodes would need to monitor an excessive number of other nodes for failure. Besides,
message failure can interfere with this process too. The situation is further complicated
by nodes failing temporarily, perhaps not even being aware of the fact that they have been
unreachable for a long time by some nodes. Also note that the restart approach solves
the node failure issue as well, without any extra effort or failure detectors, although, as
previously, allowing for some inaccuracy.
Finally, let us consider a dynamic scenario where mass conservation is violated due to
changing xi (0) values (so the approximations evolved at the nodes will no longer reflect
the correct average). In such cases one can simply set xi = xi + xnew
(0) − xold
i
i (0), which
corrects the sum of the approximations, although the protocol will need some time to
converge again. As in the previous cases, restarting solves this problem too without any
extra measures.
1.3.2 Applications
The diffusion-based averaging protocols we focused on will most often be applied as a
primitive to help other protocols and applications such as load balancing, task allocation,
or the calculation of relatively complex models of distributed data such as spectral properties of the underlying graph [11, 29]. An example of this application will be described
in Chapter 4.
Sensor networks are especially interesting targets for applications, due to the fact that
their very purpose is data aggregation, and they are inherently local: nodes can typically
CHAPTER 1. GOSSIP PROTOCOL BASICS
18
Algorithm 6 The gossip algorithm skeleton.
11: procedure ON P USH(m)
1: loop
12:
if pull then
2:
wait(∆)
13:
sendPull(m.sender,state)
3:
p ← selectPeer()
4:
if push then
14:
state ← update(state,m.state)
5:
sendPush(p,state)
15:
6:
else if pull then
16: procedure ON P ULL(m)
7:
sendPullRequest(p)
17:
state ← update(state,m.state)
8:
9:
10:
procedure ON P ULL R EQUEST(m)
sendPull(m.sender,state)
communicate with their neighbors only [30]. However, sensor networks do not support
point-to-point communication between arbitrary pairs of nodes as we assumed previously,
which makes the speed of averaging slower, depending on the communication range of
the devices.
1.4 What is Gossip after all?
So far we have discussed two applications of the gossip idea: information dissemination
and aggregation. By now it should be rather evident that these applications, although
different in detail, have a common algorithmic structure. In both cases an active thread
selects a peer node to communicate with, followed by a message exchange and the update
of the internal states of both nodes (for push-pull) or one node (for push or pull). We
propose the template (or design pattern [31]) shown in Algorithm 6 to capture this structure. The three components that need to be defined to instantiate this pattern are methods
UPDATE and SELECT P EER , and the state of a node. This template covers our two examples
presented earlier. In the case of information dissemination the state of a node is defined by
the stored updates, while in the case of averaging the state is the current approximation of
the average at the node. In addition, the template covers a large number of other protocols
as well.
1.4.1 Overlay Networks
To illustrate the power of this abstraction, we briefly mention one notable application we
have not covered in this chapter: the construction and management of overlay networks.
The larger part of this dissertation, in particular, Chapters 2 and 6 will discuss such applications. In this case the state of a node is a set of node addresses that define an overlay
network. (A node is able to send messages to an address relying on lower layers of the
networking stack; hence the name “overlay”.)
In a nutshell, the state (the set of overlay links) is then communicated via gossip, and
method UPDATE selects the new set of links from the set of all links the node has seen.
Through this mechanism one can create and manage a number of different overlay networks such as random networks, structured networks (like a ring) or proximity networks
based on some distance metric, for example semantic or latency-based distance. Method
SELECT P EER can also be implemented in a clever way, based on the actual neighbors, to
speed up convergence.
1.5. CONCLUSIONS
19
These networks can be applied by higher level applications, or by other gossip protocols. For example, random networks are excellent for implementing random peer sampling, a service all the algorithms rely on in this chapter when selecting a random peer to
communicate with.
1.4.2 Prototype-based Gossip Definition
The gossip abstraction is powerful, perhaps too much so. It is rather hard to capture
what this “gossipness” concept means exactly. Attempts have been made to define gossip
formally, with mixed success [32]. For example, periodic and local communication to
random peers appears to be a core feature. However, in the SIR model, nodes can stop
communicating. Besides, in some gossip protocols neighboring nodes need not be random
in every cycle but instead they can be fixed and static. For example, many secure gossip
protocols in fact use deterministic peer selection on controlled networks [33]. Also, quite
clearly, a protocol remains gossip if message sending is slightly irregular—for example,
due to an optimization that makes a protocol adaptive to system load or the progress
of information spreading. In general, the template allows us to model practically any
message passing protocol, since the definition of state is unrestricted, in any cycle a peer
can choose to send a zero length message (that is, no message), and the gossip period ∆
can be arbitrarily small.
For this reason it appears to be more productive to also have a feature list that defines
an idealized prototypical gossip protocol application (i.e., information spreading), and to
compare the features of a given protocol with this set. In this way, instead of giving a
formal, exact definition of gossip protocols, we make it possible to compare any given
protocol to the prototypical gossip protocol and assess the similarities and differences,
avoiding a rigid binary (gossip/non-gossip) decision over protocols. We propose the following features: (1) randomized peer selection, (2) only local information is available at
all nodes, (3) cycle-based (periodic), (4) limited transmission and processing capacity per
cycle, (5) all peers run the same algorithm.
The inherent and intentional fuzziness in this prototype-based approach turns the
yes/no distinction of a formal definition into a measure of distance from prototypical
gossip: a certain algorithm might have some of the properties, and might not have some
others. Even in the case of matching properties, we can talk about the degree of matching.
For example, we can ask how random peer selection is, or how local the decisions are.
Figure 1.2 is a simple illustration of this idea. The figure also illustrates the possibility
that some algorithms from other fields might actually be closer to prototypical gossip
than some protocols currently called gossip. The examples mentioned in the diagram are
explained in detail in [2].
1.5 Conclusions
In this chapter we introduced the gossip design pattern through the examples of information dissemination (the prototypical application) and aggregation. We showed that
both applications use a very similar communication model, and both applications provide
probabilistic guarantees for an efficient and effective execution. We also discussed the
gossip model in general, and briefly mentioned overlay network management as a further
application.
CHAPTER 1. GOSSIP PROTOCOL BASICS
20
current gossip
algorithms
gossip−like algorithms
cellular
ant
automata algorithms
overlay topology
management
aggregation
asynchronous
iteration
prototypical
gossip
peer sampling
service
self−stabilizing
systems
clock
synchronization
cooperation
routing
vehicular traffic control
gossip−friendly application areas
Figure 1.2: Prototypical gossip in a multidisciplinary context.
We should emphasize that gossip protocols represent a departure from “classical” approaches in distributed computing where correctness and reliability were the top priorities
and performance (especially speed and responsiveness) was secondary. To put it simply:
gossip protocols—if done well—are simple, fast, cheap, and extremely scalable, but they
do not always provide a perfect or correct result under all circumstances and in all system models. But in many scenarios a “good enough” result is acceptable, and in realistic
systems gossip components can always be backed up by more heavy-weight, but more
reliable methods that provide eventual consistency or correctness.
A related problem is malicious behavior. Unfortunately, gossip protocols in open
systems with multiple administrative domains are rather vulnerable to malicious behavior. Current applications of gossip are centered on single administrative domain systems,
where all the nodes are controlled by the same entity and therefore the only sources of
problems are hardware, software or network failures. In an open system nodes can be
selfish, or even worse, they can be malicious. Current secure gossip algorithms are orders of magnitude more complicated than basic versions, thus losing many of the original
advantages of gossiping.
All in all, gossip algorithms are a great tool for solving certain kinds of very important
problems under certain assumptions. In particular, they can help in the building of enormous cloud computing systems that are considered the computing platform of the future
by many, and provide tools for programming sensor networks as well.
1.6 Further Reading
Here we provide a number of references that might help the reader probe deeper into the
topics that we discussed in this chapter. Our discussion on information dissemination was
based mostly on the seminal paper of Demers et al. [20], and partly on [21] and [22] to
elaborate on some of the details. In general [20] is highly recommended for further study,
since the paper contains a lot more material than what was covered here, and it touches
on almost all research issues associated with this field. One further important aspect—
gossiping taking physical distance into account—is elaborated on in a paper by Kempe,
1.6. FURTHER READING
21
Kleinberg and Demers [24].
In the case of aggregation, we based our discussion on [8] (as discussed in Chapter 3)
and [34], borrowing some ideas from [35] and [36] during the discussion of asynchrony,
presented in a simplified form. Certain aspects of gossip aggregation have been covered
in recent work such as increased fault tolerance [37, 38] or privacy preservation [39].
An alternative theoretical approach can be followed as well if gossip-based average
calculation is viewed mathematically as a random walk on a suitably defined graph (or,
equivalently, a Markovian process with a time-reversible Markovian transition matrix),
which has a uniform stationary distribution. This however assumes that computation is
performed synchronously, in lock step, so that averaging can be described as a matrix iteration converging to the uniform distribution. These approaches have been omitted here,
but the interested reader can find them in a number of seminal papers such as [40–42]. In
general, [43] gives an excellent and comprehensive tutorial on the relevant mathematical
theory.
There are yet other alternative ways of calculating averages that have not been covered
but that are based on gossip in one way or another. One approach is based on maintaining a hierarchical overlay (often involving gossip in the construction and maintenance
of the hierarchy) and using it to calculate various aggregates [44–46]. This approach is
rather typical in wireless sensor networks [47]. One can also use additional tricks such as
random walk-based statistical approximations [48] or cleverly constructed attributes such
that the averaging problem is reduced to finding the minima of these attributes, and then
using these minima to infer the true average with a controlled accuracy [49]. A method
inspired by belief propagation was also proposed, that has favorable properties in certain
communication topologies [50].
Let us now suggest some further reading on applications of gossip that have not been
covered in this chapter. In Section 1.4 we mentioned peer sampling, mentioning gossipbased implementations. An extensive discussion of peer sampling and its variations is
provided in [3] and discussed in Chapter 2. A similar application (discussed in Chapter 6), is overlay construction and maintenance. We show that a wide range of network
topologies can be evolved with a slight modification of the peer sampling algorithm.
Gossip has also been applied in fault tolerant, practical distributed hash table implementations [51] to increase the robustness of the overlay maintenance algorithm for
dynamic conditions.
22
CHAPTER 1. GOSSIP PROTOCOL BASICS
Chapter 2
Peer Sampling Service
As we have seen, Algorithm 6 in Chapter 1 crucially relies on a method called SELECTP EER. The peer sampling service implements this function. In short, this service provides
every node with peers to gossip with. We promote this service to the level of a first-class
abstraction of a large-scale distributed system, similar to a name service being a first-class
abstraction of a local-area system.
One important problem when implementing a peer sampling service in large dynamic
networks is making sure that it is as scalable and robust as possible. We present a generic
framework to implement a peer sampling service in a decentralized manner by constructing and maintaining dynamic unstructured overlays through gossiping membership information itself. Our framework generalizes existing approaches and makes it easy to
discover new ones. We use this framework to empirically explore and compare several
implementations of the peer sampling service. Through extensive simulation experiments
we show that—although all protocols provide a good quality uniform random stream of
peers to each node locally—traditional theoretical assumptions about the randomness of
the unstructured overlays as a whole do not hold in any of the instances. We also show that
different design decisions result in severe differences from the point of view of two crucial aspects: load balancing and fault tolerance. Our simulations are validated by means
of a wide-area implementation.
2.1 Introduction
The popularity of gossip protocols stems from their ability to reliably pass information
among a large set of interconnected nodes, even if the nodes regularly join and leave the
system (either purposefully or on account of failures), and the underlying network suffers
from broken or slow links.
In a gossip-based protocol, each node in the system periodically exchanges information with a subset of its peers. The choice of this subset is crucial to the wide dissemination of the gossip. Ideally, any given node should exchange information with peers
that are selected following a uniform random sample of all nodes currently in the system [20, 22, 52–54]. This assumption made it possible to rigorously establish many desirable features of gossip-based protocols like scalability, reliability, and efficiency (see
Chapter 1).
In practice, enforcing this assumption would require to develop applications where
each node may be assumed to know every other node in the system [53, 55, 56]. However,
providing each node with a complete membership table from which a random sample can
23
24
CHAPTER 2. PEER SAMPLING SERVICE
be drawn, is unrealistic in a large-scale dynamic system, for maintaining such tables in
the presence of joining and leaving nodes (referred to as churn) incurs considerable synchronization costs. In particular, measurement studies on various peer-to-peer networks
indicate that an individual node may often be connected in the order of only a few minutes
to an hour (see, e.g. [57, 58]).
Clearly, decentralized schemes to maintain membership information are crucial to the
deployment of gossip-based protocols. This chapter factors out the very abstraction of
a peer sampling service and presents a generic, yet simple, gossip-based framework to
implement it.
The peer sampling service is singled-out from the application using it and, abstractly
speaking, the same service can be used in different settings: information dissemination [20,59], aggregation [8,10,34,60], load balancing [61], and network management [36].
The service is promoted as a first class abstraction of a large-scale distributed system. In
a sense, it plays the role of a naming service in a traditional LAN-oriented distributed
system as it provides each node with other nodes to interact with.
The basic general principle underlying the framework we propose to implement the
peer sampling service, is itself based on a gossip paradigm. In short, every node (1) maintains a relatively small local membership table that provides a partial view on the complete set of nodes and (2) periodically refreshes the table using a gossiping procedure.
The framework is generic and can be used to instantiate known [5, 62, 63] and novel
gossip-based membership implementations. In fact, our framework captures many possible variants of gossip-based membership dissemination. These variants mainly differ in
the way the membership table is updated at a given node after the exchange of tables in a
gossip cycle. We use this framework to experimentally evaluate various implementations
and identify key design parameters in practical settings. Our experimentation covers both
extensive simulations and emulations on a wide-area cluster.
We consider many dimensions when identifying qualitative differences between the
variants we examine. These dimensions include the randomness of selecting a peer as
perceived by a single node, the accuracy of the current membership view, the distribution
of the load incurred on each node, as well as the robustness in the presence of failures and
churn.
Maybe not surprisingly, we show that communication should rather be bidirectional:
it should follow the push-pull model. Adhering to a push-only or pull-only approach
can easily lead to (irrecoverable) partitioning of the set of nodes. Another finding is that
robustness against failing nodes or churn can be enhanced if old table entries are dropped
when exchanging membership information.
However, as we shall also see, no single implementation outperforms the others along
all dimensions. In this study we identify these tradeoffs when selecting an implementation of the peer sampling service for a given application. For example, to achieve good
load balancing, table entries should rather be swapped between two peers. However, this
strategy is less robust against failures and churn than non-swapping ones.
The chapter is organized as follows. Section 2.2 presents the interface and generic
implementation of our peer sampling service. Section 2.3 characterizes local randomness:
that is, the randomness of the samples as seen by a fixed participating node. In Section 2.4
we analyze global randomness in a graph-theoretic framework. Robustness to failures
and churn is discussed in Section 2.5. The simulations are validated through a wide-area
experimentation described in Section 2.6. Sections 2.7, 2.8 and 2.9 present the discussion,
related work and conclusions, respectively.
2.2. PEER-SAMPLING SERVICE
25
2.2 Peer-Sampling Service
The peer sampling service is implemented over a set of nodes (a group) wishing to execute
one or more protocols that require random samples from the group. The task of the service
is to provide a participating node with a random subset of peers from the group.
2.2.1 API
The API of the peer sampling service simply consists of two methods: INIT and SELECTP EER. It would be technically straightforward to provide a framework for a multipleapplication interface and architecture. For a better focus and simplicity of notations we
assume, however, that there is only one application. The specification of these methods is
as follows.
INIT ()
Initializes the service on a given node if this has not been done before. The actual
initialization procedure is implementation dependent.
Returns a peer address if the group contains more than one node. The
returned address is a sample drawn from the group. Ideally, this sample should be
an independent unbiased random sample. The exact characteristics of this sample
(e.g., its randomness or correlation in time and with other peers) is affected by the
implementation.
SELECT P EER ()
Our focus is to give accurate information about the behavior of the SELECT P EER ()
method in the case of a class of gossip-based implementations. Applications requiring
more than one peer simply invoke this method repeatedly.
Note that we do not define a STOP method. In other words, graceful leaves are handled
as crashes. The reason is to ease the burden on applications by delegating the responsibility of removing inactive nodes to the service layer.
2.2.2 Generic Protocol Description
We consider a set of nodes connected in a network. A node has an address that is needed
for sending a message to that node. Each node maintains a membership table representing its (partial) knowledge of the global membership. Traditionally, if this knowledge is
complete, the table is called the global view or simply the view. However, in our case
each node knows only a limited subset of the system, so the table is consequently called
a partial view. The partial view is a list of c node descriptors. Parameter c represents the
size of the list and is the same for all nodes.
A node descriptor contains a network address (such as an IP address) and an age that
represents the freshness of the given node descriptor. The partial view is a list data structure, and accordingly, the usual list operations are defined on it. Most importantly, this
means that the order of elements in the view is not changed unless some specific method
(for example, SHUFFLE, which randomly reorders the list elements) explicitly changes it.
The protocol also ensures that there is at most one descriptor for the same address in every
view.
The purpose of the gossiping algorithm, executed periodically on each node and resulting in two peers exchanging their membership information, is to make sure that the
partial views contain descriptors of a continuously changing random subset of the nodes
CHAPTER 2. PEER SAMPLING SERVICE
26
Algorithm 7 The skeleton of a gossip-based peer sampling service.
15: procedure UPDATE(buffer,c,H,S)
1: loop
16:
view.append(buffer)
2:
wait(∆)
17:
view.removeDuplicates()
3:
p ← selectGPSPeer()
18:
view.removeOldItems(min(H,view.size-c))
4:
sendPush( p, toSend() )
19:
view.removeHead(min(S,view.size-c))
5:
view.increaseAge()
20:
view.removeAtRandom(view.size-c)
6:
21:
7: procedure ON P USH(m)
22: procedure TO S END
8:
if pull then
9:
sendPull( m.sender, toSend() ) 23: buffer ← ((MyAddress,0))
24:
view.shuffle()
10:
onPull(m)
25:
move oldest H items to end of view
11:
26:
buffer.append(view.head(c/2 − 1))
12: procedure ON P ULL(m)
27:
return buffer
13:
update(m.buffer,c,H,S)
14:
view.increaseAge()
and (in the presence of failure and joining and leaving nodes) to make sure the partial
views reflect the dynamics of the system. We assume that each node executes the same
protocol of which the skeleton is shown in Algorithm 7.
Note that the algorithm is an instantiation of the generic scheme in Algorithm 6, only
it is slightly simpler because we do not support the pure pull variant, as explained in
Section 2.2.3. As in Chapter 1, we define a cycle to be a time interval of ∆ time units
where ∆ is the parameter of the protocol. During a cycle, each node initiates one view
exchange.
Three globally known system-wide parameters are used in this algorithm: parameters
c (the size of the partial view of each node), H and S. For the sake of clarity, we leave the
details of the meaning and impact of H and S until the end of this section.
In the active thread, first a peer node is selected to exchange membership information with. This selection is implemented by the method SELECT GPSP EER that returns the
address of a live node. This method should not be confused with the API method SELECTP EER. Although it serves a similar purpose, method SELECT GPSP EER is internal to the
peer sampling implementation, and it itself is a parameter of the generic protocol. We
discuss the possible implementations of SELECT GPSP EER in Section 2.2.3.
Subsequently, a push message is sent. The list of descriptors to be sent is prepared by
method TO S END. There, a buffer is initialized with a fresh descriptor of the node running
the thread. Then, c/2−1 elements are appended to the buffer. The implementation ensures
that these elements are selected randomly from the view without replacement, giving the
oldest H elements (as defined by the age stored in the descriptors) a lower priority to be
included (they are sampled only if there are not enough younger elements). As a sideeffect of shuffling the view to select the c/2 − 1 random elements without replacement,
the view itself will also have exactly those elements as first items (i.e., in the list head)
that are being sent in the buffer. This fact will play a key role in the interpretation of
parameter S as we explain later. Parameter H is guaranteed to be less than or equal to
c/2. The buffer created this way is sent to the selected peer.
When a push message arrives from a peer node (in method ON P USH) then if the
boolean parameter PULL is true then a pull message is sent also prepared by TO S END.
When any message arrives from a peer node (in method ON P USH or ON P ULL) the received
2.2. PEER-SAMPLING SERVICE
27
buffer is passed to method UPDATE, which creates the new view based on the listed parameters, and the current view, making sure the size of the new view does not decrease and is
at most c. After appending the received buffer to the view, method UPDATE keeps only the
freshest entry for each address, eliminating duplicate entries. After this operation, there
is at most one descriptor for each address. At this point, the size of the view is guaranteed
to be at least the original size, since in the original view each address was included also
at most once. Subsequently, the method performs a number of removal steps to decrease
the size of the view to c. The parameters of the removal methods are calculated in such
a way that the view size never drops below c. First, the oldest items are removed, as
defined by their age, and parameter H. The name H comes from healing, that is, this
parameter defines how aggressive the protocol should be when it comes to removing links
that potentially point to faulty nodes (dead links). Note that in this way self-healing is
implemented without actually checking if a node is alive or not. If a node is not alive,
then its descriptors will never get refreshed (and thus become old), and therefore sooner
or later they will get removed. The larger H is, the sooner older items will be removed
from views.
After removing the oldest items, the S first items are removed from the view. Recall
that it is exactly these items that were sent to the peer previously. As a result, parameter S
controls the priority that is given to the addresses received from the peer. If S is high, then
the received items will have a higher probability to be included in the new view. Since the
same algorithm is run on the receiver side, this mechanism in fact controls the number of
items that are swapped between the two peers, hence the name S for the parameter. This
parameter controls the diversity of the union of the two new views (on the passive and
active side). If S is low then both parties will keep many of their exchanged elements,
effectively increasing the similarity between the two respective views. As a result, more
unique addresses will be removed from the system. In contrast, if S is high, then the
number of unique addresses that are lost from both views is lower. The last step removes
random items to reduce the size of the view back to c.
This framework captures the essential behavior of many existing gossip membership
protocols (although exact matches often require small changes). As such, the framework
serves two purposes: (1) we can use it to compare and evaluate a wide range of different
gossip membership protocols by changing parameter values, and (2) it can serve as a
unifying implementation for a large class of protocols. As a next step, we will explore the
design space of our framework, forming the basis for an extensive protocol comparison.
2.2.3 Design Space
In this section we describe a set of specific instances of our generic protocol by specifying
the values of the key parameters. These instances will be analyzed in the rest of the
chapter.
Peer Selection
As described before, peer selection is implemented by SELECT GPSP EER that returns the
address of a live node as found in the caller’s current view. In this study, we consider the
following peer selection policies:
rand
tail
Uniform randomly select an available node from the view
Select the node with the highest age
CHAPTER 2. PEER SAMPLING SERVICE
28
Note that the third logical possibility of selecting the node with the lowest age is not included since this choice is not relevant. It is immediately clear from simply considering
the protocol scheme that node descriptors with a low age refer to neighbors that have a
view that is strongly correlated with the node’s own view. More specifically, the node
descriptor with the lowest age always refers exactly to the last neighbor the node communicated with. As a result, contacting this node offers little possibility to update the view
with unknown entries, so the resulting overlay will be very static. Our preliminary experiments fully confirm this simple reasoning. Since the goal of peer sampling is to provide
uncorrelated random peers continuously, it makes no sense to consider any policies with
a bias towards low age, and thus protocols that follow such a policy.
View propagation
Once a peer has been chosen, the peers may exchange information in various ways. We
consider the following two view propagation policies:
push
pushpull
The node sends descriptors to the selected peer
The node and selected peer exchange descriptors
Like in the case of the view selection policies, one logical possibility: the pull strategy, is
omitted. It is easy to see that the pull strategy cannot possibly provide satisfactory service.
The most important flaw of the pull strategy is that a node cannot inject information about
itself, except only when explicitly asked by another node. This means that if a node
loses all its incoming connections (which might happen spontaneously even without any
failures, and which is rather common as we shall see) there is no possibility to reconnect
to the network.
View selection
The parameters that determine how view selection is performed are H, the self-healing
parameter, and S, the swap parameter. Let us first note some properties of these parameters. First, assuming that c is even, all values of H for which H > c/2 are equivalent
to H = c/2, because the protocol never decreases the view size to under c. For the same
reason, all values of S for which S > c/2 − H are equivalent to S = c/2 − H. Furthermore, the last, random removal step of the view selection algorithm is executed only if
S < c/2 − H. Keeping these in mind, we have a “triangle” of protocols with H ranging
from 0 to c/2, and with S ranging from 0 to c/2 − H. In our analysis we will look at this
triangle at different resolutions, depending on the scenarios in question. As a minimum,
we will consider the three vertices of the triangle defined as follows.
blind
healer
swapper
H = 0, S = 0
H = c/2
H = 0, S = c/2
Keep blindly a random subset
Keep the freshest entries
Minimize loss of information
We must note here that even in the case of SWAPPER, only at most c/2 − 1 descriptors
can be swapped, because the first element of the received buffer of length c/2 is always a
fresh descriptor of the sender node. This fresh descriptor is always added to the view of
the recipient node if H + S = c/2, that is, when no random elements are removed. This
detail is very important as it is the only way fresh information can enter the system.
2.2. PEER-SAMPLING SERVICE
Algorithm 8 Newscast
1: loop
2:
wait(∆)
3:
p ← selectGPSPeer()
4:
sendPush( p, toSend() )
5:
view.increaseAge()
6:
7:
8:
9:
10:
11:
12:
13:
procedure ON P USH(m)
sendPull( m.sender, toSend() )
onPull(m)
29
14:
15:
16:
17:
procedure UPDATE(buffer,c)
view.append(buffer)
view.removeDuplicates()
view.removeOldItems(view.size-c)
18:
19:
20:
21:
22:
procedure TO S END
buffer ← ((MyAddress,0))
buffer.append(view)
return buffer
procedure ON P ULL(m)
update(m.buffer,c)
view.increaseAge()
2.2.4 Known Protocols as Instantiations of the Model
The framework captures a number of protocols that were published previously. Here
we briefly describe each in turn. The first is LPB CAST (lightweight probabilistic broadcast) [62]. The original publication describes a complete system for implementing a
publish-subscribe service. A part of that system is a membership management layer,
that is implemented essentially as the push variant of protocol BLIND with peer selection
RAND . The second protocol we cover is called C YCLON [63]. Apart from minor differences, Cyclon is equivalent to push-pull SWAPPER with RAND as peer selection.
Finally, we discuss N EWSCAST in more detail. The N EWSCAST protocol was published
originally as a technical report [4] that was later reprinted as a book chapter [5]. The first
version of the N EWSCAST protocol also included an aggregation protocol, and later on the
membership service was factored out in several subsequent publications when it became
clear that it is useful for a wide range of other applications as well. The pseudocode of
N EWSCAST is shown in Algorithm 8. The difference from Algorithm 7 is that N EWSCAST
is always push-pull, the entire view is transferred (that is, c elements and not c/2 elements), and the freshest c elements are kept from the union of the views by both nodes
that participate in an exchange. This results in an increased aggressiveness in removing
old values even w.r.t. HEALER. Method SELECT GPSP EER simply returns a random element
from the current view (that is, we use the RAND peer selection variant).
2.2.5 Implementation
We now describe a possible implementation of the peer sampling service API based on the
framework presented in Section 2.2.2. We assume that the service forms a layer between
the application and the unstructured overlay network.
Initialization
Method INIT () will cause the service to register itself with the gossiping protocol instance
that maintains the overlay network. From that point, the service will be notified by this
instance whenever the actual view is updated.
30
CHAPTER 2. PEER SAMPLING SERVICE
Sampling
As an answer to the SELECT P EER call, the service returns an element from the current
view. To increase the randomness of the returned peers, the service makes a best effort
not to return the same element twice during the period while the given element is in the
view: this would introduce an obvious bias that would damage the quality of the service.
To achieve this, the service maintains a queue of elements that are currently in the view but
have not been returned yet. Method SELECT P EER returns the first element from the queue
and subsequently it removes this element from the queue. When the service receives a
notification on a view update, it removes those elements from the queue that are no longer
in the current view, and appends the new elements that were not included in the previous
view. If the queue becomes empty, the service falls back on returning random samples
from the current view. In this case the service can set a warning flag that can be read by
applications to indicate that the quality of the returned samples is no longer reliable.
In the following sections, we analyze the behavior of our framework in order to gradually come to various optimal settings of the parameters. Anticipating our discussion in
Section 2.7, we will show that there are some parameter values that never lead to good results (such as selecting a peer from a fresh node descriptor). However, we will also show
that no single combination of parameter values is always best and that, instead, tradeoffs
need to be made.
2.3 Local Randomness
Ideally, a peer-sampling service should return a series of unbiased independent random
samples from the current group of peers. The assumption of such randomness has indeed led to rigorously establish many desirable features of gossip-based protocols like
scalability, reliability, and efficiency [21].
When evaluating the quality of a particular implementation of the service, one faces
the methodological problem of characterizing randomness. In this section we consider a
fixed node and analyze the series of samples generated at that particular node.
There are essentially two ways of capturing randomness. The first approach is based
on the notion of Kolmogorov complexity [64]. Roughly speaking, this approach considers
as random any series that cannot be compressed. Pseudo random number generators are
automatically excluded by this definition, since any generator, along with a random seed,
is a compressed representation of a series of any length. Sometimes it can be proven that
a series can be compressed, but in the general case, the approach is not practical to test
randomness due to the difficulty of proving that a series cannot be compressed.
The second, more practical approach assumes that a series is random if any statistic computed over the series matches the theoretical value of the same statistic under
the assumption of randomness. The theoretical value is computed in the framework of
probability theory. This approach is essentially empirical, because it can never be mathematically proven that a given series is random. In fact, good pseudo random number
generators pass most of the randomness tests that belong to this category.
Following the statistical approach, we view the peer-sampling service (as seen by a
fixed node) as a random number generator, and we apply the same traditional methodology that is used for testing random number generators. We test our implementations with
the “diehard battery of randomness tests” [65], the de facto standard in the field.
2.3. LOCAL RANDOMNESS
31
2.3.1 Experimental Settings
We have experimented our protocols using the P EER S IM simulator [66]. All the simulation
results in this chapter were obtained using this implementation.
The DIEHARD test suite requires as input a considerable number of 32-bit integers: the
most expensive test needs 6·107 of them. To be able to generate this input, we assume that
all nodes in the network are numbered from 0 to N. Node N executes the peer-sampling
service, obtaining one number between 0 and N − 1 each time it calls the service, thereby
generating a sequence of integers. If N is of the form N = 2n + 1, then the bits of
the generated numbers form an unbiased random bit stream, provided the peer-sampling
service returns random samples.
Due to the enormous cost of producing a large number of samples, we restricted the set
of implementations of the view construction procedure to the three extreme points: BLIND,
HEALER and SHUFFLER . Peer selection was fixed to be TAIL and PUSHPULL was fixed as
the communication model. Furthermore, the network size was fixed to be 210 + 1 = 1025,
and the view size was c = 20. These settings allowed us to complete 2 · 107 cycles for
all the three protocol implementations. In each case, node N generated four samples in
each cycle, thereby generating four 10-bit numbers. Ignoring two bits out of these ten, we
generated one 32-bit integer for each cycle.
Experiments convey the following facts. No matter which two bits are ignored, it does
not affect the results, so we consider this as a noncritical decision. Note that we could
have generated 40 bits per cycle as well. However, since many tests in the DIEHARD suit
do respect the 32-bit boundaries of the integers, we did not want to artificially diminish
any potential periodic behavior in terms of the cycles.
2.3.2 Test Results
For a complete description of the tests in the DIEHARD benchmark we refer to [65]. In
Table 2.1 we summarize the basic ideas behind each class of tests. In general, the three
random number sequences pass all the tests, including the most difficult ones [67], with
one exception. Before discussing the one exception in more detail, note that for two tests
we did not have enough 32-bit integers, yet we could still apply them. The first case
is the permutation test, which is concerned with the frequencies of the possible orderings of 5-tuples of subsequent random numbers. The test requires 5 · 107 32-bit integers.
However, we applied the test using the original 10-bit integers returned by the sampling
service, and the random sequences passed. The reason is that ordering is not sensitive to
the actual range of the values, as long as the range is not extremely small. The second
case is the so called “gorilla” test, which is a strong instance of the class of the monkey
tests [67]. It requires 6.7 · 107 32-bit integers. In this case we concatenated the output of
the three protocols and executed the test on this sequence, with a positive result. The intuitive reasoning behind this approach is that if any of the protocols produces a nonrandom
pattern, then the entire sequence is supposed to fail the test, especially given that this test
is claimed to be extremely difficult to pass.
Consider now the test that proved to be difficult to pass. This test was an instance
of the class of binary matrix rank tests. In this instance, we take 6 consecutive 32-bit
integers, and select the same (consecutive) 8 bits from each of the 6 integers forming a
6 × 8 binary matrix whose rank is determined. That rank can be from 0 to 6. Ranks
are found for 100,000 random matrices, and a chi-square test is performed on counts for
ranks smaller or equal to 4, and for ranks 5 and 6.
32
Birthday Spacings
Greatest Comm. Divisor
Permutation
Binary Matrix Rank
Monkey
Count the 1-s
Parking Lot
Minimum Distance
Squeeze
Overlapping Sums
Runs Up and Down
Craps
CHAPTER 2. PEER SAMPLING SERVICE
The k-bit random numbers are interpreted as “birthdays”
in a “year” of 2k days. We take m birthdays and list the
spacings between the consecutive birthdays. The statistic
is the number of values that occur more than once in that
list.
We run Euclid’s algorithm on consecutive pairs of random integers. The number of steps Euclid’s algorithm
needs to find the greatest common divisor (GCD) of
these consecutive integers in the random series, and the
GCD itself are the statistics used to test randomness.
Tests the frequencies of the 5! = 120 possible orderings
of consecutive integers in the random stream.
Tests the rank of binary matrices built from consecutive
integers, interpreted as bit vectors.
A set of tests for verifying the frequency of the occurrences of “words” interpreting the random series as the
output of a monkey typing on a typewriter. The random
number series is interpreted as a bit stream. The “letters”
that form the words are given by consecutive groups of
bits (e.g., for 2 bits there are 4 letters, etc).
A set of tests for verifying the number of 1-s in the bit
stream.
Numbers define locations for “cars.” We continuously
“park cars” and test the number of successful and unsuccessful attempts to place a car at the next location defined
by the random stream. An attempt is unsuccessful if the
location is already occupied (the two cars would overlap).
Integers are mapped to two or three dimensional coordinates and the minimal distance among thousands of consecutive points is used as a statistic.
After mapping the random integers to the interval [0, 1),
we test how many consecutive values have to be multiplied to get a value smaller than a given threshold. This
number is used as a statistic.
The sum of 100 consecutive values is used as a statistic.
The frequencies of the lengths of monotonously decreasing or increasing sequences are tested.
200,000 games of craps are played and the number of
throws and wins are counted. The random integers are
mapped to the integers 1, . . . , 6 to model the dice.
Table 2.1: Summary of the basic idea behind the classes of tests in the DIEHARD test suite
for random number generators. In all cases tests are run with several parameter settings.
For a complete description we refer to [65].
2.4. GLOBAL RANDOMNESS
33
When the selected byte coincides with the byte contributed by one call to the peersampling service (bits 0-7, 8-15, etc), protocols BLIND and SWAPPER fail the test. To
better see why, consider the basic functioning of the rank test. In most of the cases, the
rank of the matrix is 5 or 6. If it is 5, it typically means that the same 8-bit entry is
copied twice into the matrix. Our implementation of the peer-sampling service explicitly
ensures that the diversity of the returned elements is maximized in the short run (see
Section 2.2.5). As a consequence, rank 6 occurs relatively more often than in the case
of a true random sequence. Note that for many applications this property is actually an
advantage. However, HEALER passes the test. The reason of this will become clearer later.
As we will see, in the case of HEALER the view of a node changes faster and therefore
the queue of the samples to be returned is frequently flushed, so the diversity-maximizing
effect is less significant.
The picture changes if we consider only every 4th sample in the random sequence
generated by the protocols. In that case, BLIND and SWAPPER pass the test, but HEALER
fails. In this case, the reason of the failure of HEALER is exactly the opposite: there
are relatively too many repetitions in the sequence. Taking only every 8th sample, all
protocols pass the test.
Finally, note that even in the case of “failures,” the numeric deviation from random
behavior is rather small. The expected occurrences of ranks of ≤4, 5, and 6 are 0.94%,
21.74%, and 77.31%, respectively. In the first type of failure, when there are too many
occurrences of rank 6, a typical failed test gives percentages 0.88%, 21.36%, and 77.68%.
When ranks are too small, a typical failure is, for example, 1.05%, 21.89%, and 77.06%.
2.3.3 Conclusions
The results of the randomness tests suggest that the stream of nodes returned by the peersampling service is close to uniform random for all the protocol instances examined.
Given that some widely used pseudo-random number generators fail at least some of
these tests, this is a highly encouraging result regarding the quality of the randomness
provided by this class of sampling protocols.
Based on these experiments we cannot, however, conclude on global randomness of
the resulting graphs. Local randomness, evaluated from a peer’s point of view is important, however, in a complex large-scale distributed system, where the stream of random
nodes returned by the nodes might have complicated correlations, merely looking at local behavior does not reveal some key characteristics such as load balancing (existence
of bottlenecks) and fault tolerance. In Section 2.4 we present a detailed analysis of the
global properties of our protocols.
2.4 Global Randomness
In Section 2.3 we have seen that from a local point of view all implementations produce
good quality random samples. However, statistical tests for randomness and independence tend to hide important structural properties of the system as a whole. To capture
these global correlations, in this section we switch to a graph theoretical framework. To
translate the problem into a graph theoretical language, we consider the communication
topology or overlay topology defined by the set of nodes and their views (recall that SE LECT P EER () returns samples from the view). In this framework the directed edges of the
communication graph are defined as follows. If node a stores the descriptor of node b
34
CHAPTER 2. PEER SAMPLING SERVICE
in its view then there is a directed edge (a, b) from a to b. In the language of graphs,
the question is how similar this overlay topology is to a random graph in which the descriptors in each view represent a uniform independent random sample of the whole node
set?
In this section we consider graph-theoretic properties of the overlay graphs. An important example of such properties is the degree distribution. The indegree of node i is
defined as the number of nodes that have i in their views. The outdegree is constant and
equal to the view size c for all nodes. Degree distribution has many significant effects.
Most importantly, degree distribution determines whether there are hot spots and bottlenecks from the point of view of communication costs. In other words, load balancing is
determined by the degree distribution. It also has a direct relationship with reliability to
different patterns of node failures [68], and has an effect on the exact way epidemics are
spread [69]. Apart from the degree distribution we also analyze the clustering coefficient
and average path length, as described and motivated in Section 2.4.2.
Our main goal is to explore the different design choices in the protocol space described in Section 2.2.2. More specifically, we want to assess the impact of the peer
selection, view selection, and view propagation parameters. Accordingly, we chose to fix
the network size to N = 104 and the maximal view size to c = 30. The results presented
in this section were obtained using the P EER S IM simulation environment [66].
2.4.1 Properties of Degree Distribution
The first and most fundamental question is whether, for a particular protocol implementation, the communication graph has some stable properties, which it maintains during
the execution of the protocol. In other words, we are interested in the convergence behavior of the protocols. We can expect several sorts of dynamics which include chaotic
behavior, oscillations, and convergence. In case of convergence the resulting state may or
may not depend on the initial configuration of the system. In the case of overlay networks
we obviously prefer to have convergence towards a state that is independent of the initial
configuration. This property is called self-organization. In our case it is essential that in
a wide range of scenarios the protocol instances should automatically produce consistent
and predictable behavior. Section 2.4.1 examines this question.
A related question is whether there is convergence and what kind of communication
graph a protocol instance converges to. In particular, as mentioned earlier, we are interested in what sense overlay topologies deviate from certain random graph models. We
discuss this issue in Section 2.4.1.
Finally, we are interested in looking at local dynamic properties along with globally
stable degree distributions. That is, it is possible that while the overall degree distribution
and its global properties such as maximum, variance, average, etc., do not change, the
degree of the individual nodes does. This is preferable because in this case even if there
are always bottlenecks in the network, the bottleneck will not be the same node all the
time which greatly increases robustness and improves load balancing. Section 2.4.1 is
concerned with these questions.
Convergence
We now present experimental results that illustrate the convergence properties of the protocols in three different bootstrapping scenarios:
2.4. GLOBAL RANDOMNESS
35
partitioned average number average largest
runs
of clusters
cluster
(rand,healer,push)
100%
22.28
9124.48
0%
n.a.
n.a.
(rand,swapper,push)
(rand,blind,push)
18%
2.06
9851.11
29%
2.17
9945.21
(tail,healer,push)
(tail,swapper,push)
97%
4.07
9808.04
10%
2.00
9936.20
(tail,blind,push)
protocol
Table 2.2: Partitioning of the push protocols in the growing overlay scenario. Data corresponds to cycle 300. Cluster statistics are over the partitioned runs only.
Growing In this scenario, the overlay network initially contains only one node. At the
beginning of each cycle, 500 new nodes are added to the network until the maximal
size is reached in cycle 20. The view of these nodes is initialized with only a single
node descriptor, which belongs to the oldest, initial node. This scenario is the most
pessimistic one for bootstrapping the overlays. It would be straightforward to improve it by using more contact nodes, which can come from a fixed list or which can
be obtained using inexpensive local random walks on the existing overlay. However, in our discussion we intentionally avoid such optimizations to allow a better
focus on the core protocols and their differences.
Lattice In this scenario, the initial topology of the overlay is a ring lattice, a structured
topology. We build the ring lattice as follows. The nodes are first connected into a
ring in which each node has a descriptor in its view that belongs to its two neighbors
in the ring. Subsequently, for each node, additional descriptors of the nearest nodes
are added in the ring until the view is filled.
Random In this scenario the initial topology is defined as a random graph, in which the
views of the nodes were initialized by a uniform random sample of the peer nodes.
As we focus on the dynamic properties of the protocols, we did not wish to average
out interesting patterns, so in all cases the result of a single run is shown in the plots. Nevertheless, we ran all the scenarios 100 times to gain data on the stability of the protocols
with respect to the connectivity of the overlay. Connectivity is a crucial feature, a minimal
requirement for all applications. The results of these runs show that in all scenarios, every
protocol under examination creates a connected overlay network in 100% of the runs (as
observed in cycle 300). The only exceptions were detected during the growing overlay
scenario. Table 2.2 shows the push protocols. With the pushpull scheme we have not
observed any partitioning.
The push versions of the protocols perform very poorly in the growing scenario in
general. Figure 2.1 illustrates the evolution of the maximal indegree. The maximal indegree belongs to the central contact node that is used to bootstrap the network. After
growing is finished in cycle 20, the pushpull protocols almost instantly balance the degree
distribution thereby removing the bottleneck. The push versions, however, get stuck in
this unbalanced state.
This is not surprising, because when a new node joins the network and gets an initial
contact node to start with, the only way it can get an updated view is if some other node
CHAPTER 2. PEER SAMPLING SERVICE
36
10000
9000
maximal indegree
8000
push protocols
7000
6000
5000
4000
3000
2000
pushpull protocols
1000
0
0
50
100
150
200
250
300
cycles
Figure 2.1: Evolution of maximal indegree in the growing scenario (recall that growing
stops in cycle 20). The runs of the following protocols are shown: peer selection is either
rand or tail, view selection is blind, healer or swapper, and view propagation is push or
pushpull.
contacts it actively. This, however, is very unlikely. Because all new nodes have the same
contact, the view at the contact node gets updated extremely frequently causing all the
joining nodes to be quickly forgotten. A node has to push its own descriptor many times
until some other node actually contacts it. This also means that if the network topology
moves towards the shape of a star, then the push protocols have extreme difficulty balancing this degree-distribution state again towards a random one. We conclude that this
lack of adaptivity and robustness effectively renders push-only protocols useless. In the
following we therefore consider only the pushpull model.
Figure 2.2 illustrates the convergence of the pushpull protocols. Note that the average
indegree is always the view size c. We can observe that in all scenarios the protocols
quickly converge to the same value, even in the case of the growing scenario, in which the
initial degree distribution is rather skewed. Other properties not directly related to degree
distribution also show convergence, as discussed in Section 2.4.2.
Static Properties
In this section we examine the converged degree distributions generated by the different protocols. Figure 2.3 shows the converged standard deviation of the degree distribution. We observe that increasing both H and S results in a lower—and therefore more
desirable—standard deviation. The reason is different for these two cases. With a large
S, links to a node come to existence only in a very controlled way. Essentially, new incoming links to a node are created only when the node itself injects its own fresh node
descriptor during communication. On the other hand, with a large H, the situation is the
opposite. When a node injects a new descriptor about itself, this descriptor is (exponentially often) copied to other nodes for a few cycles. However, one or two cycles later all
copies are removed because they are pushed out by new links (i.e., descriptors) injected
in the meantime. So the effect that reduces variance is the short lifetime of the copies of
2.4. GLOBAL RANDOMNESS
37
growing scenario
indegree standard deviation
40
tail, blind
rand, blind
tail, swapper
rand, swapper
tail, healer
rand, healer
35
30
25
20
15
10
5
0
50
100
150
cycles
200
250
300
lattice scenario
40
35
35
indegree standard deviation
indegree standard deviation
random scenario
40
30
25
20
15
10
5
0
30
25
20
15
10
5
0
50
100
150
cycles
200
250
300
50
100
150
cycles
200
250
300
Figure 2.2: Evolution of standard deviation of indegree in all scenarios of pushpull protocols.
24
rand, H=0
tail, H=0
rand, H=1
tail, H=1
rand, H=3
tail, H=3
rand, H=8
tail, H=8
rand, H=14
tail, H=14
random graph
indegree standard deviation
22
20
18
16
14
12
10
8
6
4
2
0
2
4
6
8
10
12
14
S
Figure 2.3: Converged values of indegree standard deviation.
CHAPTER 2. PEER SAMPLING SERVICE
38
8
proportion of nodes (%)
proportion of nodes (%)
10
rand, blind
tail, blind
rand, swapper
tail, swapper
rand, healer
tail, healer
random graph
10
6
4
rand, blind
tail, blind
rand, swapper
tail, swapper
rand, healer
tail, healer
random graph
1
0.1
2
0.01
0
0
20
40
60
80
100
indegree
120
140
160
0
20
40
60
80
100
indegree
120
140
160
Figure 2.4: Converged indegree distributions on linear and logarithmic scales.
a given link.
Figure 2.4 shows the entire degree distribution for the three vertices of the design
space triangle. We observe that the distribution of SWAPPER is narrower than that of the
random graph, while BLIND has a rather heavy tail and also a large number of nodes with
zero or very few nodes pointing to them, which is not desirable from the point of view of
load balancing.
Dynamic Properties
Although the distribution itself does not change over time during the continuous execution of the protocols, the behavior of a single node still needs to be determined. More
specifically, we are interested in whether a given fixed node has a variable indegree or
whether the degree changes very slowly. The latter case would be undesirable because
an unlucky node having above-average degree would continuously receive above-average
traffic while others would receive less, which results in inefficient load balancing.
Figure 2.5 compares the degree distribution of a node over time, and the entire network
at a fixed time point. The figure shows only the distribution for one node and only the
random peer-selection protocols, but the same result holds for tail peer selection and for
all the 100 other nodes we have observed. From the fact that these two distributions are
very similar, we can conclude that all nodes take all possible values at some point in time,
which indicates that the degree of a node is not static.
However, it is still interesting to characterize how quickly the degree changes, and
whether this change is predictable or random. To this end, we present autocorrelation
data of the degree time-series of fixed nodes in Figure 2.6. The band indicates a 99%
confidence interval assuming the data is random. Only one node is shown, but all the 100
nodes we traced show very similar behavior. Let the series d1 , . . . dK denote the indegree
of a fixed node in consecutive cycles, and d the average of this series. The autocorrelation
of the series d1 , . . . dK for a given time lag k is defined as
rk =
PK−k
(dj − d)(dj+k − d)
,
PK
2
j=1 (dj − d)
j=1
which expresses the correlation of pairs of degree values separated by k cycles.
We observe that in the case of HEALER it is impossible to make any prediction for a
degree of a node 20 cycles later, knowing the current degree. However, for the rest of
2.4. GLOBAL RANDOMNESS
39
rand, blind
3
rand, blind: one node over time
rand, blind: snapshot of network
proportion (%)
2.5
2
1.5
1
0.5
0
0
20
40
60
80 100 120 140 160 180 200
indegree
rand, swapper
rand, healer
12
5
4.5
4
8
proportion (%)
proportion (%)
10
6
4
3.5
3
2.5
2
1.5
1
2
0.5
0
0
10
15
20
25
30
35
indegree
40
45
50
0
10
20
30
40
50
indegree
60
70
80
90
Figure 2.5: Comparison of the converged indegree distribution over the network at a fixed
time point and the indegree distribution of a fixed node during an interval of 50,000 cycles.
The vertical axis represents the proportion of nodes and cycles, respectively.
autocorrelation of node degree
1
tail, blind
rand, blind
tail, swapper
rand, swapper
tail, healer
rand, healer
99% confidence band
0.8
0.6
0.4
0.2
0
-0.2
0
20
40
60
80
100
120
140
time lag (cycles)
Figure 2.6: Autocorrelation of indegree of a fixed node over 50,000 cycles. Confidence
band corresponds to the randomness assumption: a random series produces correlations
within this band with 99% probability.
40
CHAPTER 2. PEER SAMPLING SERVICE
the protocols, the degree changes much slower, resulting in correlation in the distance of
80-100 cycles, which is not optimal from the point of view of load balancing.
2.4.2 Clustering and Path Lengths
Degree distribution is an important property of random graphs. However, there are other
equally important characteristics of networks that are independent of degree distribution.
In this section we consider the average path length and the clustering coefficient as two
such characteristics. The clustering coefficient is defined over undirected graphs (see
below). Therefore, we consider the undirected version of the overlay after removing the
orientation of the edges.
Average path length
The shortest path length between node a and b is the minimal number of edges required
to traverse in the graph in order to reach b from a. The average path length is the average
of the shortest path lengths over all pairs of nodes in the graph. The motivation of looking
at this property is that, in any information dissemination scenario, the shortest path length
defines a lower bound on the time and costs of reaching a peer. For the sake of scalability
a small average path length is essential. In Figure 2.7, especially in the growing and
lattice scenarios, we verify that the path length converges rapidly. Figure 2.8 shows the
converged values of the average path length for the design space triangle defined by H
and S. We observe that all protocols result in a very low path length. Large S values are
the closest to the random graph.
Clustering coefficient
The clustering coefficient of a node a is defined as the number of edges between the neighbors of a divided by the number of all possible edges between those neighbors. Intuitively,
this coefficient indicates the extent to which the neighbors of a are also neighbors of each
other. The clustering coefficient of a graph is the average of the clustering coefficients of
its nodes, and always lies between 0 and 1. For a complete graph, it is 1, for a tree it is 0.
The motivation for analyzing this property is that a high clustering coefficient has potentially damaging effects on both information dissemination (by increasing the number of
redundant messages) and also on the self-healing capacity by weakening the connection
of a cluster to the rest of the graph thereby increasing the probability of partitioning. Furthermore, it provides an interesting possibility to draw parallels with research on complex
networks where clustering is an important research topic (e.g., in social networks) [70].
Like average path length, the clustering coefficient also converges (see Figure 2.7);
Figure 2.8 shows the converged values. It is clear that clustering is controlled mainly by
H. The largest values of H result in rather significant clustering, where the deviation from
the random graph is large. The reason is that if H is large, then a large part of the views
of any two communicating nodes will overlap right after communication, since both keep
the same freshest entries. For the largest values of S, clustering is close to random. This
is not surprising either because S controls exactly the diversity of views.
2.4. GLOBAL RANDOMNESS
41
growing scenario
growing scenario
3.25
0.25
tail, blind
rand, blind
tail, swapper
rand, swapper
tail, healer
rand, healer
clustering coefficient
average path length
3.2
3.15
3.1
3.05
0.2
0.15
0.1
0.05
3
2.95
0
50
100
150
cycles
200
250
300
50
100
random scenario
150
cycles
200
250
300
200
250
300
200
250
300
random scenario
3.25
0.25
clustering coefficient
average path length
3.2
3.15
3.1
3.05
0.2
0.15
0.1
0.05
3
2.95
0
0
50
100
150
cycles
200
250
300
0
50
100
lattice scenario
150
cycles
lattice scenario
3.25
0.25
clustering coefficient
average path length
3.2
3.15
3.1
3.05
0.2
0.15
0.1
0.05
3
2.95
0
0
50
100
150
cycles
200
250
300
0
50
100
150
cycles
Figure 2.7: Evolution of the average path length and the clustering coefficient in all scenarios.
0.16
0.12
0.1
0.08
3.15
average path length
0.14
clustering coefficient
3.2
rand, S=0
tail, S=0
rand, S=3
tail, S=3
rand, S=8
tail, S=8
rand, S=14
tail, S=14
random graph
0.06
0.04
3.1
3.05
3
0.02
0
2.95
0
2
4
6
8
H
10
12
14
0
2
4
6
8
10
12
H
Figure 2.8: Converged values of clustering coefficient and average path length.
14
CHAPTER 2. PEER SAMPLING SERVICE
average # of nodes outside the largest cluster
42
rand, blind
tail, blind
rand, healer
tail, healer
rand, swapper
tail, swapper
random graph
100
10
1
0.1
0.01
65
70
75
80
85
90
95
removed nodes (%)
Figure 2.9: The number of nodes that do not belong to the largest connected cluster. The
average of 100 experiments is shown. The random graph almost completely overlaps with
the swapper protocols.
2.5 Fault Tolerance
In large-scale, dynamic, wide-area distributed systems it is essential that a protocol is
capable of maintaining an acceptable quality of service under a wide range of severe
failure scenarios. In this section we present simulation results on two classes of such
scenarios: catastrophic failure, where a significant portion of the system fails at the same
time, and heavy churn, where nodes join and leave the system continuously.
2.5.1 Catastrophic Failure
As in the case of the degree distribution, the response of the protocols to a massive failure
has a static and a dynamic aspect. In the static setting we are interested in the self-healing
capacity of the converged overlays to a (potentially massive) node failure, as a function
of the number of failing nodes. Removing a large number of nodes will inevitably cause
some serious structural changes in the overlay even if it otherwise remains connected.
In the dynamic case we would like to learn to what extent the protocols can repair the
overlay after a severe damage.
The effect of a massive node failure on connectivity is shown in Figure 2.9. In this setting the overlay in cycle 300 of the random initialization scenario was used as converged
topology. From this topology, random nodes were removed and the connectivity of the
remaining nodes was analyzed. In all of the 100 × 6 = 600 experiments performed we did
not observe partitioning until removing 67% of the nodes. The figure depicts the number
of the nodes outside the largest connected cluster. We observe consistent partitioning behavior over all protocol instances (with SWAPPER being particularly close to the random
graph): even when partitioning occurs, most of the nodes form a single large connected
cluster. Note that this phenomenon is well known for traditional random graphs [71].
In the dynamic scenario we made 50% of the nodes fail in cycle 300 of the random
initialization scenario and we then continued running the protocols on the damaged over-
2.5. FAULT TOLERANCE
43
number of cycles to remove all dead links
50
proportion of dead links (%)
45
40
35
30
25
20
tail, blind
rand, blind
tail, swapper
rand, swapper
tail, healer
rand, healer
15
10
5
0
1
10
100
cycles
35
30
25
20
15
rand, H=1
tail, H=1
rand, H=3
tail, H=3
rand, H=8
tail, H=8
rand, H=14
tail, H=14
10
5
0
0
2
4
6
8
10
12
14
S
Figure 2.10: Removing dead links following the failure of 50% of the nodes in cycle 300.
lay. The damage is expressed by the fact that, on average, half of the view of each node
consists of descriptors that belong to nodes that are no longer in the network. We call
these descriptors dead links. Figure 2.10 shows how fast the protocols repair the overlay,
that is, remove dead links from the views. Based on the static node failure experiment it
was expected that the remaining 50% of the overlay is not partitioned and indeed, we did
not observe partitioning with any of the protocols. Self-healing performance is fully controlled by the healing parameter H, with H = 15 resulting in fully repairing the network
in as little as 5 cycles (not shown).
2.5.2 Churn
To examine the effect of churn, we define an artificial scenario in which a given proportion of the nodes crash and are subsequently replaced by new nodes in each cycle. This
scenario is a worst case scenario because the new nodes are assumed to join the system
for the first time, therefore they have no information whatsoever about the system (their
view is initially empty) and the crashed nodes are assumed never to join the system again,
so the links pointing to them will never become valid again. A more realistic trace-based
scenario is also examined in Section 2.5.3 using the Gnutella trace described in [58].
We focus on two aspects: the churn rate, and the bootstrapping method. Churn rate
defines the number of nodes that are replaced by new nodes in each cycle. We consider
realistic churn rates (0.1% and 1%) and a catastrophic churn rate (30%). Since churn is
defined in terms of cycles, in order to validate how realistic these settings are, we need to
define the cycle length. With the very conservative setting of 10 seconds, which results
in a very low load at each node, the trace described in [58] corresponds to 0.2% churn in
each cycle. In this light, we consider 1% a comfortable upper bound of realistic churn,
given also that the cycle length can easily be decreased as well to deal with even higher
levels of churn.
We examine two bootstrapping methods. Both are rather unrealistic, but our goal here
is not to suggest an optimal bootstrapping implementation, but to analyze our protocols
under churn. The following two methods are suitable for this purpose because they represent two opposite ends of the design space:
Central We assume that there exists a server that is known by every joining node, and that
is stable: it is never removed due to churn or other failures. This server participates
in the gossip membership protocol as an ordinary node. The new nodes use the
CHAPTER 2. PEER SAMPLING SERVICE
44
central bootstrapping
22
20
20
degree standard deviation
degree standard deviation
random bootstrapping
22
18
16
14
12
10
rand, H=1
tail, H=1
rand, H=3
tail, H=3
rand, H=8
tail, H=8
rand, H=14
tail, H=14
18
16
14
12
10
0
2
4
6
8
S
10
12
14
0
2
4
6
8
10
12
14
S
Figure 2.11: Standard deviation of node degree with churn rate 1%. Node degree is
defined over the undirected version of the subgraph of live nodes. The H = 0 case is not
comparable to the shown cases; due to reduced self-healing, nodes have much fewer live
neighbors (see Figure 2.12) which causes relatively low variance.
server as their first contact. In other words, their view is initialized to contain the
server.
Random An oracle gives each new node a random live peer from the network as its first
contact.
Realistic implementations could use a combination of these two approaches, where one
or more servers serve random contact peers, using the peer-sampling service itself. Any
such implementation can reasonably be expected to result in a behavior in between the
two extremes described above.
Simulation experiments were run initializing the network with random links and subsequently running the protocols under the given amount of churn until the observed properties reached a stable level (300 cycles). The experimental results reveal that for realistic
churn rates (0.1% and 1%) all the protocols are robust to the bootstrapping method and
the properties of the overlay are very close to those without churn. Figure 2.11 illustrates this by showing the standard deviation of the node degrees in both scenarios, for the
higher churn rate 1%. Observe the close correspondence with Figure 2.3. The clustering
coefficient and average path length show the same robustness to bootstrapping, and the
observed values are almost identical to the case without churn (not shown).
Let us now consider the damage churn causes in the networks. First of all, for all
protocols and scenarios the networks remain connected, even for H = 0. Still, a (low)
number of dead links remain in the overlay. Figure 2.12 shows the average number of
dead links in the views, again, only for the higher churn rate (1%). It is clear that the
extent of the damage is fully controlled by the healing parameter H. Furthermore, it is
clear that the protocols are robust to the bootstrapping scenario also in this case. If H ≥ 1
then the maximal (not average) number of dead links in any view for the different protocol
instances ranges from 5 − 13 in the case of churn rate 1% and from 2 − 5 for churn rate
0.1%, where the lowest value belongs to the highest H. If H = 0 then the number of dead
links radically increases: it is at least 11 on average, and the maximal number of dead
links ranges from 20-25 for the different settings. That is, in the presence of churn, it is
essential for any implementation to set at least H = 1. We have already seen this effect
in Section 2.5.1 concerning self-healing performance.
2.5. FAULT TOLERANCE
45
central bootstrapping
avg. number of dead links per view
avg. number of dead links per view
random bootstrapping
3.5
3
2.5
2
1.5
1
0.5
0
3.5
3
2.5
2
rand, H=1
tail, H=1
rand, H=3
tail, H=3
rand, H=8
tail, H=8
rand, H=14
tail, H=14
1.5
1
0.5
0
0
2
4
6
8
S
10
12
14
0
2
4
6
8
10
12
14
S
Figure 2.12: Average number of dead links in a view with churn rate 1%. The H = 0 case
is not shown; it results in more than 11 dead links per view on average, for all settings.
Although the server participates in the overlay, the plots showing results under the
central bootstrapping scenario were calculated ignoring the server, because its properties
sharply differ from the rest of the network. In particular, it has a high indegree, because all
new nodes will have a fresh link to the server, and that link will stay in the view of joining
nodes for a few more cycles, possibly replicated in the meantime. Indeed, we observe that
for 1% churn, 12% − 28% of the nodes have a link to the server at any time, depending
on H and S. However, if we assume that the server can handle the traffic generated by
joining nodes, a high indegree is noncritical. The expected number of incoming messages
due to indegree d is d/c (where c is the view size), with a very low variance. This means
that the generated traffic is of the same order of magnitude as the traffic generated by the
joining nodes. We note again, however, that we do not consider this simplistic serverbased solution a practical approach; we treat it only as a worst-case scenario to help us
evaluate the protocols.
So far we have been discussing realistic churn rates. However, it is of academic interest to examine the behavior under extremely difficult scenarios, where the network suffers
a catastrophic damage in each cycle. The catastrophic churn rate of 30% combines the
effects of catastrophic failure (see Section 2.5.1) and churn.
Unlike with realistic churn rates, in this case the bootstrapping method has a strong effect on the performance of the protocols and therefore becomes the major design decision,
although the parameters H and S still have a very strong effect as well. Consequently,
we need to analyze the interaction of the gossip membership protocol and the bootstrapping method. In the case of the server-based solution, the overlay evolves into a ring-like
structure, with a few shortcut links. The reason is that the view of the server is predominantly filled with entries of the newly joined nodes, since each time a new node contacts
the server it also places a fresh entry about itself in the view of the server. These entries
are served to the subsequently joining nodes, thus forming a linear structure. This ringlike structure is rather robust: it remains connected (even after removing the server) for
all protocols with H >= 8. However, it has a slightly higher diameter than that of the
random graph (approximately 20-30 hops). For HEALER the average number of dead links
per view is still as low as 10 and 9 for random and tail peer selection, respectively.
The random scenario is rather different. In particular, we lose connectivity for all the
protocols, however, for large values of H the largest connected cluster almost reaches
the size of the network (see Figure 2.13). Besides, the structure of the overlay is also
different. As Figure 2.13 shows, tail peer selection results in a slightly more unbalanced
CHAPTER 2. PEER SAMPLING SERVICE
10000
20
9000
18
8000
degree standard deviation
size of largest connected cluster
46
7000
6000
5000
4000
3000
2000
1000
0
2
4
6
8
10
12
14
12
10
8
6
4
rand peer selection
tail peer selection
0
16
rand peer selection
tail peer selection
2
14
0
H
2
4
6
8
10
12
14
H
Figure 2.13: Size of largest connected cluster and degree standard deviation under catastrophic churn rate (30%), with the random bootstrapping method. Individual curves belong to different values of S but the measures depend only on H, so we do not need to
differentiate between them. Connectivity and node degree are defined over the undirected
version of the subgraph of live nodes.
degree distribution (note that the low deviation for low values of H is due to the low
number of live nodes). The reason is that—also considering that tail peer selection picks
the oldest live node—the nodes that stay in the overlay for somewhat longer will receive
more incoming traffic because (due to the very high number of dead links in each view)
they tend to be the oldest live node in most views they are in. For HEALER the average
number of dead links per view is 11 and 9 for random and tail peer selection, respectively.
To summarize our findings: under realistic churn rates all the protocols perform very
similarly to the case when there is no churn at all, independently of the bootstrapping
method. Besides, some of the protocol instances, in particular, HEALER, can tolerate even
catastrophic churn rates with a reasonable performance with both bootstrapping methods.
2.5.3 Trace-driven Churn Simulations
In Section 2.5.2 we analyzed our protocols under artificial churn scenarios. Here, we
consider a realistic churn scenario using the, so called, lifetime measurements on Gnutella,
carried out by Saroiu et al. [58]. These traces contain—among other information—the
connection and disconnection times for a total of 17,125 nodes over a period of 60 hours.
Throughout the trace, the number of connected nodes remains practically unchanged, in
the order of 104 nodes.
We noticed a periodic pattern occurring every 404 seconds in the traces. In each
404-second interval, all connections and disconnections take place during the first 344
seconds, rendering the network static during the last 60 seconds. These recurring gaps
would represent a positive bias for our churn simulations, as they periodically provide
the overlay with some “breathing space” to process recent changes. However, these gaps
are not realistic and are most probably an artifact of the logging mechanism. Therefore,
we decided to eliminate them by linearly expanding each 344 second interval to cover
the whole 404 seconds. Note that this transformation leaves the node uptimes practically
unaltered.
We have taken the following two decisions with respect to the parameters in the experiments presented. First, peer selection is fixed to random. Section 2.5.2 showed that
random is outperformed by tail peer selection in all cases. Therefore, random is a suitable
2.5. FAULT TOLERANCE
47
JOINS
6
4
4
2
2
0
0
REMOVALS
REMOVALS
JOINS
6
-2
-4
-6
-2
-4
-6
500
1000
1500
2000
cycles
2500
3000
3500
2300
2400
2500
cycles
2600
2700
Figure 2.14: Churn in the Saroiu traces. Full time span of 3600 one minute cycles and
zoomed in to cycles 2250 to 2750.
choice for this section as the worst case peer selection policy. Second, the swap parameter, S, is fixed to 0. Section 2.5.2 showed that S = 0 results in the highest (therefore
worst) degree deviation, while it does not affect the number of dead links.
We apply two join methods: central and random, as defined in Section 2.5.2. The only
difference is that a reconnecting node still remembers the links it previously had, some
of which may be dead at reconnection time. This facilitates reconnection, but generally
increases the total number of dead links.
The cycle length was chosen to be 1 minute. We anticipate that in reality the cycle
length will be shorter, resulting in lower churn per cycle. The choice of a cycle length
close to the upper end of realistic values is intentional, and is aimed at testing this specific
gossip membership protocol under increased stress.
Figure 2.14 shows the node connections and disconnections as a percentage of the
current network size. Connections are shown as positive points, whereas disconnections
as negative. Although we ran the experiments for the whole trace, we focus on its most
interesting part, namely cycles 2250 to 2750. Notice that at cycle 2367, around 450
nodes get disconnected at once and reconnect altogether 27 minutes later, at cycle 2394,
probably due to a router failure. Similar temporary—but shorter—group disconnections
are observed later on, around cycles 2450, 2550, and 2650, respectively.
Let us now examine the way the overlay is affected by those network changes. Figure 2.15 shows that the number of dead links is always kept at fairly small levels, especially when H is at least 1. As expected, the number of dead links peaks when there
are massive node disconnections and gets back to normal quickly. However, it is not affected by the observed massive node reconnections, because these happen shortly after
the respective disconnections, and the neighbors of the reconnected nodes are still alive.
Two observations regarding the effect of H can be made. First, higher values of H
result in fewer dead links per view, validating the analysis in Section 2.5.2. Second, higher
values of H trigger the faster elimination of dead links. The peaks caused by massive
node disconnections are wider for low H values, and become sharper as H grows. In
fact, these two observations are related to each other: in a persistently dynamic network,
the converged average number of dead links depends on the rate at which the protocol
disposes of them.
Figure 2.16 shows the evolution of the node degree deviation. It can be observed that
for H ≥ 1 the degree deviation under churn is very close to the corresponding converged
values in a static network (see Figure 2.3). For H = 0 though, the higher number of
CHAPTER 2. PEER SAMPLING SERVICE
48
random bootstrapping
central bootstrapping
8
avg. number of dead links per view
avg. number of dead links per view
8
H=0 (blind)
from top down: H=1, H=3, H=15 (healer)
7
6
5
4
3
2
1
0
7
6
5
4
3
2
1
0
2300
2400
2500
cycles
2600
2700
2300
2400
2500
cycles
2600
2700
Figure 2.15: Average number of dead links per view, based on the Saroiu Gnutella traces.
All experiments use random peer selection and S = 0.
central bootstrapping
22
20
20
degree standard deviation
degree standard deviation
random bootstrapping
22
18
16
14
12
10
8
H=0 (blind)
from top down: H=1, H=3, H=8, H=15 (healer)
6
18
16
14
12
10
8
6
2300
2400
2500
cycles
2600
2700
2300
2400
2500
cycles
2600
2700
Figure 2.16: Evolution of standard deviation of node degree based on the Saroiu Gnutella
traces. All experiments use random peer selection and S = 0.
pending dead links affects the degree distribution more. Note that both massive node disconnections and connections disturb the degree deviation, but in both cases a few cycles
are sufficient to recover the original overlay properties.
To recap our analysis, we have shown that even with a pessimistic cycle length of 1
minute, all protocols for H ≥ 1 perform very similarly to the case of a stable network,
independently of the join method. Anomalies caused by massive node connections or
disconnections are repaired quickly.
2.6 Wide-Area-Network Emulation
Distributed protocols often exhibit unexpected behavior when deployed in the real world
that cannot always be captured by simulation. Typically, this is due to unexpected message loss, network and scheduling delays, as well as events taking place in unpredictable,
arbitrary order. In order to validate the correctness of our simulation results, we implemented our gossip membership protocols and deployed them on a wide-area network.
We utilized the DAS-2 wide-area cluster as our testbed [72]. The DAS-2 cluster consists of 200 dual-processor nodes spread across 5 sites in the Netherlands. A total of 50
nodes were used for our emulations, 10 from each site. Each node was running a Java
Virtual Machine emulating 200 peers, giving a total of 10,000 peers. Peers were running
2.6. WIDE-AREA-NETWORK EMULATION
growing scenario
random scenario
indegree standard deviation
40
40
35
35
30
30
25
25
20
20
20
15
15
15
10
10
10
5
5
5
0
0
35
30
25
50
100
150
cycles
200
250
0
0
50
growing scenario
clustering coefficient
lattice scenario
40
tail, blind
rand, blind
tail, swapper
rand, swapper
tail, healer
rand, healer
0
100
150
cycles
200
250
0
0.25
0.2
0.2
0.2
0.15
0.15
0.15
0.1
0.1
0.1
0.05
0.05
0.05
0
100
150
cycles
200
250
50
100
150
cycles
200
250
0
3.2
3.2
3.15
3.15
3.1
3.1
3.1
3.05
3.05
3.05
3
3
3
2.95
2.95
2.95
200
250
250
100
150
cycles
200
250
200
250
lattice scenario
3.2
150
cycles
200
3.25
3.15
100
50
random scenario
3.25
50
150
cycles
0
0
growing scenario
3.25
0
100
lattice scenario
0.25
50
50
random scenario
0.25
0
average path length
49
0
50
100
150
cycles
200
250
0
50
100
150
cycles
Figure 2.17: Evolution of indegree standard deviation, clustering coefficient, and average
path length in all scenarios for real-world experiments.
in separate threads.
Although 200 peers were running on each physical machine, communication within
a machine accounted for only 2% of the total communication. Local-area and wide-area
traffic accounted for 18% and 80% of the total, respectively. Clearly, most messages are
transferred through wide area connections. Note that the intra-cluster and inter-cluster
round-trip delays on the DAS-2 are in the orders of 0.15 and 2.5 milliseconds, respectively. In all emulations, the cycle length was set to 5 seconds.
In order to validate our simulation results, we repeated the experiments presented in
Figures 2.2 and 2.7 of Section 2.4, using our real implementation. A centralized coordinator was used to initialize the node views according to the bootstrapping scenarios
presented in Section 2.4.1, namely growing, lattice, and random.
The first run of the emulations produced graphs practically indistinguishable from the
corresponding simulation graphs. Acknowledging the low round-trip delay on the DAS-2,
we ran the experiments again, this time inducing a 50 ms delay in each message delivery,
accounting for a round-trip delay of 100 ms on top of the actual one. The results presented
in this section are all based on these experiments.
Figure 2.17 shows the evolution of the indegree standard deviation, clustering coefficient, and average path length for all experiments, using the same scales as Figures 2.2
and 2.7 to facilitate comparison. The very close match between simulation-based and
real-world experiments for all three nodes of the design space triangle allows us to claim
50
CHAPTER 2. PEER SAMPLING SERVICE
that our simulations represent a valid approximation of real-world behavior.
The small differences of the converged values with respect to the simulations are due
to the induced round-trip delay. In a realistic environment, view exchanges are not atomic:
they can be intercepted by other view exchanges. For instance, a node having initiated a
view exchange and waiting for the corresponding reply, may in the meantime receive a
view exchange request by a third node. However, the view updates performed by the active and passive thread of a node are not commutative. The results presented correspond
to an implementation where we simply ignored this problem: all requests are served immediately regardless of the state of the serving node. This solution is extremely simple
from a design point of view but may lead to corrupted views.
As an alternative, we devised and implemented three approaches to avoid corrupted
views. In the first approach, a node’s passive thread drops incoming requests while its
active thread is waiting for a reply. In the second one, the node queues—instead of
dropping—incoming requests until the awaited reply comes. As a third approach, a node’s
passive thread serves all incoming requests, but its active thread drops a reply if an incoming request intervened.
Apart from the added complexity that these solutions impose on our design, their
benefit turned out to be difficult or impossible to notice. Moreover, undesirable situations
may arise in the case of the first two: dropping or delaying a request from a third node may
cause that node to drop or delay, in turn, requests it receives itself. Chains of dependencies
are formed this way, which can render parts of the network inactive for some periods.
Given the questionable advantage these approaches can offer, and considering the design
overhead they impose, we will not consider them further. Based on our experiments, the
best strategy is simply ignoring the problem, which further underlines the exceptional
robustness and simplicity of gossip-based design.
2.7 Discussion
In this section we summarize and interpret the results presented so far. As stated in the
introductory section, we were interested in determining the properties of various gossip
membership protocols, in particular their randomness, load balancing and fault tolerance.
In a sense, after we discussed in the last section why certain results were observed, we
discuss here what the results imply.
2.7.1 Randomness
We have studied randomness from two points of view: local and global. Local randomness is based on the analogy between a pseudo random-number generator and the peersampling service as seen by a fixed node. We have seen that all protocols return a random
sequence of peers at all nodes with a good approximation.
We have shown, however, that there are important correlations between the samples
returned at different nodes, that is, the overlay graphs that the implementations are based
upon are not random. Adopting a graph-theoretic approach, we have been able to identify
important deviations from randomness that are different for the several instances of our
framework.
In short, randomness is approached best by the view selection method SWAPPER (H =
0, S = c/2 = 15), irrespective of the peer selection method. In general, increasing H increases the clustering coefficient. The average path length is close to the one of a random
2.7. DISCUSSION
51
graph for all protocols we examined. Finally, with SWAPPER the degree distribution has
a smaller variance than that of the random graph. This property can often be considered
“better than random” (e.g., from the point of view of load balancing).
Clearly, the randomness required by a given application depends on the very nature
of that application. For example, the upper bound of the speed of reaching all nodes
via flooding a network depends exclusively on the diameter of the network, while other
aspects such as degree distribution or clustering coefficient are irrelevant for this specific
question. Likewise, if the sampling service is used by a node to draw samples to calculate
a local statistical estimate of some global property, such as network size or the availability
of some resources, what is needed is that the local samples are uniformly distributed.
However, it is not required that the samples are independent at different nodes, that is,
we do not need global randomness at all; the unstructured overlay can have any degree
distribution, diameter, clustering, etc.
Load Balancing
We consider the service to provide good load balancing if the nodes evenly share the cost
of maintaining the service and the cost induced by the application of the service. Both
are related to the degree distribution: if many nodes point to a certain node, this node will
receive more sampling-service related gossip messages and most applications will induce
more overhead on this node, resulting in poor load balancing. Since the unstructured
overlays that implement the sampling service are dynamic, it is also important to note
that nodes with a high indegree become a bottleneck only if they keep having a high
indegree for a long time. In other words, a node is in fact allowed to have a high indegree
temporarily, for a short time period.
We have seen that the BLIND view selection is inferior to the other alternatives. The
degree distribution has a high variance (that is, there are nodes that have a large indegree)
and on top of that, the degree distribution is relatively static, compared to the alternatives.
Clearly, the best choice to achieve good load balancing is the SWAPPER view selection,
which results in an even lower variance of indegree than in the uniform random graph. In
general, the parameter S is strongly correlated with the variance of indegree: increasing
S for a fixed H decreases the variance. The degree distribution is almost as static as in
the case of HEALER, if H = 0. However, this is not a problem because the distribution has
low variance.
Finally, HEALER also performs reasonably. Although the variance is somewhat higher
than that of SWAPPER, it is still much lower than BLIND. Besides, the degree distribution
is highly dynamic, which means that the somewhat higher variance of the degree distribution does not result in bottlenecks because the indegree of the nodes change quickly. In
general, increasing H for a fixed value of S also decreases the variance.
Fault Tolerance
We have studied both catastrophic and realistic scenarios. In the first category, catastrophic failure and catastrophic churn were analyzed. In these scenarios, the most important parameter turned out to be H: it is always best to set H as high as possible. One
exception is the experiment with the removal of 50% of the nodes, where SWAPPER performs slightly better. However, SWAPPER is slow in removing dead links, so if failure can
be expected, it is highly advisable to set H ≥ 1.
52
CHAPTER 2. PEER SAMPLING SERVICE
In the case of realistic scenarios, such as the realistic (artificial) churn rates, and the
trace-based simulations, we have seen that the damaging effect is minimal, and (as long
as H ≥ 1) the performance of the protocols is very similar to the case when there is no
failure.
2.8 Related Work
2.8.1 Gossip Membership Protocols
Most gossip protocols for implementing peer sampling are covered by our framework: we
mentioned these in Section 2.2.4. One notable exception is [73] that we address here in
some more detail. The protocol is as follows. In each cycle, all nodes pull the full partial
views from F randomly selected peers. In addition, they record the addresses of the peers
initiating incoming pull requests during the given cycle. The old view is then discarded
and a new view is generated from scratch. In the most practical version, the new view is
generated by first adding the addresses of the incoming requests and subsequently filling
the rest of the view with random samples from the union of the previously pulled F views
without replacement.
Notice that there are two features that are incompatible with our framework: the application of F ≥ 1 (in our case F = 1) and the asymmetry between push and pull, with pull
having a bigger emphasis. Only one entry—the initiator peer’s own entry—is pushed. It
is common to allow for F ≥ 1 also in other proposals (e.g., [62]). In our framework, information exchange is symmetric, or fully asymmetric, without a finer tuning possibility.
To compare this protocol with our framework, we implemented it and ran simulations
using our experimental scenarios. The view size and network size were the same as in
all simulations, and F was 1, 2, or 3. The main conclusions are summarized below. The
protocol class presented in [73] has some difficulty dealing with the scenarios when the
initial network is not random (the growing and lattice initializations, see Section 2.4.1).
For F = 1 we consistently observed partitioning in the lattice scenario (which was otherwise never observed in our framework). In the growing scenario—mostly for F = 1
but also for F = 2 and F = 3—the protocols occasionally get stuck in a local attractor
where there is a star subgraph: a node with a very high indegree, and a large number
of nodes with zero indegree and 1 as outdegree. Apart from these issues, if we consider
self-healing, load balancing and convergence properties, the protocols roughly behave as
if they were instances in our framework using pushpull, with 0 ≤ H ≤ 1 and S = 0, with
increasing F tending towards H = 1. Since we have concluded that the “interesting”
protocols in our space have either a high H or a high S value, based on the empirical
evidence accumulated so far there is no urgent need to extend our framework to allow for
F > 1 or asymmetric information exchange. However, studying these design choices in
more detail is an interesting topic for future research.
In the following we summarize a number of other fields that are relevant.
2.8.2 Complex Networks
The assumption of uniform randomness has only fairly recently become subject to discussion when considering large complex networks such as the hyperlinked structure of
the WWW, or the complex topology of the Internet. Like social and biological networks,
2.9. CONCLUDING REMARKS
53
the structures of the WWW and the Internet both follow the quite unbalanced powerlaw degree distribution, which deviates strongly from that of traditional random graphs.
These new insights pose several interesting theoretical and practical problems [74]. Several dynamic complex networks have also been studied and models have been suggested
for explaining phenomena related to what we have described here [75].
2.8.3 Unstructured Overlays
There are a number of protocols that are not gossip-based but that are potentially useful
for implementing peer sampling. An example is the Scamp protocol [76]. While this protocol is reactive and so less dynamic, an explicit attempt is made towards the construction
of a (static) random graph topology. Randomness has been evaluated in the context of information dissemination, and it appears that reliability properties come close to what one
would see in random graphs. Some other protocols have also been proposed to achieve
randomness [77, 78], although not having the specific requirements of the peer-sampling
service in mind. Finally, random walks on arbitrary (hence, also unstructured) networks
offer a powerful tool to obtain random samples, where even the sampling distribution
can be adjusted [79]. These protocols, however, have a significantly higher overhead if
many samples are required. This overhead and the convergence time also depend on the
structure of the overlay network the random walk operates on.
2.8.4 Structured Overlays
In a sense, structured overlays have also been considered as a basic middleware service
to applications [80]. However, a structured overlay [81–83] is by definition not dynamic.
Hence utilizing it for implementing the peer-sampling service requires additional techniques such as random walks [79, 84]. Another example of this approach is a method
assuming a tree overlay [85]. It is unclear whether a competitive implementation can be
given considering also the cost of maintaining the respective overlay structure.
Another issue in common with our own work is that graph-theoretic approaches have
been developed for further analysis [86]. Astrolabe [87] also needs to be mentioned as
a hierarchical (and therefore structured) overlay, which, although applying (nonuniform)
gossip to increase robustness and to achieve self-healing properties, does not even attempt to implement or apply a uniform peer-sampling service. It was designed to support
hierarchical information aggregation and dissemination.
2.9 Concluding Remarks
Gossip protocols have recently generated a lot of interest in the research community. The
overlays that result from these protocols are highly resilient to failures and high churn
rates. The underlying paradigm is clearly appealing to build large-scale distributed applications
Our contribution is to factor out the abstraction implemented by the membership
mechanism underlying gossip protocols: the peer-sampling service. The service provides
every peer with (local) knowledge of the rest of system, which is key to have the system
converge as a whole towards global properties using only local information.
We described a framework to implement a reliable and efficient peer-sampling service. The framework itself is based on gossiping. This framework is generic enough to be
54
CHAPTER 2. PEER SAMPLING SERVICE
instantiated with most current gossip membership protocols [5, 62, 63, 88]. We used this
framework to empirically compare the range of protocols through simulations based on
synthetic and realistic traces as well as implementations. We point out the very fact that
these protocols ensure local randomness from each peer’s point of view. We also observed
that as far as the global properties are concerned, the average path length is close to the
one in random graphs and that clustering properties are controlled by (and grow with) the
parameter H. With respect to fault tolerance, we observe a high resilience to high churn
rate and particularly good self-healing properties, again mostly controlled by the parameter H. In addition, these properties mostly remain independent of the bootstrapping
approach chosen.
In general, when designing gossip membership protocols that aim at randomness, following a push-only or pull-only approach is not a good choice. Instead, only the combination results in desirable properties. Likewise, it makes sense to build in robustness
by purposefully removing old links when exchanging views with a peer. This situation
corresponds in our framework to a choice for H > 0.
Regarding other parameter settings, it is much more difficult to come to general conclusions. As it turns out, tradeoffs between, for example, load balancing and fault tolerance will need to be made. When focusing on swapping links with a selected peer, the
price to pay is lower robustness against node failures and churn. On the other hand, making a protocol extremely robust will lead to skewed indegree distributions, affecting load
balancing.
To conclude, we demonstrated in this extensive study that gossip membership protocols can be tuned to both support high churn rates and provide graph-theoretic properties
(both local and global) close to those of random graphs so as to support a wide range of
applications.
Chapter 3
Average Calculation
As computer networks increase in size, become more heterogeneous and span greater geographic distances, applications must be designed to cope with the very large scale, poor
reliability, and often, with the extreme dynamism of the underlying network. Aggregation
is a key functional building block for such applications: it refers to a set of functions
that provide components of a distributed system access to global information including
network size, average load, average uptime, location and description of hotspots, etc.
Local access to global information is often very useful, if not indispensable for building applications that are robust and adaptive. For example, in an industrial control application, some aggregate value reaching a threshold may trigger the execution of certain
actions; a distributed storage system will want to know the total available free space; load
balancing protocols may benefit from knowing the target average load so as to minimize
the load they transfer.
In this chapter we elaborate on the aggregation protocol we introduced in Section 1.3.
As mentioned there, the class of aggregate functions we can compute is very broad and includes many useful special cases such as counting, averages, sums, products and extremal
values. The protocol is suitable for extremely large and highly dynamic systems due to its
proactive structure—all nodes receive the aggregate value continuously, thus being able
to track any changes in the system. The protocol is also extremely lightweight making
it suitable for many distributed applications including peer-to-peer and grid computing
systems. We demonstrate the efficiency and robustness of our gossip-based protocol both
theoretically and experimentally under a variety of scenarios including node and communication failures.
3.1 Introduction
In this chapter, we focus on aggregation which is a useful building block in large, unreliable and dynamic systems [89] (see also Section 1.3). Aggregation is a common name for
a set of functions that provide a summary of some global system property. In other words,
they allow local access to global information in order to simplify the task of controlling, monitoring and optimization in distributed applications. Examples of aggregation
functions include network size, total free storage, maximum load, average uptime, location and intensity of hotspots, etc. Furthermore, simple aggregation functions can be
used as building blocks to support more complex protocols. For example, the knowledge
of average load in a system can be exploited to implement near-optimal load-balancing
schemes [61].
55
56
CHAPTER 3. AVERAGE CALCULATION
We distinguish reactive and proactive protocols for computing aggregation functions.
Reactive protocols respond to specific queries issued by nodes in the network. The answers are returned directly to the issuer of the query while the rest of the nodes may or
may not learn about the answer. Proactive protocols, on the other hand, continuously
provide the value of some aggregate function to all nodes in the system in an adaptive
fashion. By adaptive we mean that if the aggregate changes due to network dynamism
or because of variations in the input values, the output of the aggregation protocol should
track these changes reasonably quickly. Proactive protocols are often useful when aggregation is used as a building block for completely decentralized solutions to complex
tasks. For example, in the load-balancing scheme cited above, the knowledge of the global
average load is used by each node to decide if and when it should transfer load [61].
We introduce a robust and adaptive protocol for calculating aggregates in a proactive manner. We assume that each node maintains a local approximate of the aggregate
value. The core of the protocol is a simple gossip-based communication scheme in which
each node periodically selects some other random node to communicate with. During
this communication the nodes update their local approximate values by performing some
aggregation-specific and strictly local computation based on their previous approximate
values. This local pairwise interaction is designed in such a way that all approximate
values in the system will quickly converge to the desired aggregate value.
In addition to introducing our gossip-based protocol, the contributions are threefold.
First, we present a full-fledged practical solution for proactive aggregation in dynamic
environments, complete with mechanisms for adaptivity, robustness and topology management. Second, we show how our approach can be extended to compute complex aggregates such as variances and different means. Third, we present theoretical and experimental evidence supporting the efficiency of the protocol and illustrating its robustness
with respect to node and link failures and message loss.
In Section 3.2 we define the system model. Section 3.3 describes the core idea of the
protocol and presents theoretical and simulation results of its performance. In Section 3.4
we discuss the extensions necessary for practical applications. Section 3.5 introduces
novel algorithms for computing statistical functions including several means, network
size and variance. Sections 3.6 and 3.7 present analytical and experimental evidence on
the high robustness of our protocol. Section 3.8 describes the prototype implementation of
our protocol on PlanetLab and gives experimental results of its performance. Section 3.9
discusses related work. Finally, conclusions are drawn in Section 3.10.
3.2 System Model
We consider a network consisting of a large collection of nodes that are assigned unique
identifiers and that communicate through message exchanges. The network is highly dynamic; new nodes may join at any time, and existing nodes may leave, either voluntarily
or by crashing. Our approach does not require any mechanism specific to leaves: spontaneous crashes and voluntary leaves are treated uniformly. Thus, in the following, we limit
our discussion to node crashes. Byzantine failures, with nodes behaving arbitrarily, are
excluded from the present discussion (but see [90]).
We assume that nodes are connected through an existing routed network, such as the
Internet, where every node can potentially communicate with every other node. To actually communicate, a node has to know the identifiers of a set of other nodes, called its
neighbors. This neighborhood relation over the nodes defines the topology of an overlay
3.3. GOSSIP-BASED AGGREGATION
Algorithm 9 push-pull aggregation
1: loop
2:
wait(∆)
3:
p ← selectPeer()
4:
sendPush(p,x)
5:
6:
7:
8:
9:
10:
57
procedure ON P USH(m)
sendPull(m.sender,x)
x ← update(m.x, x)
procedure ON P ULL(m)
x ← update(m.x, x)
network. Given the large scale and the dynamicity of our envisioned system, neighborhoods are typically limited to small subsets of the entire network. The set of neighbors
of a node (thus the overlay network topology) can change dynamically. Communication
incurs unpredictable delays and is subject to failures. Single messages may be lost, links
between pairs of nodes may break. Occasional performance failures (e.g., delay in receiving or sending a message in time) can be seen as general communication failures, and are
treated as such. Nodes have access to local clocks that can measure the passage of real
time with reasonable accuracy, that is, with small short-term drift.
We focus on node and communication failures. Some other aspects of the model that
are outside of the scope of the present analysis (such as clock drift and message delays)
are discussed only informally in Section 3.4.
3.3 Gossip-based Aggregation
We assume that each node i in the network of N nodes holds a numeric value xi . In a
practical setting, this value can characterize any (possibly dynamic) aspect of the node or
its environment (e.g., the load at the node, available storage space, temperature measured
by a sensor network, etc.). The task of a proactive protocol is to continuously provide
all nodes with an up-to-date estimate of an aggregate function, computed over the values
held by the current set of nodes.
3.3.1 The Basic Aggregation Protocol
In Chapter 1 we have already presented push-pull averaging in Algorithm 4. For the sake
of convenience, we repeat the algorithm here as Algorithm 9, with a slight generalization:
instead of averaging the two values, the state update at the nodes is now expressed as an
abstract method UPDATE. Method UPDATE computes a new local state based on the current
local state and the remote state received during the information exchange. In most of
this chapter, we limit the discussion to computing the average over the set of numbers
distributed among the nodes, that is, method UPDATE(x, y) returns (x + y)/2. However,
additional functions (most of them derived from the averaging protocol) are described in
Section 3.5.
As of the peer sampling service, in Section 3.3.2 for theoretical reasons we will assume that SELECT P EER returns a true uniform random sample over the entire set of nodes.
In Section 3.4.4 we revisit the peer sampling service from a practical point of view, by
looking at realistic implementations based on non-uniform or dynamically changing overlay topologies.
Let us now consider the convergence of the protocol. It is easy to see that after one
complete push-pull exchange, the sum of the two local estimates remains unchanged since
58
CHAPTER 3. AVERAGE CALCULATION
Algorithm 10 Avg
1: for k = 1 to N do
2:
(i(k), j(k)) = getPair(k)
3:
xi(k) = xj(k) = (xi(k) + xj(k) )/2
⊲ vector x of length N is the input
⊲ perform elementary variance reduction step
4:
return x
method UPDATE simply redistributes the initial sum equally among the two nodes; a property known as mass conservation. So, the operation does not change the global average
but it decreases the variance over the set of all estimates in the system.
It is easy to see that the variance tends to zero in probability, that is, the value at
each node will converge to the true global average in probability, as long as the network
of nodes is not partitioned into disjoint clusters. To see this, one should consider the
minimal value in the system. Clearly, if SELECT P EER returns uniform samples then in
each cycle either the number of instances of the minimal value decreases or the global
minimum increases with a probability of at least 1/N if there are different values from
the minimal value (otherwise we are done because all values are equal). This is because
if there is at least one different value, than any instance of the minimal value will get a
neighbor with a different (thus larger) value with a probability of at least 1/N.
The only non-trivial problem is to characterize the speed of the convergence of the
expected variance. In the following, we will show that each cycle results in a reduction
of the variance by a constant factor, which provides exponential convergence. We will
assume that no failures occur and that the starting point of the protocol is synchronized.
All of these assumptions will be relaxed later in the chapter.
3.3.2 Theoretical Analysis of Gossip-based Aggregation
We will treat the averaging protocol as an iterative variance reduction algorithm over a
vector of numbers. To see how, consider that the distributed protocol in Algorithm 9
results in a series of push-pull exchanges between pairs of nodes. In fact, the behavior
of the protocol is completely characterized by the series of node pairs that perform a
push-pull exchange. This observation motivates the definition of Algorithm AVG (shown
as Algorithm 10) that takes a vector x of length N as a parameter and produces a new
vector x′ = AVG(x) of the same length. The elements of the vector represent the local
approximations at the nodes in the network of size N.
This centralized view of the protocol will let us develop the theoretical tools that will
be used to characterize the original distributed protocol. Of course, in reality the pushpull exchanges in the network might overlap in time. For the sake of the theoretical
discussion, we assume that the exchanges that involve a fixed node are non-overlapping
in time, and thus these exchanges can be ordered. This defines a partial order that can
always be extended to a total order. Any such extensions are equivalent from the point of
view of convergence properties. Algorithm AVG represents the distributed execution via
generating such a total order of communicating pairs via GET PAIR ().
In this framework, we assume we are given an initial vector of numbers x(0) =
(x1 (0) . . . xN (0)). The elements of this vector correspond to the initial values at the nodes.
The consecutive cycles of the protocol result in a series of vectors x(1), x(2), . . ., where
x(t + 1) = AVG(x(t)). The behavior of our distributed gossip-based protocol can be
reproduced by an appropriate implementation of GET PAIR. In addition, other implemen-
3.3. GOSSIP-BASED AGGREGATION
59
tations of GET PAIR are possible that do not necessarily map to any distributed protocol but
are of theoretical interest. We will discuss some important special cases as part of our
analysis.
Without loss of generality, to simplify our expressions, let us assume that the average
of the values in the network is zero. Due to mass conservation, this will be true in all
cycles:
N
X
xi (t) = 0, t = 0, 1, . . .
(3.1)
i=1
Under this assumption the variance in x is now given by
N
1 X 2
x (t).
σ (t) =
N i=1 i
2
(3.2)
Since the mean of the estimates remains constant (zero) due to mass conservation, from
now on we can focus on σ 2 (t) as t tends to infinity. In particular, we want σ 2 (t) to quickly
converge to zero because a small variance means that all nodes have a very accurate
approximation.
Let us begin our analysis of the convergence of the variance with some basic observations. Let us have a look at the form of σ 2 (t + 1) when expressed using the elements of
x(t). First, for illustration, consider σ 2 (t)′ that is the variance of the vector after processing the first pair (i, j) returned by GET PAIR:
2
′
Nσ (t) =
x21 (t)
+···+
xi + xj
2
2
+···+
xi + xj
2
2
+ · · · + x2N (t)
x2j
x2i
2
+···+
+ · · · + x2N (t) + xi (t)xj (t).
= x1 (t) + · · · +
2
2
(3.3)
Clearly, after completing the N cycles of algorithm AVG, we have
Nσ 2 (t + 1) =
N
N
X
X
i=1
j=1
αi,j xj (t)
!2
=
N
X
ai x2i (t) +
i=1
X
bi,j xi (t)xj (t),
(3.4)
i6=j
where the parameters αi,j (and thus ai and bi,j ) are random variables that depend on the
random decisions made by algorithm AVG.
We now discuss a few useful observations.
Proposition 3.3.1. If algorithm AVG is symmetric to permutations (that is, for any permutation of the nodes π the series of pairs (i(k), j(k)) has the same probability as the
series (π(i(k)), π(j(k)), k = 1, . . . , N) then for some constant a∗ we have
a∗ = E(a1 ) = · · · = E(aN )
(3.5)
(using the notations in Eq. (3.4)). We will call a∗ the convergence factor. We then have
E(σ 2 (t + 1)) ≤ a∗ σ 2 (t).
(3.6)
Proof. Eq. (3.5) follows directly from symmetry. Similarly, due to symmetry, it must be
the case that for any i 6= j and m 6= n: E(bi,j ) = E(bm,n ). Let b denote this common
CHAPTER 3. AVERAGE CALCULATION
60
constant. Then we have
E(σ 2 (t + 1)) =
N
1 X
1 X
E(ai )x2i (t) +
E(bi,j )xi (t)xj (t)
N i=1
N i6=j
N
b X
a∗ X 2
xi (t)xj (t)
xi (t) +
=
N i=1
N i6=j
b X
xi (t)xj (t)
= a∗ σ 2 (t) +
N i6=j
N
N
X
b X
∗ 2
= a σ (t) +
xi (t)
xj (t)
N i=1
j=1
= a∗ σ 2 (t) − bσ 2 (t)
≤ a∗ σ 2 (t),
(3.7)
!
N
b X 2 2
−
x (t )
N i=1 i
where we used Eq. (3.1) and the fact that b ≥ 0, which follows from the fact that all the
parameters αi,j in (3.4) are non-negative.
Proposition 3.3.2.
N
X
i=1
ai ≥
N
4
(3.8)
for any fixed execution of AVG with any implementation of method GET PAIR if N is even.
(k)
(k)
Proof. Let us introduce the notations ai and αi,j to represent the parameters that are
analogous to ai and αi,j but in the state when only k cycles of AVG have been completed.
(N )
(N )
(0)
(0)
(0)
Clearly, for all i = 1, . . . N, ai = ai , αi,j = αi,j , ai = αi,i = 1 and αi,j = 0 if i 6= j.
First, observe that when a pair (i, j) is picked by algorithm AVG in cycle k, then the
P
P
(k)
(k+1)
− N
contribution of node i to the difference N
i=1 ai is
i=1 ai
2
N
X
1
j=1
4
N
(k)
(αi,j )2 =
1 X (k) 2
(α )
2 j=1 i,j
(3.9)
if the sets of non-zero α parameters of node i and j do not overlap, and strictly less if there
is an overlap. Using this insight, let us observe that the maximal possible value of this
contribution is 1/2. This happens when node i is picked for the first time, because in this
(k)
case the only non-zero α parameter is αi,i = 1. The second largest possible contribution
is 1/4. This can happen only when a node i is picked for the second time, and node i has
exactly two non-zero α values (both having a value of 1/2). To see this, consider that no
α value can possibly be in the interval (1/2, 1), and a node that is selected for the second
time will have at least two non-zero α values.
P
(N )
From these observation we can see that the maximal overall difference N
−
i=1 ai
PN (0)
ai is given by N/2 + N/4. The proposition directly follows from this, since
Pi=1
(0)
N
i=1 ai = N.
The assumption that N is even is extremely weak, given that we are interested in
networks where N is very large, where this detail makes very little difference. Dealing
with an odd N would not add much insight to the analysis, however, it would make the
equations more complex, so we will not develop our results for that case.
3.3. GOSSIP-BASED AGGREGATION
61
Corollary 3.3.3. a∗ ≥ 1/4.
Remark 3.3.4. The equality in Corollary 3.3.3 can be achieved, if AVG returns the N/2
pairs (let us assume that N is even) that form a perfect matching of the nodes as the first
N/2 pair, followed by the N/2 pairs that form a second perfect matching that has no
common pairs with the first perfect matching.
Proposition 3.3.5. If algorithm AVG is symmetric to permutation then
E(σ 2 (t + 1)) = (1 − O(
1
))a∗ σ 2 (t).
N
(3.10)
2
Proof. Considering (3.7), where we have seen that E(σP
(t + 1)) = (a∗ − b)σ 2 (t), we
need to prove that b/a∗ = O(1/N). We first prove that i6=j bi,j < N. Let us consider
the coefficients αi,j and bi,j in Eq. (3.4). First of all, we know that on any node k we have
PN
conservation property of the algorithm. From
i=1 αk,i = 1; this follows from the massP
this it follows that on any node k we have i6=j αk,i αk,j < 1. Now we know that
X
i6=j
bi,j =
N
XX
i6=j k=1
αk,i αk,j =
N X
X
αk,i αk,j < N.
(3.11)
k=1 i6=j
P
Since E( i6=j bi,j ) = N(N −1)b, it follows that b < 1/(N −1). Based on Corollary 3.3.3
we have b/a∗ < 4/(N − 1), which concludes the proof.
This proposition indicates that the bound in (3.6) is tight in large networks. Having
established the properties of the convergence factor—most importantly, the property that
one needs to concentrate only on the quadratic terms in (3.4)—we can now give the expected value of the convergence factor for a number of interesting implementations of
method GET PAIR, and ultimately, the convergence factor of Algorithm 9.
Pair Selection: Perfect Matching
As was discussed in Remark 3.3.4, the implementation of GET PAIR that is based on two
perfect matchings is optimal, and results in a convergence factor of a∗ = 1/4. We will call
this implementation GET PAIR _ PM where PM stands for perfect matching. This implementation cannot be mapped to an efficient distributed protocol directly because it requires
global knowledge of the system. What makes it interesting is the fact that it is optimal,
Pair Selection: Random Choice
Moving towards more practical implementations of GET PAIR, our next example is GETPAIR _ RAND which simply returns a random pair of different nodes independently for each
call to GET PAIR, with all such pairs having an equal probability.
GET PAIR _ RAND can easily be implemented as a distributed protocol, provided that SE LECT P EER returns a uniform random sample of the set of nodes. When iterating AVG , the
waiting time between two consecutive selections of a given node can be described by the
exponential distribution. In a distributed implementation, a given node can approximate
this behavior by waiting for a time interval randomly drawn from this distribution before
initiating communication. However, as we shall see, GET PAIR _ RAND is not a very efficient
pair selector.
CHAPTER 3. AVERAGE CALCULATION
62
Theorem 3.3.6. The limit of the convergence factor for GET PAIR _ RAND is given by
1
lim a∗ = .
N →∞
e
(3.12)
Proof. First, we repeat the observation of the proof of Proposition 3.3.2 that when a pair
(i, j) is picked by algorithm AVG in cycle k, then the contribution of node i to the differP
P
(k)
(k+1)
− N
ence N
i=1 ai is
i=1 ai
2
N
X
1
j=1
4
N
(k)
(αi,j )2
1 X (k) 2
=
(α )
2 j=1 i,j
(3.13)
if the sets of non-zero α parameters of node i and j do not overlap.
In the case of GET PAIR _ RAND, the α parameters will overlap only with a diminishing
probability for a large N. This follows from the fact, that any node i will influence only
a constant (O(1)) number of other nodes within one cycle on average, and conversely,
any node is influenced only by a constant number of other nodes on average. So the
probability that a node i gets a pair j with an overlapping set of α parameters has a
probability O(1/N).
Since originally the sum of the coefficients for the quadratic terms in the variance
PN
(k)
is j=1 (αi,j )2 , this means that node i reduces its actual contribution by a half every
time it is picked. To be more precise, the remaining half contribution to the variance,
PN
PN
(k) 2
(k) 2
1
1
j=1 (αi,j ) , will now be distributed among two nodes equally, with 4
j=1 (αi,j )
2
contributed by both node i and j.
However, since from a statistical point of view all the nodes have exactly the same
future because GET PAIR _ RAND makes decisions that are independent of the previous deci(0)
sions, we can assume that the original contribution (ai = 1) gets halved each time node
i is picked. This will result in the same expected convergence factor. This convergence
factor will then be given by the expectation
a∗ = E(
1
)
2φ
(3.14)
where φ is a random variable that describes the number of times a node i was picked as a
member of a pair. The distribution of φ can be approximated by the Poisson distribution
with parameter 2 for a large N, that is
P (φ = j) =
2j −2
e .
j!
(3.15)
Substituting this into the expression E(2−φ ) we get
E(2−φ ) =
∞
X
j=0
∞
2−j
X1
2j −2
e = e−2
= e−2 e = e−1 .
j!
j!
j=0
(3.16)
Comparing the performance of GET PAIR _ RAND and GET PAIR _ PM we can see that convergence is significantly slower than in the optimal case (the factors are e−1 ≈ 1/2.71 vs.
1/4).
3.3. GOSSIP-BASED AGGREGATION
63
Pair Selection: a Distributed Solution
Building on the results we have so far, it is possible to analyze our original protocol
described in Algorithm 9. If messages are delivered without delay then (assuming that
the start of the cycle is not synchronized) nodes will wake up in a random order, and each
of them will pair up with a random other node and complete one exchange.
In order to simulate this fully distributed version, the implementation of pair selection
will return random pairs such that in each execution of AVG (that is, in each cycle), each
node is guaranteed to be a member of at least one pair. This can be achieved by picking
a random permutation of the nodes and pairing up each node in the permutation with
another random node, thereby generating N pairs. We call this algorithm GET PAIR _ DISTR.
As we shall see, the performance of this protocol is superior to that of GET PAIR _ RAND
although of course does not match GET PAIR _ PM that is optimal.
Theorem 3.3.7. The limit of the convergence factor for GET PAIR _ DISTR is given by
1
lim a∗ = √ .
N →∞
2 e
(3.17)
Proof. As in the proof of Theorem 3.3.6, we define φ to be the expected number of times
a node participates in a pair. Random variable φ can be approximated as φ = 1 + φ′ where
φ′ has the Poisson distribution with parameter 1, that is, for j > 0
P (φ = j) = P (φ′ = j − 1) =
1
e−1 .
(j − 1)!
(3.18)
Again, similarly to the proof of Theorem 3.3.6, we calculate E(2−φ ) and here we get
E(2
−φ
)=
∞
X
j=1
∞
2
−j
1 X 2−(j−1)
1√
1
1
−1
e =
=
e= √ ,
(j − 1)!
2e j=1 (j − 1)!
2e
2 e
(3.19)
which is the desired formula.
The proof is not complete, however, because we cannot apply the reasoning of Theorem 3.3.6 in this case. The only difference is that here nodes might have different futures,
since the pairs are not independent. In other words, different nodes might have a different
expected number of times to be picked as a member of a pair in light of the number of
times they have been selected before (except at the start of AVG, when no node has been
selected yet).
In Theorem 3.3.6 we imagined the variance reduction process as a sort of stick breaking process in which a given initial unit contribution gets halved each time a node is
selected (half of the stick is thrown away) and in addition, the remaining half stick is
broken into two equal pieces and added to the contributions of the two members of the
pair.
Let us divide the original unit stick into atoms such that when the stick is broken in
two, half of the atoms are sent to the other node and half of them stay. Using 2N atoms
of initial size 2−N suffices. In this view, an atom takes a random walk in the network,
and half of its length is thrown away in each step. It makes one step each time the node
it sits on is selected; in that case it stays where it is, or moves to the other node with a
probability of 1/2.
In this setting, we show that in the limit of large N the number of steps one atom
makes (φ′′ ) has the same distribution as φ, which is sufficient to complete the proof. We
CHAPTER 3. AVERAGE CALCULATION
64
do not present the complete proof here, because it is rather technical. The main idea is
applying induction. First, we show that P (φ′′ = 1) = 1/e in the limit of large N. In the
inductive step we construct all the possible ways of making exactly k + 1 steps assuming
the possible walks (and their probability) of k steps are known. We then calculate the
relative probability and show that P (φ′′ = k + 1)/P (φ′′ = k) = 1/k (in the limit of large
N), which completes the proof, since through induction we get
P (φ′′ = k) =
1
e−1 , k > 0.
(k − 1)!
(3.20)
Comparing the performance of GET PAIR _ DISTR to GET PAIR _ RAND and GET PAIR _ PM,
we can see that convergence
than the optimal case but faster than random selec√ is slower −1
tion (the factors are 1/2 e ≈ 1/3.3, e ≈ 1/2.71 and 1/4, respectively).
Empirical Results for Convergence of Aggregation
We ran AVG using GET PAIR _ RAND and GET PAIR _ DISTR for several network sizes and different initial distributions. For each parameter setting 50 independent experiments were
performed.
Recall, that theory predicts that the average convergence factor is independent of the
actual initial node values x(0). To test this, we initialized the nodes in two different ways.
In the uniform scenario, each node is assigned an initial value uniformly drawn from the
same interval. In the peak scenario, one randomly selected node is assigned a non-zero
value and the rest of the nodes are initialized to zero.
Note that in the case of the peak scenario, methods that approximate the average based
on a small random sample (that is, statistical sampling methods) are useless: one has to
know all the values to calculate the average. Also, for a fixed variance, we have the largest
difference between any two values. In this sense this scenario represents a worst case
scenario. Last but not least, the peak initialization has important practical applications as
well as we discuss in Section 3.5.
The results are shown if Figures 3.1 and 3.2. Figure 3.1 confirms our prediction that
convergence is independent of network size and that the observed convergence factors
match theory with very high accuracy. Note that smaller convergence factors result in
faster convergence.
The only difference between the peak and the uniform scenario is that the variance of
the convergence factor is higher for the peak scenario. Note that our theoretical analysis
does not tackle the question of convergence factor variance. We can see however that
the average convergence factor is well predicted and after a few cycles the variance is
decreased considerably.
Figure 3.3 shows the difference between the maximal and the minimal estimates in
the system for both the peak and uniform initialization scenarios. Note that although the
expected variance E(σi ) decreases at the predicted rate, in the peak distribution scenario,
the difference decreases faster. This effect is due to the highly skewed nature of the
distribution of estimates in the peak scenario. In both cases, the difference between the
maximal and the minimal estimate decreases exponentially and after as few as 20 cycles
the initial difference is reduced by several orders of magnitude. This means that after
a small number of cycles all nodes, including the outliers, will possess very accurate
estimates of the global average.
3.3. GOSSIP-BASED AGGREGATION
65
0.38
convergence factor
0.36
0.34
0.32
0.3
0.28
getPair_rand, uniform
getPair_distr, peak
getPair_distr, uniform
0.26
102
103
104
network size
105
106
Figure 3.1: Convergence factor σ 2 (1)/σ 2 (0) after one execution of AVG as a function
of network size. For the peak distribution, error bars are omitted for clarity (but see
Figure 3.2). Values are averages and standard deviations for 50 independent runs. Dotted
lines√
correspond to the two theoretically predicted convergence factors: e−1 ≈ 0.368 and
1/(2 e) ≈ 0.303.
0.38
convergence factor
0.36
0.34
0.32
0.3
0.28
getPair_rand, uniform
getPair_distr, peak
getPair_distr, uniform
0.26
5
10
15
20
cycle
Figure 3.2: Convergence factor σ 2 (i)/σ 2 (i − 1) for network size N = 106 for different
iterations of algorithm AVG. Values are averages and standard deviations for 50 independent runs. Dotted lines√correspond to the two theoretically predicted convergence factors:
e−1 ≈ 0.368 and 1/(2 e) ≈ 0.303.
CHAPTER 3. AVERAGE CALCULATION
66
101
uniform
peak
max-min (normalized)
100
10-1
10-2
10-3
10-4
10-5
10-6
10-7
2
4
6
8
10
12
14
16
18
20
cycles
Figure 3.3: Normalized difference between the maximal and the minimal estimates as a
function of cycles with network size N = 106 . All 50 experiments are plotted as a single
point for each cycle with a small horizontal random translation.
A Note on our Figures of Merit
Our approach for characterizing the quality of the approximations and convergence is
based on the variance σ, and the convergence factor of the variance a∗ , which describes
the speed at which the expected value of σ decreases. To understand better what our
results mean, it helps to compare it with other approaches to characterizing the quality of
aggregation.
First of all, since we are dealing with a continuous process, there is no end result in
a strict sense. Clearly, the figures of merit depend on how long we run the protocol. The
variance σ(i) characterizes the average accuracy of the approximates in the system in the
given cycle i. In our approach, apart from averaging the accuracy over the system, we also
average it over different runs, that is, we consider E(σ(i)). This means that an individual
node in a specific run can have rather different accuracy. We have not considered the
distribution of the accuracy (only the mean accuracy as described above), which depends
on the initial distribution of the values. However, Figure 3.3 suggests that our approach is
robust to the initial distribution.
Another frequently used measure is completeness [91]. This measure is defined under
the assumption that the aggregate is calculated based on the knowledge of a subset of the
values (ideally, based on the entire set, but due to errors this cannot always be achieved).
It gives the percentage of the values that were taken into account. In our protocol this
measure is difficult to adopt directly because at all times a local approximate can be
thought of as a weighted average of the entire set of values. Ideally, all values should
have equal weight in the approximations of the nodes (resulting in the global average
value). To get a similar measure, one could characterize the distribution of weights as a
function of time, to get a more fine-grained idea of the dynamics of the protocol.
3.4. A PRACTICAL PROTOCOL FOR GOSSIP-BASED AGGREGATION
67
3.4 A Practical Protocol for Gossip-based Aggregation
Building on the simple idea presented in the previous section, we now complete the details
so as to obtain a full-fledged solution for gossip-based aggregation in practical settings.
3.4.1 Automatic Restarting
The generic protocol described so far is not adaptive, as the aggregation takes into account neither the dynamicity in the network nor the variability in values that are being
aggregated. To provide up-to-date estimates, the protocol must be periodically restarted:
at each node, the protocol is terminated and the current estimate is returned as the aggregation output; then, the current local values are used to re-initialize the estimates and
aggregation starts again with these fresh initial values.
To implement termination, we adopt a very simple mechanism: each node executes
the protocol for a predefined number of cycles, denoted as γ, depending on the required
accuracy of the output and the convergence factor that can be achieved in the particular
overlay topology adopted (see the convergence factor given in Section 3.3).
To implement restarting, we divide the protocol execution in consecutive epochs of
length γ∆ (where ∆ is the cycle length) and start a new instance of the protocol in each
epoch. We also assume that messages are tagged with an epoch identifier that will be
applied by the synchronization mechanism as described below.
3.4.2 Coping with Churn
In a realistic scenario, nodes continuously join and leave the network, a phenomenon
commonly called churn. When a new node joins the network, it contacts a node that is
already participating in the protocol. Here, we assume the existence of an out-of-band
mechanism to discover such a node, and the problem of initializing the neighbor set of
the new node is discussed in Section 3.4.4.
The contacted node provides the new node with the next epoch identifier and the time
until the start of the next epoch. Joining nodes are not allowed to participate in the current
epoch; this is necessary to make sure that each epoch converges to the average that existed
at the start of the epoch. Continuously adding new nodes would make it impossible to
achieve convergence.
As for node crashes, when a node initiates an exchange, it sets a timeout period to
detect the possible failure of the other node. If the timeout expires before the message
is received, the exchange step is skipped. The effect of these missing exchanges due to
real (or presumed) failures on the final average will be discussed in Section 3.7. Note
that self-healing (removing failed nodes from the system) is taken care of by the N EWS CAST protocol, which we propose as the implementation of method SELECT P EER (see
Sections 3.4.4 and 3.7).
3.4.3 Synchronization
The protocol described so far is based on the assumption that cycles and epochs proceed
in lock step at all nodes. In a large-scale distributed system, this assumption cannot be
satisfied due to the unpredictability of message delays and the different drift rates of local
clocks.
68
CHAPTER 3. AVERAGE CALCULATION
Given an epoch j, let Tj be the time interval from when the first node starts participating in epoch j to when the last node starts participating in the same epoch. In our protocol
as it stands, the length of this interval would increase without bound given the different
drift rates of local clocks and the fact that a new node joining the network obtains the next
epoch identifier and start time from an existing node, incurring a message delay.
To avoid the above problem, we modify our protocol as follows. When a node participating in epoch i receives an exchange message tagged with epoch identifier j such that
j > i, it stops participating in epoch i and instead starts participating in epoch j. This
has the effect of propagating the larger epoch identifier (j) throughout the system in an
epidemic broadcast fashion forcing all (slow) nodes to move up to the new epoch. In other
words, the start of a new epoch acts as a synchronization point for the protocol execution
forcing all nodes to follow the pace being set by the nodes that enter the new epoch first.
Informally, knowing that push-pull epidemic broadcasts propagate super-exponentially
(see Chapter 1) and assuming that each message arrives within the timeout used during
all communications, we can obtain a logarithmic bound on Tj for each epoch j. More
importantly, typically many nodes will start the new epoch independently with a very
small difference in time, so this bound can be expected to be sufficiently small, which
allows picking an epoch length such that it is significantly larger that Tj . A more detailed
analysis of this mechanism would be interesting but is out of the scope of the present
discussion. The effect of lost messages (i.e., those that time out) however, is discussed
later.
3.4.4 Importance of Overlay Network Topology for Aggregation
The theoretical results described in Section 3.3 are based on the assumption that the underlying overlay is “sufficiently random”. More formally, this means that the neighbor
selected by a node when initiating communication is a uniform random sample among its
peers. Yet, our aggregation scheme can be applied to generic connected topologies, by
selecting neighbors from the set of neighbors in the given overlay network. This section
examines the effect of the overlay topology on the performance of aggregation.
All of the topologies we examine (with the exception of N EWSCAST) are static—the
neighbor set of each node is fixed. While static topologies are unrealistic in the presence
of churn, we still consider them due to their theoretical importance and the fact that our
protocol can in fact be applied in static networks as well, although they are not the primary
focus of the present discussion.
Static Topologies
All topologies considered have a regular degree of 20 neighbors, with the exception of
the complete network (where each node knows every other node) and the Barabási-Albert
network (where the degree distribution is a power-law). For the random network, the
neighbor set of each node is filled with a random sample of the peers.
The Watts-Strogatz and scale-free topologies represent two classes of realistic smallworld topologies that are often used to model different natural and artificial phenomena [74, 92]. The Watts-Strogatz model [70] is obtained from a regular ring lattice. The
ring lattice is built by connecting the nodes in a ring and adding links to their nearest
neighbors until the desired node degree is reached. Starting with this ring lattice, each
edge is then randomly rewired with probability β. Rewiring an edge at node n means
3.4. A PRACTICAL PROTOCOL FOR GOSSIP-BASED AGGREGATION
1
69
Experiments
Convergence Factor
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0
0.2
0.4
0.6
0.8
1
β
Figure 3.4: Convergence factor for Watts-Strogatz graphs as a function of parameter
β. The dotted line corresponds
√ to the theoretical convergence factor for peer selection
through random choice: 1/(2 e) ≈ 0.303.
removing that edge and adding a new edge connecting n to another node picked at random. When β = 0, the ring lattice remains unchanged, while when β = 1, all edges are
rewired, generating a random graph. For intermediate values of β, the structure of the
graph lies between these two extreme cases: complete order and complete disorder.
Figure 3.4 focuses on the Watts-Strogatz model showing the convergence factor as a
function of β ranging from 0 to 1. Although there is no sharp phase transition, we observe
that increased randomness results in a lower convergence factor (faster convergence).
Scale-free topologies form the other class of realistic small world topologies. In particular, the Web graph, Internet autonomous systems, and P2P networks such as Gnutella
[93] have been shown to be instances of scale-free topologies. We have tested our protocol over scale-free graphs generated using the preferential attachment method of Barabási
and Albert [74]. The basic idea of preferential attachment is that we build the graph by
adding new nodes one-by-one, wiring the new node to an existing node already in the
network. This existing contact node is picked randomly with a probability proportional to
its degree (number of neighbors).
Let us compare all the topologies described above. Figure 3.5 illustrates the performance of aggregation for different topologies by plotting the average convergence factor
over a period of 20 cycles, for network sizes ranging from 102 to 106 nodes. Figure 3.6
provides additional details. Here, the network size is fixed at 105 nodes. Instead of
displaying the average convergence factor, the curves illustrate the actual variance reduction (values are normalized so that the initial variance for all cases is 1) for the same set
of topologies. We can conclude that performance is independent of network size for all
topologies, while it is highly sensitive to the topology itself. Furthermore, the convergence factor is constant as a function of time (cycle), that is, the variance is decreasing
exponentially, with non-random topologies being the only exceptions.
CHAPTER 3. AVERAGE CALCULATION
70
0.8
W-S(0.00)
W-S(0.25)
W-S(0.50)
W-S(0.75)
Newscast
Scale-Free
Random
Complete
Convergence Factor
0.7
0.6
0.5
0.4
0.3
102
103
104
Network Size
105
106
Figure 3.5: Average convergence factor computed over a period of 20 cycles in networks
of varying size. Each curve corresponds to a different topology where W-S(β) stands for
the Watts-Strogatz model with parameter β.
100
10-2
Variance
10-4
10-6
10-8
W-S(0.00)
W-S(0.25)
W-S(0.50)
W-S(0.75)
Newscast
Scale-free
Random
Complete
10-10
10-12
10-14
10-16
0
5
10
15
20
25
30
35
40
Cycles
Figure 3.6: Variance reduction for a network of 105 nodes. Results are normalized so
that all experiments result in unit variance initially. Each curve corresponds to a different
topology where W-S(β) stands for the Watts-Strogatz model with parameter β.
3.4. A PRACTICAL PROTOCOL FOR GOSSIP-BASED AGGREGATION
1
Experiments
Average
0.9
Convergence Factor
71
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0
10
20
30
40
50
c
Figure 3.7: Convergence factor for N EWSCAST graphs as a function of parameter c. The
dotted line corresponds
√ to the theoretical convergence factor for peer selection through
random choice: 1/(2 e) ≈ 0.303.
Dynamic Topologies
From the above results, it is clear that aggregation convergence benefits from increased
randomness of the underlying overlay network topology. Furthermore, in dynamic systems, there must be mechanisms in place that preserve this property over time. To achieve
this goal, we propose to use N EWSCAST, described in Section 2.2.4 (see Algorithm 8).
Recall that in N EWSCAST the overlay is generated by a continuous exchange of neighbor sets, where each element consists of a node identifier and a timestamp. These sets
have a fixed size, which will be denoted by c. After an exchange, participating nodes
update their neighbor sets by selecting the c node identifiers (from the union of the two
sets) that have the freshest timestamps. Nodes belonging to the network continuously
inject their identifiers in the network with the current timestamp, so old identifiers are
gradually removed from the system and are replaced by newer information. This feature allows the protocol to “repair” the overlay topology by forgetting information about
crashed neighbors, which by definition cannot inject their identifiers.
Figure 3.7 shows the performance of aggregation over a N EWSCAST network of 105
nodes, with c varying between 2 and 50. From these experimental results, choosing c = 30
is already sufficient to obtain fast convergence for aggregation. Furthermore, this same
value for c is sufficient for very stable and robust connectivity (see Chapter 2). Figures 3.5
and 3.6 provide additional evidence that applying N EWSCAST with c = 30 already results
in performance very similar to that of a random network.
3.4.5 Cost Analysis
Both the communication cost and time complexity of our scheme follow from properties
of the aggregation protocol and are inversely related. The cycle length, ∆ defines the
time complexity of convergence. Choosing a short ∆ will result in proportionally faster
convergence but higher communication costs per unit time.
CHAPTER 3. AVERAGE CALCULATION
72
As we have seen earlier, if the overlay is sufficiently random then the number of
exchanges for any fixed node in ∆ time units can be described by the random variable
1 + φ′ where φ′ has a Poisson distribution with parameter 1. Thus, on the average, there
are two exchanges per node (one initiated by the node and the other one coming from
another node), with a very low variance. Based on this distribution, parameter ∆ must be
selected to guarantee that, with very high probability, each node will be able to complete
the expected number of exchanges before the next cycle starts. Failing to satisfy this
requirement results in a violation of our theoretical assumptions.
Similarly, parameter γ (the epoch length) must be chosen appropriately, based on the
desired accuracy of the estimate and the convergence factor a∗ characterizing the overlay
network. After γ cycles, we have E(σ 2 (γ))/σ 2 (0) = a∗γ . If ǫ is the desired accuracy
of the final estimate, then γ ≥ loga∗ ǫ. Note that a∗ is independent of N, so the time
complexity of reaching a given average precision is O(1).
3.5 Aggregation Beyond Averaging
In this section we give several examples of gossip-based aggregation protocols to calculate
different aggregates. With the exception of minimum and maximum calculation, they are
all built on averaging. We also briefly discuss the question of dynamic queries.
3.5.1 Examples of Supported Aggregates
Minimum and maximum
To obtain the maximum or minimum value among the values maintained by all nodes,
method UPDATE(a, b) of the generic scheme of Algorithm 9 must return max(a, b) or
min(a, b), respectively. In this case, the global maximum or minimum value will be effectively broadcast like an epidemic. Well-known results about epidemic broadcasting [20]
are applicable.
Generalized means
We formulate the general mean of a vector of elements x = (x0 , . . . , xN ) as
f (x) = g −1
PN
i=0 g(xi )
N
!
(3.21)
where function f is the mean function and function g is an appropriately chosen local
function to generate the mean. Well known examples include g(x) = x which results in
the average, g(x) = xn which defines the nth power mean (with n = −1 being the harmonic mean, n = 2 the quadratic mean, etc.) and g(x) = ln x resulting in the geometric
mean (nth root of the product). To compute the above general mean, UPDATE(a, b) returns
g −1 [(g(a) + g(b))/2]. After each exchange, the value of f remains unchanged but the
variance over the set of values decreases so that the local estimates converge toward the
general mean.
3.5. AGGREGATION BEYOND AVERAGING
73
Variance and other moments
In order to compute the nth raw moment which is the average of the nth power of the
original values, xn , we need to initialize the estimates with the nth power of the local
value at each node and simply calculate the average. To calculate the nth central moment,
given by (x − x)n , we can calculate all the raw moments in parallel up to the nth and
combine them appropriately, or we can proceed in two sequential steps first calculating
the average and then the appropriate central moment. For example, the variance, which is
the 2nd central moment, can be approximated as x2 − x2 .
Counting
We base counting on the observation that if the initial distribution of local values is such
that exactly one node has the value 1 and all the others have 0, then the global average is
exactly 1/N and thus the network size, N, can be easily deduced from it. We will use this
protocol, which we call COUNT, in our experiments.
Using a probabilistic approach, we suggest a simple and robust implementation of this
scheme without any need for leader election: we allow multiple nodes to randomly start
concurrent instances of the averaging protocol, as follows. Each concurrent instance is
lead by a different node. Messages and data related to an instance are tagged with a unique
identifier (e.g., the address of the leader). Each node maintains a map M associating a
leader identifier with an average estimate. When nodes i and j maintaining the maps Mi
and Mj perform an exchange, the new map M (to be installed at both nodes) is obtained
by merging Mi and Mj in the following way:
M ={(l, xi /2) | xi = Mi (l) ∈ Mi ∧ l 6∈ D(Mj )}
∪ {(l, xj /2) | xj = Mj (l) ∈ Mj ∧ l 6∈ D(Mi )}
∪ {(l, (xi + xj )/2 | xi = Mi (l) ∧ xj = Mj (l)},
(3.22)
where D(M) corresponds to the domain (key set) of map M and xi is the current estimate
of node i. In other words, if the average estimate for a certain leader is known to only one
node out of the two nodes that participate in an exchange, the other node is considered to
have an estimate of 0.
Maps are initialized in the following way: if node l is a leader, the map is equal to
{(l, 1)}, otherwise the map is empty. All nodes participate in the protocol described in
the previous section. In other words, even nodes with an empty map perform random
exchanges. Otherwise, an approach where only nodes with a non-empty set perform
exchanges would be less effective in the initial phase while few nodes have non-empty
maps.
Clearly, the number of concurrent protocols in execution must be bounded, to limit
the communication costs involved. A simple mechanism that we adopt is the following.
At the beginning of each epoch, each node may become leader of a run of the aggregation
protocol with probability Plead . At each epoch, we set Plead = C/N̂ , where C is the desired
number of concurrent runs and N̂ is the estimate obtained in the previous epoch. If the
systems size does not change dramatically within one epoch then this solution ensures that
the number of concurrently running protocols will be approximately Poisson distributed
with the parameter C.
CHAPTER 3. AVERAGE CALCULATION
74
Sums and products
Two concurrent aggregation protocols are run, one to estimate the size of the network, the
other to estimate the average or the geometric mean, respectively. The size and the means
together can be used to compute the sum or the product of the initial local values.
Rank statistics
Although the examples presented above are quite general, certain statistics appear to be
difficult to calculate in this framework. Statistics that have a definition based on the index of values in a global ordering (often called rank statistics) fall into this category.
While certain rank statistics like the minimum and maximum (see above) can be calculated easily, others, including the median, are more difficult. In our previous work we
have proposed protocols for this purpose as well [13].
An example application: Naive Bayes
A central problem in data mining is classification. We assume that every node i has a
training data set containing training samples. One sample consists of a feature vector
x = (x1 , . . . , xr ) and a label y. Let us assume that both the features xi and y have a
small discrete domain (indeed, for example, even binary features and binary labels are
common). One wants to build a classification procedure that assigns labels to new observations (feature vectors) that are not labeled. This classification procedure might have
a form of a decision tree, a regression formula, a description of a joint probability distribution, etc., [94]. Here, we will focus on a very simple, yet powerful, classification
procedure called Naive Bayes.
The Naive Bayes procedure finds the maximum a posteriori (MAP) estimate
yM AP = arg max p(y|x)
y
(3.23)
with help of some empirical probabilities that are easy to find. Indeed, if we assume that
attributes are conditionally independent with respect to the class attribute (it is a naive
assumption therefore the name: Naive Bayes), the probabilities p(y|x) can be expressed
in terms of p(xi |y) and p(y):
Q
p(y) i p(xi |y)
p(y)p(x|y)
p(y|x) =
≈
.
p(x)
p(x)
(3.24)
The term p(x) is not needed, since it is constant for a given feature vector, so it does not
change the MAP estimate yM AP .
Using our averaging protocol, we now need to calculate the average number of training
samples (n), the average number of training samples that have y as label (ny ), and the
average number of training samples that have feature xi and label y (nxi ,y ). Now, each
node can estimate
nx ,y
ny
(3.25)
p(y) ≈ , p(xi |y) ≈ i .
n
ny
Using these estimates, the MAP estimate yM AP can be calculated. (Note, that ny is redundant, but can nevertheless help in increasing stability.)
3.6. THEORETICAL RESULTS FOR BENIGN FAILURES
75
3.5.2 Dynamic Queries
Although here we target applications where the same query is calculated continuously
and proactively in a highly dynamic large network, having a fixed query is not an inherent
limitation of the approach. The aggregate value being calculated is defined by method
UPDATE and the semantics of the state of the nodes (the parameters of method UPDATE).
These components can be changed throughout the system at any time, using for example
an extension of the restarting technique discussed in Section 3.4, where in a new epoch
not only the start of the new epoch is being propagated through gossip but a new query as
well.
Typically, our protocol will provide aggregation service for an application. The exact
details of the implementation of dynamic queries (if necessary) will depend on the specific environment, taking into account efficiency and performance constraints and possible
sources of new queries.
3.6 Theoretical Results for Benign Failures
3.6.1 Crashing Nodes
The result on convergence discussed in Section 3.3 is based on the assumption that the
overlay network is static and that nodes do not crash. When in fact in a dynamic environment, there may be significant churn with nodes coming and going continuously.
In this section we present results on the sensitivity of our protocols to dynamism of the
environment.
Our failure model is the following. Before each cycle, a fixed proportion, say Pf ,
of the nodes crash (recall that we do not distinguish between nodes leaving the network
voluntarily and those that crash). Given N nodes initially, Pf N nodes are removed. We
assume crashed nodes do not recover. Note that considering crashes only at the beginning
of cycles corresponds to a worst-case scenario since the crashed nodes render their local
values inaccessible when the variance among the local values is at its maximum. In other
words, the more times a node communicates with other nodes, the better it approximates
the correct global average (on average), so removing it at a latter stage does not disturb
the end result as much as removing it at the beginning. Also recall that we are interested
in the average at the beginning of the current epoch as opposed to the real-time average
(see Section 3.4.1).
Let us begin with some simple observations. In our failure model the convergence
factor will stay the same independently of Pf since the failure model is completely blind
(there is no bias towards removing larger or smaller values), and the convergence factor
does not depend on the network size N (as long as N is large). However, the average will
now become a random variable that depends on Pf , since the mass conservation property
no longer holds. Again, due to symmetry, it is trivial to see that the expectation of the
average will not change (we still assume that it is zero). So, to characterizes the expected
error of the approximation of the average, we consider the variance of the mean Var(µ(t)),
where
N (t)
1 X
µ(t) =
xi (t)
(3.26)
N i=1
and N(t) = (1 − Pf )t N. We will describe Var(µ(t)) as a function of Pf .
CHAPTER 3. AVERAGE CALCULATION
76
Proposition 3.6.1. Let us assume that the convergence factor is a∗ and algorithm AVG is
symmetric to permutation. Then µ(t) has a variance
∗ t
a
1
−
1−Pf
Pf
2
σ (0)
Var(µ(t)) ≈
.
(3.27)
a∗
(1 − Pf )N
1 − 1−P
f
Proof. Let us take the decomposition µ(t + 1) = µ(t) + dt . Random variable dt is independent of µ(t), because knowing only the average of a set does not contain information
about the statistics of any strict subset (in this case, the subset that is removed) in the lack
of additional prior information. So
Var(µ(t + 1)) = Var(µ(t)) + Var(dt ),
which means that
Var(µ(t)) =
t−1
X
Var(dj ).
(3.28)
(3.29)
j=0
This allows us to consider only Var(dt ) as a function of failures. Note that E(dt ) = 0
since E(µ(t)) = E(µ(t + 1)). Then, we have
Pf
E(σ 2 (t))
(1 − Pf )N(t)
a∗t
Pf
a∗t
Pf
σ 2 (0)
=
σ 2 (0)
.
=
1 − Pf
N(t)
1 − Pf
N(1 − Pf )t
Var(dt ) = E((µ(t) − µ(t + 1))2 ) ≈
(3.30)
which gives the desired formula when substituting (3.30) into (3.29). In the first equation
we used the fact that E(dt ) = 0. The approximation then is the results of elementary
calculations, in which we ignored the terms of the form axi xj (i 6= j). Although here
we do not formally quantify the error we make by ignoring these terms (instead, we
perform an experimental validation) the considerations in the proofs of Propositions 3.3.1
and 3.3.5 strongly suggests that the error is not large.
The results of simulations with N = 105 to validate this analysis are shown in Figure 3.8. For each value of Pf , the empirical data is based on 100 √
independent experiments
whereas the prediction is obtained from (3.27) with a∗ = 1/(2 e). The empirical data
fits the prediction nicely. Note that the largest value of Pf examined was 0.3 which means
that in each cycle almost one third of the nodes is removed. This already represents an extremely severe scenario. See also Section 3.7.1, where we present additional experimental
analysis using N EWSCAST.
If a∗ > 1 − Pf then the variance is not bounded, it grows with the cycle index,
otherwise it is bounded. Also note that increasing network size decreases the variance of
the approximation µ(i). This is good news for scalability, as the larger the network, the
more stable the approximation becomes.
3.6.2 Link Failures
In a realistic system, links fail in addition to nodes crashing. This represents another
important source of error, although we note that from our point of view node crashes are
more important because we model leaves as crashes, so in the presence of churn crash
events dominate all other types of failure.
3.6. THEORETICAL RESULTS FOR BENIGN FAILURES
1.8 x 10-5
fully connected topology
newscast
predicted
1.6 x 10-5
Var(µ(20))/σ 2(0)
77
1.4 x 10-5
1.2 x 10-5
1.0 x 10-5
8.0 x 10-6
6.0 x 10-6
4.0 x 10-6
2.0 x 10-6
0.0 x 100
0
0.05
0.1
0.15
0.2
0.25
0.3
Pf
Figure 3.8: Effects of node crashes on the variance of the average estimates at cycle 20.
Let us adopt a failure model in which an exchange is performed only with probability
1 − Pd , that is, each link between any pair of nodes is down with probability Pd . This
model is adequate because we focus on short term link failures. For long term failures
it is not sufficient to model failure as a probability, and long term failures can hardly be
modeled as independent either. Besides, long term link failure in an overlay network
means long term partitioning in the underlying physical network (because if the physical network was connected, normally the routing service could still function), and thus
the overlay network is also partitioned. In such a partitioned topology our protocol will
simply calculate an aggregate value local to each partitioned cluster.
In Section 3.3.2 it was proven that a∗ = 1/e (where a∗ is the convergence factor) if
we assume that during a cycle for each particular variance reduction step, each pair of
nodes has an equal probability to perform that particular variance reduction
step. For the
√
protocol described in Algorithm 9 we have proven that a∗ = 1/(2 e). For this protocol
the uniform randomness assumption does not hold since the protocol guarantees that each
node participates in at least one variance reduction step—the one initiated actively by
the node. In the random model however, it is possible for example that a node does not
participate in a given cycle at all.
Consider that a system model with Pd > 0 is very similar to a model in which Pd = 0
but which is “slower” (fewer pairwise exchanges are performed in a unit time interval). In
the limit case when Pd is close to 1, the uniform randomness assumption described above
(when a∗ = 1/e) is fulfilled with high accuracy.
This motivates our conclusion that the performance
√ can be bounded from below by
∗
the model where Pd = 0, and a = 1/e instead of 1/(2 e), and which is 1/(1−Pd ) times
slower than the original system in terms of wall clock time. That is, the upper bound on
the convergence factor can be expressed as
1
a∗ d = ( )1−Pd = ePd −1 .
(3.31)
e
√
Since the factor 1/e is not significantly worse than 1/(2 e), we can conclude that practically only a proportional slowdown of the system is observed. In other words, link failures
78
CHAPTER 3. AVERAGE CALCULATION
do not result in any loss of approximation quality or increased unreliability.
3.6.3 Conclusions
We have examined two sources of random failures: node crashes and link failures. In
the case of node crashes, the relationship was given between the proportion of failing
nodes and the expected loss in accuracy of the average estimation. We have seen that the
protocol can tolerate relatively large amounts of node crashes and still provide reasonable
estimates. We have also shown that performance degrades gracefully with increasing link
failure probability.
3.7 Simulation Results for Benign Failures
To complement the theoretical analysis, we have performed numerous experiments based
on simulation. In all experiments, we used N EWSCAST as the underlying overlay network
to implement function SELECT P EER in Algorithm 9. As a result, we need no unrealistic
assumptions about the amount of information available at the nodes locally.
Furthermore, all our experiments were performed with the COUNT protocol since it is
the aggregation example that is most sensitive to failures (both node crashes and message
omissions) and thus represents a worst-case. During the first few cycles of an epoch when
only a few nodes have a local estimate other than 0, their removal from the network due
to failures can cause the final result of COUNT to diverge significantly from the actual
network size.
All of experimental results were obtained through P EER S IM, a simulator developed by
us and optimized for aggregation protocols [61, 66]. Unless stated otherwise, all simulations are performed on networks composed of 105 nodes. We do not present results for
different network sizes since they display similar trends (as predicted by our theoretical
results and confirmed by Figure 3.5).
The size of the neighbor sets maintained and exchanged by the N EWSCASTprotocol is
set to 30. As discussed in Section 3.4.4, this value is large enough to result in convergence
factors similar to those of random networks; furthermore, as our experiments confirm, the
overlay network maintains this property also in the face of the node crash scenarios we
examined. Unless explicitly stated, the size estimates and the convergence factor plotted
in the figures are those obtained at the end of a single epoch of 30 cycles. In all figures,
50 individual experiments were performed for all parameter settings. When the result of
each experiment is shown in a figure (e.g., as a dot) to illustrate the entire distribution, the
x-coordinates are shifted by a small random value so as to separate results having similar
y-coordinates.
3.7.1 Node Crashes
The crash of a node may have several possible effects. If the crashed node had a value
smaller than the actual global average, the estimated average (which should be 1/N)
will increase and consequently the reported size of the network N will decrease. If the
crashed node has a value larger than the average, the estimated average will decrease and
consequently the reported size of the network N will increase.
The effects of a crash are potentially more damaging in the latter case. The larger
the removed value, the larger the estimated size. At the beginning of an epoch, relatively
3.7. SIMULATION RESULTS FOR BENIGN FAILURES
4.5
79
Experiments
Estimated Size (/105)
4
3.5
3
2.5
2
1.5
1
0.5
0
5
10
15
20
Cycle
Figure 3.9: Network size estimation with protocol COUNT where 50% of the nodes crash
suddenly. The x-axis represents the cycle of an epoch at which the “sudden death” occurs.
large values are present, obtained from the first exchanges originated by the initial value
1. These observations are confirmed by Figure 3.9, that shows the effect of the “sudden
death” of 50% of the nodes in a network of 105 nodes at different cycles of an epoch.
Note that in the first cycles, the effect of crashing may be very harsh: the estimate can
even become infinite (not shown in the figure), if all nodes having a value different from 0
crash. However, around the tenth cycle the variance is already so small that the damaging
effect of node crashes is practically negligible.
A more realistic scenario is a network subject to churn. Figure 3.10 illustrates the
behavior of aggregation in such a network. Churn is modeled by removing a number of
nodes from the network and substituting them with new nodes at each cycle. According
to the protocol, the new nodes do not participate in the ongoing approximation epoch.
However this scenario is not fully equivalent to a continuous node crashing scenario because these new nodes do participate in the N EWSCAST network and so they are contacted
by participating nodes. These contacts are refused by the new nodes which results in an
additional effect similar to link failure.
The size of the network is constant, while its composition is dynamic. The plotted
dots correspond to the average estimate computed over all nodes that still participate in
the protocol at the end of a single epoch (30 cycles), that is, that were originally part of the
system at the start of the epoch. Note that although the average estimate is plotted over
all nodes, in cycle 30 the estimates are practically identical as Figure 3.6 confirms. Also
note that 2,500 nodes crashing in a cycle means that 75% of the nodes ((30 × 2500)/105)
are substituted during the epoch, leaving 25% of the nodes that make it until the end of
the epoch.
The figure demonstrates that (even when a large number of nodes are substituted during an epoch) most of the estimates are included in a reasonable range. This is consistent
with the theoretical result discussed in Section 3.6.1, although in this case we have an additional source of error: nodes are not only removed but replaced by new nodes. While the
new nodes do not participate in the epoch, they result in an effect similar to link failure,
as new nodes will refuse all connections that belong to the currently running epoch. How-
CHAPTER 3. AVERAGE CALCULATION
80
2.6
Experiments
Estimated Size (/105)
2.4
2.2
2
1.8
1.6
1.4
1.2
1
0.8
0
500
1000
1500
2000
2500
Nodes Substituted per Cycle
Figure 3.10: Network size estimation with protocol COUNT in a network of constant size
subject to churn. The x-axis is the churn rate which corresponds to the number of nodes
that crash at each cycle and are substituted by the same number of new nodes.
ever, the variance of the estimate continues to be described by the results in Section 3.6.1
because according to Sections 3.6.2 and 3.7.2, link failures do not change the estimate,
only slows down convergence. Since an epoch lasts 30 cycles, this time is enough for
convergence even beside the highest fluctuation rate. See also Figure 3.8 for the variance
of the estimates plotted against the theoretical prediction.
The above experiment can be considered as a worst case analysis since the level of
churn was much higher than it could be expected in a realistic scenario, considering that
an epoch lasts for a relatively short time. We have repeated our experiments on the wellknown Gnutella trace described in [58] to validate our results on a more realistic churn
scenario as well. Figure 3.11 illustrates the simulation results. Only a short time window
is shown (where the churn rate is particularly variable) to illustrate the accuracy of the
approach better. We can observe that the approximation is accurate (with a one epoch
delay), and the standard deviation is low as well. In this particular trace, during one
epoch approximately 5% of the nodes are replaced. This is a relatively low rate and as
we have seen earlier, the protocol can withstand much higher churn rates. Noted that the
figure illustrates only the fluctuations in the network size as a result of churn and not the
actual churn rate itself.
3.7.2 Link Failures and Message Omissions
Figure 3.12 shows the convergence factor of COUNT in the presence of link failures. As
discussed earlier, in this case the only effect is a proportionally slower convergence. The
theoretically predicted upper bound of the convergence factor (see (3.31)) indeed bounds
the average convergence factor, and—as predicted—it is more accurate for higher values
of Pd .
Apart from link failures that interrupt communication between two nodes in a symmetric way, it is also possible that single messages are lost. If the message sent to initiate
an exchange is lost, the final effect is the same as with link failure: the entire exchange
3.7. SIMULATION RESULTS FOR BENIGN FAILURES
4900
81
Estimated Size
Actual Size
4800
Network Size
4700
4600
4500
4400
4300
4200
4100
4000
470
480
490
500
510
520
530
540
550
Epoch
Figure 3.11: Network size estimation with protocol COUNT in the presence of churn according to a Gnutella trace [58]. 50 experiments were run to calculate statistics (mean and
standard deviation), each epoch consisted of 30 cycles, each cycle lasted for 10 seconds.
is lost, and the convergence process is just slowed down. But if the message lost is the
response to an initiated exchange, the global average may change (either increasing or
decreasing, depending on the value contained in the message).
The effect of message omissions is illustrated in Figure 3.13. The given percentage of
all messages (initiated or response) was dropped. For each experiment, both the maximum
and the minimum estimates over the nodes in the network are shown, represented by
the ends of the bars. As can be seen, when a small percentage of messages are lost,
estimations of reasonable quality can be obtained. Unfortunately, when the number of
messages lost is higher, the results provided by aggregation can be larger or smaller by
several orders of magnitude. In this case, however, it is possible to improve the quality
of estimations considerably by running multiple concurrent instances of the protocol, as
explained in the next section.
3.7.3 Robustness via Multiple Instances of Aggregation
To reduce the impact of “unlucky” runs of the aggregation protocol that generate incorrect
estimates due to failures, one possibility is to run multiple concurrent instances of the
aggregation protocol. To test this solution, we have simulated a number t of concurrent
instances of the COUNT protocol, with t varying from 1 to 50. At each node, the t estimates
that are obtained at the end of each epoch are ordered. Subsequently, the ⌊t/3⌋ lowest
estimates and the ⌊t/3⌋ highest estimates are discarded, and the reported estimate is given
by the average of the remaining results.
Figure 3.14 shows the results obtained by applying this technique in a system where
1000 nodes per cycle are substituted with new nodes, while Figure 3.15 shows the results
in a system where 20% of the messages are lost. Recall that even though in the node
crashing scenario the number of nodes participating in the epoch decreases, the correct
estimation is 105 as the protocol reports network size at the beginning of the epoch.
The results are quite encouraging; by maintaining and exchanging just 20 numerical
CHAPTER 3. AVERAGE CALCULATION
82
1
Convergence Factor
0.9
0.8
0.7
0.6
0.5
Experiments
Average Rate
Upper Bound on Convergence Factor
0.4
0.3
0
0.2
0.4
0.6
0.8
1
Pd
Figure 3.12: Convergence factor of protocol COUNT as a function of link failure probability.
values (resulting in messages of still only a few hundreds of bytes), the accuracy that may
be obtained is very high, especially considering the hostility of the scenarios tested. It can
also be observed that the estimate is very consistent over the nodes (the bars are short) in
the crash scenario (as predicted by our theoretical results), and using multiple instances
the variance of the estimate over the nodes decreases significantly even in the message
omission scenario, so the estimate is sufficiently representative at every single node.
3.8 Experimental Results on PlanetLab
In order to validate our analytical and simulation results, we implemented the COUNT
protocol and deployed it on PlanetLab [95]. PlanetLab is an open, globally distributed
platform for developing, deploying and accessing planetary-scale network services. At the
time of performing these experiments, more than 170 academic institutions and industrial
research labs are members of the PlanetLab consortium, providing more than 400 nodes
for experimentation.
A summary of the experimental results obtained on PlanetLab is illustrated in Figure 3.16. During the experiment, 300 machines belonging to the PlanetLab testbed were
used. Each machine was running up to 20 virtual nodes, each participating as a distinct
entity. In other words, the maximum size of our emulated network was 6000 virtual
nodes, distributed over five continents. The size of the network was made to oscillate
between 2500 and 6000 nodes during the experiment. Virtual nodes were removed and
added using a central scheduler that randomly picked nodes from the network to produce
the oscillation effect shown in the figure. The number of concurrent protocol instances
was 20 (see Section 3.7.3), and parameter c of N EWSCAST was c = 30. The length of a
cycle is 5 seconds, while the number of cycles in an epoch is 30 (that is, the length of an
epoch is approximately 2.5 minutes). Several experiments were run, all of them starting
at 02:00 Central European Time during workdays. All of them produced results similar
to those shown in the figure. The communication mechanism of our implementation is
3.8. EXPERIMENTAL RESULTS ON PLANETLAB
109
83
Experiments
8
Estimated Size
10
107
106
105
104
103
102
0
0.1
0.2
0.3
0.4
Fraction of Messages Lost
Figure 3.13: Network size estimation with protocol COUNT as a function of lost messages.
The length of the bars illustrate the distance between the minimal and maximal estimated
size over the set of nodes within a single experiment.
1.3
Experiments
Estimated Size (/105)
1.25
1.2
1.15
1.1
1.05
1
0.95
0.9
0
10
20
30
40
50
Number of Aggregation Instances
Figure 3.14: Network size estimation with multiple instances of protocol COUNT. 1000
nodes crash at the beginning of each cycle. The length of the bars correspond to the
distance between the minimal and maximal estimates over the set of all nodes within a
single experiment.
CHAPTER 3. AVERAGE CALCULATION
84
3.5
Experiments
Estimated Size (/105)
3
2.5
2
1.5
1
0.5
0
0
10
20
30
40
50
Number of Aggregation Instances
Figure 3.15: Network size estimation with protocol COUNT as a function of concurrent
protocol instances. 20% of messages are lost. The length of the bars correspond to the
distance between the minimal and maximal estimates over the set of all nodes within a
single experiment.
7000
Estimated size
Actual size
6500
6000
Network size
5500
5000
4500
4000
3500
3000
2500
2000
0
5
10
15
20
25
30
35
40
45
50
Epoch
Figure 3.16: The estimated size (as provided by COUNT) and the actual size of a network oscillating between 2500 and 6000 nodes (approximately). Standard deviation of
estimated size is displayed using vertical bars.
3.9. RELATED WORK
85
based on UDP. This choice is motivated by the fact that in a network based on N EWSCAST,
interactions between nodes are short-lived, so establishing a TCP connection is relatively
expensive. On the other hand, the protocol can tolerate message omissions. The observed
message omission rate during our experiments varied between 3% and 8%.
The figure shows two curves, one representing the real size of the network at the
beginning of a given epoch, and the other representing the estimated size, averaged over
all nodes in the network. The (very small) standard deviation of the estimates over all
nodes is also illustrated using vertical bars. These experiments further confirm the validity
and practicality of our mechanisms.
3.9 Related Work
Since our work overlaps with a large number of fields, including gossip-based and epidemic protocols, load balancing, aggregation and network size estimation (in both overlay
and wireless ad hoc networks), we restrict our discussion to the most relevant publications
from each area.
Protocols based on epidemic and gossiping metaphors have found numerous practical applications. Examples include database replication [20] and failure detection [52].
A recently completed survey by Eugster et al. provides an excellent introduction to the
area [59]. Note that our approach applies gossiping only as the communication model
(periodic information exchange with random peers). Strictly speaking, nothing is “gossiped”, the dynamics of the system is closer to a diffusion process. This is why, for
example, theoretical results on epidemic spreading are not directly relevant here.
The load balancing protocol presented in [96] builds on the idea of generating a matching in the network topology and balancing load along the edges in the matching. Although
the basic idea is similar, our work assumes a random overlay network (that we provide
using N EWSCAST) and does not require the communications to take place in a matching
in this network. Recall however that we have shown that the matching is the optimal case
for our protocol; fortunately random pair selection has similar performance as well.
There are a number of general purpose systems for aggregation that offer a database
abstraction (supporting queries about the state of the system) and that are based on structured (typically hierarchical) topologies. Perhaps the best-known example of this approach is Astrolabe [87], and more recently, SDIMS [97]. In these systems a hierarchical
architecture is deployed which reduces the cost of finding the aggregates and enables the
execution of complex database queries. However, maintenance of the hierarchical topology introduces additional overhead, which can be significant if the environment is very
dynamic. Our gossip-based aggregation protocol is substantially different. Although the
class of aggregates that it can compute is fairly general, and dynamic queries can also
be implemented, it is not a general purpose system: it is extremely simple, lightweight,
and targeted for unstructured, highly dynamic environments. Furthermore, our protocol
is proactive: the updated results of aggregation are known to all nodes continuously.
The protocol presented in [91] suggests the so called Grid Box hierarchies to process queries in a structured fashion, which (compared to our protocol) involves increased
message sizes and more complicated (so more vulnerable) execution which involves a
logarithmic number of phases to calculate a single value. On the other hand, the overall
approach is similar in the sense that all nodes are equivalent (run the same algorithm) and
they all learn the end result.
86
CHAPTER 3. AVERAGE CALCULATION
Kempe et al. [34] propose an aggregation protocol similar to ours: it is based on gossiping and is tailored to work on random topologies. The main difference with the present
work is that they consider push-only gossiping mechanisms, which results in a slightly
more complicated (though still very simple) protocol. The complication comes from the
fact that in a push-only approach some nodes attract more “weight” due to their more
central position, so a normalization factor needs to be kept track of as well. Besides, other
difficulties arise in practical settings if the directed graph used to push messages is not
strongly connected. In our case the effective communication topology is undirected so we
need only weak connectivity to allow the protocol to work. Furthermore, their discussion
is limited to theoretical analysis, while we consider the practical details needed for a real
implementation and evaluate their performance in unreliable and dynamic environments
through simulations.
Related work targeted specifically to network size estimation should also be mentioned. A typical approach is to sample some property of the system which is random but
depends on network size and so can be used to apply maximum likelihood estimation or
a similar technique. This approach was followed in [98] in the context of multicasting.
Another, probabilistic and localized technique is described in [99] where a logical ring
is maintained and all nodes estimate network size locally based on the estimates of their
neighbors. Unlike these approaches, our protocol provides the exact size in the absence
of failures (assuming also that size is an integer which limits the necessary numeric precision) with very low cost and the approximation continues to be very accurate in highly
unreliable and dynamic environments.
In principle, aggregation (even in the presence of malicious failures) could be achieved
as follows: nodes run a protocol solving the agreement problem [100] (or the weaker
approximate agreement problem [101, 102]) with their local values as the input. This
suggests that the problems of aggregation and agreement are related. However, agreement
protocols are designed for relatively small scale systems where the main problem is to deal
with Byzantine failure. Agreement protocols are typically round based, requiring each
node to communicate with every other node in a given interval of time (round). While the
problem itself is similar, this approach is clearly not practical in the highly dynamic and
extremely large scale settings we have in mind.
Finally, aggregation is an important problem in wireless and ad hoc networks as well.
For instance, [103] represents a reactive approach where queries are propagated through
the system and the answer propagates back to the source node (see the distinction between
reactive and proactive approaches in the Introduction). The approach introduced in [104]
is similar to ours. It is assumed that the network is a one-hop network (so all nodes can
directly communicate with any other node), and a protocol is described that can manage
the matching process that implements neighbor selection in this environment.
3.10 Conclusions
We have presented a full-fledged proactive aggregation protocol and have demonstrated
several desirable properties including low cost, rapid convergence, robustness and adaptivity to network dynamics through theoretical an experimental analysis.
We proved that in the case of average calculation, the variance of the approximation
of the average decreases exponentially fast, independently of network size. This result
suggests both efficiency and scalability. We demonstrated that the method can be applied
to calculate a number of aggregates beside the average. These include the maximum and
3.10. CONCLUSIONS
87
minimum, geometric and harmonic means, network size, sum and product. We proved
theoretically that the protocol is not sensitive to node crashes, which confirms our approach of not introducing a leave protocol, but instead handling leaves as crashes. Link
failures were also shown to only slightly slow down convergence.
The protocol was simulated on top of several different topologies, including random
graphs, the complete graph, small-world networks like the Watts-Strogatz and BarabásiAlbert topologies, and a dynamic adaptive unstructured network: NEWSCAST. It was
demonstrated that the protocol is efficient on all of these topologies that have a small
diameter.
We tested the robustness of the protocol in several failure scenarios. We have seen that
very accurate estimates for the aggregate values can be obtained even if 75% of the nodes
crash during the running of the protocol. Furthermore, it was confirmed empirically that
the protocol is unaffected by link failures, which result only in a proportional slowdown
but no loss in accuracy. Effects of single messages being lost are more severe but for
reasonable levels of message loss, the protocol continues to provide highly-accurate aggregate values. Robustness to message loss can be greatly improved by the inexpensive
and simple extension of running multiple instances of the protocol concurrently and calculating the final estimate based on the results of the concurrent instances. For node crashes
and link failures, our experimental results are supported by theoretical analysis. Finally,
the empirical analysis of the protocol was completed with emulations on PlanetLab that
confirmed our theoretical and simulation results.
88
CHAPTER 3. AVERAGE CALCULATION
Chapter 4
Distributed Power Iteration
This chapter serves as our first example of a modular application of gossip components
that work together to solve a relatively complex problem. As we will see, in this application the peer sampling service and gossip based averaging (described in Chapters 2 and
3, respectively) will both be used along with an asynchronous iteration algorithm.
The problem we tackle is determining the dominant eigenvector of matrices defined by
weighted links in overlay networks. These eigenvectors play an important role in many
peer-to-peer applications. Examples include trust management, importance ranking to
support search, and virtual coordinate systems to facilitate managing network proximity.
Robust and efficient asynchronous distributed algorithms are known only for the case
when the dominant eigenvalue is exactly one. We present a fully distributed algorithm
for a more general case: non-negative square matrices that have an arbitrary dominant
eigenvalue.
The basic idea is that we apply a gossip-based aggregation protocol coupled with
an asynchronous iteration algorithm, where the gossip component controls the iteration
component. The norm of the resulting vector is an unknown finite constant by default;
however, it can optionally be set to any desired constant using a third gossip control component. Through extensive simulation results on artificially generated overlay networks
and real web traces we demonstrate the correctness, the performance and the fault tolerance of the protocol.
4.1 Introduction
The calculation of the dominant eigenvector of a matrix has long been a fundamental tool
in almost all areas of science. In recent years, eigenvector calculation has found new
and important applications in fully distributed environments such as peer-to-peer (P2P)
overlay networks.
For example, the PageRank algorithm [105] calculates the importance ranking for
hyperlinked web pages. The calculated ranks are given by the dominant eigenvector of
a matrix that can be derived from the adjacency matrix of the graph defined by the hyperlinks. Fully distributed algorithms have already been proposed to implement PageRank [106–108]. As another example, trust assignment is a key problem in P2P networks.
In [109] a method was proposed, that assigns a global trust value to each peer, through
calculating the dominant eigenvector of the matrix containing local (pairwise) trust values. Finally, the eigenvectors that belong to the largest few absolute eigenvalues also play
a role in esthetic low dimensional graph layout [110]. This application is relevant in vir89
90
CHAPTER 4. DISTRIBUTED POWER ITERATION
tual coordinate assignment that allows to map the actual delays among all pairs of nodes
onto the distance in the n-dimensional Euclidian space [111].
Motivated by these applications, and firmly believing that new ones will keep emerging, we identify fully distributed eigenvector calculation as an important P2P service that
should be studied in its own right.
The environments we target impose special requirements. We assume, that there is
a large number of nodes, the connections are volatile and unreliable and the eigenvector
needs to be continuously updated and maintained in a decentralized way. Communication
is implemented through message passing, where messages can be dropped or delayed.
However, nodes have access to a local clock that measures the passage of real time with
a reasonable accuracy. We do not assume that the local clocks at different nodes are
synchronized.
In this model, we propose a protocol that involves three components. The first is
an instantiation of the asynchronous iteration model described in [112]. This algorithm
requires that the dominant eigenvalue is exactly one. We extend this protocol with a
gossip-based control component that allows the iteration algorithm to converge even if
the dominant eigenvalue is less than or greater than one. A third gossip component can be
applied to explicitly control the exact value of the vector norm (which is an unspecified
finite value without this third component).
These extensions make the asynchronous iteration robust to dynamic change and errors. Traditional methods are very sensitive to the dominant eigenvalue being exactly one:
the slightest deviation results in misbehavior on the long run. Besides, the protocol is able
to implement algorithms that assume a dominant eigenvalue different from one. A recent
promising example is a ranking method using unnormalized web-graphs [113].
We demonstrate the correctness, the performance and the fault tolerance of the protocol through extensive simulation results on artificially generated overlay networks and
real web traces.
4.2 Chaotic Asynchronous Power Iteration
Given a square matrix A, vector x is an eigenvector of A with eigenvalue λ, if Ax = λx.
Vector x is a dominant eigenvector if there are no other eigenvectors with an eigenvalue
larger than |λ| in absolute value. In this case λ is a dominant eigenvalue and |λ| is the
spectral radius of A.
We concentrate of the abstract problem of calculating the dominant eigenvector of
a weighted neighborhood matrix of some large network, in a fully distributed way. By
“fully distributed” we mean the worst case, when the elements of the vector are held by
individual network nodes, one vector element per one node. The matrix A is defined by
physical or overlay links between the network nodes, more precisely, the weights assigned
to these links: let matrix element Aij be the weight of the link from node j to node i. If
there is no link from j to i then Aij = 0.
In [112], Lubachevsky and Mitra present a chaotic asynchronous family of message
passing algorithms to calculate the dominant eigenvector of a non-negative irreducible
matrix, that has a spectral radius of one. Algorithm 11 shows an instantiation of this
framework, that we will apply here.
In the algorithm, the values xi represent the elements of the vector that converges to
the dominant eigenvector. The values bki are buffered incoming weighted values from
incoming neighbors in the graph. These values are not necessarily up-to-date, but, as
4.3. ADDING NORMALIZATION
91
Algorithm 11 Asynchronous iteration executed at node i.
1: loop
7: procedure ON W EIGHT(m)
2:
wait(∆)
8:
k ← m.sender
3:
for each j ∈ out-neighborsi do
9:
bki ← m.x
4:
send weight Aji xi to j
P
5:
bi ← k∈in-neighbors bki
i
6:
xi ← bi
shown in [112], the only assumption about message failure is that there is a finite upper
bound on the age of these values. The age of value bki is defined by the time that elapsed
since k sent the last update successfully received by i. This bound can be very large, so
delays and message drop are tolerated extremely well. In addition, the values bki have to
be initialized to be positive.
In dynamic scenarios, when nodes or network links are added or removed, the algorithm is still functional. Temporary node failures, churn, and link failures are all regarded
as message failures, and are therefore covered by the assumption of the finite upper bound
on update delay. Permanent changes can be dealt with as well: after the change the vector
will start converging to the new eigenvector, provided simple measures are taken to make
sure nodes remove dead links and take new ones into consideration.
4.3 Adding Normalization
Let λ1 be a dominant eigenvalue of A. We can assume that λ1 ≥ 0 since A was assumed to
be non-negative. The asynchronous method described above is known to work correctly
if λ1 = 1, but if λ1 > 1 or λ1 < 1, then the vector elements will grow indefinitely or tend
to zero, respectively. This motivates us to propose a control component that continuously
approximates the average growth rate of the vector elements, and normalizes each updated component with this value. Note that after we achieve convergence, the growth rate
of every single vector element becomes λ1 . This suggests that approximating the global
average using local, limited information is a viable plan.
We adopt the gossip protocol described in Chapter 3 to approximate the average
growth rate. More precisely, we will use this algorithm to approximate the geometric
(m+1)
(m)
(m+1)
mean of the local growth rates bi
/xi over all nodes i, where bi
is the value cal(m)
culated in line 5 in Algorithm 11 and xi is the value of xi before executing line 6. The
geometric mean is a more natural choice since we average multiplicative factors (growth
rates).
The averaging protocol in Algorithm 9 is run by all nodes in parallel with the distributed power iteration. As of notation, let the gossip period of the averaging protocol
be ∆r and let ri be the current approximation of the average at node i. As a result of the
protocol, at all nodes these approximations quickly converge to the average of the initial
values of the local approximations. The protocol relies on a peer sampling service, that
returns a random node from the system. We use N EWSCAST to implement this service, a
detailed description can be found in Section 2.2.4.
To calculate the geometric mean, each node i, when updating xi , overwrites the local
approximation of the growth rate by the logarithm of the locally observed growth rate
(m+1)
(m)
of the vector element held by the node. That is, node i sets ri = log(bi
/xi ). The
CHAPTER 4. DISTRIBUTED POWER ITERATION
92
approximation of the growth rate is therefore eri (t) at node i at time t. This value is used to
normalize bi , that is, we replace line 6 by xi = bi /eri (t) in the active thread of the iteration
algorithm.
A cycle length ∆r < ∆ is chosen so that a sufficiently accurate average is calculated,
in spite of the continuous updates of ri external to the averaging protocol. According to
preliminary experiments, setting ∆r = ∆/5 is already a safe choice on all the problems
we examined. This is because, based on the results from Chapter 3, the approximation
error decreases exponentially fast, besides, the growth rate is similar at all vector elements,
as mentioned before.
4.4 Controlling the Vector Norm
The iteration component combined with gossip-based normalization is designed to achieve
convergence, however, the norm of the converged vector is not known in advance. In some
applications this might not be sufficient, since interpreting a single vector element becomes impossible, only relative values carry information. Besides, in scenarios when the
matrix A constantly and frequently changes, the vector norm can grow without bounds
or can tend to zero without explicitly controlling the vector norm. Finally, knowing a
suitable vector norm makes it possible to implement some algorithms that require global
knowledge. We will describe the random surfer operator of the PageRank algorithm as an
example.
To address these issues, we apply a second gossip component for calculating the measure that we want to keep under control: for example, the maximum or the average of
the absolute value of the vector elements. The calculation of these measures is accomplished by another instance of Algorithm 9, instantiated to calculate both the average and
the maximum of the vector elements xi (see Section 3.5.1).
Let the period of this component be ∆n . The initial values in the normalization gossip
component are updated at the same time when the growth rate gossip component updates
its own initial values, as described in the previous section. It must be noted that in the
case of norm calculation, ∆n = ∆/30 appears to be necessary according to preliminary
experiments, since, unlike growth rates, the vector elements themselves are not guaranteed
to be similar, so we need to achieve very good convergence during a single period of the
iteration algorithm.
Let us now present two examples for the application of the calculated average and
maximum.
4.4.1 Keeping the Vector Norm Constant
Let us now assume that ni (t) is the approximation of either the maximum or the average
of the vector at node i. To push this value towards one, we propose the following heuristic
control factor to modify the normalization factor to introduce a bias towards the vector of
which the average or maximum, respectively, is one. Intuitively, if ni (t) is too large, we
decrease the local value a little more, and if it is too low, we increase a little more. More
formally, we calculate a factor c as
ri (t)
c=e
·
0.2
+ 0.9
1 + 1/ni (t)
(4.1)
4.5. EXPERIMENTAL RESULTS
93
and subsequently replace line 6 with xi = bi /c. The factor c in (4.1) is a sigmoid function over the logarithm of ni (t), transformed to have range [0.9, 1.1]. This means that
the growth rate approximation is never altered by more than 10% no matter how far the
average is from one.
4.4.2 The Random Surfer Operator of PageRank
As a relatively more complex example of the possibilities this framework offers, we
present the implementation of the random surfer operator used in the PageRank algorithm [105]. This operator will in turn allow us to implement PageRank as well.
The PageRank algorithm is concerned with the normalized adjacency matrix of a directed graph (e.g., the WWW link graph). Apart from this directed graph, the PageRank
algorithm uses a “random surfer” operator R as well, defined as Rij = 1/N, for all
i, j = 1, . . . , N, where N is the number of nodes. This corresponds to the definition of R
as being a uniform random walk on the fully connected graph (hence the name “random
surfer”). A very attractive feature of the random surfer operator is that it sends the same
weight to every node. Hence there is no need to actually implement the all-to-all links, either in a centralized or in a distributed calculation. The net effect of R is to add a constant
weight to each node at each propagation step. In other words, R times any vector gives
a vector which is uniform, and whose value may be known if the average of the vector
is known [105]. Hence, we can effectively replace matrix A with the PageRank operator
(1 − ǫ)A + ǫR where the second term involving R may be known as long as the average of
x is known. Note that ǫ is a parameter of the PageRank algorithm, and defines the weight
of the random surfer operator.
As described above, we can in fact obtain an approximation of the vector average.
Then we can implement the PageRank R operator—a global operator—using purely local
operations: node i now has the update rule
xi = (1 − ǫ)
bi
r
e i (t)
+ ǫni (t),
(4.2)
where ni (t) is the locally known converged approximation of the average at time t. Finally, note that controlling the average of the vector and applying the random surfer operator can be done simultaneously as well, using the update rule
xi = (1 − ǫ)
bi
+ ǫni (t),
c
(4.3)
where c is defined as in (4.1).
4.5 Experimental Results
We performed extensive event-based simulation experiments using the P EER S IM simulator [66]. The goal of the experiments was to demonstrate that our method is both efficient
and robust to failure.
4.5.1 Notes on the Implementation
In the case of one of the components—the gossip-based protocol that continuously calculates the average of the current vector approximation, described in Section 4.4—we
applied two modifications to increase robustness.
94
CHAPTER 4. DISTRIBUTED POWER ITERATION
First, instead of Algorithm 9, we applied push averaging presented in [34] and in Section 1.3.1. This variant is very similar; the main difference is that it is slightly modified
so that it can apply the “push only” communication model, while the original version is
based on the “push-pull” model. In the push model, the nodes only send messages, but
need not answer them. In the push-pull model all messages must be answered immediately. We apply the push variant because it is more robust to message delays: while in
the push-pull version the state of the nodes are inconsistent for a short time (between the
sending and reception of the answer), this problem does not exist in the push version.
The second modification of this component is that we apply the epoch-based restarting technique described in Chapter 3. This is necessary to prevent nodes from mixing
converged values with freshly initialized ones, since otherwise we could never achieve
convergence, since the local values are constantly re-initialized by the iteration component.
4.5.2 Artificially Generated Matrices
For evaluating the protocol we applied a set of artificially generated matrices with controlled properties. To model real applications mentioned in the Introduction, all matrices
are sparse and are derived from the adjacency matrix of a link graph. First let us define
the graphs that were used to define the matrices.
All graphs have 5000 nodes. The baseline case is a directed random graph, according
to the k-out model. In this model, k random out-links are added to each node. We
generated an instance of this model with k = 8.
The second graph is a scale free graph generated by the Barabási-Albert model [114].
Most importantly, the degree distribution of this graph follows a power law, that is extremely unbalanced with many low degree nodes and a few high degree nodes, and that
is known to describe many interesting emergent networks such as the WWW or social
relationships [114]. The parameter of the model was set to two. In this case the BarabásiAlbert model defines an undirected graph by starting with two disconnected nodes, and
subsequently adding nodes one by one, linking each new node to two existing nodes.
These two nodes are selected with a probability proportional to their degree (preferential
attachment). The average degree in the graph is thus four.
The third graph was generated starting with an undirected ring, and adding two random out-links from all nodes (note that this procedure follows a modified version of the
Watts-Strogatz model [70]). The motivation behind using this graph is that, as we will
see, its adjacency matrix has a small eigenvalue gap, which results in a slow convergence
of the power iteration. This graph was chosen to test whether our method is sensitive to a
small eigenvalue gap.
The matrices were derived from the adjacency matrices of these graphs. We note
for completeness that the specific instances of the directed graphs we used (the random
k-out and the small gap graphs) were all strongly connected. Since our convention is
that the element Aij describes the weight for network link (j, i)—so that matrix vector
multiplication can be defined by sending messages along the outgoing (and not incoming)
links—the adjacency matrices were first transposed. The first set of matrices consists
of the transposed adjacency matrices. The second set contains the column normalized
versions of the matrices in the first set. The normalized versions are such that the weights
of the outgoing links sum up to one for all the nodes, therefore these matrices describe
random walks on the graphs.
4.5. EXPERIMENTAL RESULTS
95
random k-out
scale free
small gap
normalized unnorm. normalized unnorm. normalized unnorm.
λ1
1.0000
8.0000
1.0000
1.3981
1.0000
4.1938
|λ2 |
0.3573
2.8345
0.8373
1.1737
0.9754
3.9976
Table 4.1: The first and second largest magnitude eigenvalues. Note that the largest magnitude eigenvalue is guaranteed to be real and positive.
Table 4.1 shows the first two largest magnitude eigenvalues for all the problem instances. Note that the eigenvalue gap (the difference between the first and second largest
eigenvalues) determines the convergence speed of the power iteration [115], and thus it is
a good indicator of the speed of our method as well. For a small gap, convergence is slow.
With a zero gap, the power iteration does not converge at all. Since all matrix elements
are real and non-negative, the largest eigenvalue is real and non-negative as well.
4.5.3 Results
Each experiment was carried out as follows. First, each node i was initialized to have
xi = 1 and bi = 1. As we explained previously, the starting time of individual nodes
is irrelevant from the point of view of convergence results, as long as all nodes start
eventually. In the simulations we started each node at a random time within the first ∆
time units counted from the first snapshot time t0 .
Two versions of the method were run for each problem. In the first, we do not apply
the vector normalization gossip component described in Section 4.4. In this case we
expect the vector norm to converge to a previously unknown value, given that we do not
change the underlying matrices in these experiments. In the second version we do apply
vector normalization. In particular, we apply the maximum of the vector for this purpose,
and therefore we expect the maximum to converge to one.
The evaluation metrics were as follows. We first computed the correct dominant
eigenvector (x∗ ) using a centralized algorithm. Following general practice in matrix computations, we measured the angle of the actual approximation and the correct vector to
characterize convergence. That is, we computed the cosine of the angle
cos α(t) =
kx∗ T x(t)k2
,
kx∗ k2 kx(t)k2
(4.4)
and used the angle α(t) as a metric, which tends to zero as t increases. As a second metric,
we measured the maximum of the vector elements to verify normalization.
The failure scenarios involved varying message drop probabilities, and varying message delays. Message drop was modeled by dropping all messages with a given probability, and message delay and delay jitter was modeled by drawing a delay value from a
specified interval uniformly, for all messages. Obviously, these settings were applied for
all messages sent by any of the components of the protocol equally.
Figure 4.1 shows the results of the experiments. First of all, even the more moderate
failure scenario can be considered pessimistic, not to mention the more severe scenario.
This is because in the application scenarios we envision, the interval ∆ can be rather
long, in the range of ten to thirty seconds, so a delay of 10% of ∆ is already large. Most
CHAPTER 4. DISTRIBUTED POWER ITERATION
96
random k-out, with vector normalization
10
-1
10
10
-2
10
-3
10
-4
10
-5
10
-6
10
0
-1
10
-1
10
-2
10
-3
10
-4
10
-5
10
-6
-2
10
-3
10
-4
10
-5
10
-6
0
10
50 100 150 200 250 300 350
cycles
0
4
10
3
10
2
10
1
10
0
10
-1
0
50 100 150 200 250 300 350
cycles
random k-out, without vector normalization
10
-1
10
-2
10
-3
10
-4
10
-5
10
-6
-1
10
-2
10
-3
10
-4
10
-5
10
-6
-2
10
-3
10
-4
10
0
200
50 100 150 200 250 300 350
0
200
cycles
random k-out, without vector normalization
10
5
10
4
10
6
10
10
3
10
2
10
1
10
0
800
1000
400
600
800
1000
small gap, without vector normalization
8
5
10
maximal vector value
7
5
10
4
10
3
10
2
10
1
10
0
10
400 600
cycles
cycles
scale free, without vector normalization
10
maximal vector value
maximal vector value
0
10
6
cycles
-1
small gap, without vector normalization
-6
50 100 150 200 250 300 350
0
10
10
10
1000
0
-5
0
1
10
10
50 100 150 200 250 300 350
cycles
10
-1
10
-1
10
800
normalized graph, no failure
normalized graph, scenario 1
normalized graph, scenario 2
no failure
scenario 1
scenario 2
2
0
cycles
10
0
10
50 100 150 200 250 300 350
400 600
cycles
small gap, with vector normalization
10
0
200
3
10
scale free, without vector normalization
0
angle (radian)
angle (radian)
10
0
scale free, with vector normalization
9
10
8
10
7
10
6
10
5
10
4
10
3
10
2
10
1
10
0
10
-1
10
maximal vector value
10
50 100 150 200 250 300 350
cycles
angle (radian)
5
maximal vector value
maximal vector value
random k-out, with vector normalization
10
small gap, with vector normalization
0
angle (radian)
10
angle (radian)
angle (radian)
scale free, with vector normalization
0
10
0
50 100 150 200 250 300 350
cycles
4
10
3
10
2
10
1
10
0
10
10
-1
0
200
400
600
800
1000
cycles
Figure 4.1: Simulation results. Scenario 1 involves Pdrop = 0.1, and a random message
delay drawn from [0, ∆/10] uniformly. In scenario 2, Pdrop = 0.3 and the message delay
is drawn from [∆/10, ∆/2].
4.5. EXPERIMENTAL RESULTS
97
Notre Dame crawl, PageRank algorithm
Notre Dame crawl, PageRank Algorithm
0
10
average vector value
no failure
scenario 3
scenario 1
-1
10
angle (radian)
2
-2
10
-3
10
-4
10
-5
10
-6
10
no failure
scenario 3
scenario 1
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0
50
100
150
200
250
cycles
0
50
100
150
200
250
cycles
Figure 4.2: Simulation results with the PageRank algorithm. Scenario 1 involves Pdrop =
0.1, and a random message delay drawn from [0, ∆/10] uniformly. In scenario 3, Pdrop =
0 and the message delay is the same as in scenario 1.
importantly, from the point of view of the averaging and maximum finding protocols, that
have a much shorter cycle length of ∆n = ∆/30, these delay values are extreme.
From the experiments we can conclude that when the vector norm is not controlled
explicitly, then convergence is fast, comparable to that of the centralized power iteration.
Our preliminary experiments (not shown) suggest that message delay has virtually no
effect on the convergence results, when Pdrop = 0. Higher drop rates slow down convergence but do not change its characteristics significantly.
When we do apply vector normalization, convergence slows down somewhat due to
the interference of vector normalization and asynchronous iteration. In the extreme failure
scenario we don’t achieve full convergence. The reason is that the extremely high delay
and message drop rate prevents the propagation of the current maximum of the vector to
all nodes during the interval ∆, and so different nodes might normalize with a different
value. As a side effect, the maximum does not converge to one, and there is a constant
noise factor in the approximation of the eigenvector. However, in the less severe, but still
pessimistic scenario we do achieve convergence.
4.5.4 PageRank on WWW Crawl Data
As a more realistic case study, we tested our method on the same dataset used in [116],
available from the authors. It was generated by a crawler, starting from one page within
the domain of the University of Notre Dame. This sample has 325729 nodes. On this
dataset we executed the PageRank algorithm, as described in Section 4.4. The weight of
the random surfer operator was ǫ = 0.2. This way, for the complete linear operator of the
PageRank algorithm, we have λ1 = 0.84648, λ2 = 0.8.
The results of the method are shown in Figure 4.2 in various failure scenarios. We can
observe that the protocol is now more sensitive to failure than in the case of the previous
experiments, although the achieved accuracy is still satisfactory (note the logarithmic
scale of the plots). The reason is that to get correct rank values the vector average must be
used for controlling the norm of the vector, that is, it is guaranteed that the average of the
vector stays one. The average is used to implement the random surfer operator as well.
However, the calculation of the average is more sensitive to failure than the calculation
of the maximum. This way, the approximation of the actual average of the vector has a
small noise factor, that is inherited by the approximation of the ranks.
98
CHAPTER 4. DISTRIBUTED POWER ITERATION
We can also note that the protocol scales well: the network examined here is two
orders of magnitude larger than the previously examined networks, while convergence
speed is still similar.
4.6 Related Work
Due to its importance, the distributed calculation of the dominant eigenvalue of large matrices has an extensive literature. In the area of parallel and cluster computing, the focus
has largely been the optimization of existing, often iterative, methods on parallel computers and clusters (for a summary, see [115]). Such optimizations include partitioning;
for example, different parts of the vector can be freely assigned to different processors
in order to minimize message exchange and to maximize speedup. Besides, due to the
reliable computing platform, synchronization can be efficiently implemented. This model
is radically different from ours: in our case the assignment is fixed and given a priory,
and the main goal is to achieve robustness to high rates of message delivery failures.
Asynchronous protocols have also been proposed for implementing iterative methods,
and important convergence results are available as well (see [117] for a summary). These
protocols are extremely fault tolerant and also efficient, but so far no algorithms are known
that can deal with the case when the dominant eigenvalue is different from one. This
introduces a certain sensitivity to dynamic environments even if λ1 ≈ 1, besides, many
interesting applications where λ1 6= 1 cannot be tackled, for example [113].
Finally, in the context of P2P systems the main focus is on distributed PageRank
implementations, where in all cases λ1 = 1 is assumed, for example, [106–108]. The
EigenTrust protocol in [109] also applies a similar implementation, but the authors assume all values are updated in each round, presumably unaware of the advantages of the
long existing asynchronous version of the protocol, and thereby offering a rather fragile
algorithm.
4.7 Conclusions
In this chapter we have addressed the problem of designing a fully distributed and robust algorithm for finding the dominant eigenvector of large and sparse matrices, that are
represented as weights of links between nodes of a network. Our contribution can be summarized as follows. First of all, our algorithm does not require the dominant eigenvalue to
be one. This is an important feature even if the problem involves a dominant eigenvalue of
one (like PageRank does). In PageRank, sophisticated techniques for “fixing” the graph
are required to make sure the dominant eigenvalue is one, which are not needed in our
case, as we demonstrated. Besides, the protocol opens the door for applications where the
dominant eigenvalue is known to be different from one [113].
Second, the norm of the approximation of the dominant eigenvector can be controlled
as well. In other words, in addition to guaranteeing that the norm of the vector converges
to a finite value, we can define this value explicitly using an additional gossip-component.
This also means that the algorithm can be run indefinitely in a continuously changing
environment.
Finally, we demonstrated the robustness of the algorithm through event-based simulation experiments, both on artificially generated graphs and on web-crawl data.
Chapter 5
Slicing Overlay Networks
In this chapter we demonstrate yet another application of the gossip scheme: we will
show how to apply the gossip framework to implement a form of resource sharing via
maintaining partitions in the network in the face of node churn and failures. An interesting
aspect of this application is that—although it is seemingly unrelated to averaging at first
sight—its convergence can be described with the same tools we developed in Chapter 3
to characterize the convergence of averaging.
The motivation of slicing is that recently there has been an increasing interest to harness the potential of P2P technology to design and build rich environments where services
are provided and multiple applications can be supported in a flexible and dynamic manner.
In such a context, resource assignment to services and applications is crucial. Current approaches require significant “manual-mode” operations and/or rely on centralized servers
to maintain resource availability. Such approaches are neither scalable nor robust enough.
Our contribution towards the solution of this problem is proposing and evaluating a
gossip-based protocol to automatically partition the available nodes into “slices”, also
taking into account specific attributes of the nodes. These slices can be assigned to run
services or applications in a fully self-organizing but controlled manner. The main advantages of the proposed protocol are extreme scalability and robustness. We present
approximative theoretical models and extensive empirical analysis of the proposed protocol.
5.1 Introduction
Following the scale shift in distributed systems and their increasing dynamism, peer-topeer overlay networks have imposed themselves as the key to build and maintain largescale dynamic distributed systems. One important problem in the field of overlay networks is the design of infrastructures on which several applications might run together
and share resources. Examples of such applications are Desktop-grid like computing
platforms [118], and testbed platforms such as PlanetLab [95].
One key sub-problem is such environments is resource assignment to services and
applications, and the definition of the resource itself. For example, in PlanetLab, the core
concept is a slice, which refers to a virtualized network running over multiple physical
nodes, and where each node can participate in multiple slices. Such slices are assigned
to specific applications, sharing the platform. However, existing approaches are mostly
manual and/or centralized. In contrast to this, we are interested in massively large scale
and extremely dynamic networks, in which centralized slice assignment is not an option
99
100
CHAPTER 5. SLICING OVERLAY NETWORKS
and where slices need not only to be assigned, but also maintained, to face constant churn.
In this chapter, as a step towards a full self-organizing architecture, we focus on a welldefined problem: ordered slicing. Our objective is to create and maintain a partitioning
of the network (we call the partitions slices in the following). This implies that slices are
defined as subsets of the network, that is, each node belongs to exactly one slice at any
given point in time. However, several such partitionings can be maintained in parallel.
The ordered nature of the slicing means that specific attributes can be taken into account
to partition the network: the partitioning is done along a fixed attribute of the nodes. For
example, a service might require a slice composed of the top 20% of the nodes providing
the largest bandwidth. Besides, we need to provide this top 20% constantly, even if the
nodes in the top 20% constantly change due to churn or changing node properties.
Many metrics may be used to sort the nodes such as available resources (memory,
bandwidth, computing power) or some specific behavioral pattern such as up-time. Note
that slicing the network at random, and focusing only on the size of the slices is a special
case of our ordered slicing protocol. We also note that the slice sizes are expressed as a
percentage of the network, that is, if the network grows, slices grow accordingly.
The rest of the chapter is organized as follows. In Section 5.2 we provide the problem
statement and the system model. In Section 5.3 we describe our gossip-based slicing
protocol. An approximative theoretical model of our approach is presented in Section 5.4
and an extensive empirical analysis is presented in Section 5.5.
5.2 Problem Definition
5.2.1 System Model
We consider a network consisting of a large collection of nodes that are assigned unique
identifiers (typically IP addresses) and that communicate through message exchanges.
The network is highly dynamic; new nodes may join at any time, and existing nodes may
leave, either voluntarily or by crashing. In the following, we limit our discussion to node
crashes. Voluntary leaves are implemented as crashes: our protocols will not require a
dedicated leave procedure, nor any failure detection. Successful delivery of messages
happens without delay, however, messages may be dropped. Byzantine failures, with
nodes behaving arbitrarily, are excluded from the present discussion.
We assume that nodes are connected through an existing physical routed network, such
as the Internet, where every node can potentially communicate with every other node. To
actually communicate, a node has to know the identifiers of a set of other nodes (its
neighbors), for example, the IP address in the case of an IP network. This neighborhood
relation over the nodes defines the topology of the overlay network. Given the large scale
and the dynamism of our envisioned system, neighborhoods are typically limited to small
subsets of the entire network. The neighbors of a node (and, thus, the overlay topology)
may change dynamically over time.
5.2.2 The Ordered Slicing Problem
Intuitively, the ordered slicing problem asks for a partitioning of the nodes in the overlay
network into groups (slices) in such a way, that the groups are ordered with respect to
some given metric, such as the availability of a resource, or some other relevant property.
For example, we might be interested in creating and maintaining a slice composed of the
5.3. A GOSSIP-BASED APPROACH
101
top 10% nodes according to available bandwidth, expected up-time, and so on. Note that
creating a slice of a given size, populated with random nodes, is a special case where the
metric is not taken into account, or, equivalently, assuming all nodes have the same value
of the metric. Slice sizes are expressed as a percentage of the network size.
To define this problem, let N denote the network size and let each node i have an
attribute, xi . This value will typically measure the availability of some resource at node
i. We assume that there exists a total ordering over the domain of the attributes values, so
that the values in the network (x1 , . . . , xN ) can be ordered. Let us also assume that there
is a slice specification that defines an ordered partitioning of the nodes.
P That is, the slice
specification is a list of positive real numbers s1 , . . . , sk such that
si = 1, that define
slices S1 , . . . , Sk , where the size of Si is si N and for all i < j, a ∈ Si and b ∈ Sj we have
xa ≤ xb . We also assume that the slice specification is known at each node locally.
The problem is to automatically assign each node to slices in such a way that satisfies
the slice specification, using only local message exchange with currently known neighbors. That is, we want each node to find out, which slice it belongs to, and, as a function
of continuous changes in the network, maintain this assignment up-to-date.
The difficulty lies in the fact that the correct solution needs global information in that
each node needs to calculate the number of nodes that preceed them in the total order,
and break ties whenever the number of preceeding nodes is not well defined (for example,
if there are many identical attribute values in the network). Furthermore, the dynamism
and failures in the system add extra difficulty, as this assignment needs to be continuously
maintained in the face of a changing set of nodes.
As opposed to most traditional approaches that require only eventual correctness provided there is no failure or change in the system for a sufficiently long time, we focus
on a best effort approximation which is as close to the optimal solution as possible. In
other words, instead of focusing on the worst case, we focus on optimizing the performance under normal dynamic operation. Nevertheless our solution is actually eventually
consistent.
Note that the solution we present here can easily be extended to more general cases,
such as when more independent attributes are involved, or when overlapping groups need
to be maintained, and so on. However, the problem definition above allows us to keep the
focus on the analysis of the key novel contributions introduced here.
5.3 A Gossip-based Approach
As mentioned previously, to let nodes decide locally whether they belong to a certain
slice or not (expressed at a percentage of the whole size which is not known either),
the key issue is to enable a node to approximate what percentage of nodes preceed it in
the ordering according to the attribute value. There are at least two natural choices to
implement this functionality. The first is through the application of protocols to calculate
the ranking of the nodes in the ordering [13,34]. However, known protocols are expensive
and they are not suitable to maintain ranking information cheaply in the face of large scale
and dynamism. The second is to approximate the distribution of the attribute values and
use this information to map any attribute value to an approximate ranking [119,120]. This
approach however is not robust to skewed distributions and does not provide a sufficiently
accurate information for the present purposes.
To deal with dynamism and large scale, we follow a third approach, which is based on
the sorting of randomly generated numbers. The basic idea is that each node generates one
102
Algorithm 12 Slicing
1: loop
2:
wait(∆)
3:
p ← selectSlicePeer()
4:
if p 6= null then
5:
sendPush(p,(x, r))
CHAPTER 5. SLICING OVERLAY NETWORKS
6:
7:
8:
9:
10:
11:
12:
procedure ON P USH(m)
sendPull(m.sender,(x, r))
onPull(m)
procedure ON P ULL(m)
if (x − m.x)(r − m.r) < 0 then
r ← m.r
uniform random number from a fixed interval, and subsequently the set of these random
numbers are sorted “along the attribute values” with the help of the protocol we describe
below. Sorting along the attribute values means that—via swapping the random numbers
among a suitable sequence of pairs of nodes—we would like to achieve that the order of
the random numbers reflects the order of the attribute values over the nodes.
After sorting, the node is able to make a judgment about its position in the sorting of
the attribute values based on the random number it currently holds, because the distribution of the random numbers is known (that is, uniform from a fixed interval) and because
the order of the random numbers reflect the order of the attribute values. For example, if
the random numbers are drawn from [0, 1], then a node decides that it is in the first half of
the sorting if, after sorting along the attribute values, it holds a value less than 0.5. Apart
from being simple, this approach supports dynamism well, as all joining nodes can locally
initialize their random number and subsequently participate in the sorting. Furthermore,
the approach works independently of the distribution of the attribute values: they can even
be identical at all nodes, in which case only the sizes of the slices are determined, but the
nodes will be assigned to slices at random.
However, the sorting problem might seem equally difficult to our original problem.
Our main contribution—apart from proposing the application of sorting—is a gossipbased sorting protocol that is simple to implement, incurs minimal costs and is efficient
enough for the purposes of ordered slicing. The basic idea relies on a simple swapping of
the random numbers between nodes. For example, let nodes i and j have attribute values
xi = 10 and xj = 20, and random numbers ri = 0.8 and rj = 0.1. These nodes simply
swap their random numbers in order to make them reflect the ordering of the attribute
value. In order to make such pairs of nodes discover each other, we rely on a gossipbased algorithm.
Our sorting protocol is based on the N EWSCAST protocol (see Chapter 2.2.4). The
basic idea behind the sorting algorithm is that each node will passively look for candidate
peers to swap its random value with, in order to improve the sorting. These candidates
are discovered using the constantly changing set of neighbors provided by the N EWSCAST
protocol. The sorting protocol will be based on N EWSCAST not through the usual peer
sampling API (method SELECT P EER that returns a random peer) but it will have a more
intimate relationship with N EWSCAST. First, the node descriptors in the N EWSCAST view
will contain not only the node address and the timestamp, but also the values of x and r at
the node at the time of submitting the descriptor. Second, the slicing protocol will ask for
peers from the newscast view that are likely to be good candidates to swap random values
with.
The algorithm is shown in Algorithm 12. On peer i method SELECT S LICE P EER returns
a peer j from the N EWSCAST view such that (xi − xj )(ri − rj ) < 0, which means that the
given peer is a potential candidate to swap random values with. It is not guaranteed that
5.4. ANALOGY WITH GOSSIP-BASED AVERAGING
103
a suitable peer exists in the view. If there is no suitable peer then no push message is sent
and therefore no exchange is performed.
Note, that if (xi − xj )(ri − rj ) ≥ 0 according to the current view, it is still possible
that in reality peer j has become a good candidate in the meantime, because the relative
order of two peers can potentially be reversed and the information in the descriptor might
be slightly outdated. Similarly, it is possible that—although node j seems to be a suitable
one—in the meantime its random value has changed and it is not actually suitable anymore. In this latter case the push and pull messages are sent (in vain) but no exchange
happens.
5.4 Analogy with Gossip-based Averaging
In this section, we analyze the protocol mostly based on the insight that the protocol can
be considered an instance of gossip based averaging (see Chapter 3). For the present
section, we assume here that the attribute values do not change and that the set of nodes
does not change either. This also means that the set of random numbers ri held by the
nodes remains the same at any given point in time.
Let us define ρ(xi ) to be the rank of xi in the sorting of the x values, and, similarly,
let ρ(ri ) be the rank of ri in the sorting of the r values. Let
δi (t) = ρ(xi (t)) − ρ(ri (t)).
(5.1)
When δi (t) = 0, we know that ri (t) is the random value that belongs to node i. The state
we want to reach is the one in which δi = 0 for all i = 1, . . . , N, at which point the ri
values are sorted w.r.t. the xi values. Let us assume that all the values xi and ri are unique,
so their rank is well defined. This is purely to keep our discussion simple. Note that this
is the worst case: if some of the values are not unique, then the problem becomes easier;
for example, if all the values ri or xi are the same, then all permutations are sorted and
there is no problem to solve.
Our main observation is that the slicing protocol in Algorithm 12 can be considered
an averaging algorithm over the values δi . To see this, consider first that the sum of these
values remains zero after each exchange. That is, we have the mass conservation property:
N
X
j=1
δi (t) =
N
X
j=1
ρ(xi (t)) − ρ(ri (t)) =
N
X
j=1
ρ(xi (t)) −
N
X
ρ(ri (t)) = 0
(5.2)
j=1
Similarly to averaging, it is easy to verify that the maximal δ value will decrease every
time its node participates in a successful exchange. Similarly, the minimal value will
increase. Besides, it is evident that for each node, participating in a successful exchange
has a finite probability in each cycle. This means that the variance of the values decreases,
which proves convergence in probability. The following proposition will describe the
effect of an exchange on the participating δ values in more detail.
Proposition 5.4.1. Let node i and j perform a successful exchange at time t. After the
exchange, we have
E(δi (t + 1)) = E(δj (t + 1)) =
δi (t) + δj (t)
2
(5.3)
104
CHAPTER 5. SLICING OVERLAY NETWORKS
Proof. Since the exchange was successful, we must have (xi (t)−xj (t))(ri (t)−ri (t)) < 0.
Actually, this also means that
(ρ(xi (t)) − ρ(xj (t)))(ρ(ri (t)) − ρ(ri (t))) < 0.
(5.4)
Without loss of generality, let us assume throughout the proof that δi (t) < 0 and
|δi (t)| > |δj (t)| (the other logical cases are symmetrical to this). In this case, the possible
values of ρ(rj (t)) consistent with (5.4) are ρ(ri (t))−1, ρ(ri (t))−2, . . . , ρ(xi (t))−δj (t)+
1. Note that since we know δj (t), we know that ρ(xj (t)) = ρ(rj (t)) + δj (t). The (δi (t +
1), δj (t + 1)) pairs that belong to these options are (δi (t) + 1, δj (t) − 1), (δi (t) + 2, δj (t) −
2), . . . , (δj (t) − 1, δi (t) + 1).
Now, we know that 1 ≤ ρ(xi (t)) ≤ N + δi (t). For any fixed value of ρ(xi (t)) such
that 1 ≤ ρ(xi (t)) − δj (t) + 1 and ρ(ri (t)) + δj (t) − 1 ≤ N, we know that all options listed
above are possible and have equal probability, so
E(δi (t + 1) | ρ(xi (t)) = E(δj (t + 1) | ρ(xi (t)) =
n
1X
1 n(δi (t) + δj (t))
δi (t) + δj (t)
δi (t) + k =
=
,
n k=1
n
2
2
(5.5)
where n = δj (t) − δi (t) − 1.
We only sketch how to deal with the case when the value of ρ(xi (t)) is such that
1 > ρ(xi (t)) − δj (t) + 1 or ρ(ri (t)) + δj (t) − 1 > N. In this case, some of the options
are not possible. However, these impossible options are symmetric: if some settings for
node j are not possible because ρ(xi (t)) is too close to 1, then we have a symmetric value
of ρ(xi (t)) (too close to N) where the same number of options are not possible on the
opposite end of the list of options. Considering the symmetry of the series of options for
(δi (t + 1), δj (t + 1)), it is not hard to see that the the desired expectation holds also in this
regime.
This result makes slicing very similar to averaging, since the only difference is that in
the case of averaging both nodes will have exactly the average (xi (t) + xj (t))/2, whereas
here this is true only in expectation.
Like in Chapter 3, to characterize the convergence of the protocol we will focus on
the variance of the values that here we will call the disorder measure:
N
1 X
σ(t) =
δi (t)2 ,
N i=1
(5.6)
This measure is minimized when the sorting is perfect, when it takes the value of zero.
The value of this measure is shown in Figure 5.1. Note that the figure indicates that less
than 20 cycles are sufficient to reduce the average error to 1% of the network size, which
means that nodes are N/100 positions away from their correct position on average, independently of network size. With c = 80, 40 cycles are enough to reach 0.1% precision.
From Chapter 3 Proposition 3.3.5 and Corollary 3.3.3 can be shown to hold in the
case of slicing as well using Proposition 5.4.1. Therefore we expect that the disorder is
reduced exponentially fast as a function of the number of successful exchanges (note that
in the case of averaging every peer can perform a successful exchange with every other
peer at any time, whereas here it becomes harder and harder to find suitable peers). As can
be seen in Figure 5.2, the qualitative prediction of exponential behavior is very accurate.
5.5. EXPERIMENTAL ANALYSIS
1011
105
c=20, N=30000, 100000 and 300000
c=40, N=30000, 100000 and 300000
c=80, N=30000, 100000 and 300000
1010
109
disorder (σ)
108
107
106
105
104
103
102
101
0
20
40
60
80
100
cycles
Figure 5.1: Disorder as a function of cycles. Curves for a fixed c completely overlap when
normalized by N 2 .
Furthermore, slicing appears to approximate the theoretically minimal convergence
factor, that is, a∗ = 1/4 (using the notation from Proposition 3.3.5 and Corollary 3.3.3).
In the case of averaging, this minimal convergence factor could be reached using a series
of exchanges based on two independent perfect matchings in one cycle. In the case of
slicing, we speculate that any series of N/2 exchanges are closer to a perfect matching
than to a set of random pairs. This is because in each cycle the set of suitable peers for
any node is relatively small, and its size is proportional to δi (t), given that δi (t) at node i
decreases at roughly the same rate for each node i.
5.5 Experimental Analysis
We have performed extensive simulation experiments in order to study the probability of
finding a peer to swap with, as well as the behavior of the protocol in the presence of
message drop and node dynamism (churn). The experiments were performed using the
P EER S IM simulator [66]. All scenarios were run with three network sizes (N = 30000,
100000 and 300000) and three view size settings (c = 20, 40 and 80).
5.5.1 The Number of Successful Swaps
Since the nodes are not guaranteed to find suitable peers to swap values with, the exponential convergence is valid only as a function of the number of successful swaps, and
not as a function of cycles. Here we experimentally evaluate the probability of finding a
suitable peer as a function of time.
Figure 5.3 shows our experimental results regarding the number of successful swaps
as a function of cycles. The number of swaps depends on the view size parameter c, but
in all cases it has a power-law tail that decreases approximately as 1/x. In the first cycles
however, the number of exchanges remain approximately constant. This is due to the
fact that the algorithm uses c > 1 candidates to eventually select a peer. If p(t) is the
probability that a random peer is a suitable peer, then in each selection step, the algorithm
CHAPTER 5. SLICING OVERLAY NETWORKS
106
1011
N=300000, c=20, 40, 80
N=100000, c=20, 40, 80
N=30000, c=20, 40, 80
slope of 4-x
1010
109
disorder (σ)
108
107
106
105
104
103
102
101
0
2
4
6
8
10
number of successful swaps (x N)
Figure 5.2: The exponential decrease of the disorder as a function of number of successful
swaps (normalized by the network size), for different values of parameter c (view size)
and network sizes. Lines that belong to the same network size fully overlap.
selects a suitable peer with probability 1 − (1 − p(t))c . While p(t) is large (that is, while
t is small), this probability remains close to one, and as a result the disorder σ decreases
exponentially fast as a function of cycles (see Figure 5.1). However, when p(t) becomes
small (as δi (t) becomes small), convergence slows down. Most importantly, this result is
independent of network size which allows for a scalable and robust setting of parameters.
5.5.2 Message Drop
The protocol generates a large number of independent message exchanges (push and pull)
at all nodes. In the implementation, the messages are assumed to be sent using an unreliable channel, such as UDP, and there is no failure detection mechanism.
If the push message is dropped, the exchange is dropped as a whole. These cases simply slow down the convergence proportionally to the number of failures, without changing
its characteristics.
If the pull message is dropped then the random value originally held by the selected
peer is lost, since the selected peer first sets the value received in the push message. In
other words, one of the values gets duplicated and the other gets lost. This however has
no dramatic effect, as long as there are still a sufficient number of different values, since
the distribution of the set of all values is still uniform (since no bias is involved in the
message failures). Indeed, as shown in Figure 5.4, we can only observe a proportional
slowdown, under 10% uniform message drop, that can be considered a rather significant
drop rate. For other network sizes we obtain identical results. We can conclude that the
quality of ordering is highly robust to message drop failures.
However, diversity of the values is important, because the “resolution” of the system
(the number and size of the groups it can order) depends on this diversity. If there is no
churn, then there will be fewer and fewer swaps as the system converges to the ordered
state, as described in Section 5.4. Besides, when each value still represented has a small
number of copies, it becomes very unlikely that all copies of a specific value are completely removed. Due to these two properties, diversity practically stops decreasing very
5.5. EXPERIMENTAL ANALYSIS
107
c=20
c=40
c=80
slope of 1/x
N=300000
successful swaps
105
N=100000
N=30000
4
10
103
102
1
10
100
cycles
Figure 5.3: Swaps as a function of cycles. Curves completely overlap when normalized
by N.
soon. In addition, in the presence of extreme failure rates, we can add a simple technique
to further fight the lack of diversity: whenever a node sees another node in its view that
holds the same random value, it replaces its own value with a random one. This technique
in effect introduces a very low rate artificial churn, that is dealt with just like real churn
(see Section 5.5.3). We also note that if there is natural churn, then diversity is maintained
by the churn itself.
5.5.3 Churn
To examine the effect of churn, we define an artificial scenario in which a given proportion of the nodes crash and are subsequently replaced by new nodes in each cycle. This
scenario is a worst case scenario because the new nodes are assumed to join the system
for the first time (their random value ri is independent of their attribute value xi ) and the
crashed nodes are assumed never to join the system again. The view of joining nodes is
initialized with descriptors of randomly selected nodes.
Churn rate defines the number of nodes that are replaced by new nodes in each cycle.
We consider the churn rates 0.1% and 1%. Since churn is defined in terms of cycles, in
order to validate how realistic these settings are, we need to define the cycle length. With
the very conservative setting of 10 seconds, which results in a very low load at each node,
the trace described in [58] corresponds to 0.2% churn in each cycle. In this light, we
consider 1% a comfortable upper bound of realistic churn, given also that the cycle length
can easily be decreased as well to deal with even higher levels of churn.
The results of the experiments are shown in Figure 5.5. The ordering effort of the protocol and the continuously introduced disorder reaches an equilibrium after a few cycles,
after which the level of order remains stable. Even with a 1% churn rate in each cycle, the
protocol manages to keep the average distance from the correct position by approximately
an order of magnitude less than that in a random permutation. Note that in this scenario,
during the 50 cycles shown, almost half of the network gets replaced at least once. We
can further improve the performance of the protocol using techniques that take the age
(time spent in the network) into account. One technique is called age bias; when using
CHAPTER 5. SLICING OVERLAY NETWORKS
108
1010
c=20, drop rate 10%
c=40, drop rate 10%
c=80, drop rate 10%
c=20, no message drop
c=40, no message drop
c=80, no message drop
9
10
disorder (σ)
108
107
106
105
104
103
10
20
30
40
50
60
70
80
90
100
cycles
Figure 5.4: Disorder as a function of cycles, with and without message drop, for N =
100000.
0.1% churn in each cycle
1010
N=300000, c=20, 40 and 80
N=100000, c=20, 40 and 80
N=30000, c=20, 40 and 80
9
10
disorder (σ)
disorder (σ)
10
1% churn in each cycle
1010
108
107
106
105
N=300000, c=20, 40 and 80
N=100000, c=20, 40 and 80
N=30000, c=20, 40 and 80
9
108
107
106
0
10
20
30
40
cycles
50
105
0
10
20
30
40
50
cycles
Figure 5.5: Disorder as a function of cycles, for churn rates 0.1% and 1% per cycle.
Curves completely overlap when normalized by N 2 .
this technique, a node, when selecting the neighbor to swap with, chooses the one among
the candidates which has the most similar age. This can be easily implemented without
extra communication steps, if the node descriptors in the view also contain node age. As
a result, only the younger nodes tend to be disordered, while they can still converge and
while the older nodes that have already converged remain protected. Indeed, as shown
in Figure 5.6, we obtain a considerable improvement using the age bias technique, if, in
addition to the age bias, we also require a certain maturity (that is, minimal age) to be
considered as part of any slice. In other words, the order among the nodes that have a
certain minimal age improves significantly.
5.5.4 An Illustrative Example
To illustrate how well our approach copes well with highly dynamic environments, Figure 5.7 provides a visualization of three slices that are maintained in a network of size
1000, over 1200 cycles, using age bias and a maturity level of 20 cycle. The slice specification is (1/3, 1/3, 1/3), we have three slices of equal size. The view size is c = 20.
5.5. EXPERIMENTAL ANALYSIS
109
2.6 x 107
no age bias
with age bias
2.4 x 107
2.2 x 107
disorder (σ)
2.0 x 107
1.8 x 107
c=20
c=40,80
7
1.6 x 10
1.4 x 107
1.2 x 107
c=20
1.0 x 107
8.0 x 106
c=40
6
c=80
6.0 x 10
6
4.0 x 10
0
10
20
30
40
50
maturity age in cycles
Figure 5.6: Effects of age-based techniques. The converged value of σ is shown. Network
size is 100000, error bars show standard deviation over 50 cycles (cycles 50–100).
Figure 5.7: Visualization of groups in extreme failure scenarios.
110
CHAPTER 5. SLICING OVERLAY NETWORKS
After the start of churn the network seems to shrink. This is due to the fact that we consider only mature nodes, that is, those that are older than 20 cycles.. The scenario we
applied includes churn (1% in each cycle), removal of a random half of the network and
subsequently duplicating network size in one cycle. We observe that the slices remain
relatively well defined, especially if we consider that the entire network gets replaced
several times during the period shown. We also observe that as soon as the churn stops,
the slices get stabilized as well. Note that our goal cannot be to eliminate churn within
a slice completely, but only to make sure it is similar to the churn the entire network is
experiencing.
5.6 Conclusions
We have described a solution to automatically partition a highly dynamic network according to a given metric as well as to maintain such a partitioning in the presence of churn.
In our approach to the ordered slicing problem, each node has to identify which section
of the network it belongs to, ordered along an attribute xi , using only local information.
Our solution relies on a robust and scalable gossip-based sorting protocol. We have presented approximative theoretical results based on an analogy with average calculation and
demonstrated the robustness of the protocol in simulation experiments.
We focused only on the identification of the slices, which is in itself a challenging
problem. However, to be practically useful, these slices have to be presented to users and
applications as groups. One solution to this problem is to execute a slice specific N EWS CAST protocol inside each slice, which implements the peer sampling service (providing
samples from the slice). Users of a slice will simply be part of the slice and access it
through the peer sampling service. Nodes (and users) can join a slice-specific N EWSCAST
via a random contact from the slice. Such contacts can be stored (and continuously updated) together with the slice specification, which, as we mentioned previously, can be
thought of as a very small database stored at each node and maintained cheaply using
anti-entropy gossip.
Chapter 6
T-Man: Topology Construction
So far we have focused on computations based on the peer sampling service that is implemented using a random overlay network. Now we turn our attention to the problem
of constructing structured overlay networks that can be used to implement distributed
data structures or to optimize distributed applications via exploiting geographical node
proximity or the similarity of interests of the participating agents.
In general, large-scale overlay networks have become crucial ingredients of fullydecentralized applications and peer-to-peer systems. Depending on the task at hand,
overlay networks are organized into different topologies, such as rings, trees, semantic
and geographic proximity networks. We argue that the central role overlay networks play
in decentralized application development requires a more systematic study and effort towards understanding the possibilities and limits of overlay network construction in its
generality.
The contribution of this chapter is a gossip protocol called T-M AN that can build a
wide range of overlay networks from scratch, relying only on minimal assumptions. The
protocol is fast, robust, and very simple. It is also highly configurable as the desired
topology itself is a parameter in the form of a ranking method that orders nodes according
to preference for a base node to select them as neighbors. We present extensive empirical
analysis of the protocol along with theoretical analysis of certain aspects of its behavior.
6.1 Introduction
Overlay networks have emerged as perhaps the single-most important abstraction when
implementing a wide range of functions in large, fully decentralized systems. The overlay network needs to be designed appropriately to support the application at hand efficiently. For example, application-level multicast might need carefully controlled random
networks or trees, depending on the multicast approach [62, 87]. Similarly, decentralized
search applications benefit from special overlay network structures such as random or
scale-free graphs [121,122], superpeer networks [123], networks that are organized based
on proximity and/or capacity of the nodes [124, 125], or distributed hash tables (DHT-s),
for example, [81, 83].
In current work, protocol designers typically assume that a given network exists for
a long period of time, and only a relatively small proportion of nodes join or leave concurrently. Furthermore, applications either rely on their own idiosyncratic procedures for
implementing join and repair of the overlay network or they simply let the network evolve
in an emergent manner based on external factors such as user behavior.
111
112
CHAPTER 6. T-MAN: TOPOLOGY CONSTRUCTION
We believe that there is room and need for interesting research contributions on at
least two fronts. The first concerns the question whether a single framework can be used
to develop flexible and configurable protocols without sacrificing simplicity and performance to tackle the plethora of overlay networks that have been proposed. The second
front concerns scenarios in overlay construction that are often overlooked, such as massive joins and leaves, as well as quick and efficient bootstrapping of a desired overlay from
scratch or some initial state. Current approaches either fail or are prohibitively expensive
in such scenarios. Combining results on these two fronts would enable several interesting
possibilities. These include: (i) overlay network creation on demand, (ii) deployment of
temporary and adaptive decentralized applications with custom overlay topologies that
are designed on-the-fly, (iii) federation or splitting of different existing architectures [17].
We address both questions and present an algorithm called T-M AN for creating a large
class of overlay networks from scratch. The algorithm is highly configurable: the network
to be created is defined compactly by a ranking method. The ranking method formalizes
the following idea: when shown a set of nodes, we assume each node in the network
is able to decide which ones it likes from the set more and which ones it likes less (we
will later use this ability of nodes to help them have neighbors they like as much as
possible). In other words, each node can order any set of nodes. Formally speaking, the
ranking method is able to order any set of nodes given a so called base node. By defining
an appropriate ranking method, we will be able to build a wide variety of topologies,
including sorted rings, trees, toruses, clustering and proximity networks, and even fullblown DHT networks, such as the C HORD ring with fingers (as discussed in Chapter7).
T-M AN relies only on an underlying peer sampling service (see Chapter 2) that creates an
initial overlay network with random links as the starting point.
The algorithm is gossip based: all nodes periodically communicate with a randomlyselected neighbor and exchange (bounded) neighborhood information in order to improve
the quality of their own neighbor set. This approach, while requiring no more messages
than the heartbeats already present in proactive repair protocols, is simple, and achieves
fast and robust convergence as we demonstrate. Here, we limit our study to the overlay
construction problem. Our main contribution is to show that a single, generic gossipbased algorithm can create many different overlay networks from scratch quickly and
efficiently.
The roadmap of the chapter is as follows. Sections 6.3 and 6.4 present the system
model and the overlay construction problem. Section 6.5 describes the T-M AN protocol.
In Section 6.6 we present theoretical and experimental results to characterize key properties of the protocol and to give guidelines on parameter settings. Section 6.7 presents
practical extensions to the protocol related to bootstrapping and termination, and extensive experimental results are also given to examine the behavior of the protocol in different
failure scenarios. Section 6.8 concludes the chapter.
6.2 Related Work and Contribution
Related work in bootstrapping include the algorithm of Voulgaris and van Steen [126]
who propose a method to jump-start PASTRY [81]. This protocol is specifically tailored
to PASTRY and its message complexity is significantly higher than that of T-M AN. More
recently, the bootstrapping problem has been addressed in other specific overlays [127–
129]. These algorithms, although reasonably efficient, are specific to their target overlay
networks.
6.3. SYSTEM MODEL
113
An approach closer to T-M AN is V ICINITY, described in [130]. Although V ICINITY
was inspired by the earliest version of T-M AN, it does contain notable original components
related to overlay maintenance, such as churn management, and other techniques to boost
performance.
Finally, we mention related work that use gossip-based probabilistic and lightweight
algorithms. We note that these algorithms are targeted neither at efficient bootstrapping,
nor at generic topology management. Massoulié and Kermarrec [131] propose a protocol
to evolve a topology that reflects proximity. More recent protocols applying similar principles include [132] and [133]. Repair protocols used extensively in many DHT overlays
also belong to this category (e.g., [83, 134, 135]).
Our contribution with respect to related work is threefold. First, we introduce a
lightweight probabilistic protocol that can construct a wide range of overlay networks
based on a compact and intuitive representation: the ranking method. The protocol has
a small number of parameters, and relies on minimal assumptions, such as nodes being
able to obtain a random sample from the network (the peer sampling service). The protocol is an improved and simplified version of earlier variants presented at various workshops [15–17]. Second, we develop novel insights for the tradeoffs of parameter settings
based on an analogy between T-M AN and epidemic broadcasts. We describe the dynamics
of the protocol considering it as an epidemic broadcast, restricted by certain factors defined by the parameters and properties of the ranking method (that is, the properties of the
desired overlay network). We also analyze storage complexity. Third, we present novel
algorithmic techniques for initiating and terminating the protocol execution. We present
extensive simulation results that support the efficiency and reliability of T-M AN.
6.3 System Model
We consider a set of nodes connected through a routed network. Each node has an address
that is necessary and sufficient for sending it a message. Furthermore, all nodes have
a profile containing any additional information about the node that is relevant for the
definition of an overlay network. Node ID, geographical location, available resources,
etc. are all examples of profile information. The address and the profile together form the
node descriptor. At times, we will use “node descriptor” and “node” interchangeably if
this does not cause confusion.
The network is highly dynamic; new nodes may join at any time and existing nodes
may leave, either voluntarily or by crashing. Our approach does not require any mechanism specific to leaves: spontaneous crashes and voluntary leaves are treated uniformly.
Thus, in the following, we limit our discussion to node crashes. Byzantine failures, with
nodes behaving arbitrarily, are excluded from the present discussion.
We assume that nodes are connected through an existing routed network, where every
node can potentially communicate with every other node. To actually communicate, a
node has to know the address of the other node. This is achieved by maintaining a partial
view (view for short) at each node that contains a set of node descriptors. Views can be
interpreted as sets of edges between nodes, naturally defining a directed graph over the
nodes that determines the topology of an overlay network.
Communication incurs unpredictable delays and may be subject to failures. Single
messages could lost, links between pairs of nodes may break. Nodes have access to local
clocks that can measure the passage of real time with reasonable accuracy, that is, with
small short-term drift. Local clocks are not required to be synchronized.
114
CHAPTER 6. T-MAN: TOPOLOGY CONSTRUCTION
Finally, we assume that all nodes have access to the peer sampling service (see Chapter 2) that returns random samples from the set of nodes in question. From a theoretical
point of view we will assume that these samples are indeed random. From a practical
point of view, we have seen that the peer sampling service indeed has suitable realistic
implementations that provide high quality samples at a low cost.
6.4 The Overlay Construction Problem
Intuitively, we are interested in constructing some desirable overlay network, possibly
from scratch, by filling the views at all nodes with descriptors of the appropriate neighbors. For example, we might want to organize the nodes into a ring where the nodes
appear in increasing order based on their ID. Or we might want to construct a proximity
network, where the neighbors of a node are those that are closest to it according to some
metric.
We allow for arbitrary initial content of the views of the nodes in this problem definition (including empty views), noting that, as mentioned in our system model, nodes have
access to random samples from the network, so they have access to at least random nodes
from the network. In other words, starting from any arbitrary network, we want to fill the
node views with the appropriate neighbors as fast as possible at a reasonable cost.
In order to have a well defined problem, we need to specify how the desired overlay
is represented as an input to the protocol. The representation must be compact, intuitive,
yet descriptive enough to capture the widest possible range of topologies.
Our proposal for the representing the desired overlay is the ranking method. As explained before, the ranking method sorts a set of nodes (potential neighbors) according to
the “taste” of a given base node. More formally, the input of the problem is a set of N
nodes, the target view size K (bounded by N) and a ranking method RANK. The ranking
method takes as parameters the base node x and a set of nodes {y1 , . . . , yj }, j ≤ N, and
outputs an ordered list of these j nodes. All nodes in the network apply the same ranking
method, which they are assumed to know a priori. We will analyze and test only ranking
methods that are based on a partial ordering of the given set, and that return some total
ordering consistent with this partial ordering (note however, that this is not an inherent
restriction). Accordingly, we allow for an element of uncertainty (if there can be many
total orderings consistent with the partial ordering we pick a random one).
A target graph that we wish to construct is defined by the ranking method. We present
the definition of a target graph in a constructive way, through the following (inefficient)
approach, for illustration. In this approach, each node disseminates its descriptor to all
other nodes such that eventually, every node has collected locally the descriptor of every
node in the network. At this point, each node sorts this set of descriptors according to
the ranking method and picks the first K elements to be its neighbors. The resulting
structure is called a target graph. Note that in this manner we define a graph, and not only
a topology, because in addition to knowing the structure of the network, such as a ring,
we also know the exact location of each node in the structure.
Disseminating all the descriptors to all the nodes is a naive solution to this overlay
construction problem with a communication cost that is at least linear in N for each node
and a storage cost that is also linear in N for each node. A practical approach has to
significantly reduce the cost of this naive solution both in terms of communication and
storage. The T-M AN protocol described in the next section achieves precisely this.
6.4. THE OVERLAY CONSTRUCTION PROBLEM
(a)
115
(b)
(c)
Figure 6.1: Target graphs for different ranking methods and K = 2. (a) One-dimensional
distance-based, circular ranking method applied to a set of uniform node profiles; (b)
same ranking method as before but with a different set of node profiles that are clustered;
(c) direction-dependent ranking method achieves sorting even for clustered node profiles.
Although representing the target graph through the ranking method and parameter K
clearly restricts the scope of the algorithm, through the examples presented here and in the
rest of this chapter we will see that a wide range of interesting applications are covered.
One (but not the only!) way of actually defining useful ranking methods is through a
distance function that defines a metric space over the set of nodes. The ranking method
can simply return an ordering of the given set according to non-decreasing distance from
the base node.
To clarify the notions of ranking method and target graphs, let us consider a few
simple examples, where K = 2 and the profile of a node is a real number in the interval
[0, M[. We can define a ranking method based on the one-dimensional distance function
d(a, b) = |a − b|, in which case the target graph will be linear. Alternatively—to connect
the two ends of the line—we can use d(a, b) = min(M − |a − b|, |a − b|), which results
in a circular structure. As illustrated in Figure 6.1(a), if the node profiles are more-or-less
uniformly distributed over the interval [0, M[, the target graph that belongs to the circular
distance function will be a connected ring. If the node profiles are not evenly distributed
over [0, M[ but are clustered, the very same ranking method will result in a target graph
that consists of disconnected clusters (Figure 6.1(b)).
It is important to note that there are target graphs of practical interest that cannot be
defined through a global distance function. This is the main reason for using ranking
methods, as opposed to relying exclusively on the notion of distance; the ranking method
is a more general concept than distance. This fact will become important in Chapter 7
(practical application example), where it is necessary to be able to build, for example, a
ring, even in the case of uneven node descriptor distributions when distance-based ranking methods would define clustered target graphs (as in Figure 6.1(b)). Figure 6.1(c)
illustrates how a direction-dependent ranking can be used to avoid clustering in the target
graph. Here, the output of the ranking method RANK(x, {y1 , . . . , yj }) is defined as follows. We first construct a sorted ring out of the set of input profiles y1 , . . . , yj and the
base node x. We then assign a rank value to each node defined as the minimal hop count
to the node from x in this ring. The output of the ranking method is a list of the input
profiles ordered according to this rank value. In this manner, the first 2α positions in the
ranking contain α nodes preceeding x and α nodes following x in the sorted ring; hence
the name “direction-dependent”.
116
CHAPTER 6. T-MAN: TOPOLOGY CONSTRUCTION
Algorithm 13 T-M AN
1: loop
2:
wait(∆)
3:
p ← selectPeer(ψ, rank(myDescriptor,view))
4:
sendPush(p, toSend(p, m))
5:
6:
7:
8:
procedure ON P USH(msg)
sendPull(msg.sender, toSend(msg.sender,m))
onPull(msg)
9:
10:
procedure ON P ULL(msg)
view.merge(msg.buffer)
11:
12:
13:
14:
15:
16:
procedure TO S END(p, m)
buffer ← (MyDescriptor)
buffer.append(view)
buffer ← rank(p,buffer)
return buffer.subList(1, m)
6.5 The T-M AN Protocol
As mentioned earlier, the T-M AN protocol is based on a gossiping scheme, in which all
nodes periodically exchange node descriptors with peer nodes, thereby constantly improving the set of nodes they know—their partial views. Each node executes Algorithm 13.
Notice that the algorithm fits in the scheme of the gossip protocols discussed so far. Any
given view contains the descriptors of a set of nodes. As before, the view is a list data
structure and it is also a set (each node has at most one descriptor on the view). Method
MERGE is a set operation in the sense that it keeps at most one descriptor for each node.
Parameter m denotes the message size as measured in the number of node descriptors that
the message can hold. Method SELECT P EER selects a random sample among the first ψ
entries in the list given as its second parameter.
In this section we do not specify how node views are initialized. In the rest of the chapter, we always describe the particular node view initialization procedure that we assume.
These procedures include random initialization for the purposes of theoretical analysis in
Section 6.6 and practical solutions based on various broadcasting schemes and realistic
random peer sampling in Section 6.7.
We note that the protocol does not place a limit on the view size. This is done in order
to decrease the number of parameters, thereby simplifying the presentation. One might
expect that lack of a limit on view size might present scalability problems due to views
growing too large. As we will show in Section 6.6, however, the storage complexity of
nodes due to views grows only logarithmically as a function of the network size. Furthermore, preliminary experiments for the applications we consider show that imposing a
comfortable limit on view sizes (larger than both m and K) does not result in any observable decrease in performance. This suggests that the simplification of ignoring view size
limits is justified and is not critical for these applications. As usual, we define a cycle to
be an interval of ∆ time units.
Figure 6.2 illustrates the results of T-M AN for constructing a torus (visualizations were
obtained using [136]). For this example, it is clear that only a few cycles are sufficient
for convergence, and the target graph is already evident even after the first few cycles.
In the next sections we will show that this rapid convergence is not unique to the torus
example but that T-M AN performs well in a wide range of settings and that it is scalable,
very similarly to epidemic broadcast protocols.
In Table 6.1 we summarize the parameters of the protocol. Note that K (target view
size) controls the size of the target graph, and consequently, affects the running time of
the protocol. For example, if we increase K while keeping the ranking method fixed, then
the protocol will take longer to converge since it has to find a larger number of links.
6.6. KEY PROPERTIES OF THE PROTOCOL
117
after 2 cycles
after 3 cycles
after 4 cycles
after 6 cycles
after 7 cycles
after 10 cycles
Figure 6.2: Illustration of constructing a torus over 50 × 50 = 2500 nodes, starting from a
uniform random graph with initial views containing 20 random entries and the parameter
values m = 20, ψ = 10, K = 4.
6.6 Key Properties of the Protocol
In this section we study the behavior of our protocol as a function of its parameters, in
particular, m (message size), ψ (peer sampling parameter) and the ranking method RANK.
Based on our findings, we will extend the basic version of the peer selection algorithm
with a simple “tabu-list” technique as described below. Furthermore, we analyze the
storage complexity of the protocol and conclude that on the average, nodes need O(log N)
storage space where N is the network size.
To be able to conduct controlled experiments with T-M AN on different ranking methods, we first select a graph instead of a ranking method, and subsequently “reverseengineer” an appropriate ranking method from this graph by defining the ranking to be
the ordering consistent with the minimal path length from the base node in the selected
graph. We will call this selected graph the ranking graph, to emphasize its direct relationship with the ranking method.
Note that the target graph is defined by parameter K, so the target graph is identical
to the ranking graph only if the ranking graph is K-regular. However, for convenience, in
this section we will not rely on K because we either focus on the dynamics of convergence
(as opposed to convergence time), which is independent of K, or we study the discovery
of neighbors in the ranking graph directly.
In order to focus on the effects of parameters, in this section we assume a greatly simplified system model where the protocol is initiated at the same time at all nodes, where
there are no failures, and where messages are delivered instantly. While these assump-
118
RANK ()
K
∆
ψ
m
CHAPTER 6. T-MAN: TOPOLOGY CONSTRUCTION
Ranking method: determines the preference of nodes as neighbors of a base
node
Target view size: along with RANK(), it determines the target graph
Cycle length: sets the speed of convergence but also the communication cost
Peer sampling parameter: peers are selected from the ψ most preferred
known neighbors
Message size: maximum number of node descriptors that can be sent in a
single message
Table 6.1: Parameters of the T-M AN protocol.
tions are clearly unrealistic, in Section 6.7 we show through event-based simulations that
the protocol is extremely robust to failures, asynchrony and message delays even in more
realistic settings.
6.6.1 Analogy with the Anti-Entropy Epidemic Protocol
In Section 6.4 we defined the overlay construction problem with the help of a naive
approach that involved the full dissemination of all the node descriptors to every node.
Here we would like to elaborate on this idea further. Indeed, the anti-entropy epidemic
protocol—when used to implement the naive approach—can be seen as a special case of
T-M AN, where the message size m is unlimited (i.e., m ≥ N such that every possible node
descriptor can be sent in a single message) and peer selection is uniform random from the
entire network. In this case, independently of the ranking method, all node descriptors
that are present in the initial views will be disseminated to all nodes. Furthermore, it is
known that full convergence is reached in less than logarithmic time in expectation (see
Chapter 1).
For this reason, the anti-entropy epidemic protocol is important also as a base case
protocol when evaluating the performance of T-M AN, where the goal is to achieve similar
convergence speed to anti-entropy, but with the constraint that communication is limited
to exchanging a constant amount of information in each round. Due to the communication
constraint, the performance will no longer be independent of the ranking method.
6.6.2 Parameter Setting for Symmetric Target Graphs
We define a symmetric target graph to be one where all nodes are interchangeable. In
other words, all nodes have identical roles from a topological point of view. Such graphs
are very common in the literature of overlay networks. The behavior of T-M AN is more
easily understood on symmetric graphs, because focusing on a typical (average) node
gives a good characterization of the entire system.
We will focus on two ranking graphs, both undirected: the ring and a k-out random
graph, where k random out-links are assigned to all nodes and subsequently the directionality of the links is dropped. We choose these two graphs to study two extreme cases for
the network diameter. The diameter (longest minimal path) of the ring is O(N) while that
of the random graph is O(log N) with high probability.
Let us examine the differences between realistic parameter settings and the antientropy epidemic dissemination scenario described above. First, assume that the message
6.6. KEY PROPERTIES OF THE PROTOCOL
6
4-out Random
Ring
5.5
119
6
4-out Random
Ring
5.5
m=10
5
cycles
cycles
m=10
m=20
4.5
5
m=20
4.5
4
4
m=2000
3.5
m=2000
3.5
1
10
100
ψ
1000
(a) Basic T-Man protocol
10000
1
10
100
ψ
1000
10000
(b) T-Man with Tabu List
Figure 6.3: Time to collect 50% of the neighbors at distance one in the ranking graph.
Network size is N = 2000. Node views are initialized to contain 5 random links each.
Graph (b) was obtained using a tabu list of size 4.
size m is a small constant rather than being unlimited. In this case, the random peer selection algorithm is no longer appropriate: if a node i contacts peer j that ranks low with
i as the base node, then i cannot expect to learn new useful links from j because now (due
to the small m) node j has a strong bias in its view towards nodes that rank high with j as
a base node.
On the other hand, if a node i selects peers that rank too high with i as the base node,
then convergence might slow down as well. The reason for this is that consecutive peers
returned by the peer selection method will more often get repeated; in part because a node
i is more likely to select a peer to communicate with that selected i shortly before, and in
part because there are simply fewer nodes that are “close” to any given node than nodes
that are far from it. This in turn results in increased correlation between the partial views
of communicating partners, so the epidemic process is not maximally efficient.
Figure 6.3 illustrates this tradeoff using two ranking graphs: the ring and a random
graph. The latter is generated by first constructing a 2-out directed regular random graph
by selecting two random out-edges for each node, and subsequently taking the undirected
version of this graph. The average degree of a node is thus 4, with a small variance. The
basic version in Figure 6.3(a) applies the peer selection algorithm which picks a random
peer from the highest ranking ψ nodes from the view, as described earlier. The point
ψ = N and m = N corresponds to an anti-entropy epidemic dissemination (i.e., peer
selection is unbiased and there are no limits on message size) which is optimal.
As predicted, with no limits on the message size (m = N), we can observe the effect
due to the lack of randomness if the selected peer ranks too high (ψ is small). Furthermore, for large ψ performance again degrades when we place a limit on the message size
since the correlation between communicating peers’ ranking of the same set of nodes is
reduced. This effect is less pronounced for larger m because now we might obtain useful
information by chance even if there is little correlation between the rankings.
To verify our explanation as to why performance degrades with decreasing ψ, we
apply a tabu list at all nodes in order to avoid contacting the same peers over and over
again. The tabu list contains a fixed number of peers that a given node communicated
with most recently. The node then does not initiate connection with any nodes in its
CHAPTER 6. T-MAN: TOPOLOGY CONSTRUCTION
120
90
average contacts
empirical standard deviation
80
Number of contacts
70
60
50
40
30
20
10
0
1
10
100
1000
10000
Node Profile
Figure 6.4: Number of contacts made by nodes while constructing a binary tree. Statistics
are over 30 independent runs. The parameters are N = 10000, m = 20, number of cycles
is 15, ψ = 10 and the tabu list size is 4. In the ranking graph, the root is node 0 and the
out-links of node i are 2i + 1 and 2i + 2.
tabu list. We experimented with a tabu list size of 4. This mechanism does not add any
communication overhead since it simply records the last 4 communications, but it is rather
effective in reducing the negative effects of small ψ values as Figure 6.3(b) illustrates.
We can draw several other conclusions from the results in Figure 6.3. First, the tabu
list slightly improves even the performance of anti-entropy epidemic dissemination with
completely random peer selection (m = ψ = N). This is due to the fact that initially
views contain only few nodes (to be precise, five, in this case). Without a tabu list, this
significantly increases the chance of contacting the same peers in the first few cycles,
while the views are still small. Such communications are not effective in advancing dissemination due to the correlated views of the communicating peers. Also note that when
there is no limit on message size, the random graph outperforms the ring, especially when
the tabu list is applied. This is due to the fact that the number of neighbors of a node in the
random graph increases exponentially, so even for a small set of closest nodes, diversity
is very high.
Finally, we note that the exponentially increasing neighborhood becomes a disadvantage when ψ is larger, because the view of peers that are further away from the base node
in the ranking graph will be more uncorrelated to the view of the original peer. This suggests that for such graphs, peer selection should be aggressive (ψ = 1) and should be
combined with the use of tabu lists.
6.6.3 Notes on Asymmetric Target Graphs
The topological role of nodes in asymmetric target graphs is not identical. For example, some nodes can be more central or more connected than others, there can be bridge
nodes connecting isolated clusters, and so on. While symmetric graphs already exhibit
complex behavior, we argue that asymmetric graphs cannot be treated reasonably in a
common framework. Each case needs a separate analysis that needs to take into account
the particular structure of the graph.
6.6. KEY PROPERTIES OF THE PROTOCOL
6
121
6
Binary Tree
Ring
5.5
Binary Tree
Ring
5.5
m=10
5
cycles
cycles
m=10
m=20
4.5
5
m=20
4.5
4
4
m=2000
3.5
m=2000
3.5
1
10
100
ψ
1000
(a) T-Man with Tabu List
10000
1
10
100
ψ
1000
10000
(b) T-Man with Tabu List and Balancing
Figure 6.5: Time to collect 50% of the neighbors at distance one in the ranking graph.
The network size is N = 2000. Node views are initialized by 5 random links each. The
tabu list size is 4.
To understand the problem better, consider a ranking method that is independent of
the base node. This ranking method will induce a star-like structure since all nodes will
be attracted to the very same high ranking nodes. In this case, more and more nodes
will contact the nodes that rank high in the (in this case, common) ranking. As a result,
convergence speeds up enormously, at the cost of a higher load on the central nodes. The
reason is simple: the central nodes can collect the high ranking descriptors faster because
they are contacted by many nodes. Due to their central position, they also distribute them
very rapidly. One can even exploit this effect. For example, if the goal is to build a superpeer topology, with the high bandwidth nodes in the center, then the central nodes might
actually be able to deal with the extra load, thus resulting in an efficient, but still fully
self-organizing solution.
This effect can be observed in other interesting topologies as well. For example,
rooted regular trees, where the non-leaf nodes have k out-links and one in-link, except the
root, that has no in-links. If the ranking graph has such a topology, the resulting target
graph will be asymmetric with highly nonuniform average traffic at nodes, as shown in
Figure 6.4. One reason for this result is that a large proportion of the nodes are leaves.
Leaf nodes, having only one neighbor, will have a tendency to talk to nodes that are further
up in the hierarchy. This adds extra load on internal nodes and puts them in a more central
position.
This in turn has a non-trivial effect on the convergence of the protocol, and allows TM AN to have better performance for trees than for symmetric graphs. Figure 6.5 illustrates
this effect. In Figure 6.5(a), we can observe the performance of T-M AN for a rooted and
balanced binary tree as a ranking graph. We can see that there is a peculiar minimum
when message size is unlimited but ψ is small. In this region, the binary tree consistently
outperforms the ring, even for a small m.
This effect is due to the asymmetry of a binary tree. To show this, we ran T-M AN
with an additional balancing technique, to cancel out the effect of central nodes. In this
technique, we limit the number of times any node can communicate (actively or passively)
in each cycle to two. In addition, nodes also apply hunting [20], that is, when a node
contacts a peer, and the peer refuses the connection due to having exceeded its quota, the
122
CHAPTER 6. T-MAN: TOPOLOGY CONSTRUCTION
node immediately contacts another peer until the peer accepts connection, or the node
runs out of potential contacts. The results are shown in Figure 6.5(b). In the region of
practical settings of ψ and m, the advantage of the binary tree disappears, while the ring
preserves the same performance.
More detailed analysis reveals that in the initial cycles, nodes that are close to the root
play a bootstrap function and communicate more than the rest of the nodes. After that, as
the overlay network is taking shape, nodes that are further down the hierarchy take over
the management of their local region, and so on. This is a rather complex behavior, that
is emergent (not planned), but nevertheless beneficial. This also suggests that if the target
graph is not symmetric, then extra attention is needed when explaining the behavior of
T-M AN.
6.6.4 Storage Complexity Analysis
We derive an approximation for the storage space that is needed for maintaining views
by the nodes (recall that there is no hard limit enforced by the protocol). This approximation is based on a number of simplifying assumptions that convert the problem into
a model of disseminating news items, where only the most interesting news items can
spread due to limited message size. Subsequently, we present experimental validation of
the approximation using T-M AN on different realistic target graphs.
The News Spreading Model
To derive the approximation, we assume that the ranking method is independent of the
base node, that is, all nodes rank a given set of node descriptors the same way. The
rational for this assumption is the following. One conclusion of previous sections was
that the success of T-M AN crucially depends on the fact that whenever a node i selects a
peer j using SELECT P EER, the ranking of the current neighbors of i with node j as a base
node is similar to the ranking with node i as a base node, because this way nodes i and j
can provide relevant node descriptors to each other with a higher probability. Assuming
that the ranking does not depend on the base node means that any selected node j is
guaranteed to produce an identical ranking to node i, which is the ideal case for T-M AN.
This assumption, however, introduces a side-effect: it implies that the target graph
is a star-like structure, with the m highest ranking nodes forming a clique, and all the
other nodes pointing to these m nodes. This level of asymmetry is highly non-typical and
therefore is an unrealistic scenario for T-M AN. To “fix” this side-effect, we assume that
SELECT P EER returns a random node from the entire network, which makes the role of all
nodes identical.
In this setting, node descriptors have no relation to actual nodes anymore (that is,
the node addresses in the descriptors are never used), so we can think of the model as
spreading news items that have a natural ranking based on “interestingness”.
In the following we present a simplified deterministic model of the dynamics of news
spreading to obtain a heuristic baseline, to which the storage complexity of T-M AN can be
compared. Let ni,j (t) denote the number of news items of rank j at node i at time t. The
value of ni,j (t) is 0 or 1, and ni,j (t) is monotonically increasing, because we assumed that
the local view size is not bounded.
First of all, if m ≥ N then the values ni,j (t) (i, j ∈ {1, . . . , N}) are identically
distributed random variables, since there is no competition between the news items for the
6.6. KEY PROPERTIES OF THE PROTOCOL
123
slots available for spreading. Clearly, when m < N, then E(ni,j (t)) can grow undisturbed
until the effect of the competition kicks in, in other words, until the item with rank j is
no longer competitive for the available m slots. Motivated by this, in our simplified
deterministic model we will approximate ni,j (t) by
(
E(ni,j (t) | m ≥ N) if t ≤ t∗
(6.1)
n̂i,j (t) =
E(ni,j (t∗ ) | m ≥ N) if t > t∗
where t∗ is the point in time when n̂i,j stops growing due to the fact that there are already
enough more interesting news items at the local node i so that the item with rank j will
no longer be included in the most interesting m items.
For j ≤ m we know that t∗ = ∞, because these items will always be included in any
message sent. For j > m, this point in time can be calculated using the fact that at time t∗
j
X
n̂i,k (t∗ ) = jn̂i,j (t∗ ) = m,
(6.2)
k=1
where the first equation comes from the fact that ni,j (t) (i, j ∈ {1, . . . , N}) are identically
distributed. (Note that although t∗ can be calculated, we do not actually need to calculate
it, we simply need to know only that it is well defined.) This means that we have
m
n̂i,j (t) = , for t > t∗ , j > m.
(6.3)
j
Figure 6.6 compares the prediction of this model and the converged distribution obtained
P
experimentally via simulation with T-M AN. The figure uses the notation nj = i ni,j ,
which is the overall number
of news items of rank j in the network. The indicated predicP
tion is, accordingly, i n̂i,j = Nm/j.
Equation (6.3) allows us to approximate the actual storage space that is required for
the views of the nodes. We focus only on the items that rank lower than m, because the
highest ranking m items will be stored by all the nodes taking a constant amount of space.
The sum of all entries with a rank higher than m stored in the system after convergence
(when all messages are composed of the m most popular items already) is
N
X
j=m
n̂j =
N
X
Nm
j=m
j
= Nm
N
X
1
j=m
j
= Θ(N log N),
(6.4)
where we used the well known approximation of the harmonic number and the fact that
m is constant. Therefore each view stores Θ(log N) entries on the average. Note that this
result is independent of the number of iterations executed to reach convergence, and it is
also independent of the actual form of the function n̂i,j (t); recall that the only assumption
we made was that these functions are monotonically increasing.
Finally, we note that n̂j = Nm/j = Nmj −1 results in a power law distribution, as
it follows the form j −γ . Power laws are very frequently observed in complex evolving
networks [114]. The phenomenon is often due to some form of “the rich get richer”
effect. One can link our results to the study of other complex networks, for example,
social networks. All nodes start with a random constant-size set of news items, and they
gossip always only the m most interesting ones that they currently know. This dynamics
results in a power law distribution of news items, with the most interesting news being
known to everyone. Furthermore, each participant learns only about Θ(log N) news items
from the overall N news items available.
CHAPTER 6. T-MAN: TOPOLOGY CONSTRUCTION
124
observed
predicted
105
N=105, m=40
4
10
nj
103
102
101
N=104, m=20
100
100
101
102
103
104
105
j
Figure 6.6: Experimental results and values predicted by Equation (6.3) for nj with two
sets of parameters N = 104 , m = 20 and N = 105 , m = 40. For each j, the converged
value of nj is indicated as a separate point. The observed values correspond exactly to the
predicted one for the initial constant section, and are covered by the line segment on the
graph.
Ring
observed
predicted
104
after cycle 10
103
nj
after cycle 4
102
after cycle 2
101 0
10
101
102
j
103
104
Binary Tree
4-out Random
observed
predicted
104
after cycle 10
after cycle 10
103
103
after cycle 4
nj
nj
after cycle 4
102
102
after cycle 2
101
100
observed
predicted
104
101
after cycle 2
102
j
103
104
101
100
101
102
j
103
104
Figure 6.7: Experimental and predicted values of nj for three different ranking graphs.
Experiments were run with N = 104 , m = 20 and ψ = 10, without a tabu list. Note that
the plots contain three snapshots of the simulation for cycles 2, 4 and 10.
6.7. EXPERIMENTAL RESULTS
125
Empirical Validation
We verify experimentally that the prediction in (6.3) holds for T-M AN when different
ranking methods are employed. This would support as a consequence the claim that
Equation (6.4) characterizes the storage complexity of the protocol.
We need to generalize nj since ranking can now depend on the base node. Let nj be
the number of nodes that know about the node with rank j according to their own ranking
of the entire network. Figure 6.7 shows the values of nj for three ranking graphs at three
different times. Although the experiments reported in Figure 6.7 were performed without
a tabu list, further experiments (not shown) show that tabu lists have no observable effect
on the distribution of ranks in the views. They only speed up convergence of the protocol
as discussed earlier.
In Figure 6.7 we can observe that the ring fulfills the assumptions of Section 6.6.4
best: the nj values that have not stopped growing have the same value at each time point,
which means they indeed grow at the same rate. The largest deviation can be observed in
the case of the random graph. There, the growth of the nj values slows down smoothly
which implies that the assumption they grow at the same rate does not hold. This results
in a slight “overshoot” where the observed values are slightly higher than those predicted.
Note that in the case of the binary tree, the predicted values match closely the observed
ones even though the topology is not symmetric. This further underlines the robustness
of the prediction. In other words, the seemingly strong assumptions of the theory in fact
leave the essential dynamics almost unchanged, which indicates that we could understand
important features of the protocol. Of course, the more central nodes need more storage
capacity, the prediction holds only on average. However, in our preliminary experiments
(not shown), we have seen that setting a reasonable hard limit on the view size that is
significantly larger than m (for example, 1000 items) does not result in any significant
difference in performance. For this reason we opted for the simplified discussion and we
omit hard limits on the view size in the following.
6.7 Experimental Results
In the previous section we considered the most basic version of the protocol to shed light
on its convergence properties and storage complexity. This section is concerned with
developing additional techniques that allow for the practical application of the protocol; in
particular, we address two important problems: how to start and how to stop the protocol.
We also present an extensive empirical analysis under different parameter settings and
different failure scenarios, introduced by a brief discussion of the simulation environment
and the figures of merit we focus on.
6.7.1 A Practical Implementation
So far we assumed that the protocol is started at all nodes at once, in a synchronous
fashion, and we were not dealing with termination at all. We also assumed that at all
nodes the initial set of known peers is a random sample from the network. In this section,
we replace these unrealistic assumptions with practically feasible solutions.
126
CHAPTER 6. T-MAN: TOPOLOGY CONSTRUCTION
Peer Sampling Service
The peer sampling service provides each node with continuously up-to-date random samples of the entire population of nodes. Such samples fulfill two purposes: they enable the
random initialization of the T-M AN view, as discussed in Section 6.5, and make it possible to implement a starting service as well, allowing for the deployment of various gossip
based broadcast and multicast protocols.
We consider an instantiation of the peer sampling service based on the N EWSCAST
protocol (see Section 2.2.4), chosen for its low cost, extreme robustness and minimal
assumptions. The basic idea of N EWSCAST is that each node maintains a local set of
random node addresses: the (partial) view. Periodically, each node sends its view to a
random member of the view itself. When receiving such a message, a node keeps a fixed
number of freshest addresses (based on timestamps), selected from those locally available
in the view and those contained in the message.
Each node sends one message to one other node during a fixed time interval. Implementations exist in which these messages are small UDP messages containing approximately 20-30 IP addresses, along with the ports, timestamps, and descriptors such as node
IDs. The time interval is typically long, in the range of 10 s. The cost is therefore small,
similar to that of heartbeat messages in many distributed architectures. The protocol provides high quality (i.e., sufficiently random) samples not only during normal operation
(with relatively low churn), but also during massive churn and even after catastrophic
failures (up to 70% nodes may fail), quickly removing failed nodes from the local views
of correct nodes.
Starting and Terminating the Protocol
We implemented a simple starting mechanism based on well-known broadcast protocols.
The content of the broadcast message may be a simple “wake up” specifying when to build
a predefined network, or it may include additional information specifying what network to
build (e.g., by providing the implementation of a specific ranking function). To simplify
our simulation environment, we adopt the first approach; technical issues related to the
second one may be easily solved in a real implementation.
The following terminology is used when discussing the starting mechanism. We say
that a node is active if it is aware of and explicitly participating in a specific instance of
T-M AN; if the node is not aware that a protocol is being executed, it is called inactive.
Initially, there is only one active node, the initiator, activated by an external event
(e.g., a user’s request). An inactive node may become active by exchanging information
with nodes that are already active. When a node becomes active, it immediately starts
executing the T-M AN protocol. The final goal is to activate all nodes in the system, i.e., to
start the protocol at all nodes.
The actual implementation of the broadcast can take many forms that differ mainly in
communication overhead and speed.
Flooding As soon as a node becomes active for the first time, it sends a “wake up” message to a small set of random nodes, obtained from the peer sampling service. Subsequently, it remains silent.
Anti-Entropy, Push-only Periodically, each active node selects a random peer and sends
a “wake-up” message [20].
6.7. EXPERIMENTAL RESULTS
127
1
100
cummulative probability (%)
0.9
probability (%)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
80
60
40
20
0.1
0
0
0
50
100
150
200
Delay
250
300
350
400
0
100
200
Delay
300
400
Figure 6.8: Probability distribution of end-to-end delays as reported in the King data
set [137].
Anti-Entropy, Push-Pull Periodically, each node (active or not) exchanges its activation
state with a random peer. If either of them was active, they both become active [20].
As described above, a node becomes active as soon as it receives a message from
another active node. Note, however, that messages belonging to the starting protocol are
not the only source of activation; a node may also receive a T-M AN message, from a node
that has already started to execute the protocol. This message also activates the recipient
node.
As is well known, flooding is fast and effective but very expensive due to message
duplications. In comparison, the most important advantage of the other two approaches is
the dramatically lower communication overhead per unit time. The overhead can further
be reduced to almost zero, due to the fact that the starting service messages can be piggybacked, for example, on N EWSCAST messages that implement the peer sampling service.
After the target graph has been built, the protocol does not need to run anymore and
therefore must be terminated. Clearly, detecting global convergence is difficult and expensive: what we need is a simple local mechanism that can terminate the protocol at all
nodes independently.
We propose the following mechanism. Each node monitors its own local view. If
no changes (i.e., node additions) are observed for a specified period of time (δidle ), it
suspends its active thread. We call this state suspended. If a view change occurs when
a node is suspended (due to an incoming message initiated by another node that is still
active), the node switches again to the active state, and resets its timer that measures idle
time.
6.7.2 Simulation Environment
All the experiments are event-based simulations, performed using P EER S IM, an opensource simulator designed for large-scale P2P systems and publicly available at SourceForge [66]. The applied transport layer emulates end-to-end delays between pairs of nodes
based on the traces of the King data set [137]. Delays reported in these traces range from
1 ms to 400 ms, and the probability distribution is as shown in Figure 6.8.
The following parameters are fixed in the experiments: the size of the tabu list is 4, and
the peer selection parameter (ψ) is 1. If different values are not explicitly mentioned, the
message size (m) is 20, the cycle length (∆) is 1 s, and the value of δidle is set to 4 s. Each
128
CHAPTER 6. T-MAN: TOPOLOGY CONSTRUCTION
experiment is repeated 50 times with different random seeds. Plots show the average
of the observed measures, along with error bars; when graphically feasible, individual
experiments are displayed as separate dots with a small random translation.
6.7.3 Ranking Methods
To emphasize the robustness of T-M AN to the actual target graph being built, we performed all experiments on two different tasks: building a sorted ring, and building a
binary tree. These two graphs have very different topologies: the ring has a large (linear)
diameter while the tree has a small (logarithmic) one. Besides, as pointed out in Section 6.6.3, in the tree some nodes are more central than others, while in the ring all nodes
are equal from this point of view.
In the previous sections, we applied the concept of a ranking graph to (implicitly)
define the ranking method. This approach is not practical, so we need to define explicit
and locally computable ranking methods.
Sorted Ring
Creating a sorted ring is very useful, for example, for the decentralized computation of
the ranking of nodes [13] or jump-starting distributed hash tables, such as C HORD [83].
The latter application is further discussed in Chapter 7.
We assume that the node profile is an element of a collection, over which a total
ordering relation is defined. In particular, we work with 60-bit integers as node profiles
that are initialized at random for each node. We want the target graph to be a ring, in
which the node profiles are ordered (except one pair where the largest and smallest values
meet) to close the ring.
To achieve this target graph, the output of the ranking method RANK(x, y1 , . . . , yk )
is defined as follows. First we construct a sorted ring (as defined above) out of the set
of input profiles y1 , . . . , yk and the base node x, and assign a rank value to all nodes: the
minimal hop count from x in this ring. The output of the ranking method is an ordered list
of the input profiles according to these assigned rank values. Note that this is a directiondependent ranking method, that cannot be induced by a distance metric over the node
profiles. For simplicity, we will call T-M AN with this ranking method S ORTED RING.
Binary Tree
The second topology we consider is an undirected rooted binary tree. To achieve a well
controlled target graph for the sake of experimental comparison, the node profiles are defined as follows. If there are N nodes, then we assign the integers 1, . . . , N to the nodes in
some arbitrary order. The node with value 1 is the root. Using the binary representation of
these integers, the node 0a2 . . . am has two children: a2 . . . am 0 and a2 . . . am 1. Numbers
starting with 1 belong to leafs.
It is easy to calculate the shortest path length in this tree between two arbitrary nodes,
based on the two node profiles. This notion of distance is used to define the ranking
function required by T-M AN to build the tree: RANK(x, y1 , . . . , yk ) sorts the input profiles
y1 , . . . , yk according to distance from the base node x. For simplicity, we will call T-M AN
with this ranking method T REE.
6.7. EXPERIMENTAL RESULTS
129
Sorted Ring
Tree
30
Anti-Entropy (Push)
Anti-Entropy (Push-Pull)
Flooding
Synchronous start
25
Convergence Time (s)
Convergence Time (s)
30
20
15
10
5
210
211
212
213
214
215
Network Size
216
217
218
Anti-Entropy (Push)
Anti-Entropy (Push-Pull)
Flooding
Synchronous start
25
20
15
10
5
210
211
212
213
214
215
Network Size
216
217
218
Figure 6.9: Convergence time as a function of size, using different starting protocols.
6.7.4 Performance Measures
We are interested both in the effectiveness (speed and quality) and efficiency (cost) of the
protocol. We evaluate our protocols using the following performance measures: convergence time, target links found, termination time and communication costs.
convergence time The time needed to obtain the perfect target graph. In the case of
S ORTED RING, each node must know at least its first successor and predecessor in
the sorted ring. For T REE, each node different from the root must know its parent,
and non-leaf nodes must know their children.
target links found The number of links in the target graph that are actually found by
T-M AN at a certain time, typically at termination time. This allows for a more finegrained assessment of performance than convergence time.
termination time The total time needed to complete (start, execute and stop) the protocol
at all nodes. This may be considerably longer than convergence time, although, as
we will see, typically only few nodes are still active after reaching convergence.
communication cost The number of messages exchanged. Note that all messages ever
exchanged are of the same size.
The unit of time will be cycles or seconds, depending on which is more convenient
(note that cycle length defaults to 1 s). We also note that convergence time is not defined
if the protocol terminates before converging. In this case, we use the number of identified
target links as a measure.
6.7.5 Evaluating the Starting Mechanism
Figure 6.9 shows the convergence time for S ORTED RING and T REE, using the starting
protocols described in Section 6.7.1. The cycle length of the anti-entropy versions was the
same as that of T-M AN, and the flooding protocol used 20 random neighbors at all nodes.
The case of synchronous start is also shown for comparison. Note that these figures do not
represent a direct measure of the performance of well-known starting protocols; rather,
convergence time plotted here represents the overall time needed to both start the protocol
and reach convergence, with T-M AN and the broadcast protocol running concurrently.
CHAPTER 6. T-MAN: TOPOLOGY CONSTRUCTION
130
Sorted Ring
Tree
216
213
10
size =
size =
size = 2
50
size =
size =
size = 2
50
40
Time (s)
40
Time (s)
216
213
10
30
30
20
20
10
10
2
3
4
5
δidle (s)
6
7
8
2
3
4
5
δidle (s)
6
7
8
Figure 6.10: Convergence time (bottom curves) and termination time (top curves) as a
function of δidle .
In the case of flooding, “wake-up” messages quickly reach all nodes and activate the
protocol; almost no delay is observed compared to the synchronous case. Anti-entropy
mechanisms result in a few seconds of delay. In the experiments that follow, we adopt the
anti-entropy, push-pull approach, as it represents a good trade-off between communication costs and delay. Note however that (unlike the push approach) the push-pull approach
assumes that at least the starting service was started at all nodes already.
6.7.6 Evaluating the Termination Mechanism
We experimented with various settings for δidle ranging from 2 s to 12 s. Figure 6.10 shows
both convergence time (bottom three curves) and termination time (top three curves) for
different values of δidle , for S ORTED RING and T REE, respectively. In both cases, termination time increases linearly with δidle . This is because, assuming the protocol has
converged, each additional cycle to wait simply adds to the termination time.
For small values convergence was not always reached, especially for T REE. For
S ORTED RING, all runs converged except the case when δidle = 2 and N = 216 , when
76% of the runs converged. For T REE, all runs converged with δidle > 5 and no runs
converged for (δidle = 2, N = 213 ), (δidle = 2, N = 216 ), and (δidle = 3, N = 216 ). Even
in these cases, the quality of the target graph at termination time was almost perfect, as
shown in Figure 6.11. In the worst of our experiments, we observed that no more than
0.1% of the target links were missing at termination. This may be sufficient for most applications, especially considering that the target graphs will never be constructed perfectly
in a dynamic scenario, where nodes are added and removed continuously. Nevertheless,
from now on, we discard the parameter combinations that do not always converge.
Apart from longer executions, an additional consequence of choosing large values of
δidle is a higher communication cost. However, since not all nodes are active during the
execution, the overall number of messages sent per node on average is less than one quarter of the number of cycles until global termination. To understand this better, Figure 6.12
shows how many nodes are active during the construction of S ORTED RING and T REE, respectively. The curves show both an exponential increase in the number of active nodes
when starting, and an exponential decrease when stopping. The period of time in which
all nodes are active is relatively short.
These considerations suggest the use of higher values for δidle , at the cost of a larger
termination time and a larger number of exchanged messages. The chosen value of δidle
6.7. EXPERIMENTAL RESULTS
131
Tree
100
Target Links Found (%)
99.99
99.98
99.97
99.96
99.95
99.94
99.93
99.92
size=210
size=213
size=216
99.91
99.9
2
4
6
8
10
12
δidle (s)
Figure 6.11: Quality of the target T REE graph at termination time as a function of δidle
Sorted Ring
Tree
size=210
size=213
16
100
size=210
size=213
size=216
100
size=2
80
active nodes (%)
active nodes (%)
80
60
40
20
60
40
20
0
0
0
5
10
15
Time (s)
20
25
30
0
5
10
15
Time (s)
20
25
30
Figure 6.12: Proportion of active nodes during execution.
(4 s) represents a good trade-off between the desire of obtaining a perfect target graph and
the consequently larger cost in time and communication.
6.7.7 Parameter Tuning
Cycle Length If a faster execution is desired, one can always decrease the cycle length.
However, after some point, decreasing cycle length does not pay off because message
delay becomes longer than the cycle length and eventually the network will be congested
by T-M AN messages. Figure 6.13 shows the behavior of T-M AN with a cycle length varying between 0.2 s and 4 s. The figure shows the number of cycles required to terminate
the protocol. Small cycle lengths require a larger number of cycles, while after a given
threshold (around 1 s), the number of cycles required to complete a protocol is almost
constant. The reason for this behavior is that with short cycles, multiple cycles may be
executed before a message exchange is concluded, thus wasting bandwidth in sending and
receiving old information multiple times.
CHAPTER 6. T-MAN: TOPOLOGY CONSTRUCTION
132
Tree
50
45
45
Termination Time (cycles)
Termination Time (cycles)
Sorted Ring
50
40
35
30
25
20
15
40
35
30
25
20
15
10
10
0
0.5
1
1.5
2
2.5
Cycle Length (s)
3
3.5
4
0
0.5
1
1.5
2
2.5
Cycle Length (s)
3
3.5
4
Figure 6.13: Termination time as a function of cycle length.
Sorted Ring
Tree
60
Termination Time (cycles)
Termination Time (cycles)
60
50
40
30
20
10
50
40
30
20
10
5
10
15
20
Message Size
25
30
5
10
15
20
Message Size
25
30
Figure 6.14: Termination time as a function of message size.
Message Size In Section 6.6, we have examined the effect of the message size parameter (m) in detail. Here we are interested in the effect of message size on termination
time. Figure 6.14 shows that by increasing the size of messages exchanged by S ORTED
RING termination time slightly increases after around m = 20. The reason is that a node
becomes suspended only after the local view remains unchanged for a fixed number of
cycles, but increasing the message size has the effect of increasing the number of cycles
in which view changes might occur, thus delaying termination. The results for T REE have
more variance, which might have to do with the unbalanced nature of the topology, as
discussed in Section 6.6.3.
6.7.8 Failures
The results discussed so far were obtained in static networks, without considering any
form of failure. Here, we consider two sources of failure: message losses and node
crashes. Since in this chapter we consider only the overlay construction problem, and
not maintenance, we do not explicitly consider scenarios involving node churn. Instead,
we model churn through nodes leaving, and do not allowing joining nodes to participate
in an ongoing construction. Furthermore, since we do not have a leave protocol, leaving
nodes are identical to crashing nodes from our point of view.
Message Loss While a simple solution could be to adopt a reliable, connection-oriented
transport protocol like TCP, it is more attractive to rely on a lightweight but perhaps
6.7. EXPERIMENTAL RESULTS
133
Sorted Ring
60
size =
size =
size = 2
210
213
16
size =
size =
size = 2
50
Termination time (s)
50
Termination time (s)
Tree
60
210
213
16
40
30
20
10
40
30
20
10
0
0
0
5
10
Message loss (%)
15
20
0
5
10
Message loss (%)
15
20
Figure 6.15: Termination time as a function of message loss rate.
Tree
100
100
99.8
99.8
Target Links Found (%)
Target Links Found (%)
Sorted Ring
99.6
99.4
99.2
99
size = 210
size = 213
size = 216
98.8
98.6
0
99.6
99.4
99.2
99
size = 210
size = 213
size = 216
98.8
98.6
5
10
Message loss (%)
15
20
0
5
10
Message loss (%)
15
20
Figure 6.16: Target links found by the termination time as a function of message loss rate.
unreliable transport. In this case, we need to demonstrate that T-M AN can cope well with
message loss. Figure 6.15 shows that T-M AN is highly resilient to message loss and so
a datagram-oriented protocol like UDP is a perfectly suitable choice, as message losses
only slow down the protocol slightly. Many message exchanges are either never started
or never completed, thus requiring more cycles to terminate the protocol execution. The
quality does not suffer much either. In both S ORTED RING and T REE, around 1% of the
target links may be missing, as shown by Figure 6.16. Note that the mean message loss
ratio for geographic networks like the Internet is around 2% [138], an order of magnitude
smaller than the maximum message loss ratio tested in our experiments.
Node Crashes Figure 6.17 shows the behavior of T-M AN with a variable failure rate,
measured as the total number of nodes leaving the network per second per node. We
experimented with values ranging from 0 to 10−2 , which is two orders of magnitude larger
than the value of 10−4 suggested as the typical behavior of some P2P networks [139]. The
results show that both S ORTED RING and T REE are robust in normal scenarios, with T REE
being considerably more reliable in the range of extreme failure rates. This is due to the
unbalanced nature of the topology as discussed in Section 6.6.3.
CHAPTER 6. T-MAN: TOPOLOGY CONSTRUCTION
134
Sorted Ring
Tree
100
Target Links Found (%)
Target Links Found (%)
100
99.5
99
98.5
98
size=216
size=213
size=210
0
0.002
0.004
0.006
0.008
Node failures per node per second
99.5
99
98.5
98
0.01
size=216
size=213
size=210
0
0.002
0.004
0.006
0.008
Node failures per node per second
0.01
Figure 6.17: Target links found by the termination time as a function of failure rate.
6.8 Conclusions
We presented T-M AN, a lightweight gossip-based protocol for constructing various overlay networks. The target network is given by the ranking method, which is a parameter
of the protocol. T-M AN is robust to the target network: it exhibits good performance that
is mostly invariant over a wide range of target networks such as rings and trees. The
protocol is simple and robust to failure scenarios which makes it attractive for practical
applications.
In closing, we note that T-M AN has been successfully applied for constructing the
C HORD overlay network (see Chapter 7) and the PASTRY overlay network (see Chapter 8).
In this chapter, we have chosen to focus on overlay construction as opposed to overlay
maintenance. Possible overlay maintenance techniques involve limited local view sizes
and periodic removal of old entries from the view. In addition, random samples from the
network can constantly be injected into the local view.
Chapter 7
Bootstrapping Chord
In this chapter we describe a practically relevant application of T-M AN: we use it to create
a C HORD network [83]. Structured peer-to-peer overlay networks like C HORD are now an
established paradigm for implementing a wide range of distributed services. While the
problem of maintaining these networks in the presence of churn and other failures is the
subject of intensive research, the problem of building them from scratch has not been
addressed (apart from individual nodes joining an already functioning overlay). Here, we
address the problem of jump-starting a popular structured overlay, C HORD, from scratch.
This problem is of crucial importance in scenarios where one is assigned a limited
time interval in a distributed environment such as a data center for Cloud computing, and
the overlay infrastructure needs to be set up from the ground up as quickly and efficiently
as possible, or when a temporary overlay has to be generated to solve a specific task on
demand.
We introduce T-C HORD, that can build a C HORD network efficiently starting from a
random unstructured overlay. After jump-starting, the structured overlay can be handed
over to the C HORD protocol for further maintenance. We demonstrate through extensive
simulation experiments that the proposed protocol can create a perfect C HORD topology
in a logarithmic number of steps. Furthermore, using a simple extension of the protocol,
we can optimize the network from the point of view of message latency.
7.1 Introduction
Structured overlay networks have received considerable attention recently [81, 83]. A
wide range of distributed services and applications can efficiently be implemented on
top of structured overlays. The fundamental abstraction that is the basis of numerous
applications is key-based routing [80]. Key-based routing protocols are based on routing
tables stored at each node and that are used to forward messages for a specific key towards
the destination: the node that is responsible for the given key. The neighborhood relations
specified by the routing tables define the overlay topology, whose structure depends on
the specific implementation.
While the problem of maintaining these networks in the presence of churn and other
failures is the subject of intensive research, the problem of building them from scratch
has not been addressed apart from handling node joins to an existing overlay. Yet, in
some important scenarios, we face the problem of jump-starting structured overlays from
scratch. This problem gains particular importance if one is assigned a limited time interval
in a distributed environment such as PlanetLab [95], or a Grid [140], and the overlay
135
136
CHAPTER 7. BOOTSTRAPPING CHORD
infrastructure needs to be set up from the ground up as quickly and efficiently as possible,
or when a temporary overlay has to be generated to solve a specific task on demand.
Existing join protocols are not designed to handle the massive concurrency involved in
a jump-starting process, when all the nodes are trying to join at the same time [83]. On the
other hand, naive approaches where nodes are forced to join the overlay in some specified
order results in at least linear time needed to construct the network (not to mention the
serious problem of synchronizing the operations).
We propose a solution to the jump-starting problem called T-C HORD that is simple,
scalable, robust, and efficient. T-C HORD is a protocol for bootstrapping the C HORD topology on demand starting from an unstructured, uniform random overlay. The purpose of
T-C HORD is purely jump-starting the overlay; the constructed network is handed over to
the C HORD protocol for further maintenance.
T-C HORD is based on T-M AN. As we have seen in Chapter 6, T-M AN is a generic
mechanism for building and maintaining a wide range of different topologies, including
rings, grids and trees. Briefly, T-C HORD works as follows. We assume the existence of
a connected unstructured overlay network characterized by a random topology (such as
those produced by protocols in Chapter 2). Nodes are assigned unique IDs from a circular
ID space. Starting from the initial random overlay, T-M AN is used to build the ring to be
used by C HORD for consistent routing. At all nodes, as a “side effect” of its execution (by
remembering all the encountered nodes), T-M AN can also provide a larger set of nodes
from which C HORD fingers can be selected.
We have evaluated the topologies obtained by T-C HORD through simulation. The results, presented in Section 7.4, confirm that the obtained topology is equivalent to (in fact,
at times slightly better than) the “optimal” C HORD topology (as defined in the C HORD
protocol specification) based on routing performance: loss rate, hop count and latency.
7.2 System Model
We consider a network consisting of a large collection of nodes that are assigned unique
identifiers and that communicate through message exchanges. The network is highly
dynamic; new nodes may join at any time, and existing nodes may leave, either voluntarily or by crashing. Since voluntary leaves can be trivially managed by simple “logout”
protocols, in the following we limit our discussion to node crashes, that are much more
challenging. Byzantine failures, with nodes behaving arbitrarily, are excluded from the
present discussion.
We assume that nodes are connected through an existing routed network, such as the
Internet, where every node can potentially communicate with every other node. To actually communicate, a node has to know the identifiers of a set of other nodes (its neighbors). This neighborhood relation over the nodes defines the topology of the overlay
network. Given the large scale and the dynamism of our envisioned system, neighborhoods are typically limited to small subsets of the entire network. The neighbors of a
node (and, thus, the overlay topology) can change dynamically.
7.3 The T-C HORD protocol
Let us now describe the proposed algorithms. In this section we will heavily build on
algorithms and concepts introduced in Chapter 6.
7.3. THE T-CHORD PROTOCOL
137
7.3.1 A Brief Introduction to Chord
C HORD is an example of a key-based overlay routing protocol. In such protocols, subsets
of the key space are assigned to nodes, and each node has a routing table that it uses
to route messages addressed by a specific key towards the node that is responsible for
that key. These routing protocols are used as a component in the implementation of the
distributed hash table abstraction, where (key, object) pairs are stored over a decentralized
collection of nodes and retrieved through the routing protocol.
We provide a simplified description of C HORD, necessary to understand T-C HORD.
Nodes are assigned random t-bit IDs; keys are taken from the same space. The ID length
t must be large enough to make the probability of two nodes or two keys having the
same ID negligible. Nodes are ordered in an sorted ring as described in Section 6.7.3.
The way this ring is constructed naturally inspires a follows relation over the entire ID
(and key) space: we say that a follows b if (a − b + 2t ) mod 2t < 2t−1 ; otherwise, a
precedes b. We also define a notion of distance, again, inspired by the sorted ring, as
follows: d(a, b) = min(|a − b|, 2t − |a − b|). The successor of an arbitrary number i
(that is, not necessarily existing node ID) is the node with the smallest ID that follows i,
as defined above. We denote the successor of i by succ 1 (i). The concepts of predecessor,
j th successor, and j th predecessor are defined similarly. Key k is under the responsibility
of node succ 1 (k).
Each node maintains a routing table that has two parts: leaves and fingers. Leaves
define an r-regular ring lattice, where each node n is connected to its r nearest successors
succ 1 (n) . . . succ r (n) in the sorted ring. Fingers are long range links: for each node n,
its j th finger is defined as succ 1 (n + 2j ), with j ∈ [0, t − 1]. Routing in C HORD works
by forwarding messages following the successor direction: when receiving a message
targeted at key k, a node n forwards it to its leaf or finger that precedes (or is equal to)
and is closest to succ 1 (k), the intended recipient of the message.
Due to the fingers, the number of nodes that need to be traversed to reach a destination
node is O(log N) (with high probability), where N is the size of the network [83]. Leaves,
on the other hand, are used to improve the probability of delivering a message in case of
failures, and to avoid that the ring can be broken into disjoint partitions.
7.3.2
T-C HORD
In the context of C HORD, our overlay construction problem translates to initializing the
routing tables of all nodes simultaneously from scratch. The existing join protocol is not
designed to handle the massive concurrency involved in a jump-starting process, when all
the nodes are trying to join at the same time [83]. On the other hand, naive approaches
where nodes are forced to join the overlay in some specified order results in at least linear
time needed to construct the network (not to mention the serious problem of synchronizing the operations).
For constructing the leaf set and the fingers simultaneously, we apply T-M AN with an
appropriate ranking method. As usual, we use node ID-s as node profiles. The ranking
method we adopt is simply the ranking method of S ORTED RING as seen in Chapter 6. At
any time, the actual leaf and finger sets are then constructed by each node locally from
nodes in their current view. Note, that the view is not bounded, so the node descriptors
that were received in the initial (non-converged) cycles are available as well. These nodes
are not useful for the leaf set (that defines the ring), however, they are crucial for the
fingers, that represent shortcuts in the ring.
CHAPTER 7. BOOTSTRAPPING CHORD
138
As of starting and termination, we experiment with both synchronous and realistic
starting and termination policies. The realistic policies are those that were described in
Chapter 6.
7.3.3
T-C HORD -P ROX: Network Proximity
At a node n, for an exponent j ∈ [1, t − 1], several nodes in the current view may belong
to the finger range Fj = [n + 2j−1 mod 2t , n + 2j − 1 mod 2t ]. In T-C HORD, the
finger nearest to n with respect to the ID space was selected among them, according to
the convention applied in the original C HORD. However, exploiting this degree of freedom
would allow us to optimize for message latency (a key measure of performance in routing)
and select the fastest possible finger that falls in the interval. This would enable the
construction of low-latency routing paths between nodes, improving the overall routing
performance of the network.
Inspired by this insight, we propose T-C HORD -P ROX, a variant of T-C HORD based on
proximity. The protocol is the same as before, however, when constructing the finger
table, for all finger exponents j T-C HORD -P ROX picks p nodes at random from Fj (or the
entire Fj set, if its size is less then or equal to the parameter p), and measures the latency
by sending distance probes to them. A distance probe can be implemented as a simple
ping-pong exchange, for example. This simple protocol builds a routing network that
results in a number of hops similar to the original C HORD, but outperforms it in terms of
latency.
7.4 Experimental Results
We performed extensive simulation experiments in order to compare the jump-started
overlay to the perfect C HORD topology, and to characterize the scalability and robustness
of our protocols. All of the experimental results were obtained using P EER S IM [66].
7.4.1 Experimental Settings
By default, in all experiments all nodes are initialized with a random view obtained from
the NEWSCAST protocol (see Section 2.2.4). Subsequently, T-M AN is run to create an
ordered ring, and to collect long range links as well. When T-M AN reaches a pre-specified
number of cycles, each node runs T-C HORD locally to extract its routing tables from the
T-M AN view, creating the C HORD topology. We note that in Section 7.4.6 we deviate from
this default in order to analyze practical starting and termination mechanisms.
We focus on the routing performance of the obtained overlay. Three routing metrics
have been taken into consideration. Hop count is the number of nodes that are traversed
by a message to reach its destination. In case of failures, message timeouts (failed hops)
are counted separately. Delivery delay measures the time needed to reach the destination.
Our latency model is based on the King dataset [137], that provides end-to-end latency
measurements for a set of 1740 routers. Each node is attached through a 1ms link to
a randomly selected router [81]. In case of failures, a time equal to twice the latency is
added to the total delay in order to simulate timeouts. Loss rate is the fraction of messages
that do not reach the destination node.
Since our goal is to jump-start C HORD, the baseline routing performance is defined
by the perfect C HORD topology over the same set of nodes. We construct this topology
7.4. EXPERIMENTAL RESULTS
139
T-Chord (%)
T-Chord-Prox (%)
T-Chord (n.)
T-Chord-Prox (n.)
100
800
7
700
6
Message Delay (ms)
Loss Rate (%)
101
8
Hop Count (n.)
102
600
500
10-1
5
10-2
4
-3
3
200
2
100
10
10-4
0
5
10
Cycles
15
20
400
300
T-Chord
T-Chord-Prox
Chord
0
5
10
Cycles
15
20
Figure 7.1: Loss rate, hop count and message delay as a function of the number of T-M AN
cycles executed
off-line, using the specification of the C HORD protocol, and we compare the performance
of this ideal topology with the ones generated by T-C HORD. We emphasize again that our
goal is not to develop a novel routing mechanism or a new structured overlay: our goal is
to create a C HORD topology efficiently from scratch.
Besides routing performance, we also need to measure communication overhead for
building the topology. In case of T-C HORD without proximity, communication costs are
given just by T-M AN exchanges. Given the periodic nature of T-M AN, these costs can
be easily computed: each T-M AN node sends one message and receives one message on
the average per cycle, with m descriptors included in each message. T-M AN is run for
O(log N) cycles. In T-C HORD -P ROX, the cost of latency probes must also be considered.
Unless stated otherwise, all figures are based on the following parameters: network
size N = 216 nodes, message size m = 10, size of the leaf set in the C HORD target topology r = 5, maximum number of probes per routing table entry p = 5. In all figures,
20 individual experiments were performed. Average values for each of the metrics are
shown; error bars are used to show minimum and maximum values among the experiments (standard deviation is often too small to be visualized). To aid the visualization,
some of the bars are slightly shifted horizontally.
7.4.2 Convergence
The routing performance of the topologies obtained by T-C HORD depends on the number
of T-M AN cycles executed before the routing tables are built. In particular, the ring must
be completed in order to guarantee the correct delivery of all messages. This is illustrated
in Figure 7.1, where the loss rate and the observed hop count for T-C HORD and T-C HORD P ROX are shown as a function of the number of T-M AN cycles that have been run. Initially,
all messages are lost: local views contain only random nodes, so the routing algorithm
is unable to deliver messages. The loss rate rapidly decreases, however, reaching 0 after
only 14 cycles. At that point, the leaf ring is completely formed in all our experiments.
Note that the curves for T-C HORD and T-C HORD -P ROX overlap almost completely.
Regarding hop counts, the results confirm that the quality of the routing tables stabilizes after few cycles, for both versions of T-C HORD. Message delay follows a similar
behavior, except that T-C HORD -P ROX shows a significant improvement. The increasing
tendency of the hop count curves is explained by the fact that in the beginning, in spite of
the low quality overlay, a few messages reach their destination “by chance” in a few hops,
CHAPTER 7. BOOTSTRAPPING CHORD
140
9
800
8.5
700
Message Delay (ms)
8
Hop Count
7.5
7
6.5
6
600
500
400
5.5
4.5
300
Chord
T-Chord
T-Chord-Prox
5
210
211
212
213
214
Size
215
216
217
218
200
Chord
T-Chord
T-Chord-Prox
210
211
212
213
214
Size
215
216
217
218
Figure 7.2: Average hop count and message delay as a function of network size
24
22
20
Cycles
18
16
14
12
10
8
6
1-regular lattice
5-regular lattice
210
211
212
213
214
Size
215
216
217
218
Figure 7.3: Convergence time (cycles) as a function of network size
while most of the messages are lost.
7.4.3 Scalability
The experiments discussed so far were run in a network with a fixed size (216 nodes). To
assess the scalability of T-C HORD, Figure 7.2 plots measurements against network size
varying in the range [210 , 218 ]. Results for the ideal C HORD topology are also shown. All
algorithms scale logarithmically with size.
Quite interestingly, T-C HORD performs slightly better than C HORD. Regarding hop
count, this is explained by the fact that the distance of the longest fingers tend to be
larger in our case (due to not strictly satisfying the C HORD specification), which speeds
up reaching the destination node if it resides in the most distant half of the ring. Regarding
message delay, as expected, T-C HORD -P ROX outperforms both T-C HORD and C HORD, due
to its latency-optimized set of fingers. To obtain such performance, T-C HORD -P ROX pays
a price in terms of latency probes. In this experimental setting, with parameter p set to 5,
we have observed a total number of probes per node scaling logarithmically from 45 (for
N = 210 ) to 77 (for N = 218 ). This is expected, as the number of expected different finger
entries per node is O(log N) [83]. These values can be tuned by varying the p parameter.
7.4. EXPERIMENTAL RESULTS
8.8
141
800
Chord
T-Chord
T-Chord-Prox
750
700
Message Delay (ms)
8.6
Hop count
8.4
8.2
8
7.8
650
600
550
500
Chord
T-Chord
T-Chord-Prox
450
400
350
7.6
300
7.4
250
4
6
8
10
12
14
Message size
16
18
20
4
6
8
10
12
14
Message size
16
18
20
Figure 7.4: Hop count and message delay as a function of message size m
Figure 7.3 plots the number of cycles needed to obtain the 1-regular lattice (the ring),
sufficient to guarantee the consistent routing of messages (absence of message losses) [83],
and the r-regular lattice used to provide additional fault-tolerance. In both cases, the convergence is obtained in a logarithmic number of cycles. Note that the actual execution
time of the protocol depends on the length of a cycle, which is a parameter of the protocol. Based on results in Chapter 6, a cycle length of 1-2 seconds is very reasonable.
Considering the results, we can conclude that any practical network can very safely be
constructed in less then a minute (30 cycles).
Finally, as part of our measurements regarding scalability, let us consider the storage
complexity of T-M AN. In Section 6.6.4 we argued that storage complexity is O(log N)
per node. This is why we do not have an upper bound on the view size. Indeed, in our simulation experiments, the average amount of descriptors discovered during the execution
ranges from as little as 70 (N = 210 ) to 140 (N = 218 ).
7.4.4 Parameters
To evaluate the impact of the T-M AN message size (m) on the routing performance of our
algorithm, we performed the simulations shown in Figure 7.4. For message size m we
set the size of the C HORD leaf set to be r = m/2. The plots show that good results are
obtained even when using small message size, although it must be noted that in the case
of m = 4, approximately 0.6% of the messages are not delivered to their destination.
7.4.5 Robustness
To test robustness, we have considered two different failure models: crash and churn. In
the former, failures are catastrophic: a given percentage of nodes are suddenly removed
from the completed C HORD network. In the latter, the same percentage of nodes are
removed during the execution of T-C HORD, evenly distributed over time.
The two models play different roles in our analysis. The crash model is the only
one applicable to the ideal C HORD network that we use for comparison, since we build
it off-line, without using the actual C HORD maintenance protocol. We use this model to
obtain a lower bound for routing performance. In the churn model, on the other hand,
failures influence the execution of T-M AN; we use this model to show that our algorithm
can indeed survive failures during its execution.
CHAPTER 7. BOOTSTRAPPING CHORD
142
8
7
6
11
10.5
5
4
3
10
9.5
9
2
8.5
1
8
0
7.5
0
10
14
20
30
Crashed nodes (%)
40
50
0
3500
Chord (crash)
T-Chord (crash)
T-Chord (churn)
T-Chord-Prox (churn)
12
Message Delay (ms)
8
6
4
10
20
30
Crashed nodes (%)
40
50
Chord (crash)
T-Chord (crash)
T-Chord (churn)
T-Chord-Prox (churn)
3000
10
Failed Hops
Chord (crash)
T-Chord (crash)
T-Chord (churn)
T-Chord-Prox (churn)
11.5
Hop Count
Loss rate (%)
12
Chord (crash)
T-Chord (crash)
T-Chord (churn)
T-Chord-Prox (churn)
2500
2000
1500
1000
2
500
0
0
0
10
20
30
Crashed nodes (%)
40
50
0
10
20
30
Crashed nodes (%)
40
50
Figure 7.5: Loss rate, hop count, failed hops, and message delay under different failure
scenarios
It is important to note that a direct comparison between the results of T-C HORD -P ROX
and the other results is not fair. T-C HORD -P ROX probes nodes for latency before inserting
them in the finger set, which means that only a few fingers (the ones that fail in the period
after the probing) are down when the routing performance is evaluated.
We have simulated an increasing percentage of nodes removed in a network of size 216 ,
with T-M AN running for 20 cycles. The results are presented in Figure 7.5. Once again,
our routing metrics show that the topology obtained by T-C HORD without proximity is
comparable to the ideal C HORD topology, in both the crash and the churn models.
It is interesting to compare the simulated churn rate with the churn rate observed in deployed P2P networks [141]. In the worst case, the churn rate corresponds to 50% divided
by 20 cycles, i.e. 2.5% per cycle. A cycle length of 2 seconds (a perfectly reasonable
choice that enables the construction of a 216 topology in less than a minute) corresponds
to 0.0125 failures per node per second, two orders of magnitude larger than the rates
observed in deployed networks (around 10−4 failures per node per second ( [141])).
7.4.6 Starting and Termination
So far we have experimented with the protocol assuming that the startup is synchronized,
and that termination is based on a pre-determined number of cycles. Here we add to the
protocol the startup and termination mechanisms described in Chapter 6. We will now
characterize time in terms of seconds and not cycles, due to the more fine-grained nature
of the simulations. We set ∆ = 1. In the experiments here m = 10, however, we build
leaf sets of size r = 10 and not r = 5 as before, so convergence is somewhat slower even
7.5. RELATED WORK
143
55
55
50
50
Termination Time (s)
Convergence Time (s)
45
40
35
30
25
20
Anti-Entropy (Push)
Anti-Entropy (Push-Pull)
Flooding
Synchronous start
15
10
5
10
2
11
2
12
2
13
2
14
15
16
2
2
Network Size
1
2
40
35
30
20
17
2
2
18
3
4
5
δidle (s)
6
7
8
4
5
δidle (s)
6
7
8
2
24
size=216
size=213
size=210
0.1
45
25
size = 216
size = 213
size = 210
22
20
Messages
DHT Delivery Failure Rate (%)
size = 216
size = 213
size = 210
0.01
0.001
18
16
14
12
0.0001
10
1e-05
8
2
3
4
5
δidle (s)
6
7
8
2
3
Figure 7.6: The effect of parameters affecting starting and termination.
for synchronous startup.
Figure 7.6 contains the results of this set of experiments. The upper left plot shows
the convergence time for different starting protocols and for a variable network size. The
relative convergence time of the different mechanisms is similar to what can be seen in
Figure 6.9.
The upper right plot presents termination times for different values of parameter δidle .
For small values of δidle and for large networks, we found that the protocol never reaches
convergence. Nevertheless, the lower left plot shows that even for small values of δidle ,
the number of messages never delivered to the correct destination is smaller than 1%,
which means that the obtained overlay is a good approximation of C HORD. However, for
δidle = 8, all our test runs resulted in 100% successful message delivery, so we adopt this
value for the protocol. The slight disadvantage is a larger number of messages exchanged
and a slower termination time.
Finally, the lower right plot shows the average number of messages sent by a node in
the network until termination. This is significantly lower than the termination time, which
could be expected based on our findings discussed before (see Figure 6.12), namely that
most of the nodes terminate much earlier than the global termination time.
7.5 Related Work
Bootstrapping structured overlays is somewhat under-emphasized in comparison with
other research topics. Existing proposals have assumed networks that are already formed,
or networks that grow progressively, using the native join protocol. The discovery of
the node to join may be facilitated either by a central (well-known) node, or through a
144
CHAPTER 7. BOOTSTRAPPING CHORD
universal ring, a shared overlay providing discovery and deployment services [142].
Join protocols enable a new node to find its position inside the structured topology [81,
83]. For example, the single-join protocol of C HORD requires a node to locate its position
inside the ring, and then to locate each of its O(log N) distinct fingers [83]. Since both
operations require O(log N) hops (messages), the cost of a single-join is O(log2 N).
This aggressive protocol is superseded by a light-weight one that can support concurrent joins. In this case, nodes just find their position in the ring (with a O(log N) routing
operation), while fingers are updated subsequently by a stabilization protocol. The protocol is efficient “... unless a tremendous number of nodes joins the system.” [83], in which
case the updating rate of fingers is not sufficient and routing requires a linear number
of hops. In comparison, our approach builds the topology in O(log N) cycles, with two
messages sent and two messages received per node per cycle, with each message being a
collection of m 128-bit IDs.
The problem of bootstrapping an overlay topology has started recently to gain interest
from the research community. Angluin et al. [129] propose an asynchronous algorithm
whose goal is to build a linked list of nodes sorted by their identifiers. Their approach
is based on binary search trees that are built in O(W logN) time, where W is the length
of node identifiers. On comparison, our approach builds the ring in O(logN) time, independently from the size of identifiers. Furthermore, our approach can deal with high
level of churn, while churn has not been considered in [129]. Aberer et al. [127] propose
a mechanism for bootstrapping a P-Grid topology in O(log 2N) time.
Finally, Voulgaris and van Steen [126] propose an epidemic protocol with a similar
goal: jump-starting Pastry. However, their proposal is rather expensive: it requires running O(log K) instances of a modified NEWSCAST protocol [6] in parallel (where K is the
size of the ID space), and it does not take latency into account. Besides, it is highly specific to Pastry, whereas our approach, being based on T-M AN, that is able to evolve a wide
range of topologies, is potentially more generic. Indeed, we already have preliminary
results for building Pastry as well, through an XOR-based ranking function for T-M AN,
with costs similar to T-C HORD.
7.6 Conclusions
We have addressed the problem of jump-starting a popular structured overlay, C HORD,
from scratch. The proposed protocols, T-C HORD and T-C HORD -P ROX are scalable, lightweight and robust, and can be applied to scenarios (such as Grids [140] and large-scale
testbeds like Planet-Lab [95]), where the overlay infrastructure needs to be built from the
ground up as quickly and efficiently as possible.
Although here we targeted C HORD, our approach is more general, and it can be applied
to other overlay protocols as well. Chapter 8 will contain an example where we build a
prefix-table based overlay.
Chapter 8
Towards a Generic Bootstrapping
Service
In this last chapter, we present another application of T-M AN, namely the bootstrapping of
prefix-based structured overlay networks such as PASTRY. In addition, we also propose a
modular approach to combine several gossip components, and we argue that bootstrapping
different overlay networks is an important service in this architecture.
The novel application scenarios for P2P systems that are supported by this bootstrapping service include merging and splitting of large networks, or multiplexing relatively
short-lived applications over a pool of shared resources. In such scenarios, the architecture needs to be quickly and efficiently (re)generated frequently, often from scratch. We
propose the bootstrapping service abstraction as a solution to this problem. We present an
instance of the service that can jump-start any prefix-table based routing substrate quickly,
cheaply and reliably from scratch. We experimentally analyze the proposed bootstrapping
service, demonstrating its scalability and robustness.
8.1 Introduction
Structured overlay networks are increasingly seen as a key layer (or service) in peer-topeer (P2P) systems, supporting a wide variety of applications. Index-based lookup is
generally considered to be a “bottom” layer (e.g., [135, 142]), based on the assumption
that the life cycle of supported systems is similar to grassroots file sharing networks:
there exists at least one functional network, membership can change due to churn, and
the network size can also fluctuate, but relatively smoothly. Join operations are assumed
to be uncorrelated. Most simulation and analytical studies also reflect these assumptions,
since they are often based on traces collected from real file sharing networks.
While this scenario may be appropriate for many important applications, we believe
that overlay networks can be important design abstractions in radically different scenarios
that have not yet been considered by the P2P research community. In particular, massive
joins to a large overlay network are not supported by known protocols very well, and
many protocols have trouble dealing with massive departures as well. Other related scenarios that are important yet under-emphasized include bootstrapping a large network
from scratch, merging two or more networks, splitting a large network into several pieces,
and recovering from catastrophic failure.
If these scenarios were to be supported efficiently, we could build a fully open and
flexible computing infrastructure that points well beyond current applications. We envi145
146
CHAPTER 8. TOWARDS A GENERIC BOOTSTRAPPING SERVICE
sion scenarios that involve (virtual) organizations with (possibly) large pools of resources
organized in overlay networks. We want to allow these overlay networks to freely and
flexibly merge with and split from networks of other organizations on demand, and we
want to admit allocation (or sale) of pools of resources for relatively short periods to users
who could then build their own infrastructures on demand and abandon them when they
are done. This vision is in line with current efforts to enhance the flexibility of Grid
infrastructures using P2P technology [143].
To support the above vision, we propose a P2P architecture with two main components: the peer sampling service and a dedicated bootstrapping service. Merging several
large networks or starting an application from scratch within its time-slice are unusual
and radical events that many existing P2P protocols are not designed to cope with. To
provide a reliable platform in the face of massive joins and departures, we propose the
peer sampling service (see Section 2) as a lightweight bottom-most layer of our P2P architecture. The bootstrapping service is then built on top of this peer sampling service.
In the proposed architecture, large collections of resources can be readily aggregated into
global structured overlays rapidly and efficiently. This then allows the use of existing,
well-tuned protocols without modification to maintain the overlays once they have been
formed. As a concrete example of the bootstrapping service, we present a novel protocol that can efficiently build prefix-based overlay routing substrates such as Pastry [81],
Kademlia [144], Tapestry [145] and Bamboo [51] from scratch.
Considering related work in the area, massive joins to already running overlays have
been addressed previously (e.g., [134, 135]) proposing a form of periodic repair mechanism for maintaining the leaf set, not unlike the one presented here. More recently the
bootstrapping problem has been addressed as well, focusing on specific overlays [16,127,
128]. Our contribution with respect to related work is twofold. First, we propose an architecture that can support a protocol that jump-starts an entire overlay from scratch. Our
protocol is independent of the protocol that manages the routing substrate: we singled
out the abstract bootstrap service as an important architectural component. Second, our
protocol is efficient and lightweight, and supports overlays based on prefix-tables and leaf
sets.
The outline of this chapter is as follows. Section 8.2 presents the architecture to support the scenarios mentioned above. Section 8.3 describes the protocol implementing
the bootstrapping of routing substrates, while Section 8.4 presents experimental results.
Finally, Section 8.5 concludes the chapter.
8.2 The Architecture
Our ultimate goal is to design a P2P architecture that allows for large pools of resources to
behave almost like a liquid substance: it should be possible to merge large pools, or split
existing pools into several pieces easily. Furthermore, it should be possible to bootstrap
potentially complex architectures on top of these liquid pools of resources quickly on
demand.
The architecture is outlined in Figure 8.1. The lowest layer, the peer sampling service,
implicitly defines a group abstraction by allowing higher layers to obtain addresses of
random samples from the actual set of participating peers; even shortly after massive
joins or catastrophic failures. The basic idea of the architecture is that we require only
this lowest layer to be liquid, that is, persistent to the radical scenarios we described, and
we propose to build all other overlays on demand. In other words, the sampling service
8.3. BOOTSTRAPPING PREFIX TABLES
147
Figure 8.1: The layers of the proposed architecture. The highlighted part is discussed in
this chapter.
functions as a last resort that provides a very basic, but at the same time extremely robust
service, which is sufficient to enable jump starting or recovering all higher layers of the
architecture.
As shown in Figure 8.1, the architecture supports other components in addition to
structured overlays. For example, a number of components rely only on random samples,
like probabilistic broadcast (gossip) or aggregation (see Chapter 3). The architecture can
also support other overlays, such as proximity based ones (see Chapter 6).
The bottom layer of the proposed P2P architecture is the peer sampling service (see
Chapter 2). Due to its low cost, extreme robustness and minimal assumptions, gossip
based peer sampling protocols are an ideal bottom layer that makes the bootstrap service
feasible. The sampling service is useful (and, in fact, sufficient) for gossip-based protocols
that are based on sending information periodically to random peers. In this chapter, as
usual, we again use the N EWSCAST protocol in our experiments (see Section 2.2.4).
8.3 Bootstrapping Prefix Tables
As argued earlier, our architecture crucially relies on the existence of a lightweight and
efficient implementation of the bootstrapping service, that in turn relies on peer sampling.
Here we develop a protocol that fulfills these requirements. We have already addressed
bootstrapping C HORD in Chapter 7 based on a sorted ring, and additional fingers that
are defined based on distance in the ID space. However, an important alternative design
decision of DHT-s is applying prefix-based routing tables, which have some important
advantages, such as independence of ID distribution, but which are a significantly different task to build and maintain. The protocol that we present here constructs prefix-based
routing tables at all participating nodes simultaneously, and from scratch. The key idea is
similar to that of T-C HORD; nodes build a sorted ring, and during the process they collect
entries to fill the prefix tables at all nodes.
The prefix table is defined as follows. We assume that all nodes have unique numeric
IDs. An ID is represented as a sequence of digits in base 2b —each digit is encoded as
a b-bit number. The prefix table of a given node contains IDs that belong to different
types based on the length of the common prefix with the node’s own ID. The types are
defined by a pair (i, j), where i is the length (in base 2b digits) of the longest common
prefix of the ID and the node’s own ID, and j is the actual value of first differing (base
2b ) digit. For each entry type (i, j) the table contains up to k alternative IDs. (Note that
148
CHAPTER 8. TOWARDS A GENERIC BOOTSTRAPPING SERVICE
it is possible that there are less than k node IDs with the desired prefix and digit among
the participating nodes, in which case we cannot fill all the k slots; hence k is only an
upper bound.) Many overlay routing substrates are based on this prefix table: for example
Pastry [81], Kademlia [144], Tapestry [145] and Bamboo [51]. Using the constructed
prefix tables and the leaf sets (that define the sorted ring), the routing tables of all these
networks can be bootstrapped.
To sum up, each node has a prefix table and a leaf set to fill, and the leaf set is being
evolved to contain the nearest neighbors in the sorted ring of node IDs. Unlike in the case
of T-C HORD, here the leaf set is symmetric, so constructing the leaf can be achieved using
S ORTED RING directly. The size of the leaf set is denoted by m. The protocol we propose is
similar to T-C HORD in that it is an instance of S ORTED RING with some modifications. Like
in T-C HORD, the prefix table entries are constantly being filled using any new information
that is arriving in the incoming messages.
The main new idea w.r.t. T-C HORD is that the gradually improving prefix table is fed
back into the ring building process, so that the two components mutually boost each other.
That is—although the ring-building process fills in most of the entries in the prefix tables
as a side-effect—the prefix tables can already fulfill a kind of routing function before
being completed, just like in DHT-s. Especially in the end phase, when most of the
nodes have found their place in the ring, but a few still have an incorrect neighborhood,
the gossip mechanism of T-M AN and the almost complete prefix tables together can help
these last nodes find their correct neighborhood quickly, essentially as if they were routed
by the routing substrate under construction.
To implement this idea, we modify method TO S END () in Algorithm 13, which is responsible for generating a set of node descriptors to be sent to the peer node. Knowing the
ID of the peer, the method optimizes the information to be sent as follows. First it takes
the union of the leaf set, r random samples taken from the sampling service, the current
prefix table, and its own descriptor (in other words, all locally available information). It
applies the ranking function to this set, and keeps the first m entries. In addition, it adds
to the message all node descriptors that are potentially useful for the peer for its prefix
table (i.e., have a common prefix with the peer ID). The size of this additional part is not
fixed but is bounded by the size of the full prefix table, and usually is smaller in practice.
Let us summarize the parameters specific to the protocol. The prefix table is defined
by b (the number of bits in a digit) and k, the number of alternatives entries to be stored
for all the specific prefix types. The size of the leaf set is m. Finally, r is the number of
random samples used for improving the messages to be sent. Note that these samples are
“free” (if r is not too large) since the generic peer sampling layer is assumed to function
independently of the bootstrapping service.
8.4 Simulation Results
Both the sampling service and the bootstrapping service were implemented for the P EER S IM simulator [66]. We focus on two aspects of the protocol: scalability and fault tolerance. To this end, we fix all the parameters of the protocol, except the network size and
failure model. In our simulations IDs are 64-bit integers. Although typical definitions
of the ID space consider 128-bit integers, using only 64 bits for our simulations is not
limiting since the length of the largest common prefix is much less than 64 bits for all
node pairs in networks of any practical size. The extra bits play no role in this protocol.
8.4. SIMULATION RESULTS
N=214
N=216
N=218
10-1
proportion of missing prefix table entries
proportion of missing leaf set entries
100
149
10-2
10-3
10-4
10-5
10-6
0
5
10
15
cycles
20
25
30
100
10
N=214
N=216
N=218
-1
10-2
10-3
10-4
10-5
10-6
10-7
10-8
0
5
10
15
cycles
20
25
30
Figure 8.2: Results in the absence of failures. When a curve ends, the corresponding
tables are perfect at all nodes.
The parameters of the prefix table were chosen to match common settings: b = 4 and
k = 3. For networks that do not require multiple alternatives of a given table entry, setting
k > 1 is still useful because it allows for optimizing the routes according to proximity as
we did in the case of T-C HORD -P ROX. The leaf set size was m = 20 and the parameter r
was set to be 30. We experimented with network sizes (N) of 214 , 216 and 218 nodes.
Method SELECT P EER () of T-M AN uses parameter ψ = m/2 (a peer is selected from
the m/2 highest ranking nodes) and no tabu list is used. The startup technique we use is
flooding, as described in Chapter 6. The protocol is then run until the perfect leaf sets and
prefix tables are found at all nodes, based on the actual set of IDs in the network. This
cannot be decided locally, and indeed, the protocol has no stopping criterion. However,
since our protocol is cheap and needs only a small number of iterations, in practice, after initialization it can be run simply for a fixed number of cycles that are known to be
sufficient for convergence.
To test scalability, in the first set of experiments (shown in Figure 8.2) there are no
failures and all messages are delivered reliably. For network sizes 214 , 216 and 218 , we
performed 50, 10 and 4 independent experiments, respectively. The plots show the results
of each individual experiment, ending when perfect convergence is obtained.
From the left plot of Figure 8.2 we observe that the time required to reach a desired
quality of the leaf sets increases by an additive constant despite a four-fold increase in
the network size. This is a strong indication that the time needed for convergence is
logarithmic in network size. In addition to being logarithmic, the actual convergence
times are also rather small. Convergence of the leaf sets clearly follows an exponential
behavior.
The convergence of the prefix tables is rather surprising (right plot of Figure 8.2): the
network of 218 nodes converges faster in the final phase than a network that is four times
smaller, with the same parameters. Note that in this final phase, the vast majority of the
entries are already available (less than 1 out of 1000 entries are missing). This slight
difference has to do with the scarcity of suitable IDs for the remaining positions to fill.
In the second set of experiments we tested the robustness of our protocol by dropping
messages with a uniform probability (Figure 8.3). This failure model is appropriate for
study because we designed the protocol with a cheap, unreliable transport layer in mind
(UDP). The drop probability was chosen to be 20%, which is unrealistically large. Since
the protocol is based on message-answer pairs, if the first message is dropped, then the
answer is not sent either. Taking this effect into account, elementary calculation shows
CHAPTER 8. TOWARDS A GENERIC BOOTSTRAPPING SERVICE
150
N=214
N=216
N=218
10-1
proportion of missing prefix table entries
proportion of missing leaf set entries
100
10-2
10-3
10-4
10-5
10-6
0
10
20
30
cycles
40
50
100
10
N=214
N=216
N=218
-1
10-2
10-3
10-4
10-5
10-6
10-7
10-8
0
10
20
30
40
50
cycles
Figure 8.3: Results with 20% of the messages dropped. When a curve ends, the corresponding tables are perfect at all nodes.
that the expected overall loss of messages is 28%.
The main conclusion of these experiments is that the behavior of the protocol is very
similar to the case when there are no failures, only convergence is slowed down proportionally.
The protocol is not sensitive to churn either (not shown). In short, the quality of
the routing tables generated by our protocol is similar to that obtained by known routing
substrates in the presence of similar churn. Furthermore, since our protocol is based on
cheap UDP messages and can be completed in a small number of cycles, the effect of
churn during this short time is naturally limited.
8.5 Conclusions
We proposed a P2P architecture that relies on a robust peer sampling service and a bootstrapping service. Although the functionality of the sampling service is basic, its implementation is more robust and flexible than those of currently available structured overlays.
The architecture we presented here, and in particular, the bootstrapping service, bridges
the robustness and flexibility of the sampling service and the functionality of structured
overlays.
Based on our simulation results, the proposed instantiation of the bootstrapping service can build a perfect prefix table and leaf set at all nodes, in a logarithmic number
of cycles, even in the presence of message delivery failures. This performance, in combination with the support of the sampling layer, enables the on-demand deployment of
complex (multi-layered) P2P applications in short time-slices over large pools of shared
resources, in addition to allowing large pools of resources to be merged or split temporarily. Note that (as presented in Chapter 7) the bootstrapping service can be instantiated by
T-C HORD as well, in which case a finger table is produced instead of a prefix table. Also
note that the finger table can also be fed back to the construction process like the prefix
table in this chapter; a possibility we have not used in our presentation in Chapter 7.
Bibliography
[1] Jelasity, M.: Gossip. In Di Marzo Serugendo, G., Gleizes, M.P., Karageorgos,
A., eds.: Self-Organising Software: From Natural to Artificial Adaptation. Natural
Computing Series. Springer (2011) 139–162
[2] Costa, P., Gramoli, V., Jelasity, M., Jesi, G.P., Le Merrer, E., Montresor, A., Querzoni, L.: Exploring the interdisciplinary connections of gossip-based systems.
ACM SIGOPS Operating Systems Review 41(5) (2007) 51–60
[3] Jelasity, M., Voulgaris, S., Guerraoui, R., Kermarrec, A.M., van Steen, M.: Gossipbased peer sampling. ACM Transactions on Computer Systems 25(3) (August
2007) 8
[4] Jelasity, M., Kowalczyk, W., van Steen, M.: Newscast computing. Technical Report IR-CS-006.03, Vrije Universiteit Amsterdam, Department of Computer Science, Amsterdam, The Netherlands (November 2003)
[5] Jelasity, M., Kowalczyk, W., van Steen, M.: Newscast computing. In Enachescu,
C., Filip, F.G., Iantovics, B., eds.: Advanced Computational Technologies. Romanian Academy Publishing House, Bucharest, Romania (2012) 22–44 Reprint of
VU University Tech. Rep. IR-CS-006.03.
[6] Jelasity, M., Guerraoui, R., Kermarrec, A.M., van Steen, M.: The peer sampling
service: Experimental evaluation of unstructured gossip-based implementations. In
Jacobsen, H.A., ed.: Middleware 2004. Volume 3231 of Lecture Notes in Computer
Science., Springer-Verlag (2004) 79–98
[7] Tölgyesi, N., Jelasity, M.: Adaptive peer sampling with newscast. In Sips, H.,
Epema, D., Lin, H.X., eds.: Euro-Par 2009. Volume 5704 of Lecture Notes in
Computer Science., Springer-Verlag (2009) 523–534
[8] Jelasity, M., Montresor, A., Babaoglu, O.: Gossip-based aggregation in large dynamic networks. ACM Transactions on Computer Systems 23(3) (August 2005)
219–252
[9] Jelasity, M., Montresor, A.: Epidemic-style proactive aggregation in large overlay
networks. In: Proceedings of The 24th International Conference on Distributed
Computing Systems (ICDCS 2004), Tokyo, Japan, IEEE Computer Society (2004)
102–109
[10] Montresor, A., Jelasity, M., Babaoglu, O.: Robust aggregation protocols for largescale overlay networks. In: Proceedings of The 2004 International Conference
on Dependable Systems and Networks (DSN), Florence, Italy, IEEE Computer
Society (2004) 19–28
151
152
BIBLIOGRAPHY
[11] Jelasity, M., Canright, G., Engø-Monsen, K.: Asynchronous distributed power
iteration with gossip-based normalization. In Kermarrec, A.M., Bougé, L., Priol,
T., eds.: Euro-Par 2007. Volume 4641 of Lecture Notes in Computer Science.,
Springer-Verlag (2007) 514–525
[12] Jelasity, M., Kermarrec, A.M.: Ordered slicing of very large-scale overlay networks. In: Proceedings of the 6th IEEE International Conference on Peer-toPeer Computing (P2P 2006), Cambridge, UK, IEEE Computer Society (September
2006) 117–124
[13] Montresor, A., Jelasity, M., Babaoglu, O.: Decentralized ranking in large-scale
overlay networks. In: Second IEEE International Conference on Self-Adaptive
and Self-Organizing Systems Workshops (SASOW 2008), IEEE Computer Society
(2008) 208–213
[14] Jelasity, M., Montresor, A., Babaoglu, O.: T-Man: Gossip-based fast overlay
topology construction. Computer Networks 53(13) (2009) 2321–2339
[15] Jelasity, M., Babaoglu, O.: T-Man: Gossip-based overlay topology management.
In Brueckner, S.A., Di Marzo Serugendo, G., Hales, D., Zambonelli, F., eds.: Engineering Self-Organising Systems: Third International Workshop (ESOA 2005),
Revised Selected Papers. Volume 3910 of Lecture Notes in Computer Science.,
Springer-Verlag (2006) 1–15
[16] Montresor, A., Jelasity, M., Babaoglu, O.: Chord on demand. In: Proceedings
of the 5th IEEE International Conference on Peer-to-Peer Computing (P2P 2005),
Konstanz, Germany, IEEE Computer Society (August 2005) 87–94
[17] Jelasity, M., Montresor, A., Babaoglu, O.: The bootstrapping service. In: Proceedings of the 26th International Conference on Distributed Computing Systems
Workshops (ICDCS WORKSHOPS), Lisboa, Portugal, IEEE Computer Society
(2006) International Workshop on Dynamic Distributed Systems (IWDDS).
[18] Dunbar, R.: Grooming, Gossip, and the Evolution of Language. Harvard University Press (1998)
[19] Kimmel, A.J.: Rumors and Rumor Control: A Manager’s Guide to Understanding
and Combatting Rumors. Lawrence Erlbaum Associates (2003)
[20] Demers, A., Greene, D., Hauser, C., Irish, W., Larson, J., Shenker, S., Sturgis,
H., Swinehart, D., Terry, D.: Epidemic algorithms for replicated database maintenance. In: Proceedings of the 6th Annual ACM Symposium on Principles of
Distributed Computing (PODC’87), Vancouver, British Columbia, Canada, ACM
Press (August 1987) 1–12
[21] Pittel, B.: On spreading a rumor. SIAM Journal on Applied Mathematics 47(1)
(February 1987) 213–223
[22] Karp, R., Schindelhauer, C., Shenker, S., Vöcking, B.: Randomized rumor spreading. In: Proceedings of the 41st Annual Symposium on Foundations of Computer
Science (FOCS’00), Washington, DC, USA, IEEE Computer Society (2000) 565–
574
BIBLIOGRAPHY
153
[23] Bailey, N.T.J.: The mathematical theory of infectious diseases and its applications.
second edn. Griffin, London (1975)
[24] Kempe, D., Kleinberg, J., Demers, A.: Spatial gossip and resource location protocols. Journal of the ACM 51(6) (2004) 943–967
[25] Hand, E.: Head in the clouds. Nature 449 (October 2007) 963
[26] Lohr, S.: Google and i.b.m. join in ‘cloud computing’ research. The New York
Times (Oct 8 2007)
[27] Amazon Web Services: http://aws.amazon.com
[28] DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin,
A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s highly
available key-value store. In: SOSP’07: Proceedings of twenty-first ACM SIGOPS
symposium on Operating systems principles, New York, NY, USA, ACM (2007)
205–220
[29] Kempe, D., McSherry, F.: A decentralized algorithm for spectral analysis. In:
Proceedings of the 36th ACM Symposium on Theory of Computing (STOC’04),
New York, NY, USA, ACM (2004) 561–568
[30] Xiao, L., Boyd, S., Lall, S.: A scheme for robust distributed sensor fusion based on
average consensus. In: IPSN’05: Proceedings of the 4th international symposium
on Information processing in sensor networks, Piscataway, NJ, USA, IEEE Press
(2005) 9
[31] Babaoglu, O., Canright, G., Deutsch, A., Di Caro, G.A., Ducatelle, F., Gambardella, L.M., Ganguly, N., Jelasity, M., Montemanni, R., Montresor, A., Urnes,
T.: Design patterns from biology for distributed computing. ACM Transactions on
Autonomous and Adaptive Systems 1(1) (September 2006) 26–66
[32] Kermarrec, A.M., van Steen, M., eds.: ACM SIGOPS Operating Systems Review
41. ACM (October 2007) Special issue on Gossip-Based Networking.
[33] Johansen, H., Allavena, A., van Renesse, R.:
Fireflies: scalable support
for intrusion-tolerant network overlays. In: Proceedings of the 1st ACM
SIGOPS/EuroSys European Conference on Computer Systems (EuroSys’06), New
York, NY, USA, ACM (2006) 3–13
[34] Kempe, D., Dobra, A., Gehrke, J.: Gossip-based computation of aggregate information. In: Proceedings of the 44th Annual IEEE Symposium on Foundations of
Computer Science (FOCS’03), IEEE Computer Society (2003) 482–491
[35] Mehyar, M., Spanos, D., Pongsajapan, J., Low, S.H., Murray, R.M.: Asynchronous
distributed averaging on communication networks. IEEE/ACM Trans. Netw. 15(3)
(2007) 512–520
[36] Wuhib, F., Dam, M., Stadler, R., Clemm, A.: Robust monitoring of network-wide
aggregates through gossiping. In: Proc. 10th IFIP/IEEE International Symposium
on Integrated Management (IM 2007), Munich, Germany (May 2007) 21–25
154
BIBLIOGRAPHY
[37] Jesus, P., Baquero, C., Almeida, P.S.: Fault-tolerant aggregation for dynamic networks. In: Proceedings of the 29th IEEE Symposium on Reliable Distributed
Systems (SRDS). (November 2010) 37–43
[38] Eyal, I., Keidar, I., Rom, R.: Limosense – live monitoring in dynamic sensor
networks. In Erlebach, T., Nikoletseas, S., Orponen, P., eds.: Algorithms for Sensor
Systems. Volume 7111 of Lecture Notes in Computer Science. Springer Berlin /
Heidelberg (2012) 72–85
[39] He, W., Liu, X., Nguyen, H.V., Nahrstedt, K., Abdelzaher, T.: PDA: Privacypreserving data aggregation for information collection. ACM Trans. Sen. Netw.
8(1) (August 2011) 6:1–6:22
[40] Olshevsky, A., Tsitsiklis, J.N.: Convergence speed in distributed consensus and
averaging. SIAM Journal on Control and Optimization 48(1) (February 2009) 33–
55
[41] Boyd, S., Ghosh, A., Prabhakar, B., Shah, D.: Randomized gossip algorithms.
IEEE Transactions on Information Theory 52(6) (2006) 2508–2530
[42] Xiao, L., Boyd, S., Kim, S.J.: Distributed average consensus with least-meansquare deviation. Journal of Parallel and Distributed Computing 67(1) (January
2007) 33–46
[43] Lovász, L.: Random walks on graphs: A survey. In Miklós, D., Sós, V.T., Szőnyi,
T., eds.: Combinatorics, Paul Erdős is Eighty. Volume 2. János Bolyai Mathematical Society, Budapest (1996) 353–398
[44] Prieto, A.G., Stadler, R.: A-gap: An adaptive protocol for continuous network
monitoring with accuracy objectives. IEEE Trans. on Netw. and Serv. Manag. 4(1)
(June 2007) 2–12
[45] Birman, K.P., van Renesse, R., Vogels, W.: Scalable data fusion using astrolabe. In:
Proceedings of the Fifth International Conference on Information Fusion (FUSION
2002). Volume 2. (2002) 1434–1441
[46] Bawa, M., Garcia-Molina, H., Gionis, A., Motwani, R.: Estimating aggregates on
a peer-to-peer network. Technical Report 2003-24, Stanford InfoLab (April 2003)
[47] Chen, J.Y., Pandurangan, G., Xu, D.: Robust computation of aggregates in wireless
sensor networks: distributed randomized algorithms and analysis. IEEE Transactions on Parallel and Distributed Systems 17(9) (2006) 987–1000
[48] Massoulié, L., Merrer, E.L., Kermarrec, A.M., Ganesh, A.: Peer counting and
sampling in overlay networks: random walk methods. In: PODC ’06: Proceedings
of the twenty-fifth annual ACM symposium on Principles of distributed computing,
New York, NY, USA, ACM Press (2006) 123–132
[49] Mosk-Aoyama, D., Shah, D.: Fast distributed algorithms for computing separable
functions. IEEE Transactions on Information Theory 54(7) (2008) 2997–3007
[50] Moallemi, C.C., Van Roy, B.: Consensus propagation. IEEE Transactions on
Information Theory 52(11) (2006) 4753–4766
BIBLIOGRAPHY
155
[51] Rhea, S., Geels, D., Roscoe, T., Kubiatowicz, J.: Handling churn in a DHT. In:
Proceedings of the USENIX Annual Technical Conference. (June 2004)
[52] van Renesse, R., Minsky, Y., Hayden, M.: A gossip-style failure detection service.
In: Middleware ’98, The Lake District, England, IFIP (1998) 55–70
[53] Birman, K.P., Hayden, M., Ozkasap, O., Xiao, Z., Budiu, M., Minsky, Y.: Bimodal
multicast. ACM Transactions on Computer Systems 17(2) (May 1999) 41–88
[54] Kowalczyk, W., Vlassis, N.: Newscast EM. In Saul, L.K., Weiss, Y., Bottou,
L., eds.: 17th Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, MIT Press (2005) 713–720
[55] Gupta, I., Birman, K.P., van Renesse, R.: Fighting fire with fire: using randomized
gossip to combat stochastic scalability limits. Quality and Reliability Engineering
International 18(3) (2002) 165–184
[56] Kermarrec, A.M., Massoulié, L., Ganesh, A.J.: Probablistic reliable dissemination
in large-scale systems. IEEE Transactions on Parallel and Distributed Systems
14(3) (March 2003) 248–258
[57] Stutzbach, D., Rejaie, R.: Understanding churn in peer-to-peer networks. In:
Proceedings of the 6th ACM SIGCOMM conference on Internet measurement
(IMC’06), New York, NY, USA, ACM (2006) 189–202
[58] Saroiu, S., Gummadi, P.K., Gribble, S.D.: Measuring and analyzing the characteristics of Napster and Gnutella hosts. Multimedia Systems Journal 9(2) (August
2003) 170–184
[59] Eugster, P.T., Guerraoui, R., Kermarrec, A.M., Massoulié, L.: Epidemic information dissemination in distributed systems. IEEE Computer 37(5) (May 2004)
60–67
[60] Jelasity, M., Kowalczyk, W., van Steen, M.: An approach to massively distributed aggregate computing on peer-to-peer networks. In: Proceedings of the
12th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP’04), A Coruna, Spain, IEEE Computer Society (2004) 200–207
[61] Jelasity, M., Montresor, A., Babaoglu, O.: A modular paradigm for building selforganizing peer-to-peer applications. In Di Marzo Serugendo, G., Karageorgos, A.,
Rana, O.F., Zambonelli, F., eds.: Engineering Self-Organising Systems. Volume
2977 of Lecture Notes in Artificial Intelligence., Springer (2004) 265–282 invited
paper.
[62] Eugster, P.T., Guerraoui, R., Handurukande, S.B., Kermarrec, A.M., Kouznetsov,
P.: Lightweight probabilistic broadcast. ACM Transactions on Computer Systems
21(4) (2003) 341–374
[63] Voulgaris, S., Gavidia, D., van Steen, M.: CYCLON: Inexpensive Membership
Management for Unstructured P2P Overlays. Journal of Network and Systems
Management 13(2) (June 2005) 197–217
156
BIBLIOGRAPHY
[64] Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and its Applications. 2nd edn. Springer Verlag (1997)
[65] Marsaglia, G.: The Marsaglia Random Number CDROM including the Diehard
Battery of Tests of Randomness. Florida State University (1995) online at http:
//www.stat.fsu.edu/pub/diehard.
[66] Montresor, A., Jelasity, M.: Peersim: A scalable P2P simulator. In: Proceedings
of the 9th IEEE International Conference on Peer-to-Peer Computing (P2P 2009),
Seattle, Washington, USA, IEEE (September 2009) 99–100 extended abstract.
[67] Marsaglia, G., Tsang, W.W.: Some difficult-to-pass tests of randomness. Journal
of Statistical Software 7(3) (2002) 1–8
[68] Albert, R., Jeong, H., Barabási, A.L.: Error and attack tolerance of complex networks. Nature 406 (2000) 378–382
[69] Pastor-Satorras, R., Vespignani, A.: Epidemic dynamics and endemic states in
complex networks. Physical Review E 63 (2001) 066117
[70] Watts, D.J., Strogatz, S.H.: Collective dynamics of ’small-world’ networks. Nature
393 (1998) 440–442
[71] Newman, M.E.J.: Random graphs as models of networks. In Bornholdt, S., Schuster, H.G., eds.: Handbook of Graphs and Networks: From the Genome to the Internet. John Wiley, New York, NY (2002)
[72] DAS-2: http://www.cs.vu.nl/das2/
[73] Allavena, A., Demers, A., Hopcroft, J.E.: Correctness of a gossip based membership protocol. In: Proceedings of the 24th annual ACM symposium on principles
of distributed computing (PODC’05), Las Vegas, Nevada, USA, ACM Press (2005)
[74] Barabási, A.L.: Linked: the new science of networks. Perseus, Cambridge, Mass.
(2002)
[75] Dorogovtsev, S.N., Mendes, J.F.F.: Evolution of networks. Advances in Physics
51 (2002) 1079–1187
[76] Ganesh, A.J., Kermarrec, A.M., Massoulié, L.: Peer-to-peer membership management for gossip-based protocols. IEEE Transactions on Computers 52(2) (February
2003)
[77] Law, C., Siu, K.Y.: Distributed construction of random expander networks. In: Proceedings of The 22nd Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM’2003), San Francisco, California, USA (April
2003)
[78] Pandurangan, G., Raghavan, P., Upfal, E.: Building low-diameter peer-to-peer networks. IEEE Journal on Selected Areas in Communications (JSAC) 21(6) (August
2003) 995–1002
[79] Zhong, M., Shen, K., Seiferas, J.: Non-uniform random membership management
in peer-to-peer networks. In: Proc. of the IEEE INFOCOM, Miami, FL (2005)
BIBLIOGRAPHY
157
[80] Dabek, F., Zhao, B., Druschel, P., Kubiatowicz, J., Stoica, I.: Towards a common
API for structured peer-to-peer overlays. In: Proceedings of the 2nd International
Workshop on Peer-to-Peer Systems (IPTPS’03), Berkeley, CA, USA (February
2003)
[81] Rowstron, A., Druschel, P.: Pastry: Scalable, decentralized object location and
routing for large-scale peer-to-peer systems. In Guerraoui, R., ed.: Middleware
2001. Volume 2218 of Lecture Notes in Computer Science., Springer-Verlag (2001)
329–350
[82] Ratnasamy, S., Francis, P., Handley, M., Karp, R., Schenker, S.: A scalable
content-addressable network. In: Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications
(SIGCOMM), San Diego, CA, ACM, ACM Press (2001) 161–172
[83] Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H.: Chord: A
scalable peer-to-peer lookup service for internet applications. In: Proceedings of
the 2001 Conference on Applications, Technologies, Architectures, and Protocols
for Computer Communications (SIGCOMM), San Diego, CA, ACM, ACM Press
(2001) 149–160
[84] King, V., Saia, J.: Choosing a random peer. In: Proceedings of the 23rd annual
ACM symposium on principles of distributed computing (PODC’04), ACM Press
(2004) 125–130
[85] Kostić, D., Rodriguez, A., Albrecht, J., Bhirud, A., Vahdat, A.: Using random
subsets to build scalable network services. In: Proceedings of the USENIX Symposium on Internet Technologies and Systems (USITS 2003). (2003)
[86] Loguinov, D., Kumar, A., Rai, V., Ganesh, S.: Graph-theoretic analysis of structured peer-to-peer systems: Routing distances and fault resilience. In: Proceedings
of ACM SIGCOMM 2003, ACM Press (2003) 395–406
[87] van Renesse, R., Birman, K.P., Vogels, W.: Astrolabe: A robust and scalable
technology for distributed system monitoring, management, and data mining. ACM
Transactions on Computer Systems 21(2) (May 2003) 164–206
[88] Stavrou, A., Rubenstein, D., Sahu, S.: A Lightweight, Robust P2P System to
Handle Flash Crowds. IEEE Journal on Selected Areas in Communications 22(1)
(January 2004) 6–17
[89] van Renesse, R.: The importance of aggregation. In Schiper, A., Shvartsman, A.A.,
Weatherspoon, H., Zhao, B.Y., eds.: Future Directions in Distributed Computing.
Number 2584 in Lecture Notes in Computer Science, Springer (2003) 87–92
[90] Jelasity, M., Montresor, A., Babaoglu, O.: Detection and removal of malicious
peers in gossip-based protocols. In: 2nd Bertinoro Workshop on Future Directions in Distributed Computing: Survivability: Obstacles and Solutions (FuDiCo
II: S.O.S.), Bertinoro, Italy (June 2004) invitation only workshop, proceedings online at http://www.cs.utexas.edu/users/lorenzo/sos/.
158
BIBLIOGRAPHY
[91] Gupta, I., van Renesse, R., Birman, K.P.: Scalable fault-tolerant aggregation in
large process groups. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN’01), Göteborg, Sweden, IEEE Computer Society
Press (2001)
[92] Watts, D.J.: Small Worlds: The Dynamics of Networks Between Order and Randomness. Princeton University Press (1999)
[93] Ripeanu, M., Iamnitchi, A., Foster, I.: Mapping the gnutella network. IEEE Internet Computing 6(1) (2002) 50–57
[94] Mitchell, T.M.: Machine Learning. McGraw-Hill (1997)
[95] Bavier, A., Bowman, M., Chun, B., Culler, D., Karlin, S., Muir, S., Peterson, L.,
Roscoe, T., Spalink, T., Wawrzoniak, M.: Operating system support for planetaryscale services. In: Proceedings of the First Symposium on Network Systems Design and Implementation (NSDI’04), USENIX (2004) 253–266
[96] Ghosh, B., Muthukrishnan, S.: Dynamic load balancing by random matchings.
Journal of Computer and System Sciences 53(3) (December 1996) 357–370
[97] Yalagandula, P., Dahlin, M.: A scalable distributed information management system. In: Proceedings of ACM SIGCOMM 2004, Portland, Oregon, USA, ACM
Press (2004) 379–390
[98] Nekovee, M., Soppera, A., Burbridge, T.: An adaptive method for dynamic audience size estimation in multicast. In Stiller, B., Carle, G., Karsten, M., Reichl,
P., eds.: Group Communications and Charges: Technology and Business Models.
Number 2816 in Lecture Notes in Computer Science, Springer (2003) 23–33
[99] Horowitz, K., Malkhi, D.: Estimating network size from local information. Information Processing Letters 88(5) (2003) 237–243
[100] Pease, M., Shostak, R., Lamport, L.: Reaching agreement in the presence of faults.
Journal of the ACM 27(2) (1980) 228–234
[101] Dolev, D., Lynch, N., Pinter, S., Stark, E., Weihl, W.: Reaching approximate
agreement in the presence of faults. Journal of the ACM 33(3) (July 1986) 499–
516
[102] Fekete, A.: Asynchronous approximate agreement. Information and Computation
115(1) (November 1994) 95–124
[103] Madden, S., Szewczyk, R., Franklin, M.J., Culler, D.: Supporting aggregate
queries over ad-hoc wireless sensor networks. In: Fourth IEEE Workshop on Mobile Computing Systems and Applications (WMCSA’02), Callicoon, New York,
IEEE Computer Society Press (2002) 49–58
[104] Kutyłowski, M., Letkiewicz, D.: Computing average value in ad hoc networks.
In Rovan, B., Vojtáš, P., eds.: Mathematical Foundations of Computer Science
(MFCS’2003). Number 2747 in Lecture Notes in Computer Science, Springer
(2003) 511–520
BIBLIOGRAPHY
159
[105] Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking:
Bringing order to the web. Technical report, Stanford Digital Library Technologies
Project (1998)
[106] Sankaralingam, K., Sethumadhavan, S., Browne, J.C.: Distributed pagerank for
p2p systems. In: Proceedings of the 12th IEEE International Symposium on High
Performance Distributed Computing (HPDC-12 ’03). (2003) 58–69
[107] Shi, S., Yu, J., Yang, G., Wang, D.: Distributed page ranking in structured p2p
networks. In: Proceedings of the 2003 International Conference on Parallel Processing (ICPP’03). (October 2003) 179–186
[108] Parreira, J.X., Donato, D., Michel, S., Weikum, G.: Efficient and decentralized
PageRank approximation in a peer-to-peer web search network. In: Proceedings of
the 32nd international conference on Very large data bases (VLDB’2006), VLDB
Endowment (2006) 415–426
[109] Kamvar, S.D., Schlosser, M.T., Garcia-Molina, H.: The eigentrust algorithm for
reputation management in p2p networks. In: Proceedings of the 12th international
conference on World Wide Web (WWW’03), New York, NY, USA, ACM Press
(2003) 640–651
[110] Koren, Y.: On spectral graph drawing. In: Proceedings of the 9th International
Computing and Combinatorics Conference (COCOON’03). Number 2697 in Lecture Notes in Computer Science, Springer (2003) 496–508
[111] Dabek, F., Cox, R., Kaashoek, F., Morris, R.: Vivaldi: A decentralized network
coordinate system. In: Proceedings of ACM SIGCOMM 2004, Portland, Oregon,
USA, ACM Press (2004)
[112] Lubachevsky, B., Mitra, D.: A chaotic asynchronous algorithm for computing
the fixed point of a nonnegative matrix of unit radius. Journal of the ACM 33(1)
(January 1986) 130–150
[113] Burgess, M., Canright, G., Engø-Monsen, K.: Importance-ranking functions derived from the eigenvectors of directed graphs. Technical Report DELIS-TR-0325,
DELIS Project (2006)
[114] Albert, R., Barabási, A.L.: Statistical mechanics of complex networks. Reviews
of Modern Physics 74(1) (January 2002) 47–97
[115] Bai, Z., Demmel, J., Dongarra, J., Ruhe, A., van der Vorst, H., eds.: Templates
for the Solution of Algebraic Eigenvalue Problems: a Practical Guide. SIAM,
Philadelphia (2000)
[116] Albert, R., Jeong, H., Barabási, A.L.: Diameter of the world wide web. Nature 401
(1999) 130–131
[117] Frommer, A., Szyld, D.B.: On asynchronous iterations. Journal of Computational
and Applied Mathematics 123(1-2) (2000) 201–216
160
BIBLIOGRAPHY
[118] Anderson, D.P.: BOINC: A system for public-resource computing and storage.
In: Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
(GRID’04), Washington, DC, USA, IEEE Computer Society (2004) 4–10
[119] Sacha, J., Dowling, J., Cunningham, R., Meier, R.: Using aggregation for adaptive
super-peer discovery on the gradient topology. In Keller, A., Martin-Flatin, J.P.,
eds.: Proceedings of the Second IEEE International Workshop on Self-Managed
Networks, Systems and Services (SelfMan 2006). Volume 3996 of Lecture Notes
in Computer Science., Springer (2006)
[120] Sacha, J., Napper, J., Stratan, C., Pierre, G.: Adam2: Reliable distribution estimation in decentralised environments. In: 2010 International Conference on Distributed Computing Systems, (ICDCS), IEEE Computer Society (2010) 697–707
[121] Lv, Q., Cao, P., Cohen, E., Li, K., Shenker, S.: Search and replication in unstructured peer-to-peer networks. In: Proceedings of the 16th ACM International
Conference on Supercomputing (ICS’02). (2002)
[122] Adamic, L.A., Lukose, R.M., Puniyani, A.R., Huberman, B.A.: Search in powerlaw networks. Physical Review E 64 (2001) 046135
[123] Montresor, A.: A robust protocol for building superpeer overlay topologies. In:
Proceedings of the 4th IEEE International Conference on Peer-to-Peer Computing
(P2P’04), Zurich, Switzerland, IEEE Computer Society (August 2004) 202–209
[124] Chawathe, Y., Ratnasamy, S., Breslau, L., Lanham, N., Shenker, S.: Making
gnutella-like p2p systems scalable. In: Proceedings of ACM SIGCOMM 2003.
(2003) 407–418
[125] Voulgaris, S., Kermarrec, A.M., Massoulié, L., van Steen, M.: Exploiting semantic
proximity in peer-to-peer content searching. In: Proceedings of 10th IEEE International Workshop on Future Trends of Distributed Computing Systems (FTDCS
2004). (2004) 238–243
[126] Voulgaris, S., van Steen, M.: An epidemic protocol for managing routing tables
in very large peer-to-peer networks. In: Proceedings of the 14th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management, (DSOM
2003). Number 2867 in Lecture Notes in Computer Science, Springer (2003)
[127] Aberer, K., Datta, A., Hauswirth, M., Schmidt, R.: Indexing data-oriented overlay networks. In: Proceedings of 31st International Conference on Very Large
Databases (VLDB), Trondheim, Norway, ACM (August 2005)
[128] Shaker, A., Reeves, D.S.: Self-stabilizing structured ring topology p2p systems.
In: Proceedings of the Fifth IEEE International Conference on Peer-to-Peer Computing (P2P 2005), Konstanz, Germany, IEEE Computer Society (August 2005)
39–46
[129] Angluin, D., Aspnes, J., Chen, J., Wu, Y., Yin, Y.: Fast construction of overlay
networks. In: Seventeenth Annual ACM Symposium on Parallelism in Algorithms
and Architectures (SPAA). (July 2005) 145–154
BIBLIOGRAPHY
161
[130] Voulgaris, S., van Steen, M.: Epidemic-style management of semantic overlays
for content-based searching. In Cunha, J.C., Medeiros, P.D., eds.: Proceedings of
Euro-Par. Number 3648 in Lecture Notes in Computer Science, Springer (2005)
1143–1152
[131] Massoulié, L., Kermarrec, A.M., Ganesh, A.J.: Network awareness and failure resilience in self-organising overlay networks. In: Proceedings of the 22nd Symposium on Reliable Distributed Systems (SRDS 2003), Florence, Italy (2003) 47–55
[132] Bonnet, F., Kermarrec, A.M., Raynal, M.: Small-world networks: From theoretical
bounds to practical systems. In: Principles of Distributed Systems. Volume 4878.,
Springer (2007) 372–385
[133] Patel, J.A., Gupta, I., Contractor, N.: JetStream: Achieving predictable gossip
dissemination by leveraging social network principles. In: Proceedings of the Fifth
IEEE International Symposium on Network Computing and Applications (NCA
2006), Cambridge, MA, USA (July 2006) 32–39
[134] Zhao, B.Y., Huang, L., Jeremy Stribling, A.D.J., Kubiatowicz, J.D.: Exploiting
routing redundancy via structured peer-to-peer overlays. In: Proceedings of the
11th IEEE International Conference on Network Protocols (ICNP 2003). (2003)
246–257
[135] Rhea, S., Godfrey, B., Karp, B., Kubiatowicz, J., Ratnasamy, S., Shenker, S., Stoica, I., Yu, H.: OpenDHT: A public DHT service and its uses. In: Proceedings of
ACM SIGCOMM 2005, ACM Press (2005) 73–84
[136] Koren, Y.:
Embedder http://www.research.att.com/~yehuda/
index_programs.html.
[137] Gummadi, K.P., Saroiu, S., Gribble, S.D.: King: Estimating latency between
arbitrary internet end hosts. In: Internet Measurement Workshop (SIGCOMM
IMW). (2002)
[138] Kalidindi, S., Zekauskas, M.J.: Surveyor: An infrastructure for Internet performance measurements. In: Proceedings of INET’99, San Jose, CA, USA (1999)
[139] Castro, M., Costa, M., Rowstron, A.: Performance and dependability of structured
peer-to-peer overlays. In: Proceedings of the 2004 International Conference on
Dependable Systems and Networks (DSN’04), IEEE Computer Society (2004)
[140] Foster, I., Kesselman, C., eds.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers (1999)
[141] Saroiu, S., Gummadi, P.K., Gribble, S.D.: A measurement study of peer-to-peer
file sharing systems. In: Proceedings of Multimedia Computing and Networking
2002 (MMCN’02), San Jose, CA (2002)
[142] Castro, M., Druschel, P., Kermarrec, A.M., Rowstron, A.: One ring to rule them
all: Service discovery and binding in structured peer-to-peer overlay networks. In:
Proceedings of the 10th ACM SIGOPS European Workshop. (2002)
162
BIBLIOGRAPHY
[143] Foster, I., Iamnitchi, A.: On death, taxes, and the convergence of peer-to-peer and
grid computing. In: Proceedings of the 2nd International Workshop on Peer-toPeer Systems (IPTPS’03), Berkeley, CA, USA (February 2003)
[144] Maymounkov, P., Mazières, D.: Kademlia: A peer-to-peer information system
based on the XOR metric. In: Proceedings for the 1st International Workshop on
Peer-to-Peer Systems (IPTPS ’02), Cambridge, MA (2001)
[145] Zhao, B.Y., Kubiatowicz, J.D., Joseph, A.D.: Tapestry: An infrastructure for faulttolerant wide-area location and routing. Technical Report UCB/CSD-01-1141,
University of California, Berkeley (April 2001)
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement