European Sixth Framework Programme Priority

European Sixth Framework Programme Priority
Project no.
Project acronym:
Project title:
034084
SELFMAN
Self Management for Large-Scale Distributed Systems
based on Structured Overlay Networks and Components
European Sixth Framework Programme
Priority 2, Information Society Technologies
The Adventures of Selfman - Year Two
Due date of deliverable:
Actual submission date:
July 15, 2008
July 15, 2008
Start date of project:
Duration:
Dissemination level:
June 1, 2006
36 months
PU
Contents
1 Introduction
13
2 An overview of the SELFMAN project
15
3 D1.2: Report on high-level self-management
structured overlay networks
3.1 Executive summary . . . . . . . . . . . . . . .
3.2 Contractors contributing to the Deliverable . .
3.3 Results . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Introduction . . . . . . . . . . . . . . .
3.3.2 Range Queries . . . . . . . . . . . . . .
3.3.3 Relaxed-Ring . . . . . . . . . . . . . .
3.3.4 API . . . . . . . . . . . . . . . . . . .
3.3.5 Sloppy Management . . . . . . . . . .
3.3.6 Ongoing and Future work . . . . . . .
3.4 Novel P2P NAT Traversal Approach . . . . .
3.4.1 Network Address Translators . . . . .
3.4.2 Current NAT Traversal methods . . . .
3.4.3 Peerialism’s system . . . . . . . . . . .
3.4.4 Limitations of current approaches . . .
3.4.5 Our solution . . . . . . . . . . . . . . .
3.4.6 Conclusion and Future Work . . . . . .
3.5 Papers and publications . . . . . . . . . . . .
primitives for
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 D1.3b: Final report on Security in Structured Overlay
works
4.1 Executive Summary . . . . . . . . . . . . . . . . . . . . .
4.2 Contractors contributing to the Deliverable . . . . . . . .
4.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Relating D1.3b and D1.3a . . . . . . . . . . . . .
4.4 Self Organizing Networks with Small World Networks . .
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Net.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
42
44
45
45
45
47
51
54
55
57
57
58
60
60
61
67
68
69
69
71
72
73
74
CONTENTS
4.4.1 Small World Network Models . . . . . . . . . . . . . .
4.4.2 Small World Network Testbed and Simulator . . . . . .
4.4.3 Comparing Small World Networks . . . . . . . . . . . .
A Look at Security using Skype . . . . . . . . . . . . . . . . .
System-wide Monitoring Infrastructure . . . . . . . . . . . . .
4.6.1 An Example from Monitoring Skype . . . . . . . . . .
Authenticating Software Components and Version Management
Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
77
78
79
80
81
84
84
5 D1.4: Java library of SELFMAN structured overlay network
5.1 Executive summary . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Contractors contributing to the Deliverable . . . . . . . . . . .
5.3 The Kompics P2P architecture for the SELFMAN structured
overlay network . . . . . . . . . . . . . . . . . . . . . . . . . .
86
86
87
4.5
4.6
4.7
4.8
88
6 D1.5: Mozart library of SELFMAN structured overlay network
89
6.1 Executive summary . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Contractors contributing to the Deliverable . . . . . . . . . . . 90
6.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4 P2PS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.5 PEPINO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.5.1 Using PEPINO . . . . . . . . . . . . . . . . . . . . . 94
6.6 CiNiSMO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.6.1 Using CiNiSMO . . . . . . . . . . . . . . . . . . . . . 98
6.7 Publications and Submissions . . . . . . . . . . . . . . . . . . 101
7 D2.1b: Report on computation model with self-management
primitives
102
7.1 Executive summary . . . . . . . . . . . . . . . . . . . . . . . . 102
7.2 Contractors contributing to the Deliverable . . . . . . . . . . . 103
7.3 An overview of the Fractal component model . . . . . . . . . . 104
7.3.1 Component model . . . . . . . . . . . . . . . . . . . . 104
7.3.2 Fractal ecosystem . . . . . . . . . . . . . . . . . . . . . 107
7.4 Kompics: Reactive component model for distributed computing110
7.4.1 Component model . . . . . . . . . . . . . . . . . . . . 110
7.4.2 Component execution and interaction semantics . . . . 114
7.4.3 Example component architectures . . . . . . . . . . . . 116
7.5 Kompics and Fractal integration . . . . . . . . . . . . . . . . . 118
7.5.1 Example component architecture with sharing . . . . . 118
7.5.2 Conceptual mapping of model entities . . . . . . . . . . 121
SELFMAN Deliverable Year Two, Page 3
CONTENTS
7.5.3
7.5.4
Component sharing example revisited . . . . . . . . . . 124
Implementation aspects . . . . . . . . . . . . . . . . . 124
8 D2.1c: Component-based computation model
127
8.1 Executive summary . . . . . . . . . . . . . . . . . . . . . . . . 127
8.2 Contractors contributing to the Deliverable . . . . . . . . . . . 128
8.3 The Kompics component framework . . . . . . . . . . . . . . . 129
9 D2.2b: Report on architectural framework tool support
131
9.1 Event-Condition-Action Rule-Based Service for Decision Making131
9.2 Composite Probes: a Architectural Framework for Hierarchical Monitoring Data Aggregation . . . . . . . . . . . . . . . . 134
9.2.1 Probe components . . . . . . . . . . . . . . . . . . . . 135
9.2.2 Composite probes . . . . . . . . . . . . . . . . . . . . . 137
9.3 MyP2PWorld: The Case for Application-level Network Emulation of P2P Systems . . . . . . . . . . . . . . . . . . . . . . 137
9.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 137
9.3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 138
9.3.3 What MyP2PWorld is Not . . . . . . . . . . . . . . . . 139
9.3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . 140
9.3.5 System Architecture Overview . . . . . . . . . . . . . . 140
9.3.6 Discrete-Event Simulation (DES) Layer . . . . . . . . . 140
9.3.7 Emulation Layer . . . . . . . . . . . . . . . . . . . . . 142
9.3.8 Scenario Management Layer . . . . . . . . . . . . . . . 145
9.3.9 Conclusion & Future Work . . . . . . . . . . . . . . . . 145
10 D2.2c: Architectural framework – Components &
tion
10.1 Executive summary . . . . . . . . . . . . . . . . . . .
10.2 Contractors contributing to the deliverable . . . . . .
10.3 Components and navigation API . . . . . . . . . . .
10.3.1 Notations . . . . . . . . . . . . . . . . . . . .
10.3.2 Component model . . . . . . . . . . . . . . .
10.3.3 Deployment primitives . . . . . . . . . . . . .
10.3.4 Introspection, navigation and query primitives
10.4 Starting with FructOz and LactOz . . . . . . . . . .
Naviga147
. . . . . 147
. . . . . 147
. . . . . 148
. . . . . 148
. . . . . 148
. . . . . 151
. . . . . 151
. . . . . 154
11 D2.3b: Report on Formal Operational Semantics - Formal
Fractal Specification
155
11.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . 155
11.2 Contractors contributing to the deliverable . . . . . . . . . . . 155
SELFMAN Deliverable Year Two, Page 4
CONTENTS
11.3 Introduction . . . . .
11.4 Related work . . . .
11.5 Foundations . . . . .
11.6 Naming and binding
11.7 Component controller
11.8 Binding controller . .
11.9 Content controller . .
11.10Lifecycle controller .
11.11Future work . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
156
157
158
163
167
170
172
175
177
12 D3.1b: Second report on formal models for transactions over
structured overlay networks
178
12.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . 178
12.2 Partners Contributing to the Deliverable . . . . . . . . . . . . 179
12.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
12.3.1 Consistency on the Routing-Level . . . . . . . . . . . . 180
12.3.2 Transactional DHTs . . . . . . . . . . . . . . . . . . . 182
12.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
13 D3.2a: Report on replicated storage service over a structured
overlay network
185
13.1 Executive summary . . . . . . . . . . . . . . . . . . . . . . . . 185
13.2 Contractors contributing to the Deliverable . . . . . . . . . . . 186
13.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
13.4 Installation and Configuration . . . . . . . . . . . . . . . . . . 188
13.4.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . 188
13.4.2 Building Chord# . . . . . . . . . . . . . . . . . . . . . 188
13.4.3 Installation . . . . . . . . . . . . . . . . . . . . . . . . 189
13.4.4 Configuration . . . . . . . . . . . . . . . . . . . . . . . 189
13.5 User Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
13.5.1 Starting Chord# . . . . . . . . . . . . . . . . . . . . . 190
13.5.2 Java-API . . . . . . . . . . . . . . . . . . . . . . . . . 191
14 D3.3a: Simple database query layer for replicated
service
14.1 Executive summary . . . . . . . . . . . . . . . . . . .
14.2 Contractors contributing to the Deliverable . . . . . .
14.3 Introduction . . . . . . . . . . . . . . . . . . . . . . .
14.4 API . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.4.1 de.zib.chordsharp.ChordSharp . . . . . . . .
14.4.2 de.zib.chordsharp.Transaction . . . . . . . . .
SELFMAN Deliverable Year Two, Page 5
storage
192
. . . . . 192
. . . . . 193
. . . . . 194
. . . . . 195
. . . . . 195
. . . . . 197
CONTENTS
15 D4.1a: First report on self-configuration support
15.1 Executive summary . . . . . . . . . . . . . . . . . . . . .
15.2 Contractors contributing to the deliverable . . . . . . . .
15.3 Motivations . . . . . . . . . . . . . . . . . . . . . . . . .
15.4 Related work . . . . . . . . . . . . . . . . . . . . . . . .
15.5 Reference model . . . . . . . . . . . . . . . . . . . . . . .
15.6 The FructOz framework . . . . . . . . . . . . . . . . . .
15.6.1 Overview . . . . . . . . . . . . . . . . . . . . . .
15.6.2 Interfaces and components . . . . . . . . . . . . .
15.6.3 Bindings . . . . . . . . . . . . . . . . . . . . . . .
15.6.4 FructOz entities . . . . . . . . . . . . . . . . . . .
15.6.5 Components as packaging and deployment entities
15.6.6 Distributed environments . . . . . . . . . . . . .
15.6.7 LactOz: a dynamic FPath library . . . . . . . . .
15.7 Case studies . . . . . . . . . . . . . . . . . . . . . . . . .
15.7.1 Parameterized architectures . . . . . . . . . . . .
15.7.2 Synchronization and workflows . . . . . . . . . .
15.7.3 Lazy deployments . . . . . . . . . . . . . . . . . .
15.7.4 Error handling . . . . . . . . . . . . . . . . . . .
15.7.5 Self-configurable architecture . . . . . . . . . . .
15.7.6 Deployment scenarios . . . . . . . . . . . . . . . .
15.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .
15.8.1 Microbenchmarks . . . . . . . . . . . . . . . . . .
15.8.2 Local deployments . . . . . . . . . . . . . . . . .
15.8.3 Distributed deployments . . . . . . . . . . . . . .
15.9 Discussion and future work . . . . . . . . . . . . . . . . .
15.10Supplement: Workflow patterns in Oz . . . . . . . . . . .
15.10.1 Basic control flow patterns . . . . . . . . . . . . .
15.10.2 Advanced branching and synchronization patterns
15.10.3 Structural patterns . . . . . . . . . . . . . . . . .
15.10.4 Multiple instances patterns . . . . . . . . . . . .
15.10.5 State-based patterns . . . . . . . . . . . . . . . .
15.10.6 Cancellation patterns . . . . . . . . . . . . . . . .
16 D4.2a: First report on self-healing support
16.1 Executive summary . . . . . . . . . . . . . . . . . . . .
16.2 Contractors contributing to the deliverable . . . . . . .
16.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
16.4 Asynchronous failure handling in a network transparent
16.4.1 Network transparency . . . . . . . . . . . . . .
16.4.2 Relationship with Kompics . . . . . . . . . . . .
SELFMAN Deliverable Year Two, Page 6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
202
202
203
204
206
208
210
210
212
213
214
215
216
218
221
221
223
223
224
225
226
228
228
229
230
231
232
233
237
241
242
244
246
248
. . . . 248
. . . . 249
. . . . 250
system 250
. . . . 251
. . . . 251
CONTENTS
16.4.3 Asynchronous failure detection . . . . . . . . . . . . . 252
16.5 Network partitioning and merging . . . . . . . . . . . . . . . . 252
16.6 Transactional reconfiguration of component-based systems . . 253
17 D4.3a: First report on self-tuning support
17.1 Executive Summary . . . . . . . . . . . . .
17.2 Partners Contributing to the Deliverable .
17.3 Results . . . . . . . . . . . . . . . . . . . .
17.3.1 Introduction . . . . . . . . . . . . .
17.3.2 Self-benchmarking . . . . . . . . .
17.3.3 DHT Load-balancing . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
257
. 257
. 259
. 260
. 260
. 260
. 267
18 D4.4a: First report on self-protection support
18.1 Executive Summary . . . . . . . . . . . . . . . .
18.2 Contractors contributing to the Deliverable . . .
18.3 Introduction . . . . . . . . . . . . . . . . . . . .
18.4 Small World Network Experiment Testbed . . .
18.5 Small World Network as the Network . . . . . .
18.6 New Security Issues with Small World Networks
18.7 Papers . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
293
. 293
. 295
. 296
. 296
. 297
. 297
. 298
. 299
. 300
. 303
. 304
.
.
.
.
.
.
19 D5.2a: Application design specifications
19.1 Executive Summary . . . . . . . . . . . . . .
19.2 Contractors Contributing to the Deliverable
19.3 Results . . . . . . . . . . . . . . . . . . . . .
19.4 Wiki Application Design Specifications . . .
19.5 P2P TV Application Design Specifications .
19.5.1 Peerialism’s system . . . . . . . . . .
19.5.2 Introduction to Kompics . . . . . . .
19.5.3 The Tracker application . . . . . . .
19.5.4 Porting and Design . . . . . . . . . .
19.5.5 Preliminary Evaluation . . . . . . . .
19.5.6 Conclusion and Future Work . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20 D6.1c: Second-year project workshop
20.1 Call for Papers: Decentralized Self Management
P2P, and User Communities . . . . . . . . . . . .
20.1.1 Submission of position paper or technical
quired for attendance) . . . . . . . . . . .
20.1.2 Organizing committee . . . . . . . . . . .
20.1.3 Program committee . . . . . . . . . . . . .
SELFMAN Deliverable Year Two, Page 7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
for Grids,
. . . . . .
paper (re. . . . . .
. . . . . .
. . . . . .
280
280
282
283
283
285
288
292
305
. 305
. 306
. 307
. 307
CONTENTS
21 D6.5b: Second progress and assessment report
learned
21.1 Executive summary . . . . . . . . . . . . . . . .
21.2 Contractors contributing to the deliverable . . .
21.3 Results . . . . . . . . . . . . . . . . . . . . . . .
21.3.1 Vision level . . . . . . . . . . . . . . . .
21.3.2 Implementation level . . . . . . . . . . .
21.3.3 Application level . . . . . . . . . . . . .
with lessons
309
. . . . . . . . 309
. . . . . . . . 310
. . . . . . . . 311
. . . . . . . . 311
. . . . . . . . 312
. . . . . . . . 313
A Publications
315
A.1 Overcoming Software Fragility with Interacting Feedback Loops
and Reversible Phase Transitions . . . . . . . . . . . . . . . . 316
A.2 The Limits of Network Transparency in a Distributed Programming Language . . . . . . . . . . . . . . . . . . . . . . . 328
A.3 Range queries on structured overlay networks . . . . . . . . . 483
A.4 Sloppy Management of Structured P2P Services . . . . . . . . 496
A.5 The Relaxed-Ring: a Fault-Tolerant Topology for Structured
Overlay Networks . . . . . . . . . . . . . . . . . . . . . . . . . 499
A.6 WinResMon: A Tool for Discovering Software Dependencies,
Configuration and Requirements in Microsoft Windows . . . . 521
A.7 A Lightweight Binary Authentication System for Windows . . 522
A.8 pepino: peer-to-peer network inspector . . . . . . . . . . . . 539
A.9 Visualizing Transactional Algorithms for DHTs . . . . . . . . 542
A.10 Partitioning and Merging the Ring . . . . . . . . . . . . . . . 545
A.11 Transactions for Distributed Wikis on Structured Overlays . . 548
A.12 Transactional DHT Algorithms . . . . . . . . . . . . . . . . . 561
A.13 Key-based Consistency . . . . . . . . . . . . . . . . . . . . . . 600
A.14 Consistency of Data in Structured Overlays . . . . . . . . . . 606
A.15 Handling Network Partitions and Mergers in Structured Overlay Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
A.16 Reliable Dynamic Reconfiguration of Component-Based Systems628
A.17 Language Support for Navigation and Reliable Reconfiguration of Component-Based Architectures . . . . . . . . . . . . . 635
A.18 A Multi-staged Approach to Enable Reliable Dynamic Reconfiguration of Component-Based Systems . . . . . . . . . . . . 664
A.19 Security Issues in SmallWorld Network Routing . . . . . . . . 671
A.20 A Transactional Scalable Distributed Data Store: Wikipedia
on a DHT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
SELFMAN Deliverable Year Two, Page 8
List of Figures
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
Routing fingers of a SONAR node in a two-dimensional data
space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SONAR overlay network with 1.9 million keys (city coordinates) over 2048 nodes. Each rectangle represents one node. .
Extreme case of a relaxed-ring with many branches. . . . . . .
Failure recovery mechanism of the relaxed ring modeled as a
feedback loop. The labels exemplifies the failure of peer q,
placed in between peers p and r. . . . . . . . . . . . . . . . . .
Average size of branches depending on the quality of connections: avg corresponds to existing branches and totalavg represents how the whole network is affected. . . . . . . . . . . .
Load of messages in Chord due to periodic stabilization, compared to the load of the Relaxed-Ring maintenance with bad
connectivity. Y-axis presented in logarithmic scale. . . . . . .
Connectivity Table which shows the compatibility between
pairs of NAT types. It also dictates which approach should be
used in establishing a connection between peers behind those
types of NAT . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAT Traversal: Connection establishment process of class II .
NAT Traversal: Connection establishment process of class III .
Watts and Strogatz Model . . . . . . . . . . . . . . . . . . . .
Characteristic Path Length L(p) and Clustering Coefficient C(p)
Kleinberg Model . . . . . . . . . . . . . . . . . . . . . . . . .
Simulator Interface . . . . . . . . . . . . . . . . . . . . . . . .
log(n) and 6 ∗ log(n) rings . . . . . . . . . . . . . . . . . . . .
Routing Length Distribution . . . . . . . . . . . . . . . . . . .
Network configuration of three Skype clients: GVI, FX1 and
RR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Idle Skype network traffic over 16 hours. The X axis is time
measured in CPU clocks, and the Y axis is the network traffic.
9
46
47
48
49
50
52
63
64
66
75
76
76
77
78
79
81
82
LIST OF FIGURES
4.9
Skype network traffic during two party calling. The X axis is
time measured in CPU clocks, and the Y axis is the network
traffic of the two machines. . . . . . . . . . . . . . . . . . . . . 83
4.10 Network connectivity graph in a three-way conference call . . 83
5.1
Kompics peer-to-peer system architecture. . . . . . . . . . . . 88
6.1
6.2
6.3
6.4
6.5
A peer-to-peer network visualized by PEPINO
Testing relaxed-ring’s branches with PEPINO .
Events displayed with PEPINO . . . . . . . . .
Architecture of CiNiSMO . . . . . . . . . . . .
Data comparing the network traffic of different
Chord and P2PS. . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
instances of
. . . . . . .
.
.
.
.
92
93
94
98
. 99
7.1
7.2
7.3
7.4
7.5
7.6
Graphical representation of Kompics components. . . . . . . . 112
Graphical representation of a Kompics composite component. 113
Two composite components sharing a subcomponent. . . . . . 114
Static membership distributed abstractions system architecture.117
Kompics peer-to-peer system architecture. . . . . . . . . . . . 118
Example software architecture. A Leader Elector component
and a Remote Procedure Call component share a Failure Detector component: (a) architectural view, (b) sharing view. . 119
7.7 Example Kompics architecture. . . . . . . . . . . . . . . . . . 120
7.8 Example Fractal architecture. . . . . . . . . . . . . . . . . . . 121
7.9 Simple primitive component with two channel parameters in
Kompics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.10 Simple Kompics primitive component with two channel parameters in Fractal. . . . . . . . . . . . . . . . . . . . . . . . . 124
7.11 Fractalized example Kompics architecture. . . . . . . . . . . . 125
12.1 An inconsistent configuration. Due to imperfect failure detection, N 1 suspects N 2 and N 3, thus pointing to N 4 as successor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
12.2 The different phases and message exchanges in a single instance of the transaction protocol. . . . . . . . . . . . . . . . 183
15.1
15.2
15.3
15.4
15.5
15.6
Distributed environments representation . . . . . . . . . . . . 217
Anatomy of a simple union dynamic set: handling of an update.219
Parameterized architectures: interconnection scheme . . . . . 222
Simple distributed component . . . . . . . . . . . . . . . . . . 228
Local deployment evaluation . . . . . . . . . . . . . . . . . . . 230
Distributed deployments evaluation . . . . . . . . . . . . . . . 231
SELFMAN Deliverable Year Two, Page 10
LIST OF FIGURES
16.1 The ring merge algorithm . . . . . . . . . . . . . . . . . . . . 252
17.1 Big picture of a typical load testing infrastructure. . . . . .
17.2 Self-regulated load injection for autonomic search of system
performance limits. . . . . . . . . . . . . . . . . . . . . . .
17.3 Autonomic saturation search with self-regulated load injection
applied to an XML appliance. . . . . . . . . . . . . . . . . .
17.4 Autonomic benchmarking: self-tuning of a system under test
and autonomic search of performance saturation . . . . . .
17.5 Geographic Load-Balancing for Wikipedia. . . . . . . . . . .
17.6 A node Ni with successor and predecessor and their respective
responsibilities. . . . . . . . . . . . . . . . . . . . . . . . . .
17.7 A chain of underloaded nodes leaving the system. . . . . . .
17.8 Two consecutive free slots being filled by joining nodes. . .
17.9 Sketch of a search tree. . . . . . . . . . . . . . . . . . . . .
17.10An increasing value of epsilon decreases the allowed difference
between node utilization. . . . . . . . . . . . . . . . . . . .
17.11Number of items per node. . . . . . . . . . . . . . . . . . .
17.12Absolute number of moved items for a network with 100 nodes
with increasing epsilon for Karger and Karger with average
load information. . . . . . . . . . . . . . . . . . . . . . . . .
17.13The imbalance with increasing number of epsilon for Karger
and Karger with average load. . . . . . . . . . . . . . . . .
17.14The load imbalance as a function of the number of moved
items. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.1
18.2
18.3
18.4
18.5
18.6
18.7
18.8
18.9
Simulator Interface . . . . . . .
Routing Length Distribution . .
Comparisons between 3 models
Greediness . . . . . . . . . . . .
Perfect node positions . . . . .
Shuffled node positions . . . . .
Restored node positions . . . .
Switching Probability . . . . . .
Partial Restart Strategy . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 262
. 264
. 265
. 266
. 268
.
.
.
.
271
273
273
274
. 276
. 277
. 278
. 279
. 279
.
.
.
.
.
.
.
.
.
284
286
286
287
288
289
290
290
291
19.1 Distributed Wikipedia on a transactional data store based on
Chord# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
19.2 Tracker Kompics Design . . . . . . . . . . . . . . . . . . . . . 301
19.3 Kompics tracker preliminary evaluation, 2000 requests . . . . . 303
SELFMAN Deliverable Year Two, Page 11
List of Tables
9.1
9.2
Summary of the tools needed for reasoning, evalution or testing in the different stages of designing large-scale distributed
systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Comparison of MyP2PWorld against other testing tools. . . . 139
15.1 Deployment and remote invocation costs comparison . . . . . 228
21.1 Self-managing application requirements . . . . . . . . . . . . . 313
12
Chapter 1
Introduction
This document presents all the deliverables of the second year of the SELFMAN project except the Periodic Activity Report which is submitted as a
separate document. As an experiment, we have decided to bundle all the
deliverables together, similar to a book. Each deliverable corresponds to a
single chapter in this book, supplemented with the appendices (relevant papers) and bibliographic references. Note that each deliverable can be transformed into a separate document if necessary by extracting these pages with
Adobe Acrobat. We decided to create a single book-length document for
three reasons:
• Some of the relevant papers are part of more than one deliverable.
Putting them in a single appendix removes this duplication.
• We have additional results that do not fit easily into a single deliverable, for example Raphaël Collet’s Ph.D. dissertation and the release
of Mozart 1.4.0 (see Appendix A.2) and the papers on feedback loop
architectures (see Chapter 2 and Appendix A.1). These results are
not listed in the Description of Work but they are logically part of the
project. They would have been listed if we would have been prescient
enough when we originally defined the project.
• The organization in book form shows the coherence of the project. The
deliverables form an ascending ladder from low-level structures (the
SONs, Workpackage 1), followed by the component models (Workpackage 2), the transaction protocol (Workpackage 3), the merge protocol
and other self-* services (Workpackage 4), and finally the application
scenarios (Workpackage 5). We have tried to cross-reference the deliverables to clarify the connections between these parts.
13
CHAPTER 1. INTRODUCTION
Chapter 2 starts the document by giving an overview of the SELFMAN
project, summarizing the second-year results in a way that is strongly flavored
by our ultimate vision (which is further explained in Appendix A.1).
SELFMAN Deliverable Year Two, Page 14
Chapter 2
An overview of the SELFMAN
project
This paper derives from a talk given originally in the Software Technologies Concertation on Formal Methods for Components and Objects (FMCO
2007), held in CWI, Amsterdam, from Oct. 24-26, 2007. The paper was
written in March 2008 and will appear in the revised postproceedings. It
gives an overview of the SELFMAN vision and a summary of the main results of the project in its second year. For a more far-ranging vision based
on the concept of reversible phase transitions, we refer you to Appendix A.1.
The network merge algorithm developed in SELFMAN is an example of a
robust software system that shows reversible phase transitions.
15
Self Management for
Large-Scale Distributed Systems:
An Overview of the SELFMAN Project
Peter Van Roy1 , Seif Haridi2 , Alexander Reinefeld3 , Jean-Bernard Stefani4 ,
Roland Yap5 , and Thierry Coupaye6
1
Université catholique de Louvain (UCL), Louvain-la-Neuve, Belgium
2
Royal Institute of Technology (KTH), Stockholm, Sweden
3
Konrad-Zuse-Zentrum für Informationstechnik (ZIB), Berlin, Germany
4
Institut National de Recherche en Informatique et Automatique (INRIA),
Grenoble, France
5
National University of Singapore (NUS)
6
France Télécom Recherche et Développement, Grenoble, France
Abstract. As Internet applications become larger and more complex,
the task of managing them becomes overwhelming. “Abnormal” events
such as software updates, failures, attacks, and hotspots become frequent.
The SELFMAN project will show how to handle these events automatically by making the application self managing. SELFMAN combines two
technologies, namely structured overlay networks and advanced component models. Structured overlay networks (SONs) developed out of
peer-to-peer systems and provide robustness, scalability, communication
guarantees, and efficiency. Component models provide the framework to
extend the self-managing properties of SONs over the whole application.
SELFMAN is building a self-managing transactional storage and using it
for three application demonstrators: a machine-to-machine messenging
service, a distributed Wiki, and an on-demand media streaming service.
This paper provides an introduction and motivation to the ideas underlying SELFMAN and a snapshot of its contributions midway through the
project. We explain our methodology for building self-managing systems
as networks of interacting feedback loops. We then summarize the work
we have done to make SONs a practical basis for our architecture: using
an advanced component model, handling network partitions, handling
failure suspicions, and doing range queries with load balancing. Finally,
we show the design of a self-managing transactional storage on a SON.
1
Introduction
It is now possible to build applications of a higher level of complexity than ever
before, because the Internet has reached a higher level of reliability and scale
than ever before using computing nodes that are more powerful than ever before.
Applications that take advantage of this complexity cannot be managed directly
by human beings; they are just too complicated. In order to build them, they
2
need to manage themselves. In that way, human beings only need to manage the
high-level policies.
The SELFMAN project targets one part of this application space: applications built on top of structured overlay networks. Such networks are already
self managing in the lower layers: they self organize around failures to provide
reliable routing and lookup. We are building a service architecture on top of the
overlay network using an advanced component model. To make it self managing, the service architecture is designed as a set of interacting feedback loops.
Furthermore, by studying several application scenarios we find that support for
distributed transactions is important. We are therefore building a replicated
transactional storage as a key service on top of the structured overlay network.
We will build three application demonstrators that use the service architecture
and its transactional storage.
SELFMAN is a specific targeted research project (STREP) in the Information Society Technologies (IST) Strategic Objective 2.5.5 (Software and Services) of the European Sixth Framework Programme [30]. It started in June
2006 for a duration of three years with a budget of 1.96 MEuro. The project has
seven partners: Université catholique de Louvain, Kungliga Tekniska Högskolan,
INRIA (Grenoble), France Télécom Recherche et Développement (Grenoble),
Konrad-Zuse-Zentrum für Informationstechnik (Berlin), National University of
Singapore, and Stakk AB in Stockholm. This paper gives an overview of the
motivations of SELFMAN, its approach, and its contributions midway through
the project. The paper consists of the following six sections:
– Section 2: Motivation for self-managing systems. We give a brief history
of system theory and cybernetics. We then explain why programs must be
structured as systems of interacting feedback loops.
– Section 3: Presentation of the SELFMAN project. We present SELFMAN’s
decentralized service architecture and its three demonstrator applications.
– Section 4: Understanding and designing feedback structures. We explain
some techniques for analyzing feedback structures and we give two realistic
examples taken from human biology: the human respiratory system and the
human endocrine system. We infer some design rules for feedback structures
and present a tentative architecture and methodology for building them.
– Section 5: Introduction to structured overlay networks. We explain the basic
ideas of SONs and the low-level self-management operations they provide.
We then explain how they need to be extended for self-managing systems.
We have extended them in three directions: to handle network partitions,
failure suspicions, and range queries.
– Section 6: The transaction service. From our application scenarios, we have
concluded that transactional storage is a key service for building self-managing
applications. We are building the transaction service on top of a SON by using symmetric replication for the storage and a modified version of the Paxos
nonblocking atomic commit.
– Section 7: Some conclusions. We recapitulate the progress that has been
made midway through the project and summarize what remains to be done.
3
2
2.1
Motivation
Software complexity
Software is fragile. A single bit error can cause a catastrophe. Hardware and
operating systems have been reliable enough in the part so that this has not
unduly hampered the quantity of software written. Hardware is verified to a high
degree. It is much more reliable than software. Good operating systems provide
strong encapsulation at their cores (virtual memory, processes) and this has been
polished over many years. New techniques in fault tolerance (e.g., distributed
algorithms, Erlang) and in programming (e.g., structured programming, objectoriented programming, more recent methodologies) have arguably kept the pace
so far. In fact we are in a situation similar to the Red Queen in Through the
Looking-Glass: running as hard as we can to stay in the same place [7].
In our view, the next major increase in software complexity is now upon
us. The Internet now has sufficient bandwidth and reliability to support large
distributed applications. The number of devices connected to the Internet has increased exponentially since the early 1980s and this is continuing. The computing
power of connected devices is continuously increasing. Many new applications are
appearing: file sharing (Napster, Gnutella, Morpheus, Freenet, BitTorrent, etc.),
information sharing (YouTube, Flickr, etc.), social networks (LinkedIn, FaceBook, etc.), collaborative tools (Wikis, Skype, various Messengers), MMORPGs
(Massively Multiplayer On-line Role-Playing Games, such as World of Warcraft,
Dungeons & Dragons, etc.), on-line vendors (Amazon, eBay, PriceMinister, etc.),
research testbeds ([email protected], PlanetLab, etc.), networked implementations of
value-added chains (e.g., in the banking industry). These applications act like
services. In particular, they are supposed to be long-lived. Their architectures
are a mix of client/server and peer-to-peer. The architectures are still rather
conservative: they do not take full advantage of the new possibilities.
The main problem that comes from the increase in complexity is that software
errors cannot be eliminated [2, 38]. We have to cope with them. There are many
other problems: scale (large numbers of independent nodes), partial failure (part
of the system fails, the rest does not), security (multiple security domains) [18],
resource management (resources tend to be localized), performance (harnessing
multiple nodes or spreading load), and global behavior (emergent behavior of
the system as a whole). Of these, global behavior is particularly relevant. Experiments show that large networks show behavior that is not easily predicted
by the behaviors of the individual nodes (e.g., the power grid [11]).
2.2
Self-managing systems
What solution do we propose to these problems? For inspiration, we go back
fifty years, to the first work on cybernetics and system theory: designing systems that regulate themselves [37, 4, 5]. A system is a set of components (called
subsystems) that are connected together to form a coherent whole. Can we predict the system’s behavior from its subsystems? Can we design a system with
4
Fig. 1. Randomness versus complexity (taken from Weinberg [35])
desired behavior? These questions are particularly relevant for the distributed
systems we are interested in. No general theory has emerged yet from this work.
We do not intend to develop such a theory in SELFMAN. Our aim is narrower:
to build self-managing software systems. Such systems have a chance of coping
with the new complexity. Our work is complementary to [17], which applies control theory to design computing systems with feedback loops. We are interested
in distributed systems with many interacting feedback loops.
Self management means that the system should be able to reconfigure itself to handle changes in its environment or its requirements without human
intervention but according to high-level management policies. In a sense, human intervention is lifted to the level of the policies. Typical self-management
operations include adding/removing nodes, performance tuning, failure detection & recovery, intrusion detection & recovery, software rejuvenation. It is clear
that self management exists at all levels of a system: the single node level, the
network routing level, the service level, and the application level. For large-scale
systems, environmental changes that require recovery by the system become normal and even frequent events. “Abnormal” events (such as failures) are normal
occurrences.
Figure 1 (taken from [35]) classifies systems according to two axes: their complexity (the number of components and interactions) and the amount of randomness they contain (how unpredictable the system is). There are two shaded areas
that are understood by modern science: machines (organized simplicity) and aggregates (unorganized complexity). The vast white area in the middle is poorly
understood. We extend the original figure of [35] to emphasize that computing
5
research is the vanguard of system theory: it is pushing inwards the boundaries
of the two shaded areas. Two subdisciplines of computing are particularly relevant: programming research (developing complex programs) and computational
science (designing and simulating models). In SELFMAN we do both: we design
algorithms and architectures and we simulate the resulting systems in realistic
conditions.
2.3
Designing self-managing software systems
Designing self-managing systems means in large part to design systems with
feedback loops. Real life is filled with variations on the feedback principle. For
example:
– Bending a plastic ruler: a system with a single stable state. The ruler resists
with a force that increases with the degree of bending, until equilibrium is
reached (or until the ruler breaks: a change of phase). The ruler is a simple
self-adaptive system with a single feedback loop.
– A clothes pin: a system with one stable and one unstable state. It can be kept
temporarily in the unstable state by pinching. When the force is released, it
will go back to (a possibly more complex) stable state.
– A safety pin: a system with two stable states, open and closed. Within each
stable state the system is adaptive like the ruler. This is an example of a
feedback loop with management (see Section 4): the outer control (usually
a human being) chooses the stable state.
In general, anything that has continued existence is managed by a feedback loop.
Lack of feedback means that there is a runaway reaction (an explosion or an
implosion). This is true at all size scales, from the atomic to the astronomic. For
example, binding of atoms to form a molecule is governed by a negative feedback
loop: when perturbed it will return to equilibrium (or find another equilibrium).
A star at the end of its lifetime collapses until it finds a new stable state. If
there is no force to counteract the collapse, then the star collapses indefinitely
(at least, until it goes beyond our current understanding of physics). If the star
is too heavy to become a neutron star, then it becomes a black hole, which in
our current understanding is a singularity.
Most products of human civilization need an implicit management feedback
loop, called “maintenance”, done by a human. For example, changing lightbulbs,
replacing broken windows, or tanking a car. Each human mind is at the center of
an enormous number of these feedback loops. The human brain has a large capacity for creating such loops; they are called “habits” or “chores”. Most require
very little conscious awareness. Repetition has caused them to be programmed
into the brain below consciousness. However, if there are too many feedback
loops to manage then the brain is overloaded: the human complains that “life is
too complicated”! We can say that civilization advances by reducing the number
of feedback loops that have to be explicitly managed [36]. A dishwashing machine reduces the work of washing dishes, but it needs to be bought, filled and
emptied, maintained, replaced, etc. Is it worth it? Is the total effort reduced?
6
Software is in the same situation as other products of human civilization. In
the current state, most software products are very fragile: they require frequent
maintenance by a human. This is one of the purposes of SELFMAN: to reduce
this need for maintenance by designing feedback loops into the software. This
is a vast area of work; we have decided to restrict our efforts to large-scale
distributed systems based on structured overlay networks. Because they have
low-level self management built in, we consider them an ideal starting point.
SONs have greatly matured since the first work in 2001 [33]; current SONs
are (almost) ready to be used in real systems. We are adapting them in two
directions for SELFMAN. First, we are extending the SON algorithms to handle
important network issues that are not handled in the SON literature, such as
network partitioning (see Section 5). Second, we are rebuilding the SON using
a component model [1]. This is needed because the SON algorithms themselves
have to be managed and updated while the SON is running, for example to add
new basic functionality such as load balancing or new routing algorithms. The
component model is also used for the other services we need for self management.
3
The SELFMAN project
The SELFMAN project is designing a decentralized service architecture and using it to build three demonstrator applications. Here we introduce the service
architecture and the demonstrator applications. We also mention two important inspirations of SELFMAN: IBM’s Autonomic Computing Initiative and the
Chord system. Section 4.3 explains how the service architecture is used as a basis
for self management.
3.1
Decentralized service architecture
SELFMAN is based on the premise that there is a synergy between structured
overlay networks (SONs) and component models:
– SONs already provide low-level self-management abilities. We are reimplementing our SONs using a component model that adds lifecycle management
and hooks for supporting services. This makes the SON into a substrate for
building services.
– The component model is based on concurrent components and asynchronous
message passing. It uses the communication and storage abilities of the SON
to enable it to run in a distributed setting. Because the system may need
to update and reorganize itself, the components need introspection and reconfiguration abilities. We have designed a process calculus, Oz/K, that has
these abilities in a practical form [23].
This leads to a simple service architecture for decentralized systems: a SON lower
layer providing robust communication and routing services, extended with other
basic services and a transaction service. Applications are built on top of this
7
service architecture. The transaction service is important because many realistic
application scenarios need it (see Section 3.2).
The structured overlay network is the base. It provides guaranteed connectivity and fast routing in the face of random failures (Section 5). It does not
protect against malicious failures: in our current design we must consider the
network nodes as trusted. We assume that untrusted clients may use the overlay as a basic service, but cannot modify its algorithms. See [42] for more on
security for SONs and its effect on SELFMAN. We have designed and implemented robust SONs based on the DKS, Chord#, and Tango protocols [13, 29,
8]. These implementations use different styles and platforms, for example DKS is
implemented in Java and uses locking algorithms for node join and leave. Tango
is implemented in Oz and uses asynchronous algorithms for managing connectivity (Section 5.2). We have also designed an algorithm for handling network
partitions and merges, which is an important failure mode for structured overlay
networks (Section 5.1).
The transaction service uses a replicated storage service (Section 6). The
transaction service is implemented with a modified version of the Paxos nonblocking atomic commit [15] and uses optimistic concurrency control. This algorithm is based on a majority of correct nodes and eventual leader detection (the
so-called partially synchronous model). It should therefore cope with failures as
they occur on the Internet.
This simple service architecture is our starting point for building self-managing
applications. Section 4.3 shows how this service architecture is used to build the
feedback structures that are needed for self management.
Application
Self-* Properties Components Overlays Transactions
M2M Messaging
++
++
+
+
Distributed Wiki
++
+
++
++
P2P Media Streaming
++
+
++
J2EE Application Server
++
++
+
Table 1. Requirements for selected self-managing applications
3.2
Demonstrator applications and guidelines
Using this self-management architecture, we will build three application demonstrators [12]:
– A machine-to-machine messaging application (specified by partner France
Télécom). This is a decentralized messaging application. It must recover on
node failure, gracefully degrade and self optimize, and have transactional
behavior.
8
– A distributed Wiki application (specified by partner ZIB). This is a Wiki
(a user-edited set of interlinked Web pages) that is distributed over a SON
using transactions with versioning and replication, supporting both editing
and search.
– An on-demand video streaming application (specified by partner Stakk).
This application provides distributed live media streams with quality of
service to large and dynamically varying numbers of customers. Dynamic
reconfiguration is needed to handle the fluctuating structure.
Table 1 shows how much these applications need in four areas: self-* properties,
components, overlay networks (decentralized execution), and transactions. Two
pluses (++) mean strong need and one plus (+) means some need. An empty
space means no need for that area according to our current understanding. All
these applications have a strong need for self-management support. The table
shows a fourth application that was initially considered, an application server
written in J2EE, but we rejected it for SELFMAN because it does not have any
requirements for decentralized execution.
At the end of the project we will provide a set of guidelines and general
programming principles for building self-managing applications. One important
principle is that these applications are built as a set of interacting feedback
loops. A feedback loop, where part of the system is monitored and then used
to influence the system, is an important basic element for a system that can
adjust to its surroundings. As part of SELFMAN, we are carefully studying
how to build applications with feedback loops and how feedback interacts with
distribution.
3.3
Related work
The SELFMAN project is related to two important areas of work:
– IBM’s Autonomic Computing Initiative [19]. This initiative started in 2001
and aims to reduce management costs by removing humans from low-level
system management loops. The role of humans is then to manage policy and
not to manage the mechanisms that implement it.
– Structured overlay network research. The most well-known SON is the Chord
system, published in 2001 [33]. Other important early systems are Ocean
Store and CAN. Inspired by popular peer-to-peer applications, these systems
led to much active research in SONs, which provide low-level self management of routing, storage and smart lookup in large-scale distributed systems.
Other important related work is research in ambient and adaptive computing,
and research in biophysics on how biological systems regulate and adapt themselves. For example, [21] shows how systems consisting of two coupled feedback
loops behave in a biological setting.
9
Calculate corrective action
Actuating agent
Monitoring agent
Subsystem
Fig. 2. A feedback loop
4
Understanding and designing feedback structures
A self-managing system consists of a large set of interacting feedback loops.
We want to understand how to build systems that consist of many interacting
feedback loops. Systems with one feedback loop are well understood, see, e.g.,
the book by Hellerstein et al [17], which shows how to design computing systems
with feedback control, for example to maximize throughput in Apache HTTP
servers, TCP communication, or multimedia streaming. The book focuses on
regulating with single feedback loops. Systems with many feedback loops are
quite different. To understand them, we start by doing explorations both in
analysis and synthesis: we study existing systems (e.g., biological systems) and
we design decentralized systems based on SONs.
A feedback loop consists of three parts that interact with a subsystem (see
Figure 2): a monitoring agent, a correcting agent, and an actuating agent. The
agents and the subsystem are concurrent components that interact by sending
each other messages. We call them “agents” because they play specific roles in
the feedback loop; an agent can of course have subcomponents. As explained in
[34], feedback loops can interact in two ways:
– Stigmergy: two loops monitor and affect a common subsystem.
– Management: one loop directly controls another loop.
How can we design systems with many feedback loops that interact both through
stigmergy and management? We want to understand the rules of good feedback
design, in analogy to structured and object-oriented programming. Following
these rules should give us good designs without having to laboriously analyze all
possibilities. The rules can tell us what the global behavior is: whether the system
converges or diverges, whether it oscillates or behaves chaotically, and what
states it settles in. To find these rules, we start by studying existing feedback
loop structures that work well, in both biological and software systems. We
try to understand these systems by analysis and by simulation. Many feedback
systems and feedback patterns have been investigated in the literature [34, 27,
22]. Sections 4.1 and 4.2 give two approaches to understanding existing systems
and summarize some of the design rules we can infer from them. Finally, Section
4.3 gives a first tentative methodology for designing feedback structures.
10
Trigger unconsciousness
when O2 falls to threshold
Render unconscious
(and reduce CO2 threshold to base level)
Other inputs
Conscious control
of body and breathing
Increase or decrease breathing rate
and change CO2 threshold
(maximum is breath!hold breakpoint)
Trigger breathing reflex
when CO2 increases to threshold
Trigger laryngospasm temporarily
when sufficient obstruction in airways
Breathing
reflex
Detect
obstruction
in airways
Laryngospasm
(seal air tube)
Measure
CO2
in blood
Monitor
breathing
Measure
O2
in blood
Breathing apparatus
in human body
Actuating agents
Monitoring agents
Fig. 3. The human respiratory system
4.1
Feedback structures in the human body
We investigate two feedback loop structures that exist in the human body: the
human respiratory system and the human endocrine system. Figure 3 (taken
from [34]) shows the human respiratory system, which has four feedback loops:
three are arranged in a management hierarchy and the fourth interacts with
them through stigmergy. This design works quite well. Laryngospasm can temporarily interfere with the breathing reflex, but after a few seconds it lets normal
breathing take over. Conscious control can modulate the breathing reflex, but
it cannot bypass it completely: in the worst case, the person falls unconscious
and normal breathing takes over. We can already infer several design rules from
this system: one loop managing another is an example of data abstraction, loops
can avoid interference by working at different time scales, and since complex
loops (such as conscious control) can have an unpredictable effect (they can be
either stabilizing or unstabilizing) it is a good idea to have an outer “fail-safe”
management loop. Conscious control is a powerful problem solver but it needs
to be held in check.
The respiratory system is a simple example of a feedback loop structure
that works; we now give a more complex biological example, namely the human
endocrine system [10]. The endocrine system regulates many quantities in the
human body. It uses chemical messengers called hormones which are secreted by
specialized glands and which exercise their action at a distance, using the blood
11
Target glands
Physical and
emotional
stresses
Immune
system
pituitary
hormones
Target
tissues
Thyroid
(inhibit)
(inhibit)
thyro−
tropin
neuro−
hormones
Central
nervous
system
triiodothyronine,
thyroxine
cortico−
tropin
Adrenal
cortex
Anterior
pituitary
Hypothalamus
(inhibit)
growth gonado−
hormone tropins
(somatotropin)
Hormone secretion is inhibited by high local concentration
Hormones are consumed by target tissues
Carrier proteins in bloodstream buffer the hormone (reduce variations)
Estrogens increase and androgens decrease the carrier proteins
Many hormones have pulsed secretion, regulated by melatonin (pineal gland)
Gonads
(testes &
ovaries)
Liver &
other tissues
(all) steroid
hormones
estrogens,
androgens
growth
factors
Target
tissues
Target
tissues
Target
tissues
Fig. 4. The hypothalamus-pituitary-target organ axis
stream as a diffusion channel. By studying the endocrine system, we can obtain
insights in how to build large-scale self-regulating distributed systems. There are
many feedback loops and systems of interacting feedback loops in the endocrine
system. It provides homeostasis (stability) and the ability to react properly to
environmental stresses. Much of the regulation is done by simple negative feedback loops. For example, the glucose level in the blood stream is regulated by
the hormones glucagon and insulin. In the pancreas, A cells secrete glucagon and
B cells secrete insulin. An increase in blood glucose level causes a decrease in
the glucagon concentration and an increase in the insulin concentration. These
hormones act on the liver, which releases glucose in the blood. Another example
is the calcium level in the blood, which is regulated by parathyroid hormone
(parathormone) and calcitonine, also in opposite directions, both of which act
on the bone. The pattern here is of two hormones that work in opposite directions (push-pull). This pattern is explained by [21] as a kind of dual negative
feedback loop (an NN loop) that improves regulation.
More complex regulatory mechanisms also exist in the endocrine system, e.g.,
the hypothalamus-pituitary-target organ axis. Figure 4 shows its main parts as a
feedback structure. This system consists of two superimposed groups of negative
feedback loops (going through the target tissues and back to the hypothalamus
and anterior pituitary), a third short negative loop (from the anterior pituitary
to the hypothalamus), and a fourth loop from the central nervous system. The
hypothalamus and anterior pituitary act as master controls for a large set of other
regulatory loops. Furthermore, the nervous system affects these loops through
the hypothalamus. This gives a time scale effect since the hormonal loops are
slow and the nervous system is fast. Letting it affect the hormonal loop helps to
react quickly to external events.
Figure 4 shows only the main components and their interactions; there are
many more parts in the full system. There are more interacting loops, “short cir-
12
cuits”, special cases, interaction with other systems (nervous, immune). Negative
feedback is used for most loops, saturation (like in the Hill equations introduced
in Section 4.2) for others. Realistic feedback structures can be complex. Evolution is not always a parsimonious designer! The only criterion is that the system
has to work.
Computational architecture We can say something about the computational architecture of the human endocrine system. There are components and communication channels. Components can be both local (glands, organs, clumps of
cells) or global (diffuse, over large parts of the body). Channels can be pointto-point or broadcast. Point-to-point channels are fast, e.g., nerve fibers from
the spinal chord to the muscle tissue. Broadcast is slower, e.g., diffusion of a
hormone through the blood circulation. Buffering is used to reduce variations,
e.g., the carrier proteins in the bloodstream act as buffers by storing and releasing hormones. Regulatory mechanisms can be modeled by interactions between
components and channels. Often there are intermediate links (like the carrier
proteins). Abstraction (e.g., encapsulation) is almost always approximate. This
is an important difference with digital computers. Biological and social abstractions tend to be leaky; computer abstractions tend not to be. This can have a
large effect on the design. In biological systems security is done through a mechanism that is itself leaky, e.g., the human immune system. In computer systems,
the security architecture tries to be as nonleaky as possible, although this cannot
be perfect because of covert channels.
4.2
Analysis of feedback structures
How can we design a system with many interacting feedback loops like that of
Figure 3? Mathematical analysis of interacting feedback loops is quite complex,
especially if they have nonlinear behavior. Can we simplify the system to have
linear or monotonic behavior? Even then, analysis is complex. For example,
Kim et al [21] analyze biological systems consisting of just two feedback loops
interacting through stigmergy. They admit that their analysis only has limited
validity because the coupled feedback loops they analyze are parts of much larger
sets of interacting feedback loops. Their analysis is based on Matlab simulations
using the Hill equations, first-order nonlinear differential equations that model
the time evolution and mutual interaction of molecular concentrations. The Hill
equations model nonlinear monotonic interaction with saturation effects. We give
a simple example using two molecular concentrations X and Y . The equations
have the following form (taken from [21]):
VX (X/KXY )H
dY
=
− KdY Y + KbY
dt
1 + (X/KXY )H
VY
dX
=
− KdX X + KbX
dt
1 + (Y /KY X )H
Here we assume that X activates Y and that Y inhibits X. The equations
model saturation (the concentration of a molecule has an upper limit) and ac-
13
Fig. 5. Example of a biological system where X activates Y
tivation/inhibition with saturation (one molecule can affect another, up to a
point). We see that X and Y , when left to their own, will each asymptotically
approach a limit value with an exponentially decaying difference. Figure 5 shows
a simplified system where X activates Y but Y does not affect X. X has a discrete step decrease at t0 and a continuous step increase at t1 . Y follows these
changes with a delay and eventually saturates. The constants KdY and KbY
model saturation of Y (and similarly for X). The constants VX , KXY , and H
model the activation effect of X on Y . We see that activation and inhibition
have upper and lower limits.
By simulating these equations, Kim et al determine the effect of two coupled
feedback loops, each of which can be positive or negative.
– A positive loop is bistable or multistable; it is commonly used in biological
systems for decision making. Two coupled positive loops cause the decision
to be less affected by environmental perturbations: this is useful for biological
processes that are irreversible (such as mitosis, i.e., cell division).
– A negative loop reduces the effect of the environment; it is commonly used
in biological systems for homeostasis, i.e., to keep the biological system in
a stable state despite environmental changes. Negative loops can also show
oscillation because of the time delay between the output and input. Two coupled negative loops can show stronger and more sustained oscillations than
a single loop. They can implement biological oscillations such as circadian
(daily) rhythms.
– A combined positive and negative loop can change its behavior depending
on how it is activated, to become more like a positive or more like a negative
loop. This is useful for regulation.
These results are interesting because they give insight into nonlinear monotonic
interaction with saturation. They can be used to design structures with two
coupled feedback loops.
Many patterns of feedback loops have been analyzed in this way. For example, [27] shows how to model oscillations in biological systems by cycles of
feedback loops. The cycle consists of molecules where each molecule activates or
inhibits the next molecule in the cycle. If the total effect of the cycle is a negative
14
feedback then the cycle can give oscillations. Given an oscillatory behavior, the
topology of the cycle (the molecules involved and their interaction types) can
be reconstructed. Many other patterns have been analyzed as well in biological
systems, but there is as yet no general theory for analyzing these feedback structures. In SELFMAN we are interested in investigating the kinds of equations
that apply to software. In software, the feedback structures may not follow the
Hill equations. For example, they may not be monotonic. Nevertheless, the Hill
equations are a useful starting point because they model saturation, which is an
interesting form of nonlinearity.
4.3
Feedback structures for self management
From the examples given in the previous sections and elsewhere [34, 4, 37, 5,
35], we can give a tentative methodology for designing feedback structures. We
assume that the overall architecture follows the decentralized structure given in
Section 3.1: a set of loosely-coupled services built on top of a structured overlay
network. We build the feedback structure within this framework. We envisage
the following three layers for a self-managing system:
1. Components and events. This basic layer corresponds to the service architecture of Section 3.1: services based on concurrent components that interact
through events [1, 9]. There can be publish/subscribe events, where any component that subscribes to a published type will receive the events. There is a
failure detection service that is eventually perfect with suspect and resume
events. There can be more sophisticated services, like the transaction service
mentioned in Section 3.1 and presented in more detail in Section 6.
2. Feedback loop support. This layer supports building feedback loops. This is
sufficient for cooperative systems. The two main services needed for feedback
loops are a pseudoreliable broadcast (for actuating) and a monitoring layer.
Pseudoreliable broadcast guarantees that nodes will receive the message if
the originating node survives [13]. Monitoring detects both local and global
properties. Global properties are calculated from local properties using a
gossip algorithm [20] or using belief propagation [39]. The multicast and
monitoring services are used to implement self management abilities.
3. Multiple user support. This layer supports users that compete for resources.
This is a general problem that requires a general solution. If the users are
independent, one possible approach is to use collective intelligence techniques
(see Section 4.4). These techniques guarantee that when each user maximizes
its private utility function, the global utility will also be maximized. This
approach does not work for Sybil attacks (where one user appears as multiple
users to the system). No general solution to Sybil attacks is known. A survey
of partial solutions is given in [42]. We cite two solutions. One solution
is to validate the identities of users using a trusted third party. Another
solution is to use algorithms designed for a Byzantine failure model, which
can handle multiple identical users up to some upper bound. Both solutions
give significant performance penalties.
15
We now discuss two important issues that affect feedback structures: simple
versus complex components (how much computation each component does) and
time scales (different time scales can be independent). A complex component
does nontrivial reasoning, but in most cases this reasoning is only valid in part
of the system’s state space and should be ignored in other parts. This affects
the architecture of the system. At different time scales, a system can behave
as separate systems. We can take advantage of this to improve the system’s
behavior.
Complex components A self-managing system consists of many different kinds
of components. Some of these can be quite simple (e.g., a thermostat). Others
can be quite complex (e.g., a human being or a chess program). We define a
component as complex if it can do nontrivial reasoning. Some examples are a
human user, a computer chess program, a compiler that translates a program
text, a search engine over a large data set, and a problem solver based on SAT
or constraint algorithms.
Whether or not a component is simple or complex can have a major effect
on the design of the feedback structure. For example, a complex component
may introduce instability that needs fail-safe protective mechanisms (see, e.g.,
the human respiratory system) or mechanisms to avoid “freeloaders” (see Section 4.4). Many systems have both simple and complex components. We have
seen regulatory systems in the human body which may have some conscious
control in addition to simpler components. Other systems, called social systems,
have both human and software components. Many distributed applications (e.g.,
MMORPGs) are of this kind.
A complex component can radically affect the behavior of the system. If
the component is cooperative, it can stabilize an otherwise unstable system. If
the component is competitive, it can unstabilize an otherwise stable system. All
four combinations of {simple,complex} × {cooperative,competitive} appear in
practice. With respect to stability, there is no essential difference between human
components and programmed complex components; both can introduce stability
and instability. Human components excel in adaptability (dynamic creation of
new feedback loops) and pattern matching (recognizing new situations as variations of old ones). They are poor whenever a large amount of precise calculation
is needed. Programmed components can easily go beyond human intelligence in
such areas. Whether or not a component can pass a Turing test is irrelevant for
the purposes of self management.
How do we design a system that contains complex components? If the component is external to the designed system (e.g., human users connecting to a
system) then we must design defensively to limit the effect of the component on
the system’s behavior. We need to protect the system from the users and the
users from each other. For example, the techniques of collective intelligence can
be used, as explained in Section 4.4. Getting this right is not just an algorithmic
problem; it also requires social engineering, e.g., incentive mechanisms [28].
If the component is inside the system, then it can improve system behavior but fail-safe mechanisms must be built in to limit its effect. For example,
16
conscious control can improve the behavior of the human respiratory system,
but it has a fail-safe to avoid instability (see Section 4.1). In general, a complex
component will only enhance behavior in part of the system’s state space. The
system must make sure that the component cannot affect the system outside of
this part.
Time scales Feedback loops that work at different time scales can often be
considered to be completely independent of each other. That is, each loop is
sensitive to a particular frequency range of system behavior and these ranges are
often nonoverlapping. Wiener [37] gives an example of a human driver braking
an automobile on a surface whose slipperiness is unknown. The human “tests”
the surface by small and quick braking attempts; this allows to infer whether
the surface is slippery or not. The human then uses this information to modify
how to brake the car. This technique uses a loop at a short time scale to gain
information about the environment, which is then used to help for a long time
scale. The fast loop manages the slow loop.
4.4
Managing multiple users through collective intelligence
An important part of feedback structures that we have not yet explained is
how to support users that compete for resources. A promising technique for
this is collective intelligence [40, 41]. It can give good results when the users
are independent (no Sybil attacks or collusion). The basic question is how to
get selfish agents to work together for the common good. Let us define the
problem more precisely. We have a system that is used by a set of agents. The
system (called a “collective” in this context) has a global utility function that
measures its overall performance. The agents are selfish: each has a private utility
function that it tries to maximize. The system’s designers define the reward (the
increment in its private utility) given to each of the agent’s actions. The agents
choose their actions freely within the system. The goal is that agents acting to
maximize their private utilities should also maximize the global utility. There is
no other mechanism to force cooperation. This is in fact how society is organized.
For example, employees act to maximize their salaries and work satisfaction and
this should benefit the company.
A well-known example of collective intelligence is the El Farol bar problem
[3], which we briefly summarize. People go to El Farol once a week to have fun.
Each person picks which night to attend the bar. If the bar is too crowded or
too empty it is no fun. Otherwise, they have fun (receive a reward). Each person
makes one decision per week. All they know is last week’s attendance. In the
idealized problem, people don’t interact to make their decision, i.e., it is a case
of pure stigmergy! What strategy should each person use to maximize his/her
fun? We want to avoid a “Tragedy of the Commons” situation where maximizing
private utilities causes a minimization of the global utility [16].
17
We give the solution according to the theory of collective intelligence. Assume
we define the global utility G as follows:
X
G=
W (w)
w
W (w) =
X
φd (ad )
d
This sums the week utility W (w) over all weeks w. The week utility W (w) is the
sum of the day utilities φd (ad ) for each weekday d where the attendance ad is the
total number of people attending the bar that day. The system designer picks
the function φd (y) = αd ye−y/c . This function is small when y is too low or too
high and has a maximum in between. Now that we know the global utility, we
need to determine the agents’ reward function. This is what the agent receives
from the system for its choice of weekday. We assume that each agent will try to
maximize its reward. For example, [40] assumes that each agent uses a learning
algorithm where it picks a night randomly according to a Boltzmann distribution
distributed according to the energies in a 7-vector. When it gets its reward, it
updates the 7-vector accordingly. Real agents may use other algorithms; this one
was picked to make it possible to simulate the problem.
How do we design the agent’s reward function R(w), i.e., the reward that the
agent is given each week? There are many bad reward functions. For example,
Uniform Division divides φd (y) uniformly among all ay agents present on day y.
This one is particularly bad: it causes the global utility to be minimized. One
reward that works surprisingly well is called Wonderful Life:
RWL (w) = W (w) − Wagent absent (w)
Wagent absent (w) is calculated in the same way as W (w) but when the agent
is missing (dropped from the attendance vector). We can say that RWL (w) is
the difference that the agent’s existence makes, hence the name Wonderful Life
taken from the title of the Frank Capra movie [6]. We can show that if each agent
maximizes its reward RWL (w), the global utility will also be maximized. Let us
see how we can use this idea for building collective services. We assume that
agents try to maximize their rewards. For each action performed by an agent,
the system calculates the reward. The system is built using security techniques
such as encrypted communication so that the agent cannot “hack” its reward.
This approach does not solve all the security problems in a collaborative system. For example, it does not solve the collusion problem when many agents get
together to try to break the system. For collusion, one solution is to have a monitor that detects suspicious behavior and ejects colluding users from the system.
This monitor is analogous to the SEC (Securities and Exchange Commission)
which regulates and polices financial markets in the United States. Collective
intelligence can still be useful as a base mechanism. In many cases, the default
behavior is that the agents cannot or will not talk to each other, since they do
not know each other or are competing. Collective intelligence is one way to get
them to cooperate.
18
Fig. 6. Three generations of peer-to-peer networks
5
Structured overlay networks
Structured overlay networks are a recent development of peer-to-peer networks.
In a peer-to-peer network, all nodes play equal roles. There are no specialized
client or server nodes. There have been three generations of peer-to-peer networks, which are illustrated in Figure 6:
– The first generation is a hybrid: all client nodes are equal but there is a
centralized node that holds a directory. This is the structure used by the
Napster file-sharing system.
– The second generation is an unstructured overlay network. It is completely
decentralized: each node knows a few neighbor nodes. This structure is used
by systems such as Gnutella, Kazaa, Morpheus, and Freenet. Lookup is done
by flooding: a node asks its neighbor, which asks its neighbors, up to a
fixed depth. There are no guarantees that the lookup will be successful (the
item may be just beyond the horizon) and flooding is highly wasteful of
network resources. Recent versions of this structure use a hierarchy with two
kinds of peer nodes: normal nodes and super nodes. Super nodes have higher
bandwidth and reliability than normal nodes. This alleviates somewhat the
disadvantages.
– The third generation is the structured overlay network. A well-known early
example of this generation is Chord [33]. The nodes are organized in a structured way called an exponential network. Lookup can be done in logarithmic
time and will guarantee to find the item if it exists. If nodes fail or new
nodes join, then the network reorganizes itself to maintain the structure.
Since 2001, many variations of structured overlay networks with different
advantages and disadvantages have been designed: Chord, Pastry, Tapestry,
CAN, P-Grid, Viceroy, DKS, Chord#, Tango, etc. In SELFMAN we build
on our previous experience in DKS, Chord#, and Tango.
Structured overlay networks provide two basic services: name-based communication (point-to-point and group) and distributed hash table (also known as DHT,
which provides efficient storage and retrieval of (key,value) pairs). Routing is
19
done by a simple greedy algorithm that reduces the distance of a message between the current node and the destination node. Correct routing means that
the distance converges to zero in finite time.
Almost all current structured overlay networks are organized in two levels, a
ring complemented by a set of fingers:
– Ring structure. All nodes are connected in a simple ring. The ring must
always be connected despite node joins, leaves, and failures.
– Finger tables. For efficient routing, extra routing links called fingers are
added to the ring. They are usually exponential, e.g., for the fingers of one
node, each finger jumps twice as far as the previous finger. The fingers can
temporarily be in an inconsistent state; this has an effect only on efficiency,
not on correctness. Within each node, the finger table is continuously converging to a correct content.
Ring maintenance is a crucial part of the SON. Peer nodes can join and leave
at any time. Peers that crash are like peers that leave but without notification.
Temporarily broken links create false suspicions of failure.
We give three examples of structured overlay network algorithms developed in
SELFMAN that are needed for important aspects of ring maintenance: handling
network partitioning (Section 5.1), handling failure suspicions (Section 5.2), and
handling range queries with load balancing (Section 5.3). These algorithms can
be seen as dynamic feedback structures: they converge toward correct or optimal structures. The network partitioning algorithm restores a single ring in the
case when the ring is split into several rings due to network partitioning. The
failure handling algorithm restores a single ring in the case of failure suspicion
of individual nodes. The range query algorithm handles multidimensional range
queries. It has one ring per dimension. When nodes join or leave, each of these
rings is adjusted (by splitting or joining pieces in the key space) to maintain
balanced routing.
5.1
Handling network partitioning: the ring merge algorithm
Network partitioning is a real problem for any long-lived application on the
Internet. A single router crash can cause part of the network to become isolated
from another part. SONs should behave reasonably when a network partition
arrives. If no special actions are taken, what actually happens when a partition
arrives is that the SON splits into several rings. What we need to do is efficiently
detect when such a split happens and efficiently merge the rings back into a single
ring [31].
The merging algorithm consists of two parts. The first part detects when the
merge is needed. When a node detects that another node has failed, it puts the
node in a local data structure called the passive list. It periodically pings nodes
in its passive list to see whether they are in fact alive. If so, it triggers the ring
unification algorithm. This algorithm can merge rings in O(n) time for network
size n. We also define an improved gossip-based algorithm that can merge the
network in O(log n) average time.
20
Fig. 7. The ring merge algorithm
Ring unification happens between pairs of nodes that may be on different
rings. The unification algorithm assumes that all nodes live in the same identifier
space, even if they are on different rings. Suppose that node p detects that node
q on its passive list is alive. Figure 7 shows an example where we are merging the
black ring (containing node p) and the white ring (containing node q). Then p
does a modified lookup operation (mlookup(q)) to q. This lookup tries to reduce
the distance to q. When it has reduced this distance as much as possible, then
the algorithm attempts to insert q at that position in the ring using a second
operation, trymerge(pred,succ), where pred and succ are the predecessor and
successor nodes between which q should be inserted. The actual algorithm has
several refinements to improve speed and to ensure termination.
5.2
Handling failure suspicions: the relaxed ring algorithm
A typical Internet failure mode is that a node suspects another node of failing.
This suspicion may be true or false. In both cases, the ring structure must
be maintained. This can be handled through the relaxed ring algorithm [24].
This algorithm maintains the invariant that every peer is in the same ring as its
successor. Furthermore, a peer can never indicate another peer as the responsible
node for data storage: a peer knows only its own responsibility. If a successor
21
Fig. 8. The relaxed ring structure
node is suspected of having failed, then it is ejected from the ring. However, the
node may still be alive and point to a successor. This leads to a structure we call
the relaxed ring, which looks like a ring with “bushes” sticking out (see Figure
8). The bushes appear only if there are failure suspicions. At all times there is
a perfectly connected ring at the core of the relaxed ring. The relaxed ring is
always converging toward a perfect ring. The number of nodes in the bushes
existing at any time depends on the churn (the rate of change of the ring, the
number of failures and joins per time).
5.3
Handling multidimensional range queries with load balancing
Efficient data lookup is at the heart of peer-to-peer computing. Many SONs,
including DKS and Tango, use consistent hashing to store (key,value) pairs in a
distributed hash table (DHT). The hashing distributes the keys uniformly over
the key space. Unfortunately, this scheme is unable to handle queries with partial
information (such as wildcards and ranges) because adjacent keys are spread over
all nodes. In this section, we argue that using DHTs is not a good idea in SONs.
We support this argument by showing how to build a practical SON that stores
the keys in lexicographic order. We have developed a first protocol, Chord#,
and a generalization for multidimensional range queries, SONAR [29].
In SONAR the overlay has the shape of a multidimensional torus, where
each node is responsible for a contiguous part of the data space. A uniform
distribution of keys on the data space is not necessary, because denser areas get
assigned more nodes. To support logarithmic routing, SONAR maintains, per
dimension, fingers to other nodes that span an exponentially increasing number
22
10
9
8
Successors
are the
nodes
adjacent
to these
markers.
7
6
5
4
y
Node n
x
3
Routingtable
2
1
0
0
1
2
3
4
5
6
7
8
9
10
11
Fig. 9. Two-dimensional routing tables in SONAR
of nodes. Figure 9 shows an example in two dimensions. Most other overlays
maintain such fingers in the key space instead and therefore require a uniform
data distribution (e.g., which is obtained using hashing). SONAR, in contrast,
avoids hashing and is therefore able to perform range queries of arbitrary shape
in a logarithmic number of routing steps, independent of the number of systemand query-dimensions.
6
Transactions over structured overlay networks
For our three decentralized application scenarios, we need a decentralized transactional storage. We need transactions because the applications need concurrent
access to shared data. We have therefore designed a transaction algorithm over
SONs. We are currently simulating it to validate its assumptions and measure
its performance [25, 26]. Implementing transactions over a SON is challenging
because of churn (rate of node leaves, joins, and crashes and subsequent reorganizations of the SON) and because of the Internet’s failure model (crash stop
with imperfect failure detection).
The transaction algorithm is built on top of a reliable storage service. We
implement this using replication. There are many approaches to replication on a
SON. For example, we could use file-level replication (symmetric replication) or
block-level replication using erasure codes. These approaches all have their own
application areas. Our algorithm uses symmetric replication [14].
To avoid the problems of failure detection, we implement atomic commit
using a majority algorithm based on a modified version of the Paxos algorithm
[15]. In a companion paper, we have shown that majority techniques work well
23
Client
1,5,9,13
TM
2,6,10,14
15,3,7,11
3,7,11,15
rTM
rTM
11,15,7,3
rTM
Fig. 10. Transaction with replicated manager and participants
for DHTs [32]: the probability of data consistency violation is negligible. If a
consistency violation does occur, then this is because of a network partition and
we can use the network merge algorithm of Section 5.1.
A client initiates a transaction by asking its nearest node, which becomes a
transaction manager. Other nodes that store data are participants in the transaction. Assuming symmetric replication with degree f , we have f transaction
managers and each other node participating gives f replicated participants. Figure 10 shows a situation with f = 4 and two nodes participating in addition
to the transaction manager. Each transaction manager sends a Prepare message
to all replicated participants, which each sends back a Prepared or Abort message to all replicated transaction managers. Each replicated transaction manager
collects votes from a majority of participants and locally decides on abort or commit. It sends this to the transaction manager. After having collected a majority,
the transaction manager sents its decision to all participants. This algorithm has
six communication rounds. It succeeds if more than f /2 nodes of each replica
group are alive.
7
Conclusions and future work
The SELFMAN project is using self-management techniques to build large-scale
distributed systems. This paper gives a snapshot of the SELFMAN project at
its halfway point. We explain why self management is important for software
design and we give some first results on how to design self-managing systems
as feedback loop structures. We show how to use structured overlay networks
(SONs) as the basis of large-scale distributed self-managing systems. We explain
how we have adapted SONs for our purposes by handling network partitioning,
failure suspicions, and range queries with load balancing, and by providing a
24
transactional store service running over the SON. We present three realistic
application scenarios, a machine-to-machine messaging application, a distributed
Wiki, and an on-demand video streaming application. In the rest of the project,
we will complete the transactional store and build the demonstrator applications.
The final result will be a set of guidelines on how to build decentralized selfmanaging applications.
Acknowledgements
This work is funded by the European Union in the SELFMAN project (contract
34084) and in the CoreGRID network of excellence (contract 004265). Peter
Van Roy is the coordinator of the SELFMAN project. He acknowledges all the
partners in the SELFMAN project for their insights and research results, some
of which are summarized in this paper. He also acknowledges Mahmoud Rafea
for encouraging him to look at the human endocrine system and Mohamed ElBeltagy for introducing him to collective intelligence.
References
1. Arad, Cosmin, Roberto Roverso, Seif Haridi, Yves Jaradin, Boris Mejias, Peter Van
Roy, Thierry Coupaye, B. Dillenseger, A. Diaconescu, A. Harbaoui, N. Jayaprakash,
M. Kessis, A. Lefebvre, and M. Leger. Report on architectural framework specification,
SELFMAN Deliverable D2.2a, June 2007, www.ist-selfman.org.
2. Armstrong, Joe. “Making Reliable Distributed Systems in the Presence of Software Errors,” Ph.D. dissertation, Royal Institute of Technology (KTH), Stockholm,
Sweden, Nov. 2003.
3. Arthur, W. B. Complexity in economic theory: Inductive reasoning and bounded
rationality. The American Economic Review, 84(2), May 1994, pages 406-411.
4. Ashby, W. Ross. “An Introduction to Cybernetics,” Chapman & Hall Ltd., London,
1956. Internet (1999): pcp.vub.ac.be/books/IntroCyb.pdf.
5. von Bertalanffy, Ludwig. “General System Theory: Foundations, Development, Applications,” George Braziller, 1969.
6. Capra, Frank. “It’s a Wonderful Life,” Liberty Films, 1946.
7. Carroll, Lewis. “Through the Looking-Glass and What Alice Found There,” 1872
(Dover Publications reprint 1999).
8. Carton, Bruno, and Valentin Mesaros. Improving the Scalability of LogarithmicDegree DHT-Based Peer-to-Peer Networks, 10th International Euro-Par Conference,
Aug. 2004, pages 1060–1067.
9. Collet, Raphaël, Michael Lienhardt, Alan Schmitt, Jean-Bernard Stefani, and Peter Van Roy. Report on formal operational semantics (components and reflection),
SELFMAN Deliverable D2.3a, Nov. 2007, www.ist-selfman.org.
10. Encyclopaedia Britannica. Article Human Endocrine System, 2005.
11. Fairley, Peter. The Unruly Power Grid, IEEE Spectrum Online, Oct. 2005.
12. France Télécom, Zuse Institut Berlin, and Stakk AB. User requirements, SELFMAN Deliverable D5.1, Nov. 2007, www.ist-selfman.org.
13. Ghodsi, Ali. “Distributed K-ary System: Algorithms for Distributed Hash Tables,”
Ph.D. dissertation, Royal Institute of Technology (KTH), Stockholm, Sweden, Oct.
2006.
25
14. Ghodsi, Ali, Luc Onana Alima, and Seif Haridi. Symmetric Replication for Structured Peer-to-Peer Systems, Databases, Information Systems, and Peer-to-Peer Computing (DBISP2P 2005), Springer-Verlag LNCS volume 4125, pages 74-85.
15. Gray, Jim and Leslie Lamport. Consensus on transaction commit. ACM Trans.
Database Syst., ACM Press, 2006(31), pages 133-160.
16. Hardin, Garrett. The Tragedy of the Commons, Science, Vol. 162, No. 3859, Dec.
13, 1968, pages 1243–1248.
17. Hellerstein, Joseph L., Yixin Diao, Sujay Parekh, and Dawn M. Tilbury. “Feedback
Control of Computing Systems,” Aug. 2004, Wiley-IEEE Press.
18. Hoglund, Greg and Gary McGraw. “Exploiting Online Games: Cheating Massively
Distributed Systems,” Addison-Wesley Software Security Series, 2008.
19. IBM. Autonomic computing: IBM’s perspective on the state of information technology, 2001, researchweb.watson.ibm.com/autonomic.
20. Jelasity, Márk, Rachid Guerraoui, Anne-Marie Kermarrec, and Maarten van Steen.
The Peer Sampling Service: Experimental Evaluation of Unstructured Gossip-Based
Implementations, Springer LNCS volums 3231, 2004, pages 79–98.
21. Kim, Jeong-Rae, Yeoin Yoon, and Kwang-Hyun Cho. Coupled Feedback Loops Form
Dynamic Motifs of Cellular Networks, Biophysical Journal, 94, Jan. 2008, pages 359365.
22. Kobayashi, Tetsuya, Luonan Chen, and Kazuyuki Aihara. Modeling Genetic
Switches with Positive Feedback Loops, J. theor. Biol., 221, 2003, pages 379-399.
23. Lienhard, Michael, Alan Schmitt, and Jean-Bernard Stefani. Oz/K: A Kernel Language for Component-Based Open Programming, Sixth International Conference on
Generative Programming and Component Engineering (GPCE’07), Oct. 2007.
24. Mejias, Boris, and Peter Van Roy. A Relaxed Ring for Self-Organising and FaultTolerant Peer-to-Peer Networks, XXVI International Conference of the Chilean Computer Science Society (SCCC 2007), Nov. 2007.
25. Moser, Monika, and Seif Haridi. Atomic Commitment in Transactional DHTs,
Proc. of the CoreGRID Symposium, Rennes, France, Aug. 2007.
26. Moser, Monika, Seif Haridi, Thorsten Schütt, Stefan Plantikow, Alexander
Reinefeld, and Florian Schintke. First report on formal models for transactions over structured overlay networks, SELFMAN Deliverable D3.1a, June 2007,
www.ist-selfman.org.
27. Pigolotti, Simone, Sandeep Krishna, and Mogens H. Jensen. Oscillation patterns
in negative feedback loops, Proc. National Academy of Sciences, vol. 104, no. 16, April
2007.
28. Salen, Katie, and Eric Zimmerman. “Rules of Play: Game Design Fundamentals,”
MIT Press, Oct. 2003.
29. Schütt, Thorsten, Florian Schintke, and Alexander Reinefeld. Range Queries on
Structured Overlay Networks, Computer Communications 31(2008), pages 280-291.
30. SELFMAN: Self Management for Large-Scale Distributed Systems based on Structured Overlay Networks and Components, European Commission 6th Framework
Programme, June 2006, www.ist-selfman.org.
31. Shafaat, Tallat M., Ali Ghodsi, and Seif Haridi. Dealing with Network Partitions in
Structured Overlay Networks, Journal of Peer-to-Peer Networking and Applications,
Springer-Verlag, 2008 (to appear).
32. Shafaat, Tallat M., Monika Moser, Ali Ghodsi, Thorsten Schütt, Seif Haridi, and
Alexander Reinefeld. On Consistency of Data in Structured Overlay Networks, CoreGRID Integration Workshop, Heraklion, Greece, Springer LNCS, 2008.
26
33. Stoica, Ion, Robert Morris, David R. Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications,
SIGCOMM 2001, pages 149-160.
34. Van Roy, Peter. Self Management and the Future of Software Design, Third International Workshop on Formal Aspects of Component Software (FACS 2006), ENTCS
volume 182, June 2007, pages 201-217.
35. Weinberg, Gerald M. “An Introduction to General Systems Thinking: Silver Anniversary Edition,” Dorset House, 2001 (original edition 1975).
36. Whitehead, Alfred North. Quote: Civilization advances by extending the number
of important operations which we can perform without thinking of them.
37. Wiener, Norbert. “Cybernetics, or Control and Communication in the Animal and
the Machine,” MIT Press, Cambridge, MA, 1948.
38. Wiger, Ulf. Four-Fold Increase in Productivity and Quality – Industrial-Strength
Functional Programming in Telecom-Class Products, Proceedings of the 2001 Workshop on Formal Design of Safety Critical Embedded Systems, 2001.
39. Wikipedia, the free encyclopedia. Article Belief Propagation, March 2008,
en.wikipedia.org/wiki/Belief propagation.
40. Wolpert, David H., Kevin R. Wheeler, and Kagan Tumer. General principles of
learning-based multi-agent systems, Proc. Third Annual Conference on Autonomous
Agents (AGENTS ’99), May 1999, pages 77-93.
41. Wolpert, David H., Kevin R. Wheeler, and Kagan Tumer. Collective intelligence
for control of distributed dynamical systems, Europhys. Lett., 2000.
42. Yap, Roland, Felix Halim, and Wu Yongzheng. First report on security in structured
overlay networks, SELFMAN Deliverable D1.3a, Nov. 2007, www.ist-selfman.org.
Chapter 3
D1.2: Report on high-level
self-management primitives for
structured overlay networks
3.1
Executive summary
In order to build self-managing large-scale distributed systems, SELFMAN
is aiming for a combination of component models and structured overlay
networks. The goal is to achieve self management along four axes: selfconfiguration, self-healing, self-tuning and self-protection. This deliverable is
mainly focused on self-configuration and self-healing properties of structured
overlay network, defining the high-level primitives to be used by application
running on top of the peer-to-peer infrastructure.
Structured overlay networks are used for building self-organising peer-topeer systems with efficient routing. There are many ways of structuring these
kind of networks. In SELFMAN, we focus our work on the ring topology, the
relaxed-ring, and range queries. The ring and the relaxed-ring distribute the
resources uniformly amount peers providing a Distribute Hash Table (DHT).
If a uniform distribution of the resources is not feasible, it is recommended
to use a topology like range queries. Results presented on this deliverable
are related to these three kind of networks, and how applications interface
them by using high-level primitives.
This deliverable is the continuation of the “Report on low-level selfmanagement primitives for structured overlay networks” (D1.1), from year
one. Results presented here achieved what was proposed as future work
in Deliverable D1.1, and it presents the Application Programming Interface
(API) that has been used to implement software deliverables D1.4 and D1.5,
42
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
presented in Chapters 5 and 6. It is also related with the results presented
in Deliverable D3.3a, Chapter 14, where a replicated storage service is implemented on top of the results of this deliverable.
SELFMAN Deliverable Year Two, Page 43
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
3.2
Contractors contributing to the Deliverable
Contributions measured on publications related to this deliverable are provided by Université catholique de Louvain UCL(P1), Zuse Institute Berlin
ZIB(P5), Kungliga Tekniska Högskolan KTH(P2)and Peerialism. Contractors National University of Singapore NUS(P7) and Institut National de
Recherche en Informatique et Automatique INRIA(P3), has contributed with
analysis about how to improve the current results. Their related work in
form of publications is presented in other deliverables. Partner Peerialism
has contributed with a novel NAT traversal approach.
UCL(P1) has continued to work on the relaxed-ring in order to provide a
stable self-managing peer-to-peer network to be used for the implementation
of decentralized applications. It provides high-level primitives by defining the
Application Programming Interface (API) to be used by other deliverables
and work packages. This work is directly related to D1.5.
ZIB(P5) has continued its work on range queries, which provides an alternative solution to the ring topology. It has also developing work on sloppy
management of structure peer-to-peer services. The work concerning the API
is presented in Deliverable D3.3a.
KTH(P2) has also contributed defining the API by continuing the development of DKS. The work done on this deliverable is directly related to
Deliverable D1.4.
Peerialism has contributed with a novel NAT traversal approach. This
work might belong to low-level primitives, but since Peerialism has joined
the project only since year two, we have also included this results because of
their strong relationship with the rest of the project.
SELFMAN Deliverable Year Two, Page 44
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
3.3
Results
This section is dedicated to report on the results of year two in the design
of self-management primitives for structured overlay networks. Part of the
results corresponds to concluding work on low-level primitives, and the other
is the definition of an Application Programming Interface (API) as high-level
primitives for developing services and applications on top of the structured
overlay networks we have designed.
3.3.1
Introduction
Structured overlay networks are the basement for the development of selfmanaging large scale applications. They provide a self-configurable and selfhealing network which is decentralized, avoiding the classical problem of
single point of failure. SELFMAN has been working on different approaches
to build such networks, targeting different and complementary issues. The
work has been divided into low-level and high-level self-management primitives. The former has been reported on the first year of the project. The
later are presented on this deliverable.
Since this is the continuation of the work reported in Deliverable D1.1,
we analyze the future work of that deliverable. It was said that we will do
other publications extending the results of multi-dimensional range queries in
SONAR and the Relaxed-Ring. We have satisfactory achieved this task. The
publications are described in Section 3.5, and the main concepts and ideas
of these results are presented in the following sections of this deliverable.
We have also promised that KTH would continue with the development
of DKS and UCL with P2PS, which is based on the Relaxed-Ring. These
projects are presented as software deliverables D1.4 and D1.5, in chapters 5
and 6 respectively. The relation with this deliverable is based on the Application Programming Interface (API) we present here as high-level primitives.
Both software deliverables implements big part of this API. Of course, all the
results on the relaxed-ring are included in the development of P2PS.
In the following sections we presents the main concepts of the results
achieved by this deliverable, and we also discuss ongoing and future work,
and how it would be included in the next year of the project. We conclude
by summarizing the publications which are added as appendices.
3.3.2
Range Queries
The results presented in this section targets the efficient handling of multiple
range queries, which is actually not well achieved by our other results based
SELFMAN Deliverable Year Two, Page 45
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
Figure 3.1: Routing fingers of a SONAR node in a two-dimensional data
space.
on distributed hash-tables ring topology, as in P2PS and DKS. This is why
this work is complementary to the other results presented on this deliverable.
We briefly summarize the results obtained by two structured overlay networks that support arbitrary range queries. The first one, named Chord# ,
has been derived from Chord [136] by substituting Chords hashing function
by a key-order preserving function. It has a logarithmic routing performance
and it supports range queries, which is not possible with Chord. Its O(1)
pointer update algorithm can be applied to any peer-to-peer routing protocol
with exponentially increasing pointers.
Chord# is extended to support multiple dimensions, resulting in SONAR,
a Structured Overlay Network with Arbitrary Range queries. SONAR covers
multi-dimensional data spaces and, in contrast to other approaches, SONARs
range queries are not restricted to rectangular shapes but may have arbitrary
shapes. Empirical results with a data set of two million objects show the
logarithmic routing performance in a geospatial domain.
Figure 3.1 depicts a routing table in a two-dimensional data space. The
keys are specified by attibute vectors (x, y) and hypercuboids cover the complete key space. The hypercuboids are presented in the figure as rectangular
boxes which are managed by the nodes. Their different area is due to the
key distribution, which confirms that this data would not be balanced in a
ring architecture. In SONAR, at runtime, the load balancing scheme ensures
that box holds about the same number of keys.
SONAR maintains two data structures, a neighbor list and a routing
SELFMAN Deliverable Year Two, Page 46
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
Figure 3.2: SONAR overlay network with 1.9 million keys (city coordinates)
over 2048 nodes. Each rectangle represents one node.
table. The neighbor list contains links to all neighbors of a node. The node
depicted by the grey box in Figure 3.1, for example, has ten neighbors. The
routing table comprises d subtables, one for each dimension. Each subtable s
with 1 ≤ s ≤ d contains fingers that point to remote nodes in exponentially
increasing distances.
One application of SONAR is shown in Figure 3.2, which is also use
as evaluation experiment. It represent a data set of a traveling salesman
problem with the 1,904,711 largest cities worldwide. Their GPS locations
follow a Zipf distribution, which is a common distribution pattern of many
other application domains. In a preprocessing step we partitioned the globe
into non-overlapping rectangular patches so that each patch contains about
the same amount of cities. During the first year of the project we have
achieved to manage the data with a netwok of 256 peers. The results of
this year were able to scale to 2048 nodes, and still providing a very efficient
routing. More specific results concerning this work can be found in Appendix
A.3.
3.3.3
Relaxed-Ring
Continuing the work of D1.1, we have finished the design, analysis and validation of the Relaxed-Ring, the network topology used to implement P2PS.
We have also define its high-level primitives which are included in the API
SELFMAN Deliverable Year Two, Page 47
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
Figure 3.3: Extreme case of a relaxed-ring with many branches.
of Section 3.3.4. This section is dedicated to described the work done with
respect to the analysis of the algorithm and its validation. The analysis has
been done with the help of results coming from WP2, more precisely the
definition of feedback loops in the design of self-managing software. The
validation is done via simulations that will be described later on this section.
The Relaxed-Ring topology is a Chord-like [136] ring, where every peer
has a successor (succ) and a predecessor (pred). It is a structured overlay
network providing a Distributed Hash Table (DHT) where every peer is responsible for a certain range of hash-keys, which is delimited by its own key
and the key of its predecessor, pred. In order to efficiently route messages in
the network, every peer has a set of fingers to jump across the ring. When a
new peer joins the network, it uses a hash key as identifier, joining between
its corresponding succ and pred. In addition, the relaxed-ring allows loosely
coupled peers that can be attached in branches when they cannot contact
their predecessors. This property makes the system more robust and faulttolerant. An extreme case of the relaxed-ring topology can be observed in
Figure 3.3.
Taken from system theory, feedback loops can be observed not only in
existing automated systems, but also in self-managing systems in nature.
Several examples of this can be found in [145]. The loop consists out of three
main concurrent components interacting with the subsystem. There is at
least one agent in charge of monitoring the subsystem, passing the monitored
information to a another component in charge of deciding a corrective action
if needed. An actuating agent is used in order to perform this action in the
subsystem. These three components together with the subsystem forms the
SELFMAN Deliverable Year Two, Page 48
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
Figure 3.4: Failure recovery mechanism of the relaxed ring modeled as a
feedback loop. The labels exemplifies the failure of peer q, placed in between
peers p and r.
entire system. It has similar properties to PID-controllers, with the difference
that the evolution of a running software application is measured discretely.
The results of modeling the relaxed-ring using feedback loops has been
published in [101]. The work is focus on the fault-tolerant maintenance of
the network under high churn, being the join and failure recovery mechanism
the key parts of the architecture. Both feedback loops can be understood as
independent loops, with the observation that we made the failure recovery
mechanism reusing the join mechanism. This makes them easy to interact.
In this deliverable we show the interaction of both loops, which is depicted
in Figure 3.4. We use a concrete example to explain it.
Let us consider a particular section of the ring having peers p, q and r
connected through successor and predecessors pointers. Figure 3.4 describes
how the ring is perturbed and stabilised in the presence of a failure of peer
q. Only relevant monitored and actuating actions are included in the figure
to avoid a bigger and verbose diagram.
Initially, the crash of peer q is detected by peers p and r (1). Both peers
will update their routing tables removing q from the set of valid peers (2b).
But, since p is q’s predecessor, only p will trigger the correcting event join
(2a). This first iteration corresponds to a loop from the failure recovery
mechanism. The join event will be monitored by peer r (3), starting an
iteration in the join maintenance loop. The correcting action join ok will be
triggered (4a) together with the corresponding update of the routing table
(4b). Then, the event join ok will be monitored (5) by the failure recovery
component in order to perform the correspondent update of the routing table
(6). Since the join ok event is also detected by the join loop, both loops will
consider the network stable again.
SELFMAN Deliverable Year Two, Page 49
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
Figure 3.5: Average size of branches depending on the quality of connections:
avg corresponds to existing branches and totalavg represents how the whole
network is affected.
One of the missing parts of the results of year one is the empirical validation of the relaxed-ring, and how the created branches influence the routing
efficiency and the load of the network. To validate the relaxed-ring, we analyse four aspects: the amount of branches that can appear on a network, the
size of branches, the number of messages generated by the ring-maintenance
protocol, and the verification of lookup consistency on unstable scenarios.
The evaluation is done using a simulator implemented in Mozart, and that
is released in software deliverable D1.5. In this simulator (called CiNiSMO,
for Concurrent Network Simulator in Mozart-Oz), every node run autonomously
on its own lightweight thread. Nodes communicate with each other by message passing using ports. We consider that these properties make the simulator more realistic. Every network is run several times using different seeds
for random number generation. Charts are built using the average values of
these executions. This deliverable will be focus on the size of branches, and
the network traffic in comparison with Chord. Other results are presented
in detailed in Appendix A.5.
Figure 3.5 shows the average size of branches depending on the quality of
connections. The coefficient c represents the connectivity level of the network,
where for instance c = 0.95 means that when a node contacts another one,
there is only a 95% of probability that they will establish connection. A
SELFMAN Deliverable Year Two, Page 50
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
value of c = 1.0 means 100% of connectivity.
The average size of branches appears to be independent of the size of
the network. The value is very similar for both cases where the quality of
the connectivity is poor. In none of the cases the average is higher than 2
peers, which is a very reasonable value. If we want to analyse how the size of
branches degrades routing performance of the whole network, we have to look
at the average considering all nodes that belong to the core ring as providing
branches of size 0. This value is represented by the curves totalavg on the
figure. In both cases the value is smaller that 0.25. Experiments with 100%
of connectivity are not shown because there are no branches, so the average
size is always 0.
We have also implemented Chord in our simulator. Experiments were
only run in fault-free scenarios with full connectivity between peers, and
thus, in better conditions than our experiments with the relaxed-ring. We
have observed that lookup consistency can be maintained in Chord at very
good levels if periodic stabilization is triggered often enough. The problem
is that periodic stabilization demands a lot of resources.
Figure 3.6 depicts the load related to every different stabilization rate.
Logically, the worse case corresponds to most frequently triggered stabilization. If we only consider networks until 3000 nodes, it seems that the cost
of periodic stabilization pays back for the level of lookup consistency that it
offers, but this cost seems too expensive with larger networks.
In any case, the comparison with the relaxed-ring is considerable. While
the relaxed-ring does not pass 5 × 104 messages for a network of 10000 nodes,
a stabilization rate of 7 on a Chord network, starts already at 2 × 105 with
the smallest network of 1000 nodes. Figure 3.6 clearly depicts the difference
on the amount of messages sent. The point is that there are too many
stabilization messages triggered without modifying the network. On the
contrary, every join on the relaxed-ring generate more messages, but they
are only triggered when they are needed.
3.3.4
API
To be able to build services and applications on top of our structured overlay
networks, it is necessary to define a simple high-level API that can express
the functionality of the underlaying self-managing peer-to-peer network. The
API we present and discuss here is inspired by OpenDHT [117], because it
fits the ring topology used by DKS and P2PS, both presented as software
deliverables D1.4 and D1.5. Those deliverables implement big part of the
following API. The interface associated with Range queries and Chord# is
presented in detailed on its dedicated Deliverable D3.3a, Chapter 14.
SELFMAN Deliverable Year Two, Page 51
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
Figure 3.6: Load of messages in Chord due to periodic stabilization, compared to the load of the Relaxed-Ring maintenance with bad connectivity.
Y-axis presented in logarithmic scale.
Basic functionality
The basic functionality that one can expect from a peer-to-peer network
correspond to the ability of creating a node, make it join a network, lookup
for other nodes in the network, and leave the network. We do not need
to worry about neither detecting failures nor recovering from others peers
crashes.
The API we provide is untyped. Its objective is to represent a language
independent interface that can be implemented by different kind of peer-topeer networks.
• create peer(): Creates a node with an identifier id, and a namespace,
which can be use to bootstrap a new network. Other peers can use this
namespace to join this node.
• join(id, namespace): It uses an identifier id to join a network identified by namespace.
• lookup(key) It finds the responsible peer for a given key. It returns
the id of that peer. No specification about how the message is routed
is given.
SELFMAN Deliverable Year Two, Page 52
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
• send(msg, key) Reliable send of message msg to the responsible peer
of the provided key.
• direct send(msg, id) Reliable send of message msg to the particular
peer identified by id. If the peer is not connected to the network, the
message will not be deliver to the responsible of the id. This is why
present id in contrast to key.
• broadcast(msg, from, to) It provides a pseudo-reliable broadcast of
message msg to all peers responsible of keys in the range [f rom, to].
Since the functionality considers a range of keys, it can be used as a
multicast. Using [0, N − 1] broadcast the message to all peers.
• leave This functionality explicitly disconnects the peer from the network. It is not necessary to implement because leave should be a particular case of failing. We include it to give the possibility to the implementing network to send leaving message for the sake of efficiency,
not for correctness.
Basic storage functionality
This functionality is provided as the basic storage service that a DHT can
offer. The implementing system is free to offer the replication guarantees that
consider necessary. But, we assume that transactional operations, described
in the following section, should offer the replication guarantees expected from
a distributed data storage service, and that these group of functionalities
should remain as a more basic set of operations without strong persistence
guarantees. Still, all these functions requires lookup consistency provided
by the underlaying network. The work presented in sections 3.3.2 and 3.3.3
target this issue.
We do not use secrets associated to keys, and we do not use time-tolive values either. This is because we are focused on applications such as
the distributed wiki, described in WP3, where the data is not supposed to
disappear after a fix TTL. The reason for not including secrets is because
we are still analysing different approaches for security on DHTs, as it will be
described in section 3.3.6
• put(key, value): Stores de value value using the key key. The value
will reside on the peer responsible of key.
• get(key): Returns the value associated with key.
• remove(key): Removes the value associated with key.
SELFMAN Deliverable Year Two, Page 53
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
• put immutable(key, value): Store an immutable value associated
with key. This means that no other value can be put using the same
key, until the value is removed. Note that there is no need for an
operation get − immutable, because its functionality is already covered
by operation get.
Transactional functionality
This functionality is directly connected with the technical report “Transactional DHT Algorithms” [107] included in Appendix A.12, of Deliverable
D3.1b. The objective of this interface is to provide access to transactional
operations that use symmetric replication, as described in [57], given strong
guarantees on persistence storage. If these operations are implemented following the mentioned technical report, they also provide strong guarantees
with respect to failures of peers while the transaction is being process. If the
majority of the peers is alive at the moment of committing the transaction,
then, the transaction will commit.
• begin transaction Returns a transaction identifier to be used in order
to commit the transaction or to abort it. The transaction allows to
atomically perform a set of read and write operations over one or more
items.
• write(item, value) Write value to the item identified as item, using
symmetric replication.
• read(item) Returns the value stored on the item.
• commit(tid) Commit the transaction identified by tid.
• abort(tid) Abort the transaction identified by tid.
3.3.5
Sloppy Management
As an unexpected result, we have observe the potential of sloppy management
as an alternative approach for dealing with self-configuration and self-healing
services on structured overlay networks. The proposal is that instead of explicit repair algorithms, systems could use continuous background probabilistic algorithms to handle non-functional management tasks such as routing
table maintenance, while relying on the original structured algorithms for the
functional tasks such as routing messages through a DHT. This probabilistic
overlay maintenance is what we call “sloppy management”. We still need to
SELFMAN Deliverable Year Two, Page 54
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
further study the implications of such an approach. More details about this
work can be found in Appendix A.4
3.3.6
Ongoing and Future work
The analysis on how to improve the results of this deliverable, contributed by
NUS and INRIA, is presented here as ongoing work, motivating our future
work on structured overlay networks. Being aware that Work Package 1 does
not carries on during year 3, we expect to report progress on these ideas
on WP4 Self-management services, more specifically on T1 self configuration
and T4 self protection.
Security
The API presented in section 3.3.4 is based on the one of OpenDHT [117].
Therefore, our security analysis starts by spotting some issues on OpenDHT,
and how they are related to our API.
First of all, with respect to availability, OpenDHT does not offer any
load balancing mechanism to avoid data being discarded. Our system does
offer symmetric replication by using the transactional procedures. Another
difference is that storage quota in OpenDHT is offered per IP instead of
per user, as in our system, where the quota is uniform, introducing different
challenges we still have to investigate.
One risk from our system is that we have removed the time-to-live value
from the stored entries. This is because our storage services are meant to
be persistent, as in the Distributed Wiki presented in Deliverable D3.1b,
Chapter 12. This implies that our stored data could grow forever, just as the
size of Wiki can grow forever. Of course, this risk is driven by the application
we want to implement.
With respect to Sybil attack, we are expecting to apply the results from
Deliverable D1.3b, that propose the use of social networks to prevent it.
Slicing
The ongoing work on slicing is inspired on the results presented on [48]. The
objective is to categorize the network by analyzing the available resources of
each peer. The criteria to use considers processing power, storage capacity
and locality, among other variables. By having a slicing service one can
partition the network into groups that represent a controllable amount of
some resource, where peers present homogeneous capacities.
SELFMAN Deliverable Year Two, Page 55
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
Future Work
Considering what has been described as ongoing work, it is absolutely necessary to carry on with the work on security. Since there are other work
packages specially dedicated to this topic, we believe that it is feasible to
continue the work and to integrate it during the next year of the project.
The work on slicing should also be continued in order to decide if it provides
some benefit to our project.
With respect to the API, even when it is quite stable, it is expected that
we will have to adapt it and extend it during the development of services and
applications on top of the structured overlay networks we have developed on
this deliverable. We consider such iteration as a normal process in the life
cycle of any software development and research project. As mentioned in
the beginning of this section, the results of the ongoing and future work is
expected to be reported in other work packages, mainly in WP4, dedicated
to self-management services.
We can conclude that the amount of work that remains to be done in
this deliverable is reasonable small to be assimilated and integrated by other
work packages in the next year of the project.
SELFMAN Deliverable Year Two, Page 56
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
3.4
Novel P2P NAT Traversal Approach
In this section we present a new method for NAT Traversal designed in
the context of Peerialism’s peer-to-peer content distribution system. The
suggested method is based on a widely adopted technique on current home
routers: NAT Port Preservation. The method advances the current state-ofart of NAT Traversal by improving support for two widely deployed types of
NAT: Restricted Cone and Port Restricted Cone.
3.4.1
Network Address Translators
Network Address Translators (NATs) have been deployed ubiquitously in today’s Internet. Network home routers and corporate firewalls are the most
common examples of devices which implement the Network Access Translation Technique. However, as a result of a lack of standardization of this
technology, the behavior of NAT devices is often vendor-specific.
Network access translators have been mostly designed around the clientserver paradigm. The client lies in a private network where access to the open
Internet happens only through a NAT device. Servers are instead located
ouside that network, in hosts which have public IP addresses and static DNS
records. The great advantage of NAT-enabled devices is their ability to
translate connections coming from their internal hosts such that they appear
to the external servers to be issued from the NAT device, thus protecting
the clients by hiding their real addresses in the private network. Network
Access Translators also provides an additional form of protection, which is
dropping of incoming unwelcome connections that were not first initialized
by internal hosts. This approach has been designed to support Web or e-mail
services, where it is always the hosts inside the private network that initiate
the communication to the outside servers.
When peer-to-peer applications are involved instead, hosts behind NAT
might need to accept connections from other peers and not only to initiate them. In this case, the NAT ”client-server” approach might constitute
a problem. A typical example of peer-to-peer application which from this
problem is a VOIP system where calls need to be established between pairs
of peers with the help of a central rendez-vous server. Typically, the peer
would register to the rendez-vous server and obtain the address of another
peer that it wants to communicate with. It would then try to connect to the
latter to establish a call. If the destination peer is behind NAT, the address
obtained by the rendez-vous server will be either the private address of the
peer or the address of the NAT device, or even both. This is not enough to
establish a connection between the two peers, since the NAT device will not
SELFMAN Deliverable Year Two, Page 57
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
expose any open ports for the call initiator to contact the destination peer.
A number of methods have been studied to avoid this problem. The first
step in solving the problem is to define the behavior of different NATs and
group them together in different types. This has been done in the STUN
RFC. We summarize the types as follows:
• Full cone. Once a host behind a NAT initiates a connection to another
host outside its private network, and the NAT box has establishes a
mapping for send packets back to the initial host.
• Restricted cone. When a private host initiates a connection to another
for allowed to respond back to the host inside the private network. Any
other packet coming from a different host will be dropped.
• Port restricted cone. If a private host initiates a connection to a host
with address IP1 and port number N2, only packets coming from that
same IP and port will be forwarded back by the NAT device to the
private host.
• Symmetric. Each request coming from an internal host with a certain
IP address and port to a specific destination IP address and port is
mapped by the internal host initiates a connection from the same source
address and port but to a different destination, a different mapping is
used.
3.4.2
Current NAT Traversal methods
IGD and NAT-PMP
The Internet Gateway Device(IGD)[51] and the NAT Port Mapping(NATPMP)[32] protocols are Application Layer Protocols which allows for discovery and configuration of port forwarding rules in NAT-enabled devices. IGD
is implemented via UPnP[79], it has been widely adopted by router producers but it is not an Internet Engineering Task Force document. NAT-PNP
instead is an Internet Draft which has been filed by Apple and it’s mainly
used in Apple routers. Both protocols enables hosts to discover UPnP and
NAT-PMP-compliant routers in their local network, learn their public IP address, retrieve and modify existing port mappings and assign corresponding
lease times.
STUN
STUN[121] is a network protocol which allows a host to discover which type
of NAT it is behind and which public IP address is associated with it. STUN
SELFMAN Deliverable Year Two, Page 58
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
does not require any explicit interaction with the NAT device as in UPnP
and NAT-PMP, it instead makes use of an external STUN server to test the
behavior of the NAT device through a well-defined discovery process. The
process consists in first contacting the STUN server to learn the observed
public IP of the client and then in requesting the STUN server to send UDP
packets back to the acquired IP using different combinations of source IPs
and ports. The client will be able to understand which kind of NAT it is
behind according to which packet it will be able to receive as different types of
NATs handles incoming UDP packets in different ways. STUN provides also
a method to discover which port has been opened by the Network Address
Translator on the public interface. If the STUN client communicates this
information to a public rendez-vous server, it can be used by other peers
to establish a direct connection with the same host. For more information
about the STUN protocol, please refer to[121].
TURN
Traversal Using Relay NAT (TURN)[120] is a protocol which enables a host
behind a NAT or firewall to receive incoming data over TCP or UDP connections using relaying. It is mostly useful for hosts behind Symmetric NATs
or firewalls with similar behavior. Clients running TURN must authenticate
to a server in the public domain. When a TURN client wants to send data
to another one, it sends it to the server which will in turn forward it to the
the destination. Although TURN provides connectivity between clients in
almost all the cases, it has the cost of relaying all the traffic through a public
server. Therefore, TURN is used as last resort when all attempts to establish
a connection with other protocols, such as STUN, have failed.
ICE
The Interactive Connectivity Establishment (ICE)[119] provides a comprehensive mechanism for NAT traversal. The ICE approach consists in deploying a number of STUN and TURN servers in the public network and
automating the choice of which of the two protocols to use, such that it appears as transparent for the client running the ICE protocol. In particular,
it is used to allow SIP-based VoIP clients to successfully traverse the variety
of firewalls that may exist between a remote user and a network.
SELFMAN Deliverable Year Two, Page 59
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
3.4.3
Peerialism’s system
Peerialism’s media distribution platform performs audio and video streaming using proprietary peer-to-peer technologies. The target customers of our
products are home users. The customer would typically install a client application in its machine and start requesting content. Every running client
application constitutes a peer in Peerialism’s system. Peers collaborate to
distribute content over an ad-hoc overlay network managed by a central coordinator, called Tracker. The Tracker is entitled with the task of periodically
reorganizing the network such that the content distribution’s load is balanced
among all participating peers. The content itself is composed by a number
of different streams which consist of RTP and RTSP packets sent over UDP.
Given the characteristics of the system, it is therefore crucial that Clients
are able to connect to each others and exchange parts of the stream. However, this is not always possible when peers lie on private networks behind
NAT. In the next two paragraph we will explain the limitations of the current
NAT technologies which have been discovered during the process of building
our system. Then we will expose our simple approach to reach connectivity
between peers using UDP.
3.4.4
Limitations of current approaches
STUN is sometimes referred as the universal solution for NAT Traversal when
speaking about peer-to-peer applications. This is a common misconception.
STUN, as mentioned in the STUN RFC, “..will work for any application for
full cone NATs only. For restricted cone and port restricted cone NAT, it will
work for some applications depending on the application. Application specific processing will generally be needed.” However, the RFC does not define
which kind of behavior the application should implement to cope with the
limitations of the protocol. A comprehensive analysis concerning the state of
peer-to-peer communication across NATs has been recently published[135].
It complements the STUN RFC by describing in details how a peer-to-peer
application should behave when multiple levels of NATs are involved, when
peers are behind the same NAT device or when using traffic relaying. The
analysis underlines that the common approach used by peer-to-peer applications when dealing with NATs is to use a public host which behaves both
as a rendez-vous server, like in STUN, and as a coordinator for connection
establishment between peers. An earlier paper by Ford et al.[50] documents
similar hole punching techniques and analyzes their reliability on a wide
variety of deployed NAT. However, both documents suggest the use of a
technique called port prediction[154][157] when Restricted, Port Restricted
SELFMAN Deliverable Year Two, Page 60
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
and Symmetric NATs are involved. Port prediction works by analyzing the
behavior of the NAT device and attempting to predict the public port numbers it will assign to future outgoing communication sessions. As mentioned
in [157], this technique is not reliable as NAT devices might implement unpredictable behaviors when assigning public ports. Furthermore, it’s almost
impossible to predict which public port will be allocated for a certain session
when many peers are behind the same NAT.
Two independent surveys[149][138] show the distribution of pre-configured
NAT types in common home routers. The results reveal that Restricted and
Port Restricted types of NAT account for a significant share of the market,
as much as 40%. However, all tests take into consideration the type of NAT
configured by default on the devices, which might be changed by the user as
home routers usually implement more flavours of NAT on the same device.
Even though the validity of the surveys might be questionable, the first one
being unofficial and the second one being somehow outdated, statistics collected during our preliminary system tests show similar results on the home
routers of our customers. It’s therefore important to provide a good degree
of support to those NAT types.
The only existing protocols which guarantee peer connectivity for all NAT
types are IGD, NAT-PMP and TURN. IGD and NAT-PMP are usually disabled by default in home routers since they might expose internal hosts to
serious security threats[60]. However, in our system we do make use of these
protocols when available. TURN instead, and consequently ICE, suggest
an approach which consists in relaying peer-to-peer traffic through a public
server. In our system, this technique cannot be used since the cost of relaying multiple video and audio streams through a public server is too high. In
fact, it would require an enormous amount of both computational and bandwidth resources on the public host. Therefore, when connectivity cannot
be achieved between two peers, it is better to redirect the requester of the
content to another provider which is publicly reachable instead of relaying
traffic.
3.4.5
Our solution
We base our NAT Traversal solution on previous studies[135][50][157], which
we extend to provide better support for Restricted and Port Restricted NATs.
In our approach, we exploit a widely adopted technique in currently deployed
home NATs: port preservation. We refer to port preservation as the attempt
of a NAT box to use the same external port as the internal host used to
establish a connection to an outside host. This means that, for instance, if
Client A behind NAT tries to contact Server B in the public domain from its
SELFMAN Deliverable Year Two, Page 61
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
local port 123, the NAT device will try to use port 123 on its public interface
to communicate with Server B. Port preservation may be supported by all
types of NATs except the Symmetric one. In fact, by definition, Symmetric
NATs allocate a different port on the external interface every time a host tries
to contact a different endpoint, where the endpoint is given by IP address and
port. Port preservation is defined in [154] as “Exceptional Behavior”. We
argue that port preservation is becoming more and more common as NAT
vendors adapt their products to support peer-to-peer applications. That is
confirmed by tests we performed on a limited set of home routers currently
on the market.
Given the assumption of port preservation, we derived a connectivity
table, showed in Figure 3.7. It represents the compatibility between pairs
of peers according to their NAT types. The compatibility is only theoretical
since there is no guarantee that a certain NAT device will behave as assumed
by the classification of NAT types or that it implements port preservation.
Combinations of NAT types are grouped in five different classes in the table.
Every class defines which method will be used when attempting to establish
a connection between a pair of peers. The last class instead defines the
combinations where port preservation cannot be used to obtain connectivity.
In our system, the connectivity table is used by the Tracker application
to decide whether it is possible for a host to provide content to another one
when NATs are involved. The Tracker is then aware of the NAT types of its
clients. This because, when started, the Client application performs a STUN
discovery and reports to the Tracker the following information:
• Its private IP address and listening UDP port.
• The public IP address of the NAT device and the external port mapped
during the STUN discovery process.
• The type of NAT implemented by the gateway.
We now detail the aforementioned classes providing examples in the context
of our system.
• (I) Only one of the two Clients is behind NAT. In this case, we implement a classic Client-Server behavior: the peer behind NAT connects
to the peer in the public domain. The peers assume their role of content provider or content requester only after the connection has been
established.
• (II) One or both peers are behind Full Cone NAT. In this case, both
Clients start to send to each other’s public endpoint at the same time.
SELFMAN Deliverable Year Two, Page 62
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
Client A Open Internet Client B Open Internet Full Cone Restricted Cone Port Restricted Cone Full Cone Restricted Cone Port Restricted Cone Symmetric (I) Normal Connec;on (II) Simultaneous Connec;on (III) Simultaneous Connec;on + Exploit Port Preserva;on (IV) Not supported Symmetric Figure 3.7: Connectivity Table which shows the compatibility between pairs
of NAT types. It also dictates which approach should be used in establishing
a connection between peers behind those types of NAT
This is necessary to create the translation entries in the two NAT devices and make hole punching possible. If a peer is behind Full Cone
NAT, its endpoint is the public IP address and port learnt during the
initial STUN discovery. Since a device implementing Full Cone NAT
is expected to behave as endpoint-independent, the other peer will be
able to use that same endpoint for contacting the peer behind Full Cone
NAT.
Consider now the scenario in Figure 3.8. Both Client A and Client B
are behind NAT, A is behind a Full Cone NAT and B is behind Port
Restricted NAT. Both peers have previously performed a STUN discovery and reported the findings to the Tracker. Client A now makes a request for a certain stream sending a Content Request message through
its TCP connection to the Tracker. The Tracker then decides that
Client B should provide the content A requested. The Tracker issues
notifications both to Client A and to Client B at the same time, in
the figure called Receipt Notification and Delivery Notification. The
message sent to Client B contains the public endpoint of Client A, and
viceversa. Note that the public endpoint of Client A is composed by
the NAT device’s public address and the external port learnt through
SELFMAN Deliverable Year Two, Page 63
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
Client A 10.0.0.1:1234 FC NAT A 192.0.2.1 Tracker PR NAT B 192.0.2.254 Client B 10.0.2.1:4321 Content Request Receipt NoFficaFon S: 10.0.0.1:1234 D:192.0.2.254:4321 Delivery NoFficaFon Test Packet Test Packet S: 192.0.2.1:3478 D:192.0.2.254:4321 S:192.0.2.254:4321 D: 192.0.2.1:3478 D:10.0.0.1:1234 S: 192.0.2.254:4321 Marked Test Packet ConnecFon Established S: 10.0.1.2:4321 D:192.0.2.:3478 S: 192.0.2.1:1234 D:10.0.2.1:4321 Marked Test Packet RTP Packets ConnecFon Established Start Delivery CumulaFve ACK Figure 3.8: NAT Traversal: Connection establishment process of class II
the STUN discovery, in this case port 3478. Instead for Client B, the
public endpoint is NAT B’s public address, and the port is the one used
locally by the peer, port 4321. The peers then start sending test packets to each other’s public endpoint at the same time. As mentioned
earlier, this is to guarantee that the NAT devices create their translation entries as soon as possible. If the entries are not present before
the first packets are received, those packets will be dropped. However,
next packets will be forwarded correctly to the private host.
Now, if the first test packet sent from Client B opens port 4321 on NAT
B’s external interface as expected by port preservation, test packets
from Client A will be forwarded correctly to B. Port 3478 on NAT A
instead is supposed to be still open since it was used for the STUN discovery. We make sure to keep this mapping active by sending periodic
messages to the STUN server in the periods the Client A is not receiving or providing any content. If Client A receives a test packet, it marks
SELFMAN Deliverable Year Two, Page 64
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
it as received and sends it back to the source, the same does Client B.
When Client B receives the marked packet from A, it starts to deliver
data. Note that Client B might have previously received an unmarked
test packet from A but it will not start to deliver at that point. This
is to guarantee symmetric connectivity between peers, since it might
happen that Client A can send to Client B but it is unable to receive
from him, in particular because of both Clients being behind NAT.
Finally, Client A periodically acknowledges received stream packets to
Client B sending cumulative ACK messages.
The aforementioned connection establishment process works even if
NAT B does not implement port preservation. In that case what would
happen is that Client A receives a test packet from an endpoint which
has the public address of NAT B but has a different port than the one
notified by the tracker. In that case, Client A would update Client B’s
endpoint and start sending test packets to the new endpoint.
• (III) Peers are either behind Restricted or Port Restricted Nat. In this
case we rely on port preservation to know which port will be open by
the NAT device on its external interface.
Let us consider the scenario showed in Figure 3.9. Both Client A and
Client B are behind Port Restricted NAT. As in the scenario showed
in Figure 3.8, Client A makes a request for a certain stream to the
Tracker. The Tracker then decides that Client B should provide the
content. Again, the Tracker issues notifications both to Client A and
Client B at the same time. However, in this case the public endpoint
of Client A is 192.0.2.1:1234, which is NAT A’s public address and the
private port which Client A is listening on locally. Similarly, Client
B’s public endpoint is 192.0.2.154:4321. The peers then start their
connection setup phase sending to each other’s public endpoint. If
NAT A and B behave as expected, they will open the same external
port as the one their internal hosts are sending from, namely port 1234
for NAT A and port 4321 for NAT B. If this happens, both peers will
be able to receive test packets and reply to their counter-parts. Client
B will then start delivering content to Client A.
• (IV) This class is not supported since port preservation cannot be
used to predict which port will be allocated by a Symmetric NAT on
its external interface.
The class includes two combinations of NAT types: Port RestrictedSymmetric and Symmetric-Symmetric. In the first combination, even
SELFMAN Deliverable Year Two, Page 65
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
Client A 10.0.0.1:1234 PR NAT A 192.0.2.1 Tracker PR NAT B 192.0.2.254 Client A 10.0.2.1:4321 Content Request Receipt NoCficaCon S: 10.0.0.1:1234 D:192.0.2.254:4321 Delivery NoCficaCon Test Packet Test Packet S: 192.0.2.1:1234 D:192.0.2.254:4321 S:192.0.2.254:4321 D: 192.0.2.1:1234 D:10.0.0.1:1234 S: 192.0.2.254:4321 Marked Test Packet ConnecCon Established RTP Packets S: 10.0.1.2:4321 D:192.0.2.:1234 S: 192.0.2.1:1234 D:10.0.2.1:4321 Marked Test Packet ConnecCon Established Start Delivery CumulaCve ACK Figure 3.9: NAT Traversal: Connection establishment process of class III
if the NAT implementing a Port Restricted behavior supports port
preservation, there is no way of knowing which endpoint should be
used when contacting a client behind Symmetric NAT. The same is
true for the second combination, where both endpoints are completely
unpredictable.
To show the incompatibility between combinations of NAT types included in this class, let us assume that a Client A is behind Port Restricted NAT (NAT A) and wants to connect to another Client B behind
Symmetric NAT (NAT B). The Tracker can provide Client A with NAT
B’s public address, the local port which Client B is listening on and
the port which the Symmetric NAT allocated for the STUN discovery.
Let us now say that Client A starts to send to the public endpoints
given by the Tracker, first to NAT B’s public address and Client B’s
local port, then always to NAT B’s public address but to the port that
Client B learnt during its STUN discovery. NAT B will drop the incoming packets since they come from an endpoint that was not previously
SELFMAN Deliverable Year Two, Page 66
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
contacted by Client B. If Client B tries to establish a connection at the
same time, as described in (II), NAT A will drop the incoming packets
since the new endpoint that the Symmetric NAT allocated is not in
NAT A’s translation map.
Given the limitations of port preservation for combinations of NAT
types included in class (IV), it is necessary to use the Port Prediction
technique to obtain connectivity between pairs of peers whose NAT
combination falls in this class.
3.4.6
Conclusion and Future Work
In this document we presented the current NAT Traversal technologies for
peer-to-peer applications and their limitations. We then suggested and detailed a new method based on the Port Preservation technique to improve
hole punching when Restricted Cone and Port Restricted Cone NAT are
involved.
We now plan to evaluate the suggested NAT Traversal method in Peerialism’s deployed platform. Before doing that, we intend to add support
for Port Prediction, as defined by Takeda et al [157], to be used when Port
Preservation alone does not provide connectivity. We would also like to provide valid statistical results on the distribution of NAT types in today’s home
routers as observed in a real system, such as Peerialism’s platform. We would
then like to combine those results with the success rate of our NAT Traversal
technique depending on the various combinations of NAT types.
SELFMAN Deliverable Year Two, Page 67
CHAPTER 3. D1.2: REPORT ON HIGH-LEVEL SELF-MANAGEMENT
PRIMITIVES FOR STRUCTURED OVERLAY NETWORKS
3.5
Papers and publications
This section is dedicated to give a brief introduction to the publications
produced by the work done on this deliverable. Some of them are included
as appendices of this book, as explained here below.
The Relaxed-Ring: a Fault-Tolerant Topology for Structured Overlay
This paper has been accepted with revision in the Journal “Parallel Processing
Letter”. It is annexed in Appendix A.5, and it includes the results presented
in two other publications mentioned in the Periodic Activity Report, but
that are not included as appendixes: “A Relaxed-Ring for Self-Organising
and Fault-Tolerant Peer-to-Peer Networks” and “Improving the Peer-to-peer
Ring for Building Fault-tolerant Grids”. This article concludes the work of
low-level primitives, motivating the work on high-level primitives for selfmanaging systems.
Range queries on structured overlay networks
Similarly to the article on the Relaxed-Ring, this paper concludes the work
on low-level primitives for range queries, motivate the work on high level
primitives. It is included in Appendix A.3
Sloppy Management of Structured P2P Services
This publication is the result of a collaboration between ZIB and the Vrije
Universiteit Amsterdam. It proposes sloppy management for structured overlay networks using continuous background probabilistic algorithms instead
of explicit repair strategies. This work is included in Appendix A.4
SELFMAN Deliverable Year Two, Page 68
Chapter 4
D1.3b: Final report on Security
in Structured Overlay
Networks
4.1
Executive Summary
SELFMAN aims at building self-managing distributed systems. Self-management
is achieved through mechanisms which provide: self-configuration, self-healing,
self-tuning and self-protection. Although SELFMAN has self-protection, the
objective of SELFMAN is not to develop highly-secure distributed systems,
which would be too ambitious an objective. Rather the self-protection aspects are to provide mechanisms which can enhance security.
This deliverable presents results which provide security functionality which
is useful for the self-management work in Workpackage 4 (WP4). Although
security is not the primary focus, an attack on a distributed self-managing
application can affect self-configuration, self-healing and self-tuning as well.
Thus the results in this deliverable are both to do with mechanisms for
self-protection and partly about making other self-management aspects of
SELFMAN more resilient to attacks.
Most of the works in Structured Overlay Networks (SONs) are based on
the notion of a Distributed Hash Table (DHT). Unfortunately, while DHTbased SONs has many desirable properties from a self management viewpoint, they also have many security drawbacks. The first report on security,
D1.3a, suggested a promising direction which moves away from the fundamental drawbacks of DHT (from a security perspective) by employing Small
World Networks (SWN). SWNs can get around the problems of identity
which can plague DHTs and being more robust also simplify self-healing and
69
CHAPTER 4. D1.3B: FINAL REPORT ON SECURITY IN
STRUCTURED OVERLAY NETWORKS
self-configuration.
We have built a testbed for investigating SWNs which provides the following functionalities: creation of various SWN, simulation of a network using
the SWN, visualizations of the experiment including statistics. The most important feature is that the testbed makes it easy to test different algorithms
which operate on the SWN. We have successfully run simulations of up to
100000 nodes without any problems. The intention of the SWN testbed is to
investigate whether a SWN can be a suitable replacement for a DHT. Our
experimental results show although routing guarantees are probabilistic, we
found that the routing success rate to be around 100%.
Self-configuration and self-healing mean that a self-managing system will
add/modify/update software components on the fly. We have implemented
a software component authentication system which can guarantee that only
authorized code can be used. This means that we can be sure of the provenance and identity of all software components in the system. Our prototype
runs in Windows, the most common platform where many subtle attacks on
components are possible. We protect against all attacks which try to load
malware to replace a software component.
A SELFMAN application would communicate with others on the Internet. Software often contains bugs, and we expect SELFMAN applications
not to be an exception, which can be exploited externally by attackers. As
such, monitoring the behavior of multiple processes communicating over the
network can be used to detect unexpected or illegal behavior. We have developed a monitoring infrastructure which can capture the behavior of a
collection of processes and threads. It has low overheads and allows actions
to be related to the software components which cause them. Low overheads
means that permanent monitoring is feasible. We tested monitoring on Skype
running on several machines. The results demonstrate that one can clearly
see that Skype is a Peer-to-Peer (P2P) application from its network traffic
flow and once can observe how Skype works.
SELFMAN Deliverable Year Two, Page 70
CHAPTER 4. D1.3B: FINAL REPORT ON SECURITY IN
STRUCTURED OVERLAY NETWORKS
4.2
Contractors contributing to the Deliverable
UCL (P1), KTH (P2), INRIA (P3), FT(P4) and NUS (P7) have contributed
to this deliverable. Self protection issues are related to other self properties
and the work here on basic underlying mechanisms for security are between
NUS working on the security aspects with the cooperation of other partners
who have contributed to design or requirements.
UCL (P1) UCL has cooperated with NUS to refine security issues for the
low level SON.
KTH (P2) KTH has cooperated with NUS to refine security issues for the
low level SON.
INRIA (P3) INRIA has cooperated with NUS in the design of aspects of
the monitoring infrastructure which allows for secure monitoring for
self-management.
FT (P4) France Telecom has contributed to the design of messaging requirements and use cases from an application standpoint which is the
starting point used by NUS for the Small World Network routing
testbed.
NUS (P7) NUS has designed and implemented the Small World Network
testbed and simulator being a new low level SON-like infrastructure,
the WinResMon monitoring infrastructure and the BinAuth software
component authentication infrastructure.
SELFMAN Deliverable Year Two, Page 71
CHAPTER 4. D1.3B: FINAL REPORT ON SECURITY IN
STRUCTURED OVERLAY NETWORKS
4.3
Introduction
In general, making a non-distributed application secure is a difficult task.
Making distributed self-managing systems and applications, as in SELFMAN, secure is correspondingly much more difficult. Nevertheless, in this
age where attacks on software are common, some security mechanisms are
necessary. The goals of SELFMAN with respect to security are not to provide
high security, which would be simply too ambitious, rather to provide security mechanisms which can help self-protection in self-managing distributed
systems. This includes increased self-protection for the overlay networks used
in SELFMAN.
Security for a system needs to be approached from a holistic perspective.
Self-protection for the (structured)1 overlay network is one aspect. Other
aspects include whether self-tuning, self-configuration or self-healing have
been compromised. One also needs mechanisms to help determine whether
malicious behavior is occurring in the distributed system/application.2
In the D1.3a deliverable, we surveyed security in P2P systems and identified that often the final level security properties desired are specific to the
needs of the P2P application. In this workpackage, such domain/application
specific security measures are not part of the low level infrastructure and are
not dealt with here. We also identified Structured Overlay Networks (SON)
such as Distributed Hash Tables (DHT) as having a number of fundamental drawbacks. The two most important are the difficulties of dealing with
identities in a decentralized setting and the problem of maintaining the distributed data structures which comprise the SON. The former leads to the
problem of Sybil attacks [47] which are problematic for SONs. In the latter,
e.g. in a DHT like Chord, periodic stabilization is used for finger maintenance under churn, the effort for maintenance can itself lead to more security
problems such as routing and data attacks.
We proposed in D1.3a that rather than trying to retrofit more security
mechanisms to a DHT, there are already many proposals but they are not
satisfactory, to take an orthogonal approach and avoid some of the drawbacks. A Social Network can automatically provide trust relationships which
mean that the problem of Sybil attacks disappears since it is not feasible to
create multiple identities easily. Social networks are a form of Small World
Network (SWN). Routing is also simpler in SWNs which reduces the problem
of routing attacks.
1
We will later advocate studying small world networks which have a mix of random
and structured properties.
2
We cannot hope to guarantee that no malicious behavior occurs since that question is
undecidable in general.
SELFMAN Deliverable Year Two, Page 72
CHAPTER 4. D1.3B: FINAL REPORT ON SECURITY IN
STRUCTURED OVERLAY NETWORKS
We built a SWN testbed for investigating the suitability of SWN as a
replacement for a DHT. This is complementary to the other work in SELFMAN in task T1.1 which makes use of DHT-based SONs. The SWN testbed
can generate and simulate several types of SWNs. It provides visualization
as well as statistics. Our testbed is able to handle reasonably large networks,
we have run simulations on the order of 100000 nodes. Our experiments
show that although in a distributed setting routing algorithms based purely
on local information only have probabilistic guarantees, in practice, experiments show that the probability of success is very high. We report further
on self-protection in SWN in D4.4a.
The other security mechanisms described in this report serve to gain
better self-protection and assist in the other self-* tasks in WP4. The monitoring infrastructure allows one to understand and analyze the behavior of a
distributed application organized as a collection of processes and threads running on various machines. The software component authentication ensures
that only those components which are trusted can be executed. This allows
complete control over what software components are loaded and executed in
a system.
4.3.1
Relating D1.3b and D1.3a
We take the opportunity to clarify the work performed between deliverable
D1.3b and D1.3a. The SELFMAN project started late at NUS3 and the
work reported in D1.3a was for a period of three months. Given the short
timeframe, the primary focus of D1.3a was to understand the major security
issues in SONs and P2P systems. Some work was also started on the other
tasks in WP1 in this timeframe which was not reported simply because they
were preliminary. In deliverable D1.3b, as we shifted focus from SON to
SWN, some of the initial effort was no longer relevant.
The usefulness of D1.3a was that it allowed us to identify some of the
key problems which are pertinent to self protection in structured overlay
networks. The work in this deliverable reflects the directions adopted from
the results in D1.3a. We also identified from D1.3a that only two of the
applications in SELFMAN would have relevance from a security standpoint,
namely the M2M application and the Wiki scenario. The P2P TV application is being a closed system doesn’t require security.4 We remark that
3
Due to late receipt of the SELFMAN funding, work on SELFMAN could only effectively start on Feb 2007. The work in D1.3a was for the timeframe of Feb to May
2007
4
As far as we understood, they are interested in the self-management aspects other
than self-protection.
SELFMAN Deliverable Year Two, Page 73
CHAPTER 4. D1.3B: FINAL REPORT ON SECURITY IN
STRUCTURED OVERLAY NETWORKS
the M2M application has some generic properties which can typify a more
generic application for which security mechanisms can be used since it performs data collection and monitoring. Based on our existing experimental
results on SWNs, we believe that use of a SWN may simplify and make
the self-management and networking aspects of the M2M application more
robust. The Wiki application being more specialized may ultimately need
more domain specific mechanisms such as Wiki specific trust management.
4.4
Self Organizing Networks with Small World
Networks
In the D1.3a survey, we found that structured overlay network such as those
using DHTs have many drawbacks. The main ones arise from problems
with identity in a P2P setting and the need for constant maintenance of the
DHT data structures. One of the most problematic is the Sybil attack [95].
which exploit the lack of identity. Unfortunately, it is not possible to stop
Sybil attacks while retaining the decentralized properties which make DHTs
attractive.
The problem of identity in DHTs arises because of the decentralized and
P2P nature of a DHT is not conducive to creating trust. One way of increasing trust is to have networks which are based on trust or friendship and these
provide a natural defense against Sybil attack. One network which naturally
has these properties is the so-called Social Networks. For example, in a social networking site, e.g. Facebook, Friendster, LinkedIn, etc, we usually add
(real/unique) people that we know directly as our friends. The nature of the
social network verifies the identity of the nodes. Thus a social network can
be thought as ”full” of identities. A social network is usually thought of as
a kind of Small World Network (SWN).
The classic experiment by Milgram on social networks show that the chain
of social acquaintances required to connect one arbitrary person to another
arbitrary person anywhere in the world is generally short (this is the origin of
the phrase “six degrees of separation” and concepts such as Erdos number).
The motivation for investigating SWN is that the network has automatic
identity properties which reduce the problem of Sybil nodes. In addition,
SWN do not require much maintenance as the SWN graph does not change
frequently. As we will see, it is possible to route efficiently in a SWN. In
this section we introduce Small World Networks (since they might be less
familiar). The SWN infrastructure described here is used later in deliverable
D4.4a (Section 18).
SELFMAN Deliverable Year Two, Page 74
CHAPTER 4. D1.3B: FINAL REPORT ON SECURITY IN
STRUCTURED OVERLAY NETWORKS
4.4.1
Small World Network Models
A SWN as defined by Watts and Strogatz [155] is a graph with small diameter
and a high clustering coefficient. Examples of SWN are: social networks,
electric power grids, neural networks, telephone call graphs, paper authorship
relations (Erdos number), etc. There are many ways to model a SWN, here,
we describe two SWN models: the Watts-Strogatz model [155] and Kleinberg
model [86].
Figure 4.1: Watts and Strogatz Model
The Watts and Strogatz SWN construction depicted in Figure 4.1 first
starts with a regular graph where each node is connected to its k−nearest
neighbors (Figure 4.1 left-most graph). This graph has no shortcuts, high
clustering coefficient, and large diameter. With probability p, the links of
each node in the regular graph are rewired to a node chosen uniformly at
random over the entire ring. This results in edges being rewired to act
like shortcuts links to far away nodes. The middle graph in Figure 4.1 is
called a Small-World network by Watts and Strogatz. It has quite a high
clustering coefficient and diameter is small, expected log(n). If we continue
rewiring the links, the graph will no longer has high clustering coefficient. It
now resembles a random graph, Figure 4.1 right-most graph. Note that the
diameter can still be small.
The path length L is defined as the average shortest path length between
any two vertices. The clustering coefficient C is defined as the average of
the fractions of the neighbors of each node that know each other. L(p)
and C(p) is the path length and the clustering coefficient of the graph that
are constructed with the algorithm above with probability p. Figure 4.2
shows that with p = 0, the graph is a regular graph where all nodes are
connected with it’s k − nearest neighbors which has high clustering and also
large diameter. With p ≈ 0.01, the graph still has high clustering but the
path length has decreased significantly which is the SWN defined by WattsStrogatz. With p = 1, the graph is totally rewired thus has no clustering.
SELFMAN Deliverable Year Two, Page 75
CHAPTER 4. D1.3B: FINAL REPORT ON SECURITY IN
STRUCTURED OVERLAY NETWORKS
Figure 4.2: Characteristic Path Length L(p) and Clustering Coefficient C(p)
Figure 4.3: Kleinberg Model
SELFMAN Deliverable Year Two, Page 76
CHAPTER 4. D1.3B: FINAL REPORT ON SECURITY IN
STRUCTURED OVERLAY NETWORKS
The construction of the Kleinberg model [86] first starts with the links
and nodes connected like a lattice (Figure 4.3 left-most graph). Then a constant number of non-local links are added. Kleinberg add two shortcuts with
probability proportional to d(u, v)−r (Figure 4.3 right-most graph) with r as
the dimension of the graph. The shortcuts make the diameter of the graph
small with expected log(n). The node identifiers are assigned based on the
lattice coordinates and the cost of greedy routing is in expected O(log 2 (n))
steps.
Figure 4.4: Simulator Interface
4.4.2
Small World Network Testbed and Simulator
We build a SWN testbed (see Figure 4.4) to allow us to experiment with
SWNs. The testbed contains a number of generators for SWNs and also
SONs. It has a simulator which monitors routing performance (average routing hops), success percentage (how many % of the routing are successful),
node positions (visualized as a ring), routing hops percentiles (to tell how robust the routing for all percentiles), and other statistical distributions such
as hop-counts, edge-distance, edge-count, etc. The simulator allows us to
easily build new SWN algorithms and experiment with them. The SWN
testbed GUI has extensive use of visualization and animation which is useful
for understanding performance of the SWN.
To handle large networks, we have a parallel version of the simulator
which runs on our cluster. We have run networks up to 100000 nodes and
SELFMAN Deliverable Year Two, Page 77
CHAPTER 4. D1.3B: FINAL REPORT ON SECURITY IN
STRUCTURED OVERLAY NETWORKS
depending on the type of experiment, typically these take between 1 minute
to under an hour. Thus, our SWN testbed platform can handle realistic
network sizes.
Figure 4.5: log(n) and 6 ∗ log(n) rings
4.4.3
Comparing Small World Networks
Typically, DHTs use O(log n) or more links per node. Our experiments show
that for a SWN, O(log n) links also work reasonably well, We experiment
with a Normal SWN (this is called “normal” as it has the same number of
links as typical DHTs) with log n links, and the Sandberg SWN which has
6log n links. Figure 4.5 shows a normal and Sandberg SWN witn n = 100.
We also use the Kleinberg SWN with just 4 links.
To test the SWN simulator and to get some initial insights into the performance of SWN models and to determine what are the important factors,
we conducted routing experiments on all the above three models. We use
100000 nodes and in order to establish baseline performance, this experiment assumes no node or link failure.
The results of the experiment are shown in Figure 4.6. It shows that SWN
is more robust in terms of routing length if we add more edges to it. The
Kleinberg model which has only 4 constant edges has a very large deviation
in routing length between 1 and 80. The Normal model which has log(N)
edges has moderate deviation of routing length from 1 to 30. The Sandberg
model which has 6*log(N) edges has very small deviation of routing length
from 1 to 10. We has 100% success rate for all the experiments.
This experiment shows that as a basic network, a SWN can have good
performance, though it might need more edges than a DHT. The performance is actually rather good, since much less structure is needed and also
only a few assumptions. For example, given a social network, one already
automatically has a usable network. A DHT on the other hand is more
complex. As this deliverable is concerned mainly with the low level infrastructure, more details on SWN are covered in deliverable D4.4a (Section 18).
SELFMAN Deliverable Year Two, Page 78
CHAPTER 4. D1.3B: FINAL REPORT ON SECURITY IN
STRUCTURED OVERLAY NETWORKS
Figure 4.6: Routing Length Distribution
We intend to submit a position paper to the SELFMAN workshop (Decentralized Self Management for GRIDS, P2P, and user communities) at the
SASO conference (Second IEEE International Conference on Self-Adaptive
and Self-Organizing Systems) on the potential of small world networks as an
overlay network to replace more structured networks such as DHTs.
4.5
A Look at Security using Skype
In this section, we take a look at Skype as it is perhaps the largest and most
well known P2P system which has extensive security mechanisms built in.
The Skype network is a large distributed system with many nodes and users.
The main concern of Skype is to make sure that only the Skype application
makes use of the Skype network.
Skype achieves this by network traffic encryption and code obfuscation.
Obfuscation is feasible since Skype is a closed source application. Network
traffic encryption includes RC4 encryption using a key derived from the
source address, destination address and message ID.
Many code obfuscation and anti-debugging mechanisms are adopted. For
example, code is packed and encrypted in the executable file. Integrity checks
are placed in the code to prevent debuggers from modifying the Skype executable. Other anti-debugging mechanisms include timing check and detection of known debuggers. Code obfuscation includes adding indirect calls,
conditional jumps, and execution flow rerouting.
Even though Skype uses many techniques to hide the protocol, some parts
of the protocol have been discovered by researchers. [15, 21, 17, 67, 74] For
SELFMAN Deliverable Year Two, Page 79
CHAPTER 4. D1.3B: FINAL REPORT ON SECURITY IN
STRUCTURED OVERLAY NETWORKS
example, it is possible to send a command to a remote Skype node to ping
any host. This example shows that basic security mechanisms are useful, in
this case, monitoring to understand Skype and integrity checking to ensure
that only the unmodified Skype code is executed.
4.6
System-wide Monitoring Infrastructure
WinResMon [113] is a monitoring infrastructure for determining resource usage and interactions among programs in Microsoft Windows environments.
Programmers can write system monitoring tools on top of the infrastructure
of WinResMon. We enhanced WinResMon for monitoring network activity
so that it is possible to explain how network activity (and others) is related
to particular programs, processes and threads.5 We also enhanced WinResMon so that it is possible to attribute activity due to software components.
Although WinResMon is implemented for Windows, the general monitoring
infrastructure can be applied to other operating systems such as Linux.
A monitoring tool using WinResMon can register a set of interested
events. These events are reported to the monitoring tool whenever they
occur. An event is characterized by the following information:
• serial number is a monotonically increasing integer to notify progress
and missing events.
• start time and finish time are timestamps to show the period of the
event. Some events (network, large I/O) take more time, while others
(registry, small I/O) are quick. We provide high resolution timestamps
based on the internal CPU performance counters. This allows to differentiate events even when they occur with high frequency.
• process ID and thread ID signifiex which process/thread causes the
event.
• program name is the pathname of the executable.
• user name is the user which owns the process which causes the event.
• return status shows whether the event is a success or failure. If it is
a failure, e.g. file-not-found, the return status gives the reason.
5
We remark that we discovered that monitoring networking was much more complex
in Microsoft Windows than we expected. This is because networking in Windows is not
part of the core operating system kernel.
SELFMAN Deliverable Year Two, Page 80
CHAPTER 4. D1.3B: FINAL REPORT ON SECURITY IN
STRUCTURED OVERLAY NETWORKS
• operation is the resource type and the operation on the resource. E.g.
file create, net connect, net send datagram.
• resource path is the pathname of resource. It only applies to file and
registry.
• parameters are additional information related to each operation. For
example, the parameters of file create include file access flags, and
file attributes. The parameters of net connect include remote network address. The parameters of net send datagram include remote
network address and packet size. Note that this field is flexible and we
can include actual network data if needed.
• api tracing can be used to determine which software components call
each other to generate an event. The stack trace is analyzed to determine this information.
WinResMon can be used in the composite probes framework described
in Section 9.2 as basic probes to collect system information. As this is done
at the operating system level, the information is guaranteed to be accurate
and cannot easily be subverted. In addition to system-wide information such
as network throughput and disk I/O, basic probes may use WinResMon to
monitor a specific program or a specific component in a program, because
WinResMon provides context information such process ID and component
information.
WinResMon has the basic capabilities which can be used as a way of
virtualizing some aspects of load testing (described in Section 17.3.2) which
relate to the external environment, such as increasing the latency of operations under load and resource failures due to excessive resource usage.
4.6.1
An Example from Monitoring Skype
Figure 4.7: Network configuration of three Skype clients: GVI, FX1 and RR.
We use WinResMon to monitor network interaction in Skype. In the
experiments, we installed Skype in three machines, GVI, FX1 and RR. GVI
SELFMAN Deliverable Year Two, Page 81
CHAPTER 4. D1.3B: FINAL REPORT ON SECURITY IN
STRUCTURED OVERLAY NETWORKS
is directly connected to NUS network. GVI has a NUS NATed IP address
172.18.178.175. FX1 is a VMWare virtual machine running in another machine (different from GVI) in NUS network. FX1 has a VMWare NATed IP
address 192.168.0.5. RR is a VMWare virtual machine running in a machine
connected to the Singapore ISP, StarHub. RR has a VMWare NATed IP
address 192.168.0.6. Figure 4.7 describes the network configuration.
We did three experiments.
• idle
It is known that Skype will sometimes behave as a supernode unless
a registry flag is explicitly set to prevent this. It is also known that
Skype participates in routing traffic for other nodes. In our experiment,
we wanted to see the traffic flow when Skype was run for an extended
period.
Figure 4.8: Idle Skype network traffic over 16 hours. The X axis is time
measured in CPU clocks, and the Y axis is the network traffic.
We ran Skype in GVI and let it idle for a period of 16 hours. The
average traffic was 3.1B/s and 30% of the traffic was made to just 2 IP
addresses. There were a total of 126 different IP addresses connected
to GVI. Figure 4.8 shows the network traffic over time.
• two-way call
We had 2 machines (RR and FX1) in different networks calling each
other and simultaneously used WinResMon to monitor them. The idea
was to see whether the machines will directly communicate with each
other or go through a few hops. FX1 is the call initiator. We found
SELFMAN Deliverable Year Two, Page 82
CHAPTER 4. D1.3B: FINAL REPORT ON SECURITY IN
STRUCTURED OVERLAY NETWORKS
that there was no direct connection between both machines. We also
found that about 95% of the traffic from both machines were to 2 IP
addresses located in Japan. Figure 4.9 shows the network of the two
machines over time.
Figure 4.9: Skype network traffic during two party calling. The X axis is
time measured in CPU clocks, and the Y axis is the network traffic of the
two machines.
• three-way call
Figure 4.10: Network connectivity graph in a three-way conference call
We have three machines in a three-way conference call. The machine
FX1 is the call initiator. Figure 4.10 shows most of the network traffic.
Some minor network traffics such as DNS query are removed from the
graph.
Concluding, we show that using WinResMon on a distributed application
such as Skype leads to discovering interesting traffic flows.
SELFMAN Deliverable Year Two, Page 83
CHAPTER 4. D1.3B: FINAL REPORT ON SECURITY IN
STRUCTURED OVERLAY NETWORKS
4.7
Authenticating Software Components and
Version Management
In SELFMAN, the self-managing aspect of software components may itself
lead to attacks on SELFMAN applications. For example, some researchers
used the Storm Worm self-update mechanism to remove Storm Worm itself. (The Storm worm itself uses P2P techniques for self-management and
control).
We developed BinAuth [68] which is software component authentication
system in Windows. BinAuth ensures that only binaries whose data integrity
have been verified can be executed. Software components in Windows include
all kinds of binaries which include the main program executables (EXE), dynamic linked libraries, (DLL and others), and kernel drivers (SYS). Because
the BinAuth authentication system works at the operating system level, it
can guarantee that only software components which pass the authentication
test can be used and executed. This stops most kinds of malware attacks
which attempt to subvert an application with foreign code. Interestingly, we
note that Skype works perfectly fine under BinAuth but it would fail under
many systems which use dynamic code instrumentation.
For Selfman, component distribution and updating can be authenticated
by BinAuth by signing library files, assuming that each component consists
of several library files. We use a flexible signing mechanism which doesn’t
need to modify the format of existing binaries. The component authentication mechanism also helps to control the versions of software components on
a system. Often, the real system relies on many software components, some
of which are third party, and others we may simply not even know about.
BinAuth allows all the components to be managed so that automatic patching, updating can be achieved without the patching and updating mechanism
itself being the attack vector.
4.8
Papers
The papers which describe the work here are as follows:
• Felix Halim, Rajiv Ramnath, Sufatrio, Yongzheng Wu and Roland H.C.
Yap. “A Lightweight Binary Authentication System for Windows”. In
IFIPTM 2008: Joint iTrust and PST Conferences on Privacy, Trust
Management and Security. It is attached in Appendix A.7.
• Felix Halim, Yongzheng Wu and Roland H.C. Yap, “Small World Networks as Self Organizing Networks”. This is intended to be submitSELFMAN Deliverable Year Two, Page 84
CHAPTER 4. D1.3B: FINAL REPORT ON SECURITY IN
STRUCTURED OVERLAY NETWORKS
ted to Workshop on Decentralized Self Management for Grids, P2P,
and User Communities (SELFMAN) held in conjunction with SASO
2008: Second IEEE International Conference on Self-Adaptive and
Self-Organizing Systems, 2008.
SELFMAN Deliverable Year Two, Page 85
Chapter 5
D1.4: Java library of
SELFMAN structured overlay
network
5.1
Executive summary
In this deliverable we present the Java prototype for the SELFMAN structured overlay network. This has been implemented as a set of components,
using the Kompics [9] component model, presented in D2.1b (see Chapter 7),
and its prototype implementation delivered as D2.1c (see Chapter 8).
We have used the Kompics component framework to implement the Java
version of the SELFMAN structured overlay network. We have devised a
generic component architecture for a peer-to-peer system that supports multiple virtual peers in one address space. This includes components like: network, timer, virtual peer, bootstrap client and server, monitoring agent and
server, failure detector, ring based overlay, web server, web handler, etc.
Our architecture allows for the various components to be written once
and then executed both in a simulation scenario or in a real deployment.
This is made possible by replacing the network and application components
with components for network simulation and user simulation.
The SELFMAN structured overlay network prototype is available as a
public release at http://kompics.sics.se/p2p.
86
CHAPTER 5. D1.4: JAVA LIBRARY OF SELFMAN STRUCTURED
OVERLAY NETWORK
5.2
Contractors contributing to the Deliverable
KTH(P2) has contributed to this deliverable.
KTH(P2) KTH has been and is still currently implementing and testing
components of the SELFMAN structured overlay network using the Kompics
component model, presented in D2.1b (see Chapter 7), and its prototype
implementation delivered as D2.1c (see Chapter 8).
SELFMAN Deliverable Year Two, Page 87
CHAPTER 5. D1.4: JAVA LIBRARY OF SELFMAN STRUCTURED
OVERLAY NETWORK
5.3
The Kompics P2P architecture for the
SELFMAN structured overlay network
The latest release of the Kompics P2P architecture, that contains the Java
implementation of the SELFMAN structured overlay network, is publicly
available at http://kompics.sics.se/p2p. The release includes technical
documentation, source code, API documentation, the binary library and user
guide.
Figure 5.1 shows the Kompics peer-to-peer system architecture. The
architecture allows many virtual peers in one address space which makes it
possible to execute the system both in a simulation scenario (where all peers
live in the same process) or in a real deployment. This is made possible
by replacing the network and application components with components for
network simulation and user simulation.
Boot
P2P Application (Simulator)
Peer Cluster / Peer Set
↓ identical peers ↓
Bulk Operations
Peer Application
Gradient / Gnutella
Symmetric replication
Transactional DHT
T-Man / T-Chord
Structured Overlay
Fast Paxos
Aggregation / Slicing
Group Multicast
Cyclon
Bootstrap Client
Failure Detector
Peer Monitor
Lossy Network
Perfect Network
Web Handler
Ring
Router
Network (Apache MINA)
Web Server (Jetty)
Timer
Figure 5.1: Kompics peer-to-peer system architecture.
SELFMAN Deliverable Year Two, Page 88
1
Chapter 6
D1.5: Mozart library of
SELFMAN structured overlay
network
6.1
Executive summary
The objective of this deliverable is to implement the conceptual results
achieved by other deliverables, as a proof of concepts. This deliverable consists of three softwares: P2PS [44], PEPINO [45] and CiNiSMO [100], all
three of them being released as free software under Mozart Public License.
P2PS, Peer-to-Peer System, is a library implemented in Mozart [108] providing a distributed hash table (DHT) on top of a structured overlay network
(SON), using the Relaxed-Ring topology [101]. It implements the Application Programming Interface (API) presented in Deliverable D1.2, Chapter
3.
PEPINO, PEer-to-Peer network INspectOr, is an end-user software implemented using the P2PS library. It provides a dynamic visualizer that can
inspect a running network, as well as simulating one for research purposes.
CiNiSMO, Concurrent Network Simulator in Mozart-Oz, is a programming framework used for running network simulations with a realistic scenario. This is achieved by running every node in its own lightweight-thread
with its own memory scope, as if it where an independent process.
89
CHAPTER 6. D1.5: MOZART LIBRARY OF SELFMAN
STRUCTURED OVERLAY NETWORK
6.2
Contractors contributing to the Deliverable
UCL(P1) and ZIB(P5) has contributed to this deliverable.
UCL(P1) UCL has implemented, documented and tested P2PS, PEPINO
and CiNiSMO. It has also contributed by submitting demonstrator proposal
that are included as Appendices. Web pages dedicated to make the software
available are also developed by UCL.
ZIB(P5) ZIB has contributed by testing PEPINOand helping to understand the transactional algorithms presented in Work Package 3. ZIB has
co-author one of the demonstrator proposals, which is dedicated to study
decentralized transactions.
SELFMAN Deliverable Year Two, Page 90
CHAPTER 6. D1.5: MOZART LIBRARY OF SELFMAN
STRUCTURED OVERLAY NETWORK
6.3
Introduction
This software deliverable is intended as a proof of concepts for the results
achieved in other deliverables, in particular, we get a lot of input from deliverables D1.1 and D1.2 (Chapter 3) provide the low-level and high-level
self-management primitives for building structured overlay networks. We
also get input from Work Package 3, dedicate to provide a distributed storage service, which is integrated in the Application Programming Interface
(API) that has been used to implement the software presented here.
The deliverable is composed of a programming library for building Peerto-Peer systems, called P2PS. An end-user tool for visualizing and inspecting running and simulated networks, called PEPINO. And a programming
framework that has been used to evaluate many of the concepts presented in
Deliverable D1.2. The programing framework is called CiNiSMO.
6.4
P2PS
This is a library implementing the API described in Deliverable D1.2, see
Chapter 3, which provides mean for building a peer-to-peer network using
the Relaxed-Ring topology [101]. It is implemented in Mozart [108], and it
is meant for programming in Mozart as well.
We have created a dedicate web site for documentation of P2PS, where the
library can also be downloaded. The URL is http://p2ps.info.ucl.ac.be.
We believe that the best way to test what P2PS provides is by using PEPINO,
a graphical application for inspecting peer-to-peer networks, which is described in the following section.
As an example of how this library can be used in Mozart, here is an
sample code that shows how to create a peer, join a network, put a value
in the network and recover it for display. The API corresponds to a Mozart
style implementation of what is presented in Deliverable D1.2
declare
Peer = {New P2PSNode init} % creates a node
{Peer join(RingRef)} % the ring reference is supposed to be known
{Peer put(key:’hello’ value:’world’)} % put a value under key ’hello’
{Show {Peer get(key:’hello’ value:$)}} % show the recovered value
As we have have mentioned above, the best way to test what P2PS can
provide is by running PEPINO, which is presented in detail in the next
section.
SELFMAN Deliverable Year Two, Page 91
CHAPTER 6. D1.5: MOZART LIBRARY OF SELFMAN
STRUCTURED OVERLAY NETWORK
Figure 6.1: A peer-to-peer network visualized by PEPINO
6.5
PEPINO
PEPINO is a graphical PEer-to-Peer network INspectOr running on top
of P2PS, a structured overlay network using the Relaxed-Ring topology.
PEPINO has been built in order to monitor an existing network or to simulate one in order to study the system. The inspection of the network is done
by detecting failures, and by observing the messages sent between peers. A
dynamic and self-organizing view of the network is presented to the user, who
can interacts with it by injecting failures or by sending messages to arbitrary
peers.
Since many systems implement DHT in different ways - in particular by
choosing the finger table with a different strategy - PEPINO also helps
to study three different strategies, in particular finger tables following the
strategy of DKS [57], Tango [29], and Chord [136].
Some screen-shots are presented on this deliverable to depict some of the
features of PEPINO. Figure 6.1 shows a ring composed by 10 nodes. Arrows
have different colours in order to present meaningful information. Green
arrows represent successor pointers. Red ones correspond to predecessors.
Blue arrows are fingers. On the bottom-right corner there are 3 buttons
to organized the ring according to a particular colour. In the case of the
image, predecessor pointers are followed to verified that no inconsistency is
presented (correct sharing of responsibility). Fingers and other arrows are
highlighted when the mouse is focus on top of a particular node.
One of the main features that differentiate the relaxed-ring from other
SELFMAN Deliverable Year Two, Page 92
CHAPTER 6. D1.5: MOZART LIBRARY OF SELFMAN
STRUCTURED OVERLAY NETWORK
Figure 6.2: Testing relaxed-ring’s branches with PEPINO
structured overlay networks, is the ability of accepting nodes with connectivity problems forming branches annexed to the core ring. Figure 6.2 depicts
such a branch. The screen-shot shows the network organizing the visualization with respect to the successor pointer (green arrow). It is possible to
observe peers painted in yellow as members of the cor ring, and white peers
belonging to branches.
PEPINO not only visualizes the network as a ring of peers. It also shows
the messages exchanged between nodes as it is depicted on the screen-shot of
Figure 6.3. This feature is place at the resizable left side of the application.
If the mouse is place over a message exchanged between to peers, all the
other messages between them will be highlighted.
A demonstrator of the main features of PEPINO was shown in the Seventh IEEE International Conference on Peer-to-Peer Computing (P2P’07).
The abstract published in the proceedings of the conference is presented in
Appendix A.8. Two other demonstrator proposal were submitted to the
eighth version of the IEEE P2P conference 2008. Both submissions propose
an extension to PEPINO, one related to distributed transactions, and the
other one related to network partitioning and merging.
The submission included in Appendix A.9 proposes a demonstration of
the transactional DHT algorithm based on a modified version of the Paxos
consensus algorithm. The whole algorithm for transactions is described in
Deliverable D3.1b, Chapter 12, Appendix A.12. The implementation done
in P2PS, and used by PEPINO, is completely based on the result of the
mentioned deliverable. The design of the transactional DHT is mainly the
SELFMAN Deliverable Year Two, Page 93
CHAPTER 6. D1.5: MOZART LIBRARY OF SELFMAN
STRUCTURED OVERLAY NETWORK
Figure 6.3: Events displayed with PEPINO
result of the collaboration between partners ZIB(P5) and KTH(P2). The
demonstrator is the result of the collaboration between partners UCL(P1)
and ZIB(P5). Report on the interface of the transaction algorithm is presented in Deliverable D1.2, Chapter 3.
In order to strength the robustness of out algorithm for transactional
DHT, the demonstrator compares it to two-phase commit, which is one of
the most popular choices for implementing distributed transactions, being
used since the 1980s. The use of two-phase commit on peer-to-peer networks
is very inefficient because it relies on the survival of the transaction manager,
and therefore, it has to be use relying a robust server.
The second submitted proposal, included in Appendix A.10, proposes a
demonstration of the merge algorithm [129], which is a result of the first
year of SELFMAN. The demonstrator allows the user to inject a network
partition to observe how the system survives forming two independent rings.
The network partition is simulated because in other case, the PEPINO
application would be able to observer only one of the rings. Once the rings
are reorganized, the connection between the two set of nodes can be recovered
in order to observe the merging algorithm.
6.5.1
Using PEPINO
PEPINO is implemented in Mozart 1.3.2, and it is currently being ported to
Mozart 1.4.0. This means that it runs in many platforms, being Linux, Mac
OS X and Windows the main ones. The following instructions are meant
SELFMAN Deliverable Year Two, Page 94
CHAPTER 6. D1.5: MOZART LIBRARY OF SELFMAN
STRUCTURED OVERLAY NETWORK
for unix based systems with some specifique instructions for Windows users.
This commands can usually be executed in a terminal in any of this systems
(cmd for Windows).
Getting the Software
The software is available on http://p2ps.info.ucl.ac.be/pepino on the
download section. The user can chose between a compiled version or the
source code. The release is open to everyone under Free Software license. It
is not internal to SELFMAN.
Requirements
For running PEPINO, you need to install Mozart [108]. Below we specifie
some versioning issues. For building PEPINO from the source, GNU Make
is needed.
Building and Running PEPINO
Once you uncompress the downloadable file, you will find a directory called
pepino inside directory p2ps. If you get the compiled version, run PEPINO
as follows (on Windows, use pepino.exe instead of pepino):
cd p2ps/pepino/
./pepino
If you want to compile the sources, then do as follows:
cd p2ps
make all
cd pepino
./pepino
Running PEPINO like that will open the network inspector with the
default values, which is a network simulation of 11 peers. You can try out
the different arrows to play the simulation at different speeds, and to move the
position of the peers in the network with the mouse. You can also reorganize
the network according to different colours of the arrows. Right clicking with
the mouse on top of a peer can trigger a temporary or permanent failure to
test failure recovery.
PEPINO can be run with different options to study the network from
different points of view. We recommend the following options:
To create branches in the Relaxed-ring, run
SELFMAN Deliverable Year Two, Page 95
CHAPTER 6. D1.5: MOZART LIBRARY OF SELFMAN
STRUCTURED OVERLAY NETWORK
./pepino -b
If it does not create enough branches, increment the probability by running for instance
./pepino -b --prob=40
To create a network of a bigger size, run
./pepino -s 42
To see all possible options, ask for help as follows
./pepino --help
That will return the following help menu that can give you an idea about
which can of options you can use
Usage: ./pepino [option]
Options:
-b, --branches BOOL Create randomly branches
--prob NUM
Probability of having a branch
-d, --dist ATOM
Determines the type of session [dl, dss, sim (default)]
-k, --maxkey NUM
Maximun value for a key (default 100000)
-l, --logfile FILE
Log file name (default test.log)
-n, --network ATOM
Name of the network (default guinness)
-o, --ozstore FILE
Ticket to OzStore (default OzStoreTicket)
-p, --logport FILE
File to store the logger port (default logger.tket)
-s, --size NUM
Size of the network (default 11, minimun 4)
--version
Version number
--viewer
Viewer mode. Just read a log file
-h, -?, --help
This help
This PEPINO comes without mayonnaise
Looking at the options, you can log the experiment to reproduce it later by
running
./pepino --viewer --logfile=test.log
SELFMAN Deliverable Year Two, Page 96
CHAPTER 6. D1.5: MOZART LIBRARY OF SELFMAN
STRUCTURED OVERLAY NETWORK
Running real networks
Instead of running just a simulation using lightweight threads, you can run a real
network created with different unix processes. Since the new Mozart version comes
with a new distributed implementation, you will have to chose your distribution
for running PEPINO accordingly. If you want to use Mozart 1.3.2, you have to
run
./pepino -dist=dl
If you want test PEPINO with Mozart 1.4.0, you run it as follows.
./pepino -dist=dss
Note that as we mentioned above, PEPINO is currently being ported to
Mozart 1.4.0 and this version might be very unstable.
6.6
CiNiSMO
CiNiSMO is a Concurrent Network Simulator implemented in Mozart-Oz. In has
been used for evaluating the claims made about the Relaxed-Ring in Deliverable
D1.2, and we continue to use it for ongoing research with other network topologies.
In CiNiSMO, every node run autonomously on its own lightweight thread. Nodes
communicate with each other by message passing using ports. We consider that
these properties make the simulator much more realistic. We have released it as
a programming framework that can be use to run other tests with other kinds
of structured overlay networks. Another motivation for releasing CiNiSMO is to
allow other researchers to reproduce the experiments we have run to generate our
conclusions.
The general architecture of CiNiSMO is described in Figure 6.4. At the center,
we observe the component called “CiNetwork”. This one is in charge of creating n
peers using the component “Core Node”. The core node delegates every message
it receives to another component which implements the algorithms of a particular network. Currently, we have implemented in CiNiSMO Chord, P2PS, Fully
connected networks and Palta. To add a new kind of network to this simulator
it is sufficient to create the correspondent component that handles the messages
delegated by the core node.
Every core node transmit information about the messages it receives to a component called “Stats”, which can summarize information such as how many lookup
messages were generate, or how many crash events were triggered. The component
that typically demands this kind of information is the “Test”. This is another component that can be implemented to define the size of the network and the kind of
event we want to study. Only one CiNetwork is created per every Test. When the
SELFMAN Deliverable Year Two, Page 97
CHAPTER 6. D1.5: MOZART LIBRARY OF SELFMAN
STRUCTURED OVERLAY NETWORK
Figure 6.4: Architecture of CiNiSMO
relevant information is gathered, it sent to a “Logger”, which outputs the results
into a file.
Since it is cumbersome to run every test individually many times, it is possible to implement the component called “Master Test”, which can organize the
execution of many testing, changing the seed for random generation numbers, or
a parameter that is used for the creation of the CiNetwork.
Figure 6.5, which is presented as a result in Deliverable D1.2, can give us an idea
of the limits of execution of CiNiSMOṪhe data is generated by CiNiSMO, but
the plot is generated by another specific software called gnuplot. Note that we can
run networks of 10000 peers, which means 10000 threads running simultaneously
exchanging constantly messages between them. The Y-axis shows as that 1e + 07
messages where created on the most loaded networks.
6.6.1
Using CiNiSMO
The source code and documentation of CiNiSMO is available on its dedicated
website http://p2ps.info.ucl.ac.be/cinismo. It is released as Free Software.
Since it is a programming framework, we provided as source code. To compile it
you need Mozart and GNU Make. It can run on Linux, Mac OS X and Windows,
among other operative systems. Even when it is meant for programming your own
test, here are instructions for building and run some of the tests we have perform to
validate our conceptual results. The instructions are given for running CiNiSMO
on a unix based system.
Once you uncompress the downloadable files, you must execute the following
steps:
SELFMAN Deliverable Year Two, Page 98
CHAPTER 6. D1.5: MOZART LIBRARY OF SELFMAN
STRUCTURED OVERLAY NETWORK
Figure 6.5: Data comparing the network traffic of different instances of Chord
and P2PS.
cd CiNiSMO
make all
./cinismo --help
That will produce the following output, which is a help menu that gives an
idea about the different possibilities offered by CiNiSMO.
CiNiSMO is the Concurrent Network Simulator in Mozart-Oz
Usage: ./cinismo [option]
Options:
-k, --maxkey NUM
Maximun value for a key (default 666)
-l, --logfile FILE
Log file name (default nolog)
--logger FILE
Logger’s Port ticket (default nolog)
-n, --netsize NUM
Network size (default 7)
-o, --omega NUM
Omega value for PALTA (default 666)
--prob NUM
Probability of having a broken connection
-s, --seed NUM
Seed for random generator (default 1)
--stabrate NUM
Stabilization rate (default 0)
Choosing tests:
--mastertest ATOM Master Test you want to run
--test ATOM
Test you want to run
--version
Version number
-h, -?, --help
This help
CiNiSMO also stands for Cynical Network Simulator in Mozart-Oz,
where ’i’ stands for "ignored letter"
Then, you can check which kind of tests or master tests are already implemented by running
SELFMAN Deliverable Year Two, Page 99
CHAPTER 6. D1.5: MOZART LIBRARY OF SELFMAN
STRUCTURED OVERLAY NETWORK
./cinismo list test
Which will produce the output
List of possible tests you can run
chord_lookup
full_connectivity
p2ps_branches
p2ps_lookup
palta_build
palta_hops
revereendo_hops
revereendo_test
Let us run the first one, saving the result into a file called output.log. Some
results will be also printed on the standard output buffer, but we are not interested
on that right now.
./cinismo --test chord lookup -l output.log
SELFMAN Deliverable Year Two, Page 100
CHAPTER 6. D1.5: MOZART LIBRARY OF SELFMAN
STRUCTURED OVERLAY NETWORK
6.7
Publications and Submissions
This section is dedicated to give a brief introduction to the demo proposals we
have submitted using the results of this deliverable. The documents are included
as appendices.
PEPINO: PEer-to-Peer network INspectOr
This is the first proposal we have submitted with the goal of shown the main
features of our network inspector. The demonstrator was accepted and presented
in the Seventh IEEE Peer-to-Peer Conference. The article included in Appendix
A.8 appears in the proceedings of the conference.
Visualizing Transactional Algorithms for DHTs
The focus of this demonstrator is on the study of algorithms for implementing
transactions on peer-to-peer networks. Their visualization contributes to the analysis and test of the protocols, verifying their tolerance to failures. In particular, we
compare a DHT running two-phase commit and the Paxos consensus algorithm.
The submission has been accepted on the Eighth IEEE Peer-to-Peer Conference,
to take place in September of 2008. The article is included in Appendix A.9
Partitioning and Merging the Ring
This demonstrator offers a graphical way to study how tolerant a system is with
respect to network partitions, and how efficient is the merging of the network
when the network partition disappears. The article, included in Appendix A.10,
has been submitted to the Eighth IEEE Peer-to-Peer Conference.
SELFMAN Deliverable Year Two, Page 101
Chapter 7
D2.1b: Report on computation
model with self-management
primitives
7.1
Executive summary
In SELFMAN we aim to construct long-running, self-manageable and self-configurable
dynamic distributed systems. In this deliverable we present aspects of the SELFMAN computation and programming model that facilitates the construction of
such complex, dynamic and reactive systems. Systems of this type, where many
software modules execute concurrently, reactively and interact in complex ways,
are cumbersome to build, maintain and manage without a rigorous architectural
and computational model.
The architectural and management aspects of SELFMAN systems are catered
for by the guidelines provided by the Fractal [27] component model. In Fractal,
the software is organized into components that are reflective, hierarchical and
dynamically reconfigurable. However, Fractal is agnostic with respect to the model
of execution and interaction between components.
KTH(P2) has worked on Kompics [9], a reactive component model that is
compatible with Fractal but provides a concrete execution and interaction model
for components, that is particularly aimed at components that implement distributed abstractions. Kompics components are reactive/event-driven, concurrent,
and readily exploit multi-core architectures. They are fault-tolerant and can form
flexible fault supervision hierarchies. Kompics components provide basic primitives for self-healing and self-configuration.
In this deliverable we first present a recapitulating overview of the Fractal
component model in Section 7.3. We introduce the Kompics model for component
execution and interaction in Section 7.4. In Section 7.5 we report our ongoing
work on integrating Kompics and Fractal.
102
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
7.2
Contractors contributing to the Deliverable
KTH(P2), INRIA(P3) and FT R&D(P4) have contributed to this deliverable.
KTH(P2) KTH has contributed by defining and implementing the Kompics
reactive component model for building distributed protocols as components. KTH
has worked in cooperation with INRIA on integrating the Kompics component
model with the Fractal component model.
INRIA(P3), FT R&D(P4) INRIA and FT have worked in cooperation with
KTH on integrating the Kompics component model with the Fractal component
model.
SELFMAN Deliverable Year Two, Page 103
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
7.3
An overview of the Fractal component
model
Fractal [27] is an advanced component model and associated on-growing programming and management support devised initially by France Telecom and INRIA
since 2001. Most developments are framed by the Fractal project inside the ObjectWeb open source middleware consortium1 . The Fractal project targets the
development of a reflective component technology for the construction of highly
adaptable and reconfigurable distributed systems.
In this section, we first focus on the concepts that form the component model
itself and then we give some elements about Fractal implementations and tools
that are relevant for the Selfman project in the line of defining an event-based
component model (Kompics) as a “Fractal personality”.
7.3.1
Component model
Classical concepts The Fractal component model relies on some classical con-
cepts in the CBSE area.
Components are runtime entities that conforms to the component model. The
component model defines specific interaction and composition standards. Components are units of development (design, modeling, implementation, test), deployment and management. Components do exist as such during execution and
can be manipulated as such for management purposes. Components do not have
predefined granularity (as in EJB for instance where components have a fixed and
big granularity): Fractal components may be of arbitrary size, from a pool to a
complete DBMS through service, resources, protocols stacks, name servers, application servers... Also, Fractal does not have dedicated “targets” such as typically
“technical” or “applicative/business” components.
Interfaces (somehow similar to “ports” in other component models) are the only
interaction points between components. Interfaces express dependencies between
components in terms of required/client and provided/server interfaces. A client
interface can typically emits operation invocations (or signals, or events), while a
server interface can receive operation invocations (resp. signals, events).
Bindings are communication channels between component interfaces that can
be primitive or composite. Primitive bindings are local communication channels
between components (interfaces) that reside in a same address space. Primitive
binding are typically implemented as Java references or C pointers. Composite bindings are specialized assemblies of components and bindings dedicated to
advanced communication channels such as distributed, secured or transactional
communications. From the Fractal point of view, “connectors” or “adaptors” as
used in Architecture Description Languages (ADL) are just specialized bindings,
1
cf. http://fractal.objectweb/org
SELFMAN Deliverable Year Two, Page 104
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
i.e. bindings with a predefined semantics (communication, type matching, etc.).
These specialized semantics can typically be implemented as binding components
in Fractal. Structurally, they are not different from other components.
Types Fractal components can be typed. The Fractal specification defines a
basic type system in which a component type is defined a set of interface types.
Fractal interfaces are not to be confused with language interfaces: a java interface, which defines essentially a list of operations (methods) on a class of objects,
is referred to as a interface signature in Fractal - while the term “interface” in
Fractal designates the actual runtime entity that can be named and accessed by
components. A Fractal interface type defines a interface signature and additional
properties (constraints): role (client or server), cardinality (singleton or collection)
and contingency (optional or mandatory). A sub-typing relation based on “substitutability” between components is defined by the model.
Factories Instantiation (creation) of components in Fractal can be done using
factories. Factories are typically dedicated factory components (i.e. factories are
themselves implemented as components which implies of course the need for special
bootstrap factories). 3 kinds of factories are defined in the Fractal specification.
A generic (or parametric) factory can create components of arbitrary types given
as inputs and a description of control and content. A standard factory can create
components of one specific type, i.e. the factory is explicitly programmed to do
so. Templates can create components that are similar (isomorphic) to themselves.
Templates are very useful to instantiate at once complex (hierarchical) component
assemblies.
Original concepts Fractal also exhibits more original (in the sense of “less
common”) concepts.
A component is the composition of a membrane and a content. A membrane
exercises an arbitrary control over its content. It embodies the control behavior
associated to a particular component:
• it can provide explicit and causally representation of its content (sub-components),
• it can intercept oncoming and outgoing operation invocations targeting or
originating from its content and superpose a control behavior: suspending,
checkpointing, resuming activities, reifying or changing operation invocations parameters, managing transparently technical services (e.g. persistency or security), managing QoS (memory consumption, garbage collection),
• or it can do nothing at all!
Control is based on reflection. Reflection is defined as the ability of a component
(seen as a program) to manipulate as data the entities that represent its execution
state during its own execution. This manipulation can take two forms:
SELFMAN Deliverable Year Two, Page 105
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
• introspection : the ability of a component (seen as a program) to observe
and reason about its own execution state ;
• intercession : the ability of a component (seen as a program) to alter its own
execution state ; or to alter its interpretation or semantics.
The model is recursive (hierarchical) with sharing at arbitrary levels. The hierarchical structure involves the associated notions of sub-components and super-components
and an import/export mechanism based on complementary interfaces used to bind
super and sub components. Complementary interfaces is a couple made of an external and an internal interfaces (of the same component) of the same type but
with symmetrical role (client/server). Internal interfaces only exist as complementary interfaces (with the basic type system). Only sub components can be bind
to internal interfaces of their super component(s). The recursion stops with base
components that have an empty content. Base components encapsulate entities in
an underlying programming language (e.g. objects in java).
A component can be shared by multiple enclosing components: a component
can be a sub component of several super components. The behavior of a shared
component C is under the control of the direct enclosing component of C super components. Sharing is intrinsic to resource management: without sharing
encapsulation would have to be enforced by applications and/or with complex
mechanisms (e.g. replication) in pure hierarchical settings. Sharing may be used
for other purposes such as activity (e.g. transactions, processes) management,
domains management (e.g. security, faults, administrative domains).
Organization of the model The model specification is organized by “levels of
control”. The “foundations level” defines base components with no reflexive capabilities (legacy code), an IDL (Interface Description Language) and a naming and
binding API (which defines Name, NamingContext, Binder interfaces signatures).
The “introspection level” defines the component and interface API and allows
for introspection of components boundaries. The “configuration level” provides
structural introspection and intercession through the Attribute, Content, Binding,
Lifecycle control API. These represent predefined reflexive control of (white-box)
components structure but arbitrary control features may be defined. The model
also defines the type system and instantiation APIs (factories).
The model is programming language independent and open: everything is optional and extensible2 in the model, which only defines some “standard” API for
controlling bindings between components, the hierarchical structure of a component system or the components life-cycle (creation, start, stop, etc).
Fractal principles The Fractal component model enforces a limited number
of very structuring architectural principles. Components are runtime entities con2
This openness leads to the need for conformance levels and conformance test suites so
as to compare distinct implementations of the model.
SELFMAN Deliverable Year Two, Page 106
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
forming to a model and do have to exist at runtime per se for management purposes.
There is a clear separation between interfaces and implementations which allow
for transparent modifications of implementations without changing the structure
of the system. Bindings are programmatically controllable: bindings/dependencies
are not “hidden in code” but systematically externalized so as to be manipulated
by (external) programs.
Fractal systems exhibits a recursive structure with composite components that
can overlap, which naturally enforces encapsulation and easily models resource
sharing. Components exercise arbitrary reflexive control over their content: each
component is a management domain of its own.
Altogether, these principles make Fractal systems self-similar (hence the name
of the model): architecture is expressed homogeneously at arbitrary level of abstraction in terms of bindings an reflexive containment relationships.
Finally, the model - including type systems, controllers and forms of bindings
- Julia platform and some tools such as Fractal ADL are open and extensible which make us think Fractal is very suitable in the convergence effort with the
Kompics model with a foreseeable result of defining Kompics as a event-based
Fractal personality.
7.3.2
Fractal ecosystem
We give a partial snapshot of the existing Fractal implementations, languages,
tools and component libraries collectively known as the Fractal ecosystem.
Implementations There exist currently 8 implementations (a.k.a. execution
platforms)3 providing support for Fractal components programming in 8 programming languages. We focus here on the Julia platform that is considered in Selfman
in the sense that some building blocks of the platform may be used in the implementation of the Kompics component model.
Julia was historically (2002) the first Fractal implementation4 , provided by
France Telecom. Since its second version, Julia makes use of AOP-like techniques
based on interceptors and controllers built as a composition of mixins. It comes
with a library of mixins and interceptors mixed at load time (Julia relies very much
on load-time bytecode transformation as the main underlying technique thanks to
the ASM Java bytecode Manipulation Framework).
The latest evolutions of the platform, thanks to a joint work between INRIA
and France Telecom on the AOKell platform, allows for i) AOP-based (AspectOriented Programming) programming on Fractal membranes based on standard
AOP technologies (static weaving with AspectJ and load-time weaving with Spoon)
3
Julia, AOKell, ProActive and THINK are available in the ObjectWeb code base.
FracNet, FractTalk and Flone are available as open source on specific web sites.
4
And sometimes considered for this reason as “the reference implementation” in Java.
SELFMAN Deliverable Year Two, Page 107
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
instead of mixins ; and ii) implementation of component-based membranes: Fractal component controllers can themselves be implemented as Fractal components.
The design of Julia cared very much for performance: the goal was to prove that
component-based systems were not doomed to be inefficient compared to plain
Java. Julia allows for intra-components and inter-components optimizations which
altogether exhibit very acceptable performance.
Languages and tools A large number of R&D activities are being conducted
inside the Fractal community around languages and tools, with the overall ambition to provide a complete environment covering the complete component-based
software life cycle covering modeling, design, development, deployment and (self-)
management.
A relevant list for Selfman but not exhaustive list of such activities is the
following:
• development of formal foundations for the Fractal model, typically by means
of calculi, essentially by INRIA Sardes,
• development of basic and higher levels (e.g. transactional) mechanisms for
trusted dynamic reconfigurations, by France Telecom, INRIA Sardes and
Ecole des Mines de Nantes (EMN),
• support for configuration, development of ADL support and associated tool
chain, by INRIA Sardes, Jacquard, France Telecom, ST Micoelectronics,
• support for packaging and deployment, by INRIA Jacquard, Sardes Oasis,
IMAG LSR laboratory, ENST Bretagne,
• development of navigation and management tools, by INRIA Jacquard and
France Telecom,
• development of architectures that mix components and aspects (AOP), at
the component (applicative) level and at the membrane (technical) level, by
INRIA, France Telecom, ICS/Charles University Prague,
The most mature among these works are typically incorporated as new modules
into the Fractal code base. Examples of such modules relevant for Selfman are the
following:
• Fractal RMI is a set of Fractal components that provide a binding factory
to create synchronous distributed bindings between Fractal components (“à
la Java RMI”). These components are based on a re-engineering process of
the Jonathan framework.
• Fractal ADL (Architecture Description Languages) is a language for defining Fractal configurations (components assemblies) and an associated retargetable parsing tool with different back-ends for instantiating these configurations on different implementations (Julia, AOKell, THINK, etc.). Fractal
SELFMAN Deliverable Year Two, Page 108
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
ADL is a modular (XML modules defined by DTDs) and extensible language
to describe components, interfaces, bindings, containment relationships, attributes and types - which is classical for an ADL - but also to describe
implementations and especially membrane constructions that are specific
to each Fractal implementation, deployment information, behavior and QoS
contracts or any other architectural concern. Fractal ADL can be considered
as the favorite entry point to Fractal components programming (its offers a
much higher level of abstraction than the bare Fractal APIs) that embeds
concepts of the Fractal component model5 .
• FScript is a scripting language used to describe architectural reconfigurations
of Fractal components. FScript includes a special notation called FPath
(loosely inspired by XPath) to query, i.e. navigate and select elements from
Fractal architectures (components, interfaces...) according to some properties (e.g. which components are connected to this particular component?
how many components are bound to this particular component?). FPath is
used inside FScript to select the elements to reconfigure, but can be used by
itself as a query language for Fractal.
Component library and real-life usage We would like to emphasize the
maturity of the Fractal technology as a whole. Fractal is not a “paper” or “toy”
component model: it has been used effectively to build several middleware and
operating system components (several are available in ObjectWeb code base) including CLIF, a framework for performance testing, load injection and monitoring
(management of blades, probes, injectors, data aggregators, etc.), that may be
concerned/used by evaluation campaigns in the last year of Selfman WP5.
Some of these components that embed Fractal technology are used operationally, for instance JOnAS, Speedo and CLIF by France Telecom: JOnAS is
widely used by France Telecom6 for its service platforms, information systems and
networks by more than 200 applications including vocal services including VoIP,
enterprise web portals, phone directories, clients management, billing management,
salesman management, lines and incidents management.
5
It is worth noticing that Fractal ADL is not (yet) a complete component-oriented
language (in the Turing sense), hence the need for execution support in host programming
languages a.k.a. “implementations”.
6
See http://jonas.objectweb.org/success.html for a more comprehensive list of operational usage of JOnAS.
SELFMAN Deliverable Year Two, Page 109
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
7.4
Kompics: Reactive component model for
distributed computing
The Kompics component model targets the development of reliable and adaptable
long-lived, dynamic, and self-managing distributed systems. Such systems are
composed of many software modules which implement various distributed protocols
(e.g. failure detectors, reliable group communication, agreement protocols, gossip
protocols, etc.) and interact in complex ways.
Kompics aims to facilitate the construction of complex distributed systems
by providing a computation model that accommodates their reactive nature and
makes their programming as easy as possible. The Kompics run-time system provides primitives for self-configuration, self-healing, and self-tuning of component
architectures. Components are executed concurrently and multi-core hardware
architectures are exploited with no extra effort.
7.4.1
Component model
In Kompics, distributed abstractions are encapsulated into components that can
be composed into hierarchical architectures of composite components. Subcomponents can be safely shared by multiple components at any level in the component
hierarchy. Kompics components interact by passing asynchronous data-carrying
events and they are decoupled by a flexible event publish-subscribe system.
The concepts of the Kompics model are: components, events, channels, event
handlers, event subscriptions, component types, component membranes, component sharing, management and fault isolation.
Components A component is a unit of functionality and management. Com-
ponents are active entities that interact with each other by triggering (sending)
and handling (receiving) events. Components react to events by executing event
specific procedures to handle the received events. Components are decoupled (by
channels) which makes them independent and reusable.
Every component contains some internal state and a set of event-handling
procedures. Composite components also contain subcomponents and thus form
a component hierarchy. We sometimes call subcomponents children components
and the containing composite component parent component. We say that the
parent component is at a higher level in the component hierarchy than its children
components.
Events Events are passive objects that contain a set of immutable attributes.
Events are typed and they can form type hierarchies. Components subscribe for
events to channels and publish events into channels.
SELFMAN Deliverable Year Two, Page 110
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
Channels Channels are interaction links between components. They carry
events from publisher components to subscriber components. Every channel is
parameterized with a set of event types that can be subscribed for or published
into the channel. Channels exist in the context of the composite components which
create them. However, references to channels can be passed between components
through events.
Event handlers An event handler is an event-specific procedure that a compo-
nent executes as a reaction to a received event. An event handler is a component
method that takes as argument one event of a certain type. While being executed,
event handlers may trigger new events. Event handlers can be guarded by boolean
guards.
Event subscriptions Components subscribe their event handlers to channels
by registering event subscriptions at the respective channels. Event subscriptions
can be made either by event type or by both event type and event attributes,
whereby a subscription contains a set of (attribute, value) attribute filters. Events
published into a channel are delivered to all subscriber components which registered at the channel subscriptions which match the published events. Components
can publish or subscribe for subtypes of the event types carried by the channel.
Component types A component interacts with its environment (other com-
ponents) by triggering (output) and handling (input) events. The component is
subscribed for the input events to input channels and publishes output events into
output channels. Kompics components are parameterized by their input and output channels. The types of input and output events of a component together with
the input and output channel parameters that carry them represent the component’s type. A composite component expresses its dependencies on subcomponents
in terms of the component types of the subcomponent.
Component membranes A component membrane is the runtime incarnation
on the component’s type. The component membrane is a set of references to
the actual channels that the component is using as input and output channel
parameters. The membrane maps every pair (event type, in/out direction) to an
actual channel reference.
Component sharing A component is shared between multiple composite
components essentially by sharing the channels in its membrane. To share one of its
subcomponents, a composite component registers the subcomponent’s membrane
under a name, in a registry of shared components. Other composite components
can retrieve the membrane (by name) from the registry and use its channels to
communicate with the subcomponent. This registry is hierarchical in the following
sense: (1) names registered at some level in the component hierarchy (the level of
SELFMAN Deliverable Year Two, Page 111
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
the parent component of the shared component) are not visible at higher levels,
and (2) names registered at a lower level in the component hierarchy shadow the
same names registered at a higher levels.
Component management The management of Kompics components is event-
based, i.e., synchronous with the handling of functional events. Every component
has an associated built-in control channel. A component manager publishes management events into the control channel and they are handled either by built-in
default management handlers or by management handlers programmed explicitly
by the component developer. Dynamic reconfiguration operations like adding or
removing components and channels, replacing the channels in a component’s membrane, replacing components, or changing component subscriptions provide basic
primitives for self-configuration.
Component fault isolation Any error or exception that is not caught within
an event handler is isolated by the runtime system and wrapped into a fault event
which is published into the component’s control channel. A supervisor component
is subscribed to the faulty component’s control channel and handles fault events.
As a reaction to fault events, the supervisor component can manage/reconfigure
the faulty component. Flexible fault supervision hierarchies [10] (possibly different
from the component ownership hierarchy) can be formed. Hence, Kompics provides
basic primitives for self-healing.
Figure 7.1 shows an example graphical representation of Kompics components.
Here we have two components: A and B, and a channel carrying events of type E1 .
Component A has an output formal channel parameter and component B has an
input formal channel parameter. Both components use the same actual channel
for their formal (input, respectively output) channel parameter. Component B
has an event handler that is subscribed to B’s input channel and handles events
of type E1 . Component A has an event handler that publishes events of type E1
in A’s output channel. Both components A and B and the actual channel exist in
the context of a parent component, N ode.
Example
Node
A
B
E1
E1
{E1}
E1
Event handler that triggers events of type E1
E1
Event handler that handles events of type E1
Figure 7.1: Graphical representation of Kompics components.
Component output channel parameter
SELFMAN Deliverable Year Two, Page 112
Component input channel parameter
{E1}
Channel carrying events of type E1
CoreGRID WP4 Meeting, Rome, June 26th 2008
17
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
Figure 7.2 shows an example graphical representation of a Kompics composite
component, namely a Best-Effort Broadcast (BEB) [66] component that contains
a Perfect Point-to-point Links (PP2P) [66] subcomponent. The BEB component
is parameterized by an input channel carrying BebBroadcast events and an output channel carrying BebDeliver events. The PP2P component is parameterized
by an input channel carrying P p2pSend events and an output channel carrying
P p2pDeliver events. The BEB component contains two local channels that are
used as the actual channels that parameterize the PP2P subcomponent and two
event handlers that handle BebBroadcast and P p2pDeliver events respectively
and trigger P p2pSend and BebDeliver events respectively. The arrows indicate
subscriptions and publications.
Best Effort Broadcast (BEB) component
BebB(broadcast) Pp2pS(end)
BebD(eliver)
Pp2pD(eliver)
BEB
{BebD}
{BebB}
{Pp2pD}
Pp2pD(m)
BebD(m)
PP2P
BebB(m)
Pp2pS(m)
{Pp2pS}
Figure 7.2: Graphical representation of a Kompics composite component.
CoreGRID WP4 Meeting, Rome, June 26th 2008
19
Figure 7.3 shows an example of two composite components sharing a common subcomponent. A Perfect Point-to-point Links (PP2P) [66] component and
a Fair-Loss Point-to-point Links (FLP2P) [66] component share a Network subcomponent. The shared component as well as the channels in its membrane are
represented with double borders. The Network component accepts N etSend events
(in the sender process) and triggers N etDeliver events (in the receiver process).
N etSend events triggered by PP2P in the sender process are to result in
N etDeliver events handled only by PP2P in the receiver process. Similarly,
N etSend events triggered by FLP2P in the sender process are to result in N etDeliver
events handled only by FLP2P in the receiver process. Filtering N etDeliver events
between PP2P and FLP2P is done by event subtyping, i.e., PP2P and FLP2P subscribe to the Network component’s output channels for different subtypes of the
N etDeliver event type. Events of these subtypes are respectively encapsulated in
N etSend events.
SELFMAN Deliverable Year Two, Page 113
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
PP2P and FLP2P share the Network component
{Pp2pS}
{Pp2pD}
{Flp2pS}
{Flp2pD}
FLP2P
PP2P
Pp2pS
NetS
{NetS}
Pp2pNetD
Pp2pD
{NetD}
Network
Flp2pS
NetS
{NetS}
Flp2pNetD
Flp2pD
{NetD}
Network
shared channel in shared component membrane
Flp2pNetD  NetD
shared component
Pp2pNetD  NetD
CoreGRID WP4 Meeting, Rome, June 26th 2008
24
Figure 7.3: Two composite components sharing a subcomponent.
7.4.2
Component execution and interaction semantics
Typical Kompics components do not have execution threads of their own. Their
event handlers are executed on their behalf by worker threads from a worker pool.
Components that have received events are scheduled for execution to one of the
worker threads.
Concurrent component execution The event handlers of the same compo-
nent instance are guaranteed to be executed sequentially, but different component
instances can execute event handlers concurrently (or in parallel on multi-core
machines). In other words, the event handlers of the same component instance
are mutually exclusive, while the event handlers of different component instances
are not. However, the execution of an event handler is not atomic (in the all-ornothing sense). That means that events triggered by one event handler are visible
to the corresponding subscriber components (thus, executable) immediately after
they are triggered. This entails that the execution of an event handler is neither
failure atomic, i.e., it can fail before completion with observable partial side effects
(some of the events that were supposed to be triggered by the handler are indeed
triggered while others are not).
Event subscription Components subscribe their event handlers to input channels for a particular event type. When component a subscribes an event handler h
for events of type T to channel x, a subscription of the form (a, T , h) is registered
at channel x. At the same time, a FIFO work queue, qx is created at a and associated with channel x (if it is not already existing from a previous subscription
of a to x). A channel y has associated work queues qy in every component that is
SELFMAN Deliverable Year Two, Page 114
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
subscribed to it. A component can subscribe more than one of its event handlers
to the same channel.
Component scheduling A component a can be in exclusively one of the
following three scheduling states: Busy, Ready, or Idle. We say that a is Busy
if one of the worker threads is being actively executing one of a’s event handlers.
We say that a is Ready if it is not Busy and at least one of its work queues qx is
not empty, so a is ready to execute some event. We say that a is Idle if it is not
Busy and all its work queues qx are empty, so a has no event to execute.
Event publication While executing event handlers, components may publish
events into output channels. Assume component a triggers event e of type T in
channel x. Let S be the subset of all subscriptions of the form (b, T 0 , h), to channel
x, where T 0 is either T or a super-type of T . For each subscription (b, T 0 , h) in S,
a work item of the form (e, h) is enqueued at subscriber b in work queue qx and if
b was Idle then b becomes Ready.
Channel FIFO guarantees The execution model guarantees the following
FIFO semantics for channels. Each component a subscribed to a channel x, receives
events published in x, in the same order in which they are published. Events
triggered sequentially by one component instance will be published in the channel
in the order in which they were triggered. A channel serializes the concurrent
publication of events into the channel, i.e., events published concurrently in the
same channel by different component instances.
This means that all subscribers to channel x for event type T observe the same
order of publications of events of type T in their local work queues qx .
Event handler execution Worker threads execute event handlers on behalf
of components. Worker threads atomically pick Ready components and make
them Busy. When a worker picks Ready component a, it immediately makes a
Busy. An invariant of the execution model is that at this point a has at least
one work queue qx that is not empty. After making a Busy, the worker dequeues
one work item (e, h) from some work queue qx selected according to some fairness
criteria. Thereafter, the worker proceeds to execute a’s event handler h by passing
it as an argument the event e. Upon completing the execution of h, if all a’s work
queues qx are empty, then the worker makes a Idle. Otherwise it makes a Ready.
Worker threads loop Worker threads wait for components to become Ready.
When a component a becomes Ready, a worker w picks it and executes one work
item (e, h), the head of some work queue qx of a. The execution of event handler
h may trigger new events ei of types Ti , published in channels xi . All components
subscribed to channels xi for event types Ti become Ready if they were not
Busy. Upon completing the execution of event handler h, worker w picks another
SELFMAN Deliverable Year Two, Page 115
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
Ready component, if one exists, and it repeats the above steps. If no component
is Ready, worker w starts waiting for a component to become Ready and it
repeats the above steps.
Worker pools The number of workers in the worker pool can be proportional
to the number of hardware processing cores. Locking is needed only on channels
and on the component work queues qx . No coarse-grained component locking
is needed. This enables Kompics component architectures to efficiently exploit
multi-core hardware architectures at no extra cost.
Additional worker pools of various sizes can be created and destroyed. Each
component is a member of one worker pool at any one time and it shares the
workers in the pool with the other member components. A component can be
moved from one worker pool to another. Groups of “hot” components can be
placed in their own worker pools in order to prioritize their execution. Hence,
Kompics provides basic primitives for self-tuning.
Threaded components Typical Kompics event handlers are “short” and do
not block. To facilitate programming of protocols in a continuation style, we
provide a blocking receive primitive that allows an event handler to block waiting
for an event with an expected type and/or attribute values. Upon receiving an
expected event, the handler continues its execution.
Components that make use of the receive primitive use a private thread for
executing event handlers which allows them to wait for an event without blocking
one of the workers in the worker pool. However, from an observational point of
view, threaded components look just like typical components: they accept and
trigger events.
Security Components can have references to other (children or parent) components. Also, channel references can be passed between components inside events.
References to Kompics components and channels embed fine-grained revocable
capabilities (cf. caretaker pattern [116]). For example, component a can give
component b a reference to a channel that only allows b to publish events into
the channel but not to subscribe for events to the channel. Hence, Kompics provides basic mechanisms for the Principle of Least Authority (POLA) [106], thus it
provides some basic primitives for self-protection.
7.4.3
Example component architectures
In this section we present as examples two component architectures that have been
implemented with Kompics.
Reliable distributed abstractions architecture The example in Figure 7.4 shows stacking and composition of protocols in Kompics. For example
SELFMAN Deliverable Year Two, Page 116
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
the Abortable Consensus component makes use of Best-Effort Broadcast and Perfect Links and the Consensus Instance uses the Leader Detector. The example also
shows how components can be used to cater for functional and non-functional aspects. The Consensus Instance implements a Paxos uniform consensus algorithm.
The Consensus Port offers to an application component a sequence of consensus
instances while garbage collecting the already decided instances. The Consensus
Service component allocates consensus ports to different applications.
Example
Boot
Application
Consensus Service
Distributed Shared Memory
Consensus Port
Virtual Synchrony
Consensus Instance
Group Membership
Abortable Consensus
Reliable Broadcast
Unreliable Br.
Leader Detector
Best-Effort Br.
Failure Detector
Perfect Point-to-point Links
Lossy Point-to-point Links
Network
CoreGRID WP4 Meeting, Rome, June 26th 2008
4
Figure 7.4: Static membership distributed abstractions system architecture.
Peer-to-peer system architecture The example in Figure 7.5 shows a peer-
to-peer system architecture that supports multiple virtual peers in one node. This
is an example of hierarchical sharing where for example we want to share the
Perfect Network abstraction among the protocols of one virtual peer, but have different Perfect Networks in different virtual peers. On the other hand, the Network
component is shared and used by all Perfect Network abstractions. In long-running
systems like this, where peers build a communication structure among them (the
overlay network) we would like to reconfigure peers without restarting them (thus
making them forget the structure). This motivates the need for dynamic reconfiguration capabilities in the framework.
SELFMAN Deliverable Year Two, Page 117
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
Peer-to-Peer System Architecture
Boot
P2P Application (Simulator)
↓ identical peers ↓
Peer Cluster / Peer Set
Bulk Operations
Peer Application
Gradient / Gnutella
Symmetric replication
Transactional DHT
T-Man / T-Chord
Structured Overlay
Fast Paxos
Aggregation / Slicing
Group Multicast
Cyclon
Bootstrap Client
Failure Detector
Peer Monitor
Lossy Network
Perfect Network
Web Handler
Ring
Router
Network (Apache MINA)
Web Server (Jetty)
Timer
CoreGRID WP4 Meeting, Rome, June 26th 2008
Figure 7.5: Kompics peer-to-peer system architecture.
7.5
5
Kompics and Fractal integration
The Fractal [27] component model allows the specification of components that
are reflective, hierarchical, and dynamically reconfigurable. However, the Fractal
model is agnostic with respect to the execution model of components. Kompics [9]
is a reactive component model that is similar to Fractal but it enforces a particular execution and component interaction model. Hence, at a high level, Kompics
can be regarded as a specialization of Fractal. Here we document the process of
“fractalizing” Kompics components, essentially giving Kompics a Fractal personality and making it compatible with Fractal. We present the conceptual mapping
between the concepts of the two models and the design choices made for the implementation of their integration.
We start by introducing a toy example of an architecture with 2 composite
components that share a primitive component. We use this example as a support
for introducing the concepts of the two component models and to discuss the
conceptual mapping between these concepts. We assume the reader has some
familiarity with the Fractal and Kompics component models.
7.5.1
Example component architecture with sharing
Let us consider the software architecture depicted in Figure 7.6. This is a possible subset of the architecture of a process participating in a distributed system.
We have 2 composite components, Leader Elector (LE) and Remote Procedure
Call (RPC), that share a primitive component, Failure Detector (FD). The FD
component is a subcomponent of both the LE and RPC components.
SELFMAN Deliverable Year Two, Page 118
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
Leader
Elector
Remote
Procedure
Call
Failure
Detector
Failure
Detector
(a)
Leader
Elector
Failure
Detector
RPC
(b)
Figure 7.6: Example software architecture. A Leader Elector component and
a Remote Procedure Call component share a Failure Detector component:
(a) architectural view, (b) sharing view.
The FD component accepts requests to start or stop monitoring the liveness
of other processes, and when it detects the crash of a monitored process it triggers
a crash notification. We assume both the LE and RPC components use the FD
component. The LE component needs to be notified when the elected leader
crashes, to initiate a new leader election, hence it requests the FD to monitor the
leader process after every election. The RPC component needs to be notified when
the remote process to which it issued a procedure call, crashes, so that it can throw
an exception for that remote procedure call, hence it requests the monitoring of
the remote process for every RPC invocation.
A realistic architecture would also contain a Network component that is a
subcomponent of, and shared by all three components. However, we omit it here
for simplicity of presentation.
More concretely, the FD component accepts Start and Stop requests and
delivers Crash notifications. The LE component delivers NewLeader notifications. The RPC component accepts RemoteCall requests. These calls either
return successfully or throw an exception when the remote process crashes during
the invocation.
Kompics architecture
The Kompics architecture that corresponds to our abstract example architecture
is depicted in Figure 7.7. The FD component is parameterized by 2 channels: a
request channel and a notification channel. The sharing of the FD component
between the LE and the RPC components is done by sharing these 2 channels. The
request channel carries Start and Stop events, and the notification channel
carries Crash events.
The LE component is parameterized by a notification channel. LE has one
event handler that is subscribed to the FD notification channel and handles
Crash events. Upon handling a Crash event, the handler executes a leader
election protocol to elect a new leader. Thereafter, it publishes a Start event in
SELFMAN Deliverable Year Two, Page 119
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
{NewLeader}
Leader Elector
{CallReturn}
RPC
Crash
NewLeader
Start
{Start, Stop}
{RemoteCall}
{Crash}
Failure Detector
RemoteCall
Start
{Start, Stop}
Crash
CallReturn
{Crash}
Failure Detector
Figure 7.7: Example Kompics architecture.
the FD request channel to request the FD to monitor the newly elected leader. (At
the same time the handler publishes a NewLeader event in the LE notification
channel.)
The RPC component is parameterized by an invoke channel and a return
channel. RPC has one event handler that is subscribed to the RPC invoke channel and handles RemoteCall events, and one event handler that is subscribed
to the FD notification channel and handles Crash events. Upon handling a
RemoteCall event, the handler publishes a Start event in the FD request
channel in order to ask the FD to monitor the remote process to which the RemoteCall refers. We omit here the details of RPC including the case where the
call returns successfully. In the case when the remote process is detected by the
FD to have crashed during a remote call, FD publishes a Crash event in the FD
notification channel, which is handled by the Crash event handler of RPC.
This handler publishes a CallReturn event in the RPC return channel. (This
CallReturn event would contain a process crashed exception.)
Fractal architecture
The Fractal architecture that corresponds to our abstract example architecture is
depicted in Figure 7.8. The FD component is a primitive component shared by the
LE/LE and RPC/RPC composite components. (Once again, in a realistic architecture FD, LE, and RPC would be composite components that would encapsulate
a Network component but we abstract that for the purpose of this presentation.)
The FD component has a request server interface and a notification client
interface. The notification client interface is a collection interface, thus the FD
can be bound to more than one component that needs to receive crash notifications.
In our example, both LE and RPC have a crash server interface, to which the
FD’s notification interface is bound. Both LE and RPC have a start client
interface which is bound to FD’s request server interface.
The LE/LE composite component exports LE’s leader client interface. The
RPC/RPC composite component exports RPC’s call server interface and return
SELFMAN Deliverable Year Two, Page 120
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
leader
leader
crash
LE start
notif1
req
FD notif2
LE
crash
call
start
RPC
call
return
return
RPC
Figure 7.8: Example Fractal architecture.
client interface.
Notice that Kompics components communicate by event-passing. The Fractal
equivalent of this type of component interaction is bindings with an asynchronous
invocation semantics. Hence, all bindings in Figure 7.8 are asynchronous invocation bindings.
7.5.2
Conceptual mapping of model entities
The Kompics concepts are: component, channel, event handler, the subscription of
an event handler to a channel and the publication of an event to a channel. Notice
that our purpose is to give Kompics a Fractal personality, i.e., to map the Kompics
concepts to Fractal concepts in order to make it possible for Kompics architectures
to be handled by tools that were designed to handle Fractal architectures. Therefore, in the following we are going to look only at the Fractal concepts that are
needed to represent Kompics components: component, (client/server, internal/external) interface, and (primitive) binding. For each Kompics concept we are now
discussing the Fractal equivalent construction.
Component Fractal components interact though interfaces. Kompics components interact by passing events though channels. Typically, Kompics components
are parameterized by the input and output channels through which they interact
with their environment, by subscribing for events or publishing events, respectively. In order to reflect this component parameterization at the Fractal architecture level, we represent the input and output channel parameters of a Kompics
component as server and client interfaces of the corresponding Fractal component.
SELFMAN Deliverable Year Two, Page 121
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
There is a direct mapping between a Kompics component and a Fractal component. Every Kompics component, regardless of being primitive or composite, will
have a composite Fractal counterpart. The Fractal composite reflects the Kompics
component’s input/output channel parameters, as server/client interfaces (both
external and internal). The Fractal primitive reflects the Kompics component’s
event handlers as server interfaces. The bindings between the Fractal composite’s internal client interfaces and the Fractal primitive’s server interfaces reflect
Kompics event handlers’ subscription to formal input channel parameters. This is
motivated by the need to reflect the Kompics component parameterization, on the
one hand, and to encapsulate the information about Kompics component’s event
handlers but at the same time reflect their subscription to channel parameters as
Fractal bindings, on the other hand.
We reflect the Kompics component’s output channel parameters as client interfaces both of the Fractal primitive and Fractal composite component. We always
have a hardwired Fractal binding between the primitive’s client interface representing the output channel parameter and the internal interface of the composite
that corresponds to the external interface representing the same output channel
parameter.
Channel The channel is a concept specific to Kompics. We choose to represent
a Kompics channel as a Fractal component for two reasons. First, channels are
objects that can be created and destroyed in the context of a Kompics composite
component, thus they resemble subcomponents. Also, a subscription to a channel resembles a Fractal binding. Second, channels have a life-cycle management
interface similar to that of components.
A Kompics channel is represented at the Fractal architecture level as a primitive
component with five Fractal management interfaces and two functional interfaces.
The management interfaces are: BindingController, LifeCycleController, SuperController, NamingController and AttributeController. The functional interfaces
are: a publish server interface, to which client interfaces of components, representing output channel parameters are bound, and a handle client interface which
is to be bound to server interfaces of components, representing input channel parameters. The handle interface is a collection interface so that it is possible to
bind it to multiple component input channel server interfaces.
Event handler We chose to represent the event handlers of a Kompics compo-
nent as server interfaces of the corresponding Fractal primitive. This makes them
visible at the Fractal architecture level, and their subscription to input channel
parameters is visible as Fractal bindings (from client interfaces internal to the
Fractal composite, representing input channel parameters to the server interfaces
of the Fractal primitive, representing the event handlers). At the same time the
handlers information is encapsulated in the corresponding Fractal composite.
SELFMAN Deliverable Year Two, Page 122
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
Subscription Event handler subscriptions to input channel parameters are represented as a Fractal bindings from the client interfaces internal to the corresponding Fractal composite (representing the input channel parameters) to the server
interfaces of the corresponding Fractal primitive (representing the event handlers).
Publication The publication of an event in a channel is the equivalent of an
invocation over a Fractal binding. More concretely, in out mapping, the publication of an event into an output channel (including the delivery of the event to the
component that has the same channel as an input parameter), is represented by
an invocation on the following Fractal binding path: from the client interface of
the Fractal primitive (representing the output channel parameter) to the internal
server interface of the Fractal composite (representing the output channel parameter), from the external client interface of the Fractal composite (representing the
output channel parameter) to the channel’s publish server interface, and from
the channel’s handle client interface to the external server interface of the Fractal
composite (representing the input channel parameter).
A simple example
Let us now clarify the previous by means of an example. Consider the Kompics
primitive component depicted in Figure 7.9. It is a simple server component parameterized by an input and an output formal channel. It has one event handler
that is subscribed to the input channel, handles In events and publishes Out
events in the output channel. Figure 7.9 also shows the actual channels that are
used for the formal channel parameters.
{In}
Server
{Out}
In
Out
Figure 7.9: Simple primitive component with two channel parameters in
Kompics.
In Figure 7.10 you can observe the Fractal counterpart as a composite component. The input channel parameter is represented as an external (internal) server
(client) interface and the output channel is represented as an external (internal)
client (server) interface of the Fractal composite. The event handler of the Kompics
component is represented as a server interface of the Fractal primitive component
and its subscription to the formal input channel parameter is represented as a
Fractal binding from the internal interface of the Fractal composite to the server
interface of the Fractal primitive representing the event handler. We always have
SELFMAN Deliverable Year Two, Page 123
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
a hardwired Fractal binding from the client interfaces of the Fractal primitive to
the internal server interface of the Fractal composite, both representing a Kompics
output channel parameter.
Server
In
publish
In
Out
handler
Out
Out
Delegate
In
handle
Out
publish
handle
Figure 7.10: Simple Kompics primitive component with two channel parameters in Fractal.
7.5.3
Component sharing example revisited
Let us now look again at the example we introduced in Section 7.5.1. We have two
composite components, LE and RPC, sharing the FD component. We have seen
the Kompics architecture in Figure 7.7 and the Fractal architecture in Figure 7.8.
Figure 7.11 shows the mapped Fractal architecture of the Kompics components
according to our conceptual mapping. The details of the FD component are hidden.
Notice that the LE and RPC composite components are represented as Fractal
composites that contain a delegate primitive component whose server interfaces
represent the event handlers of the Kompics composite components.
7.5.4
Implementation aspects
Both Fractal and Kompics have implementations in the Java programming language. The next step after the conceptual mapping is to make the Kompics compatible with Fractal at the Java implementation level. This is desired because it
results in making a Kompics architecture (in Java) able to be introspected and
manipulated by Fractal tools, which normally operate on Fractal architectures
though the (Java) API that Fractal components implement.
The goal of our exercise is to give Kompics a Fractal personality, without
changing the programming style of Kompics components. This allows giving a
Fractal personality to already existing Kompics components preserving backward
compatibility.
We had two choices in making the Kompics Java implementation Fractal compatible. One choice was to build a wrapper to the core of a Kompics component
which would implement the Fractal Java APIs, and thus make Kompics components able to be handled by Fractal tools. At the same time, these wrappers would
form a hierarchy parallel to the actual Kompics component hierarchy. We believe
that implementing management operations that change the component hierarchy
SELFMAN Deliverable Year Two, Page 124
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
LE
publish
handleCrash
Request
Channel
Request
handle
handle
LE
Delegate
Remote
Call
handleRemoteCall
NewLeader
NewLeader
FDRequest
handle1
FD
Notification
publish
Notification
Channel
handle2
handleCrash
Remote
Call
NewLeader
RPC
Delegate
FDRequest
CallReturn
RPC
Figure 7.11: Fractalized example Kompics architecture.
SELFMAN Deliverable Year Two, Page 125
Call
Return
Call
Return
CHAPTER 7. D2.1B: REPORT ON COMPUTATION MODEL WITH
SELF-MANAGEMENT PRIMITIVES
would be very difficult since they would have to keep the two parallel hierarchies
in sync, while being invoked from either side.
The second choice, was to directly make Kompics components compatible with
Julia [4] components, the reference Java implementation for Fractal components.
We took this choice because it avoids the difficulty mentioned with the other
choice and at the same time, it makes Kompics component inherit some of the
Fractal concepts that do not exist yet in Kompics. This means that the Kompics
component core and channel core are re-engineered as sets of Julia mixins.
Event-based management
A Kompics component’s management interface is event-based, in the form of the
component’s control channel. Internal component faults are reported to fault supervisors as fault events on the control channel. Also, management commands
are sent to the component through the control channel and they are handled by
management event handlers, either default or programmed by the component developer.
Because in Fractal management operations are invoked on the management
interfaces of a component, our Julia custom (for Kompics) implementation of the
various management interfaces consists of publishing corresponding Kompics management events in the Kompics component’s control channel.
SELFMAN Deliverable Year Two, Page 126
Chapter 8
D2.1c: Component-based
computation model
8.1
Executive summary
In deliverable D2.1b (see Chapter 7) we reported on the SELFMAN componentbased architectural and computational model. In this deliverable we present a
Java prototype of the SELFMAN component-based computation model, the Kompics [9] framework and run-time system for specifying, composing, and executing
distributed protocols as reactive components.
KTH(P2) has worked on Kompics, a reactive component model that is compatible with Fractal [27] but provides a concrete execution and interaction model
for components, that is particularly aimed at components that implement distributed abstractions. Kompics components are reactive/event-driven, concurrent,
and readily exploit multi-core architectures. They are fault-tolerant and can form
flexible fault supervision hierarchies. Kompics components provide basic primitives for self-healing and self-configuration.
We have used the Kompics component framework to implement the Java version of the SELFMAN structured overlay network. This includes components
like: network, timer, virtual peer, bootstrap client and server, monitoring agent
and server, failure detector, ring based overlay, web server, web handler, etc. We
report that work in deliverable D1.4 (see Chapter 5).
An earlier release of the Kompics component framework was also successfully
used as a teaching tool in the Advanced Distributed Systems course (ID2203)
at KTH. The framework was used as support for student assignments that required the implementation of distributed abstractions as reactive components.
Distributed abstractions [66] implemented in Kompics include: perfect failure
detector, eventually perfect failure detector, eventual leader elector, best-effort
broadcast, reliable broadcast, uniform reliable broadcast, probabilistic broadcast,
multiple-writer atomic register, abortable consensus, and Paxos consensus.
127
CHAPTER 8. D2.1C: COMPONENT-BASED COMPUTATION MODEL
8.2
Contractors contributing to the Deliverable
KTH(P2) has contributed to this deliverable.
KTH(P2) KTH has implemented and tested a prototype of the Kompics reactive component model, presented in D2.1b (see Chapter 7), as a Java library.
SELFMAN Deliverable Year Two, Page 128
CHAPTER 8. D2.1C: COMPONENT-BASED COMPUTATION MODEL
8.3
The Kompics component framework
The latest release of the Kompics component framework is publicly available at
http://kompics.sics.se. The release includes technical documentation, source
code, API documentation, the binary library and user guide.
SELFMAN Deliverable Year Two, Page 129
KOMPICS
TH
Reactive Component Model for Distributed Computing
Context
Goals
Decentralized Dynamic Distributed Systems encompass
core distributed protocols like failure detectors, reliable
group communication, agreement protocols, etc. These
are inherently reactive, concurrent, and present complex
interactions which makes them challenging to program
and compose in complex hierarchical architectures.
» to make programming of complex distributed
systems an easy and painless job
Kompics components
» to enable the implementation of distributed
systems in a way that reflects their nature
- concurrent activities
- reactive behavior
- complex interaction
Contribution
» are reactive / event-driven
» are decoupled by a flexible publish-subscribe system
» Kompics component model
» are concurrent
- readily exploit multi-core architectures
» Java implementation
» can be composed out of encapsulated subcomponents
- subcomponents can be shared between multiple
composite components
» form dynamically reconfigurable architectures
» are fault tolerant
- can form flexible fault supervision hierarchies
Peer-to-peer system architecture
Boot
» methodologies and patterns
- composition and sharing
- dynamic reconfiguration
» distributed abstractions component library
- Communication abstractions
- Failure detectors
- Overlay networks
- Reliable group communication
- Gossip based systems
Composition with sharing
P2P Application (Simulator)
Peer Cluster / Peer Set
↓ identical peers ↓
Bulk Operations
Peer Application
Gradient / Gnutella
Symmetric replication
Transactional DHT
T-Man / T-Chord
Structured Overlay
Fast Paxos
Aggregation / Slicing
Group Multicast
Cyclon
Bootstrap Client
Failure Detector
Peer Monitor
Lossy Network
Perfect Network
Web Handler
Ring
Router
Boot
Application
Consensus Service
Distributed Shared Memory
Consensus Port
Virtual Synchrony
Consensus Instance
Group Membership
Abortable Consensus
Reliable Broadcast
Unreliable Br.
Leader Detector
Best-Effort Br.
Perfect Point-to-point Links
Lossy Point-to-point Links
Network (Apache MINA)
Web Server (Jetty)
Timer
Failure Detector
Network
Documentation and source at http://kompics.sics.se/
Distributed Computer Systems Group
Electronic, Computer and Software Systems
Information and Communication Technology
Contact persons:
Cosmin Arad ([email protected])
Seif Haridi ([email protected])
Computer
Systems
Laboratory
Chapter 9
D2.2b: Report on architectural
framework tool support
9.1
Event-Condition-Action Rule-Based Service for Decision Making
Decision making mechanims provides the tools (models, languages and runtime) for
implementing the reactive part of autonomic management policies defined by autonomics managers in their control loop. Each management policy is ”distributed”
across the different functions of the architecture:
• in the monitoring function for extracting the relevant information. A part
of the filtering and aggregation task can be done in the monitoring feature.
• in the analysis function for providing the mechanisms that correlate and
model complex situations (for example, time-series forecasting and queuing
models). These mechanisms allow the autonomic manager to learn about
the IT environment and help predict future situations. The analyse function
evaluates the different conditions that aims to update the global state of the
system. This state and its changes are used as inputs in the condition part
of the plan’s rules.
• in the plan function for providing the mechanisms that construct the actions needed to achieve goals and objectives. The plan function applies the
adaptation policy and fires the rules acting on the system. Depending of the
complexity of the operation, the number of steps, actions can be organized
in a workflow process. In this case, the plan rules throws an action part that
creates an instance of a process. The different interactions at each task can
be held by others rules instead of human.
• in the execute function by providing the mechanisms that control the execution of a plan.
131
CHAPTER 9. D2.2B: REPORT ON ARCHITECTURAL FRAMEWORK
TOOL SUPPORT
An approach which has been investigated in Selfman for decision making and
more globally for reactive capabilities in component-based systems is based on
active rules (Event Condition Action or ECA rules).
Objective Reactive behaviour, the ability to act/react automatically to take
corrective actions in response to the occurrence of situations of interest (events)
is a key feature in autonomic computing. This reactive behaviour is typically
incorporated by active rules (Event-Condition-Action or ECA rules), a mechanism widely used in active database systems to provide a reactive behaviour (an
elaborated form of triggers as found in most commercial DBMS). Active rules in
ADBMS are used for the implementation of integrity constraints, derived data,
update propagation, default values, versions and schema evalution management,
authorisations, etc.
The approach followed here consists in defining a mechanism for the integration of active rules in component-based systems to augment them with autonomic
properties. The fundamental idea being to ”extract” the reactive functionality
of active database systems, and to ”adapt” and ”inject” it for component-based
systems so as to provide them with autonomic capabilities.
Rational Active rules in database systems have been extensively studied but
cannot be directly applied to component-based systems. Three main points deserve
a special attention in this respect:
• the definition of a active rule definition model suitable for component-based
distributed systems. In active database systems, events, conditions are
actions are essentially related to data manipulation through (SQL) query
statements - while in component-based autonomic systems, events, conditions and actions are essentially related to interactions between components
(operation invocations on component interfaces),
• the definition of a rule execution model suitable for component-based distributed systems. In active database systems, rules are triggered by events
generated in the context of a transaction, conditions are evaluated and actions executed in the context of a transaction as well (the three steps in
one unique transactions or in concurrent transactions). All dimensions/parameters of rule execution in active database systems are also based on the
presence of transactions in ADBMS which represent a natural and convenient
execution unit. Transactions are generally absent in autonomic componentbased systems. Active rule execution models in ADMBS have to be re-visited
for component-based systems.
• the architectural integration of rules in component-based distributed systems. In active database systems, rules are generally represented and manipulated as any other data: typically relations (tables) in relational DBMS
SELFMAN Deliverable Year Two, Page 132
CHAPTER 9. D2.2B: REPORT ON ARCHITECTURAL FRAMEWORK
TOOL SUPPORT
or objects in object-oriented DBMS. Their scope is global to a database
schema (a set of relations in relational DBMS, a set of persistent classes in
object DBMS). In component-based systems, the nature (e.g. implicit rules
implemented as part of a component platform or rules as components) and
the scope (a rule attached to one component or rules with broader scopes)
of rules have to be stated.
Also important, the extensive study of active rules in database systems has shown
that one semantics (specify by an execution model) does not match all applicative
needs. On the contrary, what is needed are flexible execution models that allow
programmers to adapt the rule execution semantics to their specific needs. A
overall objective is then to come up with an adaptable architecture that would
support flexible rule execution models.
Reference models and architecture We draw here the big picture of the
ECA rules mechanim:
Definition Model The rule definition model specifies the form (format) of events,
conditions and actions. Considered events are applicative events generated
by operation invocations on components interfaces and access the components attributes, structural events related to changes in the topology of the
considered target system (additions, removals, replacements of components
and bindings between components) and system events typically generated
by the underlying JVM and OS. Applicative and structural events will be
typically detected and notified by interceptors. System events will typically
come from monitoring system such as JMX, WildCat, Lewys/CLIF, Fractal
JMX, etc in the Fractal context 1 . Conditions relate to the states of the considered system known typically by FPath queries on components attributes
and system structure (and possibly behavior). Actions range from simple
components attributes settings or external notifications (e-mail, SMS) to
complex (possibly transactional) reconfigurations (typically expressed with
FScript).
Execution Model The basis of the execution model for component-based systems is the ”execution unit” delimited by the interval between the reception
of an operation invocation on a server interface and the emission of a response onto a client interface. Applicative events (generated by operation
invocations) and structural events (add, remove of components and bindings) are thus decomposed into two signals begin and end. Other forms of
events (e.g. system events) can be integrated in the model by considering
their begin and end signals are merged (i.e. they represent both the same
1
cf.
OW2
open
source
middleware
consortium
(http://www.ow2.org/view/Activities/Projects) for information about WildCat, CLIF,
Lewys projects.
SELFMAN Deliverable Year Two, Page 133
CHAPTER 9. D2.2B: REPORT ON ARCHITECTURAL FRAMEWORK
TOOL SUPPORT
execution point or point in time). The execution model will also typically
define event processing modes (instance-oriented or set-oriented triggering
of rules) and coupling modes (execution in a immediate, delayed or differed
mode in a same or separate thread of execution).
Architectural Integration The reference architecture of the ECA mechanism
is based on the concept of management domain. A domain a set of entities
on which is applied a common policy. A domain embodies a unit of composition and a unit of control. The reference architecture is hierarchy of nested
domains implemented as components: a policy component encapsulates a
set of rules components and provide them with a execution strategy in case
of multiple or cascading rules, a rule component encapsulates a event component, a condition component and an action component and provide them
with a local execution strategy (event processing mode, coupling modes).
Event, condition and action components encapsulate sets of applicative components which embody the scope of event detections, conditions evaluations
and actions executions.
Summary The rule service described here is proposed as part the Selfman ar-
chitectural framework. It proposes an active rule model, i.e. a rule definition
model and a rule execution model, that can be coherently integrated into a component model (Fractal in the context of the work); and a graceful architecture for
the integration of active rules into component-based systems in which the rules
as well as their semantics (execution model, behaviour ) are represented as components, which permits i) to construct personalized rule-based systems and ii) to
modify dynamically the rules and their semantics in the same manner as the underlying component-based system by means of configuration and reconfiguration.
These foundations form the basis of a framework/toolkit which can be seen as a
library of components to construct events, conditions, actions, rules and policies
(and their execution sub-components). The framework is extensible: additional
components can be added at will to the library to render more elaborate and more
specific semantics according to specific applicative requirements.
9.2
Composite Probes: a Architectural Framework for Hierarchical Monitoring Data
Aggregation
Autonomic control loops - as considered in Selfman and more generally in autonomic computing e.g. in the reference architecture by IBM - link autonomic
elements and autonomic managers and need a monitoring service in charge of getting data from sensors associated to the Managed Elements, and to make these
data available for the decision function (typically an Autonomic Manager).
SELFMAN Deliverable Year Two, Page 134
CHAPTER 9. D2.2B: REPORT ON ARCHITECTURAL FRAMEWORK
TOOL SUPPORT
These data typically describe the dynamic state of a managed element rather
than its static constitution (e.g. for a computer, number of processors or memory
size). However, changes may occur even to something that would look like a ”static
constitution”. For instance, some advanced computers may have a varying number
of processors and memory size. Such changes may be of interest for the autonomic
management features, and shall be taken into account by the monitoring service.
There are actually two kinds of data:
• plain measures of resource consumption (e.g. CPU time, free memory,
database connection pool usage, request queue size in an arbitrary middleware...);
• alarms that notify the occurrence of an event that is not necessarily measurable (e.g. a garbage collector occurrence in a JVM, a node failure, etc.).
The monitoring service relies on components that observe a given resource,
namely probes. In the following, we describe the architectural description of these
probes in two steps:
• basic probes that provide the monitoring service;
• an extension of the basic probes to introduce probe sharing and aggregation
through a composite probe architecture.
It shlould be noted here that the work on basic probes was well initiated prior the
Selfman project. The work done in the context of Selfman concerns the development of composite probes. We discuss basic probes here for completeness reason
(basic probes has to be known before introducing composite probes).
9.2.1
Probe components
Basically, a probe is a component with an autonomous activity for observing and
getting measures from the resource it observes. This activity is controlled by a
given lifecycle:
• A probe component is first instanciated in the ’deployed’ state.
• It is then typically ’initialized and started, and then possibly ’suspended’
and ’resumed’.
• The end of activity is depicted as a pseudo state that actually represents
three states:
– ’aborted’ means something did wrong and the probe could not achieve
what it was supposed to do (i.e. either its computation or a lifecycle
transition request);
– ’completed’ means the probe normally terminated its activity;
SELFMAN Deliverable Year Two, Page 135
CHAPTER 9. D2.2B: REPORT ON ARCHITECTURAL FRAMEWORK
TOOL SUPPORT
– ’stopped’ means that the blade did not reach the end of its activity,
but simply conformed to the stop lifecycle transition request.
Once the end of activity has been reached, the blade activity may be rerun
after an initialization step. Suspend and resume requests may be useful when
some faults have been detected or some reconfiguration is under way, in order
to avoid getting meaningless measures and possibly bursty generations of alarms.
Suspending a probe is also a way to check the disturbance caused by its activity.
We now go into the details of the basic probe component architecture using the
Fractal model. This architecture comes from the CLIF load testing framework’s
so-called blade architecture, hence the frequent use of ”blade” in the terminology.
The probe component type consists of three mandatory server interfaces (namely
DataCollectoAdministration, StorageProxyAdministration and BladeControl) and
one mandatory client interface (SupervisorInformation).
Interfaces BladeControl and SupervisorInformation are tightly coupled because
most of probe activity control operations (init, start, stop, suspend...) are asynchronous: the call returns as soon as the operation processing starts. Once the
operation is terminated, a call-back operation from interface SupervisorInformation is used to inform the supervisor component about the actual probe state. The
reason for asynchronous operations in probe activity control is that we consider
scalability issues. A typical usage of activity control operations is to simultaneously initialize, start, suspend, etc. a whole set of probes. We could implement
asynchrony at the supervisor’s side, simply by using parallel threads calling activity control operations and waiting for operation return. But, first, this could
introduce a possible high overload on the supervisor if you consider large scale
systems (hundreds of probes or more). Second, we still need a call-back operation
to give feedback about the probe state at least for states aborted and completed.
As a result, we’d rather introduce asynchronous operations and a unified way of
providing the supervisor with feedback information about probes states. At last,
interface SupervisorInformation provides an operation to notify arbitrary alarm
events to the Supervisor. Interface BladeControl offers two extra operations, respectively to consult and modify specific properties. These properties include the
activation or deactivation of the memory of the various events (measures, lifecycle,
alarms) it generates.
Interface DataCollectorAdministration provides statistical data about the probe
- typically about the measures obtained from the resource it observes. These data
are represented as an array of integer values. It may look like an arbitrary limitation not to be able to deliver other data types, but it is actually a pragmatic choice
that is directly inspired by the LeWYS project. This choice seems particularly relevant to monitor such things like CPU usage percentage, free memory, average
throughputs and response times, etc. In a general way, we consider that for other
needs than numerical monitoring resource usage, alarms are a good way of notifying probe events holding data of arbitrary type. For instance, a node failure would
be typically notified through an alarm. Interface StorageProxyAdministration is
SELFMAN Deliverable Year Two, Page 136
CHAPTER 9. D2.2B: REPORT ON ARCHITECTURAL FRAMEWORK
TOOL SUPPORT
bound to the storage proxy role played by the probe, to enable possible buffering
and final collection of probe events. This interface provides methods to possibly
allocate a buffer for a new run, and to collect events.
9.2.2
Composite probes
The basic probes described above are primarily designed to be used in a single level,
as a flat layer: each probe is managed one by one and monitors one resource. For
both scalability and convenience reasons, it appears useful to be able to compose
these basic probes into composite probes whose data do not come from a resource
observation, but from a set of other probes, whatever they are basic or composite.
Here, the idea is to take advantage on the Fractal model’s support for component
hierarchy and sharing to be able to:
• obtain as many measures as possible from a minimal set of basic probes,
with an adaptable level of details and different aggregated values;
• transparently manage a whole hierarchy of probes through a single composite
probe.
Let’s take as an illustrative example the use case of monitoring the system
load of a clustered computing system (details and figures in article in Appendix).
For each cluster node, a basic probe is necessary to observe the CPU load and
the memory usage. Other system resources could be added to this use case: network bandwidth, disk transfer rate, etc. Now, getting all the measures from all
these basic probes, as well as managing all these probes, is quite cumbersome.
Conversely, composite probes enable getting a global system load indicator for the
cluster obtained through a single probe that transparently handles control operations for the underlying sub-probes. Then, the global system load probe may be
based on individual system load probes that aggregate measures from basic probes
(CPU, memory). Finally, component/probe sharing enables an arbitrary number
of different aggregations, such as the global cluster CPU load indicator provided
by the clusterCPU composite probe.
9.3
9.3.1
MyP2PWorld: The Case for Applicationlevel Network Emulation of P2P Systems
Introduction
Reflecting on previous research in Peer-To-Peer systems, one can find that the
majority of the research work passes through a number of typical stages where
every stage has its associated tools for reasoning and evaluation. In the algorithm
design stage, formal/semi-formal reasoning is used to prove aspects like liveness
SELFMAN Deliverable Year Two, Page 137
CHAPTER 9. D2.2B: REPORT ON ARCHITECTURAL FRAMEWORK
TOOL SUPPORT
and safety properties. In the second stage, where a system design is outlined, the
goal is to understand the effect of different parameters on the performance of the
system. Simulation is intensively used and probably is the most dominant tool,
and coudl be found in probably every single paper suggesting a new system. At this
stage, analytical modeling is also a frequently used tool. Examples include fluid
models for BitTorrent[158] and Chord[89]. At the final stage, where the system is
implemented, global scale testbeds like PlanetLab is used as well as emulators like
ModelNet[140] or NCTUns[152].
Stage
1. Algorithm Design
2. Performance Analysis
3. Implementation
Tools
Formal/Semi-Fromal Reasoning
Numerical Simulation, Analytical Modelling
Testbeds (e.g. Planetlab), Emulation (ModelNet)
Table 9.1: Summary of the tools needed for reasoning, evalution or testing
in the different stages of designing large-scale distributed systems.
In this work, we report our experience with P2P systems testing while developing a P2P solution for streaming live video events at Peerialism Inc.[2]. We mainly
argue that at late development stages, i.e the implementation stage, testbeds and
emulators are not sufficient for testing/evaluation needs. We describe a tool that
we developed entitled “MyP2Pworld”. In the next section, we elaborate on why
we needed to come up with yet another tool for testing. Afterwards, we show the
architecture of MyP2PWorld and how it was used in our projects and finally we
discuss it current limitations and our future plans for it.
9.3.2
Motivation
We faced a number of problems: First, the discrepancy between the simulated
protocols and their implementation in the production code. The case in an industrial context also is slightly complicated by the fact that the people who design
and simulate the protocol (Researchers), are different from those who deliver the
production-quality software (Developers). The main issue, while scientifically unprovable but anecdotally evident, is that when one designs a protocols and specifies
it for others to implement it, some intuitive or based-on-trial/error design decisions
are implicit. When given to another person the question of “Why do not we do it
the other way?” always becomes an issue and there is no fast way to answer that,
especially when it comes to non-obvious second order effects.
The second issue was debugging and reproducibility. We have used PlanetLab
extensively and it has been very useful. Our main problem was debugging and
reproducibility. We started by implementing an environment for collecting log
files from all nodes which helped but we either faced the case where we could
not reproduce it again or where the logs were not sufficiently verbose to show the
problem.
SELFMAN Deliverable Year Two, Page 138
CHAPTER 9. D2.2B: REPORT ON ARCHITECTURAL FRAMEWORK
TOOL SUPPORT
The third issue wast the testing environment. With PlanetLab, we had the
practical issue that the environment was not handy to everybody at all times and
there was the coordination overhead of sharing slices. Modelnet and NetUNCS
retain also this property because of the many customization on OS level that need
to be done to start using them. In fact, on PlanetLab things are much simpler
because of the central administration.
Therefore, the needed requirements for a testing environment were:
• Testing is done on the production code base.
• Easy to deploy on every development and testing machine.
• Provides total reproducibility.
• Can be used for automated integration testing not only unit testing. We
mean by that automated testing of particular sections of the protocol implementations in contrast to non-reproducible complete-system runs in testbeds
of emulators.
• Allows debugging using a debugger and not only by depending on log files.
We mainly tried to achieve all of the above sacrificing one main feature which is
that we have to modify the real code to make it capable of interchangeably running
in emulation and real modes. Nevertheless, we tried to realize that with maximal
transparency whenever possible.
Desirable Property
Simulation TestBed Emulation MyP2PWorld
Production Code Base
No
Yes
Yes
Yes
Ease of deployment
High
Medium
Low
High
Reproducibility
Yes
No
No
Yes
Automated testing
N/A
No
No
Yes
Mofidied App. Code
N/A
No
No
Yes
Table 9.2: Comparison of MyP2PWorld against other testing tools.
9.3.3
What MyP2PWorld is Not
We have to explicitly state that the point of MyP2PWorld is not to replace other
tools, but rather to complement them by providing an additional tool in the toolbox of P2P systems testing. It is another point in the design space of the tools
that aims to retaining the reproducibility property of the simulators while working
on the production code like the testeds and the emulators.
SELFMAN Deliverable Year Two, Page 139
CHAPTER 9. D2.2B: REPORT ON ARCHITECTURAL FRAMEWORK
TOOL SUPPORT
9.3.4
Related Work
The work that is most similar to our work is EmuSockets[12] in the sense that
it advocates application-level emulation of the network, and that it specifies a
congestion model for TVP links. However we complement the above with local
concurrency emulation to achieve exact reproducibility, a property not attainable
by EmuSockets. [12] also references a number of application-level emulators, none
of which shares with us the focus on reproducibility.
9.3.5
System Architecture Overview
MyP2PWorld is organized into four layers:
Discrete-Event Simulation (DES) Layer: Provides simulation time and network model and is not visible to the real application.
Emulation Layer: Provides to the real application and interface that looks like
real network/OS services however that get routed to the DES instead of
being routed the corresponding network/OS services.
Real Application Under Test: Multiple instances of the real application that
got modified to use the glue layer.
Scenario Management Layer The main execution entry point. Responsible for
taking as input a scenario fie and configures all layers such as forking and
killing instances of the peers at specified times, configure network behavior
etc.
9.3.6
Discrete-Event Simulation (DES) Layer
This layer could be (and in fact has been) used on its own as a traditional simulator.
Every simulated node has a single identifier and has access to a timer abstraction
where it can schedule things in the future, e.g. to timeout on an event or perform a
periodic activity. For the network model, we provide random delays between nodes
and we do not currently model a physical network topology. For bandwidth we have
support for reliable, FIFO links where the transmission rate of the data is always
the maximum possible at a given link. Our work has mainly been inspired by
BitTorrent simulators such as [20] and [156], however we have worked on providing
a well specified model with an efficient implementation. While we won’t delve into
the details of the DES we will describe our bandwidth model in more detail:
Bandwidth Model
We assume that, given a sender S and a receiver R, the sender sends blocks of data
that are substantially bigger than an IP packet. Once the sender starts sending a
block, the network should try to send the block at the maximum possible speed
SELFMAN Deliverable Year Two, Page 140
CHAPTER 9. D2.2B: REPORT ON ARCHITECTURAL FRAMEWORK
TOOL SUPPORT
between the two parties. While the piece is in transit we say that S and are R
have an ongoing “transfer” Naturally, the transfer of a certain block is affected by
other transfers taking place between S or R and any other third party. The main
quantities needed for the description of the model are :
βS sender’s maximum bandwidth
βR receiver’s maximum bandwidth
αS sender’s available (free) bandwidth
αR receivers’s available (free) bandwidth
τS set of sender’s ongoing transfers
τR set of receiver’s ongoing transfers
We sometimes use the above symbols without specifying sender/receiver side to
mean that the argument is interchangeably used for either side.
Bandwidth allocation Each time a block is sent, i.e. a transfer is started,
the amount of bandwidth that is given to the new transfer is equal to:
βS
βR
t = min max αS ,
, max αR ,
|τS | + 1
|τR | + 1
(9.1)
If α > t, then a bandwidth of α is reserved for the new connection and the
algorithm halts. Otherwise π = t − α amount will be deducted collectively from
the transfers in τ according to some rules. See next subsection.
Deduction algorithm
A transfer gets a deduction only if it is using more than its fair share (its bandwidth
β
> f s), where f s is the fair-share =
. Note that f s = t for at least one side,
|τ | + 1
but might not be true for the other side (t ≤ f s). Let τ 0 be the set of transfers
with bandwidth > f s. We never cut a transfer more than its fair share. We cut
from the all transfers in τ 0 by π in total for each according to its bandwidth.
Deduction distribution For transfer i let Xi be that transfers bandwidth,
πi = Xi − f s. Let Xi0 is the new bandwidth after deduction.
Xi0 = Xi − πp(i)
πi
Σi πi
Note that Σi πi π when t > α. If t = α then Σi πi = π but in that case we
don’t need to deduct.
After allocation, other nodes might have unused bandwidth due to the deduction process, in that case these connections are boosted. See bandwidth deallocation.
where p(i) =
SELFMAN Deliverable Year Two, Page 141
CHAPTER 9. D2.2B: REPORT ON ARCHITECTURAL FRAMEWORK
TOOL SUPPORT
perfromDeduction (amountToReclaim) begin
β
τ 0 ← t ∈ τ such that t.bandwidth >
;
|τ | + 1
µ ← Σt∈τ 0 t.bandwidth ;
for t ∈ τ 0 do
t.bandwidth -= t.bandwidth / µ * amountToReclaim ;
end
end
Algorithm 1: Bandwidth: Allocation Deduction
Bandwidth deallocation
Like the bandwidth allocation, each connection gets additional bandwidth from
the free bandwidth proportional to its bandwidth. We define ’loose’ transfer as
the transfer that both nodes at both ends has available bandwidth. And we define
its loose value as the minimum of both.
boostTransfers () begin
τ 0 ← t ∈ τ such that t.loose = true ;
µ ← Σt∈τ 0 t.bandwidth ;
for t ∈ τ 0 do
t.bandwidth += t.bandwidth / µ * t.getLooseValue() ;
end
end
Algorithm 2: Bandwidth: Deallocation Boost
This process can take some iterations to converge totally where all bandwidth is
utilized, however, frequently bandwidth fragementation occurs. However, accepting a threshold of 2% of untilized bandwitdh usually results in quick convergence.
9.3.7
Emulation Layer
This layer is actually the core layer of MyP2PWorld. We can say that it provides
three core functionalities:
Network Services
The point of this part is to make all network communication code exactly the
same whether the system is running in real mode or in emulation mode. In the
real application, we have been using Apache MINA framework [52]. A modular
framework on top of java non-blocking I/O libraries, that has many advantages
SELFMAN Deliverable Year Two, Page 142
CHAPTER 9. D2.2B: REPORT ON ARCHITECTURAL FRAMEWORK
TOOL SUPPORT
such as filter chains and decoupling of marshaling formats from communication
logic among other things.
Listing 9.1: MINA TCP server with minimal changes that enable switching
between real and emulated modes with minimal code changes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import org.apache.mina.common.IoAcceptor;
import org.apache.mina.transport.socket.nio.SocketAcceptor com.peerialism.simpipe.SimPipeAcceptor;
import java.net.SocketAddress;
....
SocketAddress serverAddress = new SocketAddress(‘‘localhost’’, 1234);
IoAcceptor acceptor = SocketAcceptor SimPipeAcceptor();
acceptor.bind(serverAddress, new IoHandlerAdapter(){
public void messageReceived(IoSession session, Object message){...}
public void messageSent(IoSession session, Object message){...}
public void sessionClosed(IoSession session){...}
public void sessionCreated(IoSession session) {...}
public void sessionIdle(IoSession session, IdleStatus status){...}
...
});
Time & Concurrency Services
Given that exact reproducibility is one of our main goals, we have to make sure
that we have total control over how concurrent events get scheduled. For a certain
experiment, we want to run it many times, while having exactly the same sequence
of events happening exactly the same way every single time. Concurrency occurs
on two levels. The first level is between different nodes that are running concurrently. In real mode, naturally, nodes are either on different machines or, if it
is required, they can be run in separate OS processes on the same machine. In
an emulation with the level or reproducibility that we target, having a separate
process per node will violate reproducibility. Therefore, we run all nodes in one
OS process. The consequences of that are discussed in the next section. The second level of concurrency is between different threads that are running inside one
node. Typically nodes, in real mode, use multiple threads for network/local I/O,
timeouts, periodic activities, etc. However, in emulated mode, we can not possibly
keep these multiple threads that get schedule by the OS in a non-reproducible
fashion. Therefore we also had to find a way of running all these activities in one
thread which we are explaining in this section.
In general, the DES layer gives us concurrency by means of events being scheduled at discrete instances on the simulated time scale. Many events can happen in
one such instance either on different/same application nodes. So to get rid of the
threads, our solution was to make sure that the application architecture avoided
blocking threads and instead depended on an event-based architecture to provide
concurrency. Thus, event scheduling and handling is managed by a thread pool in
real mode, and by the DES scheduler in emulated mode.
In our case, Apache MINA already provided non-blocking I/O and and an
event-based model, so we wrote emulation hooks to let the events be scheduled by
the DES layer but code changes at the application layer are minimal changes as we
explained in the previous section. However, all periodic activities and time-outs
were based on blocking threads. Therefore, we had to refactor the code to make it
SELFMAN Deliverable Year Two, Page 143
CHAPTER 9. D2.2B: REPORT ON ARCHITECTURAL FRAMEWORK
TOOL SUPPORT
event-based using Java scheduled future. The end result was that we found a way
of writing code for scheduling concurrent activities where real mode and emulated
mode code are minimally different.
It is also important to mention here, that in real mode, threads use real time
for specifying timeouts and periodic events frequencies. In emulated mode, we had
to also override the library calls for getting the system time. We configured the
DES layer such that the smallest unit of simulated time models one millisecond.
1
2
3
4
5
6
7
8
9
10
11
12
13
import java.util.concurrent.ScheduledFuture com.peerialism.ScheduledFuture;
import java.util.concurrent.ScheduledThreadPoolExecutor com.peerialism.ScheduledExecutor;
...
class SomePeriodicAction implements Runnable {
public void run() {
// Action
}
}
....
SomePeriodicAction action;
SchedulingExecutor executor = new SchedulingExecutor();
ScheduledFuture future = executor.scheduleAtFixedRate(action, 2000, TimeUnit.MILLISECONDS);
Context Services
The third set of changes that made our approach feasible was the management
of many nodes inside one OS process, namely one Java virtual machine. The
main problem is that our application (like most other P2P applications) was not
designed for many nodes to run in the same OS process. Global data structures
like singletons and loggers are examples of major issues in this category. For that,
we had to introduce to the DES layer the concept of a “context”, i.e. when a
node is created, it has to request from the DES layer the creation of a context
labeled by a unique id of the node. When the time comes for an event to be fired
the scheduler switches to the context of the executing node and we expose to the
application the service of querying the emulation layer about the current context.
Using the context services, singleton and loggers of all nodes were able to coexist
in the same OS as described below.
A singleton, in real mode, stores one instance of an object. In emulated mode,
singletons were made to store sets of objects indexed by context ids, every time a
singleton is requested to return an instance, it calls the scheduler to know in which
context it is running and returns the corresponding instance. This is one place
where we really could not find a direct transparent way that would make the real
and emulated mode code look identical. However, we have plans to improve that
in the next major release using a component framework.
For logging, we were using the slf4j[3] package whose purpose is to make an
application log using a simple interface and then the package is configured to bind
all logging to a real logging package. We used this feature to implement a contextaware binding for slf4j. Therefore, that was a totally transparent change from an
application point of view.
Other minor issues like port numbers, file locations, etc. were solved using
configuration parameters.
SELFMAN Deliverable Year Two, Page 144
CHAPTER 9. D2.2B: REPORT ON ARCHITECTURAL FRAMEWORK
TOOL SUPPORT
Misc. Performance Issues
External Web Services A minor issue was that each node in real mode
depended on an external web service (called “publisher”) that acted as a library for
content-specific meta-information. To avoid having any “real” networking while in
emulated mode, we cached all the info we needed from the publisher transparently
provided a fake publisher that loads meta-information from the local cache while
in emulation mode.
Byte Buffers All communication between nodes is actually accomplished by
cloning byte buffers that contain the messages in transit. Creating lots of byte
buffers and garbage-collecting them was a slowing the performance and the peak
memory in a negative way. Instead, we allocated pools of byte buffers that we
reused.
Multiple CPU Cores To make use of multi-core processors without violating
our reproducibility constraints, we made it possible for events that run in the
same discrete time step on different machines to run in multiple OS threads. The
change needed for that was to provide a deterministically-seeded random number
generators per node instead of per simulation.
9.3.8
Scenario Management Layer
This is the top layer that binds everything together, it loads: i) Configuration
files containing DES configuration parameters, application parameters, etc. ii)
Scenario files containing a particular setup of joins failure iii) Binaries for the
different types of nodes like in our case a tracker, a source and many clients.
9.3.9
Conclusion & Future Work
We presented a new tool for testing Peer-to-Peer systems at the implementation
stage. We were mainly motivated by the lack of a testing tool that can test the
production code base while providing exact reproducibility. The main difference
between this tool and other application level emulators is that we do not only
simulate the network layer, but we also handle local concurrency. i.e. despite
running many nodes on the same machine, one single OS thread is used to exectue
all nodes and their threads. The tool has been used in production environments to
fix a substantial number of bugs, that were extremely hard to catch on a testbed
like PlanetLab.
The current status of MyP2PWorld is that we have actually started to realize
how to make application-level emulation work. However, there are a number of
things that we need to do before we can have the tool totally generic to be used
in other applictions. The first future task would be to reengineer the switching
between real and emulation mode using a component framework to be able to
SELFMAN Deliverable Year Two, Page 145
CHAPTER 9. D2.2B: REPORT ON ARCHITECTURAL FRAMEWORK
TOOL SUPPORT
make the emulation hooks as transparent as possible. The second task would be
to have a UDP bandwidth model like our TCP model.
SELFMAN Deliverable Year Two, Page 146
Chapter 10
D2.2c: Architectural framework
– Components & Navigation
10.1
Executive summary
This deliverable defines the API of the Oz library that forms the core of the component and navigation framework detailed in Deliverable D4.1a on self-confguration
(Chapter 15 of this book), and gives some developer instructions for its use. The
API takes the form of Oz functions and procedures for creating, navigating and
querying Fractal-like structures. It enables the construction of distributed, selfdeployable and self-monitoring components, at least for mid-size, or cluster-size
systems.
10.2
Contractors contributing to the deliverable
INRIA (P3) contributed to this deliverable.
147
CHAPTER 10. D2.2C: ARCHITECTURAL FRAMEWORK –
COMPONENTS & NAVIGATION
10.3
Components and navigation API
We list here most primitives related to the component model and to the deployment
process introduced in the FructOz framework, presented in Chapter 15 of this book.
Deployment primitives are an integral part of the Framework since they allow the
construction of distributed (i.e. whose implementations span multiple machines),
self-deployable and self-configurable components. The primitives are annotated
with typing information conforming to the ML standard. The first letter of a
primitive usually indicates the type of data it mainly applies to.
10.3.1
Notations
Typing information use the following notations:
C : a component (i.e. a Membrane)
I : an interface
B : a binding
S : a set
S{X} : set of elements of type X
N : a numeral (either an integer: Z or a float: R)
B : a boolean
10.3.2
Component model
The following primitives, sorted on the main entity they apply to, are related to
component entities manipulations.
Components (i.e. Membranes)
CNew: unit → C
This operation creates a new empty component membrane.
CAddInterface, CRemoveInterface: C × I → unit
These operations add or remove, respectively, an interface to or from the
target component.
CGetInterfaces: C → S{I}
This operation retrieves the set of interfaces of the target component;
SELFMAN Deliverable Year Two, Page 148
CHAPTER 10. D2.2C: ARCHITECTURAL FRAMEWORK –
COMPONENTS & NAVIGATION
Interfaces
INew: (client|server) → I
This operation creates a new interface, client or server, not associated to
any component;
IImplements: I × (native Oz object) → unit
This operation defines the procedure or object to invoke to process messages
addressed to an interface.
IResolveSync, IResolveAsync: I → (message → unit)
These operations resolve an interface into a synchronous or asynchronous
proxy that may directly be invoked to send messages to the interface. An
asynchronous proxy just fires off a message without waiting for a response.
A synchronous proxy blocks the calling thread in wait of a response.
IGetComponent: I → C
This operation retreives the component owning the target interface.
IGetBindingsFrom, IGetBindingsTo: I → S{B}
These operation get the set of bindings connected from (resp. to) the target
interface.
IIsClient, IIsServer: I → B
These operation test whether the target interface is a client (resp. server)
interface with respect to its owning component.
Bindings
BNew: I × I → B
This operation creates a binding between a client and a server interface;
BNewLazy: I × I → B
This operation creates a lazy binding between a client and a server interface:
the client interface is required to be instantiated to establish the binding,
while the server interface will remain lazy (unneeded), as long as no introspection occurs involving the interface and no functional usage of the binding
happens.
BBreak: B → unit
This operation breaks the target binding.
BGetClientInterface, BGetServerInterface: B → I
This operation retrieves the interface the given binding connects from (resp.
to).
SELFMAN Deliverable Year Two, Page 149
CHAPTER 10. D2.2C: ARCHITECTURAL FRAMEWORK –
COMPONENTS & NAVIGATION
Generic browseable entities
Components, interfaces and bindings are browseable entities, and are thus associated with a set of tags. Tags allow the identification of any entity. Tags are
manipulated with the following primitives:
Tag, Untag: entity × tag → unit
This operation applies or removes a tag to or from the given entity.
HasTag: entity × tag → B
This operation tests an entity (a component membrane, an interface or a
binding) for the given tag.
Controllers
The FructOz framework comprises a set of functions and procedures providing
additional controller capabilities. In particular, one finds the equivalent of Fractal attribute (for attaching arbitrary meta data to components) and content (for
manipulating the subcomponents of a composite one).
Attribute control primitives are given below.
CListAttributes: C → S{Name}
This operation returns the set of attribute names of the target component
(an atttribute is essentially a pair Name#Value, where Value can be an arbitrary Oz value).
CHasAttribute: C × Name → B
This operation returns true if the target component has an attribute of the
indicated name.
CSetAttribute: C × Name × Value → unit
This operation sets the attribute of the target component designated by its
name to the indicated value.
CGetAttribute: C × Name → Value
This operation returns the value associated with an attribute of a component, designated by its name.
CRemoveAttribute: C × Name → unit
This operation removes the attribute from the set of attributes of the target
component .
Content control primitives are given below.
CListContentContexts:C → S{Context}
This operation retrieves the set of contexts associated with the target component. A context is essentially a (dynamic) set of sub-components of a
given component.
SELFMAN Deliverable Year Two, Page 150
CHAPTER 10. D2.2C: ARCHITECTURAL FRAMEWORK –
COMPONENTS & NAVIGATION
CAddSubComponent : C × C × Context → unit
This operation adds a subcomponent to the given context of the target
component.
CRemoveSubComponent : C × C × Context → unit
This operation removes a subcomponent from the given context of the target
component.
CGetSubComponents: C → S{C}
This operation retrives the set of subcomponents of the target component.
CGetSuperComponents : C → S{C}
This operation rertrieves the set of parent components of the target component.
10.3.3
Deployment primitives
The following primitives are related to the deployment process and to the control
of the distributed environnement.
Deploy: package → C
This operation deploys the target component package on the local host.
RemoteDeploy: package × (C : Host) → C
This operation deploys the target component package on the specified host
component.
NewCluster: unit → (C : Cluster)
This operation creates an empty cluster component.
NewRemoteHost: (C : Cluster) × (string : hostname) → (C : Host)
This operation creates a new host component on the given remote host
identified by its host name (string), and integrate the new host component
to the given cluster component.
CloseHost: (C : Host) → unit
This operation removes the host component and shutdown the corresponding
remote virtual machine.
CloseCluster: (C : Cluster) → unit
This operation shutdowns all hosts contained in the target cluster.
10.3.4
Introspection, navigation and query primitives
The dynamic computation model reimplements a set of standard primitives and
collections data types, such as booleans, numericals and sets.
SELFMAN Deliverable Year Two, Page 151
CHAPTER 10. D2.2C: ARCHITECTURAL FRAMEWORK –
COMPONENTS & NAVIGATION
Dynamic standard data types
BNot: B → B
This operation computes the boolean negation.
BAnd, BOr: B × · · · × B → B or S{B} → B
These operations compute the logical “and” and “or” operations.
BWait: B → unit
This operation waits until the given boolean becomes true.
NSum, NMultiply,
NMin, NMax, NAverage: N × · · · × N → N or S{N} → N
These operations apply numerical aggregation operators (sum, multiply,
minimum, etc).
NSubtract: N × N → N
This operation computes the difference between two numerals.
NDivide: R × R → R
This operation computes the floating-point division of two numerals.
NIDivide: Z × Z → Z
This operation computes the integer division of two numerals.
SNew: unit → S
This operation creates a new empty dynamic set.
SSize: S → Z
This operation retrieves the number of elements of the target dynamic set.
SIsEmpty: S → B
This operation test the emptiness of the target dynamic set.
SUnion: S × · · · × S → S or S{S} → S
This operation computes the union of the given dynamic set of dynamic sets.
SFilter: S × F → S
This operation extracts a subset from the target set given a predicate function, of the type F : V → B.
SMap: S × F → S
This operation, where F : V → V is assumed deterministic, computes a
mapped set obtained when applying the given map function to all elements
of the dynamic set.
SSubtract: S × S → S
This operation computes the difference between two dynamic sets.
SELFMAN Deliverable Year Two, Page 152
CHAPTER 10. D2.2C: ARCHITECTURAL FRAMEWORK –
COMPONENTS & NAVIGATION
SIntersect: S × S → S
This operation computes the intersection of two dynamic sets.
This list above provides an overview of the most useful dynamic primitives.
This set can easily extended to cover additional dynamic operations.
Navigation primitives
The LactOz library provides a set of common reusable navigation primitives that
may be composed and extended to describe arbitrarily complex dynamic navigation
expressions.
IIsBoundExternally, IIsBoundInternally: I → B
These operations test whether the interface is externally (resp. internally)
bound relatively to the implicit inside of its owning component.
CGetExternalBindingsFrom, CGetExternalBindingsTo,
CGetExternalBindings: C → S{B}
These operations retrieve the external bindings to and/or from the target
component.
CGetInternalBindingsFrom, CGetInternalBindingsTo,
CGetInternalBindings: C → S{B}
These operations retrieve the internal bindings to and/or from the target
component.
BGetClientComponent, BGetServerComponent: B → C
These operations retrieve the component client (resp. server) of the target
binding.
CGetExternalComponentsBoundTo, CGetExternalComponentsBoundFrom,
CGetExternalComponentsBoundWith: C → S{C}
These operations retrieve the external components bound to and/or from
the target component.
CGetInternalComponentsBoundTo, CGetInternalComponentsBoundFrom,
CGetInternalComponentsBoundWith: C → S{C}
These operations retrieve the internal components bound to and/or from
the target component.
As before, this list only provides an overview of the navigation primitives, and
may be extended as necessary to cover more operations.
SELFMAN Deliverable Year Two, Page 153
CHAPTER 10. D2.2C: ARCHITECTURAL FRAMEWORK –
COMPONENTS & NAVIGATION
10.4
Starting with FructOz and LactOz
Here is a short description explaining how to start using the primitives listed in
the previous sections.
We assume that the FructOz and LactOz compiled modules are copied to a
well known and available location. For instance, copying the *.ozf module files
into the ˜/.oz/cache/x−ozlib/<username>/ user directory will make the modules
available under the following URL: x−ozlib://<username>/∗.ozf.
Once the module files have been compiled and correctly set up, one may import
and use them as follows:
functor
import
Utils at ’x−ozlib://<username>/Utils.ozf’ % dynamic computation toolset
FructOz at ’x−ozlib://<username>/FructOz.ozf’ % FructOz and LactOz primitives
ClusterModule at ’x−ozlib://<username>/Cluster.ozf’ % distributed cluster bootstrap
define
%% Include FructOz and LactOz definitions
\insert ’FructOzHeader.oz’
%% FructOz and LactOz primitives are now directly usable
HostA = {NewRemoteHost ’hostname’}
end
Furthermore, we provide an Oz code stub named FructOzHeader.oz which adds a
few naming shortcuts to the current declaration scope. Including this code stub
allows one to make direct use of most FructOz and LactOz primitives without the
need for module prefixing.
SELFMAN Deliverable Year Two, Page 154
Chapter 11
D2.3b: Report on Formal
Operational Semantics - Formal
Fractal Specification
11.1
Executive Summary
This report contains a formal specification of the Fractal model, written in the
Alloy v4 specification language, and verified using the Alloy Analyzer model checking tool. The Fractal model is the programming language independent component
model at the basis of the development in WP2 and WP4 of the Selfman project.
This specification is intended as a first step towards the formal specification of the
Kompics model, reported in Deliverable D2.2b (Chapter 8 of this book).
11.2
Contractors contributing to the deliverable
INRIA (P3) contributed to this deliverable.
155
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
11.3
Introduction
The Fractal component model [25] is a programming-language-independent component model, which has been introduced for the construction of highy configurable
software systems. The Fractal model combines ideas from three main sources: software architecture, distributed configurable systems, and reflective systems. From
software architecture, Fractal inherits basic concepts for the modular construction
of software systems, encapsulated components and explicit connections between
them. From reflective systems, Fractal inherits the idea that components can
exhibit meta-level activities and reify through controller interfaces part of their
internal structure. From configurable distributed systems, Fractal inherits explicit
component connections across multiple address spaces, and the ability to define
meta-level activites for run-time reconfiguration. The Fractal model has been used
as a basis for the development of several kinds of configurable middleware, and has
been used successfully for building automated, architecture-based, distributed systems management capabilities, including deployment and (re)configuration management capabilities [6, 28, 49], self-repair capabilities [22, 134], overload management capabilities [23], and self-protection capabilities [33].
The Fractal model is currently defined by an informal specification. The specification only briefly mentions the general foundations that constitute the Fractal
model per se, and focuses mostly on default meta-level capabilities (or controllers,
in Fractal parlance). The specification has been successfully implemented in different languages and environments, notably in Java and C, without giving rise to
serious issues, which is a testimony to its consistency. However, there are aspects
of the specification that remain decidedly insufficiently detailed or ambiguous. The
present report attempts to correct these deficiencies by developing a formal specification of the Fractal component which makes explicit the underlying general
component model which constitutes the foundation of Fractal; which clarifies a
number of ambiguities in the informal Fractal specification; and which identifies
places where the informal Fractal specification may be overconstraining.
Beyond ensuring the consistency of the Fractal model, a formal specification
for the Fractal model can serve several purposes: to provide a more abstract,
truly programming language independent specification of the Fractal model; to
allow a formal verification of Fractal designs; to allow a formal specification and
verification of Fractal tools; to allow a rigorous comparison with other component models, and in particular to assess whether a component model constitutes a
proper refinement or specialization of the Fractal model. The latter is important
because the Fractal specification aims to define a very general component model
(e.g. meta-level capabilities in Fractal are not fixed, nor is the semantics of composition realized by composite components), from which more specialized component
models can be derived and combined.
The specification in this report is written in Alloy 4 [1, 75, 76], a lightweight
specification language based on first-order relational logic. Alloy is interesting beSELFMAN Deliverable Year Two, Page 156
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
cause of its simplicity and because of the straightforward usage of its analyzer,
which acts essentially as a model checker and counter-example generator, and
which enables rapid iterations between modelling and analysis when writing a
specification (very much akin to debugging a specification). For a detailed introduction and motivation of Alloy, we refer the interested reader to the book [76].
An online tutorial for Alloy is also available on the Alloy Analyzer Web site [1].
The report is written in a litterate programming style: the specification is
presented in its entirety, the (informal) commentary on the formal specification
being interspersed with excerpts of the Alloy code. All assertions (Alloy facts) and
theorems (Alloy assertions) have been checked with the Alloy analyzer, checking
for the existence of finite models in the first case, and for the absence of counterexamples in models below a certain size in the second case. We do not introduce
Alloy nor the Fractal model. Hopefully, the commentary running along the Alloy
code excerpts will suffice.
The report is organized as follows. Section 11.4 discusses related work. Section 11.5 details the Alloy specification of the core Fractal concepts. Section 11.6
details the Alloy specification of the naming and (distributed) binding framework
associated with Fractal. The following sections, Section 11.7 to Section 11.10,
detail the Alloy specification of the different optionnal Fractal controllers mentioned in the informal Fractal specification. These different controllers are key
primitive effectors for management operations in an architecture-based approach
of self-management: this is shown e.g. by [134] in the context of self-repair.
11.4
Related work
There have been several approaches to the formalization of component-based software and component models. Representative samples are provided by the two
books [93, 97]. The two bodies of work closest to ours are: the co-algebraic approach developed by Barbosa, Meng et al. [92, 13, 102, 103], and the formal
specification in Alloy of Microsoft COM component model developed by Jackson
and Sullivan [77], following work by Sullivan et al. on the formal specification of
the COM model in Z [137]. Although the presentation we give is relational, the
notion of component or kell we develop in this report is essentially coalgebraic in
nature since a kell can be understood primarily as a set of transitions. Whereas
Barbosa et al. develop a categorical framework, we prefer to adopt a simpler setbased approach: while we lose the benefit of dealing in the same way with multiple
forms of behavior (e.g. probabilistic, time-based, etc) as in [92], the intuition is in
our view better aided by a set-based presentation, and it is easier to understand
for it directly generalizes the well-known notion of transition system. The COM
specification presented in [77] focuses on the structural aspects of the COM model,
and notably on the definition of its query interface and aggregation mechanism.
While the Component controller in the Fractal specification provides much the same
functionality as the query interface in COM, the Fractal model does not exhibit
SELFMAN Deliverable Year Two, Page 157
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
the COM-specific difficulty arising in the interplay between query interface and aggregation highlighted in [137], and possesses several forms of meta-level behavior
(so-called controllers and controller interfaces). Thus, our work focuses more on
the specification of meta-level behavior, and in particular on the interplay between
base level behavior and meta-level behavior in a component.
11.5
Foundations
This first part of the specification captures the underlying core of the Fractal
model: a very general notion of component, called kell (a remote reference to the
biological cell). At this level of abstraction, the notion of kell first emphasizes two
facts:
• A kell has entry points, called gates. The notion of gate is an abstract form
of the notion of interface in the Fractal specification. A gate constitutes a
named point of interaction between a kell and its environment. The set of
gates of a kell constitutes its sole means of interaction with its environment,
i.e. a kell is a unit of encapsulation.
• A kell may have subcomponents, called subkells. All transitions in a kell
may act on the set of subcomponents, and modify it in arbitrary ways.
This flexibility is key to allow different semantics for composition, and to
support different kinds of meta-level operations (i.e. operations operating on
the internal structure and behavior of components).
The first primitive sets in the Alloy specification of the core Fractal model are
given below. 1
module fractal/foundations
sig Id {}
sig Val {}
sig Op extends Id {}
The above declarations introduce three primitive sets: Id, Val, and Op, respectively.
They correspond, respectively, to the set of identifiers, base values, and operation
names. Identifiers are just primitive forms of names or references. Base values represent values of some (unspecified) data types, such as integers, booleans, strings,
1
In Alloy, primitive sets are just sets of atoms, i.e. elements which have no internal
structure (and are not sets – atoms are sometimes called urelements in the logic litterature,
e.g. as in [14]). Primitive sets are called signatures in Alloy, hence the key word sig for
introducing them. Note also the module declaration: in ALloy, specifications can be
broken down into modules, which can then be imported for use in other modules using a
declaration of the form: open moduleX as X, where X is some local name used, in the current
module, as an abbreviation for the imported module.
SELFMAN Deliverable Year Two, Page 158
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
etc. At this level of abstraction, the exact forms base values can take is of no
import, hence their specification as just atoms. 2
The general notions of interface and component in the core Fractal model are
given below. They are called, respectively, gate and kell.
sig Gate {
gid: Id
}
sig Kell {
gates: set Gate,
sc: set Kell,
kid: Id
}
fact GatesInIKellHaveUniqueIds {
all c:Kell | all i,j:c.gates | i.gid = j.gid implies i = j
}
A gate, i.e. an element of the set Gate, is an entry point to communicate with
a kell. The declaration above stipulates that a gate has an identifier (in Alloy, a
declaration of the form gid:Id can be read as declaring a feature, or instance variable,
of the class Gate; formally, it declares a binary relation gid : Gate → Id between the
set of gates, Gate, and the set of identfiers, Id). A kell, i.e. an element of the set
Kell, is defined as having an identifier, given by the feature kid, a set of gates, given
by the feature gates, and a set of subcomponents, given by the feature sc. The
fact that a kell has an identifier is necessary (e.g. for management purposes) to
manifest a notion of identity that persists throughout state changes. The Alloy
fact named GatesInIKellHaveUniqueIds expresses an invariant on kells, namely that gates
that belong to a kell have distinct identifiers. 3
These elements provide the basic structure of a kell but do not explain how it
behaves. This is captured by the definition of the set TKell below, which endows
kells with transitions. Transitions are defined below as 4-tuples that comprise a
set of initial kells (feature tsc), a set of input signals (feature sin), a set of output
signals (feature sout), and a set of residual kells (feature res). Intuitively, the initial
set of kells of a transition corresponds to subkells of the kell to which the transition
2
Keyword extends in Alloy indicate that a primitive set is declared as a subset of another
one (and that it will form, with other subsets similarly declared, a partition of the set it
extends).
3
This invariant takes the form of a simple first-order logical formula, where the keyword
all denotes the universal quantifier ∀, where a declaration such as c:Kell denotes an arbitrary
element c of the set Kell (likewise, i:c.gates denotes an arbitrary element i of the set of
gates of the kell c – the dot notation c.gates is the standard notation for accessing a
feature, or attribute, of an instance of a class). In a more classical logical notation, the
GatesInIKellHaveUniqueIds formula would read:
∀c ∈ Kell, ∀i, j ∈ gates(c), gid(i) = gid(j) ⇒ i = j
SELFMAN Deliverable Year Two, Page 159
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
belongs (the set of subkells on which the transition acts). The set of residual kells
are the kells produced by the transition. The kell to which the transition belongs
may or may not belong to the residual of the transition. This allows us to model
component factories, as in the Fractal specification, i.e. component that can create
other components, or operations that delete or transform the target component.
Effectively, this means that a kell can be seen as some sort of generalized Mealy
machine (a labelled transition system, whose labels denote input and output signals
handled during a transition).
sig TKell in Kell {
transitions: set Transition
}
sig Transition {
tsc: set Kell,
sin: set Signal,
sout: set Signal,
res: set Kell
}
fact TransMayNotHaveDifferentSubComps { all c:TKell | all t:c.transitions | t.tsc = c.sc }
The invariant TransMayNotHaveDifferentSubComps ensures that the initial kells associated
with each transition of a given kell care indeed the subkells of c.
Signals are defined below as records of arguments (feature args), with a target
gate (feature target), i.e. the gate at which a signal is received (if it is an input
signal) or emitted (if it is an output signal), and an operation name (feature op).
In object-oriented terms, a signal looks very much like a reified method invocation.
sig Signal {
target: Gate,
operation: Op,
args: Id −> set Arg
}
sig Arg in Id + Val + Gate + Kell {}
fact SignalsTargetInterfaces { all c: TKell | c.transitions.(sin + sout).target in c.gates }
Signal arguments belong to the set Arg defined above as the union4 of four sets:
identifiers, values, gates and kells. This means in particular that signal may carry
gates (much as in the π-calculus messages may carry channel names 5 , and kells.
The latter capability is not explicitly reflected in the Fractal specification, but
4
In Alloy, in denotes the subset relation, or the set membership relation, + denotes set
union, & denotes set intersection, − denotes set difference, # denotes set cardinality.
5
Note that in the π-calculus, channels, i.e. communication capabilities, are just names.
Strictly speaking we could have avoided to include gates as posible arguments to signal,
by just relying on identifiers. However, we have been careful in the specification to ensure
that gates remain immutable, in contrast to kells (i.e. no operation will transform a gate
into another with the same identifier). Having a gate identifier or a gate itself as an
argument are thus strictly equivalent, but allowing gates as signal arguments simplifies
the specification.
SELFMAN Deliverable Year Two, Page 160
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
is required to model mobile agents and strong mobility, as well as deployment,
checkpointing and reconfiguration capabilities6 . The fact SignalsTargetInterfaces expresses the invariant that all target gates in signals appearing in transitions of a
kell c are gates of c.
Discussion
This completes the specification of the core Fractal model. As can be seen the
core is very small, and it merely asserts that components are higher-order Mealy
machines, that can be hierarchically organized. This core model allows a number
of seemingly unusual, or unexpected, features:
• We allow kells to have no gates, and thus only internal behavior. As a result,
the following assertion 7 is not valid:
assert AllKellsHaveGates { all c:TKell | some c.gates }
In contrast, the following assertion is valid:
assert NoInterfaceImpliesInternalActions {
all c:TKell | (no c.gates) implies (no c.transitions.(sin + sout) )
}
It asserts that if a kell has no gate, then its transitions are merely internal
as they involve no exchange of signals with the environment.
• We allow component structures with sharing, i.e. a kell may be a subkell of
two different kells. Thus, the following assertion to the contrary is invalid:
assert SharingIsImpossible {
all c1,c2:Kell | no cs:Kell {
cs in c1.sc & c2.sc and (not c1 = c2) and (not c1 in c2.sc) and (not c2 in c1.sc)
}
}
Component sharing is an original feature of the Fractal model, which as
been found useful to model situations with resource sharing, i.e. where components at different places ina component hierarchy need access to same
resource such as a software library, or an operating system service.
6
Alernatively, one could model this through marshalling and unmarshalling operations,
allowing to transform a kell into a value, and vice versa. The above specification of
arguments makes this higher-order character of signals and of the kell model explicit.
7
An assertion in Alloy is written exactly like a fact, except for the use of the assert
keyword to declare it. An invalid assertion is detected by the Alloy Analyzer when it
generates a finite model that contradicts it (a counterexample). The Alloy keyword some
denotes the existential qualifier. Thus some c:Kell | P where P is some predicate, asserts the
existence of some kell c verifying P. By extension, some s, where s is a set, asserts that s is
not empty (i.e. that there is some element in s).
SELFMAN Deliverable Year Two, Page 161
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
• We allow component structures which are not well-founded, i.e. where a kell
may appear as a subkell of itself, or as a subkell of some of its subkells, etc.
Thus, the following assertion to the contrary is invalid8 :
assert ContainmentIsWellFounded {
all c:Kell | (not c in c.∗sc)
}
This may seem counterintuitive, however this is not really different from
allowing recursive procedure calls, and we therefore do not enforce it at the
level of abstraction of the core model. The Fractal specification explicitly
disallows this (ContainmentIsWellFounded would be witten as an invariant – an Alloy fact), but this feature could be interesting to model recursive component
structures. This is one occurrence where the present formal specification
relaxes the constraints from the informal Fractal specification.
• We allow components to have a varying number of interfaces during their
lifetime, i.e. kells to have a varying number of gates in the course of their
execution. Thus, the following assertion to the contrary is invalid:
assert NumberOfGatesInKellDoesNotVary {
all c:TKell | all t:c.transitions | all cr:TKell {
(cr in t.res and cr.kid = c.kid) implies (cr.gates = c.gates)
}
}
• In contrast to most other component models (see e.g. [91] for a discussion of
recent ones), we do not need to introduce a notion of connector or binding
to mediate the communication, or reify the communication paths, between
components. Sharing, containment (i.e. the kell-subkells relationship) and
the fact that each component or kell institutes its own composition semantics
suffice to make explicit communication channels in a component structure,
and to define their semantics.
The specification of kells as Mealy machines has however one major drawback
as an Alloy specification. Because transitions make explicit the state changes that
a kell may go through, specifying state changes in the present specification amounts
to require that certain facts hold, which would take the form of closure properties
such as “kells of this kind – and the kells that appear in the residues of their
transitions – must have transitions of this sort”. For instance, we would require
all kells that support the Component gate to have certain transitions implementing
the Component operations, and all the kells in the residues of their transitions to be
kells of a similar kind. Closure properties of this kind are unfortunately instances
of so-called generator axioms that may lead to a state explosion in models of the
specification, which makes them impossible to analyze using the model checking
8
In Alloy, ∗r of some relation r denotes the reflexive and transitive closure of r, while ˆr
denotes its transitive closure.
SELFMAN Deliverable Year Two, Page 162
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
approach of the Alloy Analyzer. This problem is an instance of the unbounded
universal quantifiers problem, discussed in Section 5.3 of [76], and needs to be
avoided if we want to exploit the Alloy Analyzer in assessing the consistency of
the specification. Our approach in this specification is to not describe explicitly
the set of transitions logically associated with a kell. Instead, we will define Alloy
predicates that describe state changes on certain kells, but refrain from imposing
that these state changes appear as explicit transitions in the supporting kells. In
effect, for the purposes of this specification, we will deal only with elements of the
Kell set, and will not consider elements of the TKell set. In the following sections,
we adopt this approach: all properties and predicates considered will deal apply
to elements of Kell.
Before moving to the specification of the (optional) default meta-level capabilities of the Fractal model, we gather here a number of declarations used in the rest
of the specification. The distinction between Client and Server is here merely a primitive type distinction, which governs bindings between gates: to bind two gates
together, one must be a dual of the other. The denominations Client and Server
merely reflect that duality. The predicate isoKell can be interpreted as a strong
identity predicate, whereby two kells are strongly identical if they have the same
identifier, the same gates and the same subkells (they may differ in their internal
state).
sig Client extends Gate {}
sig Server extends Gate {}
one sig NoSuchInterfaceException extends Val {}
one sig IllegalBindingException extends Val {}
one sig IllegalLifeCycleException extends Val {}
one sig IllegalContentException extends Val {}
one sig Ok extends Val {}
one sig Null extends Val {}
pred isoKell[c:Kell, c1:Kell] {
c.kid = c1.kid
c.sc = c1.sc
c.gates = c1.gates
}
11.6
Naming and binding
The naming and binding part of the specification captures the notions necessary
for the construction of distributed configurations. We follow here the informal
Fractal specification, [26], Section 2.2.
The first concept is that of name. A name is merely an entity that is used
to refer to another one. A name comes equipped with a reference to its naming
context (feature context).
module fractal/naming
open util/relation as RR
open fractal/foundations as FF
sig Name {
SELFMAN Deliverable Year Two, Page 163
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
}
context: Id,
pack: NamePickle
A name can also be pickled (or marshalled) to make it persistent or to communicate
it between different machines. This is obtained through the combination of the
pack feature and of operations encode and decode 9 , defined as follows:
fact PackUnpackIdempotent { all n:Name | n.pack.unpack = n }
pred encode(n:Name, p:NamePickle) { p = n.pack }
pred decode(nc:NamingContext, p:NamePickle, n:Name) {
p.context = nc.nid and p.unpack.context = nc.nid implies n = p.unpack
}
assert DecodingYieldsSameNameThanEncoded {
all p:NamePickle, n:Name, nc:NamingContext {
encode[n,p] and nc.nid = n.context implies decode[nc,p,n]
}
}
Names exist only within contexts. Contexts ar primarily associations between
names and referents (i.e. entities which are refered to by names). Making contexts
explicit allows us to define different systems of names and to have them coexist and
cooperate without the need to rely on a global naming authority to disambiguate
independently created names. Contexts are defined as follows:
sig NamingContext {
nid: Id,
exported: Name −> lone Referent
}
fact NameRefersToContext { all nc:NamingContext | all n:dom[nc.exported] | n.context = nc.nid }
The feature exported in a naming context identifies the association between names
in the context and their referents. 10
Two key invariants, given below, apply to names and contexts. The first one
merely asserts that names appearing in a context correctly refer to this context.
The second one clarifies the fact that referents cannot be names that belong to the
context (they can be names that belong to other contexts, though, thus allowing
referral chains to be constructed across multiple contexts).
fact NameRefersToContext {
9
Alloy allows the definition of first-order predicates, whose declarations start with keyword pred and are optionally followed by the list of the predicate arguments. Operations,
that perform state changes in a system can be defined as predicates with some arguments
corresponding to the initial state, i.e. the state prior to the execution of the operation, and
other arguments corresponding to the final state, i.e. the state resulting from the execution
of the operation. For a discussion on how to model state changes in Alloy.
10
In Alloy a declaration of the form exported: Name −> lone Referent denotes an injective
binary relation between the set Name and the set Referent. The keyword lone is an example
of a relation multiplicity. In our case, it signifies that a given name is to be asociated with
one, and only one, referent. Of course, two distinct names can have the same referent.
SELFMAN Deliverable Year Two, Page 164
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
}
all nc:NamingContext | all n:dom[nc.exported] | n.context = nc.nid
fact InContextNamesNotExported {
all nc:NamingContext | all n:ran[nc.exported] | n.context != nc.nid
}
The main operation supported by a naming context is the export operation. Operation export returns a new name n for referent r in context nc. The name n is a valid
name for referent r in the context nc. The naming context nc can for instance be a
network context where remotely accessible interfaces are given names of a special
form (eg URLs for a Web service context). Note that a name can be exported as
well. This is necessary to handle names across different naming contexts. A referent that is already a name in the target context cannot be exported. Operation
export is specified below. 11
one sig NamingException extends Val {}
pred export (nc1, nc2: NamingContext, r: Referent, n:Name + NamingException){
let A = not (some s:Referent − r | n−>s in nc1.exported),
B = (not r.context = nc1.nid) {
(A and B) implies nc2.exported = nc1.exported + n −> r and nc1.nid = nc2.nid
else n in NamingException and nc1 = nc2
}
}
One may verify a number of properties in relation to operation
a few self-explanatory ones:
export.
Here are
assert ExportReturnsNewNameOrOldMap {
all nc,ncc:NamingContext, r:Referent, n:Name |
export[nc,ncc,r,n] implies
let A = (not n in dom[nc.exported]),
B = (n.(nc.exported) = r),
C = (not r.context = nc.nid) {
(A or B) and C
}
}
assert ExportNameBelongsToContext {
all nc,ncc:NamingContext, r:Referent, n:Name |
export[nc,ncc,r,n] implies n.context = nc.nid
}
assert ExportExceptionLeavesContextUnchanged {
all nc,ncc: NamingContext, r:Referent, n:NamingException |
export[nc,ncc,r,n] implies nc = ncc
}
The following assertions highlight the fact that name resolution within a single
naming context can be partial. To be complete, name resolution must typically
11
Note the use of the Alloy construct let A = ... { S }, which just declares a variable A to
stand as a denotation for some value, denotation which is then used inside the statement
S. Note also the use of a nested implication of the form C1 implies F1 else F2, which is
equivalent to (C1 and F1) or ((not C1) and F2). Note, finally, the keyword one which precedes
the declaration of the NamingException value: it just signifies that the set NamingException is
a singleton. In Alloy, set elements are essentially identified with singletons.
SELFMAN Deliverable Year Two, Page 165
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
take place across several naming contexts. However, we also allow partial name
resolution across several naming contexts. This takes care of situations were name
resolution cannot be carried out in full (eg in disconnected situations) or need not
be carried out in full (e.g. when no access to a referenced interface or component
is attempted).
assert ExportClosureIsJustExport {
all nc:NamingContext | nc.exported = ˆ(nc.exported)
}
assert ExportClosureEndsInInterfaceOrNotInContextName {
all nc:NamingContext, n:Name, r:Referent |
r in n.(nc.exported) implies (r in Gate) or (r in Name and r.context != nc.nid)
}
A binder is naming context that can resolve names and establish connections
(bindings) towards entities refered to by resolved names. A binding is created
typically by a bind operation. The creation of a binding results in the creation
of a component that provides a (local) interface which corresponds to (e.g. is a
proxy to) the resolved name. A binder records the assocation (bindings) between
resolved names and the (local) interfaces they refer to. Binders are specified below.
sig Binder extends NamingContext {
bindings: Name −> lone Gate,
}
fact BindingNamesBelongToContext {
all b:Binder | all n: dom[b.bindings] | n.context = b.nid
}
fact BindingsAndExportedDomainsDisjoint {
all b:Binder | no (dom[b.bindings] & dom[b.exported])
}
The bind operation is specified below, together with some self-explanatory properties.
pred bind(b,b1:Binder, n:Name, i:Gate + NamingException) {
b.nid = b1.nid
n −> i in b.exported implies b = b1
else i in Gate implies b1.bindings = b.bindings + n −> i
else i in NamingException and b = b1
}
assert BindExceptionLeavesBinderUnchanged {
all b,b1:Binder, n:Name, i: NamingException | bind[b,b1,n,i] implies b = b1
}
assert BindReturnsNewInterfaceOrFromExported {
all b,b1:Binder, n:Name, i:Gate | bind[b,b1,n,i] implies n −> i in b.exported + b1.bindings
}
Finaly, one can prove a correct interplay between export and bind, namely that
bind returns a local (in-context) gate referred to by a previously exported name.
assert BindReturnsPreviouslyExportedReferent {
all b,b1:Binder, n:Name, i:Gate |
export[b,b1,i,n] implies bind[b1,b1,n,i]
}
SELFMAN Deliverable Year Two, Page 166
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
11.7
Component controller
The component controller in Fractal supports basic introspection capabilities: discovering all the interfaces associated with a component and their type. We follow
here the informal Fractal specification, [26], Section 3.
Our formal specification of the component controller begins with the declaration of the Type signature, with its subtype relation, noted sstypes. At this level of
abstraction, the only property recorded of the subtype relation is that it constitutes
a partial order.12
module fractal/component
open util/relation as RR
open fractal/foundations as FF
open util/graph[Type] as GG
sig Type extends Val {
sstypes: set Type
}
fact SubTypingIsPartialOrder { GG/dag[sstypes] }
sig CompType extends Type {}
sig InterfaceType extends Type {}
The next signatures declare the Component gates and the Interface gates. A
gate is a server gate which also records the type of the component it
belongs to. An Interface gate records its type, as well as the Component gate of the
component it belongs to. As noted in the informal Fractal specification, this
setting is similar to that adopted by the Microsoft COM model, with Component
corresponding to the COM IUnknown interface.
Component
sig Component extends Server {
ctype: CompType
}
sig Interface in Gate {
owner: Component,
itype: InterfaceType,
}
A ckell is now defined as a kell with one gate which is an instance of Component
(its other gates can be arbitrary gates). The fact CKellsHaveComponent constraints
ckells to have only one Component gate.
sig CKell in Kell {
comp: Component
}
fact CKellsHaveComponent { all c:CKell | c.comp = c.gates & Component }
Likewise, we define an ikell as a kell whose gates are all instances of
Interface.
sig IKell in Kell {}
fact IKellsHaveInterfaces { all c: IKell | c.gates in Interface }
12
Note the use of the Alloy utility module graph, and the predicate dag from this module.
SELFMAN Deliverable Year Two, Page 167
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
We now define compkells as ckells which are also ikells, thus, as kells which
have a Component gate, and whose gates are all instances of Interface.
sig CompKell in Kell {}
fact CompKellsAreCKellsAndIKells { CompKell = CKell & IKell }
fact InterfacesInCompKellsHaveOwner { all c: CompKell | all i:c.gates | i.owner = c.comp }
The basic properties of compkells are corroborated by the following simple,
self-explanatory assertions.
assert OneComponentPerCompKell {
all c:CKell | one c.gates & Component
}
assert ComponentInCompKellsIsInterface {
all c:CompKell | c.comp in Interface
}
assert CompKellsHaveOnlyInterfaces {
no c:CompKell { some c.gates & (Gate − Interface) }
}
assert OwnersInCompKellsAreComponent {
all c:CompKell | all i:c.gates | i.owner in Component & Interface
}
Before specifying the different operations that are attached to Component and
we first define an equivalence predicate on compkells. Roughly, isoCKell
indicates that two compkells have the same identifier, the same subcomponents,
and the same gates, i.e. their internal and external structures (but not necessarily
their exact states) are the same. By virtue of the above invariants, two equivalent
compkells have the same Component gate.
Interface,
pred isoCKell[c:CompKell, c1:CompKell] {
isoKell[c,c1]
}
assert IsoCompKellsHaveSameComponent {
all c,c1:CompKell | isoCKell[c,c1] implies c.comp = c1.comp
}
We give below the different operations attached to a Component gate. Operation
returns the set of gates is of a compkell c, given its Component interface
o. In the process, compkell c becomes compkell c1, which is required to be equivalent to c, i.e. have the same gates, the same identifier, and the same subkells.
Operation getInterface returns the gate i whose identifier iid is passed as argument to
the operation. In the process, compkell c becomes compkell c1. Operation getCType
returns the component type ct of compkell c.
getInterfaces
// Operations from the Component interface
pred getInterfaces[c:CompKell, o:Component, is: set Interface, c1:CompKell] {
o = c.comp
is = c.gates
isoCKell[c,c1]
}
SELFMAN Deliverable Year Two, Page 168
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
pred getInterface[c:CompKell, o:Component, iid:Id, i:Interface + NoSuchInterfaceException, c1:CompKell] {
o = c.comp
isoCKell[c,c1]
i in c.gates implies iid = i.gid
else i = NoSuchInterfaceException
}
pred getCType[c:CompKell, o:Component, ct: CompType, c1:CompKell] {
o = c.comp
ct = o.ctype
isoCKell[c,c1]
}
The specification of the above operations provide examples of ambiguities that
arise in the informal Fractal specification (in fact, in any informal specification),
and which are difficult to weed out without a formal model. In fact, [26] Fractal
specifications leaves unspecified the exact postconditions of operations. Here we
strike a middleground between a strong form which would require that the target
compkell be left untouched, i.e. that would specify c = c1 in place of our isoCKell[c,c1],
and a very weak form which would only require c.kid = c1.kid. The strong form would
forbid any kind of side-effect to such meta-level operations (such as, e.g. setting
a counter or updating a log of such operations), whereas the very weak form
would make these introspection operations essentially useless (since the obtained
information would be obsolete as soon as it is obtained).
We specify below operations associated with an Interface gate. Operation getOwner
returns the Component gate associated with the compkell c that hosts the target
Interface gate i. Operation getName returns the identifier of the target Interface gate i.
Operation getType returns the interface type it of the target Interface gate i.
// Operations from the Interface interface
pred getOwner(c:CompKell, i:Interface, o:Component, c1:CompKell) {
i in c.gates
o = i.owner
isoCKell[c,c1]
}
pred getName[c:CompKell, i:Interface, iid:Id, c1:CompKell] {
i in c.gates
iid = i.gid
isoCKell[c,c1]
}
pred getType[c:CompKell, i:Interface, it: InterfaceType, c1:CompKell] {
i in c.gates
it = i.itype
isoCKell[c,c1]
}
We give below two simple properties, which assert the consistency of the
and Interface operations.
Component
assert ComponentToInterfacesAndBack {
all c,c1:CompKell | all i: Interface | all is: set Interface {
getInterfaces[c,c.comp,is,c1] and i in is implies getOwner[c,i,c.comp,c1]
SELFMAN Deliverable Year Two, Page 169
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
}
}
assert InterfaceToComponentAndBack {
all c,c1:CompKell | all i: Interface | all is: set Interface {
getOwner[c,i,c.comp,c1] and getInterfaces[c,c.comp,is,c1] implies i in is
}
}
11.8
Binding controller
The binding controller in Fractal supports the binding of client interfaces of a component to server interfaces. The effect of this binding is to allow the components
that are connected via these bound interfaces to communicate. We follow here the
informal Fractal specification, [26] Section 4.3.
In our case, we do not specify the exact effect of binding a client and a server
gate, since the semantics of this binding typically depends on the enclosing component where it takes place. However kells providing a BindingController gate record
which client interfaces are bound (feature bindings in an instance of BCKell). Notice
the different constraints that apply:
• a client gate is bound at most to a single server gate (lone multiplicity in
bindings feature declaration);
• client gates must be gates of the hosting kell (fact ClientsInBindingCntrlAreBCKellGates);
• the bindings relation record the binding of client gates (fact BindingsBindClientGates).
module fractal/binding
open util/relation as RR
open fractal/foundations as FF
sig BindingController extends Server {}
sig BCKell in Kell {
bctrl: BindingController,
clients: set Client,
bindings: Client −> lone Server
}
fact BindingsBindClientGates {
all c:BCKell | dom[c.bindings] in c.clients
}
fact ClientsInBCHaveUniqueIds {
all c:BCKell | all ci,cj:c.clients | ci.gid = cj.gid implies ci = cj
}
fact ClientsInBindingCntrlAreBCKellGates {
all c:BCKell | c.clients in c.gates
}
fact BindingControllerAreBCKellGates {
SELFMAN Deliverable Year Two, Page 170
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
}
all c:BCKell | c.bctrl in c.gates
Before specifying the operations attached to BindingController gates, we define an
equivalence predicate between kells with a BindingController gate. Two such kells are
equivalent if they have the same identifier, the same subkells, the same client gates
and the same BindingController gate.
pred isoBCKell(c:BCKell, c1:BCKell) {
isoKell[c,c1]
c1.clients = c.clients
c1.bctrl = c.bctrl
}
We specify below the different operations associated with BindingController gates.
Operation list returns the set of client gates of the kell hosting the target BindingController
gate; in the process, the hosting kell c evolves into kell c1. Operation lookup returns
the server gate i that is bound to the client gate whose identifier iid is passed as
argument. Operation bind binds the client interface whose identifier cid is passed
as argument to the server gate si passed as argument. Finally, operation unbind
unbinds the client gate whose identifier cid is passed as argument.
// Operations from the BindingController interface
sig BindingReturn in Ok + NoSuchInterfaceException + IllegalBindingException + IllegalLifeCycleException {}
pred list(c:BCKell, bc:BindingController, cs: set Client, c1:Kell) {
c.bctrl = bc
cs = c.clients
isoBCKell[c,c1]
}
pred lookup(c:BCKell, bc:BindingController, iid: Id, i: Server + NoSuchInterfaceException, c1:Kell) {
c.bctrl = bc
isoBCKell[c,c1]
iid in (c.clients).gid implies (some if: Client { if.gid = iid and if in c.clients and i in if.(c.bindings) })
else i = NoSuchInterfaceException
}
pred bind(c:BCKell, bc:BindingController, cid: Id, si:Server, r: BindingReturn, c1:Kell) {
bc = c.bctrl
some ci:Client {
r = IllegalLifeCycleException implies c1 = c
else no cid & (c.clients).gid implies r = NoSuchInterfaceException and c1 = c
else some ci.(c.bindings) implies r = IllegalBindingException and c1 = c
else ( cid in (c.clients).gid and no ci.(c.bindings) and
ci.gid = cid and ci in c.clients and
r = Ok and c1.bindings = c.bindings + ci −> si and
isoBCKell[c,c1] )
}
}
pred unbind(c:BCKell, bc:BindingController, cid: Id, r: BindingReturn, c1:Kell) {
some ci:Client, si: Server {
c.bctrl = bc
r = IllegalLifeCycleException implies c1 = c
else no cid & (c.clients).gid implies r = NoSuchInterfaceException and c1 = c
else no ci.(c.bindings) implies r = IllegalBindingException and c1 = c
else ( cid in (c.clients).gid and ci −> si in (c.bindings) and
SELFMAN Deliverable Year Two, Page 171
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
}
}
ci.gid = cid and r = Ok and
c1.bindings = c.bindings − ci −> si and
isoBCKell[c,c1] )
We give below a number of properties that assess the mutual consistency
of the different BindingController operations. Predicates getClient, getBoundClient, and
getBoundServer are just abbreviations for some simple conditions. The last two properties UnbindAfterBindPossible and BindAfterUnbindPossible are commutation conditions on
the bind and unbind operations.
assert LookupAfterBindYieldsCorrectServer {
all c:BCKell, cid: Id, si:Server, r: Ok, c1:BCKell |
bind[c,c.bctrl,cid,si,r,c1] implies lookup[c1,c1.bctrl,cid,si,c1]
}
pred getClient[c:BCKell, cid:Id, ci:Client] {
cid = ci.gid
ci in c.clients
}
pred getBoundClient[c:BCKell, cid:Id, ci:Client] {
getClient[c,cid,ci]
ci in dom[c.bindings]
}
pred getBoundServer[c:BCKell, cid:Id, si:Server] {
some ci:Client | getBoundClient[c,cid,ci] and ci −> si in c.bindings
}
assert UnbindPossibleMeansBindingExists {
all c:BCKell, cid:Id, c1:BCKell {
unbind[c,c.bctrl,cid,Ok,c1] implies some s:Server { s in Client.(c.bindings) }
}
}
assert UnbindAfterBindPossible {
all c:BCKell, cid: Id, si:Server, c1:Kell |
bind[c,c.bctrl,cid,si,Ok,c1] implies unbind[c1,c1.bctrl,cid,Ok,c]
}
assert BindAfterUnbindPossible {
all c:BCKell, cid: Id, c1:BCKell, si:Server {
unbind[c,c.bctrl,cid,Ok,c1] and getBoundServer[c,cid,si] implies bind[c1,c1.bctrl,cid,si,Ok,c]
}
}
11.9
Content controller
The ContentController interface in Fractal allows to introspect the internal structure
of a component in the form of its so-called internal interfaces and of its subcomponents. We follow here the informal Fractal specification, [26], Section 4.4.
We specify below kells with ContentController gates, i.e. elements of CCKell. Internal
gates appear only as a set of gates, which are not gates for interaction with the
SELFMAN Deliverable Year Two, Page 172
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
environment, i.e. the exterior, of a kell. There are no further semantics associated
with this notion of internal gate, since it typically varies with each kell (internal
gates typically allow to explicitly connect subkells to some inner functionality of
their parent kell). Making explicit internal gates allows to control, through the
ContentController gate, the internal connections between a parent kell and its subkells.
Instances of CCKell also provide access to (in general, a subset of) their subkells
(feature subcomps in the CCKell signature). Notice that CCKell is defined as a subset of
CKell, i.e. each instance of CCKell has both a ContentController gate and a Component gate.
All subkells of a kell in CCKell, or cckell, are ckells, i.e. they all have a Component
gate.
module fractal/content
open util/relation as RR
open fractal/foundations as FF
open fractal/component as FC
sig ContentController extends Server {}
sig CCKell in CKell {
cctrl: ContentController,
internals: set Gate,
subcomps: set CKell
}
fact ContentControllerInCCKellIsExternalGate { all c:CCKell | c.cctrl in c.gates }
fact InternalsAreNotExternalsInCCKells { all c: CCKell | no (c.internals & c.gates) }
fact SubcompsAreSubComponentsInCCKells { all c: CCKell | c.subcomps in c.sc }
fact InternalsIdsAreDistinct { all c:CCKell | all g,g1:c.internals | g.gid = g1.gid implies g = g1 }
fact CCKellsHaveDistinctComponentsInSubComps {
all c:CCKell | all c1,c2:c.subcomps {
c1.comp = c2.comp implies c1 = c2
c1.kid = c2.kid implies c1 = c2
}
}
assert CCKellsHaveCompAsIdsInSubComps {
all c:CCKell | all c1,c2:c.subcomps {
c1.kid = c2.kid <=> c1.comp = c2.comp
}
}
We now define an equivalence predicate on cckells.
pred isoCCKell(c:CCKell, c1:CCKell) {
isoCKell[c,c1]
c.cctrl = c1.cctrl
c.internals = c1.internals
c.subcomps = c1.subcomps
}
We specify below the different operations attached to a ContentController gate. Operation getInternalInterfaces returns the set of internal interfaces of the cckell which hosts
the target ContentController gate. Operation getInternalInterface returns the internal gate
SELFMAN Deliverable Year Two, Page 173
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
whose identifier iid is passed as argument. Operation getSubComponents returns the
set scc of subkells accessible via the ContentController gate. Operation addSubComponent
adds a ckell designated by its Component gate icc to the set of subcomps of the host
cckell. Operation removeSubComponent does the reverse.
// Operations from the ContentController interface
pred getInternalInterfaces(c:CCKell, cc:ContentController, sg: set Gate, c1:Kell) {
cc = c.cctrl
sg = c.internals
isoCCKell[c,c1]
}
pred getInternalInterface(c:CCKell, cc:ContentController, iid: Id, ig: Gate, c1:Kell) {
cc = c.cctrl
ig.gid = iid
ig in c.internals
isoCCKell[c,c1]
}
pred getSubComponents(c:CCKell, cc:ContentController, scc: set Component, c1:Kell) {
cc = c.cctrl
scc = (c.subcomps).comp
isoCCKell[c,c1]
}
pred addSubComponent(c:CCKell, cc:ContentController, icc: Component, r: ContentReturn, c1:Kell) {
some scc:CKell {
cc = c.cctrl
icc = scc.comp
r = IllegalLifeCycleException implies c1 = c
else icc in c.subcomps.comp implies r = IllegalContentException and c1 = c
else r = IllegalContentException implies c1 = c
else r = Ok and c1.kid = c.kid and c1.comp = c.comp and c1.cctrl = c.cctrl and scc in c1.subcomps
}
}
pred removeSubComponent(c:CCKell, cc:ContentController, icc:Component, r:ContentReturn, c1:CCKell) {
cc = c.cctrl
r = IllegalLifeCycleException implies c1 = c
else no icc & c.subcomps.comp implies r = IllegalContentException and c1 = c
else r = IllegalContentException implies c1 = c
else some scc: c.subcomps {
icc = scc.comp and
r = Ok and
c1.cctrl = c.cctrl and c1.comp = c.comp and c1.kid = c.kid and
no scc & c1.subcomps
}
}
one sig ContentReturn in Ok + IllegalContentException + IllegalLifeCycleException {}
Finally we give some consistency properties on operations.
assert RemoveAfterAddIsPossible {
all c,c1: CCKell, icc:Component {
(addSubComponent[c,c.cctrl,icc,Ok,c1] and
c1.gates = c.gates and
some scc:CKell {scc.comp = icc and c1.subcomps = c.subcomps + scc} ) implies
removeSubComponent[c1,c1.cctrl,icc,Ok,c]
}
SELFMAN Deliverable Year Two, Page 174
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
}
assert AddAfterRemoveIsPossible {
all c,c1: CCKell, icc:Component {
(removeSubComponent[c,c.cctrl,icc,Ok,c1] and
c1.gates = c.gates and
some scc:CKell {scc.comp = icc and c1.subcomps = c.subcomps − scc} ) implies
addSubComponent[c1,c1.cctrl,icc,Ok,c]
}
}
assert GetSubCompSucceedsAfterAdd {
all c,c1: CCKell, icc:Component {
addSubComponent[c,c.cctrl,icc,Ok,c1] implies
(getSubComponents[c1,c1.cctrl,c1.subcomps.comp, c1] and icc in c1.subcomps.comp)
}
}
//
// The following property does not hold.
// Because of the weak conditions on removeSubComponent,
// it may well be that a component with the same Component gate cc
// exists as a subcomponent of a component from which a component
// with Component gate cc has just been removed.
//
assert GetSubCompFailsAfterRemove {
all c,c1: CCKell, icc:Component {
removeSubComponent[c,c.cctrl,icc,Ok,c1] implies no icc & c1.subcomps.comp
}
}
11.10
Lifecycle controller
The LifecycleController interface in the Fractal model provides basic capabilities to
control the execution of a component. The execution of a component from the
point of view of this LifecycleController is abstracted as evolving between two macrostates, Started and Stopped. We follow here the informal Fractal specification, [26]
Section 4.5.
We specify first these two macro-states.
module fractal/content
open util/relation as RR
open fractal/foundations as FF
sig LFState extends Val {}
one sig Started extends LFState {}
one sig Stopped extends LFState {}
We then define the set LFKell that offer a LifecycleController gate. The feature ctrls identifies the set of “control” gates, i.e. those gates whose operations are not inhibitetd
when in the Stopped state.
sig LifeCycleController extends Server {}
SELFMAN Deliverable Year Two, Page 175
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
sig LFKell in Kell {
lfctrl: LifeCycleController,
state: LFState,
ctrls: set Gate
}
fact LFCtrlIsACntrlGate { all c:LFKell | c.lfctrl in c.ctrls }
fact CtrlGatesAreInGates { all c:LFKell | c.ctrls in c.gates }
fact LFStateIsStoppedOrStarted { all c:LFKell | c.state in Started + Stopped }
We define first an equivalence predicate between kells with LifecycleController gates.
pred isoLFKell(c:LFKell, c1:LFKell) {
isoKell[c,c1]
c.lfctrl = c1.lfctrl
c.state = c1.state
c.ctrls = c1.ctrls
}
We specify below the operations attached to LifecycleController gates. Operation getState
returns the macro-state s of the kell hosting the target LifecycleController gate lfc.
Operation start places the kell c hosting the target LifecycleController gate lfc into the
Started macro-state. This may imply all sorts of changes in c, hence the weak
constraint on the resulting kell c1: it has the same identifier than c, and the same
LifecycleController interface, and it is in the Started macro-state.
// Operations from the LifecycleController interface
pred getState(c:LFKell, lfc:LifeCycleController, s:LFState, c1:LFKell) {
c.lfctrl = lfc
s= c.state
isoLFKell[c,c1]
}
pred start(c:LFKell, lfc:LifeCycleController, r:LFReturn, c1:LFKell) {
c.lfctrl = lfc
r = IllegalLifeCycleException implies c1 = c
else c.state = Started implies r = Ok and c1 = c
else c.state = Stopped and r = Ok and c1.state = Started and c1.kid = c.kid and c1.lfctrl = c.lfcrtrl
}
pred stop(c:LFTKell, lfc:LifeCycleController, r:LFReturn, c1:LFTKell) {
c.lfctrl = lfc
r = IllegalLifeCycleException implies c1 = c
else c.state = Stopped implies r = Ok and c1 = c
else c.state = Started and r = Ok and c1.state = Stopped and c1.kid = c.kid and c1.lfctrl = c.lfcrtrl
}
Unfortunately, in this instance, the exact semantics of the Started and Stopped
states, and hence of the start and stop operations, can only be given by reference
to the behavior of the hosting kell. We specify below this semantics, exploiting
the notion of transition. Essentially, the Stopped state is defined as one where no
transition involving signals targetting non control gates is possible.
sig LFTKell in LFKell {}
SELFMAN Deliverable Year Two, Page 176
CHAPTER 11. D2.3B: REPORT ON FORMAL OPERATIONAL
SEMANTICS - FORMAL FRACTAL SPECIFICATION
fact LFTKellsAreTKells { all c: LFTKell | c in TKell }
fact LFTKellStoppedHasNoFunctionalTransitions {
all c:LFTKell | all t:c.transitions | c.state = Stopped implies t.(sin+sout).target in c.ctrls
}
11.11
Future work
This report formalizes the programming language independent programming model
that is at the basis of much the work in WP2 and WP4 of the Selfman project.
We have started expanding this specification to cover the Kompics model reported
in Chapter 8 of this book, so as to formally describe the event-based execution
model that Kompics embodies. This work will find its place in a revised version
of this deliverable.
SELFMAN Deliverable Year Two, Page 177
Chapter 12
D3.1b: Second report on formal
models for transactions over
structured overlay networks
12.1
Executive Summary
Application developers using a storage system such as a relational database or
a file-system requires well-defined semantics for reading and writing data. In a
database storage layer this also include transaction functionality where read and
write over multiple data entries are Atomic, Consistent, Isolated and Durable
(ACID). We aim to develop a self-managing and scalable storage layer supporting transactions based on Structured Overlay Networks and DHTs. A system
with these properties will enable applications with higher requirements on data
consistency and transaction support. An example is the Wikipedia demonstrator
presented in D5.2a.
In this deliverable we present a transaction algorithm suitable for DHTs based
on the Paxos transaction commit protocol [64, 107]. Transaction algorithms for
DHTs are particularly challenging due to the issue of lookup inconsistencies [132].
Empirical studies showed [131] that by using quorum-based algorithms the effects
of lookup inconsistencies are negligible.
178
CHAPTER 12. D3.1B: SECOND REPORT ON FORMAL MODELS
FOR TRANSACTIONS OVER STRUCTURED OVERLAY NETWORKS
12.2
Partners Contributing to the Deliverable
ZIB (P5) and KTH (P2) have contributed to this deliverable.
ZIB (P5) ZIB has contributed on the transaction model and the DHT consis-
tency model. The largest focus during the report period was to finalize the transaction model A.12 and to apply the developed techniques for the wiki demonstrator.
KTH (P2) KTH contributed on the transaction model as well as leading the
work on the consistency model. The deliverable summarizes the results from the
paper [131] and [132], included as Appendix A.14 and A.13.
SELFMAN Deliverable Year Two, Page 179
CHAPTER 12. D3.1B: SECOND REPORT ON FORMAL MODELS
FOR TRANSACTIONS OVER STRUCTURED OVERLAY NETWORKS
12.3
Results
A SON-based DHT is a self-managing storage layer providing basic item manipulation primitives such as put(key, value), for inserting a new (key, value)-pair and
get(key), for retrieving a value associated with a given key [128]. Traditionally,
DHTs have been used by applications with immutable state or weak consistency
guarantees. Applications with higher requirements on data consistency and interface flexibility are increasingly demanding easier to manage and more scalable
storage layers than what current systems can provide [41, 110].
Self-management implies that the storage layer must deal with SON node failures. The consequence of a node failure is two-fold. First, the items stored as part
of the nodes responsibility range are not available. Second, as discussed in 12.3.1,
there may become an inconsistency of responsibilities. The first problem is solved
by replicating items to more than one node, using e.g. symmetric replication [58].
The second issue was extensively studied in [132, 131], showing the frequency of
occurrence of this problem and methods to remedy it.
The transaction processing framework initially presented in D3.1a enables updates and/or reads over multiple data items stored in a DHT. In this deliverable
we expand this model and present an overview of the transaction algorithm. A
detailed description of the algorithm is available in the Appendix A.12.
To demonstrate the use of transactions in DHTs, a wikipedia application was
implemented on top of a DHT. In a wiki, users can concurrently make changes
to the same entry. In order to avoid that any data is overwritten, an update to
a wiki-page is wrapped in a transaction. When the user saves the changes, the
transaction will detect any concurrent saves from other users. The user can then
incorporate these changes in the text and try to save the wiki-entry again without
any lost changes. The wiki-application demonstrator is further described in D5.2a.
We solve the problem of consistency at two levels, 1) routing-level (section 12.3.1),
working mainly with the overlay routing pointers, and 2) data-level (section 12.3.2),
working mainly on the DHT level with data.
12.3.1
Consistency on the Routing-Level
To achieve data-consisteny in DHTs with high probability, algorithms are required
to achieve consistency on the routing level as well. In this section, we discuss the
importance of having routing-level consistency and its affect on data consistency.
It is easy to see that even without concurrent operations, data consistency
can be violated in DHTs due to lookup inconsistencies. Informally, a lookup
inconsistency is a case where in an overlay configuration, multiple lookups for the
same key return different results. Figure 12.1 illustrates such a configuration where
lookups for key k can return inconsistent results. This configuration arises when,
due to inaccuracy of the failure detector, N 1 falsely suspects N 2 and N 3 as failed.
Thus, N 1 believes that the next (clockwise) alive node on the ring in N 4, so it
SELFMAN Deliverable Year Two, Page 180
CHAPTER 12. D3.1B: SECOND REPORT ON FORMAL MODELS
FOR TRANSACTIONS OVER STRUCTURED OVERLAY NETWORKS
X
lookup(k)
N1
result: N4
lookup(k)
N2
X
k
N3
N4
result: N3
Figure 12.1: An inconsistent configuration. Due to imperfect failure detection, N 1 suspects N 2 and N 3, thus pointing to N 4 as successor.
points its successor pointer to N 4. Subsequently, a lookup for key k ending at
N 1 will return N 4 as the responsible node for k, whereas a lookup ending in N 2
would return N 3.
In the scenario depicted in figure 12.1, an update for the data stored under key k
will be stored at either N 3 or N 4. A read for data at k will return inconsistent/old
results if it reaches the node that didn’t receive the update.
The afore-mentioned scenario shows that an inconsistency on the routing-level
leads to inconsistency on the data-level i.e. data inconsistency. Thus, as a first step
to achieve data consistency in DHTs, we aim at achieving routing-level consistency.
In our work, first, we ran simulations to see the frequency of occurrence of
lookup inconsistencies. We showed that even if there is no churn in the system,
there will be lookup inconsistencies due to imperfect failure detectors. This is an
important result as previous research mainly focuses on churn. Next, we devised
two techniques to reduce lookup inconsistencies, 1) local responsibilities and 2)
quorum-techniques.
The basic idea of using local responsibilities is to modify the lookup operation
such that a lookup always returns from the locally responsible node. A node n is
said to be locally responsible for a certain key, if the key is in the range between
its predecessor and itself, noted as (n.pred, n]. Thus, before returning the result
of a lookup, the node checks if it is locally responsible for the key being looked
up. Using this technique reduces inconsistencies significantly, but has a side-effect
of keys being unavailable. This can be seen as a trade-off between availability and
consistency.
Since DHTs replicate data on different nodes to increase availability and prevent loss of data, we employ these replicas to increase consistency by using majority
based quorum-techniques. Thus, a read/write operation has operate on a majority
of the replicas. Due to lookup inconsistencies, a single replica might appear as two
SELFMAN Deliverable Year Two, Page 181
CHAPTER 12. D3.1B: SECOND REPORT ON FORMAL MODELS
FOR TRANSACTIONS OVER STRUCTURED OVERLAY NETWORKS
replicas, thus changing the number of replicas in the system. Consequently, an
operation can still be inconsistent, yet with lesser probability than without using quorum techniques. The reason being that previously, if there was a lookup
inconsistency, it would generate inconsistent data. Using the afore-mentioned technique, even with lookup inconsistency, multiple quorums exist that intersect which
eventually leads to data consistency.
The details of the work done on routing-level consistency as part of this deliverable was published as [131] and [132], both included in the appendix.
12.3.2
Transactional DHTs
With replication, concurrent get/put-operation to a single item could block the
system if all replicas must be available. In order to make progress for each operation, this requirement is relaxed by using majority-based read and writes. Read
or writes are then successful even if a minority of the replicas have failed. The
replication factor decides how many replicas are supposed to be available in the
system. This is a critical parameter for the system to function and must be set
appropriately depending on system churn.
A transaction involves a read- and write-set over one or more items. Each
transaction is executed optimistically. This means that if a transaction fails due
to for example another concurrent transaction with an overlapping write/readset, the full transaction must be re-executed by the initiator. With optimistic
transactions, items are only locked during the commit phase of the transaction
algorithm.
The node types involved in the execution of a transaction is the client, a set of
TMs corresponding to the replication factor, a set of TPs consisting of the nodes
responsible for the items in the read/write-set as well as the item replica nodes.
The first node contacted by the client is part of the TMs and is called the leader.
Transaction algorithm The transaction processing consists of two phases:
read and commit as shown in figure 12.2. During the read phase, the client initiates
a leader. The leader maintains state of the read and write operations from the
client and contact the TPs to find out the item values and versions. The client can
either decide to commit a transaction if all intermediary conditions was satisfied,
or cancel the transaction. If a client commits the transaction, the leader starts the
Commit Phase. The client API for transactions is described in Deliverable D3.3a.
The commit phase contains three internal phases: Initialization, Validation
and Consensus. During the Initialization phase, the leader finds all TMs which
corresponds to the replica nodes responsible for the Transaction ID. The leader
waits for a majority of TMs, which are necessary for the protocol to make progress.
If a leader fails, a new leader needs to be elected among the TMs. Thus, in first
step of the Validation phase, the leader tells all TMs of all other TMs. The second
step is to send a prepare message to all TPs, including the replicas, responsible
SELFMAN Deliverable Year Two, Page 182
CHAPTER 12. D3.1B: SECOND REPORT ON FORMAL MODELS
FOR TRANSACTIONS OVER STRUCTURED OVERLAY NETWORKS
for an item. When a TP receives a prepare request it compares the item versions
and if it is valid it sends a commit vote to all TMs, otherwise abort. A TP also
locks the item in the transaction at this point. The involved items are locked until
the TP finds out the results of the transaction. The sending of the vote message
starts the Consensus Phase which follows the atomic commit protocol [64]. The
TPs send their decisions to the TMs which then forwards the result to the leader.
The leader collects all results and decides if the transaction succeeded or not. This
result is then shared with the client, the TMs and the TPs. A detailed description
of the algorithm is available in [107], which can also be found in the appendix.
Figure 12.2: The different phases and message exchanges in a single instance
of the transaction protocol.
SELFMAN Deliverable Year Two, Page 183
CHAPTER 12. D3.1B: SECOND REPORT ON FORMAL MODELS
FOR TRANSACTIONS OVER STRUCTURED OVERLAY NETWORKS
12.4
Conclusion
Due to the dynamics and decentralization of DHTs and the asynchronous nature
of the Internet on which DHTs are deployed, it is difficult to build abstractions
with stronger consistency guarantees on top of DHTs. We propose using techniques on both the routing-level and the data-level to decrease data inconsistencies. Although it is impossible to guarantee consistency, availability and partitiontolerance in Internet-based systems [59], our results show that it is reasonable
(consistency is maintained with high probability) to build reliable services on top
of a DHT. As an application of a transactional storage service on top of a DHT,
a distributed wikipedia was implemented (D5.2a).
While studying the factors contributing to inconsistencies in DHTs, we found
that other than the obvious factor of churn, imperfect failure detectors contribute
a lot to generating inconsistencies. Thus, choice of a failure detection algorithm,
implementation of the failure detector and the trade-off between the accurary and
completeness of a failure detector are of crucial importance in DHTs.
SELFMAN Deliverable Year Two, Page 184
Chapter 13
D3.2a: Report on replicated
storage service over a
structured overlay network
13.1
Executive summary
The replicated storage service is based on Chord# and the transaction framework
presented in D3.1b (see Chap. 12). It is completely developed in Erlang. The
storage service can be accessed using a command line interface or from Java. The
Java Interface is described in more detail in D3.3a (see Chap. 14).
185
CHAPTER 13. D3.2A: REPORT ON REPLICATED STORAGE
SERVICE OVER A STRUCTURED OVERLAY NETWORK
13.2
Contractors contributing to the Deliverable
ZIB (P5) has contributed to this deliverable.
ZIB (P5) ZIB has contributed on the design and implementation of the replicated storage service.
SELFMAN Deliverable Year Two, Page 186
CHAPTER 13. D3.2A: REPORT ON REPLICATED STORAGE
SERVICE OVER A STRUCTURED OVERLAY NETWORK
13.3
Introduction
The replicated storage service is based on Chord# and the transaction framework
presented in D3.1b (see Chap. 12). It is completely developed in Erlang and the
architecture is based on three layers:
DHT layer At the bottom is a DHT, Chord# , which is described in great detail
in Sec. A.3. For load-balancing, we use [81]. In the future we will investigate
how to include other algorithms described in D4.3a (see Chap. 17).
It provides a simple key-value store with range queries and load-balancing.
Replication Layer The middle layer implements a simple replication scheme
based on symmetric replication [58]. For Chord# , we use different prefixes
to identify the different replicas and the load-balancing scheme will distribute the data in a way which is similar to symmetric replication. The
replication degree as well as the prefixes can be specified before startup in
the configuration file.
Transaction Layer On top of the replication layer, we implemented the transaction algorithms described in D3.1b (see Chap. 12). Further details and
evaluations can be found in [107, 131, 132].
The storage service can be accessed using a command line interface or from
Java. The Java Interface is described in more detail in D3.3a (see Chap. 14).
SELFMAN Deliverable Year Two, Page 187
CHAPTER 13. D3.2A: REPORT ON REPLICATED STORAGE
SERVICE OVER A STRUCTURED OVERLAY NETWORK
13.4
Installation and Configuration
Source Code is available at http://www.zib.de/schuett/chordsharp-selfman.tgz.
Note, that this is an internal SelfMan relase.
Chord# Directory Structure. The directory tree under chordsharp is structured as follows:
src
bin
docs
java-api
13.4.1
contains the Chord# source code
contains shell scripts needed to work with Chord# (e.g. start the boot
services, start a node, . . . )
contains Chord# documentation files
contains the Java API
Requirements
For building and running Chord# , some third-party modules are required which
are not included in the Chord# release:
• Erlang R12
• GNU Make
• rrdtool
Note, the Version 12 of Erlang is required. Chord# will not work with older
versions.
To build the Java API the following modules are required additionally:
• Java Development Kit 1.6
• Apache Ant
Before building the Java API, make sure that JAVA HOME and ANT HOME are set.
JAVA HOME has to point to a JDK 1.6 installation, and ANT HOME has to point to
an Ant installation.
13.4.2
Building Chord#
Go into the chordsharp directory and execute:
%>
%>
%>
%>
./configure
make
make rrd-init
make docs
The configure script will probably note, that you are missing common test.
You can ignore this message, as the module is not required for the correct execution
of Chord# .
SELFMAN Deliverable Year Two, Page 188
CHAPTER 13. D3.2A: REPORT ON REPLICATED STORAGE
SERVICE OVER A STRUCTURED OVERLAY NETWORK
13.4.3
Installation
Note: there is no make install at the moment! The nodes have to be started
from the bin directory.
13.4.4
Configuration
Chord# is configured by two configuration files (bin/chordsharp.cfg and bin/chordsharp.local.cfg).
It will read the former for default values and then the latter which can override the
defaults. After going through the build process there will be no chordsharp.local.cfg.
It has to be created by the user, because there are two configuration parameters
which have no default value: boot host and log host, and Chord# won’t start if
the file is missing.
%IP Address, Port, and label of the boot server
{boot_host, {{130,73,72,80},14195,boot}}.
%IP Address, Port, and label of the log server
{log_host, {{130,73,72,80},14195,boot_logger}}.
boot host defines the node where the boot server is running.
%possible values: 14195, [14195, 14196, 14197](list of ports), or 14195, 15000
range of ports
{listen_port, 14195}.
%undefined or an ip tuple, e.g. 130.73.108.1
{listen_ip, undefined}.
SELFMAN Deliverable Year Two, Page 189
CHAPTER 13. D3.2A: REPORT ON REPLICATED STORAGE
SERVICE OVER A STRUCTURED OVERLAY NETWORK
13.5
User Guide
13.5.1
Starting Chord#
In Chord# there are two kinds of processes:
• boot servers
• regular servers
In every Chord# , at least one boot server is required. It will maintain a list
of nodes taken part in the system and allows other nodes to join the ring. For
redundancy, it is also possible to have several boot servers.
Open at least two shells. In the first, go into the bin directory:
%> cd bin
%> ./boot.sh
This will start the boot server. On success http://localhost:8000 should point
to the statistics page of the boot server. The main page will show you the number
of nodes currently in the system. After a couple of seconds a first Chord# should
have started in the boot server and the number should increase to one. The main
page will also allow you to store and retrieve key-value pairs.
The boot server should show output similar to the following, when starting
the first Chord# nodes. The first line is printed when the Chord# is spawned.
Afterwards he will try to connect the boot server. When the third line is printed,
he managed to contact the boot server and joined the ring. In this case, he was
the first node in the ring.
[ I | Node
[ I | Node
[ I | Node
| <0.97.0> ] joining "23947834870"
| <0.97.0> ] join as first [50,51,57,52,55,56,51,52,56,55,48]
| <0.97.0> ] joined
In a second shell, you can now start a second Chord# node. This will be a
“regular server”. Go in the bin directory:
%> cd bin
%> ./cs_local.sh
The second node will read the configuration file and use this information to contact
the boot server and will join the ring. The number of nodes on the web page should
have increased to two by now.
Optionally, a third and fourth node can be started on the same machine. In a
third shell:
%> cd bin
%> ./cs_local2.sh
SELFMAN Deliverable Year Two, Page 190
CHAPTER 13. D3.2A: REPORT ON REPLICATED STORAGE
SERVICE OVER A STRUCTURED OVERLAY NETWORK
In a fourth shell:
%> cd bin
%> ./cs_local3.sh
This will add 3 nodes to the network. The web pages should show the additional
nodes.
Chord# can be installed on other machines in the same way as described in
Sect. 13.4. Please make sure, that the chordsharp.local.cfg is the same on all nodes.
Otherwise the other nodes will not find the boot server. On the remote nodes, you
only need to call ./cs local.sh they will automatically contact the configured
boot server.
13.5.2
Java-API
The following commands will build the Java API for Chord# :
%> cd java-api
%> ant
This will build chordsharp4j.jar, which is the library for accessing the overlay
network. Optionally, the documentation can be build:
%> ant doc
The jar file additionally contains a small cli client.
%> java -jar chordsharp4j.jar -help
usage: chordsharp
-getsubscribers <topic>
get subscribers of a topic
-help
print this message
-publish <params>
publish a new message for a topic: <topic>
<message>
-read <key>
read an item
-subscribe <params>
subscribe to a topic: <topic> <url>
-write <params>
write an item: <key> <value>
Read and write can be used to read resp. write from/to the overlay. getsubscribers, publish, and subscribe are the PubSub functions.
%> java -jar chordsharp4j.jar -write foo bar
write(foo, bar)
%> java -jar chordsharp4j.jar -read foo
read(foo) == bar
The chordsharp4j library requires that you are running a “regular server” on
the same node. Having a boot server running on the same node is not sufficient.
SELFMAN Deliverable Year Two, Page 191
Chapter 14
D3.3a: Simple database query
layer for replicated storage
service
14.1
Executive summary
The “Simple database query layer” is a small Java library for accessing the replicated storage service described in D3.2a (see Chap. 13). It provides functions
for reading and writing key-value pairs. Several read resp. write requests can be
executed within a transaction.
192
CHAPTER 14. D3.3A: SIMPLE DATABASE QUERY LAYER FOR
REPLICATED STORAGE SERVICE
14.2
Contractors contributing to the Deliverable
ZIB (P5) has contributed to this deliverable.
ZIB (P5) ZIB has contributed on the API design, implemented the Java API
and developed the Java to Erlang interface.
SELFMAN Deliverable Year Two, Page 193
CHAPTER 14. D3.3A: SIMPLE DATABASE QUERY LAYER FOR
REPLICATED STORAGE SERVICE
14.3
Introduction
The “Simple data query layer for replicated storage service” is a Java API for
accessing data stored in the replicated storage service presented in D3.2a (see
Chap. 13). The storage service has a native interface for accessing the storage,
which allows to specify transactions in Erlang. This interface is very similar to
the API for mnesia, an Erlang database. For Java users, we developed a more
traditional interface.
Simple API The simple API allows to read and write key-value pairs. The respective read and write operations are executed within a transaction and the
replicas are accessed with strong consistency, however each transaction will
contain exactly one operation – read or write. The functions are provided
by the ChordSharp class.
Transactions The Transaction class provides a more powerful interface, as several operations can be executed within one transaction.
For both interfaces, we use JInterface, which is a Java library which can send
message to Erlang VMs. The communication between the Java VM and the Erlang
VM is using the native Erlang protocol and the Transaction resp. ChordSharp
class provide a wrapper around JInterface to make it more usable for Java programmers.
SELFMAN Deliverable Year Two, Page 194
CHAPTER 14. D3.3A: SIMPLE DATABASE QUERY LAYER FOR
REPLICATED STORAGE SERVICE
14.4
14.4.1
API
de.zib.chordsharp.ChordSharp
public class ChordSharp
Public ChordSharp Interface.
Version: 1.1
Author: Nico Kruber, [email protected]
Method Summary
static Vector<String>
static void
static String
static void
static void
getSubscribers(String topic)
Gets a list of subscribers of a topic.
publish(String topic, String content)
Publishes an event under a given topic.
read(String key)
Gets the value stored with the given key.
subscribe(String topic, String url)
Subscribes a url for a topic.
write(String key, String value)
Stores the given key/value pair.
read
public static String read(String key)
throws ConnectionException,
TimeoutException,
UnknownException,
NotFoundException
Gets the value stored with the given key.
Parameters: key - the key to look up
Returns: the value stored under the given key
Throws:
ConnectionException
TimeoutException
NotFoundException
UnknownException
if the connection is not active or a
communication error occurs or an exit signal
was received or the remote node sends a
message containing an invalid cookie
if a timeout occurred while trying to fetch the
value
if the requested key does not exist
if any other error occurs
SELFMAN Deliverable Year Two, Page 195
CHAPTER 14. D3.3A: SIMPLE DATABASE QUERY LAYER FOR
REPLICATED STORAGE SERVICE
write
public static void write(String key,
String value)
throws ConnectionException,
TimeoutException,
UnknownException
Stores the given key/value pair.
Parameters: key - the key to store the value for value - the value to store
Throws: ConnectionException - if the connection is not active or a communication error occurs or an exit signal was received or the remote node sends a
message containing an invalid cookie TimeoutException - if a timeout occurred while trying to write the value UnknownException - if any other error
occurs
publish
public static void publish(String topic,
String content)
throws ConnectionException
Publishes an event under a given topic.
Parameters: topic - the topic to publish the content under content - the content
to publish
Throws: ConnectionException - if the connection is not active or a communication error occurs or an exit signal was received or the remote node sends a
message containing an invalid cookie
subscribe
public static void subscribe(String topic,
String url)
throws ConnectionException
Subscribes a url for a topic.
Parameters: topic - the topic to subscribe the url for url - the url of the subscriber
(this is where the events are send to)
Throws: ConnectionException - if the connection is not active or a communication error occurs or an exit signal was received or the remote node sends a
message containing an invalid cookie
SELFMAN Deliverable Year Two, Page 196
CHAPTER 14. D3.3A: SIMPLE DATABASE QUERY LAYER FOR
REPLICATED STORAGE SERVICE
getSubscribers
public static Vector<String> getSubscribers(String topic)
throws ConnectionException,
UnknownException
Gets a list of subscribers of a topic.
Parameters: topic - the topic to get the subscribers for
Returns: the subscriber URLs
Throws: ConnectionException - if the connection is not active or a communication error occurs or an exit signal was received or the remote node sends a
message containing an invalid cookie UnknownException - is thrown if the
return type of the erlang method does not match the expected one
14.4.2
de.zib.chordsharp.Transaction
public class Transaction
Provides means to realise a transaction with the chordsharp ring using Java.
It reads the connection parameters from a file called ChordSharpConnection.properties
or uses default properties defined in ChordSharpConnection.defaultProperties.
OtpErlangString otpKey;
OtpErlangString otpValue;
OtpErlangString otpResult;
String key;
String value;
String result;
// Transaction()
Transaction transaction = new Transaction();
// start()
transaction.start();
// write(OtpErlangString, OtpErlangString)
transaction.write(otpKey, otpValue);
// write(String, String)
transaction.write(key, value);
//read(OtpErlangString)
otpResult = transaction.read(otpKey);
//read(String)
result = transaction.read(key);
// commit()
transaction.commit();
For more examples, have a look at TransactionReadExample, TransactionParallelReadsExample, TransactionWriteExample and TransactionReadWriteExample.
Attention:
If a read or write operation fails within a transaction all subsequent operations
on that key will fail as well. This behaviour may particularly be undesirable if a
SELFMAN Deliverable Year Two, Page 197
CHAPTER 14. D3.3A: SIMPLE DATABASE QUERY LAYER FOR
REPLICATED STORAGE SERVICE
read operation just checks whether a value already exists or not. To overcome this
situation call revertLastOp() immediately after the failed operation which restores
the state as it was before that operation.
The TransactionReadWriteExample example shows such a use case.
Version: 1.0
Author: Nico Kruber, [email protected]
Constructor Summary
Transaction()
Creates the object’s connection to the chordsharp node specified in the
“ChordSharpConnection.properties” file.
Method Summary
void
abort()
Cancels the current transaction.
void
commit()
Commits the current transaction.
OtpErlangString
read(OtpErlangString key)
Gets the value stored under the given key.
String
read(String key)
Gets the value stored under the given key.
void
revertLastOp()
Reverts the last (read or write) operation by
restoring the last state.
void
start()
Starts a new transaction by generating a new transaction log.
void
write(OtpErlangString key, OtpErlangString value)
Stores the given key/value pair.
void
write(String key, String value)
Stores the given key/value pair.
Transaction
public Transaction()
throws ConnectionException
Creates the object’s connection to the chordsharp node specified in the “ChordSharpConnection.properties” file.
Throws: ConnectionException - if the connection fails
start
public void start()
throws ConnectionException,
TransactionNotFinishedException,
UnknownException
SELFMAN Deliverable Year Two, Page 198
CHAPTER 14. D3.3A: SIMPLE DATABASE QUERY LAYER FOR
REPLICATED STORAGE SERVICE
Starts a new transaction by generating a new transaction log.
Throws: ConnectionException - if the connection is not active or a communication error occurs or an exit signal was received or the remote node sends a
message containing an invalid cookie TransactionNotFinishedException - if
an old transaction is not finished (via commit() or abort()) yet UnknownException - if the returned value from erlang does not have the expected type/structure
commit
public void commit()
throws UnknownException,
ConnectionException
Commits the current transaction. The transaction’s log is reset if the commit was
successful, otherwise it still retains in the transaction which must be successfully
committed or aborted in order to be restarted.
Throws: UnknownException - If the commit fails or the returned value from
erlang is of an unknown type/structure, this exception is thrown. Neither
the transaction log nor the local operations buffer is emptied, so that the
commit can be tried again. ConnectionException - if the connection is not
active or a communication error occurs or an exit signal was received or the
remote node sends a message containing an invalid cookie
See Also:
abort()
abort
public void abort()
Cancels the current transaction.
For a transaction to be cancelled, only the transLog needs to be reset. Nothing
else needs to be done since the data was not modified until the transaction was
committed.
read
public OtpErlangString read(OtpErlangString key)
throws ConnectionException,
TimeoutException,
UnknownException,
NotFoundException
Gets the value stored under the given key.
Parameters: key - the key to look up
Returns:
the value stored under the given key
SELFMAN Deliverable Year Two, Page 199
CHAPTER 14. D3.3A: SIMPLE DATABASE QUERY LAYER FOR
REPLICATED STORAGE SERVICE
Throws: ConnectionException - if the connection is not active or a communication error occurs or an exit signal was received or the remote node sends
a message containing an invalid cookie TimeoutException - if a timeout occurred while trying to fetch the value NotFoundException - if the requested
key does not exist UnknownException - if any other error occurs
read
public String read(String key)
throws ConnectionException,
TimeoutException,
UnknownException,
NotFoundException
Gets the value stored under the given key.
Parameters: key - the key to look up
Returns:
the value stored under the given key
Throws: ConnectionException - if the connection is not active or a communication error occurs or an exit signal was received or the remote node sends
a message containing an invalid cookie TimeoutException - if a timeout occurred while trying to fetch the value NotFoundException - if the requested
key does not exist UnknownException - if any other error occurs
write
public void write(OtpErlangString key,
OtpErlangString value)
throws ConnectionException,
TimeoutException,
UnknownException
Stores the given key/value pair.
Parameters: key - the key to store the value for value - the value to store
Throws: ConnectionException - if the connection is not active or a communication error occurs or an exit signal was received or the remote node sends
a message containing an invalid cookie TimeoutException - if a timeout occurred while trying to write the value UnknownException - if any other error
occurs
write
public void write(String key,
String value)
throws ConnectionException,
TimeoutException,
UnknownException
SELFMAN Deliverable Year Two, Page 200
CHAPTER 14. D3.3A: SIMPLE DATABASE QUERY LAYER FOR
REPLICATED STORAGE SERVICE
Stores the given key/value pair.
Parameters: key - the key to store the value for value - the value to store
Throws: ConnectionException - if the connection is not active or a communication error occurs or an exit signal was received or the remote node sends a
message containing an invalid cookie TimeoutException - if a timeout occurred while trying to write the value UnknownException - if any other error
occurs
revertLastOp
public void revertLastOp()
Reverts the last (read, parallelRead or write) operation by restoring the last
state. If no operation was initiated yet, this method does nothing.
This method is especially useful if after an unsuccessful read a value with the
same key should be written which is not possible if the failed read is still in the
transaction’s log.
SELFMAN Deliverable Year Two, Page 201
Chapter 15
D4.1a: First report on
self-configuration support
15.1
Executive summary
The work on self-configuration support in the second year covered mostly the
development of an Oz-based framework (called FructOz) for the construction of
self-deployable and self-configurable components. This work builds on three earlier
developments by the Selfman partners:
• The Oz/Mozart distributed programming environment, which is used for
supporting distributed deployment and configuration processes, and for integrating self-deployment and self-configuration capabilities within component
packages.
• The Fractal component model, which provides the basic concepts and structures for defining self-configurable components, including basic introspection
and navigation capabilities.
• The FPath language for navigating and querying Fractal architectures (an
analog of the XPath language for querying XML documents).
The development of FructOz was motivated primarily by two objectives:
1. To provide basic support for complex deployment and configuration processes, including the definition of potentially complex workflows as exhibited
e.g. by large enterprise-wide software depployments.
2. To provide basic support for embedding deployment and configuration capabilities within component themselves, as a first step towards the construction
of distributed and self-configurable components.
202
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
The deliverable presents in more details the motivations and objectives for the
work, focusing in particular on shortcomings of existing technology to support
complex deployment and configuration processes. After a presentation of related
work, the deliverable introduces briefly the language-independent reference model
which we use as our baseline. The reference model introduces the notion of component package as a unit of executable software installation, and defines a minimal
structure required for performing local, i.e. node level, component deployment.
The deliverable then presents the FructOz framework, which exploits the Oz environment as a partial implementation of the reference model, and which consists
of two parts:
• A lightweight implementation of the Fractal model in Oz, and an implementation of component packages as Oz functors.
• An implementation of a dynamic variant of the FPath query language, which
allows to automate the generation of non-trivial monitoring predicates on a
distributed component implementation (reified as a Fractal architecture).
To illustrate how our technical requirements for supporting complex deployment processes are met with the FructOz framework, we then present a set of
examples, including the deployment of a cluster-size system, and the construction
of a self-configurable component, which continuousy monitors itself and can react
to changes in its internal configuration.
The deliverable also presents some preliminary evaluation of the FructOz framework, in the form of comparative micro-benchmarks for local deployment, and of a
comparison of the performance of different deployment processes built with FructOz.
The deliverable ends with a discussion of the limitations of the FructOz framework and of future work. To support the claim that FructOz provides a way to
handle complex deployment and configuration processes, we have added as a supplement material documenting how well-known control-flow workflow patterns can
be supported in Oz.
15.2
Contractors contributing to the deliverable
INRIA contributed to this deliverable, with the help of UCL for developing the
FructOz framework and for carrying out the preliminary experiments reported in
this deliverable.
SELFMAN Deliverable Year Two, Page 203
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
15.3
Motivations
Deploying and configuring a distributed software system can be a complex process,
involving multiple distributed activities. This complexity is well illustrated by papers and specifications which document deployment and configuration activities.
For instance, Hall et al. [69], characterize software deployment as a collection of
interrelated activities (such as release, install, adapt, reconfigure, update, activate,
deactivate, remove and retire); Coupaye et al. [36], extend this analysis to activities
involved in the enterprise-wide deployment of large software applications (including activities such as assemble, install, and activate, which subsume or encompass
those identified in [69]); standards such as the OMG specification for the deployment and configuration of component-based applications [65], which identifies a
number of activities in a deployment process (including installation, configuration,
planning, preparation, and launch).
A general approach for dealing with this complexity has been proposed by
van der Hoek [143], under the term architecture-based deployment and configuration. Roughly, the main thrust of the approach is to exploit software architecture
descriptions (possibly extended with domain-specific annotations) to drive deployment and configuration activities. As already noted in [143], an architecture-based
approach to deployment and configuration management has a number of benefits:
• A rise in abstraction level, which allows to encapsulate and deal uniformly
with idiosyncrasies of system configuration and deployment in legacy and
heterogeneous systems.
• A seamless integration between software configuration management activities and software deployment activities, limiting architectural erosion and
enabling a more rapid development / deployment cycle.
Actually, these benefits carry over more generally in an architecture-based approach to distributed systems management, as demonstrated by works such as
Rainbow [54], Automate [109], and our own Jade [22]. In this broader context, software architecture descriptions can serve as pivot information for multiple management functions, including fault management [134], performance management [23],
and security management [33], and an architecture description language (ADL)
can serve as a pivot language for supporting tools.
An architecture-based approach to configuration management and deployment
has been pursued in a number of different works, including e.g. [6, 7, 11, 28, 30,
37, 49, 72, 73, 87, 90, 105, 104, 112].
In these works, the deployment process is either an ad-hoc algorithm or task
framework operating on descriptions of mostly static software architectures, as
in [6, 28, 49, 83, 87, 90, 112], or generated by a constraint solving algorithm or
an automated AI planner, that interprets software descriptions with deployment
constraints as a goal, as in [11, 37, 73, 104]. The level of support for the deployment process that is provided in these different works is unsatisfactory, however.
SELFMAN Deliverable Year Two, Page 204
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
Deployment in situations such as the enterprise environments envisaged by [36],
must take into account multiple constraints, policies, synchronization and error
conditions that require more sophisticated distributed coordination facilities than
is provided by these different approaches, or than can currently be effectively handled by an AI planner approach. In particular, we expect support for deployment
and (re)configuration processes to satisfy the following requirements, which are
not adequately covered in the existing litterature:
1. Ability to define complex deployment workflows, including support for wellknown control flow and exception patterns.
2. Ability to define parameterized and higher-order workflows for the concise
specification of complex distributed architectures.
3. Ability to finely control activities involved in the deployment and (re)configuration
of a distributed software architecture, and the moment they are triggered.
4. Ability to internalize deployment and (re)configuration activities in the description of a self-configurable software architecture.
Requirement 1 concerns the ability to directly support the main patterns of synchronization and exception handling that have been identified by the workflow
community [142]. These patterns are useful yardsticks for the specification and
programming of deployment and configuration processes for they embody higherlevel abstractions and operators for process synchronisation and task composition
which have proved useful in the design of enterprise-wide formal processes and
on-line services.
Requirement 2 concerns the ability to define parametric deployment and configuration processes, and is key to support scalable and compositional definitions
of deployment and configuration processes. Parameters of a deployment and configuration process may include, for instance, the size and structure of the target
environment, or binding and interconnection schemas between deployed components as in multi-tier systems.
Requirement 3 concerns in particular the ability to delay the deployment and
configuration of individual components up to the point where they are finally
needed. This lazy deployment capability in turn supports different performance
and adaptation tradeoffs.
Requirement 4 is a necessary step towards autonomic components and systems.
In this paper, we present an approach for the development of architecturebased deployment processes that meets the above requirements. It is based on the
Fractal component model [25], and the Oz programming language and its Mozart
distributed environment [144]. Specifically, we present an Oz framework, called
FructOz, that can be used for the development of complex distributed deployment and (re)configuration workflows, and the development of self-configurable
distributed components. FructOz can be understood as providing the basis for the
SELFMAN Deliverable Year Two, Page 205
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
description of dynamic distributed software architectures, i.e. software architectures whose descriptions embody provisions for change and evolution, in response
to events from their environment.
The paper makes the following contributions:
• We introduce a language-independent reference model for distributed deployment and configuration.
• We introduce the FructOz framework, written in the Oz programming language, that can be used for building complex distributed software architectures, including complex deployment and reconfiguration workflows.
• We illustrate via several examples how the above identified requirements are
met by the use of the FructOz framework and the Oz programming language.
• We report preliminary performance results that demonstrate the viability
and the potential benefits of the FructOz approach to distributed deployment
and configuration.
15.4
Related work
The reference model presented in Section 15.5 builds on several works, including
the consideration of general component dependencies as in [87], a notion of local
installation store inspired from the Nix system [46], and a notion of component
package analogous to that of the Edos project [99]. Our notions of component
and component packages are analogous to, but generalize the wiring notions of
Assemblages [96]; in particular, plugging and mixing notions in Assemblages are
examples of our notion of binding between components and component packages.
Our reference model does not attempt to identify formally different activities or
phases involved in a deployment process e.g. as in [69, 65]. Not only are these
different phases potentially environment and application dependent, but our examples on lazy deployment and on self-configurable components shows that such
distinctions can be elusive.
Previous approaches for supporting complex distributed deployment processes
fall roughly into four categories: (i) approaches that rely on a fixed deployment
algorithm or more extensive task framework, driven by (static) software architecture descriptions, including [6, 7, 28, 30, 42, 49, 83, 85, 87, 90, 105, 112, 153]; (ii)
approaches that rely on a workflow language for describing deployment processes,
including [8, 63, 82, 122, 141]; (iii) approaches that rely on a constraint solving
algorithm for determining a deployment target, including [16, 104, 98, 139]; (iv)
approaches that rely on an AI planner for generating deployment processes, including [11, 73, 84].
Approaches that rely on fixed deployment algorithm or a supporting framework, do not meet the requirements we have identified in the introduction. HowSELFMAN Deliverable Year Two, Page 206
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
ever, the work reported in this paper could be exploited in a number of the frameworks referenced above. In particular, the work in this paper complements our
previous work on component deployment and configuration in Jade [6], and would
be directly usable in the FDF [49] framework.
Approaches that rely on constraint solving and AI planning are interesting for
their high degree of declarativity, but they do not provide sufficient support for
dealing with complex deployment workflow, and to support effectively our four
requirements above. Approaches based on constraint-solving tend to focus on
specific deployent problems and objectives (e.g. such as ensuring a certain degree
of availability [98]). They must be complemented with additional mechanisms to
deal correctly with exceptional conditions or to ensure synchronization conditions
which are not enforceable within their scheduling framework. Approaches based
on AI planning are interesting because they have the potential to automate the
generation of complex deployment workflows, but it is not clear at this point that
they can be employed successfully beyond simple system configurations. There
is certainly promising work for integrating AI planning techniques with workflow
management systems [118] but issues with respect to expressivity and scalability
are still open.
Approaches that rely on a workflow language are closer to ours. They include
the SmartFrog system [8, 63], the Workflakes system [141], the Andrea system
[122], and the use BPWS4J workflow engine with the IBM Tivoli deployment engine for the provisioning of application services [82]. Compared to our approach,
which relies on the coordination and distributed programming capabilities of the
Oz programming language, these systems do not provide direct support for parameterized, higher-order workflows (requirement 2 – see our case study 15.7.1),
and they do not provide direct support for lazy execution (requirement 3 – see
our case study 15.7.3). Also the use of a workflow or process management engine
remains essentially external to the components used or the system being managed,
and no provision has been made for the construction of self-deployable components
and self-configurable hierarchical components (requirement 4 – see our case study
15.7.5). This is also the case with the SmartFrog system, even though its workflow
constructs take the form of SmartFrog components, because these are used for the
structuring of the initial deployment process. Note that, in our approach, even if
components are not programmed in Oz, internalizing a complex deployment and
configuration behavior is made simpler by the reflective character of the Fractal
model. We could, for instance, make use of the aspect-oriented capabilities of
the reference implementation of Fractal in Java, to program advices interfacing
directly to the (meta-level) deployment and configuration behavior written in Oz.
Our work is also related to architecture description languages for dynamic architectures, i.e. ADLs that can describe dynamically evolving architectures. A
recent survey of such ADLs can be found in [24]. The survey shows that the
subject of ADLs for dynamic architectures is well researched but that none of
the surveyed approaches provided support for unconstrained evolution. InterestSELFMAN Deliverable Year Two, Page 207
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
ingly, this can be traced to the fact that none of the surveyed approaches (such
as Dynamic Wright, Darwin, etc) are higher-order, e.g. none can specify the receipt of a new component from the environment, which is not already specified in
the original architecture description. Although some of the works surveyed such
as Darwin [56] provide a supporting distributed infrastructure, none deal directly
with deployment issues. Another ADL for dynamic architectures is Plastik [78],
which benefits from a supporting infrastructure that provides a causal connection between architecture descriptions and the supporting OpenCOM component
model [35]. The Plastik infrastructure provides basic support for local deployment,
through its loader component, and supports reconfiguration scripts coded with the
Lua programming language. Plastik provides good support for local reconfiguration but does not provide direct support for our requirements above.
15.5
Reference model
The architectural background for our work takes the form of a programminglanguage-independent reference model. It consists of two main elements: a component model, and local node structure. The component model is a refinement
of the Fractal component model to make explicit notions of software component
packages and their dependencies. The local node structure identifies a set of abstractions and functions for local installation and deployment.
The component model we adopt as our basis is the Fractal component model
[25]. Fractal is a general component model which builds on classical software
architecture concepts as captured e.g. in ACME [55]: components as encapsulated
data and behavior, which clearly expose their depencencies and connections to
their environment via interfaces (entry points to components, which generalizes the
port notion), and bindings (interrelations between components, which generalizes
the connector notion). To this classical foundation, Fractal adds the following
elements:
• Bindings can be reified as components, in particular to build them by composition.
• Components can be endowed with meta-level behavior, made explicit via
so-called controller interfaces.
• Components can be shared between multiple composite components, i.e. the
containment relation is in general a directed acyclic graph of components
and not necessarily a tree.
• Fractal does not impose a predefined semantics for meta-level behavior, component containment and bindings.
The general form of a component is that of a composite, with several subcomponents, and a membrane, that encompasses all the meta-level activity of the comSELFMAN Deliverable Year Two, Page 208
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
ponent, as well as base level (or functional) behavior that is characteristic of the
component (i.e. not delegated to its subcomponents). A component membrane can
thus contain different controllers such as a content controller, which provides access
to, and allows manipulation of subcomponents, a lifecycle controller, which provides control over the execution status of the component, an attribute controller,
which provides access to data associated with the component, etc.
A Fractal component is a run-time entity, i.e. whose presence is manifest during execution. Deploying and configuring a Fractal application informally implies:
installing the appropriately configured executables (e.g. source code, binaries) at
chosen nodes (i.e. physical or virtual machines), creating and activating the necessary components (including the necessary bindings constituting interaction pathways with other components).
In our reference model, executables are made explicit and manipulated by
means of component packages. A component package (or package for brevity) is a
bundle that contains executables necessary for creating run-time components, as
well as data needed for their correct functioning, and metadata describing their
properties and requirements with respect to the target environment. A package is itself a Fractal component, which means it can contain other packages, be
shared between multiple containing packages, provide certain interfaces, require
other interfaces from its environment. A containment relation between packages
corresponds to a requirement dependency between packages: the composite package requires the presence of the sub-package in the target environment. As with
general Fractal components, different qualifications (such as mandatory, optional,
or lazy) may apply to sub-packages and to required interfaces. As for component attributes in general, our model does not enforce a specific set of metadata
to associate with a package. As a minimum, a package metadata must identify
mandatory qualifications for package dependencies. A package metadata may contain additional requirements, such as version information, resource requirements,
or conflicts (identifying that certain packages, or components in general, must not
be present in the target environment). Note that there are two aspects of configuration involved in our model: the first one is related to package resolution,
and the second one corresponds to setting appropriate activation parameters and
attributes, and establishing necessary bindings between run-time components.
A distributed deployment and configuration process for Fractal systems, can
be understood as a distributed application executing on a set of nodes. Nodes that
form the targets of the deployment process are called managed nodes. To effect
deployment and configuration, managed nodes must be equipped with a minimal
structure. This comprises, for each managed node:
• One or more binding factories for establishing bindings supporting remote
communication with components residing on the managed node.
• An installation store component, which is a repository of packages.
SELFMAN Deliverable Year Two, Page 209
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
• One or more loader components, which are responsible for loading executables from packages in the installation store.
At least one binding factory per managed node is required to enable communication
with component interfaces located on managed nodes, and to transfer packages to
managed nodes (either in a pull or a push mode). The installation store constitutes
a local target environment for deployment, and provides the context for deciding
whether a package is installable or not, i.e. whether all the mandatory dependencies
of a package can transitively be resolved (see [99] for a formalization of – a form of –
installability). The installation store component need not execute on the managed
node it belongs to, but may be located on a different node (e.g. because of resource
constraints). Loader components (loosely analogous to Java class loaders) can take
many forms, performing strictly binary loads for execution, or engaging in run-time
resolution of package dependencies, with installation of required packages in the
local installation store. By definition, loader components execute on the managed
node they belong to.
15.6
The FructOz framework
15.6.1
Overview
The FructOz framework allows Fractal developers to leverage the Oz programming
language for the deployment and configuration of Fractal systems (be they programmed in Oz or in some other language). FructOz also allows the development
of self-configurable and self-deployable distributed components, i.e. components
whose implementation may span several nodes, and which can reconfigure themselves during execution, possibly proceeding to changes in subcomponents and in
supporting component packages. FructOz currently comprises:
• A lightweight implementation of the Fractal model in Oz.
• A “dynamic FPath” library providing support for navigating, querying, and
monitoring Fractal component structures.
The Fractal implementation which FructOz provides is “lightweight” in the sense
that it does not provide all the features and capabilities of the reference Java
implementation of Fractal. Also, we have not aimed at this point for a highly
optimized implementation. The “dynamic FPath” library has been inspired by
the FPath language [40], and allows navigating and querying a Fractal component
structure, much like XPath allows to navigate and query XML documents. In our
case, navigating and querying can take place in a component structure that spans
multiple nodes.
FructOz can typically be complemented by a library implementing distributed
control flow and exception patterns, such as e.g. those documented in the workflow
SELFMAN Deliverable Year Two, Page 210
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
litterature [142, 125]. We present elements of such a library in Section 15.10. They
take the form of operators that formalize the main control flow patterns identified
in [124, 142]. We also provide some examples showing how to support the different
exception patterns presented in [125].
FructOz provides a simple implementation of the reference model presented in
Section 15.5. The construction of Fractal components is explained below. Component packages are constructed as Oz functors (a general form of module, analogous
to functors in ML), with package dependencies corresponding to import clauses in
functors. Loaders in FructOz correspond to Oz module managers, which support
the resolution of import clauses in functors and return components (in the form of
membrane constructs – see below). Loaders in FructOz are thus also component
factories. The installation store is not reified in FructOz: it just corresponds to
the Oz memory store since FructOz packages are plain Oz functors, and thus language values. Binding factories are not reified either: they just correspond to the
communication capabilities of the Oz infrastructure.
We lack the space in this paper to provide a detailed introduction to the Oz
language. A comprehensive reference is [144]. Tutorial material is available on the
Oz/Mozart Web site [71]. To facilitate the understanding of program fragments
below, here are a few indications:
• Variables in Oz are logic variables: they can be bound or unbound. An
unbound variable holds no value. Once bound, i.e. once a value has been
assigned to it, a variable is immutable. Variables in Oz programs are denoted
by tokens that begin with an upper case letter. Assignment is denoted by
an = sign. Thus: X = V denotes the assignment to variable X of some value V.
• Values in Oz can be integer, strings, atoms (denoted by tokens that begin
with a lower case letter), records, cells, ports, or procedures. A record takes
the following form r(f1:X1 ... fn:Xn) where r (the label of the record) and f1 ... fn
(the fields of the record) are typically atoms, and X1 ... Xn are (bound or
unbound) variables. Access to a record field is noted with a . sign (thus:
r(f1:X1 f2:X2).f1 returns X1). Special cases of records are pairs, noted with an
infix # sign (thus: X#Y corresponds to the pair (X,Y)), and lists, noted with
a | sign (thus: H | T denotes a list whose head is H and whose tail is T).
• A cell corresponds to a mutable reference. A cell content is a variable. Thus
@C denotes the content of a cell C, and C := X (assignment) updates the
content of cell C with variable X.
• A procedure call takes the form {P A1 ... An} where P is a procedure name,
A1 ... An are variables denoting the arguments of the call. A procedure can
update several of its (unbound) arguments, and thus return several results.
A procedure declaration takes the form proc {P X1 ... Xn} E end where P denotes
the name of the procedure, X1 ... Xn denote the formal arguments of the procedure, statement E corresponds to the body of the procedure. A function,
SELFMAN Deliverable Year Two, Page 211
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
declared with a statement of the form fun {F X1 ... Xn} E end is a special case of
procedure which returns a single result. Anonymous procedures and functions can be declared with the anonymous marker $.
• Variable declarations typically take the form
some statement.
X in ...
or
X = E in ...,
where
E
is
• Sequences of two statements S1 and S2 is just written by the juxtaposition
of the two statements, horizontally as S1 S2, or vertically.
• A new concurrent thread is spawned by a statement of the form thread S end,
where S is the statement to be executed in parallel with the current thread.
The rest of this section describes in more details the main elements of the
FructOz framework.
15.6.2
Interfaces and components
FructOz implements interfaces as Oz ports: “An Oz port is an asynchronous channel that supports many-to-one communication. A port P encapsulates a stream S.
A stream is a list with unbound tail. The operation {Send P M} adds M to the end
of S. Successive sends from the same thread appear in the order they were sent.”
(quotation from the Mozart/Oz tutorial). Considering a port more in details, the
port itself constitutes the input of the interface, i.e. where clients address their
messages, while the stream encapsulated in the port constitutes its output, i.e.
where the implementation reads the messages it processes.
Actually, a FructOz interface combines a port with a cell (a mutable reference),
so as to allow changing the implementation which processes clients’ messages at
runtime, thus featuring dynamic reconfigurations.
A FructOz component is represented by its membrane, a structure holding
a mutable set of interfaces. Some interfaces are server interfaces and export a
functionality outside the component, while others are client interfaces that import
functionalities inside the component.
There is no distinction between primitive, composite or compound components
in FructOz as in SmartFrog or Fractal/Julia (the reference implementation of Fractal in Java): a component may well implement and make direct use of some of its
interfaces, and at the same time, have subcomponents bound to other interfaces.
%% Create a new empty component membrane
Comp = {CNew}
%% Create and add an interface to the component membrane
Itf = {INew server [/∗tags list∗/]}
{CAddInterface Comp Itf}
Note that there is no explicit representation for subcomponents or composition
at this level. The functional content of a component is here implicitly defined as
the set of components directly or indirecty bound to any internal interface side of
the component.
SELFMAN Deliverable Year Two, Page 212
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
15.6.3
Bindings
Interfaces can be bound with each other and, eventually, to some native code.
Technically, a binding between two interfaces is an active pump implemented with
an Oz thread waiting for messages on the output stream of one interface and
forwarding these messages to the input port of the other interface. Though this
representation of a binding by an Oz thread is probably not the most efficient, it
is simple and affordable for our case studies since Oz threads are designed to be
very light. Establishing a binding between two interfaces is realized through the
BNew primitive.
IClient = {INew client [/∗tags list∗/]} % declare a client interface
IServer = {INew server [/∗tags list∗/]} % declare a server interface
Binding = {BNew IClient IServer} % bind the client interface to the server interface
Bindings may be chained to connect two interfaces through multiple reconfiguration points.
On the endings of a binding chain, the interfaces are connected to native Oz
code. From an interface provider perspective, this boils down to define what to do
with the messages in the output stream of an interface. The IImplements primitive
allows one to define the procedure to invoke an every message. The primitive may
be invoked multiple times on the same interface to replace the previous procedure
with another one.
%% How to implement a server interface of a component
Itf = {INew server [/∗tags list∗/]} % declare the server interface
{IImplements Itf % bind the interface to a procedure
proc {$ Message}
/∗ process the Message here ∗/
end}
A convenient way to implement an interface is to associate it with an Oz object,
because the object methods determine the different types of message expected on
the interface:
Itf = {INew server [/∗tags list∗/]} % declare the server interface
{IImplements Itf % bind the interface to an object
{New class $
meth message1(Parameter1 ...)
/∗ process Messages of type ’message1’ ∗/
end
...
end} init}
From an interface user perspective, this involves resolving the Interface into a
proxy for the implementation procedure of the interface. The primitive IResolveSync
(resp. IResolveAsync) performs the resolution of an interface into a synchronous (resp.
asynchronous) proxy.
%% How to use a client interface inside a component
Itf = {INew client [/∗tags list∗/]} % declare the client interface
ItfProxy = {IResolveSync Itf} % resolve the interface into a proxy
{ItfProxy message1(Parameter1 ...)} % invoke the implementation procedure through the proxy
SELFMAN Deliverable Year Two, Page 213
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
15.6.4
FructOz entities
FructOz represents the interfaces, components and bindings presented before as
objects inheriting an abstract Entity base class, that provides generic navigation
facilities through tags filtering: an Entity instance may be tagged with a set of
keywords. Set of entities can then be filtered based on their tags. FructOz thus
exports the following classes:
Interface : the representation of an interface is oriented, being either client or
server; an interface has a static reference to the membrane it belongs to; the
interface also references the bindings starting from and pointing to it.
Membrane : the representation of a component membrane, containing a dynamic
set of interfaces.
Binding : the representation of a binding between a client interface and a server
interface. A binding is unmutable and keeps two static references to the
client and the server interfaces.
Here are the notations that we use in the following:
C : a component (i.e. a MemS : a set
brane)
I : an interface
S{X} : set of elements of type X
B : a boolean
B : a binding
Entities programming interface.
The three entities presented above may be created and manipulated by the following set of primitives:
CNew:
INew:
unit → C, create a new empty component membrane;
(client|server) → I, create a new interface, client or server, not associated
to any component;
BNew:
I × I → B, create a binding between a client and a server interface;
BNewLazy:
I × I → B, create a lazy binding between a client and a server interface: the client interface is required to be instantiated to establish the
binding, while the server interface will remain lazy (unneeded), as long as
no introspection occurs involving the interface and no functional usage of
the binding happens.
Components accept the following operations:
CAddInterface, CRemoveInterface:
a component.
C × I → unit, add or remove an interface to or from
SELFMAN Deliverable Year Two, Page 214
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
Interfaces also accept the following:
IImplements:
I ×(native Oz object) → unit, define the procedure or object to invoke
to process messages addressed to an interface.
IResolveSync, IResolveAsync:
I → (message → unit), resolve an interface into a synchronous or asynchronous proxy that may directly be invoked to send messages to the interface.
Bindings are immutable. Bindings can only be broken (or introspected), in
which case they become garbage:
BBreak:
B → unit, break a binding.
As explained previously, components, interfaces and bindings are entities, and
are thus associated with a set of tags. Tags allow the identification of any entity.
Tags are manipulated with the following primitives:
Tag, Untag:
entity × tag → unit, apply or remove a tag to or from the given entity.
HasTag:
entity × tag → B, test an entity (a component membrane, an interface or
a binding) for the given tag.
Immediate introspection primitives.
CGetInterfaces:
IGetComponent:
C → S{I}, get the set of interfaces of the specified component.
I → C, get the component owning the given interface.
IIsClient, IIsServer:
I → B, test whether the given interface is a client (resp. server)
interface with respect to its owning component.
IGetBindingsFrom, IGetBindingsTo:
I → S{B}, get the set of bindings connected from
(resp. to) the given interface.
BGetClientInterface, BGetServerInterface:
nects from (resp. to).
15.6.5
B → I, get the interface the given binding con-
Components as packaging and deployment entities
A functor is a function which creates (i.e. instantiates and exports) a new module
(essentially, a record of Oz values), as a result of linking a module definition to its
module dependencies (imports):
functor : {modules} → module.
The choice of the modules to use as imports, i.e. the resolution of imports, is done
by a module manager.
SELFMAN Deliverable Year Two, Page 215
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
A component is implemented as an Oz module exporting the component membrane. The functor which initializes the module thus corresponds to the deployment procedure of the component. Deploying the component means instantiating
the module with a module manager. Note that a module (i.e. a component) may
be instantiated multiple times by the same module manager. A module manager
is thus a component factory.
functor ComponentPackage
export membrane: Membrane
define
%% Create a new empty component membrane
C = {CNew}
...
%% Create the component content (interfaces, implementation, subcomponents, bindings, etc)
...
%% Now that the component is deployed and ready to be
%% used, make the membrane available
Membrane = C
end
Note that the module imports in the functor declaration representing the component do not represent the client interfaces of the component (in the example
functor ComponentPackage above, there is no import section).
The deployment of the component depicted above into a running entity can
be achieved in the local running virtual machine with the following deployment
primitive example:
fun {Deploy ComponentPackage}
%% Create a new module manager
ModuleManager = {New Module.manager init}
%% Ask the module manager to apply the functor procedure,
%% this instantiates the component and exports a reference to its membrane
C = {ModuleManager apply(PackedComponent $)}
in
%% Return the membrane exported by the instantiated module
C.membrane
end
15.6.6
Distributed environments
The Mozart/Oz platform integrates a distributed programming environment. The
platform aggregates several nodes into a unique global store in which objects (values) can be freely shared and accessed. The control of the distribution over the
nodes is explicitly handled through the joint use of functors and module managers
(i.e. functor executors). An Oz module manager is tied to the machine it has been
created on, and thus always deploys modules locally on that machine.
In a centralized, non-distributed environment, components may be deployed on
the same machine and in the same virtual machine, using the {Deploy PackedComponent}
primitive presented in section 15.6.5. To represent a non-distributed environment
consisting of a single machine, FructOz provides a Host component acting as a
component factory: the Host component exports a factory interface, with a deploy
SELFMAN Deliverable Year Two, Page 216
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
operation relying on the Deploy primitive. A Host component represents a managed
node in the reference model of Section 15.5.
management
- newHost
- removeHost
Cluster
Host B
Host A
factory
- deploy
Module
Manager
Client
component
factory
- deploy
Module
Manager
Server
component
Figure 15.1: Distributed environments representation
As an example showing how to extend this to a distributed environment, a
Cluster component has been implemented to represent a logical set of fully interconnected hosts. The Cluster component provides operations to add and remove
machines from the set of nodes, hence acting as a Host factory. Introspection of
the Cluster component thus allows to discover the distributed infrastructure. Distributed deployments may then be parameterized with a cluster parameter identifying the scope of the deployment. Deploying a component on this infrastructure
now requires to make explicit the host a component should be deployed on. For this
purpose, we provide a new deployment primitive: {RemoteDeploy Host PackedComponent},
relying on the factory interface of the Host component to dispatch the deployment
on the remote hosts. The Deploy primitive remains available to deploy components
locally.
fun {RemoteDeploy Host PackedComponent}
%% Get and resolve the ’factory’ interface of the Host component
FactoryItf = {CResolveSyncInterface Host factory}
in
{FactoryItf newComponent(PackedComponent $)}
end
As an implementation detail, the cluster component creates and integrates
new hosts in the distributed environment as follows: it remotely starts an Oz
virtual machine via an SSH shell command; the new remote Oz virtual machine
opens a connection with the source virtual machine, creates a module manager and
transmits it (more exactly, a proxy for it) back to the source machine, thus making
it available to the cluster component. Finally the cluster component instantiates
a host component remotely on the new virtual machine using the proxy for the
remote module manager.
fun {NewRemoteHost Hostname}
%% First start a new Oz virtual machine on the remote Host and obtain a proxy
%% to a module manager in this virtual machine
RemoteModuleManager = {New Remote.manager init(host:Hostname fork:ssh detach:false)}
SELFMAN Deliverable Year Two, Page 217
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
%% Remotely deploy a Host component on the new virtual machine
HostModule = {RemoteModuleManager apply(PackedComponent $)}
in
end
%% Returns the membrane of the new Host component
HostModule.membrane
15.6.7
LactOz: a dynamic FPath library
FPath [39] is a query language for Fractal architectures. It is to Fractal architectures what XPath is to XML documents: a compact notation inspired by XPath
for navigating through Fractal architectures and for matching elements according
to some predicate. However, contrary to XML documents, Fractal architectures
are dynamic and their structure and content may evolve over time. The evaluation of a standard FPath expression, as proposed in [39], produces a static result,
only meaningful with respect to the original architecture it has been applied on,
and which might not reflect the architecture as it may have evolved since the
evaluation of the FPath expression. LactOz provides Dynamic FPath expressions
that capture the dynamicity of architectures within FPath expressions. LactOz
endows FPath expressions with the ability to be dynamically updated following to
architecture evolutions.
Updates to dynamic FPath expressions may happen synchronously or asynchronously. Thus FPath variables and expressions that may be static, dynamic
with synchronous updates or dynamic with asynchronous updates.
• Static variables are defined once and forever.
• Dynamic variables can be updated. Dynamic variables can be:
– explicit: such variables are “manually” updated,
– implicit: those variables are defined relatively to a set of (dynamic)
source variables and are automatically updated when their sources
change. A source propagate its changes to an implicit variable either
synchronously or asynchronously.
Predicates
FPath predicates are built using Oz expressions on top of FructOz introspection
primitives. We demonstrate here how to build the dynamic FPath expression
contained in the dynamic FPath library:
CGetExternalComponentsBoundFrom: C → S{C}. Given a component C, we want to select
all the components C 0 where there exists a binding from a client interface of C to a
server interface of C 0 . The exploration process to achieve this computation starts
from C, then gets C client interfaces, then C client bindings, then the server
SELFMAN Deliverable Year Two, Page 218
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
add(C)
notifyAdd(C)
A
B
Listener
notifyAdd(C)
Explicit Set S1
A
D
Explicit Set S2
add(C)
A
Listener
B D
Listener
Implicit Set S3
S3 = Union(S1,S2)
Figure 15.2: Anatomy of a simple union dynamic set: handling of an update.
interfaces of C client bindings, and finally the components pointed to by these
bindings.
%% Auxilliary introspection functions
fun {CGetClientInterfaces C} % get the dynamic set of client interfaces of a component
%% The kind of an interface cannot change, so we optimize
%% using a dynamic set filtering with a static filter function
{SStaticFilter AllItfs (fun {$ I} I.kind == client end)}
end
fun {BGetServerComponent B} % get the component owning the interface the binding points to
{IGetComponent {BGetServerInterface B}}
end
fun {CGetExternalComponentsBoundFrom C}
ClientItfs = {CGetClientInterfaces C}
%% SMap maps the Set of (client) Interfaces into a Set of Set of Bindings,
%% that we flatten thanks to SUnion into a Set of Bindings
ClientBindings = {SUnion {SMap ClientItfs (fun {$ I} {IGetBindingsFrom I} end)}}
%% and finally, map the Set of Bindings into a Set of Components
{SMap ClientBindings (fun {$ B} {BGetServerComponent B} end)}
end
Now suppose you want to filter the set of components obtained with some
predicate. For example, we want only those having a sub-component tagged with
the tag “interesting”, you might build and use the following predicate:
%% filter the dynamic set of components generated with the previous function
{SFilter {CGetExternalComponentsBoundFrom C}
fun {$ C}
%% get the sub−components of C
Children = {CGetSubComponents C Context}
in
%% only keep components for which the sub−component set contains at least
%% one child component tagged with ’interesting’, i.e. for which the sub−component
%% set filtered with respect to the ’interesting’ tag is not empty
{BNot {SIsEmpty {SFilterHasTag Children ’interesting’}}}
end}
The predicate here is dynamic as its prototype is: C → B where the boolean
might change if the component is added or removed some children or if some
children are tagged or untagged with “interesting”.
SELFMAN Deliverable Year Two, Page 219
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
LactOz comes with a general set of primitives such as, for instance, BNot which
creates a dynamic boolean computing a logical not of another dynamic boolean,
SIsEmpty which tests the emptiness of a set, SMap, SUnion, SFilter etc. All these primitives can be tuned to be static or dynamic, synchronous or asynchronous. This
requires some extra naming (such as Sync.sFilter or Async.sFilter) not shown here for
brevity.
Basic entities primitives
BNot:
B→B
BAnd, BOr:
BWait:
B × · · · × B → B or S{B} → B
B → unit
NSum, NMultiply,
NMin, NMax, NAverage:
NSubtract:
N×N→N
NDivide:
R×R→R
NIDivide:
Z×Z→Z
SSize:
N × · · · × N → N or S{N} → N
S → Z, get the number of elements of the set
SIsEmpty:
SUnion:
S → B, test the emptiness of the set
S × · · · × S → S or S{S} → S
SFilter:
S × F → S, where F : V → B
SMap:
S × F → S, where F : V → V is assumed deterministic
SSubtract:
S ×S →S
SIntersect:
S ×S →S
This set can be easily extended.
Basic component introspection primitives
Based on these primitives, we build higher level navigation operations. Note:
the terms “external” and “internal” refer to the implicit inside and outside of a
component.
IIsBoundExternally, IIsBoundInternally:
I→B
test whether the interface is externally (resp. internally) bound relatively to
the implicit inside of its owning component
SELFMAN Deliverable Year Two, Page 220
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
CGetExternalBindingsFrom, CGetExternalBindingsTo,
CGetExternalBindings:
C → S{B}
get the external bindings to and/or from the component
CGetInternalBindingsFrom, CGetInternalBindingsTo,
CGetInternalBindings:
C → S{B}
get the internal bindings to and/or from the component
BGetClientComponent, BGetServerComponent:
B→C
get the component client (resp. server) of the binding
CGetExternalComponentsBoundTo, CGetExternalComponentsBoundFrom,
CGetExternalComponentsBoundWith:
C → S{C}
get the external components bound to and/or from the component
CGetInternalComponentsBoundTo, CGetInternalComponentsBoundFrom,
CGetInternalComponentsBoundWith:
C → S{C}
get the internal components bound to and/or from the component
15.7
Case studies
In this section, we present examples of dynamic architectures illustrating how our
framework supports the requirements we identified in the introduction.
15.7.1
Parameterized architectures
This example shows how to describe highly parameterized architectures. For this
purpose we present an architecture composed of two sets of components and where
a parameter defines the interconnection scheme between these two sets (see Figure 15.3).
The parameters for this architecture are:
• the size of the first set of components
• the size of the second set of components
• an interconnection scheme function: f : S{C} × S{C} → S{Desc(B)} where
Desc(B) represents a descriptor for a binding (in our scenario, a descriptor
is a pair of interfaces IFrom#ITo).
SELFMAN Deliverable Year Two, Page 221
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
Set S1
Bindings
Set S2
ParameteredComposite
Figure 15.3: Parameterized architectures: interconnection scheme
fun {ParameterizedComposite N1 N2 BindingScheme}
functor $
export Membrane
define
C = {CNew nil}
%% Create the first set S1 with N1 instances of component C1.
S1 = {SNew} % create a new empty set S1
for I in 1..N1 do
SubComp = {Deploy C1} % deploy a component C1 locally
in
{S1 add(SubComp)} % add the new component into the set S1
end
%% Create S2 similarly to S1
...
%% Invoke the interconnection scheme function: generate a list of binding descriptors.
LDescs = {BindingScheme S1 S2}
%% Translate these descriptors into real bindings.
LBindings = {List.map LDescs
fun {$ IFrom#ITo}
{BNew IFrom ITo}
end}
Membrane = C
end
end
A trivial example of interconnection scheme is the full interconnection scheme
which realizes the complete two-party graphs.
fun {FullInterconnect S1 S2}
L = {NewCell nil} % create a new empty descriptor list
in
%% for all pairs (CFrom, CTo) in (S1 x S2)
{S1 forAll(
proc {$ CFrom}
{S2 forAll(
proc {$ CTo}
IFrom = {CGetInterface CFrom [/∗tags list∗/]} % locate the client interface
ITo = {CGetInterface CTo [/∗tags list∗/]} % locate the server interface
Desc = IFrom#ITo % make the descriptor
in
L := Desc | @L % prepend the new descriptor to the list
end)}
end)}
SELFMAN Deliverable Year Two, Page 222
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
end
@L % return the list of descriptor
15.7.2
Synchronization and workflows
In the example depicted in section 15.7.1, the sub-components are deployed sequentially, one after the other. We describe here how to describe and integrate
distributed synchronizations. For this purpose, we define a Barrier synchronization
pattern in the Barrier procedure, analogous to a parallel AND workflow pattern. The
procedure is used to parallelize and synchronize on the completion of the deployment of N2 instances of component C2 . Expressing the synchronization pattern
as a procedure allows to consider the synchronization pattern as a parameter of
an architecture (such as the BindingScheme parameter of the example shown in section 15.7.1).
%% Barrier synchronization pattern
proc {Barrier P N}
Bar = {Tuple.make barrier N}
for I in 1..N do
thread
{P} % invoke the procedure
Bar.I = true % ready signal once the procedure is over
end
end
{Record.forAll Bar Wait} % wait for all N ready signals
end
%% Apply the barrier pattern to deploy N2 instances of C2 components
{Barrier
proc {$}
SubComp = {Deploy C2}
in
{S2 add(SubComp)}
end
N2}
15.7.3
Lazy deployments
In this example we show how to implement lazily deployed components. We consider 3 levels of deployment: (i) level 0: the component is represented by a lazy
variable and is not deployed at all; (ii) level 1: the component membrane is deployed, which allows to introspect its interfaces and immediate sub-components;
the implementation of the component is not deployed yet. (iii) level 2: the component is fully deployed.
Contrary to the transition from level 0 to level 1, which is atomic, the transition from level 1 to level 2 may contains several intermediary states with partial
deployments of the component. For instance, a composite component with many
sub-components may be somewhere between layer 1 and layer 2 with only a few
of its sub-components deployed.
SELFMAN Deliverable Year Two, Page 223
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
This relies on lazy bindings instantiated via the BNewLazy primitive. Lazy bindings require that the client interface is determined, but allow the server interface
to be lazily determined. The BNewLazy primitive differs from BNew in that the server
interface will only be made needed, thus deployed, on usage of the binding (for
either functional or control purpose such as introspection).
%% Lazily deployed implementation
Itf = {INew server [tags ...]}
{IImplements Itf
{ByNeed fun {$} % Implementation will be lazily instantiated
{New class $ ... end}}}
%% Lazily deployed component
Client = {ByNeed fun {$} {Deploy LazyClient} end}
%% Lazily deployed binding between a Client component and a Server component
B = thread
%% Wait until someone needs and thus triggers the deployment of the Client component
%% Once this has happened, get the client interface
IFrom = {CGetInterface {WaitQuietValue Client} [client]}
%% Create a lazy reference to the server interface
ITo = {ByNeed fun {$} {CGetInterface Server [service]} end}
in
%% Create a lazy binding between Client (now determined) and Server (might still be lazy)
{BNewLazy IFrom ITo}
end
15.7.4
Error handling
In the following, we sketch how to take deployment failures into account. First
we define a (trivial) error handling pattern, which we use afterwards in a concrete
scenario. The scenario consists in a simple compensation behavior to the instantiation of a logging component, which by default tries to deploy a remote logger,
and fall back on a file logger if the remote logger cannot be instantiated. The
handling pattern may thus be used as a parameter of an architecture, much as the
synchronization pattern in a previous example.
%% Basic try/catch error handling pattern
proc {HandleError P E}
try {P} % execute procedure P
catch AnyException then
{E} % if any error happens during P, then execute E
end
end
%% Example of deployment with error handling
{HandleError
proc {$}
% try to deploy a component RemoteLogger
Log = {Deploy RemoteLogger}
end
proc {$}
% if RemoteLogger cannot be instantiated, fall back on component FileLogger
Log = {Deploy FileLogger}
end}
SELFMAN Deliverable Year Two, Page 224
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
%% Within the deployment procedure of RemoteLogger
if (/∗remote resource is not available∗/) then
raise instantiationFailure(cause: unavailableResource) end
end
15.7.5
Self-configurable architecture
We present now how to extend the example of parameterized architecture presented
in section 15.7.1 with dynamic parameters. The original example is a parameterized architecture {ParameterizedComposite N1 N2 BindingScheme} where N1 and N2 are two
constant integers and where BindingScheme is a function which generates a set of
binding descriptors given two sets of components.
The architecture dynamicity with respect to N1 means: (i) to dynamically
deploy or undeploy instances of C1 , and (ii) to dynamically add and remove bindings between C1 instances and C2 instances. For this purpose, the interconnexion
scheme takes the dynamicity into account in generating a dynamic set of binding
descriptors. The composite component listens to this dynamic set of descriptors
and translates new (resp. removed) set entries into binding creations (resp. removal).
The dynamic set of descriptors is implemented as an object integrated into
an event-based system. The object is parametered with two dynamic sets S1 and
S2 and is thus listening and reacting to these sets’ updates. For example, in response to an update event related to S1 such as a removal of elements, method
removeS1(SElements) is invoked, which adjusts the state of the descriptor set. Adjustements to the descriptor set made in removeS1 generate new events in cascade
propagated to listeners of this set, such as a dynamic architecture.
%% How to construct a dynamic set
class DynamicBindingDescSet from ImplicitSet
meth init
%% Listen to S1 updates
self.s1 listener = {New SetListenerForwarder init(self S1 sync addS1 removeS1)}
{S1 listen(self.s1 listener)}
...
end
...
meth removeS1(SElements) % On removal of elements from S1:
lock % mutual exclusion to prevent race conditions
%% Update our local view of S1
s1 := {FSet.removeAll @s1 SElements}
%% Remove all descriptors referencing components that are removed from S1
{FSet.forAll SElements
proc {$ CFrom}
Set,filterInPlace(
fun {$ Desc}
({IGetComponent Desc.iFrom} == CFrom)
end)
end}
end
end
...
SELFMAN Deliverable Year Two, Page 225
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
end
%% How to listen and react to events generated by the dynamic descriptor set
class MyDescSetListener from SetListener
...
meth remove(SElements) % On removal of descriptors:
{FSet.forAll SElements
proc {$ Desc} % destroy the bindings corresponding to the removed descriptors
{BBreak Desc.binding}
end}
end
end
%% Create and activate our listener on DescSet
{DescSet listen({New MyDescSetListener init(C DescSet sync)})}
15.7.6
Deployment scenarios
We present here how to describe different deployment strategies. We consider the
deployment of a set of identical components on a cluster of nodes and organized
as subcomponents of the composite component described as follows. The size of
the set of components drives the complexity of the overall deployment process.
fun {CompositePackage Cluster N DeployProc}
functor $
export Membrane
define
Comp = {CNew}
{DeployProc SubcomponentPackage Cluster N Comp}
Membrane = Comp
end
end
The composite component description is parameterized by a deployment procedure which is responsible of the deployment of the N identical subcomponent of
the composite.
Sequential deployment. Our first deployment strategy consists in deploying
all subcomponents sequentially from a single centralized controller as follows:
proc {DeploySeq CompPackage Cluster N Parent}
NextRoundRobin = {MakeRoundRobin Cluster}
for I = 1..N do
Host = {NextRoundRobin}
NewComp = {RemoteDeploy Host CompPackage}
{CAddSubComponent Parent NewComp}
end
end
Centralized asynchronous parallel deployment. Adding uncontrolled
parallelism to the previous strategy is trivial thanks to Oz:
proc {DeployPar CompPackage Cluster N Parent}
NextRoundRobin = {MakeRoundRobin Cluster}
for I = 1..N do
thread
SELFMAN Deliverable Year Two, Page 226
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
end
end
end
Host = {NextRoundRobin}
NewComp = {RemoteDeploy Host CompPackage}
{CAddSubComponent Parent NewComp}
Centralized synchronous parallel deployment Relying on the barrier
synchronization described in section 15.7.2, we build a centralized deployment
strategy that spawns and synchronizes on concurrent deployments as follows:
proc {DeployParallel CompPackage Cluster N Parent}
NextRoundRobin = {MakeRoundRobin Cluster}
%% Deployment script
proc {DeployProc}
Host = {NextRoundRobin}
NewComp = {RemoteDeploy Host CompPackage}
{CAddSubComponent Parent NewComp}
end
in
%% Execute and synchronize on the scripts’ execution
{Barrier DeployProc N}
end
Tree distributed deployment All deployment strategies presented up to
now are executed on a single centralized controller node. Here is how to build a
distributed deployment process based on a tree distribution strategy. The deployment process is initiated on the root node of the tree. Each node of the tree locally
deploys one instance of the component; each node also initiates the deployment
process on all their branches and synchronizes on them.
NextRoundRobin = {MakeRoundRobin Cluster}
proc {DeployTree CompPackage Arity Depth Parent}
functor DistributedDeployProc
export Membrane
define
if (Depth > 0) then
{DeployTree CompPackage Arity (Depth − 1) Parent}
end
Membrane = {Deploy CompPackage} % deploy component locally
end
proc {DeployProc}
Host = {NextRoundRobin}
NewComp = {RemoteDeploy Host DistributedDeployProc}
{CAddSubComponent Parent NewComp}
end
in
{Barrier {MakeCopyList DeployProc Arity}}
end
SELFMAN Deliverable Year Two, Page 227
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
Component deployment Remote invocation
SmartFrog/Java RMI
295.5 ms
0.097 ms
Julia/Fractal RMI
159.5 ms
0.099 ms
FructOz
146.3 ms
0.0048 ms
FructOz/Julia Bridge
262.3 ms
N/A
Table 15.1: Deployment and remote invocation costs comparison
15.8
Evaluation
15.8.1
Microbenchmarks
We show here a performance evaluation to compare the deployment process and
the cost of remote method invocations on SmartFrog/Java RMI, Julia/Fractal
RMI and FructOz/Mozart. Additionally we evaluate the deployment process of
a “FructOz/Julia bridge” (see the details at the end of this section). For this
purpose, we measure the latency of the deployment of the distributed composite
component depicted in Figure 15.4 and we also evaluate the cost of a synthetic
remote method invocation between the client and the server subcomponents.
Service
Remote host
client
Service
Service
Local host
Client
Server
Composite
Figure 15.4: Simple distributed component
The experiments have taken place on the Grid’5000 infrastructure, and the
measures are realized on dual-opteron 252 (2.6GHz) machines with 4Gb of memory and all interconnected with 1Gb network interfaces. We used SmartFrog
3.12.000 and Fractal 2.0.1, running on the Sun Java virtual machine v1.6.03, and
the Mozart/Oz virtual machine v1.3.2. To evaluate the remote method invocation cost, we evaluate the time required to complete 10000 synthetic method calls.
The garbage collector is manually triggered between measures sequences so as to
minimize its impact on the performance. The experiments have been repeated 30
times to finally report averaged measures. Table 15.1 summarizes the results of
this evaluation.
The deployment of the distributed component is more efficient on FructOz,
which may be explained because FructOz does not need any descriptor parsing
as this is the case for SmartFrog and Fractal ADL. In fact, SmartFrog reported
during the experiments an average descriptor parsing time of 150 ms, which is
about half the time of our SmartFrog deployment process. Remote method inSELFMAN Deliverable Year Two, Page 228
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
vocations is more efficient on the Mozart/Oz platform. This may be explained
because the Mozart/Oz marshaler is directly implemented and optimized in C++
while Java and Fractal RMI marshalers are implemented in Java and use the Java
reflection. Additionally, Mozart/Oz heavily relies on lazy marshalling, and serializes complex structures when this is required by a remote process only. Overall,
these microbenchmarks show that the basic operations in our platform compare
favorably with other environments.
FructOz/Julia bridge: We extended the FructOz framework with an Oz/Java bridge that endows FructOz with the ability to drive Java tasks. Moreover, the bridge exports specific hooks to manipulate the Fractal/Julia component
model in the Oz world. This way, FructOz constitutes a deployment engine for
Fractal/Julia components. The Oz/Java bridge is built as a set of XML-RPC
handlers, and we used the Apache XML-RPC v3.1 Java implementation and the
Oz XML-RPC client modules for the experimentation.
The deployment of the distributed component in this configuration thus adds
the cost of XML-RPC invocations and Fractal/Julia primitive executions to the
latency measured on the regular FructOz environment, which explains the higher
deployment cost (262.3 ms). Finally, remote method invocations could happen either as Oz invocations or as Fractal RMI invocations, thus mixing the performance
of both configurations.
This demonstrates the applicability of the FructOz framework on heterogeneous non-Oz environments. More concretely, driving the deployment of legacy
applications such as J2EE servers would require the design of component wrappers
for these legacy applications (see e.g. [6] for details).
15.8.2
Local deployments
In the following we evaluate the scalability of local single-machine deployments on
the Mozart/Oz platform. As an evaluation base, we deploy a composite component containing a single client subcomponent bound to a number of server subcomponents. The number of server subcomponents thus constitutes the size of the
deployed component. All measures are repeated 3 times and averaged.
Results are presented in Figure 15.5 and show that the deployment cost increases with the number of components deployed. This behavior may to be correlated with the garbage collector of the Oz virtual machine. Indeed we reported an
issue preventing the correct collection of generated values, leading to an increase
in the heap size, thus slowing down the collection mechanism. Moreover, the FructOz framework has not been designed with performance in mind, and circumvents
a number of limitations of the current implementation of the Oz virtual machine
with some design and performance overhead. Thus there is room for significant
optimizations.
SELFMAN Deliverable Year Two, Page 229
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
Local deployment
1,000,000
Latency (ms)
100,000
10,000
1,000
100
10
100
1,000
# of components
10,000
Figure 15.5: Local deployment evaluation
15.8.3
Distributed deployments
The following experiments consist in deploying increasing numbers of components
on a distributed environment made of a cluster of 16 machines of the Grid’5000
infrastructure (described more in details in section 15.8.1). The machine on which a
new component is deployed is determined with a simple round-robin policy over the
machines available in the cluster. In Figures 15.6(a) and 15.6(b) we demonstrate
the gains that can be obtained by defining the appropriate deployment workflow.
Figure 15.6(a) compares two simple workflows, executing on a single machine:
the first one is just the sequential deployment of a number of components on
different machines; the second one is the parallel deployment of the same number
of components on the same number of machines. Just increasing the parallelism
provides an interesting improvement. A more drastic speed-up is obtained when
changing the workflow to a distributed one. In Figure 15.6(b), we show the result
of distributing the deployment workflow on a tree of machines. We experimented
with different forms of trees, as reported in the Figure. One can notice a huge
improvement (5× speedup) over a centralized sequential workflow. Interestingly, as
can be seen in the code for our centralized and distributed deployment scenarios in
Section 15.7.6, for the same target component configuration, changing the workflow
process with our framework, is only a matter of changing a parameter in a higherorder procedure.
Note that in Figure 15.6(b) we have not been able to obtain results beyond 600
components, because of instabilities of the current Oz virtual machine. These are
due, we believe to some interplay between the garbage collector and distribution,
but we have not been able at this point to ascertain the exact source of the problem.
SELFMAN Deliverable Year Two, Page 230
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
(a) Centralized deployment
(b) Tree-distributed deployment
160
8
Sequential deployment
Parallel deployment
120
6
100
5
80
60
4
3
40
2
20
1
0
0
200
400
600
# of components
Binary tree
Ternary tree
Quad tree
7
Latency (s)
Latency (s)
140
800
1,000
0
0
100
200
300
400
# of components
500
600
Figure 15.6: Distributed deployments evaluation
15.9
Discussion and future work
The work reported in this deliverable constitutes a first step towards meeting our
two main objectives:
1. To provide basic support for complex deployment and configuration processes.
2. To provide basic support for the construction of distributed and self-configurable
components.
Compared to the state of the art in software deployment in general and architecturebased deployment in particular, we have in place a framework that allows highly
parameterized descriptions of complex distributed deployment and configuration
processes, and that allows the construction of distributed components embedding
their own monitoring and control loop capabilities. However, our work still faces
a number of limitations, that call for additional study:
• The FructOz framework was developed using the 1.3.2 version of the Mozart
environment. Hence, we could not benefit from the failure handling facilities introduced in the 1.4.0 version, and documented in R. Collet’s PhD
thesis (see appendix to this book, and Year 1 Selfman deliverable D2.3a).
We plan to exploit these facilities in the next version of our FructOz framework, to support more comprehensive and systematic patterns for handling
distributed failures.
• Handling distributed failures in deployment and configuration processes calls
for transactional support. Work introducing transactional support for reconfiguration processes (exploiting the Fpath/Fscript technology for the definition of reconfiguration programs) is reported in Deliverable D4.2a (Chapter 16 of this book). We plan to exploit this work and to extend it to fit with
SELFMAN Deliverable Year Two, Page 231
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
the FructOz Dynamic FPath and distributed deployment and configuration
capabilities.
• The FructOz framework has been experimented successfully with cluster size
environments (although we experienced some scalability issues which we revisit with the new Mozart 1.4.0 distributed infrastructure). However, it is
not clear that we can readily make use of FructOz in the larger scale and more
dynamic peer-to-peer environments targeted by Selfman. We identify three
areas of work to consider in Year 3 of the Selfman project: extending our dynamic FPath monitoring capabilities to a P2P context, adding aggregation
capabilities and coupling it with dynamic slicing capabilities [48]; studying
the question of high-level programming abstractions for reconfiguration effectors in large P2P systems (for instance, to efficiently support large-scale
push-style deployments); exploiting the work on DHT-based transactions
reported in Deliverable D3.1b (Chapter 12 in this book) to support transactional deployment and configuration processes in a P2P environment.
15.10
Supplement: Workflow patterns in Oz
We present in this section an interpretation of a collection of workflow control-flow
patterns in Oz. For simplicity, we present only an interpretation of the first twenty
control-flow patterns taken from [142], which have then been refined and extended
in [124]. Other patterns in [124] can be similarly captured in Oz, but we leave
that as an item for future work. In the description of workflow patterns below, we
keep the names and descriptions of patterns given by [124].
Before presenting our formalization of workflow patterns in Oz, a few words
about task modelling. The interpretation presented in this appendix is related to
the π-calculus formalization of the same control-flow workflow patterns presented
in [111]. The π-calculus can be seen as a direct subcalculus of the Oz kernel
language, so it is worthwhile to review the formalization proposed in [111], and
to contrast it with that of this appendix. In [111], a basic task, i.e. one which
is not built using workflow pattern operators, is modelled as a simple sequential
π-calculus process, of the form
x1 . . . xn .[a1 = b1 ] . . . [ap = bp ].τ.y 1 . . . y m .0
where xi are input actions that trigger the basic task, the name equality checks
[ai = bi ] correspond to the checking of some optional conditions (e.g. checking a
cancellation flag), τ is the π-calculus silent action, which models the execution of
the functional part of the basic task, and y i are an output actions, that can trigger
some other process, and also denote the termination of the basic task. Basic tasks
that can be triggered more than once are modelled using the π-calculus replication
operator, thus:
!x1 . . . xn .[a1 = b1 ] . . . [ap = bp ].τ.y 1 . . . y m .0
SELFMAN Deliverable Year Two, Page 232
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
The workflow patterns that are described then apply only to basic tasks, or to
slight variants of basic tasks as modelled above. This modelling is, in our view,
overly simplistic:
• It does not consider data dependencies between tasks, and in particular
it does not allow the functional part of a basic task to depend on input
parameters.
• It does not allow the functional part of a basic task to interact with its
environment.
• It only allows task cancellation before the actual execution of its functional
part.
• It does not allow a compositional definition of workflows.
The last point is crucial: because control flow operators operate only on basic
tasks, one cannot build higher-order workflow operators, i.e. workflow operators
acting on workfow processes. In our modelling, we lift all the above limitations,
and provide a set of composable programming abstractions for building workflow
processes in Oz.
Specifically, in our approach, each workflow pattern is captured as a particular operator (an Oz higher-order procedure) that acts on task procedures. Task
procedures are Oz procedures, whose invocation corresponds to the launch of the
modelled task. A task procedure can be arbitrary (e.g. it can launch multiple
concurrent thread as part of its execution), except for certain constraints (such as
having certain arguments and behaving in a certain way), which are necessary for
the proper functioning of the operator. Task procedures, provided they meet the
constraints required for their composition using a given workflow operator, can
encapsulate arbitrary workflow processes, including ones which have been built
using other workflow operators.
When describing each workflow operator, we clarify the constraints that the
procedures which are passed as arguments to the operator must meet. We also
give a brief informal description for each pattern, directly taken from [124], to
clarify the intent of the pattern. Where it is necessary, we add clarifications to the
pattern intended semantics in the form of so-called context conditions, also taken
from [124], i.e. conditions on how the pattern is to used, and how it is expected to
behave in its target environment.
15.10.1
Basic control flow patterns
Sequence
Pattern description: An activity in a workflow process is enabled after
the completion of a preceding activity in the same process.
SELFMAN Deliverable Year Two, Page 233
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
The sequence pattern can be modelled by the Seq operator below. Seq acts on a
pair of tasks. Each task is modelled by a unary procedure whose unique argument
is a logic variable that will be bound by to some value by the procedure to signal
its termination. Note that the procedure execution can give rise to a concurrent
process: the only requirement is that the execution termination be signalled by
binding the parameter variable.
proc{Seq P Q}
ZP ZQ in
{P ZP} {Wait ZP} {Q ZQ} {Wait ZQ}
end
This operator can be generalized to act on a list of tasks and to allow for the
result of a task to be fed to the following task in the list. Each task in the task list
is modelled by a unary procedure whose argument correspond to a pair: (input
argument, termination variable). Note that procedure SeqL returns only after the
last task in the list has terminated.
proc{SeqL L I}
case L of
H | T then Z in {H I#Z} {Wait Z} {SeqL T Z}
[] nil then skip
end
end
Parallel split
Pattern description: The divergence of a branch into two or more
parallel branches each of which execute concurrently.
The parallel split pattern can be modelled by the ParSplit operator below.
acts on a pair of tasks, each sipmply modelled as a nullary procedure.
ParSplit
proc{ParSplit P Q}
thread {P} end
thread {Q} end
end
This operator can be generalized to act on a list of tasks. Each task in the
task list is modelled as a unary procedure.
proc{ParSplitL L}
case L of
H | T then thread {H} end {ParSplitL T}
[] nil then skip
end
end
Synchronization
Pattern description: The convergence of two or more branches into a
single subsequent branch such that the thread of control is passed to
the subsequent branch when all input branches have been enabled.
SELFMAN Deliverable Year Two, Page 234
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
The synchronization pattern can be modelled by the ParSync operator below.
implements a synchronization barrier that is passed only when the two argument tasks P and Q have both terminated. When both tasks have terminated, the
third task R is triggered. Tasks P Q are modelled by unary procedures, which bind
their unique parameter variable to some value when they have terminated. Task
R is modelled by a unary procedure that takes as argument a list, corresponding
to results from the two synchronized tasks.
ParSync
proc{ParSync P Q R}
ZP ZQ in
thread {P ZP} end
thread {Q ZQ} end
{Wait ZP} {Wait ZQ} {R [ZP ZQ]}
end
The ParSync operator can be generalized to act on a list of tasks. Each task in
the list is modelled by a unary procedure, as above. As above, the results from
the synchronized tasks are gathered in a list that is passed as argument to the
triggered task R.
proc{ParSyncL L R}
fun{ParSyncLF L Z Lr}
case L of
H | T then
ZH ZT in
thread {H ZH} {Wait ZH} ZT = Z end
{ParSyncLF T ZT {List.append Lr [ZH]}}
[] nil then Z#Lr
end
end
Z#Lr = {ParSyncLF L unit nil}
in
{Wait Z} {R Lr}
end
Exclusive choice
Pattern description: The divergence of a branch into two or more
branches. When the incoming branch is enabled, the thread of control
is immediately passed to precisely one of the outgoing branches based
on the outcome of a logical expression associated with the branch.
The exclusive choice pattern between two alternative tasks, can be modelled by
the ExChoice operator below. ExChoice takes as argument a nullary function BF which
evaluates to a boolean, and which corresponds to the logical expression associated
with the pattern. It takes also two nullary procedures which corresponds to the
two alternative tasks in the pattern.
proc{ExChoice BF P Q}
if {BF} == true then {P} else {Q} end
end
The ExChoice can be generalized to act on a list of tasks, and a corresponding
list of boolean conditions. The ExChoiceL operator takes as argument a list of pairs
SELFMAN Deliverable Year Two, Page 235
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
of the form BF#P, where BF is a nullary function that evaluates to a boolean, and
P is a nullary procedure, corresponding to the task triggered if the condition BF
evaluates to true.
proc{ExChoiceL L}
case L of
BF#P | T then if {BF} == true then {P} else {ExChoiceL T} end
[] nil then skip
end
end
Simple merge
Pattern description: The convergence of two or more branches into
a single subsequent branch. Each enablement of an incoming branch
results in the thread of control being passed to the subsequent branch.
The simple merge pattern between two alternative tasks can be modelled by the
operator. SimpleMerge takes three unary procedures, P, Q, R as arguments.
Procedures P and Q model tasks whose termination is indicated by binding their
unique argument to some value. Procedure R corresponds to the task that is
triggered as soon as one of P or Q terminates. The value recorded in the termination
variable of P or Q is passed as an argument to the task R.
SimpleMerge
proc{SimpleMerge P Q R}
ZP ZQ in
thread {P ZP} end
thread {Q ZQ} end
if {Record.waitOr ZP#ZQ} == 1 then {R 1#ZP} else {R 2#ZQ} end
end
The code of the SimpleMerge operator uses the function waitOr from the Record module
of the base Mozart environment [144]. The statement {Record.waitOr ZP#ZQ} blocks
until at least one field of the pair ZP#ZQ is determined (i.e. until at least one of
ZP or ZQ is bound to some value), and it returns the feature (here ’1’ or ’2’) of a
determined field.
The SimpleMerge operator can be generalized to act on a list of tasks.
proc{SimpleMergeL L R}
fun{Smlf L Zs}
case L of
H | T then ZH in thread {H ZH} end {Smlf T {List.append Zs [ZH]}}
[] nil then Zs
end
end
Zs = {List.toTuple r {Smlf L nil}}
Z = {Record.waitOr Zs}
in
{Wait Z} {R Z#Zs.Z}
end
The code of the SimpleMergeL operator makes use of the Append function of the List
module of the Mozart system, and of the toTuple function of the same module (that
SELFMAN Deliverable Year Two, Page 236
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
transforms a list into a tuple – note that in Oz a tuple is just a record with
consecutive integer features).
An alternative to the definitions above, which only relies on the asynchrony of
thread execution in Oz, is given below. In this case, with operator SimpleMergeBis, we
expect tasks P and Q to be unary procedures accepting a pair Id#G as argument,
where Id is an integer (the index of the task in the list of tasks to merge), and
G is an Oz port on which the termination values of the tasks to merge are sent.
Each merged task is expected to terminate with a termination value of the form
Id#M, where Id is the index of the task (which was passed as argument to the task
procedure), and M is some value.
proc{SimpleMerge2 P Q R}
S G = {Port.new S}
in
thread {P 1#G} end
thread {Q 2#G} end
case S of Id#M | then {R Id#M} else skip end
end
proc{SimpleMergeL2 L R}
S G = {Port.new S}
proc{Smlfb L I}
case L of
H | T then thread {H I#G} end {Smlfb T I+1}
[] nil then skip
end
end
in
{Smlfb L 0}
case S of Id#M | then {R Id#M} else skip end
end
A note on the above modelling of the simple merge pattern. In [111], it is
strangely stated that the tasks to merge “will never be executed in parallel”.
However, this is clearly wrong since in [124] the Petri net-like description of the
simple merge pattern behaviour, with two tasks, allows for the two tasks to run in
parallel. We follow [124] as the reference for pattern descriptions.
15.10.2
Advanced branching and synchronization patterns
Multi-Choice
Pattern description: The divergence of a branch into two or more
branches. When the incoming branch is enabled, the thread of control
is passed to one or more of the outgoing branches based on the outcome
of distinct logical expressions associated with each of the branches.
The multi-choice pattern can be modelled by the MultiChoice operator below.
takes a list of pairs of the form BF#P, where BF is a nullary boolean
function (modelling a triggering condition), and P is a nullary task procedure. Note
MultiChoice
SELFMAN Deliverable Year Two, Page 237
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
that, in contrast to ExChoiceL which only triggers one task, provided its triggering
condition is true, MultiChoice triggers all tasks in the list whose triggering condition
evaluates to true.
proc{MultiChoice L}
case L of
BF#P | T then if {BF} == true
then thread {P} end {MultiChoice T}
else {MultiChoice T}
end
[] nil then skip
end
end
Structured synchronizing merge
Pattern description: The convergence of two or more branches (which
diverged earlier in the process at a uniquely identifiable point) into a
single subsequent branch. The thread of control is passed to the subsequent branch when each active incoming branch has been enabled.
The modelling of this pattern in Oz is different from the previous ones. The description of the pattern in [124] highlights a number of so-called context conditions
for the pattern:
1. There must be a single Multi-Choice construct earlier in the process model
with which the Synchronizing Merge is associated, and it must merge all of
the branches emanating from the Multi-Choice.
2. The Multi-Choice construct must not be re-enabled befor the asociated Synchronizing Merge construct has fired.
3. Once the Multi-Choice has been enabled, none of the activities in the branches
leading to the Synchronizing Merge can be cancelled before the merge has
been triggered. The only exception is that it is possible for all of the activities leading up to the Synchonizing Merge to be cancelled.
4. The synchronizing Merge must be able to resolve the decision as to when
it should fire based on local information available to it during the course
of execution. Critical to this decision is knowledge of how many branches
emanating from the Multi-Choice are active and require synchronization.
Because of these context conditions, we model this pattern with a higher-order
procedure that encompasses both the initial Multi-Choice construct and the Synchronizing Merge that follows. The SyncMerge operator is defined as follows:
proc{SyncMerge L R}
S G = {Port.new S}
fun{MCh L LT I}
case L of
BF#P | T then if {BF} == true then thread {P I#G} end {MCh T {List.append LT [I]} I+1}
SELFMAN Deliverable Year Two, Page 238
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
else {MCh T LT I+1}
end
[] nil then LT
end
end
fun{SMg S LT LR}
case S of Id#M | T then
LTn = {List.subtract LT Id} in
if LTn == nil then LR else {SMg T LTn {List.append LR [Id#M]}} end
end
end
LT = {MCh L nil 0}
in
if LT == nil then skip else LR = {SMg S LT nil} in {R LR} end
end
In the code of the SyncMerge operator above, the function MCh corresponds to the
Multi-Choice construct associated with the Synchronizing Merge. The synchronizing merge proper is realized by the function SMg which receives on port G and
its stream S termination values from all the tasks which have been launched by
the Multi-Choice construct. It is only after all these tasks have terminated that
the task R is launched, with argument the list of all received termination values,
with their index.
Multi-Merge
Pattern description: The convergence of two or more branches into
a single subsequent branch. Each enablement of an incoming branch
results in the thread of control being passed to the subsequent branch.
The Multi-Merge pattern can be modelled by the
MultiMerge
operator below.
proc{MultiMerge L R}
S G = {Port.new S}
proc{Launch L}
case L of
H | T then thread {H G} end {Launch T}
[] nil then skip
end
end
proc{Handle S}
case S of M | T then thread {R M} end {Handle T} end
end
in
{Launch L} {Handle S}
end
Structured discriminator
Pattern description: The convergence of two or more branches into a
single subsequent branch following a corresponding divergence earlier
in the process model. The thread of control is passed to the subsequent
branch when the first incoming branch has been enabled. Subsequent
SELFMAN Deliverable Year Two, Page 239
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
enablements of incoming branches do not result in the thread of control
being passed on. The discriminator construct resets when all incoming
branches have been enabled.
The discriminator pattern comes with a number of context conditions that
complete its semantics:
1. The Discriminator is associated with precisely one Parallel Split earlier in
the process and each of the outputs from the Parallel Split is an input to
the Discriminator.
2. The branches from the Parallel Split to the Discriminator are structured in
form and any splits and merge in the branches are balanced.
3. Each of the incoming branches to the Discriminator must only be triggered
once prior to it being reset.
4. The Discriminator resets (and can be re-enabled) once all of its incoming
branches have been enabled precisely once.
5. Once the Parallel Split has been enabled none of the activities in the branches
leading to the Discriminator can be cancelled before the join has been triggered. The only exception to this is that it is possible for all of the activities
leading up to the Discriminator to be cancelled.
The structured discriminator pattern can be modelled by the StructDiscrim operator below. Each task in the list L passed as argument to StructDiscrim, is supposed
to repeatedly produce results, which are sent on a result port G which is passed as
argument to the unary procedure that models the task when it is instantiated (as
part of the Launch procedure execution). It embeds: a parallel split in its first phase,
manifested by the Launch procedure; a synchronization condition, manifested by the
Barrier procedure, that waits for all the launched task to have yielded some result
to trigger a new phase of the discriminator; the discriminator itself, manifested by
the Discrim procedure, that recursively waits for the result from one of the launched
tasks before triggering procedure R with the obtained result as argument.
proc{StructDiscrim L R}
fun{Launch L Lg}
case L of
H | T then S G = {Port.new S} in thread {H G} end {Launch T {List.append Lg [G#S]} }
[] nil then Lg
end
end
fun{Result Lg}
Rg = {List.toRecord r Lg}
G = {Record.waitOr Rg}
in
case Rg.G of M | then G#M end
end
SELFMAN Deliverable Year Two, Page 240
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
fun{Barrier Lg Ld}
case Lg of
G#S | Tg then case S of
[] nil then Ld
end
end
| T then {Barrier Tg {List.append Ld [G#T]} } end
proc{Discrim Lg}
GM = {Result Lg} in thread {R GM} end {Discrim {Barrier Lg nil}}
end
LG = {Launch L nil}
in
{Discrim LG}
end
15.10.3
Structural patterns
Arbitrary cycles
Pattern description: The ability to represent cycles in a process model
that have more than one entry or exit point.
Modelling a process model with multiple entry and exit points, and arbitrary
cycles between its tasks, can be done directly in Oz e.g. by modelling each task
in the process model as a port object (see [144] chapter 5 for a definition of port
object). It is also possible to model each task as a FrucToz component (that
extends the port object idea), and to model connections between tasks as FructOz
bindings (see Section 15.6). This pattern is not amenable to a formalization in
the form of a single higher-order Oz procedure as the preceding patterns, because
of its free form character. However, the bindings in FructOz illustrate how to
link an arbitrary output (client) stream to an input (server) port. For the sake
of illustration, the StrangeLoop procedure below shows a simple loop: when being
called, it creates a port object, which can receive messages from its environment
on two ports G and B; each message received on port G is forwarded to port B, and
vice-versa (it is strange because messages received are endlessly forwarded around
the loop...).
proc{StrangeLoop G B}
Gs Bs
proc{Handle Is O}
case Is of M | Ts then {Port.send O M} {Handle Ts O} end
end
in
G = {Port.new Gs} B = {Port.new Bs}
thread {Handle Gs B} end
thread {Handle Bs G} end
end
Implicit termination
SELFMAN Deliverable Year Two, Page 241
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
Pattern description: A given process (or sub-process) instance should
terminate when there are no remaining work items that are able to be
done either now or at any time in the future.
This pattern essentially stipulates that a task or process is deemed terminated
when it can effect no further action. Modelling tasks as procedure executions in Oz
means that a task is completed when the procedure returns and the threads it has
spawned have terminated. Hence the implicit termination pattern is supported in
our Oz interpretation. Note that it is possible to define in Oz threads with explicit
termination detection: see [144] chapter 5 for the definition of a thread abstraction
with termination detection.
15.10.4
Multiple instances patterns
Multiple Instances without Synchronization
Pattern description: Within a given process instance, multiple instances of an activity can be created. These instances are independent of each other and run concurrently. There is no requirement to
synchronize them upon completion.
The pattern can be modelled by the MultiInst operator below. Argument
procedure MultiInst represents the number of task instances to create.
N
to
proc{MultiInst P N}
if N > 0 then
thread {P} end {MultiInst P N−1}
else skip
end
end
Alternatively, one can have a list version, modelled by the MultiInstL operator
below. Argument L to procedure MultiInstL represents the list of inputs to be passed
as arguments in calls to the unary procedure P. Each call to P represents an
instantiation of same task (albeit with potentially different initial inputs).
proc{MultiInstL P L}
case L of
H | T then thread {P H} end {MultiInstL P T}
[] nil then skip
end
end
Multiple Instances with a priori design-time knowledge
Pattern description: Within a given process instance, multiple instances of an activity can be created. The required number of instances is known at design time. These instances are independent of
each other and run concurrently. It is necessary to synchronize the
activity instances at completion before any subsequent activities can
be triggered.
SELFMAN Deliverable Year Two, Page 242
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
This pattern is modelled by the MultiInstLD operator below, where the size of the
argument list L, and the task procedure P that is part of the MultiInstLD environment,
are known at design time. Note that MultiInstLD is a variation on the ParSyncL operator that models the Synchronization pattern. In contrast to the Synchronization
pattern, note that the list L which appears as argument to the MultiInstLD procedure represents a list of argument to be passed to the different instances of task
P created. Also, the inner function MIf, which is used to create the different task
instances, also accumulate in list Lr the different results produced by the different
task instances. This list is then passed on to the subsequent task R for further
treatment.
proc{MultiInstLD L P R}
fun{MIf L Z Lr}
case L of
H | T then
ZH ZT in
thread {P H#ZH} {Wait ZH} ZT = Z end
{MIf T ZT {List.append Lr [ZH]}}
[] nil then Z#Lr
end
end
Z#Lr = {MIf L unit nil}
in
{Wait Z} {R Lr}
end
Multiple instances with a priori run-time knowledge
Pattern description: Within a given process instance, multiple instances of an activity can be created. The required number of instances may depend on a number of runtime factors, including state
data, resource availability and inter-process communications, but is
known before the activity instances must be created. Once initiated,
these instances are independent of each other and run concurrently.
It is necessary to synchronize the instances at completion before any
subsequent activities can be triggered.
This pattern is modelled by the same MultiInstLD operator above. The difference
lies in the fact that the argument list L only becomes known at run time, prior to
the call to MultiInstLD.
Multiple instances without a priori run-time knowledge
Pattern description: Within a given process instance, multiple instances of an activity can be created. The required number of instances may depend on a number of runtime factors, including state
data, resource availability and inter-process communications and is
not known until the final instance has completed. Once initiated,
these in- stances are independent of each other and run concurrently.
SELFMAN Deliverable Year Two, Page 243
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
At any time, whilst instances are running, it is possible for additional
instances to be initiated. It is necessary to synchronize the instances
at completion before any subsequent activities can be triggered.
This pattern is modelled by the MultiInstR operator below. The MultiInstR creates a
port object, whose port, G, is returned is functions in two phases. The first phase
is similar to that of the MultiInstLD operator above. The second phase establishes
a server thread that handles requests for new task instances being sent on port
G. This second phase ends when the server thread receives an “end of task”
(eOT) message on port G. When all task instances have terminated, task R is
launched, with all the results from the terminated task instances as argument.
Note that procedure MultiInstR terminates immediately after having created the port
G, the server thread handling requests received on G, and the thread responsible
for triggering the new task R upon the termination of all the task instances.
proc{MultiInstR L P R G}
fun{MIf L Z Lr}
case L of
H | T then
ZH ZT in
thread {P H#ZH} {Wait ZH} ZT = Z end
{MIf T ZT {List.append Lr [ZH]}}
[] nil then Z#Lr
end
end
fun{MIp S Z Lr}
case S of
eOT | then Z#Lr
[] r(M) | T then
ZR ZT in
thread {P M#ZR} {Wait ZR} ZT = Z end
{MIp T ZT {List.append Lr [ZR]}}
end
end
Zf#Lf = {MIf L unit nil}
Zp Lp S
in
G = {Port.new S}
thread Zp#Lp = {MIp S Zf Lf} end
thread {Wait Zp} {R Lp} end
end
15.10.5
State-based patterns
Deferred choice
Pattern description: A point in a workflow process where one of several branches is chosen based on interaction with the operating environment. Prior to the decision, all branches present possible future
courses of execution. The decision is made by initiating the first activity in one of the branches i.e. there is no explicit choice but rather a
SELFMAN Deliverable Year Two, Page 244
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
race between different branches. After the decision is made, execution
alternatives in branches other than the one selected are withdrawn.
This pattern can be modelled by the DefChoice operator below. The operator
takes as arguments: a record Tr of tasks (unary procedures), and a nullary choice
function Cf that, when evaluated, returns a record of the form r(index: I input : M),
where index designates the index, in the task record, of the task to trigger, and input
contains the argument to be passed to the newly created task.
proc{DefChoice Cf Tr}
Z = {Cf}
in
{Tr.Z.index {Z.input}}
end
Interleaved parallel routing
Pattern description: A set of activities has a partial ordering defining
the requirements with respect to the order in which they must be executed. Each activity in the set must be executed once and they can be
completed in any order that accords with the partial order. However,
as an additional requirement, no two activities can be executed at the
same time (i.e. no two activities can be active for the same process
instance at the same time).
This pattern can be modelled by the InParRoute operator below. This operator
takes a single argument POr, which is record whose fields are paris of the form P#L,
where: P is unary task procedure, whose argument is a pair, with first element a
termination variable, and with second element a list of input parameters; and L is
a specification of the partial order among tasks, which takes the form of a list of
features of POr, which indicate which tasks in POr must terminate prior to launching
the task modelled by P.
proc{InParRoute POr}
Zr = {Record.clone POr}
Lk = {Lock.new}
proc{Prepare I X}
P#L = X
proc{Prep L Lr}
case L of
H | T then {Wait Zr.H} {Prep T {List.append Lr [Zr.H]}}
[] nil then lock Lk then {P Zr.I#Lr} end
end
end
in
thread {Prep L nil} end
end
in
{Record.forAllInd POr Prepare}
end
In the code of InParRoute above, we make use of the Mozart environment record
module procedure forAllInd, which applies the same binary procedure to each field of
SELFMAN Deliverable Year Two, Page 245
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
a record, with the current index in the record passed as first actual argument to
the procedure. We also use the lock statement from Oz, which ensures an execution
in mutual exclusion.
Milestone
Pattern description: An activity is only enabled when the process
instance (of which it is part) is in a specific state (typically in a parallel
branch). The state is assumed to be a specific execution point (also
known as a milestone ) in the process model. When this execution
point is reached the nominated activity can be enabled. If the process
instance has progressed beyond this state, then the activity cannot
be enabled now or at any future time (i.e. the deadline has expired).
Note that the execution does not influence the state itself, i.e. unlike
normal control-flow dependencies it is a test rather than a trigger.
This pattern can be simply modelled using the Milestone operator below. Note
that a milestone is simply represented as a pair comprising a state variable V, and
an associated result value R. The milestone is reached when the milestone variable
is bound a pair with the particular state V as the first element.
proc{Milestone M V P}
R in {Wait M} if M == V#R then {P R} else skip end
end
15.10.6
Cancellation patterns
Cancel activity
Pattern description: An enabled activity is withdrawn prior to it commencing execution. If the activity has started, it is disabled and, where
possible, the currently running instance is halted and removed.
This pattern can be modelled with the CancelWrap operator below. CancelWrap
essentially create a port object when provided with a procedure P. The latter
takes the form of ternary procedure that records its termination by binding its
third argument Out to some state value, and that takes as its first two arguments
some message M and a state value In. Intuitively, the activity thus obtained takes
the form of state machine, whose transition function is given by the procedure P.
The operator CancelWrap ensures that the activity can be cancelled at any state.
proc{CancelWrap P In G}
S
proc{Handle S In}
case S of
cancel | then skip
[] r(M) | T then Out = {P M In $} in {Handle T Out}
end
end
SELFMAN Deliverable Year Two, Page 246
CHAPTER 15. D4.1A: FIRST REPORT ON SELF-CONFIGURATION
SUPPORT
in
G = {Port.new S}
thread {Handle S In} end
end
Cancel case
Pattern description: A complete process instance is removed. This includes currently executing activities, those which may execute at some
future time and all sub-processes. The process instance is recorded as
having completed unsuccessfully.
Assuming that each activity creation in the target process instance takes the
form of a call to CancelWrap, and that the list of the corresponding ports has been
recorded, the pattern can be modelled by the CancelCase operator below, which just
iterates over the list of ports to send the cancel message to every recorded activity.
proc{CancelCase Lp}
{List.forAll Lp proc{$ G} {Send G cancel} end}
end
SELFMAN Deliverable Year Two, Page 247
Chapter 16
D4.2a: First report on
self-healing support
16.1
Executive summary
The work on self healing in the second year covered the abilities that are needed
to complete or to complement the abilities of structured overlay networks. This
work was done in three areas: (1) asynchronous failure handling in a network
transparent system, (2) network partitioning and merging, and (3) transactional
reconfiguration support. This work complements the work on transaction support
over structured overlay networks which is reported in deliverable D3.1b.
248
CHAPTER 16. D4.2A: FIRST REPORT ON SELF-HEALING SUPPORT
16.2
Contractors contributing to the deliverable
KTH, UCL, and FT contributed to this deliverable.
KTH KTH contributed to the network partitioning and merge algorithm.
UCL UCL contributed to the Mozart 1.4.0 release.
FT FT R&D contributed to the transactional reconfiguration.
SELFMAN Deliverable Year Two, Page 249
CHAPTER 16. D4.2A: FIRST REPORT ON SELF-HEALING SUPPORT
16.3
Introduction
Work on self healing was done in three parts:
• Asynchronous failure handling at the language level. We released the Mozart
1.4.0 system, which has advanced support for building fault-tolerance abstractions. The support is based on the notion of fault streams, a stream
of messages from a failure detector attached to a language entity. Together
with lightweight concurrency and network transparency, this allows complex fault-tolerance abstractions, such as those in [66] and Erlang’s process
linking, to be implemented in just a few lines of code.
• Network partitioning and merging. A major gap in the abilities of structured
overlay networks was the inability for the overlay to merge together after a
temporary network partition. This gap is now closed with the development
of the merge algorithm.
• Transactional reconfiguration. A problem in reconfiguration of componentbased systems is what to do when there is a fault during the reconfiguration
process. One solution is to have transactional reconfiguration.
The work on transaction support over structured overlay networks is also relevant
to self healing, but at the application level. This work is presented in deliverable
D3.1b. We explain each of these topics in the sections that follow. The appendices
contain papers that cover all the work in detail.
16.4
Asynchronous failure handling in a network transparent system
A distributed system can be made much easier to program if the language in which
it is written is properly designed. The Oz language, implemented in the Mozart
system, was extended before the SELFMAN project with a network-transparent
distribution layer. This distribution layer did achieve a simplification of programs,
but it exposed the distributed system in an overly complex way to the programmer. In the context of SELFMAN, we are interesting in exploring the simplest
possible way that a self-managing system can be programmed. To achieve this,
we have redesigned and reimplemented the distribution layer of Mozart to support
asynchronous failure handling. This work is reported in the Ph.D. dissertation
of Raphaël Collet (see Appendix A.2). We have made a public release of Mozart
1.4.0 in July 2008 (see www.mozart-oz.org) and we are currently retargeting all
our SELFMAN work for this new system.
In SELFMAN, the abilities of Mozart 1.4.0 are used to simplify the construction of self-managing systems. Self-healing is simplified by the asynchronous failure detection and by the fine-grained concurrency (lightweight threads). SelfSELFMAN Deliverable Year Two, Page 250
CHAPTER 16. D4.2A: FIRST REPORT ON SELF-HEALING SUPPORT
configuration is simplified by the first-class components (based on closures and
symbolic values) and the lightweight threads.
16.4.1
Network transparency
The Oz language is implemented in Mozart with a network-transparent and networkaware implementation. Network transparency means that the same program can
execute over a network, distributed in any way, and (if there are no failures) it will
have exactly the same functionality (only the timing of operations may change).
Network awareness means that the language allows to predict and control the
distribution and network behavior of the program. These two properties, taken
together, give an enormous simplification of distributed programming. If your only
experience with distributed programming is in a system such as Java with RMI,
then you will have difficulty imagining the simplification that is possible.
Per Brand has given an example to illustrate the simplification that is possible.
The full example is available at:
http://www.sics.se/~seif/JavaVSMozart.html
In this example, we write a distributed producer/consumer program where the
producer generates a stream of data (one million integers) that is read by the
consumer. The Mozart program for this is 32 lines of code, and exactly the same
program runs on a single machine or over the network (this is network transparency). The Java program is 108 lines in the single-machine case and 220 lines
in the distributed case (Java is not network transparent). Regarding performance,
the differences are also striking: the Mozart program consistency runs much faster
than the Java program. In the single machine case: 17.6 seconds versus 3.9 seconds, and in the distributed case, 1 hour versus 8.0 seconds.
The problems with Java are twofold: first, it has no specific support for asynchronous communication, which is essential for a distributed system, and second,
it is not network transparent. The first problem can be overcome with an asynchronous communication library. The second problem cannot be overcome without
redesigning the language. Any Java application inherently exposes the distribution
structure of the underlying system it runs on. The Oz language implemented by
Mozart does not expose the distribution structure. It is possible, for example, to
write an application and test it on a single machine, and then to distribute the
application over a network without rewriting any of its code.
16.4.2
Relationship with Kompics
The Kompics component model is inspired by the programming model of Mozart.
KTH, which developed Kompics, has also worked on Mozart. For example, the
channels in Kompics are very similar to ports in Mozart. Ports in Mozart were
first introduced by KTH in the late 1990s. Experience of using ports in Mozart
SELFMAN Deliverable Year Two, Page 251
CHAPTER 16. D4.2A: FIRST REPORT ON SELF-HEALING SUPPORT
Figure 16.1: The ring merge algorithm
led to the proposal of channels in Kompics. Kompics also uses first-class component values and relies on lightweight threads, two characteristics that are directly
derived from Mozart.
16.4.3
Asynchronous failure detection
In Mozart 1.4.0, we have made a major advance with respect to the previous
work by making failure detection be asynchronous. Each distributed language
entity (object, communication channel, dataflow variable) has an associated fault
stream: a stream containing the tokens ok, tempFail, and permFail. Whenever
the distribution behavior of the entity changes, a token is added to the stream.
Together with a few other operations, the ability to change the distributed protocol
that implements the entity, and the ability to globally or locally “kill” the entity,
we can build sophisticated distributed abstractions (e.g., all the algorithms in [66])
in just a few lines of code.
16.5
Network partitioning and merging
Network partitioning is a real problem for any long-lived application on the Internet. A single router crash can cause part of the network to become isolated from
SELFMAN Deliverable Year Two, Page 252
CHAPTER 16. D4.2A: FIRST REPORT ON SELF-HEALING SUPPORT
another part. SONs should behave reasonably when a network partition arrives.
If no special actions are taken, what actually happens when a partition arrives is
that the SON splits into several rings. We have observed this behavior for correctly
implemented SONs. What we need to do is efficiently detect when such a split
happens and efficiently merge the rings back into a single ring. We have designed
and implemented an algorithm that does exactly this [129, 130]. Appendix A.15
defines the algorithm in detail. Here we give the main insights.
The merging algorithm consists of two parts. The first part detects when the
merge is needed. When a node detects that another node has failed, it puts the
node in a local data structure called the passive list. It periodically pings nodes
in its passive list to see whether they are in fact alive. If so, it triggers the ring
unification algorithm. This algorithm can merge rings in O(n) time for network
size n. We also define an improved gossip-based algorithm that can merge the
network in O(log n) average time.
Ring unification happens between pairs of nodes that may be on different rings.
The unification algorithm assumes that all nodes live in the same identifier space,
even if they are on different rings. Suppose that node p detects that node q on
its passive list is alive. Figure 16.1 shows an example where we are merging the
black ring (containing node p) and the white ring (containing node q). Then p
does a modified lookup operation (mlookup(q)) to q. This lookup tries to reduce
the distance to q. When it has reduced this distance as much as possible, then the
algorithm attempts to insert q at that position in the ring using a second operation,
trymerge(pred,succ), where pred and succ are the predecessor and successor nodes
between which q should be inserted. The actual algorithm has several refinements
to improve speed and to ensure termination.
16.6
Transactional reconfiguration of componentbased systems
Advanced component models such as Fractal considered in Selfman are fully dynamic and reflexive and make it possible to program dynamic reconfigurations,
even unanticipated ones, to be executed in a running application. This is important in order to evolve applications without stopping and redeploying them (for
example to update a component or subsystem). However, direct use of the Fractal
APIs to program reconfigurations has several drawbacks: verbose and error-prone
code due to the lack of language integration and minimalist design of the APIs,
compilation and code deployment phases which complicate the process (in the case
of Java). Using the bare APIs – especially in a general purpose language – also
makes it difficult to guarantee the correctness of the reconfigurations: individually
correct Fractal reconfigurations can result in globally incorrect reconfiguration depending on when and how they are executed with respect to each other and to
the normal execution of the application. If tools of the selfman platform are to be
SELFMAN Deliverable Year Two, Page 253
CHAPTER 16. D4.2A: FIRST REPORT ON SELF-HEALING SUPPORT
used to reconfigure applications during their execution, it is essential we guarantee
reconfigurations are reliable.
Dynamic reconfigurations allow modifications of a part of a system during its
execution without stopping it entirely so as to maximise its availability. Thanks to
properties of component models like modularity and loose coupling, reconfigurations can rely on component-based architectures. However, runtime modifications
can let the system in an inconsistent state and we identified three main reliability
problems when reconfiguring systems:
1. A first problem when modifying a system at runtime is the synchronization
between reconfigurations and the functionnal execution of the system. Actually, the part of the system which is modified could be unavailable for
functional execution during the reconfiguration time. To take the hotswap
example with a stateful component, calls on the old component must be
blocked until a “quiescent state” is reached, then the state must be transfered, and finally previous calls are forwarded towards the new component.
2. A second problem at the model level is about consistency violation by reconfigurations. Component models and application models should define what
this consistent system is. So we must ensure the conformity of the system
to the model and what we call integrity constraints after reconfigurations.
3. The third and last problem is linked to the composition of reconfiguration
operations. The semantics of reconfiguration operations implies there can be
some conflicts between them in case of compostion and for synchronization
between several reconfigurations.
Well-defined transactions associated with structural and behavioral constraints
verification is a mean to guarantee the reliability of reconfigurations in component
models. We revisit the ACID properties in the context of component-based systems:
• Atomicity: either the system is reconfigured or it is not. Each reconfiguration operation must specify its reversible operation. Thus if a reconfiguration
transaction is rollbacked, it is possible to come back in a previous stable state
by undoing operations. Transaction demarcation is either programmed in
the language or automatic.
• Consistency: a transaction must be a correct transformation of the system
state. So the reconfigured application must be conform to the component
model and application-specific constraints. A reconfiguration transaction
can be commited only if the resulting system respects the constraints. Other
faults like software and hardware failures are the responsibility of the commit
protocol.
SELFMAN Deliverable Year Two, Page 254
CHAPTER 16. D4.2A: FIRST REPORT ON SELF-HEALING SUPPORT
• Isolation: several reconfiguration transactions are independant and any
schedule of reconfiguration operations must be equivalent to their serialization. The scheduling must respect the operation semantics and conflicts.
• Durability: once a reconfiguration completes with success (commit), the
new state is persistent. For every transaction, operations are logged in a
journal so that reconfigurations can be redone in case of failure. The application state (architecture and component state) is periodically checkpointed
so that any component can be recovered in its last stable state resulting
from the last successful reconfiguration.
In our proposal, system consistency relies on integrity constraints both at
the application and at the model level. An integrity constraints is a predicate
which concerns the validity of an assembly of architectural elements but it can
also concern component state. An example of such a constraint at the component
model level is hierarchical integrity (bindings between components must respect
the component hierarchy). Constraints must be checked both at compile time
on the ADL configuration and at runtime. We represent the Fractal component
model as a typed graph and then each fractal-based application is also a graph
which is a well-typed instance of the typed graph and is provided at runtime
by the reflexivity of the model. The vertexes are elements from the component
model (components, interfaces, etc.) and the edges represent relations between
the elements (composition links, binding links etc.) Then integrity constraints can
be specified on the graphs with a constraint language “à la OCL”, basically an
extension of FPath with invariants, preconditions, postconditions.
// Example of a precondition for removing a component operation:
void removeSubComponent(Component sub);
preconditions:
// all interfaces of the sub-component are unbound
not(exists(sub/interface::*[not(bound(.))]));
To compose operations and regardless of a dedicated reconfiguration language, we consider sequences or parallel executions of intercession operations
with conditions expressed by means of introspection operations but all compositions are not always valid. We want to make operation semantics explicit
in terms of preconditions and postconditions with our constraint language
and we want eventually to be able to change it and to specify new primitive
operations. We distinguish two types of conflicts between operations:
• Parallel conflicts: for two given reconfigurations R1 and R2 executed
on the same system, a parallel conflict occurs if R1 and R2 modify the
same manageable elements in the system model (e.g. bind and unbind
operations).
SELFMAN Deliverable Year Two, Page 255
CHAPTER 16. D4.2A: FIRST REPORT ON SELF-HEALING SUPPORT
• Execution dependencies: an execution dependency occurs if R1 either need R2 to be executed first (e.g. stop before unbind)or if R1
cannot be executed after R2. That is to say R2 postconditions cover
or not R1 preconditions.
To deal with reconfiguration concurrency, we propose a pessimistic approach with locking based on operation semantics to avoid inconsistent compositions of operations. We see two different possibilities for the locking
algorithm:
• The first one is to lock directly reconfiguration operations: either conflicts between operations are automatically calculated thanks to their
preconditions and postconditions or conflict must be explicitely defined.
• The second one is to use a modified DAG locking algorithm on our
instance graph defined in. Then the lock granularity is defined by the
manageable elements in the graph representation (e.g., a lock acquisition on a component also locks all its interfaces and every operations
in each interfaces).
In the context of SELFMAN we developed a transactional reconfiguration
framework (see Appendix A.16) and its integration in the FScript language
runtime of this new transactional backend (see Appendix A.17). We also
worked on a multi-stage approach that allows for performing static analysis
prior to actually execution reconfigurations so as to prevent the (tentative)
execution of inconsistent reconfiguration transactions (see Appendix A.18).
SELFMAN Deliverable Year Two, Page 256
Chapter 17
D4.3a: First report on
self-tuning support
17.1
Executive Summary
The first part of this deliverable shows how self-tuning capabilities may be
useful and used in order to allow for deploying an optimally pre-configured
distributed application for best performance. A supporting architecture
is described, based on the combination of a self-tuning system and a selfregulated load injection system. While the former system generates possible
configurations, the latter system autonomously evaluates the performance of
each generated configuration. Such an architecture actually provides a selfbenchmarking system, autonomously supporting benchmarking campaigns.
Implementation details are given about the self-regulated load injection system.
In the second part of this deliverable, we investigate the current state of
load-balancing algorithms for DHTs. A distributed hash table (DHT) extends a structured overlay network (SON) with primitives for storing (key,
value)-pairs and for retrieving a value associated with a key. DHTs support
both direct key lookups [136, 123, 115] and range queries [127]. In the former
case, a hash function is used to distribute the keys evenly among the nodes in
the system. However, in a DHT supporting range queries this is not possible
since the order of the stored keys would then be destroyed. Depending on
the distribution of the keys the DHT can quickly become unbalanced. An
unbalanced system can lead to network congestion and unresponsive nodes.
In order to avoid this, the SELFMAN storage service is extended with algorithms that balances the load evenly over the nodes.
In the work presented here, we assume that by introducing local knowl257
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
edge about global state the performance of self-tuning algorithm for DHTs
can be improved. We modify an existing load-balancing algorithm [81] with
additional knowledge of the average load in the system. With this simple
modification, our results showed a decrease of the overall data movement
cost induced by the load-balancing. In addition, two centralized algorithms
are discussed which can provide optimal solutions and thereby benchmark
values used to indicated the performance of the decentralized algorithms. We
plan to continue this work by improving the efficiency of the centralized algorithms and by evaluating the introduction of more complex global knowledge
such as network topology.
SELFMAN Deliverable Year Two, Page 258
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
17.2
Partners Contributing to the Deliverable
ZIB (P5), KTH (P2) and FT have contributed to this deliverable.
ZIB (P5) ZIB is contributing the work on self-tuning of DHTs in cooperation with KTH.
KTH (P2) KTH contributed to the work on self-tuning together with ZIB.
FT (P4) FT R&D contributed work on self-tuning before deployment
through self-benchmarking.
SELFMAN Deliverable Year Two, Page 259
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
17.3
Results
17.3.1
Introduction
The first contribution of this deliverable develops a self-benchmarking architecture. It combines self-tuning capabilities with a self-regulated load
injection system. This combination allows for optimally configuring a distributed application or service before actually deploying and operating it.
Our goal is to avoid deploying a badly configured application that may fail
for performance reasons.
The SELFMAN storage service can in addition to exact match queries
on a key also support range-searches [128]. With range-query support, a
hash-function cannot be applied to the inserted keys since this would violate
their order. In addition, the distance between keys is typically not uniform,
e.g. the article names in Wikipedia. Therefore, a normal DHT with uniform
distance between nodes would quickly become unbalanced when storing different key distributions.
In the second contribution of this deliverable we evaluate current loadbalancing algorithms for SON-based DHTs. Our main goals are to a) minimize the overall network utilization and b) investigate how global parameters
can be introduced to improve the performance of existing load-balancing algorithms. Furthermore, we present two centralized approaches, one based on
tree search [88] and one based on auction algorithms [19]. The centralized
algorithms are used to benchmark the decentralized algorithms.
17.3.2
Self-benchmarking
Introduction
Self-tuning is generally considered from a runtime capability perspective,
used while an application or any other system is actually deployed and being
operated. However, self-tuning is also an interesting autonomic capability
before operating an application. As a matter of fact, common (and good)
practice requires a qualification/validation step before deployment, including for performance requirements. This validation process typically involves
a load test campaign, consisting in using a load injection tool to generate
a flow of requests on the System Under Test (SUT) and observe its behavior (response times, throughput, call rejections, errors). The use of these
performance measurements is twofold:
1. the SUT is qualified or not qualified according to the performance requirements;
SELFMAN Deliverable Year Two, Page 260
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
2. the evaluation is used to size the SUT, for instance in terms of replication level (e.g. in the case of a distributed application, set the right
number of servers, with the right power, for the right features).
But, in both situations, the question of the SUT tuning arises. In the first
situation, a SUT may have been unfairly unqualified because of a bad configuration with regard to performance. In the second situation, the optimal
sizing requires an optimization of the SUT in order to avoid a proliferation
of servers. Optimal sizing is not only important with regard to the cost
of buying servers, but also (and increasingly) with regard to the operation
costs: human resources for system management and maintenance, energy
consumption (including for air conditioning), hosting space, etc.
This pre-deployment tuning step can be considered as a benchmarking activity, since it relies on measuring the performance of a number of SUT configurations and compare results in order to choose the best configuration. Next
section describes the benchmarking process and explains why self-tuning is
relevant to this activity. Next sections focus on specific autonomic features
that are necessary for self-benchmarking, namely self-regulated load injection
and self-tuning.
Introduction to load testing and benchmarking
Load testing campaigns consist in generating a flow of client requests on
a SUT in order to assess its performance and sustainable throughput. As
shown by figure 17.1, a load testing infrastructure is typically composed of:
• one or several load injectors sending requests to a SUT and waiting for
responses to measure the corresponding response times;
• probes measuring the usage of computing resources, at the SUT side,
to help detecting performance problems, as well as at the load injection
side, in order to check that it is performing as expected;
• a supervision user interface to deploy, control and monitor the distributed set of load injectors and probes;
• a storage space to gather all measures (e.g. as a set of log files or a
database);
• tools for post-mortem analysis and report generation.
The traffic generated by the load injectors are commonly modeled through
the definition of virtual users, i.e. programs that emulate the behavior of
SELFMAN Deliverable Year Two, Page 261
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
Figure 17.1: Big picture of a typical load testing infrastructure.
real users, through successions of requests and think times (time spent by
a user between 2 consecutive requests). For a given SUT, there are often
a number of different typical usages, thus resulting in defining a number
of different virtual users exhibiting different behaviors. In the case of web
applications, some users may just consult available information, while others
will strongly interact and trigger complex processes, resulting in different
usage of computing resources and thus different impact on performance. For
instance, some behaviors induce database write operations while others don’t.
Performance benchmarking aims at comparing and ranking a variety of
options such as configuration parameters or alternative implementations of
hardware or software implementations, from a performance point of view.
Benchmarking relies on load testing campaigns where the SUT must be tuned
for optimal performance in order to obtain a meaningful ranking. As a matter
of fact, comparing results from an optimally configured SUT with results from
a badly configured alternative would make no sense.
Towards self-benchmarking
It typically takes a lot of manpower, skills and time to carry a load test
campaign out. The test infrastructure is a complex combination of the load
injection system, probe system, and SUT involving a tremendous number
of parameters that are likely to strongly interact with each other (e.g. size
of buffers, pools of database connections, size and policy of caches, network
SELFMAN Deliverable Year Two, Page 262
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
configuration, multi-threading policy...). Testers must be experts in every
element of the global system (hardware, software, operating system, middleware, network equipments and protocols...) in order to handle troubleshooting and performance optimization. In an empirical and iterative process,
tests are repeated again and again with different parameters arrangements
and different configurations until sufficient confidence and satisfaction about
results are met. Then, we see that testers behave like a feedback/control
loop, observing the SUT and the load injection system on the one hand, and
modifying the SUT and load injection configuration on the other hand as a
reaction to observations.
Self-benchmarking consist in considering that the tremendous complexity
of the whole computing system used in a benchmarking campaign justifies
an autonomic computing approach, that is: try to use computing power to
autonomously deal with the computing system complexity. In other words,
self-benchmarking shall carry out test campaigns by autonomously controlling the load injection system and the SUT configurations, with the objective
of maximizing performance1 .
Self-regulated load injection
The first step in benchmarking generally consists in searching the approximate performance limit of the SUT in a given configuration. For example,
in the context of application servers, the tester would typically try to find
out the maximum number of users that the SUT may sustain with regard to
given saturation criteria. Common criteria are expressed in terms of response
time, request rejection, error occurrence, or computing resource shortage. A
common way of looking for the saturation limit is to run a variable (generally
growing) number of virtual users and look for the saturation point. Another
way of varying the load injection is to change the proportion between the
different virtual user families (i.e. of different behaviors). During this experimental search, the tester plays an empirical feedback role on the load
injection system: according to the distance between the observation and the
given saturation criteria, the tester more or less increases (or possibly decreases) the load.
Figure 17.2 presents a load testing infrastructure featuring autonomous
search of saturation point. This infrastructure is composed of:
• the system under test;
1
Of course, the concept of performance may be practically mapped to a variety of
criteria, such as request throughput, rejection or error rate, response times, number of
users, etc.
SELFMAN Deliverable Year Two, Page 263
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
control (feedback) load injection system invocation probes alarms, response times, errors system under test probes alarms, resource consumption, profiling controller saturation criteria load injection policy Figure 17.2: Self-regulated load injection for autonomic search of system
performance limits.
• the load injection system, made of one or several traffic generators
depending on the required load level;
• probes measuring computing resources usage at the SUT, including for
instance typical system probes (CPU, memory, network...) as well as
possibly probes related to specific software elements involved in the
environment (middleware, database. . . );
Since the concept of saturation may be practically characterized in a
number of ways, the architecture of our self-regulated load injection system
also exhibits a component dedicated to provide and isolate the saturation
criteria and feedback (load injection) policy.
To practically implement this self-regulated load injection system, we can
rely on the open source, component-based CLIF load injection framework
[43]. This framework provides generic load injection and probe components,
that can be controlled (e.g. to dynamically change the load injection level)
and monitored through a supervisor component. We just have to define a
load injection controller component and bind it to the supervisor component
to implement the self-regulated load injection feature. This architecture has
been presented in [70] and used to automatically find the saturation limit
of an XML appliance in the context of Service Oriented Architectures. For
illustration purpose, we reproduce the self-saturation experiment results in
figure 17.3. The load injection system starts with a minimal workload of
1 virtual user. Step by step, the number of virtual users is increased (or
decreased) in order to reach 80% system load (chosen saturation criteria)
SELFMAN Deliverable Year Two, Page 264
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
100
90
80
ESB load %
70
60
50
40
30
20
10
0
0
52
50
0
0
48
0
46
0
44
0
42
40
0
0
0
38
36
0
34
0
32
0
30
28
0
0
26
0
24
0
22
0
20
18
0
0
0
16
14
0
12
80
10
60
40
20
0
500
450
450
400
400
350
350
300
VUsers
300
250
250
200
200
150
150
100
100
50
50
10
0
12
0
14
0
16
0
18
0
20
0
22
0
24
0
26
0
28
0
30
0
32
0
34
0
36
0
38
0
40
0
42
0
44
0
46
0
48
0
50
0
52
0
80
60
0
40
0
20
Throughput /s
Time (s)
Time (s)
vUsers #
throughput /s
Figure 17.3: Autonomic saturation search with self-regulated load injection
applied to an XML appliance.
for the XML appliance (named ESB load in the figure). It takes about 3
minutes to reach and maintain this limit, dynamically adjusting the number
of virtual users.
Adding self-tuning
The ultimate goal of benchmarking is to qualify the optimal performance of
a system and to compare it to other similar but different (in configuration
or implementation) system. Therefore, finding the optimal settings of the
tested system, in other words tune it, is key to the autonomic benchmarking
principle. This second step includes step one since benchmarking requires to
reach the maximum performance (i.e. the saturation limit). For instance,
the result of step one could be the conclusion that the SUT in a given configuration sustains a number of active virtual users representing a given mix
of a number of behaviors. Then, the question is: is that result the best the
SUT can deliver, or is it improvable by tuning the SUT? In other words, the
SELFMAN Deliverable Year Two, Page 265
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
Figure 17.4: Autonomic benchmarking: self-tuning of a system under test
and autonomic search of performance saturation
tester will try to optimize the SUT and rerun the load test until s/he has
the conviction that the maximum performance has been reached. This is,
of course, a matter of estimation, whose accuracy depends on the skill and
experience of the tester, since the combinatory of tuning parameters, and the
complexity of interactions between them are so huge that a full exploration
of the solution space is not humanly feasible.
Self-tuning allows for replacing the tester with a second control loop introducing an autonomous optimization of the SUT. Synchronization must
be achieved between self-tuning and self-regulated load injection processes
in order to alternate in a consistent way optimization phases and saturation
search phases. The resulting architecture (see figure 17.4) adds a configuration controller, observing the maximum system performance (i.e. achieved
at the saturation limit) reached for current configuration, and generating
new possible system configurations. SUT-specific configuration rules must
be provided in order to identify possible tunable parameters and their possible values. A general controller component orchestrates the configuration
controller and the load injection controller.
Self-benchmarking relies on a classical generate-evaluate process which is
supposed to explore a multi-dimensional space of solutions. Roughly speaking, the tested elements can be configured by setting a number of parameters,
with a range of possible values. A number of issues arise then:
• some parameter values may be incompatible with each other;
• some parameters may be correlated;
SELFMAN Deliverable Year Two, Page 266
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
• the number of parameter values combinations may be far too huge to
be able to explore them all.
Factor analysis statistical techniques may be used in order to identify and
eliminate parameters that don’t influence performance. Heuristics may also
be introduced in order to guide the exploration in an efficient way. Finally,
a sufficiently good configuration, albeit not the best configuration, shall be
found.
Conclusion
This section has presented how to apply self-tuning capabilities before the
actual deployment and operation of an application. It consists in generating possible application configurations and evaluate them. This approach
requires to add a self-regulated load-injection system that allows for autonomously finding the performance limits of each configuration. This combination can be considered more generally as a self-benchmarking system:
not only it allows for deploying an optimally configured application (with regard to performance concerns), but it also allows for comparing performances
of alternative configurations which is the key consideration of benchmarking.
Finally, note that the self-regulated load injection system, based on CLIF
load injection framework, is likely to be very useful when the Selfman project
will perform some experimental assessments of their applications and autonomic scenarios, in order to generate a user traffic.
17.3.3
DHT Load-balancing
A distributed hash table (DHT) extends a SON with primitives for storing
(key, value)-pairs and for retrieving a value associated with a key. DHTs
support both direct key lookups [136, 123, 115] and range queries [127]. In
the former case, a hash function is used to distribute the keys evenly among
the nodes in the system. However, in a DHT supporting range queries this
is not possible since the order of the stored keys would then be destroyed.
Depending on the distribution of the keys the DHT can quickly become unbalanced. An unbalanced system can lead to network congestion and unresponsive nodes. In order to avoid this, the DHT is extended with algorithms
that balances the load fairly over the nodes depending on their capacity [114].
Figure 17.5 shows an example of a load-balanced system storing a data-set
from the wikipedia demonstrator presented in D5.2a. In this scenario, entries
containing text with a certain language are more likely to be accessed from
the countries with a high population speaking the language. Thus, placing
SELFMAN Deliverable Year Two, Page 267
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
se
de
nl
en
en
nl
de
replica group4
se
replica group0
se
de
de:Main Page
nl
replica group3
replica group1
en
en
nl
replica group2
de
se
nl
en
se
de
Figure 17.5: Geographic Load-Balancing for Wikipedia.
replicas in the country or nearby tends to improve client latency. A loadbalancing algorithm that considers the data locality can therefore indirectly
be used to improve server placement.
Recent developments in gossipping for unstructured P2P-networks has
shown that it is possible to estimate global properties with high confidence
[151, 150]. For load-balancing, proximity information has proven useful in
order to improve the network utilization [159, 133]. However, not only the
node topology is interesting, knowledge about for example the average node
utilization could potentially be used to improve load-balancing efficiency.
With I/O intensive applications using the DHT, such as the wikipedia
demonstrator, the network can easily become a bottleneck. It is therefore
important that the algorithms used for tuning and DHT maintenance use the
network efficiently. This is especially the case for load-balancing algorithms
since their main operations trigger data movements [81, 53]. By introducing
more global properties there is a trade-off between the gain and the cost in
terms of network usage.
This deliverable explores the effects of introducing knowledge of global
properties to a well-known decentralized load-balancing algorithm [81] which
only considers local knowledge. In addition, we discuss two centralized approaches, one based on the auction algorithm [19] and one on tree-search [88].
The centralized approaches finds the optimal solution or a close approximaSELFMAN Deliverable Year Two, Page 268
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
tion which can be used as a comparative benchmark of the decentralized
algorithms.
Related Work
Load-balancing algorithms in DHTs focuses on three different problems.
First, in a DHT where each item is hashed uniformly over the ID space, some
nodes can have an O(log N ) imbalance in terms of stored items [114, 61]. Second, in DHTs with range-query support, the items must be mapped to the ID
space in order, keeping their original distribution. Therefore, for the system
to be balanced, i.e. nodes store an equal number of items, their IDs must
be distributed according to the key distribution [81, 53]. Third, independent
of the item distribution, certain items can have much higher request rates
than others. This is typically solved through caching, replication together
with exploiting redundant network routes [38]. The last issue is not covered
further in this deliverable.
Virtual Servers is a technique where each physical node maintains a set
of virtual nodes. Balancing of the load is done by moving virtual servers
from overloaded physical nodes to more lightly loaded physical nodes. The
assignment of virtual nodes to new physical nodes is typically performed at a
directory node. A directory node periodically receives load information from
random nodes in the system. When it has received load data from a sufficient
amount of nodes it executes the load-balancing algorithm [114, 61, 31].
Virtual Servers increases the routing table state maintained at each node.
In [62] Godfrey, et. al. introduces a scheme where physical nodes hosts virtual servers which have overlapping links in the routing table. With this
placement restriction, a physical node only needs θ(logN ) while hosting
θ(logN ) virtual servers.
Another issue with virtual servers is that a physical node failure causes
the hosted virtual nodes to fail as well. This increases the churn in the
system and must be considered when selecting global parameters such as the
replication factor. In [94], Ledlie et. al. presents the k -Choice algorithm
wherein each node samples load from a small set of IDs and directs joining
Virtual Servers to overloaded nodes. They show through simulation that this
decreases the amount of load-balancing induced churn.
The above approaches uses simple metrics for the cost of the load-balancing
operations, e.g. the number of transferred items or bytes. However, a better cost-metric should include the overall network utilization. [159, 133] are
both investigating the effects of proximity-aware load-balancing algorithms
for Virtual Servers. In [31], the assignment of Virtual Servers to physical
SELFMAN Deliverable Year Two, Page 269
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
nodes is modeled as an optimization problem which allows for an arbitrary
cost function.
Item-balancing Most of the research on load-balancing in DHTs have
focused on Virtual Servers. However, these approaches assumes that items
are uniformly distributed over the ID space using a hash-function. For a datastructure without hashed items, a single virtual server can be overloaded if it
is responsible for a popular ID range. For example, when storing a dictionary,
keys with the prefix “e” are more common than “w”, resulting in the node
responsible for “e” to store more items. The goal of item-balancing schemes
is to adapt the location of the nodes in the system to correspond to the item
distribution. This is done using two operations, jump and slide. Jump allows
a node to move to an arbitrary ID in the system, while a slide operation only
exchanges items with a nodes direct neighbors.
In [81] Karger et. al., presents a randomized item balancing scheme
where each node contacts another random node periodically. If the load of
the nodes differ with more than a factor 0 < < 14 , they share each others
load by either jumping or sliding. Karger provides a theoretical analysis of
the protocol, but does not evaluate the algorithms in an experimental or
real-world setting.
Ganesan et. al. [53] uses a reactive approach which triggers a recursive
algorithm when the node utilization super-seeds a threshold value. A node
executing the algorithm first checks if it should slide by comparing the load
with its neighbor’s load. If this is not possible, it finds the least loaded node
in the system and request that it jumps to share the overloaded node’s load.
We are basing our algorithms on the work presented by Karger, but introduce global parameters available at each node through a gossipping protocol.
In addition, we develop two different centralized algorithms used to evaluate
the global parameters applied to the decentralized approach.
System model and problem definition
A DHT consist of N nodes, where each node has an ID in the range [0, 1).
A node has a successor -pointer ni+1 , to the node with the next largest ID,
and a predecessor -pointer ni−1 to the node with the previous largest ID. The
node with the largest ID has the node with the lowest ID as successor. Thus,
the nodes and their pointers can be seen as a ring or double linked list where
the first and last item links to each other.
The DHT stores a set of items I, where each item has an ID in the range
[0, 1) and a weight. Each node stores a subset of items with IDs that falls
in a nodes range (ni−1 , ni ], i.e. the node is responsible for the ID. Figure
SELFMAN Deliverable Year Two, Page 270
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
Figure 17.6: A node Ni with successor and predecessor and their respective
responsibilities.
17.6 shows three nodes and their respective responsiblilities. Each node has
a capability c(ni ) indicating the amount of data
it can store. The utilization
PI
w(Ii )
for a node is defined as the fraction u(ni ) = i=0
.
c(ni )
PN
u(n )
The system has an average utilization U (N ) = i=0N i . We say that
a system is balanced when the utilization of all nodes fall within a range
U (N ) ± , where is a user-defined parameter. Increasing leads to a system
with higher variation of node utilization, but it also lowers the cost to reach a
load-balanced state. When u(ni ) > U (N )+, we say that node i is overloaded
and when u(ni ) < U (N ) − , the node is underloaded. The remaining nodes
that fall within the range U (N ) − ≤ u(ni ) ≤ U (N ) + are balanced.
In order to change the utilization of nodes in the system, two type of
operations are used: jump and slide.
Jump allows a node to move to an arbitrary position in the ID space. A
jumping node ni first leaves its current position and re-joins at its
new location, lk . Data is moved two times, first the range (ni−1 , ni ] is
transferred to ni+1 . Second, when ni joins at lk , all data in the range
(nj−1 , lk ] is transferred from ni ’s new successor nj .
Slide is a specialized form of jumping where a node moves to an ID in the
range (ni−1 , ni+1 ). When moving to an ID < ni , the node move the
items in (ID, ni ] to ni+1 , while when moving to an ID > ni the node
becomes responsible for the items in (ni , ID) in addition to its current
responsibilities.
Problem definitions. The load-balancing problem can be summarized as
follows: given a configuration C0 with a set of nodes N and items I, where
each item ij is assigned to a responsible node, find a configuration Cb that
only contains balanced nodes using the operations join and slide. A solution
to the load-balancing problem is a set of (operation, iteration, node, ID)tuples indicating what operation at which iteration the given node should
use to reach the new ID.
SELFMAN Deliverable Year Two, Page 271
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
In addition to the load-balancing problem, we search for an optimal solution set that minimizes the data movement cost of the transition from C0 to
Cb . The cost-function is defined as cost(ni , li ), where ni is a node and li is
an ID (location) to where the node will jump or slide. The cost-metric can
be chosen arbitrarily, but is typically based on the number of bytes moved
or the network utilization.
The minimum node utilization required to take the system from any
configuration to Cb is the sum of the distance for all overloaded nodes to
the upper utilization limit as well as the sum of the distance
for all underPoverloaded
loaded nodes to the lower utilization limit. More formally, i=0
u(ni ) +
Punderloaded
u(nj ).
j=0
Item balancing heuristics In order to reach a load-balanced configuration, we rely on the heuristics introduced for Karger’s item balancing algorithm [81]. Expressed in our notation, a load-balance operation is only performed between any pair of nodes ni , nj , iff u(ni ) ≤ u(nj ) or u(nj ) ≤ u(ni ).
When these restrictions are satisfied, the following cases are possible (assuming u(ni ) ≤ u(nj )).
Case 1 ni is underloaded and nj == ni+1 and is overloaded. Slide ni towards nj , letting ni take responsibility for a fraction nj ’s items.
Case 2 ni is overloaded and nj == ni+1 and is underloaded. Slide ni towards nj , letting nj reduce ni ’s load by taking responsibility for a fraction of ni ’s items.
Case 3 When nj is not a successor, then ni jumps to a location between
(nj−1 , nj ).
Heuristics to minimize data transfer. The order in which the join and
slide operations are performed influences the total amount data transfered.
First, consider a set of neighboring underloaded nodes as depicted in figure
17.7. If node X leaves the system before node Y, the data from X is transferred to Y. When Y leaves the system, the data from X + Y is transferred
to node B. While, if node Y leaves before X, its data is transferred to B.
Later, when node X leaves, it transfers its data directly to B. Thus, when
a set of neighboring underloaded nodes are leaving the system, the nodes
should leave in ascending order to avoid redundant transfer of data.
The second heuristic considers when a set of nodes take over responsibility
from an overloaded node as depicted in figure 17.8. This is similar to the
previous case since if node Y joins first, it will transfer all items in the range
SELFMAN Deliverable Year Two, Page 272
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
Figure 17.7: A chain of underloaded nodes leaving the system.
(A, Y ]. Thereafter, when node X joins it will transfer the data in range
(A, X] which is now held by node Y. If X joins first, the data in the range
(X, Y ] will not be transferred twice. Thus, when a set of nodes are joining
the system at the same node, the nodes should join in descending order.
Figure 17.8: Two consecutive free slots being filled by joining nodes.
Centralized Algorithms
In a centralized algorithm the global state of the system is known by an
oracle. The centralized approaches presented below are used as comparative
benchmarks for the decentralized algorithms. We discuss two approaches for
centralized algorithms. Tree-search, where a decision tree describing all possible choices for each configuration is traversed using depth-first search [88].
The second approach is based on auction algorithms [19] where overloaded
and underloaded are matched to find an optimal assignment.
Tree-search The goal of the tree-search algorithm is to find a solution with
the lowest possible total cost. A node in the tree is a system configuration, Ci ,
and an arc is representing a node’s decision resulting in a new configuration
Cj . A node decision is an operation, jump or slide, and a location. Each
decision resulting in a new configuration has a cost. The solution with lowest
possible cost is therefore the path in the tree with the lowest cost. Figure
17.9 shows an example sketch of a search tree.
The expansion at each node in the search tree is limited to the possible
operations that any node can perform. The available operations for a node
SELFMAN Deliverable Year Two, Page 273
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
Figure 17.9: Sketch of a search tree.
are as follows: (1) Slide, if the node is underloaded and its successor is
overloaded or vice verse, (2) Jump, if the node is underloaded. Note that
balanced nodes are not performing any operations. However, depending on
the movements of underloaded nodes, balanced nodes can become overloaded.
An underloaded jumping node can choose any free position at any overloaded
node. Since the number of overloaded and underloaded nodes decrease after
performing an operation, there will be fewer choices towards the end of a
search. The complexity of the tree search is in the worst case O((N ∗ d)o ),
where N is the number of nodes, d is the maximum number of decisions for
any node and o the number of operations needed to reach a load-balanced
state.
In order to lower the amount of computation, the tree is only searched up
to a given max cost, α. This is also known as depth-limited search (DLS).
However, since DLS returns after the limit has been reached it will return
the first solution, but not necessarily the solution with lowest cost. Therefore, we apply a variant of iterative deepening depth-first search [88], called
iterative lengthening, which tries alternative paths up α until the solution
with minimum cost is found.
An alternative approach is to use randomization, as done by Karger in
[81] for a decentralized algorithm. Instead of expanding the tree with each
possible node operation, an operation is chosen at random. Thus, the complexity is lowered to O(N o ), but without the guarantee of finding a solution
with minimal cost. However, since Karger’s algorithm is a well-known decentralized load-balancing solution, it is interesting to study the variation of
its performance with access to global knowledge.
SELFMAN Deliverable Year Two, Page 274
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
The Auction Algorithm Since the complexity of the tree-search algorithm increases exponentially, we are currently investigating an alternative
approach involving auction algorithms [19]. An auction algorithm finds an
optimal one-to-one assignment of persons to objects in polynomial time. The
assignment depends on the cost of the object and the benefit of the person
being assigned to the object. For the load-balancing problem this is analogous to finding a lowest cost match between underloaded and overloaded
nodes.
More formally, each person i has a benefit aij of selecting an object j
with price pj . The net value for a person i of choosing object j is aij − pj .
The goal of the auction is to find an assignment where all persons finds an
object which maximizes their net value. Thus, an auction is finished when
the equilibrium aij − pj = maxj∈Objects (aij − pj ) is satisfied.
Each iteration of the algorithm consists of a bidding phase followed by
an assignment phase. During the bidding phase, each person finds an object
resulting in maximum net value after which it computes a bidding increment.
The value of the bidding increment is used after the assignment phase to
increase the price of the object. In the assignment phase the persons with
the highest bids are assigned to the respective objects. When all persons are
assigned to an object the algorithm terminates. This also means that the
equilibrium has been satisfied [18].
When applying the auction algorithm for the load-balancing problem, an
underloaded node translates to a person and an overloaded node is seen as
an object. In a single iteration, an auction returns an assignment between
underloaded nodes and overloaded nodes that minimizes the data transfer
cost. This assignment is used to relocate the underloaded nodes. Relocation is currently done within a single iteration according to the heuristics
presented in section 17.3.3. Subsequent iterations are executed in the same
way until there are no more unbalanced nodes. That is, for any node i,
1
N
maxN
i=0 (u(ni )) ≤ u(ni ) ≤ maxi=0 (u(ni ))], where epsilon 0 < < 4 .
We implemented the described algorithm in a simulation environment.
Figure 17.10 shows the number of moved items for increasing values of epsilon. The system has 100 nodes and the data-set is an American dictionary
with 380645 entries. The figure shows that epsilon can be used to control the
data movement cost necessary to reach a load-balanced configuration. That
is, when epsilon goes towards zero the difference in node utilization when
nodes are compared is relaxed.
An advantage of the auction algorithm is that the benefit function and
the object price can be chosen arbitrarily. This allows us to explore more
complicated costs than e.g. moved data items. Furthermore, the order of the
load-balancing operations slide and join are affecting the total price of the
SELFMAN Deliverable Year Two, Page 275
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
350000
Auction
300000
moved items
250000
200000
150000
100000
50000
0
0
0.05
0.1
0.15
0.2
0.25
epsilon
Figure 17.10: An increasing value of epsilon decreases the allowed difference
between node utilization.
load-balancing process. Due to the apparent advantages in computational
complexity of the auction algorithm approach, we are actively investigating
an appropriate cost-function which can include proximity information and
the order of operations.
Decentralized algorithms
Unlike the centralized algorithms, a decentralized algorithm can only use the
information locally available at each node. We modify Karger’s randomized
item-balancing algorithm to work with different globally known parameters,
for example, the systems average load.
Perfect global knowledge is not available in a distributed system without
using expensive aggregation algorithms. However, by using gossiping techniques such as Vicinity and Cyclon [151, 150] it is possible to get a good
approximation of a parameter’s value with low network traffic overhead. Initially, we are interested in the parameters below.
Average Load The average system load can be used by each node to decide
if it is underloaded or overloaded or within the balanced range.
Location Proximity-information allows a node which will transfer load to
select a target node which minimizes the network utilization [159, 133].
SELFMAN Deliverable Year Two, Page 276
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
Over- and Underloaded nodes A list of the k most overloaded and most
underloaded nodes. These lists can be used for example in the Kargeralgorithm to improve the convergence rate.
Load Error Margin The error margin indicates how much a node’s load
can differ from the average load. By changing this parameter it is
possible to tune the aggressiveness of the load-balancing algorithm.
We implemented Karger’s algorithm and one version with knowledge of
the systems average load in a simulation. Using the average load, we apply
the heuristics to minimize data transfer from section 17.3.3. When a node
jumps to another node, instead of splitting the sharing the node load in half,
the minimum of the average load and the shared load is used. The effect of
using the extra information on average load is especially distinct in figures
17.14 and 17.13.
0
10
20
30
40
50
60
70
80
90
100
80000
80000
load
Initial load
70000
70000
60000
60000
50000
50000
40000
40000
30000
30000
20000
20000
10000
10000
0
0
0
10
20
30
40
50
60
70
80
90
100
node
Figure 17.11: Number of items per node.
The simulation was set up with 100 nodes and an initial load distribution
as shown in figure 17.11. Each simulation step tries to perform a load balancing operation for each node using a normal Karger load balancing algorithm
and the one with knowledge of the system’s average load. The algorithms
were each tested with different values in the range of 0 < < 0.25 as
suggested by Karger.
Figures 17.12 and 17.13 show how the algorithms perform using different
values taking into account the total number of moved items during the
simulation and the standard deviation of the load distribution among the
SELFMAN Deliverable Year Two, Page 277
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
0
0.05
0.1
0.15
0.2
0.25
900000
900000
moved items
Karger
Karger +avg load
800000
800000
700000
700000
600000
600000
500000
500000
400000
400000
300000
300000
200000
200000
0
0.05
0.1
0.15
0.2
0.25
epsilon
Figure 17.12: Absolute number of moved items for a network with 100 nodes
with increasing epsilon for Karger and Karger with average load information.
nodes at the end of the simulation. The standard deviation is used to measure
the degree of imbalance of a given load distribution.
Figure 17.14 shows how the standard deviation decreases as the load
balancing algorithms move items from one node to another either by shifting
the address range of two adjacent nodes or by moving a node to a different
position.
Discussion
We showed that it is possible to improve the absolute cost of data movement by introducing simple heuristics and knowledge about basic global parameters. We plan to continue this work by evaluating the effects of more
complex global knowledge such as the network topology. In addition, by
improving centralized algorithms that can find the optimal solution, we can
effectively evaluate the result of changes to the decentralized algorithm. Furthermore, there are also other aspects of the storage system such as replication, transactions and churn that should be considered in combination with
load-balancing.
SELFMAN Deliverable Year Two, Page 278
CHAPTER 17. D4.3A: FIRST REPORT ON SELF-TUNING SUPPORT
0
0.05
0.1
0.15
0.2
0.25
7000
7000
standard deviation
Karger
Karger +avg load
6000
6000
5000
5000
4000
4000
3000
3000
2000
2000
1000
1000
0
0
0
0.05
0.1
0.15
0.2
0.25
epsilon
Figure 17.13: The imbalance with increasing number of epsilon for Karger
and Karger with average load.
0
100000
200000
300000
400000
500000
600000
700000
800000
14000
900000
14000
standard deviation
Karger (e=0.21)
Karger +avg load (e=0.21)
12000
12000
10000
10000
8000
8000
6000
6000
4000
4000
2000
2000
0
0
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
moved items
Figure 17.14: The load imbalance as a function of the number of moved
items.
SELFMAN Deliverable Year Two, Page 279
Chapter 18
D4.4a: First report on
self-protection support
18.1
Executive Summary
In deliverable D1.3a and D1.3b (Section 4), we proposed and built a testbed
for testing Small World Networks (SWN). The reason for investigating SWN
is that they have better properties with respect to some of the drawbacks
of Distributed Hash Table (DHT) based structured overlay networks. SWN
have trust and identity relationships which mitigate the serious problems
caused by Sybil-type attacks in DHTs. They are also more robust because
of the random nature which is partly between a structured network and a
random one.
Here, we investigate SWN as a suitable more secure self organizing network which may be able to replace DHT-based structured overlay networks
in situations which require self-protection against Sybil and DHT maintenance attacks. In D1.3b, we describe the SWN testbed and simulator as well
as giving background on SWNs.
D4.4a investigates self-protection using two kinds of SWNs. One is where
the SWN has a global property on the identifier of nodes (the kind of SWN
proposed by Kleinberg). So far our experiments indicate that the routing
success rate is very high. This suggests that a SWN may be a feasible as a
replacement for structured overlay network in scenarios where there can be
malicious nodes or users.
The other kind of SWN does not need any special property on the node
identifier. Rather it attempts to reorganize the node identifiers so that it
resembles the first SWN. We have identified some attacks which attempt to
poison the reorganization process and delete node identifiers. We have also
280
CHAPTER 18. D4.4A: FIRST REPORT ON SELF-PROTECTION
SUPPORT
investigated some initial security mechanisms against these attacks based on
self-tuning ideas (these happen to fit well with SELFMAN).
SELFMAN Deliverable Year Two, Page 281
CHAPTER 18. D4.4A: FIRST REPORT ON SELF-PROTECTION
SUPPORT
18.2
Contractors contributing to the Deliverable
NUS (P7) has contributed to this deliverable.
NUS (P7) NUS has designed and implemented the Small World Network
testbed and simulator which has been used to investigate two initial
self-protection mechanisms: Small World Network with global identifiers as a SON; and a self-tuning protection mechanism for Small World
Networks which use Kleinberg-style reconstruction.
SELFMAN Deliverable Year Two, Page 282
CHAPTER 18. D4.4A: FIRST REPORT ON SELF-PROTECTION
SUPPORT
18.3
Introduction
In deliverable D1.3b (Section 4), we introduced the motivation of Small World
Network (SWN), described several SWN models, and introduced our SWN
simulator testbed. We also saw that routing performance for the baseline
case of a static SWN was rather good since even without much structure,
apart from the SWN properties, the number of hops was small.
This deliverable goes deeper into the use of a SWN as the base of a
self organizing network which does not have the drawbacks of DHT-based
structured overlay networks. We look at two kinds of SWNs. One is where
the SWN has a global property on the identifier of nodes (the kind of SWN
proposed by Kleinberg). So far our experiments indicate that the routing
success rate is very high. This suggests that a SWN may be a feasible as a
replacement for structured overlay network in scenarios where there can be
malicious nodes or users.
The other kind of SWN does not need any special property on the node
identifier. Rather it attempts to reorganize the node identifiers so that it
resembles the first SWN. We have identified some attacks which attempt to
poison the reorganization process and delete node identifiers. We have also
investigated some initial security mechanisms against these attacks based on
self-tuning ideas (these happen to fit well with SELFMAN). A paper based
on this work is attached in Appendix A.19.
18.4
Small World Network Experiment Testbed
In the simulator described in deliverable D1.3b (Section 4), we are able to
test different kinds of SWN models, varying the number of nodes, setting
the number of malicious nodes, measure the greediness of the greedy routing
(explained later), simulate the node and link failure and plot the result into
a graph. See Figure 18.1 which shows the GUI of the simulator.
Three kinds of SWN model can be selected by configuring the simulator.
The first SWN model is the Kleinberg model which has base-connections,
global knowledge of the positions, and constant number of links. The second
model is the Normal (DHT like) model which has log(n) links. The third is
the Sandberg model [126] using 6 ∗ log(n) links. All the links in the models
follow Kleinberg-style power-law distribution [86] to make the graph a small
world (i.e. having low diameter and high clustering coefficient).
In order to test how resilient the SWN against malicious nodes, the simulator provides a way to simulate any number of malicious nodes. The behavior of the malicious nodes also can be configured to do certain type of
SELFMAN Deliverable Year Two, Page 283
CHAPTER 18. D4.4A: FIRST REPORT ON SELF-PROTECTION
SUPPORT
Figure 18.1: Simulator Interface
attacks, for example by doing switching aggressively (active) or defensively
(passive).
To measure the robustness of a routing, the greediness of the routing algorithm is tunable. Less greedy means we can make use of routing alternatives
(by not picking the best one). By utilising more alternatives, routing can
better withstand node failure.
Last but not least, to test the dynaminism of the network, for example
churn. The simulator can generate variable number of node or link failures
and then freeze the network to be analyzed later by doing 10,000 routing
tests. If routing pick a failed link or node the routing length will get longer
and the routing is declared as a failure if a certain TTL is exceeded or if it
reaches a deadend.
The resulting experiments are visualized as graphs to give immediate feedback. The graphs displaying the ”Infection Percentage”, ”Switch Percentage”, ”Success Rate”, ”Average Routing Hops”, ”Routing Hops Percentiles”
are useful in doing the self-protection experiments. For example we need
to design the self-protection mechanism such that the resulting network will
have high success rate (good routing success rate) with good routing performance (good in average and routing hops percentiles), low infection percentage (low poison), and low switch percentage (low network traffic). These five
graphs gives us different insights when designing/tuning the self-protection
experiments.
To simulate a very large number of nodes, our simulator has a mode which
SELFMAN Deliverable Year Two, Page 284
CHAPTER 18. D4.4A: FIRST REPORT ON SELF-PROTECTION
SUPPORT
can run in parallel on cluster of machines. We make use of the machine cluster
at NUS which has about 100 nodes with 200 CPU cores. The simulator can
efficiently simulate a large number of nodes (e.g. N = 100,000) together with
many different parameters of the SWN within a few to tens of minutes.
18.5
Small World Network as the Network
The motivation for the SWN is to avoid the problems with DHTs regarding
the lack of identity, network maintenance including node/link failure, and
robustness in routing in terms of alternative paths. We have begun an initial
study to examine the question of whether it is feasible to replace a DHTbased structured overlay network with a SWN.
In this section we will measure some aspects of SWN robustness with
respect to node/link failure, and routing performance. We assume that the
network has the global coordinate position [86], so the routing test will route
messages from a particular node to another node with known coordinate
in the network (based on the Kleinberg model). An example of coordinate
position in a 2-D graph are the cartesian coordinates of a node. Similarly,
the coordinate positions in a 1-D graph could be its ring identifier.
The greedy routing algorithm is to route the message to a neighbor that
is the closest to the target/destination node coordinate. Note that in a
distributed setting, greedy routing only requires local operations (e.g. routing
only by knowing the neighbor positions and the target node position).
We already saw in deliverable D1.3b (Section 4) that without node/link
failure the routing experiments essentially succeeded all the time (100% success), see Figure 18.2. Note that the number of nodes for all experiments in
this section is 100,000.
In Figure 18.2, Kleinberg model indeed can route within O(log 2 (n)) steps
using the greedy routing. However the performance can be improved by
increasing the number of links to log(n) as in Normal model and 6 ∗ log(n)
as in Oskar Sandberg model.
In the presence of node/link failure, the effect on the SWN models can
be seen in Figure 18.3. Node failure is equivalent as having all links to the
node fail, while in link failure only some of the links of a node fail. Our
experiments are performed by freezing the network at a point in time, to do
routing tests and then carrying on with network changes.
Figure 18.3 shows our preliminary experimental results on the robustness
of a SWN. In the Kleinberg model, the link failure is less damage than node
failure. This is because the Kleinberg model only has 2 additional links
making it very vulnerable to node/link failure. The success percentage drops
SELFMAN Deliverable Year Two, Page 285
CHAPTER 18. D4.4A: FIRST REPORT ON SELF-PROTECTION
SUPPORT
Figure 18.2: Routing Length Distribution
Figure 18.3: Comparisons between 3 models
SELFMAN Deliverable Year Two, Page 286
CHAPTER 18. D4.4A: FIRST REPORT ON SELF-PROTECTION
SUPPORT
below 50% with only 15% of node failure or with 40% of link failure. In
the Normal model, with log(N ) shortcuts instead of 2 constant additional
shortcuts, the robustness increases. The success percentage drops below 50%
with 60% of node failure or with 70% of link failure. In the Sandberg model,
with 6*log(N) shortcuts, the success rate is better again.
To understand the issues of routing robustness, we varied the choice of
route from the best to the worst local choice. This shows the effect of route
choice on eventual success.
Figure 18.4: Greediness
In Figure 18.4, it seen that in Kleinberg model where the number of
edges is very few, the greediness affects a lot on routing success percentage.
As the number of edges increases to log(N ) and 6∗log(n) as in Normal model
and Sandberg model the success percentage increases a lot. The greediness
tells how greedy the routing. For example 100% greedy means the closest
neighbor to the target node will be selected. 80% greedy means the closest
neighbor to the target node will be selected with 0.8 probability if it is not
selected then the next closest neighbor to the target node will be selected
with 0.8 probability and so on. If the probability is too low, the neighbor
will be selected at random.
Our preliminary results suggest that this version of a SWN seems to be
quite promising. Routing works well even in the presences of failure. This
can be explained when we look at the robustness experiments since there is
quite a lot of leeway in choosing routes before the success rate drops.
SELFMAN Deliverable Year Two, Page 287
CHAPTER 18. D4.4A: FIRST REPORT ON SELF-PROTECTION
SUPPORT
18.6
New Security Issues with Small World
Networks
The Kleinberg SWN model works under the assumption that the coordinates
of the network are given, thus, every node knows where is its coordinate, their
neighbors coordinates and the target node coordinate (see Figure 18.5). This
property allows greedy routing to work.
In P2P settings, it might be the case that global information such as
coordinate either does not exist or is not known. As we will see, this causes
a problem for greedy routing.
Sandberg [126] proposed a method to partially recover the node positions
(coordinates) by using only local knowledge at each node. Thereafter, greedy
routing can be applied, and as we show below, it works almost as well as
the original network with coordinates as identifiers. Thus can be thought
of as a Self Organizing Network that can reorganize the positions in the
network. To give an illustration on how the self-organization works, see
Figure 18.5, 18.6, 18.7.
Figure 18.5: Perfect node positions
Figure 18.5 is a graph with perfect coordinates where the coordinates are
colored continuously along the ring using red-green-blue hues. The figure
depicts 1000 nodes, if the position coordinate space is from 0 to 1000 then
the nodes with position ranged between 0 to 333 are colored as reddish, those
with position ranged from 334 to 666 are colored as greenish, and those with
SELFMAN Deliverable Year Two, Page 288
CHAPTER 18. D4.4A: FIRST REPORT ON SELF-PROTECTION
SUPPORT
position ranged from 667 to 999 are colored as bluish. In the figure, the
reddish nodes are on the right side, greenish are on the bottom left, and the
bluish are on the top left side of the ring. This coloring scheme makes it easy
to see the node position and its coordinates.
Figure 18.6: Shuffled node positions
Then the coordinates are shuffled yielding a randomized color along the
ring as in Figure 18.6. In this graph, the greedy routing doesn’t work since
the coordinates are random. The coordinates cannot be used to guide the
greedy routing to the target node hence the routing performance of this graph
is very poor. The situation is similar to an initial peer-to-peer system that
doesn’t have any coordinates and then begin to generate their own unique
coordinates (which are essentially random coordinates just like the shuffled
ring above). The network then is expected to recover the coordinates back
for the greedy routing to work well.
Sandberg [126] shows that the network can be restored as in Figure 18.7.
We can see in the picture that the ring colors are not the same with the original perfect ring in Figure 18.5, in fact without having a global information
it is very difficult if not impossible to get back to the perfect state. Even
though the restored ring is imperfect, nevertherless, greedy routing works
very well on the restored network. Routing performance on the restored ring
is close to the perfect ring performance.
The recovery (self-organizing) algorithm was not designed to defend against
attacks. In this deliverable, we have begun an initial investigation on the
attacks against the self-organizing recovery algorithm and protection mechSELFMAN Deliverable Year Two, Page 289
CHAPTER 18. D4.4A: FIRST REPORT ON SELF-PROTECTION
SUPPORT
Figure 18.7: Restored node positions
anisms against attack on the Sandberg reorganization mechanism. More
details can be found in the attached paper in Appendix A.19.
The self-organizing algorithm works by first assigning the node a random
position. Then on each iteration each node will try to do a random walk
of certain length to another node and attempt to switch position with that
node if it’s beneficial. Basically the switch is determined by the switching
probability which is calculated by having the product of edges before switch
divided by the product of edges after switch. If the resulting probability is
larger than 1, then they always do switch (see Figure 18.8).
Figure 18.8: Switching Probability
This self-organizing algorithm (without any protection) is vulnerable to
even with a small number of malicious nodes. This can happen because the
malicious nodes can fake positions, drop positions, create similar positions,
SELFMAN Deliverable Year Two, Page 290
CHAPTER 18. D4.4A: FIRST REPORT ON SELF-PROTECTION
SUPPORT
etc. Basically these attacks try to poison the positions of nodes as the network
is doing self-reorganisation of the positions. Even with small number of
malicious nodes, a small amount of poisoning can eventually propagate to
infect the entire network. Hence, self-protection mechanisms are necessary.
Figure 18.9: Partial Restart Strategy
We proposed a partial restart strategy to minimize the effect of poison. Each node will generate a new position with probability r. Using
this strategy, the position distribution in the network will remain uniformly
distributed. Since the convergence of the recovery algorithm is faster than
the restart strategy, the network is able to maintain good routing performance even though the positions are being randomized. Our preliminary
experiments showed that simple decentralized security mechanisms helps in
minimizing the effect of the attack. Figure 18.9 shows the effect of having
self-protection by partial restart strategy. It keeps the infection below 10%
and the routing success above 80%.
Our preliminary experiments showed that simple decentralized security
mechanisms helps in minimizing the effect of the attack. It is interesting that
this both the self-organizing algorithm and the self-protection mechanisms
here are also instances of self-tuning algorithms.
SELFMAN Deliverable Year Two, Page 291
CHAPTER 18. D4.4A: FIRST REPORT ON SELF-PROTECTION
SUPPORT
18.7
Papers
The paper which describes the work here is as follows:
Felix Halim, Yongzheng Wu and Roland H.C. Yap, “Security Issues in
Small World Network Routing”. It has been accepted by the Second IEEE International Conference on Self-Adaptive and Self-Organizing Systems (SASO
2008). It is attached in Appendix A.19.
SELFMAN Deliverable Year Two, Page 292
Chapter 19
D5.2a: Application design
specifications
19.1
Executive Summary
The purpose of the WP5 is:
1. to provide use cases and requirements in different applicative contexts
(applications),
2. so as to help the Selfman project as a whole to choose which application(s) will be demonstrated,
3. and then to perform evaluations of the Selfman demonstrators and more
globally the Selfman technologies and overall approach (autonomics
based on components and overlays).
Four applications were considered in WP51 . The first one proposed by
France Telecom concerns M2M systems. The second one proposed by ZIB
concerns a distributed database system. Two additional applications were
investigated as replacements for the one that should have been proposed for
the partner E-plus which left the Selfman project in its first year. The first
one, proposed by the Peerialism company (previously named PeerTV and
then Stakk) (contact established by KTH(P2)), concerns P2P video streaming (P2P TV). The second one, proposed by the Bull company (contact
established by FT R&D(P4)), concerns a J2EE application server. After investigation, applications from Bull and FT R&D(P4)finally appeared as not
1
More elements on the history/process of Selfman WP5 in the 1st year of the project
is detailed in the corresponding deliverable documents of the first year.
293
CHAPTER 19. D5.2A: APPLICATION DESIGN SPECIFICATIONS
suitable as Selfman demonstrators and were then discarded (more details in
next of this deliverable concerning the M2M application by FT R&D(P4)).
This deliverable provides design specifications for the wiki distributed
database application and P2P TV application.
SELFMAN Deliverable Year Two, Page 294
CHAPTER 19. D5.2A: APPLICATION DESIGN SPECIFICATIONS
19.2
Contractors Contributing to the Deliverable
ZIB(P5), Peerialism and FT R&D(P4)have contributed to this deliverable.
ZIB(P5) has contributed on the design specification and implementation
of the wiki distributed database application. This work is based on the
development in WP3 on transactions in structured overlay networks and
user requrements specified in WP5 (D5.1).
Peerialism has contributed on the design specification and implementation
of the P2P TV application. This work is based user requrements specified in
WP5 (D5.1).
FT R&D(P4) has contributed on the edition of this deliverable. FT
R&D(P4)provided very detailed user requirements on a multi-service M2M
application in D5.1a. This application was finally not implemented by FT
R&D(P4)in the context of Selfman. It is perhaps worth giving here the main
reasons why:
• This was not what FT proposed to do in Selfman to start with (cf.
DOW: FT was supposed to provide component technology, user requirements and evaluation technology, not applications).
• FT did its best on user requirements for M2M application (not to mention additional work in automn 2007 because the deliverable was not
accepted) but did not have the manpower to develop an M2M application (defining user requirements for an application is not the same
activity as actually implementing this application).
• As discussed in the conclusion of D5.1, Selfman technologies might be
interesting for M2M in the long term but M2M, in FT R&D(P4)settings
at least, is not mature enough to make use of largely distributed architectures (M2M applications today send direclty sensor data to one
centralized J2EE server
• - they are years away from using overlays).
• FT R&D(P4)would not see how the implement an M2M application
with the current Selfman technologies and (lack of) architectural vision
(several component models, several overlays).
SELFMAN Deliverable Year Two, Page 295
CHAPTER 19. D5.2A: APPLICATION DESIGN SPECIFICATIONS
19.3
Results
19.4
Wiki Application Design Specifications
The distributed wikipedia is a reimplementation of the wikipedia using the
principles and techniques developed within the Selfman project. Wikipedia
is particularly interesting for us, because it is the only large scale web application which gives access to its source code and data. The code is written
in PHP and available on their web site. More important for us, they provide
complete database dumps.
For the distributed wikipedia, we are using the replicated storage service
(see D3.XX) as the backend and built a layer on top which maps the wikipedia
operations on the key-value store. Fig. 19.1 shows the general architecture.
At the top we have the transactional key-value store using Chord# . The
presentation layer consists of a couple of webservers which handles the rendering of wiki text to HTML and provides forms for changing the contents.
A load-balancer will distribute the user requests over all the webservers.
Chord#
Load Balancer
Request for page A
HTTP
Page
ib2 A
Client
Replica of page A
Webserver
Figure 19.1: Distributed Wikipedia on a transactional data store based on
Chord# .
A more detailed description as well as performance results can be found
in App. A.20.
We participated in the first IEEE Scale Challenge with the distributed
SELFMAN Deliverable Year Two, Page 296
CHAPTER 19. D5.2A: APPLICATION DESIGN SPECIFICATIONS
wikipedia and won the first prize. There we showed a deployment of the
Bavarian wikipedia over several dozens of nodes in Europe. A second deployment was running on a cluster in Berlin with a simplified English version.
19.5
P2P TV Application Design Specifications
In this section we present our experience with using Kompics[9], a software
framework for building distributed applications developed at KTH. We’ll first
give a short introduction about the framework itself, then we will describe
our system and expose the reasons why we decided to develop part of our application in Kompics. Finally, we will provide a description of the software’s
design and a preliminary evaluation of it.
19.5.1
Peerialism’s system
Peerialism’s product is a content distribution platform which performs audio
and video streaming directly to the customer’s home computer. It does that
by building an ad-hoc overlay network between all hosts requesting a certain
stream. This network is organized in such a way that the load of the content
distribution is shared among all the participating peers.
The main entities in our system are:
• The Clients, which are the peers where Peerialism’s client application
has been installed, i.e. the customers home computers. The installed
application requests audio and video streams according to the input
received from the customer. It then receives streams from other peers,
delivers them to the local media player and streams them once more
to other customers.
• The Source. It represents a host which has all data of a certain stream.
The Source itself is a Peer. A Peer becomes a source for a specific
stream when it has received all the data of that same stream.
• The Tracker. It is the central coordinator of the system. It is not part
of the overlay network but it organizes it. It receives requests from the
clients, forwards them to an optimization engine and issues directions
to the peers once the request has been satisfied.
SELFMAN Deliverable Year Two, Page 297
CHAPTER 19. D5.2A: APPLICATION DESIGN SPECIFICATIONS
• The Optimization Engine. It receives the forwarded requests from the
tracker and performs decisions according to the overall status of the network. In addition, it periodically redefines the structure of the overlay
network to optimize the flow of streams to normalize the load of the
delivery among the peers.
A typical example of the steps needed to deliver a stream are described
as follows: the customer requests a specific audio or video stream to the
client application running on its home computer, the application translates
the request into a message to the tracker. After receiving the message, the
tracker forwards it to the optimization engine. The engine builds a map of
the overlay network, which represents the global status of the delivery of the
stream, and makes a decision on which peers should provide the content to
the requesting peer. Once the decision has been made, the tracker notifies
the peers involved in the operation. The requesting peer will then start to
receive the content from other peers. The providers might be both sources
or normal peers which have already received parts of the stream. Once the
delivery has been completed, the peer notifies the Tracker and becomes a
Source for that stream.
19.5.2
Introduction to Kompics
Kompics is a software framework which allows for programming, configuring, and executing distributed protocols as software components. Kompics
components interact using data-carrying events and can be composed into
complex architectures. In particular, they can be organized in composite
and shared components, as well as in hierarchies. The components are safely
reconfigurable at run-time so that the architecture of the system can be dynamically modified and components replaced. Software faults which might
occur during execution are safely isolated and handled by supervisor components.
The component programming style used in the Kompics framework allows
for reusability, since components can be reused in other applications, and it
enables flexible management of computational resources, as it is possible to
define various policies to assign threads to components for their execution.
Kompics defines an execution model where components don’t share any
state and can communicate exclusively using events through predefined channels. The events are then executed atomically with respect to the component
instance. This model allows for parallel execution of components and takes
advantage of multi-core hardware architectures. Furthermore, it eases the
implementation of distributed protocols by offloading the programmer from
SELFMAN Deliverable Year Two, Page 298
CHAPTER 19. D5.2A: APPLICATION DESIGN SPECIFICATIONS
the difficulty of programming concurrency with threads.
The Kompics framework also provides a number of components which
can be used out-of-the-box to develop any distributed application. These
components implement useful abstractions such as reliable and lossy network
communication and failure detection.
19.5.3
The Tracker application
As mentioned earlier, the Tracker is the most important entity in the system,
since it interacts and coordinates peers in the overlay network. Therefore,
the application implementing its behavior must be reliable, to avoid internal
software failures, and resilient to failures, both protocol and network ones,
that might happen in the interaction with the peers. Furthermore, according
to the specification of our system, the Tracker application should be able
to handle a number of clients on the order of magnitude of hundreds of
thousands. Scalability is then a major requirement.
The Tracker application has been currently developed in Java using threads
to exploit multi-core architectures and improve performance. It also makes
use of the MINA framework[52] to abstract the network layer and provide
Non-blocking I/O.
In general, the Tracker greatly differs from all other parts of the system,
such as for instance the Client application. The Tracker is in fact a much
smaller and less complex application than the Client. This because the latter
is meant to fulfill more complicated tasks, such as: receipt and transmission
of streams, delivery to the media player, bandwidth measurement and NAT
Traversal. However, the role of the Tracker is more critical compared to
the one of the Client, since all peers in the network depend on it. Even a
temporary Tracker failure would cause great harm, as all peers in the system
are unable to request streams. Instead, in case of a Client failure, only the
peers which are receiving a stream from that specific host will be affected.
For the aforementioned reasons, software implementing the Tracker behavior has been tested with both unit-tests[80] and with our simulator. However, dealing with Threads and shared structures has made the design of
tracker cumbersome and error-prone. Moreover, it has been difficult to obtain scalability and cope with the amount of peers that the Tracker should
be able to handle by specification.
Considered the aforementioned problems, we decided to explore the possibility of using Kompics to re-implement our Tracker application. In particular we were attracted by the following reasons:
• Clean design, which could be obtained by organizing our software in
SELFMAN Deliverable Year Two, Page 299
CHAPTER 19. D5.2A: APPLICATION DESIGN SPECIFICATIONS
components.
• Reusability, components might be reused in future products.
• No shared state, components logically encapsule the state that they
need for execution and can communicate only using events.
• Explicit concurrency, components are concurrency units which can be
executed in parallel without need of synchronization.
.
19.5.4
Porting and Design
In general, the biggest challenge encountered in the process of porting the
tracker application to Kompics was to split our software into components. In
fact, it has been quite difficult to identify and isolate the state that should
be directly and exclusively modified by the component enclosing it. Once we
did that, we collected the parts of the pre-existing code that were to modify that same state. We packed those parts of code into event handlers and
created the events which would trigger them. Finally, we created the components and defined the interconnections between them by creating channels
and specifying which events they would carry. The resulting design can be
observed in Figure 19.2.
The network component, shown at the top of the figure, constitutes the
Kompics abstraction for the network layer. The network component provides
network communication both using the TCP and UDP protocols. Its interface is quite simple, components can trigger events which extend an abstract
Message event. The network component subscribes to those events and sends
them to the destination peers. In turn, components which are interested in
receiving messages subscribe to those same events on the other side.
The MemberManager Component instead, shown on the left in Figure 19.2,
keeps track of all information regarding the peers. When started, a peer registers to the tracker by sending its credentials, such as username and password
of the customer, and its characteristics, namely its download and upload
bandwidth capacity, the kind of NAT it is behind and its public and private
IPs. It then reports which data it has available locally. In fact, the peer
might have stored some content as a result of a previous streaming session.
That data might be reused later to serve other peers. The MemberManager
Component is not only entitled with the task of storing all this information
coming initially from the peers but it also receives periodic notifications from
SELFMAN Deliverable Year Two, Page 300
CHAPTER 19. D5.2A: APPLICATION DESIGN SPECIFICATIONS
Network Component PeerRequest Message Login and Status Update Messages Login Manager Opto Manager Member Manager Opto Instance1 Opto Instance2 Opto InstanceN GetPeerStatusEvent Figure 19.2: Tracker Kompics Design
them. The notifications contains the status of the delivery of the streams requested by each peer. Such information consists of which pieces of the stream
have been received and how many of them have been delivered to the media
player.
In the right side of Figure 19.2 we can find the OptoManager component
and its children components, the OptoInstances. The Tracker application can
handle a number of different streams. For each of them, it spawns an OptoInstance component. The latter contains an instance of the Optimization
Engine introduced in Section 19.5.1. The engine’s instance handles requests
concerning a single stream. The OptoManager is instead a supervisor component. It creates and destroys OptoInstance components as new streams are
added or removed from the system. The OptoManager also acts as a hub for
the requests coming from the peers: it redirects them to the corresponding
OptoInstance component.
When an OptoInstance component receives a request forwarded by the
OptoManager, it triggers a number of events requesting to the MemberManager information about the peers involved in the delivery of the stream. Once
the corresponding response events have been received, the embedded Optimization Engine decides which one of the peers should provide content to
the requesting host.
SELFMAN Deliverable Year Two, Page 301
CHAPTER 19. D5.2A: APPLICATION DESIGN SPECIFICATIONS
In addition, every OptoInstance is periodically requested to gather information about all peers involved in the delivery of its responsible stream. It
does that, as mentioned earlier, by issuing events to the MemberManager.
The response data is then passed to the embedded Optimization Engine.
The latter builds a map representing the overlay network and the status of
its members. Once the operation is concluded, it tries to reorganize the map
such that the load of the delivery is shared among all the involved peers.
After a decision has been made, notification messages are sent to the peers
by triggering the corresponding Message events.
This design explained above has two main advantages:
1. Isolation of Peer’s state. The MemberManager component is the
only component containing information about the status of peers. Changes
carried by incoming events are committed atomically with respect to
the MemberManager component, as mentioned in 19.5.2. The OptoInstance instead does not retain any information about the status of the
peers. It just possesses a list containing the ids of the peers which
participate in the delivery of the stream. It then needs to request their
status to the MemberManager component before making decisions. After every decision, the data concerning the peers is discarded as it might
have changed during the decision process.
With this design, the OptoInstances and the MemberManager components can be executed concurrently without need of synchronization.
Consequently, information about the peers can be stored immediately
even if the Optimization Engine is running and the Optimizer does not
need to acquire any lock for accessing the peers’ status, as it happened
in the previous implementation of the Tracker.
2. Concurrent execution of Optimization Engines. The design of
the Tracker application has been appositely studied to allow parallel
execution of OptoInstance components, and consequently of instances
of the Optimization Engine. This because the optimization process accounts for most of the load of the Tracker application. In fact, every
time a decision must be made, whether triggered by a periodic reorganization of the stream or a direct request from a peer, the Tracker
application has to perform a computationally intensive task. With this
design, OptoInstance components are independent from one another
and from any other component in the system. Consequently, they
can be executed concurrently on a multi-core machine. Furthermore,
OptoInstance components do not keep threads busy while in idle. In
Kompics, event handlers cannot make blocking calls and the executing
SELFMAN Deliverable Year Two, Page 302
CHAPTER 19. D5.2A: APPLICATION DESIGN SPECIFICATIONS
Or
i
gi
na
l
T
r
a
c
k
er
T
i
me(
ms
)
r
cke
a
r
T
cs
pi
m
Ko
Reques
tNr
.
Figure 19.3: Kompics tracker preliminary evaluation, 2000 requests
thread is released after the event-handler has terminated its execution.
19.5.5
Preliminary Evaluation
After having reimplemented the Tracker application following the aforementioned design, we compared it with the original Tracker application. For
doing that, we used MINA’s VMPipe library. The library provides a fake network layer which can be easily integrated in applications designed for MINA,
as the original Tracker application and its Kompics version. We made this
choice to be able to test the raw performance of the Tracker, without the
burden of connection and message handling. We then implemented a bogus
version of the Client application which simply authenticates to the Tracker,
makes a request for content and awaits a response. Every Client measures
how long it takes to receive a response from the moment it triggered a request.
In our tests, we create a number of those bogus Clients and we make sure
that requests are issued sequentially as fast as possible. The Tracker then
receives the requests, forwards them to the Optimization Engine and replies
back to the Client. Content Requests are processed in the order they have
been received. The test starts when the Tracker receives the first request.
SELFMAN Deliverable Year Two, Page 303
CHAPTER 19. D5.2A: APPLICATION DESIGN SPECIFICATIONS
Figure 19.3 shows the results of an experiment with 2000 bogus Clients which
trigger as many content requests. On the X-axis is displayed the number of
the request and, on the corresponding value of the Y-axis, the time that it
took to satisfy the same request. As we can see from the picture, the Kompics
version of the Tracker always outperforms the original version. This is due
to the concurrent nature of Kompics which do not rely on synchronization of
threads. We believe that the latter is the cause of the original tracker having
an almost constant response time, since the application gets flooded with
requests and the access to shared structures becomes therefore sequential.
19.5.6
Conclusion and Future Work
We presented our experience in porting Peerialism’s Tracker application to
the Kompics framework developed at KTH. We outlined the requirements of
the application and we detailed the design choices that we made for meeting
those requirements. We then presented the results of a preliminary evaluation
of the software.
We now would like to perform a more extensive evaluation of the Kompics Tracker software before integrating it in our product. In particular, we
would like to test its resilience to software failures and its stability. We then
would like to test the application’s behavior in our remote test environment,
with real Clients, to verify that the performance are the same as showed
in the preliminary tests. If expectations are met, we intend to port also our
Client application to Kompics to exploit the clear advantages in clean design,
reusability and explicit concurrency that the framework provides.
SELFMAN Deliverable Year Two, Page 304
Chapter 20
D6.1c: Second-year project
workshop
In the second year we are organizing the Workshop on Decentralized Self
Management for Grids, P2P, and User Communities:
http://www.ist-selfman.org/wiki/index.php/SelfmanWorkshop
which will be held in conjunction with the Second International Conference
on Self-Adaptive and Self-Organizing Systems (SASO 2008):
http://polaris.ing.unimo.it/saso2008
on Oct. 20-24, 2008. We advertised the workshop widely on mailing lists and
internal lists. The Workshop Call for Papers is given in Section 20.1. The
submission deadline is July 11, 2008.
The workshop is organized by SELFMAN together with the Grid4All
project, with corporate sponsorship by France Télécom Research and Development. The workshop organizing committee consists of Peter Van Roy
(SELFMAN), Marc Shapiro (Grid4All), and Seif Haridi (SELFMAN and
Grid4All).
20.1
Call for Papers: Decentralized Self Management for Grids, P2P, and User Communities
The Internet is a fantastic tool for information and resource sharing. User
communities such as families, friends, schools, clubs, etc., can pool their resources and their knowledge: hardware, computation time, file space, photos,
data, annotations, pointers, opinions, etc.
305
CHAPTER 20. D6.1C: SECOND-YEAR PROJECT WORKSHOP
However, infrastructures and tools for supporting such activities are still
relatively primitive. Existing P2P networks enable world-wide file sharing
but are limited to read-only data and provide no security or confidentiality
guarantees. Grids support closed-membership virtual organizations (VOs),
but their management remains largely manual. Web 2.0 social networks,
blogs, and wikis remain centralized and have limited functionality.
This workshop examines issues of decentralized self management as they
relate to these areas. The workshop is co-located with SASO 2008, the Second International Conference on Self-Adaptive and Self-Organizing Systems
(http://polaris.ing.unimo.it/saso2008/). The workshop covers the application of:
• Peer-to-peer techniques such as structured overlay networks
• Self-adaptive techniques such as feedback loop architectures
• Agoric systems, collective intelligence, and game theory
• Decentralized distributed algorithms
• Autonomic networking techniques
to:
• Virtual organizations
• Collaborative and social applications
• Computer-supported cooperative work, collaborative editing, code management tools, and co-operative engineering
• Data replication, distributed file systems or distributed databases
• Security and confidentiality in distributed systems
20.1.1
Submission of position paper or technical paper
(required for attendance)
To attend the workshop, you must submit a position paper of no more than
5 pages to the workshop organizers. Technical papers may be submitted as
well (of any reasonable length). All accepted submissions will be published
in an IEEE postproceedings. Submissions should be sent to the following
address (preferably by email in pdf format):
SELFMAN Deliverable Year Two, Page 306
CHAPTER 20. D6.1C: SECOND-YEAR PROJECT WORKSHOP
Peter Van Roy
Dept. of Computing Science and Engineering
Place Sainte Barbe, 2
Universitécatholique de Louvain
B-1348 Louvain-la-Neuve
Belgium
Email: [email protected]
Each submission will be reviewed by at least three reviewers. The review
will focus not only on the paper’s quality but also on its novelty and ability
to engender fruitful discussions. All authors of accepted position papers are
invited to attend the workshop. Note that workshop attendees must register
both to the conference and the workshop. Workshop PC members are also
encouraged to submit position papers and these papers will be reviewed to
the same standards as outside submissions.
This workshop is sponsored by the European projects Grid4All:
(http://www.grid4all.eu)
and SelfMan:
(http://www.ist-selfman.org/)
and with corporate sponsorship from France Télécom Research and Development.
20.1.2
Organizing committee
Peter Van Roy, Université catholique de Louvain, Belgium
Marc Shapiro, INRIA & LIP6, Paris, France
Seif Haridi, SICS & KTH, Stockholm, Sweden
20.1.3
Program committee
Gustavo Alonso, ETH Zurich, Switzerland
Seif Haridi, SICS & KTH, Stockholm, Sweden
Bernardo Huberman, HP Labs, Palo Alto, USA
Adriana Iamnitchi, University of South Florida, USA
Mark Miller, Google Research, USA
Pascal Molli, LORIA, Nancy, France
Luc Onana Alima, UMH, Mons, Belgium
Nuno Preguiça, Universidade Nova de Lisboa, Portugal
SELFMAN Deliverable Year Two, Page 307
CHAPTER 20. D6.1C: SECOND-YEAR PROJECT WORKSHOP
Alexander Reinefeld, Zuse Institut Berlin, Germany
Marc Shapiro, INRIA & LIP6, Paris, France
Peter Van Roy, Université catholique de Louvain, Belgium
Hakim Weatherspoon, Cornell University, Ithaca NY, USA
SELFMAN Deliverable Year Two, Page 308
Chapter 21
D6.5b: Second progress and
assessment report with lessons
learned
21.1
Executive summary
We can see that the SELFMAN project runs on three different levels that
interact and cross-fertilize each other: a vision level, an implementation level,
and an application level. In the vision level, we set the long-distance goals of
the project: how to build self-managing applications using concurrent components, interacting feedback loops, and reversible phase transitions. In the
implementation level, we build a complete application infrastructure with a
transaction protocol running on top of a structured overlay network, implemented with a component model. In the application level, we investigated
four application scenarios that touch on four different parts of the design
space of self-managing applications. In this report, I recapitulate the progress
made in each level and how the levels interact with each other and direct each
other. The implementation level has strongly interacted both with the vision
level and the application level, and it also influences both.
309
CHAPTER 21. D6.5B: SECOND PROGRESS AND ASSESSMENT
REPORT WITH LESSONS LEARNED
21.2
Contractors contributing to the deliverable
All contractors contributed ideas to this deliverable.
UCL UCL (Peter Van Roy) wrote the present report.
SELFMAN Deliverable Year Two, Page 310
CHAPTER 21. D6.5B: SECOND PROGRESS AND ASSESSMENT
REPORT WITH LESSONS LEARNED
21.3
Results
After the second year of the project, we can judge the effectiveness of the
different levels that are guiding the SELFMAN project. There are three
levels: the vision level, the implementation level, and the application level.
We recapitulate these levels, their results, and how they influence each other.
In essence, the implementation level gives the practical results of the
project and it implements both the vision and the applications. The vision level gives a future view of how self-managing applications should be
organized and the implementation level has realized part of this vision in
a working application infrastructure. The application level gives a practical view of what services the self-managing applications really need and the
implementation level has realized these services.
21.3.1
Vision level
The first level is a high-level “vision” thread in the project, fed by four
successive papers. We can see how the vision has evolved during the project:
• Self Management of Large-Scale Distributed Systems by Combining
Peer-to-Peer Networks and Components, CoreGRID technical report
TR-0018, Dec. 2006 [147]. This paper gives the initial vision of the
project at its start: extending structured overlay networks into fullfledged self-managing systems by using components to organize the
extension.
• Self Management and the Future of Software Design, FACS 2006, Sept.
2006 [145]. This paper explains the importance of interacting feedback
loops and gives many examples of feedback loop structures in successful
biological and software systems.
• Self Management for Large-Scale Distributed Systems: An Overview
of the SELFMAN Project, FMCO 2007, 2008 (to appear) [148] (this
paper is reproduced as Chapter 2 of the present document). This paper
gives ideas how to design and analyze systems with interacting feedback
loops and explains how collective intelligence allows to manage users
with conflicting goals. It also presents the SELFMAN architecture of a
structured overlay network with a replicated storage and transactional
protocol, as an application architecture. Finally, it explains how to
handle network partitions using the merge algorithm.
• Overcoming Software Fragility with Interacting Feedback Loops and
Reversible Phase Transitions, BCS 2008 (to appear) [146] (see ApSELFMAN Deliverable Year Two, Page 311
CHAPTER 21. D6.5B: SECOND PROGRESS AND ASSESSMENT
REPORT WITH LESSONS LEARNED
pendix A.1). This paper pushes the concept of interacting feedback
loops to its logical conclusion. Inspired by physical systems, whose behavior (both non-critical and critical) can be explained with interacting
feedback loops, the paper motivates a new architecture for robust software based on reversible phase transitions. Most existing fault-tolerant
software does a phase transition when highly stressed, but does not revert to its initial condition when the stress is relieved. The paper argues
that this is incorrect and that the phase transition should be reversible.
The merge algorithm developed in SELFMAN is the first example of
such a reversible system.
These papers present an ultimate vision that is subsequently realized in the
rest of the project. We admit that the realization is not complete: we do
not yet understand how to program with interacting feedback loops and the
notion of reversible phase transition in software is only partially understood.
Yet, we have made definite progress. E.g., the different algorithms in the
project (relaxed ring, Paxos uniform consensus, etc.) can be considered as
feedback loop structures. The merge algorithm gives the first overlay network
that actually does reversible phase transitions (previous “folklore wisdom”
in the overlay research community did not consider merge to be practical).
Furthermore, the algorithm has been simulated extensively and all indications are that it is practical. We are in the process of implementing it over
DKS and P2PS. We conclude that the vision level is guiding the project well
at this point.
21.3.2
Implementation level
The second level is the building of robust structured overlay networks. In
the second year of the project, we have arrived at overlay networks that are
practical in applications. To do this, we had to modify them in several ways:
• Modifications of the overlay’s self-organization algorithms to reduce the
probability of lookup inconsistency (e.g., relaxed ring). At UCL, the
relaxed ring has now been completely simulated and implemented in
P2PS and it is well understood. Basically, its join and leave algorithms
are completely asynchronous and involve only two nodes at a time.
This eliminated lookup inconsistency in the case of joins and leaves
only and reduces its probability in the case of failures (true failures
or false suspicions). At KTH, lookup consistency has been studied in
the case of DKS and with some small changes in the self-organization
algorithms it can also be greatly reduced. KTH notes that the network
SELFMAN Deliverable Year Two, Page 312
CHAPTER 21. D6.5B: SECOND PROGRESS AND ASSESSMENT
REPORT WITH LESSONS LEARNED
Use Case
Machine to Machine
Distributed Wiki
P2P Media Streaming
J2EE Application Server
Self-*
Properties
++
++
++
++
Components
++
+
+
++
Overlay
Networks
+
++
++
Transactions
+
++
+
Table 21.1: Self-managing application requirements
partition (merge) algorithm is essential: at some point when failures
happen, the network is actually partitioned.
• Reimplementation of overlay networks using component models, to allow self-configuration. Since the component models themselves, in particular Kompics, were only implemented in the second year (the design
started in the first year), the reimplementation of the overlay networks
is only partially complete by the end of the second year. The feedback
loop needed for self configuration has been implemented by INRIA in
Oz with two tools for dynamic component-based systems, FructOz and
LactOz, which provide for distributed deployment (actuating agent)
and navigation and monitoring (monitoring agent), both essential parts
of a self-configuring system.
• Design and implementation of a transaction algorithm, with a replicated storage algorithm, on top of the structured overlay networks. By
the end of the second year, the design is complete and several implementations are on the way. ZIB has made significant progress: they
have successfully implemented the algorithm in Erlang and written a
Distributed Wiki on top. KTH is implementing the algorithm on top of
DKS and UCL is implementing it on top of P2PS. ZIB has a head start
because of the choice of Erlang. KTH had to spend time designing and
implementing its component model, Kompics, and UCL had to spend
time for the Mozart 1.4.0 release with its new advanced distribution
model.
21.3.3
Application level
The third level is the choice and study of the self-managing applications we
would build. In the first year, we decided to explore the design space by
choosing four application scenarios (see Deliverable D5.1):
SELFMAN Deliverable Year Two, Page 313
CHAPTER 21. D6.5B: SECOND PROGRESS AND ASSESSMENT
REPORT WITH LESSONS LEARNED
• Machine-to-Machine (M2M) messaging application (FT). This application was inspired by an existing fielded application in France Télécom.
• Distributed database and Wiki (ZIB). This application is inspired by
existing work at ZIB on distributed databases.
• Media streaming application (Peerialism, formerly called Stakk). This
application is the initial proposal for a produce by the Peerialism company.
• J2EE application server (proposed by Bull). This runs on a cluster and
hosts applications in self-managing way.
These applications cover quite different points in the design space of distributed self-managing applications. We summarize this in table 21.1. In the
first year, we decided to drop the J2EE Application Server because it is not
an Internet application (it does not need overlay networks). In the second
year, we made a further choice to drop the M2M application because of lack
of resources. It has less need of overlay than the others and FT does not
have the resources to develop it further. We therefore keep two applications:
the Distributed Wiki and the P2P Media Streaming application. The Distributed Wiki has been implemented in the second year and it won a prize in
a scalable computing competition. The Media Streaming application will be
a product of Peerialism. In the third year we will evaluate these applications
and improve the self-* abilities depending on the results of the evaluations.
SELFMAN Deliverable Year Two, Page 314
Appendix A
Publications
315
APPENDIX A. PUBLICATIONS
A.1
Overcoming Software Fragility with Interacting Feedback Loops and Reversible
Phase Transitions
SELFMAN Deliverable Year Two, Page 316
Overcoming Software Fragility with
Interacting Feedback Loops and
Reversible Phase Transitions
Peter Van Roy
Dept. of Computing Science and Engineering
Université catholique de Louvain
B-1348 Louvain-la-Neuve, Belgium
[email protected]
Abstract
Programs are fragile for many reasons, including software errors, partial failures, and
network problems. One way to make software more robust is to design it from the
start as a set of interacting feedback loops. Studying and using feedback loops is
an old idea that dates back at least to Norbert Wiener’s work on Cybernetics. Up to
now almost all work in this area has focused on how to optimize single feedback
loops. We show that it is important to design software with multiple interacting
feedback loops. We present examples taken from both biology and software to
substantiate this. We are realizing these ideas in the SELFMAN project: extending
structured overlay networks (a generalization of peer-to-peer networks) for large-scale
distributed applications. Structured overlay networks are a good example of systems
designed from the start with interacting feedback loops. Using ideas from physics, we
postulate that these systems can potentially handle extremely hostile environments. If
the system is properly designed, it will perform a reversible phase transition when the
failure rate increases beyond a critical point. The structured overlay network will make
a transition from a single connected ring to a set of disjoint rings and back again when
the failure rate decreases. There is a complete research agenda based on the use of
reversible phase transitions for building robust systems. In our current work we are
exploring how to expose phase transitions to the application so that it can continue
to provide a service. For validation we are building three realistic applications taken
from industrial case studies, using a distributed transaction layer built on top of the
overlay.
Keywords: software development, self management, feedback, distributed computing, distributed
transaction, network partition, Internet, phase transition
1. INTRODUCTION
How can we build software systems that are not fragile? For example, we can exploit concurrency
to build systems whose parts are mostly independent. Keeping parts as independent as possible
is a necessary first step. But it is not sufficient: as systems become larger, their inherent fragility
becomes more and more apparent. Software errors and partial failures become common, even
frequent occurrences. Both of these problems can be made less severe by rigorous system
design, but for fundamental reasons the problems will always remain. They must be addressed.
One way to address them is to build systems as multiple interacting feedback loops. Each
feedback loop continuously observes and corrects part of the system. As much as possible of
the system should run inside feedback loops, to gain this robustness.
Building a system with feedback loops puts conditions on how it must be programmed. We find that
message passing is a satisfactory model: the system is a set of concurrent component instances
that communicate through asynchronous messages. Component instances may have internal
state but there is no global shared state. Failures are detected at the component level. Using
this model lets us reason about the feedback behavior. Similar models have been used by E for
Electronic Workshops in Computing
The British Computer Society
1
Overcoming Software Fragility with Interacting Feedback Loops and Reversible Phase Transitions
building secure distributed systems [18] and by Erlang for building reliable telecommunications
systems [1]. More reasons for justifying this model are given in [24]. For the rest of this paper, we
will use this model.
Now that we can program systems with feedback loops, the next question is how should these
systems be organized. A first rule is that systems should be organized as multiple interacting
feedback loops. We find that this gives the simplest structure and makes it easier to reason about
the system (see Sections 2 and 3). Single feedback loops can be analyzed using techniques
specific to their operation; for example Hellerstein et al [10] gives a thorough course on how to
use control theory to design and analyze systems with single feedback loops. The problem with
systems consisting of multiple feedback loops is their global behavior: how can we understand it,
predict it, and design for a desired behavior? We need to understand the issues before we can do
a theoretical analysis or a simulation.
In the SELFMAN project [20], we are tackling the problem by starting from an area where
there is already some understanding: structured overlay networks (SONs). These networks are
an outgrowth of peer-to-peer systems. They provide two basic operations, communication and
storage, in a scalable and guaranteed way over a large set of peer nodes (see Section 4). By
giving the network a particular topology and by managing this topology well, the SON shows selforganizing properties: it can survive node failures, node leaves, and node joins while maintaining
its specification. By using concepts and techniques taken from theoretical physics, we are able to
understand in a deep way how SONs work and we can begin to understand how to design them
to build robust systems. The concepts of feedback loop and phase transition play an important
role in this understanding.
This paper is structured as follows:
• Section 2 defines what we mean by a feedback loop, explains how feedback loops can
interact, and motivates why feedback loops are essential parts of any system. We briefly
present the mean field approximation of physics and show how it uses feedback to explain
the stability of ordinary matter.
• Section 3 gives two nontrivial examples of successful systems that consist of multiple
interacting feedback loops: the human respiratory system and the Transmission Control
Protocol.
• Section 4 summarizes our own work in this area. We are building a self-management
architecture based on a structured overlay network. We conjecture that when designed to
support reversible phase transitions, a SON can survive in extremely hostile environments.
We support this conjecture by analytical work [14], system design [21], and by analogy
from physics [15]. We are currently setting up an experimental framework to explore this
conjecture. We target three large-scale distributed applications, built using a transactional
service on top of a structured overlay network.
Section 5 concludes by recapitulating how feedback loops can overcome software fragility and
why all software should be designed with feedback loops. An important lesson is that systems
should be constructed so that they can do reversible phase transitions. Most existing fault-tolerant
systems are not designed with this goal in mind, so they are broken in a fundamental sense. We
explain what this means for structured overlay networks and we show how we have fixed them.
We then explain what remains to be done: there is a complete research agenda on how to build
robust systems based on the principle of reverse phase transitions.
2. FEEDBACK LOOPS ARE ESSENTIAL
2.1. Definition and history
In its general form, a feedback loop consists of four parts: an observer, a corrector, an actuator,
and a subsystem. These parts are concurrent agents that interact by sending and receiving
messages. The corrector contains an abstract model of the subsystem and a goal. The feedback
loop runs continuously, observing the subsystem and applying corrections in order to approach
Electronic Workshops in Computing
The British Computer Society
2
Overcoming Software Fragility with Interacting Feedback Loops and Reversible Phase Transitions
the goal. The abstract model should be correct in a formal sense (e.g., according to the semantics
of abstract interpretation [5]) but there is no need for it to be complete.
An example of a software system that contains a feedback loop is a transaction manager.
It manages system resources according to a goal, which can be optimistic or pessimistic
concurrency control. The transaction manager contains a model of the system: it knows at all
times which parts of the system have exclusive access to which resources. This model is not
complete but it is correct.
In systems with more than one feedback loop, the loops can interact through two mechanisms:
stigmergy (two loops acting on a shared subsystem) and management (one loop directly
controlling another). Very little work has been done to explore how to design with interacting
feedback loops. In realistic systems, however, interacting feedback loops are the norm.
Feedback loops were studied as a part of Norbert Wiener’s cybernetics in the 1940’s [29] and
Ludwig von Bertalanffy’s general system theory in the 1960’s [3]. W. Ross Ashby’s introductory
textbook of 1956 is still worth reading today [2], as is Gerald M. Weinberg’s textbook of 1975
explaining how to use system theory to improve general thinking processes [27]. System theory
studies the concept of a system. We define a system recursively as a set of subsystems
(component instances) connected together to form a coherent whole. Subsystems may be
primitive or built from other subsystems. The main problem is to understand the relationship
between the system and its subsystems, in order to predict a system’s behavior and to design
a system with a desired behavior.
2.2. Feedback loops in the real world
In the real world, feedback structures are ubiquitous. They are part of our primal experience of
the world. For example, bending a plastic ruler has one stable state near equilibrium enforced by
negative feedback (the ruler resists with a force that increases with the degree of bending) and a
clothes pin has one stable and one unstable state (it can be put temporarily in the unstable state
by pinching). Both objects are governed by a single feedback loop. A safety pin has two nested
loops with an outer loop managing an inner loop. It has two stable states in the inner loop (open
and closed), each of which is adaptive like the ruler’s. The outer loop (usually a human being)
controls the inner loop by choosing the stable state.
In general, anything with continued existence is managed by one or more feedback loops. Lack
of feedback means that there is a runaway reaction (an explosion or implosion). This is true at all
size and time scales, from the atomic to the astronomic. For example, the binding of atoms in a
molecule is governed by a simple negative feedback loop that maintains equilibrium within given
perturbation bounds. At the other extreme, a star at the end of its lifetime collapses until it finds a
new stable state. If there is no force to counteract the collapse, then the star collapses indefinitely
(at least, until it is beyond our current understanding of how the universe works).
2.2.1. The mean field approximation
The stability of ordinary matter is explained by a feedback loop. An acceptable model for ordinary
matter is the mean field approximation, which gives good results outside of critical points (see
chapter 1 of [15]). To explain this approximation, we start by the simple assumption that a uniform
substance reacts linearly when an external force is applied:
Reaction = A × Force
For example, for a gas we can assume that density n is proportional to pressure p:
n = (1/kT ) × p
This is the Boyle-Mariotte law for ideal gases, which is valid for small pressures. But this equation
gives a bad approximation when the pressure is high. It leads to the conclusion that infinite
pressure on a gas will reduce its volume to zero, which is not true.
Electronic Workshops in Computing
The British Computer Society
3
Overcoming Software Fragility with Interacting Feedback Loops and Reversible Phase Transitions
We can obtain a much better approximation by making the assumption that throughout the
substance there exists a force that is a function of the reaction. This force is called the mean
field. This gives a new equation:
Reaction = A × (Force + a(Reaction))
That is, even in the absence of an external force, there is an internal force a(Reaction) that causes
the reaction to maintain itself at a nonzero value. This internal force is the mean field. There is
a feedback effect: the mean field itself causes a reaction, which engenders a mean field, and so
on. It is this feedback effect that explains, e.g., why a condensed state such as a liquid can exist
at low temperatures independent of external pressure. J. Van der Waals applied this reasoning to
the ideal gas law, by adding a term:
n = 1/kT × (p + a(n))
where n is the density of the gas and p is the pressure. According to this equation, the density n
of a fluid can stay at a high value even though the external pressure is low: a condensed state
can exist at low temperature independent of the pressure. The internal pressure a(n) replaces
the external pressure. Van der Waals chose a(n) = a × n2 by following the reasoning that
internal pressure is proportional to n, the number of molecules per unit of volume, multiplied
by the influence of all neighboring molecules on each molecule. This influence is assumed to be
proportional to n. This gives a new equation that is a good approximation over a wide range of
densities and pressures.
The mean field approach can be applied to a wide range of problems. The limits of the approach
are attained near critical points. This is because the correlation distance between molecules
diverges. Near a critical point, there is a phase change of the fluid, e.g., a liquid can boil to
become a gas. The global behavior of the fluid changes. The behavior of matter near critical
points no longer follows the mean field approximation but can be explained using scale invariance
laws. We are using this behavior as a guide for the design of software systems (see Section 4).
2.3. Feedback loops in human society
Most products of human civilization need an implicit management feedback loop, called
“maintenance,” done by a human. Each human is at the center of a large number of these feedback
loops. The human brain has a large capacity for creating these loops; some are called “habits” or
“chores.” If there are too many feedback loops to manage, then the brain can no longer cope: the
human complains that “life is too complicated”! We can say that civilization advances by reducing
the number of feedback loops that have to be managed explicitly [28]. We postulate that this is
also true of software.
2.4. Feedback loops in software
Software is in the same situation as other products of human civilization. Existing software
products are very fragile: they require frequent maintenance by a human. To avoid this, we
propose that software must be constructed as multiple interacting feedback loops, as an effective
way to reduce its fragility. This is already being done in specific domains; here are five example:
• The subsumption architecture of Brooks is a way to implement intelligent systems by
decomposing complex behaviors into layers of simple behaviors, each of which controls
the layers below it [4].
• IBM’s Autonomic Computing initiative aims to reduce management costs of computing
systems by removing humans from low-level management loops [11]. The low-level loop
is managed by a high-level loop that contains a human.
• Hellerstein et al show how to design computing systems with feedback control, to optimize
global behavior such as maximizing throughput [10]. Hellerstein shows two examples of
adaptive systems with interacting feedback loops: gain scheduling (with dynamic selection
among multiple controllers) and self-tuning regulation (where controller gain is continuously
adjusted).
Electronic Workshops in Computing
The British Computer Society
4
Overcoming Software Fragility with Interacting Feedback Loops and Reversible Phase Transitions
• Distributed algorithms for fault tolerance handle a special case of feedback where the
observer is a failure detector [16]. The implementation of the failure detector itself requires
a feedback loop.
• Structured overlay networks (SONs, closely related to distributed hash tables, DHTs) are
inspired by peer-to-peer networks [23]. They use principles of self organization to guarantee
scalable and efficient storage, lookup, and routing despite volatile computing nodes and
networks. Our own work is in the area of SONs; we explain it further in Section 4.
3. EXAMPLES OF INTERACTING FEEDBACK LOOPS
We give two examples of nontrivial systems that consist of multiple interacting feedback loops
(for more examples see [25, 26]). Our first example is taken from biology: the human respiratory
system. Our second example is taken from software design: the TCP protocol family.
FIGURE 1: The human respiratory system as a feedback loop structure
3.1. The human respiratory system
Successful biological systems survive in natural environments, which can be particularly harsh.
Studying them gives us insight in how to design robust software. Figure 1 shows the components
of the human respiratory system and how they interact. The rectangles are concurrent component
instances and the arrows are message channels. We derived this figure from a precise medical
description of the system’s behavior [30]. The figure is slightly simplified when compared to reality,
but it is complete enough to give many insights. There are four feedback loops: two inner loops
(breathing reflex and laryngospasm), a loop controlling the breathing reflex (conscious control),
and an outer loop controlling the conscious control (falling unconscious). From the figure we can
deduce what happens in many realistic cases. For example, when choking on a liquid or a piece
of food, the larynx constricts so we temporarily cannot breathe (this is called laryngospasm). We
can hold our breath consciously: this increases the CO2 threshold so that the breathing reflex is
delayed. If you hold your breath as long as possible, then eventually the breath-hold threshold is
reached and the breathing reflex happens anyway. A trained person can hold his or her breath
long enough so that the O2 threshold is reached first and they fall unconscious without breathing.
When unconscious the breathing reflex is reestablished.
We can infer some plausible design rules from this system. The innermost loops (breathing reflex
and laryngospasm) and the outermost loop (falling unconscious) are based on negative feedback
using a monotonic parameter. This gives them stability. The middle loop (conscious control) is not
Electronic Workshops in Computing
The British Computer Society
5
Overcoming Software Fragility with Interacting Feedback Loops and Reversible Phase Transitions
stable: it is highly nonmonotonic and may run with both negative or positive feedback. It is by far
the most complex of the four loops. We can justify why it is sandwiched in between two simpler
loops. On the inner side, conscious control manages the breathing reflex, but it does not have to
understand the details of how this reflex is implemented. This is an example of using nesting to
implement abstraction. On the outer side, the outermost loop overrides the conscious control (a
fail safe) so that it is less likely to bring the body’s survival in danger. Conscious control seems to
be the body’s all-purpose general problem solver: it appears in many of the body’s feedback loop
structures. This very power means that it needs a check.
Send
stream
Send
acknowledgement
Outer loop (congestion control)
Calculate policy modification
(modify throughput)
Inner loop (reliable transfer)
Calculate bytes to send
(sliding window protocol)
Actuator
Monitor
Monitor
(send packet)
(receive ack)
throughput
Subsystem
(network that sends packet to
destination and receives ack)
FIGURE 2: TCP as a feedback loop structure
3.2. TCP as a feedback loop structure
The TCP family of network protocols has been carefully tailored over many years to work
adequately for the Internet. We consider therefore that its design merits close study. We explain
the heart of TCP as two interacting feedback loops that implement a reliable byte stream transfer
protocol with congestion control [12]. The protocol sends a byte stream from a source to a
destination node. Figure 2 shows the two feedback loops as they appear at the source node.
The inner loop does reliable transfer of a stream of packets: it sends packets and monitors
the acknowledgements of the packets that have arrived successfully. The inner loop manages
a sliding window: the actuator sends packets so that the sliding window can advance. The sliding
window can be seen as a case of negative feedback using monotonic control. The outer loop does
congestion control: it monitors the throughput of the system and acts either by changing the policy
of the inner loop or by changing the inner loop itself. If the rate of acknowledgements decreases,
then it modifies the inner loop by reducing the size of the sliding window. If the rate becomes zero
then the outer loop may terminate the inner loop and abort the transfer.
4. STRUCTURED OVERLAY NETWORKS AS A FOUNDATION FOR FEEDBACK
ARCHITECTURES
Our own work on feedback structures targets large-scale distributed applications. This work is
being done in the SELFMAN project [20]. Summarizing briefly, we are building an infrastructure
based on a transaction service running over a structured overlay network [19, 26]. We target our
design on three application scenarios taken from industrial case studies: a machine-to-machine
Electronic Workshops in Computing
The British Computer Society
6
Overcoming Software Fragility with Interacting Feedback Loops and Reversible Phase Transitions
messenging application, a distributed knowledge management application (similar to a Wiki), and
an on-demand media streaming service [6].
FIGURE 3: Three generations of peer-to-peer networks
4.1. Structured overlay networks
Structured overlay networks are inspired by peer-to-peer networks [23]. In a peer-to-peer network,
all nodes play equal roles. There are no specialized client or server nodes. Figure 3 summarizes
the history of peer-to-peer networks in three generations. In the first generation (exemplified by
Napster), clients are peers but the directory is centralized. In the second generation (exemplified
by Gnutella), peer nodes communicate by random neighbor links. The third generation is the
structured overlay network. Compared to peer-to-peer systems based on random neighbor
graphs, SONs guarantee efficient routing and guarantee lookup of data items. Almost all existing
structured overlay networks are organized as two levels, a ring complemented by a set of fingers:
• Ring structure. All nodes are connected in a simple ring. The ring is maintained connected
despite node joins, leaves, and failures.
• Finger tables. For efficient routing, extra links called fingers are added to the ring. The fingers
can temporarily be in an inconsistent state. This has an effect only on efficiency, not on
correctness. Within each node, the finger table is continuously converging to a consistent
state.
Atomic ring maintenance is a crucial part of the overlay. Peer nodes can join and leave at any
time. Peers that crash are like peers that leave but without notification. Temporarily broken links
create false suspicions of failure.
Structured overlay networks are already designed as feedback structures. They already solve the
problem of self management for scalable communication and storage. We are using them as the
basis for designing a general architecture for self-managing applications. To achieve this goal, we
are extending the SONs in three ways:
• We have devised algorithms for handling imperfect failure detection (false suspicions) [17],
which vastly reduces the probability of lookup inconsistency. Imperfect failure detection is
handled by relaxing the ring invariant to obtain a so-called “relaxed ring”, which maintains
connectivity even with nodes that are suspected (possibly falsely) to be failed. The relaxed
ring is always converging to a perfect ring as suspicions are resolved.
• We have devised algorithms for detecting and merging network partitions [21]. This is a
crucial operation when the SON crosses a critical point (see Section 4.3).
• We have devised and implemented a transaction algorithm on top of the SON using a
symmetric replicated storage and a modified version of the Paxos uniform consensus
algorithm to achieve atomic commit with the Internet failure model [19].
4.2. Transactions over a SON
The highest-level service that we are implementing on a SON is a transactional storage.
Implementing transactions over a SON is challenging because of churn (the rate of node leaves,
Electronic Workshops in Computing
The British Computer Society
7
Overcoming Software Fragility with Interacting Feedback Loops and Reversible Phase Transitions
Client
1,5,9,13
TM
2,6,10,14
15,3,7,11
3,7,11,15
rTM
rTM
11,15,7,3
rTM
FIGURE 4: Distributed transactions on a structured overlay network
joins, and failures and the subsequent reorganizations of the overlay) and because of the
Internet’s failure model (crash stop with imperfect failure detection). The transaction algorithm
is built on top of a reliable storage service. We implement this using symmetric replication [7].
To avoid the problems of failure detection, we implement atomic commit using a majority
algorithm based on a modified version of Paxos [19, 8]. We use an imperfect failure detector to
change coordinators in this algorithm. This is implementable on the Internet; because the failure
detection is imperfect we may change coordinators too often, but this only affects efficiency, not
correctness. We have shown that majority techniques work well for DHTs [22]: the probability of
data consistency violation is negligible. If a consistency violation does occur, then this is because
of a network partition and we can use the network merge algorithm [21].
We give a simple scenario to show how the algorithm works. A client initiates a transaction by
asking its nearest node, which becomes a transaction manager. Other nodes that store data
are participants in the transaction. Assuming symmetric replication with degree f , we have f
transaction managers and f replicas for each other participating node. Figure 4 shows a situation
with f = 4 and two nodes participating in addition to the transaction manager. Each transaction
manager sends a Prepare message to all replicated participants, which each sends back a
Prepared or Abort message to all replicated transaction managers. Each replicated transaction
manager collects votes from a majority of participants and locally decides on abort or commit. It
sends this to the transaction manager. After having collected a majority, the transaction manager
sents its decision to all participants. This algorithm has six communication rounds. It succeeds if
more than f /2 nodes of each replica group are alive.
FIGURE 5: Conjectured phase transitions for a relaxed ring SON
Electronic Workshops in Computing
The British Computer Society
8
Overcoming Software Fragility with Interacting Feedback Loops and Reversible Phase Transitions
4.3. Phase transitions in SONs and their effect on application design
At low node failure rates, a SON is a single ring where each node has fixed neighbors. This
corresponds to a solid phase. At high failure rates, a SON will separate into many small rings. At
the limit, a SON with n nodes will separate into n single-node SONs. This is the gaseous phase.
In between these two extremes we conjecture that there is a liquid phase, the relaxed ring, where
the ring is connected but each node does not have a fixed set of neighbors. When a node is
subject to a failure suspicion then its set of neighbors changes.
We conjecture that for properly designed SONs phase transitions can occur for changing values
of the failure rate. Figure 5 shows the kind of behavior we expect for the relaxed ring. In this figure,
we assume that the node failure rate is equal to the node join rate, so that the total number of
nodes is stationary. In accord with the Internet’s failure model, we also assume that some of the
reported failures are not actual failures (they are called failure suspicions [9]). At low failure rates,
the ring is connected and does not change (solid phase). At high failure rates, the ring “boils” to
become a set of small rings (of size 1, in the extreme case). At intermediate failure rates, the ring
may stay connected but because of failure suspicions some nodes get pushed into side branches
(relaxed ring).
We support this conjecture by citing [14], which uses the analytical model of [13] to show that
phase transitions should occur in the Chord SON [23]. Specifically, [14] shows that three phases
are traversed when the average network delay increases, in the following order: a region of
efficient lookup, followed by a region where the longest fingers are dead (inefficient lookup),
followed by a region where the ring is disconnected. We are setting up simulation experiments
to verify this behavior and further explore the phase behavior of SONs.
A SON that behaves in this way will never “fail,” it will just change phase. Each phase has
well-defined behavior that can be programmed for. These phase transitions should therefore be
considered as normal behavior that can be exposed to the application running on top of the
SON. An important research question is to determine what the application API should be for
phase transitions. At high failure rates, the application will run as many separate parts. When
the rate lowers, these parts will combine (they will “condense” using the merge algorithm) and
the application should resolve conflicts by an appropriate merge of the information stored in the
separate rings. We can see that the application will probably have different consistency models at
different failure rates. The transaction algorithm of the previous section will need to be modified to
take this into account.
As a final remark, we conclude that the merge algorithm is a necessary part of a SON. Without
the merge algorithm, condensation of a gaseous system is not possible. The SON is incomplete
without it. With the merge algorithm, the SON and its applications can live indefinitely at any failure
rate.
5. CONCLUSIONS
To overcome the fragility of software, we propose to build the software as a set of interacting
feedback loops. Each feedback loop monitors and corrects part of the system. No part of the
system should exist outside of a feedback loop. We motivate this idea by showing how it exists in
real systems taken from biology and software (the human respiratory system and the Internet’s
TCP protocol family). If the feedback structure is properly designed, then it reacts to a hostile
environment by doing a reversible phase transition. For example, when the node failure rate
increases, a large overlay network may become a set of disjoint smaller overlay networks. When
the failure rate decreases, these smaller networks will coalesce into a large network again. These
transitions can be exposed to the application as an API so that it can be written to survive the
transition. Important research questions are to determine what this API should be and how it
affects application design.
In our own work in the SELFMAN project [20], we have built structured overlay networks
that survive in realistically harsh environments (with imperfect failure detection and network
Electronic Workshops in Computing
The British Computer Society
9
Overcoming Software Fragility with Interacting Feedback Loops and Reversible Phase Transitions
partitioning). We have developed a network merge algorithm that allows structured overlay
networks to do reversible phase transitions. We are extending our SON with transaction
management to implement three application scenarios derived from our industrial partners. We
are currently finishing our implementation and evaluating the behavior of our system. Much
remains to be done, e.g., we need to extend the transaction algorithm of Section 4.2 so that it
also works correctly during phase transitions.
One important lesson from this work is that all future software systems should be designed so
that they can support reversible phase transitions. For example, up to the work reported in [21],
SONs could not merge. That means that they could not “condense” (move from a gaseous back to
a solid phase) as failure rates decreased. They would “boil” (become disconnected) when failure
rates increased and they would stay disconnected when the failure rates decreased. We conclude
that network merge is more than just an incremental improvement that helps improve reliability.
It is fundamental because it allows the system to survive any number of phase transitions. The
system is reversible and therefore does not break. Without it, the system breaks after just a single
phase transition.1
6. ACKNOWLEDGEMENTS
This work is funded by the European Union in the SELFMAN project (contract 34084) and in
the CoreGRID network of excellence (contract 004265). Peter Van Roy is the coordinator of
SELFMAN. He acknowledges all SELFMAN partners for their insights and research results.
In particular, he acknowledges the work on the relaxed ring, network partitioning, symmetric
replication, distributed transactions, and the analytic study of SONs, all done by SELFMAN
partners. Some of this work was done in the earlier PEPITO and EVERGROW projects.
REFERENCES
[1] Armstrong, Joe. “Making reliable distributed systems in the presence of software errors,” Ph.D.
dissertation, Royal Institute of Technology (KTH), Kista, Sweden, Nov. 2003.
[2] Ashby, W. Ross. “An Introduction to Cybernetics,” Chapman & Hall Ltd., London, 1956. Internet (1999):
http://pcp.vub.ac.be/books/IntroCyb.pdf.
[3] von Bertalanffy, Ludwig. “General System Theory: Foundations, Development, Applications,” George
Braziller, 1969.
[4] Brooks, Rodney A. A Robust Layered Control System for a Mobile Robot, IEEE Journal of Robotics and
Automation, RA-2, April 1986, pp. 14-23.
[5] Cousot, Patrick, and Radhia Cousot. Abstract Interpretation: A Unified Lattice Model for Static Analysis
of Programs by Construction or Approximation of Fixpoints, 4th ACM Symposium on Principles of
Programming Languages (POPL 1977), Jan. 1977, pp. 238-252.
[6] France Télécom, Zuse Institut Berlin, and Stakk AB. User requirements for self managing applications:
three application scenarios, SELFMAN Deliverable D5.1, Nov. 2007, www.ist-selfman.org.
[7] Ghodsi, Ali, Luc Onana Alima, and Seif Haridi. Symmetric Replication for Structured Peer-to-Peer
Systems, Databases, Information Systems, and Peer-to-Peer Computing (DBISP2P 2005), SpringerVerlag LNCS volume 4125, pages 74-85.
[8] Gray, Jim, and Leslie Lamport. Consensus on transaction commit. ACM Trans. Database Syst., ACM
Press, 2006(31), pages 133-160.
[9] Guerraoui, Rachid, and Luis Rodrigues, “Introduction to Reliable Distributed Programming,” SpringerVerlag Berlin, 2006.
[10] Hellerstein, Joseph L., Yixin Diao, Sujay Parekh, and Dawn M. Tilbury. “Feedback Control of Computing
Systems,” Aug. 2004, Wiley-IEEE Press.
[11] IBM. Autonomic computing: IBM’s perspective on the state of information technology, 2001,
researchweb.watson.ibm.com/autonomic.
[12] Information Sciences Institute. “RFC 793: Transmission Control Protocol Darpa Internet Program
Protocol Specification,” Sept. 1981.
[13] Krishnamurthy, Supriya, Sameh El-Ansary, Erik Aurell, and Seif Haridi. A statistical theory of Chord
under churn, Proceedings of the 4th International Workshop on Peer-to-Peer Systems (IPTPS’05), Ithaca,
New York, Feb. 2005.
1 An
interesting question for physicists is to explain why matter behaves in reversible fashion. Software has to be designed
for reversibility while simple molecules have this property implicitly.
Electronic Workshops in Computing
The British Computer Society
10
Overcoming Software Fragility with Interacting Feedback Loops and Reversible Phase Transitions
[14] Krishnamurthy, Supriya, and John Ardelius. An Analytical Framework for the Performance Evaluation
of Proximity-Aware Overlay Networks, Tech. Report TR-2008-01, Swedish Institute of Computer Science,
Feb. 2008 (submitted for publication).
[15] Laguës, Michel and Annick Lesne. “Invariances d’échelle: Des changements d’états à la turbulence”
(“Scale invariances: from state changes to turbulence”), Belin éditeur, Sept 2003.
[16] Lynch, Nancy. “Distributed Algorithms,” Morgan Kaufmann, San Francisco, CA, 1996.
[17] Mejias, Boris, and Peter Van Roy. A Relaxed Ring for Self-Organising and Fault-Tolerant Peer-to-Peer
Networks, XXVI International Conference of the Chilean Computer Science Society (SCCC 2007), Nov.
2007.
[18] Miller, Mark S., Marc Stiegler, Tyler Close, Bill Frantz, Ka-Ping Yee, Chip Morningstar, Jonathan Shapiro,
Norm Hardy, E. Dean Tribble, Doug Barnes, Dan Bornstien, Bryce Wilcox-O’Hearn, Terry Stanley, Kevin
Reid, and Darius Bacon. E: Open source distributed capabilities, 2001, www.erights.org.
[19] Moser, Monika, and Seif Haridi. Atomic Commitment in Transactional DHTs, Proc. of the CoreGRID
Symposium, Rennes, France, Aug. 2007.
[20] SELFMAN: Self Management for Large-Scale Distributed Systems based on Structured Overlay
Networks and Components, European Commission 6th Framework Programme three-year project, June
2006 – May 2009, www.ist-selfman.org.
[21] Shafaat, Tallat M., Ali Ghodsi, and Seif Haridi. Dealing with Network Partitions in Structured Overlay
Networks, Journal of Peer-to-Peer Networking and Applications, Springer-Verlag, 2008 (to appear).
[22] Shafaat, Tallat M., Monika Moser, Ali Ghodsi, Thorsten Schütt, Seif Haridi, and Alexander Reinefeld. On
Consistency of Data in Structured Overlay Networks, CoreGRID Integration Workshop, Heraklion, Greece,
Springer LNCS, 2008 (to appear).
[23] Stoica, Ion, Robert Morris, David R. Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A
Scalable Peer-to-Peer Lookup Service for Internet Applications, SIGCOMM 2001, pp. 149-160.
[24] Van Roy, Peter. Convergence in Language Design: A Case of Lightning Striking Four Times in the
Same Place, 8th International Symposium on Functional and Logic Programming (FLOPS 2006), April
2006, Springer LNCS volume 3945, pp. 2-12.
[25] Van Roy, Peter. Self Management and the Future of Software Design, Third International Workshop on
Formal Aspects of Component Software (FACS 2006), Springer ENTCS volume 182, June 2007, pages
201-217.
[26] Van Roy, Peter, Seif Haridi, Alexander Reinefeld, Jean-Bernard Stefani, Roland Yap, and Thierry
Coupaye. Self Management for Large-Scale Distributed Systems: An Overview of the SELFMAN Project,
Springer LNCS, 2008 (to appear). Revised postproceedings of FMCO 2007, Oct. 2007.
[27] Weinberg, Gerald M. “An Introduction to General Systems Thinking: Silver Anniversary Edition,” Dorset
House, 2001 (original edition 1975).
[28] Whitehead, Alfred North. Quote: Civilization advances by extending the number of important operations
which we can perform without thinking of them.
[29] Wiener, Norbert. “Cybernetics, or Control and Communication in the Animal and the Machine,” MIT
Press, Cambridge, MA, 1948.
[30] Wikipedia,
the
free
encyclopedia.
Entry
“drowning,”
August
2006.
Internet:
http://en.wikipedia.org/wiki/Drowning.
Electronic Workshops in Computing
The British Computer Society
11
APPENDIX A. PUBLICATIONS
A.2
The Limits of Network Transparency in
a Distributed Programming Language
SELFMAN Deliverable Year Two, Page 328
The Limits of Network
Transparency in a Distributed
Programming Language
Raphaël Collet
Thesis submitted in partial fulfillment of the requirements
for the Degree of Doctor in Applied Sciences
December 2007
Faculté des Sciences Appliquées
Département d’Ingénierie Informatique
Université catholique de Louvain
Louvain-la-Neuve
Belgium
Thesis committee:
Yves Deville (chair)
Marc Lobelle
Per Brand
Joe Armstrong
Peter Van Roy (advisor)
UCL/INGI, Belgium
UCL/INGI, Belgium
SICS, Sweden
Ericsson, Sweden
UCL/INGI, Belgium
Abstract
This dissertation presents a study on the extent and limits of network transparency in distributed programming languages. This property states that the
result of a distributed program is the same as if it were executed on a single
computer, in the case when no failure occurs. The programming language may
also be network aware if it allows the programmer to control how a program is
distributed and how it behaves on the network. Both aim at simplifying distributed programming, by making non-functional aspects of a program more
modular.
We show that network transparency is not only possible, but also practical :
it can be efficient, and smoothly extended in the case of partial failure. We
give a proof of concept with the programming language Oz and the system
Mozart, of which we have reimplemented the distribution support on top of
the Distribution Subsystem (DSS). We have extended the language to control
which distribution algorithms are used in a program, and reflect partial failures
in the language. Both extensions allow to handle non-functional aspects of a
program without breaking the property of network transparency.
i
Acknowledgments
I want to thank all the people that have supported me during those nine years.
This thesis took a long time to mature, and would not have been possible
without the following people.
I am grateful to my advisor Peter Van Roy, who introduced me to Mozart
and its distributed protocols, and trained me as a researcher. His constant
enthusiasm has pushed me to complete this work.
I want to thank all my colleagues at the university. In particular, the people
with whom I have explored the distribution of Oz: Boris Mejı́as, Yves Jaradin,
Valentin Mesaros, Donatien Grolaux, Kevin Glynn, Iliès Alouini; and people
who experienced constraint programming in Oz with me: Fred Spiessens, Luis
Quesada, Stefano Gualandi, Renaud de Landtsheer, and Isabelle Dony. I also
thank all the people who worked with me, and who I forgot to list here.
I am also grateful to the people of the Mozart consortium, in particular Per
Brand, Erik Klintskog, Konstantin Popov, and Christian Schulte. They helped
me when I needed some guidance to modify the virtual machine and the DSS.
I also thank all the scientists I have met during these years, and with whom I
learned so much.
This research has been supported by the projects PIRATES (Walloon Region), PEPITO (European IST FET Global Computing), EVERGROW (European 6th Framework Programme), SELFMAN (European 6th Framework Programme), and CoreGRID (European Network of Excellence on Grid and Peerto-Peer technologies). I thank all the institutions that funded those projects.
Because my life has changed since I began playing music, I would like to
thank all the musicians that made some great music with me. The following bands have been an immense opportunity to broaden my horizons: Glycerinn (with Jack, Sarah, and Felipe), the Confused Deputies (with Boris and
Fred), and the Yellows (with Jean-François, Marc, Jérôme, Alexandre, and
Christophe).
My last thanks go to my family, my parents, my brother, my sister with
her husband and two little boys.
iii
Contents
Abstract
i
Acknowledgments
iii
Contents
v
1 Introduction
1.1 Distributed systems . . . . . . . .
1.2 Models of distributed programming
1.3 Network transparency . . . . . . .
1.4 Thesis and contributions . . . . . .
1.5 Structure of the document . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
3
5
7
9
2 An introduction to Oz
2.1 The kernel language approach . . .
2.2 Declarative kernel language . . . .
2.2.1 The store . . . . . . . . . .
2.2.2 Declarative statements . . .
2.2.3 A few convenient operations
2.3 Nondeclarative extensions . . . . .
2.3.1 Exceptions . . . . . . . . .
2.3.2 Read-only views . . . . . .
2.3.3 State . . . . . . . . . . . . .
2.4 Syntactic convenience . . . . . . .
2.4.1 Declarative programming .
2.4.2 Message passing . . . . . .
2.4.3 Stateful entities . . . . . . .
2.5 Distribution . . . . . . . . . . . . .
2.5.1 Application deployment . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
12
13
13
16
16
16
17
18
18
19
21
22
24
24
3 Application structure and distribution behavior
3.1 Layered structure . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Using the declarative model . . . . . . . . . . . . . . . .
27
27
28
v
vi
Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
31
32
32
33
33
34
35
35
36
36
37
38
39
39
40
40
4 Asynchronous failure handling
4.1 Fault model . . . . . . . . . . . . . . . . . . .
4.1.1 Failures . . . . . . . . . . . . . . . . .
4.1.2 Failure detectors . . . . . . . . . . . .
4.1.3 Entity fault states . . . . . . . . . . .
4.1.4 Concrete interpretation of fault states
4.2 Failure handlers . . . . . . . . . . . . . . . . .
4.2.1 Definition . . . . . . . . . . . . . . . .
4.2.2 No synchronous handlers for Oz . . .
4.2.3 Entity fault stream . . . . . . . . . . .
4.2.4 Discussion . . . . . . . . . . . . . . . .
4.3 Making entities fail . . . . . . . . . . . . . . .
4.3.1 Global failure . . . . . . . . . . . . . .
4.3.2 Local failure . . . . . . . . . . . . . .
4.4 Failures and memory management . . . . . .
4.4.1 Blocked threads and fault streams . .
4.4.2 Entity resurrection . . . . . . . . . . .
4.5 Related work . . . . . . . . . . . . . . . . . .
4.5.1 Java RMI . . . . . . . . . . . . . . . .
4.5.2 Erlang . . . . . . . . . . . . . . . . . .
4.5.3 The first fault model of Mozart . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
44
44
45
46
47
49
49
49
50
52
52
53
53
54
54
55
56
56
56
56
5 Applications
5.1 Distributed lazy producer/consumer
5.1.1 A bounded buffer . . . . . . .
5.1.2 A correct bounded buffer . .
5.1.3 An adaptive bounded buffer .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
59
60
61
62
3.2
3.3
3.4
3.1.2 Using message passing . . . . .
3.1.3 Using shared state concurrency
Classification of language entities . . .
3.2.1 Mutable entities . . . . . . . .
3.2.2 Monotonic entities . . . . . . .
3.2.3 Immutable entities . . . . . . .
Annotations . . . . . . . . . . . . . . .
3.3.1 Annotations and semantics . .
3.3.2 Annotation system . . . . . . .
3.3.3 Partial and default annotations
3.3.4 Access architecture . . . . . . .
3.3.5 State consistency protocols . .
3.3.6 Reference consistency protocols
Related work . . . . . . . . . . . . . .
3.4.1 Erlang . . . . . . . . . . . . . .
3.4.2 Java RMI . . . . . . . . . . . .
3.4.3 E . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
63
64
67
67
68
69
69
6 Language semantics
6.1 Full language to kernel language . . . . . .
6.2 Basics of the semantics . . . . . . . . . . . .
6.2.1 The store . . . . . . . . . . . . . . .
6.2.2 Structural rules . . . . . . . . . . . .
6.3 Declarative subset of the language . . . . .
6.3.1 Sequential and concurrent execution
6.3.2 Variable introduction . . . . . . . .
6.3.3 Unification . . . . . . . . . . . . . .
6.3.4 Conditional statements . . . . . . .
6.3.5 Names and procedures . . . . . . . .
6.3.6 By-need synchronization . . . . . . .
6.4 Nondeclarative extensions . . . . . . . . . .
6.4.1 Nondeterministic wait . . . . . . . .
6.4.2 Exception handling . . . . . . . . . .
6.4.3 Read-only views . . . . . . . . . . .
6.4.4 State . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
75
75
79
79
81
82
82
82
83
85
86
86
87
87
88
89
90
7 Distributed semantics
7.1 Reflecting network and site behavior . . . . . .
7.1.1 Locality . . . . . . . . . . . . . . . . . .
7.1.2 Network failures . . . . . . . . . . . . .
7.1.3 Site failures . . . . . . . . . . . . . . . .
7.2 Reflecting entity behavior . . . . . . . . . . . .
7.2.1 Entity failures . . . . . . . . . . . . . .
7.2.2 Entity annotations . . . . . . . . . . . .
7.3 Declarative kernel language . . . . . . . . . . .
7.3.1 Purely local reductions . . . . . . . . . .
7.3.2 Variable introduction and binding . . .
7.3.3 Procedure creation and copying . . . . .
7.3.4 By-need synchronization . . . . . . . . .
7.4 Nondeclarative extensions . . . . . . . . . . . .
7.4.1 Exception handling and read-only views
7.4.2 State . . . . . . . . . . . . . . . . . . . .
7.5 Failure handling . . . . . . . . . . . . . . . . .
7.5.1 Failure detectors . . . . . . . . . . . . .
7.5.2 Making entities fail . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
93
94
94
94
95
95
96
96
98
98
98
98
99
99
100
100
102
102
104
5.2
5.3
5.1.4 A batch processing buffer .
Processes à la Erlang . . . . . . . .
Failure by majority . . . . . . . . .
5.3.1 Algorithm . . . . . . . . . .
5.3.2 Correctness . . . . . . . . .
5.3.3 The whole code of processes
5.3.4 Variants . . . . . . . . . . .
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
viii
Contents
7.6
Mapping distributed to centralized configurations . . . . . . . . 105
7.6.1 The mapping . . . . . . . . . . . . . . . . . . . . . . . . 105
7.6.2 Network transparency . . . . . . . . . . . . . . . . . . . 105
8 Implementation
8.1 Architecture of Mozart/DSS . . . . . . . . . . . .
8.2 The Distribution Subsystem . . . . . . . . . . . .
8.2.1 Protocols for mutables . . . . . . . . . . .
8.2.2 Protocols for immutables . . . . . . . . .
8.2.3 Protocols for transients . . . . . . . . . .
8.2.4 Handling failures . . . . . . . . . . . . . .
8.2.5 Distributed garbage collection . . . . . . .
8.3 The language interface . . . . . . . . . . . . . . .
8.3.1 Distributed operations in general . . . . .
8.3.2 Distributed immutables . . . . . . . . . .
8.3.3 Remote invocations and thread migration
8.3.4 Unification and by-need synchronization .
8.3.5 Fault stream and annotations . . . . . . .
8.3.6 Garbage collection . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
107
107
109
110
114
115
116
117
118
118
120
120
121
122
123
9 Evaluation
9.1 Ease of programming . . . . . .
9.2 Performance . . . . . . . . . . .
9.2.1 Mozart/DSS vs. Mozart
9.2.2 Comparing protocols . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
129
129
129
130
131
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10 Conclusion
135
10.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.2 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . 136
A Summary of the model
139
A.1 Program structure . . . . . . . . . . . . . . . . . . . . . . . . . 139
A.2 Failure handling . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Bibliography
141
1
Introduction
This dissertation presents a study on the extent and limits of network transparency in distributed programming languages. A programming language is
said to be network transparent if a distributed program gives the same result
as if it were executed on a single computer, provided network delays are ignored and no network failure occurs. The language is said to be network aware
if the language definition allows to predict and control how the program is distributed, and its network behavior. The conjunction of both properties aims
at simplifying distributed programming by separating a program’s functionality, in which distribution can be ignored, from its distribution behavior, which
includes network performance, partial failure (when part of the system fails),
and security.
Earlier works have shown that network transparency is possible, like [Jul88].
We show that network transparency is also practical : it can be efficient, and
smoothly extended in the case of partial failure. Efficiency is possible if the
programming language supports programming with asynchronicity, which is
reasonable in general, and fits well with distribution. Performance can also be
tuned by the choice of distributed algorithms used by the underlying system,
without affecting functionality in the case when there are no failures. Partial
failure can be reflected in the language in a simple way, so that fault tolerance
can be added in a modular fashion completely within the language. Security
is beyond the scope of this thesis and is a subject of future work. We give a
proof of concept with the programming language Oz and the system Mozart
[Moz99]. We have extended the language to improve its network awareness,
both for controlling distribution and handling partial failures.
This work is a continuation of earlier works on distributed programming
languages, mainly done at the Swedish Institute of Computer Science (SICS).
Among the results of those works are the system Mozart and the Distribution
Subsystem (DSS), and two dissertations:
1
2
Introduction
• Per Brand, in The Design Philosophy of Distributed Programming Systems: the Mozart Experience [Bra05], presents the first design, implementation, and evaluation of the distributed system Mozart.
• Erik Klintskog, in Generic Distribution Support for Programming Systems [Kli05], presents the design, implementation, and evaluation of the
DSS, a middleware which provides efficient distribution support for programming languages.
Per Brand showed that asynchronous stream communication can be orders of
magnitude faster than synchronous communication (such as Java RMI). He also
showed that an Oz program was almost unchanged when going from centralized
to distributed, and much simpler than a corresponding Java program.
This thesis both extends and simplifies the network-transparent distribution in Mozart. We have modified and extended the language Oz, in order to
improve the network awareness in the language. We have reimplemented the
distribution layer in the platform Mozart on top of the DSS middleware, and
completed the latter to make it able to handle and reflect partial failures. We
have also redesigned failure handling in Oz to make it completely asynchronous,
and showed that it was the right default.
1.1
Distributed systems
Distributed systems are becoming ubiquitous; today all computers are connected to the internet, which provides many collaborative tools and programs.
Moreover, many computers today contain multiple processors that run in parallel. This is a consequence of the current limits in increasing processors’ speed,
which makes manufacturers increase the number of processors instead. Computers with multiple processors can be considered as distributed systems on
their own, with fast communication between processors.
Software development is progressively shifting towards concurrent and distributed programs that can take advantage of this available parallelism. Sequential programming is still acceptable for small programs, but not for large
applications. By necessity, large programs will be distributed, and therefore
concurrent. Alas, the introduction of concurrency into existing systems is
rather poor, and inter-thread communication is often based on shared state.
This model is difficult and bug-prone for programmers, which are discouraged
to program in concurrent style. However, some systems propose different models for concurrent programming, like message passing in Erlang, and dataflow
concurrency in Mozart.
Partial failures also make distributed programs more complex than centralized ones. Programs that run on a certain number of machines should be able
to deal with faults in parts of the system. On one hand, writing distributed
applications without taking partial failures into account was quickly seen as
unrealistic. On the other hand, one needs abstractions to avoid the application
1.2 Models of distributed programming
3
code to be cluttered with failure-handling code everywhere. Failure handling
should be as modular as possible.
Programming languages and systems. We claim that the design of the
programming language is essential when writing such applications. Extending
a sequential language may lead to bad surprises, because the distributed program will not be sequential. So the language or its libraries should at least
provide good abstractions to handle concurrency. Moreover, letting parts of
a program share data introduces many subtle problems. For instance, remote
references may be provided either via proxy entities, or transparently by the
language itself. The programmer needs clear indication about what can be
shared between sites1 . Also, transferring data requires serialization.
There are basically two ways for a programming language to support the
development of distributed applications. The first approach is to augment the
language with libraries for distribution. Those libraries provide abstractions
to make sites interact with each other. This typically involves communication channels, abstract representations of distributed entities, and so on. The
programmer is responsible to integrate its application with the distribution
library.
The second approach is to provide distribution as an inherent property of
the programming language itself. In this case, we talk about a distributed programming language. A program is seen as a collection of threads and data that
are spread over a set of sites. From a functional point of view, the interaction between the parts of the application is not different from the interaction
between concurrent threads on a single host. However, the semantics of the
language is extended to incorporate new aspects, like network latency and partial failures. This thesis explores some of the possibilities that are offered by
these systems.
1.2
Models of distributed programming
The choices made to bring distribution in a programming language clearly
determines the model that the programmer has to use. We define the programming model as a set of language constructs together with how they are
executed [VH04]. It is sometimes called more informally programming paradigm. Examples are: declarative programming, object-oriented programming,
processes with message passing. Here we are interested in the underlying models of the programming systems (the languages and their libraries) that are
used to build distributed applications. We consider a few concepts that may
or may not be part of a programming model.
1 The site is the unit of localization. Sites execute code concurrently, and are independent
of each other. A typical example is a system process.
4
Introduction
Concurrency. By definition, a distributed program involves several activities
that run more or less independently on different sites. This implies that the programming model is necessary concurrent and non-deterministic, because those
properties are intrinsic to distributed systems. Therefore concurrent programming languages have an advantage over sequential languages when it comes
to distribution. A concurrent language makes it possible to test a distributed
program in a single site.
Moreover, good language support for controlling concurrency is also an
advantage. For instance, in Oz one can synchronize two threads with a dataflow
variable. This technique is simple, elegant, and is useful even if the threads are
located on different sites.
Synchronous and asynchronous operations. Many programming languages only provide synchronous operations in their model. Synchronous operations are fast and natural in centralized applications, but they can be pretty
slow in a distributed setting. Indeed, distributed synchronous operations often
require several sites to exchange messages, and cannot proceed immediately.
Asynchronous operations do not wait: the operation terminates immediately,
and its effect will be performed eventually. This scheme fits well in a distributed
environment: the operation may simply prepare a message and terminate, while
the message delivery will perform the expected effect. The network latency is
partly hidden to the user.
Most programming languages designed with distribution in mind provide
asynchronous operations in their model [Van06]. In some of them, like Erlang,
synchronous operations are not even part of the core of their model, but are
simply defined as a derived concept [AWWV96].
Stateful and stateless data. Does the programming model makes a distinction between stateful and stateless data? This is more important than it
seems at first sight. “One size fits all” does not hold for distributed data. On
one hand, stateless data can be copied between sites, which provides minimum
latency for operations. Once the data are copied, all operations are purely
local.
On the other hand, stateful data need different protocols to handle their
state and keep it consistent between sites. The state may be stationary, and
behave like a server for remote operations, like distributed objects in Java
Remote Method Invocation (RMI) [GJS96, Sun97]. But other protocols are
useful as well. A migratable state may give better performance when a site
has to perform many operations on it. Once the state has moved to the site,
it can be considered as a cache, because several operations on that site may
be performed in a batch. Replicating the state is yet another option, if read
operations are more frequent than updates.
1.3 Network transparency
5
Multiplicity of paradigms. The ability to choose between several paradigms when writing a distributed application may lead to better programs. If
a problem has a natural solution in a given programming model, its implementation will be simpler. The programmer may also choose the paradigm
depending on the problem to solve and how various concepts of the language
are distributed. The system does not force the programmer to emulate a distributed protocol on top of inappropriate concepts.
Distributed and local references. In programming systems that provide
distribution as a separate library, distributed and local references often have
different interfaces. An example is given by Java RMI, where distributed objects
introduce exceptions where equivalent local objects would not. Distributed
objects have a slightly different semantics, too. Reference integrity of remote
objects is not guaranteed in general, for instance [Sun97]. Turning local objects
into distributed ones may break an application.
Making no visible difference between local and distributed data allows the
runtime system to choose the right representation for a given datum. A local
reference that is sent to a remote site automatically switches to a distributed
representation for the corresponding data. The conversion may be reverted
once the distributed reference is used by one site only. This implies less effort
from the programmer.
Partial failures. They are inherent to distributed systems, so reflecting failures in the programming model is very important. It basically provides the
programmer with a semantic representation of the failure. This semantic support allows the programmer to reason about failures, and handle them properly.
Besides a semantic representation, the programming model should also provide
a way to detect failures. Failure detectors are the basic ingredient of failure
handlers.
We believe that causing a partial failure by program can be useful: it may
simplify a failure handler. Sometimes a component cannot be fixed easily because it strongly depends on another component, which has failed. Making the
former one fail may accelerate the failure recovery, which can be handled at a
higher level in the application.
1.3
Network transparency
As we said, network transparency states that the result of a distributed program
is the same as if it were executed on a single computer, in the case when
no failure occurs. The meaning can be more precise if we consider network
transparency at the level of the programming language. A given entity or piece
of code is network transparent if its semantics does not depend on whether it
is run in a distributed environment, provided no failure occurs.
6
Introduction
Several forms of transparency have been proposed in the literature. The
following ones are taken from [ISO98]. They use the term resource in a very
general sense. When applied to a programming language, a resource typically
corresponds to an object or an agent.
• Access transparency masks differences in data representation and invocation mechanisms, to provide a single and uniform access to resources.
• Location transparency states that the user of a resource should not be
aware of where the resource is physically located. Migration transparency
states that the user should not be aware of whether a resource or computation has the ability to move to a different location, while relocation
transparency guarantees that its migration should not be noticeable to
the user.
• Replication transparency makes a resource appear as unique even if it is
replicated among several locations. Persistence transparency makes no
difference between resources located in volatile and permanent memory.
• Failure transparency masks the failure and possible recovery of resources
or computations.
• Transaction transparency masks coordination of activities amongst a configuration of entities to achieve consistency.
Our definition of network transparency covers the above notions of access, location, migration, relocation, and replication transparency. We also cover transaction transparency in the sense that primitive operations on distributed entity
should be atomic, just like in the centralized case. Persistence transparency
is rarely present in programming languages, where the program’s memory is
often considered as volatile. Our definition does not cover failure transparency,
and our proposal for a distribution model in Chapter 4 will even make failures
explicit in the language.
In practice. Some researchers have maintained that network transparency
cannot be made practical, see, e.g., Waldo et al. [WWWK94]. They cite four
reasons: pointer arithmetic, partial failure, latency, and concurrency. The first
reason (pointer arithmetic) disappears if the language has an abstract store.
The second reason (partial failure) requires a reflective fault model, which we
designed for the Distributed Oz language. The authors of the paper above
expected that failures could be always hidden behind abstractions. They were
wrong: sometimes failures cannot be resolved locally, and requires some global
action.
The final two reasons (latency and concurrency) lead to a layered language
design. Latency is a problem if the language relies primarily on synchronized
operations, like procedure calls. The authors of the paper explicitly mention
the disappointing experience of remote procedure calls, that they see as the
1.4 Thesis and contributions
7
only way to make distributed objects. In the terminology of Cardelli, latency
is a network awareness issue [Car95]. The solution is that the language must
make asynchronous programming both simple and efficient.
Concurrency is also seen as an obstacle. A closer look at the paper reveals
that the authors actually talk about shared-state concurrency. Indeed, most
people consider that programming languages are always stateful. The problem with concurrency in those languages is how to control concurrent accesses
to state, in order to avoid invariant violations and glitches. A solution is to
support a form of stateless concurrency, known as dataflow concurrency. Concurrent threads interact by sharing values, and automatically synchronize on
the availability of data. Threads can be programmed as if they were never
waiting for data. An example of this kind of communication is pipelining in
Unix-like systems. Moreover, values can be copied between memory stores,
which substantially reduce their latency. Using dataflow concurrency can reduce the need for shared state to a minimum. We conclude that language
design is an important issue for network transparency.
1.4
Thesis and contributions
This thesis proves that network transparency is practical in a distributed programming language. It gives concrete proposals of language extensions that
deal with performance and failure handling, and demonstrates their usage
with practical examples. It also describes the implementation of the platform
Mozart/DSS, with insights on various implementation issues.
Contributions. This thesis extends, simplifies, and completes the past work
on network-transparent distribution in Mozart. The initial distribution model
and the initial failure detection model [HVS97, VHB99] formed the core of
the first distributed release of Mozart in 1999. Erik Klintskog made the first
design of a distribution subsystem (DSS) in which the distribution support is
completely factored out of the run-time emulator [Kli05]. The work on the DSS
was incomplete, however. The present thesis brings the following scientific and
technical contributions.
• We extend the distribution model of Oz to make it customizable. We
introduce entity annotations, so that the programmer has the ability to
choose between several protocols for each entity, including its distributed
memory management.
• We design a failure handling model for Oz that is simpler and more
expressive than the initial one. Each language entity produces a stream
of fault states that is extended asynchronously, whenever the entity’s
fault state changes.
• We design an effective post-mortem finalization mechanism based on the
fault stream. This mechanism did not exist in the language.
8
Introduction
• We give distributed programming patterns that show how the system
simplifies programming robust distributed systems.
• We complete Erik Klintskog’s work by presenting more precise definitions
of the distribution protocols that include failure handling, in particular
the mobile state protocol.
• We have rebuilt the distribution support of the platform Mozart on top
of the DSS library, and implemented the new distributed programming
model of the language. The reflection of failures in the language, and the
implementation of the new language features (annotations, fault stream,
making entities fail) are entirely our work.
• We have also completed the implementation of the DSS. In particular
we have rewritten all entity protocols such that they can handle partial
failures. We have also extended the DSS interface to handle and reflect
entity failures.
• We evaluate the new implementation in a realistic setting.
Publications. The following publications contain substantial contributions
by the author on the topics of this thesis. The first two papers focus on the
extension of a mobile state protocol to make it handle failures. That protocol
is part of the platform Mozart. The implementation of that protocol is now
part of the DSS. Its semantics as a migratory protocol are given in Chapter 7,
and its implementation is described in Chapter 8.
• Peter Van Roy, Per Brand, Seif Haridi, and Raphaël Collet. A lightweight
reliable object migration protocol. Lecture Notes in Computer Science,
1686:32–46, 1999 [VBHC99].
• Per Brand, Peter Van Roy, Raphaël Collet, and Erik Klintskog. Path
redundancy in a mobile-state protocol as a primitive for language-based
fault tolerance. Research Report RR2000-01, Université catholique de
Louvain, Département INGI, 2000 [BVCK00].
The next two papers propose a formal definition of lazy computations in terms
of concurrent constraints. That definition led to an efficient distributed implementation of that concept. Laziness is mentioned in Chapter 3, and a concrete
example of its usage is shown in Chapter 5. Its semantics are defined in Chapters 6 and 7, and its implementation is described in Chapter 8.
• Alfred Spiessens, Raphaël Collet, and Peter Van Roy. Declarative laziness in a concurrent constraint language. 2nd International Workshop on
Multiparadigm Constraint Programming Languages MultiCPL’03, 2003
[SCV03].
1.5 Structure of the document
9
• Raphaël Collet. Laziness and declarative concurrency. 2nd Workshop on
Object-Oriented Language Engineering for the Post-Java Era: Back to
Dynamicity PostJava’04, 2004 [Col04].
The fifth paper shows how to design a transactional system for a distributed
store of objects, on top of an overlay network. That work put some emphasis
on what kind of primitives were desired in the language to handle failures.
• Valentin Mesaros, Raphaël Collet, Kevin Glynn, and Peter Van Roy. A
transactional system for structured overlay networks. Research Report
RR2005-01, Université catholique de Louvain, Département INGI, 2005
[MCGV05].
The latter paper is the author’s proposal to favor asynchronous failure handling in a distributed programming language. The paper contains the essential
contributions of Chapter 4.
• Raphaël Collet and Peter Van Roy. Failure handling in a network-transparent distributed programming language. In C. Dony et al., editor,
Advanced Topics in Exception Handling Techniques, volume 4119 of Lecture Notes in Computer Science, pages 121–140. Springer-Verlag, 2006
[CV06].
1.5
Structure of the document
Chapter 2 gives an introduction to the programming language Oz. That language is the vehicle we have chosen to explain all our proposals. This chapter
may safely be skipped if the reader already knows the language. Note that
most code snippets in Oz appear in special figures called “snippets”.
Chapters 3 and 4 detail our proposals for dealing with an application’s
distributed behavior, and failure handling, respectively. The programming
model is exposed and explained, together with some practical intuition for
all concepts. Concrete examples using those language features are given in
Chapter 5.
Chapters 6 and 7 propose a formal definition of the language, in centralized
and distributed settings, respectively. Operational semantics are given for the
core of the language, and the centralized semantics are refined into distributed
semantics that reflect the aspects related to distribution.
Chapter 8 describes the implementation of Mozart/DSS, the reimplementation of Mozart on top of the DSS library. It gives a definition of the protocols
that are used to implement basic operations on distributed entities, sketches
the DSS application programming interface, and explains how the distribution
support is implemented on top of it.
Chapter 9 evaluates the work done so far. Comparisons with other systems
are made. Chapter 10 concludes the work. Scientific results are summarized,
10
Introduction
and future directions are given. Appendix A gives a summary of the model,
and the language extensions proposed in this work.
2
An introduction to Oz
In this text, we use the programming language Oz as a vehicle to express a
certain number of concepts related to distributed programming. The concepts
themselves are language independent, but few programming languages are able
to express them in a natural way. Oz makes it possible, thanks to its support for
multiple programming paradigms, among which we find declarative programming, dataflow concurrency, and object-oriented programming [Smo95, VH04].
This chapter gives a quick introduction to the language, and the basic model
of its distribution. A formal definition of the language is given in Chapter 6.
2.1
The kernel language approach
The language Oz is based on a small set of concepts, that form a kernel language, called Kernel Oz. In the kernel language, all concepts are primitive:
they cannot be defined in terms of each other. The full language is defined
as the kernel language extended with language abstractions, i.e., programming
abstractions with syntactic support. All those abstractions are defined in terms
of Kernel Oz, so that every program can be reduced to an equivalent program
in the kernel language.
The advantage of the kernel language approach is that the language is defined by layers. The bottom layer is the kernel, with all primitive concepts.
The upper layers then define abstractions built on concepts defined in layers
below. In practice, two layers are enough to define a very expressive language.
Figure 2.1 on the following page shows a certain number of concepts that are
present in the language. All concepts in bold font are part of the kernel language, while the others are derived concepts. The arrows indicate from what
a given concept derives.
11
12
An introduction to Oz
e
D
c
l
a
r
a
t
i
v
e
M
y
"
n
e
e
d
a
z
y
p
s
f
s
y
n
s
s
a
g
e
p
a
s
s
i
n
g
l
b
t
e
c
h
r
o
n
i
a
z
t
i
o
u
n
c
t
r
e
a
m
s
p
o
r
t
o
r
t
i
t
s
n
n
c
o
s
o
b
j
e
t
s
n
e
r
r
c
u
n
n
d
a
t
C
h
r
e
a
d
t
a
fl
o
r
w
e
n
e
t
r
c
a
n
t
c
u
n
o
r
r
e
t
s
o
n
s
c
e
q
u
e
n
t
i
a
n
r
f
s
h
y
u
n
n
o
i
z
c
a
t
i
c
l
k
c
o
o
s
o
n
l
t
i
u
o
s
,
s
e
n
q
b
j
e
t
s
l
e
t
i
a
l
n
i
a
e
x
e
c
u
t
i
o
n
e
x
p
r
e
s
s
i
c
o
s
o
b
j
e
t
s
t
n
e
S
h
a
r
e
d
s
t
a
t
e
u
q
e
S
b
v
a
r
i
a
l
e
s
,
r
r
r
e
c
o
r
d
s
p
o
c
e
d
u
r
e
e
a
d
"
o
n
s
l
y
c
v
i
e
w
e
l
l
s
s
e
x
c
e
t
i
o
n
s
p
Figure 2.1: Primitive and derived concepts in the language Oz
The paradigm space. When writing a program or module, the programmer
will often use only a subset of the concepts provided by the programming
language. Every subset of concepts defines a programming model, or paradigm.
Each paradigm comes with its own set of techniques and design rules. As you
can see, the derived concepts shown in figure 2.1 are not placed randomly.
There are two major axes in the diagram.
The base of all derived concepts is a sequential declarative language, which
can be made a functional language with the appropriate language support.
It contains no state and no concurrency. The language is mainly enriched by
adding state (right direction in the diagram) and concurrency (upper direction).
Moreover, the derived concepts are grouped together to form three major paradigms, namely declarative, message passing, and shared state programming.
Section 2.2 introduces the kernel concepts of the core language, together
with concurrency. This part of Oz is completely declarative. The non-declarative kernel concepts are introduced in section 2.3. In section 2.4 we present
syntactic extensions, and language abstractions used in the text. There we
show two important techniques for controlling concurrency in the presence of
state: message passing with ports, and locks. Note that we have chosen to
introduce concurrency before state in the presentation. This is because it is
possible to make distributed programs that are declarative; all they need is
concurrency. The base distribution model is given in Section 2.5.
2.2
Declarative kernel language
An Oz program consists in a set of threads computing over a shared store.
Threads execute statements in a sequential way. The program will be sequential
if it contains a single thread. The store is the memory of the program; it
2.2 Declarative kernel language
13
contains logic variables and data structures used by the threads. Threads are
connected to the store by variables.
2.2.1
The store
The program store has two parts: the so-called constraint store and the procedure store. The constraint store contains logic statements about the program’s
variables. For now consider that those logic statements have the form x=t,
where x is a variable, and t is either a variable or a value. The procedure store
contains the procedures created by the program.
Values. The different kind of values are integers, atoms, names, and records.
Names are primitive entities that have no structure; they are used to give an
identity to other data structures, like procedures. The boolean values true
and false are defined as names.
Atoms are literal constants that are defined by a sequence of characters.
Syntactically they are words starting with lowercase letter, like atom or nil, or
they are surrounded by quotes, like ´Hello world´ or ´|´. A literal is either
a name or an atom, and a simple value is either a literal or an integer.
A record is a compound value l(k1 :v1 . . . kn :vn ) formed by a label l (a literal)
and fields v1 , . . . , vn . Each field vi is associated to a key ki , which is a simple
value. An example is person(name:raph age:32). A tuple is a record whose
keys are the integers 1 to n. Syntactically the keys can be omitted, as in
person(raph 32), which is equivalent to person(1:raph 2:32). Lists are
defined in terms of tuples and atoms as follows. A list is either the empty list
(the atom nil), or a head-tail pair ´|´(xy). The latter can be written infix as
x|y.
Variables. The store contains logic variables that can be bound to values and
other variables. Upon creation, a variable x is always unbound. It is bound to
a value v if the store contains the statement x=v. It can be bound to at most
one value, and the binding cannot change over time. A variable bound to a
value is said to be determined.
Variables x and y can also be bound together, if the store contains x=y.
The binding relation is transitive: if x is bound to a value v, then y is also
bound to the same value v. Note that undetermined variables can be bound
together.
2.2.2
Declarative statements
The declarative kernel statements are given in Figure 2.2 on the following page.
14
An introduction to Oz
S
::=
|
|
|
|
local X in S end
X=Y
| X=v
if X then S1 else S2 end
empty statement
sequential composition
declaration
unification
conditional statement
|
|
proc {P X1 . . . Xn } S end
{P X 1 . . . Xn }
procedure creation
procedure application
|
|
thread S end
{WaitNeeded X }
thread creation
by-need synchronization
v
::=
|
s
l(k1 :X1 . . . kn :Xn )
simple value
record
s
::=
l
literal or integer
l
::=
atom
P, X, Y
::=
identifier
skip
S1 S2
|
i
|
true
|
false
atom or name
variable identifier
Figure 2.2: Declarative kernel concepts of Oz
Empty statement and sequential composition. The execution of the
statement skip simply has no effect. The sequence of statements S1 S2 first
executes S1 , then executes S2 .
Declaration. The statement local X in S end creates a fresh variable x
in the store, and reduces to the statement S, where X is associated to x. It
defines a lexical scope between the keywords in and end for the identifier X.
In order to be executable, a statement must have all its free identifiers
correspond to store variables. For the sake of simplicity, in the rest of the text,
we will refer to the variable corresponding to identifier X as “the variable X.”
Unification. Variables are bound in the store by unification. The statements X=Y and X=v add the necessary variable bindings to make their arguments equivalent. For instance, the following statements binds R to the record
foo(a:X b:Y), then by unification of records makes both X and Y equal to Z,
and finally binds them all to 42.
R=foo(a:X b:Y) R=foo(a:Z b:Z) Z=42
A unification triggers an error if its arguments cannot be made equal. The
error shows up as an exception (see below).
Conditional statement. The statement if X then S1 else S2 end blocks
until the variable X is determined. It reduces to S1 if X equals true, and S2
2.2 Declarative kernel language
15
if X equals false. Other values trigger an error (in the form of an exception,
see below).
Procedure creation. The statement proc {P X1 . . . Xn } S end creates
a fresh name ξ, adds the procedure λX1 . . . Xn .S under the name ξ in the
procedure store, and reduces to the statement P =ξ. Since the name ξ is fresh,
the procedure store defines a mapping of names to procedures. Note also that
the name makes the procedure a first-class entity: it can be passed as arguments
and stored in data structures.
The procedure may refer to anything in the lexical scope of its definition.
Those external references are determined by the variables corresponding to the
free identifiers in S that do not occur in the procedure’s parameters X1 , . . . , Xn .
The proc statement defines a lexical scope for the identifiers X1 , . . . , Xn . Note
that the declaration of the identifier P is done outside the procedure definition.
This implies that the order of creation of procedures that use each other is
irrelevant, provided they are all created before their use.
Procedure application. The statement {P Y1 . . . Yn } blocks until the variable P is determined. If P is the name of an n-ary procedure λX1 . . . Xn .S in
the procedure store, it reduces to the statement S where each Xi corresponds
to the variable Yi . If P is not a procedure, or a procedure with a different arity,
the statement triggers an error.
Thread creation. The statement thread S end creates a new thread that
consists of the statement S. The new thread is independent from the current
thread. The threads in a program execute concurrently. The following primitive
store operations are guaranteed to be atomic: creating a fresh name or variable,
binding a variable, and storing a procedure. This makes the concurrency in Oz
fine-grained.
A thread is runnable if its first statement is executable. When the first
statement of the thread blocks on a variable, we say that the thread is blocked
or suspended. It remains blocked until the variable is determined by the store.
It then becomes runnable again.
By-need synchronization. We say that a variable X is needed when either
it is determined, or a thread waits for its determination. This property is
monotonic: when a variable is needed, it remains needed. The statement
{WaitNeeded X } blocks until the variable X becomes needed. This primitive
is used to attach a lazy computation to the variable X: a thread blocks until
X becomes needed, then computes the value of X.
thread
{WaitNeeded X}
...
end
% block until X is needed
% compute a value and assign it to X
16
2.2.3
An introduction to Oz
A few convenient operations
Here are a few extra statements, that complete the declarative kernel language
with common primitive operations.
Tests. The statement X=(Y ==Z) blocks until there is enough information
in the constraint store to logically entail Y =Z or Y 6=Z. The statement then
reduces to X=true or X=false, respectively. The operator \= is similar, but
returns the opposite results. The statement
X=Y op Z
with op ∈ {<, =<, >, >=}
blocks until both Y and Z are determined to atoms or integers. It then reduces
to X=true or X=false, depending on the result of the comparison. Atoms
are ordered lexically, and atoms and integers are not comparable.
Arithmetic operations. The statements
X=˜Y
X=Y op Z
with op ∈ {+, -, *, div, mod}
block until their arguments (Y and Z) are determined to integers, and reduce
to X=i, where i is the result of the corresponding operation.
Record operations. The statements
{Label X Y }
{Width X Y }
{Arity X Y }
block until X is determined to be a record l(k1 :v1 . . . kn :vn ). They then reduce
to Y =l, Y =n, and Y =k1 | . . . |kn |nil, respectively.
The statement X=Y .Z blocks until Z is determined to a simple value and
Y to a record l(k1 :v1 . . . kn :vn ). If Z is equal to ki for a certain i, the statement
reduces to X=vi . Otherwise an exception is raised.
2.3
Nondeclarative extensions
Figure 2.3 on the next page shows nondeclarative extensions to the language.
They include exceptions, state, and read-only views.
2.3.1
Exceptions
Try statement. The statement try S1 catch X then S2 end starts by
executing S1 . If S1 terminates normally, the statement reduces to skip. If an
exception x “escapes” from S1 , i.e., x is raised inside S1 but not caught, the
try statement reduces to S2 (the exception handler ), with X corresponding to
variable x. Note that the try statement can only catch exceptions raised in its
own thread.
2.3 Nondeclarative extensions
S
::=
|
|
17
try S1 catch X then S2 end
raise X end
{FailedValue X Y }
try statement
raise statement
failed value creation
|
X=!!Y
read-only view creation
|
|
X =Y :=Z
{NewCell X Y }
cell creation
cell exchange
Figure 2.3: Nondeclarative kernel concepts of Oz
Raise statement. The statement raise X end throws an exception with
the variable X. The exception will be caught by closest exception handler in
the thread, if it exists.
Failed values. A failed value is a special value that encapsulates an exception. The statement {FailedValue X Y } creates a failed value v encapsulating
the variable X, and reduces to Y =v. Any statement that attempts to use the
value v automatically raises an exception with X.
Failed values are used to pass exceptions between threads. This is particularly useful when an exception occurs in a thread that is responsible to compute
the value of a variable Z. If the thread binds Z to a failed value encapsulating
the exception, all the other threads that use Z will know why the value of Z
could not be determined.
thread
try
local V in
...
Z=V
end
catch E then
{FailedValue E Z}
end
end
2.3.2
% compute a value V
% return the value in Z
% return a failed value in Z
Read-only views
“Bang bang” operator. The statement X=!!Y creates a read-only view u
of variable Y , and reduces to X=u. A read-only view of a variable is logically
equal to that variable, but it cannot be bound by the program; every statement
that attempts to bind it blocks. When the variable is determined to a value,
the read-only view is also bound to that value.
Read-only views are used to protect abstractions from accidental bindings
that may break them. They are often used to build robust streams. A stream
is a list that is built incrementally. During its construction, a read-only view
18
An introduction to Oz
of its tail variable is put in the list tuple. This prevents the consumer of the
stream to bind the tail, and provoke a failure in the producer of the stream.
2.3.3
State
The primitive concept of state in Oz is the cell. A cell is a mutable container
for a variable. Cells are contained in a new part of the store called the cell
store. A cell is a first-class value, identified by a name, and its contents is given
by a (possibly determined) variable. Similarly to the procedure store, the cell
store defines a mapping of names to variables.
A cell can be used by a single thread or shared among several threads.
Concurrent accesses to a cell are relatively easy to synchronize, because read
and write operations are combined in a single, atomic exchange operation.
Concurrency control abstractions can be built with cells, like ports and locks
(see Section 2.4).
Cell creation. The statement {NewCell X Y } creates a fresh name ξ, adds
the pair ξ:X in the cell store, and reduces to the statement Y =ξ.
State exchange. The statement X =Y :=Z blocks until Y is determined. If
Y is a cell ξ:w, the cell store is updated with ξ:Z, and the statement reduces
to X=w.
For instance, if C is a cell that contains an integer, the following procedure
adds N to the contents of the cell. Assume two threads update concurrently C
with AddCounter. If the initial state is x, the first thread takes x and put a
variable y in the cell, then the second thread takes y and put a variable z. The
second thread computes the new state z with y, and automatically waits for
the first thread to determine y. No race condition occurs if all state updates
are done this way.
proc {AddCounter N}
local X Y in
X=C:=Y
% get contents X, and put new contents Y
Y=X+N
% determine the value of Y
end
end
2.4
Syntactic convenience
This section introduces syntactical convenience that corresponds to the full
language, at least the part of the language that is relevant for this text. All
the rules in this section suggest how to rewrite statements in the full language
to statements in the kernel language.
Comments. All the characters that follow the percent sign (%) until the end
of the line are comments.
2.4 Syntactic convenience
2.4.1
19
Declarative programming
Declarations.
stance,
Multiple variables can be declared simultaneously. For in-
local X Y in S end
⇒ local X in local Y in S end end
If a declaration statement comprises the body of a procedure definition or the
branch of a conditional, local and end can be omitted. For example:
proc {P} Y in S end
⇒ proc {P} local Y in S end end
Declaration can be combined with initialization through unification:
local X=5 in S end
⇒ local X in X=5 S end
Expressions. We first define the statement Z={P X1 . . . Xn } as shorthand
for {P X1 . . . Xn Z }. Similarly, nesting of record construction and procedure
application avoids declaration of auxiliary variables. For example:
X=b({F N+1})
⇒ local Y Z in
Y=N+1 X=b(Z) {F Y Z}
end
Record construction is given precedence over procedure application to allow
more procedure definitions to be tail recursive. The construction is extended
analogously to other statements, allowing statements as expressions. For example:
X=local Y=2 in
{P Y}
end
⇒ local Y=2 in X={P Y} end
Procedure definitions as expressions are tagged with a dollar sign ($) to distinguish them from definitions in statement position:
X=proc {$ Y} Y=1 end
⇒ proc {X Y} Y=1 end
Another common expression is the anonymous variable:
_
⇒ local X in X end
Functions. Motivated by the functional notation of procedure calls, we define
a function of n arguments as being equivalent to a procedure of n+1 arguments,
the extra argument being bound to the result of the function body, which is an
expression:
fun {Inc X} X+1 end
⇒ proc {Inc X Y} Y=X+1 end
20
An introduction to Oz
Lazy functions. Function definitions with the decoration lazy create a
thread that uses WaitNeeded to wait for the result to be needed. Chapter
6 proposes a translation that avoids creating threads in the presence of tail
recursive calls.
fun lazy {Inc X}
X+1
end
⇒ proc {Inc X Y}
thread
{WaitNeeded Y} Y=X+1
end
end
Lists. Complete lists can be written by enclosing the elements in square
brackets. For example, [1 2] abbreviates 1|2|nil, which abbreviates ´|´(1
´|´(2 nil)).
Infix tuples. The label ´#´ for tuples ´#´(X Y Z) can be written infix:
X#Y#Z.
Pattern matching. Programming with records and lists is greatly simplified
by pattern matching. For instance, a pattern matching conditional
case X of person(name:N age:A) then S1 else S2 end
is an abbreviation for
if {Label X}#{Arity X} == person#[age name] then
N A in X=person(name:N age:A) S1
else S2 end
The else part is optional and defaults to else skip. Multiple clauses are
handled sequentially, for example:
case X
of f(Y) then S1
[] g(Z) then S2
end
⇒ case X of f(Y) then S1 else
case X of g(Z) then S2 end
end
The try statements are also subject to pattern matching. For example:
try S1
catch f(X) then S2
end
⇒ try S1 catch Y then
case Y of f(X) then S2
else raise Y end
end
end
2.4 Syntactic convenience
21
Loops. Recursive functions are expressive enough to implement all kinds of
loops in the language. Oz supports a simple yet powerful for loop on lists:
for X in L do S end
⇒ {ForAll L proc {$ X} S end}
where ForAll is defined as
proc {ForAll L P}
case L of X|T then {P X} {ForAll T P} else skip end
end
Waiting for determinacy. A common abstraction is the procedure Wait,
that blocks until its argument is determined. It is often used to explicitly
synchronize threads. It can be defined as follows, using the blocking behavior
of ==.
proc {Wait X} _=(X==1) end
2.4.2
Message passing
Ports. A port is associated to a stream (defined in Section 2.3.2 above), which
lists all the messages sent on the port. Ports are defined by two operations:
NewPort, which creates a port and its stream, and Send, which sends a message
on a given port. They can be defined in terms of cells as
fun {NewPort S}
T in S=!!T {NewCell T}
end
proc {Send P X}
T in X|!!T=P:=T
end
Note, however, that ports are not truly defined like that. When it comes to distribution, they do not behave like cells, but have a behavior of their own. This
is because ports are intrinsically asynchronous, which cells are synchronous.
Ports are very convenient to handle nondeterminism, since they are asynchronous (they never block), and all the messages sent to a port are serialized
into a list (the stream).
Port objects. A port object consists in a port and a thread that reads sequentially its message stream. Because the message processor is sequentially
reading a list, it can be written as a simple recursive function. The latter can
use an accumulator to maintain a state.
Snippet 2.1 on the following page shows a simple abstraction that builds
port objects. The argument function Func takes the object’s current state and
a message, and returns the new state of the object. The function FoldL is used
to apply Func on every message. An example is shown below, with a object
Counter. This object recognizes three kind of messages: inc, inc(N), and
get(N). See how the latter binds N to the current value of the counter.
22
An introduction to Oz
fun {NewPortObject Init Func}
S in
thread {FoldL S Func Init} end
{NewPort S}
end
fun {FoldL L F I}
case L of X|T then {FoldL T F {F I X}} else nil end
end
local
fun {F Count Msg}
case Msg
of inc
then Count+1
[] inc(N) then Count+N
[] get(N) then N=Count Count
end
end
in
Counter={NewPortObject 0 F}
end
{Send Counter inc(3)}
% increment counter by 3
Snippet 2.1: An abstraction to create port objects, and an example of a counter object
Active objects. Active objects are similar to port objects, except that they
use a stateful object instead of a function to process the messages. This technique is pretty easy to work with, because the underlying object is used sequentially.
2.4.3
Stateful entities
State operations. The full language supports two simplified versions of the
state exchange operation, to simply read and write the state.
[email protected]
C:=X
⇒ X=C:=X
⇒ _=C:=X
Objects and classes. In Oz an object is defined as a unary procedure, that
takes a method as its argument. The method is represented as a record, and
is therefore first-class. An object usually maintains a proper state. A class is
a value that creates objects. A precise kernel equivalent of classes, and their
inheritance mechanism, is given in [VH04].
Snippet 2.2 on the next page illustrates the class syntax with an example with two classes and one object. A base class Stack defines an attribute
elements, which is identified by an atom. It also defines four methods: init,
isEmpty, push, and pop. The state operators are extended to attributes. The
2.4 Syntactic convenience
23
class Stack
attr elements
% list of elements, from top to bottom
meth init
elements:=nil
% initializer
end
meth isEmpty(B)
B=(@elements==nil)
end
meth push(X)
T in T=elements:=X|T
% put X in front of elements
end
meth pop(X)
T in X|T=elements:=T
% extract front element
end
end
class Stack2 from Stack
% Stack2 extends Stack
meth top(X)
{self pop(X)} {self push(X)}
end
end
S={New Stack2 init}
{S push(42)}
% create an object of class Stack2
% call method push(42) of Obj
Snippet 2.2: Two classes and an object
class Stack2 extends the class Stack. It defines a method top, which is implemented with the methods push and pop of the object, which is accessible
by the keyword self. Then the function New is used to instanciate the class,
and initialize the object.
There is no concurrency control by default in objects. Concurrent method
invocations will be executed concurrently, and state updates are subject to
the same kind of atomicity as cells. Objects often use locks to create critical
sections inside methods.
Locks. A lock is a binary semaphore, which controls the access to the lock
itself. At most one thread can be in a given lock. The only operation takes
the lock, executes a statement, and releases the lock. If another thread already owns the lock, the operation blocks until the lock is available. The lock
statement is translated as follows.
lock L then S end
⇒ {L proc {$} S end}
The lock L is created by the following function, which implements a basic lock
with a cell and a procedure.
24
An introduction to Oz
fun {NewLock}
C={NewCell unit}
in
proc {$ P}
X Y in X=C:=Y {Wait X} {P} Y=X
end
end
When the lock is applied, it places a new variable into its cell, and waits
until the former variable in the cell is determined. Once it is determined, the
lock is available. The lock then executes the statement, which is abstracted by
a nullary procedure. It then release the lock by binding the new variable to
the value. Several threads applying the same lock will form a chain, and pass
the value unit between each other via a shared variable. The cell’s function is
to connect a thread to its successor in the waiting queue.
The language Oz actually provides reentrant locks. Those locks permit a
thread to take the same lock several times. This is useful when two procedures
or methods call each other, and protect a shared state with the same lock. For
a definition of reentrant locks in Kernel Oz, see [VH04]. Note that distinct
locks are not connected to each other in any way; deadlocks are possible, and
the language provides no deadlock detection mechanism.
2.5
Distribution
In Oz, a distributed program is usually defined as a centralized program where
entities and threads would be partitioned among sites. One can also define a
distributed program as a set of centralized programs running on their own sites,
and sharing language entities. Both definitions are in fact valid and equivalent,
because the language is network-transparent.
Most of the distribution is hidden to the programmer, as shared entities keep
their semantics almost intact. What happens is that the programming system
uses dedicated protocols to implement the entities’ semantics. In Mozart the
distribution of entities is designed to give the programmer full control over
network communication patterns that occur because of language operations.
Not all entities use the same protocol, every entity use a protocol that is adapted
to its nature: mutable, immutable, or transient. This subject is discussed in
detail in Chapter 3.
2.5.1
Application deployment
The deployment of an application covers two situations that are handled differently. The first one is how the distribution between sites that already share
entities evolve. The second one is how to create new sites, or connect independent sites.
2.5 Distribution
25
Sites that know each other. Sites that already share entities evolve by
following which entities they share. They can share new entities by transitivity:
a site obtains a reference to an entity via another entity that it already refer.
For instance, if site a sends a value x on a port, and site b reads the message
stream of that port, site b automatically has access to x.
Note that sites also connect to each other by transitivity: if sites a and b
are connected together, and b and c as well, then a and c will automatically
connect to each other, if the entities they share requires so. This depends on
which protocols are used by the shared entities.
Sites can also reduce their set of shared references. This is handled by distributed memory managment. The system detects when a site no longer refers
to a given entity, and can globally remove an entity from the distributed program. This always works, except for distributed reference cycles, i.e., reference
cycles that involve several sites.
Connecting sites. The definitions we just gave of a distributed program
suggests two ways of deploying an application over new sites: either by splitting
a site into several sites, or by connecting distinct sites. Mozart uses the second
approach, because it is easier to implement and to control in the program. This
is provided by the module Connection.
The function call {Connection.offer X} returns a ticket, i.e., an atom
that represents a reference to X. Conversely, the call {Connection.take T}
returns the entity corresponding to the ticket T. Those functions allow a site
to offer an entity reference to other sites via textual communication means.
Indeed, as an atom is nothing more than a string of characters, it can be
transmitted by e-mail, via a web site, or even told to the phone.
This mechanism is often used as a bootstrapping mechanism for distributing
an application. The first entities that sites share are used to transmit other
entities, by transitivity.
Managing resources. Not everything can be distributed. Assume a site b
executes a procedure sent by site a, and that procedure has to save temporary
data in a file. The site b may grant access to its file system, by it needs a way
to provide this access to the running procedure. In order to solve this issue,
Mozart has a module system based on functors. A functor is the specification
of a module, with a list of modules to import, exported references, and code.
If b receives a functor, it can instanciate it with a module manager that will
provide (or possibly deny) the necessary local resources to the new module.
Consult Mozart’s documentation [Moz99] for more detail about functors.
26
An introduction to Oz
3
Application structure
and distribution behavior
With network transparency it is possible to take a program and distribute
it, and it will run correctly. But it might be slow, for instance because its
distribution involves much communication between sites. Here the programmer
can take advantage of the network-aware aspects of the language to control the
communication involved by the program’s distribution. These aspects do not
break transparency in the sense that the program is always a correct centralized
program. So transparency gives two advantages. First, a centralized program is
a correct distributed program. Second, tuning a centralized program for best
distributed performance can be done by modifying the centralized program,
e.g., with annotations that have no effect on centralized meaning. The program
always retains a correct centralized semantics.
When tuning distribution, the fundamental distribution behavior is determined by the structure of the program. The latter defines the paradigms
that are used: functional, dataflow, message passing, sequential or concurrent
object-oriented, etc. The type of shared entities will determine communication
patterns between sites. At a finer level, a given entity may be distributed in
several ways, for instance stationary or replicated state, each having a specific
distributed behavior. Choosing the most appropriate program structure is the
way to make network transparency work.
3.1
Layered structure
We assume that a distributed application is structured in terms of components.
We define a component as a program fragment with well-defined inputs and
outputs. A component is itself defined in terms of simpler components. Exam27
28
Application structure and distribution behavior
ples of components are: a procedure, an object, several objects linked together.
In this context, components can themselves be distributed. A distributed component can be decomposed into several components running on different sites,
and communicating through shared language entities.
A component can be defined with a mixture of paradigms. Some of its
subcomponent can be purely functional, while others use message passing and
dataflow variables, for instance. The choice of paradigm is an advantage for
reasoning about the program, since simpler components will require simpler
reasoning techniques. For example, a declarative component handling lists will
not be subject to race conditions, provided no other component concurrently
binds its outputs.
The layered structure of the language encourages the programmer to pick
the simplest programming paradigm to solve his or her problem. Concurrency
with shared state is by far the most complex paradigm to program with. Most
components do not need this expressive power. By limiting oneself to a part of
the language, the programmer can take advantage of methodological support
from the paradigm or the abstractions chosen. The network transparency ensures that this support is independent to whether the component is distributed.
The general advice is to keep the most general paradigm only for the components that need it, and to limit the extent of this paradigm inside the component itself. The usage of shared state concurrency is easier to manage when
it is well confined in the program.
3.1.1
Using the declarative model
The simplest declarative components only provide stateless values, like pure
functions without stateful dependencies. Dynamicity in declarative components
are provided by shared logic variables. An example is a pipeline of components,
where inputs and outputs are dataflow streams. Another example is several
sites synchronizing on a given event. The sites only need to share one logic
variable, and block until it is determined. The site notifying the event binds
the variable to a conventional value, which automatically wakes up threads
blocking on that variable.
Declarative components communicate by sharing values through logic variables. Communication is therefore purely dataflow and monotonic. From a
distribution point of view, the programmer should pay attention to how data
is shared among sites. Sharing stateless data is cheap in general, since it can be
copied. But sharing too much data may imply much communication between
sites.
Using the full power of the declarative model, one can also share lazy computations between components. What is shared is actually a logic variable.
The by-need synchronization mechanism works through distribution.
Nondeterminism. Distributed declarative components are subject to nondeterminism in the sense that concurrent threads may bind variables in any
3.1 Layered structure
29
order. However, this nondeterminism is not observable from within the declarative model. If a declarative component terminates without failing, its outputs
(defined by variable bindings) can be expressed as a mathematical function of
its inputs. Their value do not depend on the execution order of the various
threads in the component.
Note that if a failure occurs, for instance because of incompatible concurrent
bindings, a part of the component will be failed state. From a strictly declarative point of view, the whole program has failed. But in the wider model, the
failure does not automatically propagate to the whole program. If the rest of
the program continues to run, then we have observable nondeterminism. But
we are no longer in the declarative model.
3.1.2
Using message passing
When observable nondeterminism is required in a component, message passing
is a good way to go. Ports provide a easy and efficient way to handle the
nondeterminism in the component. The Send operation is asynchronous, and
therefore it only requires a message to be sent by the system. The port’s stream
itself is monotonic, and behaves like a declarative component if it is distributed.
An effective use of this model is to let only one thread read the stream
of messages, and treat them sequentially. This is the idea underlying active
objects. An active object is a component that runs on only one site, and
communicate with other active objects by sending messages.
Replying with variables. The model naturally offers two ways to reply to
a given message. The simplest solution is to use the declarative model, and put
a logic variable in the message. This logic variable will be bound to the reply.
Snippet 3.1 on the following page shows two abstractions that implement this
technique. The function MakeServer takes a function, and returns a port with
a server. The server thread applies the function to each message, and binds
the reply variable to the result of the call. The procedure SendRecv sends
a message X to a port P, together with a reply variable Y. The code below
creates a stateless server which adds 42 to each message it receives. Then the
server is called with 54, with Result as the reply variable. As you can see, the
functional notation allows to call SendRecv as a function.
Server={MakeServer fun {$ X} X+42 end}
Result={SendRecv Server 54}
Replying with continuations. A slightly more general technique is to put
a continuation in the message, i.e., a procedure that is called by the receiver
to reply the message. Note that the procedure is copied to the receiving site,
so that it can be applied there. The continuation allows to program more
sophisticated patterns of communication, like the promise pipelining provided
in the language E [Mil06].
30
Application structure and distribution behavior
fun {MakeServer F}
S in
thread
for X#Y in S do Y={F X} end
end
{NewPort S}
end
proc {SendRecv P X Y}
{Send P X#Y}
end
Snippet 3.1: Abstractions to create and call a server with a
reply variable
Snippet 3.2 on the next page defines a few abstractions that can be used to
program in a “promise pipelining” style. The function MakeServerC creates
a port with a server. The server thread applies a function F to each received
message, and calls the continuation with the result. Let us create three servers
P, Q, and R with that function. The server P replies either Q or R, depending
on the message it receives. Servers Q and R expect an integer as message,
and return the message after an arithmetic operation. Note that those servers
should be created on different sites.
fun {F X} (if X==foo then Q else R end) end
P={MakeServerC F}
Q={MakeServerC fun {$ X} X+42 end}
R={MakeServerC fun {$ X} X div 2 end}
Now let us call P, thanks to the procedure SendToPort, with a continuation
C1 that is not determined yet. We will determine its value right after.
C1={SendToPort P foo}
%% is equivalent to: {Send P foo#C1}
The server will determine a result for the message, in this case Q, and eventually
call {C1 Q}. During that time, the sender determines what C1 does: it should
send the message 54 to its argument. The fact that the continuation C1 is
called by server P makes that the message to Q is sent directly from P. There
is no need to come back to the sender. The procedure SendToCont determines
its first argument to do exactly that:
C2={SendToCont C1 54}
%% is equivalent to: proc {C1 Res} {Send Res 54#C2} end
The variable C2 will thus be sent to Q as a continuation, so server Q will
eventually call {C2 96}. We now define this continuation with the procedure
GetResultC: C2 binds its result to the variable X.
X={GetResultC C2}
%% is equivalent to: proc {C2 Res} X=Res end
3.1 Layered structure
31
fun {MakeServerC F}
S in
thread
for X#Cont in S do {Cont {F X}} end
end
{NewPort S}
end
proc {SendToPort P Msg Cont}
% send Msg to port P
{Send P Msg#Cont}
end
proc {SendToCont C Msg Cont}
% send Msg to promise C
proc {C Res} {SendToPort Res Msg Cont} end
end
proc {GetResultC C X}
% get result from continuation C
proc {C Res} X=Res end
end
Snippet 3.2: Promise pipelining with continuations
These continuations have defined the following pipeline: the client sends message foo with continuation C1 to server P; then P sends message 54 with continuation C2 to server Q; then Q sends its result 96 back to the client via the
variable X. Note that this machinery relies on the fact that procedures are
copied from site to site.
3.1.3
Using shared state concurrency
This is the most complex model from a programming point of view. And it
is also the most demanding for the distribution. Shared state implies that
read/write operations can be performed from multiple sites. Moreover, those
operations are synchronous, and create many dynamic dependencies between
sites. The difficulty stands in managing the state such that it is consistent:
all the updates of a stateful entity must be serializable, as in a centralized
multithreaded program.
Moreover, the shared state concurrency model is very sensitive to errors.
This is because the threads follow an interleaving semantics. Consider the
following example, where two threads perform each a read and a write operation
on the cell C.
C={NewCell 0}
thread C:[email protected]+1 end
thread C:[email protected]+2 end
% thread A
% thread B
The result of the execution does not depend on where the threads and the cell
are localized in the distributed system. When both threads terminate, the cell
may contain either 1, 2, or 3. Some executions lead to surprizing behaviors. For
32
Application structure and distribution behavior
instance, the cell contents may decrease in this execution: thread A reads 0,
then thread B reads 0, then thread B writes 2, then thread A writes 1.
Managing the nondeterminism in the programming language can be a problem. Using locks can help to create critical sections into the code, but they
quickly lead to deadlock avoidance issues. In both the centralized and distributed cases, the programmer should use transactions to handle atomicity
issues in his or her program. Transactions can be implemented quite efficiently
in a centralized setting [ST95, VH04].
Implementing a distributed transaction system with good network behavior
is not an easy task. Two systems have been proposed so far by our research
team. The “GlobalStore” is fault-tolerant transactional replicated object store
designed and implemented by Iliès Alouini and Mostafa Al-Metwally [AM03].
It takes advantage of replication to reduce latency when computing the transaction. Another transactional system was proposed to run on structured overlay
networks, and was designed by the author and Valentin Mesaros [MCGV05].
The latter uses transaction priorities to avoid deadlocks, and the two-phase
commit algorithm to ensure consistency between sites. Note that these systems only handle permanent failures.
3.2
Classification of language entities
The design of Oz is such that different entities may use different distribution
protocols, and thus have different network behaviors. In fact, the distributed
behavior of the whole program is determined by what type of entities are used,
and how they are shared among sites. This section explores the different kinds
of entities in Oz, and how they can be distributed.
Entities can be partitioned into three main categories: mutable, immutable,
and monotonic. Each category has specific requirements that influences their
possible distributed behavior. All the entities in a given category share the
same set of distribution protocols. So the category of an entity determines
what possible protocols it may use, and thus what possible network behavior
it may have. We present each category, with the protocols available for each,
and examples of Oz entities in those categories.
3.2.1
Mutable entities
This is the category of stateful entities in general. Those entities have an
internal state, and the distribution must maintain a globally consistent view of
the state. Here we sketch three possible distribution strategies for them.
• The simplest way to distribute a mutable entity is to make its state
stationary. All operations are sent to the site holding the state, performed
there, and a value can be returned.
3.2 Classification of language entities
33
• In the “mobile state” strategy, the state moves on the sites where operations are attempted. When the state is on a given site, it can be accessed
locally. The state behaves like a cache, since several operations can be
performed locally before the state leave the site.
• One can also replicate the state through sites. An update of the entity
first invalidates all copies, then sends the new state. Reading the state
is a pure local operation, and it can be done immediately if the copy is
valid. This scheme is efficient when reading the state is more common
than updating it.
Read and write operations are synchronous in general. Oz cells, objects,
locks, dictionaries, threads belong to that category. Ports can be considered
a special subcategory: the operation Send is an asynchronous update. The
simplest strategy in this case is to leave the state stationary. To make an
update it is enough to send a message to the site holding the state.
3.2.2
Monotonic entities
Those are also stateful, but their state is updated in a monotonic way. From a
distribution point of view, they are more flexible that mutable entities. Their
state can be replicated without the need to synchronize all the sites to perform
an update. Single-assignment variables and streams are examples of monotonic
entities.
Monotonic entities support the concurrent constraint operations ask and
tell [Sar93]. The ask operation is just like a read. The tell operation updates
the state of the entity. To ensure the consistency of the tell, all updates are
serialized on one site. This site forwards the operation to all the other sites.
Single-assignment variables are transients, which is a subcategory of monotonic entities. Transients have a final state where they become another entity.
The entity exists until is is bound. In Oz, logic variables have three states:
free and not needed, free and needed, and determined. Note that streams are
built from transient entities (read-only views), and therefore inherit from the
distributed behavior of transients.
3.2.3
Immutable entities
Those are constants. One can only read them. They are usually copied eagerly
or lazily between sites. When the entity has an identity, the protocol can
guarantee that the copy is done once. This is useful when the value is large. In
some cases the value cannot be copied, for instance because of implementation
or security limitations. Read operations are then performed like in the mutable
case.
Immutable entities range from simple values (atoms, numbers), to compound values (records), and even closures (procedures, classes). A compound
value is copied with its fields, which may also be compound values or closures.
34
Application structure and distribution behavior
The copy of a compound value should have the same structure as its origin,
including possible cyclic references and coreferences. Cycles and coreferences
are detected when the value is serialized, so that they are not an issue for the
programmer.
Closures are extremely powerful, because they contain both code and external references. The latter are handled just like records. And the code is copied
between sites. The promise pipelining example in Snippet 3.2 on page 31 makes
use of them, for instance. Note that cycles may also happen with closures, since
closures can contain record references, which can contain references to closures.
An again, the cycles are gracefully handled by the system, such that each entity
in the reference graph of a given entity is copied at most once.
3.3
Annotations
The application is structured into components, and those components are themselves decomposed, layer by layer, up to primitive components, i.e., language
entities. In its first implementation of distribution, Mozart was providing its
distributed entities with fixed behavior for each. It was considered that those
choices were expressive enough for the programmer to code the distributed
behavior of his or her choice. Objects were distributed with mobile state. Stationary objects had to be reimplemented in Oz on top of ports, for instance.
We now let the programmer choose the distributed strategy for each language entity in his or her program. This choice is stated by annotating the
entity. An application can be structured from top to bottom, with the annotation system providing the shaping of the distributed behavior at the lowest
level of the structure. Annotations are part of the network awareness of the
language, since they give some explicit control on an entity’s distributed behavior.
Annotations may cover many facets of the distribution system. The first
and most evident one is how the state of an entity is distributed, and the impact
on primitive operations on that entity. Another facet is the distributed memory
management of that entity. The system could also provide some robustness for
its entities, and annotations may help to parameterize how robust an entity
must be.
How to annotate. Conceptually an annotation is a bit like a declarative
statement. It states something about an entity. It can even be thought as a
constraint that the user posts about the distributed behavior of an entity. It is
not fully declarative, since logic variables have a specific status. Annotating a
variable is not equivalent to annotating its value.
In our proposal, which is explained in detail in the next section, annotating
an entity is done by the statement
{Annotate entity parameter }
3.3 Annotations
35
The nice property about annotations is that they can be ignored in case of
no failure. They do not change the semantics of the program in that case.
Moreover, if the program is not distributed at all, they will not be taken into
account by the system.
3.3.1
Annotations and semantics
Annotations describe programmer choices for the distribution of an entity. Network transparency implies that the distribution must be an implementation, or
a refinement, of the entity’s semantics. The centralized semantics of an entity
often admits several distributed semantics. Annotations give the possibility to
the programmer to specify which distributed semantics should be used for a
given entity. The semantics of the language is given in chapters 6 and 7. The
latter also gives the semantics of annotations themselves.
The semantics of a language entity is thus partly reflected in the application.
The annotation system let the latter make semantic choices for language entities
at runtime. We could say that annotations allows the programmer to change
the semantics of the language. But one can only choose between semantic
variants, which are well defined, and do not break the centralized semantics of
an entity in case of no failure.
Annotations are thus a limited form of reflection in the programming language. Its boundaries are defined by the programming system and by the centralized semantics. The programmer may tweak an entity’s semantics within
safe boundaries.
3.3.2
Annotation system
In practice an annotation specifies parameters of the distribution subsystem.
The actual annotation system goes slightly beyond the conceptual level, because it allows implementation compromises that may break the semantics of
an entity. The typical example is the time-lease based garbage collector, which
may remove an entity from memory even if some sites in the application still
refer to it.
The procedure Annotate is called to specify distribution parameters for an
entity. We have chosen annotations to be monotonic: you cannot change your
mind once you have chosen an option. Moreover, once an entity is actually
distributed, i.e., when it has been shared by at least two sites, its distribution
parameters can no longer be changed. As a consequence, an entity can only be
annotated before it gets distributed. It is therefore useful to annotate entities
into the abstractions that create them.
Distribution parameters are specified as atoms or records. For instance,
stationary state is specified with the atom stationary, and the use of a mobile access reference is specified by the record access(migratory). Several
parameters are combined in a list, like in the example
{Annotate E [stationary access(migratory) lease]}
36
Application structure and distribution behavior
We will see in the next section that this statement is equivalent to the following
three ones.
{Annotate E stationary}
{Annotate E access(migratory)}
{Annotate E lease}
3.3.3
Partial and default annotations
Our annotation system is not only monotonic, it is even incremental. Several
annotations can be put on a given entity, at different times. The result is
that the entity is annotated by the conjunction of them, provided that it is
consistent. For instance, mobile and stationary state are inconsistent together,
but stationary state and time-lease garbage collection are consistent, because
both parameters are orthogonal to each other. In our implementation, Mozart
considers three orthogonal distribution parameters, namely the access architecture, the state protocol, and the distributed garbage collection. Those are
described in more detail in the next sections.
The system also permits partial annotations. It means that some distribution parameters may be left unspecified whenever an entity becomes distributed. In that case, the system completes the annotation with default values. For instance, if the programmer annotates a cell with lease (time-lease
based garbage collection), the system may complete the annotation to
[migratory access(stationary) lease]
right before distributing the cell.
Each type of entity has a default annotation, giving a value for each distribution parameter. The default annotation must be complete, of course. The
system implementation may or may not allow the programmer to modify default annotations. In our prototype, default annotations can be modified by
the program at any time.
3.3.4
Access architecture
The access architecture of an entity defines how all the sites sharing the entity
coordinate with each other. This architecture is the base of the other protocols
(state and reference consistency, see below). It could be anything, as long as
one can implement the entity operations and garbage collection on top of it.
In Mozart, all sites sharing the entity own a proxy, and all those proxies
refer to a unique coordinator, which is hosted by one of the sites. It is used by
the other protocols as a reference point. When an entity reference is sent from
one site to another, a network address of its coordinator is given. This allows
the receiver to connect its proxy to the other proxies of the same entity in the
whole system. The architecture is similar to a client-server architecture.
3.3 Annotations
37
Knowing the type of access architecture already gives some information
about the network behavior of the entity. For instance, an entity with stationary state will have its state located on the same site as the entity’s coordinator. Another example is garbage collection, where the coordinator determines
whether the entity is referred to by remote sites. If not, the entity is no longer
distributed.
Mozart/DSS defines one parameter for the access architecture, which states
whether the coordinator is stationary or mobile.
• access(stationary): the coordinator is located where the entity was
created, and remains on that site. This is the simplest strategy to manage
the access architecture.
• access(migratory): the coordinator can be moved from one site to
another by a specific operation. There are several possibilities on how
proxies can find where it is. Details can be found [Kli05].
A single point of failure. The coordinator of an entity is obviously a single
point of failure. When it crashes, the entity’s proxies are usually no longer
capable of finding each other. Note however that we have a single point of
failure per entity. This design decision is motivated by the fact that entity
protocols should not solve all the problems. Entities are generally not robust
to failures. Instead, failures are detected, and can be handled at the language
level. Fault-tolerant protocols can be implemented in Oz, and hide failures by
abstractions.
3.3.5
State consistency protocols
Those protocols implement the operations of the entity itself. The choice of
protocol depends on the entity’s nature, i.e., mutable, immutable, or monotonic. This parameter is the most important when considering the network
behavior of entity operations. Here are the protocol annotations considered in
Mozart/DSS.
1. Mutable entities.
• stationary: the state of the entity is located on the site of the entity’s coordinator. All operations on the entity’s state are performed
on this site. Synchronous operations therefore need a full round-trip
from the requesting site to the coordinator to complete.
• migratory and pilgrim: the state of the entity migrates from one
site to another, and a site executes operations locally when the state
is on that site. The state behaves like a cache: once the state is on a
site, that site may perform several operations without extra network
overhead.
38
Application structure and distribution behavior
• replicated: the state of the entity is copied on all the sites that use
the entity, and the copies are synchronized by a two-phase commit
protocol. This protocol is useful for data structures that are rarely
updated. Read operations can be performed locally, while write
operations require an atomic update of all sites using the entity.
2. Monotonic entities.
• variable: this corresponds to the protocol described in [HVB+ 99].
The binding of the variable is performed on the site of the entity’s
coordinator.
• reply: this is a variant of the variable protocol, where the binding
is done on the first site that receives the reference. This protocol has
the best network behavior when the receiver site attempts to bind
the variable. The variable is typically used to a reply to a query.
3. Immutable entities.
• immediate: the value is sent together with the reference of the
entity. The unicity of the entity is guaranteed, even if its value is
sent multiple times. All values with structural equality (numbers,
atoms, records) use this protocol.
• eager and lazy: those protocols guarantee that the value is sent
at most once. When the entity is sent to a site, only its reference
is actually sent. The receiving site requests the entity’s value if it
does not have it yet. In the eager case, the value is requested upon
receipt, while in the lazy case, it is requested once the value is
actually needed.
• stationary: the value is not copied on other sites. Remote operations require a full round-trip to the coordinator. For instance, one
can provide access to a chunk without allowing copies on possibly
untrusted sites.
3.3.6
Reference consistency protocols
Those protocols ensure the reference integrity, by implementing a distributed
garbage collector. Here the choice is not exclusive: one can combine several
protocols in a single annotation. An entity is kept in memory if all protocols
require so. Mozart proposes three algorithms:
• persistent simply keeps the entity alive forever on its coordinator. The
entity is simply never removed by the garbage collector. This can be
useful for providing a service on a site, which should run until the site
terminates.
3.4 Related work
39
• refcount uses a weighted reference counting scheme. Each reference to
the entity is assigned a weight, and the entity’s coordinator keeps track
of the sum of the weights of all remote references. When this number
reaches zero, the entity is no longer distributed. The advantage of using
weighted references is that new remote references can be created without
notifying the coordinator. It suffices to keep the total weight constant.
• lease uses time-lease based mechanism. Sites holding a remote reference
to an entity regularly notify its presence to the coordinator of the entity.
The time between successive notifications is called the lease period. If
the coordinator has not been notified during a long time (typically much
longer than the lease), the entity is no longer considered distributed.
The algorithm refcount guarantees consistency, i.e., a coordinator remains
alive while its proxies are, even in case of long network failures, but is not
robust to site failures. The algorithm lease does not guarantee consistency
in case of network delays, but handles site failures gracefully. It is up to the
programmer to choose what fits best for his or her application.
3.4
Related work
How do other systems provide tuning of a distributed program? Are they easy
to tune at all? Can the tuning process be programmed in the language, or is it
external to it? In the latter case, are the tuning techniques heavily modifying
the program?
3.4.1
Erlang
In the Erlang philosophy, everything is a process, and the only communication
primitive between processes is message passing with values. Processes are independent of each other, and cannot share memory. Every process is sequential
and programmed in functional style. This simple model fits pretty well with
distribution, and allows efficient implementations.
Process identifiers can be sent between sites, and sending messages to a
remote process is transparent. Processes do not migrate between sites, and
garbage collection is up to the programmer. The language favors lightweight
client-server style. For instance, the Open Telecom Platform (OTP) provides
generic server abstractions, which support transactional semantics (crashed
servers are restarted with a valid former state) and code swapping (the server
code can be changed on-the-fly). In fact, the OTP provides many more abstractions to build large-scale, fault-tolerant, distributed applications [Arm07].
With the simplicity of Erlang’s programming model, any communication
pattern can be programmed with processes and messages. Libraries like the
OTP already provide powerful abstractions for distributing applications. Of
course, the programmer should always use this kind of abstraction to build
40
Application structure and distribution behavior
his or her applications. Changing the network behavior can then be done in
a modular way. Using directly the distribution facilities makes the program
harder to adapt.
3.4.2
Java RMI
There is no distribution mechanism defined by the language Java itself. But
the Java Remote Method Invocation (RMI) library has quickly become the
most popular distribution mechanism in the Java community. The library
provides two ways to distribute an entity: full serialization and remote method
invocation. In the first case, a copy of the object is made once a reference
to that object is sent to a site. In the second case, only a reference to the
original object is created on the receiving site. When that reference is invoked,
the method invocation is sent to the object’s site and the calling thread waits
for its termination. These objects are said to be remote, while fully serialized
objects are non-remote.
This mechanism imposes synchronous interaction between sites, which can
be slow in a distributed setting. But worse, reference integrity is only guaranteed per method invocation, and not in general [Sun97]. The consequence
is important: distributed objects have a different semantics than centralized
objects.
Besides these semantic issues, Java RMI is not transparent to the programmer. Remote objects must implement the interface java.rmi.Remote, while
non-remote objects implement java.io.Serializable. Turning a local object
in a distributed one requires to modify its class. The library provides a few
abstractions to write servers, though.
3.4.3
E
The language E is an object-oriented programming language designed for secure
distributed computing. It was created by Mark S. Miller, Dan Bornstein, and
others at Electric Communities in 1997. It combines capabilities and a message
passing concurrency model with Java-like syntax. Its concurrency model is
based on event loops and promises, in order to prevent deadlocks. More on
security aspects of E can be found in Mark Miller’s thesis [Mil06].
Objects behave like concurrent sequential agents with synchronous or asynchronous method invocation. Each object is stored into a vat, which is the unit
of localization. Synchronous invocation can only happen between objects in
the same vat, and corresponds to a sequential method call.
def result := bob.foo(carol)
println(‘done: $result‘)
/* synchronous call */
A method is always executed atomically, and should never block or run forever.
This strong limitation to concurrency was chosen to avoid common programming errors due to shared state concurrency.
3.4 Related work
41
Objects can also invoke each other asynchronously, with the eventually or
send operator <- (see below). If the method returns a result, the operation
returns immediately a promise for the result.
def result := bob <- foo(carol)
when (result) -> {
println(‘done: $result‘)
} catch problem {
println(‘oops: $problem‘)
}
/* eventual send */
/* promise resolution */
One can send to the promise immediately; a promise pipelining mechanism
ensures that the resulting object is eventually sent the method (like in Section 3.1.2 on page 29). One can also synchronize on the result with a when
statement. Once the promise is resolved, the code is eventually run (atomically). The catch part allows to handle a promise failure.
The distributed model of E is strongly determined by the security aspect
of the language. By default, vats do not trust each other. Therefore objects
are never copied or moved between vats. Vats are strongly isolated from each
other, and inter-vat communications are encrypted. The encryption also guarantees that object references cannot leak into intermediate vats in the promise
pipelining process. The distribution model is thus limited to client-server communication with promises, but the capability system is guaranteed safe.
42
Application structure and distribution behavior
4
Asynchronous failure
handling
We go one step further in the distribution support by reflecting partial failures
in the programming language. We propose a language-level fault model that is
compatible with network transparency. Because a site or network failure may
affect the proper functioning of a distributed entity, our model defines how
entities can fail, and how those failures are reflected at the entity level in the
language. Here are the principles of the model, each being described in the
corresponding subsection below.
• Each site assigns a local fault state to each entity, which reflects the site’s
knowledge about the entity.
• There is no synchronous failure handler. A thread attempting to use a
failed entity blocks until the failure possibly goes away. In particular, no
exception is raised because of the failure.
• Each site provides a fault stream for each entity, which reifies the history of fault states of that entity. Asynchronous failure handlers are
programmed with this stream.
• Some fault states can be enforced by the user. In particular, a program
may explicitly provoke the global failure of an entity.
This fault model is an evolution of the first fault model of Oz, and integrates
parts of another proposal made by Donatien Grolaux et al. [GGV04]. The
next chapter will demonstrate that it improves the ease of programming and
modularity of failure handlers. A comparison with the other fault models of
Oz is given in Section 4.5 on page 56.
43
44
4.1
Asynchronous failure handling
Fault model
We first provide a precise description of the kind of faults we consider in the
system in which the program runs. This system is composed of sites that
communicate through a network. We model a certain number of failures in that
system, how they affect the system, and how they can be detected. Failures
affect language entities, so we also consider failures at the entity level. The
model we use here is inspired from Rachid Guerraoui et al. [GR06], and is
quite standard in the field of distributed systems.
Note that we will sometimes use the term process. A process is simply
something that has some autonomous behavior, some internal state, and which
interacts with other processes by exchanging messages through the network.
Both sites and language entities can be considered as processes.
4.1.1
Failures
Site failures. A site may fail by crashing, i.e., at a given time t, it stops
doing anything, especially sending and receiving network messages. There is
no recovery mechanism by default, the failure is permanent. This kind of failure
is called crash-stop or fail-stop. A site that has not crashed yet is said to be
correct.
We assume that sites are subject to neither omission (where the process
may “miss” some messages), nor Byzantine faults (where the site may perform
any arbitrary action). Those are much harder to handle, and detect. The lack
of generic detection mechanism makes any kind of language support for them
virtually impossible.
Network failures. The network is considered reliable in the sense that a
message sent by a site a to a site b will eventually be delivered, unless a or b
crashes. Messages are never corrupted nor delivered more than once. However,
the communication between two sites may take arbitrary time. The communication link may appear to have failed if no message is delivered to the destination site during a long but finite period of time. In other words, network
failures can be defined as communication delays that are longer than expected.
We do not consider the case where network failures would be permanent. Such
a failure would mean that a site can no longer communicate with any other
site. By convention, network failures are always considered to be temporary.
Entity failures. Language entities are subject to the same kind of failures
as sites. A failed entity stops being functional. No language operation can
have an effect on it. Its state is lost, and there is no recovery mechanism. The
failure is permanent and global: a failed entity is unusable for all sites.
We also consider a failure of type fail-stop, but which is valid for a given
site: the entity is crashed for that site, but may be functional for other sites.
4.1 Fault model
45
In other words, that site can no longer use the entity. We say that the entity is
locally failed on the given site. This failure is triggered by the operation Break,
which is described in Section 4.3.2. It is typically used to prevent the site from
using the entity, and does not affect the other sites.
4.1.2
Failure detectors
In order to handle failures at the program level, we need failure detectors. A
failure detector is a component that tries to determine whether a given process
has crashed. It notifies the program when it suspects the process to have
crashed. Not all failure detectors are identical, there exists several types of
them. We can classify them according to three properties: their completeness,
accuracy, and monotonicity.
• A failure detector is complete if a crashed process is always eventually
suspected.
• A failure detector is accurate if a suspect process has actually crashed, and
inaccurate if it may suspect a correct process. It is eventually accurate if
no correct process is suspected forever.
• A failure detector is monotonic if a suspect process is never notified correct later. In other words, the failure detector never “changes its mind.”
The completeness is a liveness property, it ensures the eventual detection of
a crash. On the other hand, the accuracy is a safety: it prevents erroneous suspicions. Those two properties are essential for reasoning about the correctness
of an algorithm that handle failures. The monotonicity also helps for reasoning,
because monotonic detectors have a simpler behavior than nonmonotonic ones.
A failure detector is perfect if it is complete and accurate. It is eventually
perfect if it is complete and eventually accurate. Perfect failure detectors are
not so common, because they require strong properties of the underlying system. For instance, perfect detectors are possible on a local area network (LAN),
but not on the internet in general. For the internet, one has to use eventually
perfect detectors.
Three simple failure detectors. Our model proposes a combination of
three failure detectors, namely tempFail, permFail, and localFail. Each detector
has its own properties in terms of completeness, accuracy, and monotonicity.
We will show in the next sections how we use them to handle language entity
failures.
• The tempFail detector is eventually perfect. It uses two notifications:
tempFail and ok. The first one occurs when the target process is suspected, the second one occurs when the detector has found evidence of
correctness of the process. This detector is nonmonotonic.
46
Asynchronous failure handling
detector
tempFail
localFail
permFail
complete
yes
no
no
accurate
eventually
no
yes
monotonic
no
yes
yes
Table 4.1: Summary of the properties of the three failure detectors (for global failures)
• The permFail detector is accurate but incomplete. It is not guaranteed
to detect a crash, but it never erroneously reports a crash. It uses the
notification permFail. By definition it is monotonic.
• The localFail detector is perfect for local failures, and uses the notification localFail. However it is neither complete, nor accurate for global
failures in general. Its completeness and accuracy depend entirely on the
program. However it is monotonic: suspicion remains forever.
A summary of the properties of the three failure detectors is shown in Table 4.1.
Note that these properties are given with respect to global failures. This is why
localFail is neither complete nor accurate.
4.1.3
Entity fault states
For each entity e, every site has a failure detector that combines the three failure
detectors tempFail, permFail, and localFail. That failure detector maintains a
local fault state, which is like a view of the actual fault state of the entity. The
failure detector sends a notification at every state transition. The notification
mechanism is described in Section 4.2.3.
The failure detector has four states, called local fault states, or fault views:
ok, tempFail, localFail, and permFail. Valid state transitions are depicted
in Figure 4.1 on the facing page. The semantics of the states are the following.
• ok is the initial state, and can also be triggered by the tempFail detector.
It means that the entity is not suspected by any of the basic failure
detectors.
• tempFail is triggered by the tempFail detector. It means that the site
is temporarily unable to complete any operation on the entity. This
typically happens when this site cannot communicate with other sites
that are necessary for performing language operations on the entity.
• localFail is triggered by the localFail detector. It means that the entity
is permanently unavailable for this site. Note that it is local, i.e., other
sites may still have access to the entity. This state can be enforced by
the program.
4.1 Fault model
47
o
k
t
l
p
o
c
e
a
r
l
m
F
a
F
i
a
e
m
p
F
a
i
l
l
i
l
Figure 4.1: Local fault state diagram of an entity
• permFail is triggered by the permFail detector. It means that the entity
has crashed. No site can ever perform an operation on it. This state is
final.
The main advantage of this model is that it provides a simple yet precise
description of an entity’s state, from the viewpoint of one site. It abstracts the
type of hardware and system used, the protocols, and even the kind of entity
it applies to. It describes the failure from the programming language’s point
of view. Yet its simplicity still allows to reason about the partial failures in a
program.
4.1.4
Concrete interpretation of fault states
Knowing the kind of an entity and its distribution strategy, one can easily
give a more precise interpretation of a fault state. Here we give the various
concrete reasons for an entity to fail. The only concepts we rely on are the ones
given in Chapter 3. All failures can be expressed in terms of sites, protocols,
coordinators, and memory management.
Note that the interpretation we provide for fault states is of course related
to how the distribution of an entity is implemented. A sophisticated implementation of distribution would have led to a complex fault model. In our work
we favor a simple fault model, therefore keeping the implementation simple.
The programmer should be able to reason easily about the properties of the
distribution. Complex fault-tolerant abstractions should be built at the higher
user level, not at the low level.
Mutable entities. Two sites are usually involved when reasoning about mutable entity failures: the coordinator site and the site holding the state. The
coordinator is necessary in all protocols to manage the entity’s state. It is
the site holder if the state is stationary; it manages to bring the state to the
requester in case of a mobile state; and it ensures mutual exclusion when the
48
Asynchronous failure handling
state is replicated. The state holder is also crucial: its failure always implies
that the state is lost.
A mutable entity is in state tempFail if the coordinator or state holder
is unreachable. The state localFail is triggered by the program. The state
permFail is reached when the coordinator or the site holding the state has
crashed, or the coordinator has removed the entity from its memory. The
coordinator crash can be provoked by the program (see Section 4.3).
The second reason for the state permFail has already been mentioned in
Chapter 3: time-lease based garbage collection is not correct in case of network
failures. The coordinator considers that the entity is no longer used when no
other site has showed interest for a certain duration. A problem arises when
a site cannot reach the coordinator because of a network problem. In that
case, the entity will fail on that site. The good property is that it is diagnosed
properly, and reflected in the language. If the network recovers and the site
can reach the coordinator again, the removal of the entity will be notified, and
result in the entity failure.
Monotonic entities. Transients are pretty similar to mutable entities when
it comes to failures. In the protocol variable, the coordinator is also a state
holder. The state holder might be different in the reply protocol. If only one
site refers to the variable besides the coordinator, then that site is the state
holder. The same reasoning as with mutable entities applies here.
A property of logic variables is that they conceptually disappear once they
are bound. In fact, bound variables have reached their final state, and become
invisible to the program. For the sake of consistency, bound variables do not
fail, and failed variables remain unbound.
Immutable entities. Immutable entities are simply values. Their possible fault state depend on whether they are copied between sites (protocols
immediate, eager, lazy) or not (protocol stationary). Note that entities
using the immediate protocol never fail, since one cannot have a reference to
the entity without having its state. As all entities with structural equality
(numbers, atoms, records) use this protocol, they are not subject to failure.
Values cannot fail permanently if they are copied between sites. If the site
from where the copy is made is unreachable or has crashed, a temporary failure
will be notified. The local fault state can even be localFail. But the state
permFail should never be observed, because any other site may provide a copy
of the value. The fault state permFail requires that no copy of the value is
available anywhere, even in a file. This property is difficult to verify in practice.
Values distributed with a stationary state are different. This protocol can
be used when copying the whole value is too costly or insecure. Remote sites
can still access the value, typically with the dot operation “.”. In that case,
the causes of failure are the same as mutable entities with stationary state.
4.2 Failure handlers
4.2
49
Failure handlers
We now discuss the possible ways to handle entity failures in the language. We
make a clear distinction between two basic ways of handling failures, namely
synchronous and asynchronous handlers. As we shall see, asynchronous failure
handling is preferable to synchronous failure handling.
4.2.1
Definition
A synchronous failure handler is executed in place of a statement that attempts
to perform an operation on a failed entity. In other words, the failure handling
of an entity is synchronized with the use of that entity in the program. Raising
an exception is one possibility: the failure handler simply raises an exception.
In contrast, an asynchronous failure handler is triggered by a change in the
fault state of the entity. The handler is executed in its own thread. One could
call it a “failure listener”. It is up to the programmer to synchronize with the
rest of the program, if that is required.
The following rules give small step semantics for both kinds of handlers.
The symbol σ represents the store, i.e., the memory of the program. The store
is partitioned among the sites, and the elements of the store that are specific
to a site a are subscripted by a. Each site a reflects its view of the fault state
of an entity in the store through a system-defined function fstatea (x), which
gives the local fault state of x. Each execution rule shows on its left side a
statement and the store before execution, and on the right side the result of
one execution step.
Rule (4.1) describes the semantics of a synchronous failure handler. It states
that a statement S attempting an operation on entity x can be replaced by a
handler H if the fault state of entity x is not ok, i.e., if x has failed.
Sa
σ
Ha
σ
if
statement S uses entity x
and σ |= fstatea (x) 6= ok
(4.1)
Rule (4.2) gives the semantics of an asynchronous failure handler. A new thread
is spawned with handler H whenever the fault state of x changes. Note that
there may be more than one handler on x; we assume all handlers are run when
the fault state changes.
σ ∧ fstatea (x)=fs
4.2.2
Ha
σ ∧ fstatea (x)=fs ′
if fs → fs ′ is valid
(4.2)
No synchronous handlers for Oz
In Oz, when the fault state of a given entity is not ok, operations on that entity
may not succeed. Raising an exception in that case might look reasonable, but
our experience suggested that it is not. Because of the highly concurrent nature
of the language, raising exceptions quickly creates race conditions between
50
Asynchronous failure handling
threads. The functional code is cluttered with failure handling code. Other
kinds of handlers have been tried, but without success.
We have chosen to use asynchronous failure handlers only. We propose the
following model.
“Failure causes blocking”: an operation on a failed entity simply
blocks until the entity’s fault state becomes ok again.
The operation naturally resumes if the failure proves to be temporary. It suspends forever if the failure is permanent (localFail or permFail). With this
model, nothing extra can happen in a program that does not handle distribution failures.
4.2.3
Entity fault stream
In our proposal, asynchronous failure handlers are programmed as threads that
monitor entities, and take action when an entity changes its local fault state.
On every site, each entity is associated with a fault stream, which reflects the
history of the fault state’s view of the entity. The system maintains the current
fault stream, which is a list fs|s, where fs is the current view of the fault state,
and s is an unbound variable. It is defined semantically as a system-defined
function fstreama (x) that returns the current fault stream of the entity x on
site a. The semantic rule
σ ∧ fstreama (x)=fs|s
σ ∧ fstreama (x)=s ∧ s=fs ′ |s′
if
fs → fs ′ is valid
(4.3)
reflects how the system updates the fault state to fs ′ . The dataflow synchronization mechanism wakes up every thread blocked on s, which is bound to
fs ′ |s′ . An asynchronous handler can thus observe the new fault state simply
by reading the elements of the fault stream.
To get access to the fault stream of an entity x, a thread simply calls the
function GetFaultStream with x, which returns the fault stream of x on the
current site. A formal definition is given below. To read the current fault state,
one simply takes the first element of the returned list.
(y ={GetFaultStream x})a
σ
(y =fs|s)a
σ
if σ |= fstreama (x)=fs|s
(4.4)
Figure 4.2 on the next page shows an example of how an entity’s fault stream
may evolve over time. The stream is a partially known list, and the underscore
“_” denotes an anonymous logic variable. In the last step, the stream is closed
with nil. This may happen in two situations, which are explained below.
Snippet 4.1 on the facing page shows a thread monitoring an entity E, and
printing a message for each fault state appearing on the stream. The printed
message is chosen by pattern matching. The thread is woken up each time the
stream is extended with a new state.
4.2 Failure handlers
51
FS={GetFaultStream E}
FS=ok|_
FS=ok|tempFail|_
time
FS=ok|tempFail|ok|_
FS=ok|tempFail|ok|localFail|_
? FS=ok|tempFail|ok|localFail|nil
Figure 4.2: An example of a fault stream evolving over time
thread
for S in {GetFaultStream
T = case S
of ok
then
[] tempFail then
[] localFail then
[] permFail then
end
in
{Show T}
end
end
E} do
% pattern matching on S
"entity is fine"
"some problem, don´t know"
"no longer usable locally"
"no longer usable globally"
Snippet 4.1: A thread that prints messages when entity E’s
fault state changes
Special case: variables. Monitoring variables requires a bit more care than
other entities. This is because variables conceptually disappear once they are
bound: they become what they are bound to. The question is: what happens to
the fault stream of a variable once the latter is bound? There are two distinct
cases to consider, as the variable is bound to either a value, or another variable.
Consider a variable x that is bound to another variable y. From a programmer’s point of view, the binding is transparent: x remains a variable. For a
thread monitoring x, it seems quite natural to smoothly switch to monitoring
y. We propose to make this transition automatic by merging the fault streams
of x and y. Basically the tail of the fault stream of x is bound to the tail of
fault stream of y, prepended by y’s current fault state if it is different from x’s
current fault state. This binding makes sure that the monitor does not miss a
fault state.
The other case we have to consider is the binding of the variable to a value.
We think that merging the fault streams is not a good idea here, because the
entities are of different nature, and this may lead to confusion. However, we
need a clear mechanism to notify the monitoring thread that the variable has
been bound. We propose to close the fault stream by binding its tail to nil,
because the variable has conceptually disappeared. Once this happens, calling
GetFaultStream on the variable will return the fault stream of its value.
52
Asynchronous failure handling
Failure history. The fault stream of an entity e on a site a reifies the history
of fault state observations of e by a. Moreover it transforms the nonmonotonic
changes of a fault state into monotonic changes in a stream. It provides an
almost declarative interface to the fault state maintained by the system. This
interface looks much simpler and more elegant than registered handlers, which
is what Mozart used before [VHB99]. In particular, the fault stream guarantees
that the failure handler cannot miss a state transition.
Note that the fault stream may also be closed, i.e., its tail bound to nil,
whenever it is no longer maintained by the system. This is performed by the
system when the entity is no longer in memory. See Section 4.4.
4.2.4
Discussion
Synchronous failure handlers are natural in single-threaded programs because
they follow the structure of the program. Exceptions are handy in this case
because the failures can be handled at the right level of abstraction. But the
failure modes can become very complex in a highly concurrent application.
Such applications are common in Oz and they are becoming more common in
other languages as well. Because of the various kinds of entities and distribution
protocols, there are many more interaction patterns than the usual client-server
scheme. Handlers for a given entity may run in many threads at the same time,
and those threads must be coordinated to recover from the failure.
All this conspires to make fault tolerance complicated to program if based
on synchronous failure handling. This mechanism was in fact never used by Oz
programmers developing robust distributed applications [GGV04]. Instead,
programmers relied on the asynchronous handler mechanism to implement
fault-tolerant abstractions. One such abstraction is the “GlobalStore”, a faulttolerant transactional replicated object store designed and implemented by Iliès
Alouini and Mostafa Al-Metwally [AM03].
4.3
Making entities fail
Failures in distributed systems are often partial. This will be the case with
entity failures in a distributed application, especially if the programmer defines
components that are spread among many sites. In many cases, if a subset of
the entities of a component have failed, the component itself might no longer
function. The components that use the failed component must be able to detect
the failure, and trigger a recovery mechanism. The question is: which entity
should they monitor?
A component should not monitor all entities of another component explicitly. This would prevent any encapsulation in the monitored component. But
monitoring the entities it has access to might not be enough, if none of the
monitored entities fails. One possibility is to design a component-level protocol that makes sites consider the component as failed. Another possibility is to
4.3 Making entities fail
53
proc {SyncFail Es}
Trigger in
for E in Es do
thread
if {List.member permFail {GetFaultStream E}} then
Trigger=unit
end
end
end
thread
{Wait Trigger}
for E in Es do {Kill E} end
end
end
Snippet 4.2: Synchronize the failure of a set of entities
make the monitored entities fail on purpose. We propose to provide support
for the second alternative in our failure model, i.e., the program can make an
entity fail.
4.3.1
Global failure
We provide a new operation to make an entity fail. The statement {Kill e}
attempts to make the entity e permanently failed, i.e., in fault state permFail.
The operation is asynchronous, which means that it returns immediately, and
is idempotent. It initiates a protocol that tries to make the entity globally
failed. Once it is done, the local fault state of e becomes permFail. Because of
the definition of the state permFail, the operation may require some synchronization with other sites that refer to e. The operation must ensure that no
other site can perform operations on the entity. Therefore the operation Kill
is not guaranteed to succeed.
The example in Snippet 4.2 shows a simple abstraction, yet quite powerful.
It basically tries to ensure that all entities in a list eventually fail when one of
them fails.
4.3.2
Local failure
Sometimes it is not possible to make an entity fail globally, for instance because a site that is involved in the operation Kill has silently crashed. We
therefore provide the operation Break. The statement {Break e} has a pure
local effect. It makes the entity e fail locally, and forces its fault state to be at
least localFail.
The first motivation for Break is that it is irreversible. Once an entity is
permanently failed, even locally, it cannot go back to the fault state ok. This is
useful when a site triggers a recovery mechanism, based on the state tempFail
54
Asynchronous failure handling
proc {FailAfter E TimeOut}
proc {Loop L}
case L of H|T then
if H==tempFail andthen {WaitTwo {Alarm TimeOut} T}==1
then {Break E}
else {Loop T}
end
else skip end
end
in
thread {Loop {GetFaultStream E}} end
end
Snippet 4.3: A failure handler that provokes local failure after
a certain duration of temporary failure
of an entity e. Enforcing the failure of e simplifies the task of recovering. The
threads blocked because of the failure of e will never wake up, for instance.
This is useful if a service is backed up, and at most one instance of the service
can run at any given time.
The second motivation is resource management. By making an entity permanently failed, the programmer gives a hook to its memory management
system. Threads that block because of the failure will block forever, unless
they can be woken up explicitly by other threads. The system can therefore
use the permanence of the failure to detect parts of the program (threads and
data) that will no longer affect its behavior. Those parts can be safely removed
from memory. Some issues about memory management are described in detail
in the next section.
Snippet 4.3 shows a small failure handler that can be used together with
other failure handlers. Basically it uses a timeout to make an entity locally
failed if it remains temporarily failed for a certain time. The timeout duration
is specified in the parameter TimeOut. Other failure handlers waiting for state
localFail are thus automatically triggered after the given inactivity duration.
4.4
4.4.1
Failures and memory management
Blocked threads and fault streams
Entity failures have an effect on the memory management of a program. First,
a failed entity can make a thread block. If the failure is temporary, that
thread must be kept in memory for its possible resumption. As that thread
normally refers to the entity, it keeps the entity alive. However, if the failure
is permanent, the thread will block forever, unless it is referred to by another
living entity. As we already mentioned in Section 4.3.2, a thread blocking
4.4 Failures and memory management
55
forever can be safely removed from memory.
Something similar happens with fault streams. An entity keeps its own fault
stream alive in memory. This guarantees that the threads monitoring the entity
do not silently disappear. But the fault stream itself does not keep the entity
alive, so the entity can be removed from memory anyway. Once the entity is
removed from memory, the fault stream will no longer be kept alive, and the
monitoring threads may block forever. In order to clearly reflect that the fault
stream has been “disconnected” from the entity, we make the system close the
stream, i.e., its tail is bound to nil. This action is perfectly consistent, since
once the entity is gone, the fault stream will no longer be updated.
Finalization. The closing of the fault stream provides a simple and effective post-mortem finalization mechanism. The following abstraction executes
a procedure P once the entity E is no longer in memory. The closing of the
stream simply lets the loop exit.
proc {Finalize E P}
thread
for X in {GetFaultStream E} do skip end
{P}
end
end
This is particularly useful to recollect memory from failed components in a
program. A thread monitoring an entity can already remove references to the
entity when it fails, and once it is removed from memory, the monitor can
perform some extra actions.
4.4.2
Entity resurrection
It is possible for a site to remove an entity e from its memory, even when that
entity is still used by other sites. Indeed, if the site owns a proxy for e that is
not necessary for the distribution of the entity, it can safely remove the entity’s
proxy from its memory (see Sections 3.3.4 and 3.3.6). When this happens, the
site simply no longer refers to e, which remains alive on a global scale.
Now assume that the entity e was removed from the memory of site a, and
that a reference to e is sent again to that site. A new proxy for e is created on
a, and that proxy creates a new fault stream for e on a. This reintroduction of
e on site a brings a few issues. First, there is no connection between the new
fault stream of e and its former fault stream, which was bound to nil by the
finalization mechanism described above. The instances of the fault stream in
memory correspond to different sessions of the entity on the site.
Second, it is possible that the fault state of e was localFail before e’s
removal, and ok after its reintroduction. This state transition is normally
forbidden by the fault model. To avoid this situation, site a should have kept
some information about entity e in memory. But keeping that information in
memory is unreasonable in general, because site a’s memory would grow beyond
56
Asynchronous failure handling
any limit. Our proposal is to not keep that information, but to implement a
specific solution to handle the issue at the application level. An application
may use a centralized or distributed repository of valid entities. Any occurrence
of a non valid entity can then be discarded. The management of the repository
and the choice of which entities to check is thus specific to the application.
4.5
4.5.1
Related work
Java RMI
In Java Remote Method Invocation (RMI), every distributed operation is a
method call. The standard way in that language to report a problem inside
a method call is the exception mechanism. The fault model thus favors synchronous failure handlers, which are implemented as exception handlers.
4.5.2
Erlang
The power and simplicity of failure handling in Erlang was an inspiration for our
work. Erlang provides asynchronous detection of permanent failures between
processes [Arm07]. Two processes can be linked together. When one of them
(say a) terminates normally or because of a failure, the other one (say b) is
notified by the runtime system. By default, process b will die if a died because
of a failure. However, if b is a system process, it will receive a message of the
form {’EXIT’,Pid,Why}, where Pid is the identifier of process a, and Why is a
value that describes the reason why a died. A special built-in turns a process
into a system process.
Erlang chose to model all failures as permanent failures, in accordance with
its philosophy of “Let it fail”. That is, keeping the fault model simple allows
the recovery algorithm to be simple as well. This simplicity is very important
for correctness. We can see our model as an extension of Erlang’s model with
temporary failures and with a fault stream. Furthermore, our model is designed
for a richer language than Erlang, which only has stationary ports (in our
terminology). Chapter 5 will show how to program something similar to process
linking in Oz.
4.5.3
The first fault model of Mozart
Our argument against the use of exceptions to handle distribution failures
comes from the original fault model used in Oz, which was introduced with the
first release of Mozart in 1998. The original model overlaps with the model
we propose in this chapter. It was providing much more fault information
(most of which was not used in practice) and was supporting both synchronous
and asynchronous handlers. The major difference was the ability to define
synchronous failure handlers, i.e., handlers that are called when attempting an
4.5 Related work
57
operation on a failed entity. The programmer could either ask for an exception
or provide a handler procedure that replaces the operation. The failure handler
was defined for a given entity and with certain conditions of activation.
Instead of the synchronous handlers, programmers favored a kind asynchronous handler, called a watcher. A watcher is a user procedure that is
called in a new thread when a failure condition is fulfilled. The fault stream
we propose in this paper simply factors out how the system informs the user
program. It also avoids race conditions related to the watcher registry system,
which could make one miss a fault state transition. And finally, a watcher
could not be triggered by a transition to state ok. The latter soon revealed to
be problematic for handling temporary failures.
An alternative model. The original model is criticized in [GGV04], which
proposes an alternative model. That paper proposes something similar to our
fault stream and an operation to make an entity fail locally. In order to handle
faults, it proposes to explicitly break the transparent distribution of a failed
entity. The local representative of the failed entity is disconnected from its
peers and is put in a fault state equivalent to localFail. Another operation
replaces that entity by a fresh new entity. This model has the advantage
to avoid blocking threads on failed entities, because you can replace a failed
entity by a healthy one. But this replacement introduces inconsistencies in the
application’s shared memory. We were not able to give a satisfactory semantics
that takes into account these inconsistencies.
58
Asynchronous failure handling
5
Applications
We present several abstractions that show how to program with the model we
proposed in the former chapters. We first show how to hide network delays in
a lazy producer/consumer situation, with a bounded buffer. We propose two
implementations for the buffer: a fully declarative version, and a version that
automatically adapts the buffer size.
We also show how to implement Erlang-like processes in Oz. Process linking
and monitoring is very easy to implement. We then provide an abstraction that
deals with temporary failures, and guarantees a consensus about failures in a
set of processes monitoring each other. The consensus is reached by a vote
among the correct processes.
5.1
Distributed lazy producer/consumer
Assume we have a component producing a stream lazily, and sharing that
stream with other components, possibly on other sites. From a language point
of view, those components simply share a logic variable. Consumers make the
variable needed, which awaken the producer. The latter binds the variable to
a pair X|T, where T is computed lazily as well.
This scheme is a nice example of a declarative communication channel between components. Moreover, its performance is not bad: making the variable
needed typically costs one message from the consumer to the producer, and
binding the variable costs one message in the other direction. So the variable
imposes a communication delay of one round-trip per element. Note that this
delay is independent from the number of consumers.
59
60
Applications
fun {BoundedBuffer N Xs}
fun lazy {Deliver Xs Xr}
case Xs of X|Xt then X|{Deliver Xt thread Xr.2 end} end
end
in
{Deliver Xs thread {Drop N Xs} end}
end
Snippet 5.1: A first implementation of a bounded buffer
5.1.1
A bounded buffer
In order to avoid the communication delay, one may insert a buffer between the
producer and the consumer. The buffer triggers the evaluation of n elements
ahead of the consumer. If the number n is well chosen, and the consumer does
not run faster than the producer, then the consumer will not wait for reading
one element from the stream. The value n is chosen such that the average time
for producing one element, together with the communication delay, does not
exceed the average time for consuming n elements.
Snippet 5.1 shows an implementation of a bounded buffer which is equivalent to the one proposed in [VH04]. The function BoundedBuffer takes as
input the size of the buffer n, and the lazy stream Xs, and returns a lazy stream
Ys:
Ys={BoundedBuffer N Xs}
The value of Ys is equal to Xs, except that if m elements of Ys are computed,
m + n elements of Xs are computed. The function call {Drop N Xs} returns
the list Xs without its first N elements.
Behavior analysis. First, let us notice that calling BoundedBuffer on the
producer’s site will not fit our needs. Indeed, in that case, the lazy computation associated to the output variable Ys is on the producer side. When the
consumer makes that variable needed, a full round-trip to the producer’s site
is necessary to trigger the lazy computation and send the value back.
Suppose now that BoundedBuffer is called on the consumer’s site. When
the consumer reads an element, it triggers a local lazy computation which returns immediately, if the element is available. At the same time, the lazy
computation triggers the need for an element n positions ahead in the stream.
However, when the consumer reads an extra element, the element n positions
ahead will be requested whenever the element before is delivered on the consumer’s site.
To illustrate that behavior, assume we have a producer/consumer pair with
a bounded buffer of size n = 5. Let us analyze what happens when the consumer
reads three elements from the stream. The interactions between both sites are
shown in the left picture of Figure 5.1 on the facing page. The arrows to the
5.1 Distributed lazy producer/consumer
P
c
o
m
p
u
t
e
c
o
m
p
u
t
e
c
o
m
p
u
t
e
6
t
7
8
r
o
d
u
c
e
r
C
o
n
s
u
m
e
r
r
e
a
d
r
e
a
d
r
e
P
1
2
s
d
h
t
h
t
h
a
d
3
r
r
t
n
c
o
m
p
u
t
e
c
o
m
p
u
t
e
c
o
m
p
u
t
e
6
t
h
t
h
t
h
d
7
8
61
o
d
u
c
e
r
C
o
n
s
u
m
e
r
r
e
a
d
r
e
a
d
r
e
a
d
1
2
3
s
t
n
r
d
d
Figure 5.1: Network behavior of two implementations of a
bounded buffer of size 5
left represent the messages that make a variable needed, while the arrows to the
right are the messages with the binding of the corresponding variable. What we
observe is that an element cannot be requested before the list pair containing
the former element arrives on the consumer’s site.
What we really want is something like the right picture of Figure 5.1. For
each element read, the element n positions ahead should be requested as soon
as possible. With this behavior, the elements are still produced in a sequential
way, but the message round-trips to trigger the production of elements are truly
concurrent. In the first behavior, those message round-trips are serialized.
5.1.2
A correct bounded buffer
Snippet 5.2 on the next page shows an implementation that provides the desired
behavior. The producer performs the following statement on its site.
Es={Encapsulate Xs}
The returned value is a pair of variables that will be bound to streams. The
first variable is the stream Xs, while the second variable is a stream Rs that is
built by the consumer and read by the producer. For each element appearing
on that stream, the producer requests one extra element on the data stream
Xs. This is done by the thread running procedure Prepare on the producer.
The consumer makes the following call to get a stream Ys.
Ys={DecapsulateN N Es}
This immediately builds a stream with N elements that will be read by the
producer. Then, for every element consumed on Ys, the stream Rs is appended
with the statement Rs=unit|Rt.
62
Applications
fun {Encapsulate Xs}
proc {Prepare Rs Xs} {Prepare Rs.2 Xs.2} end
Rs
in
thread {Prepare Rs Xs} end
Xs#Rs
end
fun {DecapsulateN N Es}
fun {Prepend K Xt}
if K>0 then unit|{Prepend K-1 Xt} else Xt end
end
fun lazy {Deliver Xs Rs}
case Xs of X|Xt then Rt in
Rs=unit|Rt
% trigger need at producer
X|{Deliver Xt Rt}
end
end
Rt
in
Es.2={Prepend N Rt}
% trigger N elements ahead
{Deliver Es.1 Rt}
end
Snippet 5.2: A correct implementation of a bounded buffer
Let us now check that the behavior of the abstraction corresponds to the
picture on the right of Figure 5.1. The arrows from right to left correspond
to the bindings Rs=unit|Rt, while the arrows from left to right correspond to
the bindings of the producer’s output stream. Note that the bindings of Rs are
performed immediately by the consumer. This is because the variables Rs are
created on the consumer’s site, hence the coordinators of those variables are on
that site, and variable bindings are performed by the variable’s coordinator. A
more detailed explanation can be found in Section 8.2.3.
If several consumers are present, the stream can be encapsulated once, and
each consumer decapsulates it by applying DecapsulateN. The producer will
be driven by the consumer that requests the furthest ahead. However, the
network behavior involved by the stream Rs is more difficult to describe, since
the consumers will share that stream. Therefore a binding like Rs=unit|Rt
might require an intermediate network message to another consumer site, if
that other site holds the coordinator of Rs.
5.1.3
An adaptive bounded buffer
Snippet 5.3 provides a replacement for the function DecapsulateN. The stream
is decapsulated on the consumer’s site by the statement
5.1 Distributed lazy producer/consumer
63
fun {Decapsulate Es}
fun lazy {Deliver Xs Rs}
case Xs of X|Xt then Rt in
Rs = if {Not {IsDet Xt}} then unit|unit|Rt
elseif {Not {IsDet Xt.2}} then unit|Rt
else Rt end
X|{Deliver Xt Rt}
end
end
Rt
in
Es.2=unit|Rt
{Deliver Es.1 Rt}
end
Snippet 5.3: An adaptive bounded buffer
Ys={Decapsulate Es}
This new function no longer takes a buffer size, but instead adapts how elements
are requested ahead in order to always have one element ready at the consumer’s
site.
Let us make a quick comparison between the functions DecapsulateN and
Decapsulate. The main difference is the binding of Rs, the second argument
of the internal lazy function, which is called Deliver in both versions. For
each consumed element, the adaptive version checks how many elements are
available in front of Xt. We use the function IsDet which returns true if its
argument is determined, and false otherwise. If no element is available (Xt
is not determined), the size of the buffer is increased by triggering the need
for two extra elements with the statement Rs=unit|unit|Rt. If exactly one
element is available (Xt.2 is not determined yet), we keep the same buffer
size by requesting one extra element (Rs=unit|Rt). If more than one element
is available, we decrease the buffer size by not requesting any extra element
(Rs=Rt).
This adaptive version of the bounded buffer will work well if the consumer
reads the stream at a regular pace.
5.1.4
A batch processing buffer
The reader might be surprised by the solution proposed in the previous sections.
The abstractions effectively improve the network behavior of lazy evaluation,
but they do it by avoiding the distributed mechanism of lazy evaluation. We
were also disappointed by this solution when we realized this. So we came up
with a solution that relies on the distributed by-need mechanism.
In order to avoid the sequential “ping-pong” effect illustrated in Figure 5.1,
one may let the producer site trigger the evaluation of several elements in a row.
64
Applications
proc {BatchBuffer N Xs}
proc {BatchLoop I Xs}
if I>0 then {BatchLoop I-1 Xs.2} else
{WaitNeeded Xs} {BatchLoop N Xs}
end
end
in
thread {BatchLoop 0 Xs} end
end
Snippet 5.4: An abstraction that forces the evaluation of a
stream in batches
Whenever the first element is requested, an abstraction forces the production
of n elements. In other words, we can force the producer to work by batches.
The abstraction is shown in Snippet 5.4. One simply has to call
{BatchBuffer N Xs}
on the producer’s site. The procedure creates a thread that detects when an
element is needed, and automatically makes the n−1 following elements needed.
The thread then waits until the element after that batch becomes needed, and
requests a new batch.
This abstraction can be used solely, or in combination with the simple
bounded buffer given in Snippet 5.1. When used solely, the full round trip
delay will occur only once every n elements. If the consumer uses the simple
bounded buffer, the round-trip delay can be completely hidden if n is large
enough.
5.2
Processes à la Erlang
In the language Erlang, almost everything is a process. A process consists
of a port, on which messages can be sent, and a function that processes the
messages. A process is created by the primitive spawn, and messages are sent
with the binary operator !:
Pid = spawn(F)
% create a process from a function F
Pid ! Msg
% send a message Msg to process Pid
The function takes a message from the incoming queue with the statement
receive. The statement uses pattern matching to specify valid messages, and
subsequent actions.
It is pretty easy to write a function Spawn in Oz that is similar to the
corresponding Erlang primitive. The function, shown in Snippet 5.5 on the next
page with an example, creates a port and runs the procedure in its own thread.
5.2 Processes à la Erlang
65
%% create a process with unary procedure Process
fun {Spawn Process}
Xs Ys Self={NewPort Xs}
fun {Loop Xs}
case Xs of user(M)|Xt then M|{Loop Xt} end
end
in
thread Ys={Loop Xs} end
thread {Process Ys} end
Self
end
%% send message M to process A
proc {SendProc A M}
{Send A user(M)}
end
Snippet 5.5: Spawning an Erlang-like process in Oz
The procedure takes the stream of messages in argument. The procedure should
process the messages in a sequential way. The procedure SendProc sends a
message M to a process A. Note that messages are put in a record user(M), in
order to distinguish them from system messages that are introduced below.
Here is an example with two processes A and B sending each other ping-pong
messages:
proc {ProcessA Xs}
case Xs of X|Xt then
case X
of stop
then skip
[] ping(P) then {SendProc P pong(A)} {ProcessA Xt}
end
end
end
A={Spawn ProcessA}
proc {ProcessB Xs}
{SendProc A ping(B)}
case Xs of pong(P)|_ andthen P==A then skip end
end
B={Spawn ProcessB}
Process linking. Erlang processes can be linked together, such that when
one of them terminates abnormally, the other ones die also, unless they are
system processes. System processes are explained below. Linking is symmetric,
and implements a property which states that a group of processes must crash
as soon as one of them crashes. A process A adds the process B to its link set
by evaluating the built-in function link(B).
66
Applications
%% link process Self to process A
proc {Link Self A}
{Send Self link(A)} {Send A link(Self)}
end
%% change the ’system process’ flag
proc {SetSystem Self B}
X in {Send Self system(B X)} {Wait X}
end
%% create a process with unary procedure Process
fun {Spawn Process}
Xs Ys Self={NewPort Xs} T
fun {Loop Xs Linkset Sys}
case Xs of X|Xt then
case X
of user(M) then M|{Loop Xt Linkset Sys}
[] system(B X) then X=unit {Loop Xt Linkset B}
[] link(A) then {Monitor A} {Loop Xt A|Linkset Sys}
[] exit(E) then {Notify E Linkset} nil
[] exit(A E) andthen Sys then X|{Loop Xt Linkset Sys}
[] exit(A normal) then {Loop Xt Linkset Sys}
[] exit(A E) then {Kill T} {Notify E Linkset} nil
end
end
end
proc {Monitor A}
thread
if {Member permFail {GetFaultStream A}} then
{Send Self exit(A crashed)} end
end
end
proc {Notify E Linkset}
for A in Linkset do {Send A exit(Self E)} end
end
in
thread Ys={Loop Xs nil false} end
thread
T={Thread.this}
try {Process Ys} {Send Self exit(normal)}
catch E then {Send Self exit(E)} end
end
Self
end
Snippet 5.6: Asymmetric linking and monitoring of processes
5.3 Failure by majority
67
In Snippet 5.6 on the facing page we propose a new implementation of Spawn
that handles linking and system processes. The internal loop of the process
handles system messages, and maintains a link set, i.e., a list of processes that
are notified of the termination of the current process. The message link(A) is
sent by the procedure Link, and notifies the current process that it is linked to
process A. The current process adds A to its link set, and monitors A to detect
a failure that A itself would not be able to notify.
When the process terminates, it sends the message exit(E) to itself, in
order to notify its link set. The value E describes the reason of the termination. Processes in the link set are notified with the message exit(Self E).
The latter message is handled in a different way, depending on whether the
current process is a system process. Non-system processes are killed when E is
not normal, while system processes simply receive the message. The message
system(B) is sent by procedure SetSystem by the process itself in order to
change its status (system process or not).
5.3
Failure by majority
Failure detectors are extremely useful for writing programs that react to partial
failures. Failure handling can be written in a rule-based style. However our
detector model is weak in the sense that detectors are not required to be consistent between sites. This can sometimes lead distributed programs to behave
strangely, because some sites consider an entity failed, and others don’t.
In this section we propose an algorithm that makes a group of N processes
find a consensus about the failure status of a given process. Group members
may themselves fail during the consensus algorithm. However, the algorithm
is guaranteed to reach consensus about crashed processes if less than N/2
processes crash.
5.3.1
Algorithm
The idea is quite simple. For the sake of simplicity, let us assume that the
group wonders about the status of process S. Whenever process P suspects
S, it broadcasts vote(P, +1). If it changes its mind about S, it broadcasts
vote(P, −1). Every process sums the values received from every other process,
and maintains how many of them have a positive account. When this number
becomes greater than N/2, it means that a majority of processes have suspected
S. At that point a message is broadcast to make all correct processes consider
S as permanently failed, i.e., {Break S }. The latter message must be broadcast
in a reliable manner, in order to guarantee the consistency between processes.
The algorithm is described in Figure 5.2 on the next page in the style of
Guerraoui et al. [GR06]. Processes perform actions when some events occur,
some of those events being messages. An implementation with objects is given
in Snippet 5.7 on page 71. The specificity of the algorithm is that there is
68
Applications
upon S is suspect do
broadcast vote(self , +1)
upon event vote(P, x) do
count.P := count .P + x
upon S is non-suspect do
broadcast vote(self , −1)
upon event kill (S) do
execute {Break S }
upon #{P | count.P > 0} >
reliableBroadcast kill (S)
N
2
do
Figure 5.2: Majority voting for consensus on failure status of S
only one possible decision, and that decision is taken whenever a majority of
processes agreed with that decision. Moreover, the decision is monotonic.
Note. Counting the votes from a given process P consists in summing all the
+1’s and −1’s sent by P . The counter allows to receive votes from P in any
order. If all messages from P are received in order, the counter will always
belong to the set {0, 1}.
5.3.2
Correctness
Assume that S has crashed. Because we rely on eventually perfect failure
detectors, all correct processes will eventually suspect S forever. So at some
point, at least N/2 processes will broadcast a positive vote. And all correct
processes will eventually sum all the votes broadcast by the positive majority
we mentioned. Note that broadcasting the decision is only an optimization in
that case.
Now assume that S has not crashed. There may be enough suspicions
among the other processes to let one of them observe a majority of positive
votes. The latter observation might be temporary if processes change their
mind quickly. In that case, broadcasting the decision with a reliable algorithm
ensures that all correct processes will eventually consider S as permanently
failed.
In order to show the necessity of the final broadcast, let us imagine an
extreme case where only one process P observes a majority of positive votes.
Such a scenario is depicted in Figure 5.3 on the next page. A group of NX
processes X suspect S, then cancel their suspicion. Their messages reach P
faster than another group Y . The group Y has NY processes, that also suspect
S then revise their judgment. If we have both NX , NY < N/2 and NX +
NY > N/2, then only P will observe a majority of positive votes. If P does
not broadcast its observation, its view will not be consistent with the other
processes.
5.3 Failure by majority
69
X
P
(
X
,
+
1
Y
)
(
1
(
X
,
Y
,
+
1
)
)
…
(
Y
,
‰
1
)
Figure 5.3: A scenario where only one process P observes a
majority
5.3.3
The whole code of processes
The process itself is implemented as an object that multiplexes voters. Each
process has one voter per other process it monitors. The whole code is given
in Snippets 5.8 to 5.12 on pages 72–74.
The class BaseProcess is the mother class of all processes. It creates the
process’ port, and a thread that processes messages. An identifier id is also
provided. The latter is useful if one wants to use a process as a key in a
dictionary.
The class ProcessWithFailureDetector extends BaseProcess by monitoring other processes. It is initialized with the list of processes, with their identifier and port. A process using this class should implement method failure()
in order to handle failures. In our example, this method is defined in subclass
MonitoringProcess.
The classes BestEffortBroadcast and ReliableBroadcast provide simple methods to broadcast messages to all processes. The latter is reliable in
the sense that either all correct processes deliver the message, or none of them
deliver it. Each process that receives the message broadcast it once, too. This
ensures the delivery in case the original sender crashes. This implementation
is not efficient, but it fulfills its specification.
The class MonitoringProcess is the main class. It multiplexes its voters,
and provides all the support they need for communicating. In method failure,
you can see that the “opinion” of a voter is changed when the failure state of
the corresponding process changes.
5.3.4
Variants
The algorithm we gave was kept simple for the sake of explanation. But it is
quite flexible, and admits variants, which are easy to implement. Here are a
few ideas that can improve the abstraction.
70
Applications
• One may change the number of positive votes that must be reached to
trigger the decision. A value of N/3 may be considered enough, for
instance.
• We have assumed the number of processes to be known and fixed. The
algorithm works fine if that number varies over time, and processes regularly update this number. The condition for triggering the decision simply
has to be reevaluated.
• One should discard crashed processes when counting votes. One may
also discard suspect processes, such that only known correct processes
are taken into account. The latter idea requires more attention, because
as such, it would allow one process to suspect all other processes.
5.3 Failure by majority
class Voter
attr
broadcast
rbBroadcast
decide
total
id
opinion
votes
%
%
%
%
%
%
%
71
broadcast procedure
reliable broadcast procedure
decide procedure
number of processes
identifier of this process
this process’ opinion (true or false)
accumulated votes from each voter
meth init(broadcast:B rbBroadcast:RB decide:D N)
broadcast := B
rbBroadcast := RB
decide := D
total := N
id := {NewName}
opinion := false
votes := {NewDictionary}
end
%% set the process’ opinion (true for suspicion)
meth propose(B)
if B \= @opinion then
opinion := B
{@broadcast vote(@id (if B then 1 else ˜1 end))}
end
end
%% receive a vote from Id
meth vote(Id X)
@votes.Id := {Dictionary.condGet @votes Id 0} + X
if X > 0 then N in
N={Length {Filter {Dictionary.items @votes} IsPos}}
if N*2 > @total then {@rbBroadcast decide} end
end
end
%% receive decision
meth decide
{@decide true}
end
end
fun {IsPos X} X>0 end
Snippet 5.7: Implementation of the majority voting algorithm
72
Applications
class BaseProcess
feat port id
meth init()
Xs in
thread
try {ForAll Xs self}
catch _ then {Kill self.port}
end
end
self.port={NewPort Xs}
self.id={NewName}
end
end
Snippet 5.8: Base class of processes
class ProcessWithFailureDetector from BaseProcess
attr processes
meth initProcesses(IPs)
%% IPs is a list of pairs id#port
processes := {List.toRecord p IPs}
for Id#P in IPs do
thread
{Wait P}
for X in {GetFaultStream P} do
{Send self.port failure(Id X)}
end
end
end
end
end
Snippet 5.9: A class for processes that monitor each other
class BestEffortBroadcast from ProcessWithFailureDetector
meth broadcast(M)
{Record.forAll @processes proc {$ P} {Send P M} end}
end
end
Snippet 5.10: Implementation of best-effort broadcast
5.3 Failure by majority
class ReliableBroadcast from BestEffortBroadcast
attr delivered
meth initProcesses(IPs)
BestEffortBroadcast,initProcesses(IPs)
delivered := nil
{self CheckDelivered}
end
meth CheckDelivered
%% some kind of ’garbage collection’ on delivered
delivered := unit|{List.takeWhile @delivered
fun {$ Id} Id \= unit end}
thread
{Delay 360000} {Send self.port CheckDelivered}
end
end
meth rbBroadcast(M)
%% note: unicity of messages is guaranteed by user
{self broadcast(rbDeliver(M))}
end
meth rbDeliver(M)
if {Not {Member M @delivered}} then
delivered := M|@delivered
{Send self.port M}
{self broadcast(rbDeliver(M))}
end
end
end
Snippet 5.11: Implementation of an “eager” reliable broadcast
73
74
Applications
class MonitoringProcess from ReliableBroadcast
attr voters
meth initProcesses(IPs)
N={Length Ps}
in
ReliableBroadcast,init(IPs)
voters := {Record.mapInd {List.toRecord v IPs}
fun {$ I P}
proc {B M}
{self broadcast(voting(I M))}
end
proc {RB M}
{self rbBroadcast(voting(I M))}
end
proc {D B}
if B then {self kill(I)} end
end
in
{New Voter init(broadcast:B
rbBroadcast:RB
decide:D
N)}
end}
end
%% relay a message for a voter
meth voting(I M)
{@voters.I M}
end
%% failure detector notification, maybe change opinion
meth failure(I State)
{@voters.I propose(State \= ok)}
end
%% decide whether a process has failed
meth kill(I)
{Break @processes.I} {Kill @processes.I}
voters := {AdjoinAt @voters I proc {$ _} skip end}
end
end
Snippet 5.12: Main class, with one voter per process in the
group
6
Language semantics
We now give a formal support to the language concepts we presented in Chapters 2, 3 and 4. This chapter defines an operational semantics to Oz without
taking distribution into account. The next chapter presents a refinement of
that semantics, which models distribution, network, and failures. The refinement also give a semantics to the annotation system and the failure handling
primitives.
Section 6.1 defines how to translate a program in Full Oz into an equivalent
program in Kernel Oz. Section 6.2 gives the notations and basic ingredients of
the semantic definitions. Sections 6.3 details the semantics of the declarative
part of the kernel language, while Section 6.4 gives the semantics of the nondeclarative part of the language.
6.1
Full language to kernel language
Every Oz program is equivalent to a program in Kernel Oz. In Chapter 2, we
have introduced the kernel language, and syntactic sugar of common idioms in
the full language. We now see how to formally translate an Oz program into an
equivalent Kernel Oz program. The kernel language is given by the grammar
in Figure 6.1. Both the declarative and non-declarative parts of the language
are given in the grammar.
The translation is defined by the relation ⇒, which reduces statements to
simpler statements. The kernel program equivalent to a given Oz program
is defined as the fixpoint of the program by the relation. This relation is
structural: one can reduce a statement inside another statement.
For the rest of the section, D denotes a declaration (statement or identifier),
E an expression, P a pattern, S a statement, SE a statement or expression,
and X and Y identifiers.
75
76
Language semantics
S ::= skip | S1 S2 | thread S end
| local X in S end
| X =Y | X =f (Y1 . . . Yn )
| if X then S1 else S2 end
| case X of f (Y1 . . . Yn ) then S1 else S2 end
| {WaitNeeded X}
| proc {X Y1 . . . Yn } S end | {X Y1 . . . Yn }
| try S1 catch X then S2 end | raise X end | {FailedValue X Y }
| X =!!Y
| {NewCell X Y } | X0 =Y :=X1
Figure 6.1: Grammar of Kernel Oz
Expanding declarations. The following rules split up declarations into
declared identifiers and initializing statements. It simplifies declarations as
“local X=foo in”. We assume that the statements that appear in the declaration D have already been reduced to kernel statements.
D in SE ⇒ local D in SE end
local D in SE end ⇒ local decl(D) in stmt(D) SE end
local X1 X2 . . . Xn in SE end ⇒ local X1 in
local X2 . . . Xn in SE end
end
The functions decl and stmt respectively return the declared identifiers and the
statements of a declaration. The function ident returns the set of identifiers in
a pattern; each identifier is declared at most once. Those functions are defined
as
decl(X) = {X}
decl(P =E) = ident(P )
decl(P =E1 :=E2 ) = ident(P )
decl(proc {X Y1 . . . Yn } S end) = {X}
decl(S) = ∅
decl(D1 . . . Dn ) = decl(D1 ) ∪ . . . ∪ decl(Dn )
ident(X) = {X}
ident(f (P1 . . . Pn )) = ident(P1 ) ∪ · · · ∪ ident(Pn )
6.1 Full language to kernel language
77
stmt(X) = ǫ
stmt(S) = S
stmt(D1 . . . Dn ) = stmt(D1 ) . . . stmt(Dn )
Expanding nested expressions. Those are the kernel statements that contain an expression E in place of an identifier. The reduction introduces an
identifier X, and expands the evaluation of E before evaluating the statement
itself. The identifier X is chosen such as to not occur in the original statement. Notice that in the procedure call, the first non-identifier is expanded.
We assume that m, n ≥ 0.
E =E ′ ⇒ local X =E in X =E ′ end
if E then . . . end ⇒ local X =E in (if X then . . . end) end
case E of . . . end ⇒ local X =E in (case X of . . . end) end
proc {E . . .} S end ⇒ local X =E in (proc {X . . .} S end) end
{Y1 . . . Ym E E1 . . . En } ⇒ local X =E in {Y1 . . . Ym X E1 . . . En } end
raise E end ⇒ local X =E in (raise X end) end
Expanding expressions. Function definitions are expanded to procedures.
The extra parameter X is chosen to not occur in a free position in E.
fun {. . .} E end ⇒ proc {. . . X} X =E end
Then we expand all the statements of the form X =E. The expansion often
brings the assignment to X inside the language constructs, which sometimes
declares new identifiers Yi . If X occurs in those declarations, then we substitute
this occurrence of X by another identifier. The result of this substitution is
denoted Yi∗ or E ∗ .
X =(S E) ⇒ S X =E
X =thread E end ⇒ thread X =E end
X =local Y in E end ⇒ local Y ∗ in X =E ∗ end
X =E1 =E2 ⇒ X =E1 X =E2
X =if E then E1 else E2 end ⇒ if E then X =E1 else X =E2 end
X =case E of f (Y1 . . . Yn ) ⇒ case E of f (Y1∗ . . . Yn∗ )
then E1 else E2 end
then X =E1∗ else X =E2 end
X =proc {$ . . .} S end ⇒ proc {X . . .} S end
X ={E E1 . . . En } ⇒ {E E1 . . . En X}
X =try E1 catch Y then E2 end ⇒ try X =E1 catch Y ∗ then X =E2∗ end
X =raise E end ⇒ raise E end
78
Language semantics
The lazy expansion. Lazy functions can be defined by using fun lazy
instead of fun in their definition. The simplest way to expand this construct
is to create a thread that synchronizes on the demand, then evaluates the
function’s body expression. We assume that the parameter X does not occur
in a free position in E.
fun lazy {. . .} E end ⇒ proc {. . . X}
thread {WaitNeeded X} X =E end
end
While being correct, this expansion may suffer a slight performance overhead, especially if the function is recursive. The overhead comes from the fact
that every recursive call creates a new thread. Recursive calls in tail position
do not need this extra thread. For those, one may let the current thread suspend. This is correct as long as the initial call to the function is in a different
thread. The following expansion optimizes tail recursive calls.
fun lazy {F X1 . . . Xn } E end ⇒ local F ′ in
proc {F ′ X1 . . . Xn X}
{WaitNeeded X} (X =E)∗
end
fun {F X1 . . . Xn }
thread {F ′ X1 . . . Xn } end
end
end
The identifier F ′ is chosen to not occur in the definition of F . The extra
operation (X =E)∗ expands the statement X =E, and replaces every tail call to
F by a similar call to F ′ .
The $ expansion. The main use of the $ sign is in expressions that define
a procedure, or in a procedure call. At some point such an expression E will
be reduced in a statement of the form X=E. The following rules show how to
reduce such a statement. P (X) denotes a pattern containing an identifier X,
and P ($) is the same pattern with X replaced by $.
X =proc {$ . . .} S end ⇒ proc {X . . .} S end
X =fun {$ . . .} E end ⇒ fun {X . . .} E end
X ={E1 . . . Em P ($) Em+1 . . . En } ⇒ {E1 . . . Em P (X) Em+1 . . . En }
More linguistic abstractions. The full language provides even more statements, like class definitions, functor definitions, support for constraints, etc.
We do not show how to expand those in this work. Material can be found in
the book [VH04], and the documentation of Mozart [Moz99].
6.2 Basics of the semantics
6.2
79
Basics of the semantics
Here we give the basics of the operational semantics of the language. The
semantic rules prescribe how a running program can go from one state to
another. A program state is called a configuration. A configuration consists of
a set of threads connected to a shared store:
thread · · · thread
ց
ւ
store
The thread is the basic unit of sequential computation. A computation consists
of a sequence of computation steps, each of which transforms a configuration
into another configuration. At each step, a thread is chosen, and executes an
atomic operation. The choice of the thread is nondeterministic among all the
executable threads in that configuration. Thread execution follow the interleaving semantics.
6.2.1
The store
The store is a single-assignment store (or constraint store), extended with firstclass procedures, mutable entities, and a few other specific extensions [Smo95,
VH04]. The extensions will be introduced step by step, together with their
corresponding reductions rules. Those extensions are grouped together under
the term predicate store.
The constraint store contains variable assignments made by the program.
Assignments are between variables (x=y), or between variables and values
(x=v). The constraint store is a conjunction of such assignments. It has the
property of being monotonic, in the sense that one can only add assignments;
existing assignments cannot be removed.
Store entailment. The constraint store has a logic nature, it can entail
information that is not directly present in the store. For instance, the store
x=3 ∧ x=y entails y=3. We denote a store by σ, and a basic relation like an
equality by β. The statement σ |= β means that the store σ entails β. We
assume that the store conjunction is associative, commutative, and has neutral
element ⊤, which also denotes the empty store.
What the constraint store entails is defined by the following inference rules.
The rules are given with premises on top of a horizontal line, and a conclusion
below. The horizontal line is not shown when the premises are true. The very
first rule states that the store entails at least what it contains, and in particular,
that adding information in the store never reduces entailment.
σ ∧ β |= β
(6.1)
The next rules are specific to the equality relation. Rules (6.2) simply reflect
that equality is reflexive, symmetric, and transitive. The metavariables t, u, v
80
Language semantics
can be either variables or values. The values we consider here are either simple
values, or records.
σ |= t=t
σ |= u=v
σ |= v=u
σ |= t=u σ |= u=v
σ |= t=v
(6.2)
We now define rules for record equality. Two records are equal if and only
if they have identical labels, arities, and their fields are pairwise equal. The
following two rules establish the “positive” side of this statement.
σ |= u1 =v1 · · · σ |= un =vn
σ |= f (u1 . . . un )=f (v1 . . . vn )
σ |= f (u1 . . . un )=f (v1 . . . vn )
σ |= ui =vi
1≤i≤n
(6.3)
(6.4)
The constraint store can also disentail some equalities, i.e., inferring that they
are false. The following rules state explicitly that records with different labels,
arities, or different corresponding fields are unequal.
f 6=g or m6=n
σ |= f (u1 . . . um )6=g(v1 . . . vn )
σ |= ui 6=vi
σ |= f (u1 . . . un )6=f (v1 . . . vn )
1≤i≤n
(6.5)
(6.6)
Determinacy. We can generalize a bit store entailment, in order to introduce
derived concepts like determinacy. We say a variable x is determined by a store
σ if σ entails that it is equal to a given value. We note this as σ |= det (x). If
the store cannot infer the value of the variable, we say that the variable is free.
σ |= x=v
σ |= det (x)
for a certain value v
(6.7)
Ask and tell. The two basic operations on a store are called ask and tell.
The ask operation queries the store to know whether a given constraint is
entailed or disentailed. Asking β on store σ returns a positive answer if σ |= β,
a negative answer if σ |= ¬β. There is no answer otherwise. The monotonicity
of the store guarantees that the answer of an ask never changes.
The tell operation adds a basic constraint to a store, provided that the
store remains consistent. The store becomes inconsistent as soon as it infers
something like 1=2, for instance. Telling β to σ updates the store to σ ∧β. The
rules that update the store are written such that they never make the store
inconsistent. If an inconsistency could be introduced by a program statement,
that statement should fail.
6.2 Basics of the semantics
81
Predicate store. The predicate store is subject to the principle of substitution by equals. The following inference rule states that an instance of predicate
p is entailed by the store if the store contains a similar predicate whose arguments are pairwise equal.
σ |= u1 =v1 · · · σ |= un =vn
σ ∧ p(u1 , . . . , un ) |= p(v1 , . . . , vn )
(6.8)
Contrary to the constraint store, elements of the predicate store can be removed, or replaced. The valid ways to update the predicate store depends on
each predicate, and is defined by the semantic rules.
6.2.2
Structural rules
The semantics are given by transition rules (or reduction rules1 ) that describe
valid computation steps. The rules have the form
T
σ
T′
σ′
if C
It states that a configuration with a multiset of threads T and store σ can
be reduced to the configuration with threads T ′ and store σ ′ , provided the
condition C is fulfilled. We often write the left-hand side of the rule as a pattern,
so that a configuration must match the pattern for the rule to be applicable.
The disjoint union of multisets is written with commas, and singletons are
written without curly braces. For instance, “T1 , T , T2 ” stands for {T1 } ⊎ T ⊎
{T2 }. There is no ambiguity because of the thread syntax.
The following two rules are convenient for simplifying the expression of the
rules. The first one expresses the relative isolation of concurrent threads: a
subset of the thread may reduce without directly affecting the other threads.
T ,U
σ
T ′, U
σ′
if
T
σ
T′
σ′
(6.9)
The second rule states that stores can be considered up to equivalence. This
allows to choose the most convenient representation for a store in a reduction
rule.
T T
if σ and σ ′ are equivalent
(6.10)
σ σ′
Store equivalence is defined as follows. Let us first consider the constraint
store. Two stores σ = β1 ∧ · · · ∧ βn and σ ′ = β1′ ∧ · · · ∧ βn′ ′ are equivalent if
σ |= βi′
for every i, and
σ ′ |= βj
for every j.
(6.11)
1 This expression comes from the chemical analogy of transition rules, where the execution
takes a statement, and reduces it to a simpler statement.
82
Language semantics
The other part of the store follows a similar rule, except that each instance of a
predicate in σ must correspond to exactly one predicate in σ ′ . This is necessary
for some predicates, like the one that defines the current state of a cell, which
must occur exactly once per cell in the store.
6.3
Declarative subset of the language
6.3.1
Sequential and concurrent execution
A thread is a sequence of statements S1 S2 . . . Sn . Parentheses are introduced
to avoid ambiguities when necessary. The empty thread is written (). The
abstract syntax of threads can thus be defined as
T ::= () | S T
(6.12)
The empty thread reduces to an empty multiset of threads. A nonempty thread
reduces by reducing its first statement. The latter rule will again simplify the
expression of rules.
()
σ
ST
σ
σ
S′ T
σ′
S
σ
if
S′
σ′
(6.13)
The empty statement, sequential composition, and thread statement are
tied to the notion of thread. For those rules we have to show explicitly how
they modify the structure of the threads. Notice that the latter creates a new
thread with the statement S only.
skip T
σ
6.3.2
T
σ
(S1 S2 ) T
σ
S1 (S2 T )
σ
thread S end T
σ
T, S
σ
(6.14)
Variable introduction
The local statement creates a new variable in the store, and make the declared identifier correspond to that variable. Instead of maintaining an explicit
mapping between identifiers and variables, we directly substitute the declared
identifier by its corresponding variable. The notation S[X/x] stands for the
substitution of X by x in S. The substitution takes care of lexical scope issues.
local X in S end
σ
S[X/x]
σ
where x is a fresh variable
(6.15)
The condition of the rule requires x to be a fresh variable. A fresh variable is
a variable that does not appear anywhere in the initial configuration. This can
written formally, but we have chosen to keep the rule more readable.
6.3 Declarative subset of the language
83
Variable substitution. The identifier substitution operation is quite usual.
Assume that θ denotes the substitution [X/x]. Let χ denote an identifier or a
variable.
(
x if χ = X
χθ =
(6.16)
χ otherwise
We now define the substitution inductively on the syntax of statements. The
following statements do not involve lexical scoping.
(skip)θ = skip
(6.17)
(S1 S2 )θ = S1 θ S2 θ
(thread S end)θ = thread Sθ end
(6.18)
(6.19)
(χ1 =χ2 )θ = χ1 θ=χ2 θ
(χ=c)θ = χθ=c
(6.20)
(6.21)
(χ=f (χ1 . . . χn ))θ = χθ=f (χ1 θ . . . χn θ)
(if χ then S1 else S2 end)θ = if χθ then S1 θ else S2 θ end
(6.22)
(6.23)
({χ χ1 . . . χn })θ = {χθ χ1 θ . . . χn θ}
(raise χ end)θ = raise χθ end
(6.24)
(6.25)
In the following equations, we assume that the lexical scope introduced by the
statement does not catch X, i.e., X is different from the identifiers Y, Y1 , . . . , Yn .
(local Y in S end)θ = local Y in Sθ end
case χθ of f (Y1 . . . Yn )
case χ of f (Y1 . . . Yn )
θ=
then S1 θ else S2 θ end
then S1 else S2 end
(6.26)
(6.27)
(proc {χ Y1 . . . Yn } S end)θ = proc {χθ Y1 . . . Yn } Sθ end
(6.28)
(try S1 catch Y then S2 end)θ = try S1 θ catch Y then S2 θ end (6.29)
We now define the substitution when X is caught by the lexical scope of the
statements. We assume that X ∈ {X1 , . . . , Xn }.
(local X in S end)θ = local X in S end
case χθ of f (X1 . . . Xn )
case χ of f (X1 . . . Xn )
θ=
then S1 else S2 θ end
then S1 else S2 end
(6.30)
(6.31)
(proc {χ X1 . . . Xn } S end)θ = proc {χθ X1 . . . Xn } S end
(6.32)
(try S1 catch X then S2 end)θ = try S1 θ catch X then S2 end
(6.33)
6.3.3
Unification
The unification operation in Oz imposes equality between two terms. It incrementally tells basic constraints to the store until the equality is entailed or
disentailed by the store. The operational semantics of unification is therefore
84
Language semantics
non atomic. This lack of atomicity permits a realistic extension of unification
in the distributed case.
The following two rules terminate the unification when it is either entailed,
or disentailed. The statement fail is used for the sake of readability; it is
shorthand for raise failure end, which raises a failure exception.
u=v
σ
skip
u=v
σ
fail
σ
σ
if σ |= u=v
(6.34)
if σ |= u6=v
(6.35)
We then give the rule that incrementally tells basic constraints to the store.
Those basic constraints are necessary for the unification to succeed. They are
of the form x=t, where x is not determined by the store yet, and t is either a
variable or a value.
u=v
σ
u=v
σ ∧ x=t
if σ ∧ u=v |= x=t and σ 6|= det (x)
(6.36)
There exists an optional simplification rule, that rewrites a unification as
another one. This simplification does not change the effect of unification, but
it allows an implementation to simplify it.
u=v
σ
u′ =v ′
σ
if σ ∧ u=v |= u′ =v ′ and σ ∧ u′ =v ′ |= u=v
(6.37)
Example. Executing x=f (y) with the store σ ≡ x=f (x1 ) ∧ x1 = 2 tells y=2
to the store, then reduce to skip. Indeed, the first reduction applies since the
store inference rules give
σ ∧ x=f (y) |= x=f (y)
σ ∧ x=f (y) |= f (y)=x
σ ∧ x=f (y) |= x=f (x1 )
σ ∧ x=f (y) |= f (y)=f (x1 )
σ ∧ x=f (y) |= y=x1
σ ∧ x=f (y) |= x1 =2
σ ∧ x=f (y) |= y=2
The rule leads to the store σ ′ ≡ σ ∧ y=2, which entails x=f (y):
σ ′ |= y=2
σ ′ |= x1 =2
σ ′ |= 2=y
σ ′ |= x1 =y
′
′
σ |= x=f (x1 )
σ |= f (x1 )=f (y)
′
σ |= x=f (y)
6.3 Declarative subset of the language
6.3.4
85
Conditional statements
Those statements perform an ask operation on the store, and possibly block
until a condition is entailed or disentailed.
The if statement. The classical conditional statement reduces depending
on the value of its condition variable. The statement waits until the variable
equals true or false, then reduces accordingly:
if x then S1 else S2 end
S1
σ
if σ |= x=true
(6.38)
if x then S1 else S2 end
S2
σ
if σ |= x=false
(6.39)
σ
σ
The value of x is usually determined by a boolean function, like a comparison
operator. If x is different from true and false, the statement reduces by
raising an exception (see Section ).
The case statement. It can be seen as a linguistic abstraction for pattern
matching, expressed in terms of a conditional statement, variable introduction,
and record operations Label and Arity. However the concept is important
enough to be presented with its semantic rules.
case x of f (X1 . . . Xn )
then S1 else S2 end
S1 [X1 /x1 ] · · · [Xn /xn ]
σ
σ
if σ |= x=f (x1 . . . xn )
(6.40)
case x of f (X1 . . . Xn )
then S1 else S2 end
σ
S2
σ
if σ |= x6=f (x1 . . . xn )
(6.41)
The pattern matches if the store entails the equality x=f (x1 . . . xn ), for some
variables x1 , . . . , xn . In case of a match, the statement reduces to S1 , where
the identifiers Xi are substituted by the corresponding variables xi in x. If
the store disentails any such equality, the statement reduces to S2 . If the store
does not contain enough information to decide one way or another, then the
statement cannot reduce.
Waiting for determinacy. We have defined above what it means for a store
to determine a variable. Waiting for the determination of a variable is the most
direct way to show the dataflow behavior of variables. It can be expressed
explicitly with the unary procedure Wait. Its semantics is extremely simple: it
reduces to skip once its argument is determined.
{Wait x}
σ
skip
σ
if σ |= det (x)
(6.42)
86
Language semantics
6.3.5
Names and procedures
Names are unforgeable constants, and have therefore no textual representation.
They are useful to give a unique identity to a language entity like a procedure
or a cell. But they can also be used as first-class values by a programmer.
Such a value can be confined by lexical scope to the implementation of a data
structure, for keeping a feature hidden to the user. For instance, names are
used to define private methods in a class, which are by default only accessible
from within the class.
Names are created explicitly by the operation NewName. Its semantics are
given by the following reduction rule. Every fresh name is guaranteed to be
different from all other existing names and values. Names are created in a way
similar to variables. The semantic statement x=ξ is obtained by semantic rule
reduction only. This reduction clearly separates the name creation from the
binding of the variable x.
{NewName x}
σ
x=ξ
σ
where ξ is a fresh name
(6.43)
Procedures. The proc statement creates a procedure in the store. The procedure value consists in a name ξ that is associated to a statement abstraction
in the procedure store by the pair ξ : λX1 . . . Xn .S. All the free identifiers of
S are in the set {X1 , . . . , Xn }. The name gives the procedure its identity.
Procedure application performs an ask to the store. For applying procedure
p, p must be equal to a name ξ that is associated to a statement abstraction.
Procedure application thus blocks if p is not determined by the store. Once the
procedure is known, the call reduces to the abstracted statement, where each
parameter is substituted by the corresponding argument in the call.
proc {p X1 . . . Xn } S end
σ
{p x1 . . . xn }
σ
6.3.6
p=ξ
σ ∧ ξ:λX1 . . . Xn .S
S[X1 /x1 ] · · · [Xn /xn ]
σ
ξ a fresh name
(6.44)
if σ |= p=ξ ∧ ξ:λX1 . . . Xn .S (6.45)
By-need synchronization
Lazy evaluation, or demand-driven computation, is possible in Oz via the byneed synchronization mechanism. It works as follows. In a producer-consumer
scheme, the producer and the consumer are in separate threads, and share a
logic variable x. The producer simply blocks until a consumer requires x to
be determined in order to reduce. The mechanism that allows the producer to
detect the need of a consumer is called by-need synchronization. Note that the
mechanism is very general, and allows several producers and consumers for a
single variable.
6.4 Nondeclarative extensions
87
The semantics is defined in terms of ask and tell on the by-need store. The
latter is an extension of the constraint store, and is monotonic as well, which
makes this language concept fully declarative. The predicate needed (x) is used
to synchronize producers and consumers: it is automatically told to the store by
the consumer if the determinacy of x is required for its reduction. The producer
uses the unary procedure WaitNeeded to synchronize on the entailment of the
predicate by the store.
In our proposal, we consider that determined variables are needed by convention. This simplifies the behavior of a variable. We identify three states,
which are ordered in this way: free, needed, and determined. State transitions follow that order, which ensures the monotonicity of the store. Moreover,
this convention disambiguates a producer-consumer situation where a consumer
would bind the shared variable. The variable automatically becomes needed,
and the producer is woken up. We provide this property with the following
inference rule.
σ |= det (x)
(6.46)
σ |= needed (x)
Now consider a statement S. We define needed(S) as the set of variables
which must be determined for S to be executable.
(
for every store σ:
x ∈ needed (S) iff
(6.47)
if S is executable with σ, then σ |= det (x)
The condition can also be expressed as: the statement S cannot reduce in a
configuration where x is not determined. The definition directly applies to the
Wait statement: x ∈ needed ({Wait x}). The variable x is also needed by the
statements “if x . . .” and “case x . . .”.
The first reduction rule below describes how the predicate needed (x) is told
to the store. The second rule states that the statement {WaitNeeded x} asks
the store for the predicate needed (x), and reduces to skip once it is entailed.
S
σ
S
σ ∧ needed (x)
if x ∈ needed (S), and σ 6|= needed (x)
{WaitNeeded x}
σ
6.4
6.4.1
skip
σ
if σ |= needed (x)
(6.48)
(6.49)
Nondeclarative extensions
Nondeterministic wait
The function WaitTwo takes two arguments, and returns the number of the
argument that is determined (1 or 2). The returned value is nondeterministic
in case both arguments are determined. It can be used for merging streams,
88
Language semantics
for instance.
{WaitTwo x y z }
z =1
σ
if σ |= det (x)
(6.50)
{WaitTwo x y z }
z =2
σ
if σ |= det (y)
(6.51)
σ
σ
6.4.2
Exception handling
We first introduce the try statement. In order to simplify the management of
the scope defined by the statement, we consider a catch statement, which can
only be obtained by the reduction of the first rule below. The catch statement
itself reduces to skip. These two rules model all executions where no exception
is raised.
try S1 catch X then S2 end
σ
S1 (catch X then S2 end)
σ
catch X then S2 end
skip
σ
σ
(6.52)
(6.53)
Consider now the raise statement. This statement is either written explicitly in the program, or is obtained by a reduction rule in case of an error.
For instance, an if statement reduces to a raise statement if the condition
variable is not of type boolean. The effect of the raise statement is to skip all
statements after it, except a catch statement.
raise x end (catch X then S2 end) T
σ
raise x end S T
σ
raise x end T
σ
S2 [X/x] T
σ
(6.54)
if S is not a catch statement (6.55)
This simple model works fine with any number of nested try statements, and
reflects well that the scope defined by the statement only covers the current
threads. An exception in a thread cannot be caught by another thread.
Failed values. Those special values provide a way to transmit exceptions
from one thread to another. A failed value y encapsulates an exception x,
and is represented in the store by y=failed (x). It is created by the operation
FailedValue.
{FailedValue x y }
σ
y=failed (x)
σ
(6.56)
6.4 Nondeclarative extensions
89
If a statement S needs a failed value, S immediately reduces by raising the
exception. The exception is also raised if the statement tries to bind the value.
S
σ
raise x end
u=v
σ
raise x end
6.4.3
σ
σ
if y ∈ needed (S) and σ |= y=failed (x)
(6.57)
if σ ∧ u=v |= det (y) and σ |= y=failed (x)
(6.58)
Read-only views
Oz provides a useful concept for protecting data structures from accidental
bindings from the user. This protection allows a user to read a variable without
being able to bind it. The idea is to pair two variables x and y by making y a
read-only view of x. We write this pairing as y=view (x). Such a pair is created
by the “bang bang” operator !!.
y =!!x
σ
y =z
σ ∧ z=view (x)
where z is a fresh variable
(6.59)
Once the variable x is determined, being a view of x implies being equal to x.
This property is given by the following inference rule. Note that it could be
used to drop views from the store, and replace them by equalities: when x is
determined, y=view (x) is replaced by y=x.
σ |= y=view (x)
σ |= det (x)
σ |= y=x
(6.60)
Preventing unification. As read-only views cannot be determined before
their variable, we have to strengthen the condition for binding a variable during
unification, and make sure that we never bind a read-only view of a variable.
The rule (6.36) is rewritten as
(
σ ∧ u=v |= x=t
u=v
u =v
if
(6.61)
σ
σ ∧ x=t
σ 6|= det (x), and for all y, σ 6|= x=view (y)
Views and by-need synchronization. Read-only views can be used to
protect lazy computations, provided that a variable becomes needed when its
view is needed. The following inference rule on the store does the job.
σ |= y=view (x)
σ |= needed (y)
σ |= needed (x)
(6.62)
Just like a determined variable is needed, one can expect that any attempt to
determine a view by unification makes the view needed. Although it does not
90
Language semantics
exactly fit our definition of needing a variable, we propose the following rule,
which makes views needed in case of unification.
(
σ ∧ u=v |= det (y)
u =v
u =v
if
(6.63)
σ
σ ∧ needed (y)
σ |= y=view (x), and σ 6|= needed (y)
6.4.4
State
All stateful entities can be built on top of cells. The semantics of cells will
therefore serve as a reference for all stateful entities with synchronous operations: arrays, dictionaries, etc. Ports can also be built on top of cells. However
we will consider a fully asynchronous version of the Send operation, which will
be given a specific distributed semantics.
A cell is semantically defined as a name associated to a state in the stateful
store. If ξ is the name of the cell, the stateful store contains the predicate
ξ:x, where the variable x is the current state of the cell. The creation of a cell
consists in creating a name, and adding an initial state for it in the stateful
store.
{NewCell x c}
c= ξ
where ξ is a fresh name
(6.64)
σ
σ ∧ ξ:x
Just like names, the statement reduces to unifying the cell variable to the name:
c=ξ. This separates the creation of the cell from the binding of the variable c.
Synchronous operations. All synchronous operations can be modeled as
a cell exchange operation. This operation possibly changes the state of the
entity, and returns its former state. The semantics of the operation is given by
the following reduction rule.
x=c:=y
σ ∧ ξ:w
x=w
σ ∧ ξ:y
if σ |= c=ξ
(6.65)
Asynchronous operations. The operation Send on port will serve as a
reference for asynchronous operations. Its specificity is that the statement reduction and the state update are not necessarily made together. The statement
reduction corresponds to the message begin sent, and the state update to the
message being received.
Let us first propose a definition of ports on top of cells. The port is defined
as a cell that contains a part of the stream of received messages. This reference
to is used to extend the stream as new messages arrive. The stream of messages
does not reveal the tail itself, but a read-only view instead. This guarantees
that only the port abstraction can add messages to the stream.
6.4 Nondeclarative extensions
91
proc {NewPort S P}
T in P={NewCell T} S=!!T
end
proc {Send P X}
T in X|!!T = P := T
end
The semantics of NewPort is derived from its code. However the semantics
given by this definition of Send is not satisfactory. This definition is actually
synchronous. It works perfectly in a centralized setting. It has the observable
property that all messages sent from a given thread are received in the order
they were sent. In other words, each thread imposes a partial order on the
reception of its own messages. We call this property the sender ordering.
Let us propose a fully asynchronous definition of Send. The definition below
allow messages to arrive in any order. We will use this definition as a reference.
proc {Send P X}
thread T in X|!!T = P := T end
end
Let us provide semantic rules that reflect well the semantics of the asynchronous
Send. The thread created is modeled as a special thread n⇐x, which represents
the message being sent. This special thread reduces upon message reception,
which adds the message to the message stream.
{Send p x} T
σ
ξ⇐x
σ ∧ ξ:t
T, ξ⇐x
σ
σ ∧ ξ:t′ ∧ t=x|s ∧ s=view (t′ )
if σ |= p=ξ
if s, t′ are fresh variables
(6.66)
(6.67)
Note that the real semantics of Send satisfies the sender ordering property.
This property is clearly not satisfied by our semantic rules. It would require
to model one message queue per port and per thread. We have chosen to keep
the semantics simple, as we believe that this improved model is not significant
for the operation itself.
92
Language semantics
7
Distributed semantics
The operational semantics we give in this chapter refines the centralized semantics given in the former chapter. The refinement is defined in the following
way: every distributed configuration D maps to a centralized configuration C,
d
and every distributed reduction D → D′ maps to a valid centralized reduction
c
′
C → C . The identity reduction, where C=C ′ , is considered valid. This kind of
property is usually visualized by a commutative diagram like
C → C′
↑
↑
c
(centralized)
D → D′
d
(distributed)
The semantics should reflect some aspects of the distribution, like partial
failures and network latency. Those are necessary to reason about the program.
The semantics should also reflect the distribution strategy of entities. Not all
entities are distributed the same way. For instance, stateful entities allow at
least three schemes (centralized, mobile, and replicated). Each strategy should
be clearly identifiable in the semantics, but they all must map to the same
centralized semantics.
That being said, the semantics should abstract as much as possible the
details which are not relevant for the programmer. For instance, we do not
describe how failures are detected in practice, but we give conditions that failure detectors must satisfy. Neither do we specify how communication takes
place over the network, how data are serialized, or how Oz names are guaranteed unique across machines. Those issues are supposed to be solved for the
programmer. What the semantics give are the elements that the programmer
must be aware of, like network and site failures, and the elements that are under the programmer’s control, like the distribution strategy and failure modes
of language entities.
93
94
Distributed semantics
Sections 7.1 and 7.2 define store extensions that reflect the sites, the network, and how entities are distributed among sites. The semantics of Annotate
is also given there. Sections 7.3 and 7.4 give the distributed semantics for the
declarative and nondeclarative parts of the kernel language, respectively. Section 7.5 gives the semantics of the fault stream and the operations Kill and
Break. Section 7.6 gives the mapping from distributed to centralized semantics.
7.1
Reflecting network and site behavior
The basic principle of a distributed semantics is to incorporate some information about the network and sites in the system. The semantics implicitly
provides a formal model of the program environment. The advantage is to
reflect site and network behavior at the programming language level. So that
the programmer can explain or predict the effect of environment changes on
his or her program.
7.1.1
Locality
The very first consequence of distributing a program is to introduce a notion
of locality. Each site in the system has only a partial view of the whole store.
Though the network transparency aims at abstracting this fact, it is essential
for reflecting performance and failure issues. We consider that a distributed
configuration is like a centralized configuration, where each thread and each
store element is tagged with a site identifier. Consider for instance
(y =x+1 z =y *2)a , (w=z >100)b
(x=42)a ∧ (x=42)c ∧ . . .
The first thread runs on site a, while the second thread runs on site b. The
store contains at least the constraint x=42, which is present on both sites a
and c. Dropping the site index gives an equivalent centralized store.
The local configuration of a site a in the system is simply the restriction
of the distributed configuration to the elements indexed by a, that we denote
T |a /σ|a . The operator |a (“at a”) is defined by the following equations. Both
β and γ denote predicates, but γ has either a subscript different from a or no
subscript (like the predicate a↔b defined in the next section).
(T , T ′ )|a = T |a , T ′ |a
Ta |a = T
Tb |a = ∅
7.1.2
(σ ∧ σ ′ )|a = σ|a ∧ σ ′ |a
βa |a = β
γ|a = ⊤
(7.1)
Network failures
Let us first enrich the store with information about network links. A network
link between sites a and b is operational when the predicate a↔b is present in
7.2 Reflecting entity behavior
95
the store. We consider for the sake of simplicity that those links are bidirectional.
σ |= a↔b
σ |= b↔a
(7.2)
The rules below temporarily cut network links, and restores them. We consider
that those rules are triggered by the system itself. They define valid state
transitions for the model of the environment.
σ ∧ a↔b
7.1.3
σ
σ
σ ∧ a↔b
if σ 6|= a↔b
(7.3)
Site failures
The notion of locality above is expressed in terms of site. We now model site
failures in the semantics. Remember that a site failure is of the kind crash-stop.
Its effect is simply to drop the part of the configuration that is specific to that
site. It is described by the global reduction rule
T
σ
T ↓a
σ↓a
(7.4)
where the operator ↓ a (“down a”) is defined by the following equations, with
the same convention as above for γ.
(T , T ′ ) ↓ a = (T ↓ a), (T ′ ↓ a)
Ta ↓ a = ∅
Tb ↓ a = Tb
(σ ∧ σ ′ ) ↓ a = (σ ′ ↓ a) ∧ (σ ′ ↓ a)
βa ↓ a = ⊤
(7.5)
γ↓a=γ
Site failures have no synchronous effect on other sites. Therefore the operators ↓ and | have the following properties. The first states that a site failure
removes everything that is specific to the site, and the second states that other
sites are not affected by the failure.
7.2
(C ↓ a)|a = ⊤
(7.6)
(C ↓ a)|b = C|b
(7.7)
Reflecting entity behavior
In order to handle distributed entities, we introduce three extra ingredients
that reflect parts of their behavior. Those elements are mostly independent
from the type of entity.
96
Distributed semantics
7.2.1
Entity failures
In order to reflect entity failures, we introduce the predicate alive(e) in the
predicate store. This predicate is put in the store at the creation of e, and
its absence means the permanent failure of e. It may occur at most once per
entity in the whole store, and is localized on a given site (its coordination site).
Note that it applies to variables as entities, and that alive(x) is not equivalent
to alive(ξ), even if x=ξ in the store. The principle of substitution by equals
does not apply to the predicate alive.
Remember that each site maintains a current fault state for each entity in
the system. We assume that on every site a, the store entails one equality like
(fstate(e)=s)a , which states that a considers entity e to be in fault state s. The
precise definition of how the store entails that fact and modifies it, is given in
Section 7.5. The principle of substitution by equals does not apply to fstate.
An entity e is correct if and only if the distributed store contains alive(e)a
for some site a. Removing the predicate automatically makes e permanently
failed. For a site b to perform an operation on e, we will require e to be correct,
accessible, and not locally failed on b. Note that this condition is necessary
but not sufficient. In order to abstract a bit this condition, we introduce the
predicate correct on b, which is defined by the following equation.
correct(e, a)b ≡ alive(e)a ∧ a↔b ∧ (fstate(e)=ok)b
(7.8)
Let us comment a bit on each of the conditions required by correct(e, a)b .
• The failure of site a causes the predicate alive(e)a to be dropped from
the store; this effectively prevents any further operation on e that would
require it to be correct. This property is enough to model the blocking
behavior of operations on failed entities.
• The second condition states that site b must be able to communicate with
site a. This is because making a consistent update of the entity generally
requires some synchronization with the coordination site of the entity.
• The third condition states that b considers e to be in fault state ok. This
means that b must be consistent with itself. This is necessary, as b may
consider e to be locally failed.
7.2.2
Entity annotations
Some entities have several alternatives for their distributed semantics. Which
alternative is used depends on an entity’s annotation. Every entity annotation
is visible in the store as a predicate, like stationary(e). Note that the predicate
is not localized to a site, which means that any site referring to the entity
should know its annotations.
7.2 Reflecting entity behavior
97
Let us consider protocol annotations. We define the predicates stationary,
migratory, replicated, variable, reply, immediate, eager, lazy. They are idempotent and mutually inconsistent for a given entity. For instance, we have
variable(x) ∧ variable(x) ≡ variable(x)
stationary(ξ) ∧ replicated(ξ) ≡ ⊥
Access architecture annotations are defined in a similar way. Reference consistency protocol annotations as well, except that an annotation specifies a subset
of the provided protocols. Note also that annotations may be inconsistent if
they are applied to the wrong type of entity. A cell cannot be annotated with
variable, for instance.
We also use the predicate annot(e, v) as an alternative notation to state that
entity e is annotated with v. Each predicate mentioned above corresponds to
exactly one value v:
annot(e, stationary) ≡ stationary(e)
annot(e, migratory) ≡ migratory(e)
..
.
annot(e, lazy) ≡ lazy(e)
Setting annotations. Here we define how annotations are set on entities.
The first rule tells the annotation of the entity to the store. There is a similar
rule, which we don’t mention here, that raises an exception if the annotation
is inconsistent.


σ |= alive(e)a
(skip)a
({Annotate e t})a
if σ|a |= t=v for a value v
σ
σ ∧ annot(e, v)


σ ∧ annot(e, v) is consistent
(7.9)
The second rule defines the effect of a default annotation: if the entity is shared
by more than one site and no annotation was specified, a default one is picked
at the home site of the entity.
σ
σ ∧ annot(e, v)

σ|b refers to e



σ |= alive(e)
a
if

v is default for e on a



σ ∧ annot(e, v) is consistent
(7.10)
The condition “σ|b refers to e” means that e occurs in a predicate or an equality
in the store σ|b .
98
7.3
Distributed semantics
Declarative kernel language
We now give the distributed semantics of language statements.
7.3.1
Purely local reductions
The rules that do not modify the store require very small adaptation to the
distributed case. Basically the threads must be localized on a site, and the
condition must be evaluated with the local store σ|a , where a is the site of the
reduced thread.
This is the case for the sequential and concurrent composition. Notice that
the thread statement creates a thread on the site where the statement reduces.
The conditional statements and procedure application are also extended in this
way.
7.3.2
Variable introduction and binding
The rules that introduce and bind variables require an extra adaptation. The
creation of a variable x on a site a automatically introduces the predicate
alive(x)a in the store:
(local X in S end)a
σ
(S[X/x])a
σ ∧ alive(x)a
x fresh variable
(7.11)
This predicate is necessary for binding the variable, as shown in the rules
below. The first rule binds x on its coordination site first, while the second
rule is responsible for propagating the basic constraint from the coordination
site to the other sites. Note that the binding x=t proposed by site b depends
on its local store σ|b .


σ|b ∧ u=v |= x=t
(u=v)b
(u=v)b
(7.12)
if σ|a 6|= det (x)
σ
σ ∧ (x=t)a


σ |= correct(x, a)b


σ|b refers to x
if σ|b 6|= x=t
(7.13)
σ ∧ (x=t)a σ ∧ (x=t)a ∧ (x=t)b


σ |= correct(x, a)b
Notice how the latter rule propagates references to t on all sites b that refer to
the variable x.
7.3.3
Procedure creation and copying
Procedures do not require much adaptation, since they are values. However, we
should model the copying of the value from site to site. The value is copied at
7.4 Nondeclarative extensions
99
most once per site, which keeps the local stores consistent. In the rules below,
P is the abstraction λX1 . . . Xn .S.
(proc {p X1 . . . Xn } S end)a
σ
σ ∧ (ξ:P )a
σ ∧ (ξ:P )a ∧ (ξ:P )b
(p=ξ)a
ξ fresh name
σ ∧ (ξ:P )a


σ|b refers to ξ
if σ |= eager(ξ) ∧ a↔b


σ 6|= (ξ:P )b
(7.14)
(7.15)
The latter rule propagates the abstraction P on all sites that refer to ξ. As a
consequence, all the references in the procedure body S are also propagated on
those sites.
Lazy copying. If the procedure is annotated with lazy, the copy of the
abstraction should be done lazily. The reduction rule for copying is similar to
the one above, except that it should be reducible only when it is needed on site
b, i.e., when a thread on b tries to call it.


p ∈ needed (S), σ|b |= p=ξ
Sb
Sb
if σ |= lazy(ξ) ∧ a↔b
σ ∧ (ξ:P )a σ ∧ (ξ:P )a ∧ (ξ:P )b


σ 6|= (ξ:P )b
(7.16)
7.3.4
By-need synchronization
This mechanism is easy to extend in the distributed case, since making variable
x needed consists in telling the constraint needed (x) everywhere in the system.
The two existing rules are extended as purely local rules, such that needed (x)
is told and asked locally. The additional rules below propagate the predicate
needed (x) to all sites via the coordination site, provided the variable is correct.
(
σ |= correct(x, a)b ∧ needed (x)b
if
(7.17)
σ σ ∧ needed (x)a
σ 6|= needed (x)a
(
σ |= correct(x, a)b ∧ needed (x)a
if
(7.18)
σ σ ∧ needed (x)b
σ 6|= needed (x)b
7.4
Nondeclarative extensions
We now complete the distributed semantics of Oz by extending the semantics of
nondeclarative language features to the distributed case. Like in the centralized
case, those features admit a semantics that is mostly compositional with respect
to the declarative part of the language.
100
7.4.1
Distributed semantics
Exception handling and read-only views
The exception mechanism interacts with thread execution. Its reduction rules
are extended to purely local reductions. Failed values are also reduced locally,
and are copied from site to site just like ordinary values.
Read-only views are also handled locally. The basic constraint x=view (y)
is handled like an ordinary binding, and copied on all sites that refer to x. The
binding rule (7.12) is extended with the extra condition
for all y, σ|a 6|= x=view (y).
7.4.2
State
Stateful entities are somewhat richer when it comes to their distribution. As
we have seen already, several strategies are possible for maintaining their state.
We consider three strategies here, and each has its own semantics: stationary
state, migratory state, and replicated state. The properties of those strategies
have been discussed in Chapter 3. Our concern in this chapter is that all
variants are refinements of the centralized semantics.
Let us first extend the cell creation semantics. The new reduction rule is
pretty straightforward: we locate the state on the creation site, together with
the predicate alive(ξ). The cell is distributed once two sites at least refer to the
name ξ.
({NewCell x c})a
σ
(c=ξ)a
σ ∧ alive(ξ)a ∧ (ξ:x)a
n fresh name
(7.19)
Before we go into the details of the state operations, we have to describe
how the state “becomes” distributed among sites. In the case of stationary or
migratory state, nothing special is needed. The state is already present on the
right site. Replicated state needs some extra support for distributing the state.
All we need is one rule that copies the state from its home site a to every other
site b that refers to the entity.


σ|b refers to ξ
if σ |= correct(ξ, a)b ∧ replicated(ξ)
σ ∧ (ξ:x)a σ ∧ (ξ:x)a ∧ (ξ:x)b


σ 6|= (ξ:x)b
(7.20)
Synchronous operations. We now give three semantic rules for the cell
exchange operation, each rule reflecting the cell’s possible distribution strategy.
The first rule shows a stationary cell: the state is located at site a. The
operation is performed at a.
The second rule shows a migratory cell. The operation reduces once the
state move to b, coming from another site b′ . The semantics does not tell how b
and b′ are chosen, this is left to the actual protocol. But there is an interesting
7.4 Nondeclarative extensions
101
case when b = b′ : the state remains on b and can be updated without any
network operation. This is the “caching” behavior of the migratory state.
The third rule shows a cell whose state is replicated on several sites. In the
rule, the symbol ∗ denotes the set of sites that have a copy of the cell’s state.
One can see that updating the state is costly: it requires the home site of the
cell to communicate with all the state replicas. However, if an operation does
not change the state, it can be performed on the local copy of the state.
(x=c:=y)b
σ ∧ (ξ:w)a
(x=w)b
σ ∧ (ξ:y)a
(x=c:=y)b
σ ∧ (ξ:w)b′
(x=w)b
σ ∧ (ξ:y)b
(x=c:=y)b
σ ∧ (ξ:w)∗
(x=w)b
σ ∧ (ξ:y)∗
(
σ|b |= c=ξ
σ |= correct(ξ, a)b ∧ stationary(ξ)


σ|b |= c=ξ
if σ |= correct(ξ, a)b ∧ migratory(ξ)


σ |= b′ ↔b


σ|b |= c=ξ
if σ |= correct(ξ, a)b ∧ replicated(ξ)


σ |= a↔∗
if
(7.21)
(7.22)
(7.23)
All those rules have an interesting special case. If site b has the state and
performs a read operation, none of the other sites is affected, and the state can
be read locally. The only condition is that the local fault state of the entity
must be ok. For this case all rules can be simplified to
([email protected])b
σ ∧ (ξ:w)b
(x=w)b
σ ∧ (ξ:w)b
(
if
σ|b |= c=ξ
σ |= (fstate(ξ)=ok)b
(7.24)
Asynchronous operations. Just like in the centralized semantics, the Send
operation reduces immediately by sending a message p⇐x.
({Send p x} T )b
σ
Tb , (ξ⇐x)b
σ
if σ|b |= p=ξ
(7.25)
The first rule below shows the case of the stationary port. The message is
received by the coordination site of the entity. Once delivered, the operation
is performed, and the binding of the stream is visible globally. The second
rule considers a mobile port. In that case, the message is kept locally until the
state comes at the site; the operation is then performed locally. The latter rule
considers a replicated port. All the copies of the state are atomically changed.
(ξ⇐x)b
σ ∧ (ξ:t)a
σ ∧ (t=x|t′ )a ∧ (ξ:t′ )a ∧
alive(t′ )a ∧ (t′ =view ())a

′

t is a fresh variable
if σ |= correct(ξ, a)b


σ |= stationary(ξ)
(7.26)
102
Distributed semantics
(ξ⇐x)b′
σ ∧ (ξ:t)b
(ξ⇐x)b
σ ∧ (ξ:t)∗
σ ∧ (t=x|t′ )b ∧ (ξ:t′ )b′ ∧
alive(t′ )b′ ∧ (t′ =view ())b′
σ ∧ (t=x|t′ )a ∧ (ξ:t′ )∗ ∧
alive(t′ )a ∧ (t′ =view ())a
′
t is a fresh variable



σ |= correct(ξ, a)
b
if

σ |= migratory(ξ)



σ |= correct(t, b)b′
′
t is a fresh variable



σ |= correct(ξ, a)
b
if

σ
|=
replicated(ξ)



σ |= a↔∗
(7.27)
(7.28)
State failure. An important property of cells is that they fail once their state
is lost. In other words, if the store σ has no occurrence of a state predicate ξ:z
anywhere, the cell ξ must fail. In both the stationary and replicated protocols,
this situation follows from the failure of the coordinator site of the cell. But
in the migratory protocol, we have to add a specific rule that removes the
predicate alive(ξ)a from the store:
σ ∧ alive(ξ)a
7.5
σ
if σ 6|= (ξ:z)b for all z and b
(7.29)
Failure handling
7.5.1
Failure detectors
We provide a generic rule that reflects how the system may update the fault
stream of an entity on a site. Upon creation, every entity e in the system has
a fault stream on each site a, which is described in the store as the predicate
(fs(e)=s|t)a , where s is the current fault state of e, and t is the tail of the
stream. The latter is a read-only view, but for the sake of simplicity we will
treat it as a plain logic variable. The programmer can access it by calling
GetFaultStream:
({GetFaultStream e x})a
σ
(x=s|t)a
σ
if σ |= (fs(e)=s|t)a
(7.30)
The fault stream of e on a is updated by the rule
(
σ ∧ (fs(e)=s|t)a
σ ∧ (fs(e)=t)a ∧
alive(t′ )a ∧ (t=s′ |t′ )a
if
t′ fresh variable
cond(s, s′ )
(7.31)
7.5 Failure handling
103
where the condition cond(s, s′ ) is defined below. The site h is the coordination
site of e (its home site). Note that s in (7.34) must be different from permFail.
cond(ok, tempFail) iff σ |6 = alive(e)h
cond(tempFail, ok) iff σ |= alive(e)h
or σ 6|= a↔h
and σ |= a↔h
(7.32)
(7.33)
cond(s, permFail) iff σ 6|= alive(e)h
and σ |= a↔h
(7.34)
′
cond(s, s ) is false otherwise
(7.35)
As you can see in the first condition, a temporary failure for an entity e can
be reported on a site a when either the entity actually failed, or the network
link between a and h is down. The third condition states that detecting the
permanent failure of an entity requires site a to be able to reach the home site
of the entity. This condition is not fulfilled in general if the site h has crashed.
However, it can happen in certain cases, for instance if sites a and h are on the
same local area network (LAN). The operating system may report the crash of
the process corresponding to site h.
The current fault state of an entity on a given site, as it is defined in
Section 7.2 on page 95, is derived from its fault stream on that site. We define
it with the following inference rule.
σ |= (fs(e)=s|t)a
σ |= (fstate(e)=s)a
(7.36)
Fault stream of variables As stated in Section 4.2.3, the fault streams of
unified variables are merged. In order to define which one is bound to the
other, we assume that all variables in the system are ordered by a relation ≺.
This order is used by the system, but not directly available to the programmer
itself. So the semantic rules below give the possible ways to merge. The last
rule finalizes a variable’s fault stream when the variable is determined.
σ ∧ (fs(x)=s|t)a
∧ (fs(y)=s′ |t′ )a
σ ∧ t=t′
∧ (fs(y)=s′ |t′ )a
σ ∧ (fs(x)=s|t)a
∧ (fs(y)=s′ |t′ )a
σ ∧ t=s′ |t′
∧ (fs(y)=s′ |t′ )a
σ ∧ (fs(x)=s|t)a
σ ∧ t=nil
if σ|a |= x=y ∧ x≺y ∧ s=s′
if σ|a |= x=y ∧ x≺y ∧ s6=s′
if σ|a |= det (x)
(7.37)
(7.38)
(7.39)
Creation and finalization of the fault stream. Assuming that every site
maintains a fault stream for every entity in the system may be misleading.
Indeed, this does not take into account the fact that a site may forget some
information about an entity (see the discussion of Section 4.4 on page 54). So
we propose the following two rules for creating and finalizing the fault stream
104
Distributed semantics
of an entity e on a site a. The concept of liveness of an entity is the usual
one used by garbage collectors. For the sake of conciseness, we skip its formal
definition.


e is alive on T |a /σ|a
T
T
if σ 6|= (fs(e)= . . .)a
(7.40)
σ σ ∧ (fs(e)=ok|t)a


t is a fresh variable
T
σ ∧ (fs(e)=s|t)a
7.5.2
T
σ ∧ t=nil
if e is not alive in T |a /σ|a
(7.41)
Making entities fail
In this section, we define the operations Kill and Break.
Global failure. The operation Kill should make its argument fail, i.e., it
should remove the predicate alive(x)a from the store, where x is the argument
of the call. As the operation is asynchronous, we use a “kill” message x⇐†
that is similar to the messages used in Section 7.4.2.
({Kill x} T )b
σ
(x⇐†)b
σ ∧ alive(x)a
Tb , (x⇐†)b
σ
σ
(7.42)
if σ |= a↔b
(7.43)
Notice that the latter rule states explicitly that communication with the coordinator site of x is necessary to make x permanently failed.
Local failure. The procedure Break is pretty easy to define. Its effect is to
change the fault state of the entity to localFail, unless the fault state already
has that value or permFail.
({Break x})a
σ ∧ (fs(e)=s|t)a
(skip)a
σ ∧ (fs(e)=t)a ∧ alive(t′ )a
∧ (t=localFail|t′ )a
({Break x})a
σ ∧ (fs(e)=s|t)a
(skip)a
σ ∧ (fs(e)=s|t)a
if σ |= s=ok or
σ |= s=tempFail
if σ |= s=localFail
or σ |= s=permFail
(7.44)
(7.45)
If site b has locally killed an entity e, the predicate correct(e, a)b will never be
entailed by the store. Hence, all operations that require e to be correct on b
block forever.
7.6 Mapping distributed to centralized configurations
7.6
105
Mapping distributed to centralized configurations
The definition we gave of a refinement on page 93 states that every distributed
configuration maps to a centralized configuration. This section defines precisely
that mapping.
7.6.1
The mapping
The mapping itself is pretty easy to define. Basically we collect the configurations of all sites in the system. We define it in terms of the operator |a . The
disjoint union and conjunction operators range on the set of all sites in the
system.
T
⊎a T |a
centralized
=
(7.46)
σ
∧a σ|a
There is only a small issue with the conjunction of the local stores. They
are always consistent with each other, except for the entities’ fault streams,
which can be in different states. However, we can consider that a centralized
configuration possibly has several fault streams for a given entity. This looks
like the centralized system maintains many failure detectors for each entity,
which can have different views.
7.6.2
Network transparency
With this definition we can now formulate a theorem which relates the distributed and centralized semantics of the language Oz. As this property translates the network transparency at the semantic level, we call it the Network
transparency theorem. The theorem can be proven by induction on every distributed reduction rule.
Theorem (Network transparency). The distributed semantics of the language
is a refinement of its centralized semantics. In other words, for every pair of
distributed configurations D and D′ , if D → D′ is a valid distributed reduction,
then centralized (D) → centralized (D′ ) is a valid centralized reduction.
106
Distributed semantics
8
Implementation
This chapter describes the new implementation of the distribution of Oz, based
on the Distribution Subsystem (DSS). This work has been achieved by several
developers, among which Erik Klintskog, Zacharias El Banna, Boris Mejı́as,
and myself. In order to make a clear distinction between the former and new
implementations, we will refer to them as “Mozart” and “Mozart/DSS”, respectively.
In Section 8.1, we explain the architecture, and some principles underlying Mozart/DSS. We show there that the implementation splits up into three
distinct layers. The topmost layer is the virtual machine, which has been modified as little as possible in order to take distribution into account. Section
8.2 describes the bottom layer, which provides all the distributed protocols.
Section 8.3 describes the middle layer, also known as the Glue, that interfaces
the virtual machine to the DSS layer.
8.1
Architecture of Mozart/DSS
The platform Mozart/DSS is a new implementation of the distribution of Oz,
based on the virtual machine of Mozart and the library DSS. The latter provides abstractions for distributing programming language entities. The general
architecture for a distributed entity is depicted in Figure 8.1 on the following
page. The diagrams shows the fundamental components implementing an entity that is shared among three sites. The components are divided up into sites,
separated by bold vertical lines, and into implementation layers, separated by
dashed horizontal lines.
All language entities, local or distributed, are stored in virtual machine
heaps. A distributed entity can be seen as a set of local entities connected
together via the network, and cooperating in order to provide the illusion of a
single global entity to the programmer. Note that an entity in the heap might
107
108
Implementation
V
E
n
t
i
t
y
E
n
t
i
t
y
E
n
t
i
t
i
r
t
u
a
G
M
e
d
i
a
t
o
r
M
e
d
i
a
t
o
r
M
e
d
i
a
t
o
P
o
o
r
d
i
n
a
t
o
r
o
x
y
P
r
o
x
y
P
r
o
x
M
a
c
h
i
n
e
l
u
e
L
a
y
e
r
r
D
C
l
y
i
s
t
r
i
b
u
t
i
o
n
y
r
S
S
u
N
e
t
w
o
r
b
s
y
s
t
e
m
(
S
)
D
k
Figure 8.1: The three layers that implement the distribution in
Mozart/DSS
be only a stub, i.e., the state of the entity is not available locally. In that case,
it is necessary for that site to cooperate with other sites in order to complete
a language operation.
A distributed entity has a special hook that connects it to a proxy in the DSS
library. The proxies of a given entity are connected together with a coordinator
via network links. The coordinator and proxies form the coordination network
of the entity, which has a unique global identity. It is used to identify the
language entity across sites boundaries. The coordination network implements
the access architecture of the entity, which we introduced in Section 3.3.4.
The role of each layer. Each layer in the implementation plays a specific
role. The virtual machine layer implements the entity’s centralized semantics.
The DSS layer provides global naming for entities, a general serialization mechanism for user data, a set of selectable protocols implementing generic entity
operations, a distributed garbage collector with several protocols available, and
failure detectors for sites and entities.
The Glue layer implements the distributed semantics of entity operations by
mapping them to DSS entity operations. It implements the failure semantics of
entities, and makes both the local and distributed garbage collectors cooperate.
It also provides network communication channels for the DSS.
The author’s contributions. A prototype of the Glue layer was given to
us by Erik Klintskog. We revised its design, and implemented it entirely, with
a little help from our colleage Boris Mejı́as. Together with Boris we modified
the Mozart marshaler to serialize and deserialize entities using the DSS. The
new language features, like annotations, the fault stream, and the operations
8.2 The Distribution Subsystem
P
R
R
e
e
D
f
S
m
e
o
r
e
i
t
t
n
e
r
o
x
C
y
e
P
c
e
a
r
P
D
o
t
r
S
o
c
o
x
i
t
e
o
l
y
H
R
b
o
e
D
f
S
e
m
r
t
o
r
d
i
n
a
t
o
r
e
e
i
o
n
e
P
c
e
a
109
M
P
r
o
a
D
S
t
o
n
c
a
i
t
g
e
o
l
e
P
r
r
b
r
o
P
D
o
t
r
S
x
y
o
o
c
x
i
t
P
o
l
y
e
R
R
c
e
e
D
f
S
m
e
o
r
e
i
t
t
n
e
e
r
o
x
y
P
c
r
e
b
o
P
D
t
r
S
o
o
c
x
i
t
o
l
y
e
c
Figure 8.2: Architecture of the DSS, shown for one entity shared
among three sites a, b, and c
Kill and Break, were implemented solely by the author. We rewrote all the
entity protocols in the DSS such that they could handle partial failures. We
also extended the DSS interface to handle and reflect entity failures.
We made a few contributions to the virtual machine, too. Together with
Fred Spiessens, we implemented our new design of the by-need synchronization
mechanism. We also had to adapt the virtual machine in order to handle the
new distribution of procedures and object, among others. Finally, we improved
the unification’s implementation to make it more incremental.
8.2
The Distribution Subsystem
The DSS library provides a set of protocols for distributing programming language entities [Kli05]. It is itself split into a protocol layer, and a messaging
layer. The protocol layer brings together all the protocols to manage distributed entities. Protocols are partitioned in three classes, that handle orthogonal aspects of an entity’s distribution.
• Coordination protocols provide an entity’s unique identity, and maintain
its coordination network. They let proxies and their coordinators reach
each other by message passing.
• Entity protocols, or consistency protocols, implement generic entity operations. They also handle partial failures of entities.
• Reference protocols implement distributed garbage collection policies for
shared language entities.
Figure 8.2 shows the main DSS components for one entity shared among
three sites a, b, and c. The sites are separated by the vertical bold lines, and
110
Implementation
the horizontal dashed gray line splits up both layers. One can see that each site
has a proxy for the entity, and site b owns its coordinator; those components
implement the coordination protocol of the entity. Each proxy is connected to
a protocol proxy, while the coordinator has a protocol manager ; protocol proxies and manager implement the consistency protocol of the entity. Finally, the
coordinator owns a home reference, and the remote proxies have a remote reference; those components implement the distributed garbage collection protocol
of the entity.
Some of those components, like a proxy with its protocol proxy, may call
each other directly, but they interact more generally via the messaging layer.
The latter provides a channel-based communication mechanism (reliable and
ordered message passing) between sites. Each site is abstracted by a DSite
component, which hides the communication channel, and reflects the fault status of the given site. The DSS on each site maintains a set of known sites, as it is
shown in Figure 8.2 on the preceding page. Each site also has a representation
for itself, drawn with double lines in the figure.
Each component in the protocol layer can be addressed with its type (proxy,
coordinator, protocol proxy, protocol manager, or reference), the global identity
of its coordination network, and a DSite. For instance, the protocol proxy
on site c can easily send a message to its protocol manager via its proxy,
which knows the global identity and the DSite of its coordinator. The protocol
manager on site b can send a message to its protocol proxy on site a, with its
own reference to the DSite of a, and the global identity of its coordinator. This
facility greatly simplifies the implementation of the protocols.
By design, the coordination network provides a mean for each proxy to send
messages to its coordinator. In case the coordinator is stationary, each proxy
just has to know the coordinator’s site. All the DSS protocols are built on top of
this architecture. By default the protocol manager does not know its proxies,
so in some cases it maintains a list of DSite references corresponding to its
proxies. This list is built either explicitly by making proxies register to their
manager, or implicitly by collecting message origins. The latter case uses the
fact that a message is always delivered together with the DSite representing
the sender’s site.
8.2.1
Protocols for mutables
Those protocols implement three operations, namely read, write, and send. The
operations read and write are synchronous, and may return a result. The send
operation is similar to an asynchronous version of a write, and does not return
any value. There are essentially four protocols available in this category: the
stationary state protocol, the migratory state protocol, the pilgrim protocol,
and the invalidation protocol. The author made significant contributions to
the last three ones.
8.2 The Distribution Subsystem
111
M
3
.
f
o
r
w
a
r
d
(
P
2
)
M
2
.
g
e
t
(
P
2
)
.
g
o
t
i
t
5
P
1
P
2
T
1
.
r
e
q
u
e
s
t
P
1
P
4
.
p
u
t
(
S
2
T
)
.
p
r
o
c
e
e
d
5
Figure 8.3: Basic migratory state protocol
The stationary state protocol
This protocol is the simplest of all. Remote proxies send their operation requests to their protocol manager, which performs the operation. If the operation is not a send, it returns a result message once the operation on the entity
has completed. The result message allows the proxy to resume the corresponding suspended operation on its site.
The migratory state protocol
This protocol was first described in [HVS97, VHB+ 97, HVBS98], then extended
in [VBHC99] to make it handle permanent failures. The author proposed a
formalization of the extended protocol, and proved it correct in [BVCK00].
Mozart used that protocol for distributed cells and objects.
The protocol uses a token which is passed between the proxies in the coordination network. The proxy holding the token has sole access to the state of
the distributed language entity, and the entity’s state is passed together with
the token. The migration of the token is shown in Figure 8.3. The protocol
manager M builds a forwarding chain with all the proxies requesting an operation. When a proxy P2 needs the state of the entity to perform an operation
for thread T, it sends a message get(P2) to M. The latter then sends a message forward(P2) to the last proxy in the forwarding chain. This message will
make P1 forward the state token to P2, so that P2 becomes the last proxy in
the forwarding chain. When P2 receives the state token, it sends a message
gotit to its manager. This message allows M to maintain a list of the proxies
that could hold the state token.
Bypassing failed proxies. This simple extension to the basic protocol allows a proxy to avoid sending the state token to a failed proxy. This situation is
depicted in Figure 8.4. The proxy P1 has detected that its successor P2 in the
chain is permanently failed. In order to find the next successor, it notifies its
manager. The manager M can then send a new message forward(), because
M owns a representation of the forwarding chain.
State loss detection. The state token may be lost either if a proxy holds
it and crashes, or it has been sent over the network in a message put, and the
112
Implementation
M
2
.
c
a
n
p
u
3
P
1
.
p
e
r
m
F
1
a
.
P
i
l
(
P
M
t
2
f
o
2
r
w
a
r
d
(
P
3
)
P
3
P
1
P
2
P
3
)
.
p
u
t
(
S
)
4
Figure 8.4: Bypassing a failed proxy
message is lost because of a site failure (of the sender or the receiver). When
the manager detects that a proxy in the chain has permanently failed, it runs
an inquiry protocol, which determines whether the state token has been lost.
The manager asks each proxy where the state token is. The proxy can answer
beforeMe, atMe, or afterMe. If the manager finds two proxies that answer
afterMe and beforeMe, all proxies between them have crashed, and there is
nothing in the network, then the entity state is lost. The permanent failure of
the entity is notified to all proxies.
The pilgrim protocol
This mobile state protocol is inspired by the work in [GLT97]. It can be seen as
a variant of the migratory token protocol, where the proxies accessing the token
form a ring instead of a chain. Each proxy in the ring has a successor, to which
it forwards the token. A proxy remains in the ring unless it has not performed
any entity operation for a certain period of time. The proxies interact with the
manager only to enter or leave the ring. This greatly reduces the interaction
with the manager when a set of proxies regularly access the token.
The author made a significant contribution to make that mobile state protocol handle failures. The original implementation, as provided by [Kli05], had
no simple way to detect whether the state token was lost. Moreover, proxy
insertions and removals in the ring were serialized at the manager. This implied a strong protocol invariant which was relied upon for garbage collection,
because it allowed proxies to know whether they were inside the ring, and consequently whether they had to be kept alive. Proxies inside the ring should not
be removed by their respective garbage collectors, since their removal would
create a gap in which the state token can be lost. But determining efficiently
when a proxy is no longer accessible from the ring can be tricky, in particular
when ring proxies crash.
In order to remove any dependency of the manager on its proxies, we have
simplified the proxy insertion and removal in the ring. The result is shown in
Figure 8.5. Dashed arrows represent the successor relation in the ring. When
the manager receives a request from a proxy P to enter the ring, it chooses two
consecutive proxies P1 and P2 in the ring. It then sends a message to P1 to
make its successor P, and another message to P to make its successor P2, so
8.2 The Distribution Subsystem
113
1
.
e
n
t
e
r
M
P
P
2
P
2
.
f
o
r
w
a
r
d
.
f
o
r
w
a
r
d
M
)
(
2
)
(
P
2
.
f
o
r
w
a
r
d
)
(
2
1
.
l
e
a
v
e
P
P
1
P
P
1
P
2
2
Figure 8.5: Pilgrim: entering and leaving the ring
n
M
n
'
'
.
d
a
r
k
r
e
n
n
'
.
d
a
r
k
r
e
d
.
2
l
.
i
g
h
l
i
g
t
r
h
t
e
d
a
r
k
r
e
r
e
d
3
n
d
a
r
k
r
e
n
'
.
p
e
r
m
F
.
l
i
g
h
t
r
e
a
i
l
d
'
.
l
i
g
h
t
r
e
d
2
l
i
g
h
t
r
e
.
i
g
h
l
i
g
t
r
h
t
e
d
r
e
d
3
l
i
g
h
t
r
e
d
d
P
2
l
.
.
1
P
.
d
n
P
'
1
d
P
'
d
.
.
'
M
d
1
n
'
P
n
1
P
2
S
Figure 8.6: Pilgrim: ring coloring
that P is eventually between P1 and P2 in the ring. Proxy removal is even
simpler: its ring predecessor is sent a new successor. The simplified protocol
allows the manager to remove any suspect proxy from the ring without that
proxy’s cooperation. To compensate for the apparent sloppiness of the protocol,
we have added an orthogonal protocol that both detects the loss of the state
token, and solves the garbage collection issue.
Ring coloring. The idea of the coloring protocol is to mark the proxies that
are inside the ring, following the ring structure. The proxies that have not
been marked during the process are guaranteed to be unreachable from the
ring, and can therefore be safely removed. Moreover, the protocol is able to
detect whether the coloring token or a newly marked proxy has encountered the
state token. At the end of the coloring, the manager checks this information,
and notifies all proxies if the state token is no longer present.
More than two colors are necessary to make the protocol robust. The coloring can be interrupted at any moment by a failed proxy, and several color
changes can be performed concurrently. Eventually all proxies will change to
the most recent color.
Figure 8.6 shows how the coloring protocol works. The protocol attaches
a color to every proxy and the state token, and the color is either “light” or
“dark”. A proxy always attaches its own color to the state token. When the
manager initiates a color change, it sends a message to one of the proxies in the
ring with a light color (red in the figure). The proxy creates a color token, and
114
Implementation
passes it around the ring. The color token changes the color of every proxy it
encounters. If the color token meets the state token, its color is darkened, and
the coloring continues. The first proxy also darkens its color when it receives
the state token. When that proxy receives the color token, it knows that the
state token has been lost if and only if the token’s color and its own color are
equal and light. The final color is sent back to the manager. Note that the
proxies never accept a state token with a less recent color, except the proxy
that initiated the coloring.
A color change is triggered each time a proxy fails, or when a proxy wants
to determine whether it is still reachable from the ring. The proxy is guaranteed to be unreachable if its color has not been changed by the process. The
manager keeps track of the proxies that left the ring, and forwards them the
new color after the coloring. A proxy that has a different color knows that it
is unreachable from the ring.
The invalidation protocol
This protocol, inspired by protocols presented in [Lam79], manages an entity whose state is replicated on its proxies. It implements the annotation
replicated. It maintains two types of tokens, multiple read tokens and a single write token. Holding a read token allows a proxy to read its local copy of
the state of the entity. To perform a write operation, the manager asks proxies
to release their read token, and invalidate their copy of the state. Once all read
tokens have been collected, the write token is used to update the state, then
read tokens are redistributed to proxies with the new state. The proxies delay
read operations until they receive a read token.
In its first formulation, the manager was giving the write token to the proxy
that requested the write operation [Kli05]. The new state was then sent to the
manager, which redistributed it with read tokens. However, that protocol was
sensitive to proxy failures. The author has chosen to simplify its failure modes
by performing all write operations on the protocol manager. This makes the
protocol insensitive to proxy failures. It also improves performances, since the
manager no longer has to send the write token to a proxy.
8.2.2
Protocols for immutables
Those protocols should only offer a read operation. The stationary protocol
is valid for immutable entities, as well as three others, namely the immediate
protocol, and the eager and lazy replication protocols. Their implementation
in the DSS was done by Per Sahlin [Sah04], then slightly extended by the author
to handle partial failures.
The immediate protocol is not really a protocol, since a full representation
of the entity is serialized when a reference is passed between sites. The eager
and lazy replication protocols are pretty simple: each proxy can ask its manager
a copy of the entity’s state. Once the state is installed, all read operations are
8.2 The Distribution Subsystem
115
M
M
1
1
1
.
b
i
n
d
(
v
.
u
p
d
a
t
e
(
v
1
.
b
i
n
d
(
w
.
u
p
d
a
t
e
(
w
)
)
3
2
.
b
o
u
n
d
(
v
.
1
P
P
u
p
d
a
t
e
(
w
)
)
2
P
'
)
)
3
2
P
.
u
p
d
a
1
t
e
(
v
)
P
P
3
2
Figure 8.7: Transient protocol: bind and update operations
performed locally. In the eager protocol, a proxy requests the state right after
its creation. In the lazy protocol, the proxy delays the state request until the
first read operation. Both the eager and lazy protocols guarantee a unique copy
of the entity’s state. Indeed, if a reference to the entity is sent to a site, that
site will identify the reference to the entity’s proxy. If the proxy had already
requested the entity state, it will not request it again.
8.2.3
Protocols for transients
This protocol implements single assignment variables, with two operations:
bind and update. The protocol has first been published in [HVB+ 99]. The
DSS implements this protocol extended with an incremental update operation.
Upon creation, the transient entity is unbound. Its transient state may be
updated as many times as desired, until the entity is bound. The binding
is unique and final; all subsequent updates and bindings will fail. By-need
synchronization in Mozart/DSS is implemented as an update (see Section 8.3.4).
Bind. In order to guarantee the unicity of the binding, the protocol manager
plays the role of an arbiter of binding attempts. The protocol is depicted in
Figure 8.7: proxies send binding requests to their manager; the latter accepts
the first request, and forwards the binding to all proxies. All subsequent binding
and update requests are ignored. When proxies receive the binding, they install
the final state in their entity, and check their former binding attempts to decide
whether they have succeeded.
As the manager broadcasts the binding to all its proxies, they must register
to their manager, unless the coordination network provides a way to reach
them all. The registration is done at proxy creation, when it is deserialized,
via an explicit registration message sent to the manager. If the entity is bound
at the message reception, the manager replies with the entity’s binding. Note
that the registration can be optimized if the entity reference was sent by the
home proxy: the manager may automatically register the destination site. This
autoregistration mechanism saves a registration message, and can improve the
throughput of the protocol in case of stream communication.
116
Implementation
Update. State updates are handled in a similar way, but they do not put an
end to the entity. All updates are serialized by the manager, which forwards
them to all known proxies, so that the updates are applied in the same order on
all proxies (see the right drawing in Figure 8.7). When a proxy registers, the
manager may send back an update that summarizes all former updates. This
guarantees that the proxy does not miss past updates, provided that the entity’s
state can reflect all past updates. This summary update is a contribution of
the author. Its absence creates a race condition between proxy registration and
updates.
The transient remote protocol. This variant of the transient protocol,
chosen by the annotation reply, delegates the arbiter role to a proxy. That
proxy can directly bind the entity, and forward the binding to the manager,
which broadcasts it to the other known proxies. The manager forwards all the
other updates and binding requests to that proxy, which serializes them. The
protocol is optimal if there is only one remote proxy, and that proxy binds the
entity. Indeed, the proxy is autoregistered because the entity reference must
have been sent from the manager’s site, and the only message actually sent is
the forwarded binding.
The manager is responsible for choosing the arbiter proxy. The simplest rule
is to choose the first proxy registered outside the manager’s site. If that proxy
is deleted, it sends a deregistration message to the manager, which reassigns
the arbiter role to its home proxy.
8.2.4
Handling failures
All entities supported by the DSS may fail. Failures have two origins: the environment and the programmer, and both kinds must be detected and reflected
to the user. To implement that, each coordination proxy maintains a failure
state for its entity. Each time that state changes, the proxy notifies its corresponding mediator in the Glue layer. The state may be changed by the proxy
itself, or its protocol proxy.
Failures due to the environment are detected via DSites. We take for
granted that DSites have their own failure detection mechanism, which reflects their corresponding site’s fault state. Consider a DSite representing site
b on a site a. Once the DSite changes its fault state, this change is notified to
all coordination and protocol components on site a. Every component checks
whether it affects its entity. If so, it changes its failure state accordingly, and
notifies the mediator of its entity. How an entity is affected depends on the
entity’s coordination architecture and protocol. For instance, the mobile state
protocols will probe the proxies that may hold the entity’s state, in case one of
those proxies is reported as failed.
Failure are not only reported locally, but also to a global scale. A generic
protocol supports this global reporting. Once an entity is diagnosed as permanently failed by a protocol manager, the latter broadcasts a message PERMFAIL
8.2 The Distribution Subsystem
117
to all its known proxies. Those proxies will then update their failure state,
and report the change to their own mediator. Note that this generic protocol is sometimes adapted to avoid inconsistencies. For instance, the transient
protocol never makes an entity permanently failed after its binding.
Kill. In order to let a program make an entity fail, all protocols support
an operation called kill. The same generic protocol is used to implement that
operation. To perform a kill, a protocol proxy simply sends a message PERMFAIL
to its manager, which propagates the failure globally as we explained in the
former paragraph.
8.2.5
Distributed garbage collection
The garbage collector provided by the DSS maintains a status for each coordination proxy in the system. The proxies of a given entity have different status,
depending on the references and the entity’s protocol. A given proxy can be in
one out of four states:
• PRIMARY: the entity is kept alive by remote references. The virtual machine must keep the entity, because other sites depend on it. This status
means that the coordinator of the entity is on the current site.
• WEAK: the entity is kept alive because for protocol needs. This typically
happens when the current site is the only one to hold the entity’s state.
That status is generally not definitive: the DSS can be instructed to move
away from that status.
• LOCALIZE: no remote reference keeps this entity alive. This usually means
that all the proxies of the entity are gone except this one. The entity can
be localized or deleted, depending on whether it is kept alive locally.
• NONE: the liveness of the entity entirely depends on local information. If
the entity is alive, the proxy should be kept. Otherwise, both can be
removed.
Note that proxies are considered alive by the DSS until they are deleted explicitly by the upper layer. In fact all the components at the interface of the
DSS library must be deleted explicitly. The DSS uses them as roots for its own
internal garbage collector.
Section 8.3.6 on page 123 explains how these states are used by the local
garbage collectors. The distributed garbage collector itself is implemented by
the reference components shown in Figure 8.2 on page 109. Several protocols are available, and they can be combined together. The most important
protocols are a weighted reference counting algorithm, and a time lease algorithm. More details on the algorithms are given in Erik Klintskog’s works
[Kli05, KNBH01].
118
8.3
Implementation
The language interface
The Glue layer implements the language interface to the distribution library.
As the distribution of Oz comes from sharing entities, the most important
component of this layer is the mediator, that interfaces an entity to its proxy,
and vice-versa. Every distributable entity has a mediator, which contains the
entity’s annotations, its fault state, and its memory status. For the sake of
performance, the mediator of a local entity is created lazily, once the entity is
annotated, broken, or serialized for a remote site.
An entity is distributed if and only if it has a proxy in the DSS layer,
otherwise it is purely local to its site. In general, distributed entities have a
pointer to their mediator, while local entities have an indirect access to their
mediator via a table. The mediator itself has pointers to both its entity and
proxy, and the latter has a pointer to the mediator. This provides both efficient
access for distributed entities, and small overhead for purely local entities.
8.3.1
Distributed operations in general
Performing a language operation on a distributed entity involves several components in the three layers of Mozart/DSS. Those components are depicted in
Figure 8.8 on the next page, with the horizontal dashed lines separating the
implementation layers. Each one has a specific role during the distributed execution of the operation. Note that entity failures are ignored here; they are
explained in the next section.
A language operation on a distributed entity is always delegated to the
Glue layer, which accesses the entity’s mediator, then the corresponding DSS
proxy, and invokes the latter with a generic operation. The proxy forwards the
call to its protocol proxy, which implements the protocol chosen for the entity.
The protocol proxy prescribes to either perform the operation locally, as if the
entity was purely local, or suspend and resume it later. The decision entirely
depends on the protocol.
Let us illustrate the decision taken in the case of the protocol migratory
(mobile state). If the entity’s state is on the site that attempts the operation,
it will be performed locally. This is fine, since the state is stored in the virtual
machine’s representation of the entity. If the entity’s state is not present,
the operation is suspended, and the distributed protocol is used to complete
the operation. Once performed, it is resumed, together with the thread that
attempted it.
Suspension and resumption. If the operation has to be suspended, the
Glue creates an object that represents the suspended operation. That object
must be able to resume the operation whenever the protocol says so. Resumption may be done in two ways: either
8.3 The language interface
119
S
u
l
a
n
g
u
a
g
s
p
e
f
e
r
e
n
e
n
d
e
d
e
T
r
c
e
h
r
e
a
d
s
s
u
s
p
e
n
s
i
o
n
C
R
E
n
t
i
t
e
s
u
l
t
o
a
r
i
a
b
l
e
V
S
d
i
a
t
o
a
r
i
a
r
b
o
l
l
r
g
u
m
e
n
t
e
S
u
e
t
A
V
M
n
y
s
p
e
n
d
e
d
e
r
i
a
l
i
z
e
d
r
O
p
P
r
o
x
y
e
r
a
t
i
o
n
T
M
P
r
P
o
t
r
o
o
c
x
o
e
e
s
r
s
m
a
g
e
l
y
Figure 8.8: The components involved into the distributed execution of a language operation
• the protocol installs a copy of the entity’s state in the local entity, and
the operation is performed locally, or
• the operation has been performed remotely, and the operation resumes
by delivering the results.
The suspended operation typically suspends the thread that attempted the
operation on a control variable. Once the operation completes, the thread is
woken up by binding the variable. The control variable also permits to raise
an exception in the thread, or resume the operation with another statement.
The suspended operation can also deliver an output from a remote execution
through a result variable.
When an operation is performed on a remote site, the entity’s protocol
proxy on that site invokes the corresponding mediator in order to perform a
virtual machine operation.
Passing values. Protocol messages may include Oz values. Those values can
be the input or output of an operation that is performed remotely, or a value
that represents the entity’s state. They are encapsulated in a Glue component
that takes care of their (de)serialization. The DSS library defines an interface
for a suspendable marshaler. It means that values are generally not serialized
as a whole, but the serialization is done until a buffer is almost full. This
technique generally results in a smaller memory footprint.
120
8.3.2
Implementation
Distributed immutables
The role of an immutable entity’s proxy is to either provide a copy of the entity’s
contents (protocols immediate, eager, and lazy), or to provide remote access
to the entity (protocol stationary). We call the first kind of entities copiable,
because their distribution protocol consists in copying their contents between
sites.
Copiable entities do not always require to query the Glue layer to perform
an operation. Indeed, once the contents of the entity is available on a given
site, all operations on that site will be performed locally. The overhead of
distribution can be reduced to nothing if the virtual machine performs this
optimization.
Another aspect of copiable entities is that they can survive the execution of
a program: they can be stored in a file, and reused later, possibly by another
program. This can be an issue if the entity’s identity is provided by its coordination network, because the latter is no longer functional once the program
stops. For this reason, we provided the entity with a global identifier that does
not refer to a live DSS component. As a consequence, it is possible to remove
the entity’s proxy once it has been copied. This can be done for instance by
the garbage collector.
8.3.3
Remote invocations and thread migration
Stationary objects and procedures are never copied between sites, only their
reference is transmitted between sites. The operation they have in common is
the call (also called invocation in the case of objects). For this operation, an
object can be seen as a special case of a procedure. Therefore the discussion
will only mention procedures, and calls to stationary procedures.
Assume that a thread on a site a attempts to execute the call statement
{p x1 . . . xn }, where p is a stationary procedure on a site b. If the protocol
proxy of p has to perform the call remotely, it asks the Glue layer to transmit
arguments for the remote operation, i.e., in our case x1 , . . . , xn . The Glue
layer on site b then simply calls p again with the given arguments on a new
thread, which will execute p because it is now on its site. As the procedure call
may exit with an exception, site b creates a variable z, and pushes the following
statement on the thread:
z = try {p x1 . . . xn } unit
catch E then {FailedValue E} end
The variable z is returned to site a, which replaces the original call to p by
{Wait z }. This automatically synchronizes the thread, and transmits an exception if needed. This is illustrated in Figure 8.9 on the facing page, for a
procedure P with two arguments.
This simple solution works in most cases, but it has a subtle issue: the
calling thread, and the one that actually executes the procedure have different
8.3 The language interface
121
t
r
y
a
t
{
c
{
W
a
i
t
Z
}
P
X
1
X
2
h
e
}
t
c
h
Z
=
F
a
i
l
e
{
P
X
1
X
2
}
.
d
V
=
u
a
l
u
n
i
t
}
e
E
n
e
.
Z
n
E
{
d
.
t
h
r
e
a
d
t
1
t
h
r
e
a
d
t
2
N
e
a
l
l
e
r
s
i
t
e
t
w
o
r
k
s
i
t
e
o
f
P
c
Figure 8.9: A remote procedure call
thread identifiers. This is a problem if these threads lock critical sections with
reentrant locks. Those locks allow a thread to enter many times in a critical
section, and use the thread identifier to distinguish between threads. For the
remote procedure call to be really transparent, both threads should have the
same identifier, just like if the thread has migrated between sites. Otherwise,
a deadlock may occur. In the example of Figure 8.9, we should have t1 = t2.
To realize this, the identifier t of the caller thread is sent together with the
procedure’s arguments. On site b, the procedure is executed on a thread with
identifier t. If no such thread exists on b, one is created. If such a thread exists,
then by design it must be suspended on another remote procedure call; the
topmost statement must be a call to Wait as above. Pushing a new statement
on that thread is safe, because after that statement is reduced, the thread will
re-suspend thanks to the Wait statement.
8.3.4
Unification and by-need synchronization
While the unification operation belongs to the virtual machine, its implementation deserves a special attention. The reason is that a single unification may
involve several concurrent distributed variable bindings. Performing those distributed bindings sequentially may have a significant impact on the operation’s
performance. This impact may be reduced to a minimum if those bindings are
truly performed in parallel. The author has modified the existing implementation of the unification in Mozart, which was interrupted by every distributed
binding it had to perform.
The unification of two terms basically traverses two directed graphs, and
binds the encountered variable nodes in order to make both graphs equivalent.
If a variable is local to a site, its binding is done immediately. But a distributed variable may require to invoke its protocol. In that case, the binding
is suspended until the protocol terminates the operation, and returns its result.
Suspending the whole unification at that point is correct but inefficient, since
all involved distributed bindings will be performed in sequence.
To make a better implementation of the unification’s semantics, the original
122
Implementation
algorithm is modified as follows. When a variable binding suspends, it is put
aside in a “suspended set”. Note that bindings of read-only variables are also
put in that set. The bindings in the suspended set are considered valid until
the algorithm terminates. If the algorithm terminates without failing, and the
suspended set is nonempty, say {x1 =v1 , . . . , xn =vn } with n > 0, the unification
resumes as unifying two tuples with the remaining bindings, i.e.,
x1 # . . . #xn = v1 # . . . #vn .
Moreover, the current thread is suspended on the variables x1 , . . . , xn . The
thread will be woken up as soon as one of those variables is bound (possibly by a
distributed binding), and the unification will make progress. If all the variables
xi are distributed, then all the bindings xi =vi will proceed concurrently.
Unifying distributed variables. The unification of two distributed variables requires a tiebreaker to decide which variable is bound to the other. This
is necessary for avoiding “binding cycles” in the system. Indeed, if two sites
attempt to perform x=y and decide differently, x may be bound to y and viceversa. This is problematic, since both transients are bound, and can therefore
no longer be bound to anything else: the variables are unbound forever. The
tiebreaker is borrowed from the DSS, which provides an arbitrary total order
between all its distributed entities. The order is guaranteed to be the same on
all sites.
By-need synchronization. As we have seen in Chapter 7, the semantics
of by-need evaluation simply requires to propagate the need of a variable on
all sites. The operation update provided by the transient protocols perfectly
fulfills the job. Once a variable is made needed, an update operation is performed, which will make all representatives of that variable needed. Note that
making a variable needed never blocks, since that update can be considered
asynchronous.
8.3.5
Fault stream and annotations
The mediator of a language entity manages most aspects of to the distribution
of that entity. Some of those aspects, like the entity’s fault stream, are visible
as language entities. Others, like the entity’s annotations, are stored directly in
the mediator. Figure 8.10 on the facing page shows the extra language entities
used by an entity’s mediator.
Fault stream and blocking threads. First, the mediator keeps track of the
entity’s fault state. It is updated whenever a new fault state is reported by the
entity’s proxy, or enforced by the user (localFail). The mediator manages
the entity’s fault stream by keeping a reference to its tail, a read-only variable.
The stream is extended each time the fault state changes. The mediator also
8.3 The language interface
l
a
n
g
u
a
g
e
r
e
f
e
r
123
e
n
c
e
s
b
l
o
s
E
n
t
i
t
c
k
u
e
d
s
p
t
e
h
n
r
s
i
e
a
o
d
n
s
s
y
C
F
a
u
l
t
o
n
t
r
o
l
S
t
r
e
a
m
V
a
r
i
a
b
l
e
S
F
a
u
l
t
t
a
t
e
M
e
P
d
i
r
o
a
t
x
o
r
A
n
n
o
t
a
t
i
o
n
y
Figure 8.10: The entities managed by an entity’s mediator
manages a control variable. If a thread attempts an operation on the entity
while its fault state is not ok, the thread suspends on that variable. Whenever the fault state becomes ok, that control variable is bound to unit; this
automatically wakes up all blocked threads, which will retry their operation.
Note that the memory footprint of the entity pretty small in practice. Both
the fault stream and the control variable are created lazily, once the program
needs them. Another optimization comes from the fact that the control variable is only effective with the fault state tempFail. Indeed, a thread blocking
because of the fault states localFail and permFail will never resume. Therefore the control variable does not need to be kept alive by the mediator in that
case, since it will never be bound.
Annotations and proxy. The entity’s annotations are also stored in the
mediator. They are used to create a DSS proxy for the entity. The creation and
removal of the proxy are both managed by the mediator. The proxy creation
(called entity globalization) is triggered when the entity is serialized, while its
removal (called entity localization) is prescribed by the garbage collector when
the entity is no longer referenced outside its original site.
8.3.6
Garbage collection
The memory management of Mozart/DSS involves both the virtual machine’s
garbage collector and the DSS’s distributed garbage collector. We will call them
the local and DSS garbage collectors, respectively. The latter has already been
described in Section 8.2.5 on page 117. This section focuses on the cooperation
between the implementation layers.
The basic principle is that a live entity keeps its mediator alive, and a live
124
Implementation
mediator keeps all its entities (the entity, its fault stream and control variable)
alive. We combine that principle with the information coming from both the
virtual machine and the DSS. When the Glue layer decides to remove or localize
a distributed entity, it deletes the entity’s proxy. The following paragraph
explains how the decision is taken.
Putting it all together. The cooperation between the garbage collectors is
quite generic. First, both the virtual machine and the DSS should provide correct information about what must be kept in memory. Second, some decisions,
like the correct handling of the WEAK state above, depend on whether an entity
is kept alive by local computations only. Because some of those entities have to
be kept anyway, the process requires two passes of the local garbage collector.
The main steps of the garbage collection process are the following.
1. Distributed entities in state PRIMARY are taken as roots for the local
garbage collector.
2. The local garbage collector is run, which recursively marks entities from
the roots. We can now determine which entities are marked by local
computations.
3. The distributed entities in state WEAK are checked. For each such entity,
if it has not been marked yet, mark it and instruct its proxy to move
away from that state.
4. Local garbage collection is performed again. Now all entities that must
be kept in memory are marked.
5. The distributed entities in state NONE are kept only if they are marked
locally; otherwise they are deleted together with their mediator and proxy.
The distributed entities in state LOCALIZE are localized (their proxy is
removed) if they are marked locally; otherwise they are deleted together
with their mediator and proxy.
6. The DSS performs its own internal garbage collection. This has no effect
at all on the virtual machine’s memory heap.
These steps give the broad idea for collecting the entities that must be
kept in memory. However some important details are missing, in particular at
step 1. The rest of the section identifies the missing roots for the local garbage
collection, with a detailed explanation for each. We analyze (in order) the
cases of distributed variables, fault streams, threads blocked on a failure, and
the components involved in a distributed operation.
Distributed variables. The process described above does the right thing
for most distributed entities. However, let us analyze what happens to threads
that suspend on distributed variables. Recall that when a variable is alive, its
8.3 The language interface
S
u
s
125
p
S
e
T
h
n
r
e
d
a
e
d
u
d
s
p
T
e
h
r
s
s
u
s
p
n
i
o
r
i
a
b
l
e
u
d
d
s
p
e
n
s
i
o
n
n
F
a
d
a
s
e
V
n
e
e
E
n
t
i
t
a
u
l
t
y
S
t
M
e
P
d
i
r
o
a
t
x
o
r
y
Figure 8.11: A distributed variable with suspensions only
M
e
P
d
i
r
o
a
t
x
o
r
e
a
m
r
y
Figure 8.12: A distributed entity with monitoring threads
only
suspensions are kept alive. If no live entity refers to any of them, the variable
and its suspensions are considered dead. The following situation, depicted in
Figure 8.11, might be problematic: a thread suspends on a distributed variable,
on a site that does not hold the variable’s coordinator. If nothing else keeps
that variable alive, it might be considered dead, together with its associated
suspended threads. Should the Glue layer keep this variable alive?
Our answer is: yes, distributed variables with local suspensions should be
considered as roots for the local garbage collector. The reason is pretty simple.
Suspending on a distributed variable is a common idiom, where one site waits
for the result of a computation performed on another site. The programmer
rarely considers the blocked thread as possibly dead. On the contrary: that
thread is often used to keep the continuation of a local computation alive.
Silently removing the variable and its suspensions would be an error.
Note, however, that the garbage collector is unable to find out whether
there exists a thread in the system that can bind the variable. If no thread
binds the variable, all suspensions are dead. The suspensions can also be
considered dead if the variable is permanently failed, locally (localFail) or
globally (permFail). If the program is able to detect a dead variable, it can
help the local garbage collectors by making the variable fail. We extend the
step 1 above with:
1.1. Distributed variables with local suspensions are taken as roots for the
local garbage collector, unless they are permanently failed.
126
Implementation
S
u
s
p
e
r
T
n
d
e
e
d
d
h
a
s
u
s
p
C
n
t
i
t
e
n
o
n
s
i
t
r
o
n
o
l
y
E
r
V
e
d
i
M
t
o
a
i
l
a
e
b
r
a
r
o
x
y
P
Figure 8.13: A failed entity with blocked threads
Fault streams. While an entity is alive, its fault stream is alive. This is a
consequence of the basic rule we mentioned above: the entity keeps its mediator alive, which itself keeps the entity’s fault stream alive. Now consider
the situation depicted in Figure 8.12 on the previous page. The site has no
reference to the entity, but it still monitors it, with a thread suspended on the
entity’s fault stream.
The garbage collection process must implement the policy given in Section 4.4 on page 54. First, the entity may be removed from memory, since it
is not referred to. Second, if the entity is removed, its fault stream must be
closed with its tail bound to nil. Binding the stream tail should wake up the
monitoring threads suspended on it. In order to be effective, it requires both
the stream tail and its suspensions to be alive. Therefore, the fault stream
of an entity must be kept alive until it is closed. The implementation is very
simple, we add the step:
1.2. The fault streams of entities are taken as roots for the local garbage
collector.
Blocked threads. We are now interested in the threads that block on a
failed entity. Technically those threads are suspended on the control variable
used by the entity’s mediator to resume from temporary failures. The situation
is illustrated in Figure 8.13. Let us recall the principle stated in Section 4.4 on
page 54. If the failure is temporary, those threads must be kept alive, otherwise
they would not resume. If the failure is permanent, they are not kept alive by
the entity itself. We can easily ensure the liveness of those threads by adding
the following step:
1.3. The control variables used by the mediators of temporary failed entities
8.3 The language interface
127
are taken as roots for the local garbage collector.
Distributed operations. Those operations require to handle two extra data
structures: components representing suspended operations and terms being
serialized (see Figure 8.8 on page 119). Both are freed explicitly by the DSS;
meanwhile, they must maintain their associated virtual machine entities alive.
In other words, both suspended operations and terms being serialized, can be
considered as roots for the local garbage collector. So we add the step:
1.4. Entities referred to by suspended operations and serialized terms are
taken as roots for the local garbage collector.
128
Implementation
9
Evaluation
This chapter gives an evaluation of our work. Both the language definition and
its implementation are evaluated.
9.1
Ease of programming
In Chapter 5 we have shown a few abstractions built with our distribution
model. From these examples, we can already say that both the customization
of a distributed application and their handling of partial failure are not difficult.
The examples show that nontrivial abstractions are coded relatively easily, and
without too much code.
9.2
Performance
In this section we compare the performances of Mozart/DSS and Mozart. We
will see that their performances are similar, with Mozart/DSS being a bit slower
than Mozart. But the new protocols made available by Mozart/DSS permit to
take better advantage of the distribution of Oz than Mozart. These performance comparisons complete the experiments done by Erik Klintskog on an
early version of Mozart/DSS in [KBBH03]. These experiments showed that the
Distribution Subsystem (DSS) incurred around 12% overhead on the total time
for a client to perform a given number of requests to a server, when compared
to Mozart. The same experiment also showed that Mozart/DSS was about
twice as slow as an equivalent C++ program, optimized for the experiment,
and using raw sockets for communication.
Issues. The execution of the performance tests was considerably delayed by
bugs we discovered in the implementation. Not all those bugs could be fixed.
129
130
Evaluation
proc {Server N Len}
S ServerPort={NewPort S}
in
{Offer ServerPort ...}
% make ServerPort available
for I in 1..N
get(X) in S do
L={List.number I+1 I+Len 1}
% list of Len integers
in
X=L
end
end
proc {Client N}
ServerPort={Take ...}
% connect to the server’s port
in
for I in 1..N do X in
{Send ServerPort get(X)}
{Wait X}
end
end
Snippet 9.1: Mozart/DSS vs. Mozart: server and client
Some of them were found in code that we did not write ourselves, notably in
the DSS, which made the debugging task quite difficult. So the experiments
we show here are the ones that could be run without triggering the remaining
bugs.
9.2.1
Mozart/DSS vs. Mozart
This section compares the performance of the distribution layers in both platforms Mozart and Mozart/DSS. We consider a simple client-server program,
and compute the CPU time spent in the distribution layer. Note that we do
not measure network delays in this case. The client sends N=100000 requests
of the form get(X) to the server, and the server replies by binding X to a list of
Len elements. The number N of requests has been chosen large enough in order
to trigger the garbage collector, so that all the components of the distribution
layer are involved in the test. We made experiments for Len=10 and Len=1000
to compare between small and big messages.
Snippet 9.1 above gives the code for the server and client. The server offers
a port for the client to communicate. The procedure Offer creates a ticket,
and makes it available (in a file, or on a web site). The procedure Take retrieves
the ticket, and connects to it. The client sends N requests, and waits for the
reply, so that requests are sequential. For the sake of simplicity, we made the
server wait for N requests, then exit.
9.2 Performance
Len=10
Len=1000
131
centralized
distributed
difference
centralized
distributed
difference
Mozart
0.260
25.500
25.240
6.354
53.080
46.726
Mozart/DSS
0.292
29.304
29.012 (+16%)
6.378
80.910
74.532 (+60%)
Table 9.1: Comparison of total CPU times between platforms
(network delays not included in measurements)
Results. Table 9.1 gives the results of our experiments. The numbers are
the total CPU times spent by client and server in different situations. Network
delays are not taken into account. Times are measures in seconds, and averaged
over 5 executions. The programs were run on a computer with an Intel Core
Duo 2 GHz processor. The centralized case corresponds to the client and server
being run on a single site. In the distributed case, they are created on distinct
sites, and their CPU time are added. The difference between the two reflects
the global overhead in time of the distribution for that program.
The obvious observation is that Mozart/DSS is slower than Mozart, but the
ratio between the two is reasonable. The slower performance of Mozart/DSS
can be partly explained by its higher degree of flexibility, which requires a few
more indirections when performing an entity operation. Another observation is
that the size of shared data also has an impact on the relative performance of
both platforms: the marshaling process is a bit more involved in Mozart/DSS,
because the existing Mozart marshaler is wrapped to comply to the generic API
defined by the DSS.
9.2.2
Comparing protocols
We show how we can dramatically change the network behavior of a simple
program by changing how a distributed entity is annotated. The program used
in the experiment is a simplistic chat application: peers join a group, and
broadcast messages within that group. Each peer provides a port on which
other peers send messages. The group itself is handled by a small server, which
provides a cell with the list of ports. Message broadcast is done by sending the
message to all the ports in the list. A peer joins the group by connecting to
the server, and adding its port to the list in the cell.
Note that the distributed cell is not fault-tolerant. The purpose of the
program is only to show performance variations within a program, just by
changing distribution parameters. In this case, we will compare two protocols
for distributing the cell.
Snippet 9.2 on page 133 gives the code for the group server and the group
peers. The server is instantiated with a particular protocol for the cell. A peer
132
Evaluation
protocol
migratory
replicated
no distribution (all peers on one site)
total time
63.906
11.775
10.754
Table 9.2: Comparison of total time to complete (network delays included)
is created with a nickname. It first subscribes to the group. It then waits for
one message to arrive on its port, then sends 1000 messages, with a pause of
10 milliseconds between two messages, unsubscribes and exits. Waiting for a
message permits to synchronize peers, so they all start sending messages at
the same time. Typically the last connected peer starts the process by calling
{Broadcast start}.
The code to pay attention to is the procedure Broadcast. Each call to
Broadcast reads the contents of the cell Group, and the procedure is called
many times. Therefore the efficiency of the peer is largely influenced by how
fast it can read Group.
Results. Table 9.2 gives the results of our experiments. The experiment
consists in n peers broadcasting messages, with one of them also playing the
role of the group server. The value n = 3 was sufficient to emphasize differences
between protocols. The last connected peer broadcasts a dummy message to
synchronize all peers. We measured the total time for one of the peers to run
completely, i.e., the elapsed time between its startup and termination. In this
case, we do measure network delays.
The experiment was run on Monday 5th November 2007, between 16:00
and 16:30. The times are measured in seconds and averaged over 5 runs. The
peers were located on three machines: calc6.info.ucl.ac.be (everlab cluster
at UCL), planet8.cs.huji.ac.il (everlab cluster at the Hebrew University
of Jerusalem, Israel), and my own laptop Apple MacBook Pro connected at
the UCL network. The server was located together with the peer on the first
machine. The times were measured on the peer on the second machine.
As one can see, the migratory protocol, which is the default for distributing cells, is not efficient for that application. On the contrary, the replicated
protocol, where each site has a copy of the current state of the cell, is almost
as efficient as if all peers were on the same site. The reason is that the cell is
read more often than it is updated, and the replicated protocol requires few
network communication in that case. The choice of protocol has a noticeable
impact on the performance of that example.
9.2 Performance
proc {Server Protocol}
Group={NewCell nil}
in
{Annotate Group Protocol}
{Offer Group ...}
end
133
% annotate cell
% make Group available
proc {Peer NickName}
Group={Take ...}
% connect to the cell Group
proc {Subscribe P}
% add P to the group
T in T=Group:=P|T
end
proc {Unsubscribe P}
% remove P from the group
L T in
L=Group:=T
T={List.subtract L P}
end
proc {Broadcast M}
% send M to all ports in the group
for P in @Group do
{Send P M}
end
end
S P={NewPort S}
% this peer’s port
in
thread
for X in S do {Show X} end
% show all messages
end
{Subscribe P}
{Wait S}
for I in 1..1000 do
{Broadcast NickName#I}
{Delay 10}
end
{Unsubscribe P}
end
% wait for start signal
Snippet 9.2: Comparing protocols: server and peer
134
Evaluation
10
Conclusion
10.1
Achievements
This work extends former studies on network transparency in distributed programming languages, by showing that this approach to distribution is practical
regarding both efficiency and failure handling. We have given a few guidelines
on how to structure a distributed program, and how to reason about its network behavior. We have extended the language Oz with annotations, that make
the programmer able to customize the distribution of a program by choosing
between distribution protocols for entities. The resulting program is a valid
centralized program, and all distributed executions of the program are valid in
a centralized setting.
We have also redesigned failure handling in Oz, made it simpler and more
modular. Failure handling is based exclusively on asynchronous failure detection. We have introduced the concept of a fault stream to monitor an entity,
and showed that this concept is sufficient to implement complex failure handling algorithms. The design of the fault stream also provides an effective
post-mortem finalization mechanism, which was missing in the language. We
have introduced new language operations to make entities fail. Those operations give the programmer more control to handle partial failures, for example
by propagating the failure to a group of related entities.
On the implementation side, we have completed Erik Klintskog’s work by
making all protocols of the Distribution Subsystem (DSS) handle partial failure. Finally we have reimplemented the distribution of the platform Mozart
on top of the DSS. The new implementation, Mozart/DSS, implements our distribution model for the language. We have been able to test it, and validate
its effectiveness. However, the implementation is a prototype, and still suffers
from quite a few bugs.
135
136
10.2
Conclusion
Future directions
Decentralized applications. This work should be a solid foundation to
make applications fully decentralized. Some existing work, using structured
overlay networks, have proved to be useful in this domain. Boris Mejı́as and
Donatien Grolaux have already started to reimplement the library P2PS, which
implements a structured overlay network in Oz [MCV05]. The new distribution
model seems to be promising for that implementation.
A better virtual machine. The implementation of the virtual machine of
Mozart is far from being simple. It is written in the language C++, but does
not make use of object polymorphism, for instance. It is therefore difficult to
maintain and to extend. It lacks modularity.
This lack of modularity is visible in the Glue layer: every language operation must be explicitly mapped to a DSS operation. For instance, the Oz cell
has two write operations: Exchange and Assign. For each one, we had to
provide a mapping to a DSS write operation, together with specific callbacks
to perform those operations remotely or resume them. A better option would
be to define only one write operation on cells, and both Exchange and Assign
could be expressed in terms of that operation. All write operations in the virtual machine could be given the same distribution support, and polymorphism
would allow to use the entity’s write operation as a callback directly.
The DSS roughly defines five entity operations: read, write, send, bind, and
update. Every language operations in Oz can be expressed as an instance of one
of these operations. A mapping of the five generic operations would provide
distribution support for all language operations. Of course this support has to
be carefully designed, in order to avoid bad performance of elementary language
operations, such as the sum operation.
Protocols written in Oz. In the DSS library, all entity protocols are written in C++. The existing protocols have limitations, for instance they have
no recovery mechanism in case of failure. An Oz programmer cannot define its
own protocol for objects, unless it modifies the underlying library and recompiles the platform. The recommended strategy is to define an Oz abstraction
that encapsulates the recovery mechanism, and has its own protocol defined
with ports and variables. This reduces the overall usefulness of the underlying
library, since few protocols are really crucial.
Together with my colleague Yves Jaradin, we have sketched the design of
a virtual machine that can be extended in the language itself. The idea is to
give the possibility for an entity to delegate an operation to another entity,
for instance a port. A distributed cell could be implemented by coordinating
a group of cells on different sites, such that they give the illusion of being a
unique cell. For each cell in the group, basic operations are delegated to a
local agent, which can send messages (Oz values) to the other sites. The global
10.2 Future directions
137
identity of the cell can be provided by an Oz name. The virtual machine should
support serialization, and the possibility to send values to other sites.
The design has several advantages. First, it provides a framework to experiment with complex entity protocols at the language level. The latter protocol
could use a user-defined overlay network to implement the communication between the sites that coordinate for an entity. Second, it gives the possibility to
dynamically upgrade a protocol. Indeed, the code of the protocol is defined in
the language as a value (a procedure, a class, or a functor), which can be send
to a network of sites. A new protocol for cells can implemented and deployed
without recompiling the virtual machine platform.
Security. Oz is relatively close to the language E, as we can easily encapsulate
values and restrict their access via lexical scope. We can define capabilities in Oz
in a quite reliable way. However, the implementation is more permissive than
the language, in particular when distribution is in the game. The author has the
impression that, with a reasonable effort, the distribution layer of Mozart/DSS
can be made more secure. Some work was already done by Zacharias El-Banna
and Erik Klintskog to make the DSS more secure [EKB05].
Besides that, Mozart/DSS also provides some tools at the language level.
One can force all mutable entities to be stationary by default, for instance.
This prevents a third-party site from screwing up an entity by cheating with
its protocol.
138
Conclusion
A
Summary of the model
This chapter summarizes the distribution model, and all the language extensions introduced in this thesis.
A.1
Program structure
A program is distributed by letting several sites share language entities. The
latter are stateless or stateful data. The basic operations on those entities
behave as in the centralized case, modulo some network latency.
The program is deployed over the network by connecting several centralized
programs with shared entities. A reference to an entity X can be converted from
and to an atom T with the following functions. The atom is sent between sites
by other means (e-mail, web site, etc.)
{Connection.offer E} returns an atom T.
{Connection.take T} returns the entity E from which the atom T was created
with the former function.
Annotations. Entities can be annotated with protocol descriptions. The
annotation states which kind of protocol should be used to distributed the
entity. An entity’s annotation cannot be modified.
{Annotate E A} annotates entity E with value A. Raises an exception if the
value A is not valid for E.
Possible values for A are:
• access architecture: access(stationary), access(migratory)
• entity protocols: stationary, migratory, pilgrim, replicated,
variable, reply, immediate, eager, lazy
139
140
Summary of the model
• garbage collection: persistent, refcount, lease
A.2
Failure handling
Entity failures. An entity fails by crashing: it stops being functional (forever). It can also fail locally on a given site: the entity is crashed on that site,
but it can be correct on other sites. Operations on failed entities simply block.
Entity fault states and fault stream. The language provides failure detectors for entities. Those detectors define the following local fault states for
the entity:
• ok: no failure detected
• tempFail: entity suspected of having failed, may go back to ok. Language operations block on the entity, and resume if the fault state goes
back to ok.
• localFail: entity has failed locally
• permFail: entity has failed globally
Those states are notified in the fault stream of the entity. This stream reflects
the sequence of fault states of the entity. It is accessed with the following
function.
{GetFaultStream E} returns the fault stream of entity E.
The fault streams of two variables are merged when those variables are unified.
The tail of the fault stream is bound to nil once the entity is no longer in
memory. This can be used to program post-mortem finalization.
Making entities fail. Two operations are provided:
{Kill E} makes E fail globally. The operation is asynchronous, and is not
guaranteed to succeed.
{Break E} makes E fail locally. The fault state of E becomes localFail im-
mediately (unless it was already failed).
Bibliography
[AM03]
Mostafa Al-Metwally. Design and Implementation of a FaultTolerant Transactional Object Store. PhD thesis, Al-Azhar University, Cairo, Egypt, December 2003.
[Arm07]
Joe Armstrong. Programming Erlang: Software for a Concurrent
World. Pragmatic Bookshelf, July 2007.
[AWWV96] J. Armstrong, M. Williams, C. Wikström, and R. Virding. Concurrent Programming in Erlang. Prentice-Hall, Englewood Cliffs,
N.J., 1996.
[Bra05]
Per Brand. The Design Philosophy of Distributed Programming
Systems: the Mozart Experience. PhD thesis, Royal Institute of
Technology (KTH), Kista, Sweden, 2005.
[BVCK00]
Per Brand, Peter Van Roy, Raphaël Collet, and Erik Klintskog.
Path redundancy in a mobile-state protocol as a primitive for
language-based fault tolerance. Research Report RR2000-01, Université catholique de Louvain, Département INGI, 2000.
[Car95]
Luca Cardelli. A language with distributed scope. In Principles of
Programming Languages (POPL), pages 286–297, January 1995.
[Col04]
Raphaël Collet. Laziness and declarative concurrency. 2nd Workshop on Object-Oriented Language Engineering for the Post-Java
Era: Back to Dynamicity PostJava’04, 2004.
[CV06]
Raphaël Collet and Peter Van Roy. Failure handling in a networktransparent distributed programming language. In C. Dony et al.,
editor, Advanced Topics in Exception Handling Techniques, volume 4119 of Lecture Notes in Computer Science, pages 121–140.
Springer-Verlag, 2006.
[EKB05]
Zacharias El Banna, Erik Klintskog, and Per Brand. Making
the distribution subsystem secure. Technical Report T2004:14,
Swedish Institute of Computer Science, Kista, Sweden, January
2005.
141
142
Bibliography
[GGV04]
Donatien Grolaux, Kevin Glynn, and Peter Van Roy. A fault
tolerant abstraction for transparent distributed programming.
In Second International Mozart/Oz Conference (MOZ 2004),
Charleroi, Belgium, October 2004. Springer-Verlag LNCS volume 3389.
[GJS96]
James Gosling, Bill Joy, and Guy Steele. The Java Language
Specification. Addison-Wesley, 1996. Available at http://java.
sun.com (June 2007).
[GLT97]
H. Guyennet, J.-C. Lapayre, and M. Tréhel. Distributed shared
memory layer for cooperative work applications. In 22nd Annual
Conference on Computer Networks, LCN’97, pages 72–78, Minneapolis, USA, November 1997. IEEE Computer Society and TC
Computer Communications.
[GR06]
Rachid Guerraoui and Luı́s Rodrigues. Introduction to Reliable
Distributed Programming. Springer-Verlag, 2006.
[HVB+ 99]
Seif Haridi, Peter Van Roy, Per Brand, Michael Mehl, Ralf Scheidhauer, and Gert Smolka. Efficient logic variables for distributed
computing. ACM Transactions on Programming Languages and
Systems, 21(3):569–626, May 1999.
[HVBS98]
Seif Haridi, Peter Van Roy, Per Brand, and Christian Schulte.
Programming languages for distributed applications. New Generation Computing, 16(3):223–261, 1998.
[HVS97]
Seif Haridi, Peter Van Roy, and Gert Smolka. An overview of
the design of Distributed Oz. In Proceedings of the Second International Symposium on Parallel Symbolic Computation (PASCO
’97), pages 176–187, Maui, Hawaii, USA, July 1997. ACM Press.
[ISO98]
ISO/IEC.
Open distributed processing - reference model:
Overview, 1998.
[Jul88]
Eric Jul. Object Mobility in a Distributed Object-Oriented System.
PhD thesis, Univ. of Washington, Seattle, Wash., 1988.
[KBBH03]
Erik Klintskog, Zacharias El Banna, Per Brand, and Seif Haridi.
The design and evaluation of a middleware library for distribution
of language entities. In Asian Computing Science Conference,
pages 243–259, 2003.
[Kli05]
Erik Klintskog. Generic Distribution Support for Programming
Systems. PhD thesis, Royal Institute of Technology (KTH), Kista,
Sweden, 2005.
Bibliography
143
[KNBH01]
Erik Klintskog, Anna Neiderud, Per Brand, and Seif Haridi.
Fractional weighted reference counting. In Rizos Sakellariou,
John Keane, John R. Gurd, and Len Freeman, editors, Euro-Par
2001: Parallel Processing, 7th International Euro-Par Conference
Manchester, UK August 28-31, 2001, Proceedings, volume 2150
of Lecture Notes in Computer Science, pages 486–490. Springer,
2001.
[Lam79]
Leslie Lamport. How to make a multiprocessor computer that
correctly executes multiprocess programs. IEEE Transactions on
Computers, 1979.
[MCGV05]
Valentin Mesaros, Raphaël Collet, Kevin Glynn, and Peter Van
Roy. A transactional system for structured overlay networks.
Research Report RR2005-01, Université catholique de Louvain,
Département INGI, 2005.
[MCV05]
Valentin Mesaros, Bruno Carton, and Peter Van Roy. P2PS: Peerto-peer development platform for Mozart. In Peter Van Roy, editor, Multiparadigm Programming in Mozart/Oz: Extended Proceedings of the Second International Conference MOZ 2004, volume 3389 of Lecture Notes in Artificial Intelligence. SpringerVerlag, 2005.
[Mil06]
Mark Samuel Miller. Robust Composition: Towards a Unified Approach to Access Control and Concurrency Control. PhD thesis,
Johns Hopkins University, Baltimore, Maryland, USA, May 2006.
[Moz99]
The Mozart programming system (Oz 3), January 1999. Available
at http://www.mozart-oz.org (June 2007).
[Sah04]
Per Sahlin. Efficient distribution of immutable data structures
in the distributed subsystem middleware library. Master’s thesis,
Royal Institute of Technology (KTH), Kista, Sweden, 2004.
[Sar93]
Vijay A. Saraswat. Concurrent Constraint Programming. MIT
Press, 1993.
[SCV03]
Alfred Spiessens, Raphaël Collet, and Peter Van Roy. Declarative laziness in a concurrent constraint language. 2nd International Workshop on Multiparadigm Constraint Programming
Languages MultiCPL’03, 2003.
[Smo95]
Gert Smolka. The Oz programming model. In Computer Science
Today, Lecture Notes in Computer Science, vol. 1000, pages 324–
343. Springer-Verlag, Berlin, 1995.
144
Bibliography
[ST95]
Nir Shavit and Dan Touitou. Software transactional memory. In
Symposium on Principles of Distributed Computing, pages 204–
213, 1995.
[Sun97]
Sun Microsystems. The Remote Method Invocation Specification,
1997. Available at http://java.sun.com (June 2007).
[Van06]
Peter Van Roy. Convergence in language design: A case of lightning striking four times in the same place. In Functional and
Logic Programming, 8th International Symposium (FLOPS), volume 3945 of Lecture Notes in Computer Science, pages 2–12.
Springer, 2006.
[VBHC99]
Peter Van Roy, Per Brand, Seif Haridi, and Raphaël Collet. A
lightweight reliable object migration protocol. Lecture Notes in
Computer Science, 1686:32–46, 1999.
[VH04]
Peter Van Roy and Seif Haridi. Concepts, Techniques, and Models
of Computer Programming. MIT Press, Cambridge, MA, 2004.
[VHB+ 97]
Peter Van Roy, Seif Haridi, Per Brand, Gert Smolka, Michael
Mehl, and Ralf Scheidhauer. Mobile objects in Distributed Oz.
ACM TOPLAS, 19(5):804–851, September 1997.
[VHB99]
Peter Van Roy, Seif Haridi, and Per Brand. Distributed Programming in Mozart – A Tutorial Introduction, 1999. In Mozart
documentation, available at http://www.mozart-oz.org (June
2007).
[WWWK94] Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall. A
note on distributed computing. Technical Report SMLI TR-9429, Sun Microsystems Laboratories, Mountain View, CA, November 1994.
APPENDIX A. PUBLICATIONS
A.3
Range queries on structured overlay networks
SELFMAN Deliverable Year Two, Page 483
Available online at www.sciencedirect.com
Computer Communications 31 (2008) 280–291
www.elsevier.com/locate/comcom
Range queries on structured overlay networks
Thorsten Schütt *, Florian Schintke, Alexander Reinefeld
Zuse Institute Berlin, Takustraße 7, 14195 Berlin-Dahlem, Germany
Available online 31 August 2007
Abstract
The efficient handling of range queries in peer-to-peer systems is still an open issue. Several approaches exist, but their lookup schemes
are either too expensive (space-filling curves) or their queries lack expressiveness (topology-driven data distribution).
We present two structured overlay networks that support arbitrary range queries. The first one, named Chord#, has been derived from
Chord by substituting Chord’s hashing function by a key-order preserving function. It has a logarithmic routing performance and it supports range queries, which is not possible with Chord. Its O(1) pointer update algorithm can be applied to any peer-to-peer routing protocol with exponentially increasing pointers. We present a formal proof of the logarithmic routing performance and show empirical
results that demonstrate the superiority of Chord# over Chord in systems with high churn rates.
We then extend our routing scheme to multiple dimensions, resulting in SONAR, a Structured Overlay Network with Arbitrary Range
queries. SONAR covers multi-dimensional data spaces and, in contrast to other approaches, SONAR’s range queries are not restricted to
rectangular shapes but may have arbitrary shapes. Empirical results with a data set of two million objects show the logarithmic routing
performance in a geospatial domain.
2007 Elsevier B.V. All rights reserved.
Keywords: Structured overlay networks; Routing in P2P networks; Consistent hashing; Multi-dimensional range queries
1. Introduction
Efficient data lookup is at the heart of peer-to-peer computing: given a key, find the node that stores the associated
object. Many structured overlay protocols like CAN [23],
Chord [30], Kademlia [20], Pastry [26], P-Grid [1], or
Tapestry [31] use consistent hashing [16] to store (key,
value)-pairs in a distributed hash table (DHT). The hashing
distributes the keys uniformly. A data lookup needs
O(log N) communication hops with high probability in
networks of N nodes.
The basis of DHTs [2] is the key space which defines the
names of items storable in the DHT, e.g. bitstrings of
length 160. A partitioning scheme splits the ownership of
the key space uniformly among participating nodes, which
form a ring structure by maintaining links to neighboring
*
Corresponding author. Tel.: +49 3084185361.
E-mail addresses: [email protected] (T. Schütt), [email protected] (F.
Schintke), [email protected] (A. Reinefeld).
0140-3664/$ - see front matter 2007 Elsevier B.V. All rights reserved.
doi:10.1016/j.comcom.2007.08.027
nodes in the node space. The ring structure allows to find
the owner of any given key.
A typical use of this system for storing and retrieving
files might work as follows. Suppose the key space contains
keys in the range [0, 2160). To store a file with a given filename in the DHT, the SHA1 hash of the filename is computed, producing a 160-bit key, and a message
put(key,value) is sent to an arbitrary participating
node, where value is the content of the file. The message
is forwarded from node to node through the overlay network until it reaches the node responsible for key as specified by the key space partitioning, where the pair
(key, value) is stored. Any other client can now retrieve
the contents of the file by again hashing its filename to produce key and asking any node to find the value associated
with key with a message get(key). The message will
again be routed through the overlay to the node responsible for key, which will reply with the stored (key, value)
pair.
The described hashing uniformly distributes skewed
data items over nodes and thereby supports the logarithmic
T. Schütt et al. / Computer Communications 31 (2008) 280–291
routing, it also incurs a major drawback: none of the mentioned P2P protocols is able to handle queries with partial
keywords, wildcards, or ranges, because the hashing
spreads lexicographically adjacent identifiers over all
nodes.
1.1. Schemes for range queries on structured overlays
Two different approaches for range queries on structured overlays have been proposed in the literature:
space-filling curves on DHTs and key-order preserving
mapping functions for various network topologies.
The approaches based on space-filling curves [5,10,14]
incur higher maintenance costs, because they are built on
top of a DHT and therefore require one additional mapping step. Even worse, space-filling curves cover the keyspace by discrete non-overlapping patches, which makes
it difficult to support large multi-dimensional range queries
covering several patches in a single lookup. Depending on
the patch size, such queries require several independent
lookups (Ref. Fig. 4).
The second type of structured overlays map adjacent
key ranges to contiguous ranges in the node space. These
methods are key-order-preserving and therefore capable
of supporting arbitrary range queries. Since the key distribution is not known a priori, one approach, Mercury [9],
approximates a density function of the keys to ensure logarithmic routing. The associated maintenance and storage
overhead is not negligible, despite a mediocre accuracy.
MURK [14] goes one step further. It is similar to our
approach in that it also splits the data space into hypercuboids that are managed by separate nodes. But in contrast
to our approach MURK uses a heuristic based on skip
graphs whereas our SONAR builds on an extension of
Chord#’s ring topology. More detailed information on
related work can be found in Section 4.
1.2. Structured overlays without consistent hashing
In this paper, we argue that it is neither necessary nor
beneficial to use DHTs in structured overlays. We first
introduce a scheme that matches the key space directly to
the node space and is thereby able to support range queries.
Thereafter we generalize the scheme to multi-dimensional
data spaces.
1.2.1. Chord#
Section 2 presents Chord#. It has been derived from
Chord by eliminating the hashing function and introducing
an explicit load balancing mechanism. Chord# has the
same maintenance cost as Chord, but is superior in a number of aspects: it has a proven upper-bound lookup cost of
logb N for arbitrary b, it has a constant-time finger update
algorithm and it supports range queries. It does so by routing in the node space rather than the key space. We
describe the finger placement algorithm (Section 2.1), prove
its logarithmic lookup performance (Section 2.1.2 and
281
Appendix A), and demonstrate its improvements over
Chord in highly dynamical systems (Section 2.3) with
simulations.
1.2.2. SONAR
Section 3 extends the principle idea of Chord# to multiple dimensions. The resulting algorithm, SONAR, performs efficient multi-attribute queries. While other
systems [9,14] also support multi-dimensional range queries, to the best of our knowledge there exist no other
approach that is equally deterministic in its finger placement. We present details on SONAR’s finger placement
(Section 3.2), routing strategy (Section 3.4), range query
execution (Section 3.5), and demonstrate its practical
lookup performance in networks storing two million
objects (Section 3.6).
Section 4 discusses related work and Section 5 presents a
summary and conclusion.
2. Chord#
Since its introduction in the year 2001, Chord [30] was
deployed in many practical P2P systems, making it one
of the best-understood overlay protocols. Chord’s lookup
mechanism is provably robust in the face of node failures
and re-joins. It uses consistent hashing to map the keys
and node IDs uniformly onto the identifier space. In the
following, we introduce our Chord# algorithm by deriving
it step by step from Chord as illustrated in the four parts of
Fig. 1.
Let us assume a key space of size 28 and a node space of
size 24. Chord organizes the nodes n0, . . . , n15 in a logical
ring, each of them being responsible for a subset of the keys
0, . . . , 28 1. Each node maintains a finger table that
contains the addresses of the peers halfway, quarter-way,
1/8-way, 1/16-way, . . . , around the ring. When a node
(e.g. n0) receives a query, it forwards it to the node in its finger table with the highest ID not exceeding hash(key).
This halves the distance in each step, resulting in O(log N)
hops in networks with N nodes, because the hashing
ensures a uniform distribution of the keys and nodes with
a high probability [30] (Fig. 1a).
Because this scheme does not support range queries,
we eliminate the hashing of the keys. All keys are now
sorted in lexicographical order, but unfortunately the
nodes that are responsible for popular keys (e.g. ‘E’) will
become overloaded – both in terms of storage space and
query load. Hence, this approach is impractical, even
though its routing performance is still logarithmic
(Fig. 1b).
When we additionally eliminate the hashing of the node
IDs, the nodes can be placed at any suitable place in
the ring to achieve a better load distribution. We must
introduce an explicit load balancing scheme [17] that
dynamically removes keys from overloaded nodes. Unfortunately, without adjusting the fingers in the finger table,
much more hops are needed to retrieve a given key. The
282
T. Schütt et al. / Computer Communications 31 (2008) 280–291
A-B
0-15
240-255
n0
n 15
16-31
n0
n1
E-F
4-6
32-47
n2
n 14
192-207
n1
n 15
224-239
208-223
C-D
7-9
n2
n 14
64-79
Y-Z
n4
n 12
n5
n 11
n5
n 11
80-95
176-191
G-H
n3
n 13
n4
n 12
Overloaded
0-3
48-63
n3
n 13
I-J
K-L
W-X
n 11
n6
n 10
96-111
160-175
n7
n9
M-N
n7
n9
n8
node space: n 0 —n15
key space : 0 - 255
112-127
144-159
n6
U-V
128-143
n8
S-T
A-B
A-B
C-D
7-9
C-D
7-9
n0 n1
n0 n 1
E-F
4-6
E-F
4-6
n2
n 15
n2
n 15
n3
n3
n4
n4
n5
0-3
Y-Z
node space: n 0 —n15
key space : A -9
O- P
Q- R
G-H
n 14
n5
0-3
I-J
Y-Z
n 14
I-J
n6
n7
W-X
n6
n7
K-L
n8
U-V
n 13
n 12 n
n9
11
n 10
S-T
O-P
G-H
W-X
M-N
K-L
n8
U-V
node space: n 0 —n15
key space : A-9
n 13
n 12 n
n9
11
n 10
S-T
Q-R
O-P
M-N
node space: n 0 —n15
keys pace : A -9
Q-R
Pointer corrections
Routing Hops
n0
A-B
n0
Node
Associated
Keyspace
Old Routing Hops
Search Invoker
Fig. 1. Transforming Chord into Chord#.
lookup started in node n13 for key ‘R’, for example, needs
six instead of four hops. In general, the routing degrades to
O(N) (Fig. 1c).
In the final step, we introduce a new finger placement
algorithm that dynamically adjusts the fingers in the routing table. The lookup performance is now again O(log N) –
just as in Chord. But in contrast to Chord this new variant
does the routing in the node space rather than the key
space, and it supports complex queries – all with logarithmic routing effort (Fig. 1d).
We now present the new finger placement algorithm and
discuss its logarithmic performance. A formal proof can be
found in Appendix A.
2.1. Finger placement
In the above transformation we substituted Chord’s
hash function by a key-order preserving function. When
doing so, the keys are no longer uniformly distributed over
the nodes but they follow some unknown density function.
T. Schütt et al. / Computer Communications 31 (2008) 280–291
To still obtain logarithmic routing, we must ensure that the
fingers in the routing tables cross an exponential number of
nodes in the ring. This can be achieved as follows: to calculate the longest finger (i.e., the finger pointing half-way
around the ring) we ask the node referred to by our second
longest entry (i.e., quarter-way around) for its second longest entry (again quarter-way). For calculating our second
longest finger, we follow our third-longest finger, and so
on.
In general, to calculate the ith finger in its finger table, a
node asks the remote node, to which its (i 1)th finger
refers to, for its (i 1)th finger. The fingers at level i are
set to the fingers’ pointers in the next lower level i 1.
At the lowest level, the finger refers to the direct successor.
successor
:i¼0
fingeri ¼
fingeri1 ! getFingerði 1Þ : i 6¼ 0
2.1.1. Logarithmic finger table size
The finger table of Chord# has a maximum of Ølog N ø
entries. This can be seen by observing its construction.
Chord# first enters the shortest finger (direct successor of
the node) and recursively doubles the distance in the node
space with each further entry. Formally: a node n inserts an
additional fingeri into its routing table as long as the following equations holds true:
fingeri1 < fingeri < n
This construction process guarantees that – in contrast to
Chord – no two entries point to the same node. Since the
fingers in the table point to nodes at exponentially increasing distance, it becomes apparent that the routing table has
a total of Ølog Nø entries.
2.1.2. Logarithmic routing performance
Like the original Chord, Chord# has a routing performance of O(log N) hops. Unlike Chord, the logarithmic
routing performance is not only proven ‘with high probability’, but it constitutes a guaranteed upper bound. In
the following, we outline the basic idea; the formal proof
can be found in Appendix A.
Let the key space be 0 . . . 2m1 and let ¯ be an addition
modulo 2m. In the original Chord, the ith finger (fingeri) in
the routing table of node n refers to the node responsible
for the key fi with
fi ¼ ðn:key 2i1 Þ
for
16i6m
The original Chord calculates fingeri by sending a query
to the node responsible for fi. This requires O(log N) communication hops for each single entry in the routing table,
which sums up to O(log2 N) hops per routing table.
Chord#’s finger placement algorithm is derived by reformulating the above equation as
fi ¼ ðn:key 2i2 Þ 2i2
283
Having split the right hand side into two terms, the
recursive structure becomes apparent and it is clear that
the whole calculation can be done in just 1 hop. The first
term represents the (i 1)th finger and the second term
the (i 1)th finger on the node pointed to by fingeri1.
Routing in the node space allows us to remove the hashing function and to arrange the keys in lexicographical
order among the nodes so that no node is overloaded. This
new finger placement has two advantages over Chord’s
algorithm: first, it works with any type of keys as long as
a total order over the keys exists, and second, finger
updates are cheaper than in Chord, because they need just
one hop instead of a full search. This is because Chord#
uses the better informed remote information for adjusting
the fingers in its finger table by recursive finger references.
2.2. Fewer hops with logb routing
The routing performance of Chord# can be further
improved at the cost of additional storage space. The idea
for this enhancement comes from DKS [3], which inserts
extra fingers into the routing table to allow for higher-order
search schemes than simple binary search. By this means,
the log2 N routing performance can be reduced to logb N
with arbitrary bases b.
In Chord#, the longest finger, fingerm1, with
m = Ølog Nø, and the position of the current node n split
the ring into two intervals: [n, fingerm1] and [fingerm1, n].
The intervals have about equal sizes because of the recursive finger update algorithm. The next shorter finger, fingerm2, splits the first half again into two halves
[n, fingerm2] and [fingerm2, fingerm1]. This splitting is
recursively continued until the subsets contain only one
key. It becomes obvious that each routing hop cuts the distance to the goal in half, resulting in O(log2N) hops.
By dividing the interval on the ring into b equally sized
subsets at each level, each hop reduces the distance to 1b th
and the overall number of hops per search to O(logb N).
To implement logb routing, we need to extend the routing
table by b 2 extra columns. The calculation of the fingers
is similar to the finger placement algorithm presented in
Section 2.1.
Note that with logb routing (b > 2) there are several
alternatives to calculate the long fingers via different intermediate nodes. This allows to eliminate inconsistent fingers
in dynamic systems based on local information. Moreover,
the update process may be improved by piggybacking
information on routing table entries when sending search
results. This on-thy-fly correction has only negligible traffic
overhead and allows more frequent updates.
2.3. Experimental results
This section presents empirical results of Chord# in a
highly dynamic network. To allow the comparison to other
P2P protocols, we followed the experimental set up of Li
et al. [19] who simulated a network of 1024 nodes under
284
T. Schütt et al. / Computer Communications 31 (2008) 280–291
heavy churn. The network runs for 6 h and each node fails
on average (exponentially distributed) after 1 h with an
absent time of 1 h (again exponentially distributed). Hence,
only 50% of the nodes are online at any moment. Each
alive node issues every 10 minutes a lookup query for a
randomly chosen key, where the time intervals are exponentially distributed. Messages have a length of 20 bytes
plus 4 bytes for each additional node address contained
in the message. The latency between the nodes is given by
the King data set [15] which contains real data observed
at Internet DNS servers.
Note that the simulation does not account for user data.
Only the protocol overhead is measured, and hence the
result gives the worst case.
We used the same parameters for testing the algorithm’s
performance as Li et al. [19], resulting in a total of 480 simulation runs:
(i) Base is the branching factor of each finger table
entry. Each finger table contains a total of (base 1)*
logbase(n) fingers. Values: 2, 8, 16, 32.
(ii) Successors is the number of direct successors stored in
each nodes’ successor list. Values: 4, 8, 16, 32.
(iii) Successor stabilization interval denotes the time spent
between two updates of the nodes’ successor lists.
Values: 30 s, 60 s, 90 s.
(iv) Finger update interval is the time spent between two
finger table updates. Values: 60 s, 300 s, 600 s, 900 s,
1200 s.
(v) Latency optimizer tells whether proximity routing was
used for improving the latency. Values: true, false.
Fig. 2 shows the average lookup latency versus the
maintenance overhead (measured in bytes per node per second). For simplicity, we plotted just the convex hull of the
parameter combinations – all inferior data points above the
graphs are omitted (more detailed results can be found in
[28]). As can be seen, Chord# (dotted graph) strictly outperforms Chord: it has a lower latency with reduced maintenance cost. Much of this favorable performance is
attributed to the better finger placement algorithm.
800
average latency [ms]
700
Chord
Chord#
600
500
400
300
200
100
0
0
10
20
30
40
50
60
70
80
bandwidth [byte/node/s]
Fig. 2. The convex hull of 480 (resp. 960) experiments with different
parameter combinations: Chord# outperforms Chord under churn.
Table 1 lists the number of hops for searching random
keys in a network of 1024 nodes. As expected Chord# never
needs more than log 1024 = 10 hops while the original
Chord requires up to 41 hops. This is because Chord computes the finger placement in the key space, which does not
ensure that the distance in the node space is halved with
each hop.
In summary, the experimental results indicate that
Chord# has a lower maintenance overhead (Fig. 2) and
requires often fewer lookup hops (Table 1) as compared
to Chord. This is true for both, static and dynamic networks. Most of these favorable results are due to Chord#’s
recursive finger update algorithm which uses O(1) instead
of O(log N) communication hops to calculate a single finger
entry in the routing table. Because of the recursive construction process, one could expect that the longer fingers
are more likely to be misplaced – especially in networks
with high churn rates. But our experimental results show
the contrary: the recursive finger placement is beneficial
even under a very high churn.
Table 1
Number of hops to find random keys in 1024 nodes
Hops
Chord
Chord#
0
1
2
3
4
5
6
7
8
9
10
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
33
37
38
41
27
549
1924
3734
4932
4377
2635
1116
317
78
9
9
5
4
1
7
6
7
6
4
3
2
3
1
2
2
4
3
3
1
1
1
1
1
1
44
320
1433
3342
4783
4930
3212
1365
315
29
3
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Chord# needs a maximum of log (1024) = 10 hops whereas Chord exhibits
logarithmic routing performance only with ‘high probability’, i.e., there
are some degenerated cases with considerably more hops.
T. Schütt et al. / Computer Communications 31 (2008) 280–291
285
3. SONAR
Direct routing neighbours are
the nodes
adjacent
to these
markers.
y-domain (0.0 — 1.0)
In this section, we extend the concept of Chord# to multiple dimensions. The resulting algorithm, named SONAR
for Structured Overlay Network with Arbitrary Range
Queries, is capable of handling non-rectangular range queries over multiple dimensions with a logarithmic number of
routing hops. Non-rectangular range queries are important, for example, for geoinformation systems, where
objects are sought that lie within a given distance from a
specified position. Other applications include Internet
games with thousands or millions of online-players concurrently interacting in a virtual space or grid resource management systems [12,25].
Node
y
x
Routingtable
(0.84,0.24)
3.1. d-Dimensional data space
SONAR operates on a virtual d-dimensional Cartesian
coordinate space with d attribute domains. Keys are represented by attribute vectors of length d. Like in MURK
[14,21], the total key space is dynamically partitioned
among the nodes such that each node is responsible for
approximately the same amount of keys. Explicit load balancing allows to add or remove nodes when the number of
objects increases or shrinks or when additional storage
space becomes available.
Similarly to CAN [23], each SONAR node participates
in d dimensions and has direct neighbors in all directions.
The torus is made up of hypercuboids, each of them managed by a single node. Taken together, all hypercuboids
cover the complete key space.
During runtime the system balances the key load in such
a way that each hypercuboid contains about the same number of keys, and hence each node has to handle a similar
amount of data. As a consequence, the hypercuboids have
different sizes and a node usually has more than one direct
neighbor per direction. Compared to Chord# or CAN, this
slightly complicates the routing, because there are usually
several options for selecting a ‘direct neighbor’ – we chose
the one that is adjacent to the center of the node (see the
small ticks in Fig. 3).
3.2. Building the routing table
Fig. 3 illustrates a routing table in a two-dimensional
data space. The keys are specified by attribute vectors
(x, y) and the hypercuboids (here: rectangular boxes),
which are managed by the nodes, cover the complete key
space. Their different area is due to the key distribution:
at runtime, the load balancing scheme ensures that each
box holds about the same number of keys.
SONAR maintains two data structures, a neighbor list
and a routing table. The neighbor list contains links to
all neighbors of a node. The node depicted by the grey
box in Fig. 3, for example, has ten neighbors.
The routing table comprises d subtables, one for each
dimension. Each subtable s with 1 6 s 6 d contains fingers
x-domain (0.0 — 1.0)
Fig. 3. Routing fingers of a SONAR node in a two-dimensional data
space.
that point to remote nodes in exponentially increasing distances. We use the same recursive finger placement algorithm as in Chord# (Section 2.1.2) for inserting fingers
into each subtable s of node n: starting with the neighbor
that is adjacent to the center of node n in routing direction
s, we insert an additional fingeri into the subtable as long as
the following equations holds true:
fingeri1 < fingeri < n
This construction process ensures that each subtable s
contains log Ns entries, where Ns is the number of nodes
in dimension s. Taken together, all subtables s of a node
n will have log N entries – just as in Chord#. Note that this
is a self-regulating process [4]: the exact number of entries
per dimension depends on the actual data distribution of
the application. All that can be said, is that the total number of entries is log N.
3.3. Routing performance
Assuming that the d-dimensional torus is filled uniformly by N1* N2* . . . *Nd = N nodes, the routing performance is log N1 + log N2 + + log Nd = log N. So, even
though SONAR supports multi-dimensional range queries,
it exhibits a similar logarithmic routing performance as
Chord#. Our simulations presented in Section 3.6 confirm
this observation even for highly skewed, real world data
distributions.
3.4. Routing in d dimensions
Chord# uses greedy routing: it always selects the finger
that is nearest to the target and forwards the request to this
node. In SONAR the situation is more complex, because in
286
T. Schütt et al. / Computer Communications 31 (2008) 280–291
each routing step there are d dimensions to chose from. For
regular grids the order of dimensions does not make a difference for the number of routing hops. Here it is best to
choose the dimension for the next hop based on network
latency, called proximity routing. When the grid is irregular
(as in Fig. 3), it is important to select the ‘best’ dimension
in each routing hop – either by distance (greedy routing),
by the volume of the managed data space, or simply at
random.
3.5. Non-rectangular range queries
The one-dimensional range queries supported by
Chord# are defined by a lower and an upper bound: the
query returns all keys between those bounds. Extending
this approach to d dimensions results in rectangular range
queries. In contrast to other approaches, SONAR also efficiently supports non-rectangular range queries.
Fig. 5 shows a circular range query in two dimensions,
defined by a center and a radius. In this example, we
assume that a person in the governmental district of Berlin
is searching for a hotel. The center of the circle is the location of the person and the radius is chosen to find hotels in
‘walking distance’. In a first step, SONAR routes the query
to the node responsible for the center of the circle, taking
O(log N) hops. Thereafter, the query is broadcasted to
the neighbors which partially cover the circle, which is proportional to the size of the range query (a single hop for
each neighbor). The query is performed on the local data
and the results are returned to the requesting node.
Other structured overlays, which support multi-dimensional range queries, like [5,14,27,29], use space filling
curves to map the multi-dimensional space to one dimension. As shown in Fig. 4 z-curves can be used for this. Dif-
Fig. 4. 2-Dimensional range query using z-curves.
Fig. 5. 2-Dimensional range query by SONAR.
ferent parts of the z-curve are assigned to different nodes.
Even for simple range queries several non-contiguous parts
of the curve may be responsible for the queried range. For
each part a complete lookup with O(log N) hops is necessary (eight in the given example).
3.6. Performance results
For our experiments we selected a large data set of a
traveling salesman problem with the 1,904,711 largest cities
worldwide [6]. Their GPS locations follow a Zipf distribution [32], which is a common distribution pattern of many
other application domains. In a preprocessing step we partitioned the globe into non-overlapping rectangular
patches so that each patch contains about the same amount
of cities. We did this by recursively splitting the patches
along alternate sides until the number of cities in the area
dropped below a given threshold. We mapped the coordinates onto a doughnut-shaped torus (Fig. 6) rather than
a sphere, because the poles of a sphere would become a
routing bottleneck and the rings for the western, respectively, eastern hemisphere (southwards vs. northwards)
would be in opposite directions.
Fig. 7 gives an overview of the results for networks of
128–131,072 nodes. It shows the routing performance
(average hop number) and the routing table size in xand y-direction. As expected, all data points increase logarithmically with the network size.
Most interesting are the results on the routing table
sizes: although the nodes autonomously determine the
table sizes solely on their local knowledge, the result perfectly matches the theoretical expectation of log2 N entries.
In the two-dimensional case (depicted here), each routing
table contains one subtable in x-direction, which is a bit
T. Schütt et al. / Computer Communications 31 (2008) 280–291
287
Fig. 6. SONAR overlay network with 1.9 million keys (city coordinates) over 2048 nodes. Each rectangle represents one node.
10
9
8
400
½ log2 N
avg # hops
# fingers x
# fingers y
finger length 16
finger length 32
350
300
7
# cases
6
5
4
250
200
150
3
100
2
1
50
0
2
7
2
9
11
2
13
2
15
2
0
-30
17
2
number of nodes
larger, and one subtable in y-direction, which is a bit smaller. The deviations are due to the given key distribution of
this domain. Together, both subtables contain log2 N
entries.
Fig. 7 also shows the average number of routing hops.
The slope of this graph is slightly above the expected value
(0.5log2 N) because the routing table entries are calculated
on local knowledge only, which may result in longer
lookup chains. To verify this hypothesis, we checked the
accuracy of the finger lengths. Fig. 8 shows the length deviations for fingers of length 16 and 32 in a torus of 2048
nodes. This data was obtained by comparing the entries
in the routing table to the actual distance computed with
global knowledge (based on a breadth-first search in all
directions along the neighbors). The results seem to indicate that the longer fingers of size 32 (straight line) more
often overestimate the actual length.
The observed inaccuracy of the finger lengths may also
be attributed to the specific data distribution in our experiment [24]. From Fig. 6 it is evident that the rectangles of
the whole data space have very different sizes. This
-10
0
10
20
30
40
50
finger length deviation [hop]
Fig. 8. Deviation of finger lengths due to local knowledge in a torus of
1024 nodes. The measured lengths are centered around their expected
length of 16 resp. 32.
becomes more obvious when plotting the number of
rectangles versus their size: Fig. 9 shows an almost perfect
Zipf distribution, which is common for a large number of
applications [11].
216
n-th node (sorted by node size)
Fig. 7. Lookup performance (avg. #hops) and size of routing tables
(#fingers) for various network sizes.
-20
size (random)
size (world)
215
214
213
0
1e-04
0.001
0.01
0.1
1
10
100
size of nodes in degree2
Fig. 9. Distribution of node sizes.
1000
10000 10000
T. Schütt et al. / Computer Communications 31 (2008) 280–291
Assuming that large rectangles have more incoming and
outgoing fingers than the smaller ones, we plotted the indegree of all nodes, see Fig. 10. If all keys would be uniformly distributed over the key space, and consequently
all nodes would be of the same size, one would expect an
in-degree of log2 2048 = 11 for each node. Fig. 10 indicates
that the in-degree is slightly smaller for most nodes,
because there are a few (presumably large) nodes with an
extremely high in-degree of up to 100.
Due to the different node sizes the rings of the torus are
skewed, i.e., they do not end in the beginning node, but
may form spirals or may even join in a common node.1
For the routing process, this does not cause harm, because
a data lookup never makes a full round. Only for performance reasons we need to address the problem of uneven
in-degrees: some nodes may have to handle more routing
requests than others which may affect the routing performance. One solution to this problem would be to include
the in-degree into the load metric. The load balancing
scheme would then autonomously balance the in-degrees
by splitting larger nodes and thereby reducing the indegree. Alternatively, we could provide more flexibility in
the selection of neighbors by allowing also neighbors that
are not adjacent to the center of the node for the routing.
4. Related work
The first structured overlay networks like Chord [30] or
CAN [23], published in 2001, allow to organize unreliable
peers into a stable, regular structure, where each node is
responsible for a part (ring segment in Chord, rectangle
in CAN). Due to consistent hashing these system allow
to distribute the nodes and keys equally
pffiffiffiffi along the system
and to route with O(log N) resp. Oð d N Þ performance. Both
handle one-dimensional keys but do not support efficient
range queries, because adjacent keys are not mapped to
adjacent nodes in the overlay.
4.1. One-dimensional range queries
To allow efficient range queries, structured overlays
without hashing had to be developed, which put adjacent
keys to nodes adjacent in the overlay. So, with one logarithmic lookup for the start of the range, the range query can be
performed by that node and the nodes adjacent in the logical structure of the overlay. One major challenge for such
systems is the uneven distribution of keys to nodes and
the finger maintenance for efficient routing. SkipGraphs
[7], a distributed implementation of skip lists [22], for example, use probabilistically balanced trees and thereby allow
efficient range queries with O(log N) performance.
Ganesan et al. [13] further improved SkipGraphs with
an emphasis on load-balancing. Their load-balancing
1
In fact, the overlay built by SONAR is only a torus when the keys are
uniformly distributed. In all practical instances, SONAR builds a graph
that only slightly resembles a torus.
180
indegree
160
140
120
# cases
288
100
80
60
40
20
0
0
10
20
30
40
50
60
70
80
90
100
indegree [node]
Fig. 10. In-degree of nodes in a torus with 2048 nodes. The expected value
is log2 2048 = 11.
scheme maps ordered sets of keys on nodes with formally
proven performance bounds, similar to [17]. For routing,
they deploy a SkipGraph on top of the nodes, guaranteeing
the routing performance of O(log N) with high probability.
Mercury [9] does not use consistent hashing and therefore has to deal with load imbalance. It determines the density function (see Section A) with random walk sampling,
which generates additional traffic for maintaining the finger
table.
In contrast to Mercury, Chord# [28] never needs to compute the density function and therefore has significantly
less overhead. It is the recursive construction of the routing
table in the node space using the successor information,
which allows Chord# to handle load-imbalance while preserving an O(log N) routing performance.
4.2. Multi-dimensional keys and range queries
There exist several systems, that support multi-dimensional keys and range queries. They can be split into two
groups:
4.2.1. Space filling curves
Several systems [5,14,27] have been proposed that use
space-filling curves to map multi-dimensional to onedimensional keys. Space-filling curves are locality preserving, but they provide less efficient range queries than the
space partitioning schemes described below. This is because
a single range query may cover several parts which have to
be queried separately (Fig. 4).
Chawathe et al. [10] implemented a 2-dimensional range
query system on top of OpenDHT using z-curves for linearization. In contrast to many other publications, they
report performance results from a real-world application.
Due to the layered structure the query latency is only 2–
3 s for 24–30 nodes.
4.2.2. Space partitioning schemes
Another approach is the partitioning and direct mapping of the key space to the nodes. SONAR belongs to this
T. Schütt et al. / Computer Communications 31 (2008) 280–291
group of systems. The main differentiating factor between
such systems is the indexing and routing structure.
SWAM [8] employs a Voronoi-based space partitioning
scheme and uses a small-world graph overlay [18] with
routing tables of size O(1). The overlay does not rely on
a regular partitioning like a kd-tree, but it must sample
the network to place its fingers.
Multi-attribute range queries were also addressed by
Mercury [9], but their implementation uses a large number
of replicas per item to achieve logarithmic routing performance. Following this scheme, we could employ Chord#
for the same purpose, but SONAR supports multi-dimensional range queries with considerably less storage
overhead.
Ganesan et al. [14] proposed two systems for multidimensional range queries in P2P systems – SCRAP and
MURK. SCRAP follows the traditional approach of using
space-filling curves to map multi-dimensional data down to
one dimension. Each range-query can then be mapped to
several range queries on the one dimensional mapping.
MURK is more similar to our approach as it also
divides the data space into hypercuboids with each partition assigned to one node. In contrast to SONAR, MURK
implements a heuristic approach based on skip graphs.
5. Conclusion
We presented two structured overlay protocols that do
not use consistent hashing and are therefore able to support range queries.
Our Chord# provides a richer query expressiveness with
the same logarithmic routing complexity as Chord. Its finger update algorithm needs just one communication hop
per routing table entry instead of O(log N) hops as in
Chord. As shown in our experimental results (Fig. 2), this
greatly reduces the maintenance overhead and it is beneficial in dynamic environments with a high churn rate.
The second part of the paper generalizes the concepts of
Chord# to multi-dimensional data. The resulting algorithm, named SONAR for Structured Overlay Network
with Arbitrary Range Queries, is capable of handling
non-rectangular range queries over multiple dimensions
with a logarithmic number of routing hops. Non-rectangular range queries are necessary for geo-information systems, where objects are sought that lie within a given
distance from a specified position, or in Internet games
with millions of online-players interacting in a virtual game
space.
Acknowledgements
We thank the anonymous reviewers and the guest editors for their valuable comments. Our thanks go also to
Slaven Rezić for his street map of Berlin (used in Figs. 4
and 5) and to NASA’s Earth Observatory for the topographic images from the ‘Blue Marble next generation’
289
project in Fig. 6. Part of this research was supported by
the EU projects SELFMAN and XtreemOS.
Appendix A. Proof of the logarithmic routing performance of
Chord#
Before proving the routing performance of Chord# to be
O(log2 N), we briefly motivate our line of argumentation.
Let the key space be 0 . . . 2m1. In Chord, the ith finger (fingeri) in the finger table of node n refers to the node responsible for the key fi with2
fi ¼ ðn:key 2i1 Þ
for
16i6m
ðA:1Þ
This procedure needs O(log N) hops for each entry. It can
be rewritten as
fi ¼ ðn:key 2i2 Þ 2i2
Having split the right hand side into two terms, the
recursive structure becomes apparent and it is clear that
the whole calculation can be done in just 1 hop. The first
term represents the (i 1)th finger and the second term
the (i 1)th finger on the node pointed to by finger i 1.
For proving the correctness, we describe the node distribution by the density function d(x). It gives for each point
x in the key space the reciprocal of the width of the corresponding interval. For a Chord ring with N nodes and a
key space size of K = 2m the density function can be
approximated by dðxÞ ¼ 2Nm (the reciprocal of NK and
K = 2m) because it is based on consistent hashing.
Theorem 1 (Consistent hashing [16]). For any set of N
nodes and K keys, with high probability:
(i) Each node is responsible for at most ð1 þ Þ NK keys.
(ii) node (N + 1) joins or leaves the network, responsibility
for OðNK Þ keys changes hands (and only to or from the
joining or leaving node).
The most interesting property of d(x) is the integral over
subsets of the key space:
Lemma 1. The integral over d(x) equals the number of nodes
in the corresponding range. Hence, the integral over the
whole key space is:
Z
dðxÞdx ¼ N :
keyspace
Proof. We first investigate the integral of an interval from
ai to ai+1, where ai and ai+1 are the left and the right end of
the key range owned by a single node.
Z
aiþ1
?
dðxÞ dx ¼ 1:
ai
Because ai and ai+1 mark the begin and the end of an interval served by one node, d is constant for the whole range.
2
We assume calculations to be done in a ring using (mod 2m).
290
T. Schütt et al. / Computer Communications 31 (2008) 280–291
The width of this interval is ai+1 ai and therefore according to its definition dðxÞ ¼ aiþ11ai . Because we chose ai and
ai+1 to span exactly one interval the result is 1, as expected.
The integral over the whole key space therefore equals
the sum of all intervals, which is N:
Z
N 1 Z aiþ1
X
dðxÞ dx ¼
dðxÞ dx ¼ N
keyspace
i¼0
ai
Note that Lemma 1 could also be used to estimate the
~ in the system, having an approximaamount of nodes N
~
tion of d(x) called dðxÞ.
Each node could compare
1
~
log
ð
N
Þ
to
the
observed
average
routing performance
2
in order to estimate and improve its local approximation
~
dðxÞ.
Both, Chord and Chord# use logarithmically placed fingers, so that searching is done in O(log N). Chord, in contrast to our scheme, computes the placement of its fingers
in the key space. This ensures that with each hop the distance in the key space to the searched key is halved, but
it does not ensure that the distance in the node space is also
halved. So, a search may need more than O(log N) network
hops. According to Theorem 1, the search in the node
space still takes O(log N) steps with high probability. In
regions with less than average sized intervals (dðxÞ NK )
the routing performance degrades.
Chord places the fingers fingeri in a node n with the following scheme:
16i6m
ðA:2Þ
Using our integral approach from Lemma 1 and the
density function d(x), we develop an equivalent finger
placement algorithm as follows. First, we take a look
at the longest finger fingerm1. It points to the node
responsible for n + 2m1 when the key space has a size
of 2m. This corresponds to the opposite side of n in the
Chord ring. With a total of N nodes this finger links
to the N2 th node to the right with high probability due
to the consistent hashing theorem.
With Lemma 1 key fm1, which is stored on the N2 th node
to the right, can be predicted.
Z fm1
N
dðxÞdx ¼
2
n
Other fingers to the N4 th, . . . , 2Ni th node are calculated
accordingly.
As a result we can now formulate the following more
flexible finger placement algorithm:
Theorem 2 (Chord finger placement). For Chord, the
following two finger placement algorithms are equivalent:
(i) Rfi = (n.key ¯ 2i1),
i1
f
(ii) n i dðxÞ dx ¼ 22m N ;
16i6m
16i6m
The equivalence of Chord’s two finger placement algorithms will be used in the following section to prove the
correctness of Chord#’s algorithm.
A.2. Finger placement in Chord#
A.1. Finger placement in chord
fi ¼ ðn:key 2i1 Þ;
Proof. To prove the equivalence, we set dðxÞ ¼ 2Nm according to Theorem 1.
Z fi
2i1
dðxÞ dx ¼ m N
2
n
Z fi
N
2i1
dx ¼ m N
2m
2
n
N
2i1
ðfi nÞ ¼ m N
2m
2
fi ¼ n:key 2i1
Theorem 3 (Chord# finger placement).
successor
fingeri ¼
fingeri1 ! getFingerði 1Þ
:i¼0
: i 6¼ 0
Proof. We first analyze Chord’s finger placement (Ref.
Theorem 2) in more detail.
Z
n
fi
dðxÞ dx ¼
2i1
N;
2m
16i6m
ðA:3Þ
First we split the integral into two equal parts by introducing an arbitrary point X between n (the key of the local
node) and fi (the key of fingeri):
Z X
2i2
dðxÞ dx ¼ m N
ðA:4Þ
2
n
Z fi
2i2
dðxÞ dx ¼ m N
ðA:5Þ
2
X
In Eqs. (A.4) and (A.5), the only unknown is X. Comparing Eq. (A.4) to Theorem 2, we see that X is fi1.
In summary, to calculate fingeri we go to the node
addressed by fingeri1 in our finger table (Eq. (A.4)), which
crosses half of the nodes to fingeri. From this node the
(i 1)th entry in the finger table is retrieved, which refers to
fingeri according to Eq. (A.5). So, Eq. (A.3) is equivalent to
fingeri ¼ fingeri1 ! getFingerði 1Þ
Instead of approximating d(x) for the whole range between n and fi, we split the integral into two parts and
treat them separately. The integral from n to fi1 is
equivalent to the calculation of fingeri1 and the remaining equation is equivalent to the calculation of the
(i 1)th finger of the node at fingeri1. We thereby
proved the correctness of the pointer placement algorithm in Theorem 3. h
With this new routing algorithm, the cost for updating
the complete finger table has been reduced from O(log2 N)
in Chord to O(log N) in Chord#.
T. Schütt et al. / Computer Communications 31 (2008) 280–291
References
[1] K. Aberer, P-Grid: a self-organizing access structure for P2P
information systems, CoopIS (2001).
[2] K. Aberer, L. Onana Alima, A. Ghodsi, S. Girdzijauskas, S. Haridi,
M. Hauswirth, The essence of P2P: a reference architecture for
overlay networks, IEEE P2P (2005).
[3] L. Alima, S. El-Ansary, P. Brand, S. Haridi, DKS(N,k,f): a family of
low-communication, scalable and fault-tolerant infrastructures for
P2P applications, GP2PC (2003).
[4] A. Andrzejak, A. Reinefeld, F. Schintke, T. Schtt, On adaptability in
grid systems, in: V. Getov, D. Laforenza, A. Reinefeld (Eds.), Future
Generation Grids, 2006, pp. 29–46.
[5] A. Andrzejak, Z. Xu, Scalable, efficient range queries for Grid
information services, IEEE P2P (2002).
[6] D. Applegate, R. Bixby, V. Chvatal, W. Cook, Implementing the
Dantzig-Fulkerson-Johnson algorithm for large traveling salesman
problems, Mathematical Programming, Series B, 97 <http://
www.tsp.gatech.edu/world>, 2003.
[7] J. Aspnes, G. Shah, Skip graphs, SODA (2003).
[8] F. Banaei-Kashani, C. Shahabi, SWAM: a family of access
methods for similarity-search in peer-to-peer data networks, CIKM
(2004).
[9] A. Bharambe, M. Agrawal, S. Seshan, Mercury: supporting scalable
multi-attribute range queries, ACM SIGCOMM (2004).
[10] Y. Chawathe, S. Ramabhadran, S. Ratnasamy, A. LaMarca, S.
Shenker, J. Hellerstein, A case study in building layered DHT
applications, ACM SIGCOMM (2005).
[11] P.J. Denning, Network laws, CACM 47 (11) (2004).
[12] V. Gaede, O. Günther, Multidimensional access methods, ACM
Computing Surveys 30 (2) (1998).
[13] P. Ganesan, M. Bawa, H. Garcia-Molina, Online balancing of rangepartitioned data with applications to peer-to-peer systems, VLDB
(2004).
[14] P. Ganesan, B. Yang, H. Garcia-Molina, One torus to rule them all:
multidimensional queries in P2P systems, WebDB (2004).
[15] K.P. Gummadi, S. Saroiu, S.D. Gribble, King: estimating
latency between arbitrary internet end hosts, in: Proceedings of
the 2nd Usenix/ACM SIGCOMM Internet Measurement Workshop, 2002.
[16] D. Karger, E. Lehman, T. Leighton, R. Panigrah, M. Levine, D.
Lewin, Consistent hashing and random trees: distributed caching
protocols for relieving hot spots on the World Wide Web. 29th
Annual ACM Symposium on Theory of Computing, 1997.
[17] D. Karger, M. Ruhl, Simple efficient load balancing algorithms for
peer-to-peer systems, IPTPS (2004).
[18] J. Kleinberg, The small-world phenomenon: an algorithmic perspective, in: Proceedings of the 32nd ACM Symposium on Theory of
Computing, 2000.
[19] J. Li, J. Stribling, R. Morris, M.F. Kaashoek, T.M. Gil, A
performance vs. cost framework for evaluating DHT design tradeoffs
under churn,