A READ-ONLY DISTRIBUTED HASH TABLE VERDI MARCH B.Sc (Hons) in Computer Science, University of Indonesia A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF COMPUTER SCIENCE NATIONAL UNIVERSITY OF SINGAPORE 2007 DECLARATION No portion of the work referred to in this thesis has been submitted in support of an application for another degree or qualification of this or any other university or other institution of learning. ii Abstract Distributed hash table (DHT) is an infrastructure to support resource discovery in large distributed system. In DHT, data items are distributed across an overlay network based on a hash function. This leads to two major issues. Firstly, to preserve ownership of data items, commercial applications may not allow a node to proactively store its data items on other nodes. Secondly, data-item distribution requires all nodes in a DHT overlay to be publicly writable, but some nodes do not permit the sharing of its storage to external parties due to a different economical interest. In this thesis, we present a DHT-based resource discovery scheme without distributing data items called R-DHT (Read-only DHT). We further extend R-DHT to support multi-attribute queries with our Midas scheme (Multi-dimensional range queries). R-DHT is a new DHT abstraction that does not distribute data items across an overlay network. To map each data item (e.g. a resource, an index to a resource, or resource metadata) back onto its resource owner (i.e. physical host), we virtualize each host into virtual nodes. These nodes are further organized as a segment-based overlay network with each segment consisting of resources of the same type. The segment-based overlay also increases R-DHT resiliency to node failures. Compared to conventional DHT, R-DHT’s overlay has a higher number of nodes which increases lookup path length and maintenance overhead. To reduce iii R-DHT lookup path length, we propose various optimizations, namely routing by segments and shared finger tables. To reduce the maintenance overhead of overlay networks, we propose a hierarchical R-DHT which organizes nodes as a two-level overlay network. The top-level overlay is indexed based on resource types and constitutes the entry points for resource owners at second-level overlays. Midas is a scheme to support multi-attribute queries on R-DHT based on d-toone mapping. A multi-attribute resource is indexed by a one-dimensional key which is derived by applying a Hilbert space-filling curve (SFC) to the type of the resource. The resource is then mapped (i.e. virtualized) onto an R-DHT node. To retrive query results, a multi-attribute query is transformed into a number of exact queries using Hilbert SFC. These exact queries are further processed using R-DHT lookups. To reduce the number of lookup required, we propose two optimizations to Midas query engine, namely incremental search and search-key elimination. We evaluate R-DHT and Midas through analytical and simulation analysis. Our main findings are as follows. Firstly, the lookup path length of each R-DHT lookup operation is indeed independent of the number of virtual nodes. This demonstrates that our lookup optimization techniques are applicable to other DHT-based systems that also virtualize physical hosts into nodes. Secondly, we found that RDHT is effective in supporting multi-attribute range queries when the number of query results is small. Our results also imply that a selective data-item distribution scheme would reduce cost of query processing in R-DHT. Thirdly, by not distributing data items, DHT is more resilient to node failures. In addition, data update at source are done locally and thus, data-item inconsistency is avoided. Overall, R-DHT is effective and efficient for resource indexing and discovery in large distributed systems with a strong commercial requirement in the ownership of data items and resource usage. iv Acknowledgements I thank God almighty who works mysteriously and amazingly to make things happen. I have never had the slightest imagination to pursue a doctoral study, and yet, His guidance has made me come this far. Throughout these five years, I also slowly learn to appreciate His constants blessings and love. To my supervisor, A/P Teo Yong Meng, I express my sincere gratitude for his advise and guidance throughout my doctoral study. His determined support when I felt my research was going nowhere is truly inspirational. I learned from him the importance of defining research problems, how to put solutions and findings into perspective, a mind set of always looking for both sides of a coin, and technical writing skill. I also like to express my gratitude to my Ph.D. thesis committee, Professors Gary Tan Soon Huat, Wong Weng Fai, and Chan Mun Choon. I acknowledge the contributions of Dr Wang Xianbing to this thesis. Due to his persistance, we managed to analytically prove the lookup path length of RDHT. In addition, the backup-fingers scheme was invented when we discussed experimental results that are in contrast to theoretical analysis. I am indebted to Peter Eriksson (KTH, Sweden) who implemented a simulator that I use in Chapter 3. Dr Bhakti Satyabudhi Stephan Onggo (LUMS, UK) has provided me his advice regarding simulations and my thesis writing. Hendra Setiawan gave v me a crash course on probability theories to help me in performing theoretical analysis. Professor Seif Haridi (KTH, Sweden), Dr Ali Ghodsi (KTH, Sweden), and Gabriel Ghinita provided valuable inputs at various stages of my research. With Dr Lim Hock Beng, I have had some very insightful discussions regarding my research. I owe a great deal to Tan Wee Yeh, the keeper of Angsana and Tembusu2 clusters, whom I bugged frequently during my experiments. I thank Johan Prawira Gozali for sharing with me major works in job scheduling when I was looking for a research topic. Many thanks to Arief Yudhanto, Djulian Lin, Fendi Ciuputra Korsen, Gunardi Endro, Hendri Sumilo Santoso, Kong Ming Siem, and other friends as well for their support. Finally, I thank my parents who have devoted their greatest support and encouragement throughout my tough years in NUS. I would never have completed this thesis without their constant encouragement especially when my motivation was at its lowest point. Thank you very much for your caring support. CONTENTS vi Contents Abstract ii Acknowledgements iv Contents vi List of Symbols ix List of Figures xi List of Tables xiii List of Theorems xiv 1 Introduction 1.1 P2P Lookup . . . . . . . . . . . . . . . . 1.2 Distributed Hash Table (DHT) . . . . . 1.2.1 Chord . . . . . . . . . . . . . . . 1.2.2 Content-Addressable Network . . 1.2.3 Kademlia . . . . . . . . . . . . . 1.3 Multi-Attribute Range Queries on DHT 1.3.1 Distributed Inverted Index . . . . 1.3.2 d-to-d Mapping . . . . . . . . . . 1.3.3 d-to-one Mapping . . . . . . . . . 1.4 Motivation . . . . . . . . . . . . . . . . . 1.5 Objective . . . . . . . . . . . . . . . . . 1.6 Contributions . . . . . . . . . . . . . . . 1.7 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 4 7 10 12 15 17 19 20 23 25 27 31 2 Read-only DHT: Design and Analysis 2.1 Terminologies and Notations . . . . . . 2.2 Overview of R-DHT . . . . . . . . . . 2.3 Design . . . . . . . . . . . . . . . . . . 2.3.1 Read-only Mapping . . . . . . . 2.3.2 R-Chord . . . . . . . . . . . . . 2.3.3 Lookup Optimizations . . . . . 2.3.3.1 Routing by Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 34 36 37 37 41 44 48 . . . . . . . CONTENTS 2.4 2.5 2.6 2.7 vii 2.3.3.2 Shared Finger Tables . . . . . . . 2.3.4 Maintenance of Overlay Graph . . . . . . Theoretical Analysis . . . . . . . . . . . . . . . . 2.4.1 Lookup . . . . . . . . . . . . . . . . . . . 2.4.2 Overhead . . . . . . . . . . . . . . . . . . 2.4.3 Cost Comparison . . . . . . . . . . . . . . Simulation Analysis . . . . . . . . . . . . . . . . . 2.5.1 Lookup Path Length . . . . . . . . . . . . 2.5.2 Resiliency to Simultaneous Failures . . . . 2.5.3 Time to Correct Overlay . . . . . . . . . . 2.5.4 Lookup Performance under Churn . . . . . Related Works . . . . . . . . . . . . . . . . . . . . 2.6.1 Structured P2P with No-Store Scheme . . 2.6.2 Resource Discovery in Computational Grid Summary . . . . . . . . . . . . . . . . . . . . . . 3 Hierarchical R-DHT: Collision Detection and 3.1 Related Work . . . . . . . . . . . . . . . . . . 3.1.1 Varying Frequency of Stabilization . . 3.1.2 Varying Size of Routing Tables . . . . 3.1.3 Hierarchical DHT . . . . . . . . . . . . 3.2 Design of Hierarchical R-DHT . . . . . . . . . 3.2.1 Collisions of Group Identifiers . . . . . 3.2.2 Collision Detection . . . . . . . . . . . 3.2.3 Collision Resolution . . . . . . . . . . . 3.2.3.1 Supernode Initiated . . . . . 3.2.3.2 Node Initiated . . . . . . . . 3.3 Simulation Analysis . . . . . . . . . . . . . . . 3.3.1 Maintenance Overhead . . . . . . . . . 3.3.2 Extent and Impact of Collisions . . . . 3.3.3 Efficiency and Effectiveness . . . . . . 3.3.3.1 Detection . . . . . . . . . . . 3.3.3.2 Resolution . . . . . . . . . . . 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Midas: Multi-Attribute Range Queries 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . 4.2 Hilbert Space-Filling Curve . . . . . . . . . . . . . 4.2.1 Locality Property . . . . . . . . . . . . . . . 4.2.2 Constructing Hilbert Curve . . . . . . . . . 4.3 Design . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Multi-Attribute Indexing . . . . . . . . . . . 4.3.1.1 d-to-one Mapping Scheme . . . . . 4.3.1.2 Resource Type Specification . . . . 4.3.1.3 Normalization of Attribute Values 4.3.2 Query Engine and Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 49 52 53 57 61 62 63 65 66 70 74 74 75 76 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 80 81 81 82 84 86 87 90 91 91 92 93 96 99 99 100 101 . . . . . . . . . . 102 . 103 . 105 . 106 . 107 . 111 . 112 . 113 . 114 . 116 . 119 CONTENTS 4.4 4.5 Performance Evaluation . . . . . . . . . 4.4.1 Efficiency . . . . . . . . . . . . . 4.4.2 Cost of Query Processing . . . . . 4.4.3 Resiliency to Node Failures . . . 4.4.4 Query Performance under Churn Summary . . . . . . . . . . . . . . . . . viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 125 127 133 136 138 5 Conclusion 140 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Appendices 149 A Read-Only CAN 149 A.1 Flat R-CAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 A.2 Hierarchical R-CAN . . . . . . . . . . . . . . . . . . . . . . . . . . 152 B Selective Data-Item Distribution 154 References 157 LIST OF SYMBOLS ix List of Symbols R-DHT β Ratio of the number of collisions in hierarchical R-DHT with detect & resolve to the number of collisions in hierarchical R-DHT without detect & resolve ξ Stabilization degree of an overlay network ξn Correctness of n’s finger table f Finger h Host K Number of unique keys in a system k Key N Number of hosts n Node p Stabilization period r Resource Sk Segment prefixed with k T Average number of unique keys in a host Th Set of unique keys in host h V Number of nodes Midas a Length parameter that determines the size of query region for the experiments in Chapter 4 C Number of clusters in query region LIST OF SYMBOLS c Cluster is consecutive Hilbert identifiers from c.lo–c.hi d Number of Dimensions x −1 fHilbert Function to map a Hilbert identifier to a coordinate fHilbert Function to map a coordinate to a Hilbert identifier Hld The lth -order Hilbert curve of a d-dimensional space I Number of intermediate nodes required to locate a responsible node l Approximation level of a multidimensional space and a Hilbert curve Q Query region whose Q.lo and Q.hi are its smallest and largest coordinates q Ordered set of search keys Qakey Number of available keys Qcnode Number of Chord nodes responsible for keys Qskey Number of search keys R Number of responsible nodes LIST OF FIGURES xi List of Figures 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 1.13 Classification of P2P Lookup Schemes . . . . . . . . . . . . . . . Chord Ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chord Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Join Operation in Chord . . . . . . . . . . . . . . . . . . . . . . . Lookup in a 2-Dimensional CAN . . . . . . . . . . . . . . . . . . Dynamic Partitioning of a 2-Dimensional CAN . . . . . . . . . . . Kademlia Tree Consisting of 14 Nodes (m = 4 Bits) . . . . . . . . Kademlia Lookup (α = 1 Node) . . . . . . . . . . . . . . . . . . . Classification of Multi-Attribute Range Query Schemes on DHT . Example of Distributed Inverted Index on Chord . . . . . . . . . Intersecting Intermediate Result Sets . . . . . . . . . . . . . . . . Example of Direct Mapping on 2-dimensional CAN . . . . . . . . Hilbert SFC Maps Two-Dimensional Space onto One-Dimensional Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.14 Example of 2-Dimensional Hash on Chord . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20 . . . . . . . . . . . . . . . . . . . . Host in the Context of Computational Grid . . . . . . . . . . . virtualize : hosts → nodes . . . . . . . . . . . . . . . . . . . . . Proposed R-DHT Scheme . . . . . . . . . . . . . . . . . . . . . Resource Discovery in a Computational Grid . . . . . . . . . . . Mapping Keys to Node Identifiers . . . . . . . . . . . . . . . . . Virtualization in R-DHT . . . . . . . . . . . . . . . . . . . . . . R-DHT Node Identifiers . . . . . . . . . . . . . . . . . . . . . . Virtualizing Host into Nodes . . . . . . . . . . . . . . . . . . . . Chord and R-Chord . . . . . . . . . . . . . . . . . . . . . . . . . Node Failures and Stale Data Items . . . . . . . . . . . . . . . . The Fingers of Node 2|3 . . . . . . . . . . . . . . . . . . . . . . Unoptimized R-Chord Lookup . . . . . . . . . . . . . . . . . . . R-Chord Lookup Exploiting R-DHT Mapping . . . . . . . . . . lookup(k) with and without Routing by Segments . . . . . . . . Effect of Shared Finger Tables on Routing . . . . . . . . . . . . Finger Tables with Backup Fingers . . . . . . . . . . . . . . . . Successor-Stabilization Algorithm . . . . . . . . . . . . . . . . . Finger-Correction Algorithm . . . . . . . . . . . . . . . . . . . . Average Lookup Path Length . . . . . . . . . . . . . . . . . . . Average Lookup Path Length with Failures (N = 25,000 Hosts) . . . . . . . . . . . . . . . . . . . . 3 7 8 10 11 13 14 16 18 19 20 20 . 21 . 22 34 35 36 38 39 40 40 42 43 45 46 46 47 49 50 51 52 53 64 67 LIST OF FIGURES xii 2.21 Percentage of Failed Lookups (N = 25,000 Hosts) . . . . . . . . . . 68 2.22 Correctness of Overlay ξ . . . . . . . . . . . . . . . . . . . . . . . . 71 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 Two-Level Overlay Consisting of Four Groups . . . . . . . Example of a Lookup in Hierarchical R-DHT . . . . . . . . Join Operation . . . . . . . . . . . . . . . . . . . . . . . . Collision at the Top-Level Overlay . . . . . . . . . . . . . . Collision Detection Algorithm . . . . . . . . . . . . . . . . Collision Detection Piggybacks Successor Stabilization . . Collision Detection for Groups with Several Supernodes . . Announce Leave to Preceding and Succeeding Supernodes Supernode-Initiated Algorithm . . . . . . . . . . . . . . . . Node-Initiated Algorithm . . . . . . . . . . . . . . . . . . . Maintenance Overhead of Hierarchical R-Chord . . . . . . Size of Top-Level Overlay (V = 100, 000 Nodes) . . . . . . . . . . . . . . . . . . 84 86 87 87 88 89 90 91 91 92 95 98 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 Retrieving Result Set of Resource Indexes with Attribute cpu = P 3 SFC on 2-Dimensional Space . . . . . . . . . . . . . . . . . . . . . . Clusters and Region . . . . . . . . . . . . . . . . . . . . . . . . . . Constructing Hilbert Curve on 2-Dimensional Space . . . . . . . . . Midas Indexing and Query Processing . . . . . . . . . . . . . . . . . Midas Multi-dimensional Indexing . . . . . . . . . . . . . . . . . . . Attributes and Key . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of Midas Indexing (d = 2 Dimensions and m = 4 Bits) . . Dimension Values for Compound Attribute book . . . . . . . . . . . Sample XML Document of GLUE Schema . . . . . . . . . . . . . . Range Query with Search Attributes cpu and memory . . . . . . . . Naive Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . Midas Incremental Search Algorithm . . . . . . . . . . . . . . . . . Search-Key Elimination . . . . . . . . . . . . . . . . . . . . . . . . Example of Range Query Processing . . . . . . . . . . . . . . . . . Four Chord Nodes are Responsible for Twelve Search Keys . . . . . Locating Key and Accessing Resource in R-Chord and Chord . . . . 104 106 108 109 111 112 114 115 116 117 120 121 122 123 123 129 132 5.1 5.2 Multi-attribute Queries on R-DHT . . . . . . . . . . . . . . . . . . 141 Exploiting Host Virtualization to Selectively Distribute Data Items 147 A.1 A.2 A.3 A.4 VIDs of Node Identifier 11012 . . . . . . . . . . . Zone Splitting in CAN may Violate Definition A.1 Zone Splitting in Flat R-CAN . . . . . . . . . . . Zone Splitting in Hierarchical R-CAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 150 152 153 B.1 Relaxing Node Autonomy . . . . . . . . . . . . . . . . . . . . . . . 155 B.2 Lookup within Reserved Segment . . . . . . . . . . . . . . . . . . . 156 LIST OF TABLES xiii List of Tables 2.1 2.2 2.3 2.4 2.5 Variables Maintained by Host and Node . . . . . . . . Comparison of API in R-DHT with Conventional DHT Comparison of Chord and R-Chord . . . . . . . . . . . Lookup Performance under Churn (N ∼ 25, 000 Hosts) Comparison of R-DHT with Related Work . . . . . . . 3.1 3.2 3.3 3.4 3.5 Additional Variables Maintained by Node n in a Hierarchical R-DHT 85 Number of Collisions . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Average Time to Detect a Collision (in Seconds) . . . . . . . . . . . 99 Ratio of Number of Collisions (β) . . . . . . . . . . . . . . . . . . . 100 Average Number of Nodes Affected by a Collision . . . . . . . . . . 100 4.1 4.2 Comparison of Multi-attribute Range Query Processing . . . . . . . Resource Type Specification for Compute Resources based on GLUE Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance of Query Processing in Naive Scheme vs Midas . . . . Query Cost of Midas . . . . . . . . . . . . . . . . . . . . . . . . . . Qcnode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average Number of Lookups per Query (based on Table 4.4b) . . . Average Number of Intermediate Nodes per Lookup (based on Table 4.4b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Percentage of Keys Retrieved under Simultaneous Node Failures . . Percentage of Keys Retrieved under Simultaneous Node Failures . . Percentage of Keys Retrieved under Churn (N ∼ 25, 000 Hosts) . . 4.3 4.4 4.5 4.6 4.7 4.8 4.8 4.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 41 62 73 76 105 118 126 128 129 130 131 134 135 137 LIST OF TABLES xiv List of Theorems Definition Definition Definition Definition Definition Definition 2.1 2.2 2.3 4.1 4.2 A.1 Property 4.1 Property 4.2 Property 4.3 Lemma 2.1 Lemma 2.2 Theorem Theorem Theorem Theorem Theorem Theorem Theorem Theorem Theorem 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 A.1 Resource Type . Host . . . . . . . Node . . . . . . . Key Derived from Query Region . . R-CAN VID . . . . . . . . . . . . . . . . . . . . . . . . . . . Hilbert SFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 34 34 113 119 149 Refinement of Hilbert Cell . . . . . . . . . . . . . . . . . . . . 109 Bit-Length of Dimension . . . . . . . . . . . . . . . . . . . . 110 Bit-Length of Hilbert Codes . . . . . . . . . . . . . . . . . . . 110 Probability of a Host to own a Key . . . . . . . . . . . . . . . 54 Lookup Path Length of Routing by Segments . . . . . . . . . . 55 Lookup Path Length in Chord . . . . . . . . . . . Lookup Path Length in R-Chord . . . . . . . . . . Cost to Join Overlay . . . . . . . . . . . . . . . . . Number of Fingers Maintained by Host in R-Chord Cost of Stabilizations . . . . . . . . . . . . . . . . Finger Flexibility . . . . . . . . . . . . . . . . . . . Cost to Add Key . . . . . . . . . . . . . . . . . . . Number of Replicas . . . . . . . . . . . . . . . . . Zone Splitting in Flat R-CAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 56 57 58 58 59 60 60 151 CHAPTER 1. INTRODUCTION 1 Chapter 1 Introduction The advance of internetworking has lead to initiatives to achieve the sharing and collaboration of resources across geographically dispersed locations. One popular initiative is peer-to-peer-based systems. Peer-to-peer (P2P) is an architecture for building large distributed systems that facilitate resource sharing among nodes (peers) from different administrative domains, where nodes are organized as an overlay network on top of existing network infrastructure (e.g. the TCP/IP network). The main characteristics of P2P are (i) every node can be a resource provider (server) and a resource consumer (client), and (ii) the overlay network are self-organizing with minimum manual configuration [10, 18, 100, 112]. P2P has been specifically applied for file-sharing applications [6]. However, the popularity of P2P paradigm has lead to its adoption by other types of applications such as information retrieval [105, 109, 127, 135, 146], filesystems [38, 39, 42, 46, 66, 81, 83, 104], database [70, 111], content delivery [34, 41, 48, 73, 82, 88, 125], and communication and messaging systems [3, 11, 12, 13, 102]. Recently, P2P has also been proposed to support resource discovery in computational grid [27, 28, CHAPTER 1. INTRODUCTION 2 71, 91, 132, 145]. A key service in P2P is an effective and efficient resource discovery service. Effective means users should successfully find available resources with high result guarantee, while efficient means resource discovery processes are subjected to performance constraints such as minimum number of hops or minimum network traffic. As a P2P system is comprised of peer nodes from different administrative domains, an important design consideration of a resource discovery scheme is to address the problem of resource ownership and conflicting self-interest among administrative domains. In this thesis, we present a resource discovery scheme based on read-only DHT (R-DHT). The remainder of this chapter is organized as follows. First, we review existing P2P lookup schemes in Section 1.1 and introduce a class of decentralized P2P lookup schemes called DHT in Section 1.2. In Section 1.3, we discuss how DHT supports a type of complex queries called multi-attribute range queries. Then, we highlight the problem of data-item distribution in Section 1.4. Next, we present the objective of this thesis and our contributions in Section 1.5–1.6. Finally, we describe the organization of this thesis in Section 1.7. 1.1 P2P Lookup Based on the architecture, we classify P2P lookup schemes as centralized and decentralized (Figure 1.1). Centralized schemes such as Napster [8] employ a directory server to index all resources in the overlay network. This leads to high result guarantee and efficiency since each lookup is forwarded only to the directory server. However, for large systems, a central authority needs a significant investment in providing a powerful CHAPTER 1. INTRODUCTION 3 Figure 1.1: Classification of P2P Lookup Schemes directory server to handle a high number of requests. The directory server is also a potential single point of failure due to technical reasons such as hardware failure, and non-technical reasons such as political or legal actions. A well-publicized example is the termination of Napster service in July 2001 due to legal actions. Decentralized schemes minimize the reliance on a central entity by distributing the lookup processing among nodes in the overlay. Based on the overlay topology, decentralized schemes are further classified as unstructured P2P and structured P2P. Unstructured P2P such as Gnutella [6] organizes nodes as a random overlay graph. In the earlier unstructured P2P, each node indexes only its own resources and a lookup floods the overlay: each node forwards an incoming lookup to all its neighbors. However, flooding limits scalability because in a P2P system consisting of CHAPTER 1. INTRODUCTION 4 N nodes, the lookup complexity, in terms of the number of messages, is O(N 2 ) [98, 121]. Hence, a high volume of network traffic is generated. To address this scalability issue, various approaches to limit search scope are proposed, including heuristic-based routing [15, 37, 79, 94, 141], distributed index [33, 35, 40], superpeer architecture [142], and clustering of peers [33, 114]. Though improving lookup scalability, limiting search scope leads to a lower result guarantee: a lookup returns a false negative answer when it is terminated before successfully locating resources. Thus, trying to efficiently achieve a high result guarantee remains a challenging problem [35, 138]. Structured P2P, also known as distributed hash table (DHT) [62, 69, 89, 117], is another decentralized lookup scheme that aims to provide a scalable lookup service with high result guarantee. We review the mechanism of DHT in Section 1.2 and how DHT supports complex queries in Section 1.3. 1.2 Distributed Hash Table (DHT) DHT, as with a hash-table data structure, provides an interface to retrieve a key-value pair. A key is an identifier assigned to a resource; traditionally this key is a hash value associated with the resource. A value is an object to be stored into DHT; this could be the shared resource itself (e.g. a file), an index (pointer) to a resource, or a resource metadata. An example of a key-value pair is hSHA1(file name), http://peer-id/filei, where the key is the SHA1 hash of the file name and the value is the address (location) of the file. DHT works in a similar way as hash tables. Whereas a hash table assigns every key-value pair onto a bucket, DHT assigns every key-value pair onto a node. There are three main concepts in DHT: key-to-node mapping, data-item distribution, and structured overlay networks. CHAPTER 1. INTRODUCTION 5 Key-to-Node Mapping Assuming that keys and nodes share the same identifier space, DHT maps key k to node n where n is the closest node to k in the identifier space; we refer to n as the responsible node of k. We use the term one-dimensional DHT and d-dimensional DHT to refer to DHT that use a one-dimensional identifier space and a d-dimensional identifier space, respectively. Data-Item Distribution All key-value pairs (i.e. data items) whose key equals to k are stored at node n regardless of who owns these key-value pairs. To improve the resilience of lookups when the responsible node fails, the keyvalue pairs can also be replicated in a number of neighbors of n. However, the replication needs to consider application-specific requirements such as consistency among replicas, degree of replication, and overhead of replication [42, 54, 87, 113, 120]. Structured Overlay Network In DHT, nodes are organized as a structured overlay network with the purpose of striking a balance between routing performance and overhead of maintaining routing states. There are two important characteristics of a structured overlay network: 1. Topology A structured overlay network resembles a graph with a certain topology such as a ring [123, 133], a torus [116], or a tree [14, 99]. 2. Ordering of nodes The position of a node in a structured overlay network is determined by the node identifier. Compared to unstructured P2P, DHT is perceived to offer a better lookup performance in terms of results guarantee and lookup path length [93]. Due to the key-to-node mapping, finding a key-value pair equals to locating a node respon- CHAPTER 1. INTRODUCTION 6 sible for the key. This increases result guarantee (i.e. a lower number of false negative answers) because it avoids the termination of lookups before existing keys are found1 . By exploiting its structured overlay, DHT locates the responsible node in a shorter and bounded number of hops (i.e. the lookup path length). Existing DHT implementations adopt all the three DHT main concepts. Two of these concepts, i.e. key-to-node mapping and structured overlay network, can be implemented differently among DHT implementations. On the other hand, dataitem distribution is implemented in existing DHT by providing a store operation [43, 120]. As an illustration of how DHT concepts are implemented, we present three well-known DHT examples, namely Chord [133], Content-Addressable Network (CAN) [116], and Kademlia [99]. 1. Chord, a one-dimensional DHT, is the basis for implementing our proposed read-only DHT scheme in Chapter 2–4. 2. CAN, a d-dimensional DHT, is used in an alternative implementation of our proposed scheme in Appendix A. 3. Kademlia is another one-dimensional DHT with a different key-to-node mapping function and structured overlay topology compared to Chord. For each of these examples, we first elaborate on its overlay topology and keyto-node mapping function. We also highlight that each of the presented example distributes data items. Lastly, we discuss the process of looking up for a key (i.e. the basic DHT lookup operation) and the construction of overlay network. 1 In contrast to DHT, the result guarantee in unstructured P2P depends on the popularity of key-value pairs. Lookup for popular key-value pairs, i.e. highly replicated and frequently requested, have a higher probability to return a correct answer compared to lookup for less popular key-value pairs [93]. CHAPTER 1. INTRODUCTION 1.2.1 7 Chord Chord is a DHT implementation that supports O(log N )-hops lookup path length and O(log N ) routing states per node, where N denotes the total number of nodes [133] . Chord organizes nodes as a ring that represents an m-bit one-dimensional circular identifier space, and as a consequence, all arithmetic are modulo 2m . To form a ring overlay, each node n maintains two pointers to its immediate neighbors (Figure 1.2). The successor pointer points to successor(n), i.e. the immediate neighbor of n clockwise. Similarly, the predecessor pointer points to predecessor(n), the immediate neighbor of n counter clockwise. Figure 1.2: Chord Ring Chord maps key k to successor(k), the first node whose identifier is equal to or greater than k in the identifier space (Figure 1.3a). Thus, node n is responsible for keys in the range of (predecessor(n), n], i.e. keys that are greater than predecessor(n) but smaller than or equal than n. For example, node 32 is responsible for all keys in (21, 32]. All key-value pairs whose key equals to k are then stored on successor(k) regardless of who owns the key-value pairs (i.e. data-item distribution). Finding key k implies that we route a request to successor(k). The simplest approach for this operation, as illustrated in Figure 1.3b, is to propagate a re- CHAPTER 1. INTRODUCTION (a) Map and Distribute Keys to Nodes 8 (b) Traverse the Ring to Find successor(54) (c) The Fingers of Node 8 (d) find successor (54 ) Utilizing Finger Tables Figure 1.3: Chord Lookup CHAPTER 1. INTRODUCTION 9 quest along the Chord ring in a clockwise direction until the request arrives at successor(k). However, this approach is not scalable as its complexity is O(N ), where N denotes the number of nodes in the ring [133]. To speed-up the process of finding successor(k), each node n maintains a finger table of m entries (Figure 1.3c). Each entry in the finger table is also called a finger. The ith finger of n is denoted as n.f inger[i] and points to successor(n + 2i−1 ), where 1 ≤ i ≤ m. Note that the 1st finger is also the successor pointer while the largest finger divides the circular identifier space into two halves. When N < 2m , the finger table consists of only O(log N ) unique entries. By utilizing finger tables, Chord locates successor(k) in O(log N ) hops with high probability [133]. Intuitively, the process resembles a binary search where each step halves the distance to successor(k). Each node n forwards a request to the nearest known preceding node of k. This is repeated until the request arrives at predecessor(k), the node whose identifier precedes k, which will forward the request to successor(k). Figure 1.3d shows an example of finding successor(54) initiated by node 8. Node 8 forwards the request to its 6th finger which points to node 48. Node 48 is the predecessor of key 54 because its 1st finger points to node 56 and 48 < 54 ≤ 56. Thus, node 48 will forward the request to node 56. Figure 1.4 illustrates the construction of a Chord ring. A new node n joins a Chord ring by locating its own successor. Then, n inserts itself between successor(n) and the predecessor of successor(n), illustrated in Figure 1.4a. The key-value pairs stored on successor(n), whose key is less than or equal to n, is migrated to node n (Figure 1.4b). Because the join operation invalidates the ring overlay, every node performs periodic stabilizations to correct its successor and predecessor pointers (Figure 1.4c), and its fingers. CHAPTER 1. INTRODUCTION 10 finger-correction mechanism to correct its successor and predecessor pointers (Figure 1.4c). (a) Insert Node 25 (b) Migrate Key 22 from Node 32 to Node 25 (c) Correct Successor and Predecessor Pointers Figure 1.4: Join Operation in Chord 1.2.2 Content-Addressable Network CAN is d-dimensional DHT that supports O(n1/d )-hops lookup path length and O(d) routing states per node, where N denotes the total number of nodes [116]. The design of CAN is based on a d-dimensional Cartesian coordinate space on a dtorus. The coordinate space is partitioned into zones and every node is responsible for a zone. Each node is also assigned a virtual identifier (VID) that reflects its position in the coordinate space. To facilitate routing (i.e. lookups) , a node maintains pointers to its adjacent neighbors. For a d-dimensional coordinate space CHAPTER 1. INTRODUCTION 11 partitioned into N equal zones, every node maintains 2d neighbors. Figure 1.5 illustrates an example of 2-dimensional CAN consisting of six nodes and an 8 × 8 coordinate space. Node E, whose VID is 101, is responsible for zone [6–8, 0–4] where the lower-left Cartesian point (6, 0) and the upper-right Cartesian point (8, 4) are the lowest and highest coordinates in this zone, respectively. Figure 1.5: Lookup in a 2-Dimensional CAN CAN maps key k to point p within a zone. As in Chord, CAN also adopts dataitem distribution where the key-value pair whose key equals to k is stored to the node responsible for the zone. Thus, finding a key implies locating the zone that contains point p. Intuitively, CAN routes a request to a destination zone by using a straight line path from the source to the destination. Each node forwards a request to its neighbor whose coordinate is the closest to the destination coordinate. For a d-dimensional coordinate space divided into N equal zones, the lookup path length is O(n1/d ) [116]. Figure 1.5 shows a lookup for a key mapped to Cartesian point (7, 3). Initiated by node C, the lookup is routed to node E as its zone, [6–8, 0–4], contains the requested point. To join a CAN coordinate space, a new node n randomly chooses a point p and locates zone z that contains p. Then, z is split into two child zones along a particular dimension based on a well-defined ordering. For instance, in a two- CHAPTER 1. INTRODUCTION 12 dimensional CAN, a zone is first split along the x axis followed by the y axis. Node e, which was responsible for z, will take over the lower child zone along the split dimension, while the new node n is responsible for the higher child zone. To properly reflect their new position, the VIDs of both nodes are updated by concatenating the original VID of e with 0 (if the node in the lower child zone) or 1 (if the node is in the higher child zone). Figure 1.6 illustrates the construction of a 2-dimensional CAN consisting of six nodes. A binary string in parentheses denotes a node VID. Initially, the first node A is responsible for the whole coordinate space, i.e. [0–8, 0–8], and its VID is an empty-string (Figure 1.6a). As node B arrives (Figure 1.6b), zone [0–8, 0–8] is split along the x axis into two child zones: [0–4, 0–8] and [4–8, 0–8], which corresponds to the lower and higher zone, respectively, along the x axis. Node A will be responsible for the lower child zone and therefore, its new VID is 0, which is the concatenation of A’s original VID and 0. Meanwhile, the new node B is responsible for the higher child zone and its VID will be 1. Figure 1.6c shows another node C arrives and further splits zone [4–8, 0–8]. Because zone [4–8, 0–8] is the result of a previous splitting along the x axis, this zone is now split along the y axis, which results in [4–8, 0–4], i.e. the lower child zone along the y axis, and [4–8, 4–8], i.e. the higher child zone along the y axis. Node B will be taking over the lower child zone and its new VID will be 10. The new node C is responsible for the higher child zone and therefore, its VID will be 11. The zone splitting continues as more nodes join (Figure 1.6d–1.6f). 1.2.3 Kademlia Assuming an m-bit identifier space, Kademlia supports O(log N )-hops lookup path length and O(κm) routing states per node, where N denotes the total number of nodes and κ denotes a coefficient for routing-states redundancy [99]. Kademlia CHAPTER 1. INTRODUCTION 13 (a) Node A Occupies [0–8, 0– 8] (b) Node B Splits [0–8, 0–8] along x Axis into [0–4, 0–8] and [4–8, 0–8] (c) Node C Splits [4–8, 0–8] along y Axis into [4–8, 0–4] and [4–8, 4–8] (d) Node D Splits [4–8, 4–8] along x Axis into [4–6, 4–8] and [6–8, 4–8] (e) Node E Splits [4–8, 0–4] along x Axis into [4–6, 0–4] and [6–8, 0–4] (f) Node F Splits [6–8, 4–8] along y Axis into [6–8, 4–6] and [6–8, 6–8] Figure 1.6: Dynamic Partitioning of a 2-Dimensional CAN CHAPTER 1. INTRODUCTION 14 organizes nodes as a prefix-based binary tree where each node is a leaf of the tree. The position of a node is determined by the shortest unique prefix of the node identifier. Figure 1.7 illustrates the position of node 5 (01012 ) in a Kademlia tree, assuming a 4-bit identifier space. Figure 1.7: Kademlia Tree Consisting of 14 Nodes (m = 4 Bits) To facilitate the routing of lookup requests, each node maintains a routing table consisting of O(m) buckets where each bucket consists of O(κ) pointers. First, node n divides the tree into m subtrees such that the ith subtree consists of O(N/2i ) nodes with the same (i − 1)-bit prefix as n, where 1 ≤ i ≤ m and N denotes the number of nodes. The ith subtree is higher than the j th subtree if i < j. Thus, the 1st subtree is also called the highest subtree, while the mth subtree is the lowest subtree. For each subtree, node n maintains a bucket consisting of pointers to O(κ) nodes in the subtree. Figure 1.7 illustrates the routing states maintained by node 5. The node partitions the binary tree into four subtrees. The 1st subtree consists of nodes with prefix 1, which amount to (nearly) half of the tree. The remaining three subtrees consists of nodes with prefix 0, 01, and 010, respectively. Kademlia maps key k to node n whose identifier is the closest to k. The distance between k and n is defined as d(k, n) = k ⊕ n where ⊕ is an XOR operator and the value of d(k, n) is interpreted as an integer. Then, key-value pairs whose key CHAPTER 1. INTRODUCTION 15 equals to k are distributed to n. To find key k, each node forwards a lookup request to the lowest subtree that contains k, i.e. a subtree that has the same longest common prefix as k. This is repeated until the request arrives at the node closest to k. In an N -nodes tree, the lookup complexity is O(log N ) hops and the reason is similar to Chord: every routing step halves the distance to the destination. Kademlia reduces the turnaround time of lookups by exploiting its κ-bucket routing tables. When forwarding a request to a subtree, the request is concurrently send to α (≤ κ) nodes in the subtree. Figure 1.8a illustrates a lookup for key 14 (11102 ) initiated by node 5 (01012 ). The key is mapped to node 15 where d(14, 15) = 1 (00012 ). Because key 14 and node 5 do not share a common prefix, node 5 forwards the request to any node in the 1st subtree (Figure 1.8a). Assuming that the request arrives at node 12 (11002 ), node 12 further forwards the request to its 3rd subtree which contains only node 15 (Figure 1.8b). At node 15 (11112 ), the lookup request will be terminated because the distance between k and any node in node 15’s lowest subtrees is larger than d(14, 15) (Figure 1.8c). The construction of a Kademlia tree is straightforward. A new node n first locates another node n0 closest to it. Then, n probes and builds its m subtrees through node n0 . In addition, every time n receives a request, it adds the sender of the request into the appropriate bucket. The replacement policy will ensure that a bucket contains pointers to stable nodes (i.e. nodes with longer uptime). 1.3 Multi-Attribute Range Queries on DHT The DHT lookup operation, presented in the previous section, offers high results guarantee and short lookup path length for single-attribute exact queries [93]. This may suffice the needs of some applications such as CFS [42] and POST [102]. CHAPTER 1. INTRODUCTION (a) Node 5 Initiates a Lookup for Key 14 (11102 ) (b) Node 12 Processes the Lookup (c) Node 15 Terminates the Lookup Figure 1.8: Kademlia Lookup (α = 1 Node) 16 CHAPTER 1. INTRODUCTION 17 However, applications such as computational grid deal with resources described by many attributes [5, 7]. Users of such applications needs to find resources that match a multi-attribute range query. To fulfill the need of such applications, DHT must support not only single-attribute exact queries (i.e. the basic DHT lookup operation), but also multi-attribute range queries. A multi-attribute range query is a query that consist of multiple search attributes. Each search attribute can be constrained by a range of values using relational operators <, ≤, =, >, and ≤. An example of such queries is to find compute resources whose cpu = P3 and 1 GB ≤ memory ≤ 2 GB. A special case of multi-attribute range queries is multi-attribute exact queries where each attribute is equal to a specific value. An example of a multi-attribute exact query is to find compute resources whose cpu = P3 and memory = 1 GB. Supporting multi-attribute range queries is very well researched in other fields such as database [49] and information retrieval [21]. This thesis focuses on multi-attribute range queries on DHT. As illustrated in Figure 1.9, we classify multi-attribute range query processing on DHT into three categories, namely distributed inverted index, d-to-d mapping, and d-to-one mapping. Distributed inverted index and d-to-one mapping scheme are applicable to both one-dimensional DHT [99, 123, 133, 144] and d-dimensional DHT [116], whereas d-to-d mapping is applicable to d-dimensional DHT only. In Chapter 1.3.1–1.3.3, we discuss the indexing scheme and query-processing scheme used in each of the categories. 1.3.1 Distributed Inverted Index For every resource that is described by d attributes, distributed inverted index assigns d keys to the resource, i.e. one key per attribute. To facilitate range queries, each attribute is hashed into a key using a locality-preserving hash func- CHAPTER 1. INTRODUCTION 18 Figure 1.9: Classification of Multi-Attribute Range Query Schemes on DHT tion [19, 28]; this ensures that consecutive attributes are hashed to consecutive keys. Examples of DHT-based distributed inverted index are MAAN [28], CANDy [24], n-Gram Indexing [67], KSS [56], and MLP [129]. Figure 1.10 illustrates the indexing of a compute resource R with two attributes, cpu = P3 and memory = 1 GB. Based on these attributes, we assign two key-value pairs to the resource, one with key kcpu = hash(P 3) and the other with key kmemory = hash(1GB). Then, we store the two key-value pairs to the underlying DHT. There are two main strategies for processing a d-attribute range query. The first strategy uses O(d) DHT lookups; one lookup (i.e. the selection operator, σ, in relational algebra) for each attribute. The result sets of these lookups need to be intersected (i.e. operator ∩) to produce a final result set. This can be performed at the query initiator [28] or by pipelining intermediate result sets through a number of nodes [24, 56, 129], as illustrated in Figure 1.11. The second strategy requires CHAPTER 1. INTRODUCTION 19 Figure 1.10: Example of Distributed Inverted Index on Chord only O(1) lookup to obtain the final result set. Assuming that each key-value pair also includes the complete attributes of the resource (value), the intersection can be performed only once. 1.3.2 d-to-d Mapping d-to-d mapping such as pSearch [135], MURK [50], and 2CAN [16], maps each d-attribute resource onto a point in a d-dimensional space. Figure 1.12 illustrates a compute resource with cpu = P3 and memory = 1 GB is mapped to point (P3, 1 GB) in a 2-dimensional CAN. The x-axis and y-axis of the coordinate space correspond to attribute cpu and memory, respectively. In d-to-d mapping, a d-attribute range query can be visualized as a region in the coordinate space. For example, the shaded rectangle in Figure 1.12 represents a query for resources with any type of cpu and 256 ≤ memory ≤ 768. The basic concept in processing a query involves two stages. First, a request is routed to any point in the query region. On reaching the initial point, the request is further flooded to the remaining points in the query region. CHAPTER 1. INTRODUCTION 20 (a) At Query Initiator (b) At Intermediate Nodes Figure 1.11: Intersecting Intermediate Result Sets Figure 1.12: Example of Direct Mapping on 2-dimensional CAN 1.3.3 d-to-one Mapping d-to-one mapping maps a d-attribute resource onto a point (i.e. a key) in a onedimensional identifier space. Each d-attribute resource is assigned with a key CHAPTER 1. INTRODUCTION 21 drawn from a one-dimensional identifier space. The key is derived by hashing the d-attribute resource using a locality-preserving function, i.e. the d-to-one mapping function. The resulted key (and key-value pair) is then stored on the underlying DHT. Compared to d-to-d mapping, d-to-one mapping can use one-dimensional DHT (e.g. Chord [133]) as the underlying DHT, as well as d-dimensional DHT (e.g. CAN [116]). Examples of query processing schemes on DHT that are based on d-to-one are Squid [127], SCRAP [50], ZNet [131], CISS [86], and CONE [16]. With the exception of CONE, all the above examples use space-filling curve (SFC) as the hash function. Figure 1.13 shows an example of Hilbert SFC [124] that maps each two-dimensional coordinate point onto an identifier, e.g. coordinate (3, 3) is mapped onto identifier 10. Figure 1.13: Hilbert SFC Maps Two-Dimensional Space onto One-Dimensional Space Figure 1.14 illustrates the indexing of resources with two attributes. Each resource corresponds to a point in the 2-dimensional attribute space, and each point is further hashed into a key (Figure 1.14a). Using Hilbert curve, (cpu = P3, memory = 1 GB) and (cpu = sparc, memory = 4 GB) are assigned key 3 and key 10, respectively. Since each key is one-dimensional, it can be mapped directly to one-dimensional DHT such as Chord (Figure 1.14b). Similar to d-to-d mapping, a d-attribute range query can be visualized as a re- CHAPTER 1. INTRODUCTION 22 (a) Map Points in 2-Dimensional Attribute Space to Keys in 1-Dimensional Identifier Space (b) Map Keys to Chord Nodes Figure 1.14: Example of 2-Dimensional Hash on Chord gion in the d-dimensional attribute space. However, the difference between d-to-d mapping and d-to-one mapping is in the query processing. In d-to-one mapping, we apply the d-to-one mapping function to the query region to produce a number of search keys. A naive way of searching is to issue a lookup for each search key. To reduce the number of lookups initiated, query processing is optimized by exploiting the facts that (i) some search keys do not represent available resources, and (ii) several search keys are mapped onto the same DHT node. CHAPTER 1. INTRODUCTION 1.4 23 Motivation Existing DHT distribute data items where key-value pairs are proactively distributed by their owner across the overlay network. As each DHT node stores its key-value pair (i.e. data item) to a responsible node which is determined by a key-to-node mapping function, data items from many nodes are aggregated in one responsible node. To exploit this property, various performance optimizations are proposed, including load balancing schemes [57, 58, 78], replication schemes to achieve high-availability [42, 54, 81, 83, 87], and data aggregation scheme to support multi-attribute range queries (see Section 1.3). Though facilitating many performance optimizations in DHT, data-item distribution also reduces the autonomy (i.e. control) of nodes in placing their key-value pairs [44]. 1. Node n has no control on where its key-value pairs will be stored because: (a) A key-to-node mapping function considers only the distance between keys and nodes in the identifier space. (b) A key can be remapped due to a new node as illustrated in Figure 1.4b. Hence, node n perceives its key-value pairs to be distributed to random nodes. 2. To join a DHT-based system, node n must make provision to store keyvalue pairs belonging to other nodes. However, n has limited control on the number of key-value pairs to store because: (a) The number of keys mapped to n is affected by n’s neighbors (e.g. predecessor(n) in Chord). (b) The number of key-value pairs with the same key (i.e. resources of the same type) depends on the popularity of the resource type; this is CHAPTER 1. INTRODUCTION 24 beyond the control of n. The limited node autonomy potentially hinders the widespread adoption of DHT by commercial entities. In large distributed systems, nodes can be managed by different administrative domains, e.g. different companies, different research institutes, etc. This has been observed in computational grid [47, 80] as well as earlier generations of distributed systems such as file-sharing P2P [6] and world wide web (WWW). In such applications, distributing data items among different administrative domains (in particular, different commercial entities) leads to two major issues: Ownership of Data Items Commercial application requirements may not allow a node to proactively store its data items (even if data items are just pointers to a resource) on other nodes. Firstly, the node is required to ensure that it is the sole provider of its own data items. As an example, a web site may not allow its contents to be hosted or even directly linked by other web sites which include search engines, to prevent customers being drawn away from the originating web site [107, 108, 118]. Secondly, a node may restrict distributing its data items to prevent the misuse of its data items [55, 59, 60]. Though a node can encrypt its key-value pairs before storing them to other nodes, we argue that encryption addresses privacy issue instead of the ownership issue. The privacy issue is concerned with ensuring that data items are not accessible to illegitimate users and this is addressed by encrypting data items. On the other hand, in the case of ownership issue, data items are already publicly accessible. Conflicting Self-Interest among Administrative Domains Data-item distribution requires all nodes in a DHT overlay to be publicly writable. However, CHAPTER 1. INTRODUCTION 25 this may not happen when nodes do not permit the sharing of its storage resources to external parties due to a different economical interest. Firstly, nodes want to protect their investment in their storage infrastructure by not storing data items belonging to other nodes. Secondly, individual node may limit the amount of storage it offers. However, limiting the amount of storage reduces result guarantee if the total amount of storage in DHT becomes smaller than the total number of key-value pairs. In addition to the problem in enforcing storage policies, nodes also face a challenge where their infrastructure is used by customers of other parties [110, 130]. As an example, when a node stores many data items belonging to other parties, the node experiences an increased usage of its network bandwidth and computing powers due to processing a high number of lookup requests for data items. The above two issues can be addressed by not distributing data items. However, by design, DHT assumes that data items can be distributed across overlay networks. 1.5 Objective User requirements may dictate P2P systems to provide an effective and efficient lookup service without distributing data items. In this thesis, we investigate a DHT-based approach without distributing data items and with supports for multi-attribute range queries. The proposed scheme consists of two main parts: R-DHT (Read-only DHT) and Midas (Multi-dimensional range queries). RDHT serves as the basic infrastructure to support the DHT lookup operations (i.e. single-attribute exact queries), and Midas adds supports for multi-attribute range queries on R-DHT. As an example, we apply our proposed scheme to support decentralized resource indexing and discovery in large computational grids [47, 80]. CHAPTER 1. INTRODUCTION 26 R-DHT is a class of distributed hash tables where a node allows “read-only” accesses to its key-value pairs, but does not allow key-value pairs belonging to other nodes to be written (mapped) on it. The design criteria for R-DHT include: 1. Support for DHT-style lookup. R-DHT must support the flat naming scheme in order to provide the hash table abstraction of locating data items. The lookup operation of R-DHT requires users to specify only the key of requested data items. 2. Effective and efficient lookup performance. The result guarantee and the lookup path length of R-DHT must not be worse than conventional DHT. 3. Lookup resiliency to node failures. R-DHT must be resilient to node failures even when resources by nature cannot be replicated. As an example, in a computational grid, resources are not replicable. However, there are multiple resource instances, shared by different administrative domains, with the same resource type, and finding a subset of these resources is sufficient. These properties are exploited to increase lookup resiliency in R-DHT. In this thesis, we do not focus on “availability” in which resources are replicated so that they can be located even if their master copy ceases to exist. Similar to DHT, resource replications can be introduced in R-DHT to increase resource availability. Midas is a scheme to support multiple-attribute range queries on R-DHT. Midas indexes multi-attribute resources by mapping each of the resources onto an RDHT node. In this thesis, we focus on resources whose description conforms to a well-defined schema such as GLUE schema [5]. Midas processes multi-attribute CHAPTER 1. INTRODUCTION 27 range queries using one or more R-DHT lookup operations. The design criteria of Midas include: 1. Efficient resource indexing. To reduce the overhead of indexing resources, each multi-attribute resource is assigned only one key so that the resource is mapped onto one R-DHT node only. In addition, the indexing exploits locality where resources with similar attributes are mapped to R-DHT nodes that are close in the overlay network. 2. Efficient query processing. Midas processes a multi-attribute range queries by invoking one or more RDHT lookup operations. The number of lookup operations and the number of intermediate hops per lookup must be minimized. 3. Support for one-dimensional overlay network. The majority of existing DHT implementations are one-dimensional DHT [99, 123, 133, 144]. Therefore, Midas must support one-dimensional RDHT (e.g. Chord-based R-DHT) as the underlying DHT, in addition to d-dimensional R-DHT (e.g. CAN-based R-DHT). 1.6 Contributions Our three main contributions are the R-DHT approach, hierarchical R-DHT, and multi-attribute range queries on R-DHT. R-DHT Approach Our proposed scheme enables DHT to map keys to nodes without distributing data items [97, 136]. The read-only mapping scheme in R-DHT virtualizes a host (i.e. a physical entity that shares resources) into nodes: one node is associated with CHAPTER 1. INTRODUCTION 28 each unique key belonging to the host. Nodes are organized as a segment-based structured overlay network. The node identifier space is split into two sub-spaces: key space and host identifier space. Our scheme inherits the good properties of DHT presented in Chapter 1.2, namely support for decentralized lookups to minimize single point of failure, high result guarantee, and bounded lookup path length. Compared to existing DHT, the R-DHT scheme results in the following benefits: 1. Node autonomy in placing data items In R-DHT, each node stores only its own key-value pairs, i.e. the index of its shared resources, without depending on a third-party publicly-writable nodes. Thus, nodes are read-only because they store only their own keyvalue pairs, as in the “pay-for-your-own” usage model [22]. Updates to a key-value pair are reflected immediately by the node that owns it. This avoids data inconsistency as in conventional DHT. 2. Lookup performance Though the size of R-DHT overlay is larger than DHT, the lookup path length in R-DHT is at worst equal to DHT. The segment-based overlay of R-DHT allows messages to be routed by segments and finger tables to be shared among nodes. When the number of unique resource types (K) is larger than the number of hosts (N ), e.g. in file-sharing P2P systems, the lookup path length in R-DHT is bounded by O(log N ) as in traditional DHT (Chord). However, when K ≤ N , e.g. in computational grid, the O(log K)hops lookup path length of R-DHT is shorter than traditional DHT. 3. Lookup resiliency to node failures We demonstrate that R-DHT segment-based overlay reduces the number of failed lookups (i.e. lookups that return a false negative or false positive an- CHAPTER 1. INTRODUCTION 29 swer) in the event of node failures. In R-DHT, lookups for key-value pairs shared by many nodes are more likely to succeed even without replicating key-value pairs. When one of the node fails, only its own key-value pairs become unavailable. The remaining key-value pairs in other nodes can still be discovered because R-DHT exploits segment-based overlay by using backup fingers. Thus, the probability to find resources of a certain type is higher when there are many nodes sharing resources of that type. 4. Reuse of DHT functionalities R-DHT reuses the existing functionalities from conventional DHT and as such, improvements in DHT are beneficial to R-DHT as well. To demonstrate this ability of R-DHT, we present R-Chord, an implementation of R-DHT scheme that uses Chord as the underlying overlay graph. R-Chord uses Chord’s join algorithm to construct its overlay network. In addition, R-Chord’s lookup and stabilization are based on Chord’s lookup and stabilization algorithm. We show the performance and overhead of R-DHT scheme through theoretical and simulation analysis. Hierarchical R-DHT We propose the design of a two-level R-DHT to reduce the maintenance overhead of R-DHT overlay networks. The hierarchical R-DHT partitions the maintenance overhead among smaller sub-overlays. To address the problem of collisions in the top-level overlay, we propose a scheme to detect and resolve collisions in hierarchical R-DHT [96]. Collisions occur in the top-level overlay because of membership changes when node joins or fails. We evaluate the effectiveness of this scheme through simulations. CHAPTER 1. INTRODUCTION 30 Multi-Attribute Range Queries on R-DHT Midas is our proposed scheme to support multi-attribute range queries on R-DHT [95]. The indexing scheme and query engine in Midas are based on d-to-one for the following reasons: 1. One key per resource d-to-one assigns one key to each resource so that the resource is later mapped onto one R-DHT node only. This reduces the overhead of indexing resources on R-DHT. Midas uses Hilbert space-filling curve [124] as the d-to-one mapping function because studies have indicated its effectiveness in preserving locality of multidimensional indexes [74, 103]. 2. Support for efficient query processing Midas minimizes the number of lookup operations in processing a query by invoking lookups only for available resources. In addition, Midas exploits the locality of resource indexes to minimize the number of intermediate hops per lookup operation. 3. Support for one-dimensional overlay network Midas assigns to each resource a key which is drawn from a one-dimensional key space. Therefore, resources can be mapped onto a node in a onedimensional R-DHT. Using simulations, we show that query processing on R-DHT achieves a higher result guarantee than conventional DHT. We also study the implication of dataitem distribution to the cost of processing queries. Our study indicates that for the same size of queries, the cost of query processing in conventional DHT and R-DHT is determined by the number of nodes and the number of query results, CHAPTER 1. INTRODUCTION 31 respectively. This implies that R-DHT is more suitable when the number of query results is small. To reduce the query cost in R-DHT when the number of query results is large, an R-DHT-based system may perform data-item distribution only among a set of trusted nodes and search for query results only within the trusted nodes. 1.7 Thesis Overview The remainder of this thesis is organized as follows. Chapter 2 discusses the design of R-DHT and its Chord-based implementation called R-Chord. We first present read-only mapping, the main concept in R-DHT. Next, we discuss how the read-only mapping is applied to Chord, which results in read-only Chord (R-Chord). Subsequently, we present the optimizations to R-Chord lookup operations, i.e. routing by segments and shared finger tables. This is followed by the maintenance of R-Chord overlay network which exploits finger flexibility through backup fingers. We evaluate the performance of R-DHT through theoretical and simulation analysis. Chapter 3 presents a hierarchical R-DHT that reduces the overhead of overlaynetwork maintenance. We discuss the design of hierarchical R-DHT where nodes in the top-level overlay network are organized into a Chord ring. Then, we present an approach to detect collisions in the hierarchical R-DHT by piggybacking periodic stabilizations, followed by two approaches to resolve the collisions, namely supernode initiated and node initiated. Lastly, we present a simulation analysis on hierarchical R-DHT. Chapter 4 discusses Midas, a scheme to support multiple-attribute range queries on R-DHT. Midas uses Hilbert space-filling curve as the d-to-one mapping function. CHAPTER 1. INTRODUCTION 32 We describe the indexing scheme of Midas which maps resources to R-DHT nodes, followed by Midas query engine which searches for resources that satisfy a given query. We evaluate our approach through simulations. Finally, Chapter 5 summarizes the results of this thesis and discusses some issues that require further investigation. CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 33 Chapter 2 Read-only DHT: Design and Analysis A distributed hash table realizes the mapping of keys onto nodes through the store operation. As a result, key-value pairs are distributed across the overlay network [14, 99, 116, 123, 133]. Data-item distribution reduces node autonomy in the aspect of key-value-pairs placement. This leads to the issues of data-item ownership and conflicting self-interest among administrative domains. In this chapter, we present R-DHT, a DHT scheme that does not distribute data items across its overlay network. We start with the terminologies and notations used throughout this chapter, followed by an overview of R-DHT. Then, we present the design of R-DHT using Chord [133] as the underlying overlay graph. This is followed by theoretical analysis and experimental evaluation, through simulations, on R-DHT lookup performance and maintenance overhead. Finally, we conclude this chapter with a summary. CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 2.1 34 Terminologies and Notations In this section, we introduce the terms resource type, host, and node. Definition 2.1. A resource type is the list of attribute names of a resource. The resource type determines the key assigned to a resource. There could be many resource instances with the same type; these resources are assigned the same key. Definition 2.2. A host refers to a physical entity that shares resources. Let Th denote the set of unique keys (i.e. resource types) owned by a host whose host identifier is h. Figure 2.1 illustrates how the above terminologies are applied to a computational grid [47, 80]. In this example, host refers to the MDS server [4]. The two keys, T3 = {2, 9}, denote that host 3 indexes two types of resources shared by administrative domain 3. One resource type consists of three resource instances (e.g. machines), each of which is identified by key 2. The other resource type refers to a resource whose key is 9. Details on assigning a key to a resource is discussed in Chapter 4. Figure 2.1: Host in the Context of Computational Grid Definition 2.3. A node refers to a logical entity in an overlay network. In terms of set theory, a multiple-valued function, virtualize : hosts → nodes, describes the relationship between hosts and nodes (Figure 2.2). A host joins an CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 35 overlay network as one or more nodes, i.e. by assuming one or more identities in the overlay. Clearly, each node corresponds to only one host. The notion of “nodes” is equivalent to “virtual servers” in Cooperative File System [42] or “virtual hosts” in Apache HTTP Server [1]. Figure 2.2: virtualize : hosts → nodes Table 2.1 shows some of the important variables maintained by each host and Chord node. In addition to its node identifier (n), each node maintains its states in an overlay topology (finger, successor, and predecessor). Entity Variable Description Host h Th a set of unique keys owned by host h Node n finger[1 . . . F ] a finger table of F entries successor the next node in the ring overlay, i.e. finger[1] predecessor the previous node in the ring overlay Table 2.1: Variables Maintained by Host and Node In presenting pseudocode, we adopt the notations from [133]: 1. Let h denote a host or its identifier, and n denotes a node or its identifier, as their meaning will be clear from the context. 2. Remote procedure calls or variables are preceded by the remote node identifier, while local procedure calls and variables omit the local node identifier. CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 2.2 36 Overview of R-DHT R-DHT is a read-only DHT where a node supports “read-only” accesses to its keys, and does not allow keys belonging to other nodes to be written (mapped) on it. The read-only property ensures that keys are mapped onto their originating node. With its read-only property, R-DHT addresses issues such as node autonomy in placing key-value pairs, prevents stale data items, and increases lookup resiliency to node failures without the need to replicate keys. As shown in Figure 2.3, R-DHT achieves the read-only mapping by virtualizing a host into a number of nodes where each node represents a unique key shared by the host. The node identifier space is divided into two sub-spaces: a key space and a host identifier space. This ensures the uniqueness of node identifiers without compromising R-DHT’s support for a flat naming scheme [22]. Shared resources of the same type is identified by the same key and forms a segment on the overlay graph. A segment-based overlay reduces lookup path length and improves lookup resiliency to node failures without a need to replicate data items. Figure 2.3: Proposed R-DHT Scheme As an example, we discuss how R-DHT supports distributed resource discovery in a computational grid. A computational grid facilitates the sharing of compute CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 37 resources from different administrative domains [47, 80]. Typically, grid users search for specific resources where their job will be executed [106]. In a centralized scheme, an MDS [4] indexes all resources from all administrative domains and processes all user queries (Figure 2.4a). As the adoption of grid increases, the central MDS becomes a potential bottleneck and a single point of failure. Recently, there is a growing interest in studying the use of DHT-based resource discovery for large computational grids [27, 28, 132, 145]. Instead of depending on a thirdparty central MDS, DHT-based schemes distribute queries across administrative domains organized as nodes in an overlay network (Figure 2.4b). With R-DHT as the basis, a computational grid supports scalable distributed resource discovery with high result guarantee, while preserving the autonomy of administrative domains where each administrative domain stores its own resource metadata. 2.3 Design In this section, we present the design of R-DHT. We first describe read-only mapping, the main concept in R-DHT. Then, we discuss the construction of R-DHT overlay and the lookup algorithm using Chord [133] as the underlying overlay graph. An alternative of R-DHT using CAN is described in Appendix A. Subsequently, we use “R-DHT” to refer to read-only DHT in general, and “R-Chord” to refer to a Chord-based R-DHT. 2.3.1 Read-only Mapping The basic idea of our proposal is to exploit DHT mapping whereby a key can be mapped onto a specific node if the identifier of the node is equal to the key (Figure 2.5). Thus, each node in R-DHT is a bucket with one unique key, as opposed to conventional DHT where each node is a bucket with a number of unique keys. R-DHT realizes the key-to-node mapping through virtualization CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS (a) Centralized (b) Decentralized Figure 2.4: Resource Discovery in a Computational Grid 38 CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 39 and splitting of node identifiers. Virtualization ensures that each host can share different keys, whereas the splitting of node identifiers prevents node collisions when several hosts share the same key. (a) Conventional DHT Maps Key to Node Identifier Closest to the Key (b) R-DHT Maps Key to Node Identifier Equal to the Key Figure 2.5: Mapping Keys to Node Identifiers R-DHT virtualizes each host into a number of nodes by associating node n to each unique key k (i.e. a resource type) belonging to host h. Figure 2.6a shows an example of virtualizing of two hosts into four nodes, where each host with two unique keys is virtualized into two nodes. By making the node identifier equals to its associated key, R-DHT ensures that keys are not distributed. However, when nodes and keys share the same identifier space, virtualizing several hosts sharing the same key results in collisions of node identifiers (Figure 2.6b). To avoid the collision of node identifiers, a node associated to key k shared by host h is assigned k|h as its identifier. Each node can be uniquely identified by its node identifier which is the concatenation of the key (k) and the host identifier1 (h). Thus, we split the node identifier space into two sub-spaces: key space and host identifier space. Figure 2.7a shows an example of node identifier where the key and the host identifier are of the same bit-length2 , i.e. (m/2) bits. We divide 1 A host identifier can be derived by hashing the host’s IP address or an identifier obtained from a certification authority [32]. 2 In general, the key space and the host identifier space need not be of the same size. CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 40 (a) Virtualize Two Hosts into Four Nodes (b) Collision of Node Identifiers Figure 2.6: Virtualization in R-DHT the m-bit node identifier space into 2m/2 segments. Each segment Sk consists of 2m/2 consecutive node identifiers prefixed with k (Figure 2.7b). Therefore, each segment represents resources of the same type shared by different hosts. Segment indexing reduces lookup path length and improves fault-tolerance. (a) m-bit Node Identifier (b) Segmenting Node-Identifier Space Figure 2.7: R-DHT Node Identifiers R-DHT is designed to maintain API compatibility with conventional DHT and supports the flat naming scheme [22]. As shown in Table 2.2, R-DHT API requires the same arguments as its DHT counterparts. The lookup(k ) API of R-DHT CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 41 supports a flat naming scheme by abstracting away the location of keys in its argument. With the flat naming scheme, queries are formulated as “find resource type k”. This allows R-DHT to fully provide a hashtable abstraction where only the key are required to retrieve key-value pairs. However, systems that supports only a hierarchical naming scheme do not fully provide a hashtable abstraction of locating key-value pairs. The hierarchical naming scheme requires users to specify the location of the key as an argument to lookup operation, in addition to the key itself. Thus, queries are formulated as “find resource type k from host h”, which are reminiscent of HTTP requests: “retrieve index.html from www.comp.nus.edu.sg”. API Operation DHT R-DHT Host h joins overlay through existing host e h.join(e) h.virtualize(e) Host h shares new key k h.store(k) h.newKey(k) Users at host h search for key k h.lookup(k) h.lookup(k) Table 2.2: Comparison of API in R-DHT with Conventional DHT R-DHT supports both a one-level overlay network (i.e. flat R-DHT) or a twolevel overlay network (i.e. hierarchical R-DHT). In the remainder of this chapter, we discuss flat R-DHT in the following aspects: construction of overlay, lookup algorithm, and maintenance of overlay. Hierarchical R-DHT will be discussed in Chapter 3. 2.3.2 R-Chord R-Chord is a flat R-DHT that organizes nodes as a (one-level) Chord overlay. Figure 2.8 presents how a new host joins R-Chord, and how an existing host shares a new key. Each new node join the ring overlay using Chord’s join protocol (line 5 and 13 in Figure 2.8). Nodes are organized as a logical ring in clock-wise CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 42 ascending order where each node identifier (k|h) is interpreted as an integer. Let S −1 N denote the number of hosts, and K = | N h=0 Th | denote the total number of unique keys in the system. R-Chord divides its ring overlay into K segments where segment Sk consists of nodes prefixed with k. Thus, each segment Sk represents resources of the same type (i.e. all resources with key k) shared by different hosts. 1. 2. 3. 4. 5. // h joins R-DHT through an existing host e h.virtualize(e) for each k ∈ Th do n = k|h; n.join(e); // Chord’s protocol [133] 6. 7. 8. 9. 10. 11. 12. 13. // h shares a new key k h.newKey(k) if k ∈ Th return; Th = Th ∪ {k}; n = k|h; n.join(h); // Chord’s protocol [133] Figure 2.8: Virtualizing Host into Nodes Figure 2.9 compares Chord and R-Chord in a grid consisting of three administrative domains (hosts), assuming that keys and host identifiers are 4-bit long. In Chord (Figure 2.9a), each host becomes a Chord node and its keys are distributed to another node whose identifier immediately succeeds the key. For example, all keys with identifier 2 will be stored on node 3. In R-Chord (Figure 2.9b), each key is mapped to its originating node. In this example, host 3 owns two unique keys, i.e. T3 = {2, 9}, and thus, we virtualize host 3 into two nodes with node identifiers 2|3 = 35 and 9|3 = 147. Similarly, we virtualize host 6 and host 9 into one node and three nodes, respectively. We then organize the six nodes as an R-Chord ring based on their integral node identifier. The R-Chord ring consists of three segments, namely segment S2 with node 2|3 and node 2|9, segment S5 with node 5|9 and 5|6, and segment S9 with node 9|3 and node 9|9. These three CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 43 segments represents three keys (resource types): key 2, key 5, and key 9. (a) Chord Distributes Data Items among Three Nodes (b) R-Chord Virtualizes Three Hosts into Six Nodes Figure 2.9: Chord and R-Chord R-DHT prevents stale data items when updated by their originating node, while conventional DHT must route the update to the node where the data item is distributed. Before the update reaches the node, the data item becomes stale. Referring to Figure 2.9, when host 3 updates its key 9, Chord routes the update to node 9 which stores the key. On the other hand, in R-Chord the update is reflected immediately because key 9 is mapped onto node 9|3 which is associated with host 3 itself. By design, R-DHT is inherently fault tolerant without a need to incorporate data- CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 44 item replications as a resiliency mechanism. Firstly, a node failure does not affect keys belonging to other nodes. Secondly, lookup(k ) can still be successful since RDHT can route the lookup request to other alive nodes in segment Sk . Figure 2.10 shows an example when administrative domain 3 with two shared resource types is down. Since administrative domain 9 is still alive, R-Chord can still locate resource type 2 by routing a lookup(2 ) request to node 2|9 (Figure 2.10a). On the contrary, the lookup fails in Chord because it is routed to node 3 (Figure 2.10b). Typically, conventional DHT replicates keys to improve its resiliency. However, this increases the risk of stale data items, i.e. data items pointing to unavailable resources in the inaccessible administrative domain 3 (Figure 2.10c). 2.3.3 Lookup Optimizations R-DHT supports a flat naming scheme where users need to specify only key k when searching. R-DHT bases its lookup on the underlying overlay’s lookup protocol. With Chord as the underlying overlay, each node maintains at most m unique fingers (Figure 2.11). Lookup for key k implies locating the successor of k|0, i.e. the first node in segment Sk . Figure 2.12 shows a direct application of Chord lookup algorithm on R-Chord. If a lookup returns a node n0 where prefix(n0 ) = k, then key k is successfully found (line 11); otherwise, the key does not exist (line 14). The direct application of Chord’s lookup is not efficient because it does not exploit the advantages of read-only mapping. In a system with N hosts where each host has T = −1 ΣN h=0 |Th | N unique keys on average, R-Chord consists of N · T nodes and hence, its lookup path length is O(log N T ) hops (see Theorem 2.1). To reduce the lookup path length, R-Chord exploits the read-only mapping scheme by incorporating two optimizations, namely routing by segments and shared routing tables. The complete algorithm is shown in Figure 2.13. CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS (a) R-Chord (b) Chord without Replication (c) Replication Introduces Stale Data Items Figure 2.10: Node Failures and Stale Data Items 45 CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS Figure 2.11: The Fingers of Node 2|3 1. 2. 3. 4. 5. 6. // Ask h to find a node in segment Sk h.lookup(k) k 0 = a key randomly chosen from Th ; n = k 0 |h; 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. // Ask n to find a node in segment Sk n.node lookup(k) if k == prefix(n.successor) then // n’s successor shares k return n.successor; return n.node lookup(k); if n < k|0 < n.successor then return n.successor; n0 = n.closest preceding node(k|0); h0 = suffix(n0 ); return h0 .lookup(k); (a) Main Algorithm 1. // Ask n to find the closest predecessor of id. 2. n.closest preceding node(id) 3. for i = m downto 1 do 4. if (n < finger[i] ≤ id) then 5. return finger[i]; 6. 7. return n; (b) Helper Functions Figure 2.12: Unoptimized R-Chord Lookup 46 CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 47 1. // Ask h to find a node in Sk 2. h.lookup(k) 3. for each j ∈ Th do 4. if j == k then 5. return k|h; // h is in Sk 6. 7. n =find segment in fingers(k ); 8. if n = 6 NOT FOUND then 9. return n; // n is in the preceding segment of Sk 10. 11. for each j ∈ Th do 12. n = j|h; 13. if n < k|0 < n.successor then 14. return n.successor; 15. 16. n = closest preceding node(k); 17. h0 = suffix(n); 18. return h0 .lookup(k); (a) Main Algorithm 1. 2. 3. 4. 5. 6. 7. 8. 9. // Ask h to find a finger pointing to Sk h.find segment in fingers(k) for each j ∈ Th do // Iterate all local nodes n = j|h; for i = 1 to m do if prefix(n.finger[i]) == k then return n.finger[i]; 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. // Ask h to find the closest predecessor of id. h.closest preceding node(id) x = id + 1; // Initialize to the farthest predecessor return NOT FOUND for each k ∈ Th do n = k|h; for i = m downto 1 do f = n.finger[i]; if (n < f < id) // Is f the closest predecessor known by node n? and (x < f < id) then // Is f a closer predecessor than x? x = f; return x; (b) Helper Functions Figure 2.13: R-Chord Lookup Exploiting R-DHT Mapping CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 2.3.3.1 48 Routing by Segments R-DHT divides its node-identifier space into segments, with each segment representing a key. Therefore, locating key k equals to locating any node within segment Sk , instead of strictly locating successor(k|0) only. R-Chord exploits this segmentation by forwarding a lookup(k ) request from one segment to another segment (line 4, 7, and 16 in Figure 2.13a, and line 6 in Figure 2.13b). Each routing step halves the distance, in term of segments, to the destination segment Sk (see Lemma 2.2 for the proof). As such, in a system with K segments, routing by segments reduces lookup path length to O(log K). Since segments are identified by the prefix of node identifiers, our routing-by-segments scheme is essentially a prefix-based routing optimization applied for the Chord lookup protocol. Figure 2.14 illustrates the processing of a lookup(k) request initiated by node n1 . Without routing by segments, i.e. the direct application of Chord’s lookup protocol, the lookup path is n1 → n2 → n3 → n4 → n5 because we always locate node n5 = successor(k|0). However, with routing by segments, we realize that one of the intermediate hops, node n2 , has a finger pointing to n6 . Though node n6 is not successor(k|0), it also shares key k and is in segment Sk . Since the lookup can then be completed at node n6 , the optimized lookup path becomes n1 → n2 → n6 . 2.3.3.2 Shared Finger Tables To limit the lookup path length at O(log N ) hops even when K > N , our routing algorithm utilizes all the |Th | finger tables maintained by each host h (line 3 and 11 in Figure 2.13a, and line 3 and 14 in Figure 2.13b). In other words, a node’s finger table is shared by all nodes from the same host. As such, visiting one host is equivalent to visiting all the |Td | nodes which correspond to the host (Figure 2.15). The proof that this optimization leads to O(log N )-hop lookup path length, similar to Chord, is presented in Theorem 2.2. However, the intuitive explanation is as CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 49 (a) No Routing by Segments (b) Routing by Segments Figure 2.14: lookup(k) with and without Routing by Segments follows: since the distance between any two points in the overlay is at most N hosts, it takes O(log N ) hops to locate any segment. Thus, even though the number of nodes in the overlay is greater than N due to host virtualization, the lookup path length is not affected. 2.3.4 Maintenance of Overlay Graph As with Chord, R-Chord maintains its overlay through periodic stabilizations. The periodic stabilization is implemented as two functions: stabilize successor() and correct f ingers(). The first function corrects successor pointers, i.e. the first finger, in addition to predecessor pointers, whereas the later correct the remaining fingers in a finger table. The rate in which these functions are invoked is an CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS (a) Visiting Host 3 is Equal to Visiting Node 2|3 and Node 9|3 in the Overlay 50 (b) Visiting Host 6 (c) Visiting Host 9 — All Nodes are Visited Figure 2.15: Effect of Shared Finger Tables on Routing implementation matter. During the periodic stabilization, R-Chord exploits finger flexibility which is inherent in our segment-based overlay. Finger flexibility denotes the amount of freedom available when choosing a finger. A higher finger flexibility increases the robustness of lookup since finger tables deplete slower in the event of node failures [62]. Finger flexibility also allows proximity-based routing to reduces lookup latency. CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 51 However, our current R-Chord implementation has yet to exploit this feature. The segment-based overlay improves the finger flexibility of R-Chord whereby n.finger[i] is allowed to point to any nodes in the same segment as successor(n + 2i−1 ) – this is an improvement over O(1) finger flexibility of Chord3 . To exploit this finger flexibility, we employ the backup fingers scheme which is reminiscent of Kademlia’s κ-bucket routing tables [99]. With such a scheme, every R-Chord node n maintains backups for each of its fingers. Thus, when f dies (i.e. points to a dead node), n still has a pointer to segment Sprefix(f ) . The new structure of R-Chord finger table is shown in Figure 2.16. We use n.finger[i] and backup(n.finger[i]) to denote the main finger and the list of backup fingers, respetively. Description Finger 1st finger 2nd finger ... a|x b|t ... Backups a|y, a|z b|r, b|s, b|u ... Figure 2.16: Finger Tables with Backup Fingers Figure 2.3.4 shows the algorithm to correct successors in R-Chord. When a new successor is detected, the old successor pointer is added into the backup list instead of being discarded (line 7). Similarly, when a new predecessor is to be set, the old predecessor is also added into the backup list (line 19). We then ensure that the backup list of successors and predecessors contains only nodes with the same prefix as the new successor (line 9) and predecessor (line 21), respectively. Figure 2.18 shows that the algorithm to correct finger f in R-Chord exploits finger flexibility. When a new finger is added, we also construct its backup list based on the entries piggybacked from the remote node and the older valid backup 3 To improve its robustness in spite of its lower finger flexibility, each Chord node caches additional entries in its finger tables. Such scheme can also be adopted in R-Chord. CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. // n verifies its successor pointer, and announces itself to the successor. n.stabilize successor() p = successor.predecessor; if n < p < successor then // Prepare a new backup list backup(successor) = backup(successor) ∪ successor ∪ backup(p); Remove backup(successor) entries whose prefix 6= prefix(p); 12. 13. 14. 15. 16. 17. 18. 19. 20. // n0 thinks it might be our predecessor n.notify(n0 ) if (predecessor == nil) or (predecessor < n0 < n) then // Prepare a new backup list backup(predecessor) = backup(predecessor) ∪ predecessor ∪ backup(n0 ); Remove backup(predecessor) entries whose prefix 6= prefix(n0 ); 52 successor = p; // New successor pointer successor.notify(n); predecessor = n0 ; // New predecessor pointer Figure 2.17: Successor-Stabilization Algorithm entries (line 7–9 in Figure 2.18a). As with lookups, the finger-correction algorithm incorporates shared finger tables (line 4 in Figure 2.18a and line 3 in Figure 2.18b). 2.4 Theoretical Analysis In this section, we analyze the lookup performance of R-Chord, and compare the overhead of the mapping scheme in R-Chord and conventional Chord-based DHT (hereafter referred simply as Chord). Let N denote the number hosts, Th denote the set of unique keys owned by host h, T denote the average number of unique ΣN −1 |T | keys owned by each host (i.e. h=0N d ), and K denote the total number of unique S −1 keys in the system (i.e. | N h=0 Th |). The Chord overlay consists of N nodes (one node per host) and the R-Chord overlay consists of V (= N T ) nodes (on average, T nodes per host). CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 53 // Node n to correct its ith finger. n.correct finger(i) f = n + 2i−1 ; h = suffix(n); f 0 = h.find successor(f ); // Prepare a new backup list backup(finger[i]) = backup(finger[i]) ∪ finger[i] ∪ backup(f 0 ); Remove backup(finger[i]) entries whose prefix 6= prefix(f 0 ); finger[i] = f 0 ; // New finger (a) Main Algorithm 1. // Ask host h to find successor(id). 2. h.find successor(id) 3. for each k ∈ Th do 4. n = k|h; 5. if n < id ≤ n.successor then 6. return n.successor; 7. 8. n = closest preceding node(id); // See Figure 2.13b 9. return n0 .find successor(id); (b) Helper Functions Figure 2.18: Finger-Correction Algorithm 2.4.1 Lookup In order to analyze the lookup performance in R-Chord, we first present the proof on Chord lookup path length (see also Theorem 2 in [133]). Theorem 2.1. In an N -node Chord overlay, the lookup path length is O(log N ). Proof. Suppose that node n wishes to locate the successor of k. Let p be the node that immediately precedes k. To prove the lookup path length, we first show that each routing step halves the distance to p. If n 6= p, then n forwards a lookup request to the closest predecessor of k in its CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 54 finger table. If p is located in the interval [n + 2i−1 , n + 2i ), then n will contact its ith finger, the first node f in the interval. The distance between n and f is at least 2i−1 , and the distance from f to p is at most 2i−1 . This also means that the distance from f to p is at most half of the distance from n to p. Assume that node identifiers are uniformly distributed in the 2m circular identifier space, the number of forwarding necessary to locate k will be O(log N ), which is explained as follows. After i forwarding, there are N/2i remaining nodes to choose for the next forwarding. Thus, with log N forwarding, there is only one node as the next hop; this node is the successor of k. In the following, we present the proof on the lookup path length of R-Chord utilizing two intermediate results, Lemma 2.1 and Lemma 2.2. Lemma 2.1. The probability that host h owns key k, i.e. P (k ∈ Th ), is bounded K by ln K−T . Proof. Let k ∈ Th denote key k owned by host h. We define the probability that host h owns key k as P (k ∈ Th ) = P (e1 ) + P (e2 |e1 ) + P (e3 |e1 , e2 ) + . . . + P (eT |e1 , . . . , eT −1 ) = T X P (ei |e1 . . . ei−1 ) i=1 where ei denotes the outcome for ki = k, and ei denotes the outcome for ki 6= k. Assuming that T K and k is uniformly drawn from {1, ..., K}, we approximate P (k ∈ Th ) using the first-order Markov process as follows. Firstly, we consider K resource types as K balls, each of which with a unique color. If we pick T balls sequentially, then the probability that the ith outcome, where i ≤ T , produces the CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS k-colored ball is 1 . K−i+1 This leads to the following equation: P (k ∈ Th ) = 0 if T = 0 1 if T = 1 K P T 1 i=1 K−i+1 Since Hx = Px 1 i=1 i 55 = PK 1 i=K−T +1 i if T > 1 = ln x + O(1), then P (k ∈ Th ) = 0 1 K HK − HK−T 0 if T 1 = if T K ln K if T K−T if T = 0 if T = 1 if T > 1 =0 =1 >1 Lemma 2.2. Routing by segments leads to O(log K)-hops lookup. Proof. To analyze the lookup path length due to our routing-by-segments optimization, we compare the finger tables in R-Chord and Chord. According to Theorem 2.1, in a Chord system consisting of N nodes, the lookup path length is O(log N ) if Chord is able to route a lookup request from one node to another such that each step halves the distance to the destination node. To achieve this, each node n maintains O(log N ) unique fingers where: 1. The distance between n and n.finger[i + 1] is twice the distance between n and n.finger[i]. 2. The largest finger of n points to successor(N/2). CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 56 We now show the similarity of finger tables in R-Chord and Chord. Let S = O(N P (k ∈ Th )) denote the average number of nodes in a segment. In an R-Chord system consisting of V nodes, each node n maintains O(log V ) unique fingers where: 1. The first O(log S) of the O(log V ) fingers point to the segment containing n.successor. The remaining O(log V −log S) fingers point to O(log V −log S) different segments because the distance between n and n.finger[log(S+j+1)] is twice the distance between n and n.finger[log(S + j)], where 0 ≤ j ≤ log N − log S. 2. The largest finger of n will point to successor(N/2), which is a node in the segment that succeeds segment K/2. Using the same argument as in Chord, R-Chord routes a lookup request from one segment to another and each hop halves the distance, in terms of the number of segments, to the destination segment. Since R-Chord consists of K segments, a lookup will cost O(log K) hops. Theorem 2.2. With shared finger tables, the lookup path length in R-Chord is O(min(log K, log N )) hops. Proof. If K ≤ N then log K ≤ log N . Thus, according to Lemma 2.2, this theorem is true. Consider K > N . When host h processes lookup(k ), we choose two consecutive keys s, u ∈ Th where 1. s < k < u 2. There is no v ∈ Th such that s < v < u CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 57 The two keys, s and u, are associated with two nodes, namely node s|h and node u|h, respectively. The destination segment Sk will be located between node s|h and node u|h, and the distance between s|h and u|h is O(K/T ). Since K ≤ V , then K/T = O(V /T ) = O(N T /T ) = O(N ). Thus, according to Theorem 2.1, lookup(k ) can be routed from node s|h to segment Sk in O(log N ) hops. Theorem 2.2 shows that the lookup performance in R-Chord is at worst comparable to the lookup performance in Chord. Due to the shared finger tables, R-Chord’s lookup path length is equivalent to Chord where the number of hops to reach a certain node is affected by the number of hosts in the physical network (N ) instead of the number of nodes in the overlay network (V ). 2.4.2 Overhead The following theorems compare the maintenance overhead in R-Chord and Chord in terms of the cost of virtualization, number of fingers per host, cost of updating data items, and overhead of replication. Theorem 2.3. The cost for a host to join R-Chord and Chord is O(|Th | log2 V ) K and O(log2 N + |Th | log N + K ln K−T ), respectively. Proof. R-Chord virtualizes a host into |Th | nodes in an overlay graph of size V . Since a node join costs O(log2 V ), the host join costs O(|Th | log2 V ). In Chord, a host join consists of a node-join operation, |Th | store operations to store the key-value pairs belonging the new host, and migrations of key-value pairs. The node-join operation costs O(log2 N ) and each store operations costs O(log N ). The migration process moves O(N/K) unique keys from an existing node, which is the successor of the new node, to the new node n. As there are K O(N P (k ∈ Th )) key-value pairs per unique key, the migration costs O(K ln K−T ). CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS K Therefore, the host join costs O(log2 N + |Th | log N + K ln K−T ) in total. 58 Theorem 2.3 shows that the cost to join R-Chord is higher than Chord. This is because R-Chord replaces a store operation with a join operation, and the cost of a join operation is higher than a store operation. Theorem 2.4. A host maintains O(|Th | log V ) and O(log N ) unique fingers in R-Chord and Chord, respectively. Proof. In a R-Chord ring consisting of V nodes, each node maintains finger table consisting of O(log V ) unique fingers. Because R-Chord virtualizes a host into |Th | nodes, the host maintains O(|Th | log V ) fingers in total. In Chord, each host joins a ring overlay as a node. Thus, with N hosts, the Chord ring consists of N nodes where each node maintains O(log N ) unique fingers. The following theorem shows the overhead of maintaining an overlay, i.e. stabilization cost, in terms of the number of messages sent. Theorem 2.5. In R-Chord, the stabilization cost to correct all fingers (including successor pointers) is O(V log N log V ) messages. Since N ≤ V , the stabilization cost is also Ω(V log2 V ) messages. On the other hand, the stabilization cost in Chord is O(N log2 N ) messages. Proof. R-Chord overlay consists of V nodes and each node maintains O(log V ) fingers. Correcting the ith finger of node n is performed by locating the node that succeeds n + 2i−1 ; this costs O(log N ) hops, i.e. shared finger tables as the only lookup optimization (see Theorem 2.2). Thus, the cost of stabilization is O(V log V log N ) hops. CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 59 In Chord, the overlay consists of N nodes and each node maintains O(log N ) fingers. Correcting each finger is performed by locating successor(n+2i−1 ), which costs O(log N ) hops. Thus, the cost of stabilization is O(N log2 N ) hops. Theorem 2.4–2.5 states that a higher number of fingers implies a higher overhead in maintaining an overlay graph. This affects the scalability of R-Chord particularly when a host is virtualized into many nodes. To reduce the overhead of periodic stabilizations, which are employed by the current implementation of RChord to correct fingers, nodes need not to correct all their fingers each time the stabilization procedure is invoked. Instead, each invocation corrects only a subset of a node’s fingers, e.g. the successor pointer and another randomly-chosen finger; this is similar to Chord’s current implementation of periodic stabilizations. The drawback of this approach is the increase of the number of incorrect entries in a finger table; this increases the lookup path length. However, as long as the successor pointer, i.e. the first finger, is maintained, the lookup will still terminate at the correct node. K Theorem 2.6. The finger flexibility in R-Chord and Chord is O(N ln K−T ) and O(1), respectively. Proof. Assume that successor(n + 2i−1 ) is in segment Sk , in R-Chord, the ith finger of node n can point to any node in segment Sk . The number of nodes in this segment is equal to to the number of hosts that own key k, which is K O(N P (k ∈ Th )) hosts. Hence, the finger flexibility is O(N ln K−T ). In Chord, the ith finger of n must point to successor(n + 2i−1 ), and hence, O(1) finger flexibility. As mentioned in Section 2.3.4, a higher finger flexibility increases the robustness of lookup in the presence of node failures. Higher finger flexibility also allows CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 60 proximity-based routing to reduces lookup latency. Theorem 2.7. In Chord, adding a key-value pair costs O(log N ). In R-Chord, adding a key-value pair whose key k already exists in host h (i.e. k ∈ Th before the addition) costs O(1), while adding a key-value pair whose key is new for host h (i.e. k ∈ / Th before the addition) costs O(log2 V ). Proof. In Chord, a key-value pair is stored on the successor of the key. This costs O(log N ). In R-Chord, if k ∈ Th before the addition, then no new node is created and hence, the cost is O(1). However, if k ∈ / Th before the addition, then a new node is created and joins the R-Chord system. This costs O(log2 V ). In applications such as P2P file sharing, sharing a new file is equal to adding a new resource type. However, in computational grid, a resource type consists of many resource instances, and an administrative domain can add new instances to one of its existing resource type. Theorem 2.7 shows that using R-Chord, the administrative domain does not need to notify other nodes in the R-Chord overlay. Theorem 2.8. In R-Chord, the total number of key-value pairs with the same K key is O(N ln K−T ). In Chord, assuming that each key-value pair is replicated O(log N ) times, then the total number of key-value pairs with the same key is K O(N ln K−T log N ). Proof. Given N hosts, the number of key-value pairs with the same key is O(N P (k ∈ Th )). Since R-Chord does not redistribute and replicate key-value pairs, the numK ber of key-value pairs with the same key is also O(N ln K−T ). In Chord, because each key-value pair is replicated O(log N ) times, then the total number of keyK value pairs with the same key will be O(N ln K−T log N ). CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 61 R-Chord does not need to replicate data items to improve the lookup resiliency to node failures. This eliminates the network bandwidth required to replicate data items and the complexity in maintaining consistency among replicas. Even when data items are not replicated, data-item distribution can still lead to the problem of inconsistent data items. In conventional DHT, updates must be propagated from to the node responsible to store a data item. This is shown in the following corollary. Corollary 2.1. In Chord, the cost of a host to update its key-value is O(log N ). In R-Chord, the cost is O(1). Proof. In Chord, the cost for a host (the originating node) to propagate an update on its key-value pair to another node costs O(log N ), according to Theorem 2.1. In R-Chord, the key-value pair is mapped to its originating node. Hence, the cost of updating the key-value pair is O(1). Corollary 2.1 shows that R-Chord improves the performance of a host in updating its data items, including the deletion of data items. In the case of computational grid, updates occur when an administrative domain changes the configuration of its shared resources, or changes the number of resource instances of a resource type. 2.4.3 Cost Comparison Table 2.3 summarizes the performance analysis of R-Chord. We show that the lookup path length in R-Chord is shorter than Chord and in the worst case, it is equal to Chord. However, our mapping scheme increases the cost for a host to join an overlay. In addition, each host in R-Chord has more fingers to correct CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 62 due to the host being virtualized into nodes. Thus, the scalability of R-Chord is determined by the number of nodes associated to each host. When each host is virtualized into one node only, the scalability of R-Chord is equal to traditional Chord. Property Chord R-Chord Lookup a key O(log N ) O(min(log K, log N )) Host join O(log2 N + |Th | log N + K ) K ln K−T O(|Th | log2 V ) # unique fingers per host Finger flexibility O(log N ) O(1) O(|Th | log V ) K ) O(N ln K−T Stabilization O(N log2 N ) O(V log V log N ) Add a key that exists Add a new key Update a key-value pair # key-value pairs with the same key O(log N ) O(log N ) O(log N ) K O(N ln K−T log N ) O(1) O(log2 V ) O(1) K O(N ln K−T ) Table 2.3: Comparison of Chord and R-Chord 2.5 Simulation Analysis In this section, we evaluate R-DHT by simulating an R-Chord-based resource indexing and discovery scheme in a large computational grid. As illustrated in Figure 2.4b, a computational grid consists of many administrative domains, each of which share one or more compute resources. Each administrative domain, represented by its MDS server (i.e. host), joins an R-Chord overlay and stores only its own resource metadata (i.e. data items). The key of each data item is determined by the type (i.e. attributes) of compute resource associated with the data item. Thus, the number of unique keys owned by a host denotes the number of unique resource types shared by an administrative domain. To facilitate our experiments, we implement R-Chord using the Chord simulator CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 63 [2]. Let m = 18-bit unless stated otherwise. The network latency between hosts is exponentially distributed with a mean of 50 ms, and the time for a node to process a request is uniformly distributed in [5, 15] ms. In the following subsections, we compare the lookup path length, resiliency to node failures, time to correct overlay, and lookup performance on incorrect overlays. 2.5.1 Lookup Path Length To verify Theorem 2.2, we measure the average lookup path length of 500,000 lookups that arrived based on a Poisson distribution with mean arrival rate λ = 1 lookup/second. Each lookup requests for a randomly selected key and is initiated by a randomly chosen host. Assuming that T denotes the average number of unique keys per host, each host has |Th | ∼ U [0.5T, 1.5T ] unique keys. As shown in Figure 2.19, the average lookup path length in R-Chord is 20-30% lower than in Chord. When K (= N T ) > N , R-Chord’s overlay consists of K segments and each segment consists of one node. According to Theorem 2.2, the lookup path length of R-Chord is affected only by N , and hence, increasing K does not increase the lookup path length. However, for K ≤ N , the lookup path length increases with K. Figure 2.19 also shows that in R-Chord, increasing T reduces the average path length, which can be explained as follows. First, as each host maintains O(|Th | log N T ) unique fingers (Theorem 2.4), an increase in T also increases the number of fingers per hosts. Several studies such as [64, 122] also reveal that maintaining more fingers reduces the lookup path length. Secondly, an increased in T increases the number of segments occupied by a host, and hence, each host has a higher probability to be visited. CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS (a) N = 10, 000 Hosts (b) N = 25, 000 Hosts Figure 2.19: Average Lookup Path Length 64 CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 65 Our simulation result confirms Theorem 2.2, i.e. when K > N , the lookup path length in R-Chord has the same upper bound as Chord. When K ≤ N , the lookup path length in R-Chord is shorter than Chord. 2.5.2 Resiliency to Simultaneous Failures To evaluate the resiliency when no churn occurs, we measure the average lookup path length and failed lookups with the following procedures. First, we setup a system of N = 25, 000 hosts where all hosts have T unique keys on average. Then, we fail a fraction of hosts4 simultaneously, disable the periodic finger correction after the simultaneous failures, and simulate 500,000 lookup requests with mean arrival rate λ = 1 lookup/second (Poisson distribution). We define a lookup for a key as fail if it results in (i) a false negative answer where existing resources (i.e. at least one originating node of the key is alive) cannot be located, or (ii) a false positive answer where stale data items are returned. We also assume that Chord stores a key-value pair only to successor(key) and does not further replicate the key-value pair to several other nodes. Finally, we exploit the property of finger flexibility in R-Chord by maintaining a maximum of four backups per finger. Figure 2.20 shows the average lookup path length with 25% and 50% of simultaneous host failures. For K (= T N ) > N , the average lookup path length in R-Chord shows a trend similar to that of in Chord, i.e. lookup path length increases as more hosts fail (Figure 2.20a). Because each segment consists of only one node, R-Chord cannot exploit finger flexibility. Hence, as the percentage of host failures increases, the number of valid fingers, i.e. pointing to alive nodes, reduces in each node’s finger table. For K ≤ N (Figure 2.20b), R-Chord has a shorter average path length than Chord and the lookup path length is not significantly affected by the number of failed hosts. The reason is as follow. Firstly, R-Chord provides O(log K)-hops 4 In R-Chord, one host fail results in simultaneous node fails. CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 66 lookup path length only if each node has a correct finger tables. In the case of node failures, the finger table has a reduced number of valid fingers; this increases the lookup path length. However, since K ≤ N implies each segment consists of more than one node, R-Chord effectively exploits finger flexibility to maintain the number of valid fingers. In terms of failed lookups, Figure 2.21 shows that for K > N and K ≤ N , RChord is 70% and 95% lower than in Chord, respectively. For (K = N T ) > N (Figure 2.21a), Chord has more failed lookups because key-value pairs are stored on another host. In the event of the host that stores the key-value pair fails, a lookup request to that host will be unsuccessful though the host that owns the key-value pair is still alive. For K ≤ N (Figure 2.21b), R-Chord achieves even less failed lookups (95% lower than Chord) because R-Chord exploits the property that each key is available in a segment consisting of several nodes. Hence, even if some of these nodes fail, R-Chord can still reach the remaining hosts in the segment through the backup fingers. Thus, R-Chord offers better resiliency to simultaneous failures in comparison to the conventional DHT. 2.5.3 Time to Correct Overlay The correctness of an overlay network is crucial to the lookup performance in DHT. To ensure the correctness of an overlay in the event of membership changes, each node periodically corrects its fingers, i.e. periodic stabilization. However, the larger size of R-Chord overlay (Theorem 2.3 and Theorem 2.4) increases the stabilization cost ((Theorem 2.5). To amortize the maintenance overhead, periodic stabilization in R-Chord is performed less aggresively, similar to Chord, where each invocation of the stabilization procedure corrects only a subset of a node’s fingers, e.g. the successor pointer and another randomly-chosen finger. However, this may increase the time required to correct an R-Chord’s overlay. CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 67 (a) K = N T Unique Keys (b) K = 5, 000 Unique Keys Figure 2.20: Average Lookup Path Length with Failures (N = 25,000 Hosts) CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS (a) K = N T Unique Keys (b) K = 5, 000 Unique Keys Figure 2.21: Percentage of Failed Lookups (N = 25,000 Hosts) 68 CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 69 In this experiment, we evaluate the time required to correct the overlay topology, which is measured starting from the time when the last host arrives. To facilitate this measurement, we quantify the correctness of an overlay using stabilization degree (ξ) which is derived by averaging the correctness of all finger tables in the overlay network (ξn ). PN −1 ξ= and n=0 N ξn 0≤ξ≤1 0 if n.f inger[1] is incorrect ξn = F0 (2.1) (2.2) F where 0 ≤ ξn ≤ 1, F 0 is the number of correct fingers in n, and F is total number of fingers in n. Note that we do not consider backup fingers in calculating ξn . The experiment is performed as follows. We simulate a system consisting of N hosts with mean arrival rate λ = 1 host/second. The number of unique keys per host is |Th | ∼ U[2, 5] unique keys and therefore, T = 3.5 unique keys. We assume that the total number of unique keys in the system is K = 3N keys; this K approximates N T keys where each segment consists of one node on average. We then periodically measure ξ starting from the time when the last host arrives. A node joins a ring overlay through a randomly chosen existing node. We base our node-join process on the one described in [133]. First, a new node n starts the join process by adding n0 = find successor (n) as its successor pointer. After one or more rounds of finger correction, there will be at least one other node pointing to n. At this time, the join process completes. Note that R-Chord uses the find successor () which incorporates shared finger tables (Figure 2.18b). Each node invokes the finger correction every [0.5p, 1.5p] seconds (uniform distribution). Each invocation of finger correction will correct the successor pointer CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 70 (n.f inger[1]) and one other finger. Correcting n.f inger[i] is done by updating it with the node returned by find successor (n + 2i−1 ). Figure 2.22 reveals that at larger p (960 seconds in this experiment), nodes in Chord correct their successor pointer (i.e. n.f inger[1]) faster than R-Chord. For examples, when N = 25, 000 hosts, ξ in Chord increases from 0.36–0.59 in the first three hours, which is faster than R-Chord (0.31 – 0.38). The same behavior is also observed in N = 50, 000 hosts. This is because ξn puts more priority on the successor pointer (see Equation (2.2)). Hence, by correcting successor pointers faster, Chord increases its ξ faster than R-Chord. Conclusively, although R-Chord has a larger overlay and each of its hosts has to correct more fingers, R-Chord does not require a longer time than Chord to fully correct its overlay. This is due to shared finger tables reducing the time to locate the correct successor when correcting a finger. 2.5.4 Lookup Performance under Churn Churn refers to membership changes in an overlay network. In R-DHT, churn occurs in two ways. Firstly, host arrivals, host fails, and host leaves results in simultaneous node joins, node fails, and node leaves, respectively. Secondly, adding a new unique key to a host also causes a node join (Theorem 2.7). When the frequency of membership changes (i.e. churn rate) is high, lookup performance may decrease because the larger overlay of R-DHT magnifies the impact of churn on the correctness of overlay topology. To evaluate the ability of R-DHT to cope with churn, we simulate lookups when RChord ring overlay keeps changing due to host arrivals, failures, and leaves. When a node leaves, it notifies its successor and predecessor. In addition, a node leaving CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 71 (a) N = 25, 000 Hosts (b) N = 50, 000 Hosts Figure 2.22: Correctness of Overlay ξ (Measured Every Three Hours Starting From the Last Host Arrival) CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 72 a Chord ring migrates all data items belonging to other nodes to its successor without delay. In this experiment, we assume that each host has |Th | ∼ U[4, 12] unique keys (which results in T = 8 unique keys), and each node invokes the finger correction procedure every [30, 90] seconds (uniform distribution). We first set up an overlay network of 25,000 hosts, followed by a number of churn events (i.e. arrivals, fails, and leaves) produced by 25,000 hosts in a duration of one hour. Thus, there will be about N = 25, 000 alive hosts at any time within this duration. During this one-hour period, we also simulate a number of lookup events and keep the ratio of arrive:fail:leave:lookup to be 2:1:1:6. The mean arrival rate of these events, λ, will model the churn rate. Assuming that these events follow a Poisson distribution, our simulation uses λB = 10 events/second and λG = 40 events/second; these are derived from the measurements on peer life-time by Bhagwan et. al. [25] and Gummadi et. al. [63], respectively5 . Table 2.4 presents the result for various K, from 5,000 (K < N ) to 150,000 (K ∼ N T ). The average lookup path length (Table 2.4a) again confirms Theorem 2.2 and the result from Subsection 2.5.1. Though the number of nodes in R-Chord’s overlay is at least three times Chord, when K < N the average lookup path length is shorter than Chord since R-Chord routes lookups by segments. When K ≥ N , the average lookup path length is not worse than Chord due to the shared finger tables. 5 λB is obtained as follows. Bhagwan et. al. [25] measures that on average, each host performs 6.4 joins, and 6.4 fails per day [25]. We interpret the measured fail events as consisting of 3.2 host fails and 3.2 host leaves. Thus, including lookups, there are 32 events per day. With 25,000 hosts come and go repeteadly, there are 800,000 events per day, which is approximately one event every 100 ms. Similar steps as above are used to derive λG . Gummadi et. al. [63] measures 24 joins and 24 fails per host per day, and we interpret the measured fails as consisting of 12 host fails and 12 host leaves. Given 25,000 hosts and a ratio of arrive:fail:leave:lookup = 2:1:1:6, there are approximately one event every 25 ms. CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS K 5,000 7,500 10,000 15,000 25,000 50,000 75,000 100,000 125,000 150,000 λB = 10 ev./sec. λG = 40 ev./sec. Chord R-Chord Chord R-Chord 8.4 8.5 8.5 8.5 8.5 8.5 8.5 8.6 8.6 8.6 4.1 4.4 4.6 4.9 5.3 5.9 6.2 6.4 6.5 6.6 9.1 9.1 9.3 9.1 9.2 9.3 9.5 9.4 9.2 9.4 73 4.7 5.0 5.4 5.8 6.5 7.2 7.7 8.0 8.1 8.3 (a) Average Lookup Path Length K 5,000 7,500 10,000 15,000 25,000 50,000 75,000 100,000 125,000 150,000 λB = 10 ev./sec. λG = 40 ev./sec. Chord R-Chord Chord R-Chord 2% 2% 3% 4% 4% 7% 10% 7% 6% 8% <1% <1% 1% 1% 1% 5% 7% 5% 7% 16% 7% 9% 9% 13% 15% 20% 20% 23% 30% 27% 2% 2% 3% 4% 9% 13% 26% 25% 26% 34% (b) % of Failed Lookups Table 2.4: Lookup Performance under Churn (N ∼ 25, 000 Hosts) Table 2.4b shows that lookup resiliency in R-Chord is comparable to Chord (from 8% lower to 9% higher than Chord). Under a churn rate of λB and λG , R-Chord has a lower percentage of failed when K ≤ 100, 000 and K ≤ 50, 000, respectively. The results indicate the importance of exploiting finger flexibility through backup fingers. In R-DHT, lookup resiliency is increased due to the property that a key can be found in several nodes of the same segment. Hence, it is important that a segment can be reached as long as it contains one or more alive nodes. RChord addresses this issue by maintaining backup fingers as redundant pointers CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 74 to a segment. With a higher number of nodes per segment (i.e. finger flexibility), backup fingers are more effective in reducing the impact of a churn rate to lookup resiliency. As K is increased, the number of nodes per segment decreases. Because finger flexibility is reduced, there are less redundancy to be exploited through backup fingers. Considering that R-Chord’s overlay is eight times larger than Chord’s overlay, we conclude that the decrease is reasonable. In summary, the result in this subsection suggests that R-Chord achieves better resiliency when finger flexibility is exploited. When R-Chord cannot exploit finger flexibility, it can still achieve comparable resiliency as Chord because by not distributing data items, failure of a host affects only its own data items. 2.6 Related Works In this section, we first compare and contrast R-DHT with structured P2P systems that support the no-data-item-distribution scheme. Secondly, we discuss the current status of distributed resource indexing and discovery in a computational grid. 2.6.1 Structured P2P with No-Store Scheme We discuss three structured P2P that also support the no-data-item-distribution scheme, namely SkipGraph [20], Structella [30], and SkipNet [68]. SkipGraph [20] supports the no-store scheme by associating a key to a node and organizing nodes as a skip-list-like topology. It is assumed that each key is shared only by one node, e.g. resources of the same type are shared only by one administrative domain. Our proposed scheme generalizes SkipGraph by first, allowing a key to be associated with several nodes. Secondly, our scheme can organize nodes with different structured overlay topologies. CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 75 Structella [30] organizes nodes as a structured overlay (i.e. a Pastry ring [123]) but each node manages only its own keys, similar to R-DHT. However, unlike DHT, Structella does not map keys onto nodes. Instead, Structella employs the routing schemes used in unstructured P2P such as flooding and random walk, and exploits its structured overlay to reduce the overhead of those schemes. The authors reported that Structella offers similar results guarantee as unstructured P2P. To improve results guarantee, the authors propose to distribute and replicate data items to a number of nodes [31]. In contrast, R-DHT maps keys onto nodes and exploits DHT-based lookup schemes. Thus, even without distributing data items, R-DHT offers the same level of result guarantee as other DHT. SkipNet [68] supports content locality to map a key onto a specific node. This is achieved through the hierarchical naming scheme: put(n|key) maps a key to node n, and lookup(n|key) retrieves the key. Compared to our proposed scheme, SkipNet provides greater flexibility for a host to decide where its data items are stored. However, the hierarchical naming scheme does not directly supports queries such as “find resources of type k in any hosts”. In contrast, though R-DHT addresses node autonomy only by ensuring that data items are stored on their originating host, it is compatible with flat naming scheme. Table 2.5 summarizes the comparison of R-DHT with the three no-data-itemdistribution scheme. 2.6.2 Resource Discovery in Computational Grid We classify distributed grid information systems based on their overlay topology into unstructured overlay networks [72, 91] and structured overlay networks (DHT) [27, 28, 132, 145]. CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS Characteristic Mapping Scheme High Result Guarantee Key Shared by Many Hosts Controlled Data Placement Flat Naming Scheme Overlay Network SkipGraph Structella Yes Yes No No Yes No No Yes No Yes Skip List Pastry 76 SkipNet R-DHT Yes Yes Yes Yes No Multi-Level Ring Yes Yes Yes No Yes Multiple Choices Table 2.5: Comparison of R-DHT with Related Work In unstructured overlay networks, the routing-transferring model replicates resource information to all nodes proposed [91]. However, this consumes communication bandwidth. Iamnitchi [72] proposes to replicate information based on the small-world effect and uses heuristics to aid lookup. However, heuristics do not guarantee that a lookup will successfully find resources. In contrast, DHT-based systems provides stronger lookup guarantee and scalable lookup performance. MAAN [28], self-organizing Condor pools [27], XenoSearch [132], and RIC [145] are examples of grid information systems that are based on conventional structured overlay networks. Compared to such schemes, our R-DHT-based grid information system increases the autonomy of administrative domains by not distributing data items. In addition, our scheme does not introduce stale data items when the overlay topology changes, and it is resilient to node failures without a need to replicate data items. 2.7 Summary Distributed hash table maps each key onto a node to achieve good lookup performance. A typical DHT realizes this mapping through the store operation and as a result, key-value pairs are distributed across the overlay network. To address the requirements of applications where distributing key-value pairs is not desirable, we propose R-DHT, a new DHT mapping scheme without the store operation. R- CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 77 DHT enforces the read-only property by virtualizing a host into nodes subjected to the unique keys belonging to the host, and dividing the node identifier space into two sub-spaces (i.e. a key space and a host identifier space). By mapping data items back onto its owner, R-DHT is inherently fault tolerant. In addition, it increases consistency of data items because updates need not be propagated in overlay networks. R-DHT maintains API compatibility with existing DHT. In addition, R-DHT lookup operations exploit not only existing DHT lookup scheme, but also routing by segments, shared finger tables, and finger flexibility through backup fingers. Through theoretical and simulation analysis, we show that in a Chord-based RDHT consisting of N hosts and K total unique keys, the lookup path length is O(min(log K, log N )) hops. This result suggests that our proposed lookup optimization schemes can reduce lookup path length in other DHTs that also virtualizes hosts into nodes. We further demonstrate that R-DHT lookup is resilient to node failures. In our simulations, when 50% of nodes fail simultaneously, the number of failed lookup in R-Chord is 5–30%; this is lower compared to Chord where at least 60% of its lookups fail. Our simulation also shows that under churn, lookup performance of R-Chord is comparable to Chord even though R-Chord overlay is eight times larger: (i) R-Chord lookup path length is shorter, and (ii) number of failed lookups in R-Chord is at most 8% worst than Chord. The host-to-nodes virtualization in R-DHT increases the size of overlay network in terms of number of nodes. This leads to higher overhead in maintaining an overlay network. In an R-Chord system that virtualizes N hosts into V nodes where V ≥ N , the maintenance overhead is O(V log V log N ). With the same number of hosts, the maintenance overhead in Chord is O(N log2 N ). Our simulation analysis further revealed that when stabilization period is large, R-Chord requires more CHAPTER 2. READ-ONLY DHT: DESIGN AND ANALYSIS 78 time to correct its overlay into a ring topology. To address the problem of overlaynetwork maintenance in R-Chord, we present in the next chapter hierarchical R-DHT where nodes are organized into a two-level overlay network. CHAPTER 3. HIERARCHICAL R-DHT 79 Chapter 3 Hierarchical R-DHT: Collision Detection and Resolution Stabilization refers to the procedure for correcting routing states to adapt to changes in an overlay topology. The overlay topology changes when new nodes join, or existing nodes leave or fail. The one-level R-DHT (i.e. flat R-DHT) presented in the previous chapter achieves the same lookup path length as conventional DHT (Theorem 2.1) but with higher stabilization cost, i.e. the maintenance overhead (Theorem 2.5). In a system consisting of N hosts where each host has T unique keys on average, a flat R-DHT overlay consists of V (= N T ) nodes. Compared to a conventional DHT overlay consisting of N nodes, R-DHT corrects a higher number routing-table entries which, according to Theorem 2.5, is due to: 1. The number of routing tables is proportional to V , i.e. one routing table per node. 2. The size of each routing table is increased as V increases (Theorem 2.4). CHAPTER 3. HIERARCHICAL R-DHT 80 In the case of R-Chord, it takes Ω(V log2 V ) stabilization messages to correct all V finger tables, including successor and predecessor pointers, in an R-Chord overlay, i.e. Ω(V log2 V ) hops per finger table. In this chapter, we discuss a hierarchical R-DHT scheme that reduces its maintenance overhead by organizing nodes into a two-level overlay network. To address the problem of collision of groups, we propose a collision-detection method that piggybacks on stabilization, and two collision-resolution methods. Collisions occur when concurrent node joins result in nodes with the same group identifier being created at the top-level overlay. This increases the size of the top-level overlay, which in turn increases the total number of stabilization messages in the top-level overlay. In the worst case, collisions lead to the degeneration of hierarchical DHT into flat DHT, i.e. every node occupies the top-level overlay. The rest of this chapter is organized as follows. We first present existing approaches to reduce maintenance overhead in DHT. Next, using R-Chord as the example, we propose a collision detection and resolution scheme. We evaluate our proposed scheme through simulation experiments. Finally, we conclude this chapter with a summary. 3.1 Related Work A number of approaches have been proposed to reduce the maintenance overhead of DHT. We classify existing approaches into three main categories, (i) varying frequency of stabilization, (ii) varying number of routing states to correct, and (iii) hierarchical DHT. The first two approaches are directly applicable to flat R-DHT. CHAPTER 3. HIERARCHICAL R-DHT 3.1.1 81 Varying Frequency of Stabilization Frequency-based approaches such as adaptive stabilization [29, 52], piggybacking stabilization with lookups [17, 90], and reactive stabilization [17] reduce the maintenance overhead by reducing the frequency in correcting routing states. Adaptive stabilization adjusts the frequency based on churn rate and the importance of each routing state to lookup performance1 . Systems such as DKS [17] and Accordion [90] piggyback stabilization with lookups to reduce the necessity of performing dedicated periodic stabilization2 . Reactive stabilization such as DKS’s correctionon-change [53] does away altogether with periodic stabilization. Instead, changes to overlay networks due to membership changes are propagated immediately when membership-change events are detected. However, Rhea et. al. have reported that reactive stabilization can increase maintenance overhead under high churn rate and constrained bandwidth availability [119]. 3.1.2 Varying Size of Routing Tables This approach reduces the size of routing tables so that the number of routing states to correct becomes smaller. Examples of DHT that implement this approach include CAN [116], Koorde [76], and Accordion [90]. However, reducing the size of routing tables potentially increases lookup path length [139]. Besides reducing the size of routing tables, each routing table can also be partitioned into two parts: one part consisting of entries that are corrected through stabilization, and the other part consisting of cached entries. This reduces the maintenance overhead while achieving a shorter lookup path length. For example, a finger table in Chord consists of O(log N ) fingers and a number of location caches where the location caches are maintained by the LRU replacement policy [2]. 1 Routing states with higher importance, e.g. successor pointers in Chord [133] and leaf sets in Pastry [123], are refreshed/corrected more frequently. 2 In DKS, this is referred as correction on use. CHAPTER 3. HIERARCHICAL R-DHT 3.1.3 82 Hierarchical DHT Hierarchical DHT partitions stabilization among different overlays. This speeds up each stabilization process and reduces the number of stabilization messages in each of the overlays. Hierarchical DHT organizes nodes into a multi-level overlay network, where the top-level overlay consists of logical groups [51, 68, 77, 101, 137, 140, 143]. Each group, which consists of a number of nodes, is assigned a group identifier based on a common node property. For examples: 1. Grouping by administrative domains [68, 101, 143] improves the administrative autonomy and reduces latency. 2. Grouping by physical proximity [137, 140] reduces network latency. 3. Grouping by services [77] promotes the integration of services into one system. In each group, one or more supernodes act as gateways to the nodes at the secondlevel. Within each group, nodes can further form a second-level overlay network. In terms of topology maintenance, the hierarchical DHT has the following advantages compared to the flat DHT: 1. Each stabilization message in a hierarchical DHT is routed only in one of the smaller second-level overlays. This reduces the number of stabilization messages processed by each node. 2. Topology changes within a group due to churn do not affect the top-level overlay or other groups. Stable overlay topologies improves the result guarantee of DHT lookups. In the following, we compare our proposed scheme with existing hierarchical DHT. We discuss how each scheme addresses the problem of collisions. CHAPTER 3. HIERARCHICAL R-DHT 83 In hierarchical DHT such as Brocade [143], SkipNet [68], and hierarchical Scribe [101], collisions do not occur because a new node always chooses a bootstrap node from the same group. In such systems, nodes are grouped by their administrative domains. Therefore, it is natural for the new node to choose a bootstrap node from the same administrative domain. This grouping policy guarantees that multiple group with the same group identifier are not created. However, such systems do not address other grouping policies that can introduce collisions, i.e. when a new node is bootstrapped from a node in a different group. In systems such as the hierarchical DHT by Garcés-Erice et. al. [51], Diminished Chord [77], Hieras [140], and HONet [137], collisions can occur but the problem is not directly addressed. They assume that collisions can be resolved by mechanisms inherent in the system structure, and the extent of collisions is not studied. In [77, 140], all nodes in a group are assumed to be supernodes. In such systems, the size of the top-level overlay, with or without collisions, is the same. Hence, the stabilization procedure of the underlying DHT is sufficient to resolve collisions. However, the size of the top-level overlay is larger than in systems where only a subset of nodes become supernodes. Thus, the total number of stabilization messages is larger because more supernodes have to perform stabilization. In [51, 137], a new node can choose a bootstrap node from a different group. Hence, it is possible that the bootstrap node cannot locate the group associated with the new node, even though the group exists. However, the effect and impact of the collisions are not evaluated. To summarize the above comparisons, our scheme relaxes the assumption that a new node must be bootstrapped from the same group and all group members must become supernodes. In addition, our scheme resolves collisions to maintain CHAPTER 3. HIERARCHICAL R-DHT 84 the top-level overlay size that is close to the ideal size. 3.2 Design of Hierarchical R-DHT In an R-DHT framework with V nodes and K (≤ V ) unique keys, the hierarchical R-DHT organizes the nodes into a two-level overlay network. The top-level overlay consists of K groups, and each group consists of nodes that share the same key. Therefore, groups are equivalent to segments in a flat R-DHT. Every group has one or more supernodes that act as gateways to other nodes in the group. These supernodes are organized in the top-level overlay. Each group can further organize its nodes as a second-level overlay with a topology and stabilization mechanism that differ from the top-level. Clearly, each of the overlay networks in a hierarchical R-DHT is smaller than the flat R-DHT overlay. Thus, while each host h is still virtualized into |Th | nodes, each of the nodes will join a smaller overlay network than in the flat R-DHT network. As a result, each node maintains and corrects a smaller number of fingers than the flat R-DHT’s nodes. Figure 3.1 shows a hierarchical R-Chord where the top-level ring consists of four groups. Figure 3.1: Two-Level Overlay Consisting of Four Groups CHAPTER 3. HIERARCHICAL R-DHT 85 Nodes in the hierarchical R-DHT are assigned two identifers, as opposed to nodes in the flat R-DHT. In the flat R-DHT, n = k|h denotes that k|h is the node identifier of node n. In the hierarchical R-DHT, we also assign a group identifier to node n. The value of the group identifier is equal to prefix(n), which is k. In addition, each second-level node in the hierarchical R-DHT holds a pointer to at least one of the supernodes in its group. Table 3.1 summarizes the important variables maintained by each node, in addition to the ones presented in Figure 2.1. Variable Description gid m-bit group identifier (= prefix(n)) is super true if n is a supernode, false otherwise supernode Pointer to supernode of group gid, nil if n is a supernode. Table 3.1: Additional Variables Maintained by Node n in a Hierarchical R-DHT In hierarchical R-DHT, locating key k implies locating the group responsible for k. Firstly, a lookup request for key k is routed to the supernode of the initiating group. Secondly, using R-DHT lookup algorithm (Figure 2.13), the lookup request is further routed to the supernode of group k, i.e. a supernode whose node identifier is prefixed with k. Thirdly, the lookup request can be further forwarded to one of the second-level nodes in group k, depending on the application. As illustrated in Figure 3.2, a lookup request for key 2, initiated by second-level node 6|6, is forwarded to its supernode 6|4 (step 1). In the top-level overlay, the lookup request is routed to supernode 2|7 of group 2 (step 2). Finally, supernode 2|7 can further forward the request to its second-level nodes (step 3), e.g. lookup for compute resources of type 2 in multiple administrative domains. If new nodes join hierarchical R-DHT when some routing states in the top-level overlay are incorrect, i.e. yet to be updated, the top-level overlay may end up with two or more groups with the same group identifier. In the following subsections, CHAPTER 3. HIERARCHICAL R-DHT 86 Figure 3.2: Example of a Lookup in Hierarchical R-DHT we discuss how collisions occur, and then present our proposed scheme to detect and resolve collisions. To avoid sending additional overhead messages, collision detection is performed together with successor stabilization, i.e. the process of correcting successor pointers. This is because successful collision detections require the successor pointers in the top-level Chord overlay to be correct, and the correctness of the successor pointers is maintained by stabilization. 3.2.1 Collisions of Group Identifiers Collisions of group identifiers arise because of join operations invoked by nodes. Figure 3.3 shows the node-join algorithm for hierarchical R-Chord. Node n, whose group identifier is denoted as n.gid, makes a request to join group g through bootstrap node n0 . In a hierarchical R-Chord, this means finding successor(g|0) in the top-level overlay. If n0 successfully finds an existing group g, then n joins this group using a group-specific protocol (line 5–9). However, if n0 returns g 0 > g, then n creates a new group with identifier g (line 11–15). A collision occurs if the new group is created even though a group with identifier g has already been created. This happens due to n and bootstrap node n0 are in two different groups, and the top-level overlay has not fully stabilized (i.e. some supernodes have incorrect successor pointers). CHAPTER 3. HIERARCHICAL R-DHT 87 1. // Node n joins through bootstrap node n0 2. n.join(n0 ) 3. h0 = suffix(n0 ); 4. s = h0 .find successor(gid|0); // See Figure 2.18b 5. if (gid == s.gid) 6. // s is a supernode of group g 7. join group(s); 8. is super = false; 9. supernode = s 10. else 11. // n creates a new group. 12. // This can cause a collision. 13. predecessor = nil; 14. successor = s; 15. is super = true; Figure 3.3: Join Operation Figure 3.4 illustrates a collision that occur when node 1|2 and node 1|3 belonging to the same group g1 , join concurrently. Due to concurrent joins, find successor () invoked by both nodes, during their join operation, will return node 2|7. This causes both the new nodes to create two groups with the same group identifier g1 . Figure 3.4: Collision at the Top-Level Overlay 3.2.2 Collision Detection We propose to perform collision detections during successor stabilization. This is achieved by extending Chord’s stabilization so that it not only checks and corrects CHAPTER 3. HIERARCHICAL R-DHT 88 the successor pointer of supernode n, but also detects if n and its new successor should be in the same group. Figure 3.5 presents our collision detection algorithm, assuming that each group has only one supernode. The algorithm first ensures that the successor pointer of a node is valid (line 4–5). It then checks for a potential collision (line 8–10), before updating the successor pointer to point to the correct node (line 11–13). 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. // n periodically verifies its successor pointer, // and announces itself to the successor. n.stabilize successor() if successor.is super == false then successor = successor.supernode(); p = successor.predecessor; if ((p 6= n) and (p.gid == gid)) then if is collision(p) then merge(p); else if n.gid < p.gid < successor.gid then successor = p; successor.notify(n); (a) Main Algorithm 14. 15. 16. 17. 18. 19. 20. // n0 thinks it might be our predecessor n.notify(n0 ) if (predecessor == nil) or (predecessor.is super == false) or (predecessor < n0 < n) then predecessor = n0 ; 21. 22. 23. 24. 25. 26. // Assume one supernode per group n.is collision(n0 ) if (gid == n0 .gid) return true return false (b) Helper Functions Figure 3.5: Collision Detection Algorithm CHAPTER 3. HIERARCHICAL R-DHT 89 (a) (b) (c) (d) Figure 3.6: Collision Detection Piggybacks Successor Stabilization The following example illustrates the collision detection process. In Figure 3.6a, a collision occurres when node 1|2 and 1|3 belonging to the same group, group 1, join concurrently. In Figure 3.6b, node 1|3 stabilizes and causes node 2|7 to set its predecessor pointer to node 1|3 (step 1). Then, the stabilization by node 0|5 causes 0|5 to set its successor pointer to node 1|3 (step 2), and node 1|3 to set its predecessor pointer to node 0|5 (step 3). In Figure 3.6c, the stabilization by node 1|2 causes 1|2 to set its successor pointer to node 1|3. At this time, a collision is detected by node 1|2 and is resolved by merging 1|2 to 1|3. If each group contains more than one supernodes, then is collision routine shown in Figure 3.5 may incorrectly detect collisions. Consider the example in Fig- CHAPTER 3. HIERARCHICAL R-DHT 90 ure 3.7a. When node n stabilizes, it incorrectly detects a collision with node n0 because n.successor.predecessor = n0 and n.gid = n0 .gid. An approach to avoid this problem is for each group to maintain a set of its supernodes [51, 65] so that each supernode can accurately decide whether a collision has occurred. The modified collision detection algorithm is shown in Figure 3.7b. (a) Multiple Supernodes in Each Group 1. n.is collision(n0 ) 2. // L is a set of supernodes in my group 3. if n0 ∈ / L then 4. return true 5. 6. return false (b) Modified is collision Algorithm Figure 3.7: Collision Detection for Groups with Several Supernodes 3.2.3 Collision Resolution To resolve collisions, groups associated with the same gid are merged. After the merging, some supernodes become ordinary nodes depending on the group policy. Before a supernode changes its state into a second-level node, the supernode notifies its successors and predecessors to update their pointers (Figure 3.8). Nodes in the second level also need to be merged to the new group. We propose two methods to merge groups, namely supernode initiated and node initiated. CHAPTER 3. HIERARCHICAL R-DHT 91 1. // Set predecessor of n to n0 2. n.replace predecessor(n0 ) 3. predecessor = n0 ; 4. // Set successor of n to n0 5. n.replace successor(n0 ) 6. successor = n0 ; Figure 3.8: Announce Leave to Preceding and Succeeding Supernodes 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. // Nodes joins the group where n0 is the supernode n.merge(n0 ) is super = false // Announce leave to neighbors in top-level overlay successor.replace predecessor(predecessor); predecessor.replace successor(successor); predecessor = successor = nil; n0 .join group(n); g = prefix(n); for each node x ∈ g do x.join group(n0 ); x.supernode = n0 Figure 3.9: Supernode-Initiated Algorithm 3.2.3.1 Supernode Initiated To merge a group n.gid with another group n0 .gid, the supernode n notifies its second-level nodes to join group n0 .gid (Figure 3.9). The advantage of this approach is that second-level nodes join a new group as soon as a collision is detected. However, n needs to keep track of its group membership, which may not always be correct. If n has only partial knowledge of group membership, some nodes in the second-level can become orphans. 3.2.3.2 Node Initiated In node-initiated merging, each second-level node periodically checks that its known supernode n0 is still a supernode (Figure 3.10). If n0 is no longer a su- CHAPTER 3. HIERARCHICAL R-DHT 92 pernode, then the second-level node will ask n0 to find the correct supernode and join a new group through the new supernode. This approach does not require supernodes to track group membership. However, an additional overhead is introduced due to second-level nodes periodically checking the status of their supernode. 1. // Supernode n joins another group, 2. // ignoring its second-level nodes 3. n.merge(n0 ) 4. is super = false; 5. 6. // Announce leave to neighbors in top-level overlay 7. successor.replace predecessor(predecessor); 8. predecessor.replace successor(successor); 9. predecessor = successor = nil; (a) Main Algorithm 10. // Second-level node n periodically 11. // verifies its supernode pointer 12. n.check supernode() 13. if supernode.is super == false then 14. x = supernode.supernode; 15. supernode = x; 16. join group(x); (b) Helper Functions Figure 3.10: Node-Initiated Algorithm 3.3 Simulation Analysis To evaluate the effectiveness of our proposed scheme, we first show that hierarchical R-Chord significantly reduces maintenance overhead, compared to flat R-Chord. Then, we study the performance of collision detection and resolution by comparing two systems: without detect & resolve (i.e. hierarchical R-Chord without collision detection and resolution) and detect & resolve (i.e. hierarchical R-Chord with collision detection and resolution). We assume that each group CHAPTER 3. HIERARCHICAL R-DHT 93 contains one supernode. To resolve collisions, we use the supernode-initiated approach. Since the emphasis of this experiment is to study collisions at the top-level R-Chord and the purpose of collision resolution is to ensure that second-level overlays are correct after a collision, the choice of collision-resolution approach does not significantly affect the result of this experiment. We extend the Chord simulator included in Chord SDK [2] to model a hierarchical R-Chord. The average inter-arrival time of nodes is exponentially distributed with a mean of one second. Each supernode maintains a successor pointer, a predecessor pointer, and O(log KC ) fingers. In addition, each supernode periodically invokes the stabilization procedure. With a stabilization period parameter of p (in seconds), the stabilization period is uniformly distributed in the interval [0.5p, 1.5p]. In the simulators, each stabilization corrects the successor pointer and one of the fingers. The link latency between nodes is exponentially distributed with a mean of 50 ms and the request-processing time by each node is uniformly distributed between 5 and 15 ms. 3.3.1 Maintenance Overhead We measure the maintenance overhead by the total number of stabilization messages in the top-level overlay. We simulated hierarchical and flat systems with 50,000 and 100,000 nodes (V ). For the number of groups (K) in the top-level overlay, we chose the values of 2,000 and 8,000. Thus, we evaluated four different R-Chord configurations. In addition, we compare the maintenance overhead of hierarchical R-Chord to Chord. The results are shown in Figure 3.11. As shown in Figure 3.11, hierarchical R-Chord significantly reduces the maintenance overhead of its top-level overlay compared to flat R-DHT, because the top-level overlay consists of only K groups. Hence, there are a smaller number of CHAPTER 3. HIERARCHICAL R-DHT 94 fingers to correct. When we double V from 50,000 to 100,000 nodes, the total number of stabilization messages does not increase; this is in contrast to flat R-Chord. In both systems, the number of stabilization messages reduces by 50% when p is increased from 30 seconds to 60 seconds, because stabilization is performed less frequently. The overhead to maintain the top-level overlay of hierarchical R-Chord is lower than Chord consisting of N = 10, 000 hosts, but is still higher than Chord consisting of N = 1, 000 hosts. This is because the maintenance overhead of hierarchical R-Chord and Chord depends on K and N , respectively. In hierarchical R-Chord, the maintenance overhead in the top-level overlay is O(K log2 K) because there are K groups in the top-level overlay and each supernode of a group maintains O(log K) fingers3 . Note that the total cost to maintain a hierarchical R-Chord, which includes the cost to maintain second-level overlays, is Ω(V log N log V ). However, this can be amortized through less frequent stabilizations in the secondlevel overlays. Unlike the top-level overlay where an incorrect topology causes all nodes sharing the same resource type to be inaccessible, an incorrect secondlevel overlay causes only a subset of nodes sharing the same resource type to be inaccessible. We discuss briefly the comparison with the hierarchical DHT schemes presented in Section 3.1.3. In [51, 68, 101, 137, 143], the maintenance overhead for their top-level overlay is higher than hierarchical R-DHT when their top-level overlay is larger than K. In [77, 140], the size of their top-level overlay is always O(N ). In addition, each node joins more than one overlay network, i.e. the top-level overlay plus the lower-level overlay networks. Hence, the maintenance overhead for their top-level overlay is higher than hierarchical R-DHT when N > K. Moreover, in the case of Hieras [140], the number of overlay levels can be greater than two and 3 The proof is similar as in Theorem 2.5. CHAPTER 3. HIERARCHICAL R-DHT (a) p = 30 Seconds (b) p = 60 Seconds Figure 3.11: Maintenance Overhead of Hierarchical R-Chord 95 CHAPTER 3. HIERARCHICAL R-DHT 96 each node is present in every level. Therefore, the maintenance overhead of the whole Hieras overlays is higher when the size of its all overlays is greater than V . 3.3.2 Extent and Impact of Collisions Consider the total number of stabilization messages required at the top-level RChord overlay. Let K (≤ N ) denote the number of groups and V denote the number of nodes. Each group employs one supernode and hence, we expect that the ideal size of the top-level overlay consists of K supernodes. Without collisions, the total number of stabilization messages (S) is O(K log2 K) because there are K groups that perform stabilization, each group corrects O(log K) fingers, and the cost of correcting each finger is O(log K). With collisions, the size of toplevel overlay is increased by c times, i.e. cK groups. As each group performs periodic stabilization, the cost of stabilization when collisions occur (SC ) is Ω(cS) (Equation 3.1). Sc cK log2 cK c log2 cK = = = Ω(c) S K log2 K log2 K (3.1) Table 3.2 shows the extent of collisions from measuring the total number of collisions for different values of the stabilization period p. Without resolving collisions, the number of collisions is about 2 to 5 times K. With frequent stabilization, our scheme significantly reduces the number of collisions. But as p increases, the number of collisions grows because of the reduced frequency of collision resolution. The impact of collisions is measured by the growth in the size of the top-level overlay. Figure 3.12 shows the number of groups at an interval of one hour. Without collision resolution, the size of the top-level overlay grows to about 2 to 5 times K because the additional groups caused by collisions will remain in the CHAPTER 3. HIERARCHICAL R-DHT p 30 60 120 240 Without Detect & Resolve K = 2, 000 K = 8, 000 5,740 5,941 6,425 8,914 11,421 11,511 12,823 15,905 97 Detect & Resolve K = 2, 000 K = 8, 000 56 113 1,181 1,609 33 153 1,088 2,349 (a) V = 50,000 Nodes p 30 60 120 240 Without Detect & Resolve K = 2, 000 K = 8, 000 7,097 7,232 7,830 9,813 16,930 17,009 17,979 20,139 Detect & Resolve K = 2, 000 K = 8, 000 35 212 641 1,942 23 136 1,133 3,023 (b) V = 100,000 Nodes Table 3.2: Number of Collisions top-level overlay. If the size of the top-level overlay increases by 5 times, then the total number of stabilization messages is increased by Ω(5) times. On the other hand, detect & resolve merges the colliding groups so that the size of the overlay converges to that of the ideal size K. Figure 3.12 also shows that more frequent stabilization keeps the top-level overlay size that is close to the ideal size. With a larger p, stabilization is performed less frequently. Thus, more stabilization rounds are required to correct the successor pointers. Since our scheme is performed together with stabilization to reduce overhead, it takes a longer time to reduce the size of the top-level overlay close to the ideal size. As an example, with p = 240 seconds, it takes at least 15 hours to reduce the top-level overlay size to the ideal size (Figure 3.12b). CHAPTER 3. HIERARCHICAL R-DHT (a) p = 30 Seconds (b) p = 240 Seconds Figure 3.12: Size of Top-Level Overlay (V = 100, 000 Nodes) 98 CHAPTER 3. HIERARCHICAL R-DHT 3.3.3 99 Efficiency and Effectiveness The efficiency and effectiveness of our scheme depends on the frequency of detection and resolution, which is determined by the stabilization period p. 3.3.3.1 Detection The efficiency of collision detection is measured by the average time required to detect a collision. This is defined as the period between a join and a stabilization procedure that detects the collision. It is desirable to detect collisions as soon as possible to minimize the impact of collisions. Table 3.3 shows that the average time to detect collisions increases as p increases. From the results, the ratio of the collision detection time to the stabilization interval (p) is up to 104 times (i.e. p = 120 seconds and K = 8, 000). This indicates that collision detection time is significant. p 30 60 120 240 V = 50, 000 V = 100, 000 K = 2, 000 K = 8, 000 K = 2, 000 K = 8, 000 1,265 3,211 5,955 9,281 186 2,849 9,635 22,070 288 1,764 5,557 6,960 61 4,236 12,526 23,646 Table 3.3: Average Time to Detect a Collision (in Seconds) Table 3.4 shows the effectiveness of our scheme. Let β denote the ratio of the number of collisions in the detect & resolve case to the number of collisions in the without detect & resolve case. With frequent stabilization when p is 30 seconds, β is less than 0.01, i.e. our scheme reduces the number of collisions by 99%. As p increases, the effectiveness of the scheme decreases. However, even when p is 240 seconds, our scheme still reduces the number of collisions by at least 80%. CHAPTER 3. HIERARCHICAL R-DHT p 30 60 120 240 100 V = 50, 000 V = 100, 000 K = 2, 000 K = 8, 000 K = 2, 000 K = 8, 000 0.01 0.02 0.18 0.18 0.01 0.01 0.08 0.15 0.01 0.02 0.08 0.13 0.01 0.03 0.11 0.20 Table 3.4: Ratio of Number of Collisions (β) 3.3.3.2 Resolution There are two main factors that affect the cost of collision resolutions. The first factor is the number of groups and nodes to be merged. Table 3.5 shows the average number of nodes corrected in each collision resolution. The effectiveness of collision resolution improves with a higher frequency of stabilization. Overall, our results indicate that the average number of nodes corrected can be reduced to less than 10% of the average group size (V /K). p 30 60 120 240 V = 50, 000 V = 100, 000 K = 2, 000 K = 8, 000 K = 2, 000 K = 8, 000 2.2 2.9 3.6 7.0 2.1 2.5 4.0 4.9 3.2 3.1 7.2 11.8 2.1 2.3 3.6 6.2 Table 3.5: Average Number of Nodes Affected by a Collision The second factor is the overhead of correcting stale finger pointers and the cost of updating fingers to point to the new group after merging. As each group is pointed by O(log KC ) groups and the correction of each finger pointer requires O(log KC ), the total cost to update the fingers pointing to the merged group is O(log2 KC ). The results in this section, i.e Tables 3.3–3.5, suggest that the efficiency and CHAPTER 3. HIERARCHICAL R-DHT 101 effectiveness of our scheme can be improved by having more frequent detections and resolutions. This will reduce both the number of collisions and the cost of correcting collisions. Based on the simulation results in Table 3.2, with p = 60 seconds, the number of collisions is smaller than 12% of the ideal size (when V = 100, 000 and K = 2, 000). 3.4 Summary We have presented a hierarchical R-DHT and a scheme to detect and resolve collision of groups. A hierarchical R-DHT organizes nodes into a two-level overlay network. It partitions stabilization among different overlays to speed-up each stabilization process and reduces the number of stabilization messages in each overlay. In the hierarchical R-DHT, the maintenance overhead of the top-level overlay is O(K log2 K). However, collision of groups increases the size of the top-level overlay by a factor c, which increases the total number of stabilization messages by Ω(c) times. Our scheme performs collision detection together with stabilization to avoid introducing additional messages. Two approaches are proposed to resolve collisions: supernode-initiated merging and node-initiated merging. Our simulation results show that if collisions are not resolved, the size of the toplevel overlay increases more than twice. With our scheme, the number of collisions is reduced by 80% at least. In addition, the size of the top-level overlay remains close to the ideal size; otherwise it can be up to five times larger, which increases the total number of stabilization messages by Ω(5) times. The results also reveal the importance of minimizing collisions as it takes several stabilization rounds to detect collisions. Thus, more frequent stabilization reduces collisions and keeps the top-level overlay that is close to the ideal size. CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 102 Chapter 4 Midas: Multi-Attribute Range Queries DHT supports lookup with exact queries effectively (i.e. high result guarantee) and efficiently (i.e. short lookup path length). An exact query locates resources identified with a specific key. As an example, a query find files whose file name = A.MP3 locates all files identified with a key of SHA1(A.MP3). Recently, supporting efficient multi-attribute range queries on DHT has been an active area of research (see Section 1.3). A multi-attribute query locates resources identified with multiple search attributes. Each search attribute can be constrained by a range of values using relational operators (<, ≤, =, >, and ≤). As an example, find compute resources whose cpu = P3 and 1 GB ≤ memory ≤ 2 GB is a query consisting of two search attributes; the second search attribute, memory, has a range of 1 GB. We propose Midas (Multi-dimensional range queries), an approach to support multi-attribute range queries on R-DHT based on d-to-one mapping scheme. We CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 103 focus on resources that are described by a well-defined schema, e.g. GLUE for describing compute resources [5]. Midas adopts the Hilbert space-filling curve (Hilbert SFC) [124] as the d-to-one mapping function because it has been shown that for multi-dimensional indexing, it has a better clustering property than other types of SFC [74, 103]. The rest of this chapter is organized as follows. First, we discuss the related work, followed by an overview of Hilbert space-filling curve. Next, we discuss the design of Midas indexing scheme, followed by two optimizations of the the query engine, namely incremental search and search-key elimination. The performance of Midas is evaluated using simulations. Finally, we conclude this chapter with a summary. 4.1 Related Work We compare Midas with three main approaches in supporting multi-attribute range queries on DHT, namely distributed inverted index, d-to-d mapping, and d-to-one mapping (Section 1.3). We outline the rationale for choosing d-to-one as the basis for supporting multi-attribute range queries in R-DHT. Compared to distributed inverted index, d-to-one mapping does not need to perform the intersection operator (∩). The intersection operation assumes that one or more intermediate results are created using selection operators (σ). However, a selection operation incurs a higher overhead in R-DHT as it visits every node within an R-DHT segment for creating an intermediate result. With d-to-one, Midas avoids the intersection and thus, does not need to create intermediate result sets. Figure 4.1 compares how Chord and R-Chord create an intermediate result set consisting of resources whose cpu = P3. In Chord, resources whose cpu = P3 are indexed by a key of k = hash(P 3). Thus, the select operation retrieves the relevant indexes from successor(k) only. However, the same operation in R-Chord CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 104 retrieves the relevant indexes from all nodes in segment Sk . (a) Chord: Retrieve Key k from successor(k) (b) R-Chord: Retrieve Key k from All Nodes in Segment Sk Figure 4.1: Retrieving Result Set of Resource Indexes with Attribute cpu = P 3 Compared to d-to-d mapping scheme, d-to-one mapping offers a higher flexibility in selecting the underlying DHT. Because resources are mapped onto keys in a one-dimensional identifier space, we can use one of the many implementations of one-dimensional DHT [14, 17, 20, 68, 76, 99, 119, 122, 123, 133, 144] as the underlying infrastructure. On the other hand, d-to-d mapping requires multidimensional DHT which, to the best of our knowledge, is implemented only by CAN [116]. A number of d-to-one mapping schemes have been proposed for DHT [16, 50, 86, 127, 131]. These schemes reduce the number of nodes visited during query processing by exploiting data-item distribution. Though Midas is also based on d-to-one mapping scheme, it is designed for R-DHT which does not distribute data items. To reduce query cost, Midas transforms a query into a number of search keys and performs R-DHT lookups only for available keys. Table 4.1 summarizes the comparison of the three query processing schemes. CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES Factor #keys/resource Distributed Inverted Index 105 d-to-one d-to-d General R-DHT d one one one Type of DHT any d-dimensional any any Query engine ∩ and σ flood region query exploit dataitem distribution search-key elimination Table 4.1: Comparison of Multi-attribute Range Query Processing 4.2 Hilbert Space-Filling Curve Let f : Nd → N denote a d-to-one mapping function which maps a d-dimensional space to a one-dimensional space. The function is also referred to as a spacefilling curve (SFC) because it can be visualized as a curve (i.e. the one-dimensional space) that traverses every coordinate in the d-dimensional space. A coordinate in a d-dimensional space is a tuple of d dimension values. Figure 4.2 illustrates two types of SFC, namely z-curve and Hilbert curve, on a 2-dimensional space consisting of two axes, x-axis and y-axis. Each dimension consists of four values from 0 to 3, resulting in a total of 16 coordinates (cells). SFC has been used in various applications, including multi-dimensional indexing in traditional databases [9, 45, 85, 115]. SFC allows every coordinate in a d-dimensional space to be assigned a unique identifier. The curve is divided into subcurves such that a coordinate covered by the ith subcurve, where i > 0, is assigned identifier i − 1. The curve traverses the whole d-dimensional space where every coordinate is covered by one subcurve only; this ensures all coordinates are assigned a unique identifier. Figure 4.2 shows that coordinate (0, 0) has the same z-identifier and Hilbert identifier, which is 0, as it is covered by the first subcurve of both SFC. On the other hand, coordinate (3, 3) has two different identifiers: 15 and 10 as its z-identifier and Hilbert identifier, CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 106 (a) z-Curve (b) Hilbert Curve Figure 4.2: SFC on 2-Dimensional Space respectively. In Chapter 4.2.1–4.2.2, we present the locality property of Hilbert SFC and the recursive construction of a Hilbert curve. 4.2.1 Locality Property An SFC preserves locality if coordinates close in the d-dimensional space are mapped onto identifiers close in the one-dimensional space, and vice versa [61]. The locality property is desirable in many types of applications as it improves their performance. In traditional database, preserving locality reduces the number of disk blocks to be fetched and seek time during query processing. Similarly, in DHT, preserving locality reduces the number of nodes visited when a query CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 107 is processed. Gotsman et. al. [61] and Jagadish [74] reported that in general, for any d0 < d it is not possible to always map two points that are close in a d-dimensional space to two points that are close in a d0 -dimensional space1 . Referring to Figure 4.2, two adjacent coordinates, (1, 0) and (2, 0), are mapped onto two identifiers that are further apart: 2 and 8 in z-curve, and 1 and 14 in Hilbert curve, respectively. Though achieving optimal locality is not possible, studies have indicated that among various SFC, Hilbert SFC achieves better locality when applied to multidimensional indexing [74, 103]. Intuitively, this is because two consecutive Hilbert identifiers always connect two adjacent coordinate points. Jagadish [74] and Moon et. al. [103] quantify the locality-preserving property using the number of clusters. A cluster is a group of coordinate points, inside a d-dimensional region that are mapped onto consecutive identifiers. The region represents a multi-dimensional range query and is a subspace of a d-dimensional space. Using theoretical analysis and simulation, Hilbert curve is shown to minimize the average number of clusters compared to other types of SFC [74, 103]. Figure 4.3 shows a region, i.e. the shaded area, which is mapped by z-curve and Hilbert curve. With z-curve, the region is covered by two clusters: the first cluster consists of identifier 1 and the second cluster consists of identifers 3–7. With Hilbert curve, the region is covered by one cluster only, which consists of identifiers 2–7. 4.2.2 Constructing Hilbert Curve To construct a Hilbert curve, we recursively divide a d-dimensional space until L approximation levels. At approximation level l, where 1 ≤ l ≤ L, we divide the 1 In the case of SFC, d0 = 1. CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES (a) z-Curve: Region is Covered by Cluster 1 and Cluster 3–7 108 (b) Hilbert Curve: Region is Covered by Cluster 2–7 Figure 4.3: Clusters and Region d-dimensional space into 2dl cells. The lth -level SFC, which traverses the 2dl cells, is constructed from a number of first-level curves, each of which being orientated differently. Figure 4.4 shows an example of constructing Hilbert SFC that covers a 2-dimensional space, up to approximation level 3. In this example, coordinate of cells and Hilbert identifiers are shown in binaries. For Hilbert identifiers, the decimal values are also shown in parentheses. Figure 4.4a shows the level-1 curve which starts from coordinate (02 , 12 ), i.e. Hilbert identifier 02 (0), and ends at coordinate (12 , 12 ), i.e. Hilbert identifier 012 (3). In Figure 4.4b, each level-1 cell is split into four cells and a level-1 Hilbert curve, with a potentially different orientation, is applied on the four cells. For example, the four lower-left cells are covered by a level-1 Hilbert curve that has been rotated 270 degree along x-axis and mirrored along y-axis, whereas the four lower-right cells are covered by a level-1 Hilbert curve that has been rotated 90 degree along x-axis. To construct the next level of Hilbert curve, (Figure 4.4c), we follow the same process and then reuse level-2 curves. Thus, the eight lower-left cells, for example, are covered by a level-2 Hilbert curve that has been rotated 270 degree along x-axis and mirrored along y-axis. CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES (a) Level 1 109 (b) Level 2 (c) Level 3 Figure 4.4: Constructing Hilbert Curve on 2-Dimensional Space The recursive construction of Hilbert curve results in the following three properties. Property 4.1. A cell at level l − 1 is refined into 2d subcells at level l. Proof. The cell is a region where the length of its dimensions are equal to one. The cell is refined to the next level by splitting each of its dimensions into two halves. This results in 2d subcells in total. Based on Property 4.1, assuming that 1 ≤ l ≤ l0 ≤ L, a cell at level l is equivalent to a region at level l0 region, i.e. a group of level-l0 cells. For example, the level-1 coordinate (02 , 12 ) in Figure 4.4a, is equivalent to level-2 coordinates (002 , 102 ), (002 , 112 ), (012 , 112 ), and (012 , 102 ) in Figure 4.4b. CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 110 Property 4.2. Dimension values at approximation level l are l-bit long, where l > 0. Proof. We prove this property by induction. If l = 1, each dimension consists of two values. Thus, one bit is needed to encode the values, and this proposition is true. Assume that Pl−1 is true for 1 < l ≤ L, i.e. l − 1 bits are needed to encode dimension values at level l − 1. We split each dimension value at level l − 1 into two subvalues at level l. Each level-l value is prefixed by the (l − 1)-bit value of its parent, and has one additional bit (0 or 1). Thus, this property is true. In Figure 4.4, the prefix of dimension values is shown in bold. Consider an example where the first-level cell (02 , 12 ), i.e. the shaded area in Figure 4.4a, is divided into four second-level subcells, i.e. the shaded area in Figure 4.4b. At the second-level cells, the possible values for x-axis are derived by concatenating the x-value of the parent cell with 0 and 1. The similar process applies for y-axis as well. Property 4.3. Hilbert identifiers at approximation level l are dl-bit long. Proof. Since there are 2dl level-l coordinates to map, dl bits are required to encode all the coordinates. When a cell at level l − 1 is refined into 2d cells at the next level, the subcurve that covers the parent cell are also refined into a 2d contiguous subcurve at level l. The resulted identifiers at level l are prefixed by their parent’s Hilbert identifier. In Figure 4.4, the prefix of Hilbert identifiers is underlined. In the example, when the first-level cell (02 , 12 ) with Hilbert identifier 012 (Figure 4.4a) is refined into CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 111 four second-level cells, the resulted second-level Hilbert identifiers are 01002 , 01012 , 01102 , and 01112 ; all of which are prefixed by the parent cell’s Hilbert identifier, 012 (Figure 4.4b). 4.3 Design As illustrated in Figure 4.5, Midas is divided into two main parts, namely indexing scheme and query engine. Each d-attribute resource is indexed as a key which is a Hilbert identifier. The key is further mapped onto an R-DHT node. A multiattribute range query is first transformed into a number of exact queries using Hilbert SFC. These exact queries are further processed by the query engine to minimize the number of R-DHT lookups required. Figure 4.5: Midas Indexing and Query Processing CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 4.3.1 112 Multi-Attribute Indexing Midas indexing consists of three basic components as shown in Figure 4.6. Firstly, it extracts the type of a resource, i.e. attributes of the resource (Definition 2.1), using two supporting components: resource type specification and attribute-value normalization. The resource type specification defines d attributes that constitute a resource type, e.g. attribute cpu and attribute memory. Then, the attributevalue normalization converts domain-specific attribute values into numbers, e.g. (cpu=‘P4 ’, memory=‘1 GB ’) is normalized into resource type (cpu=2, memory=1). Once the type of a resource is derived, the d-to-one mapping maps the resource type onto a key using Hilbert SFC. Subsequently, the key is mapped onto an R-DHT node. Figure 4.6: Midas Multi-dimensional Indexing In the rest of this section, we first describe the d-to-one mapping, followed by the CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 113 two supporting components. 4.3.1.1 d-to-one Mapping Scheme Based on its type, each resource is assigned a key which is a Hilbert identifier at the maximum approximation level. All resources are assumed to have the same number of attributes. These attributes can be derived based on a well-defined resource naming scheme such as GLUE schema [5] for compute resources (see Section 4.3.1.2 for more details). Definition 4.1. Let d denote the number of dimensions and m denotes the bit length of one-dimensional identifier space. A key in the identifier space is defined as an m-bit Hilbert code at the maximum approximation level L where L = m/d. Each resource is modeled as a point in a d-dimensional attribute space. The coordinate is determined by the resource type which consisting of d attributes. Each dimension represents an attribute encoded as an (m/d)-bit value, and thus, there are 2m/d possible values per dimension (Figure 4.7). The m-bit Hilbert identifier of the coordinate will become the key assigned to the resource. Because the coordinate of a resource is determined by the resource type, resources of the same type occupies the same coordinate point and are assigned the same key (Definition 2.1). Finally, the key is mapped onto an R-DHT node using our readonly mapping scheme (Section 2.3.1–2.3.2). The following example illustrates the process of indexing resources characterized by two attributes, namely cpu and memory, assuming m = 4-bit. • Assign Key to Resource As illustrated in Figure 4.8a, resource r with cpu = P4 and memory = 1 GB is modeled as coordinate point (2, 1) in a 2-dimensional attribute space. CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 114 Figure 4.7: Attributes and Key The coordinate is derived by normalizing2 the two attributes into a resource type consisting of two attributes, namely cpu = 2 and memory = 1. Using Hilbert SFC, coordinate (2, 1) is converted to Hilbert identifier 13 which is 4-bit long. Thus, r is assigned key k = 13. • Map Key onto R-DHT Node Identifier Assume that r is shared by host h. According to Definition 2.2, key 13 ∈ Th and the key is associated with node n = 13|h (Figure 4.8b). If key 13 represents a new resource type of host h, then node joins an R-Chord overlay and occupies segment S13 (Figure 2.8 and Theorem 2.7). Otherwise, no new node is created on the overlay (Corollary 2.1). 4.3.1.2 Resource Type Specification The resource type specification defines d indexing attributes, e.g. indexed columns in traditional database, that constitute a resource type out of d0 resource attributes(d ≤ d0 ). There are several reasons to index resource only by a subset of resource at2 See Section 4.3.1.3 for the details. CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 115 (a) Assign Key 13 to Resource r (b) Map SFC Key 13 onto R-DHT Node 13|h Figure 4.8: Example of Midas Indexing (d = 2 Dimensions and m = 4 Bits) tributes. Firstly, keys are kept stable by excluding attributes that change frequently. Otherwise, a resource must be re-assigned a new key when some of its attributes change. Secondly, we need to ensure that an (m/d)-bit dimension is sufficient to represent all possible values of an attribute. Thirdly, it has been shown that for higher dimension, the locality property of Hilbert SFC decreases, i.e. a higher number of clusters per query region [74, 103]. To include as many resource attributes as possible without significantly increasing the dimensionality d, we can combine several resource attributes into one compound attribute [86]. Consider a compound attribute (attr) that consists of i member attributes. A value of this attribute, which corresponds to one of 2m/d dimension values, is denoted as a tuple of hmember0 , ..., memberi−1 i. To support range queries, we impose an ordering on attr tuples where memberj must be logically contained by memberk (j > k). CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 116 As an example, consider a compound attribute book with three member attributes called chapter, section, and subsection. We define each value of this compound attribute as a tuple of hchapter, section, subsectioni since each subsection is part of a section, and each section is part of a chapter. Assuming member attributes are encoded as 2-bit values, each book tuple is encoded to a 6-bit dimension value derived by concatenating the three member attributes. Figure 4.9 shows the first chapter, i.e. all tuples with chapter = 002 , is encoded to dimension values prefixed by 002 (i.e. dimension values 0–15). Similarly, the last section of the last chapter, i.e. all tuples with chapter = 112 and section = 112 , are encoded to dimension values whose prefix is 11112 (i.e. dimension values 60–63). Figure 4.9: Dimension Values for Compound Attribute book Table 4.2 shows an example of resource type specification based on GLUE schema [5] (Figure 4.10) to describe resource in a computational grid. We define a resource type using five attributes, out of more than 20 as specified in GLUE schema. Each of the attributes are modeled as a dimension with 32-bit long values. Thus, each key (Hilbert identifier) is 160-bit, which is a typical value used in Chord and several other DHT implementations. The specification includes two compound attributes, namely OS and CPU, each of which consists of three and four member attributes, respectively. 4.3.1.3 Normalization of Attribute Values Each domain-specific attribute value is encoded as an (m/d)-bit length number (i.e. dimension value). For example, attribute cpu may consist of the following values: P3, P4, or SPARC. Each of these values is encoded as a number in the range of 0 to 2m/d . To support range queries, the normalization encodes attribute CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 117 Figure 4.10: Sample XML Document of GLUE Schema values i and j to dimension values f (i) and f (j) such that f (i) < f (j) if and only if i < j. We outline three approaches for the normalization of domain-specific attribute values. 1. Static Conversion This approach converts each domain-specific attribute value to a predefined dimension value, e.g. cpu = P4 is converted to dimension value 2. The concept of static conversion has been applied in other fields as well. For example, LINUX operating system allocates a predefined number and name to each device, e.g. the first SCSI devices is allocated device number and device name 0 and /dev/sda, respectively. We can further extend the static conversion by mapping a group of attribute values (e.g. all cpu from a particular vendor) to contiguous dimension values. The administrative authority responsible for the attribute values (e.g. the manufacturer) manages the allocated range. This is analogous to the allocation of IP addresses in networking. CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 118 Dimension Length (bit) Machine Count 32 Number of instances of the resource type CPU Count 32 Number of processors per machine Memory 32 Size of physical memory (multiplied by 256 MB) OS Name Release Version 32 12 10 10 Operating system installed on a machine CPU Vendor Architecture Model Clock Speed 32 7 5 5 15 Processor type of a machine Description CPU speed (multipled by 256 MHz) Table 4.2: Resource Type Specification for Compute Resources based on GLUE Schema 2. Locality-Preserving Hashing A locality-preserving hash function [19, 28] is applied to each attribute value to obtain the corresponding dimension value. With locality-preserving hashing, similar attribute values are hashed onto dimension values that are also similar. Locality-preserving hashing supports range queries since i < j < k is hashed to dimension values that satisfy condition hash(i) < hash(j) < hash(k). If we use non-locality-preserving hashing, the condition hash(i) < hash(j) < hash(k) is not guaranteed. 3. Interval Mapping For numerical attributes, attribute values can simply be divided into intervals and each interval i is directly mapped onto a dimension value. Thus, each dimension value represents an attribute value in a multiplicity of i. CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 4.3.2 119 Query Engine and Optimizations A multi-attribute range query is transformed into search keys. A naive scheme treats each search key as an exact query (i.e. a DHT lookup). This results in many nodes visited during query processing. In Midas, we propose to minimize the number of nodes visited by initiating lookups only for search keys that represent available resources. A multi-attribute range query specifies d search attributes where each of the attributes can be constrained by a range. A range imposes a limit on a search attribute using relational operators such as <, ≤, =, >, or ≤. After normalizing search attributes into dimension values, a query becomes a region in the ddimensional attribute space (Definition 4.2). For d = 2, a query region resembles a rectangle and the number of search keys is equal to the area of the rectangle. For d = 3 and d > 3, a query region resembles a cube and a hypercube, respectively; the number of search keys is equal to the volume of the cube and the hypercube. Definition 4.2. A query region (Q) is represented with the two endpoints of its diagonal, namely Q.lo and Q.hi. Endpoint Q.lo refers to the smallest coordinate in the query region, i.e. the coordinate where each dimension consists of the smallest value in the range specified for the dimension. Similarly, endpoint Q.hi refers to the largest coordinate in the query region. Figure 4.11 shows an an example of a 2-attribute range query: find compute resources with P3 ≤ cpu ≤ P4 and 1 GB ≤ memory ≤ 2 GB. The two ranges specify by the query are 1–2 and 0–1 for dimension cpu and dimension memory, respectively. The query region Q is illustrated as a shaded rectangle, where Q.lo = (cpumin , memorymin ) = (1, 0) and Q.hi = (cpumax , memorymax ) = (2, 1). Both Q.lo and Q.hi are not necessarily converted to the smallest and largest CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 120 Hilbert identifiers within the query region. Figure 4.11: Range Query with Search Attributes cpu and memory Each query is transformed into search keys, i.e. Hilbert identifiers assigned to all coordinates in the query region. For example, the query region covered by the shaded rectangle is transformed into four search keys grouped in two clusters, namely cluster 1–2 and cluster 13–14. Each search key is considered as an exact query. In a naive query processing, a query initiator (i.e. the user) issues one lookup per search key. However, this is not efficient for the following reasons: 1. The naive search includes unnecessary lookups for search keys that do not represent resources. Figure 4.12 shows two unnecessary lookups for search keys 13 and 14, out of four lookups. Because search keys 13 and 14 do not correspond to resources, there are no S13 and S14 in the underlying R-Chord overlay. As a result, both the unnecessary lookups terminate at a different segment, S15 . 2. The naive search ignores the clustering property of Hilbert SFC. Since search key 1 and 2 are clustered, the closer proximity between S1 and S2 can be exploited by issuing only lookup(1 ) and letting S1 forward a request to S2 . To support efficient query processing, we propose an incremental search strategy CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 121 Figure 4.12: Naive Search Algorithm that processes only available keys. A key is available if the resource type represented by the key consists of at least one resource instance. Figure 4.13 shows Midas incremental search. After obtaining q, i.e. an ordered set of search keys (line 2 in Figure 4.13a), the algorithm initiates lookup(k ) for the lowest key k ∈ q (line 4). This lookup will end-up at node n in segment Sk0 (line 5), where Sk0 is the succeeding segment of k. If k 0 = k then we add k (or the associated key-value pair) to the result set, otherwise we discard k (line 3–9 in Figure 4.13b). Prior to continuing the incremental search, we remove any search key k” that does not correspond to any resources (line 10–11). A search key is eliminated subject to one of the following conditions: 1. preceding segment(k) < k” < k 0 We eliminate any unavailable key k” that precedes k 0 , i.e. key k” which is within the left-side range of node n as shown in Figure 4.14. The reason is that in R-DHT, Sk0 is the succeeding segment of k” (and k) only if k” does not exist, otherwise the succeeding segment would be Sk” . To quickly obtain the preceding segment, R-Chord lookup(k ) can be modified to return not only the succeeding segment of k, but also the preceding segment of k. 2. k 0 < k” < succeeding segment(k 0 ) We eliminate any unavailable key k” that succeeds k 0 , i.e. k” is within the CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 122 1. h.mdq search(Query Q) 2. q = transform query region(Q.lo, Q.hi); 3. rs = {}; 4. k = get min(q); 5. (n, p) = lookup(k); 6. return n.incr search(q, rs, p); (a) Main Algorithm 1. n.incr search(List q, ResultSet rs, PreceedingSegment p) 2. //Check if h owns a key equals to the search key 3. k = get min(q); 4. Th = a set of keys in h; 5. for each y ∈ Th do 6. if y == k then 7. rs = rs ∪ {(k, n)}; 8. 9. q = q − {k}; 10. q = eliminate keys(q, p, prefix(n)); 11. q = eliminate keys(q, prefix(n), prefix(succ seg)); 12. 13. if q == {} then 14. return rs; 15. 16. //Search the next lowest key 17. k = get min(q); 18. (n0 , p) = lookup(k); 19. return n0 .incr search(q, rs, p); 20. // Eliminate keys in the range of [low, high) 21. n.eliminate keys(List q, Key low, Key high) 22. k = get min(q); 23. while k 6= nil and low < k < high do 24. q = q − {k}; 25. k = get min(q); 26. 27. return q; (b) Helper Functions Figure 4.13: Midas Incremental Search Algorithm right-side range of node n in Figure 4.14. To quickly locate Sk0 , each node maintains a pointer to its succeeding segment. This new pointer can be considered as a finger and is put in the CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 123 finger table. The pointer is maintained through periodic stabilization. With this new pointer, locating the succeeding segment of a node can be done by simply examining the node’s finger table instead of issuing lookup((Sk +1)|0). Figure 4.14: Search-Key Elimination Figure 4.15 illustrates an example of incremental search for the query illustrated in Figure 4.11. Given the four search keys, 1, 2, 13, and 14, Midas initiates a DHT lookup for the lowest key 1. Since the lookup finds the key at segment S1 , Midas adds key 1 to the result set and continues with lookup(2 ). As this lookup arrives at S2 , Midas adds key 2 to the result set. Furthermore, it eliminates key 13 and 14 since the succeeding segment of S2 is S15 . Thus, the final result set consists of two keys: 1 and 2. To return faster results, query processing can be parallelized by partitioning the search keys and performing one incremental search per partition. Figure 4.15: Example of Range Query Processing CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 4.4 124 Performance Evaluation Using simulation, we evaluate the performance of multi-attribute range queries on DHT through simulations3 . We first show the efficiency of Midas compared to the naive query processing scheme, and the impact of underlying overlay (R-DHT and conventional DHT) to the performance of Midas. Next, we study the impact of data-item distribution on the performance of multi-attribute range queries. We compare R-Chord-based Midas and Chord-based Midas and measure query cost, query resiliency to node failures, and query performance under churn. The implementation of Midas on Chord and R-Chord differs in the search-key elimination algorithm: 1. Consider a lookup(k ) request that ends up at node n (i.e. the successor of k). Node n eliminates only search key k” where n.predecessor < k” < n; this is similar to the first condition described in Section 4.3.2. However, n does not eliminate the search key if n < k” < n.successor, which is equivalent to the second condition in R-Chord, because Chord maps k” to n.successor. Since n.successor is the responsible node of k”, it is the one who is responsible to eliminate the key. 2. When eliminating key k”, node n also checks if key k” is actually available and stored on it. If k” is available, it is added to the result set. Unless stated otherwise, our experiments use the following parameters: • d, the number of attributes per resource, is varied from 3 to 5. Each dimension is 6-bit long (m/d), and thus, is capable to hold 26 = 64 dimension 3 Our simulator uses Lawder’s table-driven algorithm to perform the Hilbert mapping [84]. The algorithm is applicable to arbitrary number of dimensions, in contrast to several earlier algorithms [26, 36, 92] which are limited to 2-dimensional space. Recently, Jin et. al. [75] proposed a table-driven framework capable of constructing different types of SFC. CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 125 values. • m, the number of bits for keys and host identifiers, is 6d-bit. • K, the number of unique keys (i.e. resource types), is varied from 5,000 to 150,000. This means, there are K points in the d-dimensional attribute space. We generate these points using the normal distribution. • N , the number of hosts, is varied from 25,000 to 50,000. Each host shares T = 8 unique resource types on average (i.e. |Th | ∼ U [4, 12] unique resource types), and each resource type may be offered for sharing by more than one hosts. • The size of a range query, i.e. the number of search keys, is ad where a (≤ 2m/d ) is a length parameter. Queries are classified based on their shape: – Type 1, i.e. (a)d , are query regions where the length of each dimension is a. – Type 2, i.e. (0.5a)(2a)(a)d−2 , are query regions where the length of the first dimension and the second dimension are 0.5a and 2a, respectively, while the length of the remaining dimension is a. 4.4.1 Efficiency To study the improvement by Midas over the naive scheme, we compare the average number of nodes visited per query, i.e. responsible nodes, which store available keys, and intermediate nodes. The naive scheme initiates one lookup per search key, whereas Midas initiates lookups only for available keys. Both schemes use R-Chord as the underlying overlay in a system comprising 25,000 hosts (N ) . For each value of length parameter a, we simulate 1,000 queries consisting of 500 type1 queries and 500 type-2 queries. Each query consists of Qskey (= ad ) search keys CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 126 and Qakey (≤ Qskey ) available keys. Table 4.3 presents the query profile and the simulation results. d=3 K a Qskey d=4 Qakey Qskey d=5 Qakey Qskey Qakey 4 5,000 8 16 64 512 4,096 7 256 56 4,096 442 65,536 1 1,024 13 32,768 206 1,048,576 0.1 3 97 4 8 16 64 512 4,096 46 256 375 4,096 3,058 65,536 8 1,024 125 32,768 2,068 1,048,576 1 29 379 50,000 (a) Query Profile d=3 d=4 d=5 K a 5,000 4 8 16 337 2,697 21,672 27 1,391 115 22,381 627 360,068 21 5,654 85 182,151 533 5,870,126 18 73 499 50,000 4 8 16 403 3,243 25,922 75 1,828 474 29,440 3,386 471,964 53 7,365 344 238,941 3,267 6,431,520 39 245 2,670 Naive Midas Naive Midas Naive Midas (b) Average Number of Nodes Visited per Query Table 4.3: Performance of Query Processing in Naive Scheme vs Midas The result in Table 4.3b shows that Midas is more efficient than the naive scheme in processing multi-attribute range queries, because Midas initiates lookups only for available keys. Our result reveals that the number of nodes visited in Midas is at least five times (i.e. d = 3, K = 50, 000, and a = 4) smaller than the naive scheme. In the naive scheme, the cost is determined by the size of the query region, i.e. Ω(Qskey ). Because one R-Chord lookup is initiated per search key and each lookup visits O(min(log K, log N )) nodes according to Theorem 2.2, the number of nodes visited for each query is O(Qskey min(log K, log N )). On the other hand, the cost of query processing in Midas is determined by the number of available keys CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 127 (Qakey ) which is less Qskey . Because Midas looks up only for the available keys, the number of nodes visited is at least Qakey . In addition, due to incremental search, the cost of the initial lookup (for the smallest search key) becomes insignificant as Qakey increases. Thus, the number of nodes visited per query is Ω(Qakey ). 4.4.2 Cost of Query Processing The impact of data-item distribution on Midas query processing cost is evaluated using both Chord and R-Chord. The query processing cost is measured using the average number of nodes visited per query. We simulate 10,000 type-1 queries and 10,000 type-2 queries on system with 25,000 to 50,000 hosts. The size of each query is kept constant at Qskey = 16d . Table 4.4 shows the query profile in terms of the number of available keys per query (Qakey ), and the simulation results. Table 4.4b shows that the query cost in R-Chord-based Midas is affected by K, whereas Chord-based Midas is affected by N . Because each R-Chord node is responsible for its own key only, the number of nodes visited is Ω(Qakey ). As Qakey increases when K is increased, so does the number of nodes visited. In Chord, the number of nodes visited is Ω(Qcnode ) where Qcnode denotes the number of Chord nodes responsible for available keys. Due to data-item distribution, each Chord node is responsible for one or more keys, and thus, Qcnode ≤ Qakey (Figure 4.16). As N is increased, the value of Qcnode increases because available keys are distributed to a higher number of Chord nodes. This is shown in Table 4.5 which compares Qcnode in Chord rings consisting of 25, 000 nodes (N25 ) and 50,000 nodes (N50 ). A similar observation regarding the performance of range queries on conventional DHT has also been made by Cristina et. al. [128]. Though Table 4.4b shows that the query cost in R-Chord is higher than Chord as K is increased, it does not contradict our earlier analysis on R-Chord lookup CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES K 128 d=3 d=4 d=5 5,000 7,500 10,000 15,000 25,000 50,000 75,000 100,000 125,000 150,000 445 668 879 1,271 1,938 3,057 3,416 3,388 3,206 2,992 211 325 432 643 1,042 2,066 2,897 3,572 4,070 4,510 95 150 199 299 475 953 1,359 1,669 1,929 2,137 (a) Average Number of Available Keys per Query (Qskey = 16d Search Keys) d=3 N K d=4 d=5 Chord R-Chord Chord R-Chord Chord R-Chord 5,000 7,500 10,000 15,000 25,000 25,000 50,000 75,000 100,000 125,000 150,000 650 647 648 641 649 638 667 655 644 654 629 885 1,120 1,544 2,242 3,387 3,741 3,708 3,526 3,313 418 423 420 420 416 412 414 418 424 421 537 741 928 1,274 1,887 3,312 4,385 5,225 5,839 6,352 358 358 360 358 363 357 364 358 357 357 499 682 836 1,120 1,589 2,675 3,513 4,137 4,631 5,003 5,000 7,500 10,000 15,000 25,000 50,000 50,000 75,000 100,000 125,000 150,000 1,133 1,128 1,124 1,127 1,138 1,129 1,154 1,148 1,118 1,134 627 889 1,123 1,545 2,238 3,447 3,993 4,194 4,195 4,121 662 661 661 660 661 662 663 664 665 665 534 740 927 1,270 1,882 3,360 4,675 5,833 6,895 7,783 524 520 521 519 524 522 525 520 520 519 498 681 833 1,116 1,600 2,733 3,720 4,624 5,393 6,079 (b) Average Number of Nodes Visited Table 4.4: Query Cost of Midas CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 129 Figure 4.16: Four Chord Nodes are Responsible for Twelve Search Keys d=3 K 5,000 7,500 10,000 15,000 25,000 50,000 75,000 100,000 125,000 150,000 d=4 d=5 N25 N50 N25 N50 N25 N50 222 271 310 351 407 457 499 493 478 483 296 383 453 553 677 817 895 911 889 902 92 114 129 147 168 195 207 216 226 225 118 149 173 205 247 302 331 347 362 369 47 58 67 79 95 112 126 128 131 135 55 71 83 101 125 157 176 187 195 200 Table 4.5: Qcnode performance (Theorem 2.2) which states that the path length of each R-Chord lookup is at most equal to Chord. Instead, the higher cost of query processing in R-Chord is caused by a higher number of lookup operations (Table 4.6). As stated earlier, one lookup is required to locate each responsible node, and the number of responsible nodes in R-Chord (i.e. Qakey ) is higher than Chord (i.e. Qcnodes ). However, the number of intermediate hops per R-Chord lookup4 is lower 4 In both Chord and R-Chord, the number of intermediate hops per lookup decreases as K is increased, which is explained as follows. The query cost in Chord and R-Chord is Ω(Qcnode ) and Ω(Qakey ), respectively. Both Qcnode and Qakey , which also denote the number of nodes responsible for query results, increase as K is increased. Given a constant size of overlay network, a larger number of responsible nodes reduces the distance between responsible nodes. CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES d=3 N K d=4 130 d=5 Chord R-Chord Chord R-Chord Chord R-Chord 5,000 7,500 10,000 15,000 25,000 25,000 50,000 75,000 100,000 125,000 150,000 505 357 367 362 503 363 390 509 355 347 479 704 918 1,300 1,965 2,958 3,158 3,396 2,768 2,516 266 209 193 202 264 194 200 265 204 196 304 441 567 807 1,253 2,256 2,922 3,926 3,778 4,040 191 153 147 149 193 147 146 187 147 148 231 326 407 561 826 1,424 1,840 2,418 2,325 2,497 5,000 7,500 10,000 15,000 25,000 50,000 50,000 75,000 100,000 125,000 150,000 918 913 909 911 921 914 936 931 904 919 477 705 917 1,306 1,962 3,122 3,648 3,848 3,857 3,791 446 444 444 443 444 445 446 445 447 447 303 440 566 806 1,249 2,387 3,444 4,412 5,307 6,081 296 292 293 291 295 292 293 289 290 290 230 326 406 561 832 1,515 2,134 2,721 3,232 3,696 Table 4.6: Average Number of Lookups per Query (based on Table 4.4b) than Chord for various d and K (Table 4.7). Overall, the results in Table 4.6–4.7 further emphasizes that the cost of query processing in R-Chord is higher due to the absence of data-item distribution instead of the cost of each (primitive) lookup operation. The result in Table 4.7 also indicates sthat the clustering property of Hilbert SFC becomes poorer as dimensionality d is increased. On higher dimensions, locality preservation decreases where two resources that are semantically similar are assigned keys that are farther apart. These keys are further mapped onto responsible nodes whose distance is farther apart; this increasing the number of intermediate nodes per lookup. As a result, the path length of each lookup becomes longer and the overall query cost increases. CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES d=3 N K d=4 131 d=5 Chord R-Chord Chord R-Chord Chord R-Chord 5,000 7,500 10,000 15,000 25,000 25,000 50,000 75,000 100,000 125,000 150,000 0.85 0.75 0.67 0.58 0.48 0.37 0.32 0.32 0.33 0.35 0.38 0.31 0.26 0.21 0.15 0.11 0.09 0.09 0.10 0.11 1.23 1.14 1.09 1.02 0.94 0.83 0.79 0.76 0.73 0.73 1.07 0.95 0.88 0.78 0.67 0.53 0.46 0.42 0.40 0.38 1.62 1.57 1.53 1.47 1.39 1.30 1.24 1.23 1.21 1.18 1.75 1.63 1.56 1.46 1.35 1.16 1.07 1.02 0.98 0.95 5,000 7,500 10,000 15,000 25,000 50,000 50,000 75,000 100,000 125,000 150,000 0.91 0.82 0.74 0.63 0.50 0.34 0.28 0.25 0.25 0.25 0.38 0.31 0.26 0.21 0.15 0.11 0.10 0.09 0.09 0.09 1.22 1.15 1.10 1.03 0.93 0.81 0.74 0.71 0.68 0.66 1.07 0.94 0.87 0.78 0.68 0.53 0.45 0.41 0.37 0.35 1.59 1.54 1.49 1.44 1.35 1.25 1.19 1.15 1.12 1.10 1.75 1.63 1.56 1.46 1.34 1.16 1.06 1.00 0.95 0.91 Table 4.7: Average Number of Intermediate Nodes per Lookup (based on Table 4.4b) CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 132 Our evaluation raises two implications to consider in reducing the cost of R-DHT query processing: 1. Combine Query Processing and Resource Accesses An R-DHT implementation supports this feature by providing an API that combines a lookup request and a resource-access request into a single request. The query cost presented in Table 4.4b excludes the additional hops needed by Chord to access resources once resource indexes have been found through query processing. As illustrated in Figure 4.17a, Chord maps key k belonging to node n onto another node n0 . To access resource r that is assigned key k, a user first locates k stored at n0 (step 1), before accessing r at node n (step 2). The total number of hops to access all resources that match a query is equal to Ω(Qakey ). Therefore, the total cost in Chord-based Midas becomes the sum of the query cost in Table 4.4b and Qakey (Table 4.4a). For example, when N = 25, 000 hosts, K = 5, 000 unique keys, and d = 4 dimensions, the total cost is 537 + 211 = 748 hops. R-Chord, on the other hand, allows a user to access resources during query processing because resource r and its key k are located at the same node (Figure 4.17b). (a) Resource r and Key k are Separated (Chord) (b) Resource r and Key k are Located at the Same Node (R-Chord) Figure 4.17: Locating Key and Accessing Resource in R-Chord and Chord CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 133 2. Distributing Data Items to Trusted Nodes We have identified that data-item distribution reduces the cost of query processing. Thus, introducing selective data-item distribution into R-DHT is a logical choice in optimizing its range query processing. R-DHT facilitates selective data-item distributions by grouping publicly-writable nodes in a reserved segment (Appendix B). The reserved segment is essentially another Chord ring embedded in a larger R-Chord ring. Query processing will search only among these trusted nodes. Assuming that the number of nodes in the reserved segment is denoted as Qcnode , the cost to locate available keys in the reserved segment is Ω(Qcnode ). Our experiment on Chord-based Midas is an extreme case of selective data-item distributions, where every node in the system is a trusted, writable node. 4.4.3 Resiliency to Node Failures Query processing is resilient if it is able to locate available keys in the presence of node failures. To evaluate query resiliency, we simulate range queries when a percentage (F ) of 25,000 hosts and 50,000 hosts fail simultaneously, and we measure the percentage of available keys that are successfully retrieved. The result shown in Table 4.8 shows that range-query resiliency is higher in R-Chord where nearly all available keys are retrieved. The result of this experiment is consistent with our findings on the R-DHT lookup resiliency (Section 2.5.2). Because each R-Chord node is responsible only for its own keys, when a node fails, only its own keys are affected. And by its design (i.e. routing by segments and finger flexibility through backup fingers), R-Chord can locate a key as long as there is at least one alive node still sharing the key. On the other hand, Chord stores a key belonging to one node on another responsible node. When the responsible node fails, Chord fails to locate the keys (i.e. resource CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES d=3 d=4 134 d=5 F K 25% 5,000 7,500 10,000 15,000 25,000 50,000 75,000 100,000 125,000 150,000 74 68 72 70 70 76 80 79 79 77 > 99 > 99 > 99 > 99 > 99 > 99 99 98 97 97 64 75 68 73 65 73 77 77 81 77 > 99 > 99 > 99 > 99 > 99 99 97 97 96 95 75 74 67 72 74 76 65 84 83 85 > 99 > 99 > 99 > 99 > 99 99 97 96 95 95 50% 5,000 7,500 10,000 15,000 25,000 50,000 75,000 100,000 125,000 150,000 28 31 28 22 34 39 36 39 47 41 96 97 97 97 98 97 93 89 86 83 30 33 31 27 33 35 39 43 52 51 95 96 96 96 97 94 89 84 81 74 33 33 35 26 36 43 32 49 42 28 95 95 96 95 96 93 88 82 78 75 Chord R-Chord Chord R-Chord Chord R-Chord (a) N = 25, 000 Hosts Table 4.8: Percentage of Keys Retrieved under Simultaneous Node Failures CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES d=3 d=4 135 d=5 F K 25% 5,000 7,500 10,000 15,000 25,000 50,000 75,000 100,000 125,000 150,000 69 73 68 71 71 72 70 72 73 77 > 99 > 99 > 99 > 99 > 99 > 99 > 99 > 99 99 99 72 70 65 70 70 72 69 77 75 74 > 99 > 99 > 99 > 99 > 99 > 99 > 99 99 98 98 72 66 72 71 76 74 74 74 72 75 > 99 > 99 > 99 > 99 > 99 > 99 99 99 98 97 50% 5,000 7,500 10,000 15,000 25,000 50,000 75,000 100,000 125,000 150,000 24 26 25 30 25 27 30 32 29 36 97 97 97 97 98 99 99 98 96 94 31 30 25 29 30 29 32 34 37 32 95 95 95 96 96 96 96 94 91 89 31 21 31 37 36 32 41 36 37 47 95 97 95 95 95 96 95 94 91 88 Chord R-Chord Chord R-Chord Chord R-Chord (b) N = 50, 000 Hosts Table 4.8: Percentage of Keys Retrieved under Simultaneous Node Failures CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 136 indexes) stored on the responsible node, even if the originating node of the keys (i.e. the node where the actual resources are located) is still alive. 4.4.4 Query Performance under Churn To evaluate the performance range query processing under churn, we compare Midas on R-Chord and Chord overlays that change dynamically. The procedure is similar to the churn experiment in Section 2.5.4). We begin the experiments by warming up 25,000 hosts. In the next one-hour period, we simulate churn events (i.e. arrivals, fails, and leaves) produced by 25,000 hosts. Thus, there will be N ∼ 25, 000 alive hosts at any time within this duration. During this onehour period, we also simulate a number of range query events, where the ratio of arrive:fail:leave:query is set at 2:1:1:1. Assuming that these events follow a Poisson distribution, we derive two rates to represent churn rate, λB = 5 events/second and λG = 17 events/second, based on the measurements on peer life-time by Bhagwan et. al. [25] and Gummadi et. al. [63], respectively (refer to Section 2.5.4 for details of the derivation). Each node in the overlay invokes the finger correction every 60 seconds on average. Table 4.9 presents the percentage of available keys that are successfully retrieved. With the moderate churn rate (Table 4.9a), our result shows that R-Chord performs reasonably well compared with Chord. Though R-Chord overlay is eight times larger than Chord, the number of available keys retrieved in R-Chord is, at most, 10% lower than Chord. With the high churn rate (Table 4.9a), the number of keys retrieved in R-Chord is up to 25% lower than Chord. This result again shows that R-DHT lookup performance under churn is influenced by finger flexibility. When K is increased, finger flexibility is reduced, i.e. each segment in the overlay consists of a small number of nodes (see Theorem 2.6). As there are not enough stable nodes within each segment, the effectiveness of backup fingers is CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES d=3 K 5,000 7,500 10,000 15,000 25,000 50,000 75,000 100,000 125,000 150,000 d=4 137 d=5 Chord R-Chord Chord R-Chord Chord R-Chord 98 98 96 95 94 91 95 95 90 94 98 98 97 96 95 91 87 86 88 87 98 96 97 95 95 90 92 96 98 98 98 98 97 98 93 88 86 81 87 77 98 98 96 96 95 91 93 90 92 93 98 97 97 95 93 89 81 87 80 83 (a) λB = 5 Events/Second d=3 K 5,000 7,500 10,000 15,000 25,000 50,000 75,000 100,000 125,000 150,000 d=4 d=5 Chord R-Chord Chord R-Chord Chord R-Chord 95 92 90 87 84 81 79 74 78 81 94 91 88 86 80 67 63 63 59 59 93 87 89 84 82 73 80 78 78 82 92 88 87 84 77 62 60 54 52 51 93 88 89 87 85 75 82 82 84 79 90 89 87 83 75 61 60 58 56 55 (b) λG = 17 Events/Second Table 4.9: Percentage of Keys Retrieved under Churn (N ∼ 25, 000 Hosts) reduced under churn because nodes have a higher number incorrect fingers (i.e. fingers which point to incorrect segments). This leads to the resiliency of R-DHT, which is due to segment-based overlay, is also reduced. However, in terms of the number of keys retrieved, R-Chord can still retrieve at least half of available keys. In summary, the performance of Midas under churn is mainly influenced by the effectiveness of backup fingers. With a higher finger flexibility, i.e. a higher number of nodes per segment, backup fingers is effective in increasing the resiliency of CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 138 routing by segment. However, when each segment has a small number of nodes, our results implies that it is important for each node to maintain fingers pointing to stable nodes. One approach to address this is by using a hierarchical R-DHT where only stable nodes can become supernodes. 4.5 Summary We have presented Midas, a scheme to support multi-attribute range queries on R-DHT. Using Hilbert SFC, Midas assigns to each d-attribute resource a onedimensional key. In processing a range query, Midas performs search-key elimination to avoid issuing unnecessary lookups, and incremental search to exploit the clustering property of Hilbert SFC. Due to its read-only property, R-DHT does not need to send additional requests to access resources, as resources and its key are co-located. Performance evaluation of Midas is conducted through simulations, and the main results are as follows. Efficiency of Midas Compared to a naive scheme of query processing, Midas significantly reduces the number of nodes visited in processing a multi-attribute range query. To process a query consisting of Qskey search keys and Qakey (≤ Qskey ) answers, Midas visits Ω(Qakey ) nodes whereas the naive scheme requires Ω(Qskey ) nodes. We further validate the efficiency of Midas through simulations. Our experiments reveal that Midas is at least five times more efficient than the naive scheme. Cost of Query Processing We study the implication of data-item distributions on the cost of query processing. For the same size of queries, the cost of query processing in R-DHT is determined by the number of resource types (K). In CHAPTER 4. MIDAS: MULTI-ATTRIBUTE RANGE QUERIES 139 contrast, in conventional DHT, the cost is determined by the number of nodes (N ). This indicates that (i) R-DHT is more suitable in applications where query selectivity is much larger than the number of query answers, and (ii) relaxing node autonomy through selective data-item distributions can improve the performance of multi-attribute range queries in R-DHT. Resiliency to Node Failures We show that using R-DHT as the underlying overlay increases the resiliency of Midas without a need to replicate data items. Our simulation result shows that nearly all available keys are located even when 50% of nodes fail simultaneously. Query Performance under Churn In R-DHT, effective backup fingers are crucial in increasing performance of query processing under churn. When there are less finger flexibility to exploit, we highlight hierarchical R-DHT as a possible solution where only stable nodes can be promoted into supernodes. CHAPTER 5. CONCLUSION 140 Chapter 5 Conclusion We conclude this thesis by summarizing our main contributions and highlight several directions for future research to address the limitations of our proposed scheme. 5.1 Summary We have proposed a DHT-based system that does not distribute data items across an overlay network. For a large distributed system consisting of many administrative domains, our proposed system addresses the issues of data-item ownership and conflicting self-interest among different administrative domains. Figure 5.1 summarizes our proposed scheme which consists of two main parts: R-DHT and Midas. R-DHT (Read-only DHT) is a new DHT abstraction that does not distribute data items across an overlay network. Two variants of R-DHT have been proposed, namely flat R-DHT and hierarchical R-DHT. In addition, we highlighted a hybrid scheme which allows selective data-item distributions in an R-DHT overlay network. Midas (Multi-dimensional range queries) supports multi-attribute CHAPTER 5. CONCLUSION 141 range queries on R-DHT by exploiting a d-to-one mapping scheme. We have presented Midas’s indexing multi-attribute resources and Midas’s query engine. We have also demonstrated that in addition to R-DHT, Midas is applicable to support multi-attribute range queries on conventional DHT as well. Figure 5.1: Multi-attribute Queries on R-DHT In the following, we summarize our main contributions. CHAPTER 5. CONCLUSION 142 Effective and Efficient R-DHT Lookup A significance result in this thesis is to demonstrate that even without distributing data items, the performance of R-DHT lookups is better than conventional DHT in two areas: 1. Lookup Path Length Although the size of R-DHT overlay is larger than conventional DHT, an R-DHT lookup is as efficient as a DHT lookup due to the three proposed optimizations, namely routing by segments and shared finger tables. Using Chord as the underlying overlay graph, we have shown that the lookup path length in Chord-based R-DHT is O(min(log K, log N )) hops (i.e. better than Chord which is O(log N ) hops), where N denotes the number of hosts and K denotes the number of unique keys (i.e. unique resource types). Our simulation results further confirm our theoretical analysis on R-DHT lookup path length. 2. Lookup Resiliency to Node Failures We have shown that R-DHT does not need to rely on active replication to achieve high resiliency. Firstly, failure of a node does not affect data items belonging to other nodes. Secondly, result of a lookup operation can be found in a segment which consists of multiple nodes. To achieve high lookup resiliency, we propose to exploit segment-based overlays through the backup fingers scheme. Through simulation experiments, we have demonstrated that both lookup operations and Midas query engine on R-DHT achieve a higher result guarantee: nearly all available keys are found even when half of nodes in an overlay network simultaneously fail. CHAPTER 5. CONCLUSION 143 Collision Detection and Resolution in Hierarchical R-DHT To address the higher maintenance overhead in a flat overlay network, we proposed a hierarchical R-DHT, which collapses nodes within a segment into a second-level overlay network. We also addressed the problem of group collisions in hierarchical R-DHT. Collisions increase the size of the top-level overlay and in turn increase the maintenance cost of the top-level overlay. To detect collisions, we proposed a scheme whereby collision detections are performed together with stabilization to avoid introducing additional messages. To resolve collisions, we proposed a supernode-initiated and a node-initiated merging scheme. Simulation analysis shows that our collision detection and resolution scheme is more effective when stabilization is performed more frequently. With our scheme, the number of collisions is reduced by 80% at least. In addition, the size of the top-level overlay remains close to the ideal size; otherwise it can be two to five times larger. Support for Multi-Attribute Range Queries on R-DHT We have proposed Midas, an approach to support multi-attribute range queries on R-DHT based on d-to-one mapping scheme. The selection of d-to-one as the basis of our solution, rather than distributed inverted index or d-to-d, is due to two main reasons: R-DHT does not distribute data items and there is more research interest, within the P2P community, on one-dimensional DHT compared to d-dimensional DHT. We described the multi-dimensional indexing scheme in Midas which assigns to each multi-attribute resource a key derived using Hilbert SFC. We also proposed two optimizations for the query engine in Midas, namely incremental search and search-key elimination. Through simulation evaluations we have shown that Midas CHAPTER 5. CONCLUSION 144 significantly reduces the number of nodes visited in processing a query, compared to the naive R-DHT query processing. Impact of Data-Item Distribution on Multi-Attribute Range Queries We have studied the implication of data-item distribution to the performance of query processing. We compared the cost of query processing in two Midas implementations: one on R-DHT and another on conventional DHT. Simulation evaluations reveal two main observations: 1. Data-item distribution reduces the number of nodes visited in processing a query. The number of lookups per query is determined by the number of nodes that are responsible for query results. In conventional DHT, each node is a bucket with a number of unique keys, as opposed to R-DHT where each node is a bucket of one unique key. Thus, the number of responsible nodes in conventional DHT is lower than R-DHT. This reduces the query cost in conventional DHT. 2. The higher cost of query processing in R-DHT is due to a higher number of R-DHT lookups required, not the cost of each individual R-DHT lookup. Based on this observation, we have highlighted possible optimizations to reduce the cost of query processing in R-DHT, such as combining query processing with resource accesses and selective data-item distribution. Performance of R-DHT under Churn We have studied the performance of query processing in R-DHT under churn. Churn introduces inconsistencies to routing states maintained by nodes. From CHAPTER 5. CONCLUSION 145 our simulation evaluations, we have observed that R-DHT achieves a higher result guarantee when there are a higher number of nodes sharing resources of the same type. With a higher number of nodes within a segment, the exploitation of finger flexibility mitigates the impact of inconsistent node’s routing states on the number of keys successfully retrieved. As such, the difference between R-DHT and conventional DHT is negligible. However, when each resource type is provisioned only by a small number of nodes, result guarantee in R-DHT is at most 20% worse than conventional DHT. Thus, R-DHT is suitable for a system whereby many administrative domains share resources of the same type, but administrative domains are willing to store only their own data items. 5.2 Future Works We highlight several research directions for future work. Selective Data-Item Distribution An administrative domain who is willing to store data items belonging to other administrative domains joins R-DHT as one node only and occupies a reserved segment (see Appendix B for details). The benefits of selective data-item distribution include: 1. Selective data-item distribution improves the performance of multi-attribute range queries because data-items are aggregated on a trusted node. This reduces the number of lookups needed to retrieve all query result. 2. Selective data-item distribution facilitates replication of data items to a set of trusted nodes. This improves availability of resources as every resource is duplicated in different administrative domains. When the master copy a resource is unavailable, R-DHT can locate the backup copies of the resource. CHAPTER 5. CONCLUSION 146 3. Selective data-item distribution can address the load imbalance problem where all lookups for a frequently-requested data item are routed to its originating domain. Further studies are needed to effectively exploit selective data-item distribution in R-DHT and to evaluate its impact on R-DHT performance. Another interesting area to explore is how to selectively distribute data-item without reserving a reserved segment. The benefits of not requiring a reserved segment include: 1. In our current scheme, an R-DHT implementation has to provide two lookup interfaces: one to locate data items among read-only nodes, and another to locate data items among read-write nodes (in the reserved segments). Without a reserved segment, only one common lookup interface is required. 2. In our current scheme, when query results must be retrieved from both readwrite and read-only nodes, each query is processed twice: once to retrieve keys from the reserved segment and once to retrieve keys from the remaining segments. By removing the reserved segment, we can retrieve keys from both set of nodes by processing each query only once. A simple approach to remove the reserved segment is by exploiting host virtualizations. As illustrated in Figure 5.2, when host 5 stores a replicated version of key 5, a new node 5|3 is created. However, this simple approach increases the size of overlay network which in turns increases the maintenance overhead. Thus, a better approach is needed. Dynamic Routing-Table Size Virtualization in R-DHT increases the size of its overlay network in terms of number of nodes. We have proposed a hierarchical R-DHT scheme to partition CHAPTER 5. CONCLUSION 147 Figure 5.2: Exploiting Host Virtualization to Selectively Distribute Data Items the maintenance overhead into multiple overlay networks. Assume that N denotes the number of hosts and K denotes the number of unique keys, the size of the top-level overlay in a hierarchical R-DHT is K groups and its maintenance cost is a function of K. However, when K > N , the maintenance cost can be further reduced into a function of N , i.e. the maintenance cost in conventional DHT. A possible solution to address this issue, which complements hierarchical R-DHT, is to investigate a new scheme where each administrative domain adaptively adjusts the number of fingers maintained. In this scheme, each administrative domain approximate the size of the overlay network to determine the minimum number finger required in order to support robust lookup with short lookup path length. Caching of Data Items Caching is a common technique to improve the lookup performance in DHT. From the perspective of nodes, caching data items belonging to other nodes is voluntarily, as opposed to data-item distribution which is mandatory. Thus, caching can be exploited to improve the lookup performance in R-DHT without violating the storage-usage policy of a node. However, we need to address the problem of data-item ownership due to malicious nodes. For example, a node can cache a data item indefinitely by ignoring invalidation requests from the owner of data items. CHAPTER 5. CONCLUSION 148 Semi-Structured Overlay Networks R-DHT allows node autonomy in placing their key-value pairs. However, a higher degree of node autonomy is possible, in particular the autonomy in selecting neighboring nodes in an overlay network [44]. In DHT, the neighbors of a node is determined by a structure overlay network, i.e. the overlay network resembles a graph with a certain topology, and and the position of a node in an overlay network is determined by the node identifier (Section 1.2). Recently, semi-structured overlay topologies have been proposed [37, 126]. In these schemes, nodes are free to choose its position in the overlay network and its neighbors, as long as the overlay network exhibits a global property such as a power-law network [126] or a square-root network [37]. Though both proposed schemes claim a provable lookup path length, we believe that more works are needed to improve these schemes. Firstly, percolation search [126] is based on earlier observations that file sharing P2P systems are power-law networks [94, 121]. However, a recent study indicates that such systems are not power-law [134]. Secondly, square-root network [37] assumes that the popularity of data items is known. However, the author does not describe how to measure the popularity of every data item in a large scale P2P system. APPENDIX A. READ-ONLY CAN 149 Appendix A Read-Only CAN In R-DHT, the identifier of a node is prefixed by the key shared by the node. However, this property is not guaranteed in CAN [116] because CAN dynamically changes the identifier of existing nodes when splitting a zone (Section 1.2.2). In this chapter, we describe two R-CAN schemes, flat R-CAN (see also Section 2.3.1) and hierarchical R-CAN (see also Section 3.2), whereby the dynamic zone splitting guarantees that a node identifier is prefixed by a key shared by a node. Flat RCAN allows a zone to be split into two unequal-size subzones, whereas hierarchical R-CAN allows several nodes to occupy the same zone. Subsequently, we make a distinction between VIDs and node identifiers as stated in Definition A.1. Definition A.1. Let n denote a node identified by an m-bit node identifier. The i-bit VID of node n is defined as the i-bit prefix of n, where i ≤ m. Based on Definition A.1, an m-bit node identifier is associated with m VIDs. Figure A.1 illustrates four possible VIDs of a 4-bit node identifier. APPENDIX A. READ-ONLY CAN 150 Figure A.1: VIDs of Node Identifier 11012 A.1 Flat R-CAN The original CAN dynamically changes the location of nodes and as such, the i-bit VID of a node may not be the same as the i-bit prefix of the node identifier. As illustrated in Figure A.2, when node n0 arrives, a zone, which is occupied by node n only, is split along x-dimension into two equal-size subzones; each node is assigned one subzone. Nodes reflect their new subzone in their 3-bit VID, i.e. the 2-bit VID of n (before the splitting) concatenated by 0 or 1. However, this new VID for n0 violates Definition A.1 because it does not match with the 3-bit prefix of n’, which is 111. (a) Zone Occupied by Node n whose 2-bit VID is 11 (b) 3-bit VID Assigned to n0 is not Equal to 3-bit Prefix of n’ Figure A.2: Zone Splitting in CAN may Violate Definition A.1 Flat R-CAN solves the above problems as follows: 1. A node’s location in the Cartesian space is determined only by its (fixed) APPENDIX A. READ-ONLY CAN 151 node identifier. This is in contrast to the original CAN, where a new node chooses a random initial location. 2. A zone can be split into unequal-size subzones. 3. We ignore a zone splitting along a particular dimension two nodes, i.e. one existing node and one new node, occupy the same coordinate in the splitting dimension. When a zone splitting is ignored, we assigned to nodes their i-bit nodeidentifier as their i-bit VID. Then, we continue to split the zone along a different dimension until the i0 -bit VID produced by the splitting, where i0 > i, is equal to the i0 -bit prefix of a node identifier. As illustrated in Figure A.3a, the location occupied by node n is determined by its node identifier. When node n0 with the same x-coordinate as n enters the zone, splitting the zone along x-dimension is ignored (Figure A.3b). Thus, both nodes are a 3-bit VID derived from their 3-bit node-identifier prefix. Afterwards, we split the zone along y-dimension and assign a 4-bit VID to each node (Figure A.3c). Because n and n0 have different a y-coordinate, two 4-bit VIDs are produced; these VIDs are guaranteed to equal to the 4-bit prefix of n and n0 . Theorem A.1. Let node identifiers are m-bit long. The number of ignored zone splittings is at most m − 1. Proof. Assume that two node identifiers share the same (m − 1)-bit long prefix, i.e. they differ in the least-significant bit only. Then, out of m VID owned by each node, (m − 1) VID are shared between nodes. In R-CAN, the zone-splitting process ignores the splittings that result in the same VID for both nodes. Thus, m − 1 splitting are ignored. APPENDIX A. READ-ONLY CAN (a) Node Identifier Determines Coordinates of a Node 152 (b) Zone Splitting along x-Dimension is Ignored (c) Continue Zone Splitting along yDimension Figure A.3: Zone Splitting in Flat R-CAN In other words, given any two nodes, their coordinates must differ in at least one dimension. When we split a zone, we ignore the splittings along dimensions in which both nodes have the same coordinate. A.2 Hierarchical R-CAN The primary difference between hierarchical R-CAN and flat R-CAN is that in hierarchical R-CAN, a zone is either split to two equal-size subzones or none at all. As illustrated in Figure A.4, splitting the zone along x-dimension into two equal-size subzones results in an empty subzone. However, unlike flat R-CAN which selects an alternative splitting dimension, hierarchical R-CAN simply does APPENDIX A. READ-ONLY CAN 153 not split the zone. Instead, the zone is shared by both the original node n and the new node n0 . Thus, this zone can be considered as a group as in hierarchical DHT. Nodes within this zone may be further organized as a second-level overlay network. (a) Empty Subzone (b) Two Nodes in One Zone (c) Two Nodes in One Zone Figure A.4: Zone Splitting in Hierarchical R-CAN APPENDIX B. SELECTIVE DATA-ITEM DISTRIBUTION 154 Appendix B Selective Data-Item Distribution In Chapter 2, we have demonstrated how R-DHT supports node autonomy where each node stores only its own data items. In this chapter, we extend R-DHT to accommodate applications where some hosts may store data items belonging to other hosts. Example of such hosts are DHT service providers [23] or MDS servers [4] serving as yellow pages in a computational grid. Selective data-item distribution also facilitates data-item replication in R-DHT. To accommodate publicly-writable hosts, R-DHT restricts data-item distribution within a reserved segment Sr (e.g. r could be 0 or 2m − 1). A publicly-writable host (h) is virtualized into only one node (n) identified with r|h. A key is then mapped and stored onto a node within Sr even if another node outside Sr is the closest to the key. For example, R-Chord maps and stores key k onto node n = r|h where r|h = successor(r|k); this can be further simplified as mapping key k to publicly-writable host h where h = successor(k). Essentially, the selective dataitem distribution scheme emulates an m-bit node-identifier space within the 2mbit identifier space. Our selective data-item distribution reduces the maintenance APPENDIX B. SELECTIVE DATA-ITEM DISTRIBUTION 155 overhead of R-DHT because each publicly-writable host increases the size of the overlay network only by one node. Figure B.1b shows the algorithm for a publicly-writable host joining the reserved segment. (a) Map Keys to Nodes only in Segment S0 1. // Host h joins segment Sr 2. // through an existing host e. 3. h.virtualize to reserve segment(e) 4. n = r|h; 5. n.join(e) // Chord’s protocol [133] (b) Virtualize Host to Reserved Segment Figure B.1: Relaxing Node Autonomy Figure B.2 shows the algorithm for finding successor(k) in segment Sr . This operation allows the mapping of a key onto a node in Sr (i.e. store operation) and the retrieval of a key from segment Sr (i.e. lookup operation). The algorithm first finds the reserved segment Sr if necessary (line 5), followed by finding successor(k) in Sr (line 16 and 22). If no such node is found, i.e. k is beyond the last node in Sr , R-Chord maps k onto successor(r|0), i.e. the first node in Sr (line 14 and 20). APPENDIX B. SELECTIVE DATA-ITEM DISTRIBUTION 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. // Find successor(k) in segment Sr h.find successor in rsegment(k) n = r|h; if @n then // Find Sr , as h is not publicly writable h0 = lookup(r); if h0 == NOT FOUND then return NOT FOUND; return h0 .find successor in rsegment(k); if n < r|k ≤ n.successor then // n is the predecessor of successor(k) if prefix(n.successor) == r then return n.successor; return find successor (r|0); // See Figure 2.18b // Go to the nearest known predecessor of k n0 = closest preceding node(r|k) if prefix(n0 ) == r then return n0 .find successor in segment(k); return n0 .find successor (r|0); // See Figure 2.18b Figure B.2: Lookup within Reserved Segment 156 REFERENCES 157 References [1] Apache HTTP server. http://httpd.apache.org. [2] The Chord Project. http://www.pdos.lcs.mit.edu/chord. [3] EarthLink SIPShare. http://www.research.earthlink.net/p2p/. [4] Globus toolkit – Information Service. http://www.globus.org/toolkit/ mds/. [5] GLUE information model. http://glueschema.forge.cnaf.infn.it. [6] Gnutella. http://www.gnutella.com. [7] IEEE standard 1420.1-1995 (R2002), IEEE standard for information technology–software reuse–data model for reuse library interoperability: Basic interoperability data model (BIDM). http://standards.ieee.org/ reading/ieee/std/se/1420.1-1995.pdf. [8] Napster. http://www.napster.com. [9] Oracle Spatial. spatial/index.html. http://www.oracle.com/technology/products/ [10] P2Pwg: Peer-to-peer working group. http://p2p.internet2.edu. [11] Qnext. http://www.qnext.com. [12] Skype. http://www.skype.com. [13] The voP2P project. http://vop2p.jxta.org. [14] K. Aberer, P. Cudr-Mauroux, A. Datta, Z. Despotovic, M. Hauswirth, M. Punceva, and R. Schmidt. P-Grid: A self-organizing structured p2p system. SIGMOD Record, 32(2):29–33, September 2003. [15] L. A. Adamic, R. M. Lukose, A. R. Puniyani, and B. A. Huberman. Local search in unstructured networks. Handbook of Graphs and Networks: From the Genome to the Internet, pp. 295–316, January 2003. REFERENCES 158 [16] D. Agrawal, A. E. Abbadi, and S. Suri. Attribute-based access to distributed data over P2P networks. Proc. of the 4th Intl. Workshop on Databases in Networked Information Systems, pp. 244–263, Springer-Verlag, Japan, March 2005. [17] L. O. Alima, S. El-Ansary, P. Brand, and S. Haridi. DKS (N, k, f): A family of low communication, scalable and fault-tolerant infrastructures for P2P applications. Proc. of the 3rd IEEE Intl. Symp. on Cluster Computing and the Grid, pp. 344–350, IEEE Computer Society Press, Japan, May 2003. [18] S. Androutsellis-Theotokis and D. Spinellis. A survey of peer-to-peer content distribution technologies. ACM Computing Surveys, 36(4):335–371, December 2004. [19] A. Andrzejak and Z. Xu. Scalable, efficient range queries for grid information services. Proc. of the 2nd Intl. Conf. on Peer-to-Peer Computing, pp. 33–40, IEEE Computer Society Press, Sweden, September 2002. [20] J. Aspnes and G. Shah. Skip Graphs. Proc. of the 14th Annual ACMSIAM Symp. on Discrete Algorithms, pp. 384–393, ACM/SIAM Press, USA, January 2003. [21] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999. [22] H. Balakrishnan, K. Lakshminarayanan, S. Ratnasamy, S. Shenker, I. Stoica, and M. Walfish. A layered naming architecture for the internet. Proc. of ACM SIGCOMM, pp. 343–352, ACM Press, Germany, September 2004. [23] H. Balakrishnan, S. Shenker, and M. Walfish. Peering peer-to-peer providers. Proc. of the 4th Intl. Workshop on Peer-to-Peer Systems, pp. 104–114, Springer-Verlag, USA, February 2005. [24] D. Bauer, P. Hurley, R. Pletka, and M. Waldvogel. Bringing efficient advanced queries to distributed hash tables. Proc. of the 29th IEEE Intl. Conf. on Local Computer Networks, pp. 6–14, IEEE Computer Society Press, USA, November 2004. [25] R. Bhagwan, S. Savage, and G. M. Voelker. Understanding availability. Proc. of the 2nd Intl. Workshop on Peer-to-Peer Systems, pp. 256–267, SpringerVerlag, USA, February 2003. [26] G. Breinholt and C. Schierz. Algorithm 781: Generating Hilbert’s spacefilling curve by recursion. ACM Transactions on Mathematical Software, 24(2):184–189, June 1998. [27] A. R. Butt, R. Zhang, and Y. C. Hu. A self-organizing flock of Condors. Proc. of the ACM/IEEE SC2003 Conf. on High Performance Networking and Computing, pp. 42, ACM Press, USA, November 2003. REFERENCES 159 [28] M. Cai, M. Frank, J. Chen, and P. Szekely. MAAN: A Multi-Attribute Addressable Network for grid information services. Journal of Grid Computing, 2(1):3–14, 2004. [29] M. Castro, M. Costa, and A. Rowstron. Performance and dependability of structured peer-to-peer overlays. Proc. of the 2004 Intl. Conf. on Dependable Systems and Networks, pp. 9–18, June 2004. [30] M. Castro, M. Costa, and A. Rowstron. Should we build Gnutella on a structured overlay? ACM SIGCOMM Computer Communication Review, 34(2):131–136, April 2004. [31] M. Castro, M. Costa, and A. Rowstron. Debunking some myths about structured and unstructured overlays. Proc. of 2nd Symp. on Networked Systems Design and Implementation, pp. 85–98, USENIX Association, USA, May 2005. [32] M. Castro, P. Druschel, A. Ganesh, A. Rowstron, and D. S. Wallach. Secure routing for structured peer-to-peer overlay networks. Proc. of the 5th USENIX Symp. on Operating Systems Design and Implementation, pp. 299– 314, USENIX Association, USA, December 2002. [33] Y. Chawathe, S. Ratnasamy, L. Breslau, N. Lanham, and S. Shenker. Making Gnutella-like P2P systems scalable. Proc. of ACM SIGCOMM, pp. 407–418, ACM Press, Germany, August 2003. [34] F. Chen, T. Repantis, and V. Kalogeraki. Coordinated media streaming and transcoding in peer-to-peer system. Proc. of the 19th IEEE Intl. Parallel and Distributed Processing Symp., pp. 56b, IEEE Computer Society Press, USA, April 2005. [35] A-H. Cheng and Y-J. Joung. Probabilistic file indexing and searching in unstructured peer-to-peer networks. Proc. of the 4th IEEE Intl. Symp. on Cluster Computing and the Grid, pp. 9–18, IEEE Computer Society Press, USA, April 2004. [36] A. J. Cole. Compaction techniques for raster scan graphics using space-filling curves. The Computer Journal, 31(1):87–92, 1987. [37] B. F. Cooper. Quickly routing searches without having to move content. Proc. of the 4th Intl. Workshop on Peer-to-Peer Systems, pp. 163–172, Springer-Verlag, USA, February 2005. [38] L. P. Cox, C. D. Murray, and B. D. Noble. Pastiche: Making backup cheap and easy. Proc. of the 5th USENIX Symp. on Operating Systems Design and Implementation, pp. 285–298, USENIX Association, USA, December 2002. [39] L. P. Cox and B. D. Noble. Samsara: Honor among thieves in peer-to-peer storage. Proc. of the 19th ACM Symp. on Operating Systems Principles, pp. 120–132, ACM Press, USA, October 2003. REFERENCES 160 [40] A. Crespo and H. Garcia-Molina. Routing indices for peer-to-peer systems. Proc. of the 22nd IEEE Intl. Conf. On Distributed Computing Systems, pp. 23–33, IEEE Computer Society Press, Austria, July 2002. [41] Y. Cui and K. Nahrstedt. Layered peer-to-peer streaming. Proc. of 13th Intl. Workshop on Network and Operating Systems Support for Digital Audio and Video, pp. 162–171, ACM Press, USA, June 2003. [42] F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, and I. Stoica. Wide-area cooperative storage with CFS. Proc. of the 11th ACM Symp. on Operating Systems Principles, pp. 202–215, ACM Press, Canada, October 2001. [43] F. Dabek, B. Y. Zhao, P. Druschel, J. Kubiatowicz, and I. Stoica. Towards a common API for structured peer-to-peer overlays. Proc. of the 2nd Intl. Workshop on Peer-to-Peer Systems, pp. 33–44, Springer-Verlag, USA, February 2003. [44] N. Daswani, H. Garcia-Molina, and B. Yang. Open problems in data-sharing peer-to-peer systems. Proc. of the 9th Intl. Conf. on Database Theory, pp. 1–15, Springer-Verlang, Italy, January 2003. [45] F. K. H. A. Dehne, T. Eavis, and A. Rau-Chaplin. Parallel multi-dimensional ROLAP indexing. Proc. of the 3rd Intl. Symp. on Cluster Computing and the Grid, pp. 86–95, IEEE Computer Society Press, Japan, May 2003. [46] P. Druschel and A. I. T. Rowstron. PAST: A large-scale, persistent peer-topeer storage utility. Proc. of the 8th Workshop on Hot Topics in Operating Systems, pp. 75–80, IEEE Computer Society Press, Germany, May 2001. [47] I. Foster and C. Kesselman, editors. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, 1999. [48] M. J. Freedman, E. Freudenthal, and D. Mazières. Democratizing content publication with Coral. Proc. of the 1st Symp. on Networked Systems Design and Implementation, pp. 239–252, USENIX Association, USA, March 2004. [49] V. Gaede and O. Günther. Multidimensional access methods. volume 30, pp. 170–231, June 1998. [50] P. Ganesan, B. Yang, and H. Garcia-Molina. One torus to rule them all: Multi-dimensional queries in P2P systems. Proc. of the 7th Intl. Workshop on the Web and Databases, pp. 19–24, France, June 2004. [51] L. Garcés-Erice, E. W. Biersack, P. A. Felber, K. W. Ross, and G. UrvoyKeller. Hierarchical peer-to-peer systems. Proc. of the 9th Intl. Euro-Par Conf., pp. 1230–1239, Springer-Verlag, Austria, August 2003. [52] G. Ghinita and Y. M. Teo. An adaptive stabilization framework for distributed hash tables. Proc. of the 20th IEEE Intl. Parallel and Distributed Processing Symp., IEEE Computer Society Press, Greece, April 2006. REFERENCES 161 [53] A. Ghodsi, L. O. Alima, and S. Haridi. Low-bandwidth topology maintenance for robustness in structured overlay networks. Proc. of 38th Hawaii Intl. Conf. on System Sciences, pp. 302a, IEEE Computer Society Press, USA, January 2005. [54] A. Ghodsi, L. O. Alima, and S. Haridi. Symmetric replication for structured peer-to-peer systems. Proc. of the 3rd Intl. Workshop on Databases, Information Systems and Peer-to-Peer Computing, pp. 12, Spinger-Verlag, Norway, April 2005. [55] The Boston Globe. Google subpoena roils the web: US effort raises privacy issues. http://www.boston.com/news/nation/articles/2006/01/ 21/google subpoena roils the web?mode=PF, January 2006. [56] O. D. Gnawali. A keyword-set search system for peer-to-peer networks. Master’s thesis, Dept. of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, May 2002. [57] B. Godfrey, K. Lakshminarayanan, S. Surana, R. Karp, and I. Stoica. Load balancing in dynamic structured P2P systems. Proc. of INFOCOM, pp. 2253– 2262, IEEE Press, China, March 2004. [58] P. B. Godfrey and I. Stoica. Heterogeneity and load balance in distributed hash tables. Proc. of INFOCOM, pp. 596–606, IEEE Press, USA, March 2005. [59] Google. Google’s opposition to the government’s motion to compel. http: //googleblog.blogspot.com/pdf/Google Oppo to Motion.pdf, February 2006. [60] Google. Response to the DoJ motion. http://googleblog.blogspot.com/ 2006/02/response-to-doj-motion.html, February 2006. [61] C. Gotsman and M. Lindenbaum. On the metric properties of discrete spacefilling curves. IEEE Transactions on Image Processing, 5(5):794–797, May 1996. [62] K. Gummadi, R. Gummadi, S. Gribble, S. Ratnasamy, S. Shenker, and I. Stoica. The impact of DHT routing geometry on resilience and proximity. Proc. of ACM SIGCOMM, pp. 381–394, ACM Press, Germany, August 2003. [63] P. K. Gummadi, S. Saroiu, and S. Gribble. Measurement study of Napster and Gnutella as examples of peer-to-peer file sharing systems. Multimedia Systems Journal, 9(2):170–184, August 2003. [64] A. Gupta, B. Liskov, and R. Rodrigues. Efficient routing for peer-to-peer overlays. Proc. of 1st Symp. on Networked Systems Design and Implementation, pp. 113–126, USENIX Association, USA, March 2004. REFERENCES 162 [65] I. Gupta, K. Birman, P. Linga, A. Demers, and R. V. Renesse. Kelips: Building an efficient and stable P2P DHT through increased memory and background overhead. Proc. of the 2nd Intl. Workshop on Peer-to-Peer Systems, pp. 160–169, Springer-Verlag, USA, February 2003. [66] Andreas Haeberlen, Alan Mislove, and Peter Druschel. Glacier: Highly durable, decentralized storage despite massive correlated failures. Proc. of 2nd Symp. on Networked Systems Design and Implementation, pp. 143–158, USENIX Association, USA, May 2005. [67] M. Harren, J. M. Hellerstein, R. Huebsch, B.T.Loo, S. Shenker, and I. Stoica. Complex queries in DHT-based peer-to-peer networks. Proc. of the 1st Intl. Workshop on Peer-to-Peer Systems, pp. 242–249, Springer-Verlag, USA, March 2002. [68] N. J. A. Harvey, M. B. Jones, S. Saroiu, M. Theimer, and A. Wolman. SkipNet: A scalable overlay network with practical locality properties. Proc. of the 4th USENIX Symp. on Internet Technologies and Systems, pp. 113– 126, USENIX Association, USA, March 2003. [69] H-C. Hsiao and C-T. King. A tree model for structured peer-to-peer protocols. Proc. of the 3rd IEEE Intl. Symp. on Cluster Computing and the Grid, pp. 336–343, IEEE Computer Society Press, Japan, May 2003. [70] R. Huebsch, J. M. Hellerstein, N. Lanham, B. T. Loo, S. Shenker, and I. Stoica. Querying the internet with PIER. Proc. of 29th Intl. Conf. on Very Large Data Bases, pp. 321–332, Morgan Kaufmann Publishers, Germany, September 2003. [71] A. Iamnitchi. Resource Discovery in Large Resource-Sharing Environments. PhD thesis, Dept. of Computer Science, The University of Chicago, December 2003. [72] A. Iamnitchi, M. Ripeanu, and I. Foster. Locating data in (small-world?) peer-to-peer scientific collaborations. Proc. of the 1st International Workshop on Peer-to-Peer Systems, pp. 232–241, Springer-Verlag, USA, March 2002. [73] S. Iyer, A. I. T. Rowstron, and P. Druschel. Squirrel: a decentralized peer-topeer web cache. Proc. of the 21th ACM Symp. on Principles of Distributed Computing, pp. 213–222, ACM Press, USA, July 2002. [74] H. V. Jagadish. Linear clustering of objects with multiple attributes. Proc. of the 1990 ACM SIGMOD Intl. Conf. on Management of Data, pp. 332–342, ACM Press, USA, June 1990. [75] G. Jin and J. Mellor-Crummey. SFCGen: A framework for efficient generation of multi-dimensional space-filling curve by recursion. ACM Transactions on Mathematical Software, 31(1):120–148, March 2005. REFERENCES 163 [76] M. F. Kaashoek and D. R. Karger. Koorde: A simple degree-optimal distributed hash table. Proc. of the 2nd Intl. Workshop on Peer-to-Peer Systems, pp. 98–107, Springer-Verlag, USA, February 2003. [77] D. R. Karger and M. Ruhl. Diminished Chord: A protocol for heterogeneous subgroup. Proc. of the 3rd Intl. Workshop on Peer-to-Peer Systems, pp. 288–297, Springer-Verlag, USA, February 2004. [78] D. R. Karger and M. Ruhl. Simple, efficient load balancing algorithms for peer-to-peer systems. Proc. of the 3rd Intl. Workshop on Peer-to-Peer Systems, pp. 131–140, Springer-Verlag, USA, February 2004. [79] F. B. Kashani and C. Shahabi. Criticality-based analysis and design of unstructured peer-to-peer networks as “complex systems”. Proc. of the 3rd IEEE Intl. Symp. on Cluster Computing and the Grid, pp. 351–358, IEEE Computer Society Press, Japan, May 2003. [80] K. Krauter, R. Buyya, and M. Maheswaran. A taxonomy and survey of grid resource management systems for distributed computing. Intl. Journal of Software, Practice and Experience, 32(2):135–164, February 2002. [81] J. Kubiatowicz, D. Bindel, Y. Chen, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao. OceanStore: An architecture for global-scale persistent storage. Proc. of the 9th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 190–201, ACM Press, USA, November 2000. [82] J. B. Kwon and H. Y. Yeom. Distributed multimedia streaming over peer-topeer networks. Proc. of the 9th Intl. Euro-Par Conf., pp. 851–858, SpringerVerlag, Austria, August 2003. [83] M. Landers, H. Zhang, and K-L. Tan. Peerstore: Better performance by relaxing in peer-to-peer backup. Proc. of the 4th Intl. Conf. on Peer-to-Peer Computing, pp. 72–79, IEEE Computer Society Press, Switzerland, August 2004. [84] J. K. Lawder. Using state diagrams for Hilbert curve mappings. Technical Report JL2/00, School of Computer Science and Information Systems, Birkbeck College, University of London, August 2000. [85] J. K. Lawder and P. J. H. King. Using space-filling curves for multidimensional indexing. Proc. of the 17th British National Conf. on Databases: Advances in Databases, pp. 20–35, Springer-Verlag, UK, July 2000. [86] J. Lee, H. Lee, S. Kang, S. Choe, and J. Song. CISS: An efficient object clustering framework for DHT-based peer-to-peer applications. Proc. of VLDB Workshop On Databases, Information Systems and Peer-to-Peer Computing, pp. 215–229, Spinger-Verlag, Canada, August 2004. REFERENCES 164 [87] M. Leslie, J. Davies, and T. Huffman. Replication strategies for reliable decentralised storage. Proc. of the 1st Workshop on Dependable and Sustainable Peer-to-Peer Systems, pp. 740–747, IEEE Computer Society Press, Japan, April 2006. [88] J. Li, P. A. Chou, and C. Zhang. Mutualcast: An efficient mechanism for content distribution in a peer-to-peer (P2P) network. Technical Report MSR-TR-2004-100, Microsoft Research, Communication and Collaboration Systems, September 2004. [89] J. Li, J. Stribling, T. M. Gil, R. Morris, and M. F. Kaashoek. Comparing the performance of distributed hash tables under churn. Proc. of the 3rd Intl. Workshop on Peer-to-Peer Systems, pp. 87–99, Springer-Verlag, USA, February 2004. [90] J. Li, J. Stribling, R. Morris, and M. F. Kaashoek. Bandwidth-efficient management of DHT routing tables. Proc. of 2nd Symp. on Networked Systems Design and Implementation, pp. 99–114, USENIX Association, USA, May 2005. [91] W. Li, Z. Xu, F. Dong, and J. Zhang. Grid resource discovery based on a routing-transferring model. Proc. of the 3rd Intl. Workshop on Grid Computing, pp. 145–156, Springer-Verlag, USA, November 2002. [92] X. Liu and G. Schrack. Encoding and decoding the Hilbert order. Software— Practice and Experience, 26(12):1335–1346, December 1996. [93] B. T. Loo, R. Huebsch, I. Stoica, and J. M. Hellerstein. The case for a hybrid P2P search infrastructure. Proc. of the 3rd Intl. Workshop on Peer-to-Peer Systems, pp. 141–150, Springer-Verlag, USA, February 2004. [94] Q. Lv, P Cao, E. Cohen, K. Li, and S. Shenker. Search and replication in unstructured peer-to-peer networks. Proc. of the 2002 Intl. Conf. on Supercomputing, pp. 84–95, ACM Press, USA, June 2002. [95] V. March and Y. M. Teo. Multi-attribute range queries on read-only DHT. Proc. of the 15th Intl. Conf. on Computer Communications and Networks, pp. 419–424, IEEE Communications Society Press, USA, October 2006. [96] V. March, Y. M. Teo, H. B. Lim, P. Eriksson, and R. Ayani. Collision detection and resolution in hierarchical peer-to-peer systems. Proc. of the 30th IEEE Conf. on Local Computer Networks, pp. 2–9, IEEE Computer Society Press, Australia, November 2005. [97] V. March, Y. M. Teo, and X. Wang. DGRID: A DHT-based grid resource indexing and discovery scheme for computational grids. Proc. of the 5th Australasian Symp. on Grid computing and e-Research, pp. 41–48, Australian Computer Society Inc., Australia, January 2007. REFERENCES 165 [98] E. P. Markatos. Tracing a large-scale peer to peer system: An hour in the life of Gnutella. Proc. of the 2nd IEEE Intl. Symp. on Cluster Computing and the Grid, pp. 65–74, IEEE Computer Society Press, Germany, May 2002. [99] P. Maymounkov and D. Mazières. Kademlia: A peer-to-peer information system based on the XOR metric. Proc. of the 1st Intl. Workshop on Peerto-Peer Systems, pp. 53–65, Springer-Verlag, USA, March 2002. [100] D. S. Milojicic, V. Kalogeraki, R. Lukose, K. Nagaraja, J. Pryune, B. Richard, S. Rollins, and Z. Xu. Peer-to-peer computing. Technical Report HPL-2002-57, HP Laboratories Palo Alto, March 2002. [101] A. Mislove and P. Druschel. Providing administrative control and autonomy in structured peer-to-peer overlays. Proc. of the 3rd Intl. Workshop on Peerto-Peer Systems, pp. 162–172, Springer-Verlag, USA, February 2004. [102] A. Mislove, A. Post, C. Reis, P. Willmann, P. Druschel, D. S. Wallach, X. Bonnaire, P. Sens, J-M. Busca, and L. B. Arantes. POST: A secure, resilient, cooperative messaging system. Proc. of the 9th Workshop on Hot Topics in Operating Systems, pp. 61–66, IEEE Computer Society Press, USA, May 2003. [103] B. Moon, H. V. Jagadish, C. Faloutsos, and J. H. Saltz. Analysis on the clustering properties of the Hilbert space-filling curve. IEEE Transactions on Knowledge and Data Engineering, 13(1):124–141, January 2001. [104] A. Muthitacharoen, R. Morris, T. M. Gil, and B. Chen. Ivy: A read/write peer-to-peer file system. Proc. of 5th USENIX Symp. on Operating System Design and Implementation, USENIX Association, USA, December 2002. [105] K. Nakauchi, Y. Ishikawa, H. Morikawa, and T. Aoyama. Peer-to-peer keyword search using keyword relationship. Proc. of the 3rd IEEE Intl. Symp. on Cluster Computing and the Grid, pp. 359–366, IEEE Computer Society Press, Japan, May 2003. [106] Z. Németh and V. Sunderam. Characterizing grids: Attributes, definitions, and formalisms. Journal of Grid Computing, 1(1):9–23, 2003. [107] World Association of Newspapers. Google must pay! http://www. wan-press.org/article9384.html?var recherche=google+news. [108] World Association of Newspapers. Newspaper, magazine and book publishers organizations to address search engine practices. http://www. wan-press.org/article9055.html, January 2006. [109] F. D. Ngoc, J. Keller, and G. Simon. MAAY: a decentralized personalized search system. Proc. of the IEEE/IPSJ Intl. Symp. on Applications and the Internet, IEEE Computer Society Press, USA, January 2006. REFERENCES 166 [110] S. J. Nielson, S. A. Crosby, and D. S. Wallach. A taxonomy of rational attacks. Proc. of the 4th Intl. Workshop on Peer-to-Peer Systems, pp. 36– 46, Springer-Verlag, USA, February 2005. [111] B. C. Ooi, Y. Shu, and K.L. Tan. Relational data sharing in peer-based data management systems. SIGMOD Record, 32(1):59–64, March 2003. [112] A. Oram. Peer-to-Peer: Harnessing the Power of Disruptive Technologies. O’Reilly, 2001. [113] V. Ramasubramanian and E. G. Sirer. Beehive: O(1) lookup performance for power-law query distributions in peer-to-peer overlays. Proc. of 1st Symp. on Networked Systems Design and Implementation, pp. 99–112, USENIX Association, USA, March 2004. [114] L. Ramaswamy, B. Gedik, and L. Liu. A distributed approach to node clustering in decentralized peer-to-peer networks. IEEE Transaction on Parallel and Distributed Systems, 16(9):814–829, September 2005. [115] F. Ramsak, V. Markl, R. Fenk, M. Zirkel, K. Elhardt, and R. Bayer. Integrating the UB-tree into a database system kernel. Proc. of 26th Intl. Conf. on Very Large Data Bases, pp. 263–272, Morgan Kaufmann Publishers, Egypt, September 2000. [116] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A scalable Content-Addressable Network. Proc. of ACM SIGCOMM, pp. 161–172, ACM Press, USA, August 2001. [117] S. Ratnasamy, I. Stoica, and S. Shenker. Routing algorithms for DHTs: Some open questions. Proc. the 1st Intl. Workshop on Peer-to-Peer Systems, pp. 45–52, Springer-Verlag, USA, March 2002. [118] Reuters. WPP’s Sorrell sees Google as threat, opportunity. http://today.reuters.com/news/articlebusiness.aspx?type= media&storyid=nN01402884&imageid=&cap=, March 2006. [119] S. Rhea, D. Geels, and T. Roscoe J. Kubiatowicz. Handling churn in a DHT. Proc. of the USENIX, pp. 127–140, USENIX Association, USA, June 2004. [120] S. Rhea, B. Godfrey, B. Karp, J. Kubiatowicz, S. Ratnasamy, S. Shenker, I. Stoica, and H. Yu. OpenDHT: A public DHT service and its uses. Proc. of ACM SIGCOMM, pp. 73–84, ACM Press, USA, August 2005. [121] M. Ripeanu, I. Foster, and A. Iamnitchi. Mapping the Gnutella network: Properties of large-scale peer-to-peer systems and implications for system design. IEEE Internet Computing Journal, 6(1):50–57, January 2002. [122] R. Rodrigues and C. Blake. When multi-hop peer-to-peer lookup matters. Proc. of the 3rd Intl. Workshop on Peer-to-Peer Systems, pp. 112–122, Springer-Verlag, USA, February 2004. REFERENCES 167 [123] A. Rowstron and P. Druschel. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. Proc. of IFIP/ACM Intl. Conf. on Distributed Systems Platforms, pp. 329–350, Springer-Verlag, Germany, November 2001. [124] H. Sagan. Space-Filling Curves. Springer-Verlag, 1999. [125] D. Sandler, A. Mislove, A. Post, and P. Druschel. FeedTree: Sharing micronews with peer-to-peer event notification. Proc. of the 4th Intl. Workshop on Peer-to-Peer Systems, pp. 141–151, Springer-Verlag, USA, February 2005. [126] N. Sarshar, P. O. Boykin, and V. P. Roychowdhury. Percolation search in power law networks: Making unstructured peer-to-peer networks scalable. Proc. of the 4th Intl. Conf. on Peer-to-Peer Computing, pp. 2–9, IEEE Computer Society Press, Switzerland, August 2004. [127] C. Schmidt and M. Parashar. Flexible information discovery in decentralized distributed systems. Proc. of the 12th IEEE Intl. Symp. on High Performance Distributed Computing, pp. 226–235, IEEE Computer Society Press, USA, June 2003. [128] C. Schmidt and M. Parashar. Analyzing the search characteristics of space filling curve-based indexing within the Squid P2P data discovery system. Technical Report TR-276, Center for Advanced Information Processing (CAIP), Rutgers University, December 2004. [129] S. Shi, G. Yang, D. Wang, J. Yu, S. Qu, and M. Chen. Making peer-topeer keyword searching feasible using multi-level partitioning. Proc. of the 3rd Intl. Workshop on Peer-to-Peer Systems, pp. 151–161, Springer-Verlag, USA, February 2004. [130] J. Shneidman and D. C. Parkes. Rationality and self-interest in peer to peer networks. Proc. of the 2nd Intl. Workshop on Peer-to-Peer Systems, pp. 139–148, Springer-Verlag, USA, February 2003. [131] Y. Shu, B. C. Ooi, K-L. Tan, and A. Zhou. Supporting multi-dimensional range queries in peer-to-peer systems. Proc. of the 5th Intl. Conf. on Peerto-Peer Computing, pp. 173–180, IEEE Computer Society Press, Germany, August 2005. [132] D. Spence and T. Harris. XenoSearch: Distributed resource discovery in the XenoServer open platform. Proc. of the 12th IEEE International Symp. on High Performance Distributed Computing, pp. 216–225, IEEE Computer Society Press, USA, June 2003. [133] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. Proc. of ACM SIGCOMM, pp. 149–160, ACM Press, USA, August 2001. REFERENCES 168 [134] D. Stutzbach, R. Rejaie, and S. Sen. Characterizing unstructured overlay topologies in modern P2P file-sharing systems. Proc. of the 2005 Internet Measurement Conf., pp. 49–62, USENIX Association, USA, May 2005. [135] C. Tang, Z. Xu, and S.Dwarkadas. Peer-to-peer information retrieval using self-organizing semantic overlay networks. Proc. of ACM SIGCOMM, pp. 175–186, ACM Press, Germany, August 2003. [136] Y. M. Teo, V. March, and X. Wang. A DHT-based grid resource indexing and discovery scheme. Proc. of Singapore-MIT Alliance Annual Symp., Singapore, January 2005. [137] R. Tian, Y. Xiong, Q. Zhang, B. Li, B. Y. Zhao, and X. Li. Hybrid overlay structure based on random walks. Proc. of the 4th Intl. Workshop on Peerto-Peer Systems, pp. 152–162, Springer-Verlag, USA, February 2005. [138] D. Tsoumakos and N. Roussopoulos. A comparison of peer-to-peer search methods. Proc. of the Intl. Workshop on Web and Databases, pp. 61–66, USA, June 2003. [139] J. Xu. On the fundamental tradeoffs between routing table size and network diameter in peer-to-peer networks. Proc. of INFOCOM, pp. 2177–2187, IEEE Press, USA, March 2003. [140] Z. Xu, R. Min, and Y. Hu. HIERAS: A DHT based hierarchical P2P routing algorithm. Proc. of the 2003 Intl. Conf. on Parallel Processing, pp. 187–194, IEEE Computer Society Press, Taiwan, October 2003. [141] B. Yang and H. Garcia-Molina. Improving search in peer-to-peer networks. Proc. of the 22nd IEEE Intl. Conf. On Distributed Computing Systems, pp. 5–14, IEEE Computer Society Press, Austria, July 2002. [142] B. Yang and H. Garcia-Molina. Designing a super-peer network. Proc. of the 19th Intl. Conf. on Data Engineering, pp. 49–61, IEEE Computer Society Press, India, March 2003. [143] B. Y. Zhao, Y. Duan, L. Huang, A. D. Joseph, and J. Kubiatowicz. Brocade: Landmark routing on overlay networks. Proc. of the 2nd Intl. Workshop on Peer-to-Peer Systems, pp. 34–44, Springer-Verlag, USA, March 2002. [144] B. Y. Zhao, J. Kubiatowicz, and A. D. Joseph. Tapestry: An infrastructure for fault-tolerant wide-area location and routing. Technical Report UCB/CSD-01-1141, Computer Science Department, UC Berkeley, April 2001. [145] C. Zhu, Z. Liu, W. Zhang, W. Xiao, Z. Xu, and D. Yang. Decentralized grid resource discovery based on resource information community. Journal of Grid Computing, 2(3):261–277, September 2004. REFERENCES 169 [146] Y. Zhu, X. Yang, and Y. Hu. Making search efficient on Gnutella-like P2P systems. Proc. of the 19th IEEE Intl. Parallel and Distributed Processing Symp., pp. 56a, IEEE Computer Society Press, USA, April 2005.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement