Improving the Performance of List Intersection.

Improving the Performance of List Intersection.
Improving the Performance of List Intersection
Dimitris Tsirogiannis
Sudipto Guha
Nick Koudas
University of Toronto
University of Pennsylvania
University of Toronto
[email protected]
[email protected]
[email protected]
ABSTRACT
List intersection is a central operation, utilized excessively for query
processing on text and databases. We present list intersection algorithms for an arbitrary number of sorted and unsorted lists tailored
to the characteristics of modern hardware architectures. Two new
list intersection algorithms are presented for sorted lists. The first
algorithm, termed Dynamic Probes, dynamically decides the probing order on the lists exploiting information from previous probes
at runtime. This information is utilized as a cache-resident microindex. The second algorithm, termed Quantile-based, deduces in
advance a good probing order, thus avoiding the overhead of adaptivity and is based on detecting lists with non-uniform distribution
of document identifiers. For unsorted lists, we present a novel hashbased algorithm that avoids the overhead of sorting.
A detailed experimental evaluation is presented based on real and
synthetic data using existing chip multiprocessor architectures with
eight cores, validating the efficiency and efficacy of the proposed
algorithms.
1.
INTRODUCTION
Set intersection is a central operation in query processing and
text analytics. All modern query processors inherently support set
operations. Additionally, set intersection is employed during the
evaluation of conjunctive selection conditions. Consider for example the evaluation of the following conjunctive selection condition:
a1 = ‘X’ ∧ a2 = ‘Y’, where a1 , a2 are two relation attributes. Assume that for each attribute ai there is an index. Utilizing the index
on each attribute, one can retrieve a set of pointers to records satisfying an individual condition. The conjunctive selection condition
is evaluated by taking the intersection of all sets of record pointers and then fetching the qualifying records from memory. Sets of
record pointers may be sorted or unsorted.
Text data are generated at unprecedented rates. This is corroborated by user generated content in the form of blogs and social
networks. The huge amount of text data as well as the increasing
need to exploit textual information in order to solve business problems necessitate the use of efficient query processing techniques.
The fundamental data structure for indexing and query process-
Permission to copy without fee all or part of this material is granted provided
that the copies are not made or distributed for direct commercial advantage,
the VLDB copyright notice and the title of the publication and its date appear,
and notice is given that copying is by permission of the Very Large Data
Base Endowment. To copy otherwise, or to republish, to post on servers
or to redistribute to lists, requires a fee and/or special permission from the
publisher, ACM.
VLDB ‘09, August 24-28, 2009, Lyon, France
Copyright 2009 VLDB Endowment, ACM 000-0-00000-000-0/00/00.
ing on text is the inverted index [4]. Query processing consists
of retrieving the inverted indices corresponding to query keywords
and intersecting them to identify relevant documents. Each inverted
index for a keyword q consists of a document identifier and commonly a score associated with each document identifier. The score
represents a measure of the importance of q at each document identifier (commonly computed via several scoring functions for document scoring purposes [4]). Depending on the platform and the
application, query processing may involve the intersection of a different number of inverted indices which may or may not be ordered
on document identifier.
Text analytics queries can be answered efficiently using inverted
index (list) intersection. Consider for example the case where company X wants to know what male bloggers under the age of 25
and located in a specific geographical area think of a new product
Y. To answer questions like this, we have to retrieve all blog posts
that satisfy certain criteria. Assuming that both demographics (age,
geographic location) and text keywords are indexed using inverted
indices, text analytics queries like this one can be answered efficiently computing the intersection of a number of inverted indices,
each corresponding to a particular criterion, i.e. age = ‘25’.
Given the importance of this basic operation, list intersection
has attracted significant research attention in the past [3, 5, 7, 8,
13, 14, 27]. Prior research on list intersection algorithms has ignored the characteristics of modern hardware architectures: a) large
on-chip caches, and b) on-chip parallelism in the context of chipmultiprocessors. It is imperative to incorporate modern hardware
trends in algorithm design to improve the performance of this increasingly important operation.
1.1
Hardware Trends
The use of 64-bit computer architectures increased the amount
of memory that a system can utilize and today it is common to
have servers with many GBs of RAM performing tasks in memoryresident data. However, increasing the size of memory also increased its random access latency to hundreds of CPU cycles introducing the “memory-wall” problem [1]. Significant amount of
research has been conducted toward “cache-aware” query processing techniques that effectively utilize multi-level cache hierarchies
to reduce the overhead of random memory accesses [10, 24, 28].
On the other hand, traditional techniques for list intersection have
not been adapted to modern processor architectures with large on
board caches mostly due to the random memory access patterns
that they exhibit.
On-chip parallelism was recently introduced in the context of
chip multiprocessors (CMPs). Chip-multiprocessors allow concurrent execution of multiple threads of control through the use of multiple simple processing units (cores). Most vendors have adopted
this design paradigm [2, 21, 23]. Compared to multi-processor ar-
chitectures, CMPs offer different computation and communication
tradeoffs. In a CMP, inter-processor communication is relatively inexpensive as it is performed through some shared level of the cache
hierarchy, typically the L2 cache. On the other hand, more shared
resources, such as cache and memory bandwidth, must be carefully
managed to avoid the competition among cores which may result
in severe performance degradation.
Several research efforts have studied how to unleash the computational power of CMPs in order to improve the performance of
various data management tasks [11, 16]. On the contrary, prior research work on list intersection algorithms assumes a uniprocessor
architecture, ignoring the potential for performance improvement
that CMPs offer.
1.2
List Intersection on Modern Hardware
In this paper, we study the design of in-memory list intersection algorithms, tailored to the characteristics of modern hardware
architectures. Our goals are: a) to reduce the overhead of random memory accesses by exploiting the large on-chip cache hierarchy, and b) to take advantage of the on-chip parallelism offered
in CMPs.
For the case of sorted lists, the majority of algorithms proposed
follow a similar pattern in the way they compute intersection. The
elements of one list are used as eliminators and probes are conducted in the other lists using a search method (e.g. binary or interpolation search) to locate these elements. If an eliminator belongs
to the intersection, the probes will locate it in all lists. If this is not
the case, the eliminator is simply ignored. The algorithms proposed
in the literature vary in how they select the eliminators and in the
order in which they probe the lists (probing order).
All search methods utilized for probing lists exhibit a random
memory access pattern, thus lacking temporal and spatial locality.
Consequently, the multi-level cache hierarchy is underutilized and
every probe incurs a number of expensive memory accesses, wasting hundreds of CPU cycles.
In this paper, we propose a list intersection algorithm for sorted
lists, termed Dynamic Probes, that utilizes previous probes to construct a cache-resident micro-index of the lists. The micro-index is
utilized to reduce the search range of each probe and consequently,
the number of random memory accesses performed. The microindex is continuously updated during the computation of the intersection, allowing the algorithm to adapt to the characteristics of
each list by dynamically changing its probing order.
An important consideration when designing parallel algorithms
is load balancing. Primarily, load has to be distributed evenly among
cores. If this is not the case, load disparity underutilizes the computational power available, thus reducing the speedup achieved. Furthermore, the overhead of load balancing should be kept to a minimum, otherwise it may offset the benefits of parallelization. It is
important to notice that executing a list intersection algorithm in
parallel may not always be the best solution due to the load balancing overhead introduced. In cases where computing the intersection
is not time consuming and throughput is the primary performance
objective, a serial algorithm may be preferred.
Exact algorithms for partitioning a number of lists ensure perfect load balancing. However, when a large number of lists is intersected, the overhead of such algorithms is high [22]. Approximate algorithms with explicit error guarantees offer attractive alternatives as they trade accuracy for speed. Load is not perfectly
balanced but the overhead is lower than the overhead of exact algorithms. Even when approximate algorithms are utilized, the overhead of load balancing is non-negligible as these algorithms are
not optimized for cache performance. Furthermore, for the case of
unsorted lists, they deploy an expensive sorting phase.
We propose a CMP adaptation of an approximate quantile identification algorithm demonstrating how it can be utilized to efficiently partition an arbitrary number of sorted (Partition Sorted algorithm) and unsorted lists (Partition Unsorted algorithm) into any
number of independent partitions. The quantile identification algorithm used has explicit error guarantees, allowing us to control the
load disparity across cores.
Applying a quantile identification algorithm for load balancing
purposes provides a unique opportunity to inspect list elements and
identify lists with highly non-uniform distribution of elements. As
Demaine et al. pointed out, this non-uniformity may result in severe performance degradation of list intersection algorithms [13];
they proposed adaptive intersection algorithms to address this issue.
However, when a large number of lists is intersected, their approach
introduces notable overhead because it tends to be “over-adaptive”
[14, 27]. In this paper, we propose a list intersection algorithm for
sorted lists, termed Quantile-based, that takes advantage of quantiles, computed by the load balancing algorithm, to detect lists with
highly non-uniform distribution of elements and determine a good
probing order at low cost, thus eliminating the overhead of adaptivity.
Finally, for the case of unsorted lists, all known approaches deploy a sorting algorithm to sort the lists. Then, a list intersection
algorithm for sorted lists is applied. Instead, we propose Hashbased, a list intersection algorithm that is tailored to CMPs. It utilizes the quantile identification algorithm to avoid sorting the lists
and computes their intersection in a cache-conscious manner using
hashing.
1.3
Contributions and Paper Outline
To the best of our knowledge, this is the first study on how to
improve the performance of list intersection for sorted and unsorted
lists on modern hardware architectures. The contributions of this
paper are the following:
• We demonstrate how we can exploit the characteristics of
modern hardware architectures to improve the performance
of list intersection for sorted and unsorted lists.
• We present Dynamic Probes, a novel list intersection algorithm that takes advantage of multi-level cache hierarchies to
reduce the overhead of random memory accesses.
• We study the problem of load balancing during the parallelization of list intersection algorithms and we propose an
effective and efficient load balancing algorithm based on quantiles.
• We present Quantile-based, a list intersection algorithm that
exploits the “free” inspection of lists, when a load balancing
algorithm is employed, to reduce the cost of intersection.
• We present Hash-based, a list intersection algorithm for unsorted lists tailored to CMPs that avoids the overhead of sorting.
• We present a detailed experimental evaluation using synthetic
and real data on a chip multiprocessor with eight cores. We
demonstrate that all algorithms proposed herein scale almost
linearly as the number of cores increases. For sorted lists,
the proposed techniques exhibit the best performance compared to existing list intersection algorithms. When the lists
are unsorted, the Hash-based algorithm performs better than
the combination of sorting and any list intersection algorithm
for sorted lists.
The remainder of the paper is organized as follows. Section 2
presents related work. Section 3 describes the approximate quantile identification algorithm and presents cache-conscious list partitioning algorithms for sorted and unsorted lists. Section 4 presents
intersection algorithms for sorted lists and the hash-based intersection algorithm for unsorted lists. We present the experimental
evaluation of the proposed algorithms in Section 5 and conclude in
Section 6.
2.
RELATED WORK
The problem of intersecting two sorted lists has been studied in
the past by Hwang and Lin [20, 19]. Demaine et al. proposed the
Adaptive algorithm for computing the intersection of a collection
of sorted sets [13]. Their algorithm computes the intersection by
repeatedly cycling through the sets in a round-robin fashion. On
a follow-up study they compared the Adaptive algorithm with existing algorithms using real data and demonstrated that in practice
their algorithm does not always exhibit the best performance due
to the overhead of adaptivity (repeatedly cycling through the sets)
[14]. Based on these findings, a number of improvements were proposed, the most significant of which are: a) galloping search [9],
and b) a limited form of adaptivity, compared to the initial version.
The resulting algorithm, termed Small Adaptive, intersects the two
smallest sets and it starts cycling through the remaining sets only
when a match is found.
Barbay et al. repeated the study of [14] using a larger dataset
[8]. In their experiments they included a list intersection algorithm
proposed in [7]. Their study verified that Small Adaptive has the
best performance in practice and that interpolation search improves
the performance of all intersection algorithms.
An adaptive row id list intersection strategy for query processing has been proposed by Raman et al. [27]. In order to avoid
the risk of using wrong cardinality estimations during the computation of an AND-tree, they use an adaptive n-ary intersection
algorithm based on the Adaptive algorithm [13]. Instead of using
a round-robin probing policy as suggested in [13], they propose a
probabilistic probing policy to determine which list to examine next
based on historical data from previous probes. In their experimental
evaluation they compare their approach against the Adaptive algorithm and against the traditional pipelined approach that utilizes the
best AND-tree. As we will show in Section 5, the performance of
Small Adaptive [14] in many cases surpasses or is equivalent to the
performance of their algorithm.
Iyer et al. proposed an exact percentile identification algorithm
for an arbitrary number of sorted lists [22]. The overhead of the
algorithm increases by a factor of m log m, where m is the number
of sorted lists. Hence, it is not applicable to a large number of lists.
Furthermore, in order to apply the algorithm in unsorted lists, a
sorting algorithm must be utilized at first, introducing significant
overhead.
Several algorithms for computing approximate quantiles of large
datasets have been proposed [25, 18, 12]. Manku et al. presented
single pass algorithms for computing approximate quantiles of large
datasets with explicit approximation guarantees [25]. Greenwald
and Khanna proposed a distributed single-pass algorithm to compute an -approximate quantile summary over sensor data [18].
Cormode et al. proposed deterministic algorithms for biased quantiles, ranked queries and targeted quantile queries that have guaranteed space bounds [12].
3.
LOAD BALANCING ON A CMP
A major issue when designing algorithms for CMP architectures
is load balancing. Since load disparity among cores and the overhead of the load balancing technique limit the achievable speedup
through parallelism, we seek low overhead load balancing techniques with explicit error guarantees. Techniques that achieve perfect load balancing usually introduce considerable overhead and in
many cases the benefit derived from the elimination of the load disparity does not compensate for the overhead introduced [15]. We
verify that effect in Section 5.
A quantile identification algorithm computes any order statistic
over a set of elements1 S. Such an algorithm is used as a building
block in our framework to divide (sorted and unsorted) lists into
a number of independent partitions of approximately equal size.
Each partition is assigned to a core and in this way the computation
of list intersection is parallelized in a load balanced way. For the
case of sorted lists, list elements are accessed once while computing the intersection and the only requirement is load to be evenly
partitioned among cores. Hence, the number of partitions is set to
be equal to the number of cores utilized. For the case of unsorted
lists, we require each partition to be cache-resident. Consequently,
for a system utilizing C cores, we choose the number of partitions
such that each partition fits in 1/C-th part of L2 cache.
Let us first discuss how a quantile identification algorithm can
be utilized to partition k ≥ 2 sorted lists into P independent partitions of approximately equal size on a uniprocessor architecture
(Section 3.1). Next, we will demonstrate how these quantiles can
be computed efficiently on CMPs (Section 3.2).
3.1
List Partitioning on a Single Processor
The independence property states that for any pair of partitions
Pi , Pj (i < j, i, j = 1, . . . , P ), all the elements of Pi must be
strictly smaller or larger than all the elements of Pj . Consider for
example two sorted lists A1 = {1, 2, 3, 5, 9, 10, 12, 15, 18, 20, 40}
and A2 = {4, 8, 11, 13, 14, 16, 17, 39, 41, 42, 50}. A possible
partitioning of these two lists in two partitions that satisfies the independence property could be: P1 = {{1, 2, 3, 5}, {4}} and P2 =
{{9, 10, 12, 15, 18, 20, 40}, {8, 11, 13, 14, 16, 17, 39, 41, 42, 50}}.
In this partitioning, P1 contains four elements of A1 and one element of A2 ; P2 contains seven elements of A1 and ten elements of
A2 . Notice, that all the elements of P1 are strictly smaller than the
elements of P2 .
In order to identify the elements of each partition Pi , it is sufficient to determine its boundary values lo(Pi ), hi(Pi ). The following invariant is true for boundary values: if element id belongs
to Pi then lo(Pi ) ≤ id ≤ hi(Pi ). In the example described,
the boundary values are: lo(P1 ) = 1, hi(P1 ) = 5, lo(P2 ) =
8, hi(P2 ) = 50. In general, a partition is a “set” of sublists - one
from each of the original lists - where each of the sublists shares the
same boundary values. The size of a partition is the sum of sizes of
the sublists.
Note that although this is one possible partitioning of the two
lists, it is not necessarily the best as P1 , P2 are not of equal size. In
order to divide an arbitrary number of sorted lists into P partitions
of equal size, it is sufficient to compute P − 1 order statistics of the
elements in the union of the lists; specifically, the elements with
rank Pi , i = 1, . . . , P − 1. The P − 1 order statistics along with
the extreme values (min, max) of the lists determine the boundary
values of the partitions. In our example, we need to compute the
median in the union of A1 and A2 (value 13). Hence, the boundary
values of the partitions in the optimal partitioning of A1 , A2 are:
lo(P1 ) = 1, hi(P1 ) = 13, lo(P2 ) = 14, hi(P2 ) = 50, i.e. P1 =
{{1, 2, 3, 5, 9, 10, 12}, {4, 8, 11, 13}}, P2 = {{15, 18, 20, 40},
1
Element and document identifier (id) are used interchangeably in
the text.
{14, 16, 17, 39, 41, 42, 50}}. The resulting partitions, P1 , P2 have
the same size (11 elements each).
Existing exact quantile identification algorithms can be used to
compute order statistics and consequently to partition the lists in a
load balanced way. However, these algorithms introduce notable
overhead. Several approximate quantile algorithms with explicit
error guarantees have been proposed in the literature [25, 18, 12];
any of these algorithms can be utilized in our setting. In this study
we choose to elaborate utilizing the algorithm proposed by Greenwald and Khanna [18] since it has better space requirements compared to the algorithm in [25]. It is easy to see that if the size of the
dataset N is known, the space requirements of the algorithm in [18]
is O( 1 log N ) compared to the algorithm in [25] which requires
O( 1 log2 N ) memory, where is the error with which any order
statistic is computed. Furthermore, the algorithm in [18] is utilized
because we want the simplest algorithm for deterministic unbiased
quantiles; the algorithm in [12] is for biased quantiles.
An -approximate quantile summary Q for a set of elements S
is an ordered subset of S that can be utilized to answer any order statistic query over S with error . The resulting summary Q
contains (1/2 + 1) elements. If we query Q for an element with
rank r with error , then the output element is guaranteed to have
rank within r ± |S|. Next, we describe the computation of an
-approximate quantile summary for a set of elements residing in
two sorted lists in a uniprocessor architecture using the algorithm
in [18]. For brevity, we refer to an -approximate quantile summary
as -summary.
We will describe the basic operations of the algorithm using a
simple example. Assume that we wish to compute an -summary
Q for the elements residing in two sorted lists A1 , A2 (same as
in the previous example) with error = 0.1. Initially, the algorithm computes two -summaries Q1 , Q2 , for the elements in A1
and A2 , respectively. An -summary Qi for the elements of Ai
1
+ 1 elements. These are the
is computed by choosing at most 2
elements with ranks 1, 2|Ai |, 4|Ai |, . . . , |Ai |, where |Ai | is the
number of elements in Ai . The -summaries Q1 , Q2 contain six
elements each, Q1 = {1, 3, 9, 12, 18, 40} and Q2 = {4, 11, 14,
17, 41, 50}. In order to compute the final -summary Q, we apply
the combine(Q1 , . . . , Qm ) operation [18]. The combine operation
produces a new summary Q by merging the smaller summaries
Q1 , . . . , Qm provided as input. The error of Q is the maximum
error of the input summaries in combine. The size of Q is equal
to the union size of summaries provided as input in combine. In
our example, we apply the combine operation using as input the
-summaries Q1 , Q2 obtaining Q = {1, 3, 4, 9, 11, 12, 14, 17, 18,
41, 40, 50} which is an -summary that can be used to answer any
order statistic over all the elements in A1 and A2 with error 10%.
3.2
List Partitioning on CMPs
Given the description of the approximate quantile algorithm for
uniprocessors, we now propose an adaptation tailored to CMPs, resulting in Partition Sorted algorithm for sorted lists (presented in
Section 3.2.1) and Partition Unsorted algorithm for unsorted lists
(presented in Section 3.2.2). The adaptation aims to: a) enhance the
algorithm so that it exploits all the available processing elements,
and b) improve its cache locality, thus reducing the number of expensive memory accesses performed.
We consider a CMP architecture consisting of C cores and an
on-chip cache hierarchy with a private L1 cache per core and a
shared L2 cache which can store |L2| elements. Our algorithms
are general enough to function on any type (organization) of the
on-chip cache hierarchy. We refer to L2 as the lowest level of the
on-chip cache hierarchy which is shared among cores.
3.2.1
Partition Sorted Algorithm
To parallelize the computation of an -summary from k sorted
lists, we assign the lists to cores (approx. Ck lists per core). Let ki
be the number of lists assigned to core i. Note, that there is no physical transfer of lists because they reside in memory which is shared
among cores. Core i computes ki -summaries Q1 , . . . , Qki , one
from each assigned list. Then, it merges the ki -summaries using the combine operation to produce a single -summary Qi . A
single core is responsible for merging the resulting -summaries
Q1 , . . . , QC using the combine operation and produces the final
-summary Q. The -summaries are small enough to reside in
cache and since cache is shared among cores, the intermediate summaries Q1 , . . . , QC as well as the final -summary Q are produced without the need for expensive off-chip memory accesses.
If k < C, we “greedily” partition the lists in order of decreasing length. The partitioning requires C steps and at every step
} contigumin{# of remaining unassigned elements of current list, N
C
ous elements from the currently examined list are assigned to a
core; N is the total number of elements in the lists. Each core produces an -summary and the -summaries are merged by a single
core to produce the final -summary Q.
3.2.2
Partition Unsorted Algorithm
In contrast to the sorted case, when the lists are unsorted, it is
not possible to create an -summary by simply choosing a set of
elements with specific rank. The naive solution is to sort the lists
and then compute an -summary utilizing the Partition Sorted algorithm (Section 3.2.1). We demonstrate that it is possible to avoid
a possibly expensive sorting phase.
In Algorithm 1, we present pseudo-code for the Partition Unsorted algorithm. The main idea is to partially sort each list into a
number of sorted runs and create an -summary from each sorted
run (lines 1-7). Once the -summaries have been produced, they
are merged into the final -summary Q (lines 9-21).
elements from a list and it is sorted usEach run contains |L2|
C
ing Quicksort which is an in-place sorting algorithm. The size of
each run is chosen so that C runs can fit simultaneously in L2.
Therefore, the generation of sorted runs does not cause any L2 capacity cache misses. Only compulsory cache misses are incurred
when the elements are fetched for the first time into the cache. An
-summary is produced from a sorted run immediately after its creation. The overhead of producing the -summary is negligible as
the run already resides in cache.
Once the initial -summaries of all the runs have been produced,
they are merged in parallel by the cores to produce the final summary Q. Let M be the number of initial -summaries. The M
-summaries
-summaries are assigned to cores (approximately M
C
per core) and each core merges its assigned -summaries using the
combine operation; an -summary Qi is produced by core i. The C
resulting -summaries Q1 , . . . , QC are merged by a single core to
produce Q.
If all -summaries can fit in L2, the cores do not experience any
L2 cache misses while the -summary Q is produced (lines 10-13).
If this is not the case, we resort to a multi-phase merging (lines 1521). In each phase the cores read a sufficient number of summaries
fitting in L2. When in cache, the summaries are merged by the
cores and a single summary is produced. Then, the size of the
resulting summary is reduced using the prune operation [18].
The prune operation reduces the size of a quantile summary.
It takes as input an -summary Q for a set of elements S and a
parameter B and produces a new 0 -summary Q0 of size at most
B + 1 elements with error 0 by querying Q for the elements of
1
.
rank 1, |S|/B, 2|S|/B, . . . , |S|. The error of Q0 is 0 = + 2B
In the next phase, a sufficient number of pruned summaries fitting in L2 is read by the cores. The summaries are merged and a
new summary is created which is again pruned. In the last phase
all the summaries fit together in L2 and the final summary Q is
computed (line 21).
Since the number of initial summaries is M , the prune operation
will be applied at most log M times, each time increasing the error
1
of the resulting quantile summary by 2B
. Hence, in order for the
final quantile summary to have error , we must create the initial
quantile summaries with error 2 ( 2 -summaries) and set the value
B = 1 log M . Consequently, if we apply the prune operation
log M times, we will generate a quantile summary with error .
Algorithm 1 Partition Unsorted
Input: k unsorted lists A1 , . . . , Ak , |Ai | elements in list Ai , |L2| elements in L2 cache, C cores, error Output: an -approximate quantile summary Q
1: for l = 1, . . . , k do
2: while ∃ unread elements in Al do
|L2|
3:
Read the next C elements from Al .
4:
Sort the elements in-place using Quicksort and generate a sorted
run r.
5:
Compute an -approximate summary for r containing at most
1
+ 1 elements.
2
6: end while
7: end for
P
8: Barrier // Synchronization point - There are M = kl=1 ml quantile
˚ |Al |·C ˇ
summaries, where ml = |L2| is the number of runs of list Al .
1
9: if M · ( 2
+ 1) < |L2| then
10: The quantile summaries are assigned to cores, i.e.
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
4.
M
quantile sumC
maries are assigned to each core.
Each core merges its assigned quantile summaries using the combine operation.
Barrier // Synchronization point - C quantile summaries have been
created.
The C quantile summaries are combined by a single core into the
final quantile summary Q.
else
while summaries cannot
fit in L2 cache do
0
Each core reads M
summaries in L2 cache. // Let M 0 be the
C
number of summaries that can fit in L2 cache.
Each core merges its assigned summaries using combine operation.
Barrier // Synchronization point - C quantile summaries have
been created.
The C quantile summaries are combined by a single core into the
quantile summary Q0 . prune is applied to Q0 to reduce its size to
at most B + 1 elements.
end while
Q is computed. // Lines 10 - 13
end if
LIST INTERSECTION ALGORITHMS
In this section, we present several algorithms to compute the intersection of sorted and unsorted lists. Section 4.1 presents the Dynamic Probes and Quantile-based algorithms for sorted lists. Section 4.2 presents the Hash-based algorithm for unsorted lists.
Assume we wish to compute the intersection of k lists A1 , . . . , Ak ,
where k ≥ 2. Each list Ai contains |Ai | unique elements (ids);
Pk
the total number of elements in the lists is N =
i=1 |Ai |. We
assume that the lists reside in main memory and the result of the
intersection is also written in memory.
4.1
The Case of Sorted Lists
4.1.1
Dynamic Probes Algorithm
The basic idea explored by most of the algorithms proposed for
intersecting k lists is to use the elements of one list as eliminators
and conduct probes (lookups) in the remaining lists using a search
method (binary search, galloping search, etc). The probing order
(order in which the lists are probed) and the search interval length
of each probe affects the performance of a list intersection algorithm.
For every probe, the number of steps (comparisons) required to
either locate an element or exclude it from the intersection depends
on the length of the search interval. Due to the random access
pattern that most search methods exhibit, every step results in a
random memory access that with high probability cannot be resolved through cache. Furthermore, an effective probing order can
speedup the computation of intersection. Probing the lists that are
less likely to contain an element first, reduces the number of probes
that is required to exclude an element from the intersection.
The Dynamic probes (DP) algorithm dynamically decides the
probing order and the length of each probe using information from
previous probes. This information is utilized as a cache-resident
micro-index. Let us describe how the DP algorithm computes the
intersection of k sorted lists A1 , . . . , Ak . The pseudo-code for the
DP algorithm is presented in Algorithm 2. Initially, the lists are
sorted by increasing length (|A1 | ≤ |A2 | ≤ . . . ≤ |Ak |) and the
first element of the smallest list is set as eliminator, e = A1 [0]. A
binary search is performed for e in the remaining lists A2 , . . . , Ak
and two arrays V al, P os are utilized to store the ids and the positions, respectively, of the elements that are “touched” during a
binary search. For every list Ai , the ids are stored in the i-th row
of array V al and the positions are stored in the i-th row of array P os (lines 3-5). V al and P os store the ids and positions of
only the elements with ids larger than e. V al and P os are two
(k − 1) × log nmax arrays, where nmax is the number of elements
in the largest list. Next, we describe how the information stored in
these two arrays is utilized as a micro-index to speedup following
probes.
In the next step, the successor of e in the current smallest unexplored list is used as eliminator, e = succ(e) (line 6). Instead
of using a round robin probing policy as the one applied in [14],
we dynamically decide the probing order and the search interval
length for e as follows. For every list Ai , we search in V al for
the ids lo vali (e), hi vali (e) that bound e, i.e. lo vali (e) < e <
hi vali (e) (line 9). The corresponding entries in P os (lo posi (e),
hi posi (e)) of lo vali (e) and hi vali (e), define a range of positions to search for e in list Ai . e will either be found in range
[lo posi (e), hi posi (e)] or it does not exist in that list. Given their
small size and the fact that they are frequently accessed, V al and
P os will reside with high probability in some level of the on-chip
cache hierarchy. Consequently, computing the search intervals for
e imposes minimal overhead. Once we determine for every list the
search interval of e, we sort the intervals by increasing length (line
15) and we greedily probe the lists in that order using interpolation
search (line 20). The same procedure is applied repeatedly until
one of the lists is exhausted.
In Figure 1, we illustrate a sample run of DP in three lists A1 ,
A2 , A3 with 5, 100 and 150 elements (document identifiers), respectively. DP uses the first id (100) of list A1 as eliminator and
performs a binary search in A2 and A3 . A binary search for id
100 in A2 locates id 100 at position 6 and “touches” the following
position-id pairs: (50-750), (25-200), (12-150). Hence, the second
row of V al is populated with ids (150, 200, 750) and the second
row in P os is populated with their positions (12, 25, 50). In accordance, after performing a binary search for id 100 in list A3 , the
third row of V al stores ids (350, 500, 700) and the third row of
P os stores their positions (18, 37, 75).
Now, consider a search for the second id (400) of A1 . Any list
intersection algorithm will consider as search interval the range
(6, 100] of positions in A2 and the range (9, 150] of positions in
A3 . Moreover, if the probing order is determined using the remaining unexplored portion of each segment, then A2 is probed first
and A3 second. Instead, the DP algorithm operates in a less conservative manner exploiting the entries in V al and P os to refine the
search interval for id 400. By comparing the id searched with the
ids stored in V al, it sets the search interval to be the range [25, 50]
of positions in A2 and the range [18, 37] of positions in A3 . Furthermore, since the search interval in A3 is smaller, A3 is probed
first and A2 second.
100
A1
1
400
600
800
900
2
3
4
5
1 100 150 200
400 600 750
800
900
1000
?
?
100
A2
1 6
1
12
100
25
350
?
400
?
50
500
600
700
800
900
1000
75
?
?
150
A3
1
9
18
?
37
?
Figure 1: Example of Dynamic Probes algorithm.
The entries in arrays V al and P os are updated in two cases using
binary search: a) when the length of the search interval (number of
elements) is larger than a threshold TDP , and b) when for a given
eliminator e and a list Ai , it is not possible to identify a search
interval in Ai using the entries in V al and P os (lines 10-13). The
latter case occurs when the value of e is larger than all the values in
the i-th row of V al. In that case, the remainder unexplored portion
of Ai determines the search interval of e.
For example, consider searching for id 800 of A1 in Figure 1.
Id 800 is larger than all the ids stored in V al for both A2 and A3 .
Hence, we set lo posi (e) to be the maximum between the position
in Ai where the search for the predecessor of e halted and the last
entry in the i-th row of P os (maximum position in i-th row). We
set hi posi (e) to be the length of list Ai , |Ai |. Thus, for id 800,
the search interval is the range [50, 100] of positions in A2 and the
range [75, 150] of positions in A3 . When probing list Ai for e,
we perform a binary search and we update the entries in the i-th
row of V al and P os to benefit subsequent probes (lines 17-18); a
binary search is performed in A2 and A3 for id 800. In Section 5
we describe how the value of TDP is determined.
4.1.2
Quantile-based Algorithm
When we are dealing with real life documents such as news articles or blogs, we may often encounter situations in which the distribution of document identifiers (ids) to lists is highly non-uniform.
Demaine et al. referred to this non-uniformity as “burstiness” and
argued that although adaptive list intersection algorithms may be
theoretically superior to static algorithms, in practice the effectiveness of adaptivity depends on the burstiness of the actual data [14].
Based on these findings, we are trying to detect as early as possible
lists with non-uniform distribution of document identifiers and select in advance a good probing order to eliminate the overhead of
adaptivity.
In order to determine the existence of lists with non-uniform distribution of ids, some form of data inspection is required. When a
serial list intersection algorithm is employed, inspecting the lists introduces notable overhead that offsets any benefit attained. On the
contrary, computing the intersection in a CMP environment provides a unique opportunity to examine the lists during the execution of a load balancing algorithm. The load balancing algorithms
Algorithm 2 Dynamic Probes
Input: k sorted lists, A1 , . . . , Ak , threshold TDP
Output: The elements in list intersection
1: Sort the lists by length (|A1 | ≤ |A2 | ≤ . . . ≤ |Ak |).
2: Choose the eliminator e to be the first element in A1 .
3: for each list Ai , i = 2 . . . k do
4: Perform a binary search for element e and store the values and the
positions of the elements “touched” by the binary search in arrays
V al and P os, respectively.
5: end for
6: Set new eliminator e = succ(e).
7: while the eliminator e 6= ∞ do
8: for each list Ai , i = 2 . . . k do
9:
Using V al and P os, find the search interval of e in list Ai .
10:
if a search interval is not found for list Ai using P os and V al
then
11:
Set the maximum possible search interval of Ai .
12:
Mark Ai for binary search.
13:
end if
14: end for
15: Sort lists by increasing search interval (|A2 | ≤ . . . ≤ |Ak |).
16: for i = 2 . . . k do
17:
if Ai was marked for binary search or the search interval is larger
than TDP then
18:
Perform binary search for e in Ai and update P os and V al
entries.
19:
else
20:
Perform interpolation search in the search interval of Ai .
21:
end if
22:
if e is found in k lists then
23:
Output e.
24:
end if
25: end for
26: Set new eliminator e = succ(e).
27: end while
presented in Section 3 generate a statistical description of the lists
in the form of approximate quantiles.
In this section, we present a list intersection algorithm, termed
Quantile-based (QB), that utilizes quantiles to detect lists with nonuniform distribution of ids. QB recursively splits the partitions created by the load balancing algorithm into sub-partitions for which it
determines a “good” probing order. The probing order of a partition
remains fixed while that partition is being processed, thus avoiding
the overhead of adaptivity. Pseudo-code for the QB algorithm is
presented in Algorithm 3.
Let us illustrate the QB algorithm with a simple example. In
Figure 2, we present parts of two sorted lists A1 , A2 that belong
to a partition p. We use the same notation (Ai ) to refer to a list
and the part of that list that belongs to a partition. The median of
the ids in p is 7. Given the position of the median in each list, it
is evident that the distribution of ids in A1 , A2 is non-uniform; A1
contains six elements with values which are smaller than 7, while
A2 contains only two such elements. Now, consider how different
probing policies would work in this setting. The adaptive probing
policy [14] performs 10 probes to compute the intersection (shown
in Figure 2). A probing policy with a fixed probing order requires
either 9 or 10 probes depending on the order used.
Algorithm QB on the other hand, utilizes the median to detect
the existence of non-uniformity in the distribution of ids (lines 38). The median splits every list Ai in two sub-lists A1i and A2i , e.g.
A1 is divided into A11 = {1, 2, 3, 4, 5, 6, 7} and A21 = {8, 11, 13}.
For every list, QB compares the length difference of its sub-lists
with a threshold TQB (non-uniformity condition) to identify nonuniformity in the distribution of ids. In Section 5 we demonstrate
how the value of TQB is determined.
If at least one list Ai satisfies the non-uniformity condition, i.e.
||A1i | − |A2i || > TQB , QB divides p in two sub-partitions p1 ,
p2 (line 6). p1 consists of sub-lists A11 , A12 with elements which
are smaller or equal than the median (shown in Figure 2). Accordingly, p2 consists of sub-lists A21 , A22 with elements which are
larger than the median. For each sub-partition pi , QB greedily selects a probing order by sorting the sub-lists of pi in increasing
length (A12 → A11 for p1 and A21 → A22 for p2 ) resulting in 6
probes.
Adaptive
Quantile based
1
1
2
3
4
5
6 7
8 11
1
13
2
3
A1
2
4
5
A1
8 11 13
6 7
A1
A1
A2
A2
1 3
7
8
9 10 11 12 13
14
1 3
7
1
A2
successful
probe
8
9 10 11 12 13 14
2
A2
unsuccessful
probe
Figure 2: Different probing policies in two segments with nonuniform distributions of document identifiers.
In general, the same procedure is applied recursively for every
newly created sub-partition until either a threshold of maximum
number of generated partitions allowed is exceeded or until none
of the lists satisfies the non-uniformity condition (lines 2-10). List
intersection is computed in every (sub-)partition using the elements
of the current smallest unexplored list as eliminators and probing
the remaining lists in the probing order derived (lines 11-12). Each
probe is performed using interpolation search.
Algorithm 3 Quantile-based
Input: -approximate quantile summary Q, k lists of partition P ,
A1 , . . . , Ak , threshold TQB ,
Output: The elements in list intersection
1: enqueue(P ); subpartitions = 0
2: while queue is not empty() && subpartitions < maximum number
of sub-partitions allowed do
3: P 0 = dequeue()
4: Compute the approximate median m of P 0 using Q.
5: Search for m in each list Ai , i = 1, . . . , k, of P 0 .
6: if at least one list of P 0 is split by the median into sub-lists with
length difference which is larger than TQB then
7:
Split P 0 in two sub-partitions P1 , P2 using m.
8:
enqueue(P1 ); enqueue(P2 ); subpartitions+ = 2
9: end if
10: end while
11: Sort the lists by increasing length to determine the probing order for
that (sub-)partition.
12: Use the elements of the current smallest unexplored list as eliminators.
Probe the remaining lists using interpolation search.
examining all the elements in the lists. Then, hashing is utilized to
identify the elements of intersection.
Algorithm 4 presents pseudo-code for the HB algorithm. The
Partition Unsorted technique (Section 3.2.2) is utilized to compute
an -summary Q of the elements in k unsorted lists A1 , . . . , Ak
(line 1).
Using
˚ ·C
ˇ Q, the lists are divided into P partitions, where
P = N
(line 2). The size of each partition p is chosen such
|L2|
. Note,
that C partitions can fit in L2 simultaneously, i.e. |p| = |L2|
C
that since the lists are unsorted, at the present time we only know
the boundary values (lo(p), hi(p)) of p. Next, the P partitions
are assigned to cores so that each core processes approximately the
P
.
same number of partitions, i.e. C
The next step, for a core processing a partition p, is to locate the
positions of the elements that belong to p in the lists. The naive
approach is to read all the elements from the lists and filter out
those that do not belong to p. However, this approach imposes
an overhead, which is the cost of reading all the elements in the
lists. Instead, the HB algorithm exploits a partial sorting of the
lists (performed by Partition Unsorted) to locate the elements of p
without reading undesirable elements, i.e. elements that belong to
other partitions.
Recall that the Partition Unsorted algorithm˚ generates,
along
ˇ
l |·C
of sorted
with a quantile summary Q, a number ml = |A|L2|
elements. The
runs per list Al (r1 , . . . , rml ), each containing |L2|
C
core that processes partition p utilizes these sorted runs to locate
the elements of p as follows (lines 4-8). In every sorted run of each
list, it performs two binary searches for the boundary values of p
(lo(p), hi(p)). For a sorted run r, the two binary searches determine the range of positions in r ([lo pos(p), hi pos(p)]) with the
elements that belong to p.
In Figure 3, we present the execution of the HB algorithm for
two unsorted lists A1 , A2 using a CMP with two cores. Let us
assume for simplicity that the number of partitions is two (P = 2)
and the boundary values are (1, m) for P1 and (m, M AX ID) for
P2 , where m is the median in the union of A1 , A2 and MAX ID the
maximum document identifier in A1 , A2 .
sorted runs
1
1
r1
r
2
A1
m
1
s1
m
2
sorted runs
s1
2
2
r
r
1
2
A2
m
1
s2
m
2
s2
Figure 3: Example of Hash-based algorithm for two unsorted
lists on a CMP with two cores.
4.2
The Case of Unsorted Lists
When the lists are unsorted, one way to compute the intersection
would be to sort the lists and then apply any algorithm for sorted
lists. However, sorting imposes an overhead. In this section we
present Hash-based, an algorithm that computes the intersection of
unsorted lists in a CMP context. It is a cache-conscious algorithm
that uses hashing and avoids the overhead of sorting.
4.2.1
Hash-based Algorithm
The Hash-based (HB) algorithm utilizes the Partition Unsorted
algorithm (Section 3.2.2) to partition the lists. Furthermore, it exploits the partial sorting of the lists, that Partition Unsorted performs, to locate the elements of each partition at low cost without
Assume that every list in Figure 3 consists of two sorted runs,
produced by Partition Unsorted algorithm. Furthermore, assume
that core 1 processes partition P1 and core 2 processes partition
P2 . Core 1 performs a binary search for the median m on each
sorted run of A1 (r11 , r21 ) and on each sorted runs of A2 (r12 , r22 ).
The position of m in a sorted run r divides the elements of r in two
partitions, P1 (denoted in black) and P2 (denoted in white). The
elements of A1 that belong to P1 are distributed in two sorted runs,
i.e. the black parts of r11 and r21 . In accordance, the elements of A2
that belong to P1 are distributed in the two sorted runs of A2 (black
parts of r12 and r22 ).
Next, we describe how a partition p is processed to compute the
intersection. We refer to the set of elements of list Ai that belong to
partition Pj as segment sji . Let s1 , . . . , sk be the segments of p (superscript is omitted for brevity). Initially, the segments are sorted
by increasing length, |s1 | ≤ . . . ≤ |sk | (line 10). Algorithm HB
uses the elements of the smallest segment, s1 to build a hash table
and the elements of the remaining segments, s2 , . . . , sk to probe
that hash table and compute the intersection (lines 11-20). The entries of the hash table are hid, counteri pairs. Id is used as a key and
counter is the payload recording the number of lists in which this
id exists; counter is initialized to one. Every probe to an existing id
increments the corresponding counter. When the value of counter
becomes equal to the number of lists, the corresponding id belongs
to the intersection. In Figure 3, core 1 (processing partition P1 )
utilizes segment s12 to build a hash table and segment s11 to probe
the hash table and compute the intersection (assuming |s12 | < |s11 |).
In a similar fashion, core 2 processes partition P2 .
Since the size of each partition is chosen appropriately, the cores
are not competing for L2 while the intersection is computed. With
high probability, each hash table will reside in the L2 cache and
hash probes will not cause any off-chip memory accesses. The
cores are experiencing only compulsory misses while reading the
elements of a partition and conflict misses due to cache associativity.
Algorithm 4 Hash-based
Input:
in total in k lists A1 , . . . , Ak , C cores, ml =
l N elements
m
|Al |·C
sorted runs r1 , . . . , rml per list Al
|L2|
Output: The elements in list intersection
1: Compute an -approximate quantile summary Q using Partition Unsorted algorithm.
2: Using Q, divide
l the
m lists into P approximately equal sized partitions,
·C
where P = N
.
|L2|
3: while ∃ unprocessed partitions do
4: Process the next unprocessed partition p. // Each core processes a
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
5.
disjoint set of partitions.
for l = 1 to k do
for j = 1 to ml do
Search for the boundary values of p in rjl . // The output is a
range of positions in rjl , [lo pos(p), hi pos(p)].
end for
end for// The segments s1 , . . . , sk of p have been computed.
Sort the segments of p by increasing length (|s1 | ≤ . . . ≤ |sk |).
Build a hash table h using the elements of s1 (smallest segment).
for i = 2 to k do
Probe h using the elements of si .
if a match e is found then
Increment counter(e).
if counter(e) == k then
Output element e.
end if
end if
end for
end while
EXPERIMENTAL EVALUATION
In this section we present a series of experiments on sorted and
unsorted lists using a chip multiprocessor with eight cores. We
compare the Dynamic Probes (DP) and the Quantile-based (QB)
algorithms with Small Adaptive (SA) [14], Probabilistic Probes
(PP) [27] and Linear Merge (LM). The latter computes the intersection of sorted lists using merging similar to the execution of a
sort-merge join algorithm. Alternatively, one could view the problem of intersecting k lists as a special case of joining k relations and
compare the performance of DP and QB algorithms against static
AND-trees. However, as Raman et al. demonstrated, adaptive list
intersection algorithms perform in most cases at least as good as
the best AND-tree and are not prone to cardinality estimation errors [27]. Hence, we compare our algorithms only against other
adaptive list intersection algorithms.
For unsorted lists, we evaluate the performance of the Hashbased (HB) algorithm. Due to the lack of a multi-way hash join
algorithm for CMPs, we compare HB against an adaptation of a
parallel and cache-conscious sorting algorithm [15] for CMPs (PS
for brevity). PS is a variant of AlphaSort [26] that utilizes an exact partitioning technique [22]. The sorting algorithm can be used
in conjunction with any algorithm for sorted lists. We experiment
with synthetically generated datasets as well as with real data. Section 5.1 discusses the datasets, the hardware configurations, and
implementation details. The results are presented in Section 5.2.
5.1
Methodology
We experiment with real and synthetic datasets. Our synthetic
data capture uniform and correlated distributions of document identifiers. Uniform synthetic data were generated utilizing the guidelines on [17]. For real data we use a sample corpus from a web
search engine2 [6]. The size of the corpus is 3.1GB of raw text and
the query log contains 1000 multi-keyword queries.
All the experiments were conducted on a Dell PowerEdge 2950
server with dual quad-core Intel E5355 CPUs (8 cores) running at
2.6GHz. It has a 4MB of shared L2 cache per quad cores (8MB in
total) and each core has a private 64KB L1 cache. It runs the Linux
operating system with kernel 2.6.15 and has 32GB of main memory. All the algorithms were implemented in C using Posix threads
(Pthreads). We used the Intel icc-10 compiler with optimization
level O3. In all the experiments conducted, the sched setaffinity
function of the Linux OS is employed to schedule the threads to
specific cores and assure that they are not re-scheduled during the
execution of the algorithms.
5.2
Experimental Results
In all the experiments we measured the wall-clock time required
to read the lists from main memory, conduct partitioning using the
quantile algorithms (for sorted and unsorted lists), compute the intersection and write the intersection result in memory. For the case
of unsorted lists, where we evaluate intersection time utilizing sorting, the time includes sorting the lists, computing the partitioning
using the quantile algorithm, computing the intersection using the
best algorithm for sorted lists and writing the intersection result in
memory. All the numbers reported are averages over 100 runs. The
standard deviation for all the algorithms was less than 2%.
The maximum possible size of the intersection of k lists is the
length of the smallest list. We consider the selectivity of a list intersection to be the ratio of the size of the intersection to the maximum possible size of the intersection. Unless stated otherwise, all
the lists in the synthetic datasets have the same size, containing one
million elements. Each element is a four byte integer. In Table 1,
we present the range of parameter values examined in our experiments.
List size
Cores
Lists
1 - 100 million ids
1-8
2 - 16
Table 1: Range of parameter values in our experimental design.
All the algorithms examined utilize the list partitioning techniques presented in Section 3 to achieve load balancing. In all the
experiments that follow, we set the error threshold at = 0.01.
2
www.blogscope.net
The maximum load disparity across cores observed in all the experiments conducted (using eight cores) never exceeded 10%.
5.2.1
Sorted Lists
In the first set of experiments, we assess the performance of list
intersection algorithms at different selectivities and different number of sorted lists using uniform synthetic data. Figures 4, 5, 6
present the results for 4, 8 and 16 lists, respectively when four cores
are utilized. We present the normalized time of each algorithm with
respect to LM. The latter exhibits the worst performance with running times reaching 0.3 seconds (for 16 lists and 4 cores).
For four sorted lists (Figure 4) and when the selectivity is small
(1%), SA and QB have approximately the same performance with a
small advantage for the latter (10%). SA cycles repeatedly through
the lists only when a match is found in the two smallest lists currently being intersected. When the selectivity is small, the frequency at which matching elements are found is proportionally
small too. Hence, SA changes its probing order less frequently
and does not pay the overhead of cycling through the lists. For
larger selectivities though, where matching elements are discovered more frequently, SA acts more adaptively, continuously updating its probing order. That has a negative impact on its performance compared to algorithms that update their probing policy
more effectively, such as DP and QB. As the number of lists increases (Figures 5, 6), the overhead of repeatedly cycling through
a large number of lists results in SA performing worse than DP and
QB even for small selectivities.
For small selectivities and small number of lists SA performs
better than DP. However, as selectivity increases (more than 10% in
our experiment), DP performs better than SA and the performance
difference between these two algorithms is more pronounced. For
every eliminator, DP consults the micro-index in order to refine the
search range for every list and to determine the best probing order.
Even though the micro-index is cache-resident, there is still some
overhead involved in this procedure. On the other hand, when selectivity is small, SA discards most of the eliminators by performing a single probe on the second smallest list. Hence, it is more
efficient than DP. However, as selectivity increases and the lists are
probed more frequently, the use of the micro-index starts paying
off and DP performs better than SA.
For small selectivities, DP and QB have approximately the same
performance. However, for large selectivities the DP algorithm performs worse than QB. The latter determines a good probing order
at low cost which remains fixed during its execution. The benefit
of the simple probing policy that QB applies is more profound for
larger selectivities and larger number of lists where more adaptive
algorithms such as SA, DP and PP experience large overheads.
Algorithm LM has the worst performance overall for the range of
selectivities and the number of lists examined. Compared to other
intersection algorithms, it reads the lists sequentially resulting in
better utilization of cache lines. Furthermore, the sequential memory access pattern of LM is detected by the hardware prefetcher
of the Intel architecture. For small number of lists the sequential
memory access pattern of LM compensates for the larger number
of comparisons it performs (compared to other intersection algorithms); for two sorted lists the performance of LM is equivalent to
the performance of other list intersection algorithms (results omitted due to space constraints). As the number of sorted lists increases, the execution time of LM is dominated by the excessive
number of comparisons and the performance difference between
LM and the rest of the algorithms increases proportionally.
The PP algorithm, although it performs significantly better than
LM, is 1.6X slower than QB and 1.44X slower than SA for four
lists and small selectivities (1%). The PP algorithm repeatedly cycles through the lists with some probability p and chooses the next
“promising” list to probe with probability 1 − p. Unless there is
a probing order that is highly beneficial, PP will behave like the
Adaptive algorithm [13]; the latter has been shown to be inferior to
SA. The results in Figures 4, 5, 6 demonstrate that the PP algorithm
is consistently worse than SA regardless of selectivity and number
of lists.
Figure 4: Performance of intersection algorithms for four
sorted lists with synthetic data using four cores.
Figure 5: Performance of intersection algorithms for eight
sorted lists with synthetic data using four cores.
Figure 6: Performance of intersection algorithms for sixteen
sorted lists with synthetic data using four cores.
We conducted the same set of experiments utilizing eight cores.
Figure 7 presents the results for sixteen lists (normalized time with
respect to LM). The results for smaller number of lists are consistent with those presented in Figures 4 and 5. The most important
observation is that, when eight cores are utilized, the relative performance of the intersection algorithms does not change, indicating
the effectiveness of the load balancing technique applied.
Next, we wanted to verify that our results are consistent when
larger lists are intersected. Figure 8 presents the performance of
list intersection algorithms for four lists when every list contains
100 million elements; four cores are used in this experiment. The
results presented in Figure 8 are similar to those in Figure 4, verifying that our results are consistent irrespective of list size.
5.2.1.1
Anti-correlated Data.
In all experiments presented thus far, the document identifiers
that belong to the intersection were positioned randomly in each
5.2.1.2
Figure 7: Performance of intersection algorithms for sixteen
sorted lists with synthetic data using eight cores.
Real Data.
The performance comparison of intersection algorithms as a function of the number of sorted lists using real data is presented in Figure 10; four cores were utilized. For each algorithm we measured
the average time to execute a set of queries with the same number
of keywords. We present the normalized average time of each algorithm with respect to PP; its running times reach 0.05 seconds.
One characteristic exhibited by real data distributions on inverted
lists is that the resulting selectivities of intersections is low. Hence,
in practice the LM algorithm is not considered to be a competitive
algorithm. SA performs better than DP for small number of lists but
as we increase the number of lists the conservative probing policy
of SA reduces its performance. QB has the best performance and
the performance difference with the other algorithms is increasing
for larger number of lists. Notice that for 6 lists QB is 1.6X faster
than SA and PP; the latter has approximately the same performance
as SA.
Figure 8: Performance of intersection algorithms for four
sorted lists with synthetic data using four cores when each list
contains 100 million elements.
list. In the next experiment, we study the performance of intersection algorithms for sorted lists in the case where the positions
of the document identifiers that belong to the intersection are anticorrelated. This means that these document identifiers are clustered
(placed close to each other in each list) and the clusters are in different positions in each of the lists. The position of each cluster is
selected randomly.
Figure 9 presents the results for 16 sorted lists with anti-correlated
data when eight cores are utilized. LM is not included in these results as its performance does not depend on the distribution of the
document identifiers in the lists. We present the normalized time of
each algorithm with respect to PP; its running times reach 0.03 seconds. DP and QB perform consistently better than SA and PP. As
selectivity increases, the performance difference of the proposed
algorithms with SA and PP is more pronounced. QB performs
consistently better than DP and SA as it is able to detect the nonuniformity in the lists and select from the beginning a good probing
strategy. For large selectivities and large number of lists (50% in
our experiments) QB is 2.5X faster than DP and 6X faster than SA
and PP. The anti-correlated document identifiers degrade the performance of SA due to its conservative round robin probing policy.
PP is slightly worse than SA. These results are consistent for different number of lists as well as for correlated data (omitted due to
space constraints). In all cases examined, QB is the algorithm of
choice.
Figure 9: Performance of intersection algorithms for sixteen
sorted lists with anti-correlated data using eight cores.
Figure 10: Performance of intersection algorithms for different
number of sorted lists with real data using four cores.
5.2.1.3
Speedup.
In the next set of experiments we measure the speedup of the
proposed algorithms for sorted lists as well as the performance of
the parallel partitioning algorithm. In Figures 11 and 12 we demonstrate the performance of DP and QB algorithms, respectively, for
computing the intersection of eight lists as we increase the number
of cores; selectivity is 10%. We present the normalized running
time with respect to the case when a single core is utilized (serial
algorithm). We also present the fraction of time spend on the partitioning technique.
Both algorithms achieve adequate speedup as we increase the
number of cores from one to four, verifying the effectiveness of the
proposed partitioning technique. However, we observe that as we
increase the number of cores to eight, the reduction in total running time of the DP algorithm is merely 27% compared to the case
where four cores are utilized. Interestingly, the intersection time is
reduced approximately 50%, indicating the ability of the partitioning technique to balance the load evenly. The problem however, as
we increase the number of cores, is the increase in the partitioning
cost. Recall that our partitioning technique is tailored to CMP architectures and relies on the fact that the cores share the L2 cache.
However, the hardware that we utilized contains two quad-cores
which communicate through main memory. This has the following
implication. In the last step of the partitioning technique a single
core is producing the final -summary by merging a number of summaries. Since each quad-core has its own shared L2 cache, half the
accesses to the quantile summaries will have to be resolved through
main memory, thus increasing the partitioning cost.
In CMP architectures in which all the cores share the L2 (L3)
cache, all the quantile summaries will reside on the same cache hierarchy –as it happens when four cores are utilized in our experiment–
and the partitioning cost will be significantly lower. Analogous is
the case for the QB algorithm (presented in Figure 12).
(a) 2 lists
(b) 3 lists
(c) 4 lists
Figure 13: Impact of threshold TQB on the performance of QB
algorithm on real data.
Figure 11: Performance of the DP algorithm as we increase the
number of cores.
Figure 12: Performance of the QB algorithm as we increase the
number of cores.
5.2.1.4
Micro-benchmarks.
We run several micro-benchmarks to identify the best values for
the thresholds used in algorithms QB and DP. Recall that algorithm
QB utilizes threshold TQB in a non-uniformity condition, to decide
whether a partition should be divided into smaller sub-partitions.
The length of the search interval currently examined is compared
to threshold TDP in DP, to decide whether binary search should be
applied to update the cache resident index.
We measured the running time of the QB algorithm using real
data for increasing number of lists as a function of the value of
TQB . We report the average of 100 runs in Figure 13 for two, three
and four lists. As observed, a single global optimal value for TQB
does not exist. The value of TQB that results in the best performance depends on the number of lists intersected. A second observation is that the optimal value of TQB is shifted towards smaller
values as the number of lists increases. As the number of lists increases, a wrong decision in the probing policy applied has an even
larger (negative) impact on performance. For small values of TQB ,
QB tends to be less conservative, changing its probing order more
often to handle fluctuations in the distribution of document identifiers.
The results for larger number of lists are consistent with those
presented here (omitted due to space constraints). We choose to
set the value of TQB in our experiments according to these microbenchmarks. Namely, for each experiment involving a number
of lists k, we set TQB to the best value observed in the microbenchmarks for k lists.
Similar trends are observed in the micro-benchmarks conducted
for the DP algorithm (omitted due to space constraints). As the
number of lists increases, the optimal value of TDP that results
in the best performance for DP is shifted towards smaller values.
For small values of TDP , DP updates its cache-resident index more
often and probes the lists more effectively. Similarly, in our experiments we set TDP values according to the values yielding the best
performance in these micro-benchmarks.
5.2.2
Unsorted Lists
In this section we evaluate the performance of intersection algorithms for unsorted lists using real and synthetic data. We compare
the performance of Hash-based (HB) with that of sorting (PS) in
conjunction with the best intersection algorithm for sorted lists; the
QB algorithm is used in all the experiments. The selectivity of the
intersection does not affect the performance of HB and PS algorithms. Their cost depends solely on the number of elements in the
lists. Hence, in all the experiments we fix the selectivity at 10%.
In Figure 14 we present a performance comparison of HB and PS
in conjunction with the QB algorithm as we increase the number of
unsorted lists from 2 to 16, when eight cores are utilized. All the
lists have the same length (1 million unique ids).
As it is evident in Figure 14, HB performs consistently better
than the combination of PS and QB for different number of unsorted lists. HB utilizes the Partition Unsorted algorithm to leave
the data partially sorted and computes the intersection without sorting the lists completely. Each core reads only the elements of the
partition currently processed. Hence, the elements of the lists are
read only once. However, HB pays the overhead of building and
probing hash tables. We reduce this overhead by building hash tables using the elements of the smallest segments. Moreover, since
the partitions are cache resident, probing the hash tables results in
smaller overhead compared to the cost of sorting the lists (performed by PS). In addition to reading the elements once for performing the final merge phase, PS pays the overhead of storing the
final sorted run in memory. Finally, we notice that the cost of computing the intersection using QB is relatively small compared to the
cost of sorting the lists.
Figure 14: Performance of intersection algorithms for different
number of unsorted lists with synthetic data using eight cores.
A performance comparison of the algorithms for unsorted lists
using real data, when eight cores are utilized is presented in Fig 15.
HB performs consistently better than the combination of sorting
and QB (best algorithm for sorted lists with real data), verifying
the efficacy of HB in practice.
In the next experiment we measure the speedup of the HB algorithm as well as the cost of the partitioning technique as we increase
the number of cores (Figure 16). The first observation is the good
7.
Figure 15: Performance of intersection algorithms for different
number of unsorted lists with real data using eight cores.
speedup achieved by the HB algorithm, verifying the effectiveness
of the partitioning technique. The second observation is that the
total running time of HB is dominated by the partitioning cost and
that the cost is significantly reduced as we increase the number of
cores, demonstrating the necessity for a parallel partitioning technique. Note that, in contrast to the case of sorted lists, the partitioning technique is also employed when a single core is utilized,
in order to improve cache locality.
In contrast to the speedup experiments of DP and QB algorithms
(Figures 11 and 12), when eight cores are utilized we achieve significant reduction in the partitioning cost even though the cores do
not operate on the same shared cache. The reason is that for unsorted lists, that cost is dominated by the sorting phase which is
performed independently by each core.
Figure 16: Performance of the HB algorithm as we increase the
number of cores.
6.
CONCLUSIONS
We have presented algorithms to compute the intersection of an
arbitrary number of sorted and unsorted lists tailored to commodity
chip-multiprocessors.
The problem of load balancing is initially studied and cacheconscious algorithms for partitioning sorted and unsorted lists in
a CMP context have been presented. Two new intersection algorithms have been proposed for sorted lists: Dynamic Probes, and
Quantile-based. Dynamic Probes exploits information from previous probes as a cache-resident index. Quantile-based utilizes
quantiles to detect lists with non-uniform distributions of document
identifiers and select in advance and at low cost a good probing
policy. For unsorted lists we have proposed a cache-conscious intersection algorithm, termed Hash-based, that computes the intersection of unsorted lists using hashing.
In a detailed experimental evaluation using real and synthetic
data on a CMP with eight cores, we demonstrate that our intersection algorithms for sorted and unsorted lists have superior performance and achieve very good speedup due to the effectiveness of
the load balancing strategy.
REFERENCES
[1] A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on a
modern processor: Where does time go? In VLDB, pp. 266–277,
1999.
[2] AMD. Multi-core processors - the next evolution in computing,
AMD white paper, 2005.
[3] R. A. Baeza-Yates. A fast set intersection algorithm for sorted
sequences. LNCS, 3109:400–408, 2004.
[4] R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information
Retrieval. ACM Press / Addison-Wesley, 1999.
[5] R. A. Baeza-Yates and A. Salinger. Experimental analysis of a fast
intersection algorithm for sorted sequences. In SPIRE, pp. 13–24,
2005.
[6] N. Bansal and N. Koudas. Blogscope: A system for online analysis
of high volume text streams. In VLDB, pp. 1410–1413, 2007.
[7] J. Barbay and C. Kenyon. Adaptive intersection and t-threshold
problems. In SODA, pp. 390–399, 2002.
[8] J. Barbay, A. Lopez-Ortiz, and T. Lu1. Faster adaptive set
intersections for text searching. LNCS, 4007:146–157, 2006.
[9] J. L. Bentley and A. C.-C. Yao. An almost optimal algorithm for
unbounded searching. Information Processing Letters, 5:82–87,
1976.
[10] S. Chen, A. Ailamaki, P. B. Gibbons, and T. C. Mowry. Inspector
joins. In VLDB, pp. 817–828, 2005.
[11] J. Cieslewicz and K. A. Ross. Adaptive aggregation on chip
multiprocessors. In VLDB, pp. 339–350, 2007.
[12] G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava. Spaceand time-efficient deterministic algorithms for biased quantiles over
data streams. In PODS, pp. 263–272, 2006.
[13] E. D.Demaine, A. L. Ortiz, and J. I. Munro. Adaptive set
intersections, unions, and differences. In SODA, pp. 743–752, 2000.
[14] E. D. Demaine, A. López-Ortiz, and J. I. Munro. Experiments on
adaptive set intersections for text retrieval systems. In ALENEX, pp
91–104, 2001.
[15] D. J. DeWitt, J. F. Naughton, and D. A. Schneider. Parallel sorting on
a shared-nothing architecture using probabilistic splitting. In Parallel
and Distributed Information Systems, pp. 280–291, 1991.
[16] B. Gedik, P. S. Yu, and R. R. Bordawekar. Executing stream joins on
the cell processor. In VLDB, pp. 363–374, 2007.
[17] J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J.
Weinberger. Quickly generating billion-record synthetic databases. In
SIGMOD, pp. 243–252, 1994.
[18] M. B. Greenwald and S. Khanna. Power-conserving computation of
order-statistics over sensor networks. In PODS, pp. 275–285, 2004.
[19] F. K. Hwang and S. Lin. Optimal merging of 2 elements with n
elements. Acta Informatica, 1(2):145–158, 1971.
[20] F. K. Hwang and S. Lin. A simple algorithm for merging two disjoint
linearly ordered sets. SIAM Journal on Computing, 1(1):31–39, 1972.
[21] Intel. Intel multi-core processor architectur development
backgrounder, white paper, 2005.
[22] B. R. Iyer, G. R. Ricard, and P. J. Varman. Percentile finding
algorithm for multiple sorted runs. In VLDB, pp. 135–144, 1989.
[23] R. Kalla, B. Sinharoy, and J. M. Tendler. IBM POWER5 chip: A
dual-core multithreaded processor. IEEE Micro, 24(2):40–47, 2004.
[24] S. Manegold, P. A. Boncz, and M. L. Kersten. Generic database cost
models for hierarchical memory systems. In VLDB, pp. 191–202,
2002.
[25] G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate
medians and other quantiles in one pass and with limited memory. In
SIGMOD, pp. 426–435, 1998.
[26] C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, and D. Lomet.
Alphasort: a risc machine sort. In SIGMOD, pp. 233–242, 1994.
[27] V. Raman, L. Qiao, W. Han, I. Narang, Y.-L. Chen, K.-H. Yang, and
F.-L. Ling. Lazy, adaptive RID-list intersection, and its application to
index anding. In SIGMOD, pp. 773–784, 2007.
[28] A. Shatdal, C. Kant, and J. F. Naughton. Cache conscious algorithms
for relational query processing. In VLDB, pp. 510–521, 1994.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement