Processing Spatial-keyword Query as a Top

Processing Spatial-keyword Query as a Top
Processing Spatial Keyword Query as a Top-k Aggregation
Query
Dongxiang Zhang
Chee-Yong Chan
Kian-Lee Tan
Department of Computer Science
School of Computing, National University of Singapore
{zhangdo,chancy,tankl}@comp.nus.edu.sg
ABSTRACT
We examine the spatial keyword search problem to retrieve objects
of interest that are ranked based on both their spatial proximity to
the query location as well as the textual relevance of the object’s
keywords. Existing solutions for the problem are based on either
using a combination of textual and spatial indexes or using specialized hybrid indexes that integrate the indexing of both textual and
spatial attribute values. In this paper, we propose a new approach
that is based on modeling the problem as a top-k aggregation problem which enables the design of a scalable and efficient solution
that is based on the ubiquitous inverted list index. Our performance
study demonstrates that our approach outperforms the state-of-theart hybrid methods by a wide margin.
Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Search process
Keywords
Spatial keyword search, Top-k aggregation
1.
INTRODUCTION
The prevalence of smartphones and social network systems has
enabled users to create, disseminate and discover content on-themove. According to a recent study [2], Twitter has 164 million
monthly active users accessing its site from mobile devices while
Facebook has 425 million mobile users doing so. Consequently,
a tremendous amount of social content, including tweets, checkins, POIs and reviews, are generated every day and notably, these
data are attached with geo-coordinates and form a large-scale geodocument corpus. Spatial keyword search is an important functionality in exploring useful information from a corpus of geodocuments and has been extensively studied for years [5, 13, 14,
16, 18, 22, 29, 33–36]. The work on spatial keyword search can
be broadly categorized into two classes: distance-insensitive and
distance-sensitive.
In the traditional distance-insensitive keyword search, the documents are organized based on a geographical ontology [5, 13, 18,
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected]
SIGIR’14, July 6–11, 2014, Gold Coast, Queensland, Australia.
Copyright 2014 ACM 978-1-4503-2257-7/14/07 ...$15.00.
Include the http://DOI string/url which is specific for your submission and included in
the ACM rightsreview confirmation email upon completing your ACM form.
36]. For example, “University Ave” → “Palo Alto” →“ South
Bay”→ “Bay Area” [5] is an example hierarchical path in a geographical ontology. Given a keyword query with a locational constraint, the constraint is used to restrict the document search within
matching hierachies of the ontology. Thus, the query’s locational
constraint here is used as a filtering criterion. In distance-sensitive
keyword search, each geo-document is associated with a precise
reference point 1 so that its spatial distance from a query location
can be calculated to evaluate its relevance in terms of spatial proximity. Thus, the query location here is used as a ranking criterion.
Distance-sensitive keyword search has many useful applications
that facilitate a user to search for nearby locations/events/friends
based on some keywords with the matching results ranked in increasing proximity to the user’s location. For example, point-ofinterest (POI) applications such as Foursquare and Yelp are very
useful to search for nearby venues or restaurants. As another example, location-based social discovery applications such as Tinder
are very useful to search for nearby users with mutual interests.
In this paper, we focus on the problem of top-k distance-sensitive
spatial keyword query search: given a spatial keyword query (with
a location and a set of keywords), the objective is to return the best k
documents based on a ranking function that takes into account both
textual relevance as well as spatial proximity. Figure 1 illustrates
a simple example of distance-sensitive spatial keyword search for
a corpus with seven geo-documents d1 , · · · , d7 . Each document is
associated with a location and a collection of keywords together
with their relevance scores. Consider a spatial keyword query Q
with keywords “seafood restaurant” issued at the location marked
by the star. In this example, document d2 is considered to match Q
better than document d4 because d2 is close to the query location
and it is highly relevant to the query keywords; in comparison, d4
does not match the query keywords well, though it is slightly closer
to Q then d2 .
The efficiency of evaluating top-k distance-sensitive spatial keyword queries becomes critical for a large-scale geo-document corpus. Existing methods for this problem [14, 16, 22, 29, 35] have
shown that combining the spatial and textual attributes together
can effectively prune the search space. These methods can be divided into two broad categories: spatial-first and textual-first. In
the spatial-first strategy, a spatial data structure (such as an Rtree [10]) is first used to organize the geo-documents into spatial
regions based on their locations. Next, each spatial region is augmented with an inverted index to index all the documents contained
in that region. An example of a spatial-first scheme is the IRtree [14]. On the other hand, in the textual-first strategy, the search
space of geo-documents is first organized by keywords, and then for
1 The location information can be naturally derived from the GPS
of smartphones.
d5
d1
d2
d3
d4
d5
d6
d7
d1
d6
(seafood, restaurant)
Q
d2
d4
d3
(pizza 0.6), (restaurant,0.4)
(seafood 0.9), (restaurant 0.8)
(seafood 0.2), (pizza 0.5)
(noodle 0.7), (seafood, 0.2)
(spicy 0.8), (noodle 0.5), (restaurant 0.6)
(spicy 0.4), (restaurant 0.5)
(seafood 0.1), (restaurant 0.3)
d7
Figure 1: An example of spatial keyword search scenario
A top-k distance-sensitive spatial keyword query is denoted by
Q = hxq , yq , w1 , w2 , . . . , wm , ki; where (xq , yq ) denote the (longitude, latitude) location of the query, {w1 , w2 , . . . , wm } denote the
set of query keywords, and k denote the number of top-matching
documents to be returned for the query.
The relevance score between a document D and a query Q is
evaluated by a monotonic aggregation of textual relevance and spatial proximity. The textual relevance is measured by a scoring function φt (D , Q) that sums up the relevance score of each query keyword:
φt (D , Q) =
∑
φt (D , wi )
(1)
wi ∈Q
each keyword, a spatial data structure is employed to index the documents associated with that keyword. S2I [29] and I 3 [35] are two
schemes that follow this strategy. Recent extensive performance
studies [29, 35] have demonstrated that both S2I and I 3 are significantly faster than IR-tree. As such, we consider S2I and I 3 to be
the state-of-the-art solutions for distance-sensitive spatial keyword
search.
However, there are two key limitations with the state-of-the-art
solutions. First, these solutions are not sufficiently scalable to cope
with the demands of today’s applications. For example, Foursquare
has over 60 millions of venues, and the number of geo-tweets published by mobile users per day is in the hundreds of million. None
of existing solutions [29, 35] have been evaluated with such largescale datasets; indeed, our experimental study reveals that the stateof-the-art solutions do not perform well for large datasets. Second, these solutions employ specialized “hybrid” indexes that integrate the indexing of both textual as well as spatial attributes which
present additional complexity for their adoption in existing search
engines in terms of implementation effort, robustness testing, and
performance tuning.
In this paper, we propose a new approach to address the limitations of the state-of-the-art solutions. Our paper makes three key
contributions. First, we present a simple but effective approach to
solve the top-k distance-sensitive spatial keyword query by modeling it as the well-known top-k aggregation problem. Second,
we propose a novel and efficient approach, named Rank-aware CA
(RCA) algorithm, which extends the well-known CA algorithm with
more effective pruning. In contrast to the specialized hybrid solutions, our RCA algorithm is more practical as it relies only on the
ubiquitous and well-understood inverted index. Third, we demonstrate that RCA outperforms the state-of-the-art approaches by a
wide margin (up to 5 times faster) on a real dataset with 100 million geo-tweets.
The rest of the paper is organized as follows. We present the
problem statement in Section 2, and review related work in Section 3. In Section 4, we present our new approach to model the
problem and two baseline algorithms. We present our new algorithm, RCA, in Section 5. In section 6, we evaluate the efficiency
of RCA with an extensive performance study. Section 7 concludes
the paper.
If a document D does not contain any query keyword, we consider
it irrelevant and set φt (D , Q) to be −∞ to prevent it from appearing
in the results. The spatial proximity is measured by φs (D , Q) which
assigns a score based on the distance between D and Q such that
a smaller spatial distance yields a higher φs (D , Q) value. In this
paper, we use the same spatial ranking function as in [14, 29, 35].
φs (D , Q) = 1 −
d(D , Q)
γ
(2)
where d(D , Q) measures the Euclidean distance between the locations of the query Q and document D , and γ is a parameter to
normalize φs (D , Q) between [0, 1] which is set to be the maximum
distance among all the pairs of documents. Finally, as in [14, 16,
29, 35], we define the overall ranking function φ(D , Q) as a linear
combination of textual relevance and spatial proximity:
φ(D , Q) = α · φt (D , Q) + (1 − α) · φs (D , Q)
(3)
Based on the ranking function φ, we can formally define the topk spatial keyword query as follows:
D EFINITION 1. Top-k spatial keyword query
Given a document corpus C, a top-k spatial keyword query Q retrieves a set O ⊆ C with k documents such that ∀D ∈ O and D ′ ∈
C − O, φ(D , Q) ≥ φ(D ′ , Q).
3. RELATED WORK
Spatial keyword search is a well-studied problem and can be categorized into two classes: distance-insensitive and distance-sensitive.
3.1 Distance-insensitive Approaches
The traditional local search engines such as Google Local [1]
and Yahoo! Local [3] belong to the distance-insensitive category.
In these systems, location information is first extracted from web
pages and organized using a hierarchical geographical ontology.
Given a query with a locational constraint, the candidate documents
are retrieved from the matching hierarchies in the ontology using
existing information retrieval methods such as TAAT [6–8, 30] or
DAAT [11, 31].
In [36], Zhou et al. built separate R-tree [10] and inverted index for the spatial and textual attributes respectively; and evaluated
a query by first searching with one of the indexes (spatial or textual) followed by ranking the matching documents using informa2. PROBLEM STATEMENT
tion from the other index.
We model a geo-document by a quintuple, D = hdocID, x, y, termsi;
In [13], the locational regions are first extracted from the web
where docID denote the document id, (x, y) denote the (longitude,
documents which are then used to organize the documents using
latitude) location of the document, and terms denote a collection of
the grid file [27] or space-filling curve [19] such that spatially close
weighted terms. Each weighted term is of the form htermi , φt (D ,termi )i, documents are clustered near to one another on disk. Given a query,
where termi denote a term in document D and φt (D ,termi ) denote
documents whose locational regions contain the query location are
the textual relevance score between D and termi . Here, φt denote
first retrieved and their textual relevance wrt the query are are then
the tf-idf textual reference scoring function [9, 25].
computed.
In [18], Fontoura et al. proposed query relaxation techniques to
generate top-k results when there are not enough matching documents in the specified region.
In [22], Hariharan et al. proposed KR*-tree, the first hybrid index to handle spatial keyword query with AND-semantics, where
a matching document must contain all the query keywords and its
query location must intersect with the query region. In KR*-tree,
an inverted index is maintained for each tree node, where both spatial filtering and textual filtering can be employed at the same time
when accessing an index node. In [32], Zhang et al. recently proposed a more efficient hybrid index to process queries with ANDsemantics. The index combines linear quadtree [20] and fixedlength bitmap for query processing.
3.2 Distance-sensitive Approaches
In the distance-sensitive category of work [12, 14, 16, 29, 35],
matching documents are ranked based on both their textual relevance and spatial proximity to the query.
In [16], Felipe et al. proposed IR2 which is a hybrid index of the
R-tree and signature file; their work is based on the AND-semantics
which sorts the matching documents by their distance to the query
location.
Cong et al. [14] proposed IR-tree which is a spatial-first approach
that integrates the R-tree with inverted index; each R-tree node is
augmented with inverted lists to index documents located within
that node.
More recently, two textual-first approaches, S2I index [29] and
I 3 [35], have been proposed. S2I uses aggregated R-tree [28] for
spatial relevance evaluation. The search algorithm expands from
the query location in the spatial index associated with each query
keyword until k best results are found. I 3 builds one Quadtree [17]
for each keyword and adopts the best-first search strategy in query
processing. Performance studies [29, 35] have shown that both S2I
and I 3 are significantly faster than IR-tree.
4.
OUR APPROACH
In this section, we present a new approach for the top-k distancesensitive spatial keyword search problem. Our approach is based
on the simple but effective idea of modeling the problem as a top-k
aggregation problem [15].
D EFINITION 2. Top-k aggregation problem
Consider a database D where each object o = (x1 , x2 , . . . , xn ) has n
scores, one for each of its n attributes. Given a monotonic aggregation function f , where f (o) or f (x1 , x2 , . . . , xn ) denote the overall
score of object o, the top-k aggregation problem is to find a set of
top-k objects in D with the highest overall scores; i.e., find O ⊆ D
with k objects such that ∀o ∈ O and o′ ∈ D − O, f (o) ≥ f (o′ ).
The reformulation of the top-k distance-sensitive spatial keyword search problem as a top-k aggregation problem is straightforward. Given a query Q with m keywords, each geo-document D
in a database D can be modeled as a m + 1-tuple (x1 , x2 , . . . , xm+1 );
where xi (1 ≤ i ≤ m) is the textual relevance score (defined by φt )
between D and the ith query keyword, and xm+1 is spatial proximity
score (defined by φs ) between D and the query location. By setting
the aggregation function to be φ in Equation 3, the top-k distancesensitive spatial keyword query problem is now reformulated as a
top-k aggregation problem.
E XAMPLE 1. Consider again the example spatial keyword search
problem in Figure 1. Since all the documents contain at least a
query keyword, we have D = {d1 , d2 , . . . , d7 }. Each document D
Ψt(seafood) (d2, 0.9) (d3, 0.2) (d4, 0.2) (d7, 0.1) (d1, 0.0) (d5, 0.0) (d6, 0.0)
Ψt(restaurant) (d2, 0.8) (d5, 0.6) (d6, 0.5) (d1, 0.4) (d7, 0.3) (d3, 0.0) (d4, 0.0)
Ψs
(d4, 0.8) (d2, 0.7) (d1, 0.6) (d3, 0.55) (d6, 0.5) (d5, 0.47) (d7, 0.42)
Figure 2: An example of top-k aggregation for query “seafood
restaurant”
is now modelled as a three-dimensional vector (x1 , x2 , x3 ); where
x1 is the D’s textual relevance to the keyword “seafood”, x2 is D’s
textual relevance to the keyword “restaurant”, and x3 is the spatial proximity between D and the query. If we set the aggregation
function to be φ, the top-k aggregation problem for this example
would return the results with the highest scores ranked by φ which
are exactly the results for the spatial keyword search problem.
By formulating the problem as a top-k aggregation problem, we
are able to design an efficient solution that relies only on the simple
and widely used inverted index in contrast to existing hybrid solutions. The rest of this section is organized as follows. We first review top-k aggregation algorithms in Section 4.1, and then explain
how they are adapted as baseline algorithms for the top-k distancesensitive spatial keyword search problem in Section 4.2.
4.1 Top-k Aggregation Algorithms
In this section, we present an overview of top-k aggregation algorithms. A comprehensive survey of these algorithms are given
in [23].
For convenience, we present the algorithms in the context of spatial keyword search as follows. Given a query with m keywords and
a geo-location, assume that we have constructed m + 1 sorted lists,
L1 , L2 , . . . , Lm , wrt a database of geo-documents. Each Li , i ∈ [1, m],
is sorted (in non-ascending order) by the documents’ textual relevance scores wrt the ith query keyword; and Lm+1 is sorted (in nonascending order) by the documents’ spatial proximity to the query
location. Thus, each sorted list entry is a pair of document identifier
and a score (textual relevance or spatial proximity).
The first algorithm is the TA algorithm proposed by Fagin et
al. [15], which consists of two main steps.
1. Perform a sorted access in parallel to each of the m + 1 sorted
lists Li . For each document accessed, perform a random
access to each of the other lists Li to find its score in Li .
Then, compute the aggregated score of the document using
the ranking function φ in Equation 3. If the computed aggregated score is one of the k highest we have seen so far,
remember the document and its score.
2. For each list Li , let high[i] be the score of the last document
seen under sorted access. Define the threshold value Bk to be
the aggregated score of high[i] by the ranking function φ. As
soon as at least k documents have been seen whose score is
at least equal to Bk , the algorithm terminates.
Although the TA algorithm has been proven to be instance optimal, the optimality depends on the cost of random access. The
algorithm does not perform well if the cost of random access is
too high. Subsequently, an improved variant of the TA algorithm,
the CA algorithm [15], was proposed to achieve a good tradeoff
between the number of sequential access and random access.
The CA algorithm uses a parameter h to control the depth of
sequential access. In each iteration, h documents in each list are
sequentially accessed. h is set to be the ratio of the cost of a random
access to the cost a sequential access. For each accessed document
doc, let B(doc) denote an upper bound of the aggregated score of
doc. A document doc is defined to be viable if B(doc) is larger than
the kth best score that has been computed so far. At the end of each
iteration, the viable document with the maximum B(doc) value is
selected for random access to determine its aggregated score. The
algorithm terminates when at least k distinct documents have been
seen and there are no viable documents.
Besides the TA and CA algorithms, there are a number of other
variants that optimizes the approach for early termination. In [21],
an improved TA variant, Quick-Combine, was proposed where instead of accessing the sorted lists in a round-robin manner, QuickCombine first estimates the influence of each sorted list and selects
the most influential list in each iteration for random access. In [26],
Marian et al. proposed the Upper and Pick algorithm to conduct
random access by minimizing the upper and lower bounds of all
objects.
4.2 Baseline Algorithms
In this section, we explain how the standard top-k aggregation
algorithms described in the previous section are adapted as baseline
algorithms for the top-k distance-sensitive spatial keyword search
problem.
First, note that while it is possible to precompute the sorted lists
for the textual attributes (i.e., L1 , L2 , . . . , Lm ) since the tf-idf scores
are independent of the query, this is not the case for the list Lm+1
for the spatial attribute as the spatial proximity score is dependent
on the query location. Thus, the sorted list Lm+1 needs to be created
at query time.
Second, since the documents matching a query’s keywords are
generally not clustered together (i.e., their docIDs are not related),
performing random access to matching documents to retrieve the
overall relevance score typically incurs a large number of random
disk I/O. To further minimize the number of random disk I/O for
such retrievals, we introduce a simple optimization to organize the
single, large document list into multiple smaller ones by maintaining a document list of matching documents for each keyword. By
clustering the document entries that match the same keyword, this
optimization helps to reduce the number of random disk I/O incurred for document identifier lookups. However, this advantage is
at the cost of additional storage for the document lists as a document’s information is now replicated among several lists.
document list could incur a total of 7 disk I/O (one for each distinct
document); in contrast, with the use of multiple document lists,
the number of disk I/O could be reduced to only 2 if each of the
keyword-based document lists fits on one disk page.
5. RANK-AWARE CA ALGORITHM
While the baseline algorithms presented in the previous section
could be easily incorporated into existing search engines that support the ubiquitous inverted list index, the baseline algorithms have
a performance drawback in that the spatial attribute list cannot be
precomputed statically but need to be sorted at runtime using the
query location to compute the spatial proximity values. In this section, we present an optimized variant of the CA algorithm, termed
Rank-aware CA (RCA), to address this limitation.
The key idea of our optimization is to sort the spatial attribute list
offline based on an approximate spatial order preserving encoding
such that the two-dimensional location attribute values are encoded
into totally ordered values with the desirable property that a pair
of encoded location values that are close together in the total order
represents a pair of locations that are likely to be spatially close to
each other. In this paper, we apply the well-known Z-order [4] to
obtain such a mapping.
1
2
5
6
17
18
21
22
3
4
7
8
19
20
23
24
9
10
13
14
25
26
29
30
11
12
15
16
27
28
31
32
33
34
37
38
50
53
54
r2
49
r1
35
36
39
40
51Q
52
55
56
41
42
45
46
57
58
61
62
43
44
47
48
59
60
63
64
I2
I1
r1
r2
1
16
38
Q=51 58
r2
63 64
Figure 4: An example of Z-order encoding
Memory
3
5
7
9
10
13
15
Disk I/O
One document list
Disk
Global Document List
Memory
3
5
7
10 13
5
9
10 15
E XAMPLE 3. Figure 4 illustrates an application of Z-order to
encode location values for a two-dimensional space that is partitioned into 8 × 8 cells. Applying the Z-order encoding, each cell is
assigned a unique Z-order value from 1 to 64. Since each document
is located within some cell, the document’s location is represented
by the Z-order value of its cell location. Note that the Z-order encoding provides an approximate preservation of spatial proximity.
Disk I/O
Multi document lists
doc list for w1
doc list for w2
Disk
Figure 3: One global document list v.s. multiple small lists
E XAMPLE 2. Figure 3 illustrates how the use of multiple document lists can help reduce disk I/O. In this example, keyword w1 appears in documents L1 = {3, 5, 7, 10, 13} and keyword w2 appears
in documents L2 = {5, 9, 10, 15}. If we need to perform random
access on all the documents in L1 and L2 , using a single, global
There are two useful properties of Z-order encoding that we exploit in our RCA algorithm. First, given any rectangular region R
in the location space, the top-left corner cell of R has the smallest
Z-order value (denoted by Rmin ) and the bottom-right corner cell
of R has the largest Z-order value (denoted by Rmax ) among all the
cells in R. Thus, all the cell locations in a region R are contained in
the range of Z-order values [Rmin , Rmax ]. As an example, consider
a query Q that is located in cell 51 and a region R that is centered
at Q with a radius of r1 as shown in Figure 4. We have [Rmin , Rmax ]
= [38, 58] which contains the Z-order values of all the cells in R.
Second, for any cell with a Z-order value c that is outside of a region R centered at Q with radius r1 (i.e., c < Rmin or c > Rmax ), the
distance of this cell from Q must be larger than r1 .
Based on the properties of the Z-order encoding, the RCA algorithm progressively accesses the documents in the spatial attribute
list in iterations in a score-bounded manner. In each iteratoin, unlike conventional CA algorithm which explores a fixed number of
items, our RCA algorithm accesses all the documents with spatial
proximity score within a fixed-length score interval.
1
In this section, we elaborate on the score-bounded access of the
spatial attribute lists.
Similar to the rationale for organizing a document list into multiple shorter lists as explained in Section 4.2, the spatial attribute list
is also organized as multiple shorter lists with one spatial inverted
list Lw for each keyword w; i.e., the entries in Lw represent all the
documents that contain the keyword w. The entries in Lw are sorted
in ascending order of their Z-order encodings of the document locations, and these spatial inverted lists are created offline without
incurring any runtime sorting overhead.
Assume that the spatial relevance scores are normalized to the
range (0, 1] by the scoring function φs . Let ηs denote the maximum
number of iterations to be used to access all the documents in the
spatial attribute lists. Therefore, spatial score range (0, 1] is partitioned into ηs disjoint intervals (each of length η1s ): T1 = (1− η1s , 1],
T2 = (1 − η2s , 1 − η1s ], . . ., Tηs = (0, η1s ]; where Ti denotes the range
of spatial proximity score values for the documents accessed in the
ith iteration. By Equation 2, it follows that for the ith iteration, we
d(D ,Q)
have φs = 1 − γ > 1 − ηis . Thus, d(D , Q) < ηiγs for the ith iteration. In other words, in the ith iteration of the score-bounded
access, the search radius for that iteration is set to ri = ηiγs ; and the
radius progressively increases over the iterations.
At each iteration of the score-bounded access, the search on a
spatial inverted list is actually performed in terms of a forwardscan and a backward-scan. Consider again the example in Figure 4
where the Z-order encoding of the query location (denoted by zq )
is in cell 51 (i.e., zq = 51). With an initial radius of r1 , we access
the documents located in the Z-order range I1 = [38, 58] as shown
in Figure 4. At the next iteration when the radius is increased to
r2 , we have the Z-order range I2 = [15, 63] which contains I1 . To
access only the documents contained in I2 but not in I1 , the spatial
search is split into a backward-scan of Z-order range [15, 37] and a
forward-scan of Z-order range [58, 63].
Based on the properties of Z-order, it is possible that some of
the documents accessed in the searched spatial region (specified
by some range of Z-order values) could be false positives; i.e, the
actual distance between the accessed document and query could be
larger than the current search radius ri at the ith iteration. To avoid
processing these false positives too early, we maintain ηs buffers
to temporarily store these false positive documents: a false positive
document that should have been processed later in the jth iteration
(i.e., j > i) will be temporarily stored in the jth buffer. Thus, the
documents in the jth buffer will be considered later during the jth
iteration.
E XAMPLE 4. Figure 5 illustrates an example of scored-based
access for the spatial attribute. The algorithm starts from the region
with radius λs = ηγs which is then progressively increased to 2λs ,
3λs , etc. until it has found k documents with the highest scores. For
the iteration with radius = λs , the Z-order range in this iteration is
5
6
17
18
21
22
20
23
24
d5
3
4
7
8
19
9
10
13
14
25
26
29
11
12
15
16
27
28
31
32
33
34
37
38
49
50
53
54
56
d2
35
5.1 Score-bounded Access of Spatial Attribute
Lists
2
41
43
d1
s
36
39
40
d451Q
52
55
42
45
46
57
58
61
47
48
59
60
63
d3
44
Q
30
d6
62
d7
64
d3
dist
Buffers
s
2 s
3s
4s
5 s
6 s
Figure 5: Buffering false positives in spatial search
[38, 58] and two documents {d3 , d4 } are accessed during the scan
of the spatial list. Among them, only d4 is a true candidate. d3 is a
false positive and it is temporarily stored in the appropriate buffer
to be considered in a subsequent iteration.
In contrast to the CA algorithm, which accesses a fixed number
of documents in each iteration, the documents accessed by RCA
is determined by a score interval corresponding to the iteration.
This difference between CA and RCA is motivated by two reasons.
First, the upper bound score for the ranking function φs (Equation 2) relies on the minimum distance of all the unseen objects
to the query location and this distance computation is complex as
each Z-order range in general corresponds to an irregular polygon.
Second, the upper bound score for φs could decrease very slowly
if a fixed number of documents is accessed per iteration. This is
because the Z-order encoding only preserves spatial proximity approximately and spatially close objects could have Z-order values
that are far apart in the linear order. For example, in Figure 4, although the minimum distance between cells 16 and 49 is zero, the
difference in their Z-order values is 33. As another example, if
RCA were to adopt CA’s approach of accessing a fixed number of
objects per iteration, then there is actually no change in the upper
bound spatial relevance score from the objects accessed in cell 49
to cell 17.
5.2 Score-bounded Access of Textual Attribute
Lists
Recall that our ranking function φ is a linear weighted combination of spatial proximity and textual relevance with the weight
parameter α. Intuitively, when the value of α is small, the spatial
relevance is more important than textual relevance; and it is therefore desirable to examine more documents in the spatial inverted
lists than the textual inverted lists so that the upper bound textual
relevance score for the unseen documents can decrease more significantly to enable an earlier termination of the algorithm.
To achieve the above property, the RCA algorithm also adopts
the score-bounded access method for the textual attribute lists. Specifically, let ηt denote the maximum number of iterations to be used
to access the textual attribute lists. Then, the textual relevance
score domain (0, 1] is partitioned into ηt intervals where documents
whose textual relevance score is within (1 − ηit , 1 − i−1
ηt ] are ac-
cessed in the ith iteration. In this way, our RCA algorithm enables
more relevant documents to be accessed before less relevant ones.
At the end of the ith iteration, the upper bound textual relevance
score for the unseen documents is given by
Bt (i) = 1 −
i
ηt
(4)
It follows that the parameters ηs and ηt are related as follows:
ηt =
(1 − α)
· ηs
α
(5)
E XAMPLE 5. Consider once more the spatial keyword query
with keywords “seafood restaurant”, and assume that ηt = 4 (i.e.,
the length of the score interval is λt = 0.25). Figure 6 shows the
documents that are score-bounded accessed for each of the four
iterations w.r.t. the two inverted lists for the query keywords. In
the first iteration, we access documents whose textual relevance
score is within (0.75, 1] and d2 is accessed in both inverted lists.
The upper bound textual relevance score for the unseen documents
in each list is updated to be 0.75. In the second iteration, no documents are accessed for keyword “seafood” and only d5 is accessed.
In this way, our score-bounded approach prioritises the document
retrievals to access documents that are more likely to be in the topk results earlier. In contrast, the conventional CA algorithm will
access the documents d4 , d3 and d7 with low relevance scores in
the inverted list of “seafood” very early.
Ψt(seafood)
(d4, 0.2)
(d3, 0.2)
(d7, 0.1)
(d2, 0.9)
(d2, 0.8)
(d5, 0.6)
1st
iteration
2nd
iteration
Ψt(restaurant)
(d6, 0.5)
(d1, 0.4)
(d7, 0.3)
3rd
iteration
4th
iteration
If Bk ≤ Wk , we stop the sequential access on the sorted lists as it is
guaranteed that that no unseen document could have an aggregated
score higher than Wk . However, the algorithm cannot be terminated
at this point because there could be some viable documents not in
the top-k heap but with a upper bound score larger than Wk . For
these documents, we need to conduct random access to get their
full aggregated score and update the top-k heap if we find a better
result. In this way, we can guarantee no correct result is missed.
Our rank-aware CA algorithm to process top-k spatial keyword
queries is shown in Algorithm 1. The input parameter Lt is the
collection of textual lists sorted by φt values and Ls is the collection of spatial lists sorted by Z-order encoding values. In lines 1
and 2, a top-k heap and ηs buffers are initialized, where bu f [i]
is used to temporarily store any false positive documents to be
processed during the ith iteration. Three pointers pt , p f and pb
are used for sequential access: pt is used for the textual lists and
ps (resp. pb ) is used for the forward (resp. backward) scan in
the spatial lists (lines 6-8). In each iteration, we perform sequential access in the textual lists by calling exploreTextList (line 10)
and forward/backward scans in the spatial lists by calling exploreForwardSpatialList and exploreBackwardSpatialList (lines 11-13).
The function exploreTextList accepts the parameter Bt (i) defined
in Equation 4 so as to scan the documents whose relevance is in the
range (Bt (i), Bt (i − 1)]. Similarly, we calculate the Z-order range
from the current search radius and perform forward and backward
sequential access of documents within the computed Z-order range
in the spatial lists. Any false positive documents are stored in the
appropriate buffers in bu f and examined in subsequent iterations
(lines 14-15 in Algorithm 1). After the sequential access, we perform random access on the viable documents in the top-k heap
(lines 16-17 in Algorithm 1). Finally, we update Bk according to
Equation 6. If Bk ≤ Wk , we stop the sequential access and for each
viable document not in topk, we perform random access to obtain
its complete score and update the top-k results if it is a better result.
6. EXPERIMENT EVALUATION
Figure 6: Score-bounded sequential access for keywords
“seafood restaurant”
5.3 Overall Approach
In this section, we discuss the overall approach for our RCA algorithm. Our approach adopts a different random access strategy
and termination criteria from the traditional CA algorithm. Recall
that the CA algorithm selects the viable document with the maximum upper bound score for random access and the algorithm terminates if this upper bound score is no greater than the score of
the kth best document seen so far (denoted by Wk ). In contrast, our
RCA algorithm does not maintain B(doc) to store the upper bound
score for each viable document. Moreover, our random access is
applied to the viable documents in the min-heap storing top-k results so that Wk can be increased as much as possible towards an
early termination. Based on our score-bounded sequential access
approach, we can also compute a tighter upper bound for a document’s aggregated score (denoted by Bk ). Specifically, after the ith
iteration, Bk is given by is calculated by
Bk = α · m · Bt (i) + (1 − α) · Bs (i)2
(6)
2 More precisely, the upper bound of textual relevance can be further reduced to α · ∑1≤ j≤m s j , where s j is the score last seen in each
textual list. Since the ith iteration in the jth textual list terminates
when we access a document with a score s j ∈
/ Ti , we can be assured
that s j ≤ Bt (i).
In this section, we compare the methods derived from top-k aggregation with state-of-the-art indexes that combine spatial partitioning and textual partitioning. More specifically, we compare
the performance of TA, CA and RCA proposed in this paper with
S2I [24] and I 3 [35] in processing top-k spatial keyword queries.
We set h = 8 in the CA Algorithm and ηt = 20 in the RCA algorithm. All the methods are disk-based and implemented in Java. We
conduct the experiments on a server with 48GB memory and QuadCore AMD Opteron(tm) Processor 8356, running Centos 5.8.
6.1 Twitter Datasets
We use a Twitter dataset for the experiments. Our collection contains 100 million real geo-tweets that take up to 7.7GB in storage in
the raw data format. For scalability evaluation, we sample four subsets whose sizes vary from 20 million to 80 million. The statistics
of these datasets are shown in Table 1, where we report the dataset
size, number of distinct keywords, average number of keywords in
a document and amount of disk storage for each dataset.
In Table 2, we report the disk size of our inverted index and the
comparison indexes. To support Rank-aware CA (RCA), we build
three inverted lists for each keyword and use MapDB 3 to store all
the lists. Since TA and CA do not need the lists sorted by z-order,
we let them share the inverted lists of RCA that are sorted by textual relevance and document id. As shown in Table 2, inverted index consumes only slightly more disk space than S2I. Although we
3 http://github.com/jankotek/mapdb
Table 1: The Twitter Datasets
Number of distinct keywords
2,719,087
4,261,582
5,530,216
6,652,879
7,672,170
maintain three inverted lists for one keyword while S2I builds one
R-tree for one keyword, the inverted lists have higher disk utilization than R-tree and can be easily compressed to save disk space. I 3
allocates at least one disk page for an infrequent keyword but S2I
merges them in one file. Thus, I 3 consumes the most disk space.
6.2 Query Set
We use real keyword queries from AOL search engine 4 to generate spatial keyword queries. First, we select hotel and restaurant
as two typical location-based queries. All the keyword queries in
the log containing “hotel” or “restaurant” are extracted. Among
them, we keep the queries whose number of keywords are from 2
to 6. After removing duplicate queries, we note that, as shown in
Table 3, queries with 3 keywords are the most common and hotel queries are more frequently submitted by users than restaurant
queries. Next, we attach to each keyword query a spatial location.
The location is randomly sampled and follows the same distribution
as the tweet location.
6.3 Parameter Setup
In the following experiments, we will evaluate the performance
of query processing in terms of increasing dataset size (from 20
million to 100 million), varying number of query keywords m (from
2 to 6), the number of query results k (from 10 to 200) and textual
relevance weight α in the ranking function (from 0.1 to 0.9). The
values in bold in Table 4 represent the default settings. In each experiment, we vary one parameter and fix the remaining parameters
at their default values. The performance is measured by the average
latency of query processing. Given a query set with thousands of
hotel or restaurant queries, we start the timing when the first query
arrives and stop when the last query finishes. Thus, the query processing time includes the I/O cost to load the index into memory.
We do not set a cache limit for query processing. The part of index
that is loaded in memory will stay as cached until all the keyword
queries are processed.
6.4 Query Results
The query results are top-k geo-tweets sorted by spatial proximity and textual relevance. Since tweets are normally short text,
we merge the tweets with the same location into one geo-document.
For example, a restaurant may be checked in multiple times and the
term frequency can reflect the popularity of the location. Table 5
illustrates an example of top-10 geo-tweets for a query “seafood
restaurant” submitted at location (40.7312, −73.9977), corresponding to Washington Square Park in Manhattan, New York City. In
this example, α is set to 0.3 in favor of geo-tweets closer to the user
location. If there are multiple tweets at the same location, we only
select a representative one for presentation purpose.
6.5 Increasing Dataset Size
In this experiment, we examine the scalability in terms of increasing dataset size. We increase the number of geo-tweets from
4 http://www.gregsadetsky.com/aol-data/
Average number of keywords
6.92
6.94
6.95
6.94
6.94
Disk storage
1.6GB
3.1GB
4.6GB
6.2GB
7.7GB
20 million to 100 million and run the sets of 3-keyword hotel and
restaurant queries (the number is 7, 831 and 3, 844 respectively).
We report the average query processing latency of different methods in Figure 7. As can be seen, the inverted-index-based solutions,
including TA, CA and RCA, achieve much better performance than
the hybrid indexes that combine spatial partitioning and textual partitioning. S2I and I 3 maintain a spatial index for each keyword.
Given a set of keywords, they start from the query location and
expand the search region from the spatial attribute only. The documents closest to the query location will be accessed first, regardless
of the textual relevance. In addition, the spatial index is disk-based
and the tree nodes are accessed with a large number of random I/Os.
Therefore, their performance do not scale well. When the dataset
size increases to 100 million, the query latency is 5 times worse
than that of RCA.
Avg Query Latenancy (ms)
Number of geo-tweets
20,000,000
40,000,000
60,000,000
80,000,000
100,000,000
1200
1000
800
TA
CA
RCA
3
I
S2I
600
400
200
20
40
60
80
100
Number of Geo-tweets (million)
(a) Hotel Queries
Avg Query Latenancy (ms)
DataSets
Twitter20M
Twitter40M
Twitter60M
Twitter80M
Twitter100M
1600
1400
1200
1000
800
600
400
200
0
TA
CA
RCA
3
I
S2I
20
40
60
80
100
Number of Geo-tweets (million)
(b) Restaurant Queries
Figure 7: Query latency of increasing dataset size
TA and CA demonstrate comparable performance. Although
they need to sort the lists by the distance to the query location,
the performance is still around 2 times better than state-of-the-art
solutions. TA terminates earlier than CA because most of the documents only contain part of the query keywords. For these documents, even if all the contained keywords have been accessed, their
upper bound score is still higher than the real score. Therefore, Bk
Algorithm 1: RCA(Q, Lt , Ls , m, k, ηt , ηs )
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
initialize a min-heap topk with k dummy documents with score 0
initialize ηs buffers bu f [1..ηs ] for false positive documents
Wk ← 0 // Wk is the minimum score in topk
Bk ← 1 // Bk is the upper bound score for all the unseen documents
for i = 1;i ≤ m;i + + do
pt [i] ← 0
p f [i] ← binary_search(slists[i],zq ) // zq is the Z-order of Q
pb [i] ← p f [i] − 1
for i = 1;i ≤ max(ηt ,ηs );i + + do
exploreTextList(Bt (i))
obtain the minimum Z-order zmin and maximum Z-order zmax from
the rectangle (Q.lat − iλs , Q.lat + iλs , Q.lng − iλs , Q.lng + iλs )
exploreForwardSpatialList(zmin , i · λs )
exploreBackwardSpatialList(zmax , i · λs )
for doc ∈ bu f [i] do
seqAccess(doc, m + 1, (1 − α) · φs (doc,Q))
for each viable document doc in topk do
randomAccess(doc)
Bk ← α · m · Bt (i) + (1 − α) · Bs (i)
if Wk ≥ Bk then
for doc ∈ W do
if doc ∈
/ topk and doc is viable then
randomAccess(doc)
break
return topk
exploreTextList(minT )
1. for i = 1;i ≤ m;i + + do
2. while Lt [i][pt [i]].score > minT do
3.
doc ← Lt [i][pt [i]]
4.
seqAccess(doc,i,α · φt (doc,Q.wi ))
5.
pt [i] ← pt [i] + 1
exploreForwardSpatialList(maxZ, radius)
1. for i = 1;i ≤ m;i + + do
2. while Ls [i][p f [i]].order ≤ maxZ do
3.
doc ← Ls [i][p f [i]]
4.
if d(doc,Q) ≤ radius then
5.
seqAccess(doc,m + 1,(1 − α) · φs (doc,Q))
6.
else
⌋]
7.
put doc in bu f [ ⌊ d(doc,Q)
λs
8.
p f [i] ← p f [i] + 1
exploreBackwardSpatialList(minZ, , radius)
1. for i = 1;i ≤ m;i + + do
2. while Ls [i][pb [i]].order ≥ minZ do
3.
doc ← Ls [i][pb [i]]
4.
if d(doc,Q) ≤ radius then
5.
seqAccess(doc,m + 1,(1 − α) · φs (doc,Q))
6.
else
⌋]
7.
put doc in bu f [ ⌊ d(doc,Q)
λs
8.
pb [i] ← pb [i] − 1
seqAccess(doc, i, s)
1. W (doc) ← W (doc) + s
2. E(doc) ← E(doc) ∪ i
3. updateTopk(doc)
randomAccess(doc)
1. if doc is not accessed then
2. for i = 1;i ≤ m;i + + do
3.
W (doc) ← W (doc) + α · φt (doc,Q.wi )
4. W (doc) ← W (doc) + (1 − α) · φs (doc,Q)
5. updateTopk(doc)
updateTopk(doc)
1. if W (doc) > Wk then
2. update doc in topk
3. Wk ← mindoc∈topk W (doc)
Table 2: Disk space occupation of different indexes
DataSets
Twitter20M
Twitter40M
Twitter60M
Twitter80M
Twitter100M
Inverted Index
12GB
21GB
35GB
47GB
58GB
S2I
11GB
21GB
31GB
41GB
51GB
I3
24GB
43GB
62GB
79GB
97GB
Table 3: Statistics of hotel and restaurant queries
Number of keywords
2
3
4
5
6
Hotel Queries
3,471
8,436
7,831
4,116
1,619
Restaurant Queries
3,319
5,314
3,844
1,758
587
decreases slowly in CA. TA does not need to maintain the upper
bound score for each document and terminates earlier. However, it
incurs the overhead of a larger number of random accesses. Each
random access includes at most m times of binary search on the
sorted lists that have been loaded in memory and the cost of random access is moderate. Consequently, the query processing time
in TA and CA is close.
RCA achieves the best performance among all the methods. Even
in the dataset with 100 million tweets, it takes around 200ms to answer a query which is 2 times better than TA and CA. The main
reason is that it does not need to sort the spatial lists online and
the rank-aware expansion is effective in saving the cost of sequential access. Table 6 shows the average fraction of documents sequentially traversed by the different top-k aggregation algorithms.
TA uses much smaller number of sequential access than CA but
it requires a random access for each document retrieved from the
sequential traversal. RCA can be considered as a compromise between TA and CA with a moderate number of sequential access and
random access.
Finally, restaurant queries take a relatively longer time than hotel queries to answer. This is not because restaurant is a more frequent keyword in the Twitter dataset. In fact, the average length
of an inverted list for keyword that appears in restaurant queries
is 53, 967 but this number is 73, 045 for hotel queries. Instead,
our investigation shows that this performance difference is due to
the memory cache. In this experiment, we run 7, 831 hotel queries,
which is much higher than the 3, 844 restaurant queries. This means
more inverted lists about hotels are cached in memory and the disk
I/Os can be saved when the cached keyword appears in subsequent
queries. Note that the comparison among different methods is fair
enough because all of them adopt the same caching mechanism.
6.6 Increasing k
The average query latency of the various algorithms as k increases from 10 to 200 is shown in Figure 8(a) and 8(b). The performance of S2I and I 3 degrade more significantly than the solutions
Table 4: Parameter Setting
Dataset size
Number of query keywords
k
α
20M, 40M, 60M, 80M, 100M
2, 3, 4, 5, 6
10, 50, 100, 150, 200
0.1, 0.3, 0.5, 0.7, 0.9
Table 5: Example tweet results for query “seafood restaurant”
1
40.717
-73.997
2
40.715
-73.996
3
40.714
-73.996
4
40.714
-73.996
5
40.707
-74.018
6
7
40.759
40.813
-73.982
-73.955
8
40.645
-73.995
9
10
40.768
40.814
-73.911
-73.956
I’m at Dun Huang Seafood Restaurant (New
York, NY) http://t.co/jxx7FRg56i
APALA holiday dinner! #union #labor #aapi
@ Sunshine 27 Seafood Restaurant
Moms picking out a fish for dinner
#chinatown @ Fuleen Seafood Restaurant
I’m at East Seafood Restaurant (New York,
NY) http://t.co/IGA8Am8ph9
Mediterranean (@ Miramar Seafood Restaurant) https://t.co/W9ZhS0ZKD0
Dinner. (@ Oceana Seafood Restaurant Bar)
I’m at Seafood Boca Chica Restaurant (New
York, NY) http://t.co/oAEjm15drA
I’m at New Fulin Kwok Seafood Restaurant
(Brooklyn, NY) http://t.co/qr7rdhpbk0
Seafood Egyptian restaurant @ Sabrys
good
1
(@
Seafood
Restaurant)
http://t.co/INvKPKNNHF
Table 6: Fraction of sequential access for hotel queries
Twitter20M
Twitter40M
Twitter60M
Twitter80M
Twitter100M
TA
0.085
0.073
0.065
0.060
0.057
CA
0.449
0.369
0.320
0.294
0.277
RCA
0.230
0.227
0.223
0.219
0.217
based on top-k aggregation. This is because they access the documents in the order of the distance to the query location. Hence, their
pruning power only relies on the spatial attribute and the textual relevance is not taken into account. The remaining methods consider
the spatial and textual relevance as a whole and demonstrate better
scalability to k. The results also shed insight on the effectiveness of
search space pruning among TA, CA and RCA. In TA and CA, the
query processing time include the cost to sort by spatial proximity
and the cost of sequential and random access. As k increases, the
overhead of sorting is fixed and the running time increases because
more documents are accessed when k becomes larger. Hence, we
can compare the pruning effectiveness of TA, CA and RCA from
the experiment figures. As k increases from 10 to 200, the running
time of RCA increases much slower than TA and CA, which means
our rank-aware expansion is effective in reducing the access cost in
the sorted lists.
6.7 Increasing Number of Query Keywords
In this experiment, we increase the number of query keywords
m from 2 to 6 and evaluate the performance in Twitter60M dataset.
Since a document is considered relevant if it contains at least one
query keyword, the number of candidates grows dramatically as
m increases. The average query latency is reported in Figure 8(c)
and 8(d). As can be seen, the performance of spatial keyword query
processing degrades dramatically as m increases. When the number
of keywords increases from 5 to 6, the running time doubles for S2I
and I 3 . This is because as m increases, the number of documents
(containing at least one of the query keywords) whose locations
are near the query location also increases. Though many of these
documents’ relevance scores are too low to be among the top-k
results, they still need to be accessed. In comparison, TA, CA and
RCA scale smoothly with m; the pruning mechanism takes into
account both spatial and textual relevance which is clearly more
effective.
6.8 Increasing α
In the last experiment, we evaluate the effect of the weight α in
the ranking function φ on the performance. As α decreases, the spatial relevance plays a more important role in determining the final
score. We can see from Figure 8(e) and 8(f) that the performance of
S2I and I 3 improves significantly as α decreases. This is because
S2I and I 3 examine documents near the query location first while
the textual relevance is ignored. When α is 0.9, the textual relevance dominates spatial relevance and the spatial proximity is no
longer important. However, S2I and I 3 access documents based on
their distance to the query location. Even if the nearby documents
are not textually relevant, they need to examine all of them. When
α decreases, the spatial relevance becomes more important and the
top-k results are more likely to be located around the query location. This is advantageous for the expansion strategy of S2I and I 3 .
That’s why the running time of α = 0.1 is nearly two times faster
than that of α = 0.9.
TA and CA are not sensitive to α. They build inverted lists sorted
by textual relevance and spatial proximity. Hence, when the ranking function is biased on textual relevance, say α = 0.9, the most
relevant documents are likely to appear in the front of the inverted
lists sorted by textual relevance. Similarly, when α is small, the
spatial proximity is more important and the documents close to the
query location will be accessed first by TA and CA algorithms.
RCA, however, is affected by α. When α decreases, its performance improves because its inverted lists on the spatial attribute
are not strictly sorted by the distance to the query location. When
α is small, the top-k documents are close and the spatial expansion
on the z-order list can be terminated earlier.
7. CONCLUSION
In this paper, we process distance-sensitive spatial keyword query
as a top-k aggregation query and present the revised TA and CA
algorithm for query processing. Furthermore, we propose a rankaware CA algorithm that works well on inverted lists sorted by textual relevance and spatial curving order. We conduct experiments
on Twitter dataset with up to 100 million geo-tweets. Our experimental results show that our proposed rank-aware CA scheme is
superior over state-of-the-art solutions.
8. ACKNOWLEDGEMENT
This work is funded by the NExT Search Centre (grant R-252300-001-490), supported by the Singapore National Research Foundation under its International Research Centre @ Singapore Funding Initiative and administered by the IDM Programme Office.
9. REFERENCES
[1] Google Local https://local.google.com/.
[2] http://techcrunch.com/2013/10/03/mobile-twitter-161m-access-fromhandheld-devices-each-month-65-of-ad-revenues-coming-frommobile/.
[3] Yahoo Local http://local.yahoo.com/.
[4] Computer Graphics: Principles and Practice, second edition .
Addison-Wesley Professional, 1990.
[5] E. Amitay, N. Har’El, R. Sivan, and A. Soffer. Web-a-where:
geotagging web content. In SIGIR ’04: Proceedings of the 27th
annual international ACM SIGIR conference on Research and
development in information retrieval, pages 273–280, New York, NY,
USA, 2004. ACM.
[6] V. N. Anh, O. de Kretser, and A. Moffat. Vector-space ranking with
effective early termination. In SIGIR, pages 35–42, 2001.
TA
CA
RCA
3
I
S2I
500
400
300
200
100
10
50
100
K
150
200
900
800
700
600
500
400
300
200
100
TA
CA
RCA
3
I
S2I
10
4500
4000
3500
3000
2500
2000
1500
1000
500
0
TA
CA
RCA
3
I
S2I
2
3
4
5
Number of Query Keywords
(d) Restaurant Queries
100
K
150
200
TA
CA
RCA
3
I
S2I
2000
1500
1000
500
0
2
(b) Restaurant Queries
Avg Query Latenancy (ms)
Avg Query Latenancy (ms)
(a) Hotel Queries
50
6
TA
CA
RCA
3
I
S2I
1000
800
600
400
200
0
0.9
0.7
0.5
α
0.3
3
4
5
Number of Query Keywords
6
(c) Hotel Queries
Avg Query Latenancy (ms)
600
Avg Query Latenancy (ms)
Avg Query Latenancy (ms)
Avg Query Latenancy (ms)
700
0.1
(e) Hotel Queries
1400
1200
1000
800
600
400
200
0
0.9
TA
CA
RCA
3
I
S2I
0.7
0.5
α
0.3
0.1
(f) Restaurant Queries
Figure 8: Query latency of increasing k, m and α
[7] V. N. Anh and A. Moffat. Simplified similarity scoring using term
ranks. In SIGIR, pages 226–233, 2005.
[8] V. N. Anh and A. Moffat. Pruned query evaluation using
pre-computed impacts. In SIGIR, pages 372–379, 2006.
[9] R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information
Retrieval. ACM Press / Addison-Wesley, 1999.
[10] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The
r*-tree: An efficient and robust access method for points and
rectangles. In SIGMOD Conference, pages 322–331, 1990.
[11] C. Buckley and A. F. Lewit. Optimization of inverted vector
searches. In SIGIR, pages 97–110, 1985.
[12] L. Chen, G. Cong, C. S. Jensen, and D. Wu. Spatial keyword query
processing: An experimental evaluation. PVLDB, 6(3):217–228,
2013.
[13] Y.-Y. Chen, T. Suel, and A. Markowetz. Efficient query processing in
geographic web search engines. In SIGMOD Conference, pages
277–288, 2006.
[14] G. Cong, C. S. Jensen, and D. Wu. Efficient retrieval of the top-k
most relevant spatial web objects. PVLDB, 2(1):337–348, 2009.
[15] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms
for middleware. J. Comput. Syst. Sci., 66(4):614–656, 2003.
[16] I. D. Felipe, V. Hristidis, and N. Rishe. Keyword search on spatial
databases. In ICDE, pages 656–665, 2008.
[17] R. A. Finkel and J. L. Bentley. Quad trees: A data structure for
retrieval on composite keys. Acta Inf., 4:1–9, 1974.
[18] M. Fontoura, V. Josifovski, R. Kumar, C. Olston, A. Tomkins, and
S. Vassilvitskii. Relaxation in text search using taxonomies. PVLDB,
1(1):672–683, 2008.
[19] V. Gaede and O. Günther. Multidimensional access methods. ACM
Comput. Surv., 30(2):170–231, 1998.
[20] I. Gargantini. An effective way to represent quadtrees. Commun.
ACM, 25(12):905–910, 1982.
[21] U. Güntzer, W.-T. Balke, and W. Kießling. Optimizing multi-feature
queries for image databases. In VLDB, pages 419–428. Morgan
Kaufmann, 2000.
[22] R. Hariharan, B. Hore, C. Li, and S. Mehrotra. Processing
spatial-keyword (sk) queries in geographic information retrieval (gir)
systems. In SSDBM, page 16, 2007.
[23] I. F. Ilyas, G. Beskales, and M. A. Soliman. A survey of top-k query
processing techniques in relational database systems. ACM Comput.
Surv., 40(4), 2008.
[24] Z. Li, K. C. K. Lee, B. Zheng, W.-C. Lee, D. L. Lee, and X. Wang.
Ir-tree: An efficient index for geographic document search. IEEE
Trans. Knowl. Data Eng., 23(4):585–599, 2011.
[25] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to
information retrieval. Cambridge University Press, 2008.
[26] A. Marian, N. Bruno, and L. Gravano. Evaluating top-k queries over
web-accessible databases. ACM Trans. Database Syst.,
29(2):319–362, 2004.
[27] J. Nievergelt, H. Hinterberger, and K. C. Sevcik. The grid file: An
adaptable, symmetric multikey file structure. ACM Trans. Database
Syst., 9(1):38–71, 1984.
[28] D. Papadias, P. Kalnis, J. Zhang, and Y. Tao. Efficient olap operations
in spatial data warehouses. In SSTD, pages 443–459, 2001.
[29] J. B. Rocha-Junior, O. Gkorgkas, S. Jonassen, and K. Nørvåg.
Efficient processing of top-k spatial keyword queries. In SSTD,
pages 205–222, 2011.
[30] T. Strohman and W. B. Croft. Efficient document retrieval in main
memory. In SIGIR, pages 175–182, 2007.
[31] T. Strohman, H. R. Turtle, and W. B. Croft. Optimization strategies
for complex queries. In SIGIR, pages 219–225, 2005.
[32] C. Zhang, Y. Zhang, W. Zhang, and X. Lin. Inverted linear quadtree:
Efficient top k spatial keyword search. In ICDE, pages 901–912,
2013.
[33] D. Zhang, Y. M. Chee, A. Mondal, A. K. H. Tung, and
M. Kitsuregawa. Keyword search in spatial databases: Towards
searching by document. In ICDE, pages 688–699, 2009.
[34] D. Zhang, B. C. Ooi, and A. K. H. Tung. Locating mapped resources
in web 2.0. In ICDE, pages 521–532, 2010.
[35] D. Zhang, K.-L. Tan, and A. K. H. Tung. Scalable top-k spatial
keyword search. In EDBT, pages 359–370, 2013.
[36] Y. Zhou, X. Xie, C. Wang, Y. Gong, and W.-Y. Ma. Hybrid index
structures for location-based web search. In CIKM, pages 155–162,
2005.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement