Zhenjie Zhang ,, Reynold Cheng, Dimitris Papadias , Anthony K. H. Tung Minimizing the Communication Cost for Continuous Skyline Maintenance . Accepted and to appear in the Proceedings of the ACM Conference on the Management of Data (SIGMOD), Providence, RI, USA, June 29th
Noname manuscript No.
(will be inserted by the editor)
Understanding the Meaning of a Shifted Sky
A General Framework on Extending Skyline Query
Zhenjie Zhang · Hua Lu · Beng Chin Ooi · Anthony K. H. Tung
Received: date / Accepted: date
Abstract Skyline queries are often used on data sets in multidimensional space for many decisionmaking applications. Traditionally, an object p is said to dominate another object q if, for all dimension, it is no worse than q and is better on at least one dimension. Therefore, the skyline of a data set consists of all objects not dominated by any other object.
To better cater to application requirements such as controlling the size of the skyline or handling data sets that are not wellstructured, various works have been proposed to extend the definition of skyline based on variants of the dominance relationship. In view of the proliferation of variants, in this paper, a generalized framework is proposed to guide the extension of skyline query from conventional definition to different variants.
Our framework explicitly and carefully examines the various properties that should be preserved in a variant of the dominance relationship so that: (1) maintaining original advantages, while extending adaptivity to application semantics, and (2) keeping computational complexity almost unaffected. We prove that traditional dominance is the only relationship satisfying all desirable properties, and present some new dominance relationships by relaxing some of the properties.
These relationships are general enough for us to design new topk skyline queries that return robust results of a controllable size. We analyze the existing skyline algorithms based on their minimum requirements on dominance properties. We also extend our analysis to data sets with missing values, and present extensive experimental results on the combinations of new dominance relationships and skyline algorithms.
Zhenjie Zhang
Department of Computer Science
School of Computing
National University of Singapore
Email: [email protected]
Hua Lu
Department of Computer Science
Faculties of Engineering, Science, and Medicine
Aalborg University
Email: [email protected]
Beng Chin Ooi
Department of Computer Science
School of Computing
National University of Singapore
Email: [email protected]
Anthony K. H. Tung
Department of Computer Science
School of Computing
National University of Singapore
Email: [email protected]
1 Introduction
Given a set P of ddimensional points, point p is said to
dominate point q if p is no worse than q on any dimension and better than q on at least one. The subset of all points not dominated by others is called the skyline of
P . A skyline query retrieves from P its skyline, which is interesting to users with multiple criteria, especially from economical perspective [15,24]. These are the traditional skyline concepts that have formed the basis for most previous works on skyline queries [4,22,13,18,10].
The essence of a skyline query is its dominance definition, which not only determines the relationship between any pair of points but also shapes the final query result. There are both benefits and negative effects in adopting the traditional dominance relationship for the definition of a skyline point.
In terms of benefits, the traditional dominance relationship guarantees the robustness of the query re
2 sult. This is because scaling and shifting on any dimension do not impact the query result. For example, if the objects are associated with two attributes, including temperature and weight, the skyline points remain the same regardless of whether temperature is represented in Fahrenheit or Celsius, or whether weight is measured in kilograms or pounds. This property is the most important advantage of the skyline query, making it the only option for the user when dealing with incomparable dimensions.
As for negative effects, the rigorous definition of the traditional dominance relationship restricts the usefulness of the skyline query in reallife applications, which usually demand additional requirements of skyline points. One common requirement is the control over the result size of a skyline query, i.e., the number of skyline points being returned. This has in fact motivated the introduction of variances into traditional skyline queries [12,16,6]. Another requirement is to provide systematic flexibilities in tuning skyline points selection. Given a variant definition of skyline query, the result returned is expected to be both meaningful on semantics and controllable on cardinality. Yet another requirement is to apply skylining power to handle data sets that are not so wellstructured, such as tables with missing values.
In view of these trends, in this paper, we formulate a generalized framework to serve as the basis for defining and examining variants. To formulate the framework, we carefully examine various properties that should be preserved in a variant of the dominance relationship so that: (1) the definition of skyline maintains its original advantages as much as possible while remaining adaptive to application semantics, and (2) the computational complexity of skyline based on a new variant is not too adversely affected. Through the introduction of this framework, we hope to remove the needs for researchers to reexamine these properties whenever there is a necessity to define a variant of the dominance relation or redevelop a new algorithm for computing skyline based on the new variant. This is done in the same spirit as previous work like [21] which provides theoretical answer to the question on when is nearest neighbors indexable.
Unlike previous studies on preference queries in relational databases [7,11] that focus on traditional properties for total and partial orders, we emphasize two important properties in the traditional dominance relationship: scaling robustness and shifting robustness.
With these two properties, the skyline set remains robust even when the dimensions are totally incomparable
[4]. Besides these two properties, we are also interested in the rationality property and the transitivity property, both of which are crucial to algorithm design for the skyline query. It is also interesting to see the traditional dominance relationship being the only binary relationship satisfying all the properties above.
Further, we consider relaxing one of the following properties: transitivity, scaling robustness and shifting robustness, as doing so allows us to design dominance relationships such that the size of the skyline can be controlled. We show that these relationships are likely to form an ordered class {D
1
, D
2
, . . . , D n
}, with the property that object p can dominate object q under D
i
for any i ≤ j if p can dominate q under D
j
. Based on such an ordered class property, we propose a new type of topk skyline query which attempts to find the smallest D
i
such that the corresponding skyline is of a size smaller than a user specified parameter k.
On the efficiency issue of skyline query computation, we study some existing skyline algorithms, such as BNL[4], SFS [8], TSA[6] and BBS [19]. We analyze their applicable ranges by looking at the minimum requirement on the properties of the underlying dominance relationship. Additionally, we propose two algorithm frameworks for the topk skyline query, namely
Binary Search and Progressive Search, as well as their applicability conditions.
To illustrate the extensibility of our proposal, we apply our principles in two contexts. First, we apply our analysis to design a new dominance relationship called
cone dominance which allows us to reduce the skyline size while sacrificing either scaling or shifting robustness. Second, we apply our analysis to handle data sets with missing values without losing any of the desirable properties. In both cases, we then select the appropriate algorithms for finding skyline and topk skyline based on the exhibited properties.
The rest of the paper is organized as follows. Section
2 introduces some basic properties of the dominance relationship and reviews some related works. Section 3 studies dominance relationships that satisfy all or parts of the properties. Section 4 summarizes the current algorithms based on their basic requirements on dominance properties. Section 5 looks at the design of cone dominance in order to control the size of the skyline.
Section 6 extends the analysis to data sets with missing values. Section 7 presents the experimental results and Section 8 concludes the paper.
2 Preliminaries
In this section, we first give the common definitions and notations used in rest of the paper. Then, we review some related work in the literature of skyline query processing.
2.1 Definitions and Notations
Given a ddimensional numerical space S, a point p in the space S is represented by a ddimensional vector
(p[1], p[2], . . . , p[d]). In this space, we can define a binary relationship D : S × S called a dominance relationship.
A point p is said to dominate another point q, if (p, q) is in D, denoted as D(p, q). We will also use D(p, q) to denote that (p, q) is not in D, or p cannot dominate q.
Based on the dominance relationship, we can define the skyline of a data set P ⊆ S as a subset S(P, D) of P , which contains all points not dominated by any point in P , i.e., S(P, D) = {p ∈ P ∀
q∈D
D(q, p)}. Since every dimension is simply numerical, the basic preference follows one of the two cases that, smaller value dominates larger value, or opposite. Without loss of generality, we simply assume that a point p is better than another point q on a dimension i, if p[i] < q[i].
In most of the previous studies on skyline, a skyline query typically employs the traditional dominance relationship, where a point p dominates another point q, if p is not worse than q on all dimensions and p is better than q on at least one dimension. To distinguish the traditional dominance relationship from other dominance relationships, we shall call it T D.
Definition 1 Dominance Region
Based on the specified dominance relationship D, the dominance region of a point p is the largest region in S such that every point in the region must be dominated by p with respect to D.
The dominance region of a point p in the traditional dominance relationship T D, for example, is the hyperrectangle in the space for points with no smaller values on all dimensions, except for the position of p itself.
In this paper, we focus on the study of dominance relationships for the skyline query. The basis of our study builds on the important properties of these relationships.
Definition 2 Rationality Property
A dominance relationship D satisfies the rationality property, if D(p, q) for any pair of p and q that q[i] < p[i] for all 1 ≤ i ≤ d.
The rationality property gives us the basic standard as to what is good and what is bad. A point cannot dominate another point if it is worse on all aspects.
Definition 3 Transitivity Property
A dominance relationship D satisfies the transitivity property, if D(p, q) when there exists another point r such that D(p, r) and D(r, q).
3
This property is intuitive since preference usually embodies the transitivity property.
Given a ddimensional vector α = (α[1], . . . , α[d]) and a point p in S, we define the scaling operation as
αp = (p[1]α[1], . . . , p[d]α[d]), where α[i] ≥ 0 for all i and α[j] > 0 for some j. α[i] is the scaling factor of dimension i.
Definition 4 Property of Scaling Robustness
A dominance relationship D satisfies the property of scaling robustness if D(αp, αq) when D(p, q) for any valid α.
Similarly, given a ddimensional constant vector β =
(β[1], . . . , β[d]) and a point p, we define the shifting operation as p + β = (p[1] + β[1], . . . , p[d] + β[d]), where
β[i] is any real number for all i. β[i] is said to be the shifting factor of dimension i.
Definition 5 Property of Shifting Robustness
A dominance relationship D satisfies the property of shifting robustness if D(p + β, q + β) when D(p, q) for any β.
The properties of scaling robustness and shifting robustness are important in real applications. This is because many real data sets contain incomparable dimensions. For example, in the case of hotel selection
[4], the room price and distance to the beach are different in nature. The properties of scaling robustness and shifting robustness enable comparisons to be made among points with totally different dimensions, which is one of the most important advantages in the original skyline query.
With the concept of dominance relationship, we can provide a generic definition of skyline query as follows.
Problem 1 Skyline Query
Given a data set P and a dominance relationship D, locate the skyline S(P, D).
In the following, we define the concept of dominance class. Given a group of dominance relationships, we define the ordering property as follows.
Definition 6 Ordering Property
Given a fully sorted index set Θ, a set of dominance relationships indexed by Θ i.e., D = {D
i
i ∈ Θ}, D embodies the ordering property if for any i ¹ j, D
i
(p, q) must be valid if D
j
(p, q) is valid.
Lemma 1 Given a dominance relationship class D in
dexed by Θ and a data set P , we have S(P, D
i
S(P, D
j
) for i ¹ j.
) ⊆
4
dist
A
G
E
F
B
C
D price
Fig. 1 Example of traditional dominance definition
Notation
S d p, q, r
D
D(p, q)
P
S(P, D)
Θ
α, β
D
T D
CD
γ
E(p, q)
f ,g
M D
λ
Table 2 Table of Notations
Description underlying numerical space dimensionality of S points in S dominance relationship
p dominates q by D a data set in S skyline of P with D dominance index set for dominance class scaling and shifting vector a set of dominance relationships traditional dominance relationship cone dominance with parameter γ
Euclidean distance between p and q mappings from a distribution to a point mapping dominance with parameter λ
Dominance
D
1
D
2
D
3
D
4
ε
0.5
0.2
0.1
0
Skyline
{B}
{B, C}
{A, B, C}
{A, B, C, D}
Table 1 Example of Topk Skyline Query
Dominated Set
{A, C, D, E, F, G}
{A, D, E, F, G}
{D, E, F, G}
{E, F, G}
If Θ is a finite set, we say D is a finite ordered dominance class; otherwise, we say D is an infinite ordered dominance class. For example, Θ can be an integer set on [1, n] or a real number interval [a, b]. In the rest of the section, we will assume D is finite and Θ contains all integers in [1, n]. The definitions can be easily extended to infinite cases. Given an ordered dominance class D = {D
1
, D
2
, . . . , D n
}, we can define the new problem in a way similar to the traditional topk query in database systems.
Problem 2 Topk Skyline Query
Given the specified parameter k and an ordered dominance class D, find a dominance relationship that (1)
D i
∈ D, that S(P, D
S(P, D
n i
) ≥ k, or (2) D
) ≥ k and S(P, D
i−1
n
, if S(P, D
n
) < k.
) < k, if
In other words, the topk skyline query tries to discover the dominance relationship in D with the minimal skyline cardinality but above k. If all of them in the dominance class lead to small skyline, the one with the maximal cardinality is returned instead. The topk skyline query (Problem 1) is more attractive than the original skyline query (Problem 2) in many real applications since users can relate to results of manageable size more easily. Given the data shown in Figure 1, for example, if we construct an ordered dominance class based on εADR [12], the corresponding skyline and dominated sets are as shown in Table 1. Therefore, the original skyline is {A, B, C, D} while the top2 skyline based on this ordered class will return {B, C} as results.
For convenience and readability, we summarize the notations used in the rest of this paper in Table 2.
2.2 Related Work
We next review the existing dominance relationship definitions that are relevant to the skyline query. While the definition of skyline query was previously known as maximal vectors problem in algorithm community, the earlier studies only focuses on the computational complexity [3,2]. In the literature of database system, instead, I/O cost becomes the bottleneck as the data grows beyond the capacity of the memory. In this paper, we emphasize the skyline query processing algorithms capable on large database. Specifically, we consider five important aspects of interest. For each dominance relationship definition, we first consider whether its query result is deterministic, which indicates that reexecutions (or different evaluations) of a given instance of a query type always produce the same subset as the query result. In addition, we look into whether each dominance relationship conforms to the four properties of: rationality, transitivity, scaling robustness and shifting robustness. All existing dominance relationship definitions and respective properties are listed in Table 3.
We include the topk query as it can be regarded as a special case of the skyline query with multiple dimensions for comparison degenerating into only one. A topk query yields deterministic results, and it conforms to the rationality and transitivity properties. Whether it exhibits the property of scaling robustness or shifting robustness is dependent on the concrete aggregation functions that are used for ranking purposes.
A traditional skyline query [1,4,22,13,19,10,23,14] yields deterministic results. It also conforms to all the four properties described in the previous subsection.
The detailed proofs will be presented in Section 3.1. Another instance employing the traditional skyline is the recent k most representative skyline operator [16]. It
5
Dominance Definition
Topk
Traditional dominance [4, 22, 13, 19, 16]
Partiallyordered domain dominance [5]
εADR dominance [12, 9]
kdominance [6]
Table 3 Revisiting existing dominance relationships
Determin.
Ration.
Transi.
Scaling Robust Shifting Robust
+
+
+

+
+
+
+

+
+
+
+



+
N.A.
+
+

+
N.A.
+
+ introduces a constraint on the number of skyline points to be returned, and selects from the traditional skyline those points that maximize the total number of dominated points.
The skyline query on partially ordered domains [5] and categorical domain [20,17] are special cases of the traditional skyline query, but the properties of scaling robustness or shifting robustness are not applicable because partially ordered domains do not support scaling or shifting operations.
In the approximate dominating representatives problem [12,9], each point is boosted (if larger values are preferred) by ε in all dimensions when being compared with other points. We call the underlying dominance relationship εdominance, and the corresponding skyline query εADR skyline query. The εADR skyline query does not have deterministic results, and εdominance violates both rationality and transitivity properties. However, εdominance conforms to the properties of scaling robustness and shifting robustness.
The kdominant skyline [6] problem also alters the traditional dominance definition. Given a ddimensional data set, a point p is said to kdominate another point
q if there exists a kdimensional subspace (k ≤ d) within which p traditionally (fully) dominates q. The kdominant skyline query yields deterministic results, but it does not conform to the transitivity property [6].
3 Analysis on Relationship and Properties
Y
Z q
X
Fig. 2 Example of octants in threedimensional space
Proof T D must satisfy the rationality property since a point p dominates another point q only when p is not worse than q on all dimensions.
As pointed out in [4], T D satisfies the transitivity property.
Given that T D(p, q), we have p[i] ≤ q[i] for any
1 ≤ i ≤ d, and there is at least one j that p[j] <
q[j]. Consider the relationship between αp and αq. Since
α[i] > 0, α[i]p[i] ≤ α[i]q[i] for any 1 ≤ i ≤ d and
α[j]p[j] < α[j]q[j]. This shows that T D(αp, αq) satisfy the property of scaling robustness.
The proof for shifting robustness is similar to that of scaling robustness. Thus, we omit the detail here.
In the rest of the section, we will show that T D is the only binary relationship satisfying all the four properties. The proof begins with several lemmas.
In this section, we analyze the connections between dominance relationships and desired properties, and show how relaxations on some of the properties reshape the dominance relationships.
Definition 7 HyperOctant
Given a point p in ddimensional space S, the point can divide the whole space into 2
d
hyperoctants. For any two points q and r in the same octant, we have
(q[i] − p[i])(r[i] − p[i]) ≥ 0 for any 1 ≤ i ≤ d.
3.1 Traditional Dominance Relationship
We first consider the traditional dominance relationship (T D). In the following, we use T D(p, q) to represent that p dominates q based on the definition of the traditional dominance relationship.
By the definition, we know that for any two points in the same octant, both of them are of larger (smaller) values than p’s value on any dimension i. In Figure
2, we present an example in threedimensional space, where the cube with thick edges is the octant containing points of smaller values than q on the X and Z axes but of larger values than q on the Y axis.
Theorem 1 T D satisfies the properties of rationality,
transitivity, scaling robustness and shifting robustness.
Lemma 2 Given a dominance relationship D that sat
isfies the properties of scaling robustness and shifting
6
robustness, a point p and the octants induced by p, if
D(p, q) for some q in an octant X, D(p, r) for any
r ∈ X − {p}.
Proof Consider any r in X, we construct a vector δ =
{δ[1], . . . , δ[d]} such that δ[i] = (r[i] − p[i])/(q[i] − p[i]).
By the property of octant, we are sure δ[i] ≥ 0 for all
i. Then, we know that δp + (1 − δ)p = p and δq + (1 −
δ)p = r. By applying the property of scaling robustness, we have D(δp, δq). By applying also the property of shifting robustness, we have D(δp + (1 − δ)p, δq + (1 −
δ)p), which directly leads to the conclusion that D(p, r).
The last lemma implies that any dominance relationship D exhibiting scaling robustness and shifting robustness has some ability to expand from a single dominated point to the whole hyperoctant.
Given a dominance relationship D, if D(p, q) and
D(p, r), we say D is convex if D(p, γq + (1 − γ)r) for any constant real value 0 ≤ γ ≤ 1.
Lemma 3 Given a dominance relationship D satisfy
ing the properties of scaling robustness, shifting robustness and transitivity, D must be convex.
Proof Given p, q and r that D(p, q) and D(p, r), by the properties of scaling robustness and shifting robustness, we have D(p, p + γ(q − p)) and D(p + γ(q − r), p + γ(q −
p)+(1−γ)(r−p)). By the transitivity property, we have
D(p, p + γ(q − p) + (1 − γ)(r − p)). Since p + γ(q − p) +
(1 − γ)(r − p) = γq + (1 − γ)r, we reach the convexity condition by D(p, γq + (1 − γ)r).
Theorem 2 If D is a dominance relationship satisfy
ing all the properties proposed in the last section, D must be equal to T D in some subspace S
0
⊆ S.
Proof Given any point p in the space S, by the rationality property, p dominates the hyperoctant that contains points worse than p on all dimensions. If this is the only hyperoctant dominated by p, it is T D in S.
If p dominates another hyperoctant with points better than p on some dimensions in S
00
⊆ S, by the convexity property, p dominates all points worse than p on dimensions in S
0
= S − S
00
.
By Theorem 2, we have proved that the traditional dominance relationship is the only dominance relationship satisfying the properties of rationality, transitivity, scaling robustness and shifting robustness.
3.2 Relaxation of Properties
Although Theorem 2 shows that T D is the only option for a dominance relationship to satisfy all the four properties, we can obtain some other relationships if we are
Y
Z q
X
Fig. 3 Example of dominance region after relaxing transitivity property able to relax some of the properties. In this part of the section, we investigate along this direction by relaxing one of the following properties: transitivity, scaling robustness and shifting robustness. We will also discuss the scenarios in real applications when the relaxations are reasonable. Note that the rationality property cannot be relaxed since doing so can lead to unreasonable results.
3.2.1 Relaxing the Property of Transitivity
By relaxing the transitivity property, a point p may not be able to dominate another point r, even if it dominates some point q that dominates r. However, Lemma
2 still applies since we have not relaxed the properties of scaling robustness and shifting robustness. Therefore, the dominance region of a point p will occupy arbitrary hyperoctants divided by p in the space. By the rationality property, p cannot dominate the hyperoctant with points better than p on all dimensions. Therefore, the dominance region of p can be nonconvex as shown in the example of Figure 3.
In this example, the dominance relationship definitely does not follow the transitivity property since the dominance region is not convex. We call this type of dominance relationship Octant Dominance, or OD in short.
We use Φ to denote the set of all OD relationships satisfying all the properties except transitivity. Since there are 2
d
− 1 hyperoctants to choose for the dominance relationship, there are 2
2
d
−1 dominance relationships in Φ. Unfortunately, there does not exist a total order on these 2
2
d
−1 relationships. Therefore, Φ cannot be an ordered class of dominance relationships, which prohibits the topk skyline query directly over Φ. However, it is possible to find some ordered subset of Φ. It is not difficult to verify that the kdominance proposed in [6] is a subset of Φ and is an ordered dominance class with d dominance relationships.
The property of transitivity is important in some applications, especially when consistent result is expected in recommendation system. However, with the increase of dimensionality, the property of transitivity incurs cost on the decreasing meaning of selective result, since the points become hard to dominate. Therefore, the relaxation of transitivity is a natural option for data in high dimensional space. More details on this can be referred to [6]
3.2.2 Relaxing the Property of Scaling Robustness
If we relax the property of scaling robustness, we can have many different variants of the traditional dominance relationship since it will be unnecessary to consider the scaling factor any more. The dominance region of a point p can even be discrete. For example, we can define a dominance relationship, D, which only allows a point p to dominate another point q if q[i] − p[i] ∈ Z
+ for all i. It is not hard to verify that such dominance relationship must follow all properties except for scaling robustness.
Compared to the limited number of OD relationships introduced above, there are infinite dominance relationships satisfying all properties except scaling robustness. Moreover, in these relationships, we can find some ordered classes of infinite size. For example, an infinite ordered class can be defined as follows. Given the index set on all positive integers, Θ = {1, 2, . . . , i, . . .}, a dominance relationship D
i
q[k]−p[k] is divisible by 2
i−1 is defined as D
i
(p, q) if for all 1 ≤ k ≤ d. Each D
i
, as defined above, must be a valid dominance relationship, satisfying all properties except for scaling robustness. Each pair of D
i
and D
j
(i < j) must also satisfy the requirement of Definition 6. Thus, these dominance relationships form an infinite ordered class and topk skyline query can be issued on them.
The property of scaling robustness is helpful when the dimensions are in totally different domains. In the classic example of hotel selection [4], with two dimensions on price and distance to beach, scaling robustness plays an important role on deriving meaningful result.
However, this property is no longer important if the dimensions are recorded with the same measurement. In movie rating data set with every user as a single dimension, such as Netflix
1
, the score on each dimension can only be some integer between 1 and 5. In such cases, scaling robustness can be relaxed without affecting the meaningfulness of the skyline query result.
1 http://www.netflix.com
7
3.2.3 Relaxing the Property of Shifting Robustness
We can easily extend the analysis on relaxing scaling robustness to the new case where shifting robustness is relaxed instead. The key observation is that the property of scaling robustness is equivalent to the property of shifting robustness in the logscale space log S, in which every point p is transformed to another point
p
0
= (log p[1], . . . , log p[d]) with the assumption that
p[i] > 0 for all i.
Based on this observation, there is a mapping between the set of all relationships violating only scaling robustness and the set of relationships violating only shifting robustness. Given a relationship D satisfying all properties except for scaling robustness, we can construct a new relationship D
0
that D
0
(p, q) if
D(log p, log q). D
0
must satisfy scaling robustness, since
D(log pα, log qα) = D(log p + log α, log q + log α) with any scaling vector α. Since the reverse mapping can be constructed similarly, we can prove that the number of relationships violating only scaling robustness must be equal to the number of relationships violating only shifting robustness.
Shifting robustness is an valuable property if some of the dimensions are error sensitive or subjective. Considering the example of movie rating data set, some of the raters are prone to give higher scores to all movies while some others always give low scores. Shifting robustness is able to remove the influence of these factors, rendering consistent results. In some applications with relatively stable and pointive values on all dimensions, such as climate data collected over sensor network, the necessity of shifting robustness does not exist any more, allowing relaxation on this property.
4 General Algorithm Design
In this section, we discuss how existing skyline algorithms can be modified to answer skyline query as well as topk skyline query, based on the properties of the underlying dominance relationship used. Note that all the algorithm designs here assume the complete information of every point on all dimensions. The extensions to data with missing values will be discussed in later sections.
4.1 General Algorithms for Skyline Queries
There are four algorithm discussed below, including
Block Nested Loop, Sort Filter Skyline, Two Scan Al
gorithm and BranchandBound Skyline.
8
Algorithm 1 General Block Nested Loop (data set P , dominance relationship D)
1: clear the skyline buffer S
2: for each point p in P do
3: for each point q in S do
4:
5:
isSky=TRUE if D(q, p) then
6:
7:
8:
9:
isSky=FALSE end if if D(p, q) then remove q from S
10: end if
11: end for
12: if isSky then
13: move p into S
14: end if
15: end for
16: return S
Algorithm 2 General Sort Filter Skyline (data set
P , dominance relationship D)
1: sort the data set P on the sum of all dimensions for every point.
2: clear the skyline buffer S
3: for each point p in P do
4: for each point q in S do
5:
6: if D(q, p) then go to line (3)
7: end if
8: end for
9: move p into S
10: end for
11: return S
4.1.1 Block Nested Loop
In Algorithm 1, we list the details of the General Block
Nested Loop algorithm, which was first proposed in [4].
In this algorithm, a buffer S is maintained for the current skyline on the data seen so far. If a new point p dominates some current skyline point q, q is removed from S. If no point in S dominates p, p is moved into
S. This simple algorithm is of the widest applicability range, as stated in the following theorem.
Theorem 3 Algorithm 1 can be applied on any domi
nance relationship satisfying the transitivity property.
Proof By Algorithm 1, S must contain all the true skyline points S, since no other points dominates them by definition. On the other hand, assume S contains a false skyline point p, and p is dominated by another point q.
If q ∈ S, q must be eliminated when the algorithm visits
p or q. If q 6∈ S, q must be dominated by another point
r. By the transitivity property r, r must dominate p.
Following the logic, we can always find a skyline point dominating p, which contradicts the assumption.
If the underlying dominance relationship does not have the property of transitivity, BNL will fail to retrieve the correct skyline set, since some dominance pairs can be missed when some points are dropped from the temporary skyline buffer S.
4.1.2 Sort Filter Skyline
Next, we consider the General Sort Filter Skyline algorithm, which was first proposed in [8]. In this algorithm, a presorting topologically on all dimensions is conducted before the block nested loop algorithm is applied. The benefit of presorting is that any point moved into the skyline buffer must be a true skyline point at the end, which saves time spent on pruning false skyline points.
The applicability range of Algorithm 2 is smaller than that of Algorithm 1.
Theorem 4 Algorithm 2 can be applied on any domi
nance relationship satisfying the rationality and transitivity properties.
Proof This algorithm must rely on the transitivity property since all points are pruned by only those skyline points. On the other hand, sorting can avoid a current skyline point being dominated only when the dominance relationship follows the rationality and transitivity properties.
If any of the two properties does not hold, SFS algorithm cannot output the correct result, because 1) the sorting of the points implicitly assumes the property of rationality when points with smaller values are preferred, and 2) the skyline buffer S may not contain enough points to dominate nonskyline points if transitivity is violated. Based on this observation, the requirements on SFS is tight.
4.1.3 Two Scan Algorithms
The third algorithm in our consideration is the General
Two Scan Algorithm, which was first proposed in [6].
This algorithm consists of two scans. In the first scan,
SFS is run on the data set to obtain a candidate set
S. In the second scan, points in S are compared with all points in P to eliminate false skyline points. The details are summarized in Algorithm 3. This algorithm returns correct results even on a dominance relationship without the transitivity property.
Theorem 5 Algorithm 3 can be applied on any domi
nance relationship satisfying the rationality property.
Algorithm 3 General Two Scan Algorithm (data set P , dominance relationship D)
1: get candidate set S by running SFS
2: for each point p in P do
3: for each point q in S do
4:
5: if D(p, q) then remove q from S
6: end if
7: end for
8: end for
9: return S
9
Algorithm 4 General BBS Algorithm (data set P , dominance relationship D, index tree T )
1: clear a heap H and skyline buffer S
2: put root node N of T into H
3: while H is not empty do
4: pick a node n from H with the minimum possible distance to the space origin, and remove n from H.
5: Set M as the MBR on node n
6: for each point p in S do
7:
8: if DP (M ) is dominated by p then go to line (3)
9: end if
10: end for
11: if n is a single point then
12: move n into S
13: else
14: retrieve all children of n in T , and insert them into H
15: end if
16: end while
17: return S
Proof Compared against the SFS algorithm, Algorithm
3 employs a second scan. The second scan enables the algorithm to find skylines even when the dominance relationship does not follow the transitivity property.
The tightness of the requirement on TSA comes from the simple observation that the employment of
SFS leads to the sorting of points depending on the property of rationality. As is discussed in SFS algorithm, rationality is underlying reason on the validity of the sorting process.
4.1.4 BranchandBound Skyline
Finally, we look at the complicated General BBS algorithm, which was first proposed in [19]. In this algorithm, there exists an index structure, such as the R
Tree, where every point can be found efficiently. Each node in the index has an MBR (minimum bounding rectangle), which is the bounding range of all the points stored in its descendant nodes.
To facilitate the adoption of indexing tree structure, we propose a new concept, Common Dominating Po
sition, over the Minimum Bounding Rectangles. Intuitively, common dominating position can be abstracted as some location in the space, which is able to dominate any possible point in the MBR. The following lemma implies the existence of common dominating position for any MBR.
Lemma 4 If the dominance relationship satisfies the
properties of rationality and transitivity, common dominating position always exists for any MBR.
Proof We prove this by construction. Given two points
p
1 and p
2
, there is definitely at least one common dominating position for {p
1
, p
2
} because of the property of rationality. Assuming there is a set P = {p
1
, p
2
, . . . , p n
}
containing points stored in some MBR M , the common dominating position for M can be constructed in n − 1 steps. In the first step, the common dominating position
p
0
2 is discovered for {p
1 new position p sition for {p
0 i
, p
0
i+1
i+1
, p
2
}. In step i (1 < i ≤ n − 1), a is found as common dominating po
}. Because p
0
i+1 dominates p
0 i
, it also dominates any p
j
(j ≤ i) due to the property of transitivity. Therefore, the final position p
0 n
must be common dominating position for the whole set P , which completes the proof of the lemma.
In the rest of the paper, we use DP (M ) to denote the common dominating position for some MBR M .
The computation of the positions will be covered when the specific dominance relationship is introduced. Generally speaking, the applicability of BBS algorithm can be summarized by the following theorem.
Theorem 6 Algorithm 4 can be applied on any domi
nance relationship satisfying the rationality and transitivity properties.
Proof Considering a node n in the indexing tree, if there is at least one skyline point q in node n, n can never been removed by BBS algorithm because of the properties of rationality and transitivity. Otherwise, some point p in the buffer S is able to dominate the MBR of n, contradicting to the fact that q is a skyline point.
Therefore, every skyline point must be included in the buffer S. On the other hand, S can never contain any false positive skyline point q, because there is at least one dominating point in S for q based on the property of transitivity. As a summary, the output of BBS must be the correct skyline set.
To discuss the tightness of BBS algorithm, we first look at the property of rationality. If the dominance relationship violates rationality, there is no longer guarantee on the correctness of step (4), since the point selected may not be a skyline point. Secondly, when there is no property of transitivity, the skyline buffer S does not have full capacity to prune all nonskyline points,
10 for similar reason for BNL and SFS. This leads to the conclusion that either property is removable from the requirements for BBS algorithm.
We summarize the necessary conditions of the four algorithms in this part in Table 4.
Algo Name Rationality Transitivity
BNL
SFS
TSA
BBS
√
√
√
√
√
√
Table 4 Necessary Properties of Algorithms 1,2,3 and 4
Algorithm 5 General Binary Search Algorithm
(data set P , dominance class D, skyline algorithm A, skyline size k)
1: set l = 1 and u = n
2: compute S(P, D
u
) by running a skyline query algorithm
3: if S(P, D
u
) ≤ k then
4: return S(P, D
u
)
5: end if
6: i = b(l + u)/2c
7: while S(P, D
i
8: if S(P, D
i
) 6= k and l 6= u do
) < k then
9: l = i
10: else
11: u = i
12: end if
13: i = b(l + u)/2c
14: end while
15: return S(P, D
i
)
4.2 General Algorithms for Topk Skyline Queries
To support efficient computation of topk skyline queries, we summarize two general methods which are based on the general algorithms proposed for skyline queries.
In this subsection, we also assume D is finite, where there are n dominance relationships D = {D
1
, D
2
, . . . , D
with index set Θ on [1, n]. To handle the infinite domi
n
}
nance class with a dominance index Θ on real interval
[a, b] with the same algorithm, we can use a minimum gap ² to discretize the class, i.e., constructing n = b dominance relationships where D
i a−b
² c
(1 ≤ i ≤ n) equal to the original relationship with index a + ²(i − 1).
4.2.1 Binary Search
The first general algorithm in our consideration is a binary search on the index of dominance relationships
D i
∈ D. It is so general that any algorithm proposed in
Section 4.1 can be directly used to find the dominance relationship D
i
, satisfying the condition in Problem 2, with the smallest index i in D.
The correctness of Algorithm 5 is straightforward.
Since the skyline monotonically shrinks with the decrease of the dominance relationship index i, binary search can definitely reach the smallest D
i
with exactly
k results (or smallest one above k). However, in some cases, even the weakest dominance relationship in the class cannot return a skyline of size no smaller than
k. The simplest example is the construction of a dominance class with only one dominance relationship. In such cases, the algorithm can only return the maximal skyline calculated based on the weakest dominance relationship in the dominance class. Generally speaking, different D
i
s in different iterations in the binary search do not correlate. Therefore, Algorithm 5 computes each
S(P, D
i
) by calling the most appropriate algorithm as presented in Section 4.1.
4.2.2 Progressive Algorithm
The second general method is to modify any progressive skyline algorithm to return topk skyline query results.
A progressive skyline algorithm keeps all current skyline points in a buffer, which we use to answer topk skyline queries. The general progressive algorithm is as shown in Algorithm 6. The algorithm starts with the input largest dominance index i = n doing a progressive skyline search with dominance relationship D decreased when a full buffer is encountered.
i
being
When the buffer contains k + 1 points, at least one point will be removed from the buffer. To guarantee the correctness of the current topk result, we need to find the point easiest to dominate than any other points in
S. This is implemented by Algorithm 7. Every pair of points in the buffer is examined, and the point p dominated by some q with the largest D
i
will be exactly the one wanted. The efficiency of the algorithm can be further improved if we store the previous pairwise computation result, since there is only one new point after
Algorithm 7 is applied once.
The correctness of the algorithm depends on two conditions. First, the underlying skyline algorithm A must be progressive. Only with progressive algorithms, we can make sure points in the buffer are definitely topk skyline query result after the algorithm completes its run. Second, it must be easy to find the largest D
i
∈ D for q to dominate p. Fortunately, all dominance relationships and their relaxed variants listed in this paper meet this condition. Therefore, we can focus on the underlying skyline algorithm in analyzing the applicability of the progressive scheme.
In Table 5, we list the analysis of the possible combinations of general skyline algorithms and the general topk skyline algorithms. We can see that all the four
Algorithm 6 General Progressive Algorithm
(data set P , dominance relationship class D, progressive skyline algorithm A, expected skyline size k)
1: construct a skyline buffer S of size k + 1
2: set i = n and use D
i
as the current dominance relationship
3: run A on P with D
i
full
, run Skyline Pruning when the S is
4: return S
Algorithm 7 Skyline Pruning (Skyline Buffer S, current relationship D
i
, dominance class D)
1: set θ = 1
2: for each point p ∈ S do
3: for each point q ∈ S and p 6= q do
4: compute the largest j that D
j
(q, p)
5:
6: if j ≤ θ then
θ = j and mark p
7: end if
8: end for
9: end for
10: set D
θ
as new relationship and remove the last marked point from S
11
E(p, q)γ for all i, while there is at least one dimension
j that p[j] < q[j]) + E(p, q)γ. We use CD
γ
note such a relationship.
(p, q) to de
In Figure 4, an example is shown in twodimensional space. Given a point p in the space, the dominance region can be represented by a cone region with the top point at p. In the figure, for example, there are three pairs of boundary lines of the dominance region, represented by dashed, normal and thick lines respectively.
These dominance regions are achieved by assigning negative, zero and positive γs respectively. When γ = 0, as the pair of normal lines show, cone dominance degenerates to traditional dominance T D. The following lemma states the valid index set on γ.
Y
algorithms can be directly used in binary search while only SFS and BBS can be used in the progressive algorithm.
Algo Name Binary Search Progressive
BNL
SFS
TSA
BBS
√
√
√
√
√
√
Table 5 Combinations of Algorithms and Topk Skyline
p
X
Fig. 4 Example of cone dominance relationships
5 Cone Dominance with Arbitrary Resolution
A common problem with skyline query is its uncontrollable result size. Although there are studies on skyline variants to reduce the result size when conventional skyline is too large, there does not exist any systematic method which can adaptively output result with specified size, no mater whether conventional skyline is oversized or undersized.
In this section, we apply our framework on the design of some generalized dominance relationships. We propose a new class of dominance relationships, namely
Cone Dominance, with arbitrary selection resolution if the parameters are set appropriately. In the following, we use E(p, q) to denote the Euclidean distance between
p and q in the space S.
Definition 8 Cone Dominance inates q if (1) p[i] < q[i], and (2) p[i] ≤ q[i] +
Lemma 5 CD
γ satisfies all properties except scaling robustness, if γ is chosen from index set
Θ = h
−
p
1/d, p
(d − 1)/d i
Proof First, CD
γ
is null if γ is smaller than − p
1/d.
−
E
p
1/d, E
2
(p, q) =
P
(p[i] − q[i])
2
<
P
E
2
(p, q)/d =
2
(p, q). Thus, we cannot find any dominance pair in the space, making the dominance relationship useless.
inance relationship CD
γ
(p, q) and γ > p
(d − 1)/d, p
q[i] > E(p, q)
(d − 1)/d. Therefore, we have
(d − 1)/d−(d−1)E(p, q)
P p
p[i] −
1/d(d − 1) = of cone dominance. Thus, CD with γ > must be an empty relationship.
p
(d − 1)/d
The shifting robustness property on cone dominance is straightforward since the distance between two points does not change after shifting operations. Rationality property is proved directly by the definition.
12
P
q[i] and
P
γ
q[i] <
P
γ
r[i]. So, p[i] <
p[i] <
r[i], which proves the first condition of cone dominance. On the other hand, since p[i] ≤ q[i] + E(p, q)γ and q[i] ≤ r[i] +
E(q, r)γ, p[i] ≤ r[i]+E(p, q)γ+E(q, r)γ. By the triangle inequality of Euclidean distance, we have p[i] ≤ r[i] +
E(p, r)γ.
Algorithm 8 Find Common Dominating Position
(MBR M , parameter γ)
1: if γ ≥ 0 then
2: Return (M.l[1], M.l[2], . . . , M.l[d])
3: end if
4: Find j with minimal M.u[j] − M.l[j]
5: Find the minimal α that (M.u[j] − M.l[j] + α)
(
P
i
(M.u[i] − M.l[i] + α)
2
)γ
2
6: Return (M.l[1] − α, . . . , M.l[d] − α)
2
≤
Based on the last lemma, the index set Θ can be defined as all real numbers in the interval h
Lemma 6 If CD cases.
−
γ
p
1/d, p
(d − 1)/d
(p, q), CD
γ
i
0
(p, q) for all γ
0
≥ γ.
Proof If CD
γ
(p, q), p[i] ≤ q[i] + E(p, q)γ. When γ
p[i] ≤ q[i] + E(p, q)γ
0
0
≥ γ,
. Since the first condition does not change when changing γ to γ
0
, CD
γ
0
(p, q) is valid in all dominance class indexed by Θ =
−
p
1/d, p
(d − 1)/d i
.
Note that the generic full order, ¹, in Definition 6 is instantiated as γ
1
≥ γ
2 in cone dominance relationship.
Intuitively, we can interpret cone dominance as follows. If γ is negative, a point p dominates another point
q only when every dimension contributes enough to the difference between p and q. In other words, thorough advantage to some extent is expected on all dimensions.
On the other hand, if γ is positive, the definition allows
p to dominate q even when p does not show too much disadvantage on all dimensions compared to the total difference between them. Based on the adjustment on the parameter γ, cone dominance can adaptively choose the resolution on the query, leading to controllable size of query output.
By the mapping method derived in Section 3.2.3,
LogScale Cone Dominance(LCD
γ
LCD
γ
(p, q) if and only if CD
γ
) can be defined as
(log p, log q). LogScale
Cone Dominance is thus only violating shiftingrobustness but satisfying all other properties.
Since cone dominance and logscale cone dominance follow both rationality and transitivity property, all of the algorithms presented in Section 4 can be used to answer skyline query and topk skyline query on them.
As presented in Section 4, the adoption of indexing tree structure depends on the existence of common dominating position for MBR M . Here, we discuss the details of the computation on the common dominating position with respect to cone dominance relationships.
The extension to logscale cone dominance is straightforward.
For a MBR M , the lower and upper boundaries of
M on dimension i are represented by M.l[i] and M.u[i] respectively. In Algorithm 8, we present some method computing the common dominating position, with input MBR M and the cone dominance parameter γ.
In Algorithm 8, if γ is not negative, the method simply returns the leftbottom corner of the MBR, since the corner position is enough to dominate all points in it. It can be easily interpreted by the example in
Figure 4. When γ is negative, the problem becomes more complicated because the corner is not capable enough. To discover a stronger position, the algorithm searches on the line crossing the the corner position, i.e.
p
α
= (M.l[1] − α, . . . , M.l[d] − α) with α as a positive parameter. When the condition on line (4) of Algorithm
8 is satisfied, it is easy to verify that the whole MBR is covered by the dominance region of the position p
α
.
Since condition on line (4) is actually a quadratic inequality on variable α, a simple solution can be derived in constant time. Therefore, p
α
can be easily calculated and returned as the common dominating position for the MBR M .
6 Dominance Relationships on Data with
Missing Values
Data tuples with missing values are a common kind of data found in many databases. In a movie rating data set, for example, if we consider every rater as a dimension, every dimension has only very few entries in which a value is present, while others are filled with
NULL values. This is because every user can possibly watch some small fraction of all movies. However, such uncertainties can lead to difficulty in using the skyline query on highquality movie selection since it is difficult to compare a pair of movies when they have been watched and rated by totally different users.
In this section, we extend our analysis from certain data to data with missing values. We first propose a mappingbased dominance relationship definition for data with missing values. Then, we study the properties of rationality, scaling robustness, shifting robustness and transitivity for the mappingbased dominance relationship. We also discuss the combinations of mapping dominance with other relaxed dominance relation
13 ships presented in previous section, as well as the algorithm applicabilities on the computation of both skyline and topk skyline queries with mapping dominance.
Note that mapping dominance is not the only option on possible dominance relationships defined on incomplete data tuples. The more important implication provided in this section is that welldefined dominance relationships can be easily incorporated into our general skyline framework, which facilitates easier analysis and querying algorithm design.
6.1 Mapping Dominance
Given a space S defined by a set of d dimensions {1, 2,
. . . , d}, without loss of generality, we suppose that missing values appear on the first k dimensions, i.e., from dimension 1 to dimension k. For a dimension i (1 ≤
i ≤ k), we use φ
i
(x
i
) to denote the probability distribution function (or pdf) of the value x
i
on dimension
i. Such a distribution can be retrieved by observing all the nonmissing values on dimension i.
Now consider a point p(x
1
, x
2
, . . . , x d
) with missing values on some dimensions. A generic mapping h is defined as a function converting incomplete point p to a fixed point p
. . . , x x
0 i
=
d
½
)) = p
F i
(φ
i
0
(x
(x
i
0
1
x i
0
(x
0
1
, x
)), if x
i
0
2
, . . . , x
, x
0
2
, . . . , x
0 d
, otherwise
0 d
) in S, i.e., h(p(x
1
, x
2
), where is NULL
,
The generic mapping h retains each concrete value of p, and estimates a concrete value for each missing value based on the known pdf on that relevant dimension.
The estimation is accomplished by a function F
i
, which takes a pdf as input and outputs a fixed value.
To facilitate comparison of an arbitrary pair of points
p and q with possible missing values in a generic way, we define two mappings, f and g, as instances of the general mapping h aforementioned. Based on these two mappings, we define a special dominance relationship called Mapping Dominance, or M D, as follows: the conditions that enable the M D relationship to satisfy the properties of scaling robustness, shifting robustness and transitivity.
Lemma 7 Given any point p with possible missing val
ues, the two mappings f and g that satisfy αf (p) =
f (αp) and αg(p) = g(αp) for any scaling factors α,
then M D satisfies the property of scaling robustness.
Proof Assume two points p, q and M D(p, q). If αf (p) =
f (αp) and αg(q) = g(αq), (αp, αq) must be in M D since (f (p), g(q)) is in T D and T D satisfies scaling robustness for any scaling factor α.
Lemma 8 If for any point p, the mappings f and g
satisfy that f (p) + β = f (p + β) and g(p) + β = g(p +
β) for any shifting factors β, then M D satisfies the
property of shifting robustness.
Proof Assume two points p, q and M D(p, q). If f (p) +
β = f (p + β) and g(q) = g(q + β), (p + β, q + β) must be in M D, since f (p), g(q) is in T D and T D is shifting robust.
Lemma 9 If for any point p with possible missing val
ues, (g(p), f (p)) ∈ T D or g(p) = f (p), M D embodies
the transitivity property.
Proof Assume three point p, q, r that M D(p, q) and
M D(q, r). Since T D(f (p), g(q)) and T D(g(q), f (q)), we have T D(f (p), f (q)) by the transitivity property of T D.
Since T D(f (q), g(r)), T D(f (p), g(r)) and M D(p, r) by using the transitivity property again.
To satisfy the conditions of the lemmas above, we propose a new type of dominance relationships called
M D
λ
. If O
i
is the set of observed values on nonNULL entries on dimension i, we can construct the observation sets O
1
, O
2
, . . . , O k
by visiting the data set once. The dominating position mapping f
λ,i
 values in O
i
for missing values on dimension i is thus defined as f
λ,i
(1−λ)O
x
00 i i
 values in O that λO
i i
(N U LL) = x are smaller than x
0 i
nated position mapping g
λ,i
(N U LL) = are smaller than x
00 i
.
0 i
that
, while domiis defined as g
λ,i
MD = {(p, q)  TD
S
(f (p), g(q))}, where T D
S
is the traditional dominance relationship over S. We call the mapping f dominating posi
tion mapping and g dominated position mapping, in the sense that f tends to leave point p dominating others while g tends to leave point q being dominated by others.
Now we consider the properties of the mapping dominance (M D) relationship that we have proposed for data with missing values. The following theorems state
Point Dimension 1 Dimension 2
A
B
C
D
E
NULL
0.3
0.5
0.7
0.9
0.5
NULL
0.4
0.5
0.6
Table 6 Example of Mapping Dominance
In Table 6, we present a small example of mapping dominance over a data set with missing values. In this
14 data set, the values on the first dimension of point A and on the second dimension of point B are missing. If
λ = 0.25, f
λ,1
(A[1]) = 0.9 and g
λ,1
(A[2]) = 0.3, since one out of four existing values over the first dimension is no larger than 0.3 while only one value is above 0.9.
If λ = 0.5, f
λ,2
(B[2]) = 0.5 and g
λ,2
(B[2]) = 0.5, since half of the existing values is no larger than 0.5 while the other half is no smaller than 0.5.
Theorem 7 If 0 ≤ λ ≤ 1/2, then the M D
λ relationship satisfies the properties of rationality, scaling robustness, shifting robustness and transitivity.
Proof M D definitely follows the rationality property since it degenerates to T D when two points have no missing values. Given any scaling factors α and shifting factors β, the mappings f
λ,i
and g
λ,i
follow the condition of Lemma 7 and Lemma 8. Therefore, it must be scaling and shifting robust. Finally, when 0 ≤ λ ≤ 1/2,
g
λ,i
(x
i
) must be smaller or equal to f
λ,i
(x
i
) for any x
i
; this property satisfies the condition of Lemma 9, making the transitivity property valid.
Even when λ is larger than 1/2, M D
λ
only violates the transitivity property while keeping all the other properties valid. Another advantage of M D
λ
is its natural extension to ordered dominance class based on the following lemma.
Lemma 10 If M D
λ
(p, q), M D
λ 0
(p, q) for all λ
0
≥ λ.
Proof Since f
λ 0 ,i
(x
i
) ≤ f
λ,i
(x
i
) for any x
i
λ, and g
λ 0 ,i
(x
i
) ≥ g
λ,i
(x
i
) for any x
i
dominates q when λ increases to λ
0
.
when λ when λ
0
0
≥
≥ λ, p
Therefore, mapping dominance is an ordered dominance class with index set Θ based on all λ values on the real number interval [0, 0.5].
We note here that our method for data with missing values should not be seen as a simple plugin of constants for missing values. The reasons are twofold.
First, we decide concrete values for missing values carefully based on the probability distribution of the corresponding dimension, rather than on an ad hoc basis.
Second, our concrete value selection is enabled within the general framework we have established in this paper, which ensures solid semantics which simple plugins of constants apparently lack.
6.2 Combination with Other Dominance Relationships
In the original M D
λ
relationship, the points are compared with respect to traditional dominance relationship after the mapping values are calculated. A natural question arises here on the possibility of combining mapping dominance with other dominance relationships on certain data, such as Cone Dominance and
LogScale Cone Dominance introduced in the previous section. In this part of this section, we provide some positive answers to this question.
Here, let M D
D
λ
denote a new relationship on missing value, with another certain dominance D replacing T D in the original definition of mapping dominance. The following lemmas imply that the properties are very likely to be inherited from D to M D
D
λ
.
Lemma 11 If the properties of scaling robustness or
shifting robustness hold for D, they also hold with M D
D
λ
The proof of the lemma above is similar to those for
Lemma 7 and Lemma 8.
Lemma 12 If the property of transitivity holds for D
and T D(g(p), f (p)) → D(g(p), f (p)), transitivity is also
valid with M D
D
λ
.
Proof If M D
D
λ
(p, q) and M D
D
λ
(q, r), it implies that
D(f (p), g(q)) and D(f (q), g(r)), based on the definition of mapping dominance. Due to the condition of the lemma, we have D(g(q), f (q)) because T D(g(q), f (q)).
With the property of transitivity on D, it is easy to derive that D(f (p), g(r)), leading to M D
D
λ
(p, r). This completes the proof on the validity of transitivity property on M D
D
λ
Last lemma implies that when D is stronger than
T D, the property of transitivity in D can be passed to
M D
D
λ
. Considering the employment of CD and LCD in the generalized definition of mapping dominance, we can verify the properties of new mapping dominance relationship based on the lemmas above. To simplify the notation, we use M CD
λ,γ
and M LCD
λ,γ
to denote the new mapping dominance with these two dominance relationships respectively, with λ and γ being the parameters for them correspondingly. When γ ≥ 0, both M CD
λ,γ
and M LCD
λ,γ
have property of transitivity, since CD and LCD are stronger than T D when
γ ≥ 0. Therefore, M CD
λ,γ
(γ ≥ 0) is consistent with the properties of shifting robustness and transitivity, and M LCD
λ,γ
with the properties of scaling robustness and transitivity.
6.3 Algorithm Applicability
In Section 4, we present some general algorithms for skyline queries and discuss their applicabilities with respect to the properties of the underlying dominance relationship. However, all of the algorithms are designed for data sets with complete information on all dimensions. In this part of the section, we will present some extensions over the algorithms to handle mapping dominance over incomplete data.
For the BNL algorithm, no modification is necessary for the extension to mapping dominance. The only operation called in BNL is some verification between some pair of points on their dominance relationship, which can be simply implemented by an independent component. Since the mapping dominance relationship can be verified based on the definition, this component can be seamlessly integrated with BNL algorithm.
For SFS, TSA and BBS algorithms, the problem remains when they need to use some sorting function or indexing structure, which does not support points with missing values directly. To overcome the difficulties with these three algorithms, we hereby propose two simple schemes, enabling the system to employ traditional sorting and indexing component without too much modification.
6.3.1 Sorting for Mapping Dominance
When some sorting function is invoked over the points with missing values, a virtual position p
0
for the original point p is constructed by filling the missing value on dimension i with f
0.5,i
(p). Given the virtual positions for all data points, some conventional sorting algorithm will be called to order the points based on their virtual positions on the sum of all dimensions in nondescending order.
Lemma 13 If p
0 with the new sorting method, q cannot dominate p based on dominance relationship M D
D
λ
, if 1) D is stronger than T D, i.e.
T D(p, q) → D(p, q) and 2) D has the property of tran
sitivity.
is sorted before q
0
Proof We prove this lemma by contradiction. If the lemma does not hold, meaning that we can find some
p
0
sorted before q
0
but q dominates p by M D
D
λ
. Then, the dominance must be valid on D(f (q), g(p)). By the condition 1) on the dominance relationship D, we have
D(g(p), p
0
), D(q
0
, f (q)). It leads to D(q
0
, p
0
) if combining the previous result with the property of transitivity.
This is contradicted to property of rationality, which implies that D(q
0
, p
0
) since p
0
is sorted before p
0
.
The correctness of last lemma directly implies that all of the mapping dominance relationships, proposed in this section, are consistent with the new sorting scheme, since the correctness of sorting in SFS and TSA algorithms depends on the result of the lemma.
6.3.2 Indexing for Mapping Dominance
When indexing the points with missing values, each point p is represented by some rectangle with two corners at f (p) and g(p). Thus, given a node in the indexing tree, such as RTree, the Minimum Bounding
Box is the minimum rectangle in the space covering all the rectangles created for the points stored behind this node.
Lemma 14 Given two MBRs M
1
p ∈ M
1
Proof Since no point q in M
2
and M
2
, if one point cannot be dominated by any other point q ∈
M
2
by the definition of mapping dominance, M dominate M
1
.
2 or topk skyline query.
cannot
dominates p in M
1
15
, we know that g(p) cannot be dominated by any f (q). Considering the MBRs M
1 and M
2
, there is at least one dimension that minimum boundary of M
1 cannot be bounded by the maximum boundary of M
2
. Thus, M
2 cannot dominate M
1
.
Based on the last lemma, the pruning strategies in any indexing tree must be valid, because no pruning will remove real skyline point from the candidate set. Therefore, it is safe to use the indexing scheme for mapping dominance relationship when computing skyline query
7 Experiments
In this section, we evaluate the efficiencies of the algorithm on variants of skyline queries, and effectiveness of the new dominance definitions, on synthetic and real data sets.
7.1 Experimental Settings
In the experiments, both synthetic data sets and real data sets are used to evaluate the performance. There are three common types of synthetic data sets that have been used in previous studies of skyline queries
[4], including correlated (C), independent (I) and anticorrelated (A) data sets. In correlated data sets, the dimensions of the points are positively correlated, meaning a point with a better value on one attribute is very likely to have better values on other attributes.
In independent data sets, the dimensions are independent and uniformly distributed. In anticorrelated data sets, the dimensions are anticorrelated, which is implemented by keeping the sums on all dimensions for all points around the same value [4]. We also adopt two real data sets: the NBA data set
2 and the MovieLens data set
3
, both of which have been used in the study of kdominant skyline queries in [6]. The NBA data set contains more than 17,000 records of players’ season
2
3 http://www.databasebasketball.com
http://movielens.umn.edu/
16
Parameter
Dimensionality
Data Size (100K)
Distribution
Topk Size
Parameter γ for CD (0.01)
Range
5,10,15,20
1,2,3,4,5,6,8,10
C,I,A
50,100,150,200,250
5,10,15,20
Table 7 Parameters in Tests on Synthetic Data Sets
γ(0.01)
5
10
15
20
C I A
441 40999 81527
134 16214 58205
52
28
4908
1165
32450
8961
Table 8 Skyline Cardinalities on Synthetic Data Sets with Varying Dominance Parameter statistics on 17 attributes from the first season of NBA in 1945 to the season in 2002. The MovieLens data set was collected by the movielens web site from September 1997 to April 1998. There are 100,000 ratings from
943 users on 1682 movies in the data set. All the users in the data set have rated at least 20 movies. However, this data set is still very sparse, with missing values in most of the entries.
We compare the performances of different algorithms and different dominance relationships. The skyline algorithms evaluated here include: General Block Nest
Loop (BNL), General Sort Filter Skyline (SFS) and
General BranchandBound Skyline (BBS). The topk skyline algorithms evaluated include: Binary SFS (B
SFS), Progressive SFS (PSFS), Binary BBS (BBBS) and Progressive BBS (PBBS). The dominance relationships tested include the three new variant relationship proposed in this paper: Cone Dominance (CD),
Logscale Cone Dominance (LCD) and Mapping Dominance (MD).
All experiments are run on a PC with PIII 1.8GHz
CPU, 1GB main memory and 20GB hard disk. The programs are compiled with GCC v3.4.3 in Linux Fedora
3 system.
BBS deteriorates rapidly mainly because of the worse indexing quality in high dimensional space. This phenomenon happens even earlier on independent data sets when the dimensionality remains small. On independent data sets, SFS is usually the fastest method on high dimensional space. On anticorrelated data sets, the performances of the algorithms tend to converge when dimensionality increases since all of them have to compare every pair of points.
When γ increases from 0.05 to 0.2, the expansion of dominance ability leads to the quick decrease of skyline cardinalities (Table 8). All the three algorithms, BNL,
SFS and BBS can be more CPU and IO efficient when given a larger γ (Figures 810). BNL is faster than SFS on high dimensional correlated data sets since the sorting in SFS takes too much time. SFS is more efficient on high dimensional independent data sets due to the advantage of sorting. The performances of SFS and BBS in CPU and IO are almost the same on anticorrelated data sets.
In Figures 1113, we present the experiment results when the size of the synthetic data set is varied from
100K to 1M. In this group of tests, BNL is worse than
SFS on CPU and IO on all data sets. BBS is much slower than SFS, but with almost the same IO cost since the IO cost of retrieving new points from the underlying indexing structure is much smaller than the IO cost spent on comparing a point with all current skyline points.
7.2 Experiments on Synthetic Data Sets
In the experiments on synthetic data sets, we vary some parameters of the data sets, such as dimensionality and data size. Since LCD is not very different from CD, we only test the performances of CD on the data sets. For skyline queries and topk skyline queries, we also vary the dominance parameter γ and specified result size
k, respectively. The varying ranges of the parameters are summarized in Table 7, in which default values are marked in bold font.
7.2.1 On Skyline Algorithms for Cone Dominance
In Figures 57, we show the result on skyline queries with varying dimensionality. On correlated data, BBS is more CPU and IO efficient than the other two algorithms when the dimensionality is not very large. When the dimensionality grows, the CPU time performance of
7.2.2 On Topk Skyline Algorithms for Cone
Dominance
The experimental results on varying dimensionality are presented in Figures 1416. The figures show the overall optimality of progressive search over binary search. Binary search with SFS or BBS can only work in data sets with less than 10 dimensions while progressive search with SFS or BBS can be very scalable to high dimensional data sets. On all three types of data sets, PSFS is quicker than PBBS while the IO costs of PSFS and
PBBS are about the same. Since BSFS and BBBS are much worse than PSFS and PBBS, we will only evaluate PSFS and PBBS in the rest of the section.
17
10
BNL
SFS
BBS
1
0.1
0.01
5 10
Dimensionality
(a) CPU time
15 20
10000
1000
BNL
SFS
BBS
100
10
1
0.1
0.01
5 10
Dimensionality
(b) IO cost
15
Fig. 5 Skyline Queries with Varying Dimensionality on Correlated Data Sets
10000
1000
BNL
SFS
BBS
100
10
1
0.1
0.01
5 10
Dimensionality
(a) CPU time
15 20
1e+06
100000
BNL
SFS
BBS
10000
1000
100
10
1
0.1
0.01
5 10 15
Dimensionality
(b) IO cost
Fig. 6 Skyline Queries with Varying Dimensionality on Independent Data Sets
10000
BNL
SFS
BBS
1000
1e+06
BNL
SFS
BBS
100000
10000
100
1000
10
5 20
100
5 10
Dimensionality
(a) CPU time
15 10 15
Dimensionality
(b) IO cost
Fig. 7 Skyline Queries with Varying Dimensionality on AntiCorrelated Data Sets
20
20
20
In the experiments of varying the specified skyline size k on correlated datasets as shown in Figure 17, both the PSFS and PBBS algorithms scale well with
k. PBBS has the advantage of IO efficiency while P
SFS is better on CPU time. This is due to the fact that simple sorting is more effective than indexing in high dimensional space, but indexing is better on reducing
IO. We omit similar results on independent and anticorrelated data sets.
In the results shown in Figure 18, we can conclude that both computation costs and IO costs of PSFS and
PBBS are linear to the data size on correlated data sets. Similar results on independent and anticorrelated data sets are also omitted here.
18
10
BNL
SFS
BBS
1
1000
100
BNL
SFS
BBS
0.1
10
0.01
5 10
Gamma (0.01)
(a) CPU time
15 20
Fig. 8 Skyline Queries with Varying γ on Correlated Data Set
1000
BNL
SFS
BBS
100
10
1
0.1
5 20 10
Gamma (0.01)
(a) CPU time
15
1
5
Fig. 9 Skyline Queries with Varying γ on Independent Data Set
10000
BNL
SFS
BBS
1000
1e+06
100000
100000
10000
1000
100
10
1
5
10
Gamma (0.01)
(b) IO cost
15
10 15
Gamma (0.01)
(b) IO cost
BNL
SFS
BBS
BNL
SFS
BBS
20
20
100 10000
10
5 10
Gamma (0.01)
(a) CPU time
15 20
Fig. 10 Skyline Queries with Varying γ on AntiCorrelated Data Set
1000
5 10 15
Gamma (0.01)
(b) IO cost
20
7.3 Experiments on Real Data Sets
7.3.1 CD and LCD on NBA data set
In Table 9, we show the two skylines retrieved by the topk skyline query with cone dominance and logscale cone dominance respectively. By simple observations, we can see that the skyline with cone dominance (on the left) prefers center players while the skyline with logscale cone dominance (on the right) prefers guard players. The difference stems from the Euclidean distances used in cone dominance and logscale cone dominance. In cone dominance, the Euclidean distance between two players are dominated by those “large” attributes, such as points, rebounds and assists, which leads to bias to centers with high scoring and rebounds.
In logscale cone dominance, the Euclidean distance is a more average aggregation of all attributes, preferring who are more average on different attributes.
19
10
BNL
SFS
BBS
1
10000
BNL
SFS
BBS
1000
100
10
1
1 2 3 4 5 6
Data Set (100K)
(b) IO cost
8
0.1
1 2 3 4 5 6
Data Set (100K)
(a) CPU time
8 10
Fig. 11 Skyline Queries with Varying Data Size on Correlated Data Sets
10000
BNL
SFS
BBS
1e+06
BNL
SFS
BBS
100000
1000
10000
10
100
1 2 3 4 5 6
Data Set (100K)
(a) CPU time
8 10
1000
1 2 3 4 5 6
Data Set (100K)
(b) IO cost
8
Fig. 12 Skyline Queries with Varying Data Size on Independent Data Sets
100000
BNL
SFS
BBS
1e+07
BNL
SFS
BBS
1e+06
10000
100000
10
1000
1 2 3 4 5 6
Data Set (100K)
(a) CPU time
8 10
10000
1
Fig. 13 Skyline Queries with Varying Data Size on AntiCorrelated Data Sets
2 3 4 5 6
Data Set (100K)
(b) IO cost
8 10
7.3.2 Mapping Cone Dominance on Movie Data Set
Although we can construct an ordered dominance class
{M D
λ
} by gradually increasing λ from 0 to 1/2, the skyline size can still be much larger than expected. Experiments of MD on the MovieLens data set shows that even if we set λ to 0.5, there are still more than 1000 skyline points returned, which makes the result meaningless. This indicates that mapping dominance itself may not be plausible in reducing skyline size to user’s expectations. A straightforward alternative is employing other mapping dominance relationships, combined with other dominance on certain data, such as MCD and MLCD, introduced in Section 6.2. Since MCD and
MLCD have two parameters, with λ controlling the mapping procedure and γ controlling the degree of cone, the index set on the ordered dominance class for topk skyline query becomes hard to define. To simplify the problem, we fix λ at 0.5 and give freedom on γ when computing the topk skyline query.
20
100
10
1
0.1
0.01
0.001
5
BSFS
PSFS
BBBS
PBBS
10
Dimensionality
(a) CPU time
15 20
10000
1000
100
10
1
0.1
5
BSFS
PSFS
BBBS
PBBS
10
Dimensionality
(b) IO cost
15
Fig. 14 Topk Skyline Query with Varying Dimensionality on Correlated Data Sets
100 1000
10
1
0.1
0.01
5 10
Dimensionality
(a) CPU time
15
BSFS
PSFS
BBBS
PBBS
20
100
10
1
0.1
5 10
Dimensionality
(b) IO cost
15
BSFS
PSFS
BBBS
PBBS
Fig. 15 Topk Skyline Query with Varying Dimensionality on Independent Data Sets
1000
100
10
1
0.1
0.01
5 10
Dimensionality
(a) CPU time
15
BSFS
PSFS
BBBS
PBBS
20
10000
1000
100
10
1
0.1
5
BSFS
PSFS
BBBS
PBBS
10
Dimensionality
(b) IO cost
15
Fig. 16 Topk Skyline Query with Varying Dimensionality on AntiCorrelated Data Sets
20
20
20
In Table 10, we present the topk skyline set with
Mapping Cone Dominance (left) and Mapping LogScale
Cone Dominance (right) as underlying dominance relationships. In terms of their IMDB scores
4
, our method does discover the popular movies. The advantage of the topk skyline query is that we do not need to manually adjust the aggregation function as IMDB does. By comparing the two results of MCD and MLCD, we find five movies shared by both skylines, all of which are well rec
4 www.imdb.com
ognized classic movies. The left ones indicate the difference on the preference of the two skylines. MCD prefers artistic movies, which are liked by a small fraction of reviewers, while MLCD biases to mass entertainment movies, such as two animations “wrong trousers” and
“close shave”. This is still due to the distance used by these two dominance relationships, as is discussed in relation with the results of the NBA data set.
21
100
PSFS
PBBS
10
1
0.1
50 100 150 k
(a) CPU time
200 250
120
100
PSFS
PBBS
80
60
40
20
0
50 100 150 k
(b) IO cost
200
Fig. 17 Topk Skyline Query with Varying k on Correlated Data Set
1000
PSFS
PBBS
100
10000
PSFS
PBBS
1000
10 100
1
0.1
1 2 3 4 5 6
Data Size (100K)
(a) CPU time
8 10
10
1
1 2 3 4 5 6
Data Size (100K)
(b) IO cost
8
250
Fig. 18 Topk Skyline Query with Varying Data Size on Correlated Data Sets
Movie & Year IMDB score
Star Wars 1977
Forbidden Planet 1956
Manchurian Candidate 1962
Big Sleep 1946
Killing Fields 1984
As Good As It Gets 1997
Godfather 1972
8 1/2 1963
Wings of Desire 1987
8.2
8.0
One Flew Over the Cuckoo’s Nest 1975 8.8
8.8
7.7
8.4
8.3
8.0
7.7
9.1
Table 10 Top10 Skylines with MCD and MLCD on Movie Data Set
Movie & Year IMDB score
Living in Oblivion 1995
Godfather 1972
Wrong Trousers 1993
As Good As It Gets 1997
7.3
9.1
8.5
7.7
Schindler’s List 1993 8.8
One Flew Over the Cuckoo’s Nest 1975 8.8
Close Shave 1995 8.3
Star Wars 1977
Wild Bunch 1969
Manchurian Candidate 1962
8.8
8.1
8.4
Name & Year Position
G. Mcginnis 1974
M. Malone 1978
M. Malone 1981
Forward
Center
Center
C. Barkley 1987 Forward
H. Olajuwon 1989 Center
J. Stockton 1990
J. Stockton 1991
Guard
Guard
G. Payton 1999
A. Walker 2000
P. Pierce 2001
Guard
Forward
Guard
Table 9 Top10 Skylines on NBA data set
Name & Year Position
C. Barkeley 1987
C. Barkeley 1988
M. Johnson 1989
M. Jordan 1989
S. Pippen 1994
A. Walker 1997
V. Carter 2000
Forward
Forward
Guard
Guard
Forward
Forward
Guard
A. Walker 2000
K. Bryant 2002
Forward
Guard
T. McGrady 2002 Guard
8 Conclusion
In this paper, we have investigated the possibility of using dominance relationships other than the traditional one in skyline queries. Among the extensive studies on skyline queries recently, we are the first to present a general framework on the robustness of skyline query results. While traditional dominance is the only binary relationship satisfying all desired properties, as we have proved, we have proposed some new dominance relationships with relaxed properties to improve the flexibility of skyline queries. Our study has also identified
22 the basic requirements for use as a guide in designing specific skyline queries with expected results.
References
1. I. Bartolini, P. Ciaccia, and M. Patella. Efficient sortbased skyline evaluation. ACM Trans. Database Syst., 33(4), 2008.
2. J. L. Bentley, K. L. Clarkson, and D. B. Levine. Fast linear expectedtime algorithms for computing maxima and convex hulls. In SODA, pages 179–187, 1990.
3. J. L. Bentley, H. T. Kung, M. Schkolnick, and C. D. Thompson. On the average number of maxima in a set of vectors and applications. J. ACM, 25(4):536–543, 1978.
operator. In ICDE, pages 421–430, 2001.
5. C. Y. Chan, P.K. Eng, and K.L. Tan. Stratified computation of skylines with partiallyordered domains. In SIGMOD, pages 203–214, 2005.
6. C.Y. Chan, H. V. Jagadish, K.L. Tan, A. K. H. Tung, and
Z. Zhang. Finding kdominant skylines in high dimensional space. In SIGMOD, pages 503–514, 2006.
7. J. Chomicki. Preference formulas in relational queries. ACM
TODS, 24(4):427–466, 2003.
8. J. Chomicki, P. Godfrey, J. Gryz, and D. Liang. Skyline with presorting. In ICDE, pages 717–719, 2003.
9. I. Diakonikolas and M. Yannakakis. Succinct approximate convex pareto curves. In SODA, pages 74–83, 2008.
10. P. Godfrey, R. Shipley, and J. Gryz. Maximal vector computation in large data sets. In VLDB, pages 229–240, 2005.
11. W. Kießling. Foundations of preferences in database systems.
In VLDB, pages 311–322, 2002.
12. V. Koltun and C. H. Papadimitriou. Approximately dominating representatives. In ICDT, pages 204–214, 2005.
13. D. Kossmann, F. Ramsak, and S. Rost. Shooting stars in the sky: an online algorithm for skyline queries. In VLDB, pages
275–286, 2002.
14. K. C. K. Lee, B. Zheng, H. Li, and W.C. Lee. Approaching the skyline in z order. In VLDB, pages 279–290, 2007.
15. C. Li, B. C. Ooi, A. K. H. Tung, and S. Wang. DADA: A data cube for dominant relationship analysis. In SIGMOD, pages 659–670, 2006.
16. X. Lin, Y. Yuan, Q. Zhang, and Y. Zhang. Selecting stars:
The k most representative skyline operator. In ICDE, pages
86–95, 2007.
17. M. D. Morse, J. M. Patel, and H. V. Jagadish. Efficient skyline computation over lowcardinality domains. In VLDB, pages 267–278, 2007.
18. D. Papadias, Y. Tao, G. Fu, and B. Seeger. An optimal and progressive algorithm for skyline queries. In SIGMOD, pages
467–478, 2003.
19. D. Papadias, Y. Tao, G. Fu, and B. Seeger. Progressive skyline computation in database systems. TODS, 30(1):41–82,
2005.
20. N. Sarkas, G. Das, N. Koudas, and A. K. H. Tung. Categorical skylines for streaming data. In SIGMOD Conference, pages 239–250, 2008.
21. U. Shaft and R. Ramakrishnan. When is nearest neighbors indexable? In ICDT, pages 158–172, 2005.
22. K. L. Tan, P. K. Eng, and B. C. Ooi. Efficient progressive skyline computation. In VLDB, pages 301–310, 2001.
Deltasky: Optimal maintenance of skyline deletions without exclusive dominance region generation. In ICDE, pages 486–
495, 2007.
24. Z. Zhang, L. V. S. Lakshmanan, and A. K. H. Tung. On domination game analysis for microeconomic data mining.
TKDD, 2(4), 2009.
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Related manuals
advertisement