Programmable Order-Preserving Secure Index for Encrypted

2012 IEEE Fifth International Conference on Cloud Computing
Programmable Order-Preserving Secure Index for Encrypted Database Query
Dongxi Liu
Shenlu Wang ∗
CSIRO ICT Centre, Marsfield, NSW 2122, Australia
{shenlu.wang,dongxi.liu}@csiro.au
Abstract
intentionally, or by attackers who compromise the database
service platforms. Since the database services are a kind of
cloud computing services, the techniques of trusted cloud
computing have the potential to be used to build trusted
database services. However, there is still a gap of applying
the techniques of trusted cloud computing such as [7, 15]
to address the security and privacy problem in database services.
For cloud database services, a straightforward approach
to addressing the security and privacy problem is to encrypt
the database. By this way, the service provider or an attacker only can see the meaningless encrypted data. However, after encrypted, a database cannot be easily queried. It
is not acceptable to decrypt the entire database before performing each query because the decryption might be very
slow for a large database and the decrypted database is again
at the risk of having its security and privacy breached. Ideally, a query should be executed directly over the encrypted
database.
A database query can be an equality query, a range query,
an aggregate query or their combinations. In this paper, we
focus on the problem of performing range queries on encrypted databases. For example, a range query can be “select staffs who join the company between 2000 and 2012”.
For other two types of queries over encrypted databases, the
equality queries are not hard to handle when a deterministic
encryption scheme (e.g., AES in ECB mode) is used, since
in this scheme the same plaintexts are always encrypted into
the same ciphertexts, and the aggregate queries need homomorphic encryption algorithms [11] to process the SQL operations SUM and AVG over encrypted databases. We also
describe how to apply our method together with secure hash
algorithms and homomorphic encryption algorithms to deal
with all types of queries over encrypted databases.
To deal with range queries on encrypted databases, an
order-preserving encryption scheme has been proposed in
[2]. In this scheme, the ith value in the plaintext domain is
mapped to the ith value in the ciphertext domain, such that
the order between plaintexts is preserved between ciphertexts. To use this scheme, users need to be able to model
the distributions of values in the plaintext and ciphertext
The database services on cloud are appearing as an attractive way of outsourcing databases. When a database
is deployed on a cloud database service, the data security
and privacy becomes a big concern for users. A straightforward way to address this concern is to encrypt the database.
However, an encrypted database cannot be easily queried.
In this paper, we propose an order-preserving scheme for
indexing encrypted data, which facilitates the range queries
over encrypted databases. The scheme is secure since it
randomizes each index with noises, such that the original data cannot be recovered from indexes. Moreover, our
scheme allows the programmability of basic indexing expressions and thus the distribution of the original data can
be hidden from the indexes.
1. Introduction
Cloud database services, such as Amazon Relational
Database Service (RDS) and Microsoft SQL Azure, are appearing as an attractive way for enterprises to outsource
their databases. In cloud database services, the hardware
and software underlying databases are shared among users.
The database services allow enterprises to deploy their
databases quickly without making the large investment on
their proprietary hardware and software, hence reducing the
total cost of ownership. Moreover, the database services on
cloud can be elastic, meaning that an enterprise can dynamically increase or decrease the compute resources allocated
to its databases according to its business requirements.
Though attractive as a new paradigm of data management, database services cannot be fully exploited if the
problem of data privacy and security cannot be addressed
[1, 5]. When a database is deployed into a public database
service, the service provider has the complete physical control over the database. The data in the database might be
improperly accessed by the service provider accidentally or
∗ Shenlu
Wang is a vacation student from RMIT University.
978-0-7695-4755-8/12 $26.00 © 2012 IEEE
DOI 10.1109/CLOUD.2012.65
502
domains. However, when using cloud database services, an
enterprise may not have database professionals who know
the techniques [9] for data distribution modeling. In addition, the scheme [2] can only deal with plaintexts in a finite
domain. The cryptographic study of the order-preserving
encryption scheme is done in [3].
The work [1] shows a way of building order-preserving
polynomials, which are based on the polynomials proposed
by Shamir for secret sharing [16]. However, the proposed
mechanism is only applicable to a finite plaintext domain,
where the number of plaintexts are needed to determine the
range of coefficients in a polynomial. On the other hand,
the evaluation results of order-preserving polynomials may
reveal the distribution of plaintexts, since similar plaintexts
are transformed with similar polynomials. As discussed in
[2], the coupling distribution of plaintext and ciphertext domains might be exploited by attackers to guess the scope of
the corresponding plaintext for a ciphertext.
In [8], an indexing mechanism for range queries is proposed. This mechanism is not strictly order preserving since
two different values may be mapped into the same bucket,
which is used when checking query conditions. The mechanism can lead to inaccuracy of query results and hence some
post-processing is needed to remove unexpected query results.
In this paper, we propose an order-preserving indexing
scheme, which is secure and easy to use. The scheme is
built over the simple linear expressions of the form a∗x+b.
The form of the expressions is public, however the coefficients a and b are kept secret (not known by attackers).
Based on the linear expressions, the indexing scheme maps
an input value v to a ∗ v + b + noise, where noise is a random value. The noise is carefully selected, such that the
order of input values is preserved. For example, suppose
the linear expression is defined over integers (i.e., a, b and
x are all integers), then the noise is selected from the set
{0, 1, ..., a−1}. When more input values are indexed, more
noises are introduced into the result, implying that attackers
cannot recover the input values from the generated indexes.
Hence, our indexing scheme is information-theoretically secure, since attackers cannot get enough information to solve
the linear equations over the input values and the generated
indexes.
Our indexing scheme allows the programmability of basic indexing expressions (i.e., the linear expressions). Users
can make an indexing program that deals with different input values with different indexing expressions. On the one
hand, the programmability improves the robustness of our
scheme against brute-force attacks since there are more indexing expressions to attack. On the other hand, the programmability can help decouple the distributions of input
values and indexes. When a single linear expression is used
to index all input values, the distribution of indexes is iden-
Figure 1. Architecture of Querying Encrypted
Databases
tical to the distribution of input values. This problem can be
addressed by designing appropriate indexing programs. For
example, suppose input values are uniformly distributed.
Then, if the indexing program maps a bigger input value
into an index that is distributed in a bigger range, then the
indexes do not take the uniform distribution. Hence, the
distribution of input values is not revealed by indexes.
Our indexing scheme is easier to use than that in [2],
since our scheme does not need users to model data distribution. Unlike the scheme in [2], our scheme does not
generate the indexes with specified distribution. We only
require the indexes do not reveal the distribution of input
values. Our indexing scheme only depends on linear expressions, which are easier for users to understand and use
than polynomials used in in [1]. The usability of security
mechanisms is important for them to be effectively taken
in practice. In addition, unlike the schemes in [1, 2], our
scheme is not an encryption scheme. It is used together
with existing encryption algorithms (e.g., AES) to deal with
range queries over encrypted databases. Thus, our scheme
can benefit from the advances in the encryption algorithm
research.
The rest of the paper is organized as follows. Section 2
describes the architecture of querying encrypted databases.
Section 3 gives the details of our indexing scheme. Section
4 introduces query translation. In Section 5, we describe an
prototype of the system. At last, related work and conclusion are given.
2. The Architecture of Querying Encrypted
Databases
In this section, we describe the architecture in which
our indexing scheme is used in the queries to encrypted
databases. The architecture is shown in Figure 1. In this
503
the encrypted databases, the attackers there cannot break
the indexes if they do not know a, b and any input values.
That is, the basic indexing scheme is secure against ciphertext only attacks. Though in our threat model we do not
allow attackers to choose arbitrary input values, the attackers may happen to know the input values of some particular
indexes. At this case, they may be able to recover a and
b by solving two linear equations, since the equations have
only two unknowns a and b. Suppose attackers know two
different input values v1 and v2 corresponding respectively
to indexes i1 and i2 , then the following two equations can
be used to recover a and b.
architecture, there is a database service provided in a public cloud, and an enterprise that deploys into the cloud a
database, which is encrypted by the enterprise to protect its
privacy.
To query or update the encrypted database, the enterprise
has a query proxy managing the communication between
the database applications and the encrypted database. When
a query is received from an application, the proxy translates
it into a query that can be executed directly over the encrypted database. When a query result is returned from the
database, the query proxy decrypts it before forwarding the
result to the application. The query proxy depends on some
metadata, such as keys and database schema, to translate
queries and decrypt query results.
Briefly, when a value is put into the database, the proxy
uses the indexing mechanism to generate its index and also
encrypts the value with some encryption algorithm like
AES. The index and the encrypted value are then stored
into corresponding fields in the same record of the database.
When a range query is made, the proxy calculates the index
of the value in the query condition, which is then used by the
database service to search indexes stored in the databases.
The order-preserving indexing mechanism reveals the
order information of encrypted values. Hence, the cryptographic system based on order-preserving encryption or
order-preserving indexing is vulnerable to plaintext-chosen
attacks [2, 3]. In this architecture, the proxy is put into
the administrative boundary of the enterprise. The attackers
from the cloud cannot control the proxy. Hence, the attackers cannot recover the encrypted values by using plaintextchosen attacks.
a ∗ v 1 + b = i1
3.2
Order-Preserving Indexing with Randomness
To solve the vulnerability described above, our idea is
to add some random noise to each index. That is, given two
input values v1 and v2 , their indexes i1 and i2 will be a∗v1 +
b+noise 1 and a∗v2 +b+noise 2 , respectively, where noise 1
and noise 2 are randomly sampled from some range (to be
defined later) by the query proxy. Consequently, even if v1 ,
v2 and their indexes are known accidentally by attackers on
the cloud, they still cannot have enough information (i.e.,
due to the random noises) to solve the following equations.
a ∗ v1 + b + noise 1 = i1
a ∗ v2 + b + noise 2 = i2
In the following, we describe how to determine the range
of noises, such that if v1 > v2 and a > 0, then a ∗ v1 + b +
noise 1 > a ∗ v2 + b + noise 2 .
3. Order-Preserving Secure Indexing and Its
Programmability
3.2.1
There are several data types (i.e., integer, double, string,
etc.) used in a database. In our work, we design the indexing scheme primitively for numerical values, and other data
types are translated into integers before indexing.
3.1
a ∗ v2 + b = i2
Randomized Order-Preserving Indexing Over
Integers
We start the definition of the noise range from a special case,
building up the intuitiveness of our method. In this special
case, we assume the input values and coefficients in the linear expression are all integers. Suppose v1 and v2 are two
integers and v1 > v2 . Then, the gap between them is at
least 1, that is v1 − v2 ≥ 1. We will use sensitivity to mean
the least gap, as in differential privacy research [10].
To determine how much noise can be added into indexes,
such that the indexes keep the order between v1 and v2 , we
need to know the least gap between a ∗ v1 + b (denoted i1 )
and a ∗ v2 + b (denoted i2 ). Since v1 − v2 ≥ 1, we have
i1 − i2 = a ∗ (v1 − v2 ) and hence i1 − i2 ≥ a ∗ 1 and
i1 ≥ i2 + a ∗ 1. If noise 1 and noise 2 are both randomly
sampled from the range [0, a ∗ 1) (We keep writing a ∗ 1 to
manifest the sensitivity of input values in the noise range),
then we have i1 + noise 1 > i2 + noise 2 , which holds even
when noise 1 is 0 (the minimum of noise 1 ) and noise 2 is its
maximum in [0, a ∗ 1).
Basic Order-Preserving Indexing
Our indexing scheme is based on the linear expression
a ∗ x + b, where x is the input value, a and b are secret coefficients (only known by the query proxy in the architecture
of Figure 1). The input value and coefficients can be integers or real numbers. To make sure the linear expression
strictly increasing, we require a > 0 in the linear expression. Hence, for all v1 and v2 , if v1 > v2 and a > 0, then
a ∗ v1 + b > a ∗ v2 + b.
As shown above, the basic linear expression respects the
order of input values. When the outputs of the linear expressions, used as indexes of the input values, are put into
504
v2 ) + noise1 − noise2 > 0. According to the definition of randomized indexes, both noise1 and noise2 lie
in the range [0, a ∗ sens). Hence, the proof goal holds
if a ∗ (v1 − v2 ) − noise2 > 0. Since sens is the sensitivity of the input values, we have v1 − v2 ≥ sens
and hence a ∗ (v1 − v2 ) ≥ a ∗ sens > noise2 , that is,
a ∗ (v1 − v2 ) − noise2 > 0.
In the following, we introduce a special type of randomized indexes. In this type of indexes, the sensitivity of indexes is the same as that of input values. Such sensitivitykeeping indexes will make the indexing programs easier to
write, as to be discussed in the next subsection.
For example, suppose the linear expression over integers
is 5 ∗ x + 3, and then the noise can be randomly selected
from the range [0, 5). Hence, the index of input value 1 is
distributed in the range [8, 13), the index of 2 is in [13, 18),
and so on.
3.2.2
Randomized Order-Preserving Indexing
As shown above, the sensitivity of input values is needed
to determine the amount of noise that can be added into
indexes. The following is the formal definition of sensitivity
of input values.
Definition Given the sensitivity sens of input values V , if
a > 1, then the sensitivity-keeping index of value v ∈ V is
a ∗ v + b + noise, where noise is randomly sampled from
the range [0, a ∗ sens − sens].
Definition Let V be the set of all input values. The sensitivity of V is the minimum element in the set {|v1 −v2 ||v1 ∈
V, v2 ∈ V, v1 = v2 }.
By its definition, the sensitivity is always greater than 0.
The sensitivity of input values is usually specific to applications. For example, if the salary in a company takes the
format of d1 d2 d3 .d4 d5 , where di is a digit, then the sensitivity of salary is 0.01. That is, the least salary difference
of between two staffs is 0.01 in the company. For another
example, if the input values in an application can only be
even numbers, then the sensitivity of input values in this
application is 2.
Note that the sensitivity-keeping index of value v is defined only when a > 1, which ensures a ∗ sens − sens > 0.
Consider the previous example where the linear expression is 7.2 ∗ x + 3.75 and the sensitivity of input values
is 0.01. Then, the range of noises is [0, 0.072 − 0.01] (i.e.,
[0, 0.062]). The sensitivity-keeping index of v is indicated
by the notation skindexsens
[a,b] (v). The following theorem
states that the sensitivity of input values is kept by indexes.
Theorem Given the sensitivity sens of input values V , v1 ∈
V and v2 ∈ V , if v1 − v2 = sens, then skindexsens
[a,b] (v1 ) −
sens
skindex[a,b] (v2 ) ≥ sens.
Definition Given the sensitivity sens of input values V , the
randomized index of value v ∈ V is a ∗ v + b + noise,
where a > 0 and noise is randomly sampled from the range
[0, a ∗ sens).
For the proof of this theorem, we have
sens
= a ∗ (v1 −
skindexsens
[a,b] (v1 ) − skindex[a,b] (v2 )
v2 ) + noise1 − noise2 = a ∗ sens + noise1 − noise2 .
According to the definition of skindx, we have
0 ≤ noise1 ≤ (a−1)∗sens and 0 ≤ noise2 ≤ (a−1)∗sens,
and hence a ∗ sens + noise1 − noise2 ≥ sens. Since the
sensitivity sens is greater than 0, the theorem also shows
the order between v1 and v2 is preserved.
To keep sensitivity, skindex withholds some noise (i.e.,
the amount of sens). In the next section, we will show that
skindex is always followed by rindex in an indexing program, such that there is no noise withheld from final indexes.
For example, suppose the linear expression is 7.2 ∗ x +
3.75, and the sensitivity of input values is 0.01. Then, the
range for generating noises is [0, 0.072). For two example
input values 2.04 and 2.05, their randomized indexes are
calculated by 7.2∗2.04+3.75+noise1 and 7.2∗2.05+3.75+
noise2 , and hence distributed in the ranges [18.438, 18.51)
and [18.51, 18.582), respectively. Note that due to random
noises two same values can have different indexes.
We use the notation rindexsens
[a,b] (v) to represent the randomized index of input value v, calculated by using the
above definition. The following theorem shows that randomized index defined above is order-preserving, reflecting
the correctness of the randomized indexing scheme.
3.3
Theorem Given the sensitivity sens of input values V , for
all v1 ∈ V and v2 ∈ V , if v1 > v2 , then rindexsens
[a,b] (v1 ) >
(v
).
rindexsens
[a,b] 2
Programmability of Indexes
In this section, we describe how to compose basic indexing expressions (skindex or rindex) into indexing programs. Briefly, an indexing program allows different input values to be indexed by different linear indexing expressions and allows indexes to be indexed again (like the
3DES algorithm, in which a ciphertext is encrypted again
by DES).
To prove this theorem, we need to show that
sens
rindexsens
[a,b] (v1 ) − rindex[a,b] (v2 ) > 0. Let noise1 and
noise2 denote the noises added to the indexes of v1 and
v2 , respectively. Then, our proof goal becomes a ∗ (v1 −
505
I
S
C
::=
::=
::=
sens
rindexsens
[a,b] | S; rindex[a,b]
sens
skindex[a,b] | if C then S1 else S2 | S1 ; S2
gt(c) | ge(c)
I
S
S1
S2
S3
S4
S5
S6
Figure 2. Abstract Syntax of Indexing Programs
= skindex1[3.1,14.7] ; S; rindex1[0.3,73]
= if gt(1200) then skindex1[12,121.5] else S1
= if gt(900) then skindex1[9.2,81.7] else S2
= if gt(650) then skindex1[6.3,78.3] else S3
= if gt(400) then skindex1[4.1,65.2] else S4
= if gt(280) then skindex1[3.3,43.6] else S5
= if gt(150) then skindex1[2.5,30.1] else S6
= if gt(100) then skindex1[1.8,19.7] else skindex1[1.2,3.7]
Figure 3. An Indexing Program Example
The syntax of indexing programs is shown in Figure 2.
An index program I is either rindexsens
[a,b] or has the form
,
where
S
is
the
composition
of sensitivityS; rindexsens
[a,b]
keeping indexing expressions. S can be a basic sensitivitykeeping indexing expression skindexsens
[a,b] , a conditional indexing expression, or a sequential composition of expressions. In the conditional indexing expression, C means a
condition, which can be gt(c) or ge(c), where c is a constant.
The semantics of indexing programs is defined as follows. Suppose v is an input value. Then, I(v) means the application of I to v, generating v’s index. If I is rindexsens
[a,b] ,
sens
sens
then I(v) = rindex[a,b] (v). If I is S; rindex[a,b] , then
I(v) = rindexsens
[a,b] (i), where i = S(v). The semantics of
indexing steps S is defined inductively. If S is skindexsens
[a,b] ,
sens
then S(v) = skindex[a,b] (v). If S is the conditional indexing step, then S(v) = S1 (v) if v makes the condition C
true; otherwise, S(v) = S2 (v). The condition C is gt(c)
or ge(c). The condition gt(c) is true if v > c, and ge(c) is
true if v ≥ c. If S is a sequential composition of steps, then
S(v) = S2 (i), where i = S1 (v).
An indexing program is said well-formed if it is orderpreserving. Since in an indexing program the basic indexing expressions skindex and rindex are already orderpreserving, it is order-preserving if all conditional indexing
expressions are also order-preserving. For any conditional
indexing expression if C then S1 else S2 , where C
is gt(c) or ge(c), it is order-preserving if S1 (c) ≥ S2 (c).
This condition also makes sure there is no overlap among
indexes generated by S1 and S2 . Note that this order preserving condition can be checked by using only the program
code (i.e., without using any input values).
When writing an indexing program, the argument sens
on all skindex and skindex represents the sensitivity of
input values. In an indexing program that consists of a sequence of expressions, all intermediate indexes are calculated by skindex, which does not change the sensitivity of
input values. Hence, programmers can use the sensitivity
of input values in the whole program, easing the burden of
programming.
An indexing program example is given in Figure 3. In
this example, we assume the sensitivity of input values is 1.
Suppose input values are from the range [0, 500] and evenly
distributed. This indexing program first transforms the input values with skindex1[3.1,150] , leading to intermediate indexes in range [14.7,1566.8] (i.e., the upper bound 1566.8
is calculated by 3.1∗500+14.7+3.1∗1−1). Then, the program divides the intermediate indexes into eight parts, processed by indexing expressions with different coefficients.
At last, an randomized indexing expression is applied to
generate the final indexes. In this example the indexes are
not evenly distributed, since a bigger index is distributed in
a bigger range.
The programmability of indexes increases the robustness
of our index scheme in two aspects. First, input values can
be indexed by multiple linear expressions, making bruteforce attacks harder. Second, the distribution of indexes can
be decoupled from the distribution of input values, making
it harder to estimate the range of input values according to
the positions of indexes.
The following notations will be used later. Let Index be
an indexing program, which is used secretly by the proxy
when translating queries. Then, Index(v, s) generates the
index of v by using the program Index, with all indexing
expressions in the program taking s as their sensitivity. Specially, Index(v, 0) means the index of v without adding any
noise, which the minimum index of v.
3.4
Indexing String Input Values
In this section, we introduce how to convert a string into
an integer, such that our indexing scheme can be applied.
Our basic idea is to convert a string into an integer, where a
character in the string has its ASCII encoding as the value
of the corresponding byte in the integer. For example, “BC”
is converted to 0x4243.
Strings are usually compared in the lexical order. For example, the string “BC” is greater than “ABC”. When strings
are converted into integers, their order must be preserved.
Hence, it is not acceptable that “BC” is converted to 0x4243
and “ABC” is converted to 0x414243, since 0x4243 is less
than 0x414243. To solve this problem, our indexing scheme
needs to know the maximum length of strings that will be
compared. If the maximum length of input strings is l and
a string has the length n, then (l − n) bytes of zeros will be
506
HMACSHA1. Thus, for an equality query or a query that
depends on equality comparison (e.g., a query using Group
By), it will be translated to make equality comparisons on
the column SalaryEqIdx.
To support the queries involving the operations SUM
and AVG, the proxy must use homomorphic encryption algorithms, such as [4, 13], to generate ciphertext for the
SalaryEnc column. Thus, the aggregate operations can be
performed directly on the encrypted data in the SalaryEnc
column. Figure 4 summarizes the table structure seen by
the database application and the table structure managed by
the cloud database service, where the notation Staff represents the hash of name Staff, and similar notations are also
for other names.
Figure 4. Change of Table Structures
padded to the end of the converted integer.
For example, suppose l = 4. Then, “BC” is converted to 0x42430000 (two bytes of zeros are padded) and
“ABC” is converted to 0x41424300 (one byte of zero is
padded). Apparently, we have “BC” > “ABC”, and also
0x42430000 > 0x41424300.
4.2
4. Query of Encrypted Databases
The queries from database applications are translated by
the proxy before being executed by the cloud database service. The translation of some representative queries is introduced below. Assume the proxy has the key k. We write
Enc(k, v) for the encryption of v with k, and Hash(k, v) for
the secure hash of v with k. The numeric and string data
type is represented by Num and String.
We introduce how to perform range queries on encrypted
databases, under the architecture in Figure 1. The equality
and aggregate queries are also discussed.
4.1
The Translation of SQL Statements
The Basic Idea
The basic idea of performing range queries is illustrated
with the following example. Suppose the database application developers have designed a database that has a Staff
table, which includes only one column Salary. When creating such a table in a cloud database service, the proxy
hashes the table name, such that the table name is meaningless to attackers on cloud. For the column Salary, the proxy
actually creates two corresponding columns in the created
table; their names are obtained by hashing SalaryEnc and
SalaryRngIdx, respectively, where Enc and RngIdx are postfixes also applied to other columns.
When an input value from the database application is being put into the encrypted table, the proxy encrypts the value
with some encryption algorithms such as AES, generating
the ciphertext for the SalaryEnc column, and also indexes
the value for the SalaryRngIdx column (Note that the column names are hashed in the cloud database service). When
the database application issues a range query on the column
Salary, the proxy translates the query into a new one that selects the encrypted values from the column SalaryEnc with
the range condition compared on the column SalaryRngIdx.
The new query is then executed by the database service.
The basic idea also applies to equality and aggregate
queries. To support equality queries, the proxy adds another extra column, which contains the secure hash of input values. Thus, the same value appears the same in this
column. For example, for the Salary column, another extra column SalaryEqIdx is added. When inserting a value
into the encrypted table, the proxy hashes the value for the
column SalaryEqIdx with the secure hash algorithms like
4.2.1
Creation of Encrypted Databases and Tables
To create a database and a table, the database application
can issue the following two statements.
create database dbname
create table tblname (colnm Type,... )
In the statement above, Type is the data type for the column colnm. The statements are translated into the following
statements by the proxy. In addition, the proxy records the
schema of the created table in its metadata.
create database Hash(k,dbname)
create table Hash(k,tblname)
(Hash(k,colnm+"EqIdx") String,
Hash(k,colnm+"RngIdx") Num,
Hash(k,colnm+"Enc") String,... )
That is, three columns are created for the column colnm.
The column colnm+“EqIdx” have the type String, since its
values are always hexadecimal strings generated by secure
hash functions. The values of column colnm+“RngIdx” are
generated by our indexing mechanism and have the numerical type. The column colnm+“Enc” for ciphertext also has
the type String.
4.2.2
Insertion of Values into Tables
After a table is created, the database application can put a
new record into the table by using the following statement.
insert into tblname (colnm,... )
values (v,...)
507
Assume the sensitivity of values in column colnm is sens,
which is configured in the proxy. The proxy translates the
above statement into the following one for execution. In the
new statement, the value v is hashed, indexed and encrypted
for storing into different columns.
insert into Hash(k,tblname)
(Hash(k,colnm+"EqIdx"),
Hash(k,colnm+"RngIdx"),
Hash(k,colnm+"Enc"),... ) values
(Hash(k,v),Index(v,sens),Enc(k,v),...)
4.2.3
Figure 5. A Fragment of Encrypted Database
the webs server and returning back the decrypted query results. The database application is a web application, which
includes the web server and browser. The web services and
web server are deployed over the GlassFish 3.1 platform.
The web application is designed to manage the staffs in a
company and the projects they are involved in. The database
in the application includes the following two tables.
Queries
A query from the database application can take the following basic form.
select colnm,... from tblname where cond
If ∗ is used in the query (i.e., select * from ...), the proxy
can replace ∗ with all column names according to the table
schema in its metadata. For the basic query statement, the
proxy translates it into the following form, where the translation of cond into cond is discussed below.
staff(id INTEGER, name VARCHAR(32),
email VARCHAR(255), level INTEGER)
project(id INTEGER, project VARCHAR(32),
deadline TIMESTAMP)
In the database service, the schema is expanded,
with the table name and column names hashed with
the HMACSHA1 algorithm.
For example, in the
encrypted database, the staff table has the name
“9EE14475FCE3725D60410AE3A9DDA94A1CBA766E”
and the id column has led to three columns
and
the
idEnc
column
has
the
name
“D97B7C1AB660AF36862144A51C384964873C4EF5”.
To test the application, we put 200 staff records and 300
project records into the encrypted database. A fragment
of the database is shown in Figure 5, where the first row
is the HMACSHA1 hashes of four column names (idEnc,
nameEnc, emailEnc and levelEnc) and other rows are encrypted records. In the application, the AES algorithm is
used for encryption, and the indexing programs used are
different for different columns. As an example, for the id
column, the following is the used indexing program, represented in XML.
select Hash(k,colnm+"Enc"),...
from Hash(k,tblname) where cond’
For the condition cond, it is defined over the primitive
logical forms colnm < c, colnm = c, colnm > c, where
c is a constant from the domain of the colnm column, by
using the logical connectives (i.e, and, or). When translating
the condition cond, we just need to replace each primitive
logical expression with the translated one.
The condition colnm < c is translated into
Hash(k,colnm+“RngIdx”) < Index(c,0).Recall that Index(c, 0)
is the minimum index of c. The condition colnm=c is simply translated into Hash(k,colnm+“EqIdx”) = Hash(k,c).
Assume the sensitivity of values in the colnm column is
sens. Then, c+sens is the next value of c, and colnm > c is
equivalent to the new condition colnm ≥ c + sens, which is
translated into Hash(k,colnm+“RngIdx”) ≥ Index(c+sens,0).
Note that Index(c+sens,0) is the minimum index of c+sens.
The keywords order by colnm and group by colnm
are frequently used in queries.
They are transand
lated
into
order by Hash(k,colnm+“RngIdx”)
group by Hash(k,colnm+“EqIdx”), respectively.
<indexing table="STAFF" col="ID" sens="1">
<skstep><a>2</a><b>11</b></skstep>
<ifstep><gt>50</gt>
<skstep><a>5</a><b>17</b></skstep>
<skstep><a>3</a><b>13</b></skstep>
</ifstep>
<rstep><a>7</a><b>19</b></rstep>
</indexing>
5. Implementation and Experiment
We implemented a prototype of our indexing scheme for
querying encrypted database. In the implementation, we
simulate a database service by wrapping up the Apache
Derby database management system with a SOAP-based
web service interface, which is accessed by the proxy to
query over the encrypted database. The query proxy is also
implemented as a web service, accepting SQL queries from
The query over the encrypted database is illustrated by
the following example. Given a range query below, Figure
6 shows the query result returned by the database service
and the decryption result generated by the proxy.
select * from staff natural join project
where "deadline">’2012/6/9’ and
"deadline"<’2012/8/9’
508
scheme to query encrypted databases by query translation.
A prototype is implemented to demonstrate our system.
References
[1] D. Agrawal, A. E. Abbadi, F. Emekçi, and A. Metwally.
Database management as a service: Challenges and opportunities. In Proceedings of the 25th International Conference
on Data Engineering, pages 1709–1716, 2009.
[2] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu. Order preserving encryption for numeric data. In Proceedings of the
2004 ACM SIGMOD international conference on Management of data, SIGMOD ’04, pages 563–574, 2004.
[3] A. Boldyreva, N. Chenette, Y. Lee, and A. O’Neill. Orderpreserving symmetric encryption. In Proceedings of the 28th
Annual International Conference on Advances in Cryptology, EUROCRYPT ’09, pages 224–241, 2009.
[4] Z. Brakerski and V. Vaikuntanathan. Fully homomorphic encryption from ring-lwe and security for key dependent messages. In Proceedings of the 31st annual conference on Advances in cryptology, CRYPTO’11, pages 505–524, 2011.
[5] CircleID Reporter.
Survey: Cloud computing ‘no
hype’, but fear of security and control slowing adoption.
http://www.circleid.com/posts/20090226_
cloud_computing_hype_security, Feb. 2009.
[6] E. A. Fox, Q. F. Chen, A. M. Daoud, and L. S. Heath. Orderpreserving minimal perfect hash functions and information
retrieval. ACM Trans. Inf. Syst., 9:281–308, July 1991.
[7] A. Haeberlen. A case for the accountable cloud. SIGOPS
Oper. Syst. Rev., 44:52–57, April 2010.
[8] B. Hore, S. Mehrotra, and G. Tsudik. A privacy-preserving
index for range queries. In Proceedings of the 30th international conference on Very large data bases, 2004.
[9] A. C. König and G. Weikum. Combining histograms and
parametric curve fitting for feedback-driven query resultsize estimation. In Proceedings of the 25th International
Conference on Very Large Data Bases, 1999.
[10] F. D. McSherry. Privacy integrated queries: an extensible
platform for privacy-preserving data analysis. In Proceedings of the 35th SIGMOD international conference on Management of data, SIGMOD ’09, pages 19–30, 2009.
[11] D. Micciancio. A first glimpse of cryptography’s holy grail.
Commun. ACM, 53(3):96, 2010.
[12] G. Ozsoyoglu, D. A. Singer, and S. S. Chung. Anti-tamper
databases: Querying encrypted databases. In In Proc. of
the 17th Annual IFIP WG 11.3 Working Conference on
Database and Applications Security, pages 4–6, 2003.
[13] P. Paillier. Public-key cryptosystems based on composite
degree residuosity classes. In Proceedings of the 17th international conference on Theory and application of cryptographic techniques, pages 223–238, 1999.
[14] R. A. Popa, C. M. S. Redfield, N. Zeldovich, and H. Balakrishnan. CryptDB: protecting confidentiality with encrypted
query processing. In Proceedings of the Twenty-Third ACM
Symposium on Operating Systems Principles, 2011.
[15] N. Santos, K. P. Gummadi, and R. Rodrigues. Towards
trusted cloud computing. In Proceedings of the 2009 conference on Hot topics in cloud computing, 2009.
[16] A. Shamir. How to share a secret. Commun. ACM, 22:612–
613, November 1979.
Figure 6. A Query Result and its Decryption
6. Related Works
The most related works include the order-preserving encryption scheme [2], the order-preserving polynomials [1]
and the order-preserving indexing scheme [8]. In addition
to the differences discussed before, the programmability of
indexing expressions is a unique feature of our scheme and
can improve the robustness of our scheme by indexing different input values with different indexing expressions.
The work [12] uses strictly increasing functions to implement order-preserving encryption. Their functions can be
higher order and can be sequentially composed. However,
all input values are encrypted by the same functions. These
functions do not add noises into the encryption result, and
hence the secret coefficients can be recovered when some
pairs of plaintexts and ciphertexts are known by attackers.
The order-preserving hash functions discussed in [6]
map a set of input values into a set of hash values for fast information retrieval, with the hash values preserving the order of input values. These hash functions are not designed
for protecting security. For example, there is no secret values (like encryption keys) that prevent the recovery of input
values from hash values.
The CryptDB [14] is a system supporting SQL queries
over encrypted databases, where range queries rely on
order-preserving encryption [3]. Our method can be incorporated into such systems to process range queries.
7. Conclusion
In this paper, we proposed a method of generating orderpreserving indexes for facilitating range queries over encrypted databases. Our indexing is simple to use since it is
based on linear expressions. The basic linear indexing expression is information-theoretically secure since each index is added with some random noise. We gave the way
of controlling the amount of noises such that the randomized indexes are still order-preserving. Our scheme is programmable, meaning that the basic indexing expressions
can be composed together to improve the robustness of the
indexing programs and hide the distribution of input values from indexes. We introduced how to apply the indexing
509