2012 IEEE Fifth International Conference on Cloud Computing Programmable Order-Preserving Secure Index for Encrypted Database Query Dongxi Liu Shenlu Wang ∗ CSIRO ICT Centre, Marsﬁeld, NSW 2122, Australia {shenlu.wang,dongxi.liu}@csiro.au Abstract intentionally, or by attackers who compromise the database service platforms. Since the database services are a kind of cloud computing services, the techniques of trusted cloud computing have the potential to be used to build trusted database services. However, there is still a gap of applying the techniques of trusted cloud computing such as [7, 15] to address the security and privacy problem in database services. For cloud database services, a straightforward approach to addressing the security and privacy problem is to encrypt the database. By this way, the service provider or an attacker only can see the meaningless encrypted data. However, after encrypted, a database cannot be easily queried. It is not acceptable to decrypt the entire database before performing each query because the decryption might be very slow for a large database and the decrypted database is again at the risk of having its security and privacy breached. Ideally, a query should be executed directly over the encrypted database. A database query can be an equality query, a range query, an aggregate query or their combinations. In this paper, we focus on the problem of performing range queries on encrypted databases. For example, a range query can be “select staffs who join the company between 2000 and 2012”. For other two types of queries over encrypted databases, the equality queries are not hard to handle when a deterministic encryption scheme (e.g., AES in ECB mode) is used, since in this scheme the same plaintexts are always encrypted into the same ciphertexts, and the aggregate queries need homomorphic encryption algorithms [11] to process the SQL operations SUM and AVG over encrypted databases. We also describe how to apply our method together with secure hash algorithms and homomorphic encryption algorithms to deal with all types of queries over encrypted databases. To deal with range queries on encrypted databases, an order-preserving encryption scheme has been proposed in [2]. In this scheme, the ith value in the plaintext domain is mapped to the ith value in the ciphertext domain, such that the order between plaintexts is preserved between ciphertexts. To use this scheme, users need to be able to model the distributions of values in the plaintext and ciphertext The database services on cloud are appearing as an attractive way of outsourcing databases. When a database is deployed on a cloud database service, the data security and privacy becomes a big concern for users. A straightforward way to address this concern is to encrypt the database. However, an encrypted database cannot be easily queried. In this paper, we propose an order-preserving scheme for indexing encrypted data, which facilitates the range queries over encrypted databases. The scheme is secure since it randomizes each index with noises, such that the original data cannot be recovered from indexes. Moreover, our scheme allows the programmability of basic indexing expressions and thus the distribution of the original data can be hidden from the indexes. 1. Introduction Cloud database services, such as Amazon Relational Database Service (RDS) and Microsoft SQL Azure, are appearing as an attractive way for enterprises to outsource their databases. In cloud database services, the hardware and software underlying databases are shared among users. The database services allow enterprises to deploy their databases quickly without making the large investment on their proprietary hardware and software, hence reducing the total cost of ownership. Moreover, the database services on cloud can be elastic, meaning that an enterprise can dynamically increase or decrease the compute resources allocated to its databases according to its business requirements. Though attractive as a new paradigm of data management, database services cannot be fully exploited if the problem of data privacy and security cannot be addressed [1, 5]. When a database is deployed into a public database service, the service provider has the complete physical control over the database. The data in the database might be improperly accessed by the service provider accidentally or ∗ Shenlu Wang is a vacation student from RMIT University. 978-0-7695-4755-8/12 $26.00 © 2012 IEEE DOI 10.1109/CLOUD.2012.65 502 domains. However, when using cloud database services, an enterprise may not have database professionals who know the techniques [9] for data distribution modeling. In addition, the scheme [2] can only deal with plaintexts in a ﬁnite domain. The cryptographic study of the order-preserving encryption scheme is done in [3]. The work [1] shows a way of building order-preserving polynomials, which are based on the polynomials proposed by Shamir for secret sharing [16]. However, the proposed mechanism is only applicable to a ﬁnite plaintext domain, where the number of plaintexts are needed to determine the range of coefﬁcients in a polynomial. On the other hand, the evaluation results of order-preserving polynomials may reveal the distribution of plaintexts, since similar plaintexts are transformed with similar polynomials. As discussed in [2], the coupling distribution of plaintext and ciphertext domains might be exploited by attackers to guess the scope of the corresponding plaintext for a ciphertext. In [8], an indexing mechanism for range queries is proposed. This mechanism is not strictly order preserving since two different values may be mapped into the same bucket, which is used when checking query conditions. The mechanism can lead to inaccuracy of query results and hence some post-processing is needed to remove unexpected query results. In this paper, we propose an order-preserving indexing scheme, which is secure and easy to use. The scheme is built over the simple linear expressions of the form a∗x+b. The form of the expressions is public, however the coefﬁcients a and b are kept secret (not known by attackers). Based on the linear expressions, the indexing scheme maps an input value v to a ∗ v + b + noise, where noise is a random value. The noise is carefully selected, such that the order of input values is preserved. For example, suppose the linear expression is deﬁned over integers (i.e., a, b and x are all integers), then the noise is selected from the set {0, 1, ..., a−1}. When more input values are indexed, more noises are introduced into the result, implying that attackers cannot recover the input values from the generated indexes. Hence, our indexing scheme is information-theoretically secure, since attackers cannot get enough information to solve the linear equations over the input values and the generated indexes. Our indexing scheme allows the programmability of basic indexing expressions (i.e., the linear expressions). Users can make an indexing program that deals with different input values with different indexing expressions. On the one hand, the programmability improves the robustness of our scheme against brute-force attacks since there are more indexing expressions to attack. On the other hand, the programmability can help decouple the distributions of input values and indexes. When a single linear expression is used to index all input values, the distribution of indexes is iden- Figure 1. Architecture of Querying Encrypted Databases tical to the distribution of input values. This problem can be addressed by designing appropriate indexing programs. For example, suppose input values are uniformly distributed. Then, if the indexing program maps a bigger input value into an index that is distributed in a bigger range, then the indexes do not take the uniform distribution. Hence, the distribution of input values is not revealed by indexes. Our indexing scheme is easier to use than that in [2], since our scheme does not need users to model data distribution. Unlike the scheme in [2], our scheme does not generate the indexes with speciﬁed distribution. We only require the indexes do not reveal the distribution of input values. Our indexing scheme only depends on linear expressions, which are easier for users to understand and use than polynomials used in in [1]. The usability of security mechanisms is important for them to be effectively taken in practice. In addition, unlike the schemes in [1, 2], our scheme is not an encryption scheme. It is used together with existing encryption algorithms (e.g., AES) to deal with range queries over encrypted databases. Thus, our scheme can beneﬁt from the advances in the encryption algorithm research. The rest of the paper is organized as follows. Section 2 describes the architecture of querying encrypted databases. Section 3 gives the details of our indexing scheme. Section 4 introduces query translation. In Section 5, we describe an prototype of the system. At last, related work and conclusion are given. 2. The Architecture of Querying Encrypted Databases In this section, we describe the architecture in which our indexing scheme is used in the queries to encrypted databases. The architecture is shown in Figure 1. In this 503 the encrypted databases, the attackers there cannot break the indexes if they do not know a, b and any input values. That is, the basic indexing scheme is secure against ciphertext only attacks. Though in our threat model we do not allow attackers to choose arbitrary input values, the attackers may happen to know the input values of some particular indexes. At this case, they may be able to recover a and b by solving two linear equations, since the equations have only two unknowns a and b. Suppose attackers know two different input values v1 and v2 corresponding respectively to indexes i1 and i2 , then the following two equations can be used to recover a and b. architecture, there is a database service provided in a public cloud, and an enterprise that deploys into the cloud a database, which is encrypted by the enterprise to protect its privacy. To query or update the encrypted database, the enterprise has a query proxy managing the communication between the database applications and the encrypted database. When a query is received from an application, the proxy translates it into a query that can be executed directly over the encrypted database. When a query result is returned from the database, the query proxy decrypts it before forwarding the result to the application. The query proxy depends on some metadata, such as keys and database schema, to translate queries and decrypt query results. Brieﬂy, when a value is put into the database, the proxy uses the indexing mechanism to generate its index and also encrypts the value with some encryption algorithm like AES. The index and the encrypted value are then stored into corresponding ﬁelds in the same record of the database. When a range query is made, the proxy calculates the index of the value in the query condition, which is then used by the database service to search indexes stored in the databases. The order-preserving indexing mechanism reveals the order information of encrypted values. Hence, the cryptographic system based on order-preserving encryption or order-preserving indexing is vulnerable to plaintext-chosen attacks [2, 3]. In this architecture, the proxy is put into the administrative boundary of the enterprise. The attackers from the cloud cannot control the proxy. Hence, the attackers cannot recover the encrypted values by using plaintextchosen attacks. a ∗ v 1 + b = i1 3.2 Order-Preserving Indexing with Randomness To solve the vulnerability described above, our idea is to add some random noise to each index. That is, given two input values v1 and v2 , their indexes i1 and i2 will be a∗v1 + b+noise 1 and a∗v2 +b+noise 2 , respectively, where noise 1 and noise 2 are randomly sampled from some range (to be deﬁned later) by the query proxy. Consequently, even if v1 , v2 and their indexes are known accidentally by attackers on the cloud, they still cannot have enough information (i.e., due to the random noises) to solve the following equations. a ∗ v1 + b + noise 1 = i1 a ∗ v2 + b + noise 2 = i2 In the following, we describe how to determine the range of noises, such that if v1 > v2 and a > 0, then a ∗ v1 + b + noise 1 > a ∗ v2 + b + noise 2 . 3. Order-Preserving Secure Indexing and Its Programmability 3.2.1 There are several data types (i.e., integer, double, string, etc.) used in a database. In our work, we design the indexing scheme primitively for numerical values, and other data types are translated into integers before indexing. 3.1 a ∗ v2 + b = i2 Randomized Order-Preserving Indexing Over Integers We start the deﬁnition of the noise range from a special case, building up the intuitiveness of our method. In this special case, we assume the input values and coefﬁcients in the linear expression are all integers. Suppose v1 and v2 are two integers and v1 > v2 . Then, the gap between them is at least 1, that is v1 − v2 ≥ 1. We will use sensitivity to mean the least gap, as in differential privacy research [10]. To determine how much noise can be added into indexes, such that the indexes keep the order between v1 and v2 , we need to know the least gap between a ∗ v1 + b (denoted i1 ) and a ∗ v2 + b (denoted i2 ). Since v1 − v2 ≥ 1, we have i1 − i2 = a ∗ (v1 − v2 ) and hence i1 − i2 ≥ a ∗ 1 and i1 ≥ i2 + a ∗ 1. If noise 1 and noise 2 are both randomly sampled from the range [0, a ∗ 1) (We keep writing a ∗ 1 to manifest the sensitivity of input values in the noise range), then we have i1 + noise 1 > i2 + noise 2 , which holds even when noise 1 is 0 (the minimum of noise 1 ) and noise 2 is its maximum in [0, a ∗ 1). Basic Order-Preserving Indexing Our indexing scheme is based on the linear expression a ∗ x + b, where x is the input value, a and b are secret coefﬁcients (only known by the query proxy in the architecture of Figure 1). The input value and coefﬁcients can be integers or real numbers. To make sure the linear expression strictly increasing, we require a > 0 in the linear expression. Hence, for all v1 and v2 , if v1 > v2 and a > 0, then a ∗ v1 + b > a ∗ v2 + b. As shown above, the basic linear expression respects the order of input values. When the outputs of the linear expressions, used as indexes of the input values, are put into 504 v2 ) + noise1 − noise2 > 0. According to the deﬁnition of randomized indexes, both noise1 and noise2 lie in the range [0, a ∗ sens). Hence, the proof goal holds if a ∗ (v1 − v2 ) − noise2 > 0. Since sens is the sensitivity of the input values, we have v1 − v2 ≥ sens and hence a ∗ (v1 − v2 ) ≥ a ∗ sens > noise2 , that is, a ∗ (v1 − v2 ) − noise2 > 0. In the following, we introduce a special type of randomized indexes. In this type of indexes, the sensitivity of indexes is the same as that of input values. Such sensitivitykeeping indexes will make the indexing programs easier to write, as to be discussed in the next subsection. For example, suppose the linear expression over integers is 5 ∗ x + 3, and then the noise can be randomly selected from the range [0, 5). Hence, the index of input value 1 is distributed in the range [8, 13), the index of 2 is in [13, 18), and so on. 3.2.2 Randomized Order-Preserving Indexing As shown above, the sensitivity of input values is needed to determine the amount of noise that can be added into indexes. The following is the formal deﬁnition of sensitivity of input values. Deﬁnition Given the sensitivity sens of input values V , if a > 1, then the sensitivity-keeping index of value v ∈ V is a ∗ v + b + noise, where noise is randomly sampled from the range [0, a ∗ sens − sens]. Deﬁnition Let V be the set of all input values. The sensitivity of V is the minimum element in the set {|v1 −v2 ||v1 ∈ V, v2 ∈ V, v1 = v2 }. By its deﬁnition, the sensitivity is always greater than 0. The sensitivity of input values is usually speciﬁc to applications. For example, if the salary in a company takes the format of d1 d2 d3 .d4 d5 , where di is a digit, then the sensitivity of salary is 0.01. That is, the least salary difference of between two staffs is 0.01 in the company. For another example, if the input values in an application can only be even numbers, then the sensitivity of input values in this application is 2. Note that the sensitivity-keeping index of value v is deﬁned only when a > 1, which ensures a ∗ sens − sens > 0. Consider the previous example where the linear expression is 7.2 ∗ x + 3.75 and the sensitivity of input values is 0.01. Then, the range of noises is [0, 0.072 − 0.01] (i.e., [0, 0.062]). The sensitivity-keeping index of v is indicated by the notation skindexsens [a,b] (v). The following theorem states that the sensitivity of input values is kept by indexes. Theorem Given the sensitivity sens of input values V , v1 ∈ V and v2 ∈ V , if v1 − v2 = sens, then skindexsens [a,b] (v1 ) − sens skindex[a,b] (v2 ) ≥ sens. Deﬁnition Given the sensitivity sens of input values V , the randomized index of value v ∈ V is a ∗ v + b + noise, where a > 0 and noise is randomly sampled from the range [0, a ∗ sens). For the proof of this theorem, we have sens = a ∗ (v1 − skindexsens [a,b] (v1 ) − skindex[a,b] (v2 ) v2 ) + noise1 − noise2 = a ∗ sens + noise1 − noise2 . According to the deﬁnition of skindx, we have 0 ≤ noise1 ≤ (a−1)∗sens and 0 ≤ noise2 ≤ (a−1)∗sens, and hence a ∗ sens + noise1 − noise2 ≥ sens. Since the sensitivity sens is greater than 0, the theorem also shows the order between v1 and v2 is preserved. To keep sensitivity, skindex withholds some noise (i.e., the amount of sens). In the next section, we will show that skindex is always followed by rindex in an indexing program, such that there is no noise withheld from ﬁnal indexes. For example, suppose the linear expression is 7.2 ∗ x + 3.75, and the sensitivity of input values is 0.01. Then, the range for generating noises is [0, 0.072). For two example input values 2.04 and 2.05, their randomized indexes are calculated by 7.2∗2.04+3.75+noise1 and 7.2∗2.05+3.75+ noise2 , and hence distributed in the ranges [18.438, 18.51) and [18.51, 18.582), respectively. Note that due to random noises two same values can have different indexes. We use the notation rindexsens [a,b] (v) to represent the randomized index of input value v, calculated by using the above deﬁnition. The following theorem shows that randomized index deﬁned above is order-preserving, reﬂecting the correctness of the randomized indexing scheme. 3.3 Theorem Given the sensitivity sens of input values V , for all v1 ∈ V and v2 ∈ V , if v1 > v2 , then rindexsens [a,b] (v1 ) > (v ). rindexsens [a,b] 2 Programmability of Indexes In this section, we describe how to compose basic indexing expressions (skindex or rindex) into indexing programs. Brieﬂy, an indexing program allows different input values to be indexed by different linear indexing expressions and allows indexes to be indexed again (like the 3DES algorithm, in which a ciphertext is encrypted again by DES). To prove this theorem, we need to show that sens rindexsens [a,b] (v1 ) − rindex[a,b] (v2 ) > 0. Let noise1 and noise2 denote the noises added to the indexes of v1 and v2 , respectively. Then, our proof goal becomes a ∗ (v1 − 505 I S C ::= ::= ::= sens rindexsens [a,b] | S; rindex[a,b] sens skindex[a,b] | if C then S1 else S2 | S1 ; S2 gt(c) | ge(c) I S S1 S2 S3 S4 S5 S6 Figure 2. Abstract Syntax of Indexing Programs = skindex1[3.1,14.7] ; S; rindex1[0.3,73] = if gt(1200) then skindex1[12,121.5] else S1 = if gt(900) then skindex1[9.2,81.7] else S2 = if gt(650) then skindex1[6.3,78.3] else S3 = if gt(400) then skindex1[4.1,65.2] else S4 = if gt(280) then skindex1[3.3,43.6] else S5 = if gt(150) then skindex1[2.5,30.1] else S6 = if gt(100) then skindex1[1.8,19.7] else skindex1[1.2,3.7] Figure 3. An Indexing Program Example The syntax of indexing programs is shown in Figure 2. An index program I is either rindexsens [a,b] or has the form , where S is the composition of sensitivityS; rindexsens [a,b] keeping indexing expressions. S can be a basic sensitivitykeeping indexing expression skindexsens [a,b] , a conditional indexing expression, or a sequential composition of expressions. In the conditional indexing expression, C means a condition, which can be gt(c) or ge(c), where c is a constant. The semantics of indexing programs is deﬁned as follows. Suppose v is an input value. Then, I(v) means the application of I to v, generating v’s index. If I is rindexsens [a,b] , sens sens then I(v) = rindex[a,b] (v). If I is S; rindex[a,b] , then I(v) = rindexsens [a,b] (i), where i = S(v). The semantics of indexing steps S is deﬁned inductively. If S is skindexsens [a,b] , sens then S(v) = skindex[a,b] (v). If S is the conditional indexing step, then S(v) = S1 (v) if v makes the condition C true; otherwise, S(v) = S2 (v). The condition C is gt(c) or ge(c). The condition gt(c) is true if v > c, and ge(c) is true if v ≥ c. If S is a sequential composition of steps, then S(v) = S2 (i), where i = S1 (v). An indexing program is said well-formed if it is orderpreserving. Since in an indexing program the basic indexing expressions skindex and rindex are already orderpreserving, it is order-preserving if all conditional indexing expressions are also order-preserving. For any conditional indexing expression if C then S1 else S2 , where C is gt(c) or ge(c), it is order-preserving if S1 (c) ≥ S2 (c). This condition also makes sure there is no overlap among indexes generated by S1 and S2 . Note that this order preserving condition can be checked by using only the program code (i.e., without using any input values). When writing an indexing program, the argument sens on all skindex and skindex represents the sensitivity of input values. In an indexing program that consists of a sequence of expressions, all intermediate indexes are calculated by skindex, which does not change the sensitivity of input values. Hence, programmers can use the sensitivity of input values in the whole program, easing the burden of programming. An indexing program example is given in Figure 3. In this example, we assume the sensitivity of input values is 1. Suppose input values are from the range [0, 500] and evenly distributed. This indexing program ﬁrst transforms the input values with skindex1[3.1,150] , leading to intermediate indexes in range [14.7,1566.8] (i.e., the upper bound 1566.8 is calculated by 3.1∗500+14.7+3.1∗1−1). Then, the program divides the intermediate indexes into eight parts, processed by indexing expressions with different coefﬁcients. At last, an randomized indexing expression is applied to generate the ﬁnal indexes. In this example the indexes are not evenly distributed, since a bigger index is distributed in a bigger range. The programmability of indexes increases the robustness of our index scheme in two aspects. First, input values can be indexed by multiple linear expressions, making bruteforce attacks harder. Second, the distribution of indexes can be decoupled from the distribution of input values, making it harder to estimate the range of input values according to the positions of indexes. The following notations will be used later. Let Index be an indexing program, which is used secretly by the proxy when translating queries. Then, Index(v, s) generates the index of v by using the program Index, with all indexing expressions in the program taking s as their sensitivity. Specially, Index(v, 0) means the index of v without adding any noise, which the minimum index of v. 3.4 Indexing String Input Values In this section, we introduce how to convert a string into an integer, such that our indexing scheme can be applied. Our basic idea is to convert a string into an integer, where a character in the string has its ASCII encoding as the value of the corresponding byte in the integer. For example, “BC” is converted to 0x4243. Strings are usually compared in the lexical order. For example, the string “BC” is greater than “ABC”. When strings are converted into integers, their order must be preserved. Hence, it is not acceptable that “BC” is converted to 0x4243 and “ABC” is converted to 0x414243, since 0x4243 is less than 0x414243. To solve this problem, our indexing scheme needs to know the maximum length of strings that will be compared. If the maximum length of input strings is l and a string has the length n, then (l − n) bytes of zeros will be 506 HMACSHA1. Thus, for an equality query or a query that depends on equality comparison (e.g., a query using Group By), it will be translated to make equality comparisons on the column SalaryEqIdx. To support the queries involving the operations SUM and AVG, the proxy must use homomorphic encryption algorithms, such as [4, 13], to generate ciphertext for the SalaryEnc column. Thus, the aggregate operations can be performed directly on the encrypted data in the SalaryEnc column. Figure 4 summarizes the table structure seen by the database application and the table structure managed by the cloud database service, where the notation Staff represents the hash of name Staff, and similar notations are also for other names. Figure 4. Change of Table Structures padded to the end of the converted integer. For example, suppose l = 4. Then, “BC” is converted to 0x42430000 (two bytes of zeros are padded) and “ABC” is converted to 0x41424300 (one byte of zero is padded). Apparently, we have “BC” > “ABC”, and also 0x42430000 > 0x41424300. 4.2 4. Query of Encrypted Databases The queries from database applications are translated by the proxy before being executed by the cloud database service. The translation of some representative queries is introduced below. Assume the proxy has the key k. We write Enc(k, v) for the encryption of v with k, and Hash(k, v) for the secure hash of v with k. The numeric and string data type is represented by Num and String. We introduce how to perform range queries on encrypted databases, under the architecture in Figure 1. The equality and aggregate queries are also discussed. 4.1 The Translation of SQL Statements The Basic Idea The basic idea of performing range queries is illustrated with the following example. Suppose the database application developers have designed a database that has a Staff table, which includes only one column Salary. When creating such a table in a cloud database service, the proxy hashes the table name, such that the table name is meaningless to attackers on cloud. For the column Salary, the proxy actually creates two corresponding columns in the created table; their names are obtained by hashing SalaryEnc and SalaryRngIdx, respectively, where Enc and RngIdx are postﬁxes also applied to other columns. When an input value from the database application is being put into the encrypted table, the proxy encrypts the value with some encryption algorithms such as AES, generating the ciphertext for the SalaryEnc column, and also indexes the value for the SalaryRngIdx column (Note that the column names are hashed in the cloud database service). When the database application issues a range query on the column Salary, the proxy translates the query into a new one that selects the encrypted values from the column SalaryEnc with the range condition compared on the column SalaryRngIdx. The new query is then executed by the database service. The basic idea also applies to equality and aggregate queries. To support equality queries, the proxy adds another extra column, which contains the secure hash of input values. Thus, the same value appears the same in this column. For example, for the Salary column, another extra column SalaryEqIdx is added. When inserting a value into the encrypted table, the proxy hashes the value for the column SalaryEqIdx with the secure hash algorithms like 4.2.1 Creation of Encrypted Databases and Tables To create a database and a table, the database application can issue the following two statements. create database dbname create table tblname (colnm Type,... ) In the statement above, Type is the data type for the column colnm. The statements are translated into the following statements by the proxy. In addition, the proxy records the schema of the created table in its metadata. create database Hash(k,dbname) create table Hash(k,tblname) (Hash(k,colnm+"EqIdx") String, Hash(k,colnm+"RngIdx") Num, Hash(k,colnm+"Enc") String,... ) That is, three columns are created for the column colnm. The column colnm+“EqIdx” have the type String, since its values are always hexadecimal strings generated by secure hash functions. The values of column colnm+“RngIdx” are generated by our indexing mechanism and have the numerical type. The column colnm+“Enc” for ciphertext also has the type String. 4.2.2 Insertion of Values into Tables After a table is created, the database application can put a new record into the table by using the following statement. insert into tblname (colnm,... ) values (v,...) 507 Assume the sensitivity of values in column colnm is sens, which is conﬁgured in the proxy. The proxy translates the above statement into the following one for execution. In the new statement, the value v is hashed, indexed and encrypted for storing into different columns. insert into Hash(k,tblname) (Hash(k,colnm+"EqIdx"), Hash(k,colnm+"RngIdx"), Hash(k,colnm+"Enc"),... ) values (Hash(k,v),Index(v,sens),Enc(k,v),...) 4.2.3 Figure 5. A Fragment of Encrypted Database the webs server and returning back the decrypted query results. The database application is a web application, which includes the web server and browser. The web services and web server are deployed over the GlassFish 3.1 platform. The web application is designed to manage the staffs in a company and the projects they are involved in. The database in the application includes the following two tables. Queries A query from the database application can take the following basic form. select colnm,... from tblname where cond If ∗ is used in the query (i.e., select * from ...), the proxy can replace ∗ with all column names according to the table schema in its metadata. For the basic query statement, the proxy translates it into the following form, where the translation of cond into cond is discussed below. staff(id INTEGER, name VARCHAR(32), email VARCHAR(255), level INTEGER) project(id INTEGER, project VARCHAR(32), deadline TIMESTAMP) In the database service, the schema is expanded, with the table name and column names hashed with the HMACSHA1 algorithm. For example, in the encrypted database, the staff table has the name “9EE14475FCE3725D60410AE3A9DDA94A1CBA766E” and the id column has led to three columns and the idEnc column has the name “D97B7C1AB660AF36862144A51C384964873C4EF5”. To test the application, we put 200 staff records and 300 project records into the encrypted database. A fragment of the database is shown in Figure 5, where the ﬁrst row is the HMACSHA1 hashes of four column names (idEnc, nameEnc, emailEnc and levelEnc) and other rows are encrypted records. In the application, the AES algorithm is used for encryption, and the indexing programs used are different for different columns. As an example, for the id column, the following is the used indexing program, represented in XML. select Hash(k,colnm+"Enc"),... from Hash(k,tblname) where cond’ For the condition cond, it is deﬁned over the primitive logical forms colnm < c, colnm = c, colnm > c, where c is a constant from the domain of the colnm column, by using the logical connectives (i.e, and, or). When translating the condition cond, we just need to replace each primitive logical expression with the translated one. The condition colnm < c is translated into Hash(k,colnm+“RngIdx”) < Index(c,0).Recall that Index(c, 0) is the minimum index of c. The condition colnm=c is simply translated into Hash(k,colnm+“EqIdx”) = Hash(k,c). Assume the sensitivity of values in the colnm column is sens. Then, c+sens is the next value of c, and colnm > c is equivalent to the new condition colnm ≥ c + sens, which is translated into Hash(k,colnm+“RngIdx”) ≥ Index(c+sens,0). Note that Index(c+sens,0) is the minimum index of c+sens. The keywords order by colnm and group by colnm are frequently used in queries. They are transand lated into order by Hash(k,colnm+“RngIdx”) group by Hash(k,colnm+“EqIdx”), respectively. <indexing table="STAFF" col="ID" sens="1"> <skstep><a>2</a><b>11</b></skstep> <ifstep><gt>50</gt> <skstep><a>5</a><b>17</b></skstep> <skstep><a>3</a><b>13</b></skstep> </ifstep> <rstep><a>7</a><b>19</b></rstep> </indexing> 5. Implementation and Experiment We implemented a prototype of our indexing scheme for querying encrypted database. In the implementation, we simulate a database service by wrapping up the Apache Derby database management system with a SOAP-based web service interface, which is accessed by the proxy to query over the encrypted database. The query proxy is also implemented as a web service, accepting SQL queries from The query over the encrypted database is illustrated by the following example. Given a range query below, Figure 6 shows the query result returned by the database service and the decryption result generated by the proxy. select * from staff natural join project where "deadline">’2012/6/9’ and "deadline"<’2012/8/9’ 508 scheme to query encrypted databases by query translation. A prototype is implemented to demonstrate our system. References [1] D. Agrawal, A. E. Abbadi, F. Emekçi, and A. Metwally. Database management as a service: Challenges and opportunities. In Proceedings of the 25th International Conference on Data Engineering, pages 1709–1716, 2009. [2] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu. Order preserving encryption for numeric data. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, SIGMOD ’04, pages 563–574, 2004. [3] A. Boldyreva, N. Chenette, Y. Lee, and A. O’Neill. Orderpreserving symmetric encryption. In Proceedings of the 28th Annual International Conference on Advances in Cryptology, EUROCRYPT ’09, pages 224–241, 2009. [4] Z. Brakerski and V. Vaikuntanathan. Fully homomorphic encryption from ring-lwe and security for key dependent messages. In Proceedings of the 31st annual conference on Advances in cryptology, CRYPTO’11, pages 505–524, 2011. [5] CircleID Reporter. Survey: Cloud computing ‘no hype’, but fear of security and control slowing adoption. http://www.circleid.com/posts/20090226_ cloud_computing_hype_security, Feb. 2009. [6] E. A. Fox, Q. F. Chen, A. M. Daoud, and L. S. Heath. Orderpreserving minimal perfect hash functions and information retrieval. ACM Trans. Inf. Syst., 9:281–308, July 1991. [7] A. Haeberlen. A case for the accountable cloud. SIGOPS Oper. Syst. Rev., 44:52–57, April 2010. [8] B. Hore, S. Mehrotra, and G. Tsudik. A privacy-preserving index for range queries. In Proceedings of the 30th international conference on Very large data bases, 2004. [9] A. C. König and G. Weikum. Combining histograms and parametric curve ﬁtting for feedback-driven query resultsize estimation. In Proceedings of the 25th International Conference on Very Large Data Bases, 1999. [10] F. D. McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In Proceedings of the 35th SIGMOD international conference on Management of data, SIGMOD ’09, pages 19–30, 2009. [11] D. Micciancio. A ﬁrst glimpse of cryptography’s holy grail. Commun. ACM, 53(3):96, 2010. [12] G. Ozsoyoglu, D. A. Singer, and S. S. Chung. Anti-tamper databases: Querying encrypted databases. In In Proc. of the 17th Annual IFIP WG 11.3 Working Conference on Database and Applications Security, pages 4–6, 2003. [13] P. Paillier. Public-key cryptosystems based on composite degree residuosity classes. In Proceedings of the 17th international conference on Theory and application of cryptographic techniques, pages 223–238, 1999. [14] R. A. Popa, C. M. S. Redﬁeld, N. Zeldovich, and H. Balakrishnan. CryptDB: protecting conﬁdentiality with encrypted query processing. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, 2011. [15] N. Santos, K. P. Gummadi, and R. Rodrigues. Towards trusted cloud computing. In Proceedings of the 2009 conference on Hot topics in cloud computing, 2009. [16] A. Shamir. How to share a secret. Commun. ACM, 22:612– 613, November 1979. Figure 6. A Query Result and its Decryption 6. Related Works The most related works include the order-preserving encryption scheme [2], the order-preserving polynomials [1] and the order-preserving indexing scheme [8]. In addition to the differences discussed before, the programmability of indexing expressions is a unique feature of our scheme and can improve the robustness of our scheme by indexing different input values with different indexing expressions. The work [12] uses strictly increasing functions to implement order-preserving encryption. Their functions can be higher order and can be sequentially composed. However, all input values are encrypted by the same functions. These functions do not add noises into the encryption result, and hence the secret coefﬁcients can be recovered when some pairs of plaintexts and ciphertexts are known by attackers. The order-preserving hash functions discussed in [6] map a set of input values into a set of hash values for fast information retrieval, with the hash values preserving the order of input values. These hash functions are not designed for protecting security. For example, there is no secret values (like encryption keys) that prevent the recovery of input values from hash values. The CryptDB [14] is a system supporting SQL queries over encrypted databases, where range queries rely on order-preserving encryption [3]. Our method can be incorporated into such systems to process range queries. 7. Conclusion In this paper, we proposed a method of generating orderpreserving indexes for facilitating range queries over encrypted databases. Our indexing is simple to use since it is based on linear expressions. The basic linear indexing expression is information-theoretically secure since each index is added with some random noise. We gave the way of controlling the amount of noises such that the randomized indexes are still order-preserving. Our scheme is programmable, meaning that the basic indexing expressions can be composed together to improve the robustness of the indexing programs and hide the distribution of input values from indexes. We introduced how to apply the indexing 509

Download PDF

- Similar pages