MTAT.07.006 Research Seminar in Cryptography Building a secure aggregation database Dan Bogdanov University of Tartu, Institute of Computer Science 22.10.2006 1 Introduction This paper starts by describing the privacy problem regarding the aggregation of sensitive data. Several general solutions are considered and a secure aggregation scheme based on secret sharing and secure multi-party computation is introduced. We describe our current progress in building a database which provides data privacy, but allows various aggregation algorithms to run and provide correct results on the input data. This paper is based on joint work with Sven Laur1 and Taneli Mielikäinen2 . 2 Aggregation of sensitive data 2.1 Problem statement Databases containing personal, medical or financial information about a person are usually classified as sensitive. Processing such data often requires special licenses from respective authorities as determined by the law. This protection is a problem for research organisations, who can’t learn global properties or trends from collected data, because they are forbidden from using it. In this paper we address a simplified version of the problem. Assume that we have asked m people n sensitive questions. By collecting the answers we obtain a m × n matrix (denoted D) which represents our data. Our goal is to devise a method for calculating aggregate statistics from this matrix without compromising the privacy of a single person. 1 2 Helsinki University of Technology HIIT Basic Research Unit 1 Let W1 , . . . , Wk be the participants who gather data and construct the matrix. Let M1 be the data miner who is interested in global patterns. In the common scenario, all participants give their data to M1 who constructs D and runs aggregation and data mining algorithms on it. The participants have to trust that M1 will not use the data for selfish gains. They have no way of ensuring the privacy of people who have provided the data, because the miner requires control over the complete database to perform calculations. 2.2 General solutions We want to keep a specific party from having the complete database. This way the participants won’t have to trust a single entity. 2.2.1 Distribution of rows We can divide the m × n data matrix D into smaller matrices D1 , . . . , Dt so that each smaller matrix contains some rows from D. We can then distribute matrices D1 , . . . , Dt to independent miners who calculate results based on their part of the data and combine them with results of other miners. Unfortunately this solution does not provide privacy of the data, as for each person who answered the questions some node will have complete data about the person. 2.2.2 Distribution of columns We can also divide the matrix D so that matrices D0 1 , . . . , D0 t contain columns from D. This allows us to keep the data identifying the person in a separate database from the sensitive data. Such a solution decreases the usability of the data, because one miner has access only to some attribute. For example, this could keep us from finding reliable aggregations or association rules based on all relevant attributes, because some of them might not be available for a given miner. 2.2.3 Distribution of values It is important to notice, that the two previous solutions are not secure. Our solution does not alter the dimensions of matrix D. Instead we distribute the values of the matrix between three miners so that no single miner or a pair of them can find the original value given their parts of this value. Participants will need to trust the miners not to co-operate. This can be achieved if the miners are hosted by organisations who don’t trust each other or audit each other’s work. Examples are competing companies and government organisations protecting people’s privacy. 2 Secret sharing and multiparty computation is used in this scenario. Participants will process values and send each miner only the respective part of the input values. Miners can calculate aggregation results if all three of them work together. The results are based on the complete data, but no miner has a complete row of the input data. 3 3.1 3.1.1 Prerequisites Secure multi-party computation Main idea Assume that we have n participants p1 , . . . , pn and each participant i knows an input value xi [1]. Secure multi-party computation is the calculation of a function f (x1 , . . . , xn ) = (y1 , . . . , yn ) in such a way, that the output is correct and the inputs of the nodes are kept private. Each node i will get the value of yi and nothing else. One of the methods of achieving secure multi-party computation is verifiable secret sharing. Consider the scenario in which a dealer shares a value s between n nodes. The dealer and the nodes may be malicious. If the dealer is honest, the nodes can gain no information about the value. Honest nodes can reconstruct s even in the presence of malicious nodes. 3.1.2 Modelling the adversary We use malicious nodes to model the adversary. The adversary may corrupt any number of nodes before the protocol starts. The honest players at first do not know, which nodes are corrupted and which are not. In the case of a passive corruption the adversary can read all the data held, sent and received by the node. If the corruption is active, the adversary has complete control over the node. If the adversary is static, the set of corrupted nodes remains the same for the whole duration of the protocol. The adversary may also be adaptive and corrupt new nodes during the execution of the protocol. We must limit the adversary to keep secure protocols possible. If all nodes are corrupted, we have no hope to complete the calculation of f . Therefore we restrict the adversary to corrupting only a proper subset of all the nodes. 3.1.3 Models of communication There are two main communication models used in secure multi-party computation. They are the cryptographic model and the information-theoretic model. In the cryptographic model the adversary is provided with read-only access to all the 3 messages from the traffic between honest nodes. The adversary may not modify the messages. In the information-theoretic model all nodes have private channels between them. The cryptographic model is secure if the adversary cannot break the cryptographic problem and read the messages. The information-theoretic model is stronger, as even a computationally unbounded adversary can not read the messages exchanged between honest nodes. Communication can be synchronous or asynchronous. In the synchronous mode nodes have synchronised clocks which allows us to design protocols with rounds. Messages sent each round will be delivered before the next round begins. An adaptive adversary may decide to corrupt nodes in the adversary structure, but only, if they are in the respective adversary structure. The asynchronous model is more complex as it has no guarantees on message delivery time. If we don’t have guarantees for message delivery, we can’t demand that the protocol reaches a certain step at all. 3.1.4 Example A classical problem in multi-party computation is the millionaire problem. Assume that Alice and Bob are millionaires who would like to know, which one of them is richer without revealing their exact wealth. We have two participants. Let Alice be p1 and Bob be p2 . Let Alice’s wealth be x1 and Bob’s wealth be x2 . The function we need to evaluate is ”greater than”, that is , we need to find out, if x1 > x2 without Alice knowing x2 or Bob knowing x1 . There are various solutions to this problem. The classic one was presented together with the problem intruduction by Yao [2]. 3.2 3.2.1 Secret Sharing Introduction Secret sharing is used to keep values such as cryptographic keys secure [3]. An algorithm is used to distribute the value between n nodes so, that each node gets one share and there is another algorithm which will reconstruct the original value when given the shares of all nodes. 3.2.2 Threshold secret sharing scheme Assume that we have an input value s from a finite set S, that we wish to keep secret. Also consider that we have n nodes available for computation. 4 A threshold secret sharing scheme is a probabilistic algorithm S which takes s as the input and outputs n bitstrings s1 , . . . , sn . Values s1 , . . . , sn are called shares. The secret sharing scheme has a threshold t ∈ N, 0 < t < n. The adversary may gain access to up to t shares without learning anything about the value s. If more than t shares are available, s can be calculated. We will define privacy and correctness as follows. Privacy: Assume that we have run S on an input value s ∈ S and created n nodes s1 , . . . , sn . The secret sharing scheme is secure if for each subset K ∈ P ({1, . . . , n}), |K| 6 t the probability distribution of {sk |k ∈ K} is independent of the one of s. Correctness: Assume that we have run S on an input value s ∈ S and created n nodes s1 , . . . , sn . The secret sharing scheme is correct if for each subset L ∈ P ({1, . . . , n}), |L| > t + 1 the value s is determined by values {sl |l ∈ L} and there is an efficient algorithm for calculating s based on these values.. 3.2.3 Share conversion Secret sharing is a technique used for converting shares of the same secret from one sharing scheme to a different one. This can be used to protect a system against malicious attacks by participants who do share calculation. Other participants may convert the shares to another scheme and still retrieve the original valuem if necessary. A share conversion solution for the Shamir scheme is descibed by Cramer, Damgård and Ishai [4]. 3.2.4 Example Follows a classic implementation by Shamir [5] Assume that we have an input value s, n nodes and we want a threshold value t. We pick a prime p so that p > n. The algorithm S consists of the following steps: 1. Choose a random polynomial f (x) over Zp with a degree at most t so that f (0) = s. A suitable polynomial is f (x) = s + a1 x + a2 x2 + . . . + at xt where a1 , . . . , at are randomly selected elements of Zp 2. Distribute values si = f (i) mod p(i = 1 . . . n) as shares to nodes 1 . . . n. Each of the shares can be considered as a point on the curve determined by the polynomial f (x). If we have t or more points of the curve, we can rebuild the polynomial by using LaGrange interpolation. If we have less than t shares available, we won’t get a polynomial of the required degree so we can’t restore the original secret. 5 4 4.1 A system for secure data aggregation General model Let M1 , . . . , M3 be the miners. Let W1 , . . . , Wk be the participants. The participants collect data, send it to the miners and order the miners to perform aggregations on the data. The miners are responsible for storing the received data and calculating aggregations. Data is shared between the three miners. Distributing data into shares is handled by the participants. The miners receive shares and store them in their database. If we could combine the databases of the miners, we would get the original database. If the participants are honest, the miners have no way of retrieving the original value without co-operating with the other two miners. If the miners are honest, the participants have no way of seeing other participants’ data. We can also prevent malicious input by dishonest participants by using share conversion. To run aggregations on the data we implement the necessary share computation operations as three-way protocols between the miners. Aggregation is done by combining these operations. We have implemented two operations - adding a row to the miner database and multiplying values in the database. 4.2 Data storage The participants have input data where each record is in the form of a vector with m elements. Each miner Mi has a database DMi which is a matrix of size m × n. Data elements in the system are members of Z232 . 4.3 Adding a row to the miner database Assume that we have a random number generator RNG. The participant Wt has a record with new values it wants to add to the miners’ database. It is represented by a vector R (|R| = m). It creates three new vectors R1 , R2 and R3 and calculates their values as follows: ∀i = 1 . . . m R1 [i] ← RNG, R2 [i] ← RNG and R3 [i] = R [i] − R1 [i] − R2 [i]. Wt sends each vector Ri to the respective node Mi (i = 1 . . . 3). The miner node Mi adds the values of Ri to its database DMi as a new row. All three vectors are required to restore the original value, because individual miners sees only a value that is random to it. Two values are also not enough, because one of them is random and hence the distribution of their sum is also random. 6 4.4 Share multiplication Let there be miners M1 , . . . , M3 and their databases DM1 , . . . , DM3 . We want to find the product of two values in the database and store it in the same database. Note, that the input values are distributed into three shares and we need the result also in shares. Assume that the values we want to multiply are in the k-th row and l-th column of the database matrix. The values are x = xA + xB + xC and y = yA + yB + yC , where xi = element from the k-th row and l-th column of the matrix DMi . To calculate x · y we use the following protocol. We name the nodes M1 , M2 and M3 Alice, Bob and Charlie respectively. Round 1: Sharing randomness • Alice generates r12 , r13 , s12 , s13 ← RNG • Bob generates r23 , r21 , s23 , s21 ← RNG • Charlie generates r31 , r32 , s31 , s32 ← RNG • All values ∗ij are sent over a secure channel from Mi to Mj Round 2: Sharing shares • Alice computes â12 = xA + r31 , b̂12 = yA + s31 , â13 = xA + r21 , b̂13 = yA + s21 • Bob generates â23 = xB + r12 , b̂23 = yB + s12 , â21 = xB + r32 , b̂21 = yB + s32 • Charlie generates â31 = xC +r23 , b̂31 = yC +s23 , â32 = xC +r13 , b̂32 = yC +s13 • All values ∗ij are sent over a secure channel from Mi to Mj Round 3: Local computations • Alice computes cA = xA b̂21 + xA b̂31 + yA â21 + yA â31 − â12 b̂21 − b̂12 â21 + r12 s13 + s12 r13 • Bob generates cB = xB b̂32 + xB b̂12 + yB â32 + yA â12 − â23 b̂32 − b̂23 â32 + r23 s21 + s23 r21 • Charlie generates cC = xC b̂13 + xC b̂23 + yC â13 + yA â23 − â31 b̂13 − b̂31 â13 + r31 s32 + s31 r32 7 After running this protocol, the miners have calculated the product xy = (cA + xA yA ) + (cB + xB yB ) + (cC + xC yC ). (1) The correctness of the protocol can be shown by expanding both sides of equation (1) and showing that they are equal. The protocol is secure, since in every round all miners see values with a random distribution. The computation requires altogether 24 messages, each miner sends and receives 8 messages. 5 5.1 Software specification Overview Nodes run two kinds of software. Mining nodes run the miner software and participants run the controller software. The miner software consists of algorithms and protocols which perform the multi-party computation and data mining tasks. The controller software is used to send data and commands to the miner software. We consider two separate implementations of the system. The first is a framework for algorithm testing and development and the second is a prototype of a database engine. The systems differ in purpose, security features and performance. 5.2 5.2.1 Development framework Overview The development framework provides the user with tools for implementing multiparty computation solutions in our model. It is essentially a distributed virtual processor which has a memory and instruction scheduler. This processor is wrapped into a function library, which can be used by client programs. The system is self-organising. The miners are generic processes which are configured by the client program during system initialisation. This kind of a selforganising network makes developing easier, because running the system requires less manual configuration. The miners can send any data back to the client program. This includes the contents of the database. This helps the programmer develop and debug algorithms. Automated tests are possible, because the client can ask the miners for intermediate results. The software is implemented in the C++ programming language. It makes use of the RakNet[6] network library for communication. The system is designed to be cross-platform. Development is done on Apple Mac OS X and Microsoft Windows. Linux/UNIX versions are planned. 8 5.2.2 Communication The communication between nodes is message-based. During system setup the client application locates all three miners and assigns node numbers 1, 2 and 3 to them. After that the miners connect to each other. At the end of start-up all nodes have communication channels to other nodes in the system. The miners in the development framework are not designed to support multiple client applications at the same time. 5.2.3 Storage The client programs have access to three kinds of storage. The first is the database Di which is used for persistent storage. For runtime use the virtual processor provides a stack S and a heap H. The stack and heap are temporary - their contents are forgotten if the miners restart. The stack is used for passing parameters to instructions. For example, the share multiplication operation could pop two values from the stack, multiply them, and push the result back on the stack top. The stack provides standard methods for pushing, peeking and popping. It also provides random access to the stack so that vector operations will find the start of their parameter list. The heap is used to store intermediate results in algorithms. The heap is implemented as an index-addressed array of elements. Protocols also have internal storage for intermediate results. 5.2.4 Instruction scheduler The miners recognise a number of pre-defined commands. The commands are divided into the following categories: 1. system operations - managing the database, modifying miner configuration 2. data transfer - exchanging data between the client program and the miners 3. computation - performing calculations with the distributed database The scheduler processes one instruction at a time. The order is determined by the arrival of messages containing the instruction code. Parameter passing is handled by a simple and generic scheme. Parameters are not sent together with the instruction code but rather as data values from client node to miner node. The client sends the instruction code in one message and the parameters in the others. The miner, when processing the instruction, first waits for the parameters to arrive. 9 5.3 Prototype database In the future work we will investigate the feasibility of building an actual prototype database platform which could be used to process sensitive data. Such a system will have much stricter security requirements than the development framework. To avoid collusion by the miners, they will have to be controlled, configured and hosted by different organisations. This means that the miners will be a lot more independent and clients will have almost no control over their operation. In a production environment the miners can not give out any information which could compromise the sensitive data. This means that in no circumstances can miners send raw shares to the clients. The miners will also need some logic which would determine, whether the results of an aggregation query reveal too much about the source data. In the latter case the miners will refuse to process the query. The miners will have to work as a standard database server. This means supporting multiple simultaneous clients at the same time. The miners’ protocols will be more specific and optimised to perform the queries as fast as possible. Advanced scheduling and caching techniques have to be considered. References [1] R. Cramer, I. Damgård. Multiparty Computation, an Introduction. Course notes, 2002 [2] A. C. Yao. Protocols for Secure Computations (extended abstract). Proceedings of the 21st Annual IEEE Symposium on the Foundations of Computer Science, pp 160-164. 1982. [3] I. Damgård. Secret sharing. Course notes, 2002 [4] R. Cramer, I. Damgård, Y. Ishai. Share Conversion, Pseudorandom Secret-Sharing and Applications to Secure Computation. Lecture Notes in Computer Science, Volume 3378/2005, pages 342-362, 2005. [5] A. Shamir. How to share a secret. Communications of the ACM, Volume 22, Issue 11, pages 612 - 613, 1979. [6] RakNet - a reliable cross-platform UDP network library. Website: http://www.rakkarsoft.com/. 10

Download PDF

advertisement