# MTAT.07.006 Research Seminar in Cryptography Building a secure

```MTAT.07.006 Research Seminar in Cryptography
Building a secure aggregation database
Dan Bogdanov
University of Tartu, Institute of Computer Science
22.10.2006
1
Introduction
This paper starts by describing the privacy problem regarding the aggregation of
sensitive data. Several general solutions are considered and a secure aggregation
scheme based on secret sharing and secure multi-party computation is introduced.
We describe our current progress in building a database which provides data privacy, but allows various aggregation algorithms to run and provide correct results
on the input data.
This paper is based on joint work with Sven Laur1 and Taneli Mielikäinen2 .
2
Aggregation of sensitive data
2.1
Problem statement
Databases containing personal, medical or financial information about a person
are usually classified as sensitive. Processing such data often requires special licenses from respective authorities as determined by the law. This protection is
a problem for research organisations, who can’t learn global properties or trends
from collected data, because they are forbidden from using it.
In this paper we address a simplified version of the problem. Assume that we
have asked m people n sensitive questions. By collecting the answers we obtain
a m × n matrix (denoted D) which represents our data. Our goal is to devise a
method for calculating aggregate statistics from this matrix without compromising
the privacy of a single person.
1
2
Helsinki University of Technology
HIIT Basic Research Unit
1
Let W1 , . . . , Wk be the participants who gather data and construct the matrix.
Let M1 be the data miner who is interested in global patterns. In the common
scenario, all participants give their data to M1 who constructs D and runs aggregation and data mining algorithms on it. The participants have to trust that M1
will not use the data for selfish gains. They have no way of ensuring the privacy
of people who have provided the data, because the miner requires control over the
complete database to perform calculations.
2.2
General solutions
We want to keep a specific party from having the complete database. This way
the participants won’t have to trust a single entity.
2.2.1
Distribution of rows
We can divide the m × n data matrix D into smaller matrices D1 , . . . , Dt so that
each smaller matrix contains some rows from D. We can then distribute matrices
D1 , . . . , Dt to independent miners who calculate results based on their part of the
data and combine them with results of other miners. Unfortunately this solution
does not provide privacy of the data, as for each person who answered the questions
some node will have complete data about the person.
2.2.2
Distribution of columns
We can also divide the matrix D so that matrices D0 1 , . . . , D0 t contain columns
from D. This allows us to keep the data identifying the person in a separate
database from the sensitive data. Such a solution decreases the usability of the
data, because one miner has access only to some attribute. For example, this could
keep us from finding reliable aggregations or association rules based on all relevant
attributes, because some of them might not be available for a given miner.
2.2.3
Distribution of values
It is important to notice, that the two previous solutions are not secure.
Our solution does not alter the dimensions of matrix D. Instead we distribute
the values of the matrix between three miners so that no single miner or a pair of
them can find the original value given their parts of this value. Participants will
need to trust the miners not to co-operate. This can be achieved if the miners
are hosted by organisations who don’t trust each other or audit each other’s work.
Examples are competing companies and government organisations protecting people’s privacy.
2
Secret sharing and multiparty computation is used in this scenario. Participants will process values and send each miner only the respective part of the input
values. Miners can calculate aggregation results if all three of them work together.
The results are based on the complete data, but no miner has a complete row of
the input data.
3
3.1
3.1.1
Prerequisites
Secure multi-party computation
Main idea
Assume that we have n participants p1 , . . . , pn and each participant i knows an
input value xi [1]. Secure multi-party computation is the calculation of a function
f (x1 , . . . , xn ) = (y1 , . . . , yn ) in such a way, that the output is correct and the inputs
of the nodes are kept private. Each node i will get the value of yi and nothing else.
One of the methods of achieving secure multi-party computation is verifiable
secret sharing. Consider the scenario in which a dealer shares a value s between
n nodes. The dealer and the nodes may be malicious. If the dealer is honest, the
nodes can gain no information about the value. Honest nodes can reconstruct s
even in the presence of malicious nodes.
3.1.2
We use malicious nodes to model the adversary. The adversary may corrupt any
number of nodes before the protocol starts. The honest players at first do not
know, which nodes are corrupted and which are not.
In the case of a passive corruption the adversary can read all the data held,
sent and received by the node. If the corruption is active, the adversary has
complete control over the node. If the adversary is static, the set of corrupted
nodes remains the same for the whole duration of the protocol. The adversary
may also be adaptive and corrupt new nodes during the execution of the protocol.
We must limit the adversary to keep secure protocols possible. If all nodes are
corrupted, we have no hope to complete the calculation of f . Therefore we restrict
the adversary to corrupting only a proper subset of all the nodes.
3.1.3
Models of communication
There are two main communication models used in secure multi-party computation. They are the cryptographic model and the information-theoretic model. In
3
messages from the traffic between honest nodes. The adversary may not modify
the messages. In the information-theoretic model all nodes have private channels
between them.
The cryptographic model is secure if the adversary cannot break the cryptographic problem and read the messages. The information-theoretic model is
stronger, as even a computationally unbounded adversary can not read the messages exchanged between honest nodes.
Communication can be synchronous or asynchronous. In the synchronous mode
nodes have synchronised clocks which allows us to design protocols with rounds.
Messages sent each round will be delivered before the next round begins. An
only, if they are in the respective adversary structure.
The asynchronous model is more complex as it has no guarantees on message
delivery time. If we don’t have guarantees for message delivery, we can’t demand
that the protocol reaches a certain step at all.
3.1.4
Example
A classical problem in multi-party computation is the millionaire problem. Assume
that Alice and Bob are millionaires who would like to know, which one of them is
richer without revealing their exact wealth.
We have two participants. Let Alice be p1 and Bob be p2 . Let Alice’s wealth
be x1 and Bob’s wealth be x2 . The function we need to evaluate is ”greater than”,
that is , we need to find out, if x1 > x2 without Alice knowing x2 or Bob knowing
x1 .
There are various solutions to this problem. The classic one was presented
together with the problem intruduction by Yao [2].
3.2
3.2.1
Secret Sharing
Introduction
Secret sharing is used to keep values such as cryptographic keys secure [3]. An
algorithm is used to distribute the value between n nodes so, that each node gets
one share and there is another algorithm which will reconstruct the original value
when given the shares of all nodes.
3.2.2
Threshold secret sharing scheme
Assume that we have an input value s from a finite set S, that we wish to keep
secret. Also consider that we have n nodes available for computation.
4
A threshold secret sharing scheme is a probabilistic algorithm S which takes s
as the input and outputs n bitstrings s1 , . . . , sn . Values s1 , . . . , sn are called shares.
The secret sharing scheme has a threshold t ∈ N, 0 < t < n. The adversary may
than t shares are available, s can be calculated.
We will define privacy and correctness as follows.
Privacy: Assume that we have run S on an input value s ∈ S and created
n nodes s1 , . . . , sn . The secret sharing scheme is secure if for each subset K ∈
P ({1, . . . , n}), |K| 6 t the probability distribution of {sk |k ∈ K} is independent
of the one of s.
Correctness: Assume that we have run S on an input value s ∈ S and
created n nodes s1 , . . . , sn . The secret sharing scheme is correct if for each subset
L ∈ P ({1, . . . , n}), |L| > t + 1 the value s is determined by values {sl |l ∈ L} and
there is an efficient algorithm for calculating s based on these values..
3.2.3
Share conversion
Secret sharing is a technique used for converting shares of the same secret from one
sharing scheme to a different one. This can be used to protect a system against
malicious attacks by participants who do share calculation. Other participants
may convert the shares to another scheme and still retrieve the original valuem if
necessary.
A share conversion solution for the Shamir scheme is descibed by Cramer,
Damgård and Ishai [4].
3.2.4
Example
Follows a classic implementation by Shamir [5]
Assume that we have an input value s, n nodes and we want a threshold value
t. We pick a prime p so that p > n. The algorithm S consists of the following
steps:
1. Choose a random polynomial f (x) over Zp with a degree at most t so that
f (0) = s. A suitable polynomial is f (x) = s + a1 x + a2 x2 + . . . + at xt where
a1 , . . . , at are randomly selected elements of Zp
2. Distribute values si = f (i) mod p(i = 1 . . . n) as shares to nodes 1 . . . n.
Each of the shares can be considered as a point on the curve determined by
the polynomial f (x). If we have t or more points of the curve, we can rebuild
the polynomial by using LaGrange interpolation. If we have less than t shares
available, we won’t get a polynomial of the required degree so we can’t restore the
original secret.
5
4
4.1
A system for secure data aggregation
General model
Let M1 , . . . , M3 be the miners. Let W1 , . . . , Wk be the participants. The participants collect data, send it to the miners and order the miners to perform aggregations on the data. The miners are responsible for storing the received data and
calculating aggregations.
Data is shared between the three miners. Distributing data into shares is
handled by the participants. The miners receive shares and store them in their
database. If we could combine the databases of the miners, we would get the
original database.
If the participants are honest, the miners have no way of retrieving the original
value without co-operating with the other two miners. If the miners are honest, the
participants have no way of seeing other participants’ data. We can also prevent
malicious input by dishonest participants by using share conversion.
To run aggregations on the data we implement the necessary share computation
operations as three-way protocols between the miners. Aggregation is done by
combining these operations. We have implemented two operations - adding a row
to the miner database and multiplying values in the database.
4.2
Data storage
The participants have input data where each record is in the form of a vector with
m elements. Each miner Mi has a database DMi which is a matrix of size m × n.
Data elements in the system are members of Z232 .
4.3
Adding a row to the miner database
Assume that we have a random number generator RNG. The participant Wt has
a record with new values it wants to add to the miners’ database. It is represented
by a vector R (|R| = m). It creates three new vectors R1 , R2 and R3 and calculates
their values as follows:
∀i = 1 . . . m R1 [i] ← RNG, R2 [i] ← RNG and R3 [i] = R [i] − R1 [i] − R2 [i].
Wt sends each vector Ri to the respective node Mi (i = 1 . . . 3). The miner
node Mi adds the values of Ri to its database DMi as a new row.
All three vectors are required to restore the original value, because individual
miners sees only a value that is random to it. Two values are also not enough,
because one of them is random and hence the distribution of their sum is also
random.
6
4.4
Share multiplication
Let there be miners M1 , . . . , M3 and their databases DM1 , . . . , DM3 . We want to
find the product of two values in the database and store it in the same database.
Note, that the input values are distributed into three shares and we need the result
also in shares.
Assume that the values we want to multiply are in the k-th row and l-th column
of the database matrix. The values are x = xA + xB + xC and y = yA + yB + yC ,
where xi = element from the k-th row and l-th column of the matrix DMi .
To calculate x · y we use the following protocol. We name the nodes M1 , M2
and M3 Alice, Bob and Charlie respectively.
Round 1: Sharing randomness
• Alice generates r12 , r13 , s12 , s13 ← RNG
• Bob generates r23 , r21 , s23 , s21 ← RNG
• Charlie generates r31 , r32 , s31 , s32 ← RNG
• All values ∗ij are sent over a secure channel from Mi to Mj
Round 2: Sharing shares
• Alice computes â12 = xA + r31 , b̂12 = yA + s31 , â13 = xA + r21 , b̂13 = yA + s21
• Bob generates â23 = xB + r12 , b̂23 = yB + s12 , â21 = xB + r32 , b̂21 = yB + s32
• Charlie generates â31 = xC +r23 , b̂31 = yC +s23 , â32 = xC +r13 , b̂32 = yC +s13
• All values ∗ij are sent over a secure channel from Mi to Mj
Round 3: Local computations
• Alice computes cA = xA b̂21 + xA b̂31 + yA â21 + yA â31 − â12 b̂21 − b̂12 â21 + r12 s13 +
s12 r13
• Bob generates cB = xB b̂32 + xB b̂12 + yB â32 + yA â12 − â23 b̂32 − b̂23 â32 + r23 s21 +
s23 r21
• Charlie generates cC = xC b̂13 + xC b̂23 + yC â13 + yA â23 − â31 b̂13 − b̂31 â13 +
r31 s32 + s31 r32
7
After running this protocol, the miners have calculated the product
xy = (cA + xA yA ) + (cB + xB yB ) + (cC + xC yC ).
(1)
The correctness of the protocol can be shown by expanding both sides of equation (1) and showing that they are equal. The protocol is secure, since in every
round all miners see values with a random distribution. The computation requires
altogether 24 messages, each miner sends and receives 8 messages.
5
5.1
Software specification
Overview
Nodes run two kinds of software. Mining nodes run the miner software and participants run the controller software. The miner software consists of algorithms
and protocols which perform the multi-party computation and data mining tasks.
The controller software is used to send data and commands to the miner software.
We consider two separate implementations of the system. The first is a framework for algorithm testing and development and the second is a prototype of a
database engine. The systems differ in purpose, security features and performance.
5.2
5.2.1
Development framework
Overview
The development framework provides the user with tools for implementing multiparty computation solutions in our model. It is essentially a distributed virtual processor which has a memory and instruction scheduler. This processor is wrapped
into a function library, which can be used by client programs.
The system is self-organising. The miners are generic processes which are
configured by the client program during system initialisation. This kind of a selforganising network makes developing easier, because running the system requires
less manual configuration.
The miners can send any data back to the client program. This includes the
contents of the database. This helps the programmer develop and debug algorithms. Automated tests are possible, because the client can ask the miners for
intermediate results.
The software is implemented in the C++ programming language. It makes use
of the RakNet[6] network library for communication. The system is designed to be
cross-platform. Development is done on Apple Mac OS X and Microsoft Windows.
Linux/UNIX versions are planned.
8
5.2.2
Communication
The communication between nodes is message-based. During system setup the
client application locates all three miners and assigns node numbers 1, 2 and 3
to them. After that the miners connect to each other. At the end of start-up all
nodes have communication channels to other nodes in the system.
The miners in the development framework are not designed to support multiple
client applications at the same time.
5.2.3
Storage
The client programs have access to three kinds of storage. The first is the database
Di which is used for persistent storage. For runtime use the virtual processor
provides a stack S and a heap H. The stack and heap are temporary - their
contents are forgotten if the miners restart.
The stack is used for passing parameters to instructions. For example, the share
multiplication operation could pop two values from the stack, multiply them, and
push the result back on the stack top. The stack provides standard methods for
pushing, peeking and popping. It also provides random access to the stack so
that vector operations will find the start of their parameter list. The heap is
used to store intermediate results in algorithms. The heap is implemented as an
Protocols also have internal storage for intermediate results.
5.2.4
Instruction scheduler
The miners recognise a number of pre-defined commands. The commands are
divided into the following categories:
1. system operations - managing the database, modifying miner configuration
2. data transfer - exchanging data between the client program and the miners
3. computation - performing calculations with the distributed database
The scheduler processes one instruction at a time. The order is determined
by the arrival of messages containing the instruction code. Parameter passing is
handled by a simple and generic scheme. Parameters are not sent together with
the instruction code but rather as data values from client node to miner node.
The client sends the instruction code in one message and the parameters in the
others. The miner, when processing the instruction, first waits for the parameters
to arrive.
9
5.3
Prototype database
In the future work we will investigate the feasibility of building an actual prototype
database platform which could be used to process sensitive data. Such a system
will have much stricter security requirements than the development framework.
To avoid collusion by the miners, they will have to be controlled, configured
and hosted by different organisations. This means that the miners will be a lot
more independent and clients will have almost no control over their operation.
In a production environment the miners can not give out any information which
could compromise the sensitive data. This means that in no circumstances can
miners send raw shares to the clients. The miners will also need some logic which
would determine, whether the results of an aggregation query reveal too much
about the source data. In the latter case the miners will refuse to process the
query.
The miners will have to work as a standard database server. This means
supporting multiple simultaneous clients at the same time. The miners’ protocols
will be more specific and optimised to perform the queries as fast as possible.
Advanced scheduling and caching techniques have to be considered.
References
[1] R. Cramer, I. Damgård. Multiparty Computation, an Introduction.
Course notes, 2002
[2] A. C. Yao. Protocols for Secure Computations (extended abstract).
Proceedings of the 21st Annual IEEE Symposium on the Foundations of Computer Science, pp 160-164. 1982.
[3] I. Damgård. Secret sharing. Course notes, 2002
[4] R. Cramer, I. Damgård, Y. Ishai. Share Conversion, Pseudorandom
Secret-Sharing and Applications to Secure Computation. Lecture
Notes in Computer Science, Volume 3378/2005, pages 342-362, 2005.
[5] A. Shamir. How to share a secret. Communications of the ACM,
Volume 22, Issue 11, pages 612 - 613, 1979.
[6] RakNet - a reliable cross-platform UDP network library. Website:
http://www.rakkarsoft.com/.
10
```