VoloDB_Ali_Dar
VoloDB: High Performance and ACID Compliant
Distributed Key Value Store with Scalable Prune
Index Scans
ALI DAR
Master’s of Science Thesis
KTH Royal Institute of Technology
School of Information and Communication Technology
Supervisor: Dr. Jim Dowling
Examiner: Prof. Seif Haridi
Stockholm, Sweden, June 2015
TRITA-ICT-EX-2015:96
Abstract
Relational database provide an efficient mechanism to store
and retrieve structured data with ACID properties but it
is not ideal for every scenario. Their scalability is limited
because of huge data processing requirement of modern day
systems. As an alternative NoSQL is different way of looking at a database, they generally have unstructured data
and relax some of the ACID properties in order to achieve
massive scalability. There are many flavors of NoSQL system, one of them is a key value store. Most of the key value
stores currently available in the market offers reasonable
performance but compromise on many important features
such as lack of transactions, strong consistency and range
queries. The stores that do offer these features lack good
performance.
The aim of this thesis is to design and implement VoloDB, a key value store that provides high throughput in
terms of both reads and writes but without compromising
on ACID properties. VoloDB is built over MySQL Cluster
and instead of using high-level abstractions, it communicates with the cluster using the highly efficient native low
level C++ asynchronous NDB API. VoloDB talks directly
to the data nodes without the need to go through MySQL
Server that further enhances the performance. It exploits
many of MySQL Cluster’s features such as primary and
partition key lookups and prune index scans to hit only
one of the data nodes to achieve maximum performance.
VoloDB offers a high level abstraction that hides the complexity of the underlying system without requiring the user
to think about internal details. Our key value store also
offers various additional features such as multi-query transactions and bulk operation support. C++ client libraries
are also provided to allow developers to interface easily with
our server. Extensive evaluation is performed which benchmarks various scenarios and also compares them with another high performance open source key value store.
Acknowledgement
I would like to thank my examiner Prof. Seif Haridi for giving me an opportunity
to work on this research project. I will especially thank Dr. Jim Dowling for his
constant support and guidance in every step of the way during the thesis. His expert
input was instrumental in solving complex design, implementation and optimization
problems in the project.
In the end, I would like to thank my wife and parents for their moral support
throughout my studies.
Contents
Acknowledgement
List of Figures
Acronym
1 Introduction
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2
3
4
2 Background
2.1 ACID Properties . . . . . . . . .
2.2 Strong and Eventual Consistency
2.3 CAP Theorem . . . . . . . . . .
2.4 BASE . . . . . . . . . . . . . . .
2.5 SQL Based Databases . . . . . .
2.6 NoSQL Databases . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
6
6
7
7
8
3 Related Work
3.1 MapReduce . . . . . . . . .
3.2 Apache Hadoop and YARN
3.3 BigTable . . . . . . . . . . .
3.4 Apache HBase . . . . . . .
3.5 Dynamo . . . . . . . . . . .
3.6 Apache Cassandra . . . . .
3.7 MongoDB . . . . . . . . . .
3.8 Riak . . . . . . . . . . . . .
3.9 FoundationDB . . . . . . .
3.10 Aerospike . . . . . . . . . .
3.11 Spanner and F1 . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
12
12
13
14
14
15
15
16
17
17
4 Design
4.1 MySQL Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
19
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.1.1 Architecture . . . . . .
4.1.2 Partition Key . . . . . .
NDB API . . . . . . . . . . . .
4.2.1 Asynchronous API . . .
Abstractions and Mappings . .
4.3.1 Database . . . . . . . .
4.3.2 Store . . . . . . . . . . .
4.3.3 Key Value Pair . . . . .
4.3.4 Key . . . . . . . . . . .
4.3.5 Value . . . . . . . . . .
4.3.6 Index . . . . . . . . . .
4.3.7 Operations . . . . . . .
Feature Set . . . . . . . . . . .
4.4.1 Strong consistency . . .
4.4.2 ACID Properties . . . .
4.4.3 Transactions . . . . . .
4.4.4 Prune Index Scans . . .
4.4.5 Strong Data Types . . .
4.4.6 Multi-column key . . . .
4.4.7 Multi-column value . . .
4.4.8 Supported Data Types .
Performance Considerations . .
4.5.1 Allowed Queries . . . .
4.5.2 Disallowed Queries . . .
Definition Operations . . . . .
4.6.1 Create Store . . . . . .
4.6.2 Delete Store . . . . . . .
Manipulation Operations . . .
4.7.1 Set . . . . . . . . . . . .
4.7.2 Get . . . . . . . . . . .
4.7.3 Delete . . . . . . . . . .
4.7.4 Atomic Mode . . . . . .
High Level System Architecture
VoloDB Architecture . . . . . .
4.9.1 Network I/O Handler .
4.9.2 Definers . . . . . . . . .
4.9.3 Executors . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Implementation
5.1 Transport . . . . . . . . . . . . . .
5.2 Serialization . . . . . . . . . . . . .
5.3 Worker Threads . . . . . . . . . . .
5.4 Inter-process communication . . .
5.5 Memory Management and Garbage
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
20
21
21
21
21
22
22
22
22
22
22
22
22
23
23
23
23
23
24
24
24
24
25
25
25
26
26
26
26
26
26
27
27
27
28
29
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Collection
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
31
32
32
33
33
5.6
5.7
5.8
5.9
5.10
Bulk requests Handling . . . .
Protocol Buffer Messages . . .
Parallel Database Connections
Zero Copy Semantics . . . . . .
Client Library . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Evaluation
6.1 Evaluation Setup . . . . . . . . . . . . . . .
6.1.1 Hardware . . . . . . . . . . . . . . .
6.1.2 Software . . . . . . . . . . . . . . . .
6.1.3 Workload . . . . . . . . . . . . . . .
6.2 Experiment 1 - Set . . . . . . . . . . . . . .
6.3 Experiment 2 - Get . . . . . . . . . . . . . .
6.4 Experiment 3 - Delete . . . . . . . . . . . .
6.5 Experiment 4 - Mixed Reads and Writes . .
6.6 Experiment 5 - Prune Index Scans . . . . .
6.7 Experiment 6 - Comparison with Aerospike
7 Conclusion
7.1 Future Work . . . . . . . . . .
7.1.1 Joins . . . . . . . . . . .
7.1.2 Additional Data Types .
7.1.3 Language Bindings . . .
7.1.4 Load Balancing . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
34
34
35
35
.
.
.
.
.
.
.
.
.
.
37
37
37
37
38
38
40
41
42
43
45
.
.
.
.
.
49
49
50
50
50
50
Bibliography
51
Appendices
55
A User Guide
A.1 Download . . . . . . . . . . . . . .
A.2 Project Structure . . . . . . . . . .
A.3 Prerequisites . . . . . . . . . . . .
A.4 Setting up VoloDB . . . . . . . . .
A.4.1 Installation . . . . . . . . .
A.4.2 Quick Installation . . . . .
A.4.3 Configuraion . . . . . . . .
A.4.4 Execution . . . . . . . . . .
A.5 Setting up VoloDB Client Library
A.5.1 Compilation . . . . . . . . .
A.5.2 Usage . . . . . . . . . . . .
A.6 Sample Application . . . . . . . . .
A.6.1 Setup . . . . . . . . . . . .
A.6.2 Important Classes . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
57
57
57
58
58
58
58
58
60
60
60
60
60
60
61
A.6.3
A.6.4
A.6.5
A.6.6
A.6.7
A.6.8
A.6.9
Store Creation . . . . . . . . .
Key Value Pair Insertion . . . .
Fetching Key Value Pair . . . .
Fetching Key Value Pairs using
Fetching Key Value Pairs using
Key Value Pair Deletion . . . .
Store Deletion . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
Partition Key . . . .
Non-Keyed Column
. . . . . . . . . . . .
. . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
62
64
65
66
67
68
68
List of Figures
4.1
4.2
4.3
Overview of MySQL Cluster Architecture . . . . . . . . . . . . . . . . . . .
High Level System Architecture . . . . . . . . . . . . . . . . . . . . . . . .
VoloDB Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
27
28
5.1
Protocol Buffer Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
6.11
Throughput of Set Operations . . . . . . . . . . . . . . . . . . .
Throughput of Get Operations . . . . . . . . . . . . . . . . . . .
Throughput of Delete Operations . . . . . . . . . . . . . . . . .
Throughput of Mixed Reads and Writes . . . . . . . . . . . . . .
Throughput of Prune Index Scans with 50 Records Returned . . .
Throughput of Prune Index Scans with Variable Records Returned
Throughout Comparison with Aerospike: 0% Reads, 100% Writes .
Throughout Comparison with Aerospike: 20% Reads, 80% Writes .
Throughout Comparison with Aerospike: 50% Reads, 50% Writes .
Throughout Comparison with Aerospike: 80% Reads, 20% Writes .
Throughout Comparison with Aerospike: 100% Reads, 0% Writes .
39
40
41
42
43
44
45
46
47
47
48
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Acronyms
ACID Atomicity, Consistency, Isolation, Durability. 1–3, 5, 7, 8, 16, 17, 23, 49
API Application Programming Interface. 13, 16, 17, 21, 29
BASE Basic Availability, Soft State, Eventual Consistency. 7, 8, 16, 23
BSON Binary JavaScript Object Notation. 15
CAP Concurrency, Availability, Partition Tolerance. 6, 13–15, 22
CPU Central Processing Unit. 32
CQL Cassandra Query Language. 15
GFS Google File System. 2, 12, 13
HDFS Hadoop Distributed File System. 2, 12, 13
IBM International Business Machines. 1
JSON JavaScript Object Notation. 15, 32
NDB Network DataBase. 19–21, 29, 33, 34, 57
NewSQL New Structured Query Language. 2, 17
NoSQL Not Only SQL. 2, 3, 7, 8, 11, 12, 14–17, 19, 21–23, 45, 49
OLAP Online Analytical Processing. 1
OLTP Online Transaction Processing. 1
OODBMS Object Oriented Database Management System. 1, 8
OOP Object Oriented Programming. 7
ORD Object-Relational Database. 1
ORDBMS Object-Relational Database Management System. 7, 8
POSIX Extensible Markup Language. 32
RDBMS Relational Database Management System. 1, 2, 5, 7, 49
SQL Structured Query Language. 1, 2, 7, 15–17
XML Extensible Markup Language. 32
Chapter 1
Introduction
During 1970’s E.F Codd while working at International Business Machines (IBM)
Research Laboratory proposed relational model[1] to organize and manipulate data.
The database based over this model organizes the records into rows or tuples which
are grouped together into a two dimensional relation or a table. The row itself is
an ordered collection of attributes containing associated values. The model presented was very extensive and included things like relational joins, constraints and
normalization procedures to make sure that designed models were consistent and
optimized. Most of the current Relational Database Management System (RDBMS)
use Structured Query Language (SQL) to manage the data. The model was a big
success and became an industry standard to store and manipulate data. It was and
is currently in use by Online Transaction Processing (OLTP) systems that ensures
ACID properties of every transaction. It is also being used for Online Analytical
Processing (OLAP) to offer business intelligence services to users. The relational
model was extended later to Object-Relational Database (ORD) and Object Oriented Database Management System (OODBMS), they supported data to be saved
in terms of objects. They were mainly inspired from the object oriented programming paradigm such as C++.
Relationship databases are designed for a specific use case and perform rather
well in those scenarios. It works well when ACID properties are requirement and
data is structured in a predefined schemas which can be logically organized into
rows and tables. Although RDBMS works and scales well with reasonably large
amount of data, but its scalability is limited[2].
Data requirements today have outgrown immensely which can not be handled
efficiently by the conventional database systems. Processing petabytes and Exabyte
of data nowadays is a frequent task for large enterprises. Other than just the massive
volume of data requirements, other characteristics have changed. The speed at
which data is generated and processed is huge and the complexity has increased[3].
1
CHAPTER 1. INTRODUCTION
The data just is not rigid enough to be mapped into a predefined schemes, it is
unstructured, complex and comes from varies sources. Such data sets are termed as
Big Data. Big data posed an enormous challenge and new method called Not Only
SQL (NoSQL) was conceived for data storage and retrieval. NoSQL presented a
model that was significantly different than the traditional relational way of looking
at thing. NoSQL supported databases that do not impose rigid schemas as RDBMS
do and can scale massively. The massive scaling is attributed to relaxed rules of
ACID properties, for example consistency can be relaxed to achieve the desired
effect.
NoSQL was started as internal projects at big enterprises. Google came with
the implementation of the MapReduce programming model[4] to efficiency process
huge amount of data in parallel using clusters of distributed commodity computers.
They later presented storage solutions such as column oriented database called
Big Table[5], which was built upon Google File System (GFS). It later inspired
many open source project such as Hadoop, HBase and Hadoop Distributed File
System (HDFS)[6]. FlockDB by developed by Twitter, Cassandra[7] was started
initially by Facebook. Highly available key value store called Dynamo[8] by Amazon,
and Voldemort by Linkedin. There are other examples such as a key value store
called Riak. Document oriented stores like MongoDB[9] and CouchDB and graph
based stores such as OrientDB and Neo4G have become very popular as well. The
NoSQL landscape has become extremely versatile and provides numerous options
to users and enterprises to choose from many solutions depending upon their unique
requirement. The NoSQL databases can be complex to use that has resulted into
creation of many high level tools and languages to help users and reduce time of
development. For example, Pig Latin and HiveQL are procedural languages built
over Hadoop that offers SQL like features.
Though NoSQL databases continue to serve the purpose of handling big data, it
is still not perfect especially because it does not guarantee ACID properties. New
systems are being researched and developed which seek to overcome this limitation.
A class of relational database called NewSQL is a new initiative that ensures ACID
properties at the same time allowing massive scaling. Google Spanner[10], MemSQL
and VoltDB are few examples.
1.1
Motivation
The NoSQL databases that exist today are built to scale and offer good performance but are limited in terms of many important functionalities that are offered
by traditional relational databases such as strong data types, range queries and
transactions. They also lack strong Atomicity, Consistency, Isolation, Durability
(ACID) properties by relaxing one of these in order to scale better. Strong consistency is generally relaxed to eventual consistency which means that after some
2
1.2. CONTRIBUTION
time the system will converge to a consistence state. Though databases with these
properties work just fine with most use cases but it is not ideal for all the scenarios.
The ideal scenario is a database system that scales well but does not compromise
on ACID properties.
As mentioned earlier, the lack of features in NoSQL databases is a big limitation
for many users. Some NoSQL databases such as HBase provide transactions, but
their scope is very limited since transactions only ensure row level consistency[11].
Key value stores such as Riak, Aerospike and Foundation also put numerous limitations to ensure strong consistency[12][13]. We would like to run transactions like
the ones on the traditional relational databases without any limitations with full
ACID properties.
Other than lack of fully featured transactions there are many other limitations.
Key values stores either have limited or no support for strong data types[14], for
example FoundationDB only allows bytes as a key or a value[15]. Due to lack of
strong data types, user has to encode/decode to/from bytes every time that causes
inconvenience. Many stores also does not allow composite keys and values which is
inefficient as user has to serialize/deserialize data to/from a single column. Support
for non primary key lookups are either missing or if supported does not scale well and
lacks performance. There is also a possibility to increase the overall performance of
key value stores by using novel techniques to boost throughput for reads and writes.
1.2
Contribution
The purpose of the thesis is to research, design and implement VoloDB, a NoSQL
distributed key value store that offers high throughput of reads and writes. Other
than having high performance, it offers features that other key value stores or
NoSQL databases lack in general. The following main features are implemented.
• Provides high throughput of reads and writes for primary key lookups.
• Fully supports transactions to allow execution of multiple operations as an
atomic operation.
• Provides full compliance of ACID properties for all type of queries.
• Supports highly scalable non primary key lookup queries by exploiting distribution/partition key.
• Supports many strong data types with feature to allow composite keys and
values.
3
CHAPTER 1. INTRODUCTION
1.3
Outline
• Chapter 1: Gives brief introduction and motivation of the project.
• Chapter 2: Explains the background concepts required to understand the
design and implementation of the project.
• Chapter 3: Discusses work related to this project that were considered while
determining the design and feature set of the project.
• Chapter 4: Describes in detail the system architecture and design of VoloDB.
• Chapter 5: Describes the implementation details of VoloDB.
• Chapter 6: Evaluates the performance of VoloDB by conducting various
experiments and then comparing it with another open source high performance
key value store.
• Chapter 7: Gives concluding remarks and discuss future work that could be
carried out.
4
Chapter 2
Background
2.1
ACID Properties
ACID is set of properties that describes the guarantees provided by a transactions
that are central for any RDBMS. The concept of these properties for any reliable
transaction was described and implemented by Jim Grey[16] but the term ACID
was introduced by Andreas Reuter and Theo Harder in 1983 in their paper[17]. The
properties are explained below:
Atomicity: It refers to a property that states that all the operations in a transaction are either completed or none are. There can not be a situation where some
of the operations executed successfully and some failed to do so. Let us consider a
financial transaction where we have to transfer money to another account, in this
case it has two operations, one is to debit money from one account and other is
to credit deducted money into the other account. If these operations are not done
atomically, then we can have a situation where amount is credited to other account
but not debited at the source, thus causing financial damages to the bank. It can
also happen that amount was debited at the source, but was not credited at the
destination account.
Consistency: This property ensures that only valid data is ever written to the
database. The validity means that the data written does not violate any constraints or any consistency rules. Any violation of rules will return in rollback of all
the operations in the transaction. If a transaction succeeds, it takes the database
from one consistent stage to another.
Isolation: It states that multiple transactions happening at the same time should
not effect each other and must work in isolation. Any intermediate data generated
from one transaction should not be visible to the other one. Usually databases provide many isolation levels that users can choose from depending upon their use case.
5
CHAPTER 2. BACKGROUND
For example, Read committed allows only the committed data from one transaction
to be visible to any other concurrent transaction. Another isolation level called
Serializable releases all the locks after the transaction completes thus only allowing
transaction to happen one after another in a serial manner.
Durability: It ensures that any data committed by a transaction will be persisted and must not be lost. Any system, hardware or power failure should not
result in any loss of committed data.
2.2
Strong and Eventual Consistency
Distributed system generally offers two notion of consistency, which are described
below:
Strong Consistency: This property ensures that all the users in the system have
same view of the data. In other way it could be described is that for a read operation, all users should get back the last committed value of the write operation[18].
Eventual Consistency: It is a weaker form of strong consistency, which does
not guarantee that read will always return the last written value but states that
after some period of time all reads will start getting the last written value and thus
reaching a consistent state[19]. This type is also referred as weak consistency.
2.3
CAP Theorem
CAP theorem was introduced by Eric A. Brewer[20] that states that for a distributed
system, only two of the following three properties are possible to achieve:
Consistency: All the users of the database have the exact same view of the data.
It is same as strong consistency.
Availability: It refers to a property that system is able to respond to the request of all the users even in case of any node failure. Every request should get a
response back from any non failed node.
Partition Tolerance: It is a property that ensures that system stays in a working
condition even in case of lost messages, node failure or any number of nodes making
a separate partition because of isolation. The system must keep on responding to
any request received.
6
2.4. BASE
2.4
BASE
As discussed in section 2.1 ACID model is used by transaction processing relational
databases. This model can be very strong for some applications which results in
very limited scalability. A weaker model exists which is called BASE[20] introduced
by Eric A. Brewer. This new model are generally adopted by NoSQL databases and
has the following three properties:
Basic Availability: NoSQL databases tend to focus on availability to make sure
that services are always up and running to respond to user request even if there are
disruptions and failures of nodes. They tend to achieve it through active replication
of data to many nodes, so if a node dies, data remains available to be served by users.
Soft State: The requirement of consistency is generally weakened and data should
not have to be consistent at all times but it should eventually converge to a valid
consistent form after some period of time. In this way, state is always soft which
changes without the intervention of the user.
Eventual Consistency: As mentioned in section 2.2, it is a weaker form of strong
consistency that drops the requirement of data being consistent at all the time, but
should converge to a consistent state eventually. This weakening of a property can
allow the system to scale on a massive scale.
2.5
SQL Based Databases
SQL is a procedural language used to manage and manipulate databases. SQL
based databases adhere to ACID model and have the following main categories:
Relational Databases (RDBMS):
As mentioned in chapter 1, it organizes the records into rows or tuples which are
grouped together into a two dimensional relation or a table. The row itself is an
ordered collection of attributes containing associated values.
Object-Relational Databases (ORDBMS):
This kind of database is very similar to relational databases, the data is stored
in tables but it has many features inspired from Object Oriented Programming
(OOP)[21]. Added features include inheritance, custom types, classes, polymorphism and object methods[22]. They are manipulated in a very similar fashion like
relational databases using SQL.
7
CHAPTER 2. BACKGROUND
Object Oriented Databases (OODBMS):
It stores data in form of objects only that is different from ORDBMS, which uses
hybrid approach of relational and object orientation[23]. It offers tight integration
and direct mapping from objected oriented programming languages since both uses
the same model.
2.6
NoSQL Databases
These are another class of databases that does not adhere to ACID properties but
instead adhere to BASE model as discussed in section 2.4. Adhering to strong ACID
properties also prevent the system to scale massively, so these databases weakens
one of the properties to achieve the desired scalability. Compared to relational
databases, they can be highly unstructured and might not bind to a static scheme.
They are divided in to the following main categories:
Key Value Store
It is the simplest type of NoSQL database where each record in the database has a
value corresponding to a given key, like a (key, value) pair. The value is generally a
byte array that can store value of any serialized data. Most common operations are
Set to insert/update a value and Get to retrieve it. It is conceptually like a map or
a dictionary data structure of object oriented programming language. Dynamo[8]
by Amazon, Riak and FoundationDB are examples of these kind of databases.
Document Oriented Store
It is one of the most popular types of NoSQL databases which store documents
that are in the form of semi structured data. The documents are most commonly
encoded in XML or JSON. Documents are identified by keys but unlike relational
databases, documents in a store do not need to have a fixed structure and every
document can have a totally different structure. MongoDB[9] and CouchDB[24] are
example of such databases.
Column Oriented Store
In this type of database, the data is stored in terms of columns instead of rows.
There is a concept of column family that groups number of columns. Each family
can contain any number of columns that can be added at runtime. Since values
of a particular column are stored together, it allows extremely fast access to any
column of the table. Google’s BigTable, Apache Cassandra and HBase are examples
of column oriented databases.
8
2.6. NOSQL DATABASES
Graph Database
These databases are ideal for data whose relationships are well suited for a graph
structure. It consists of nodes, and an edge between them can represent a relationship. World Wide Web, network topologies, road and rail networks and Facebook
friends graph can be represented well in these databases. Products such as OrientDB, Infinite Graph and Neo4J are examples of this category.
9
Chapter 3
Related Work
After the emergence of NoSQL databases, a lot of research and development has
been constantly taking place. It has resulted into variety of products in every
category with increasingly better performance. Using databases that store big data
can be complex to use, so just storing the data is not enough. While giving high
performance it should also be easy to use for new users, so they can focus on the
given problem without thinking too much on how to use a complex database.
3.1
MapReduce
It is a programming model to do parallel distributed data processing over group of
computer connected with each other in a cluster. It was inspired by an older concept
of functional programming and in 2004 Google developed an implementation of this
model to process large scale data over commodity computers[4].
The main idea of the framework is that it takes (key, value) pairs as input and
produces another list of (key, value) as output. The computation is done using two
main functions:
Map: Takes input as set of (key, value) pairs and produces and another intermediately list of (key, value) pairs. It can be represented as Map(key, value) ->
List(key1, value1). This function is applied to inputs in parallel on different distributed computers and separate group is created for unique key with list of values
of this key combined together.
Reduce: The reduction function takes input each generated group after map stage
as key value pairs and produces list of values. It can be represented as Map(key1,
list(v1)) -> List(v2).
The map reduce functions are written by users but all the underlying complex
11
CHAPTER 3. RELATED WORK
details and implementation is hidden and managed the system. The user provided
input is partitioned by the system and then assigned to worker machines in the
cluster. All the workers are controlled and coordinated by a master node. Input is
then processed in parallel by different worker machines by applying Map function to
produce intermediate (key, value) pairs. The intermediate values are then grouped
together for each unique key and then again assigned to worker machines to execute
Reduce function. Each worker machine then produces the final output from the
reduction function.
3.2
Apache Hadoop and YARN
The MapReduce implementation by Google has inspired many other products and
databases for an open source implementations of this model, Apache Hadoop is one
of them. It allows distributed parallel processing of big data using HDFS modeled
over Google’s GFS as storage and MapReduce as processing framework. Apache
Hadoop has itself inspired many other products and data processing tools where it
is used as a base underlying technology that we will discuss later.
Apache Hadoop’s MapReduce went undergo a big overhaul by introducing MapReduce 2.0 that is also called Apache YARN[25]. The initial version had a tight integration with the processing framework and resource management that resulted in
scalability issues. YARN introduced a different approach that fixed a lot of these
issues. MapReduce has only one master node that tracks and manages all the jobs
and nodes in the cluster that resulted in scalability issues and a single points of failure. If a master nodes crashes then all the currently running jobs will fail. Another
issue is that if number of jobs running becomes too high, it will limit scalability
since it gets difficult for one node to manage all the concurrent jobs effectively.
YARN decouples the resource management and job scheduling by introducing
per cluster Resource Manager and per application Application Manager. Resource
Manager manages all the resources in the cluster. Node Manager was also introduced which exists per node to manage the health and heartbeats of the node.
Every job is handled by a separate Application Manager which manages all the
lifecycle related the job such as negotiating the required resources. The decoupling
significantly reduces the scalability issues since Resource Manager handle resources
and Application Manager separately handle jobs which removes a major bottleneck.
3.3
BigTable
It is a column oriented NoSQL database developed by Google in 2006[5]. It is a
proprietary database built over GFS, which is used internally by Google in numerous
products such as Google Earth, Google Analytics and for web indexes. It efficiently
handles petabytes of data running in a cluster of thousands of computers. BigTable
12
3.4. APACHE HBASE
is in effect a distributed sorted multidimensional map. High performance C++
API is also available to interact with the database. Google not only uses it to store
data for its products, it is also used as an input source or an output destination of
MapReduce jobs.
BigTable arranges the data in form of tables. In each table, data is stored in
rows, which are identified by keys. Keys are byte arrays that do not have any
specific type associated with them. Every row consists of column families, where
each column family groups number of related columns. Column families are fixed at
the table creation time but during insert a row can skip any column family or add
new columns in it at runtime. The columns are of generic type having byte array
type. Each column value is called a cell which is referred by the combination of row
id, column family and column name. Each cell value is versioned by a timestamp,
whenever a new value is written to a column, an updated value corresponding to a
new timestamp is added. During read time, if no timestamp is specified, the value
associated with the latest timestamp is returned, otherwise the cell value with the
user given timestamp is returned back. The number of versioned value for column
families is configurable.
3.4
Apache HBase
It is an open source implementation of BigTable[26]. It is written in Java and
is built over Apache Hadoop and HDFS. It provides strong consistency of read
and writes and offers partition tolerance capabilities, so according to CAP theorem
it is a CP database system. It features capabilities such as in memory operations,
compression, compaction, linear scalability and automatic failover support. In 2010,
Facebook implemented its messaging platform over HBase.
HBase provides a similar data model as BigTable, it has similar concepts such
as structure of a row, column family and versioning. In HBase each column family
is saved into a separate file in GFS that allows extremely fast access to columns. It
provides auto sharding by partitioning tables automatically if they become too big.
Continuous ranges of rows are stored together and each region is managed by one
region server that is responsible for all reads and writes in it. Region servers can
also be added dynamically depending upon the need. There is also a master node
that manages all the region servers.
It offers easy and comprehensive set of client libraries. It provides JRuby based
shell, Thrift Gateway, JAVA API, REST, XML and Protocol Buffers support. It
provides support for real time queries and convenient classes to support massive
MapReduce jobs.
13
CHAPTER 3. RELATED WORK
3.5
Dynamo
It is a NoSQL distributed key value store developed by Amazon[8]. Amazon is an
e-commerce giant having customer centric business who has to deal with millions
of customers at a given time. With these needs in mind, Dynamo was designed to
provide reliability at a massive scale and to ensure ’always on’ experience. Amazon
uses it internally for their internal core services. It provides high availability and
partition tolerance with eventual consistency. According to the CAP theorem it is
an AP database system.
Dynamo nodes are arranged into a ring structure like Chord DHT[27]. Data
is partitioned by consistent hashing. Nodes are responsible for range of keys and
an incoming request can be received by any of the node. There is a preference list
of nodes for a given key that is available at every node in the system. In order to
provide high availability and failover support, the data is replicated to N-1 successor
nodes in the ring. Inconsistencies in replicas are handled through Merkle trees[28]
that ensure that correction is done quickly and without generating much data traffic
over the network. Another prime feature of Dynamo is that it provides an always
write which is achieved by a concept called sloppy quorum. Instead of enforcing
strict quorums, the values are read or written to the any of the N healthy nodes at
the given time. The values for R and W are configurable and for example W can
be set to 1 to significantly increase the write availability. Every written value has a
vector clock[29] value associated with it. Any conflicting value resulted because of
node failure or partition is resolved during the read operation. Any conflict that can
be resolved automatically is done by the system by looking at their related causally
of vector clock values. For unresolved conflicts, all causally unrelated values are
returned which are then reconciled. The reconciled values are then considered the
correct value and written back to the system. New nodes can easily be added
in the system, there is no central authority that manages the membership. The
membership and failure detection is done through a gossip based protocol that
prevents single point of failure.
Dynamo has inspired many other open source NoSQL databases which uses
its set of techniques. For the end users, Amazon provides a database called DynamoDB, which is based on the similar data model of Dynamo[30]. DynamoDB
can be used as part of Amazon Web Services. It also provides numerous language
binding for developers having support of Java, .NET, Erlang, Python and many
other programming and scripting languages.
3.6
Apache Cassandra
It is a an open source distributed store inspired by Google’s BigTable and Amazon’s
Dynamo[7]. It was initially developed internally by Facebook but later made it open
14
3.7. MONGODB
source. Like Dynamo store, it is highly available, partitioning tolerant and eventual
consistent, thus according to the CAP theorem it is an AP database system. It
is incrementally scalable, requires minimal administration and has no single point
of failure. Although it is eventually consistent data store, it can be tuned to offer
better strong consistency. Cassandra can tune between consistency and latency
tradeoff depending on the user needs.
The data model of Cassandra is column oriented similar to BigTable, which is
discussed in section 3.3. The internal implementation such as node arrangement,
data partitioning and replication are based on techniques learned from Dynamo. It
allows integration with Hadoop with MapReduce support. It supports other features
such as secondary indexes on columns, online schema changes, compression, compaction and dynamic upgrades without downtime. In addition to providing binding
for popular programming and scripting languages such as C++, Java, Python and
Ruby, it also provides an easy to use SQL type query language called Cassandra
Query Language (CQL) to query the database.
3.7
MongoDB
It is a document oriented NoSQL database developed by MongoDB Inc[9]. Unlike
traditional relational databases, it saves data in form of JSON like documents. The
data format used by MongoDB for network transfer and storage is BSON, which
is the binary form of JSON. BSON is an extremely efficient in terms of storage,
network transfer and scan speeds. MongoDB is a strongly consistent database and
is a CP system as described by CAP theorem.
MongoDB has a very dynamic schema in which documents are stored in JSON.
Documents are identified by keys but unlike relational databases, documents in a
store do not need to have a fixed structure, every document can have a totally
different structure or fields. It supports horizontal scaling to allow deployment of
thousands of nodes on cloud platforms. The database supports MapReduce operations and also contains an expressive query language to easily communicate with
the database. MongoDB also has various management tools to allow deployment,
monitoring and scaling of the database. It has an active large developer community
which is driving the product forward with the rapid pace. MongoDB is considered
to be the most popular and widely used NoSQL database. It is mostly used as a
backend for medium and large scale websites. Companies such as SAP, Foursquare,
Ebay, SourceForge are using it in their backend.
3.8
Riak
Riak is a NoSQL key value store developed by Basho Technologies[31]. It is open
source project based on the principles and techniques learned from Dynamo. It
15
CHAPTER 3. RELATED WORK
is distributed, provides partition tolerance, high availability, scalability and fault
tolerance. Riak has predictable low latency even during node failures, network
partition and during peak times. It offers eventual consistency though tunable to
offer strong consistency.
Riak is implemented over Erlang, a language ideal for massive distributed system. The data model is extremely simple, each value is identified by a value which
is stored in bucket. A value could be anything such as bytes, XML, JSON, documents e.t.c. Riak offers an extremely simple interface to developers for interaction
by giving RESTful web services API for set and get operations. It also supports
Protocol Buffers and offers language binding for popular programming languages.
Creation of secondary indexes is allowed and non key value operations for large data
set is supported through MapReduce. Another product called Riak Cloud Storage
is also available which is an open source database built on top of Riak to offer both
public and private cloud solutions.
It has many limitations such as it does not allow composite keys and values.
It has limited support of strong data types with only few types supported. It has
also limited support for transactions in CP mode and does not support prune index
scans.
3.9
FoundationDB
FoundationDB is database that provides both SQL and NoSQL access with high
performance[32]. It is a multi model database based on shared nothing architecture.
At the core, FoundationDB has an ordered key value store with additional features
implemented on top in form of layers. It is scalable, fault tolerant, supports various
operating systems and cloud platforms such as EC2. It provides various methods
such as command line interface, numerous language bindings and SQL like layer to
access the database.
Currently most of the NoSQL databases are based on BASE and does not support ACID properties. The distinctive feature of FoundationDB is the support of
transactions. Transactions fully support ACID properties like traditional relationship databases.
Though ACID properties are supported however there are numerous limitations.
FoundationDB does not support transactions which lasts for more than five seconds
or if commit takes place after five seconds of first read operation. The key and values
in the database must not exceed 10 KB and 100 KB respectively. The total data
size written to key and values also should not exceed 10 MB in a single transaction.
The limitations are in place because of performance reasons and failing to comply
with it will result in a failure. Other than having limitation on transactions, it is
16
3.10. AEROSPIKE
limited in terms of data types. It does not support strong data types and key value
pair can only be byte strings. It also does not have support for queries like scalable
prune index scans which are based on non primary keys columns.
3.10
Aerospike
It is an in-memory NoSQL key value store, which is highly optimized for Flash and
SSD storage. It is mainly an AP system according to CAP theorem thus provides
partition tolerance and availability. The properties of the system are also tunable
and can be set to CP mode to allow strong consistency and ACID compliance.
It also claims to be 10 times faster than both Cassandra and MongoDB according
to Yahoo! Cloud Serving Benchmark[33]. Despite of its various features it has many
limitations. Transaction support in CP mode is limited as it does not allow bulk
write operations in a transaction and also imposes a limit on a batch size. It has
limited support for data types and does not allow composite keys and values. It
also does not allow prune index scans.
3.11
Spanner and F1
Spanner is Google’s proprietary modern semi-relational database management system and a successor of BigTable[10]. It is a NewSQL form of database that tries
to solve issues faced in NoSQL databases. Spanner fully supports transactions
that were lacking in BigTable. The data is replicated, stored and managed across
distributed data centers. It has a core API that introduces a concept of clock uncertainty to provide strong consistency. It gives the same scalability performance as a
NoSQL database while guaranteeing the ACID properties of traditional relational
database system.
F1, a database management system is based on Spanner is used as a backend
for AdWords business[34]. For the internal use, Google had implemented a custom
version of MySQL that failed to scale big and F1 database also replaced it. F1 is
fully relational database system in a traditional sense with ACID properties and
SQL query support but also scales massively and hence has properties of both
NoSQL and SQL databases.
17
Chapter 4
Design
VoloDB, our NoSQL key value store is designed and implemented over MySQL Cluster. MySQL Cluster is an in memory database which is scalable, ACID compliant,
runs on commodity computers and provides 99.99% availability[35]. It is built upon
shared nothing technology with auto-sharing to efficiently process massive read and
write requests.
Features of VoloDB are mapped over MySQL Cluster database. Store is mapped
to a table, key value pair as a table row. An individual key or a value of a pair
mapped to a table column. Index on store column is mapped to standard column
index of MySQL Cluster database. There are various other features specifics to
MySQL Cluster that are exploited to enhance overall performance of VoloDB which
are discussed later.
4.1
MySQL Cluster
In order to understand how VoloDB is implemented over MySQL Cluster, it is
important to understand its overall architecture, available client libraries and interfaces.
4.1.1
Architecture
MySQL Cluster is composed of several components. It integrates a standard MySQL
database with an in memory storage engine called Network DataBase (NDB). The
cluster setup consists of many number of host machines which can run many processes which are called nodes. A node can consists of a MySQL Server to help access
the data or it can store only data. There can also be one or more data management
servers to configure/manage the cluster.
The data stored by the NDB engine can be accessed by any of the MySQL
19
CHAPTER 4. DESIGN
Server nodes. An update made to the data node by any of the server is visible
by other server nodes. Data is actively replicated to other data notes and hence
failure of a data node does not cause availability issues since it is available in other
active data nodes. Nodes can be added, removed or restarted easily by the NDB
management servers. They can do also do configuration changes or apply software updates. MySQL Cluster also offers various client interfaces such as standard
MySQL client, various language bindings and connectors and even low level high
performance interfaces. The architecture is shown in figure 4.1[36].
Figure 4.1: Overview of MySQL Cluster Architecture
4.1.2
Partition Key
A concept that is important to understand, which is used through this project
is called Key Partitioning. Since MySQL Cluster contains multiple data nodes,
Partition key of the table determines on which data node will the main replica of
the row reside. The responsible data node is calculated by taking MD5 hash of the
partitioning key columns specified by the users. In this way the data is randomly
distributed across the data nodes, which provides load balance. If a column is not
specified, then the primary key columns are automatically taken as partitioning key
columns. The point to note is that partition key must be part of the primary key.
20
4.2. NDB API
4.2
NDB API
It was briefly discussed in section 4.1.1 that MySQL Cluster provides various interfaces that users or programmers can use to interact with it. The examples of
provided interfaces are shown in figure 4.1 under Clients/API. As we can see most
of the provided interfaces go through SQL nodes before going to the data nodes.
This extra layer between the data nodes can hider performance for applications that
have extremely high throughput requirements. In figure 4.1, we can see that NDB
API interface talks directly to the NDB storage engine running on data nodes. NDB
API is a low level high performance C++ programming interface to communicate
with the data nodes.
VoloDB is built using low level NDB API C++ API. This interface provides
functions that can be used to create tables, indexes, insert, fetch, update and delete
records in asynchronous manner in an highly efficient way.
4.2.1
Asynchronous API
NDB API offers two flavors, one is synchronous and other is asynchronous. Synchronous as name suggests is a collection of functions and classes that wait for the
result to return from MySQL Cluster. This behavior though handy in certain scenarios, it can be extremely inefficient when high throughput is the requirement and
thus it is not feasible.
The other asynchronous flavor allows the user to query and send commands to
the cluster as many times without waiting for the response. This behavior allows
the client to send as many requests as he wants without blocking and then receive
the responses in bulk thus increasing the throughput considerably. VoloDB is built
upon this asynchronous version of the NDB API.
4.3
Abstractions and Mappings
As mentioned earlier, our VoloDB key value store is built upon MySQL Cluster.
In order to make it work, it is very important to map the concepts and entities
correctly from a relational database to our NoSQL key value store. The following
are the main mappings done with the MySQL Cluster.
4.3.1
Database
The concept of database that is a collection of individual stores is mapped to schema.
Schema in MySQL Cluster is a collection of tables, for our case we will treat it as
a collection of stores.
21
CHAPTER 4. DESIGN
4.3.2
Store
The store that holds set of key value pairs is mapped to database table. By abstracting a store over tables, we get all the features of MySQL tables for free such as
automatic sharding and replication to achieve high availability and failover support.
4.3.3
Key Value Pair
Key and value pair in the key value store is abstracted and mapped as a row of
table in the MySQL Cluster. As MySQL Cluster is an in-memory database, it tries
to keep all the rows of tables in the memory at all times and as a result we will get
super fast access to records for our store.
4.3.4
Key
The key of an individual store is mapped to the primary key of a table that uniquely
identifies a given row of the table.
4.3.5
Value
The values for the given key are stored as columns of row of the table. VoloDB
supports multiple values for a given key, thus internally a value can be mapped to
more than one table columns.
4.3.6
Index
Our key value store also internally creates indexes to allows faster access to data.
The index on the store is simply mapped to the index of the MySQL Cluster’s table
column.
4.3.7
Operations
Set operation is mapped to Insert Into/Update Table, Get operation to Select From
Table, Delete operation to Delete From Table and Create/Drop Store operation is
mapped to Create/Drop Table.
4.4
Feature Set
The following are the salient high level features of VoloDB.
4.4.1
Strong consistency
VoloDB is strongly consistent where all the concurrent users will have a same view of
the store. Most of the current NoSQL databases have a weaker form of consistency
according to CAP theorem. Though there are few NoSQL databases that offer
strong consistency but they are limited such as HBase offers consistency only at
22
4.4. FEATURE SET
a row level and not when a query effects more than one row. Our goal is to offer
strong consistency with no limitations and support table level consistency.
4.4.2
ACID Properties
As discussed earlier, generally all of the NoSQL databases adhere to BASE instead
of ACID properties because of the scalability issues. Though there are some NoSQL
databases that do offer ACID properties but they lack performance and put certain
restrictions. VoloDB fully supports ACID properties without any performance or
feature limitations by using the underline MySQL Cluster functionalities.
4.4.3
Transactions
VoloDB fully supports transactions with ACID properties. Most current key value
stores also lack transactions or provide only a weak form of it. Our transactions
not only support a single query per transaction but as many as the user wants, and
all of them are run atomically. VoloDB maps the transaction to a standard MySQL
cluster transaction and thus achieves all required properties of it.
4.4.4
Prune Index Scans
VoloDB not only supports fetching a value by key that is unique, but it also supports
queries that are not based on a primary key. Prune index scans is one of the distinct
features of VoloDB that allows the users to query a store using partition key.
As explained in section no 4.1.2, the records of a table are distributed to data
nodes using partition key and since prune index scans query using this key it will
only hit a single data node which tend to be very efficient.
4.4.5
Strong Data Types
Most of the current key values stores support a generic byte as value for the key.
This scheme though allows the user to store any value in the store but it can be
inefficient and inconvenient as user has to encode and decode values to and from
bytes respectively. Since values are mapped to strongly type column of a table,
VoloDB supports this feature to offer better usability and performance.
4.4.6
Multi-column key
Key for any record in the store is not limited to a single column. A user can specify
many values for a key when creating a store, they are not required to be of a generic
byte type.
23
CHAPTER 4. DESIGN
4.4.7
Multi-column value
VoloDB supports more than one value for a given key. A user when creating a
store can specify as many values for a key. There are many key value stores that
only support a single value. If a user has to store more than one value for a key,
the user has to pack and encode values in a single byte array value and store it.
Our multi-column value supports a scenario where user has to store a structure like
values efficiently without any overhead.
4.4.8
Supported Data Types
VoloDB supports wide range of strong data types. Total of 10 data types are
supported that are boolean, int32, uint32, int64, uint64, float, double, char, varchar
and varbinary.
4.5
Performance Considerations
VoloDB keeps in account many performance considerations before running operations. Not all kind of queries are supported. The queries which tend to be very
slow are not allowed to make sure the throughput remains very high at all times.
An inherently slow query not only slows the current user but it can slow down the
whole system thus effecting the other concurrent users.
As discussed in section 4.1, Data in MySQL Cluster is distributed across many
data nodes. For every inserted row, MD5 hash is taken of the partition key column
to compute the destination node where the data is then stored. While querying a
table in the cluster every data node can be hit to access the required data. These
type of queries that hit all the data nodes are slow and are blocked by VoloDB.
4.5.1
Allowed Queries
Here we discuss the kind of the operations that are supported in VoloDB.
Primary Key Lookup/Insert: Primary key lookup and insert only effects one
record since the key is guaranteed to be unique. Also as partition key is required
to be part of the primary key, the system can compute the data node that is responsible for the given operation and as a result it will only hit a single data node.
As this type of query is hitting only one data node, it will allowed in our key value
store since it can be executed efficiently.
Prune Index Scan: Prune index scan queries are the ones that do not use primary key. They are based on partition key that only hits one data node. Query
based on the partition key columns using equality operator can end up fetching
more than one record but since they were initially distributed using these partition
24
4.6. DEFINITION OPERATIONS
key columns, the cluster will hit only one data node. As this type of query only
ends up accessing one data node, it is allowed in VoloDB.
4.5.2
Disallowed Queries
As a rule, any query that can end up hitting more than one data node is not allowed
due to performance reasons.
Non Primary Key Operation: Query based on non primary key will not be
allowed, since it means that record is not unique and the cluster will have to hit all
the data nodes to fulfill the query.
Non Partition key Operation: Partition key is only required to be a part of
the primary key but queries based on it is allowed since records are distributed using this key and MySQL Cluster can directly go the destination data node without
going to other nodes. So if a query is not based on partition key columns, it will be
disallowed.
Full Table/Index Scan: Although it is mentioned that any query not based
on either the primary or partition key is not allowed and will be blocked but to
make things clear, queries resulting in full table or index scan will be disallowed.
The point to note is that although index scan can be a lot faster than full table
scan, but in order to satisfy the query, the store will have to go through all the data
nodes to traverse all the column index values which will still be a performance hit.
4.6
Definition Operations
These are operations that are used to make changes in the schema/database of
VoloDB. The supported operations are as follows:
4.6.1
Create Store
This operation will create a new store in VoloDB. Store is a container for key value
pairs. User must specify more than one fields that can have the following properties:
1. A single or a composite key columns which will uniquely identify the values.
2. One or more columns to hold values for a given key.
3. A single or composite partition key columns. This option is not mandatory,
if it is not specified, then primary key is automatically taken as partition key.
25
CHAPTER 4. DESIGN
4.6.2
Delete Store
The Delete command is opposite of Create which will remove the given store.
Records that exist within the store will be removed automatically.
4.7
Manipulation Operations
These set of operations allows the user to manipulate the data. The following
manipulation operations are supported.
4.7.1
Set
This operation adds a key value pair in the given store. The user specifies the data
for each key and value columns to be inserted into the store. The key must be unique
without any null values. The key and non key columns must match exactly as they
were created during the store creation. User is not allowed to add or remove field
in a store on the fly. Note that no Update command exists to modify the existing
values in the store. The values are updated with the same Set command. If no
value exists for the key, then a new key value pair is added, otherwise old values
are overwritten and updated to new values.
4.7.2
Get
Given the key, this operation fetches the values that were inserted earlier using the
Set operation. As mentioned in section 4.5.1, only those fetch queries are supported
that only hit single data nodes. In order to satisfy this condition, only equality
operator on either primary key or partition key is allowed. If query is based on a
primary key then only single key value pair is returned, and if it based on partition
key then more than one pairs can be returned.
4.7.3
Delete
This operation given a key, removes the key value pair from the store. In order to
Delete a key value pair, the user has to provide full primary key data of the record.
4.7.4
Atomic Mode
All of the mentioned manipulation operations can be grouped together into a transaction to be run atomically. All of the grouped operations will either run successfully
or none will. The result is guaranteed to be strongly consistent even in the case of
concurrent running transaction on the same store.
26
4.8. HIGH LEVEL SYSTEM ARCHITECTURE
4.8
High Level System Architecture
A high level system architecture of VoloDB and how it fits in with the rest of
the external entities is shown in figure 4.2. We have our key value store sitting in
between the clients and the MySQL Cluster data nodes. Clients will talk to VoloDB
using a custom library that transports data on wire using protocol buffers. The
detail of client library and its implementation is discussed later. VoloDB receives
request from different clients and talks directly to data nodes without the additional
layer of MySQL Server in the middle. After receiving the response from the data
node, VoloDB returns the result back to the clients using protocol buffers. The
details of Protocol Buffers and wire format are discussed later in section no 5.2.
Data Nodes
Client
<<Protobuf>>
NDBD
<<NDBAPI>>
Client
VoloDB
<<Protobuf>>
NDBD
NDBD
NDBD
<<Protobuf>>
Client
Figure 4.2: High Level System Architecture
4.9
VoloDB Architecture
The design of VoloDB is inspired from Mikael Ronstrom’s benchmark program
that shows how to create a highly scalable key lookup engine[37]. The internal
architecture of VoloDB is shown in figure 4.3. It shows the most important concepts
that are critical to the design and performance.
4.9.1
Network I/O Handler
VoloDB has a single asynchronous I/O thread that is used to handle network traffic.
All the requests from the clients and responses sent back are handled by this thread,
shown as Network I/O Thread in figure 4.3. Whenever network I/O thread receives
a request from the client, it forwards it to one of the definer threads using a round
robin approach to distribute the requests load equally. Network I/O thread can
also receive a message from its internal executor component. The response of a
user request is handed over to it and is responsible to deliver the message to the
destination client. In order to correctly reply to destinations, it keeps track of all
the connected clients by keeping state information.
27
CHAPTER 4. DESIGN
Incoming Request
Outgoing Response
VoloDB
Network
I/O Thread
Definer
Thread 1
Definer
Thread 2
Definer
Thread 3
Executor
Thread 1
Executor
Thread 2
Executor
Thread 3
Executor
Thread 5
Executor
Thread 5
Data
Node 1
Data
Node 2
Data
Node 3
Data
Node 4
Data
Node 5
Figure 4.3: VoloDB Architecture
4.9.2
Definers
Definer threads sit between the network I/O thread and the executor threads and
receives the user request forwarded by network I/O thread. Each definer thread is
connected to all the executor threads running in the system. The number of definer
threads are configurable and can increased or decreased as per user requirements
but at least one definer thread is required.
The task of the definer thread is to decode the received message and assign it
to an executor. It collects as much messages as possible within a specified period of
time(10 milli second by default). The requests are then randomly assigned to one
of the executors.
28
4.9. VOLODB ARCHITECTURE
4.9.3
Executors
Executor threads are responsible for running the user queries by communicating
directly to the data nodes using the asynchronous NDBAPI. The number of executor
threads are configurable and can increased or decreased as per user requirements
but at least one executor thread is required. For better performance, there should
be at least one to one mapping between executor threads and the number of data
nodes in the cluster such that allowing the executor no 1 is to be responsible for
handling queries for data node no 1, executor no 2 for data node no 2 and so on.
Every executor thread collects the requests received from the definer threads
while silently ignoring any unrecognized messages. It then tries to prepare as much
queries in bulk since sending queries in bulk to the data nodes have obvious performance benefits. After all the queries are prepared, they are sent in bulk to data
nodes using NDBAPI in asynchronous fashion. Executor thread then waits asynchronously to receive the responses back from the data nodes. When the executor
thread receives the response back from the data nodes, it serializes the response in
an appropriate format as expected by the client and forwards it to the network I/O
thread for delivery.
29
Chapter 5
Implementation
VoloDB is completely written in C++ using the latest standard called C++11. In
this chapter we will discuss implementation specific details of the store. It includes
details of any specific language, framework or libraries used. It also mentions any
optimizations and techniques used to enhance the performance of the store. All the
implementation follows the same functional specifications and design guidelines as
discussed earlier.
5.1
Transport
Transferring and receiving data over the network is one of the most important
aspects since all the requests from the clients are received over the wire. Inefficient
handling of network messages will slow down the system and hence will adversely
effect all the connected users.
For our project, ZeroMQ[38] is used for all the transport needs. ZeroMQ is an
efficient, ultra fast socket library. It is built for general purpose distributed system
applications with focus on extreme optimizations. It also has various design patterns
already implemented, transport mechanisms and language bindings. For VoloDB,
transport server that is used to service requests and responses over the network is
completely written using this library. A pattern called asynchronous server is used
with user configurable number of worker threads and C/C++ language binding to
achieve best possible performance.
As an alternative, many other approaches and frameworks were considered to
handle requests and responses over the network. One obvious approach was to write
our own library using native sockets. This approach was risky since it takes a lot
of effort and time to achieve the same level of optimization, efficiency and stability
which we can readily achieve by using a popular, community driven, already stable
and tested library. Another asynchronous, high performance and popular C++
31
CHAPTER 5. IMPLEMENTATION
library called Boost Asio[39] was considered but it was not chosen due to its usability
issues and complexity compared to ZeroMQ.
5.2
Serialization
In order to send messages from clients to VoloDB and get a response back, all the
messages must be encoded in a certain pre-defined format. The encoded message
should be small so that it takes less time to transfer over the wire otherwise clients
over a high latency network will suffer from low performance. It should be also be
encoded in such a way that decoding takes less time otherwise precious CPU cycles
will be used in decoding a messages which could have been better utilized in actual
request handling of a user.
The encoding and decoding of network messages are handled by Google’s Protocol Buffers[40]. Protocol Buffers are extensible, language and platform neutral
way of serializing data. All the messages and operations supported by VoloDB are
defined using it and then classes are generated using the tools provided by Google.
Classes generated have the ability to convert themselves into raw steam of bytes
that can be sent over the network. Same generated classes are provided to the client
that they can easily populate, encode and send to VoloDB for handling.
As an alternative, many other approaches were considered for data serialization.
For example, data could have been converted into XML or JSON but these formats
are very verbose can take a lot of bytes to encode which will slow down its transfer
over the network. Cap’n Proto[41] was one of the framework that could have been
used instead of Protocol Buffers. It is another language and platform neutral serialization library that requires absolutely no encoding and decoding step. All the
data added to an object is always appropriate for memory representation and data
interchange format but as a side effect the overall bytes taken by an object is far
greater than compared to Protocol Buffer’s serialized object. Hence Cap’n Proto
is ideal for use cases when serialized objects are stored and retrieved locally but
sending it over a network will have a significant performance loss because of the
larger data size. Another framework developed by Google called FlatBuffers[42],
which is based on the same techniques as Cap’n Proto was considered by was left
out due to the same reason.
5.3
Worker Threads
Worker threads run by VoloDB are standard C++11 threads for best performance
and portability. Other popular implementations such as Boost and POSIX threads
were not used because of standardization and portability issues. Worker threads
are configurable and are in the form of either definers or executors as mentioned in
32
5.4. INTER-PROCESS COMMUNICATION
section 4.9. There is also a worker thread running by ZeroMQ for all the network
I/O needs.
5.4
Inter-process communication
In order to have proper coordination between definer and executor threads, constant
communication is needed between them. Since there can be many definers which
need to communicate with any of the executor threads, it can results in contention.
Inefficient handling and unnecessary locks can severely decrease the performance of
the system.
All the interprocess communication between the threads and system components
is done using lock free PUSH based local sockets of ZeroMQ. As it is lock free the
message queue for every thread is efficiently handled by ZeroMQ library.
5.5
Memory Management and Garbage Collection
As VoloDB is completely written in C++, it is the responsibility of the author to
manage the allocation and deallocation of memory. C++ does not offer any garbage
collection mechanism in order to ensure better speed and performance. If enough
care is not taken, managing the deallocation of memory entirely by the developer
can become unmanageable for big and performance centric applications. Leaving
out deallocation for any allocation results in a memory leak that accumulates over
time and will crash the system eventually.
In the project, for every dynamic allocation of memory, a corresponding C++
standard version of smart pointer called shared_ptr pointer is assigned to it which
then manages its memory life cycle. It also handles automatic memory ownership
transfer when a smart pointer is assigned to another smart pointer. Any dynamically
allocated memory is then automatically deleted when a smart pointer goes out of
scope or is explicitly signaled.
5.6
Bulk requests Handling
Handling data in bulk can result in significant improvement in performance and
throughput and for this reason all requests received by the store are handled in
bulk. By default VoloDB tries to collect as many request as possible within 10
milliseconds and then processes and executes them in bulk. NDB queries sent to
MySQL Cluster are also bulk sent. All NDB transactions are prepared and then
executed at once. ZeroMQ is also designed in a way to aggregate messages and
handle them in bulk for performance reasons. As discussed earlier in section 5.4,
interprocess communication between definers and executors are done using ZeroMQ
PUSH sockets so we automatically get bulk message handling for free.
33
CHAPTER 5. IMPLEMENTATION
5.7
Protocol Buffer Messages
As describe earlier, user requests such as Get, Set and Delete operations are serialized using protocol buffers before being sent on the wire. The figure 5.1 shows some
of important messages and their structures used to encapsulate the user queries.
Operation
Manipulation Operation
Definition Operation
store_name: String
primary_key:ColumnValue[]
store_name: String
Set
Get
CreateStoreOperation
value:ColumnValue[]
«enumeration»
ColumnType
BOOL
INT32
UINT32
INT64
UINT64
FLOAT
DOUBLE
CHAR
VARCHARVAR
BINARY
Delete
attributes: AttributeInfo[]
DeleteStoreOperation
Column
Row
name: String
type: ColumnType
is_primary_key: Bool
is_distribution_key: Bool
column:ColumnValue
<<uses>>
<<uses>>
ColumnInfo
length: Int
Result
ColumnValue
value:Bytes[]
transaction_identifier: String[]
result: Row[]
error_code: Int
error_description: String
Figure 5.1: Protocol Buffer Messages
5.8
Parallel Database Connections
NDB connections by default are not thread safe and having a single connection will
be extremely inefficient since for every request execution on the MySQL Cluster an
exclusive lock is required. The exclusive use of the connection results in execution
of only one request at a time that will achieve very low throughput. In order to
fix the issue, many separate parallel database connection are kept open by executor
34
5.9. ZERO COPY SEMANTICS
threads. It not only allows parallel execution of requests for higher throughput
but since no locking and unlocking steps are required for connection sharing, the
performance gets a further boost.
5.9
Zero Copy Semantics
As described earlier in section 5.1, ZeroMQ is a highly optimized network library
for efficient data transfer. One of the important optimizations it does is to avoid
copying of data as much as possible that is significant for high performance and low
latency[43].
VoloDB takes a similar cue from ZeroMQ and applies zero copy semantic throughout the implementation. Any sort of data either small or big is transferred/passed
between functions and components by references or pointers. Extra copies are also
prevented automatically by move constructor feature of C++11.
5.10
Client Library
A client library written in C++ is also implemented for end users to interact with
VoloDB. It supports main operations such as create store, drop store, get, set and
delete command with advanced features such as transaction and bulk operation
support. The library is also completely asynchronous for performance reasons.
Since the server operates on protocol buffer messages, more clients supporting other
programming languages can easily be added later.
35
Chapter 6
Evaluation
In order to evaluate VoloDB, we implemented our own benchmark program inspired from flexAsynch[44]. FlexAsync is a benchmark program available as part of
MySQL Cluster to test its scalability and performance. Our benchmark program
uses VoloDB’s C++ client library to simulate client and their requests.
As the focus of the thesis is to achieve maximum throughout, we extensively
evaluated VoloDB for all the supported operations. For every operation, different scenarios were benchmarked. In the end VoloDB was also compared against
Aerospike that is an open source high performance key value store.
6.1
6.1.1
Evaluation Setup
Hardware
The experiments were performed on up to nine machines which were part of internal
cluster at SICS. Every machine had 24 core Intel Xeon 1.9 GHz processor. They
had a DDR3 RAM of 94GB with L1, L2 and L3 cache of 32K, 256K and 15360K
respectively. Machines were connected together over a 10 Gigabit Ethernet. Two
data node MySQL Cluster was also deployed on these machines.
6.1.2
Software
The machines were running 64 bit version of CentOS 6.6. As our key value store
is dependent upon many of the third party software libraries, it is important to
mention their version numbers. MySQL Cluster 7.4.5, ZeroMQ 4.0.5, Boost 1.57
and Protocol Buffer 2.6.0 were used to compile VoloDB.
37
CHAPTER 6. EVALUATION
6.1.3
Workload
Our benchmark program was designed to generate custom level of workloads. Workloads generated depended upon the scenario, operation and number of VoloDB
instances being tested. Each experiment was tested on up to seven instances of
VoloDB. Every instance was run on a separate machine while clients were run on
the same machine as well. Total of 400 clients generated 2000 queries that made 1.6
million requests per VoloDB instance. Since we were testing up to seven instances
of VoloDB, the total workload varied from 1.6 million to 11.2 million operations
depending upon the number of instances being run.
To make our experiments meaningful, we simulated a simple e-commerce application for online shopping. Total of five tables were created to store different type
of user information. The details are given below:
Orders: contains information regarding pending orders of users.
History: contains history of all the completed and cancelled orders of all the users.
Cart: contains information of products currently in cart for every user.
WishList: contains product wish list of users.
WatchList: contains list of products of which user wants to be notified if their
price changes.
All the tables had int32 user_id column in common which was also a distribution key. It meant that all the records in every table were distributed across data
nodes using user_id column. This property is important for prune index scans to
fetch all the information of a certain user quickly, which we will evaluate later in
an experiment. For the sake of simplicity all the tables were of the same size having columns of int32, uint32, int64, uint64 and a float. They all had a four part
composite primary key and a single distribution key column as discussed earlier.
6.2
Experiment 1 - Set
In this experiment we simulated a scenario where our e-commerce business is receiving a lot of write requests such as new orders by users. We tested up to seven
instances of VoloDB, the total records added varied from 1.6 million for single instance to 11.2 million records for seven instances.
The experiment was initially executed to add a single order within a transaction.
The results are shown in figure 6.1. The throughput achieved was very good with
over 100,000 writes per second and got almost linear growth on up to three VoloDB
instances. It peaked at four instances where we got the throughput of almost 280,000
writes per second. After four instances it started to flatten out at around 250,000
writes per second, even when more instances were introduced. This behavior can be
38
6.2. EXPERIMENT 1 - SET
attributed to the fact that we were using two node MySQL Cluster, which seemed
to max out when more than four VoloDB instances were used. Adding more data
nodes should help achieve more throughput.
The experiment was repeated two times but this time five Set operations were
added within a single transaction. In the first run five orders were bulk inserted
in Orders table. In the second rerun, five records were added in bulk but each Set
operation belonged to a different table(Orders, History, Cart, WishList, WatchList).
Adding records in this fashion simulated a scenario where large quantity of different
kind of data is being generated across the whole e-commerce system.
As we can observe in figure 6.1, the behavior was similar as before but we got
constant boost when write operations were executed in bulk. Both rerun at four
VoloDB instances peaked at around 320,000 writes per second and flattened out
after that to about 280,000 writes per second. The point to note is that when
five operations were executed within the same table it consistency performed better
than when they were executed in different ones.
Figure 6.1: Throughput of Set Operations
39
CHAPTER 6. EVALUATION
6.3
Experiment 2 - Get
In this experiment we simulated a scenario where our online shopping system has
to deal with large number of read requests. We benchmarked the throughput of
Get operation by fetching all the records that were added in experiment no 1.
Initially we fetched all the existing orders with a single Get operation in a transaction and then repeated twice to have five operations in a single transaction. In first
rerun orders were fetched in bulk and then in the second rerun fetched information
for orders, wish list, watch list, history and cart within a single transaction.
The results are shown in figure 6.2. As expected Get operations achieved a lot
higher throughput compared to Set operations. The growth was almost linear all
the way up to seven VoloDB instances even though we just had two node MySQL
Cluster. It achieved throughput of up to 600,000 reads per second when a single
fetch was done in a transaction. When Get operations were executed in bulk, they
started slowly compared to single operation transaction up to four VoloDB instances
but then aggressively took it over and peaked at around 750,000 reads per second.
Bulk fetching either in a single or different tables did not seem to be making much
difference and both were equally efficient. If total records to be fetched are in large
numbers, reading records in bulk can help in achieving better throughout.
Figure 6.2: Throughput of Get Operations
40
6.4. EXPERIMENT 3 - DELETE
6.4
Experiment 3 - Delete
To benchmark the performance of Delete operation we deleted all the records from
our online shopping system that were added in the experiment no 1.
Like previous experiments it followed the similar pattern, it first deleted orders of
every user. It was repeated again to delete all orders with a batch of five operations
in a transaction. Lastly the test was repeated to delete every record where a single
transaction contained a Delete operation for each of the table(Orders, History, Cart,
WishList, WatchList).
The results are shown in figure no 6.3. The pattern of the graph was quite
similar to that of Set operation in experiment no 1. When a single operation
was used in a transaction the throughput peaked at just over 320,000 deletes per
second at four VoloDB instances but started to go down even when more server
instances were used. The decrease in throughput can be attributed to the fact
that MySQL Cluster is only using two data nodes, which seemed to insufficient for
more than four VoloDB instances. As expected bulk deleting records got better
throughout especially after three VoloDB instances as gap between bulk and non
bulk transaction started to increase considerably. Bulk deleting when done in single
table performed better than when done in different tables and peaked at around
400,000 and 360,000 deletes per second respectively.
Figure 6.3: Throughput of Delete Operations
41
CHAPTER 6. EVALUATION
6.5
Experiment 4 - Mixed Reads and Writes
In a real world scenario only reads or writes does not happen but a mix of both.
Similarly in an online shopping system users would either be browsing the products(read operation) or will be adding items to the cart or placing new orders(write
operation). This experiment simulated a similar scenario to test the throughout
of VoloDB when mix of reads and writes are generated by clients. The setting of
this experiment was same as no 1 and 2 but in this case both Set and Get operations were mixed together with different ratios to randomly fetch and add new user
orders.
The results are shown in figure no 6.4. The throughput was still good and as
expected it was maximum and scaled linearly when read/write ratio was 80:20.
The performance was lowest when write operation had the highest ratio but still
got a high throughput of around 250,000 transactions per second. The read/write
ratio of 50:50 lied in the middle with maximum around 350,000 transactions per
second compared to when experiments were run with other two ratios. After four
VoloDB instances the throughput started to slow down especially for write intensive
experiments. This seemed to suggest that it is the maximum that our two data node
MySQL Cluster can handle and more nodes can be added to improve the throughput
even further.
Figure 6.4: Throughput of Mixed Reads and Writes
42
6.6. EXPERIMENT 5 - PRUNE INDEX SCANS
6.6
Experiment 5 - Prune Index Scans
Prune index scan is the most distinguish feature of VoloDB. It allows the user to
query a store without the primary key and still get high scalability and throughput. In this experiment we put prune index scans to the test to benchmark its
performance.
Before moving forward with the experiment it is important to understand the
use case of this feature. Prune index scans can be initiated by querying on partition
key columns. Our online shopping system presents an ideal scenario where this
features can be used. Generally in a shopping system like ours, instead of fetching
all the records in the database, only records for a specific user needs to be fetched.
For example when a user logs in to his account the system only has to fetch the
details of this user like his products in the cart, order history and wish list etc. If
we look at the tables of our online shopping system as discussed in section 6.1.3,
all of them have user_id in common which is also a partition key. This means that
records in all the tables are distributed to data nodes using this column and if we
have a query based on user_id with a particular value, it will hit only one data
node and we will be able to fetch all the record for this user.
Figure 6.5: Throughput of Prune Index Scans with 50 Records Returned
To benchmark prune index scans two types of experiments were performed. In
the first experiment orders for particular users were fetched in random with single
Get operation using distribution key. It was repeated again in batch of five to fetch
43
CHAPTER 6. EVALUATION
orders of five random users in a transaction. Lastly with a transaction having five
Get operation for all the tables to randomly fetch all data for a particular user.
Before running the experiment, data was populated in such a way only 50 records
were returned in a table for a particular user. The result of the experiment is
shown in figure 6.5. Prune index scans with single operation transactions reached a
maximum throughput of almost 650,000 scans per second with a linear growth even
up to seven VoloDB instances which was as good as Get operations on a primary
key. The growth for two variants of batch prune index scan was also linear and
both performed almost similarly but they reached a peak of around 355,000 which
is about half of non bulk variant.
Figure 6.6: Throughput of Prune Index Scans with Variable Records Returned
The same experiment was repeated with a variation, instead of returning fixed
50 records, every scan returned variable number of records from 1 to up to 2048. It
was important to test this because since prune index scan is not based on primary
key and can return any number of rows. The result of the experiment is shown
in figure no 6.6. The graph shows that for bulk and non bulk operation variants,
only slight decrease in throughput happened when up to 64 records were returned
in a scan. On 128 and above the performance started to decrease considerably but
even with 2048 records returned, the performance was still above 100,000 scans per
second. The reason why throughput decreased with increasing number of records
even though the scan was only hitting single data node is because VoloDB had to
iterate, process, encode and send all returned rows to the client which caused the
system to slow down.
44
6.7. EXPERIMENT 6 - COMPARISON WITH AEROSPIKE
6.7
Experiment 6 - Comparison with Aerospike
For the last experiment we compared the throughput of VoloDB with one of high
performance open source NoSQL key value store called Aerospike. Aerospike tested
was a community edition version 3.5.9 with their C benchmark program version
3.1.16[45] available with the source code.
Figure 6.7: Throughout Comparison with Aerospike: 0% Reads, 100% Writes
We tested reads, writes and mix of both with different ratios. Option to test
deletes was not available and since Aerospike does not support bulk operations and
prune index scans it was not possible to benchmark them either. Aerospike was
tested up to a cluster of nine instances running on same machines as mentioned
in section no 6.1. The number of client threads were carefully chosen to get the
maximum throughput out of each instance of Aerospike. For every server instance,
two instances of the benchmark program were run on the same machine with 40
client threads each. Setting for benchmark program was also changed to ensure
strong consistency of reads and writes.
Total of five variations of experiment were performed. Throughput was measured for read/write ratio of 0:100, 20:80, 50:50, 80:20 and 100:0. The results are
shown in figure no 6.7, 6.8, 6.9, 6.10 and 6.11. For these experiments we were also
counting two data nodes of MySQL Cluster for VoloDB, which is why server instances at the x-axis starts from 3 instead of 1. VoloDB performed better in write
intensive operations compared to Aerospike on up to six server instances as seen in
45
CHAPTER 6. EVALUATION
Figure 6.8: Throughout Comparison with Aerospike: 20% Reads, 80% Writes
figure no 6.7 and 6.8. After six instances, Aerospike started to take over VoloDB in
write intensive operations. It could be caused by the fact that it is backed by only
two data nodes of MySQL Cluster that is insufficient for higher number of VoloDB
nodes and client requests, whereas each instance of Aerospike is a data node as well.
For a read/write ratio of 50:50 and read intensive operations as shown in figures
6.9, 6.10 and 6.11, Aerospike consistency performed better VoloDB and peaked
at over one million. As we started moving from write intensive mix to a read
intensive one, the gap between VoloDB and Aerospike started to decrease before
Aerospike took over. We can conclude that for single operation write intensive
transaction, VoloDB is better and for read intensive operations, Aerospike gives
better throughput. It should be noted that comparison is valid for only transaction
having a single operation, since Aerospike has a limited support for multi-operation
transactions. VoloDB achieved better throughout when operations were batched
together in a transaction that we could not compare with Aerospike because of the
lack of features.
46
6.7. EXPERIMENT 6 - COMPARISON WITH AEROSPIKE
Figure 6.9: Throughout Comparison with Aerospike: 50% Reads, 50% Writes
Figure 6.10: Throughout Comparison with Aerospike: 80% Reads, 20% Writes
47
CHAPTER 6. EVALUATION
Figure 6.11: Throughout Comparison with Aerospike: 100% Reads, 0% Writes
48
Chapter 7
Conclusion
A NoSQL key value store provides an efficient alternative to traditional relational
database management systems when flexibility and scalability are the main requirements. Most key value stores does this by weakening one of the ACID properties
offered by RDBMS. Popular key value stores such as Riak and Amazon Dynamo
use a weaker form of consistency in order to achieve high scalability. Lacking ACID
in a database makes it limited since it can not be used in numerous scenario where
this property is desired. Other NoSQL key value stores that do provide ACID
properties are limited and tend to lack in performance and throughput. They also
have other issues such as no scalable way to do non primary key lookups and lack
of strong data type support. VoloDB, our key value store presented in the thesis
tried to solve these mentioned problems. We have shown in chapter 6 that it is
possible to achieve high throughput performance of Get, Set and Delete operations
without compromising on ACID properties. By just using two node MySQL Cluster we seemed to have performed better in write intensive operations compared to
Aerospike. Aerospike however performed better in read intensive operations, but
there is a strong possibility that by using a larger MySQL Cluster we could outperform it. Our key value store implemented advanced features such as strong data
types, full transaction support and bulk operations which most of the other store
lack. It also offered a distinctive feature to allow queries based on the partition key
called prune index scan. Our experiments showed that it scaled linearly and performed as good as a Get operation. A high performance and scalable prune index
scans opens up wide range of possibilities that can be done efficiently.
7.1
Future Work
VoloDB supports all the basic operations of a key value store with some advanced
features, but it still has room for improvement for further feature support and
optimization.
49
CHAPTER 7. CONCLUSION
7.1.1
Joins
VoloDB does not support joining two stores together with a common column, for
example between primary and a foreign key. This lack of feature is attributed to
the fact that NDB API does not support it because it only consists of single table
operations. In order to support this feature, all the join code has to be added
manually at the server or the client side.
7.1.2
Additional Data Types
VoloDB currently supports ten most basic and importance data types as mentioned
in section 4.4.8. The support for advanced data types is still required. The data
types that require implementation are date and time types, all remaining variations
of binary, floating point and text types and possibly spatial data types.
7.1.3
Language Bindings
Currently only C++ language binding is supported. Client can only use C++
in order to communicate with VoloDB. As VoloDB protocol is based on protocol
buffers, it will take less amount of time to support other languages such as Java and
Python, since many number of classes can be auto generated using protocol buffer
compiler.
7.1.4
Load Balancing
Every user request varies in nature. Some operations can be simple such as a
single get operation or a complex transaction having numerous operations spanning
many tables. User can also bulk send large number of queries in a single message.
Situations can arise where an executor might sit idle or may not be fully utilized
since it has been executing simpler queries.
Currently definer threads assign incoming requests randomly to executor threads
without analyzing their complexity, current queue length and load of the executors.
Taking in account complexity of user queries and queue length of the executors for
better load balancing can result in a performance boost.
50
Bibliography
[1]
E. F. Codd, “A relational model of data for large shared data banks,” Commun.
ACM, vol. 13, no. 6, pp. 377–387, Jun. 1970. doi: 10.1145/362384.362685.
[Online]. Available: http://doi.acm.org/10.1145/362384.362685
[2]
F. Labs, “Why rdbms fails to support big data?” 2014, [Online; accessed
11-June-2015]. [Online]. Available: http://blog.flux7.com/blogs/flux7-labs/
why-rdbms-fails-to-support-big-data
[3]
ScaleDB, “High-velocity data,” 2015, [Online; accessed 11-June-2015]. [Online].
Available: http://www.scaledb.com/high-velocity-data.php
[4]
J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on
large clusters,” in Proceedings of the 6th Conference on Symposium on
Opearting Systems Design & Implementation - Volume 6, ser. OSDI’04.
Berkeley, CA, USA: USENIX Association, 2004, pp. 10–10. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1251254.1251264
[5]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows,
T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A distributed storage
system for structured data,” in Proceedings of the 7th USENIX Symposium
on Operating Systems Design and Implementation - Volume 7, ser. OSDI
’06. Berkeley, CA, USA: USENIX Association, 2006, pp. 15–15. [Online].
Available: http://dl.acm.org/citation.cfm?id=1267308.1267323
[6]
K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed
file system,” in Proceedings of the 2010 IEEE 26th Symposium on Mass
Storage Systems and Technologies (MSST), ser. MSST ’10. Washington, DC,
USA: IEEE Computer Society, 2010. doi: 10.1109/MSST.2010.5496972. ISBN
978-1-4244-7152-2 pp. 1–10. [Online]. Available: http://dx.doi.org/10.1109/
MSST.2010.5496972
[7]
A. Lakshman and P. Malik, “Cassandra: A decentralized structured
storage system,” SIGOPS Oper. Syst. Rev., vol. 44, no. 2, pp.
35–40, Apr. 2010. doi:
10.1145/1773912.1773922. [Online]. Available:
http://doi.acm.org/10.1145/1773912.1773922
51
BIBLIOGRAPHY
[8]
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman,
A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo:
Amazon’s highly available key-value store,” in Proceedings of Twenty-first
ACM SIGOPS Symposium on Operating Systems Principles, ser. SOSP ’07.
New York, NY, USA: ACM, 2007. doi: 10.1145/1294261.1294281. ISBN
978-1-59593-591-5 pp. 205–220. [Online]. Available: http://doi.acm.org/10.
1145/1294261.1294281
[9]
E. Plugge, T. Hawkins, and P. Membrey, The Definitive Guide to MongoDB:
The NoSQL Database for Cloud and Desktop Computing, 1st ed. Berkely, CA,
USA: Apress, 2010. ISBN 1430230517, 9781430230519
[10] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman,
S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak,
E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan,
R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang,
and D. Woodford, “Spanner: Google’s globally-distributed database,”
in Proceedings of the 10th USENIX Conference on Operating Systems
Design and Implementation, ser. OSDI’12. Berkeley, CA, USA: USENIX
Association, 2012. ISBN 978-1-931971-96-6 pp. 251–264. [Online]. Available:
http://dl.acm.org/citation.cfm?id=2387880.2387905
[11] A. Foundation, “Hbase acid semantics,” 2015, [Online; accessed 08-June-2015].
[Online]. Available: http://hbase.apache.org/acid-semantics.html
[12] Aerospike, “Aerospike batch operations,” 2015, [Online; accessed 08-June2015]. [Online]. Available: https://www.aerospike.com/docs/guide/batch.html
[13] FoundationDB, “Foundationdb limitations,” 2015, [Online; accessed
08-June-2015]. [Online]. Available:
https://foundationdb.com/layers/sql/
documentation/Concepts/known.limitations.html
[14] Basho, “Riak data types,” 2015, [Online; accessed 08-June-2015]. [Online].
Available: http://docs.basho.com/riak/latest/dev/using/data-types/
[15] FoundationDB, “Foundationdb data modeling,” 2015, [Online; accessed 08June-2015]. [Online]. Available: https://foundationdb.com/key-value-store/
documentation/data-modeling.html
[16] J. Gray, “The transaction concept: Virtues and limitations (invited paper),”
in Proceedings of the Seventh International Conference on Very Large Data
Bases - Volume 7, ser. VLDB ’81. VLDB Endowment, 1981, pp. 144–154.
[Online]. Available: http://dl.acm.org/citation.cfm?id=1286831.1286846
[17] T. Haerder and A. Reuter, “Principles of transaction-oriented database
recovery,” ACM Comput. Surv., vol. 15, no. 4, pp. 287–317, Dec. 1983. doi:
10.1145/289.291. [Online]. Available: http://doi.acm.org/10.1145/289.291
52
BIBLIOGRAPHY
[18] Aphyr, “Strong consistency models,” 2015, [Online; accessed 08-June-2015].
[Online]. Available: https://aphyr.com/posts/313-strong-consistency-models
[19] W. Vogels, “Eventually consistent,” Commun. ACM, vol. 52, no. 1,
pp. 40–44, Jan. 2009. doi: 10.1145/1435417.1435432. [Online]. Available:
http://doi.acm.org/10.1145/1435417.1435432
[20] E. A. Brewer, “Towards robust distributed systems (abstract),” in Proceedings
of the Nineteenth Annual ACM Symposium on Principles of Distributed
Computing, ser. PODC ’00. New York, NY, USA: ACM, 2000. doi:
10.1145/343477.343502. ISBN 1-58113-183-6 pp. 7–. [Online]. Available:
http://doi.acm.org/10.1145/343477.343502
[21] T. Rentsch, “Object oriented programming,” SIGPLAN Not., vol. 17,
no. 9, pp. 51–57, Sep. 1982. doi: 10.1145/947955.947961. [Online]. Available:
http://doi.acm.org/10.1145/947955.947961
[22] M. J. Carey, N. M. Mattos, and A. K. Nori, “Object-relational database
systems (tutorial): Principles, products and challenges,” in Proceedings
of the 1997 ACM SIGMOD International Conference on Management
of Data, ser. SIGMOD ’97. New York, NY, USA: ACM, 1997. doi:
10.1145/253260.253370. ISBN 0-89791-911-4 pp. 502–. [Online]. Available:
http://doi.acm.org/10.1145/253260.253370
[23] F. Banciihon, “Object-oriented database systems,” in Proceedings of the
Seventh ACM SIGACT-SIGMOD-SIGART Symposium on Principles of
Database Systems, ser. PODS ’88. New York, NY, USA: ACM, 1988. doi:
10.1145/308386.308429. ISBN 0-89791-263-2 pp. 152–162. [Online]. Available:
http://doi.acm.org/10.1145/308386.308429
[24] J. C. Anderson, J. Lehnardt, and N. Slater, CouchDB: The Definitive Guide
Time to Relax, 1st ed.
O’Reilly Media, Inc., 2010. ISBN 0596155891,
9780596155896
[25] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans,
T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley,
S. Radia, B. Reed, and E. Baldeschwieler, “Apache hadoop yarn: Yet
another resource negotiator,” in Proceedings of the 4th Annual Symposium
on Cloud Computing, ser. SOCC ’13. New York, NY, USA: ACM, 2013.
doi: 10.1145/2523616.2523633. ISBN 978-1-4503-2428-1 pp. 5:1–5:16. [Online].
Available: http://doi.acm.org/10.1145/2523616.2523633
[26] Y. Jiang, HBase Administration Cookbook.
1849517142, 9781849517140
Packt Publishing, 2012. ISBN
[27] I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek,
F. Dabek, and H. Balakrishnan, “Chord: A scalable peer-to-peer lookup
53
BIBLIOGRAPHY
protocol for internet applications,” IEEE/ACM Trans. Netw., vol. 11, no. 1,
pp. 17–32, Feb. 2003. doi: 10.1109/TNET.2002.808407. [Online]. Available:
http://dx.doi.org/10.1109/TNET.2002.808407
[28] R. C. Merkle, “A digital signature based on a conventional encryption
function,” in A Conference on the Theory and Applications of Cryptographic
Techniques on Advances in Cryptology, ser. CRYPTO ’87. London, UK, UK:
Springer-Verlag, 1988. ISBN 3-540-18796-0 pp. 369–378. [Online]. Available:
http://dl.acm.org/citation.cfm?id=646752.704751
[29] L. Lamport, “Time, clocks, and the ordering of events in a distributed
system,” Commun. ACM, vol. 21, no. 7, pp. 558–565, Jul. 1978.
doi: 10.1145/359545.359563. [Online]. Available: http://doi.acm.org/10.1145/
359545.359563
[30] S. Sivasubramanian, “Amazon dynamodb: A seamlessly scalable non-relational
database service,” in Proceedings of the 2012 ACM SIGMOD International
Conference on Management of Data, ser. SIGMOD ’12. New York, NY,
USA: ACM, 2012. doi: 10.1145/2213836.2213945. ISBN 978-1-4503-1247-9 pp.
729–730. [Online]. Available: http://doi.acm.org/10.1145/2213836.2213945
[31] B. Technologies, “Riak,” 2015, [Online; accessed 5-March-2015]. [Online].
Available: http://basho.com/riak/
[32] FoundationDB, “Foundationdb,” 2015, [Online; accessed 5-March-2015].
[Online]. Available: https://foundationdb.com
[33] Aerospike, “Aerospike benchmark,” 2015, [Online; accessed 13-May-2015].
[Online]. Available: http://www.aerospike.com/benchmark/
[34] J. Shute, R. Vingralek, B. Samwel, B. Handy, C. Whipkey, E. Rollins,
M. Oancea, K. Littlefield, D. Menestrina, S. Ellner, J. Cieslewicz,
I. Rae, T. Stancescu, and H. Apte, “F1: A distributed sql database
that scales,” Proc. VLDB Endow., vol. 6, no. 11, pp. 1068–1079,
Aug. 2013. doi: 10.14778/2536222.2536232. [Online]. Available: http:
//dx.doi.org/10.14778/2536222.2536232
[35] MySQL, “Mysql cluster,” 2015, [Online; accessed 08-June-2015]. [Online].
Available: https://www.mysql.com/products/cluster/mysql-cluster-datasheet.
pdf
[36] ——, “Mysql cluster architecture overview,” 2015, [Online; accessed 11March-2015]. [Online]. Available: http://dev.mysql.com/doc/refman/5.0/en/
mysql-cluster-overview.html
[37] M. Ronstrom, “Making efficient scalable kay value store,” 2015, [Online;
accessed 03-June-2015]. [Online]. Available: http://mikaelronstrom.blogspot.
se/2013/11/how-to-make-efficient-scalable-key.html
54
BIBLIOGRAPHY
[38] ZeroMQ, “Zeromq,” 2015, [Online;
Available: http://zeromq.org/
accessed 31-March-2015]. [Online].
[39] Boost, “Boost asio,” 2015, [Online; accessed 31-March-2015]. [Online]. Available: http://www.boost.org/doc/libs/1_57_0/doc/html/boost_asio.html
[40] Google, “Protocol buffers,” 2015, [Online; accessed 31-March-2015]. [Online].
Available: https://developers.google.com/protocol-buffers/
[41] Sandstorm, “Cap’n proto,” 2015, [Online; accessed 1-April-2015]. [Online].
Available: https://capnproto.org/
[42] Google, “Flatbuffers,” 2015, [Online; accessed 1-April-2015]. [Online].
Available: https://google.github.io/flatbuffers/
[43] ZeroMQ, “Zero-copy and multi-part messages,” 2015, [Online; accessed
7-April-2015]. [Online]. Available: http://zeromq.org/blog:zero-copy
[44] MySQL, “Flexasync benchmark program,” 2015, [Online; accessed 28-May2015]. [Online]. Available: http://dev.mysql.com/downloads/benchmarks.html
[45] Aerospike, “Aerospike benchmark program,” 2015, [Online; accessed 02June-2015]. [Online]. Available: http://www.aerospike.com/docs/client/c/
benchmarks/
55
Appendix A
User Guide
In this chapter we will discuss how to install, configure and run VoloDB. A tutorial
on C++ client library is also presented to help developers write applications for
VoloDB.
A.1
Download
Clone VoloDB repository from [email protected]:dc-sics/VoloDB.git
A.2
Project Structure
VoloDB directory contains four projects that are mentioned below.
1. volodb-common, it is a shared static library containing functionality used by
both VoloDB server and its client library.
2. volodbserver, it is a project directory of VoloDB server.
3. volodb, is a static library of VoloDB server to allow other projects to use
VoloDB server functionality.
4. volodb-client, it is a static library used by client application to communicate
with VoloDB.
5. volodbclientapp, contains tests programs to query VoloDB.
6. volodb-install.sh, is an install script that will install all required binaries, libraries, header and configuration files.
57
APPENDIX A. USER GUIDE
A.3
Prerequisites
Many packages need to be installed before VoloDB can be compiled. Please note
that the version numbers mentioned for packages are those with which VoloDB has
been tested. It should work with older versions if they contain the functionality
used by VoloDB.
1. MySQL Cluster 7.4.5 (https://dev.mysql.com/downloads/cluster/).
2. ZeroMQ 4.0.5 (http://zeromq.org/intro:get-the-software).
3. Boost 1.57 (http://www.boost.org/users/download/).
4. Protocol Buffers 2.6.0 (https://developers.google.com/protocol-buffers/
docs/downloads).
Refer to project’s original guide to learn how to install them.
A.4
Setting up VoloDB
VoloDB has been tested with OS X Yosemite 10.10, Ubuntu 12.04 and CentOS 6.6,
however should work with other Linux flavors and versions.
A.4.1
Installation
Go to directory of volodb-server project and run the make command. When the
command completes successfully, the VoloDB server executable will be created in
project’s dist folder.
A.4.2
Quick Installation
volodb-install.sh present in the root directory can be executed to install VoloDB
server binary, libraries, header and configuration files.
A.4.3
Configuraion
In order to run VoloDB, a configuration file must be provided. A sample configuration file is shown in listing 1.
58
A.4. SETTING UP VOLODB
[settings]
no_definer_threads = 5
no_executor_threads = 30
transport_poll_wait = 10
store_poll_wait = 10
store_port = 5570
mysql_server_ip = 192.168.0.1
mysql_port = 3306
ndb_connect_string = 192.168.0.1
ndb_port = 1186
mysql_user_name = hop
mysql_password = hop
database=vdb
log_level=INFO
Listing 1: Example of VoloDB Configuration File
The settings are briefly explained below:
no_definer_threads: Number of definer threads. Must be at least 1.
no_executor_threads: Number of executor threads. Must be at least 1.
transport_poll_wait: Time in milliseconds before definer threads check for client
messages on transport sockets. All messages queued in the specified time are processed in bulk.
store_poll_wait: Time in milliseconds before executor threads check for processed messages sent by definer threads. All messages queued in the specified time
are executed in bulk.
mysql_port: Port on which VoloDB listens for client requests.
mysql_server_ip: IP address of MySQL Server.
mysql_port: Port number of MySQL Server.
ndb_connect_string: IP address of NDB.
ndb_port: Port number of NDB.
mysql_user_name: Username to use while connecting to MySQL Server.
mysql_password: Password to use while connecting to MySQL Server.
database: Database/Schema name where VoloDB stores are created. This database
must already exist before running VoloDB.
log_level: Logging level of the server. The supported values are: OFF, ERROR,
WARNING, INFO, DEBUG, DEBUG1, DEBUG2, DEBUG3, DEBUG4.
Create a file having settings similar to listing no 1 with appropriate values. If
any of the settings are missing, then default values will be used.
59
APPENDIX A. USER GUIDE
A.4.4
Execution
To run VoloDB, go the directory containing the VoloDB executable and run the
following command:
./volodbserver -c <path_to_config_file>
where -c flag is used to pass file path of the configuration file. If VoloDB is run
without any configuration file information, then it will try to open /etc/volodb.conf.
If correct information is given in the configuration file, the server will run successfully and will start listening for client requests.
A.5
Setting up VoloDB Client Library
VoloDB also provides a C++ client library project that users can link against to
communicate with VoloDB.
A.5.1
Compilation
Go to directory of volodb-client project and run make command. When the command completes successfully, the library file will be created in project’s dist folder.
If VoloDB was installed through volodb-install.sh script then library will be available
at /usr/local/lib/volodb without the need to run make.
A.5.2
Usage
In order to use VoloDB client library, link it with your project. Since it is a static
library, it contains all code it needs, no further configuration is required. Header
files of library classes will be needed by client application which are available in
volodb-client project or in /usr/local/include/volodb, if VoloDB was installed using
volodb-install.sh script file.
A.6
Sample Application
We will step by step create a very simple application that will track user and their
orders using VoloDB as a backend.
A.6.1
Setup
Before writing an application, two steps have to be performed.
1. Link against libvolodb-client static library.
60
A.6. SAMPLE APPLICATION
2. Include volodb-client project’s header files.
Let us suppose that we are creating an e-commerce website and need to store
and manage relevant data. We want to save two types of information, one is to store
user data and other is to save their order information. For the sake of simplicity,
both the stores will have very simple and basic information. They will have the
following structure:
User Store: User ID(Primary Key, Distribution Key), Name.
Order Store: Order ID(Primary Key), User ID(Primary Key, Distribution Key),
Price
A.6.2
Important Classes
Before looking into specific examples, we will first go through few of the important
classes that will be used throughout the application.
Definition Operations:
Definition operations are declared in volodb-common/ProtoBuf/DefinitionOperations.pb.h
header and consists of classes to create and delete a store in VoloDB. The two relevant classes are:
• CreateStoreOperation to create a store.
• DeleteStoreOperation to delete a store.
Manipulation Operations:
These operations are declared in volodb-common/ProtoBuf/ManipulationOperations.pb.h
header and consists of classes to manipulate data in VoloDB stores.
• SetOperation is used to insert or update a key value pair.
• GetOperation is used for fetching key value pair[s].
• DeleteOperation is used for deleting a key value pair.
Column Classes:
These classes are declared in volodb-common/ProtoBuf/Column.pb.h header. There
are two important classes related to store columns that are described below:
61
APPENDIX A. USER GUIDE
• ColumnInfo contains information of a column such as type, name, length and
flags to indicate if it is a primary or a distribution key.
• ColumnValue contains value for a particular ColumnInfo.
Builders:
These classes are declared in volodb-common/AttributeUtils/ColumnsValueBuilder.h
and volodb-common/AttributeUtils/ColumnsValueDecoder.h header files. These are
helper classes to allow developers to encode or decode values in ColumnValue.
• ColumnsValueBuilder encodes native values into a form expected by VoloDB.
• ColumnsValueDecoder decodes values back into native form.
Executor:
It is declared in volodb-client/Executor/Executor.h and contains a class responsible
for executing the user operations. The execution is completely asynchronous and
user must register a callback to get notified of a response by VoloDB.
Result:
Declared in volodb-common/ProtoBuf/Result.pb.h is a class that contains a response
from VoloDB. It contains information such as returned rows, error information(if
any) and transactionID.
Refer to header files for complete information of classes and functions provided.
A.6.3
Store Creation
We need to create two stores for our sample e-commerce application. After creating an instance of CreateStoreOperation, we add ColumnInfo according to the
requirement and set its properties.
CreateStoreOperation user_info_operation;
user_info_operation.mutable_store_info()->set_store_name("UserInfo");
ColumnInfo* user_info_column1 = user_info_operation.add_column();
user_info_column1->set_name("user_id");
user_info_column1->set_primary_key(true);
user_info_column1->set_distribution_key(true);
user_info_column1->set_type(Column_Type_UINT32);
ColumnInfo* user_info_column2 = user_info_operation.add_column();
user_info_column2->set_name("name");
62
A.6. SAMPLE APPLICATION
user_info_column2->set_type(Column_Type_VARCHAR);
user_info_column2->set_length(300);
CreateStoreOperation create_orders_operation;
create_orders_operation.mutable_store_info()->set_store_name("Orders");
ColumnInfo* orders_column1 = create_orders_operation.add_column();
orders_column1->set_name("order_id");
orders_column1->set_primary_key(true);
orders_column1->set_type(Column_Type_UINT32);
ColumnInfo* orders_column2 = create_orders_operation.add_column();
orders_column2->set_name("user_id");
orders_column2->set_primary_key(true);
orders_column2->set_distribution_key(true);
orders_column2->set_type(Column_Type_UINT32);
ColumnInfo* orders_column3 = create_orders_operation.add_column();
orders_column3->set_name("price");
orders_column3->set_type(Column_Type_FLOAT);
At some point of the application, instance of Executor must be initialized with
callback information and VoloDB server details.
Executor executor(this, "192.168.0.1", 5570);
//this refers to class conforming to result callback protocol
The operations are then executed using execute function either individually or
in bulk. ID’s of operations are also passed which are received back in executor’s
callback.
//execute individually
executor.execute(user_info_operation, "user_table_oper_id");
executor.execute(create_orders_operation, "orders_table_oper_id");
//execute in bulk
executor.execute({user_info_operation,create_orders_operation},
"user_table_oper_id", "orders_table_oper_id");
The application should also conform with the ExecutorDelegate protocol to receive callbacks from server by overriding the following function:
virtual void didReceiveResponse(Executor* executor, vector<Result*>& result)
{
63
APPENDIX A. USER GUIDE
if(result[0]->has_error_code())
cout<<result[0]->error_description();
else
cout<<result[0]->transaction_identifier()<<" executed successfully";
}
A.6.4
Key Value Pair Insertion
We will now add a new user and two of his pending orders. Key value pairs are
added using SetOperation. Let us first prepare values for user information.
SetOperation new_user_info;
new_user_info.add_store_info()->set_store_name("UserInfo");
ColumnsValueBuilder userInfoPrimarykeyValueBuilder;
userInfoPrimarykeyValueBuilder.addUInt32Column("user_id", 1, true, true);
ColumnsValueBuilder userInfoValueBuilder;
userInfoValueBuilder.addVarCharColumn("name", string("Ali Dar"));
ColumnsValueBuilder::SetPrimaryKeys(new_user_info,
userInfoPrimarykeyValueBuilder.getColumns());
ColumnsValueBuilder::SetValues(new_user_info,
userInfoValueBuilder.getColumns());
For every SetOperation the user has to provide two sets of data. Data for both
keys and values have to be provided separately. User can set ColumnValue for
each column manually but this requires internal knowledge of the wire format. But
VoloDB client library provides a helper class ColumnsValueBuilder to populate and
setup column values. ColumnsValueBuilder provides overloaded functions for every
supported data type. After data for key value pair is populated, they are assigned
to SetOperation using SetPrimaryKeys() and SetValues() functions.
Similarly, setup two SetOperation for two orders of the user.
//Setup Order no 1
SetOperation new_order_1;
new_order_1.add_store_info()->set_store_name("Orders");
ColumnsValueBuilder order1PrimarykeyValueBuilder;
order1PrimarykeyValueBuilder.addUInt32Column("order_id", 1, true);
order1PrimarykeyValueBuilder.addUInt32Column("user_id", 1,
true, true);
ColumnsValueBuilder order1InfoValueBuilder;
order1InfoValueBuilder.addFloatColumn("price", 101.45);
64
A.6. SAMPLE APPLICATION
ColumnsValueBuilder::SetPrimaryKeys(new_order_1,
order1PrimarykeyValueBuilder.getColumns());
ColumnsValueBuilder::SetValues(new_order_1,
order1InfoValueBuilder.getColumns());
//Setup Order no 2
SetOperation new_order_2;
new_order_2.add_store_info()->set_store_name("Orders");
ColumnsValueBuilder order2PrimarykeyValueBuilder;
order2PrimarykeyValueBuilder.addUInt32Column("order_id", 2, true);
order2PrimarykeyValueBuilder.addUInt32Column("user_id", 1,
true, true);
ColumnsValueBuilder order2InfoValueBuilder;
order2InfoValueBuilder.addFloatColumn("price", 200.0);
ColumnsValueBuilder::SetPrimaryKeys(new_order_2,
order2PrimarykeyValueBuilder.getColumns());
ColumnsValueBuilder::SetValues(new_order_2,
order2InfoValueBuilder.getColumns());
Now we can insert all the values together in an atomic way or individually. But
we would like to execute them in a single transaction so either all are added or none.
//true is passed to tell executor to run all operations
//in a single transaction
executor.execute({new_user_info, new_order_1, new_order_2},
{"my_transaction_id"}, true);
Callback will be called when the transaction either completes successfully or
fails. Refer to section no A.6.3 on how to receive a result from VoloDB.
A.6.5
Fetching Key Value Pair
Suppose we would like to fetch the name of a user with an id no 1. The value for a
key can be fetched using GetOperation by providing full primary key values.
GetOperation getOperation;
StoreInfo* storeInfo = getOperation.add_store_info();
storeInfo->set_store_name("UserInfo");
ColumnsValueBuilder primarykeyValueBuilder;
65
APPENDIX A. USER GUIDE
primarykeyValueBuilder.addUInt32Column("user_id", (uint32_t)1, true);
ColumnsValueBuilder::SetPrimaryKeys(getOperation,
primarykeyValueBuilder.getColumns());
Values returned from VoloDB can be fetched in the callback.
virtual void didReceiveResponse(Executor* executor, vector<Result*>& result)
{
if(result[0]->has_error_code())
cout<<result[0]->error_description();
else
{
//fetching first result only in this case, system aggregates the
//received responses from server and delivers them in one go
Result* r = result[0];
//fetching only one result because transaction contains only
//one operation
const OperationResult& operationResult = r->result(0);
//fetching only one row because it was a primary key lookup
const Row& row = operationResult.row(0);
for(int k = 0; k < row.column_size(); k++)
{
const ColumnValue& columnValue = row.column(k);
if(columnValue.name().compare("name") == 0)
cout<<"User name"<<": "<<
ColumnsValueDecoder::getVarChar(columnValue)<<endl;
else if(columnValue.name().compare("user_id") == 0)
cout<<"User ID"<<": "<<
ColumnsValueDecoder::getUInt32(columnValue)<<endl;
}
}
}
A.6.6
Fetching Key Value Pairs using Partition Key
Let us now fetch all the orders of user no 1. In order to fetch all the orders of
the user, a table scan on user_id column is required, since they can not be fetched
through primary key lookup. Table scans are discouraged because it is slow as it
has to hit every data node. But if we observe the structure of tables, user_id is the
partition key in both UserInfo and Orders table. VoloDB supports queries based
66
A.6. SAMPLE APPLICATION
on partition key because they will only hit one data node. Since the query is not
based on a primary key, more than one key value pairs can be returned.
To fetch all the orders of user no 1, we will query the Orders table on user_id
column to fetch all the orders of that user.
GetOperation pruneIndexOperation;
StoreInfo* storeInfo = pruneIndexOperation.add_store_info();
storeInfo->set_store_name("Orders");
ColumnsValueBuilder primarykeyValueBuilder;
//passing (true, true) because user_id column is a part of
//primary and distribution key
primarykeyValueBuilder.addUInt32Column("user_id", (uint32_t)1, true, true);
ColumnsValueBuilder::SetPrimaryKeys(pruneIndexOperation,
primarykeyValueBuilder.getColumns());
//now execute it
executor.execute(pruneIndexOperation, "prune_index_transaction_id");
In Executor’s callback function we can fetch all the returned rows similarly to
an example in section no A.6.5.
A.6.7
Fetching Key Value Pairs using Non-Keyed Column
Suppose we want to fetch all the orders that have the price of 200.0. price column
is neither a primary key nor a distribution key, so we can not do a primary fetch or
a prune index scan. We will have to initiate a table scan on the price column with
200.0 value. Instead of using ColumnsValueBuilder::SetPrimaryKeys() function in
this case, we will use ColumnsValueBuilder::SetValues() because the price column is
not a primary key. The user is also allowed to do a full table scan without qualification on any column that can end up returning all rows of the store. This operation
though supported is not recommended because table scans are slow operations that
can also slow down other users.
GetOperation scanOperation;
StoreInfo* storeInfo = scanOperation.add_store_info();
storeInfo->set_store_name("Orders");
ColumnsValueBuilder valueBuilder;
//not passing true flag, since it is not a primary or a distribution key
67
APPENDIX A. USER GUIDE
valueBuilder.addFloatColumn("price", (float)200.0);
ColumnsValueBuilder::SetValues(scanOperation,
valueBuilder.getColumns());
//now execute it
executor.execute(scanOperation, "scan_transaction_id");
In the Executor’s callback function we can fetch all the returned rows similarly to
an example in section no A.6.5.
A.6.8
Key Value Pair Deletion
Let us assume that an order with id no 2 is now fulfilled for user no 1 and we would
like to delete it from the Orders store. Key value pair is deleted by executing a
DeleteOperation. The user has to provide values for the primary key columns in
order to delete the pair.
DeleteOperation deleteOrderOperation;
StoreInfo* storeInfo = deleteOperation.add_store_info();
storeInfo->set_store_name("Orders");
ColumnsValueBuilder primarykeyValueBuilder;
primarykeyValueBuilder.addUInt32Column("order_id", (uint32_t)2, true);
primarykeyValueBuilder.addUInt32Column("user_id", (uint32_t)1, true, true);
ColumnsValueBuilder::SetPrimaryKeys(deleteOperation,
primarykeyValueBuilder.getColumns());
executor.execute(deleteOrderOperation, "my_transaction_id");
A.6.9
Store Deletion
A store can be deleted by using DeleteStoreOpertion. The user only has to set the
name of the store and then execute the operation using the Executor. Let us delete
the user information and the order store that we created earlier.
DropStoreOperation drop_user_operation;
drop_user_operation.mutable_store_info()->set_store_name("UserInfo");
DropStoreOperation drop_orders_operation;
drop_orders_operation.mutable_store_info()->set_store_name("Orders");
executor.execute({drop_user_operation,drop_orders_operation},
"user_table_oper_id", "orders_table_oper_id");
68
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement