Vorlesung Web Services und Workflows
Scalable Data Management
NoSQL Data Stores in Research
and Practice
Felix Gessert
[email protected]
October 3, 2016, Percona Live, Amsterdam
@baqendcom
About me
Felix Gessert
NoSQL research for
PhD dissertation
CEO of startup for high-performance serverless
development based on caching and NoSQL
With me: Erik Witt (Baqend), Distributed Systems & Web
Performance Engineer
Slides: slideshare.net/felixgessert
Article: medium.com/baqend-blog
Outline
NoSQL Foundations and
Motivation
The NoSQL Toolbox:
Common Techniques
NoSQL Systems
Decision Guidance: NoSQL
Decision Tree
• The Database Explosion
• NoSQL: Motivation and
Origins
• The 4 Classes of NoSQL
Databases:
• Key-Value Stores
• Wide-Column Stores
• Document Stores
• Graph Databases
• CAP Theorem
Introduction: What are NoSQL
data stores?
Architecture
Data Management
Data Analytics
Typical Data Architecture:
Analytics
Data
Warehouse
Operative
Database
Applications
Reporting
Data Mining
Architecture
Data Management
Data Analytics
Typical Data Architecture:
Analytics
Reporting
Data Mining
Data
Warehouse
NoSQL
Operative
Database
Applications
Architecture
Data Analytics
Typical Data Architecture:
Analytics
Reporting
Data Mining
Data
Warehouse
Data Management
The era of one-size-fits-all database systems is over
 Specialized data
systems
NoSQL
Operative
Database
Applications
The Database Explosion
Sweetspots
RDBMS
Parallel DWH
NewSQL
General-purpose
ACID transactions
Aggregations/OLAP for
massive data amounts
High throughput
relational OLTP
Wide-Column Store
Document Store
Long scans over
structured data
Deeply nested
data models
Key-Value Store
Large-scale
session storage
Graph Database
In-Memory KV-Store
Wide-Column Store
Graph algorithms
& queries
Counting & statistics
Massive usergenerated content
The Database Explosion
Cloud-Database Sweetspots
Realtime BaaS
Communication and
collaboration
Managed RDBMS
Managed Cache
General-purpose
ACID transactions
Caching and
transient storage
Azure Tables
Wide-Column Store
Wide-Column Store
Backend-as-a-Service
Very large tables
Massive usergenerated content
Small Websites
and Apps
Google Cloud
Storage
Amazon Elastic
MapReduce
Managed NoSQL
Object Store
Hadoop-as-a-Service
Full-Text Search
Massive File
Storage
Big Data Analytics
How to choose a database system?
Many Potential Candidates
Application Layer
Question in this tutorial:
Billing Data
Nested
Application Data
How to approach the
Friend
network
Session data
Files
decision problem?
database RecommenCached data requirements
Search Index
dation Engine
& metrics
Google Cloud
Storage
Amazon Elastic
MapReduce
NoSQL Databases



„NoSQL“ term coined in 2009
Interpretation: „Not Only SQL“
Typical properties:
◦
◦
◦
◦
◦
Non-relational
Open-Source
Schema-less (schema-free)
Optimized for distribution (clusters)
Tunable consistency
NoSQL-Databases.org:
Current list has over 150
NoSQL systems
NoSQL Databases

Two main motivations:
Scalability
Impedance Mismatch
ID
Customer
Line Item 1: …
Line Item2: …
Payment: Credit Card, …
User-generated data,
Request load
?
Orders
Payment
Line Items
Customers
Scale-up vs Scale-out
Scale-Up (vertical
scaling):
Scale-Out (horizontal
scaling):
More RAM
More CPU
More HDD
Commodity
Hardware
Shared-Nothing
Architecture
Schemafree Data Modeling
RDBMS:
NoSQL DB:
Item[Price] Item[Discount]
SELECT Name, Age
FROM
Customers
Customers
Explicit
schema
Implicit
schema
Big Data
The Analytic side of NoSQL

Idea: make existing massive, unstructured data
amounts usable
Sources
Analyst, Data Scientist,
Software Developer
•
•
•
•
•
•
Structured data (DBs)
Log files
Documents, Texts, Tables
Images, Videos
Sensor data
Social Media, Data Services
•
•
•
•
Statistics, Cubes, Reports
Recommender
Classificators, Clustering
Knowledge
Not covered here: NoSQL + Analytics
Example: Lambda Architecture
Given analytic function 𝑓:
𝑟𝑒𝑠𝑢𝑙𝑡 = 𝑓 𝑎𝑙𝑙 𝑑𝑎𝑡𝑎 = 𝑓 𝑜𝑙𝑑 𝑑𝑎𝑡𝑎 ⊙ 𝑓 𝑙𝑎𝑚𝑏𝑑𝑎
Storage
Batch
Layer
Batch
View
New
data
Queries
Speed
Layer
Realtime
View
NoSQL Paradigm Shift
Open Source & Commodity Hardware
Commercial DBMS
Open-Source DBMS
Specialized DB hardware
(Oracle Exadata, etc.)
Commodity hardware
Highly available network
(Infiniband, Fabric Path, etc.)
Commodity network
(Ethernet, etc.)
Highly Available Storage (SAN,
RAID, etc.)
Commodity drives (standard
HDDs, JBOD)
NoSQL Paradigm Shift
Shared Nothing Architectures
Shift towards higher distribution & less coordination:
Shared Memory
e.g. "Oracle 11g"
Shared Disk
e.g. "Oracle RAC"
Shared Nothing
e.g. "NoSQL"
NoSQL System Classification

Two common criteria:
Data
Model
Consistency/Availability
Trade-Off
Key-Value
AP: Available & Partition
Tolerant
Wide-Column
CP: Consistent &
Partition Tolerant
Document
Graph
CA: Not Partition
Tolerant
Key-Value Stores


Data model: (key) -> value
Interface: CRUD (Create, Read, Update, Delete)
users:2:friends
users:2:inbox
users:2:settings
{23, 76, 233, 11}
[234, 3466, 86,55]
Value:
An opaque blob
Theme → "dark", cookies → "false"
Key

Examples: Amazon Dynamo (AP), Riak (AP), Redis (CP)
Wide-Column Stores


Data model: (rowkey, column, timestamp) -> value
Interface: CRUD, Scan
Row Key
com.cnn.www

Column
Versions (timestamped)
content : "<html>…"
content
"<html>…" title : "CNN"
content : :"<html>…"
crawled: …
Examples: Cassandra (AP), Google BigTable (CP),
HBase (CP)
Document Stores


Data model: (collection, key) -> document
Interface: CRUD, Querys, Map-Reduce
JSON Document
ID/Key
order-12338
{
order-id: 23,
customer: { name : "Felix Gessert", age : 25 }
line-items : [ {product-name : "x", …} , …]
}

Examples: CouchDB (AP), RethinkDB (CP), MongoDB
(CP)
Graph Databases


Data model: G = (V, E): Graph-Property Modell
Interface: Traversal algorithms, querys, transactions
Nodes
company:
Apple
value:
300Mrd

Properties
WORKS_FOR
since: 1999
salary: 140K
name:
John Doe
Edges
Examples: Neo4j (CA), InfiniteGraph (CA), OrientDB
(CA)
Graph Databases


Data model: G = (V, E): Graph-Property Modell
Interface: Traversal algorithms, querys, transactions
Nodes
company:
Apple
value:
300Mrd

Properties
WORKS_FOR
since: 1999
salary: 140K
name:
John Doe
Edges
Examples: Neo4j (CA), InfiniteGraph (CA), OrientDB
(CA)
Search Platforms


Data model: vectorspace model, docs + metadata
Examples: Solr, ElasticSearch
POST /lectures/dis
{ „topic": „databases",
„lecturer": „ritter",
… }
REST API
Doc. 3
Search Server
Term
Document
database
ritter
3,4,1
Key
Value
Key
Value
Key
Value
1
Inverted Index
Doc. 4
Doc. 1
Key
Value
Key
Value
Key
Value
Key
Value
Key
Value
Key
Value
Object-oriented Databases


Data model: Classes, objects, relations (references)
Interface: CRUD, querys, transactions
Properties

Classes
Examples: Versant (CA), db4o (CA), Objectivity (CA)
Object-oriented Databases


Data model: Classes, objects, relations (references)
Interface: CRUD, querys, transactions
Properties

Classes
Examples: Versant (CA), db4o (CA), Objectivity (CA)
XML databases, RDF Stores



Data model: XML, RDF
Interface: CRUD, querys (XPath, XQuerys, SPARQL),
transactions (some)
Examples: MarkLogic (CA), AllegroGraph (CA)
XML databases, RDF Stores



Data model: XML, RDF
Interface: CRUD, querys (XPath, XQuerys, SPARQL),
transactions (some)
Examples: MarkLogic (CA), AllegroGraph (CA)
Distributed File System

Data model: files + folders
Network FS
Cluster FS
Distributed FS
Client
RPC
RPC
RPC
I/O Nodes
Stub
Server
NFS, AFS
SAN
GPFS, Lustre
HDFS
Big Data Frameworks


Data model: arbitrary (frequently unstructured)
Examples: Hadoop, Spark, Flink, DryadLink, Pregel
Algorithms
Log files
Unstructured
Files
-Aggregation
-Machine
Learning
-Correlation
-Clustering
Databases
Data
Batch Analytics
Statistics,
Models
Soft NoSQL Systems
Not Covered Here
Search Platforms (Full Text Search):
◦ No persistence and consistency guarantees for OLTP
◦ Examples: ElasticSearch (AP), Solr (AP)
Object-Oriented Databases:
◦ Strong coupling of programming language and DB
◦ Examples: Versant (CA), db4o (CA), Objectivity (CA)
XML-Databases, RDF-Stores:
◦ Not scalable, data models not widely used in industry
◦ Examples: MarkLogic (CA), AllegroGraph (CA)
CAP-Theorem
Consistency
Partition
Tolerance
Availability
Impossible
Only 2 out of 3 properties are
achievable at a time:
◦ Consistency: all clients have the same
view on the data
◦ Availability: every request to a nonfailed node most result in correct
response
◦ Partition tolerance: the system has to
continue working, even under
arbitrary network partitions
Eric Brewer, ACM-PODC Keynote, Juli 2000
Gilbert, Lynch: Brewer's Conjecture and the Feasibility of
Consistent, Available, Partition-Tolerant Web Services, SigAct News 2002
CAP-Theorem: simplified proof

Problem: when a network partition occurs, either
consistency or availability have to be given up
Block response until
ACK arrives
 Consistency
Replication
Value = V1
Response before
successful replication
 Availability
Value = V0
N2
N1
Network partition
NoSQL Triangle
Data models
Every client can always
read and write
A
CA
Oracle, MySQL, …
All clients share the
same view on the data
C
Relational
Key-Value
Wide-Column
Document-Oriented
AP
Dynamo, Redis, Riak, Voldemort
Cassandra
SimpleDB
CP
Postgres, MySQL Cluster, Oracle RAC
BigTable, HBase, Accumulo, Azure Tables
MongoDB, RethinkDB, DocumentsDB
P
All nodes continue
working under network
partitions
Nathan Hurst: Visual Guide to NoSQL Systems
http://blog.nahurst.com/visual-guide-to-nosql-systems
PACELC – an alternative CAP formulation

Idea: Classify systems according to their behavior
during network partitions
yes
no
Partiti
on
Avail-
Con-
ability
sistency
No consequence of the
CAP theorem
AL - Dynamo-Style
AC - MongoDB
Cassandra, Riak, etc.
Laten-
Con-
cy
sistency
CC – Always Consistent
HBase, BigTable and ACID systems
Abadi, Daniel. "Consistency tradeoffs in modern distributed
database system design: CAP is only part of the story."
Serializability
Not Highly Available Either
Global serializability and availability are incompatible:
Write A=1
Read B
Write B=1
Read A
𝑤2 𝑏 = 1 𝑟2 (𝑎 = ⊥)
𝑤1 𝑎 = 1 𝑟1 (𝑏 = ⊥)

Some weaker isolation levels allow high availability:
◦ RAMP Transactions (P. Bailis, A. Fekete, A. Ghodsi, J. M. Hellerstein, und I. Stoica, „Scalable
Atomic Visibility with RAMP Transactions“, SIGMOD 2014)
S. Davidson, H. Garcia-Molina, and D. Skeen. Consistency in
partitioned networks. ACM CSUR, 17(3):341–370, 1985.
Impossibility Results
Consensus Algorithms

Safety
Properties
Consensus:
◦ Agreement: No two processes can commit different decisions
◦ Validity (Non-triviality): If all initial values are same, nodes must
commit that value
Liveness
Property
◦ Termination: Nodes commit eventually


No algorithm guarantees termination (FLP)
Algorithms:
◦ Paxos (e.g. Google Chubby, Spanner, Megastore, Aerospike,
Cassandra Lightweight Transactions)
◦ Raft (e.g. RethinkDB, etcd service)
◦ Zookeeper Atomic Broadcast (ZAB)
Lynch, Nancy A. Distributed algorithms.
Morgan Kaufmann, 1996.
Where CAP fits in
Negative Results in Distributed Computing
Asynchronous Network,
Unreliable Channel
Asynchronous Network,
Reliable Channel
Atomic Storage
Atomic Storage
Impossible:
CAP Theorem
Consensus
Impossible:
2 Generals Problem
Possible:
Attiya, Bar-Noy, Dolev (ABD)
Algorithm
Consensus
Impossible:
Fisher Lynch Patterson (FLP)
Theorem
Lynch, Nancy A. Distributed algorithms.
Morgan Kaufmann, 1996.
ACID vs BASE
ACID
„Gold standard“
for RDBMSs
BASE
Atomicity
Basically
Available
Consistency
Soft State
Isolation
Eventually
Consistent
Model of many
NoSQL systems
Durability
http://queue.acm.org/detail.cfm?id=1394128
Weaker guarantees in a database?!
Default Isolation Levels in RDBMSs
Database
Actian Ingres 10.0/10S
Aerospike
Clustrix CLX 4100
Greenplum 4.1
IBM DB2 10 for z/OS
IBM Informix 11.50
MySQL 5.6
MemSQL 1b
MS SQL Server 2012
NuoDB
Oracle 11g
Oracle Berkeley DB
Postgres 9.2.2
SAP HANA
ScaleDB 1.02
VoltDB
Default Isolation
S
RC
RR
RC
CS
Depends
RR
RC
RC
CR
RC
S
RC
RC
RC
S
RC: read committed, RR: repeatable read, S: serializability,
SI: snapshot isolation, CS: cursor stability, CR: consistent read
Maximum Isolation
S
RC
?
S
S
RR
S
RC
S
CR
SI
S
S
SI
RC
S
Bailis, Peter, et al. "Highly available transactions: Virtues and
limitations." Proceedings of the VLDB Endowment 7.3 (2013): 181-192.
Weaker guarantees in a database?!
Default Isolation Levels in RDBMSs
Database
Actian Ingres 10.0/10S
Aerospike
Clustrix CLX 4100
Greenplum 4.1
IBM DB2 10 for z/OS
IBM Informix 11.50
MySQL 5.6
MemSQL 1b
MS SQL Server 2012
NuoDB
Oracle 11g
Oracle Berkeley DB
Postgres 9.2.2
SAP HANA
ScaleDB 1.02
VoltDB
Default Isolation
S
RC
RR
RC
CS
Depends
RR
RC
RC
CR
RC
S
RC
RC
RC
S
Maximum Isolation
S
RC
?
S
S
RR
S
RC
S
CR
SI
S
S
SI
RC
S
Theorem:
Trade-offs are central to database systems.
RC: read committed, RR: repeatable read, S: serializability,
SI: snapshot isolation, CS: cursor stability, CR: consistent read
Bailis, Peter, et al. "Highly available transactions: Virtues and
limitations." Proceedings of the VLDB Endowment 7.3 (2013): 181-192.
Data Models and CAP provide high-level
classification.
But what about fine-grained
requirements, e.g. query capabilites?
Outline
NoSQL Foundations and
Motivation
The NoSQL Toolbox:
Common Techniques
NoSQL Systems
Decision Guidance: NoSQL
Decision Tree
• Techniques for Functional
and Non-functional
Requirements
• Sharding
• Replication
• Storage Management
• Query Processing
Functional
Techniques
Scan Queries
ACID Transactions
Conditional or Atomic Writes
Joins
Non-Functional
Sharding
Data Scalability
Range-Sharding
Hash-Sharding
Entity-Group Sharding
Consistent Hashing
Shared-Disk
Write Scalability
Read Scalability
Replication
Commit/Consensus Protocol
Synchronous
Asynchronous
Primary Copy
Update Anywhere
Elasticity
Consistency
Write Latency
Storage Management
Sorting
Filter Queries
Full-text Search
Aggregation and Analytics
Logging
Update-in-Place
Caching
In-Memory Storage
Append-Only Storage
Read Latency
Query Processing
Read Availability
Global Secondary Indexing
Local Secondary Indexing
Query Planning
Analytics Framework
Materialized Views
Write Throughput
Write Availability
Durability
Functional
Scan Queries
ACID Transactions
Functional
Conditional or Atomic Writes
Requirements from
Joins
the
application
Sorting
Filter Queries
Techniques
enable
Sharding
Range-Sharding
Hash-Sharding
Entity-Group Sharding
Consistent Hashing
Shared-Disk
Central
Replication
techniques
Commit/Consensus
Protocol
Synchronous
Asynchronous
NoSQL
Primary Copy
Update Anywhere
databases
Storage Management
employ
Logging
Update-in-Place
Caching
In-Memory Storage
Append-Only Storage
Query Processing
Full-text Search
Aggregation and Analytics
Global Secondary Indexing
Local Secondary Indexing
Query Planning
Analytics Framework
Materialized Views
Non-Functional
enable Data Scalability
Write Scalability
Read Scalability
Operational
RequireConsistency
ments
Write Latency
Elasticity
Read Latency
Write Throughput
Read Availability
Write Availability
Durability
http://www.baqend.com
/files/nosql-survey.pdf
Functional
Techniques
Scan Queries
ACID Transactions
Conditional or Atomic Writes
Joins
Sorting
Non-Functional
Sharding
Data Scalability
Range-Sharding
Hash-Sharding
Entity-Group Sharding
Consistent Hashing
Shared-Disk
Write Scalability
Read Scalability
Elasticity
Sharding (aka Partitioning, Fragmentation)
Scaling Storage and Throughput

Horizontal distribution of data over nodes
Shard 1
Peter
Franz
[G-O]
Shard 2
Shard 3


Partitioning strategies: Hash-based vs. Range-based
Difficulty: Multi-Shard-Operations (join, aggregation)
Sharding
Approaches
Hash-based Sharding
◦ Hash of data values (e.g. key) determines partition (shard)
◦ Pro: Even distribution
◦ Contra: No data locality
Range-based Sharding
◦ Assigns ranges defined over fields (shard keys) to partitions
◦ Pro: Enables Range Scans and Sorting
◦ Contra: Repartitioning/balancing required
Entity-Group Sharding
◦ Explicit data co-location for single-node-transactions
◦ Pro: Enables ACID Transactions
◦ Contra: Partitioning not easily changable
David J DeWitt and Jim N Gray: “Parallel database systems: The future of high performance
database systems,” Communications of the ACM, volume 35, number 6, pages 85–98, June 1992.
Sharding
Approaches
Hash-based Sharding
Implemented in
MongoDB, Riak,
Redis,(shard)
◦ Hash of data values (e.g. key) determines
partition
Cassandra, Azure Table,
◦ Pro: Even distribution
Dynamo
◦ Contra: No data locality
Range-based Sharding
Implemented in
BigTable,
◦ Assigns ranges defined over fields
(shardHBase,
keys) DocumentDB
to partitions
Hypertable, MongoDB,
◦ Pro: Enables Range Scans and Sorting
RethinkDB, Espresso
◦ Contra: Repartitioning/balancing required
Entity-Group Sharding
Implemented in
G-Store, MegaStore,
◦ Explicit data co-location for single-node-transactions
◦ Pro: Enables ACID Transactions Relation Cloud, Cloud SQL
Server
◦ Contra: Partitioning not easily changable
David J DeWitt and Jim N Gray: “Parallel database systems: The future of high performance
database systems,” Communications of the ACM, volume 35, number 6, pages 85–98, June 1992.
Problems of Application-Level Sharding
Example: Tumblr
 Caching
 Sharding from
application
Moved towards:
 Redis
 HBase
Web
Server
Web
Servers
W W W
1
Web
Cache
MySQL
MySQL
LB
LB
Web
Cache
Web
Cache
Web
Servers
W W W
3
Memcached
2
Memcached
MySQL
Web
Cache
Web
Cache
Web
Cache
Web
Servers
W W W
Memcached
4
My
SQL
My
SQL
My
SQL
Manual
Sharding
Functional
Techniques
Non-Functional
ACID Transactions
Read Scalability
Conditional or Atomic Writes
Replication
Commit/Consensus Protocol
Synchronous
Asynchronous
Primary Copy
Update Anywhere
Consistency
Write Latency
Read Latency
Read Availability
Write Availability
Replication
Read Scalability + Failure Tolerance

Stores N copies of each data item
DB Node
DB Node
DB Node


Consistency model: synchronous vs asynchronous
Coordination: Multi-Master, Master-Slave
Özsu, M.T., Valduriez, P.: Principles of distributed database systems.
Springer Science & Business Media (2011)
Replication: When
Asynchronous (lazy)
◦
◦
◦
◦
Writes are acknowledged immdediately
Performed through log shipping or update propagation
Pro: Fast writes, no coordination needed
Contra: Replica data potentially stale (inconsistent)
Synchronous (eager)
◦ The node accepting writes synchronously propagates
updates/transactions before acknowledging
◦ Pro: Consistent
◦ Contra: needs a commit protocol (more roundtrips),
unavaialable under certain network partitions
Charron-Bost, B., Pedone, F., Schiper, A. (eds.): Replication: Theory and
Practice, Lecture Notes in Computer Science, vol. 5959. Springer (2010)
Replication: When
Asynchronous (lazy)
◦
◦
◦
◦
Implemented in
Writes are acknowledged immdediately
Performed through log shipping
or update
Dynamo
, Riak,propagation
CouchDB,
Redis,
Cassandra, Voldemort,
Pro: Fast writes, no coordination
needed
MongoDB, RethinkDB
Contra: Replica data potentially stale (inconsistent)
Synchronous (eager)
◦ The node accepting writes synchronously
propagates
Implemented
in
updates/transactions before acknowledging
BigTable, HBase, Accumulo,
◦ Pro: Consistent
CouchBase, MongoDB,
◦ Contra: needs a commit protocol
(more roundtrips),
RethinkDB
unavaialable under certain network partitions
Charron-Bost, B., Pedone, F., Schiper, A. (eds.): Replication: Theory and
Practice, Lecture Notes in Computer Science, vol. 5959. Springer (2010)
Replication: Where
Master-Slave (Primary Copy)
◦ Only a dedicated master is allowed to accept writes, slaves are
read-replicas
◦ Pro: reads from the master are consistent
◦ Contra: master is a bottleneck and SPOF
Multi-Master (Update anywhere)
◦ The server node accepting the writes synchronously
propagates the update or transaction before acknowledging
◦ Pro: fast and highly-available
◦ Contra: either needs coordination protocols (e.g. Paxos) or is
inconsistent
Charron-Bost, B., Pedone, F., Schiper, A. (eds.): Replication: Theory and
Practice, Lecture Notes in Computer Science, vol. 5959. Springer (2010)
Synchronous Replication
Example: Two-Phase Commit is not partition-tolerant
commit
prepare
Synchronous Replication
Example: Two-Phase Commit is not partition-tolerant
prepared
prepared
prepare
prepared
prepared
prepared
prepared
Synchronous Replication
Example: Two-Phase Commit is not partition-tolerant
prepared
prepared
commit
prepared
prepared
prepared
prepared
Synchronous Replication
Example: Two-Phase Commit is not partition-tolerant
prepared
commited
commit
commited
commited
prepared
prepared
Synchronous Replication
Example: Two-Phase Commit is not partition-tolerant
prepared
commited
commit
commited
commited
prepared
prepared
Synchronous Replication
Example: Two-Phase Commit is not partition-tolerant
commited
commited
commit
commited
commited
commited
commited
Synchronous Replication
commited
Example: Two-Phase Commit is not partition-tolerant
commited
commited
commit
commited
commited
commited
commited
Consistency Levels
Linearizability
Causal
Consistency
PRAM
Writes
Follow Reads
Read Your
Writes
Viotti, Paolo, and Marko Vukolić. "Consistency in NonTransactional Distributed Storage Systems." arXiv (2015).
Monotonic
Reads
Monotonic
Writes
Bounded
Staleness
Bailis, Peter, et al. "Highly available transactions: Virtues and
limitations." Proceedings of the VLDB Endowment 7.3 (2013): 181-192.
Consistency Levels
Linearizability
Causal
Consistency
PRAM
Writes
Follow Reads
Read Your
Writes
Viotti, Paolo, and Marko Vukolić. "Consistency in NonTransactional Distributed Storage Systems." arXiv (2015).
Monotonic
Reads
Either version-based or
time-based. Both not
highly available.
Monotonic
Writes
Bounded
Staleness
Bailis, Peter, et al. "Highly available transactions: Virtues and
limitations." Proceedings of the VLDB Endowment 7.3 (2013): 181-192.
Consistency Levels
Linearizability
Causal
Consistency
Writes in one session are
strictly ordered on all
PRAM
replicas.
Writes
Follow Reads
Read Your
Writes
Viotti, Paolo, and Marko Vukolić. "Consistency in NonTransactional Distributed Storage Systems." arXiv (2015).
Monotonic
Reads
Monotonic
Writes
Bounded
Staleness
Bailis, Peter, et al. "Highly available transactions: Virtues and
limitations." Proceedings of the VLDB Endowment 7.3 (2013): 181-192.
Consistency Levels
Linearizability
Causal
Consistency
Versions a client reads in
a session increase
PRAM
monotonically.
Writes
Follow Reads
Read Your
Writes
Viotti, Paolo, and Marko Vukolić. "Consistency in NonTransactional Distributed Storage Systems." arXiv (2015).
Monotonic
Reads
Monotonic
Writes
Bounded
Staleness
Bailis, Peter, et al. "Highly available transactions: Virtues and
limitations." Proceedings of the VLDB Endowment 7.3 (2013): 181-192.
Consistency Levels
Linearizability
Causal
Consistency
Clients directly
see their own
writes.
Writes
Follow Reads
Read Your
Writes
Viotti, Paolo, and Marko Vukolić. "Consistency in NonTransactional Distributed Storage Systems." arXiv (2015).
PRAM
Monotonic
Reads
Monotonic
Writes
Bounded
Staleness
Bailis, Peter, et al. "Highly available transactions: Virtues and
limitations." Proceedings of the VLDB Endowment 7.3 (2013): 181-192.
Consistency Levels
Linearizability
Causal
Consistency
If a value is read, any causally
relevant data items that lead to
that value are available, too.
Writes
Follow Reads
Read Your
Writes
Viotti, Paolo, and Marko Vukolić. "Consistency in NonTransactional Distributed Storage Systems." arXiv (2015).
PRAM
Monotonic
Reads
Monotonic
Writes
Bounded
Staleness
Bailis, Peter, et al. "Highly available transactions: Virtues and
limitations." Proceedings of the VLDB Endowment 7.3 (2013): 181-192.
Consistency Levels
Achievable with high availability
Bailis, Peter, et al. "Bolt-on causal
consistency." SIGMOD, 2013.
Linearizability
Causal
Consistency
PRAM
Writes
Follow Reads
Read Your
Writes
Viotti, Paolo, and Marko Vukolić. "Consistency in NonTransactional Distributed Storage Systems." arXiv (2015).
Monotonic
Reads
Monotonic
Writes
Bounded
Staleness
Bailis, Peter, et al. "Highly available transactions: Virtues and
limitations." Proceedings of the VLDB Endowment 7.3 (2013): 181-192.
Consistency Levels
Linearizability
Causal
Consistency
Strategies:
• Single-mastered reads and
writes
• Multi-master replication with
consensus on writes
PRAM
Writes
Follow Reads
Read Your
Writes
Viotti, Paolo, and Marko Vukolić. "Consistency in NonTransactional Distributed Storage Systems." arXiv (2015).
Monotonic
Reads
Monotonic
Writes
Bounded
Staleness
Bailis, Peter, et al. "Highly available transactions: Virtues and
limitations." Proceedings of the VLDB Endowment 7.3 (2013): 181-192.
Problem: Terminology
Viotti, Paolo, and Marko Vukolić. "Consistency in NonTransactional Distributed Storage Systems." arXiv (2015).
Bailis, Peter, et al. "Highly available transactions: Virtues and
limitations." Proceedings of the VLDB Endowment 7.3 (2013): 181-192.
Read Your Writes (RYW)
Definition: Once the user has written a value, subsequent reads will
return this value (or newer versions if other writes occurred in
between); the user will never see versions older than his last write.
https://blog.acolyer.org/2016/02/26/distributed-consistencyand-session-anomalies/
Wiese, Lena. Advanced Data Management: For SQL, NoSQL, Cloud
and Distributed Databases. De Gruyter, 2015.
Monotonic Reads (MR)
Definition: Once a user has read a version of a data item on one replica
server, it will never see an older version on any other replica server
https://blog.acolyer.org/2016/02/26/distributed-consistencyand-session-anomalies/
Wiese, Lena. Advanced Data Management: For SQL, NoSQL, Cloud
and Distributed Databases. De Gruyter, 2015.
Montonic Writes (MW)
Definition: Once a user has written a new value for a data item in a
session, any previous write has to be processed before the current
one. I.e., the order of writes inside the session is strictly maintained.
https://blog.acolyer.org/2016/02/26/distributed-consistencyand-session-anomalies/
Wiese, Lena. Advanced Data Management: For SQL, NoSQL, Cloud
and Distributed Databases. De Gruyter, 2015.
Writes Follow Reads (WFR)
Definition: When a user reads a value written in a session after that
session already read some other items, the user must be able to see
those causally relevant values too.
https://blog.acolyer.org/2016/02/26/distributed-consistencyand-session-anomalies/
Wiese, Lena. Advanced Data Management: For SQL, NoSQL, Cloud
and Distributed Databases. De Gruyter, 2015.
PRAM and Causal Consistency



Combinations of previous session consistency guarantess
◦ PRAM = MR + MW + RYW
◦ Causal Consistency = PRAM + WFR
All consistency level up to causal consistency can be
guaranteed with high availability
Example: Bolt-on causal consistency
Bailis, Peter, et al. "Bolt-on causal consistency."
Proceedings of the 2013 ACM SIGMOD, 2013.
Bounded Staleness

Either time-based:
t-Visibility (Δ-atomicity): the inconsistency window comprises
at most t time units; that is, any value that is returned upon
a read request was up to date t time units ago.

Or version-based:
k-Staleness: the inconsistency window comprises at most k
versions; that is, lags at most k versions behind the most
recent version.

Both are not achievable with high availability
Wiese, Lena. Advanced Data Management: For SQL, NoSQL, Cloud
and Distributed Databases. De Gruyter, 2015.
Functional
Techniques
Non-Functional
Storage Management
Logging
Update-in-Place
Caching
In-Memory Storage
Append-Only Storage
Read Latency
Write Throughput
Durability
NoSQL Storage Management
In a Nutshell
Improves
latency.
RAM
SSD
HDD
Durable
Size
Speed, Cost
Volatile
Typical Uses in DBMSs:
Low Performance
High Performance
RR
SR
RW SW
RR
SR
RW SW
RR
SR
RW SW
Data
 Caching
 Primary Storage
 Data Structures
 Caching
 Logging
 Primary Storage
Is good
for
RAM
read latency.
Update-InPlace
Increases
Data write
throughput.
 Logging
 Primary Storage
RR: Random Reads
RW: Random Writes
In-Memory/
Caching
Log
Append-Only
I/O
Logging
Persistent Storage
SR: Sequential Reads
SW: Sequential Writes
Promotes durability of
write operations.
Functional
Techniques
Non-Functional
Joins
Sorting
Read Latency
Filter Queries
Query Processing
Full-text Search
Aggregation and Analytics
Global Secondary Indexing
Local Secondary Indexing
Query Planning
Analytics Framework
Materialized Views
Local Secondary Indexing
Partitioning By Document
Partition II
Color
Key
Color
12
Red
104
Yellow
56
Blue
188
Blue
77
Red
192
Blue
Term
Match
Term
Match
Red
[12,77]
Yellow
[104]
Blue
[56]
Blue
[188,192]
Data
Key
Index
Index
Data
Partition I
Kleppmann, Martin. "Designing data-intensive
applications." (2016).
Local Secondary Indexing
Partitioning By Document
Color
Key
Color
12
Red
104
Yellow
56
Blue
188
Blue
77
Red
192
Blue
Term
Match
Term
Match
Red
[12,77]
Yellow
[104]
Blue
[56]
Blue
[188,192]
Scatter-gather query
pattern.
Data
Key
Partition II
Indexing is always
local to a partition.
Index
Index
Data
Partition I
WHERE color=blue
Kleppmann, Martin. "Designing data-intensive
applications." (2016).
Local Secondary Indexing
Partitioning By Document
Key
Color
Key
Color
12
Red
104
Yellow
188
Blue
192
Blue
Term
Match
Yellow
[104]
Blue
[188,192]
Term
Red
Blue
Scatter-gather query
pattern.
Implemented in
Data
56
77
Index
Partition II
Indexing is always
• MongoDB
Blue
local to a partition.
•
Red
Riak
• Cassandra
• Elasticsearch
Match
• SolrCloud
[12,77]
• VoltDB
[56]
Index
Data
Partition I
WHERE color=blue
Kleppmann, Martin. "Designing data-intensive
applications." (2016).
Global Secondary Indexing
Partitioning By Term
Partition II
Color
Key
Color
12
Red
104
Yellow
56
Blue
188
Blue
77
Red
192
Blue
Term
Match
Yellow
[104]
Term
Match
Blue
[56, 188, 192]
Red
[12,77]
Data
Key
Index
Index
Data
Partition I
Kleppmann, Martin. "Designing data-intensive
applications." (2016).
Global Secondary Indexing
Partitioning By Term
Partition I
56
Index
77
Red
Consistent IndexBlue requires
maintenance
distributed
Red transaction.
Term
Match
Yellow
[104]
Blue
[56, 188, 192]
Data
12
Color
Index
Data
Key
Partition II
Key
Color
104
Yellow
188
Blue
192
Blue
Term
Match
Red
[12,77]
Targeted Query
WHERE color=blue
Kleppmann, Martin. "Designing data-intensive
applications." (2016).
Global Secondary Indexing
Partitioning By Term
Partition I
56
Index
77
Red
Consistent
IndexImplemented
Blue requires
maintenance
distributed
• transaction.
DynamoDB
Red
in
Data
12
Color
Key
Color
104
Yellow
188
Blue
192
Blue
Yellow
• Oracle Datawarehouse
Match
• Riak (Search)
Term
• Cassandra (Search)
[104]
Blue
[56, 188, 192]
Term
Index
Data
Key
Partition II
Red
Match
[12,77]
Targeted Query
WHERE color=blue
Kleppmann, Martin. "Designing data-intensive
applications." (2016).
Query Processing Techniques
Summary





Local Secondary Indexing: Fast writes, scatter-gather
queries
Global Secondary Indexing: Slow or inconsistent writes,
fast queries
(Distributed) Query Planning: scarce in NoSQL systems
but increasing (e.g. left-outer equi-joins in MongoDB
and θ-joins in RethinkDB)
Analytics Frameworks: fallback for missing query
capabilities
Materialized Views: similar to global indexing
How are the techniques from the NoSQL
toolbox used in actual data stores?
Outline
NoSQL Foundations and
Motivation
The NoSQL Toolbox:
Common Techniques
NoSQL Systems
Decision Guidance: NoSQL
Decision Tree
• Overview & Popularity
• Core Systems:
• Dynamo
• BigTable
• Riak
• HBase
• Cassandra
• Redis
• MongoDB
NoSQL Landscape
Document
Google
Datastore
Wide Column
Key-Value
Graph
Project Voldemort
Popularity
http://db-engines.com/de/ranking
#
System
Model
Score
1.
Oracle
Relational DBMS
1462.02
2.
MySQL
Relational DBMS
1371.83
3.
MS SQL Server
Relational DBMS
1142.82
4.
MongoDB
Document store
320.22
5.
PostgreSQL
Relational DBMS
307.61
6.
DB2
Relational DBMS
185.96
7.
Cassandra
Wide column store 134.50
8.
Microsoft Access
Relational DBMS
131.58
9.
Redis
Key-value store
108.24
Relational DBMS
107.26
10. SQLite
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
Elasticsearch
Teradata
SAP Adaptive Server
Solr
HBase
Hive
FileMaker
Splunk
SAP HANA
MariaDB
Neo4j
Informix
Memcached
Couchbase
Amazon DynamoDB
Search engine
Relational DBMS
Relational DBMS
Search engine
Wide column store
Relational DBMS
Relational DBMS
Search engine
Relational DBMS
Relational DBMS
Graph DBMS
Relational DBMS
Key-value store
Document store
Multi-model
Scoring: Google/Bing results, Google Trends, Stackoverflow, job
offers, LinkedIn
86.31
73.74
71.48
65.62
51.84
47.51
46.71
44.31
41.37
33.97
32.61
30.58
27.90
24.29
23.60
History
Google File System
MapReduce
CouchDB
BigTable
Dynamo
MongoDB
Hadoop &HDFS
Cassandra
Riak
HBase
Redis
CouchBase
MegaStore
RethinkDB
HyperDeX
Spanner
F1
Espresso
CockroachDB
Dremel
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
NoSQL foundations

BigTable (2006, Google)
◦ Consistent, Partition Tolerant
◦ Wide-Column data model
◦ Master-based, fault-tolerant, large clusters (1.000+ Nodes),
HBase, Cassandra, HyperTable, Accumolo

Dynamo (2007, Amazon)
◦
◦
◦
◦
Available, Partition tolerant
Key-Value interface
Eventually Consistent, always writable, fault-tolerant
Riak, Cassandra, Voldemort, DynamoDB
Chang, Fay, et al. "Bigtable: A distributed storage system
for structured data."
DeCandia, Giuseppe, et al. "Dynamo: Amazon's highly
available key-value store."
Dynamo (AP)




Developed at Amazon (2007)
Sharding of data over a ring of nodes
Each node holds multiple partitions
Each partition replicated N times
DeCandia, Giuseppe, et al. "Dynamo: Amazon's
highly available key-value store."
Dynamo (AP)




Developed at Amazon (2007)
Sharding of data over a ring of nodes
Each node holds multiple partitions
Each partition replicated N times
DeCandia, Giuseppe, et al. "Dynamo: Amazon's
highly available key-value store."
Consistent Hashing

Naive approach: Hash-partitioning (e.g. in Memcache,
Redis Cluster)
partition = hash(key) % server_count
Consistent Hashing

Solution: Consistent Hashing – mapping of data to
nodes is stable under topology changes
2160 0
hash(key)
position = hash(ip)
Consistent Hashing

Extension: Virtual Nodes for Load Balancing
2160 0
C3
A1
B takes over
two thirds of
A
C1
B1
B3
A2
Range transferred
A3
B2
C2
C takes over
one third of
A
Reading
Parameters R, W, N

An arbitrary node acts as a coordinator

N: number of replicas
R: number of nodes that need to confirm a read
W: number of nodes that need to confirm a write


N=3
R=2
W=1
Quorums

N (Replicas), W (Write Acks), R (Read Acks)
◦ 𝑅 + 𝑊 ≤ 𝑁 ⇒ No guarantee
◦ 𝑅 + 𝑊 > 𝑁 ⇒ newest version included
Read-Quorum
A
B
C
D
A
B
C
D
E
F
G
H
E
F
G
H
I
J
K
L
I
J
K
L
N = 12, R = 3, W = 10
Write-Quorum
N = 12, R = 7, W = 6
Writing

W Servers have to acknowledge
N=3
R=2
W=1
Hinted Handoff

Next node in the ring may take over, until original node
is available again:
N=3
R=2
W=1
Vector clocks

Dynamo uses Vector Clocks for versioning
C. J. Fidge, Timestamps in message-passing systems
that preserve the partial ordering (1988)
Versioning and Consistency



𝑅 + 𝑊 ≤ 𝑁 ⇒ no consistency guarantee
𝑅 + 𝑊 > 𝑁 ⇒ newest acked value included in reads
Vector Clocks used for versioning
Read Repair
Conflict Resolution

The application merges data when writing (Semantic
Reconciliation)
Merkle Trees: Anti-Entropy

Every Second: Contact random server and compare
Hash
Hash
Hash
0
Hash
0-0
Hash
1
Hash
0-1
Hash
1-0
Hash
0
Hash
1-1
Hash
0-0
Hash
1
Hash
0-1
Hash
1-0
Hash
1-1
Merkle Trees: Anti-Entropy

Every Second: Contact random server and compare
Hash
Hash
Hash
0
Hash
0-0
Hash
1
Hash
0-1
Hash
1-0
Hash
0
Hash
1-1
Hash
0-0
Hash
1
Hash
0-1
Hash
1-0
Hash
1-1
Merkle Trees: Anti-Entropy

Every Second: Contact random server and compare
Hash
Hash
Hash
0
Hash
0-0
Hash
1
Hash
0-1
Hash
1-0
Hash
0
Hash
1-1
Hash
0-0
Hash
1
Hash
0-1
Hash
1-0
Hash
1-1
Merkle Trees: Anti-Entropy

Every Second: Contact random server and compare
Hash
Hash
Hash
0
Hash
0-0
Hash
1
Hash
0-1
Hash
1-0
Hash
0
Hash
1-1
Hash
0-0
Hash
1
Hash
0-1
Hash
1-0
Hash
1-1
Quorum

LinkedIn (SSDs):
𝑃 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑡 ≥ 99.9%
nach 1.85 𝑚𝑠
Typical Configurations:
Performance
(Cassandra Default)
N=3, R=1, W=1
Quorum, fast
Writing:
N=3, R=3, W=1
Quorum, fast
Reading
N=3, R=1, W=3
Trade-off (Riak
Default)
N=3, R=2, W=2
P. Bailis, PBS Talk: http://www.bailis.org/talks/twitter-pbs.pdf
𝑅 + 𝑊> 𝑁 does not imply linearizability

Consider the following execution:
Writer
set x=1
ok
Replica 1
0
Replica 2
ok
0
ok
0
Replica 3
1
Reader A
Reader B
get x  1
get x  0
Kleppmann, Martin. "Designing dataintensive applications." (2016).
CRDTs
Convergent/Commutative Replicated Data Types


Goal: avoid manual conflict-resolution
Approach:
◦ State-based – commutative, idempotent merge function
◦ Operation-based – broadcasts of commutative upates

Example: State-based Grow-only-Set (G-Set)
𝑆1 = {}
add(x)
𝑆1 = {𝑥}
𝑆1 = 𝑚𝑒𝑟𝑔𝑒 𝑥 , 𝑦
= {𝑥, 𝑦}
Node 1
𝑆1
𝑆2
𝑆2 = {}
𝑆2 = {𝑦}
add(y)
𝑆2 = 𝑚𝑒𝑟𝑔𝑒 𝑦 , 𝑥
= {𝑥, 𝑦}
Node 2
Marc Shapiro, Nuno Preguica, Carlos Baquero, and Marek
Zawirski "Conflict-free Replicated Data Types"
Riak (AP)
Riak
Model:


Open-Source Dynamo-Implementation
Extends Dynamo:
◦
◦
◦
◦
◦
◦
Key-Value
License:
Apache 2
Written in:
Keys are grouped to Buckets
Erlang und C
KV-pairs may have metadata and links
Map-Reduce support
Secondary Indices, Update Hooks, Solr Integration
Option for strongly consistent buckets (experimental)
Riak CS: S3-like file storage, Riak TS: time-series database
Consistency Level: N, R, W, DW
Storage Backend: Bit-Cask, Memory, LevelDB
Data: KV-Pairs
Bucket
Riak Data Types

Implemented as state-based CRDTs:
Data Type
Convergence rule
Flags
enable wins over disable
Registers
The most chronologically recent value wins, based
on timestamps
Counters
Implemented as a PN-Counter, so all increments
and decrements are eventually applied.
Sets
If an element is concurrently added and removed,
the add will win
Maps
If a field is concurrently added or updated and
removed, the add/update will win
http://docs.basho.com/riak/kv/2.1.4/learn/concepts/crdts/
Hooks & Search

Hooks:
JS/Erlang Pre-Commit Hook
Update/Delete/Create
Response
JS/Erlang Post-Commit Hook

Riak Search:
Riak_search_kv_hook
Update/Delete/Create
/solr/mybucket/select?q=user:emil
Term
Dokument
database
3,4,1
rabbit
2
Search Index
Riak Map-Reduce
Knoten 1
nosql_dbs
Map
45
Map
4
Map
445
Knoten 3
Knoten 2
POST /mapred
Map
6
Map
12
Map
678
Map
9
Map
3
Map
49
http://docs.basho.com/riak/latest/tutorials/querying/MapReduce/
Riak Map-Reduce
Knoten 1
nosql_dbs
Map
45
Map
4
Map
445
Knoten 3
Knoten 2
POST /mapred
function(v) {
6
var jsonMap
= v.values[0].data;
return [{count : json.stackoverflow_questions}];
Map
12
}
Map
678
Map
9
Map
3
Map
49
http://docs.basho.com/riak/latest/tutorials/querying/MapReduce/
Riak Map-Reduce
Knoten 3
Knoten 2
Knoten 1
nosql_dbs
Map
45
Map
4
Map
Map
Map
Map
Reduce
494
445
function(mapped) {
var sum = 0;
for(var i in mapped) {
6
sum += i.count;
}
696
Reduce
12
return [{count : 0}];
}678
Map
9
Map
3
Map
49
Reduce
POST /mapred
61
http://docs.basho.com/riak/latest/tutorials/querying/MapReduce/
Riak Map-Reduce
Knoten 1
nosql_dbs
Map
45
Map
4
Map
445
Reduce
494
Knoten 3
Knoten 2
POST /mapred
Map
6
Map
12
Map
678
Map
9
Map
3
Map
49
Reduce
Reduce
696
61
http://docs.basho.com/riak/latest/tutorials/querying/MapReduce/
Riak Map-Reduce
Knoten 1
nosql_dbs
Map
45
Map
4
Map
445
Reduce
494
Knoten 3
Knoten 2
POST /mapred
Map
6
Map
12
Map
678
Map
9
Map
3
Map
49
Reduce
Reduce
696
Reduce
1251
61
http://docs.basho.com/riak/latest/tutorials/querying/MapReduce/
Riak Map-Reduce




JavaScript/Erlang, stored/ad-hoc
Pattern: Chainable Reducers
Key-Filter: Narrow down input
Link Phase: Resolves links
Map
"key-filter" : [
["string_to_int"],
["less_than", 100]
]
"link" : {
"bucket":"nosql_dbs"
}
Reduce
Same
Data Format
Riak Cloud Storage
Stanchion:
Request
Serializer
Files
Amazon S3
API
1MB Chunks
Summary: Dynamo and Riak



Available and Partition-Tolerant
Consistent Hashing: hash-based distribution with stability
under topology changes (e.g. machine failures)
Parameters: N (Replicas), R (Read Acks), W (Write Acks)
◦ N=3, R=W=1  fast, potentially inconsistent
◦ N=3, R=3, W=1  slower reads, most recent object version contained



Vector Clocks: concurrent modification can be detected,
inconsistencies are healed by the application
API: Create, Read, Update, Delete (CRUD) on key-value pairs
Riak: Open-Source Implementation of the Dynamo paper
Dynamo and Riak
Classification
Sharding
RangeSharding
HashSharding
Entity-Group
Sharding
Replication
Transaction
Protocol
Sync.
Replication
Storage
Management
Logging
Updatein-Place
Caching
InMemory
Append-Only
Storage
Query
Processing
Global
Index
Local
Index
Query
Planning
Analytics
Materialized
Views
Async.
Replication
Consistent
Hashing
Primary
Copy
Shared
Disk
Update
Anywhere
Redis (CA)
Redis
Model:









Key-Value
Remote Dictionary Server
License:
BSD
In-Memory Key-Value Store
Written in:
Asynchronous Master-Slave Replication
C
Data model: rich data structures stored under key
Tunable persistence: logging and snapshots
Single-threaded event-loop design (similar to Node.js)
Optimistic batch transactions (Multi blocks)
Very high performance: >100k ops/sec per node
Redis Cluster adds sharding
Redis Architecture

Redis Codebase ≅ 20K LOC
Redis Server
hello
SET mykey hello
TCP Port
6379
+OK
Client
Plain Text Protocol
Event Loop
Local
Filesystem
Log
Dump
AOF
RDB
One Process/
Thread
- Periodic
- After X Writes
- SAVE
RAM
Persistence



Default: „Eventually Persistent“
AOF: Append Only File (~Commitlog)
RDB: Redis Database Snapshot
config set appendonly everysec
fsync() every second
Snapshot every 60s,
if > 1000 keys changed
config set save 60 1000
User
Space
1
App
Hardware
Memory
SET mykey hello
Database
Process
3
Kernel
Space
1. Resistence to client
crashes
2. Resistence to DB process
crashes
3. Resistence to hardware
crashes with Write-Through
4. Resistence to hardware
crashes with Write-Back
Client
Persistence
fsync()
In Memory Data
Structures
2
fwrite()
POSIX Filesystem API
Page Cache
(Reads)
Controller
Disk
Buffer Cache
(Writes)
Disk Cache
Write Through
vs Write Back
4
Persistence: Redis vs an RDBMS

PostgreSQL:

> synchronous_commit on
Redis:
> appendfsync always
Latency > Disk Latency, Group Commits, Slow
> synchronous_commit off
> appendfsync everysec
periodic fsync(), data loss limited
> fsync false
Data corruption and losspossible
> pg_dump
> appendfysnc no
Data loss possible, corruption
prevented
> save oder bgsave
Master-Slave Replication
Slave Offsets
> SLAVEOF 192.168.1.1 6379
< +OK
Writes
Memory Backlog
Asynchronous
Replication
Master
Slave1
Slave2
Slave2.1
Slave2.2
Stream
Data structures

String, List, Set, Hash, Sorted Set
String
web:index
Set
users:2:friends
List
users:2:inbox
Hash
Sorted Set
Pub/Sub
users:2:settings
top-posters
users:2:notifs
"<html><head>…"
{23, 76, 233, 11}
[234, 3466, 86,55]
Theme → "dark", cookies → "false"
466 → "2", 344 → "16"
"{event: 'comment posted', time : …"
Data Structures

(Linked) Lists:
LPUSHX
LPUSH
Only if list
exists
inbox
234
4
LRANGE inbox 1 2
RPUSH
86
55
LINDEX inbox 2
RPOP
3466
LREM inbox 0 3466
LLEN
LPOP
Blocks until element
arrives
BLPOP
Data Structures

Sets:
23 10 2 28 325 64 70
SINTERSTORE common_friends
user:2 friends user:5:friends
SINTER
user:2:friends
4
23
76
233
11
SMEMBERS
23
SADD
SCARD
false
SREM
SRANDMEMBER
user:5:friends
SISMEMBER
common_friends
Data Structures

Pub/Sub:
users:2:notifs
"{event: 'comment posted', time : …"
PUBLISH user:2:notifs
"{
event: 'comment posted',
time : …
}"
SUBSCRIBE user:2:notifs
{
event: 'comment posted',
time : …
}
Example: Bloom filters
Compact Probabilistic Sets



Bit array of length m and k independent hash functions
insert(obj): add to set
contains(obj): might give a false positive
n
y
h2
h3
h1
=1?
y
h2
contained
h1
h3
1 1 0 0 1 0 1 0 1 1
1
m
1 1 0 0 1 0 1 0 1 1
1
m
Insert y
Query x
https://github.com/Baqend/
Orestes-Bloomfilter
Bloomfilters in Redis

Bitvectors in Redis: String + SETBIT, GETBIT, BITOP
public void add(byte[] value) {
Jedis: Redis Client for Java
for (int position : hash(value)) {
jedis.setbit(name, position, true);
}
SETBIT creates and resizes
}
automatically
public void contains(byte[] value) {
for (int position : hash(value))
if (!jedis.getbit(name, position))
return false;
return true;
}
Pipelining


If the Bloom filter uses 7 hashes: 7 roundtrips
Solution: Redis Pipelining
Redis
Client
SETBIT key 22 1
SETBIT key 87 1
...
Redis for distributed systems


Common Pattern: distributed system with shared state
in Redis
Example - Improve performance for legacy systems:
Hash
k
m
Bits
Slow Legacy
System
MD5
7
80000
0 1 0 0 1 0 1 0 1 1
Bloomfilter lookup:
GETBIT, GETBIT...
On Hit
App Server
Get Data
From Legacy System
Redis Bloom filters
Open Source
https://github.com/Baqend/
Orestes-Bloomfilter
Why is Redis so fast?
Pessimistic
transactions
are expensive
No Query
Parsing
AOF
Operations are
lock-free
Single-threading
Data in RAM
Harizopoulos, Stavros, Madden, Stonebraker "OLTP through
the looking glass, and what we found there."
Optimistic Transactions


MULTI: Atomic Batch Execution
WATCH: Condition for MULTI Block
WATCH users:2:followers, users:3:followers
Only executed if
bother keys are
unchanged
MULTI
SMEMBERS users:2:followers
Queued
SMEMBERS users:3:followers
Queued
INCR transactions
EXEC
Queued
Bulk reply with 3 results
Lua Scripting
Redis Server
SCRIPT LOAD
Script Hash
EVALSHA $hash 1
"mylock" "10"
1
--lockscript, parameters: lock_key, lock_timeout
local lock = redis.call('get', KEYS[1])
if not lock then
return redis.call('setex', KEYS[1], ARGV[1], "locked")
end
return false
Script Cache
Data
Ierusalimschy, Roberto. Programming in lua. 2006.
Redis Cluster
Work-in-Progress


Idea: Client-driven hash-based sharing (CRC32, „hash slots“)
Asynchronous replication with failover (variant of Raft‘s
leader election)
◦ Consistency: not guaranteed, last failover wins
◦ Availability: only on the majority partition
neither AP nor CP
Full-Mesh
Cluster Bus
8192-16384
Client
Redis Master
Redis Slave
Redis Master
Redis Slave
0-8192
- No multi-key operations
- Pinning via key: {user1}.followers
http://redis.io/topics/cluster-spec
Performance
Comparable to Memcache
> redis-benchmark -n 100000 -c 50
Requests pro Sekunde

80000
70000
60000
50000
40000
30000
20000
10000
0
Operation
Example Redis Use-Case: Twitter



Per User: one
materialized timeline in
Redis
Timeline = List
Key: User ID
>150 million users
~300k timeline querys/s
RPUSHX user_id tweet
http://www.infoq.com/presentations/Real-Time-Delivery-Twitter
Classification: Redis
Techniques
Sharding
RangeSharding
HashSharding
Entity-Group
Sharding
Replication
Transaction
Protocol
Sync.
Replication
Storage
Management
Logging
Updatein-Place
Caching
InMemory
Append-Only
Storage
Query
Processing
Global
Index
Local
Index
Query
Planning
Analytics
Materialized
Views
Async.
Replication
Consistent
Hashing
Primary
Copy
Shared
Disk
Update
Anywhere
Google BigTable (CP)


Published by Google in 2006
Original purpose: storing the Google search index
A Bigtable is a sparse,
distributed, persistent
multidimensional sorted map.

Data model also used in: HBase, Cassandra, HyperTable,
Accumulo
Chang, Fay, et al. "Bigtable: A distributed storage system
for structured data."
Wide-Column Data Modelling

Storage of crawled web-sites („Webtable“):
Column-Family:
contents
1. Dimension:
Row Key
com.cnn.www
2. Dimension:
CF:Column
content : "<html>…"
content : "<html>…"
content : "<html>…"
Column-Family:
anchor
t3
t5
t6
3. Dimension:
Timestamp
cnnsi.com : "CNN"
Sparse
Sorted
my.look.ca : "CNN.com"
Range-based Sharding
BigTable Tablets
Tablet: Range partition of ordered records
Rows
A-C
Tablet Server 1
Tablet Server 2
A-C
C-F
C-F
F-I
I-M
M-T
F-I
I-M
M-T
T-Z
T-Z
Controls Ranges, Splits, Rebalancing
Master
Tablet Server 3
Architecture
Master
Tablet Server
Chubby
Tablet Server
Tablet Server
SSTables
Commit
Log
GFS
Architecture
ACLs, Garbage
Collection,
Rebalancing
Master Lock, Root
Metadata Tablet
Master
Chubby
Stores Ranges,
Answers client
requests
Tablet Server
Stores data and
commit log
Tablet Server
Tablet Server
SSTables
Commit
Log
GFS
Storage: Sorted-String Tables




Goal: Append-Only IO when writing (no disk seeks)
Achieved through: Log-Structured Merge Trees
Writes go to an in-memory memtable that is periodically
persisted as an SSTable as well as a commit log
Reads query memtable and all SSTables
Row-Key
Key
Block
Key
Block
Key
Block
...
Block Index
Block (e.g. 64KB)
Key
Value
Key
Value
Key
Variable Length
Sorted String Table
Value
...
Storage: Optimization


Writes: In-Memory in Memtable
SSTable disk access optimized by Bloom filters
Write(x)
Memtable
Read(x)
Client
Bloom
filters
Main Memory
Hit
Periodic
Flush
Disk
SSTables
Periodic
Compaction
Apache HBase (CP)
HBase
Model:


Open-Source Implementation of BigTable
Hadoop-Integration
◦ Data source for Map-Reduce
◦ Uses Zookeeper and HDFS

Wide-Column
License:
Apache 2
Written in:
Java
Data modelling challenges: key design, tall vs wide
◦ Row Key: only access key (no indices)  key design important
◦ Tall: good for scans
◦ Wide: good for gets, consistent (single-row atomicity)


No typing: application handles serialization
Interface: REST, Avro, Thrift
HBase Storage

Logical to physical mapping:
Key cf1:c1 cf1:c2 cf2:c1 cf2:c2
r1
r2
r3
r4
r5
George, Lars. HBase: the definitive guide. 2011.
HBase Storage

Logical to physical mapping:
r1:cf2:c1:t1:<value>
r2:cf2:c2:t1:<value>
Key cf1:c1 cf1:c2 cf2:c1 cf2:c2
r1
r2
r3:cf2:c2:t2:<value>
r3:cf2:c2:t1:<value>
r5:cf2:c1:t1:<value>
HFile cf2
r3
r4
r5
r1:cf1:c1:t1:<value>
r2:cf1:c2:t1:<value>
r3:cf1:c2:t1:<value>
r3:cf1:c1:t2:<value>
r5:cf1:c1:t1:<value>
HFile cf1
George, Lars. HBase: the definitive guide. 2011.
HBase Storage

Logical to physical mapping:
In Value
In Key
In Column
Key Design – where to store data:
r2:cf2:c2:t1:<value>
r2-<value>:cf2:c2:t1:_
r2:cf2:c2<value>:t1:_
Key cf1:c1 cf1:c2 cf2:c1 cf2:c2
r1
r2
r1:cf2:c1:t1:<value>
r2:cf2:c2:t1:<value>
r3:cf2:c2:t2:<value>
r3:cf2:c2:t1:<value>
r5:cf2:c1:t1:<value>
HFile cf2
r3
r4
r5
r1:cf1:c1:t1:<value>
r2:cf1:c2:t1:<value>
r3:cf1:c2:t1:<value>
r3:cf1:c1:t2:<value>
r5:cf1:c1:t1:<value>
HFile cf1
George, Lars. HBase: the definitive guide. 2011.
Example: Facebook Insights
Log
Extraction
every 30 min
MD5(Reversed Domain) + Reversed Domain + URL-ID
Row Key
6PM
Total
6PM
Male
Total
Male
10
7
1000
Atomic HBase
Counter
567
CF:Daily
…
01.01
Total
01.01
Male
100
65
CF:Monthly
TTL – automatic deletion of
old rows
…
…
CF:All
Lars George: “Advanced
HBase Schema Design”
Schema Design

Tall vs Wide Rows:
◦ Tall: good for Scans
◦ Wide: good for Gets

Hotspots: Sequential Keys (z.B. Timestamp) dangerous
Performance
Sequential
Random
Key
George, Lars. HBase: the definitive guide. 2011.
Schema: Messages
User ID
CF
Column
Timestamp
Message
12345
12345
12345
12345
data
data
data
data
5fc38314-e290-ae5da5fc375d
725aae5f-d72e-f90f3f070419
cc6775b3-f249-c6dd2b1a7467
dcbee495-6d5e-6ed48124632c
1307097848
1307099848
1307101848
1307103848
"Hi Lars, ..."
"Welcome, and ..."
"To Whom It ..."
"Hi, how are ..."
Timestamp
Message
: 1307097848
: 1307099848
: 1307101848
: 1307103848
"Hi Lars, ..."
"Welcome, and ..."
"To Whom It ..."
"Hi, how are ..."
vs
ID:User+Message
CF
12345-5fc38314-e290-ae5da5fc375d
12345-725aae5f-d72e-f90f3f070419
12345-cc6775b3-f249-c6dd2b1a7467
12345-dcbee495-6d5e-6ed48124632c
data
data
data
data
Wide:
Atomicity
Scan over Inbox: Get
Column
Tall:
Fast Message Access
Scan over Inbox: Partial Key Scan
http://2013.nosql-matters.org/cgn/wp-content/uploads/2013/05/
HBase-Schema-Design-NoSQL-Matters-April-2013.pdf
API: CRUD + Scan
Setup Cloud Cluster:
> elastic-mapreduce --create -hbase --num-instances 2 --instancetype m1.large
> whirr launch-cluster --config
hbase.properties
Login, cluster size, etc.
HTable table = ...
Get get = new Get("my-row");
get.addColumn(Bytes.toBytes("my-cf"), Bytes.toBytes("my-col"));
Result result = table.get(get);
table.delete(new Delete("my-row"));
Scan scan = new Scan();
scan.setStartRow( Bytes.toBytes("my-row-0"));
scan.setStopRow( Bytes.toBytes("my-row-101"));
ResultScanner scanner = table.getScanner(scan)
for(Result result : scanner) { }
API: Features

Row Locks (MVCC): table.lockRow(), unlockRow()
◦ Problem: Timeouts, Deadlocks, Ressources


Conditional Updates: checkAndPut(), checkAndDelete()
CoProcessors - registriered Java-Classes for:
◦ Observers (prePut, postGet, etc.)
◦ Endpoints (Stored Procedures)

HBase can be a Hadoop Source:
TableMapReduceUtil.initTableMapperJob(
tableName, //Table
scan, //Data input as a Scan
MyMapper.class, ... //usually a TableMapper<Text,Text> );
Summary: BigTable, HBase






Data model: 𝑟𝑜𝑤𝑘𝑒𝑦, 𝑐𝑓: 𝑐𝑜𝑙𝑢𝑚𝑛, 𝑡𝑖𝑚𝑒𝑠𝑡𝑎𝑚𝑝 →
𝑣𝑎𝑙𝑢𝑒
API: CRUD + Scan(start-key, end-key)
Uses distributed file system (GFS/HDFS)
Storage structure: Memtable (in-memory data structure)
+ SSTable (persistent; append-only-IO)
Schema design: only primary key access  implicit
schema (key design) needs to be carefully planned
HBase: very literal open-source BigTable implementation
Classification: HBase
Techniques
Sharding
RangeSharding
HashSharding
Entity-Group
Sharding
Replication
Transaction
Protocol
Sync.
Replication
Storage
Management
Logging
Updatein-Place
Caching
InMemory
Append-Only
Storage
Query
Processing
Global
Index
Local
Index
Query
Planning
Analytics
Materialized
Views
Async.
Replication
Consistent
Hashing
Primary
Copy
Shared
Disk
Update
Anywhere
Apache Cassandra (AP)
Cassandra
Model:


Published 2007 by Facebook
Idea:
◦ BigTable‘s wide-column data model
◦ Dynamo ring for replication and sharding


Wide-Column
License:
Apache 2
Written in:
Java
Cassandra Query Language (CQL): SQL-like query- and
DDL-language
Compound indices: partition key (shard key) + clustering
key (ordered per partition key)  Limited range queries
Architecture
Thrift
Thrift
Session
Thrift
RPC
Session
or CQL
set_keyspace()
get_slice()
Stores SSTables
and Commit Log
Hashing:
Stateful
Communication
Local
Filesystem
Replication,
Gossip, etc.
Cassandra Node
TCP Cluster
Messages
Storage
Proxy
Column
Family Store
Row Cache
MemTable
Key Cache
Stores Rows
Stores Primary Key Index
(Seek Position)
MD5(key)
Random Partitioner
Order Preservering
Partitioner
key
Snitch: Rack, Datacenter,
EC2 Region Information
Consistency


No Vector Clocks but Last-Write-Wins
 Clock synchronisation required
No Versionierung that keeps old cells
Write
Read
Any
-
One
One
Two
Two
Quorum
Quorum
Local_Quorum / Each_Quorum
Local_Quorum / Each_Quorum
All
All
Consistency


Coordinator chooses newest version and triggers Read
Repair
Downside: upon conflicts, changes are lost
C1: writes B
Write(One)
C2: writes C
C3 : reads C
Write(One)
B
Version A
C
Version A
B
C
Read(All)
Version A
C
Storage Layer

Uses BigTables Column Family Format
KeySpace: music
Column Family: songs
Comparator determines
order
f82831…
title: Andante
144052…
title: Jailhouse
Rock
Row Key: Mapping to
Server
album: New
World Symphony
Type validated by
Validation Class UTFType
artist: Antonin
Dvorak
artist: Elvis
Presley
Sparse
http://www.datastax.com/dev/blog/cql3-for-cassandra-experts
CQL Example: Compound keys

Enables Scans despite Random Partitioner
SELECT * FROM playlists
WHERE id = 23423
ORDER BY song_order DESC
LIMIT 50;
CREATE TABLE playlists (
id uuid,
song_order int,
song_id uuid, ...
PRIMARY KEY (id, song_order)
);
Clustering Columns:
sorted per node
Partition Key
id
song_order
song_id
artist
23423
1
64563
Elvis
23423
2
f9291
Elvis
Other Features





Distributed Counters – prevent update anomalies
Full-text Search (Solr) in Commercial Version
Column TTL – automatic garbage collection
Secondary indices: hidden table with mapping
 queries with simple equality condition
Lightweight Transactions: linearizable updates through a
Paxos-like protocol
INSERT INTO USERS (login, email, name, login_count)
values ('jbellis', '[email protected]', 'Jonathan Ellis', 1)
IF NOT EXISTS
Classification: Cassandra
Techniques
Sharding
RangeSharding
HashSharding
Entity-Group
Sharding
Replication
Transaction
Protocol
Sync.
Replication
Storage
Management
Logging
Updatein-Place
Caching
InMemory
Append-Only
Storage
Query
Processing
Global
Index
Local
Index
Query
Planning
Analytics
Materialized
Views
Async.
Replication
Consistent
Hashing
Primary
Copy
Shared
Disk
Update
Anywhere
MongoDB (CP)
MongoDB
Model:






Document
From humongous ≅ gigantic
License:
GNU AGPL 3.0
Schema-free document database with
Written in:
tunable consistency
C++
Allows complex queries and indexing
Sharding (either range- or hash-based)
Replication (either synchronous or asynchronous)
Storage Management:
◦ Write-ahead logging for redos (journaling)
◦ Storage Engines: memory-mapped files, in-memory, Logstructured merge trees (WiredTiger), …
Basics
> mongod &
> mongo imdb
MongoDB shell version: 2.4.3
connecting to: imdb
> show collections
movies
Properties
tweets
> db.movies.findOne({title : "Iron Man 3"})
{
title : "Iron Man 3",
year : 2013 ,
Arrays, Nesting allowed
genre : [
"Action",
"Adventure",
"Sci -Fi"],
actors : [
"Downey Jr., Robert",
"Paltrow , Gwyneth",]
}
Data Modelling
Genre
n
Movie
title
year
rating
director
Actor
1
n
n
Tweet
text
coordinates
retweets
1
User
1
name
location
Data Modelling
Genre
n
Movie
title
year
rating
director
Actor
1
{
"_id" : ObjectId("51a5d316d70beffe74ecc940")
title : "Iron Man 3",
year : 2013,
rating : 7.6,
director: "Shane Block",
genre : [ "Action",
"Adventure",
"Sci -Fi"],
actors : ["Downey Jr., Robert",
"Paltrow , Gwyneth"],
tweets : [ {
"user" : "Franz Kafka",
"text" : "#nowwatching Iron Man 3",
"retweet" : false,
"date" : ISODate("2013-05-29T13:15:51Z")
}]
}
Movie Document
n
n
Tweet
text
coordinates
retweets
1
User
1
name
location
Data Modelling
Genre
n
Movie
title
year
rating
director
Actor
1
{
"_id" : ObjectId("51a5d316d70beffe74ecc940")
title : "Iron Man 3",
year : 2013,
rating : 7.6,
director: "Shane Block",
genre : [ "Action",
"Adventure",
"Sci -Fi"],
actors : ["Downey Jr., Robert",
"Paltrow , Gwyneth"],
tweets : [ {
"user" : "Franz Kafka",
"text" : "#nowwatching Iron Man 3",
"retweet" : false,
"date" : ISODate("2013-05-29T13:15:51Z")
}]
}
Movie Document
n
n
Tweet
1
text
coordinates
retweets
User
1
name
location
Denormalisation instead
of joins
Nesting replaces 1:n
and 1:1 relations
Schemafreeness:
Attributes per document
Unit of atomicity:
document
Principles
Sharding und Replication
Sharding:
-Sharding attribute
-Hash vs. range sharding
config
config
config
Client
mongos
mongos
Client
Controls Write Concern:
Unacknowledged, Acknowledged,
Journaled, Replica Acknowledged
Slave
Master
Slave
-Load-Balancing
-can trigger rebalancing of
chunks (64MB) and splitting
Replica Set
Slave
Master
Slave
Replica Set
-Receives all writes
-Replicates asynchronously
MongoDB Example App
Twitter
Firehose
@Johnny: Watching
Game of Thrones
@Jim: Star Trek
rocks.
REST API (Jetty)
Tweets
saveTweet()
Movies
3
Tweets
MongoDB
2
Streaming
1
GET
JSON
4
GridFS
HTTP
Movies
getTaggedTweets()
getByGenre()
searchByPrefix()
MovieService
Server
Search
Browser
Tweet Map
Searching
Queries
Client
MongoDB by Example
DBObject query = new BasicDBObject("tweets.coordinates",
new BasicDBObject("$exists", true));
db.getCollection("movies").find(query);
Or in JavaScript:
db.movies.find({tweets.coordinates : { "$exists" : 1}})
Overhead caused by large results → projection
db.tweets.find({coordinates : {"$exists" : 1}},
{text:1, movie:1, "user.name":1, coordinates:1})
.sort({id:-1})
Projected attributes, ordered by insertion date
db.movies.ensureIndex({title : 1})
db.movies.find({title : /^Incep/}).limit(10)
Index usage:
db.movies.find({title : /^Incep/}).explain().millis = 0
db.movies.find({title : /^Incep/i}).explain().millis = 340
db.movies.update({_id: id), {"$set" : {"comment" : c}})
or:
db.movies.save(changed_movie);
fs = new GridFs(db);
fs.createFile(inputStream).save();
File
GridFS
API
256 KB
Blocks
Mongo
DB
db.tweets.ensureIndex({coordinates : "2dsphere"})
db.tweets.find({"$near" : {"$geometry" : … }})
Geospatial Queries:
• Distance
• Intersection
• Inclusion
db.tweets.runCommand( "text", { search: "StAr trek" } )
Full-text Search:
• Tokenization, Stop Words
• Stemming
• Scoring
Analytic Capabilities

Aggregation Pipeline Framework:
Sort
Match: Selection
by query

Projection
Unwind:
elimination of
nesting
Skip and
Limit
Group
Grouping, e.g.
{ _id : "$author",
docsPerAuthor : { $sum : 1 },
viewsPerAuthor : { $sum : "$views" } }} );
Alternative: JavaScript MapReduce
Sharding

Range-based:

Hash-based:
In the optimal case only one
shard asked per query, else:
Scatter-and-gather
Even distribution,
no locality
docs.mongodb.org/manual/core/sharding-introduction/
Sharding


Splitting:
Migration:
Split chunks that are
too large
Mongos Load Balancer
triggers rebalancing
docs.mongodb.org/manual/core/sharding-introduction/
Classification: MongoDB
Techniques
Sharding
RangeSharding
HashSharding
Entity-Group
Sharding
Replication
Transaction
Protocol
Sync.
Replication
Storage
Management
Logging
Updatein-Place
Caching
InMemory
Append-Only
Storage
Query
Processing
Global
Index
Local
Index
Query
Planning
Analytics
Materialized
Views
Async.
Replication
Consistent
Hashing
Primary
Copy
Shared
Disk
Update
Anywhere
Other Systems
Graph databases

Neo4j (ACID, replicated, Query-language)

HypergraphDB (directed Hypergraph, BerkleyDB-based)

Titan (distributed, Cassandra-based)

ArangoDB, OrientDB („multi-model“)

SparkleDB (RDF-Store, SPARQL)

InfinityDB (embeddable)

InfiniteGraph (distributed, low-level API, Objectivity-based)
Other Systems
Key-Value Stores

Aerospike (SSD-optimized)

Voldemort (Dynamo-style)

Memcache (in-memory cache)

LevelDB (embeddable, LSM-based)

HyperDex (Searchable, Hyperspace-Hashing, Transactions)

Oracle NoSQL database (distributed frontend for BerkleyDB)

HazelCast (in-memory data-grid based on Java Collections)

FoundationDB (ACID through Paxos)
Other Systems
Document Stores

CouchDB (Multi-Master, lazy synchronization)

CouchBase (distributed Memcache, N1QL~SQL, MR-Views)

RavenDB (single node, SI transactions)

RethinkDB (distributed CP, MVCC, joins, aggregates, real-time)

MarkLogic (XML, distributed 2PC-ACID)

ElasticSearch (full-text search, scalable, unclear consistency)

Solr (full-text search)

Azure DocumentDB (cloud-only, ACID, WAS-based)
Other Systems
Wide-Column Stores

Accumolo (BigTable-style, cell-level security)

HyperTable (BigTable-style, written in C++)
Other Systems
NewSQL Systems

CockroachDB (Spanner-like, SQL, no joins, transactions)

Crate (ElasticSearch-based, SQL, no transaction guarantees)

VoltDB (HStore, ACID, in-memory, uses stored procedures)

Calvin (log- & Paxos-based ACID transactions)

Google F1 (based on Spanner, SQL)

Microsoft Cloud SQL Server (distributed CP, MSSQL-comp.)

MySQL Cluster, Galera Cluster, Percona XtraDB Cluster
(distributed storage engine for MySQL)
Open Research Questions
For Scalable Data Management

Service-Level Agreements
◦ How can SLAs be guaranteed in a virtualized, multi-tenant
cloud environment?

Consistency
◦ Which consistency guarantees can be provided in a georeplicated system without sacrificing availability?

Performance & Latency
◦ How can a database deliver low latency in face of distributed
storage and application tiers?

Transactions
◦ Can ACID transactions be aligned with NoSQL and scalability?
Distributed Transactions
ACID and Serializability
Definition: A transaction is a sequence of operations transforming
the database from one consistent state to another.
Atomicity
Commit Handling
Consistency
Constraint Checking
Isolation
Concurrency Control
Durability
Logging & Recovery
Distributed Transactions
ACID and Serializability
Definition: A transaction is a sequence of operations transforming
the database from one consistent state to another.
Isolation
Commit
IsolationHandling
Levels:
1. Serializability
2. Snapshot
Isolation
Constraint
Checking
3. Read-Committed
4. Read-Atomic
Concurrency
Control
5. …
Durability
Logging & Recovery
Atomicity
Consistency
Distributed Transactions
General Processing
Commit Protocol is not available
Commit Protocol
Needs to ensure globally
correct isolation
Strong Consistency –
needed by Concurrency
Control
Concurrency Control
Concurrency Control
Concurrency Control
Replication
Replication
Replication
Replicas
Replicas
Replicas
Shard
Shard
Shard
Distributed Transactions
In NoSQL Systems – An Overview
System
Isolation
Granularity
Commit Protocol
Megastore
Concurrency
Control
OCC
SR
Entity Group
Local
G-Store
OCC
SR
Key Group
Locla
Spanner / F1
PCC / OCC
SR / SI
Multi-Shard
2PC
Percolator
OCC
SI
Multi-Shard
2PC
MDCC
OCC
Multi-Shard
Custom – 2PC like
ElasTras
PCC
ReadCommitted
SR
Shard
Local
CloudTPS
TO
SR
Multi-Shard
2PC
Cherry Garcia
OCC
SI
Multi-Shard
Client Coordinated
Omid
OCC
SI
Multi-Shard
FaRMville
RAMP
OCC
SR
Multi-Shard
Local
Custom
Read-Atomic
Multi-Shard
Custom
2PC
Local
Distributed Transactions
Megastore




Synchronous Paxos-based replication
Fine-grained partitions (entity groups)
Based on BigTable
Local commit protocol, optmisistic concurrency control
User
ID
Name
Root Table
Photo
1
n
ID
User
URL
Child Table
EG: User + n Photos
• Unit of ACID transactions/
consistency
• Local commit protocol,
optimistic concurrency
control
Distributed Transactions
Megastore
Synchronous Paxos-based replication
Spanner
 Fine-grained partitions (entity groups)
Idea: Based on BigTable
• Autosharded Entity Groups
 Local commit
protocol, optmisistic concurrency control
• Paxos-replication
per Shard

Transaction:
• Multi-Shard transactions
• SI using TrueTime API
Useron 2PL and 2PC
• SR based
Photo
1
ID
• Core ofIDF1 powering ad business
n
Name
User
J. Corbett et al. "Spanner: Google’s globally distributed
URL
database." TOCS 2013
Root Table
Child Table
EG: User + n Photos
• Unit of ACID transactions/
consistency
• Local commit protocol,
optimistic concurrency
control
Distributed Transactions
MDCC – Multi Datacenter Concurrency Control
Paxos Instance
Properties:
Read Committed Isolation
v  v‘
Geo Replication
Optimistic Commit
Replicas
v  v‘
Record-Master
(v)
T1= {v  v‘,
u  u‘}
u  u‘
u  u‘
App-Server
(Coordinator)
Record-Master
(u)
Replicas
Distributed Transactions
RAMP – Read Atomic Multi Partition Transactions
Fractured Read
Properties:
r(x)
Read Atomic Isolation
Synchronization Independence
r(y)
time
Guaranteed Commit
1 read objects
3
load other version
w(y)
r(y)
r(x)
Partition Independence
2 validate
w(x)
Distributed Transactions in the Cloud
The Latency Problem
Interactive Transactions:
Optimistic Concurrency Control
Optimistic Concurrency Control
The Abort Rate Problem
• 10.000 objects
• 20 writes per second
• 95% reads
Optimistic Concurrency Control
The Abort Rate Problem
• 10.000 objects
• 20 writes per second
• 95% reads
Distributed Cache-Aware Transaction Scalable
ACID Transactions

Solution: Conflict-Avoidant Optimistic Transactions
◦ Cached reads → Shorter transaction duration → less aborts
◦ Bloom Filter to identify outdated cache entries
Begin Transaction
Bloom Filter
Client
Reads
2
1
Cache
Cache
REST-Server
Cache
Commit: readset versions & writeset
Committed OR aborted + stale objects
REST-Server
3
REST-Server
Read all
DB
Writes (Public) 5
validation 4
Coordinator
prevent conflicting
validations
Distributed Cache-Aware Transaction
Speed Evaluation
• 10.000 objects
• 20 writes per second
• 95% reads
 16 times speedup
Distributed Cache-Aware Transaction
Abort Rate Evaluation
• 10.000 objects
• 20 writes per second
• 95% reads
 16 times speedup
 Significantly less aborts
 Highly reduced runtime
of retried transactions
Distributed Cache-Aware Transaction
Combined with RAMP Transactions
1 read objects
3
2 validate
load other version
3
Research Challanges
Encrypted Databases


Example: CryptDB
Idea: Only decrypt as much as neccessary
SQL-Proxy
Encrypts and decrypts, rewrites queries
RDBMS
Research Challanges
Encrypted Databases


Relational Cloud
Example: CryptDB
DBaaS Architecture:
Idea: Only decrypt as much as
• neccessary
Encrypted with CryptDB
• Multi-Tenancy through live
SQL-Proxy
migration
• Workload-aware partitioning
Encrypts and decrypts, rewrites
queries
(graph-based)
C. Curino, et al. "Relational cloud: A database-as-a-service
for the cloud.“, CIDR 2011
RDBMS
Research Challanges
Encrypted Databases


Relational Cloud
Example: CryptDB
DBaaS Architecture:
Idea: Only decrypt as much as
• neccessary
Encrypted with CryptDB
• Multi-Tenancy through live
SQL-Proxy
migration
• Workload-aware partitioning
Encrypts and decrypts, rewrites
queries
(graph-based)
C. Curino, et al. "Relational cloud: A database-as-a-service
for the cloud.“, CIDR 2011
RDBMS
• Early approach
• Not adopted in practice, yet
Dream solution:
Full Homomorphic Encryption
Research Challanges
Transactions and Scalable Consistency
Consistency
Transactional Unit
Commit
Latency
Data
Loss?
Dynamo
Eventual
None
1 RT
-
Yahoo PNuts
Timeline per key
Single Key
1 RT
possible
COPS
Causality
Multi-Record
1 RT
possible
MySQL (async) Serializable
Static Partition
1 RT
possible
Megastore
Serializable
Static Partition
2 RT
-
Spanner/F1
Snapshot Isolation Partition
2 RT
-
MDCC
Read-Commited
1 RT
-
Multi-Record
Research Challanges
Transactions and Scalable Consistency
Multi-Data Center Consistency
Commit
Consistency
Idea:
Dynamo
Yahoo PNuts
Transactional Unit
Data
Loss?
Latency
• Multi-Data
None center commit
1 RTprotocol with
single round-trip
Timeline perImplementation:
key
Single Key
1 RT
possible
Eventual
• Optimistic Commit Protocol
Multi-Record
1 RT
possible
• Fast, Generalized Multi-Paxos
Result:Static
almost as fast as Dynamo-style
MySQL (async) Serializable
Partition 1 RT
possible
COPS
Causality
T. Kraska et al. "MDCC: Multi-data center consistency." EuroSys, 2013.
Megastore
Serializable
Spanner/F1
MDCC
Static Partition
2 RT
-
Snapshot Isolation Partition
2 RT
-
Read-Commited
1 RT
-
Multi-Record
Research Challanges
Transactions and Scalable Consistency
Multi-Data Center Consistency
Commit
Consistency
Idea:
Dynamo
Yahoo PNuts
Transactional Unit
Data
Loss?
Latency
• Multi-Data
None center commit
1 RTprotocol with
single round-trip
Timeline perImplementation:
key
Single Key
1 RT
possible
Eventual
• Optimistic Commit Protocol
Multi-Record
1 RT
possible
• Fast, Generalized Multi-Paxos
Result:Static
almost as fast as Dynamo-style
MySQL (async) Serializable
Partition 1 RT
possible
COPS
Causality
T. Kraska et al. "MDCC: Multi-data center consistency." EuroSys, 2013.
Megastore
Serializable
Spanner/F1
Snapshot
Isolationno
Partition
Currently
NoSQL
MDCC
Static Partition
2 RT
-
2 RT
DB implements
consistent Multi-DC
Read-Commited
Multi-Recordreplication
1 RT
-
Research Challanges
NoSQL Benchmarking

YCSB (Yahoo Cloud Serving Benchmark)
Workload:
1. Operation Mix
2. Record Size
3. Popularity Distribution
Threads
Stats
Client
Pluggable DB interface
Workload Generator
Runtime Parameters:
DB host name,
threads, etc.
Read()
Insert()
Update()
Delete()
Scan()
Data Store
DB protocol
Research Challanges
NoSQL Benchmarking

YCSB (Yahoo Cloud Serving Benchmark)
Zipfian
C – Read Only
D – Read Latest
Update: 50%
Threads
Read: 95%
Update: 5%
Read: 100%
Read: 95% Stats
Insert: 5%
Workload:
E – Operation
Short Ranges
Scan: 95%
1.
Mix
2. Record Size
Insert: 5% Client
3. Popularity Distribution
Pluggable DB interface
B – Read Heavy
Workload Generator
Runtime
Parameters: Operation Mix
Workload
DB host name,
A – Update
threads,
etc. Heavy Read: 50%
Read()
Insert()
Distribution
Example
Update()
Zipfian Delete()
Session Store
Scan()
Photo Tagging
Data Store
DB protocol
Zipfian
User Profile Cache
Latest
User Status Updates
Zipfian/
Uniform
Threaded Conversations
Research Challanges
NoSQL Benchmarking

Example Result
(Read Heavy):
Research Challanges
NoSQL Benchmarking

Example Result
(Read Heavy):
Weaknesses:
• Single client can be a
bottleneck
• No consistency &
availability measurement
Research Challanges
NoSQL Benchmarking
YCSB++
 Example Result
• Clients
coordinate
through
(Read
Heavy):
Zookeeper
• Simple Read-After-Write Checks
• Evaluation: Hbase & Accumulo
S. Patil, M. Polte, et al.„Ycsb++: benchmarking and
performance debugging advanced features in scalable
table stores“, SOCC 2011
Weaknesses:
• Single client can be a
bottleneck
• No consistency &
availability measurement
Research Challanges
NoSQL Benchmarking
YCSB++
 Example Result
• Clients
coordinate
through
(Read
Heavy):
Zookeeper
• Simple Read-After-Write Checks
• Evaluation: Hbase & Accumulo
S. Patil, M. Polte, et al.„Ycsb++: benchmarking and
performance debugging advanced features in scalable
table stores“, SOCC 2011
Weaknesses:
• Single client can be a
bottleneck
• No consistency &
availability measurement
YCSB+T
• New workload: Transactional
Bank Account
• Simple anomaly detection for
Lost Updates
• No comparison of systems
A. Dey et al. “YCSB+T: Benchmarking Web-Scale
Transactional Databases”, CloudDB 2014
• No Transaction Support
No specific application
 CloudStone, CARE, TPC
extensions?
How can the choices for an appropriate system be narrowed down?
Outline
NoSQL Foundations and
Motivation
The NoSQL Toolbox:
Common Techniques
NoSQL Systems
Decision Guidance: NoSQL
Decision Tree
• Decision Tree
• Classification Summary
• Literature
Reommendations
NoSQL Decision Tree
Access
Fast Lookups
Complex Queries
Volume
Volume
RAM
Unbounded
HDD-Size
Unbounded
CAP
Consistency
Query Pattern
AP
Redis
Memcache
Cache
CP
ACID
Availability
Cassandra
Riak
Voldemort
Aerospike
HBase
MongoDB
CouchBase
DynamoDB
RDBMS
Neo4j
RavenDB
MarkLogic
CouchDB
MongoDB
SimpleDB
Shoppingbasket
Order
History
OLTP
Website
Example Applications
Ad-hoc
Analytics
MongoDB
RethinkDB
HBase,Accumulo
ElasticSeach, Solr
Hadoop, Spark
Parallel DWH
Cassandra, HBase
Riak, MongoDB
Social
Network
Big Data
NoSQL Decision Tree
Access
Fast Lookups
Complex Queries
Volume
Volume
RAM
Unbounded
HDD-Size
Unbounded
CAP
Consistency
Query Pattern
AP
Redis
Memcache
Cache
CP
Cassandra
Riak
Voldemort
Aerospike
HBase
MongoDB
CouchBase
DynamoDB
Purpose:
Shoppingbasket
ACID
RDBMS
Neo4j
RavenDB
MarkLogic
Availability
CouchDB
MongoDB
SimpleDB
Ad-hoc
MongoDB
RethinkDB
HBase,Accumulo
ElasticSeach, Solr
Analytics
Hadoop, Spark
Parallel DWH
Cassandra, HBase
Riak, MongoDB
Application Architects: narrowing down the potential
system candidates based on requirements
Order
Social
OLTP
Website
Big Data and
Database
Vendors/Researchers:
clear
communication
History
Network
design of system trade-offs
Example Applications
System Properties
According to the NoSQL Toolbox

For fine-grained system selection:
x
x
x
x
x
Filter Query
Full-Text Search
Analytics
Joins
x
x
x
x
Sorting
x
x
x
Conditional Writes
Mongo
Redis
HBase
Riak
Cassandra
MySQL
ACID Transactions
Scan Queries
Functional Requirements
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
System Properties
According to the NoSQL Toolbox

For fine-grained system selection:
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Durability
Write Availability
x
x
Read Availability
x
x
x
x
x
Write Throughput
Read Latency
x
x
x
x
x
x
Write Latency
x
Consistency
Read Scalability
x
Elasticity
Write Scalability
Mongo
Redis
HBase
Riak
Cassandra
MySQL
Data Scalability
Non-functional Requirements
x
x
x
x
x
x
Mongo
Redis
HBase
Riak
Cassandra
MySQL
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Query Planning
Analytics Framework
x
x
x
x
x
x
x
Materialized Views
Local Indexing
Global Indexing
Append-Only Storage
x
x
x
x
x
x
In-Memory
Caching
Update-in-Place
Logging
Update Anywhere
Primary Copy
Async. Replication
Sync. Replication
Transaction Protocol
Shared-Disk
Consistent Hashing
Entity-Group Sharding
Hash-Sharding

Range-Sharding
System Properties
According to the NoSQL Toolbox
For fine-grained system selection:
Techniques
x
x
x
Future Work
Online Collaborative Decision Support

Select Requirements in Web GUI:
Conditional Writes
Read Scalability

Consistent
System makes suggestions based on data from
practitioners, vendors and automated benchmarks:
4/5
4/5
3/5
4/5
5/5
5/5
Summary

High-Level NoSQL Categories:



Key-Value, Wide-Column, Docuement, Graph
Two out of {Consistent, Available, Partition Tolerant}
The NoSQL Toolbox: systems use similar techniques
that promote certain capabilities
Techniques
Sharding, Replication,
Storage Management,
Query Processing

Decision Tree
promote
Functional
Requirements
Non-functional
Requirements
Our NoSQL research at the
University of Hamburg
Presentation
is loading
The Latency Problem
Average: 9,3s
Loading…
-7% Conversions
-20% Traffic
-9% Visitors
-1% Revenue
If perceived speed is such an
important factor
...what causes slow page load times?
State of the Art
Two bottlenecks: latency und processing
High Latency
Processing Time
Network Latency: Impact
I. Grigorik, High performance browser networking.
O’Reilly Media, 2013.
Network Latency: Impact
2× Bandwidth =
Same Load Time
½ Latency
½ Load Time
≈
I. Grigorik, High performance browser networking.
O’Reilly Media, 2013.
Our Low-Latency Vision
Data is served by ubiquitous web-caches
Low Latency
Less Processing
Innovation
Solution: Proactively Revalidate Data
5 Years
Research & Development
New Algorithms
Solve Consistency Problem
Bloom filter
0 1 1 1
0 1 1
0 0 1
Innovation
Solution: Proactively Revalidate Data
F. Gessert, F. Bücklers, und N. Ritter, „ORESTES: a Scalable
Database-as-a-Service Architecture for Low Latency“, in
CloudDB 2014, 2014.
F. Gessert, S. Friedrich, W. Wingerath, M. Schaarschmidt, und
N. Ritter, „Towards a Scalable and Unified REST API for Cloud
Data Stores“, in 44. Jahrestagung der GI, Bd. 232, S. 723–734.
F. Gessert und F. Bücklers, „ORESTES: ein System für horizontal
skalierbaren Zugriff auf Cloud-Datenbanken“, in Informatiktage
2013, 2013.
F. Gessert, M. Schaarschmidt, W. Wingerath, S. Friedrich, und
N. Ritter, „The Cache Sketch: Revisiting Expiration-based
Caching in the Age of Cloud Data Management“, in BTW 2015.
F. Gessert und F. Bücklers, Performanz- und
Reaktivitätssteigerung von OODBMS vermittels der WebCaching-Hierarchie. Bachelorarbeit, 2010.
F. Gessert und F. Bücklers, Kohärentes Web-Caching von
Datenbankobjekten im Cloud Computing. Masterarbeit 2012.
M. Schaarschmidt, F. Gessert, und N. Ritter, „Towards
Automated Polyglot Persistence“, in BTW 2015.
W. Wingerath, S. Friedrich, und F. Gessert, „Who Watches the
Watchmen? On the Lack of Validation in NoSQL
Benchmarking“, in BTW 2015.
S. Friedrich, W. Wingerath, F. Gessert, und N. Ritter, „NoSQL
OLTP Benchmarking: A Survey“, in 44. Jahrestagung der
Gesellschaft für Informatik, 2014, Bd. 232, S. 693–704.
F. Gessert, „Skalierbare NoSQL- und Cloud-Datenbanken in
Forschung und Praxis“, BTW 2015
4,7s
5,7s
4,0s
0,5s
1,5s
1,3s
2,4s
2,9s
1,8s
0,5s
0,7s
1,8s
2,8s
3,6s
3,4s
Competitive Advantage
5,0s
5,7s
TOKYO
3,0s
We measured page load times for users in
four geographic regions. Our caching
technology achieves on average 6.8x faster
loading times compared to competitors.
}
0,6s
KALIFORNIEN
7,2s
FRANKFURT
Other
BaaS
providers
SYDNEY
Business Model
Backend-as-a-Service
Pay-per-use
or on-Premise
Customer
Simplified
development
Cached data with
minimal latency
Baqend
Cloud
Baqend
Enterprise
Backend
Caching infrastructure
End user
Orestes
Components
Content-DeliveryNetwork
Orestes
Components
Polyglot Persistence
Mediator
Content-DeliveryNetwork
Orestes
Components
Backend-as-a-Service Middleware:
Caching, Transactions, Schemas,
Invalidation Detection, …
Content-DeliveryNetwork
Orestes
Components
Standard HTTP Caching
Content-DeliveryNetwork
Orestes
Components
Unified REST API
Content-DeliveryNetwork
Bloom filters for Caching
End-to-End Example
Browser
Cache
CDN
0 2 1 4 0
Bloom filters for Caching
End-to-End Example
Gets Time-to-Live
Estimation by the server
Browser
Cache
CDN
0 2 1 4 0
Bloom filters for Caching
End-to-End Example
Browser
Cache
CDN
0 2 1 4 0
Bloom filters for Caching
End-to-End Example
Browser
Cache
CDN
0 2 1 4 0
Bloom filters for Caching
End-to-End Example
purge(obj)
Browser
Cache
CDN
hashA(oid)
hashB(oid)
0 3
2 1 4 1
0
Bloom filters for Caching
End-to-End Example
Browser
Cache
CDN
Flat(Counting Bloomfilter)
0 1 1 1 1
0 3
2 1 4 1
0
Bloom filters for Caching
End-to-End Example
Browser
Cache
hashA(oid)
CDN
hashB(oid)
0 1 1 1 1
0 3
2 1 4 1
0
Bloom filters for Caching
End-to-End Example
Browser
Cache
hashA(oid)
CDN
hashB(oid)
0 1 1 1 1
0 3
2 1 4 1
0
Bloom filters for Caching
End-to-End Example
Browser
Cache
0 1 1 1 1
CDN
0 3
2 1 4 1
0
Bloom filters for Caching
End-to-End Example
Browser
Cache
CDN
hashA(oid)
0 1 1 1 1
hashB(oid)
0 2 1 4 0
Bloom filters for Caching
End-to-End Example
𝑘𝑛
False-Positive
−
𝑓 ≈ 1−𝑒 𝑚
Rate:
Browser
Cache
𝑘
HashFunctions:
𝑛
𝑚
𝑘 = ln 2 ⋅ ( )
CDN
With 20.000 distinct updates and 5% error rate:
11 Kbyte
Consistency Guarantees: Δ-Atomicity, Read-Your-Writes, Monotonic
Reads, Monotonic Writes, Causal Consistency
hashA(oid)
0 1 1 1 1
hashB(oid)
0 2 1 4 0
Want to try Baqend?
mit
InnoRampUp
Free Ziel
Baqend
Cloud
Instance at baqend.com
Download Community
Edition
Literature Recommendations
Recommended Literature
1.
2.
Recommended Literature
Recommended Literature: Cloud-DBs
Recommended Literature: Blogs
http://medium.baqend.com/
http://www.nosqlweekly.com/
http://www.dzone.com/mz/nosql
http://www.infoq.com/nosql/
https://martin.kleppmann.com/
https://aphyr.com/
http://muratbuffalo.blogspot.de/
http://highscalability.com/
http://db-engines.com/en/ranking
Seminal NoSQL Papers
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Lamport, Leslie. Paxos made simple., SIGACT News, 2001
S. Gilbert, et al., Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web
services, SIGACT News, 2002
F. Chang, et al., Bigtable: A Distributed Storage System For Structured Data, OSDI, 2006
G. DeCandia, et al., Dynamo: Amazon's Highly Available Key-Value Store, SOSP, 2007
M. Stonebraker, el al., The end of an architectural era: (it's time for a complete rewrite), VLDB, 2007
B. Cooper, et al., PNUTS: Yahoo!'s Hosted Data Serving Platform, VLDB, 2008
Werner Vogels, Eventually Consistent, ACM Queue, 2009
B. Cooper, et al., Benchmarking cloud serving systems with YCSB., SOCC, 2010
A. Lakshman, Cassandra - A Decentralized Structured Storage System, SIGOPS, 2010
J. Baker, et al., MegaStore: Providing Scalable, Highly Available Storage For Interactive Services, CIDR,
2011
M. Shapiro, et al.: Conflict-free replicated data types, Springer, 2011
J.C. Corbett, et al., Spanner: Google's Globally-Distributed Database, OSDI, 2012
Eric Brewer, CAP Twelve Years Later: How the "Rules" Have Changed, IEEE Computer, 2012
J. Shute, et al., F1: A Distributed SQL Database That Scales, VLDB, 2013
L. Qiao, et al., On Brewing Fresh Espresso: Linkedin's Distributed Data Serving Platform, SIGMOD, 2013
N. Bronson, et al., Tao: Facebook's Distributed Data Store For The Social Graph, USENIX ATC, 2013
P. Bailis, et al., Scalable Atomic Visibility with RAMP Transactions, SIGMOD 2014
Thank you
[email protected]
Twitter: @Baqendcom
Blog: medium.baqend.com
Slides: slideshare.net/felixgessert
Web: baqend.com
Polyglot Persistence
Current best practice
Application Layer
Billing Data
Friend
network
Nested
Application Data
Cached data
& metrics
Session data
Search Index
Files
Recommendation Engine
Google Cloud
Storage
Amazon Elastic
MapReduce
Polyglot Persistence
Current best practice
Application Layer
Research Question:
Billing Data
Nested
Application Data
Session data
Can we automate the
Friend
network
Cached data
& metrics
Files
mapping problem?
data
database Recommen-
Search Index
Google Cloud
Storage
dation Engine
Amazon Elastic
MapReduce
Vision
Schemas can be annotated with requirements
-
Write Throughput > 10,000 RPS
Read Availability > 99.9999%
Scans = true
Full-Text-Search = true
Monotonic Read = true
Schema
DBs
Tables
Fields
Vision
The Polyglot Persistence Mediator chooses the database
Application
Data and
Operations
Database
Metrics
Annotated
Schema
Polyglot Persistence
Mediator
Latency < 30ms
db1
db2
db3
Step I - Requirements
Expressing the application‘s needs

Tenant annotates schema
with his requirements
Tenant
1. Define
schema
2. Annotate
Database
Table
Table
Field Field Field Field
annotated
Inherits continuous
annotations
1
Annotations
 Continuous non-functional
e.g. write latency < 15ms
 Binary functional
e.g. Atomic updates
 Binary non-functional
e.g. Read-your-writes
Requirements
Step I - Requirements
Expressing the application‘s needs

Tenant annotates schema
with his requirements
Tenant
1. Define
schema
2. Annotate
Database
Table
Table
Field Field Field Field
annotated
Inherits continuous
annotations
1
Annotations
 Continuous non-functional
e.g. write latency < 15ms
 Binary functional
e.g. Atomic updates
 Binary non-functional
e.g. Read-your-writes
Requirements
Step II - Resolution
Finding the best database
Provider



The Provider resolves the
requirements
RANK: scores available
database systems
Routing Model: defines the
optimal mapping from schema
elements to databases
Either:
Refuse or
Provision new DB
Capabilities for
available DBs
1. Find optimal
2a. If unsatisfiable
RANK(schema_root, DBs)
through recursive descent
using annotated schema and metrics
2b. Generates
routing model
Routing Model
Route schema_element db
 transform db-independent to dbspecific operations
2
Resolution
Step III - Mediation
Application
Routing data and operations




The PPM routes data
Operation Rewriting:
translates from abstract to
database-specific operations
Runtime Metrics: Latency,
availability, etc. are reported
to the resolver
Primary Database Option: All
data periodically gets
materialized to designated
database
1. CRUD, queries,
transactions, etc.
Polyglot Persistence Mediator
 Uses Routing Model
Report  Triggers periodic
materialization
metrics
2. route
db1
db2
3
Mediation
db3
Evaluation: News Article
Prototype of Polyglot Persistence Mediator in ORESTES
Scenario: news articles with impression counts
Objectives: low-latency top-k queries, highthroughput counts, article-queries
Article
Counter
Evaluation: News Article
Prototype built on ORESTES
Scenario: news articles with impression counts
Objectives: low-latency top-k queries, highthroughput counts, article-queries
Mediator
Counter updates kill performance
Evaluation: News Article
Prototype built on ORESTES
Scenario: news articles with impression counts
Objectives: low-latency top-k queries, highthroughput counts, article-queries
Mediator
No powerful queries
Evaluation: News Article
Prototype built on ORESTES
Scenario: news articles with impression counts
Objectives: low-latency top-k queries, highthroughput counts, article-queries
Article
ID
Title
…
Document
Imp.
Imp.
ID
Sorted Set
Found Resolution
Cloud Data Management

New field tackling the design, implementation,
evaluation and application implications of database
systems in cloud environments:
Protocols, APIs,
Caching
Application
architecture,
Data Models
Load distribution, Auto-Scaling, SLAs
Workload Management, Metering
Multi-Tenancy,
Consistency, Availability,
Query Processing, Security
Replication,
Partitioning,
Transactions,
Indexing
Cloud-Database Models
Data
Model
unstructured
unstructured
Analytics
machine
image
Analyticsas-aService
Analytics/
ML
APIs
schemafree
NoSQL
machine
image
Managed
NoSQL
NoSQL
Service
Database-as-a-Service
relational
structured
RDBMS
machine
image
Managed
RDBMS/
DWH
RDBMS/
DWH
Service
Deployment
Model
Cloud-Deployed Database
Database-image provisioned in IaaS/PaaS-cloud
IaaS/PaaS deployment of
database system
Does not solve:
IaaS-Cloud
Provisioning, Backups, Security,
Scaling, Elasticity, Performance
Tuning, Failover, Replication, ...
Managed RDBMS/DWH/NoSQL DB
Cloud-hosted database
SQL Azure
RDBMS
DBaaS-Provider
Google
Cloud SQL
NoSQL DB
RDBMS
DWH
NoSQL DB
Amazon Redshift
DWH
IaaS-Cloud
Managed RDBMS/DWH/NoSQL DB
Cloud-hosted database
SQL Azure
Provisioning, Backups,
Google Security,
Cloud SQL
Scaling, Elasticity,
Performance
Tuning, Failover, Replication, ...
RDBMS
DBaaS-Provider
NoSQL DB
RDBMS
DWH
NoSQL DB
Amazon Redshift
DWH
IaaS-Cloud
Proprietary Cloud Database
Designed for and deployed in vendor-specific cloud environment
Amazon
SimpleDB
Azure Tables
Database.com
BigTable, Megastore, Spanner, F1, Dynamo,
PNuts, Relational Cloud, …
Azure Blob
Storage
Openstack
Swift
Google Cloud
Storage
Object Store
Cloud
Google Cloud
Datastore
Database
Managed by
Cloud Provider
Provider‘s API
Black-box system
Analytics-as-a-Service
Analytic frameworks and machine learning with service APIs
Analytics Cluster
Google
BigQuery
Google
Prediction API
Cloud
ML
Provisioning,
Data Ingest
Azure
HDInsight
Analytics
Amazon Elastic
MapReduce
Backend-as-a-Service
DBaaS with embedded custom and predefined application logic
Authentication,
Users, Validation,etc.
Data API
Service-Layer
IaaS-Cloud
AppCelerator
Cloud
(mobile) BaaS
Backend API
Maps to (different)
databases
Pricing Models
Pay-per-use and plan-based
e.g. Compose
Pay-per-use
e.g. DynamoDB
Parameters: Network, Bandwidth,
Storage, CPU, Requests, etc.
Payment: Pre-Paid, Post-Paid
Variants: On-Demand, Auction, Reserved
Account
Usage
End of
month
Pricing Models
Pay-per-use and plan-based
Plan-based
e.g. Compose
Parameters: Allocated Plan (e.g.e.g. DynamoDB
2 instances + X GB storage)
Account
Usage
End of
month
Database-as-a-Service
Approaches to Multi-Tenancy
Private OS
Private Process/DB
Private Schema
Shared Schema
Virtual Schema
Schema
Schema
Schema
Schema
Database
Database
Database
Database
Database Process
Database Process
Database Process
Database Process
VM
Hardware Resources
e.g. Amazon RDS
VM
VM
VM
Hardware Resources
Hardware Resources
Hardware Resources
e.g. Compose
e.g. Google DataStore
Most SaaS Apps
T. Kiefer, W. Lehner “Private table database virtualization for dbaas”
UCC, 2011
Multi-Tenancy: Trade-Offs
App.
indep.
Ressource
Util.
Isolation
Maintenance,
Provisioning
Private OS
Private
Process/DB
Private Schema
Shared Schema
W. Lehner, U. Sattler “Web-scale Data Management for the Cloud”
Springer, 2013
Authentication & Authorization
Checking Permissions and Indentity
Internal Schemes
External Identity
Provider
Federated Identity
(Single Sign On)
e.g. Amazon IAM
e.g. OpenID
e.g. SAML
Authenticate/Login
Token
Authenticated Request
Authentication
API
Authorization
Database-aa-Service
Response
User-based Access
Control
Role-based Access
Control
Policies
e.g. Amazon S3 ACLs
e.g. Amazon IAM
e.g. XACML
Service Level Agreements (SLAs)
Specification of Application/Tenant Requirements
SLA
Technical Part
1. SLO
2. SLO
3. SLO
Legal Part
1. Fees
2. Penalties
Service Level Objectives:
• Availability
• Durability
• Consistency/Staleness
• Query Response Time
Service Level Agreements
Expressing application requirements
Functional Service Level Objectives
◦ Guarantee a „feature“
◦ Determined by database system
◦ Examples: transactions, join
Non-Functional Service Level Objectives
◦ Guarantee a certain quality of service (QoS)
◦ Determined by database system and service provider
◦ Examples:
 Continuous: response time (latency), throughput
 Binary: Elasticity, Read-your-writes
Service Level Objectives
Making SLOs measurable through utilities
Utility expresses „value“ of a continuous non-functional
requirement:
𝑓𝑢𝑡𝑖𝑙𝑖𝑡𝑦 𝑚𝑒𝑡𝑟𝑖𝑐 → [0,1]
Workload Management
Guaranteeing SLAs
Typical approach:
W. Lehner, U. Sattler “Web-scale Data Management for the Cloud”
Springer, 2013
Workload Management
Guaranteeing SLAs
Typical approach:
W. Lehner, U. Sattler “Web-scale Data Management for the Cloud”
Springer, 2013
Workload Management
Guaranteeing SLAs
Typical approach:
W. Lehner, U. Sattler “Web-scale Data Management for the Cloud”
Springer, 2013
Workload Management
Guaranteeing SLAs
Typical approach:
Maximize:
W. Lehner, U. Sattler “Web-scale Data Management for the Cloud”
Springer, 2013
Workload Management
Guaranteeing SLAs
Typical approach:
W. Lehner, U. Sattler “Web-scale Data Management for the Cloud”
Springer, 2013
Resource & Capacity Planning
From a DBaaS provider‘s perspective
Goal: minimize penalty and
resource costs
Resources
Expected
Load
Time
T. Lorido-Botran, J. Miguel-Alonso et al.: “Auto-scaling Techniques for
Elastic Applications in Cloud Environments”. Technical Report, 2013
Resource & Capacity Planning
From a DBaaS provider‘s perspective
Provisioned Resources:
• #No of Shard- or Replica
servers
• Computing, Storage,
Network Capacities
Goal: minimize penalty and
resource costs
Resources
Expected
Load
Time
T. Lorido-Botran, J. Miguel-Alonso et al.: “Auto-scaling Techniques for
Elastic Applications in Cloud Environments”. Technical Report, 2013
Resource & Capacity Planning
From a DBaaS provider‘s perspective
Goal: minimize penalty and
resource costs
Resources
Actual
Load
Time
T. Lorido-Botran, J. Miguel-Alonso et al.: “Auto-scaling Techniques for
Elastic Applications in Cloud Environments”. Technical Report, 2013
Resource & Capacity Planning
From a DBaaS provider‘s perspective
Goal: minimize penalty and
resource costs
Underprovisioning:
• SLAs violated
• Usage maximized
Resources
Actual
Load
Overprovisioning:
• SLAs met
• Excess Capacities
Time
T. Lorido-Botran, J. Miguel-Alonso et al.: “Auto-scaling Techniques for
Elastic Applications in Cloud Environments”. Technical Report, 2013
SLAs in the wild
Most DBaaS systems offer no SLAs, or
only a a simple uptime guarantee
Model
CAP
SimpleDB
Table-Store
(NoSQL Service)
CP
CP
Dynamo-DB
Table-Store
(NoSQL Service)
CP
Azure Tables
Table-Store
(NoSQL Service)
Entity-Group
Store
(NoSQL Service)
CP
Object-Store
(NoSQL Service)
AP
AE/Cloud DataStore
S3, Az. Blob, GCS
SLAs
99.9%
uptime
99.9%
uptime
(S3)
Open Research Questions
in Cloud Data Management

Service-Level Agreements
◦ How can SLAs be guaranteed in a virtualized, multi-tenant
cloud environment?

Consistency
◦ Which consistency guarantees can be provided in a georeplicated system without sacrificing availability?

Performance & Latency
◦ How can a DBaaS deliver low latency in face of distributed
storage and application tiers?

Transactions
◦ Can ACID transactions be aligned with NoSQL and scalability?
DBaaS Example
Amazon RDS

Relational Database Service
RDS
Model:
Managed RDBMS
Pricing:
Instance + Volume
+ License
Underlying DB:
MySQL, Postgres,
MSSQL, Oracle
API:
DB-specific
DBaaS Example
Amazon RDS

Relational Database Service
RDS
Model:
Managed RDBMS
Pricing:
Instance + Volume
+ License
Underlying DB:
MySQL, Postgres,
MSSQL, Oracle
API:
DB-specific
DBaaS Example
Amazon RDS

Relational Database Service
• Synchronous Replication
• Automatic Failover
RDS
Model:
Managed RDBMS
Pricing:
Instance + Volume
+ License
Underlying DB:
MySQL, Postgres,
MSSQL, Oracle
API:
DB-specific
DBaaS Example
RDS
Amazon RDS

Model:
Relational Database Service
• Synchronous Replication
• Automatic Failover
99,95% uptime SLA
Managed RDBMS
Pricing:
Instance + Volume
+ License
Underlying DB:
MySQL, Postgres,
MSSQL, Oracle
API:
DB-specific
DBaaS Example
RDS
Amazon RDS

Model:
Relational Database Service
• Synchronous Replication
• Automatic Failover
99,95% uptime SLA
Managed RDBMS
Pricing:
Instance + Volume
+ License
Underlying DB:
MySQL, Postgres,
MSSQL, Oracle
API:
DB-specific
Provisioned IOPS: access to
EBS volumes networkoptimized (up to 4000 IOPS)
DBaaS Example
Amazon RDS

Relational Database Service
RDS
Model:
Managed RDBMS
Pricing:
Instance + Volume
+ License
Underlying DB:
MySQL, Postgres,
MSSQL, Oracle
API:
DB-specific
DBaaS Example
Amazon RDS

Relational Database Service
RDS
Model:
Managed RDBMS
Pricing:
Instance + Volume
+ License
Underlying DB:
MySQL, Postgres,
MSSQL, Oracle
API:
DB-specific
EC2 instances: Up to 32
Cores, 244 GB RAM, 10 GbE
DBaaS Example
Amazon RDS

Relational Database Service
RDS
Model:
Managed RDBMS
Pricing:
Instance + Volume
+ License
Underlying DB:
Minor Version Upgrades are
performed without downtime
MySQL, Postgres,
MSSQL, Oracle
API:
DB-specific
EC2 instances: Up to 32
Cores, 244 GB RAM, 10 GbE
DBaaS Example
Amazon RDS

Relational Database Service
RDS
Model:
Managed RDBMS
Pricing:
Instance + Volume
+ License
Underlying DB:
MySQL, Postgres,
MSSQL, Oracle
API:
DB-specific
DBaaS Example
Amazon RDS

Relational Database Service
RDS
Model:
Managed RDBMS
Pricing:
Instance + Volume
+ License
Underlying DB:
MySQL, Postgres,
MSSQL, Oracle
Backups are automated and
scheduled
API:
DB-specific
DBaaS Example
Amazon RDS

RDS
Model:
Relational Database Service
Managed RDBMS
Pricing:
Instance + Volume
+ License
Underlying DB:
MySQL, Postgres,
MSSQL, Oracle
Backups are automated and
scheduled
•
•
•
•
API:
DB-specific
Support for (asynchronous) Read Replicas
Administration: Web-based or SDKs
Only RDBMSs
“Analytic Brother“ of RDS: RedShift (PDWH)
DBaaS Example
Azure Tables
REST API
No Index: Lookup only (!) by full table scan
Partition
Key
Row Key
(sortiert)
Timestamp
(autom.)
Atomic "EntityProperty1 Propertyn
Group Batch
Transaction" possible
intro.pdf
v1.1
14/6/2013
…
intro.pdf
v1.2
15/6/2013
präs.pptx
v0.0 to
Hash-distributed
11/6/2013
…
Partition
…
…
Sparse
Partition
parition servers

Similar to Amazon SimpleDB and DynamoDB
• Indexes all attributes
• Rich(er) queries
• Many Limits (size, RPS, etc.)
• Provisioned Throughput
• On SSDs („single digit latency“)
• Optional Indexes
DBaaS and PaaS Example
Heroku Addons


Many Hosted NoSQL
DbaaS Providers
represented
And Search
DBaaS and PaaS Example
Redis2Go
Heroku Addons
Model:
Create Heroku App:
Pricing:
Managed NoSQL
Plan-based
Underlying DB:
Add Redis2Go Addon:
Redis
API:
Redis
Use Connection URL (environment variable):
Deploy:
DBaaS and PaaS Example
Redis2Go
Heroku Addons
Model:
Create Heroku App:
Pricing:
Managed NoSQL
Plan-based
Underlying DB:
Add Redis2Go Addon:
Redis
API:
Redis
Use Connection URL (environment variable):
Deploy:
• Very simple
• Only suited for small to medium
applications (no SLAs, limited control)
Cloud-Deployed DB
An alternative to DBaaS-Systems

Idea: Run (mostly) unmodified DB on IaaS

Method I: DIY
1. Provision VM(s)

2. Install DBMS (manual, script,
Chef, Puppet)
Method II: Deployment Tools
> whirr launch-cluster --config
hbase.properties
Login, cluster-size etc.

Method III: Marketplaces
Amazon EC2
Google
BigQuery
Google BigQuery
BigQuery
Model:

Idea: Web-scale analysis of nested data
Analytics-aaS
Pricing:
Storage + GBs
Processed
API:
REST
Google
BigQuery
Google BigQuery
BigQuery
Model:

Idea: Web-scale analysis of nested data
Analytics-aaS
Pricing:
Storage + GBs
Processed
API:
REST
Google
BigQuery
Google BigQuery
BigQuery
Model:

Idea: Web-scale analysis of nested data
Analytics-aaS
Pricing:
Storage + GBs
Processed
API:
REST
Dremel
Idea:
Multi-Level execution tree on
nested columnar data format
(≥100 nodes)
Melnik et al. “Dremel: Interactive analysis
of web-scale datasets”, VLDB 2010
Google
BigQuery
Google BigQuery
BigQuery
Model:

Idea: Web-scale analysis of nested data
Analytics-aaS
Pricing:
Storage + GBs
Processed
API:
REST
Dremel
•
•
Idea:
SLA: 99.9% uptime / month
Multi-Level execution tree on
nested
columnar
data format
Fundamentally different
from
relational
DWHs
(≥100 nodes)
and MapReduce
Melnik et al. “Dremel: Interactive analysis
web-scaleImpala,
datasets”, VLDBShark
2010
• Design copied by Apache ofDrill,
Managed NoSQL services
Summary
Model
HBase
WideColumn
CAP
CP
Scans
Sec.
Indices
Over
Row Key
Largest
Cluster
~700
Learning
1/4
Lic.
DBaaS
Apache
(EMR)
MongoDB
Document
CP
Riak
KeyValue
AP
yes
>100
<500
4/4
GPL
~60
3/4
Apache
(Softlayer)
Cassandra
WideColumn
AP
With
Comp.
Index
Redis
KeyValue
CA
Through
Lists,
etc.
manual
>300
<1000
2/4
Apache
N/A
4/4
BSD
Managed NoSQL services
Summary
Model
HBase
WideColumn
CAP
CP
Scans
Over
Row Key
Sec.
Indices
Largest
Cluster
~700
Learning
1/4
Lic.
DBaaS
Apache
(EMR)
MongoDB
Document
CP
Riak
KeyValue
AP
Cassandra
WideColumn
Redis
KeyValue
yes
>100
<500
4/4
GPL
~60
3/4
Apache
AndWith
there are many
>300more:
2/4
AP
•
•
CA•
•
•
CouchDB
Comp. (e.g. Cloudant)
<1000
Index
CouchBase
(e.g. KuroBase Beta)
Through manual Bonsai)
N/A
4/4
ElasticSearch(e.g.
Lists,
Solr
(e.g. WebSolr)
etc.
…
(Softlayer)
Apache
BSD
Proprietary Database services
Summary
Model
CAP
Scans
Sec.
Indices
Queries
API
Scaleout
SimpleDB
TableStore
CP
Yes (as
queries)
Automatic
SQL-like
(no joins,
groups, …)
REST +
SDKs
DynamoDB
TableStore
CP
By range
key /
index
Local Sec.
Global
Sec.
Key+Cond.
On Range
Key(s)
REST +
SDKs
Automatic
over Prim.
Key
Azure
Tables
TableStore
CP
By range
key
Key+Cond.
On Range
Key
REST +
SDKs
Automatic
over Part.
Key
AE/Cloud
DataStore
EntityGroup
CP
Yes (as
queries)
Conjunct.
of Eq.
Predicates
REST/
SDK,
JDO,JPA
Automatic
over Entity
Groups
S3, Az.
Blob, GCS
BlobStore
AP
REST +
SDKs
Automatic
over key
Automatic
SLA
99.9%
uptime
99.9%
uptime
(S3)
Hadoop Distributed FS (CP)
HDFS
HDD Size
Model:
Size: 1,4 GB
Reading: 4,8 MB/s
→ 5 min/HDD
1990


File System
Size: 1 TB
Reading: 100 MB/s
→ 2,5 h/HDD
Year
2013
License:
Apache 2
Written in:
Java
Modelled after: Googles GFS (2003)
Master-Slave Replication
◦ Namenode: Metadata (files + block locations)
◦ Datanodes: Save file blocks (usually 64 MB)

Design goal: Maximum Throughput and data locality for
Map-Reduce
Sends data operations to
DataNodes and metadata
operations to the NameNode
Holds filesystem data and
block locations in RAM
DataNodes communicate to
perform 3-way replication
Files are split into blocks and
scattered over DataNodes
Holmes, Alex. Hadoop in Practice. Manning, 2012.
Hadoop
Hadoop
Model:






Batch-Analytics
Framework
For many synonymous to Big Data Analytics
License:
Large Ecosystem
Apache 2
Written in:
Creator: Doug Cutting (Lucene)
Java
Distributors: Cloudera, MapR, HortonWorks
Gartner Prognosis: By 2015 65% of all complex analytic
applications will be based on Hadoop
Users: Facebook, Ebay, Amazon, IBM, Apple, Microsoft,
NSA
http://de.slideshare.net/cultureofperformanc
e/gartner-predictions-for-hadoop-predictions
MapReduce: Example
Constructing a reverse-index
Input
(HDFS)
doc1.txt
cat sat mat
doc2.txt
cat sat dog
Mappers
Intermediate
Output
cat, doc1.txt
sat, doc1.txt
mat, doc1.txt
cat, doc2.txt
sat, doc2.txt
dog, doc2.txt
Reducers
Output
part-r-0000
cat: doc1.txt, doc2.txt
part-r-0001
sat: doc1.txt, doc2.txt
dog: doc2.txt
part-r-0002
mat: doc1.txt
Holmes, Alex. Hadoop in Practice
Cluster Architecture
The client sends job
and configuration to
the Jobtracker
The JobTracker
coordinates the cluster
and assigns tasks
TaskTrackers execute Mappers
and Reducers as child-processes
Arun Murthy “Apache Haddop YARN”
Cluster Architecture
YARN – Abstracting from MR
The ResourceManager
is a pure scheduler
Only the ApplicationMaster is
Framework specific (e.g. MR)
Arun Murthy “Apache Haddop YARN”
Summary: Hadoop Ecosystem




Hadoop: Ecosystem for Big Data Analytics
Hadoop Distributed File System: scalable, shared-nothing file
system for throughput-oriented workloads
Map-Reduce: Paradigm for performing scalable distributed
batch analysis
Other Hadoop projects:
◦
◦
◦
◦
◦
◦
◦
Hive: SQL(-dialect) compiled to YARN jobs (Facebook)
Pig: workflow-oriented scripting language (Yahoo)
Mahout: Machine-Learning algorithm library in Map-Reduce
Flume: Log-Collection and processing framework
Whirr: Hadoop provisioning for cloud environments
Giraph: Graph processing à la Google Pregel
Drill, Presto, Impala: SQL Engines
Spark
Spark
Model:


„In-Memory“ Hadoop that does not suck
for iterative processing (e.g. k-means)
Resilient Distributed Datasets (RDDs):
partitioned, in-memory set of records
Batch Processing
Framework
License:
Apache 2
Written in:
Scala
M. Zaharia, M. Chowdhury, T. Das, et al. „Resilient distributed
datasets: A fault-tolerant abstraction for in-memory cluster computing“
Spark
Example RDD Evaluation


Transformations: RDD  RDD
Actions: Reports an operation
errors = sc.textFile("log.txt").filter(lambda x: "error" in x)
warnings = inputRDD.filter(lambda x: "warning" in x)
badLines = errorsRDD.union(warningsRDD).count()
Runtime
Execution
RDD Lineage
H. Karau et al. „Learning Spark“
Storm
Storm
Model:


Distributed Stream Processing Framework
Topology is a DAG of:
◦ Spouts: Data Sources
◦ Bolts: Data Processing Tasks

Stream Processing
Framework
License:
Apache 2
Written in:
Java
Cluster:
◦ Nimbus (Master) ↔ Zookeeper ↔ Worker
Nathan Marz „Big Data“
Kafka
Kafka
Model:




Scalable, Persistent Pub-Sub
Log-Structured Storage
Guarantee: At-least-once
Partitioning:
Distributed PubSub-System
License:
Apache 2
Written in:
Scala
◦ By Topic/Partition
◦ Producer-driven
 Round-robin
 Semantic

Replication:
◦ Master-Slave
◦ Synchronous to majority
J. Kreps, N. Narkhede, J. Rao, und others, „Kafka:
A distributed messaging system for log processing“
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement