Data Models and Storage
Chapter 10
Data Models and Storage
A model’s just an imitation of the real thing.
—Mae West
The ocean flows of online information are all streaming together, and the access tools are
becoming absolutely critical. If you don’t index it, it doesn’t exist. It’s out there but you can’t
find it, so it might as well not be there.
—Barbara Quint
The relational database model provided a strong theoretical foundation for the representation of data that
eliminated redundancy and maximized flexibility of data access. Many next-generation database systems,
particularly those of the NewSQL variety, continue to embrace the relational model: their innovations
generally focus on the underlying physical storage of data.
However, databases of the NoSQL variety explicitly reject the fixed schema of the relational model. This
rejection is not based on any theoretical disagreement with the principles of relational data modeling but,
rather, from a practical desire to facilitate agility in application development by allowing schemas to evolve
with the application. Additionally, NoSQL databases seek to avoid the overhead created by operations such
as joins, which are a necessary consequence of relational normalization.
Underneath the data model, all databases adopt physical storage mechanisms designed to optimize
typical access paths. Following our review of new and traditional data models in this chapter, we’ll examine
how storage is designed to support the models.
Data Models
Today’s databases adopt a variety of data models:
•
Relational models serve as the inspiration for the representation of data in the
traditional RDBMS (Oracle, SQL Server, etc.), as well as for NewSQL databases such
as Vertica and VoltDB.
•
Key-value stores theoretically impose no structure or limitation on the “value” part
of the key value. However, in practice, most key-value stores provide additional
support for certain types of data structures to allow for indexing and conflict
resolution.
•
Databases based on Google’s BigTable database implement the wide column store
described in the BigTable specification. However, some significant innovations have
been introduced in databases such as Cassandra.
145
Chapter 10 ■ Data Models and Storage
•
Document databases use JSON or XML documents, which generally impose no
restriction on the data that can be represented, but which provide a self-describing
and predictable data representation.
•
Graph databases represent data as nodes, relationships, and properties. Graph
databases were described in detail in Chapter 5.
Review of the Relational Model of Data
Before diving into nonrelational data models, let’s quickly recapitulate the relational model, which
dominated the last generation of database systems and which remains the most widely adopted data model
today. The relational model forms the basis not just for the traditional RDBMS but also for databases of
the NewSQL variety—databases such as Vertica and VoltDB, for instance. We provided an overview of the
relational model back in Chapter 1, and of course the relational data model is supported by a vast body of
literature. What follows is a brief summary.
The relational model organizes values into tuples (rows). Multiple tuples are used to construct relations
(tables). Rows are identified by key values and—at least in third normal form—all values will be locatable by
the entire primary key and nothing else. Foreign keys define relationships between tables by referencing the
primary keys in another table.
The process of eliminating redundancy from a relational model is known as normalization. Figure 10-1
provides an example of un-normalized and normalized data.
Figure 10-1. Normalized relational data model
146
Chapter 10 ■ Data Models and Storage
The star schema represents a data modeling pattern commonly found—indeed, almost always found—
in data warehouses. In a star schema, a large “fact” table contains detailed business data and contains
foreign keys to smaller more static “dimension” tables that categorize the fact items in business terms,
typically including time, product, customer, and so on.
Figure 10-2 shows an example of a star schema. The central SALES fact table contains sales totals that
are aggregated across various time periods, products, and customers. The detail and explanation of each
aggregation can be found by joining to the dimension tables TIMES, PRODUCTS, and CUSTOMERS.
Figure 10-2. Star schema
147
Chapter 10 ■ Data Models and Storage
Key-value Stores
Unlike the relational model, there is no formal definition for how data is represented in a key-value store.
Typically a key-value store will accept any binary value within specific size limits as a key and is capable of
storing any binary data as a value. In this respect, we might say that the key-value store is data-type agnostic.
However, most key-value stores provide additional support for certain types of data. The support stems from
the desire to provide one of the following features:
•
Secondary indexes. In a pure key-value store, the only way to locate an object is via
its key. However, most applications perform lookups on non-key terms. For instance,
we may wish to locate all users within a geography, all orders for a specific company,
and so on.
•
Conflict resolution. Key-value stores that provide eventual consistency or Amazon
Dynamostyle tunable consistency may implement special data types that facilitate
conflict resolution.
Riak, a key-value store based on Amazon’s Dynamo specification, illustrates both of these patterns.
As well as binary objects, Riak allows for data to be defined as one of the following:
•
A Riak convergent replicated data type (CRDT). These data types, described in
more detail below, include maps, sets, and counters. Conflicting operations on
these data types can be resolved by Riak without requiring application or human
intervention.
•
A document type, such as XML, JSON, or Text.
•
A custom data type.
Riak includes Solr, an open-source text search engine. When a Riak value is defined as one of the builtin data types, or as a custom data type for which you have provided custom search code, then Solr will index
the value and provide a variety of lookup features, including exact match, regular expression, and range
searches.
In the case of JSON, XML, and Riak maps, searches can be restricted to specific fields within the
document. So, for instance, you could search for documents that have a specific value for a specific JSON
attribute.
Convergent Replicated Data Types
We touched on convergent replicated data types (CRDT) in Chapter 9. As we noted there, Riak users vector
clocks to determine if two updates potentially conflict. By default, if after the examination of the vector
clocks for two conflicting updates the system cannot determine which of the updates is most correct, then
Riak will maintain both versions of the update by creating sibling values, which must be resolved by the user
or application.
CRDTs allow two conflicting updates to be merged even if their vector clocks indicate that they are
concurrent. A CRDT encapsulates deterministic rules that can be used either to merge the conflicting values
or to determine which of two conflicting values should “win” and be propagated.
148
Chapter 10 ■ Data Models and Storage
When a CRDT value is propagated between nodes, it includes not just the current value of the object
but also a history of operations that have been applied to the object. This history is somewhat analogous to
the vector clock data that accompanies each update in a Riak system, but unlike a vector clock, it contains
information specifically pertaining to the history of a specific object.
The simplest example of CRDT merging involves the g-counter (grow-only counter) data type. This is a
monotonically incrementing counter that cannot be decremented (e.g., you can increase the counter, but
you cannot decrease its value).
You might think that you could merge g-counter updates simply by adding up all the increment
operations that have occurred on every node to determine the total counter value. However, this approach
ignores the possibility that some of these increment operations are replicas of other increment operations.
To avoid such double counting, each node maintains an array of the counter values received from every
node. Upon incrementing, the node increments its element in the array and then transmits the entire array
to other nodes. Should a conflict between updates be detected (through the vector clock mechanism), we
take the highest element for each node in each version and add them up.
That last sentence was quite a mouthful, but it’s still simpler than the mathematical notation! Stepping
through the example shown in Figure 10-3 will, we hope, illuminate the process.
•
At time t0, each node has a value of 0 for the counter, and each element in the array
of counters (one for each node) is also set to 0.
•
Around time t1, node 1 receives an increment operation on the counter of +1, while
node 2 receives an increment operation of +2. Nodes 1 and 2 transmit their counter
values to node 3.
•
Around time t2, node 1 receives an increment of +4 and an update from node 2,
while node 3 receives an increment of +2. Each node now has a different value for
the counter: 7 (5,0,2), 3 (1,0,2), or 4 (0,0,4).
•
Around time t3, node 2 receives updates from node 1 and node 3. These updates are
potentially conflicting, so node 2 has to merge the three counter arrays. By taking the
highest element for each node from each array, node 2 concludes that the correct
value for the counter is (5,0,4), which adds up to 9.
•
At time t4, node 2 propagates the correct values of the counter to the other nodes and
the cluster is now back in sync.
149
Chapter 10 ■ Data Models and Storage
Figure 10-3. Convergent replicated data type g-counter
Other CRDTs are defined in academia and implemented in Riak or in other databases. For instance:
•
The PN-counter type allows counters to increment and decrement safely. It is
implemented as two g-counters, one which maintains increments and the other
which maintains decrements.
•
The G-set type implements a collection of objects to which you can add elements but
never remove them.
•
The 2P-set provides a collection to which elements can be removed as well as
inserted. However, an object can be removed only once.
•
The LWW-set allows multiple insertions and deletes, with a last-write-wins policy in
the event of conflicting operations.
Other CRDT types provide further flexibility in the operations they support, but in a manner similar to
the LWW-set type, specify winners and losers in the event of conflict. The winner might be determined by
timestamps or by comparing the relative number of operations to which the element has been subjected, or
by some other domain-specific logic.
150
Chapter 10 ■ Data Models and Storage
Data Models in BigTable and HBase
Google’s BigTable paper was published in 2006 and was the basis for the data model used in HBase,
Cassandra, and several other databases. Google also makes BigTable storage available as a service in the
Google Cloud BigTable product.
BigTable tables have the following characteristics:
•
Data is organized as tables that—like relational tables—have columns and rows.
•
Tables are indexed and sorted by a single rowkey.
•
A table includes one or more column families, which are named and specified in the
table definition.
•
Column families are composed of columns. Column names are dynamic, and new
columns can be created dynamically upon insertion of a new value. Use of the term
“column” is somewhat misleading: BigTable column families are more accurately
described as sorted multidimensional maps, in which values are identified by
column name and timestamp.
•
Data for a specific column family is stored together on disk.
•
Tables are sparse: empty columns do not take up space.
•
A cell (intersection of row and column) may contain multiple versions of a data
element, indexed by timestamp.
Column Family Structure
Column families can be used to group related columns for convenience, to optimize disk IO by co-locating
columns that are frequently accessed together on disk, or to create a multidimensional structure that can be
used for more complex data.
Figure 10-4 illustrates a simple column family structure. Rows are grouped into three column families,
but each row has identical column names. In this configuration, the table resembles a relational table that
has been vertically partitioned on disk for performance reasons.
Figure 10-4. Simple column family structure
The uniqueness of the BigTable data model becomes more apparent when we create a “wide” column
family. In this case, column names represent the name portion of a name:value pair. Any given row key may
have any arbitrary collection of such columns, and there need be no commonality between rowkeys with
respect of column names.
151
Chapter 10 ■ Data Models and Storage
Figure 10-5 illustrates such a wide column family. In the FRIENDS column family, we have a variable
number of columns, each corresponding to a specific friend. The name of the column corresponds to the
name of the friend, while the value of the column is the friend’s email. In this example, both Guy and Joanna
have a common friend John, so each share that column. But other columns that represent friends who are
not shared, and those columns appear only in the row required.
Figure 10-5. Wide column family structure
BigTable/HBase column families are described as sparse because no storage is consumed by columns
that are absent in a given row. Indeed, a BigTable column family is essentially a “map” consisting of an
arbitrary set of sorted name:value pairs.
Versions
Each cell in a BigTable column family can store multiple versions of a value, indexed by timestamp.
Timestamps may be specified by the application or automatically assigned by the server. Values are stored
within a cell in descending timestamp order, so by default a read will retrieve the most recent timestamp.
A read operation can specify a timestamp range or specify the number of versions of data to return.
A column family configuration setting specifies the maximum number of versions that will be stored for
each value. In HBase, the default number of versions is three. There is also a minimum version count, which
is typically combined with a time to live (TTL) setting. The TTL setting instructs the server to delete values
that are older than a certain number of seconds. The minimum version count overrides the TTL, so typically
at least one or more copies of the data will be kept regardless of age.
Figure 10-6 illustrates multiple values with timestamps. For the row shown, the info:loc column has
only a single value, but the readings:temp column has five values corresponding perhaps to the last five
readings of a thermostat.
152
Chapter 10 ■ Data Models and Storage
Figure 10-6. Multiple versions of cell data in BigTable
Deletes in a BigTable database are implemented by creating tombstone markers that indicate all
versions of a column or column family less than a given timestamp have been removed. By default, a delete
uses the current timestamp, thus eliminating all previous row values.
Deleted elements persist on disk until a compaction occurs. We’ll discuss compaction later in this
chapter.
Cassandra
Cassandra’s data model is based on the BigTable design, but has evolved significantly since its initial release.
Indeed, it can be hard to recognize the BigTable structure when working with Cassandra through the
Cassandra Query Language (CQL).
The CQL CREATE TABLE statement allows us to define composite primary keys, which look like
familiar multi-column keys in relational databases. For instance, in the CQL shown below, we create a table
FRIENDS, which is keyed on columns NAME and FRIEND corresponding to the user’s name and the name
of each of his or her friends:
CREATE TABLE friends
(user text,
friend text,
email text,
PRIMARY KEY (user,friend));
CQL queries on this table return results that imply one row exists for each combination of user and
friend:
cqlsh:guy>
SELECT * FROM friends;
user | friend | email
------+--------+-----------------Jo | George | [email protected]
Jo |
Guy |
[email protected]
Jo |
John |
[email protected]
Guy |
Jo |
[email protected]
153
Chapter 10 ■ Data Models and Storage
But when we look at the column family using the (now depreciated) thrift client, we can see we have
two rows, one with four columns and the other with six columns (the output here has been edited for
clarity):
RowKey: Jo
=> (name=George:,
=> (name=George:email,
=> (name=Guy:,
=> (name=Guy:email,
=> (name=John:,
=> (name=John:email,
------------------RowKey: Guy
=> (name=Jo:,
=> (name=Jo:email,
=> (name=John:,
=> (name=John:email,
2 Rows Returned.
value=,
[email protected],
value=,
[email protected],
value=,
[email protected],
timestamp=...)
timestamp=...)
timestamp=...)
timestamp=...)
timestamp=...)
timestamp=...)
value=,
[email protected],
value=,
[email protected],
timestamp=...)
timestamp=...)
timestamp=...)
timestamp=...)
The first part of the CQL primary key (USER, in our example) is used to specify the rowkey for the
table and is referred to as the partition key. The second parts of the primary key (FRIEND, in our example)
are clustering keys and are used to create a wide column structure in which each distinct value of the CQL
key column is used as part of the name of a BigTable-style column. So for instance, the column Guy:email
is constructed from the value “Guy” within the CQL column “Friend” together with the name of the CQL
column “email.”
That’s quite confusing! So it’s no wonder that Cassandra tends to hide this complexity within a more
familiar relational style SQL-like notation. Figure 10-7 compares the Cassandra CQL representation of
the data with the underlying BigTable structure: the apparent five rows as shown in CQL are actually
implemented as two BigTable-style rows in underlying storage.
154
Chapter 10 ■ Data Models and Storage
Figure 10-7. Cassandra CQL represents wide column structure as narrow tables
■■Note Cassandra uses the term “column family” differently from HBase and BigTable. A Cassandra column
family is equivalent to a table in HBase. For consistency’s sake, we may refer to Cassandra “tables” when a
Cassandra purist would say “column family.”
The underlying physical implementation of Cassandra tables explains some of the specific behaviors
within the Cassandra Query Language. For instance, CQL requires that an ORDER BY clause refer only to
composite key columns. WHERE clauses in CQL also have restrictions that seem weird and arbitrary unless
you understand the underlying storage model. The partition key accepts only equality clauses (IN and “=”),
which makes sense when you remember that rowkeys are hash-partitioned across the cluster, as we discussed
in Chapter 8. Clustering key columns do support range operators such as “>” and “<”, which again makes
sense when you remember that in the BigTable model the column families are actually sorted hash maps.
Cassandra Collections
Cassandra’s partitioning and clustering keys implement a highly scalable and efficient storage model.
However, Cassandra also supports collection data types that allow repeating groups to be stored within
column values.
155
Chapter 10 ■ Data Models and Storage
For instance, we might have implemented our FRIENDS table using the MAP data type, which would
have allowed us to store a hash map of friends and emails within a single Cassandra column:
CREATE TABLE friends2
(person text,
friends map<text,text>,
PRIMARY KEY (person ));
INSERT into friends2(person,friends)
VALUES('Guy',
{'Jo':'[email protected]',
'john':'[email protected]',
'Chris':'[email protected]'});
Cassandra also supports SET and LIST types, as well as the MAP type shown above.
JSON Data Models
JavaScript Object Notation (JSON) is the de facto standard data model for document databases. We devoted
Chapter 4 to document databases; here, we will just formally look at some of the elements of JSON.
JSON documents are built up from a small set of very simple constructs: values, objects, and arrays.
•
Arrays consist of lists of values enclosed by square brackets (“[“ and “]”) and
separated by commas (“,”).
•
Objects consist of one or more name value pairs in the format “name”:”value” ,
enclosed by braces (“{“ and ”}” ) and separated by commas (“,”).
•
Values can be Unicode strings, standard format numbers (possibly including
scientific notation), Booleans, arrays, or objects.
The last few words in the definition are very important: because values may include objects or arrays,
which themselves contain values, a JSON structure can represent an arbitrarily complex and nested set
of information. In particular, arrays can be used to represent repeating groups of documents, which in a
relational database would require a separate table.
Document databases such as CouchBase and MongoDB organize documents into buckets or
collections, which would generally be expected to contain documents of a similar type. Figure 10-8
illustrates some of the essential JSON elements.
156
Chapter 10 ■ Data Models and Storage
Figure 10-8. JSON documents
Binary JSON (BSON)
MongoDB stores JSON documents internally in the BSON format. BSON is designed to be a more compact
and efficient representation of JSON data, and it uses more efficient encoding for numbers and other
data types. In addition, BSON includes field length prefixes that allow scanning operations to “skip over”
elements and hence improve efficiency.
Storage
One of the fundamental innovations in the relational database model was the separation of logical data
representation from the physical storage model. Prior to the relational model it was necessary to understand
the physical storage of data in order to navigate the database. That strict separation has allowed the
relational representation of data to remain relatively static for a generation of computer scientists, while
underlying storage mechanisms such as indexing have seen significant innovation. The most extreme
example of this decoupling can be seen in the columnar database world. Columnar databases such as
Vertica and Sybase IQ continue to support the row-oriented relational database model, even while they have
tipped the data on its side by storing data in a columnar format.
We have looked at the underlying physical storage of columnar systems in Chapter 6, so we don’t need
to examine that particular innovation here. However, there has been a fundamental shift in the physical
layout of modern nonrelational databases such as HBase and Cassandra. This is the shift away from B-tree
storage structure optimized for random access to the log-structured Merge tree pattern, which is optimized
instead for sequential write performance.
157
Chapter 10 ■ Data Models and Storage
Typical Relational Storage Model
Most relational databases share a similar high-level storage architecture.
Figure 10-9 shows a simplified relational database architecture. Database clients interact with the
database by sending SQL to database processes (1). The database processes retrieve data from database
files on disk initially (2), and store the data in memory buffers to optimize subsequent accesses (3). If data
is modified, it is changed within the in-memory copy (4). Upon transaction commit, the database process
writes to a transaction log (5), which ensures that the transaction will not be lost in the event of a system
failure. The modified data in memory is written out to database files asynchronously by a “lazy” database
writer process (6).
Figure 10-9. Relational database storage architecture
Much of the architecture shown in Figure 10-9 can be found in nonrelational systems as well. In
particular, some equivalent of the transaction log is present in almost any transactional database system.
Another ubiquitous RDBMS architectural pattern—at least in the operational database world—is the
B-tree index. The B-tree index is a structure that allows for random access to elements within a database
system.
Figure 10-10 shows a B-tree index structure. The B-tree index has a hierarchical tree structure. At the
top of the tree is the header block. This block contains pointers to the appropriate branch block for any given
range of key values. The branch block will usually point to the appropriate leaf block for a more specific
range or, for a larger index, point to another branch block. The leaf block contains a list of key values and the
physical addresses of matching table data.
158
Chapter 10 ■ Data Models and Storage
Figure 10-10. B-tree index structure
Leaf blocks contain links to both the previous and the next leaf block. This allows us to scan the index in
either ascending or descending order, and allows range queries using the “>”, “<” or “BETWEEN” operators
to be satisfied using the index.
B-tree indexes offer predictable performance because every leaf node is at the same depth. Each
additional layer in the index exponentially increases the number of keys that can be supported, and for
almost all tables, three or four IOs will be sufficient to locate any row.
However, maintaining the B-tree when changing data can be expensive. For instance, consider inserting a
row with the key value “NIVEN” into the table index diagrammed in Figure 10-10. To insert the row, we must add
a new entry into the L-O block. If there is no free space within a leaf block for a new entry, then an index split is
required. A new block must be allocated and half of the entries in the existing block have to be moved into the
new block. As well as this, there is a requirement to add a new entry to the branch block (in order to point to the
newly created leaf block). If there is no free space in the branch block, then the branch block must also be split.
These index splits are an expensive operation: new blocks must be allocated and index entries moved
from one block to another, and during this split access times will suffer. So although the B-tree index is an
efficient random read mechanism, it is not so great for write-intensive workloads.
The inherent limitations of the B-tree structure are somewhat mitigated by the ability to defer disk
writes to the main database files: as long as a transaction log entry has been written on commit, data file
modifications—including index blocks—can be performed in memory and written to disk later. However,
during periods of heavy, intensive write activity, free memory will be exhausted and throughput will be
limited by disk IO to the database blocks.
159
Chapter 10 ■ Data Models and Storage
There have been some significant variations on the B-tree pattern to provide for better throughput for
write-intensive workloads: Both Couchbase’s HB+-Trie and Tokutek’s fractal tree index claim to provide
better write optimization.
However, an increasing number of databases implement a storage architecture that is optimized from
the ground up to support write-intensive workloads: the log-structured merge (LSM) tree.
Log-structured Merge Trees
The log-structured merge (LSM) tree is a structure that seeks to optimize storage and support extremely high
insert rates, while still supporting efficient random read access.
The simplest possible LSM tree consists of two indexed “trees”:
•
An in–memory tree, which is the recipient of all new record inserts. In Cassandra,
this in-memory tree is referred to as the MemTable and in HBase as the MemStore.
•
A number of on-disk trees, which represent copies of in-memory trees that have
been flushed to disk. In Cassandra, this on-disk structure is referred to as the SSTable
and in HBase as the StoreTable.
The on-disk tree files are initially point-in-time copies of the in-memory tree, but are merged
periodically to create larger consolidated stores. This merging process is called compaction.
■■Note The log-structured merge tree is a very widely adopted architecture and is fundamental to BigTable,
HBase, Cassandra, and other databases. However, naming conventions vary among implementations. For
convenience, we use the Cassandra terminology by default, in which the in-memory tree is called a MemTable
and the on-disk trees are called SSTables.
The LSM architecture ensures that writes are always fast, since they operate at memory speed. The
transfer to disk is also fast, since it occurs in append-only batches that allow for fast sequential writes. Reads
occur either from the in-memory tree or from the disk tree; in either case, reads are facilitated by an index
and are relatively swift.
Of course, if the server failed while data was in the in-memory store, then it could be lost. For this
reason database implementations of the LSM pattern include some form of transaction logging so that the
changes can be recovered in the event of failure. This log file is roughly equivalent to a relational database
transaction (redo) log. In Cassandra, it is called the CommitLog and in HBase, the Write-Ahead Log (WAL).
These log entries can be discarded once the in-memory tree is flushed to disk.
Figure 10-11 illustrates the log-structured merge tree architecture, using Cassandra terminology.
Writes from database clients are first applied to the CommitLog (1) and then to the MemTable (2). Once the
MemTable reaches a certain size, it is flushed to disk to create a new SSTable (3). Once the flush completes,
CommitLog records may be purged (4). Periodically, multiple SSTables are merged (compacted) into larger
SSTables (5).
160
Chapter 10 ■ Data Models and Storage
Figure 10-11. LSM architecture (Cassandra terminology)
SSTables and Bloom Filters
The on-disk portion of the LSM tree is an indexed structure. For instance, in Cassandra, each SSTable is
associated with an index that contains all the rowkeys that exist in the SSTable and an offset to the location
of the associated value within the file. However, there may be many SSTables on disk, and this creates a
multiplier effect on index lookups, since we would theoretically have to examine every index for every
SSTable in order to find our desired row.
To avoid these multiple-index lookups, bloom filters are used to reduce the number of lookups that must
be performed.
Bloom filters are created by applying multiple hash functions to the key value. The outputs of the hash
functions are used to set bits within the bloom filter structure. When looking up a key value within the bloom
filter, we perform the same hash functions and see if the bits are set. If the bits are not set, then the search
value must not be included within the table. However, if the bits are set, it may have been as a result of a
value that happened to hash to the same values. The end result is an index that is typically reduced in size by
85 percent, but that provides false positives only 15 percent of the time.
Bloom filters are compact enough to fit into memory and are very quick to navigate. However, to
achieve this compression, bloom filters are “fuzzy” in the sense that they may return false positives. If you
get a positive result from a bloom filter, it means only that the file may contain the value. However, the bloom
filter will never incorrectly advise you that a value is not present. So if a bloom filter tells us that a key is not
included in a specific SSTable, then we can safely omit that SSTable from our lookup.
Figure 10-12 shows the read pattern for a log-structured merge tree using Cassandra terminology. A
database request first reads from the MemTable (1). If the required value is not found, it will consult the
bloom filters for the most recent SSTable (2). If the bloom filter indicates that no matching value is present,
it will examine the next SSTable (3). If the bloom filter indicates a matching key value may be present in the
SSTable, then the process will use the SSTable index (4) to search for the value within the SSTable (5). Once a
matching value is found, no older SSTables need be examined.
161
Chapter 10 ■ Data Models and Storage
Figure 10-12. Log-structured merge tree reads (Cassandra terminology)
Updates and Tombstones
SSTables are immutable—that is, once the MemTable is flushed to disk and becomes an SSTable, no further
modifications to the SSTable can be performed. If a value is modified repeatedly over a period of time, the
modifications will build up across multiple SSTables. When retrieving a value, the system will read SSTables
from the youngest to the oldest to find the most recent value of a column, or to build up a complete row.
Therefore, to update a value we need only insert the new value, since the older values will not be examined
when a newer version exists.
Deletions are implemented by writing tombstone markers into the MemTable, which eventually
propagates to SSTables. Once a tombstone marker for a row is encountered, the system stops examining
older entries and reports “row not found” to the application.
Compaction
As SSTables multiply, read performance and storage degrade as the numbers of bloom filters, indexes, and
obsolete values increase. So periodically the system will compact multiple SSTables into a single file. During
compaction, rows that are fragmented across multiple SSTables are consolidated and deleted rows are
removed.
162
Chapter 10 ■ Data Models and Storage
However, tombstones will remain in the system for a period of time to ensure that a delayed update to
a row will not resurrect a row that should remain deleted. This can happen if a tombstone is removed while
older updates to that row are still being propagated through the system. To avoid this possibility, default
settings prevent tombstones from being deleted for over a week, while hinted handoffs (see Chapter 8)
generally expire after only three hours. But in the event of these defaults being adjusted, or in the event of an
unreasonably long network partition, it is conceivable that a row that has been deleted will be resurrected.
Secondary Indexing
A secondary index allows us to quickly find rows that meet some criteria other than their primary key or
rowkey value.
Secondary indexes are ubiquitous in relational systems: it’s a fundamental characteristic of a relational
system that you be able to navigate primary key and foreign key relationships, and this would be impractical if
only primary key indexes existed. In relational systems, primary key indexes and secondary indexes are usually
implemented in the same way: most commonly with B-tree indexes, or sometimes with bitmap indexes.
We discussed in Chapter 6 how columnar databases often use projections as an alternative to indexes:
this approach works in columnar systems because queries typically aggregate data across a large number of
rows rather than seeking a small number of rows matching a specific criteria. We also discussed in Chapter
5 how graph databases use index-free adjacency to perform efficient graph traversal without requiring a
separate index structure.
Neither of these solutions are suitable for nonrelational operational database systems. The underlying
design of key-value stores, BigTable systems, and document databases assumes data retrieval by a specific
key value, and indeed in many cases—especially in earlier key-value systems—lookup by rowkey value is the
only way to find a row.
However, for most applications, fast access to data by primary key alone is not enough. So most
nonrelational databases provide some form of secondary index support, or at least provide patterns for
“do-it-yourself” secondary indexing.
DIY Secondary Indexing
Creating a secondary index for a key-value store is conceptually fairly simple. You create a table in which the
key value is the column or attribute to be indexed, and that contains the rowkey for the primary table.
Figure 10-13 illustrates the technique. The table USERS contains a unique identifier (the rowkey) for
each user, but we often want to retrieve users by email address. Therefore, we create a separate table in
which the primary key is the user’s email address and that contains the rowkey for the source table.
163
Chapter 10 ■ Data Models and Storage
Figure 10-13. Do-it-yourself secondary indexing
Variations on the theme allow for indexing of non-unique values. For instance, in a wide column
store such as HBase, an index entry might consist of multiple columns that point to the rows matching the
common value as shown in the “COUNTRY” index in Figure 10-13.
However, there are some significant problems with the do-it-yourself approach outlined above:
•
It’s up to the application code to consistently maintain both the data in the base table
and all of its indexes. Bugs in application code or ad hoc manipulation of data using
alternative clients can lead to missing or incorrect data in the index.
•
Ideally, the operations that modify the base table and the operation that modifies
the index will be atomic: either both succeed or neither succeeds. However, in
nonrelational databases, multi-object transactions are rarely available. If the index
operation succeeds but not the base table modification (or vice versa), then the
index will be incorrect.
•
Eventual consistency raises a similar issue: an application may read from the index an
entry that has not yet been propagated to every replica of the base table (or vice versa).
•
The index table supports equality lookups, but generally not range operations, since
unlike as in the B-tree structure, there is no pointer from one index entry to the next
logical entry.
Do-it-yourself indexing is not completely impractical, but it places a heavy burden on the application
and leads in practice to unreliable and fragile implementations. Most of these issues are mitigated when a
database implements a native secondary indexing scheme. The secondary index implementation can be
made independent of the application code, and the database engine can ensure that index entries and base
table entries are maintained consistently.
164
Chapter 10 ■ Data Models and Storage
Global and Local Indexes
Distributed databases raise additional issues for indexing schemes. If index entries are partitioned using
normal mechanisms—either by consistent hashing of the key value or by using the key value as a shard
key—then the index entry for a base table row is typically going to be located on a different node. As a result,
most lookups will usually span two nodes and hence require two IO operations.
If the indexed value is unique, then the usual sharding or hashing mechanisms will distribute data
evenly across the cluster. However, if the key value is non-unique, and especially if there is significant skew
in the distribution of values, then index entries and hence index load may be unevenly distributed. For
instance, the index on COUNTRY in Figure 10-13 would result in the index entries for the largest country
(USA, in our example) being located on a single node.
To avoid these issues, secondary indexes in nonrelational databases are usually implemented as local
indexes. Each node maintains its own index, which references data held only on the local node. Indexdependent queries are issued to each node, which returns any matching data via its local index to the query
coordinator, which combines the results.
Figure 10-14 illustrates the local secondary indexing approach. A database client requests data for a
specific non-key value (1). A query coordinator sends these requests to each node in the cluster (2). Each
node examines its local index to determine if a matching value exists (3). If a matching value exists in the
index, then the rowkey is retrieved from the index and used to retrieve data from the base table (4). Each
node returns data to the query coordinator (5), which consolidates the results and returns them to the
database client (6).
Figure 10-14. Local secondary indexing
165
Chapter 10 ■ Data Models and Storage
Secondary Indexing Implementations in NoSQL Databases
Although most nonrelational databases implement local secondary indexes, the specific implementations
vary significantly.
•
Cassandra provides local secondary indexes. The implementation in Cassandra
Query Language (CQL) uses syntax that is familiar to users of relational databases.
However, internally the implementation involves column families on each node that
associate indexed values with rowkeys in the base column family. Each index row is
keyed on the indexed value and a wide column structure in the index row contains
the rowkeys of matching base column family rows. The architecture is similar to the
COUNTRY index example shown in Figure 10-13.
•
MongoDB allows indexes to be created on nominated elements of documents within
a collection. MongoDB indexes are traditional B-tree indexes similar to those found
in relational systems and as illustrated in Figure 10-10.
•
Riak is a pure key-value store. Since the values associated with keys in Riak are
opaque to the database server, there is no schema element for Riak to index.
However, Riak allows tags to be associated with specific objects, and local indexes
on each node allow fast retrieval of matching tags. Riak architects now recommend
using the built-in Solr integration discussed earlier in this chapter instead of this
secondary indexing mechanism.
•
HBase does not provide a native secondary index capability. If your HBase
implementation requires secondary indexes, you are required to implement some
form of DIY secondary indexing. However, HBase has a coprocessor feature that
significantly improves the robustness of DIY indexes and reduces the overhead for
the programmer. An HBase observer coprocessor acts like a database trigger in an
RDBMS—it allows the programmer to specify code that will run when certain events
occur in the database. Programmers can use observer coprocessors to maintain
secondary indexes, thereby ensuring that the index is maintained automatically
and without exception. Some third parties have provided libraries that further assist
programmers who need to implement secondary indexes in HBase.
Conclusion
In this chapter we’ve reviewed some of the data model patterns implemented by nonrelational nextgeneration databases. NoSQL databases are often referred to as schema-less, but in reality schemas are more
often flexible than nonexistent.
HBase and Cassandra data models are based on the Google BigTable design, which implements a
sparse distributed multidimensional map structure. Column names in BigTable-oriented tables are in
reality closer to the keys in a Java or .NET map structure than to the columns in relational systems. Although
Cassandra uses BigTable-oriented data structures internally, the Cassandra engineers have implemented a
more relational-style interface on top of the BigTable structure: the Cassandra Query Language.
Many nonrelational systems use the log-structured merge tree architecture, which can sustain higher
write throughput than the traditional relational B-tree structures.
The initial implementation of many nonrelational systems—those based on BigTable or Dynamo in
particular—supported only primary key access. However, the ability to retrieve data based on some other
key is an almost universal requirement, so most nonrelational systems have implemented some form of
secondary indexing.
166
Chapter 11
Languages and Programming
Interfaces
Any fool can write code that a computer can understand. Good programmers write code
that humans can understand.
—Martin Fowler
As far as the customer is concerned, the interface is the product.
—Jef Raskin
Crucial to the dominance of the relational database was the almost universal adoption of the SQL language
as the mechanism for querying and modifying data. SQL is not a perfect language, but it has demonstrated
sufficient flexibility to meet the needs of both non-programming database users and professional database
programmers. Programmers embed SQL in programming languages, while non-programmers use SQL
either explicitly within a query tool or implicitly when a BI tool uses SQL under the hood to talk to the
database. Prior to the introduction of SQL, most IT departments labored with a backlog of requests for
reports; SQL allowed the business user to “self-serve” these requests.
SQL remains the most significant database language. Not only does it remain the universal language for
RDBMS, but it is also widely adopted in next-generation database systems of the NewSQL variety.
However, even though NoSQL has been retrospectively amended to mean “Not Only SQL” rather than
“Hell No SQL!,” SQL is not usually available for next-generation databases of the NoSQL variety. In this
chapter, we look at how we can interact with these databases in the absence of SQL, and see how SQL is
increasingly finding its way back into the world of nonrelational databases.
■■Note The code examples in this chapter are intended only to provide a flavor of database programming
languages1—this chapter is not intended to serve as a tutorial or reference for any of the languages concerned.
SQL
SQL remains the most significant database programming language, even within the scope of next-generation
database systems. Hundreds of books have been written on the topic of SQL (indeed, I’ve written a couple), and
it would be superfluous to provide a full review of the language here. However, it is probably worth recapping
the variations that exist within SQL implementations today, as well as key ingredients of the SQL language.
167
Chapter 11 ■ Languages and Programming Interfaces
The SQL language consists of these major categories of statements:
•
Data query language (DQL), represented by the ubiquitous SELECT statement.
•
Data manipulation language (DML), which includes statements that modify data
in the database, such as UPDATE, DELETE, INSERT, and MERGE, together with
transactional control statements (COMMIT, ROLLBACK, BEGIN TRANSACTION)
and—for the purposes of this discussion—data control language (DCL) statements
such as GRANT.
•
Data definition language (DDL), which includes statements that create or alter
tables and other structures (indexes, materialized views, etc.). DML also allows for
the specification of stored procedures and triggers. These statements are usually
highly vendor specific, since they incorporate support for proprietary programming
languages (Oracle PL/SQL, for instance) or for storage clauses that are unique to the
database in question.
The SQL language is the subject of several ANSI and ISO standard specifications. Some of these
standards are:
•
SQL-89: The first major standard to be widely adopted. This standard describes the
core elements of the SQL language as we know it today.
•
SQL-92 (SQL2): Added the modern join syntax, in which join conditions are fully
qualified within the FROM clause, and added a standard metadata interface that
provides descriptions of objects in the database and which is implemented by at
least some vendors. SQL-92 introduced the concept of an “entry-level” subset of the
specification (which was very similar to SQL-89).
•
SQL:1999 (SQL3): An explosion on the moon propels the moon into interstellar
space. Whoops, sorry, that was Space: 1999. SQL:1999 was somewhat less interesting,
introducing object-oriented relational database features that almost nobody uses
and a few minor language changes.
•
SQL:2003: Introduced analytic “window” functions—an important innovation for
analytics—and SQL/XML. It also finalized a specification for stored procedures.
•
SQL:2006, SQL:2008, SQL:2011: Miscellaneous refinements, such as INSTEAD OF
triggers, the MERGE statement, and temporal queries.
In my opinion, the various SQL standards have become increasingly disconnected from real-world SQL
language implementations. It’s common to hear vendors describe their SQL support as “entry-level SQL-92,”
effectively claiming credit for adhering to the minimum level of a 14-year-old specification.
In practice, you can expect an RDBMS to implement everything in SQL-92 at least with respect to the
SELECT statement and DML. SQL:2003 windowing functions, which allow a calculation within a row to have
visibility into a “window” of adjacent rows, are widely implemented and are probably the most significant
single innovation introduced in the last 15 years of the SQL language standards.
DDL statements and stored procedure code will generally not be compatible across databases. DDL
statements such as CREATE TABLE share a common syntax, but usually contain vendor-specific constructs
such as custom data types or proprietary storage clauses. While the ANSI stored procedure syntax is
implemented by DB2 and MySQL, Oracle and Microsoft SQL Server implement a completely incompatible
stored program language.
168
Chapter 11 ■ Languages and Programming Interfaces
NoSQL APIs
Databases that are described as NoSQL clearly have to provide a mechanism for inserting, modifying, and
retrieving data. Since most of these databases were developed “by programmers for programmers,” they
usually primarily provide low-level APIs supported in a language such as Java.
Riak
Riak is an open-source implementation of the Amazon Dynamo model. It implements a pure key-value
system: objects in Riak are located through the object’s key and the object retrieved by the key is a binary
object whose contents are opaque to the database engine.
Given the relatively simple interaction just described, we expect a fairly straightforward API, and that is
what we get. Let’s look at some Java code that inserts a value into a Riak bucket:
1.
2.
3.
4.
5.
6.
RiakClient myClient = RiakClient.newClient(myServer);
// Create the key, value and set the bucket
String myKey = Long.toString(System.currentTimeMillis());
String myValue = myKey + ":" + Thread.getAllStackTraces().toString();
Location myLocation = new Location(new Namespace("MyBucket"), myKey);
StoreValue sv = new StoreValue.Builder(myValue).withLocation(myLocation)
.build();
7. StoreValue.Response svResponse = myClient.execute(sv);
8. System.out.println("response="+svResponse);
In lines 3 to 5, we define a key (set to the current time in milliseconds), a value (a string representation
of the current stack trace), and the bucket (“MyBucket”) that will receive the value. Line 6 prepares a
StoreValue object—this object contains the key-value pair and the bucket name that it will be associated
with. We execute the StoreValue in line 7, effectively adding the data to the database.
The StoreValue object takes options that can control optional behaviors such as quorums. In the
example that follows, we specify that the write can complete as long as at least one node completes the write
IO (see Chapter 9 for a discussion of quorums):
sv = new StoreValue.Builder(Thread.getAllStackTraces()).
withLocation(myLocation).
withOption(StoreValue.Option.W,Quorum.oneQuorum()).build();
This example also utilizes one of the cool features of the Riak API: if we pass a Java object as our value,
Riak will automatically convert it to a JSON document.
Here, we retrieve the data we just inserted. Note that we use the same Location object (MyLocation)
that we used for the insert:
FetchValue fv = new FetchValue.Builder(myLocation).build();
FetchValue.Response fvResp = myClient.execute(fv);
String myFetchedData = fvResp.getValue(String.class);
System.out.println("value=" + myFetchedData);
169
Chapter 11 ■ Languages and Programming Interfaces
The value returned is a string containing a JSON representation of the stack trace object we inserted earlier:
value=
{"Thread[nioEventLoopGroup-2-8,10,main]":
[{"methodName":"poll0","fileName":null,
"lineNumber":-2,
"className":"sun.nio.ch.WindowsSelectorImpl$SubSelector" ...
For some applications, this put/get programming model may be sufficient. However, there are
significant nuances to Riak programming that arise from the Dynamo consistency model. You may
remember from Chapters 8 and 9 that conflicting updates may result in siblings that need to be resolved by
the application. Riak provides interfaces you can implement to automate the resolution of such conflicts.
Next is a simplistic conflict resolver class. This resolver will be invoked when siblings of type String are
detected. The class is passed a list of strings and returns the “resolved” result. Any application code could be
implemented here, but in my example I’ve just returned the first string in the list.
public class MyResolver implements ConflictResolver<String> {
public String resolve(List<String> siblings) {
return(siblings.get(0) );
}
}
The conflict resolver needs to be registered in order to become active:
ConflictResolverFactory factory = ConflictResolverFactory.getInstance();
factory.registerConflictResolver(String.class, new MyResolver());
Now, if we fetch a string value that includes siblings, the conflict resolver will be invoked and will
resolve them. In our case, we used almost the simplest object type—string—and used trivial resolution logic:
in a production application, the object types would likely be an application-defined complex class, and the
resolution might involve highly complex logic to merge the two objects.
As well as resolving siblings, the Riak API allows for modifications to complex data types to be handled.
Application code need not provide a complete copy of the new object to be inserted, but instead specify only
a change vector that is to be applied to an existing object. For instance, if an object were a list of friends, we
could add a new friend simply by specifying the new friend’s name, without having to retrieve and reinsert
all existing friends.
By extending the UpdateValue super class, we can define how an update value is applied to a Riak
object. In the example that follows, the apply method defines that the update will simply be appended (after
a “\n” carriage return) to the Riak object.
public class MyRiakUpdater extends UpdateValue.Update<String> {
private final String updatedString;
public MyRiakUpdater(String updatedString) {
this.updatedString = updatedString;
}
public String apply(String original) {
return original + "\n" + updatedString;
}
}
170
Chapter 11 ■ Languages and Programming Interfaces
To invoke the updater, we create a new object instance from the updater class we created earlier and
apply it to the original version using the withUpdate method.
MyRiakUpdater myUpdatedData = new MyRiakUpdater(newData);
UpdateValue myUpdate = new UpdateValue.Builder(myLocation)
.withUpdate(myUpdatedData).build();
UpdateValue.Response updateResponse = myClient.execute(myUpdate);
This code will apply the update to the Riak object after first resolving siblings (providing we have
registered an appropriate conflict resolver).
Hbase
The HBase API bears some resemblance to the put/get pattern we saw in the Riak API. However, because
HBase tables have a wide column structure, some additional complexity is introduced.
HBase supports a simple shell that allows access to basic data definition and manipulation commands.
Here, we launch the HBase shell and issue commands to create a table ourfriends with two column
families: info and friends:
$ hbase shell
hbase(main):003:0* create 'ourfriends', {NAME => 'info'}, {NAME => 'friends'}
0 row(s) in 1.3400 seconds
Each put command populates one cell within a row, so these four commands populate columns in the
row identified by rowkey ‘guy’:
hbase(main):005:0*
0 row(s) in 0.0900
hbase(main):006:0>
0 row(s) in 0.0070
hbase(main):007:0>
0 row(s) in 0.0050
hbase(main):008:0>
0 row(s) in 0.0040
put 'ourfriends',
seconds
put 'ourfriends',
seconds
put 'ourfriends',
seconds
put 'ourfriends',
seconds
'guy','info:email','[email protected]'
'guy','info:userid','9990'
'guy','friends:Jo','[email protected]'
'guy','friends:John','[email protected]'
The get command pulls the values for the specific rowkey, allowing us to see the cell values we just
input:
hbase(main):018:0* get 'ourfriends','guy'
COLUMN
CELL
friends:Jo
timestamp=1444123707299,
friends:John
timestamp=1444123707324,
info:email
timestamp=1444123707214,
info:userid
timestamp=1444123707274,
4 row(s) in 0.0390 seconds
[email protected]
[email protected]
[email protected]
value=9990
Note that this is a wide column structure, where the column names in the column family friends
represent the names of friends. This is the data structure that was illustrated in Chapter 10, Figure 10-5.
171
Chapter 11 ■ Languages and Programming Interfaces
The shell is suitable for simple experimentation and data validation, but most real work in HBase is
done within Java programs. Here, we see simple code to connect to an HBase server from a Java program:
Configuration config = HBaseConfiguration.create();
config.set("hbase.zookeeper.quorum", myServer);
config.set("hbase.zookeeper.property.clientport", "2181");
HBaseAdmin.checkHBaseAvailable(config);
Connection connection = ConnectionFactory.createConnection(config);
For column families with a fixed column structure, such as info in our example, we can retrieve the
data fairly simply:
1.
2.
3.
4.
5.
6.
7.
8.
9.
Table myTable = connection.getTable(TableName.valueOf("ourfriends"));
byte[] myRowKey = Bytes.toBytes("guy");
byte[] myColFamily=Bytes.toBytes("info");
byte[] myColName=Bytes.toBytes("email");
Get myGet = new Get(myRowKey);
Result myResult = myTable.get(myGet);
byte[] myBytes = myResult.getValue(myColFamily, myColName);
String email = Bytes.toString(myBytes);
System.out.println("Email address="+email);
Lines 1 through 4 specify the table, rowkey, column family, and column name to be retrieved. Lines 5
and 6 retrieve the row for the specified rowkey, and line 7 extracts a cell value for the specified column family
and column name. Line 8 converts the value from a byte array to a string. This is one of the less endearing
features of the HBase API: you are constantly required to convert data to and from HBase native byte arrays.
For dynamic column names, we need to retrieve the names of the columns for each row before we can
retrieve the data. The getFamilyMap method returns a map structure from a result object identifying the
column names within a specific row:
NavigableMap<byte[], byte[]>
myFamilyMap = myResult.getFamilyMap(myColFamily);
We can then iterate through the column names and use the standard getValue call to retrieve them:
for (byte[] colNameBytes : myFamilyMap.keySet()) {
// Get the name of the column in the column family
String colName = Bytes.toString(colNameBytes);
byte[] colValueBytes = myResult.getValue(myColFamily, colNameBytes);
System.out.println("Column " + colName + "=" + Bytes.toString(colValueBytes));
}
The HBase put method is similar in form to the get method and allows us to place a new column
value into a cell. Remember as discussed in Chapter 10, Hbase cells may contain multiple versions of data,
identified by timestamp. So the put method really adds a new value rather than overwriting the existing value.
172
Chapter 11 ■ Languages and Programming Interfaces
Here, we add a new friend:
myColFamily=Bytes.toBytes("friends");
byte [] myNewColName=Bytes.toBytes("paul");
byte [] myNewColValue=Bytes.toBytes("[email protected]");
Put myPut=new Put(myRowKey);
myPut.addColumn(myColFamily,myNewColName,myNewColValue);
myTable.put(myPut);
The HBase API does not need to include methods for sibling resolution as does Riak, since HBase uses
locking and strict consistency to prevent such siblings from ever being created.
The HBase API includes additional methods for scanning rowkey ranges, which can be useful when
the rowkey includes some kind of natural partitioning value, such as a customer ID, or where the rowkey
indicates some kind of temporal order. Also, you may recall from our discussion in Chapter 10 that HBase
supports a coprocessor architecture allowing code to be invoked when certain data manipulation operations
occur. This architecture provides a database trigger-like capability that allows us to perform validation on
input values, maintain secondary indexes, or maintain derived values.
MongoDB
The APIs we’ve seen in Riak and HBase are not overly complex when compared to popular development
frameworks like Spring or J2EE. However, for the casual user—especially one not proficient in
programming—they are virtually inaccessible. The languages require that you explicitly navigate to the
desired values and do not provide any direct support for complex queries.
MongoDB goes some of the way toward addressing these restrictions by providing a rich query language
implemented in JavaScript, and exposed through the MongoDB shell. Collection objects within MongoDB
include a find() method that allows fairly complex queries to be constructed. Figure 11-1 provides a
comparison of a MongoDB query with an equivalent SQL statement.
Figure 11-1. Comparison of MongoDB JavaScript query and SQL
■■Note The example data for MongoDB and Couchbase is based on a port of the MySQL “Sakila” sample
schema, which describes a database for a DVD rental business. DVD rental businesses are largely gone now,
but the sample database provides a schema familiar to MySQL users. The converted database is available at
http://bit.ly/1LwY9xl.
Quite complicated queries can be constructed using the find() method, but for queries that want
to aggregate across documents, we need to use the aggregate() method. The aggregate() method
implements a logical equivalent of the SQL GROUP BY clause. Figure 11-2 compares a SQL GROUP BY
statement with its MongoDB equivalent.
173
Chapter 11 ■ Languages and Programming Interfaces
Figure 11-2. MongoDB aggregation framework compared with SQL
The JavaScript-based MongoDB shell language is not directly embeddable within other programming
languages. Historically, MongoDB drivers for various languages implemented interfaces inspired by
the JavaScript interface, but with significant divergence across languages. More recently, a cross-driver
specification has been developed that improves consistency among the driver implementations for various
programming languages.
Using Java as an example, here is some code that connects to a server and selects the “test” collection
within the NGDBDemo database:
MongoClient mongoClient = new MongoClient(mongoServer);
MongoDatabase database = mongoClient.getDatabase("NGDBDemo");
MongoCollection<Document> collection = database.getCollection("test");
The following code creates and inserts a new document:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
Document people = new Document();
people.put("Name", "Guy");
people.put("Email", "[email protected]");
BasicDBList friendList = new BasicDBList();
// A document for a person
// List of friends
BasicDBObject friendDoc = new BasicDBObject(); // A single friend
friendDoc.put("Name", "Jo");
friendDoc.put("Email", "[email protected]");
friendList.add(friendDoc);
friendDoc.clear();
friendDoc.put("Name", "John");
friendDoc.put("Email", "[email protected]");
friendList.add(friendDoc);
people.put("Friends", friendDoc);
collection.insertOne(people);
// Add the friend
// Add another friend
Line 1 creates an empty document. Lines 2 and 3 add some values to the document. In lines 4 through 7,
we create a List structure that represents an array of subdocuments and inserts the first document into that
array. Lines 9 through 13 insert a second subdocument to the list. Line 14 inserts the new document
(which includes two embedded friend documents) into MongoDB.
This programming pattern reflects the underlying structure of JSON documents as described in
Chapter 10: JSON documents are composed of arbitrarily nested objects, values, and arrays. The MongoDB
interface requires that we build these programmatically, although there are several utility classes available
independently that allow Java objects to be converted to and from JSON documents.
174
Chapter 11 ■ Languages and Programming Interfaces
Collection objects in the Java driver support a find() method that, although not syntactically
identical with the JavaScript version, allows us to execute the same operations that we can perform in the
JavaScript shell:
Document myDoc = collection.find(eq("Name", "Guy")).first();
System.out.println(myDoc.toJson());
The API provides scrollable cursors that allow us to navigate through a result set or an entire collection.
This example iterates through all the documents in a collection:
MongoCursor<Document> cursor = collection.find().iterator();
try {
while (cursor.hasNext()) {
System.out.println(cursor.next().toJson());
}
}
finally {
cursor.close();
}
A more compact alternative fetch loop could be framed like this:
for (Document cur : collection.find()) {
System.out.println(cur.toJson());
}
Cassandra Query Language (CQL)
Cassandra’s underlying data structures are based on Google’s BigTable model, which would lead us to expect
an API syntactically similar to that of the HBase API. Indeed, the early thrift-based Cassandra APIs were
easily as complex as the HBase programming API; arguably even more so since Cassandra had implemented
a “SuperColumn” structure that extended the range of possible column family configurations, but which was
hard to conceptualize and program against.
In version 0.8, the Cassandra team made a decisive shift from an API-centric interface to a languagebased interface, inspired by SQL: the Cassandra Query Language (CQL). CQL uses familiar SQL idioms
for data definition, manipulation, and query tasks, and is now the preferred method for interacting with
Cassandra databases from query tools or within programs.
CQL provides Cassandra with an interactive ad hoc query capability through the cqlsh program. It also
simplifies programming tasks by allowing for more succinct and comprehensible data manipulation code,
which looks familiar to those who have coded in SQL-based interfaces such as JDBC.
But perhaps most significantly and most controversially, Cassandra CQL abstracts the underlying wide
column BigTable-style data model in favor of a more relational-like tabular scheme. We discussed this in
detail in in Chapter 10 and won’t repeat that discussion here: see in particular, Figure 10-7 for a comparison
of the Cassandra CQL representation of data compared with the underlying wide column structure.
The best—and maybe also the worst—thing about this CQL abstraction is that users can interact with
Cassandra without understanding the nuances of the wide column data model. However, although you
can write functional Cassandra CQL without understanding the underlying Cassandra data model, the best
results will be attained if you do understand how the two relate.
Wide column structures in CQL are defined by using composite primary keys, where the first part of the
key defines the partitioning (e.g., the rowkey), and the second part of the key defines the clustering columns.
Clustering column values become the dynamic column names in the wide column family.
175
Chapter 11 ■ Languages and Programming Interfaces
The CQL statements that follow—executed in the cqlsh shell—define and populate a Cassandra table
roughly equivalent to the HBase table we created earlier in this chapter:
cqlsh:guy> CREATE TABLE friends
... (name text,
...
friend_name text,
...
friend_email text,
...
PRIMARY KEY (name,friend_name));
cqlsh:guy>
INSERT INTO friends (name,friend_name,friend_email)
VALUES('Guy','Jo','[email protected]');
cqlsh:guy>
INSERT INTO friends (name,friend_name,friend_email)
VALUES('Guy','Chris','[email protected]');
cqlsh:guy>
INSERT INTO friends (name,friend_name,friend_email)
VALUES('Guy','John','[email protected]');
Familiar SQL-like constructs allow us to perform updates and deletes, create indexes, or issue queries.
However, the CQL SELECT statement has limited capabilities when compared to standard SQL: in
particular, joins and aggregate (GROUP BY) operations are not supported. Furthermore, WHERE clauses
and ORDER BY clauses are severely restricted. Ordering and range queries are limited to clustering columns
within a specific partition key.
These limitations seem confusing if you think of CQL tables as relational structures. But if you
remember that the first part of the key is actually a rowkey that is consistently hashed across the cluster, then
the limitation seems more reasonable. Cassandra is unable to effectively perform a range scan across rowkey
values that are hashed across the entire cluster. Nor is it possible to access the partition columns without
accessing a specific row, since every row could have entirely distinct column values.
So this ORDER BY clause cannot be supported:
cqlsh:guy> SELECT * FROM friends ORDER BY name;
SInvalidRequest: code=2200 [Invalid query] message="ORDER BY is only supported when the
partition key is restricted by an EQ or an IN."
But this is legal:
cqlsh:guy> SELECT * FROM friends WHERE name = 'Guy'
ORDER BY friend_name;
name | friend_name | friend_email
------+-------------+----------------Guy |
Chris | [email protected]
Guy |
Jo |
[email protected]
Guy |
John | [email protected]
A similar restriction prevents range operations on the partition key:
cqlsh:guy> SELECT * FROM friends WHERE name > 'Guy' ;
InvalidRequest: code=2200 [Invalid query] message="Only EQ and IN relation are supported on
the partition key (unless you use the token() function)"
176
Chapter 11 ■ Languages and Programming Interfaces
But allows a range query on the clustering key, provided the partition key is also specified:
cqlsh:guy> SELECT * FROM friends
WHERE name='Guy' AND friend_name > 'Guy';
name | friend_name | friend_email
------+-------------+---------------Guy |
Jo |
[email protected]
Guy |
John | [email protected]
CQL is used within Java programs or other languages using a driver syntax that is similar to JDBC: CQL
statements are passed as strings to methods that submit the CQL to the server and return result sets or
return codes.
The following Java code connects to a Cassandra server and specifies a keyspace (lines 1-3), submits a
CQL query (lines 5-6) and iterates through the results (lines 8-11):
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
String myServer=args[0];
Cluster cluster = Cluster.builder().addContactPoint(myServer).build();
Session myKeySpace = cluster.connect("guy");
String cqlString = "SELECT * FROM friends where name='Guy'";
ResultSet myResults = myKeySpace.execute(cqlString);
for (Row row : myResults.all()) {
System.out.println(row.getString(0) +" "+
row.getString(1) + " " + row.getString(2));
}
If we don’t know the structure of the result set in advance, then there is a metadata interface that allows
us to extract column names and data types:
List<Definition> colDefs = myResults.getColumnDefinitions().asList();
System.out.println("Column count=" + colDefs.size());
System.out.println("Column Names:");
for (Definition colDef : colDefs) {
System.out.println(colDef.getName());
}
MapReduce
The put and get methods provided by early NoSQL systems support only record-at-a-time processing and
place a heavy programming burden on an application that needs to perform even simple analytics on
the data. Google’s MapReduce algorithm—first published in 2004—provided a solution for parallelizing
computation across a distributed system, and it has been widely adopted not just by systems inspired by the
Google stack, such as Hadoop and HBase, but also by many early NoSQL systems, such as CouchDB and
MongoDB.
The canonical example of MapReduce is provided by the WordCount (https://wiki.apache.org/
hadoop/WordCount ) program, which represents almost the simplest possible MapReduce example. We
showed a diagrammatic representation of WordCount way back in Figure 2-4.
177
Chapter 11 ■ Languages and Programming Interfaces
In the WordCount example, the map phase uses a tokenizer to break up the input into words, then
assigns a value of 1 to each word:
public static class Map
extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
The reducer class takes these name:value pairs (where the value is always 1) and calculates the sum of
counts for each word:
public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
The MapReduce job is invoked by mainline code that defines input and output types and files, specifies
the map and reducer classes, and invokes the job:
Job job = Job .getInstance(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
178
Chapter 11 ■ Languages and Programming Interfaces
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
MapReduce coding in Java is somewhat cumbersome and involves a lot of boilerplate coding. Many
alternative implementations are much simpler. For instance, here is the WordCount algorithm implemented
in MongoDB’s JavaScript MapReduce framework:
db.films.mapReduce(
/* Map
*/ function() {emit (this.Category,1);},
/* Reduce */ function(key,values) {return Array.sum(values)} ,
{ out: "MovieRatings" }
)
This JavaScript reads from the films collection, and does a word count on the categories of each film,
which is output to the collection “MovieRatings”:
>
{
{
{
{
{
db.MovieRatings.find();
"_id" : "Action", "value" : 64 }
"_id" : "Animation", "value" : 66 }
"_id" : "Children", "value" : 60 }
"_id" : "Classics", "value" : 57 }
"_id" : "Comedy", "value" : 58 }
MapReduce is a flexible programming paradigm capable of being adapted to a wide range of data
processing algorithms. However, it is rarely the most efficient algorithm for a given problem and is usually
not the most programmer-efficient approach. Consequently, there have been many frameworks that provide
alternative programming and processing paradigms.
Pig
It was early realized that the full potential of Hadoop could not be unlocked if commonplace operations
required highly skilled Java programmers with experience in complex MapReduce programming. As we
will see later in this chapter, at Facebook the development of Hive—the original SQL on Hadoop—was
an important step toward finding a solution for this problem. At Yahoo! the Hadoop team felt that the
SQL paradigm could not address a sufficiently broad category of MapReduce programming tasks. Yahoo!
therefore set out to create a language that maximized productivity but still allowed for complex procedural
data flows. The result was Pig.
Pig superficially resembles scripting languages such as Perl or Python in that it offers flexible syntax and
dynamically typed variables. But Pig actually implements a fairly unique programming paradigm; it is best
described as a data flow language. Pig statements typically represent data operations roughly analogous
to individual operators in SQL—load, sort, join, group, aggregate, and so on. Typically, each Pig statement
accepts one or more datasets as inputs and returns a single dataset as an output. For instance, a Pig
statement might accept two datasets as inputs and return the joined set as an output. Users can add their
own operations through a Java-based user-defined function (UDF) facility.
179
Chapter 11 ■ Languages and Programming Interfaces
For those familiar with SQL programming, programming in Pig turns the programming model upside
down. SQL is a nonprocedural language: you specify the data you want rather than outline the sequence
of events to be executed. In contrast, Pig is explicitly procedural: the exact sequence of data operations is
specified within your Pig code. For SQL gurus, it resembles more the execution plan of a SQL statement
rather than the SQL statement itself.
SQL compilers and Hive’s HQL compiler include optimizers that attempt to determine the most
efficient way to resolve a SQL request. Pig is not heavily reliant on such an optimizer, since the execution
plan is explicit. As the Pig gurus are fond of saying “Pig uses the optimizer between your ears.”
Here is the ubiquitous word count implemented in Pig:
file= load 'some file’;
b = foreach file generate flatten(TOKENIZE((chararray)$0)) as word;
c = group b by word;
d = foreach c generate COUNT(b), group;
store d into 'pig_wordcount';
Pig can be used to perform complex workflows and provide an ad hoc query capability similar to
SQL. For instance, the example shown in Figure 11-3 performs joins, filters, and aggregations to provide a
summary of customers in the Asian region. Figure 11-3 also includes a comparable SQL statement.
Figure 11-3. Pig compared to SQL
Although Pig is capable of expressing virtually any data query that can be expressed in SQL syntax, it is
also capable of performing more complex data flows that would require multiple SQL statements chained
together with procedural code in an RDBMS.
Nevertheless, while Pig is more flexible than SQL, it is not a Turing complete programming language: it
lacks the control structures required for a complete general-purpose programming solution. However, Pig
can be embedded in Python and other languages.
180
Chapter 11 ■ Languages and Programming Interfaces
Directed Acyclic Graphs
Hadoop 1.0 was based on the MapReduce pattern. Complex programs could link multiple MapReduce steps
to achieve their end result.
It’s long been acknowledged that while MapReduce is a broadly applicable model that can support a wide
range of job types, it is not the best model for all workloads. In particular, MapReduce has a very large startup
cost, which means that even the simplest “Hello World” MapReduce job typically takes minutes rather than
seconds; this alone makes MapReduce a poor choice for interactive workloads and low-latency operations.
Hadoop 2.0 introduced the YARN framework, which allows Hadoop to run workloads based on other
processing patterns—of which MapReduce is just one.
The Apache Tez project (Tez is Hindi for “speed”) is one of a number of YARN-based initiatives
that provide Hadoop with a processing framework supporting both the massive batch processing that
characterizes traditional Hadoop workloads and low-latency operations that allow Hadoop to support a
wider variety of solutions.
Tez is based on a flexible processing paradigm known as directed acyclic graph (DAG). This intimidating
term actually describes a familiar processing model. Anyone who has examined a SQL execution plan
will have encountered a DAG. These graphs describe how a complex request is decomposed into multiple
operations that are executed in a specific order and that can arbitrarily feed into each other. MapReduce
itself is a DAG, but the MapReduce paradigm severely limits the types of graphs that can be constructed.
Furthermore, MapReduce requires that each step in the graph be executed by a distinct set of processes,
while Tez allows multiple steps in the graph to be executed by a single process, potentially on a specific node
of the Hadoop cluster.
Cascading
Cascading is a popular open-source Java framework that abstracts Hadoop MapReduce or YARN-based
processing primitives. In some respects, it resembles Pig in that it works with high-level data flows and
transformations and spares the programmer the labor involved in constructing low-level parallelization
classes. However, unlike Pig, Cascading is integrated within the Java language and is capable of creating
solutions that are more modular and sophisticated.
The Cascading programming model is based on sources, sinks, and pipes assembled in data flows.
The programmer assembles these pipes to construct programs that are more sophisticated than MapReduce.
These workflows represent the DAG discussed in the previous section.
Spark
We looked at the origins and architecture of the Spark project in Chapter 7. Spark can be thought of as
“memory-based Hadoop,” but in truth it offers more than just an in-memory speed boost. The Spark API
operates at a higher level of abstraction than native YARN or MapReduce code, offering improvements in
programmer productivity and execution speed.
Sparks supports APIs in Python, Java, and other languages, but it is native to the Scala language, so our
examples here will use the Scala API.
Here, we load text files (in this case, from HDFS) into Spark Resilient Distributed Datasets (RDDs)
val countries=sc.textFile("COUNTRIES")
val customers=sc.textFile("CUSTOMERS")
Spark RDDs are immutable: we can’t alter the contents of an RDD; rather, we perform operations that
create new RDDs.
181
Chapter 11 ■ Languages and Programming Interfaces
The HDFS inputs are CSV files, so in our initial RDDs each line in the input file is represented as a single
string. Here’s the first element in the countries RDD:
scala> countries.first()
res9: String = "52790,United States of America,Americas"
In the next example, we use a map function to extract the key value from each of the CSV strings and
create key-value pair RDDs.
val countryRegions=countries.map(x=>(x.split(",")(0),x.split(",")(2)))
val AsianCountries=countryRegions.filter(x=> x._2.contains("Asia") )
// Country codes and country names
val countryNames=countries.map(x=>(x.split(",")(0),x.split(",")(1)))
The first RDD countryRegions contains all country codes and their associated regions. The second
(AsianCountries) uses the filter() method to create a RDD containing only Asian countries. The third
(countryNames) creates an RDD with country names keyed by country ID. Here’s the first element in the
countryNames RDD:
scala> countryNames.first()
res12: (String, String) = (52790,United States of America)
Aggregations can be created by performing map and reduce operations. The first line that follows uses
map to emit a country name and the numeral 1 for each customer. The second line invokes a reducer that
emits an RDD containing the counts of customers in each country:
val custByCountry=customers.map(x=>(x.split(",")(3),1))
val custByCountryCount=custByCountry.reduceByKey((x,y)=> x+y)
In the next statement, we join the RDD containing customer counts by country ID with the list of Asian
country IDs that we created earlier. Because we are using the default inner join method, this returns only
customer counts for Asian regions.
val AsiaCustCount=AsianCountries.join(custByCountryCount)
The next join operation joins our RDD containing country names keyed by country code to the previous
RDD. We now have an RDD that contains counts of customers by country name in the Asian region:
val AsiaCustCountryNames=AsiaCustCount.join(countryNames)
This Spark workflow is roughly equivalent to the Pig workflow shown in Figure 11-3. However, for
massive datasets, the Spark job could be expected to complete in a fraction of the time, since every operation
following the initial load from HDFS would complete in memory.
The Return of SQL
I don’t know about you, but as I created the examples in this chapter I was struck by the “Tower of Babel”
impression created by the new generation of database languages. Say what you will about SQL, but for
more than two decades database programmers have all been speaking the same language. As a result
of the explosion of nonrelational systems, we’ve all been forced to speak different languages. And while
182
Chapter 11 ■ Languages and Programming Interfaces
some of these languages offer definite advantages for working with unstructured data or for parallelization
of programming tasks, in many cases the level of abstraction has been reduced and the work for the
programmer increased.
So it’s not surprising that within just a few years after the initial enthusiasm for NoSQL, we’ve seen SQL
return to almost every new database niche. SQL just has too many advantages: it’s a high-level abstraction
that simplifies data access and manipulation, it’s a language in which literally millions of database users are
conversant, and there are hundreds of popular business intelligence and analytic tools that use it under the
hood as the means for getting at data.
Hive
Hive is the original SQL on Hadoop. We discussed the origins and architecture of Hive in Chapter 2. From
the very early days of Hadoop, Hive represented the most accessible face of Hadoop for many users.
Hive Query Language (HQL) is a SQL-based language that comes close to SQL-92 entry-level compliance,
particularly within its SELECT statement. DML statements—such as INSERT, DELETE, and UPDATE—are
supported in recent versions, though the real purpose of Hive is to provide query access to Hadoop data usually
ingested via other means. Some SQL-2003 analytic window functions are also supported.
As discussed in Chapter 2, HQL is compiled to MapReduce or—in later releases—more sophisticated
YARN-based DAG algorithms.
The following is a simple Hive query that performs the same analysis as our earlier Pig and Spark
examples:
0: jdbc:Hive2://>
SELECT country_name, COUNT (cust_id)
0: jdbc:Hive2://>
FROM countries co JOIN customers cu
0: jdbc:Hive2://>
ON(cu.country_id=co.country_id)
0: jdbc:Hive2://>
WHERE region = 'Asia'
0: jdbc:Hive2://> GROUP BY country_name
0: jdbc:Hive2://>
HAVING COUNT (cust_id) > 500;
2015-10-10 11:38:55
Starting to launch local task to process map join;
maximum
memory = 932184064
<<Bunch of Hadoop JobTracker output deleted>>
2015-10-10 11:39:05,928 Stage-2 map = 0%, reduce = 0%
2015-10-10 11:39:12,246 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 2.28 sec
2015-10-10 11:39:20,582 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 4.4 sec
+---------------+------+--+
| country_name | _c1 |
+---------------+------+--+
| China
| 712 |
| Japan
| 624 |
| Singapore
| 597 |
+---------------+------+--+
3 rows selected (29.014 seconds)
183
Chapter 11 ■ Languages and Programming Interfaces
HQL statements look and operate like SQL statements. There are a few notable differences between
HQL and commonly used standard SQL, however:
•
HQL supports a number of table generating functions which can be used to return
multiple rows from an embedded field that may contain an array of values or a map
of name:value pairs. The Explode() function returns one row for each element in an
array or map, while json_tuple() explodes an embedded JSON document.
•
Hive provides a SORT BY clause that requests output be sorted only within each
reducer within the MapReduce pipeline. Compared to ORDER BY, this avoids a large
sort in the final reducer stage, but may not return results in sorted order.
•
DISTRIBUTE BY controls how mappers distribute output to reducers. Rather than
distributing values to reducers based on hashing of key values, we can insist that
each reducer receive contiguous ranges of a specific column. DISTRIBUTE BY can be
used in conjunction with SORT BY to achieve an overall ordering of results without
requiring an expensive final sort operation. CLUSTER BY combines the semantics of
DISTRIBUTE BY and SORT BY operations that specify the same column list.
Hive can query data in HBase tables and data held in HDFS. Support for Spark is available, though still
under active development.
Impala
It’s difficult to overstate the significance of Hive to the early adoption of Hadoop. Hive allowed non-Java
programmers a familiar mechanism for accessing data held in Hadoop, and allowed third-party analytic
solutions to more easily integrate with Hadoop.
However, the similarity between Hive and RDBMS SQL led inevitably to unrealistic expectations.
RDBMS users had become used to SQL as a real-time query tool, whereas even the simplest Hive queries
would typically take minutes to complete, since even the simplest query had to undertake the overhead of
initiating a MapReduce job. Furthermore, caching of data in memory typically reduces overall SQL response
time by 90 percent or more, while Hive queries were totally I/O bound. Initial releases of Hive also employed
fairly primitive query optimization.
Disappointment with Hive performance led to a fairly intensive effort to improve Hive, and modern
versions of Hive outperform the initial releases by several orders of magnitude.
Cloudera’s Impala project aims to provide low-latency ANSI-compliant SQL on Hadoop. The key
difference between Impala’s approach and that of Hive is that while Hive programs are translated to native
Hadoop processing—initially MapReduce but today including Tez— Impala includes its own processing
architecture that reads directly from native HDFS or HBase storage. Although Impala bypasses the Hadoop
processing layer, it can access Hive and Pig metadata through the HCatalog interface, which means it knows
the structure of Hive and Pig tables.
Impala architecture is heavily influenced by traditional massively parallel processing (MPP) data
warehouse database architectures, which we discussed in Chapter 8. Impala deploys daemon processes—
typically on each Hadoop data node—and distributes work to these daemons using algorithms similar to
those used by RDBMS systems, such as Teradata or Oracle. These daemon processes are always ready for
action, so there is no initial latency involved in job creation as there is in Hive.
There’s no doubt that the Impala architecture offers a better performance experience for short-duration
queries than Hive. However, given improvements in recent releases of Hive, there are those who claim that
Hive still offers superior performance at greater scale. Some of this debate is driven by marketing groups
within commercial Hadoop companies, so careful evaluation of competing claims is warranted.
184
Chapter 11 ■ Languages and Programming Interfaces
Spark SQL
As noted earlier, there is some support for Spark within recent releases of Hive. However, Spark includes its
own SQL dialect, called—not surprisingly—Spark SQL.
You may recall from Chapter 7 that Spark SQL works with data frames rather than RDDs. Data frames
can be thought of as a more tabular and schematized relation to the RDD. A data frame can be created from
an RDD, or from a Hive table.
SQL-92 compliance is a work in progress within Spark SQL.
Couchbase N1QL
So far we’ve seen SQL used within the context of analytic systems such as Hadoop and Spark. Operational
databases of the NoSQL variety have not been so quick to implement SQL; however, in 2015, Couchbase
announced Non-first Normal Form Query Language (N1QL), pronounced “Nickel,” a virtually complete SQL
language implementation for use with document databases and implemented within the Couchbase server 4.0.
For example, consider the sample data shown in Figure 11-4 (this is the same sample data we used for
MongoDB earlier).
Figure 11-4. Couchbase sample document
185
Chapter 11 ■ Languages and Programming Interfaces
N1QL allows us to perform basic queries to retrieve selected documents or attributes of selected
documents:
cbq> SELECT `Title` FROM films WHERE _id=200;
{
"requestID": "0d5cff15-f5e7-434d-9dc4-d950ef5e21f8",
"signature": {
"Title": "json"
},
"results": [
{
"Title": "CURTAIN VIDEOTAPE"
}
],
"status": "success",
N1QL allows us to access nested documents within the JSON structure using array notation. So, for
instance, in the example that follows, Actors[0] refers to the first nested document within the actors array:
SELECT Actors[0].`First name` , Actors[0].`Last name`
FROM films where _id=200;
cbq>
>
{
"requestID": "5aa27ec1-ce4d-4452-a137-2239b88e47fe",
"results": [
{
"First name": "JOE",
"Last name": "SWANK"
}
],
"status": "success",
We can query for subdocuments that match a search criteria using WHERE ANY syntax:
cbq> SELECT `Title` FROM films
> WHERE ANY Actor IN films.Actors SATISFIES
>
( Actor.`First name`="JOE" AND Actor.`Last name`="SWANK" )END;
{
"requestID": "f3d6dd05-912d-437b-984f-214770f87076",
"results": [
{
"Title": "CHOCOLAT HARRY"
},
{
"Title": "CHOCOLATE DUCK"
},
... ...
186
Chapter 11 ■ Languages and Programming Interfaces
The UNNEST command allows embedded documents to be “joined” back up to the parent document. So
here we get one result for each actor who starred in film 200, with the film title included in the results:
cbq> SELECT f.`Title` ,a.`First name` ,a.`Last name`
>
FROM films f
> UNNEST f.Actors a
> WHERE f._id=200;
{
"requestID": "f8227647-3506-4bfd-a538-3f8a0d038198",
"results": [
{
"First name": "JOE",
"Last name": "SWANK",
"Title": "CURTAIN VIDEOTAPE"
},
{
"First name": "WALTER",
"Last name": "TORN",
"Title": "CURTAIN VIDEOTAPE"
},
... ...
],
"status": "success",
}
The UNNEST command allows us to perform the equivalent of joins between parent and child documents
when the child documents are nested within the parent. N1QL also allows us to join between independent
documents, providing that one of the documents contains a reference to the primary key in the other.
So, for instance, if we had a bucket of documents that contained the primary keys of “overdue” films in
our imaginary (and by now definitely struggling) DVD store, then we can join that to the films collection to
return-just those films using the ON KEYS join syntax:
cbq> SELECT f.`Title` FROM overdues
>
JOIN films f ON KEYS overdues.filmId ;
{
"requestID": "6f0f505e-72f6-404d-9e20-953850dc9524",
"results": [
{
"Title": "CURTAIN VIDEOTAPE"
},
{
"Title": "HARPER DYING"
}
],
"status": "success",
N1QL also includes DML statements allowing us to manipulate the contents of documents and DDL
statements allowing creation and modification of indexes.
187
Chapter 11 ■ Languages and Programming Interfaces
N1QL is an ambitious attempt to bring SQL into the world of document databases. It’s interesting
to consider that at the same time as companies like CouchBase are introducing SQL support into their
database, companies like Oracle are introducing strong JSON support into their SQL-based database. It
would seem that the two worlds are coming together.
Apache Drill
So far we have looked at SQL variants that are tightly coupled with their underlying technology. It’s true that
technologies such as Hive can access data in HBase and Spark, as well as HDFS, but this speaks more to the
integration of HDFS, HBase, and Spark than it does to some inherent heterogeneity of the Hive system.
The Apache Drill framework aims to provide a SQL engine that can operate across multiple distributed
data stores such as HDFS or Amazon S3, as well as NoSQL systems such as MongoDB and HBase. Drill’s
architecture is based on Google’s Dremel system, which provides the foundation for the Google BigQuery
product.
Drill incorporates a distributed heterogeneous cost-based optimizer that can intelligently distribute
data-access algorithms across multiple, disparate systems. This allows a SQL query to span Hadoop,
MongoDB, Oracle, or other databases and—at least in theory—to do so in an efficient and optimal manner.
Currently, Drill can query data from relational systems that have a JDBC or ODBC connector, from
systems that are supported by Hive, from a variety of cloud-based distributed file systems (Amazon S3,
Google Cloud Drive), and from MongoDB.
Let’s look at the MongoDB support, since it allows us to see how Drill deals with nontabular data. Here,
we use Drill to query our sample MongoDB collections.
Simple queries are, of course, simple:
0: jdbc:drill:zk=local> SELECT Title FROM films WHERE Rating='G' LIMIT 5;
+--------------------+
|
Title
|
+--------------------+
| ACE GOLDFINGER
|
| AFFAIR PREJUDICE
|
| AFRICAN EGG
|
| ALAMO VIDEOTAPE
|
| AMISTAD MIDSUMMER |
+--------------------+
5 rows selected (1.365 seconds)
We can drill into subdocuments using a notation that is similar to the N1QL array notation. So, here we
retrieve data from the second document in the embedded actors array for the film West Lion using the array
notation Actors[2]:
0: jdbc:drill:zk=local> SELECT Actors[2].`First name`, Actors[2].`Last name`
. . . . . . . . . . . >
FROM films WHERE Title='WEST LION';
+---------+-----------+
| EXPR$0 | EXPR$1
|
+---------+-----------+
| SEAN
| WILLIAMS |
+---------+-----------+
188
Chapter 11 ■ Languages and Programming Interfaces
The FLATTEN function returns one row for every document in an embedded array. It’s somewhat similar
to the Hive EXPLODE function or the N1QL UNNEST clause. Note that each document is returned in JSON
format; there doesn’t seem to be a way currently to schematize these results:
0: jdbc:drill:zk=local> SELECT Title, FLATTEN(Actors)
. . . . . . . . . . . >
FROM films WHERE Rating='G' LIMIT 5;
+-----------------+--------------------------------------------------------|
Title
|
EXPR$1
+-----------------+--------------------------------------------------------| ACE GOLDFINGER | {"First name":"BOB","Last name":"FAWCETT","actorId":19}
| ACE GOLDFINGER | {"First name":"MINNIE","Last name":"ZELLWEGER",
| ACE GOLDFINGER | {"First name":"SEAN","Last name":"GUINESS", "actorId":9
| ACE GOLDFINGER | {"First name":"CHRIS","Last name":"DEPP","actorId":160}
| AFFAIR PREJUDICE| {"First name":"JODIE","Last name":"DEGENERES",
+-----------------+--------------------------------------------------------5 rows selected (0.589 seconds)
We can see that Drill has a basic ability to navigate complex JSON documents, and we can expect this
capability to improve over time.
Drill can also navigate wide column store structures in HBase. Let’s look at the data that we inserted
into HBase earlier in this chapter, this time using Drill:
0: jdbc:drill:zk=local> SELECT * FROM friends;
+---------+---------+------+
| row_key | friends | info |
+---------+---------+------+
| [[email protected] | {"Jo":"am9AZ21haWwuY29t","John":"am9obkBnbWFpbC5jb20="} |
{"email":"Z3V5QGdtYWlsLmNvbQ==","userid":"OTk5MA=="} |
| [[email protected] | {"John":"am9obkBnbWFpbC5jb20=","Guy":"Z3V5QGdtYWlsLmNvbQ==",
"Paul":"cGF1bEBnbWFpbC5jb20=","Ringo":"cmluZ29AZ21haWwuY29t"} | {"email":"am9AZ21haWwuY29t",
"userid":"OTk5MQ=="} |
+---------+---------+------+
2 rows selected (1.532 seconds)
Not very friendly output! Initially, Drill returns HBase data without decoding the internal byte array
structure and without flattening any of the maps that define our wide column family.
However, we can use the FLATTEN function to extract one row for each column in our wide column
family “friends”, KVGEN function to convert the map to columns, and the CONVERT FROM function to cast the
byte arrays into Unicode characters:
0: jdbc:drill:zk=local>
WITH friend_details AS
(SELECT info, FLATTEN(KVGEN(friends)) AS friend_info FROM friends)
SELECT CONVERT_FROM(friend_details.info.email,'UTF8') AS email,
CONVERT_FROM(friend_details.friend_info.`value`,'UTF8')
AS friend_email
FROM friend_details;
189
Chapter 11 ■ Languages and Programming Interfaces
+----------------+------------------+
|
email
|
friend_email
|
+----------------+------------------+
| [email protected] | [email protected]
|
| [email protected] | [email protected]
|
| [email protected]
| [email protected]
|
| [email protected]
| [email protected]
|
| [email protected]
| [email protected]
|
| [email protected]
| [email protected] |
+----------------+------------------+
Drill shows an enormous amount of promise. A single SQL framework capable of navigating the variety
of data structures presented by relational and nonrelational systems is just what we need to resolve the
Tower of Babel problem presented by the vast array of languages and interfaces characterizing the next
generation databases of today and the future.
Other SQL on NoSQL
There are a number of other notable SQL-on-NoSQL systems:
•
Presto is an open-source SQL engine similar in many respects to Drill that can query
data in JDBC, Cassandra, Hive, Kafka, and other systems.
•
Many relational database vendors provide connectors that allow their SQL language
to retrieve data from Hadoop or other systems. Examples include Oracle Big Data
SQL, IBM BigSQL, and Terradata QueryGrid.
•
Apache Phoenix provides a SQL layer for HBase.
•
Dell Toad Data Point provides SQL access to a variety of nonrelational systems,
including MongoDB, Cassandra, HBase, and DynamoDB. (Disclaimer: I lead the
team at Dell that develops Toad.)
Conclusion
This chapter tried to give you a feel for the languages and APIs provided by next-generation database
systems. In many cases, low-level programming APIs are all that is provided. However, the trend for the
future seems clear: SQL is reasserting itself as the lingua franca of the database world. Virtually all new
databases are becoming accessible by SQL, and many systems are adopting SQL-like interfaces even for
low-level programming.
It’s unlikely that SQL will again become the sole interface to databases of the future: the unique
requirements of wide column systems and document databases suggest that non-SQL idioms will be
required, although Cassandra CQL and CouchBase N1QL do show how a SQL-like language remains
a useful abstraction for dealing with data that might not be in relational format. Nevertheless, it seems
increasingly likely that most next-generation databases will eventually support some form of SQL access,
even if only through an independent layer such as Drill.
Note
1.
Code examples can be found at https://github.com/gharriso/
NextGenDBSamples.
190
Chapter 12
Databases of the Future
The human brain had a vast memory storage. It made us curious and very creative.... And
that brain did something very special. It invented an idea called “the future.”
—David Suzuki
Every revolution has its counterrevolution—that is a sign the revolution is for real.
—C. Wright Mills
This book is the story of how a revolution in database technology saw the “one size fits all” traditional
relational SQL database give way to a multitude of special-purpose database technologies. In the past
11 chapters, we have reviewed the major categories of next-generation database systems and have taken a
deep dive into some of the internal architectures of those systems.
Most of the technologies we have reviewed are still evolving rapidly, and few would argue that we’ve
reached an end state in the evolution of database systems. Can we extrapolate from the trends that we
have reviewed in this book and the technology challenges we’ve observed to speculate on the next steps in
database technology? Furthermore, is there any reason to think that any of the revolutionary changes we’ve
reviewed here are moving along on the wrong track? Should we start a counterrevolution?
In this chapter I argue that the current state of play—in which one must choose between multiple
overlapping compromise architectures—should and will pass away. I believe that we will see an increasing
convergence of today’s disparate technologies. The database of the future, in my opinion, will be one that
can be configured to support all or most of the workloads that today require unique and separate database
technologies.
The Revolution Revisited
As we discussed in Chapter 1, the three major eras of database technology correspond to the three major
eras of computer applications: mainframe, client-server, and modern web. It’s not surprising, therefore,
that the story of the evolution of modern databases parallels the story of the development of the World
Wide Web. The predominant drivers for the latest generation of database systems are the drivers that
arose from the demands of Web 2.0, global e-commerce, Big Data, social networks, cloud computing,
and—increasingly—the Internet of Things (IoT). These buzzwords represent more than simple marketing
claims: they each demand and reflect significant changes in application architectures to which database
technologies must respond. Today’s application architectures and database technologies are continuously
challenged to meet the needs of applications that demand an unparalleled level of scale, availability, and
throughput.
191
Chapter 12 ■ Databases of the Future
The imperative represented by these challenges is irresistible: for most enterprises, Dig Data, social
networks, mobile devices, and cloud computing represent key competitive challenges. The ability to
leverage data to create competitive advantage is key to the survival of the modern business organization,
as is the ability to deploy applications with global scope and with mobile and social context. It’s hard to
imagine a successful modern business that did not have a strategy to exploit data or to engage with users
via social networks and mobile channels. For some industries, the IoT represents a similar threat and
opportunity: Internet-enabled devices stand to revolutionize manufacturing, health care, transportation,
home automation, and many other industries.
The shift in application architectures that has emerged from these demands has been conveniently
summarized by the market-research company IDC and others as “the third platform.”
Many of the key database variations we have reviewed in this book are optimized to satisfy one or more
of the challenges presented by this third platform:
•
Hadoop and Spark exist to provide a platform within which masses of
semi-structured and unstructured data can be stored and analyzed.
•
Nonrelational operational databases such as Cassandra and MongoDB exist to
provide a platform for web applications that can support global scale and continuous
availability, and which can rapidly evolve new features. Some, like Cassandra, appeal
because of their ability to provide scalability and economies across a potentially
global deployment. Others, like MongoDB, appeal because they allow more rapid
iteration of application design.
•
Graph databases such as Neo4j and Graph Compute Engines allow for the
management of network data such as is found in social networks.
However, within each of these domains we see continual demand for the key advantages provided by
traditional RDBMS systems. In particular:
•
SQL provides an interface for data query that has stood the test of time and that is
familiar to millions of human beings and involves thousands of analytic tools.
•
The relational model of data represents a theoretically sound foundation for
unambiguous and accessible data models. The relational model continues to
be the correct representation for most computer datasets, even if the physical
implementation takes a different form.
•
Transactions, potentially multi-object and ACID, continue to be mandatory in many
circumstances for systems that strive to correctly represent all interactions with the
system.
Counterrevolutionaries
It would be hard for anyone to argue that some sort of seismic shift in the database landscape has not
occurred. Hadoop, Spark, MongoDB, Cassandra, and many other nonrelational systems today form an
important and growing part of the enterprise data architecture of many, if not most, Fortune 500 companies.
It is, of course, possible to argue that all these new technologies are a mistake, that the relational model and
the transactional SQL relational database represent a better solution and that eventually the market will
“come to its senses” and return to the relational fold.
While it seems unlikely to me that we would make a complete return to a database architecture that
largely matured in the client-server era, it is I think fairly clear that most next-generation databases represent
significant compromises. Next-generation databases of today do not represent a “unified field theory” of
databases; quite the contrary. We still have a long way to go.
192
Chapter 12 ■ Databases of the Future
A critic of nonrelational systems might fairly claim that the latest breed of databases suffer from the
following weaknesses:
•
A return of the navigational model. Many of the new breed of databases have
reinstated the situation that existed in pre-relational systems, in which logical and
physical representations of data are tightly coupled in an undesirable way. One of the
great successes of the relational model was the separation of logical representation
from physical implementation.
•
Inconsistent to a fault. The inability in most nonrelational systems to perform a
multi-object transaction, and the possibility of inconsistency and unpredictability in
even single-object transactions, can lead to a variety of undesirable outcomes that
were largely solved by the ACID transaction and multi-version consistency control
(MVCC) patterns. Phantom reads, lost updates, and nondeterministic behaviors can
all occur in systems in which the consistency model is relaxed.
•
Unsuited to business intelligence. Systems like HBase, Cassandra, and MongoDB
provide more capabilities to the programmer than to the business owner. Data in
these systems is relatively isolated from normal business intelligence (BI) practices.
The absence of a complete SQL layer that can access these systems isolates them
from the broader enterprise.
•
Too many compromises. There are a wide variety of specialized database solutions,
and in some cases these specialized solutions will be an exact fit for an application’s
requirements. But in too many cases the application will have to choose between two
or more NQR (not quite right) database architectures.
Have We Come Full Circle?
It’s not unusual for relational advocates to claim that the nonrelational systems are a return to pre-relational
architectures that were discarded decades ago. This is inaccurate in many respects, but in particular because
pre-relational systems were nondistributed, whereas today’s nonrelational databases generally adopt a
distributed database architecture. That alone makes today’s nonrelational systems fundamentally different
from the pre-relational systems of the 1960s and ’70s.
However, there is one respect in which many newer nonrelational systems resemble pre-relational
databases: they entangle logical and physical representations of data. One of Edgar Codd’s key critiques of
pre-relational systems such as IDMS and IMS was that they required the user to be aware of the underlying
physical representation of data. The relational model decoupled these logical and physical representations
and allowed users to see the data in a logically consistent representation, regardless of the underlying
storage model. The normalized relational representation of data avoids any bias toward a particular
access pattern or storage layout: it decouples logical representation of data from underlying physical
representation.
Many advocates of modern nonrelational systems explicitly reject this decoupling. They argue that by
making all access patterns equal, normalization makes them all equally bad. But this assertion is dubious: it
ignores the possibility of de-normalization as an optimization applied atop a relational model. It’s clear that
the structure presented to the end user does not have to be the same as the structure on disk.
Indeed, the motivation for abandoning the relational model was driven less by these access pattern
concerns and more by a desire to align the databases’ representation of data with the object-oriented
representation in application code. An additional motivation was to allow for rapidly mutating schemas: to
avoid the usually lengthy process involved in making a change to a logical data model and propagating that
change through to a production system.
193
Chapter 12 ■ Databases of the Future
Was Codd wrong in 1970 to propose that the physical and logical representations of data should be
separate? Almost certainly not. The ability to have a logical data model that represents an unambiguous and
nonredundant view of the data remains desirable. Indeed, most nonrelational modeling courses encourage
the user to start with some form of “logical” model.
I argue, therefore, that the need for a high-level logical model of the data is still desirable and that to
allow users to interact with the database using that model remains as valid a requirement as it was when the
relational model was first proposed. However, modern applications need the ability to propagate at least
minor changes to the data model without necessitating an unwieldy database change control process.
An Embarrassment of Choice
Ten years ago, choosing the correct database system for an application was fairly straightforward: choose the
relational database vendor with which one had an existing relationship or which offered the best licensing
deals. Technical considerations or price might lead to a choice of Oracle over SQL Server, or vice versa, but
that the database would be an RDBMS was virtually a given. Today, choosing the best database system is
a much more daunting task, often made more difficult by the contradictory claims of various venders and
advocates. Most of the time, some form of relational technology will be the best choice. But for applications
that may seek to break outside the RDBMS comfort zone and seek competitive advantage from unique
features of new database technologies, the choice of database can be decisive.
“Which is the best database to choose?” Often, there is no right answer to the question because each choice
implies some form of compromise. An RDBMS may be superior in terms of query access and transactional
capability, but fails to deliver the network partition tolerance required if the application grows. Cassandra may
deliver the best cross-data-center availability, but may fail to integrate with BI systems. And so on ...
Figure 12-1 illustrates some of the decision points that confront someone trying to decide upon a
database today.
Figure 12-1. Decisions involved in choosing the correct database
194
Chapter 12 ■ Databases of the Future
Can We have it All?
I’ve become convinced that we can “have it all” within a single database offering. For instance, there is no
architectural reason why a database system should not be able to offer a tunable consistency model that
includes at one end strict multi-record ACID transactions and at the other end an eventual consistency style
model. In a similar fashion, I believe we could combine the features of a relational model and the document
store, initially by following the existing trend toward allowing JSON data types within relational tables.
The resistance to this sort of convergence will likely be driven as much by market and competitive
considerations as by technological obstacles. It may not be in the interests of the RDBMS incumbents nor
the nonrelational upstarts to concede that a mixed model should prevail. And supporting such a mixed
model may involve technical complexity that will be harder for the smaller and more recent market entrants.
Nevertheless, this is what I believe would be the best outcome for the database industry as a whole.
Rather than offering dozens of incompatible technologies that involve significant compromises, it would be
better to offer a coherent database architecture that offers as configurable behaviors the features best meet
application requirements. Let’s look at each area of convergence and consider what would be required to
combine technologies into a single system.
Consistency Models
Databases like Cassandra that are built on the Dynamo model already provide a tunable consistency
model that allows the administrator or developer to choose a level of consistency or performance trade-off.
Dynamo-based systems are well known for providing eventual consistency, but they are equally capable
of delivering strict consistency—at least within a single-object transaction. RDBMS systems also provide
control over isolation levels—providing levels that guarantee repeatable reads, for instance.
However, systems like Cassandra that are based on Dynamo cannot currently provide multi-object
transactions. Meanwhile, RDBMS consistency in existing implementations is influenced by ACID
transactional principles, which require that the database always present a consistent view to all users.
As we discussed in Chapter 3, the ultimate motivation for eventual consistency-type systems is the
ability to survive network partitions. If a distributed database is split in two by a network partition—the “split
brain” scenario—the database can only maintain availability if it sacrifices some level of strict consistency.
Implementing multi-row transactions within an eventually consistent, network partition-tolerant
database would undoubtedly be a significant engineering challenge, but it is not obviously impossible. Such
a database would allow the developer or the administrator to choose between strictly consistent ACID and
eventually consistent transactions that involve multiple objects (as shown in Figure 12-2).
195
Chapter 12 ■ Databases of the Future
Figure 12-2. A possible convergence of consistency models
Schema
While relational fundamentalists may claim that commercial RDBMS systems have not implemented a pure
version of the relational model, the traditional RDBMS has at least established a well-defined discipline on
the data modeling process. Data “normalization” eliminates redundancy and ambiguity from the logical
representation. By and large, the relational representation of data has proved invaluable in providing
non-programmers with a comprehensible view of data and—together with the adoption of SQL as
a common language for data access—has offered a predictable and accessible interface to business
intelligence and query tools.
196
Chapter 12 ■ Databases of the Future
Dissent with the relational model arose for at least two reasons:
•
Programmers desired a data store that could accept object-oriented data without
the overhead and complexity involved in deconstructing and reconstructing from
normal form.
•
The lifecycle of a relational model represented somewhat of a waterfall process,
requiring that the model be comprehensively defined at the beginning of a
project and was difficult to change once deployed to production. Modern agile
application development practices reject this waterfall approach in favor of an
iterative development in which change to design is expected. Furthermore, modern
web applications need to iterate new features fast in order to adapt to intense
competitive environments. While modern RDBMS systems can perform online
schema modifications, the coordination of code and schema change creates risk
of application failure and generally requires careful and time-consuming change
control procedures.
In short, modern applications require flexible schemas that can be modified if necessary on the fly by
changed application code.
However, the need for a comprehensible and unambiguous data model that can be used for business
intelligence is even more important in this world of Big Data than it was in the early relational era.
Nonrelational databases are unable to easily integrate into an organization’s business intelligence (BI)
capability; in the worst case, the data they contain are completely opaque to BI frameworks.
Providing a best-of-both-worlds solution seems within the capabilities of some existing databases.
A database that allows data to be represented at a high level in a relatively stable normal form, but which
also allows for the storage of dynamically mutating data, only requires that columns in a relational table
be able to store arbitrarily complex structures—JSON, for instance—and that these complex structures be
supported by efficient query mechanisms integrated into SQL.
As we will see, such hybrid capabilities already exist. Virtually all databases—relational and
nonrelational—are introducing support for JSON. For instance, Riak, Cassandra, PostgresSQL, and
Oracle all provide specific mechanisms for indexing, storing, and retrieving JSON structures. It will soon
be meaningless to describe a database as a “document” database, since all databases will provide strong
support for JSON.
I’m sure that relational purists will argue such a hybrid solution is, in fact, the work of the devil in that
it compromises the theoretical basis of the relational model. Perhaps so, and a relational implementation
that allowed prototyping and rapid iteration of data models while preserving relational purity would be
welcome. But for now I believe the short-term direction is set: we’re going to embed JSON in relational
tables to balance flexibility with the advantages of the relational model. Figure 12-3 illustrates a vision for the
convergence of schema elements in a database of the future.
197
Chapter 12 ■ Databases of the Future
Figure 12-3. A possible convergence of schema models
Database Languages
Codd’s definition of the relational database did not specify the SQL language, and it is perfectly possible for a
formally relational database to use some other language. However over time the advantages of a single cross
platform data access language became obvious and the industry united behind SQL.
The term “NoSQL” is unfortunate in that it implies a rejection of the SQL language rather than the more
fundamental issues that were at the core of the changes in database technology. However, it is true that in
many cases SQL was incompatible or inappropriate for these new systems that often supported record-at-atime processing using low-level APIs described in Chapter 11.
Today, SQL has regained its status as the lingua franca of the database world, and it seems clear that
most databases will allow for SQL queries, whether natively as in Couchbase’s N1QL or via a SQL processing
framework such as Apache Drill. Even for databases that cannot natively support the full range of SQL
operations, we can see that a reduced SQL syntax enhances usability and programmer efficiency: compare,
for instance, the API presented by HBase with the SQL-like interface provided by Apache Cassandra CQL
(both are described in Chapter 11).
The emerging challenge to unify databases is not so much to provide SQL access to nonrelational
systems as to allow non-SQL access to relational systems. While the SQL language can satisfy a broad variety
of data queries, it is not always the most suitable language. In particular, there are data science problems
that may call for a lower-level API such as MapReduce or a more complex directed acyclic graph (DAG)
algorithm. We’ve also seen how graph traversal operations cannot easily be specified in SQL, and can be
more easily expressed in an alternative syntax such as Gremlin or Cypher.
It’s a formal principle of the relational model that one ought not to be able to bypass the set-based query
language (e.g., SQL). This is specified in Codd’s 13th rule (nonsubversion). However, if the relational database
is going to maintain relevance across all realms, it may be necessary to either relax this rule or provide
alternative processing APIs on top of a SQL foundation.
198
Chapter 12 ■ Databases of the Future
Additionally, when hybrid databases are storing data primarily within embedded JSON structures
rather than in relational tables, requiring a SQL syntax to navigate these JSON documents may be unwieldy.
An API closer to that native to MongoDB may be more appropriate. Expressing queries as JSON documents
pushed to the database via a REST interface (as in existing JSON document databases) might be desirable.
In fact, some of the relational database vendors have already provided much of what is outlined above.
We will see later in this chapter how Oracle has provided support for JSON REST queries and the Cypher
graph language. Figure 12-4 illustrates the vision for integration of database languages and APIs.
Figure 12-4. A possible convergence path for database languages
Storage
The structure of data on disk or other persistent media has an enormous effect on performance and the
economics of the database. Chapter 7 discussed the need to be able to tier data in multiple storage layers
(memory, SSD, disk, HDFS). We’ve also seen in Chapter 6 how the columnar representation of data can lead
to significant improvements for analytical databases. Finally, we saw in Chapter 10 how the B-tree indexing
structures that dominated in the traditional RDBMS can be inferior for write-intensive workloads to the
log-structured merge tree architectures of Cassandra and HBase.
So, as with the other forks in our technology decision tree, there is no one correct storage layout that
suits all workloads. Today, we generally have to pick the database that natively supports the storage layout
that best suits our application.
Of course, the underlying storage mechanism for a database is hidden from an application: the access
methods to retrieve data must reflect the schematic representation of data, but the structure on disk is
generally opaque to the application. Furthermore, we have already seen databases that can maintain
multiple storage engines: MySQL has had a pluggable storage engine interface for over 10 years, and
MongoDB has recently announced a pluggable storage architecture supporting a handful of alternative
storage engines.
199
Chapter 12 ■ Databases of the Future
We can also see in systems like Hana the ability to choose row-based or columnar storage for specific
tables based on anticipated workload requirements.
This pluggable—or at least “choose-able”—storage engine architecture seems to be the right direction
for databases of the future. I believe these databases should allow tables or collections to be managed as
columnar or row-based, supported by B-trees or log-structured merge trees, resident in memory or stored
on disk. In some cases we may even wish to store data in a native graph format to support real-time graph
traversal (although in most cases it will be sufficient to layer a graph compute engine over a less restrictive
structure). Figure 12-5 illustrates a vision for pluggable storage engines.
Figure 12-5. Convergence options for database storage
A Vision for a Converged Database
By consolidating the convergence visions for individual aspects of database technology, We are in a position
to outline the characteristics of an ideal database management system. The key requirement can be
summarized as follows:
■■An ideal database architecture would support multiple data models, languages, processing paradigms,
and storage formats within the one system. Application requirements that dictate a specific database feature
should be resolved as configuration options or pluggable features within a single database management
system, not as choices from disparate database architectures.
200
Chapter 12 ■ Databases of the Future
Specifically, an ideal database architecture would:
•
Support a tunable consistency model that allows for strict RDBMS-style ACID
transactions, Dynamo-style eventual consistency, or any point in between.
•
Provide support for an extensible but relational compatible schema by allowing
data to be represented broadly by a relational model, but also allowing for
application-extensible schemas, possibly by supporting embedded JSON data types.
•
Support multiple languages and APIs. SQL appears destined to remain the primary
database access language, but should be supplemented by graph languages such as
Cypher, document-style queries based on REST, and the ability to express processing
in MapReduce or other DAG algorithms.
•
Support an underlying pluggable data storage model allowing the physical storage
of data to be based on row-oriented or columnar storage is appropriate and on disk
as B-trees, log-structured merge trees, or other optimal storage structures.
•
Support a range of distributed availability and consistency characteristics. In
particular, the application should be able to determine the level of availability and
consistency that is supported in the event of a network partition and be able to finetune the replication of data across a potentially globally distributed system.
Meanwhile, Back at Oracle HQ ...
Oracle was, of course, the company that first commercialized relational database technology, and as
such might be expected to have the most vested interest in maintaining the RDBMS status quo. To some
extent this is true: Oracle continues to dominate the RDBMS market, and actively evangelizes SQL and the
relational model.
However, behind the scenes, Oracle has arguably come as close as any other vendor to pursuing the
vision for the converged database of the future that I outlined earlier. In particular, Oracle has:
•
Provided its own engineered system based on Hadoop: the Oracle Big Data
Appliance. The Big Data appliance includes the Cloudera distribution of Hadoop
(including Spark). It also incorporates significant innovations in performance by
pushing predicate filtering down to data nodes, using technology originally created
for the Oracle Exadata RDBMS appliance. Oracle also provides connectors that
facilitate data movement between the Big Data appliance and Oracle RDBMS and
Oracle Big Data SQL, which allows SQL from the Oracle RDBMS to target data held
in the Oracle Big Data Appliance.
•
Provided very strong support for JSON embedded in the RDBMS. JSON documents
may be stored in LOB (see following section) or character columns, and may be
retrieved using extensions to SQL. JSON in the database may also be accessed
directly by a REST-based JSON query API Simple Oracle Document Access (SODA).
This API strongly resembles the MongoDB query API.
•
Offers Oracle REST Data Services (ORDS), which also provides a REST-based
interface to data in relational tables. This provides a non-SQL based API for
retrieving table data using embedded JSON query documents in REST calls that are
similar to the JSON-based query language supported by SODA.
201
Chapter 12 ■ Databases of the Future
•
Enhanced its graph compute engine. The new Oracle graph will allow graph
analytics using OpenCypher (the graph language originated by Neo4j) to be
performed on any data stored in the RDBMS or Big Data Appliance.
•
Supports a shared-nothing sharded distributed database, which provides an
alternative distributed database model that can support a more linearly scalable
OLTP solution compared to the shared-disk RAC clustered database architecture.
Let’s look at some of these offerings in more detail.
Oracle JSON Support
Oracle allows JSON documents to be stored in Oracle LOB (Long OBject) or character columns. Oracle
provides a check constraint that can be used to ensure the data in those columns is a valid JSON document:
CREATE TABLE ofilms
(
id
INTEGER PRIMARY KEY,
json_document
BLOB
CONSTRAINT ensure_json CHECK (json_document IS JSON)
)
Oracle supports functions within its SQL dialect that allow for JSON documents to be queried.
Specifically:
•
JSON_QUERY returns a portion of a JSON document; it uses JSON path expressions
that are similar to XPATH for XML.
•
JSON_VALUE is similar to JSON_QUERY, but returns a single element from a JSON
document.
•
JSON_EXISTS determines if a specific element exists in the JSON document.
•
JSON_TABLE projects a portion of JSON as a relational table. For instance,
JSON_TABLE can be used to project each subelement of a JSON document as a
distinct virtual row.
The example that follows uses the “films” JSON document collection that appeared in Chapter 11 to
illustrate MongoDB and Couchbase queries. To refresh your memory, these documents look something
like this:
{
"Title": "ACE GOLDFINGER",
"Category": "Horror",
"Description": "A Astounding Epistle of a Database Administrator And a Explorer who must
Find a Car in Ancient China",
"Length": "48",
"Rating": "G",
"Actors": [
{
"First name": "BOB",
"Last name": "FAWCETT",
"actorId": 19
},
202
Chapter 12 ■ Databases of the Future
... ...
{
"First name": "CHRIS",
"Last name": "DEPP",
"actorId": 160
}
]
}
Simple dot notation can be used to expand the JSON structures (o.json_document.Title, in the
example that follows), while JSON_QUERY can return a complete JSON structure for a selected path within
the document. In this example, JSON_QUERY returns the actors nested document.
SQL> SELECT o.json_document.Title,
2
JSON_QUERY (json_document, '$.Actors') actors
3
FROM "ofilms" o
4
WHERE id = 4
5 /
TITLE
-----------------------------------------------------------ACTORS
-----------------------------------------------------------AFFAIR PREJUDICE
[{"First name":"JODIE","Last name":"DEGENERES","actorId":41}
,{"First name":"SCARLETT","Last name":"DAMON","actorId":81} ...
Returning a nested JSON document as a string isn’t helpful in a relational context, so this is where we
could use JSON_TABLE to return a single row for each actor. Here is a query that returns the film title and the
names of all the actors in the film:
SQL> SELECT f.json_document.Title,a.*
2
FROM "ofilms" f,
3
JSON_TABLE(json_document,'$.Actors[*]'
4
COLUMNS("First" PATH '$."First name"',
5
"Last" PATH '$."Last name"')) a
6
WHERE id=4;
TITLE
First
Last
-------------------- --------------- --------------AFFAIR PREJUDICE
JODIE
DEGENERES
AFFAIR PREJUDICE
SCARLETT
DAMON
AFFAIR PREJUDICE
KENNETH
PESCI
AFFAIR PREJUDICE
FAY
WINSLET
AFFAIR PREJUDICE
OPRAH
KILMER
The query returns five rows from the single film document because there are five actors within the
actors array in that document. JSON_TABLE is roughly equivalent to the UNNEST clause in N1QL or the
FLATTEN clause within Apache Drill (both discussed in Chapter 11).
203
Chapter 12 ■ Databases of the Future
Accessing JSON via Oracle REST
Jason documents and collections can be created, manipulated, and retrieved entirely without SQL, using the
REST-based Simple Oracle Data Access (SODA) protocol.
Although the collections can be created without any SQL-based data definition language statements,
under the hood the implementation is as we discussed in the last section—a database table is created
containing a LOB that stores the JSON documents.
Regardless of whether the collection was created by the REST API or through SQL, the documents may
be interrogated using REST calls. For instance, Figure 12-6 shows a REST command retrieving a document
with an ID of 4:
Figure 12-6. Oracle REST JSON query fetching a row by ID
The REST SODA interface can be used to perform CRUD (Create, Read, Update, Delete) operations.
Adding a “document” to a “collection” creates a new row in the table.
The interface also supports a query mechanism that can be used to retrieve documents matching
arbitrary filter criteria. Figure 12-7 illustrates a simple REST query using this interface (using the Google
Chrome “Postman” extension that can be employed to prototype REST calls).
204
Chapter 12 ■ Databases of the Future
Figure 12-7. Simple SODA REST query
The JSON document filter syntax is extremely similar to that provided by MongoDB. Indeed, the Oracle
developers intentionally set out to provide a familiar experience for users of MongoDB and other document
databases.
Figure 12-8 illustrates a more complex query in which we search for movies longer than 60 minutes with
a G rating, sorted by descending length.
205
Chapter 12 ■ Databases of the Future
Figure 12-8. Complex Oracle REST query
REST Access to Oracle Tables
The Oracle REST data services API provides a REST interface for relational tables as well. The mechanism is
virtually identical to the JSON SODA interface we examined earlier.
In Figure 12-9, we see a REST query retrieving data from the Oracle CUSTOMERS table, providing a
simple query filter in the HTTP string.
206
Chapter 12 ■ Databases of the Future
Figure 12-9. Oracle REST interface for table queries
Oracle Graph
Oracle has long supported an RDF-based graph capability within its Oracle Spatial and Graph option.
This option was based on the RDF WC3 standard outlined in Chapter 6. However, while this capability
provided some significant graph capabilities, it fell short of the capabilities offered in popular property graph
databases such as Neo4J.
As you may recall from Chapter 6, an RDF system is an example of a triple store, which represents
relatively simple relationships between nodes. In contrast, a property store represents the relationship
between nodes but also allows the model to store significant information (properties) within the nodes and
the relationships. While RDF graphs provide strong support for the development of ontologies and support
distributed data sources, property graphs are often more attractive for custom application development
because they provide a richer data model.
Oracle has recently integrated a property graph compute engine that can perform graph analytics on
any data held in the Oracle RDBMS, the Oracle Big Data Hadoop system, or Oracle’s own NoSQL system.
The compute engine supports the openCypher graph language. Cypher is the language implemented within
Neo4J, the most popular dedicated graph database engine, and openCypher has been made available as an
open-source version of that language.
Oracle’s implementation does not represent a native graph database storage engine. As you may recall
from Chapter 6, a native graph database must implement index free adjacency, effectively implementing the
graph structure in base storage. Rather, Oracle implements a graph compute engine capable of loading data
from a variety of formats into memory, where it can be subjected to graph analytics. The advantage of this
approach is that graph analytics can be applied to data held in any existing format, provided that format can
be navigated as a graph. Many existing relational schemas do in fact implement graph relationships. For
example, a typical organization schema will have foreign keys from employees to their managers, which can
be represented as a graph.
207
Chapter 12 ■ Databases of the Future
Oracle Sharding
Oracle sharding is a relatively new feature of the Oracle database, announced in late 2015. Oracle has long
provided a distributed database clustering capability. As far back as the early 1990s, Oracle was offering the
Oracle Parallel Server—a shared-disk clustered database. This database clustering technology eventually
evolved into Oracle Real Application Clusters (RAC), which was widely adopted and which represents the
most significant implementation of the shared-disk database clustering model. We discussed shared-disk
database clusters in more detail in Chapter 8.
However, the clustered architecture of RAC was more suited to data warehousing workloads than to
massively scalable OLTP. Massive scaling of RAC becomes problematic with write-intensive workloads,
as the overhead of maintaining a coherent view of data in memory leads to excessive “chatter” across the
private database network. For this reason, typical RAC clusters have only a small number of nodes—the vast
majority have fewer than 10.
The Oracle sharding option allows an OLTP workload to be deployed across as many as 1,000 separate
Oracle instances, using an architecture that is similar to the do-it-yourself sharding of MySQL, which was
reviewed in Chapter 3, and also quite similar to the MongoDB sharding architecture that was examined in
Chapter 8.
Figure 12-10 shows a representation of the Oracle sharding architecture. The coordinator database (1)
contains the catalog that describes how keys are distributed across shards. Each shard is implemented by
a distinct Oracle database instance. Queries that include the shard key can be sent directly to the relevant
database instance (3); the shard director is aware of the shard mappings and can provide the appropriate
connection details to the application (4). Queries that do not specify the shard key or that are aggregating
across multiple shards (5) are mediated by the coordinator database which acts as a proxy. The coordinator
database sends queries to various shards to retrieve the necessary data (6) and then aggregates or merges
the data to return the appropriate result set.
208
Chapter 12 ■ Databases of the Future
Application
SQL including shard key
2
SQL without shard key
5
Coordinator
And Catalog DB
3
1
Shard
Directors
Shard Directory
4
6
Shard 1
Shard 2
Shard n
Shard Chunks
Shard Chunks
Shard Chunks
Replica 1
Replica 2
Replica n
Figure 12-10. Oracle sharding
Oracle sharding supports distribution schemes similar to those supported by Oracle’s existing table
partitioning offerings. Data may be partitioned across shards by a hash, shard key ranges, or lists of shard
key values. There is also a composite scheme in which data is partitioned primarily by list or range, and
then hashed against a second shard key. When hash partitioning is used, Oracle can determine the load
balancing automatically by redistributing shard chunks as data volumes change or as nodes are added or
removed from the sharded system. Range or list partitioning requires user balancing of shards.
Each shard may be replicated using Oracle replication technologies (DataGuard or Goldengate).
Replicas can be used to satisfy read requests, though this may require that the application explicitly request
data from a replica rather than from the master.
Tables that don’t conform to the shard key or that are relatively small can be duplicated on each shard
rather than being sharded across the entire cluster. For instance, a products table might be duplicated across
all shards, while other data is sharded by customer ID.
209
Chapter 12 ■ Databases of the Future
As with all sharding schemes, Oracle sharding breaks some of the normal guarantees provided by
a traditional RDBMS. While joins across shards are supported, there is no guarantee of point-in-time
consistency in results. Transactions that span shards are atomic only within a shard—there is no two-phase
commit for distributed transactions.
Oracle as a Hybrid Database
It’s surprising, and maybe even a little amusing, to see Oracle adopt a JSON interface that is clearly designed
to be familiar to MongoDB users, and to finally adopt a distributed database strategy that admits the
superiority of the shared-nothing architecture for certain workloads. But I for one find it encouraging to see
the leading RDBMS vendor learn from alternative approaches.
However, Oracle is yet to attempt to address one of the key challenges for an integrated database of the
future: balancing consistency and availability in the face of possible network partitions.
Oracle’s RAC clustered database explicitly chooses consistency over availability in the case of a network
partition: in a “split brain” scenario, isolated instances in the Oracle RAC cluster will be evicted or voluntarily
shut down rather than continue operating in an inconsistent state. Oracle sharding offers a potentially better
solution for an online system—theoretically, during a network partition some parts of the sharded database
may continue to be available in each partition. However, only a subset—selected shards—will be available to
each partition, and there is no mechanism to reconcile inconsistencies. Furthermore, in the sharded model,
transactional integrity—even when the entire database is available—is not guaranteed. Transactions or
queries that span shards may exhibit inconsistent behavior, for instance.
A mode in which a transactional relational database might maintain availability in the face of a network
partition would require some sort of merger between the transactional behavior implemented in ACID
RDBMS systems and Dynamo-style eventual consistency.
Oracle is not alone in the RDBMS world in its adoption of JSON or interest in nonrelational paradigms.
Whether the Oracle folks are sincerely attempting to move their flagship database product into the future or
simply trying to take the wind out of the sails of upstart competitors remains to be seen. But it is clear that
significant effort is going into engineering features that shift the Oracle RDBMS away from its traditional
RDBMS roots. And if nothing else, some of these features are suggestive of how a converged database system
might behave.
Other Convergent Databases
There are several other attempts to converge the relational and nonrelational models that are worth
mentioning here:
210
•
NuoDB is a SQL-based relational system that uses optimistic asynchronous
propagation of transactions to achieve near-ACID consistency in a distributed
context. Slight deviations from strict ACID consistency might result from this
approach: not all replicas will be updated simultaneously across the cluster. In
a manner somewhat similar to that of Dynamo systems, the user can tune the
consistency levels required. NuoDB also separates the storage layer from the
transactional layer, allowing for a pluggable storage engine architecture.
•
Splice Machine layers a relationally compatible SQL layer over an HBase-managed
storage system. Although its key objective is to leverage HBase scalability to provide a
more economically scalable SQL engine, it does allow for hybrid data access because
the data may be retrieved directly from HBase using MapReduce or YARN, as well as
via SQL from the relational layer.
Chapter 12 ■ Databases of the Future
•
Cassandra has added strong support for JSON in its current release and is also
integrating a graph compute engine into its enterprise edition. The Dynamo
tunable consistency model, together with the Cassandra lightweight transaction
feature, covers a broader range of transactional scenarios than other nonrelational
competitors. However, there is no roadmap for full SQL or multi-object ACID
transactions.
•
Apache Kudu is an attempt to build a nonrelational system that equally supports
full-scan and record-based access patterns. The technological approach involves
combining in-memory row-based storage and disk-based columnar storage. The
stated intent is to bridge the gap between HDFS performance for complete “table”
scans and HBase row-level access. Of itself, this doesn’t provide a truly hybrid
solution, but coupled with Apache Impala, it could provide a SQL-enabled database
that also provides key-value style access and Hadoop compatibility. However, there
is no plan as yet for multi-row transactions.
Disruptive Database Technologies
So far, I’ve described a future in which the recent divergence of database technologies is followed by a period
of convergence toward some sort of “unified model” of databases.
Extrapolating existing technologies is a useful pastime, and is often the only predictive technique
available. However, history teaches us that technologies don’t always continue along an existing trajectory.
Disruptive technologies emerge that create discontinuities that cannot be extrapolated and cannot always
be fully anticipated.
It’s possible that a disruptive new database technology is imminent, but it’s just as likely that the big
changes in database technology that have occurred within the last decade represent as much change as we
can immediately absorb.
That said, there are a few computing technology trends that extend beyond database architecture and
that may impinge heavily on the databases of the future.
Storage Technologies
Since the dawn of digital databases, there has been a strong conflict between the economies of speed and
the economies of storage. The media that offer the greatest economies for storing large amounts of data
(magnetic disk, tape) come with the slowest times and therefore the worst economies for throughput and
latency. Conversely, the media that offer the lowest latencies and the highest throughput (memory, SSD) are
the most expensive per unit of storage.
As was pointed out in Chapter 7, although the price per terabyte of SSDs is dropping steadily, so is the
price per terabyte for magnetic disk. Extrapolation does not suggest that SSDs will be significantly cheaper
than magnetic disk for bulk storage anytime soon. And even if SSDs matched magnetic disk prices, the
price/performance difference between memory and SSD would still be decisive: memory remains many
orders of magnitude faster than SSD, but also many orders of magnitude more expensive per terabyte.
As long as these economic disparities continue, we will be encouraged to adopt database architectures
that use different storage technologies to optimize the economies of Big Data and the economies of “fast”
data. Systems like Hadoop will continue to optimize cost per terabyte, while systems like HANA will attempt
to minimize the cost of providing high throughput or low latency.
However, should a technology arise that simultaneously provides acceptable economies for mass
storage and latency, then we might see an almost immediate shift in database architectures. Such a universal
memory would provide access speeds equivalent to RAM, together with the durability, persistence, and
storage economies of disk.
211
Chapter 12 ■ Databases of the Future
Most technologists believe that it will be some years before such a disruptive storage technology arises,
though given the heavy and continuing investment, it seems likely that we will eventually create a persistent
fast and economic storage medium that can meet the needs of all database workloads. When this happens,
many of the database architectures we see today will have lost a key part of their rationale for existence. For
instance, the difference between Spark and Hadoop would become minimal if persistent storage (a.k.a. disk)
was as fast as memory.
There are a number of significant new storage technologies on the horizon, including Memristors and
Phase-change memory. However, none of these new technologies seems likely to imminently realize the
ambitions of universal memory.
Blockchain
You would have to have been living under a rock for the past few years not to have heard of bitcoin. The
bitcoin is an electronic cryptocurrency that can be used like cash in many web transactions. At the time of
this writing, there are about 15 million bitcoins in circulation, trading at approximately $US 360 each, for a
total value of about $US 5.3 billion.
The bitcoin combines peer-to-peer technology and public key cryptography. The owner of a bitcoin can
use a private key to assert ownership and authorize transactions; others can use the public key to validate
the transaction. As in other peer-to-peer systems, such as Bittorrent, there is no central server that maintains
bitcoin transactions; rather, there is a distributed public ledger called the blockchain.
The implications of cryptocurrencies are way beyond our scope here, but there are definite repercussions
for database technologies in the blockchain concept. Blockchain replaces the trusted third party that
must normally mediate any transfer of funds. Rather than there being a centralized database that records
transactions and authenticates each party, blockchain allows transactions and identities to be validated
by consensus with the blockchain network; that is, each transaction is confirmed by public-key-based
authentication from multiple nodes before being concluded.
The blockchain underlying the bitcoin is public, but there can be private (or permissioned) blockchains
that are “invitation only.” Whether private or public, blockchains arguably represent a new sort of shared
distributed database. Like systems based on the Dynamo model, the data in the blockchain is distributed
redundantly across a large number of hosts. However, the blockchain represents a complete paradigm shift
in how permissions are managed within the database. In an existing database system, the database owner
has absolute control over the data held in the database. However, in a blockchain system, ownership is
maintained by the creator of the data.
Consider a database that maintains a social network like Facebook: although the application is
programmed to allow only you to modify your own posts or personal details, the reality is that the Facebook
company actually has total control over your online data. The staff there can— if they wish—remove your
posts, censor your posts, or even modify your posts if they really want to. In a blockchain-based database, you
would retain total ownership of your posts and it would be impossible for any other entity to modify them.
Applications based on blockchain have the potential to disrupt a wide range of social and economic
activities. Transfers of money, property, management of global identity (passports, birth certificates), voting,
permits, wills, health data, and a multitude of other transactional data could be regulated in the future by
blockchains. The databases that currently maintain records of these types of transactions may become
obsolete.
Most database owners will probably want to maintain control of the data in the database, and therefore
it’s unlikely that blockchain will completely transform database technology in the short term. However, it
does seem likely that database systems will implement blockchain-based authentication and authorization
protocols for specific application scenarios. Furthermore, it seems likely that formal database systems built
on a blockchain foundation will soon emerge.
212
Chapter 12 ■ Databases of the Future
Quantum Computing
The origins of quantum physics date back over 100 years, with the recognition that energy consists of
discrete packets, or quanta.
By the 1930s, most of the mindboggling theories of quantum physics had been fully articulated.
Individual photons of light appear to pass simultaneously through multiple slits in the famous twin-slit
experiment, providing they are not observed. The photons are superimposed across multiple states. Attempts
to measure the path of the photons causes them to collapse into a single state. Photons can be entangled, in
which case the state of one photon may be inextricably linked with the state of an otherwise disconnected
particle—what Albert Einstein called “spooky action at a distance.”
As Niels Bohr famously said, “If quantum mechanics hasn’t profoundly shocked you, you haven’t
understood it yet.” The conventional view of quantum physics is that multiple simultaneous probabilities
do not resolve until perceived by a conscious observer. This Copenhagen interpretation serves as the basis
for the famous Schrödinger’s Cat thought experiment, in which a cat is simultaneously dead and alive when
unobserved in an elaborate quantum contraption. Some have conjectured that Schrödinger’s dog actually
proposed this experiment.1
The undeniably mindboggling weirdness of various quantum interpretations does have significant
implications for real-world technology: most modern electronics are enabled directly or indirectly by
quantum phenomena, and this is especially true in computing, where the increasing density of silicon
circuitry takes us ever closer to the “realm of the very small” where quantum effects dominate.
Using quantum effects to create a new type of computer was popularized by physicist Richard Feynman
back in the 1980s. The essential concept is to use subatomic particle behavior as the building blocks of
computing. In essence, the logic gates and silicon-based building blocks of today’s physical computers
would be replaced by mechanisms involving superimposition and entanglement at the subatomic level.
Quantum computers promise to provide a mechanism for leapfrogging the limitations of silicon-based
technology and raise the possibility of completely revolutionizing cryptography. The promise that quantum
computers could break existing private/public key encryption schemes seems increasingly likely, while
quantum key transmission already provides a tamper-proof mechanism for transmitting certificates over
distances within a few hundreds of kilometers.
If quantum computing realizes its theoretical potential, it would have enormous impact on all areas of
computing—databases included. There are also some database-specific quantum computing proposals:
•
Quantum transactions: Inspired by the concept of superimposition, it’s
proposed that data in a database could be kept in a “quantum” state, effectively
representing multiple possible outcomes. The multiple states collapse into an
outcome when “observed.” For example, seat allocations in an aircraft could be
represented as the sum of all possible seating arrangements, which “collapse”
when final seat assignments are made; the collapse could be mediated by various
seating preferences: requests to sit together or for aisle or window seats. This
approach leverages quantum concepts but does not require a quantum computing
infrastructure, though a quantum computer could enable such operations on a
potentially massive scale.2
•
Quantum search: A quantum computer could potentially provide an acceleration of
search performance over a traditional database. A quantum computer could more
rapidly execute a full-table scan and find matching rows for a complex non-indexed
search term.3 The improvement is unlikely to be decisive when traditional disk
access is the limiting factor, but for in-memory databases, it’s possible that quantum
database search may become a practical innovation.
213
Chapter 12 ■ Databases of the Future
•
A quantum query language: The fundamental unit of processing in a classical
(e.g., non-quantum) computer is the bit, which represents one of two binary states.
In a quantum computer, the fundamental unit of processing is the qubit, which
represents the superimposition of all possible states of a bit. To persistently store
the information from a quantum computer would require a truly quantum-enabled
database capable of executing logical operations using qubit logic rather than
Boolean bit logic. Operations on such a database would require a new language that
could represent quantum operations instead of the relational predicates of SQL.
Such a language has been proposed: Quantum Query Language (QQL).4
Promises of practical quantum computing have been made for several decades now, but to date
concrete applications have been notably absent. It’s quite possible that true quantum computing will
turn out to be unobtainable or will prove to be impractical for mainstream applications. But if quantum
computing achieves even some of its ambitions it will change the computing landscape dramatically, and
databases will not be immune from the repercussions.
Conclusion
In this book I’ve tried to outline the market and technology forces that “broke” the decades-old dominance
of the RDBMS. In Chapter 1, I argued that there was no one-size-fits-all architecture that could meet all the
needs of diverse modern applications. I still believe that to be true.
I’d also like to reiterate that I believe the modern RDBMS—essentially a combination of relational
model, ACID transactions, and the SQL language—represents an absolute triumph of software engineering.
The RDBMS revolutionized computing, database management, and application development. I believe it is
apparent that the RDBMS remains the best single choice for the widest variety of application workloads, and
that it has and will continue to stand the test of time.
That having been said, it is also clear that the RDBMS architecture that was established in the 1990s
does not form a universal solution for the complete variety of applications being built today. Even if it could,
it is clear that in the world we live in, developers, application architects, CTOs, and CIOs are increasingly
looking to nonrelational alternatives. It behooves us as database professionals to understand the driving
forces behind these nonrelational alternatives, to understand their architectures, and to be in a position to
advise on, manage, and exploit these systems.
As I have learned more about the technologies underpinning modern next-generation database
architectures, I have come to believe that the Cambrian explosion of database diversity we’ve seen in the last
10 years would—or at least should—be followed by a period of convergence. There are too many choices
confronting database users, and too many compromises required when selecting the best database for
a given application workload. We shouldn’t have to sacrifice multi-row transactions in order to achieve
network partition tolerance, nor should we have to throw out the relational baby with the bathwater when
we want to allow an application to dynamically evolve schema elements. Thus, I hope for a future in which
these choices become a matter of configuration within a single database architecture rather than a Hobson’s
choice between multiple, not quite right systems.
Clear signs of convergence are around us: the universal acceptance of JSON into relational systems,
the widespread readoption of SQL as the standard language for database access, and the introduction of
lightweight transactions into otherwise nontransactional systems.
In this chapter I’ve tried to outline some of the options that an ideal database system should provide
in order to reach this Nirvana. We are clearly many years away from being able to deliver such a converged
system. But that doesn’t stop a database veteran from dreaming!
I feel cautiously confident that convergence will be the dominant force shaping databases over the
next few years. However, there are technologies on the horizon that could force paradigm shifts: quantum
computing, blockchain, and universal memory, to mention a few.
214
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement