Database management as a cloud-based service for small

Masaryk University
Faculty of Informatics
Master Thesis
Database management
as a cloud-based service
for small and medium
organizations
Student:
Dime Dimovski
Brno, 2013
Statement
I declare that I have worked on this thesis independently using only the sources listed in the
bibliography. All resources, sources, and literature, which I used in preparing or I drew on them,
I quote in the thesis properly with stating the full reference to the source.
Dime Dimovski
2
Resume
The goal of this thesis is to explore the cloud computing, manly focusing on database
management systems as a cloud service. It will give review of some of current available
solutions of SQL and NOSQL based database management systems as a cloud service;
advantages and disadvantages of the cloud computing in general and the common
considerations.
Keywords
Cloud computing, SaaS, PaaS, Database management, SQL, NOSQL, DBaaS, Database.com, SQL
Azure, Amazon Web Services, SimpleDB, DynamoDB, Google SQL, MongoDB, CouchDB, Google
Datastore.
3
Contents
1.
Introduction .................................................................................................................................... 8
2.
Introduction to Cloud Computing ................................................................................................... 9
2.1
Cloud computing – definition ........................................................................................................ 9
2.2
Cloud Types .................................................................................................................................. 10
2.2.1
2.3
NIST model ............................................................................................................................... 10
Cloud computing architecture ..................................................................................................... 12
2.3.1
Infrastructure ....................................................................................................................... 13
2.3.2
Platform ............................................................................................................................... 14
2.3.3
Application Platform as a Service (APaaS ) or Virtual appliances ........................................ 15
2.3.4
Application ........................................................................................................................... 16
3.
Scalability ...................................................................................................................................... 17
4.
Elasticity ........................................................................................................................................ 18
5.
Database Management Systems in the cloud (Database as a service) ......................................... 19
6.
Database.com ............................................................................................................................... 21
6.1
Database.com Architecture ......................................................................................................... 21
6.2
Multitenant data model............................................................................................................... 22
6.3
Multitenant indexes ..................................................................................................................... 23
6.4
Multitenant relationships ............................................................................................................ 23
6.5
Multitenant field history .............................................................................................................. 23
6.6
Partitioning of metadata, data, and index data ........................................................................... 23
6.7
Application development ............................................................................................................ 24
6.8
Data Access .................................................................................................................................. 24
6.9
Query languages .......................................................................................................................... 25
6.10
Multitenant search processing .................................................................................................... 25
4
6.11
Multitenant isolation and protection .......................................................................................... 26
6.12
Deletes, undeletes ....................................................................................................................... 27
6.13
Backup.......................................................................................................................................... 27
6.14
Pricing .......................................................................................................................................... 27
7.
Microsoft’s SQL AZURE ................................................................................................................. 28
7.1
Subscriptions ................................................................................................................................ 28
7.2
Databases..................................................................................................................................... 28
7.3
Security and Access to a SQL Azure Database ............................................................................. 29
7.4
SQL Azure architecture ................................................................................................................ 29
7.5
Logical Databases on a SQL Azure Server .................................................................................... 29
7.6
Network Topology........................................................................................................................ 31
7.7
High Availability with SQL Azure .................................................................................................. 33
7.8
Failure Detection.......................................................................................................................... 33
7.9
Reconfiguration............................................................................................................................ 33
7.10
Availability Guarantees ................................................................................................................ 34
7.11
Scalability with SQL Azure ............................................................................................................ 34
7.12
Throttling ..................................................................................................................................... 34
7.13
Load Balancer............................................................................................................................... 35
7.14
SQL Azure Management .............................................................................................................. 35
7.15
Pricing in SQL Azure ..................................................................................................................... 35
8.
9.
Amazon WebServices.................................................................................................................... 37
8.1
Amazon Relational Database Service (Amazon RDS) ................................................................... 37
8.2
Amazon RDS Architecture/Features ............................................................................................ 37
8.3
Scalability with Amazon RDS ........................................................................................................ 38
8.4
High Availability ........................................................................................................................... 39
8.5
Pricing .......................................................................................................................................... 39
Google Cloud SQL ......................................................................................................................... 40
5
9.1
Pricing .......................................................................................................................................... 41
10.
Summary of RDBMSaaS and common considerations ................................................................. 42
11.
NOSQL ........................................................................................................................................... 45
12.
Amazon SimpleDB and DynamoDB............................................................................................... 45
12.1
Dynamo History ........................................................................................................................... 45
12.2
Amazon DynamoDB DataModel .................................................................................................. 46
12.3
Amazon DynamoDB Features ...................................................................................................... 48
12.4
Amazon SimpleDB ........................................................................................................................ 49
12.5
Pricing .......................................................................................................................................... 51
13.
Google Datastore .......................................................................................................................... 52
13.1
Datastore Datamodel ................................................................................................................... 52
13.2
Queries and indexes ..................................................................................................................... 52
13.3
Transactions ................................................................................................................................. 53
13.4
Scalability ..................................................................................................................................... 53
13.5
High Availability............................................................................................................................ 53
13.6
Data Access .................................................................................................................................. 54
13.7
Quotas and Limits ........................................................................................................................ 54
14.
MongoLab/MongoDB and Cloudent/Apache CouchDB ............................................................... 55
14.1
Document oriented database ...................................................................................................... 55
14.2
MongoDB and CouchDB comparison ........................................................................................... 56
14.3
MVCC – Multy Version Concurency Control................................................................................. 56
14.4
Scalability ..................................................................................................................................... 57
14.5
Querying....................................................................................................................................... 57
14.6
Atomicity and Durability .............................................................................................................. 58
14.7
Map Reduce ................................................................................................................................. 58
14.8
Javascript...................................................................................................................................... 58
14.9
REST ............................................................................................................................................. 58
6
14.10
15.
MongoLab and Cloudent .......................................................................................................... 58
What benefits cloud database and cloud computing brings for small and medium organizations?
62
15.1
Advantages for Small Business ..................................................................................................... 62
15.2
Disadvantages of Cloud Computing ............................................................................................. 63
15.3
Main things to be considered when moving to the cloud ............................................................ 64
16.
Will cloud computing reduce the budget? ................................................................................... 67
17.
Conclusion..................................................................................................................................... 69
Appendix ................................................................................................................................................... 70
Case studies from the industry – Amazon RDS........................................................................................ 70
Case studies from the industry – Microsoft SQL Azure ........................................................................... 70
Case studies from the industry – Amazon DynamoDB ............................................................................ 70
Case studies from the industry – Amazon SimpleDB............................................................................... 71
References ................................................................................................................................................ 72
7
1. Introduction
The boom of the cloud computing over the past few years has led to situation that it is common to many
innovations and new technologies. It became common for enterprises and a person to use the services
that are offered in the cloud and recognize that cloud computing is a big deal even though they are not
clear why that is so. Even the phrase “in the cloud” has been used in our colloquial language. Huge
percentage of the developers in the world is currently working on “cloud-related” products. Therefore
the cloud is this amorphous entity that is supposed to represent the future of modern computing.
In an attempt to gain a competitive edge, businesses are looking for new innovative ways to cut costs
while maximizing value. They recognize the need to grow but at the same time they are under pressure
to save money. The cloud gave this opportunity for the business allowing them to focus on their core
business by offering hardware and software solution without having to develop them by their own.
In this thesis I will give an overview of what cloud computing is. I will describe its main concepts and
architecture; and take a look at the paradigm XaaS (something/everything as a service) and the current
available options in the cloud mostly focusing on Database in the cloud or Database as a service. I will
give a closer look on how the cloud computing in general and database as a service can be used for small
and medium enterprises, what are the main benefits that it offers and will it really help businesses to
reduce the budget and focus on their core business.
8
2. Introduction to Cloud Computing
In reality the cloud is something that we have been using for a long time, it is the Internet, with all the
standards and protocols that provide Web services to us. Usually the Internet is drawn as a cloud, this
represent s one of the essential characteristics of cloud computing, abstraction. Cloud computing
refers to applications and services that run on a distributed network using virtualized resources and
are accessed by common Internet protocols and networking standards. It is distinguished by the
notion that resources are virtual and limitless and that details of the physical system on which
software runs are abstracted from user .[1]
One of the main things that is driving cloud computing is the recent advancements in wireless speed
and connectivity. Without these in place, cloud computing wouldn’t be practical or even possible. In
many ways, cloud computing was/is an eventuality. The influence of telecommunications
organizations and their push towards simplifying and miniaturizing virtually every electronic device
that can be used by the mobile users is pushing cloud computing even faster. This represents a major
breakthrough in not only computing but also communication.
Cloud computing represents a real paradigm shift in the way in which systems are deployed. The
massive scale of cloud computing systems was enabled by the popularization of the Internet and the
growth of some large service companies.[1]
Cloud computing has been compared to the standard utility companies, and it does bear a striking
resemblance to these institutions. Just like water, electricity or gas, cloud computing makes the longheld dream of utility computing possible with a pay-as-you-go, infinitely scalable, universally available
system. In other words, the ‘goods’ come from one central location; we’re just turning things off and
on. This may ultimately give more people access to a larger pool or resources at an extremely reduced
cost. One of the biggest benefits of cloud computing is its ability to offer users access to off-site
hardware and software. With cloud computing the resources of the cloud itself are at your disposal.
This means all the hardware, software, processors and networks will combine to give individuals much
more computing power than has ever been possible. This will completely change nearly every facet of
information exchange as well as influence everything from social networking to web development. By
keeping things light and simple individual access devices are going to last a lot longer and become
much more durable. And of course, losing or breaking a device is no longer going to be of any
particular concern, as they will be easily replaced and there’s no danger of losing your files or
information either.
With cloud computing, you can start very small and become big very fast. That's why cloud computing
is revolutionary, even if the technology it is built on is evolutionary.
2.1 Cloud computing – definition
The use of the word “cloud” makes reference to the two essential concepts:

Abstraction

Virtualization
9
Abstraction
Cloud computing is abstracting the details of the system implementation from the users and the
developers. Applications run on physical systems that aren't specified, data is stored in locations that
are unknown, administration of systems is outsourced to others, and access by users is ubiquitous.[1]
Virtualization
Cloud computing virtualizes systems by pooling and sharing resources. Systems and storage can be
provisioned as needed from a centralized infrastructure, costs are assessed on a metered basis, multitenancy is enabled, and resources are scalable with agility.
Cloud computing is an abstraction based on the notion of pooling physical resources and presenting
them as a virtual resource. It is a new model for provisioning resources, for staging applications, and
for platform-independent user access to services. Clouds can come in many different types, and the
services and applications that run on clouds may or may not be delivered by a cloud service provider.
2.2 Cloud Types
Usually the cloud computing is separated into two distinct sets of models:
 Deployment models – refers to location and management of the cloud’s infrastructure.
 Service models – particular types of services that can be accessed on a cloud computing
platform.
2.2.1 NIST model
The NIST model is set of working definitions published by the U.S. National Institute of Standards and
Technology. This cloud model is composed of five essential characteristics, three service models, and
four deployment models.[2]
Essential Characteristics:



On-demand self-service - A consumer can unilaterally provision computing capabilities, such
as server time and network storage, as needed automatically without requiring human
interaction with each service provider.
Broad network access - Capabilities are available over the network and accessed through
standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g.,
mobile phones, tablets, laptops, and workstations).
Resource pooling - The provider’s computing resources are pooled to serve multiple
consumers using a multi-tenant model, with different physical and virtual resources
dynamically assigned and reassigned according to consumer demand. There is a sense of
location independence in that the customer generally has no control or knowledge over the
exact location of the provided resources but may be able to specify location at a higher level
of abstraction (e.g., country, state, or datacenter). Examples of resources include storage,
processing, memory, and network bandwidth.
10


Rapid elasticity - Capabilities can be elastically provisioned and released, in some cases
automatically, to scale rapidly outward and inward commensurate with demand. To the
consumer, the capabilities available for provisioning often appear to be unlimited and can be
appropriated in any quantity at any time.
Measured service - Cloud systems automatically control and optimize resource use by
leveraging a metering capability at some level of abstraction appropriate to the type of service
(e.g. storage, processing, bandwidth, and active user accounts). Resource usage can be
monitored, controlled, and reported, providing transparency for both the provider and
consumer of the utilized service.
Service Models:



Software as a Service (SaaS) - The capability provided to the consumer is to use the provider’s
applications running on a cloud infrastructure2. The applications are accessible from various
client devices through either a thin client interface, such as a web browser (e.g., web-based
email), or a program interface. The consumer does not manage or control the underlying
cloud infrastructure including network, servers, operating systems, storage, or even individual
application capabilities, with the possible exception of limited user specific application
configuration settings.
Platform as a Service (PaaS) - The capability provided to the consumer is to deploy onto the
cloud infrastructure consumer-created or acquired applications created using programming
languages, libraries, services, and tools supported by the provider. The consumer does not
manage or control the underlying cloud infrastructure including network, servers, operating
systems, or storage, but has control over the deployed applications and possibly configuration
settings for the application-hosting environment.
Infrastructure as a Service (IaaS) - The capability provided to the consumer is to provision
processing, storage, networks, and other fundamental computing resources where the
consumer is able to deploy and run arbitrary software, which can include operating systems
and applications. The consumer does not manage or control the underlying cloud
infrastructure but has control over operating systems, storage, and deployed applications; and
possibly limited control of select networking components (e.g., host firewalls).
Deployment Models:




Private cloud - The cloud infrastructure is provisioned for exclusive use by a single
organization comprising multiple consumers (e.g., business units). It may be owned, managed,
and operated by the organization, a third party, or some combination of them, and it may
exist on or off premises.
Community cloud - The cloud infrastructure is provisioned for exclusive use by a specific
community of consumers from organizations that have shared concerns (e.g., mission,
security requirements, policy, and compliance considerations). It may be owned, managed,
and operated by one or more of the organizations in the community, a third party, or some
combination of them, and it may exist on or off premises.
Public cloud - The cloud infrastructure is provisioned for open use by the general public. It is
usually open system available to general public via WWW or Internet. It may be owned,
managed, and operated by a business, academic, or government organization, or some
combination of them. It exists on the premises of the cloud provider. Examples of public
cloud: Google application engine, Amazon elastic compute cloud, Microsoft Azure.
Hybrid cloud - The cloud infrastructure is a composition of two or more distinct cloud
11
infrastructures (private, community, or public) that remain unique entities, but are bound
together by standardized or proprietary technology that enables data and application
portability (e.g., cloud bursting for load balancing between clouds). [2]
2.3 Cloud computing architecture
Cloud computing is essentially a series of levels that function together in various ways to create a
system. This system is also referred to as cloud computing architecture. The cloud creates a system
where resources can be pooled and partitioned as needed. Cloud architecture can couple software
running on virtualized hardware in multiple locations to provide an on-demand service to user-facing
hardware and software. A cloud can be created within an organization's own infrastructure or
outsourced to another datacenter. Usually resources in a cloud are virtualized resources because
virtualized resources are easier to modify and optimize. A compute cloud requires virtualized storage
to support the staging and storage of data. From a user's perspective, it is important that the
resources appear to be infinitely scalable, that the service be measurable, and that the pricing be
metered.[1]
Figure 1 Cloud computing stack
Applications in the cloud are usually composable systems, this means that they are using standard
component so assemble services that are tailored for a specific purpose. A composable component
must be:
•
Modular: It is a self-contained and independent unit that is cooperative, reusable, and
reeplaceable.
12
•
Stateless: A transaction is executed without regard to other transactions or requests
In general cloud computing doesn’t require that hardware and software to be composable but it is a
highly desirable characteristic. It makes system design easier to implement and solutions are more
portable and interoperable.
Some of the benefits from composable system are:

Easier to assemble systems

Cheaper system development

More reliable operation

A larger pool of qualified developers

A logical design methodology
There is a trend toward designing composable systems in cloud computing in the widespread adoption
of what has come to be called the Service Oriented Architecture (SOA). The essence of a service
oriented design is that services are constructed from a set of modules using standard communications
and service interfaces. An example of a set of widely used standards describes the services themselves
in terms of the Web Services Description Language (WSDL), data exchange between services using
some form of XML, and the communications between the services using the SOAP protocol. There are,
of course, alternative sets of standards.[1]
What isn't specified is the nature of the module itself; it can be written in any programming language
the developer wants. From the standpoint of the system, the module is a black box, and only the
interface is well specified. This independence of the internal workings of the module or component
means it can be swapped out for a different model, relocated, or replaced at will, provided that the
interface specification remains unchanged. That is a powerful benefit to any system or application
provider as their products evolve.
Essentially there are 3 tiers in a basic cloud computing architecture:

Infrastructure

Platform

Application
If we further break down the standard cloud computing architecture there are really two areas to deal
with; the front end and back end.
Front End - The front end includes all client (user) devices and hardware in addition to their computer
network and the application that they actually use to make a connection with the cloud.
Back End - The back end is populated with the various servers, data storage devices and hardware that
facilitate the functionality of a cloud computing network.
2.3.1 Infrastructure
The infrastructure of cloud computing architecture is essentially all the hardware, data storage devices
(including virtualized hardware), networking equipment, applications and software that operates and
13
drives the cloud.
Most Infrastructure as a Service (IaaS) providers use virtual machines to deliver servers that run
applications. Virtual machines images or instances are containers that have assigned specific
resources (number of CPU cycles, memory access, network bandwidth, etc.).
Figure 2 shows the cloud computing stack that is defined as the server. The Virtual Machine Monitor,
also called a hypervisor is the low level software that allows different operating systems to run in their
own memory space and manages I/O for the virtual machines.[1]
Figure 2 "Server" stack
2.3.2 Platform
A cloud computing platform is the actual programming, code and implemented systems of interfacing
that help user-level devices (and applications) connect with the hardware and software resources of
the cloud. It is a software layer that is used to create higher level of services.
A cloud computing platform is generally divided up between the front end and back end of a network.
Its job is to provide a communication and access portal for the client, so that they may effectively
utilize the resources of the cloud network. The platform may only be a set of directions, but it is in all
actuality the most integral part of a cloud computing network; without it cloud computing would not
be possible.
There are many different Platform as a Service (PaaS) providers, we will mention some of them:

Salesforce.com’s Force.com and Databse.com Platforms

Windows Azure Platform

Google Apps and Google AppEngine

Amazon Web services
All platform services offer hosted hardware and software needed to build and deploy Web application
or services that are custom built by the developers.
It makes sense for operating system vendors to move their development environments into the cloud
14
with the same technologies that have been successfully used to create Web applications. Thus, you
might find a platform based on an Oracle xVM hypervisor virtual machine that includes
a NetBeans Integrated Development Environment (IDE) and that supports the Oracle GlassFish
Web stack programmable using Perl or Ruby. For Windows, Microsoft would be similarly interested in
providing a platform that allowed Windows developers to run on a Hyper-V VM, use the ASP.NET
application framework, support one of its enterprise applications such as SQL Server, and be
programmable within Visual Studio—which is essentially what the Azure Platform does. This approach
allows someone to develop a program in the cloud that can be used by others.
Platforms often come with tools and utilities to aid in application design and deployment. Depending
on a vendor they can be: tools for team collaboration, testing tools, versioning tools, database and
web service integration, and storage tools. Platforms providers begin with creation of developer’s
community to support the work done in the environment.
Platform is exposed to users through an API, also an application built in the cloud using a platform
service would encapsulates the service through its own API. An API can control data flow,
communications, and other important aspects of the cloud application. Till now there are is no
standard API and each cloud vendor has their own.
2.3.3 Application Platform as a Service (APaaS ) or Virtual appliances
A virtual appliance is software that installs as middleware onto a virtual machine. This are usually a
Web server, database server, BPM, ESBs, Messaging Portals and others that are running on a virtual
machine image. This, by someone referred to as Application platform as a Service, is more or less
horizontal extension of the offerings of PaaS.
APaaS is a type of service model that gives cloud software developers the power to actually do their
jobs. This gives an opportunity to use the APaaS /Virtual Appliances to build more complex services.
Within the ApaaS system, the actual software architectures of applications are built and established. It
is also within this layer that overall portability (and the ability of an application to function alongside a
bevy of other cloud applications as well as operating systems) is established. Since most of the actual
developmental breakthroughs (both in terms of software and overall cloud usability) occur within the
realms of the middleware (PaaS, APaaS), it makes sense that a great deal of attention is paid to it. [3]
For example Amazon WS is offering more than 700 different virtual machine images preconfigured
with enterprise applications like Oracle BPM, SQL Server, and even complete application stacks such
as LAMP (Linux, Apache, MySQL, and PHP) which are used to create a virtual machine within the
Amazon Elastic Compute Cloud (EC2). It serves as the basic unit of deployment for services delivered
using EC2.
APaaS gives software developers a solid part of platform that they can stand on, with its own
impressive workbench of tools, while they are constructing and envisioning new possibilities.
The true benefit from APaaS however is its ability to provide accurate feedback regarding the
functionality and compatibility of applications that are still under development. This is extremely
important to software developers, who can take serious losses (in terms of both money and time
spent) if they produce an application that simply won’t function in an environment, behave as
expected once deployed, or function in a compatible manner with other elements in a cloud
infrastructure. For those companies that want to run their IT and/or software development projects
through an APaaS, they need only pay subscription fees and not licensing fees. Subscription is
substantially cheaper than licensing and offers its benefits when paired with cloud APaaS. Most APaaS
15
packages that are put together for designers are often much easier to use than most standardized
design tools. These packages often allow software development teams to integrate and share their
work more smoothly as well as run the project from start to finish much faster than with other
systems.[3]
The global emergence of APaaS will no doubt lead to the creation of a number of companies that will
utilize the tools of APaaS to create their own business model, especially one that seeks to provide yet
another proprietary service aimed at delivering timely solutions to business software issues. One
particular area that could use the help is enterprise software, for example. Enterprise software is
often hard to manage, difficult to customize and frequently falls short in its functionalities. When you
couple these shortcomings with the fact that it is often quite expensive, there is a serious problem. An
obvious solution for dealing with enterprise software problems would be the deployment of an
APaaS-style service. Not only would this greatly increase the overall functionality of expensive
enterprise business software, but it would also allow for a great range of customization, as well as the
option for integrating it with other cloud services and/or networking opportunities. APaaS was
created to make the lives of software designers, developers and investors much easier. It is through
the use of APaaS that many excellent next generation apps have been developed and many experts in
the field of cloud computing agree that it is APaaS that will produce some of the upcoming “game
changing” applications that will actually shape the future of cloud computing in general.
2.3.4 Application
This area is compromised of the client hardware and the interface used to connect to the cloud. Big
problems arise from the design of Internet protocols to treat each request to a server as an
independent transaction (stateless service) [1]. The standard HTTP commands are all atomic in nature.
While stateless servers are easier to architect and stateless transactions are more resilient and can
survive outages, much of the useful work that computer systems need to accomplish are stateful.
Usage of transaction servers, message queuing servers and other similar middleware is meant to
bridge this problem. Standard methods that are part of Service oriented Architecture that help to
solve this issue and that are used in cloud computing are:

Orchestration – process flow can be choreographed as a service

Use of service bus that controls cloud components
There are many ways how clients can connect to a cloud service. The most common are:

Web browser

Proprietary application
This application can run on number of different devices, PC, Servers, Smartphones, and Tablets. They
all need a secure way to communicate with the cloud. Some of the basic methods to secure the
connection are:

Secure protocol such as SSL (HTTPS). FTPS, IPSec or SSH

Virtual connection using a virtual private network (VPN)

Remote data transfer such as Microsoft RDP or Citrix ICA that are using tunneling mechanism

Data encryption
16
3. Scalability
The scalability is the ability of a system to handle growing amount of work in a capable manner or its
ability to improve when additional resources are added.
The scalability requirement arises due to the constant load fluctuations that are common in the
context of Web-based services. In fact these load fluctuations occur at varying frequencies: daily,
weekly, and over longer periods. The other source of load variation is due to unpredictable growth (or
decline) in usage. The need for scalable design is to ensure that the system capacity can be
augmented by adding additional hardware resources whenever warranted by load fluctuations.
Thus, scalability has emerged both as a critical requirement as well as a fundamental challenge in the
context of cloud computing.[1][4]
Typically there are two ways to increase scalability:


Vertical scalability – by adding hardware resources, usually addition of CPU, memory etc.
This vertical scaling (scaling-up) enables them to use virtualizations technologies more
effectively by providing more resources for the hosted operating systems and applications to
share.
Horizontal scalability – means to add more nodes to a system, such as adding new node to a
distributed software application or adding more access points within the current system.
Hundreds of small computers may be configured in a cluster to obtain aggregate computing
power. The Horizontal scalability (scale-out) model also creates an increased demand for
shared data storage with very high I/O performance especially where processing of large
amounts of data is required. In general, the scale-out paradigm has served as the
fundamental design paradigm for the large-scale data-centers of today.
Integrating multiple load balancers into your system is probably the best solution for dealing with
scalability issues. There are many different forms of load balancers to choose from; server farms,
software and even hardware that have been designed to handle and distribute increased traffic. Items
that interfere with scalability[3]:

Too much software clutter (no organization) within the hardware stack(s).

Overuse of third-party scaling.

Reliance on the use of synchronous calls.

Not enough caching

Database not being used properly.
Creating a cloud network that offers the maximum level of scalability potential is entirely possible if
we apply a more “diagonal” solution. By incorporating the best solutions present in both vertical and
horizontal scaling, it is possible to reap the benefits of both models[3]. Once the servers reach the
limit of diminishing returns (no growth), we should simply start cloning them. This will allow us to
keep a consistent architecture when adding new components, software, apps and users. For most
individuals, problems arise from lack of resources not the inherent architecture of their cloud itself. A
more diagonal approach should help the business to deal with the current and growing demands that
it is facing.
17
4. Elasticity
Of all the attributes possessed by cloud computing in general, the most important is certainly its
elasticity. It’s ability to amplify and instantly upgrade resources and/or capacities on a moment notice.
Storage, processing and the scalability of applications are all elastic in the cloud. The really remarkable
thing about cloud computing is the real-time infrastructure that actively responds on user requests for
resources. Without the real-time monitoring and support behind this elasticity, the effectiveness,
adaptability and muscle of cloud computing would be greatly undermined. It is this elastic ability that
the service providers possess which allows them to offer their users access to cloud computing
services at such reduced costs. Since users only pay for what they use they can save money. For
example with the traditional grid computing network every user has its own intensive hardware setup
of which most of the users rarely use more than 50% of the capacity. Their combined resource usage
might be 20-30% of the total resources available on their central cloud computing hardware stack.
What cloud computing is really offering is the ability for average users to retain their current
standards and expectations, while leaving the door open for instant expansion opportunities if they
desire it. This also gives a much more efficient way to use energy.
Elasticity offers the same computing experience to which we are accustomed, with the added benefit
of near limitless resources at the same time offering a way to manage the energy consumption. [1][3]
The elastic capabilities offered by cloud computing makes it perfectly suited toward handling certain
activities or processes.
 Establishing an “in office” communication and online networking infrastructure (for
employees). Setting up a system that allows those in the organization a cleaner and
more efficient system for communicating and working often leads to greatly increased
profits.
 Using cloud computing to handle overdrafting - high volume data transfer periods and
events. Some businesses only use cloud computing when they run out of their own
resources, or perhaps anticipate that they might lack needed functionalities.
This can be something that is scheduled for an annual or bi-annual basis; designed to
meet a seasonal demand for a particular product for example.
 Assigning all customer data and transaction information to a cloud computing
element. This allows an organization to keep their customer’s data safe even from
their own employees. Utilizing a third party to handle all customer data can also pay
off in the event of a catastrophic type event. Cloud computing providers tend to keep
your information more securely backed-up than most are even aware of. [3]
In other word elasticity allows both user and provider to “do more with less”.
18
5. Database Management Systems in the cloud (Database as a
service)
Data and database management are integral part of wide variety of applications. Particularly
Relation DBMSs had been massively used due to many futures that they offer:




Overall functionality offering intuitive and relatively simple model for modeling different
types of applications.
Consistency, dealing with concurrent workloads without worrying about the data getting out
of sync
Performance, low latency and high throughput combined with many years of engineering and
development
Reliability, persistence of data in the presence of different types of failures and ensuring
safety.
The main concern is that the DBMSs and RDBMSs are not cloud-friendly because they are not as
scalable as the web-servers and application servers, which can scale from a few machines to hundreds.
The traditional DBMSs are not design to run on top of the shared-nothing architecture (where a set of
independent machines accomplish a task with minimal resource overlap) and they do not provide the
tools needed to scale-out from a few to a large number of machines.
Technology leaders such as Google, Amazon, and Microsoft have demonstrated that data centers
comprising thousands to hundreds of thousands compute nodes, provide unprecedented economiesof-scale since multiple applications can share a common infrastructure. All three companies have
provided frameworks such as Amazon’s AWS, Google’s AppEngine and Microsoft Azure for hosting
third party application in their clouds (data-center infrastructures).
Because the RDBMs or “transactional data management” databases that back banking, airline
reservation, online e-commerce, and supply chain management applications typically rely on the ACID
(Atomicity, Consistency, Isolation, Durability) guarantees that databases provide and It is hard to
maintain ACID guarantees in the face of data replication over large geographic distances1, they even
have developed propriety data management technologies referred to as key-value stores or informally
called NO-SQL database management systems.[6] The need for web-based application to support
virtually unlimited number of users and to be able to respond to sudden load fluctuations raises the
requirement to make them scalable in cloud computing platforms. There is a need that such scalability
can be provisioned dynamically without causing any interruption in the service. Key-value stores and
other NOSQL database solutions, such as Google Datastore offered with Google AppEngine, Amazon
SimpleDB and DynamoDB, MongoDB and others, have been designed so that they can be elastic or can
be dynamically provisioned in the presence of load fluctuations. We will explain some of these systems
in more details later on.
1
CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to
simultaneously provide all three of the following guarantees:
Consistency (all nodes see the same data at the same time)
Availability (a guarantee that every request receives a response about whether it was successful or failed)
Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the
system)
According to the theorem, a distributed system can satisfy any two of these guarantees at the same time, but
not all three.
19
As we move to the cloud-computing arena which typically comprises data-centers with thousands of
servers, the manual approach of database administration is no longer feasible. Instead, there is a
growing need to make the underlying data management layer autonomic or self-managing especially
when it comes to load redistribution, scalability, and elasticity. [7]
Figure 3 Traditional VS Cloud Data Services
This issue becomes especially acute in the context of pay-per-use cloud-computing platforms hosting
multi-tenant applications. In this model, the service provider is interested in minimizing its operational
cost by consolidating multiple ten-ants on as few machines as possible during periods of low activity
and distributing these tenants on a larger number of servers during peak usage [7]. Due to the above
desirable properties of key-value stores in the context of cloud computing and large-scale data-centers,
they are being widely used as the data management tier for cloud-enabled Web applications. Although
it is claimed that atomicity at a single key is adequate in the context of many Web-oriented
applications, evidence is emerging that indicates that in many application scenarios this is not enough.
In such cases, the responsibility to ensure atomicity and consistency of multiple data entities falls on
the application developers. This results in the duplication of multi-entity synchronization mechanisms
many times in the application software. In addition, as it is widely recognized that concurrent programs
are highly vulnerable to subtle bugs and errors, this approach impacts the application reliability
adversely. The realization of providing atomicity beyond single entities is widely discussed in developer
blogs. Recently, this problem has also been recognized by the senior architects from Amazon and
Google, leading to systems like MegaStore [10] that provide transactional guarantees on key-value
stores.
Both RDBMs and NOSQL DBMs offerings in the cloud will be explained in more details, how they work
who offers them and how they are provisioned.
I will first focus on the relational database offered in the cloud. I will start with one of the first
Enterprise database built for the cloud, the Salesforce’s database.com.
20
6. Database.com
Database.com is a database management system that is built for cloud computing with multitenancy
inherent in its design. Traditional RDBMSs were designed to support on premises deployments for one
organization. All core mechanisms such as system catalog, cashing mechanisms and query optimizer
are built to support single-tenant applications and to run directly on a specifically tuned host operating
system and hardware. Only possible way to build multi-tenant cloud database service with standard
RDBMS is to use virtualization. Unfortunately, the extra overhead of the hypervisor typically hurts the
performance of the RDBMS. Database.com combines several different persistence technologies,
including a custom -designed relational database schema, which are innately designed for clouds and
multitenancy - no virtualization required.
6.1
Database.com Architecture
Database.com’s core relational database technology uses a runtime engine that materializes all
application data from metadata - data about the data itself. In Database.com’s metadata-driven
architecture, there is a clear separation of the compiled runtime database engine (kernel), tenant data,
and the metadata that describes each application’s schema. These distinct boundaries make it possible
to independently update the system kernel and tenant -specific application schemas.
Figure 4 Databse.com Architecture [9]
Every logical database object is internally managed using metadata. Objects, (“tables” in traditional
relational database parlance), fields, stored procedures, and database triggers are all abstract
21
constructs that exist merely as metadata in Database.com’s Universal Data Dictionary (UDD).
Database.com used terminology is shown in Table 1.
Relational Database Term
Database
Table
Column
Row
Equivalent Term in Databse.com
Organization
Object
Field
Record
Table 1 Database.com Terminology
When a new application object is defined or some procedural code is written, Database.com does not
create an actual table in a database or compile any code, it simply stores metadata that the system’s
engine can use to generate the virtual application components at runtime. When modification or
customization of something about the application schema is needed, like modify an existing field in an
object, all that’s required is a simple non-blocking update to the corresponding metadata [9].
In order to avoid performance-sapping disk I/O and code recompilations, and improve application
response times, Database.com uses massive and sophisticated metadata caches to maintain the most
recently used metadata in memory. The system runtime engine must be optimizes to access metadata
because frequent metadata access would prevent the service from scaling.
At the heart of Database.com is its transaction database engine. Database.com uses a relational
database engine with a specialized schema build for multitenancy. It also employs a search engine
(separate from the transaction engine) that optimizes full -text indexing and searches. As applications
update data, the search service’s background processes asynchronously update tenant - and userspecific indexes in near real time. The goal of this separation of duties between the transaction engine
and the search service lets applications process transactions without the overhead of text index
updates [9].
6.2 Multitenant data model
Database.com storage model manages virtual database structures using a set of metadata, data, and
pivot tables, as illustrated in Figure 5
Figure 5 Multitenant data model of Database.com [9]
22
When application schemas are created, the UDD keeps track of metadata concerning the objects, their
fields, their relationships, and other object attributes. Few large database tables store the structured
and unstructured data for all virtual tables. A set of related multitenant indexes, implemented as
simple pivot tables with denormalized data, make the combined data set extremely functional.
Because Database.com manages object and field definitions as metadata rather than actual database
structures, the system can tolerate online multitenant application schema maintenance activities
without blocking the concurrent activity of other tenants and users [9].
6.3
Multitenant indexes
Database.com automatically indexes various types of fields to deliver scalable performance. Traditional
database systems rely on native database indexes to quickly locate specific rows in a database table
that have fields matching a specific condition. Index of MT_Data is managed by synchronously copying
field data marked for indexing to an appropriate column in a pivot table called MT_Indexes.
In some circumstances the external search engine can fail to respond to a search request. In this cases
Database.com falls back to a secondary search mechanism. A fallback search is implemented as a direct
database query with search conditions that reference the Name field of target records. To optimize
global object searches (searches that span tables) without having to execute potentially expensive
union queries, a pivot table called MT_Fallback_Indexes that records the Name of all records is
maintained. Updates to MT_Fallback_Indexes happen synchronously, as transactions modify records,
so that fall-back searches always have access to the most current database information [9].
6.4
Multitenant relationships
Database.com provides “relationship” datatypes that an organization can use to declare relationships
(referential integrity) among tables. When an organization declares an object’s field with a relationship
type, the field is mapped to a Value field in MT_Data, and then uses this fie ld to store the ObjID of a
related object [9].
6.5
Multitenant field history
Database.com provides history tracking for any field. When a tenant enables auditing for a specific
field, the system asynchronously records information about the changes made to the field (old and
new values, change date, etc.) using an internal pivot table as an audit trail [9].
6.6 Partitioning of metadata, data, and index data
All Database.com data, metadata, and pivot table structures, including underlying database indexes,
are physically partitioned by tenant (OrgID) using native database partitioning mechanisms. Data
partitioning is a proven technique that database systems provide to physically divide large logical data
structures into smaller, more manageable pieces. Partitioning can also help to improve the
performance, scalability, and availability of a large database system such as a multitenant environment.
For example, by definition, every Database.com query targets a specific tenant’s information, so the
23
query optimizer need only to consider accessing data partitions that contain a tenant’s data rather
than an entire table or index. This common optimization is sometimes referred to as “partition
pruning.” [9]
6.7
Application development
Developers can declaratively build server-side application components using the Database.com
Console. This point-and-click interface supports all facets of the application schema building process,
including the creation of an application’s data model (objects and their fields, relationships, etc.),
security and sharing model (users, profiles, role hierarchies, etc.), declarative logic (workflows), and
programmatic logic (stored procedures and triggers). The Console provides access to built-in system
futures which makes it easy to implement application functionality without the need of writing code
[9].
6.8
Data Access
Database.com provides the following tools to query and work with data.
Database.com REST API and Force.com Web Services API
The REST API and Web Services API can be used to interact with Database.com by creating, retrieving,
updating, and deleting records, maintaining passwords, performing searches, etc. This APIs can be used
with any language that supports Web services.
The SOAP-based API is optimized for real-time client applications that update small numbers of records
at a time [8] [9].
Force.com Bulk API
The Bulk API is based on REST principles, and is optimized for loading or deleting large sets of data. It
can be used to insert, update, delete, or restore a large number of records asynchronously by
submitting a number of batches that are processed in the background by Database.com. The Bulk
API is designed to simplify the processing of a few thousand to millions of records.
Apex Data Manipulation Language (DML)
DML statements are used to insert, delete, and update data from within your Apex code.
Apex Web Services
Apex methods can be exposed as Web service operations that can be called by external Web client
applications. This is a powerful tool for building efficient communication between data service and
application tier. By aggregating business logic onto Database.com, it can:

Prevent unnecessary communication between data service and the client
24

Client development and maintenance by providing a coarse-grained application- level API

Build more robust applications, since all of the logic implemented in Apex is executed within a
transaction on Database.com [9]
6.9
Query languages
Database.com is using the Salesforce Object Query Language (SOQL) to construct database queries.
Similar to the SELECT command in the Structured Query Language (SQL), SOQL allows you to specify
the source object, a list of fields to retrieve, and conditions for selecting rows in the source object.
Database.com also includes a full-text, multi-lingual search engine that automatically indexes all textrelated fields. Apps can leverage this pre-integrated search engine using the Salesforce Object Search
Language (SOSL) to perform text searches.
Unlike SOQL, which can only query one object at a time, SOSL can search text, email, and phone fields
for multiple objects simultaneously [9].
6.10 Multitenant search processing
Web-based application users have come to expect an interactive search capability to scan the entire or
a selected scope of an application’s database, return ranked results that are up-to-date, and do it all
with sub-second response times. To provide such robust search functionality for applications,
Database.com uses a search engine that is separate from its transaction engine. The relationship
between the two engines is depicted in the figure 4.
Figure 6 Transaction and Search engine [9]
The search engine receives data from the transactional engine, with which it creates search indexes.
The transactional engine forwards search requests to the search engine, which returns results that the
transaction engine uses to locate rows that satisfy the search request.
As applications update data in text fields (CLOBs, Name, etc.), a pool of background processes called
indexing servers are responsible for asynchronously updating corresponding indexes, which the search
25
engine maintains outside the core transaction engine. To optimize the indexing process, Database.com
synchronously copies modified chunks of text data to an internal “to-be - indexed” table as
transactions commit, thus providing a relatively small data source that minimizes the amount of data
that indexing servers must read from disk. The search engine automatically maintains separate indexes
for each organization (tenant).
Depending on the current load and utilization of indexing servers, text index updates may noticeably
lag behind actual transactions. To avoid unexpected search results originating from stale indexes,
Database.com also maintains an MRU (most recently used) cache of recently updated rows that the
system considers when materializing full-text search results. In order to efficiently support possible
search scopes, MRU caches are maintained per-user and per-organization.
Database.com’s search engine optimizes the ranking of records within search results using several
different methods. For example, the system considers the security domain of the user performing a
search and weighs those rows to which the current user has access more heavily. The system can also
consider the modification history of a particular row and rank more actively updated rows ahead of
those that are relatively static. The user can choose to weight search results as desired, for example,
placing more emphasis on recently modified rows.
6.11 Multitenant isolation and protection
To protect the overall scalability and performance of the shared database system for all concerned
applications, Database.com is using an extensive set of governors and resource limits associated with
code execution. Execution of a code script is monitored and limited how much CPU time it can use,
how much memory it can consume, how many queries and DML statements it can execute, how many
math calculations it can perform, how many outbound Web services calls it can make, and much more.
Individual queries that optimizer regards as too expensive to execute throw an exception to the caller
[9].
Before an organization can transition a new application from development to production status,
salesforce.com requires unit tests that validate the functionality of the application’s Database.com
code routines. Salesforce.com executes submitted unit tests in Database.com’s sandbox development
environment to ascertain if the application code will adversely affect the performance and scalability of
the multitenant population at large.
Once an application’s code is certified for production by salesforce.com, the deployment process
copies all the application’s metadata into a production Database.com instance and reruns the
corresponding unit tests
After a production application is live, the performance profiler automatically analyzes and provides
associated feedback to administrators. Performance analysis reports include information about slow
queries, data manipulations, and sub-routines that you can review and use to tune application
functionality.
26
6.12 Deletes, undeletes
When an app deletes a record from an object, Database.com simply marks the row for deletion. It is
possible to restore selected rows from the Recycle Bin for up to 30 days before it is permanently
removed. The total number of records that is maintained for an organization is limited based on the
storage limits for that organization.
The Recycle Bin also stores dropped fields and their data until an organization permanently deletes
them or 45 days has elapsed, whichever happens first. Until that time, the entire field and all its data is
available for restoration [9].
6.13 Backup
Database.com uses a variety of methods to ensure that organizations do not experience any data loss.
Every transaction is stored to RAID disks in real-time with archive mode enabled, allowing the database
to recover all transactions prior to any system failure. Every night all data is backed up to a separate
backup server and automatic tape library. The backup tapes are cloned as an additional precautionary
measure, and the cloned tapes are transported to an off-site, fireproof vault twice a month [8].
6.14 Pricing
Database.com pricing is based on number of users, records and transactions per month. The
registration of new account is free and it includes:

3 Standard Users

3 Administration Users

100,000 records in the database

50,000 Transactions per month
Additional storage and capacity can be purchased at any time with no downtime.
27
7. Microsoft’s SQL AZURE
Microsoft SQL Azure Database is a cloud-based relational database service that is built on SQL Server
technologies and runs in Microsoft data centers on hardware that is owned, hosted, and maintained
by Microsoft.
SQL Azure is probably the most fully-featured relational database available in the cloud. It is based on
the SQL Server standalone database but the way data is managed and stored in SQL Azure is
significantly different.
Similar to an instance of SQL Server, SQL Azure Database exposes a tabular data stream (TDS)
interface for Transact-SQL-based database access. This allows your database applications to use SQL
Azure Database in the same way that they use SQL Server. Because SQL Azure Database is a service,
administration in SQL Azure Database is slightly different.
Unlike administration for an on-premise instance of SQL Server, SQL Azure Database abstracts the
logical administration from the physical administration. Users continue to administer databases,
logins, users, and roles, but Microsoft administers the physical hardware such as hard drives, servers,
and storage. This approach helps SQL Azure Database provide a large-scale multitenant database
service that offers enterprise-class availability, scalability, security, and self-healing [11].
7.1
Subscriptions
To use SQL Azure, Windows Azure platform account must be used. This account allows access to all
the Windows Azure-related services, such as Windows Azure, Windows Azure AppFabric, and SQL
Azure. The Windows Azure platform account is used to set up and manage subscriptions and to bill for
consumption of any of the Windows Azure services including SQL Azure, and running SQL Azure does
not require Windows Azure. Whit the Windows Azure platform account, the Windows Azure Platform
Management portal can be used to create SQL Azure servers, databases, and its associated
administrator accounts [11].
Each subscription allows one instance of SQL Server to be defined, which will initially include only a
master database. For each server firewall settings has to be configured, to determine which
connections will be allowed access.
7.2
Databases
Each SQL Azure server always includes a master database. Up to 149 additional databases can be
created for each SQL Azure server. Microsoft is offering two editions of SQL Azure databases: Web
and Business, and when you create a database using the Windows Azure Platform Management
portal, the maximum size you specify determines the edition you create. A Web Edition database can
have a maximum size of 1 GB or 5GB. A Business Edition database can have maximum size of up to
150 GB of data, in 10GB increments up to 50GB, and then 50 GB increments [11][12]. If the size of the
database reaches the limit it is not possible to insert data, update data, or create new database
28
objects. However, read and delete data, truncate tables, drop tables and indexes, and rebuild indexes
are still possible.
SQL Azure data access model does not support cross-database queries in the current version a
connection is made to a single database. If data from another database is needed, new connection
must be created [11].
7.3
Security and Access to a SQL Azure Database
Most security issues for SQL Azure databases are managed by Microsoft within the SQL Azure data
center, with very little setup required by the users. A user must have a valid login and password in
order to connect to the SQL Azure database. Because SQL Azure supports only standard security, each
login must be explicitly created.
In addition, the firewall can be configured on each SQL Azure server to only allow traffic from
specified IP addresses to access the SQL Azure server. This helps to greatly reduce any chance of a
denial-of-service (DoS) attack. All communications between clients and SQL Azure must be SSL
encrypted, and clients should always connect with Encrypt = True to ensure that there is no risk of
man-in-the-middle attacks. DoS attacks are further reduced by a service called DoSGuard that actively
tracks failed logins from IP addresses and if it notices too many failed logins from the same IP address
within a period of time, the IP address is blocked from accessing any resources in the service [11].
The security model within a database is identical to that in SQL Server. Users are created and mapped
to login names. Users can be assigned to roles, and users can be granted permissions. Data in each
database is protected from users in other databases because the connections from the client
application are established directly to the connecting user’s database.
7.4
SQL Azure architecture
Each SQL Azure database is associated with its own subscription. From the subscriber’s perspective,
SQL Azure provides logical databases for application data storage. In reality, each subscriber’s data is
replicated across three SQL Server databases that are distributed across three physical servers in a
single data center. Many subscribers may share the same physical database, but the data is presented
to each subscriber through a logical database that abstracts the physical storage architecture and uses
automatic load balancing and connection routing to access the data. The logical database that the
subscriber creates and uses for database storage is referred to as a SQL Azure database [11].
7.5
Logical Databases on a SQL Azure Server
SQL Azure subscribers access the actual databases, which are stored on multiple machines in the data
center, through the logical server. The SQL Azure Gateway service acts as a proxy, forwarding the
Tabular Data Stream (TDS) requests to the logical server. It also acts as a security boundary providing
29
login validation, enforcing the firewall and protecting the instances of SQL Server behind the gateway
against denial-of-service attacks. The Gateway is composed of multiple computers, each of which
accepts connections from clients, validates the connection information and then passes on the TDS to
the appropriate physical server, based on the database name specified in the connection. Figure 8
shows the physical architecture represented by the single logical server.
Figure 7 Figure 8 A logical server and its databases distributed across machines in the data center [11]
The machines with the SQL Server instances are called data nodes. Each data node contains a single
SQL Server instance, and each instance has a single user database, divided into partitions. Each
partition contains one SQL Azure client database, either a primary or secondary replica. Each database
hosted in the SQL Azure data center has three replicas: one primary replica and two secondary
replicas. All reads and writes go through the primary replica, and any changes are replicated to the
secondary replicas asynchronously. The replicas are the central means of providing high availability
for your SQL Azure databases.
The other SQL Azure databases partitions existing within the same SQL Server instances in the data
center are completely invisible and unavailable between different subscribers [11].
For SQL Azure databases every commit needs to be a quorum commit. That is, the primary replica and
at least one of the secondary replicas must confirm that the log records have been written before the
transaction is considered to be committed.
Each data node machine hosts a set of processes referred to as the fabric. The fabric processes
perform the following tasks:

Failure detection: notes when a primary or secondary replica becomes unavailable so that
the Reconfiguration Agent can be triggered

Reconfiguration Agent: manages the re-establishment of primary or secondary replicas after
a node failure
30

PM (Partition Manager) Location Resolution: allows messages to be sent to the Partition
Manager

Engine Throttling: ensures that one logical server does not use a disproportionate amount of
the node’s resources, or exceed its physical limits

Ring Topology: manages the machines in a cluster as a logical ring, so that each machine has
two neighbors that can detect when the machine goes down
The machines in the data center are all commodity machines with components that are of low-tomedium quality and low-to-medium performance capacity. The low cost and the easily available
configuration make it easy to quickly replace machines in case of a failure condition. In addition,
Windows Azure machines use the same commodity hardware, so that all machines in the data center,
whether used for SQL Azure or for Windows Azure, are interchangeable
In Figure 7, the logical server contains three databases: DB1, DB2, and DB3. The primary replica for
DB1 is on Machine 6 and the secondary replicas are on Machine 4 and Machine 5. For DB3, the
primary replica is on Machine 4, and the secondary replicas are on Machine 5 and on another
machine not shown in this figure. For DB4, the primary replica is on Machine 5, and the secondary
replicas are on Machine 6 and on another machine not shown in this figure. Note that this diagram is
a simplification. Most production Microsoft SQL Azure data centers have hundreds of machines with
hundreds of actual instances of SQL Server to host the SQL Azure replicas, so it is extremely unlikely
that if multiple SQL Azure databases have their primary replicas on the same machine, their secondary
replicas will also share a machine [11].
The physical distribution of databases that all are part of one logical instance of SQL Server means
that each connection is tied to a single database, not a single instance of SQL Server.
7.6
Network Topology
Four distinct layers of abstraction work together to provide the logical database for the subscriber’s
application to use: the client layer, the services layer, the platform layer, and the infrastructure layer.
Figure 8 illustrates the relationship between these four layers.
The client layer resides closest to the application, and it is used by the application to communicate
directly with SQL Azure. The client layer can reside on-premises in a data center, or it can be hosted in
Windows Azure. Every protocol that can generate TDS over the wire is supported. Because SQL Azure
provides the TDS interface same as SQL Server, known and familiar tools and libraries can be used to
build client applications for data that is in the cloud.
The infrastructure layer represents the IT administration of the physical hardware and operating
systems that support the services layer.
31
Figure 8 Four layers of abstraction provide the SQL Azure logical database for a client application to use [11]
32
7.7
High Availability with SQL Azure
The goal for Microsoft SQL Azure is to maintain 99.9 percent availability for the subscribers’
databases. As it was stated earlier this goal is achieved by the use of commodity hardware that can
be quickly and easily replaced in the case of machine or drive failure and the management of the
replicas, one primary and two secondary, for each SQL Azure database [12].
7.8
Failure Detection
Management in the data centers needs to detect not only a complete failure of a machine, but also
conditions where machines are slowly degenerating and communication with them is affected. The
concept of quorum commit, discussed earlier, addresses these conditions. First, a transaction is not
considered to be committed unless the primary replica and at least one secondary replica can
confirm that the transaction log records were successfully written to disk. Second, if both a primary
replica and a secondary replica must report success, small failures that might not prevent a
transaction from committing but that might point to a growing problem can be detected [11].
7.9
Reconfiguration
The process of replacing failed replicas is called reconfiguration. Reconfiguration can be required
due to failed hardware or to an operating system crash, or to a problem with the instance of SQL
Server running on the node in the data center. Reconfiguration can also be necessary when an
upgrade is performed, whether for the operating system, for SQL Server, or for SQL Azure.
All nodes are monitored by six peers, each on a different rack than the failed machine. The peers
are referred to as neighbors. A failure is reported by one of the neighbors of the failed node, and
the process of reconfiguration is carried out for each database that has a replica on the failed node.
Because each machine holds replicas of hundreds of SQL Azure databases (some primary replicas
and some secondary replicas), if a node fails, the reconfiguration operations are performed
hundreds of times. There is no prioritization in handling the hundreds of failures when a node fails;
the Partition Manager randomly selects a failed replica to handle, and when it is done with that
one, it chooses another, until all of the replica failures have been dealt with.
If a node goes down because of a reboot, that is considered a clean failure, because the neighbors
receive a clear exception message.
Another possibility is that a machine stops responding for an unknown reason, and an ambiguous
failure is detected. In this case, an arbitrator process determines whether the node is really down.
Although this discussion centers on the failure a single replica, it is really the failure of a node that is
detected and dealt with. A node contains an entire SQL Server instance with multiple partitions
containing replicas from up to 650 different databases. Some of the replicas will be primary and
some will be secondary. When a node fails, the processes described earlier are performed for each
affected database. That is, for some of the databases, the primary replica fails, and the arbitrator
chooses a new primary replica from the existing secondary replicas, and for other databases, a
33
secondary replica fails, and a new secondary replica is created.
The majority of the replicas of any SQL Azure database must confirm the commit. At this time, user
databases maintain three replicas, so a quorum commit would require two of the replicas to
acknowledge the transaction. A metadata store, which is part of the Gateway components in the
data centers, maintains five replicas and so needs three confirmations to satisfy a quorum commit.
The master cluster, which maintains seven replicas, needs four of them to confirm a transaction.
However, for the master cluster, even if all seven replicas fail, the information is recoverable,
because mechanisms are in place to rebuild the master cluster automatically in case of such a
massive failure [11].
7.10 Availability Guarantees
As mentioned earlier, the goal for Microsoft SQL Azure is to maintain 99.9 percent availability.
Because of the way that database replicas are distributed across multiple servers and the efficient
algorithms for promoting secondary replicas to primary, up to 15 percent of the machines in the
data center can be down and the availability can still be guaranteed [11].
7.11 Scalability with SQL Azure
As said earlier one of the biggest benefits of hosting your databases in the cloud is the built-in
scalability. With SQL Azure as with the most cloud database platforms you add more databases only
when and if you need them, and if the need is only temporary, you can then drop the unneeded
databases. There are two components within SQL Azure that allow this scalability by continuously
monitoring the load on each node. One component is Engine Throttling, which ensures that the
server doesn’t get overloaded. The other component is the Load Balancer, which ensures that a
server isn’t continuously in the throttled state. In this section, we’ll look at these two components
and discuss how engine throttling applies when predefined limits are reached and how load
balancing works as the number of hosted database increases. The third technique to achieve
greater scalability and performance are the Federations [31] used in SQL Azure. One or more tables
within a database are split by row and portioned across multiple databases (Federation members).
This type of horizontal partitioning is often referred to as ‘sharding’. The primary scenarios in which
this is useful are where you need to achieve scale, performance, or to manage capacity [11].
7.12 Throttling
Because of the multitenant use of each SQL Server in the data center, it is possible that one
subscriber’s application could render the entire instance of SQL Server ineffective by imposing
heavy loads. For example, under full recovery mode, inserting lots of large rows, especially ones
containing large objects, can fill up the transaction log and eventually the drive that the transaction
log resides on. In addition each instance of SQL Server in the data center shares the machine with
34
other critical system processes that cannot be starved – most relevantly the fabric process that
monitors the health of the system.
To keep a data center server’s resources from being overloaded and jeopardizing the health of the
entire machine, the load on each machine is monitored by the Engine Throttling component. In
addition, each database replica is monitored to make sure that statistics such as log size, logs write
duration, CPU usage, the actual physical database size limit, and the SQL Azure user database size
are all below target limits. If the limits are exceeded, the result can be that a SQL Azure database
rejects reads or writes for 10 seconds at a time. Occasionally, violation of resource limits may result
in the SQL Azure database permanently rejecting reads and writes (depending on the resource type
in question) [11].
7.13 Load Balancer
At this time, although there are availability guarantees with SQL Azure, there are no performance
guarantees. Part of the reason for this is the multitenant problem: many subscribers with their own
SQL Azure databases share the same instance of SQL Server and the same computer, and it is
impossible to predict the workload that each subscriber’s connections will be requesting. SQL Azure
provides load balancing services that evaluate the load on each machine in the data center. When a
new SQL Azure database is added to the cluster, the Load Balancer determines the locations of the
new primary and secondary replicas based on the current load on the machines.
If one machine gets loaded too heavily, the Load Balancer can move a primary replica to a machine
that is less loaded [11].
7.14 SQL Azure Management
Because your SQL Azure databases are hosted within larger SQL Server instances on machines in
the data centers, the management work that needs to be done is very limited. However, some
maintenance tasks are still necessary.
All physical aspects of dealing with your databases are handled in the data center by Microsoft.
Also all the upgrades are handled in the data center one replica at a time. The user has
responsibility to troubleshoot poorly performing queries and concurrency problems, such as
blocking.
Just like in SQL Server, some of the main tools available for troubleshooting are the dynamic
management views (DMVs) [11].
7.15 Pricing in SQL Azure
Billing in SQL Azure is per database, based on usage and database edition, this allows organization
35
to start with a small investment and add space as the business grows. SQL Azure provides two
different database editions, Business Edition and Web Edition. SQL Azure edition features apply to
the individual database. They can be mixed and match different database editions within the same
SQL Azure server.
Both editions offer scalability, automated high availability, and self-provisioning.

The Web Edition Database is suited for small Web applications and workgroup or
departmental applications. This edition supports a database with a maximum size of 1 or 5
GB of data.

The Business Edition Database is suited for independent software vendors (ISVs), line- ofbusiness (LOB) applications, and enterprise applications. This edition supports a database
of up to 150 GB of data, in 10GB increments up to 50GB, and then 50 GB increments.
Both editions charge an additional bandwidth-based fee when the data transfer includes a client
outside the Windows Azure platform or outside the region of the SQL Azure database.
You specify the edition and maximum size of the database when you create it; you can also change
the edition and maximum size after creation. The billing will be based on the new edition type (and
the peak size the database reaches, daily) [13].
Microsoft is charging monthly fee for each SQL Azure user database. The database fee is amortized
over the month and charged daily. The daily fee depends on the peak size that each database
reached that day, the edition of each database, and the maximum number of databases you. A 10
GB multiplier is used for pricing Business Edition databases and a 1 GB or 5 GB multiplier is used for
pricing Web Edition databases. Users pay for the databases they have, for the days they have them
[13].
Bandwidth used between SQL Azure and Windows Azure or Windows Azure AppFabric is free
within the same sub-region or data center.
36
8. Amazon WebServices
Amazon is another company that is offering relational database service as a part of their amazon
web services. In the next section I will first speak about Amazon relational database services and
later I will give an overview of their NOSQL database, Amazon SimpleDB and DynamoDB and
another NOSQL solutions currently available.
8.1
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS) is a web service that can operate, and to some
level scale a relational database in the cloud. It provides cost-efficient and resizable capacity while
automating the administration tasks. Amazon RDS gives the users access to the capabilities of a
MySQL or Oracle database running on their own Amazon RDS database instance. This gives the
advantage that the code and applications that use on-premises MySQL or Oracle database can be
easily migrated to Amazon RDS.
8.2
Amazon RDS Architecture/Features
Amazon RDS has different approach then the Database.com and SQL Azure. It offers the full
capabilities of MySQL or Oracle database running on separate database instance. The features
provided by Amazon RDS depend on the DB Engine you select. In general it offers:

Pre-configured Parameters – DB Instances are pre-configured with a sensible set of
parameters and settings appropriate for the DB Instance class that has been selected. It
gives the possibility to launch a MySQL or Oracle DB Instance and connect an application
without additional configuration.

Monitoring and Metrics – Amazon RDS provides Amazon CloudWatch metrics for the DB
Instance deployments. AWS Management Console can be used to view key operational
metrics for the DB Instance deployments, including compute/memory/storage capacity
utilization, I/O activity, and DB Instance connections.

Automatic Software Patching – Amazon RDS will make sure that the relational database
software stays up-to-date with the latest patches

Automated Backups – Turned on by default, the automated backup feature of Amazon
RDS enables point-in-time recovery for the DB Instance. Amazon RDS will backup the
database and transaction logs and store both for a user-specified retention period. This
allows restores of the DB Instance to any second during the retention period, up to the
last five minutes. Automatic backup retention period can be configured to up to thirty five
days.
37

DB Snapshots – DB Snapshots are actually user-initiated backups of the DB Instance.
These full database backups will be stored by Amazon RDS until they are explicitly deleted.
Users can also create a new DB Instance from a DB Snapshot.

Isolation and Security– Using Amazon VPC2, it is possible to isolate DB Instances in own
virtual network, and connect to an existing IT infrastructure using industry-standard
encrypted IPsec VPN. In addition, for both MySQL and Oracle, it allows controlling access
to the DB Instances using database security groups (DB Security Groups). A DB Security
Group acts like a firewall controlling network access to the DB Instance. By default,
network access is turned off to the DB Instances. For applications to access a DB Instance
DB Security Group must be set to allow access from EC23 Instances with specific EC2
Security Group membership or IP ranges [14].
8.3
Scalability with Amazon RDS
Amazon RDS gives flexibility of being able to scale the compute resources or storage capacity
associated with the relational database instance by using the Amazon RDS APIs or through the AWS
Management Console. The compute and memory resources can be scaled up or down by using
predefined DB Instance Classes. Currently Amazon is offering five supported DB Instance classes:

Small DB Instance: 1.7 GB memory, 1 ECU (1 virtual core with 1 ECU), 64-bit platform,
Moderate I/O Capacity

Large DB Instance: 7.5 GB memory, 4 ECUs (2 virtual cores with 2 ECUs each), 64-bit
platform, High I/O Capacity

High-Memory Extra Large Instance 17.1 GB memory, 6.5 ECU (2 virtual cores with 3.25
ECUs each), 64-bit platform, High I/O Capacity

High-Memory Double Extra Large DB Instance: 34 GB of memory, 13 ECUs (4 virtual cores
with 3,25 ECUs each), 64-bit platform, High I/O Capacity

High-Memory Quadruple Extra Large DB Instance: 68 GB of memory, 26 ECUs (8 virtual
cores with 3.25 ECUs each), 64-bit platform, High I/O Capacity
For each DB Instance class, it is possible to select from 5GB to 1TB of associated storage capacity.
Additional storage can be provisioned on the fly with no downtime.
One ECU provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon
processor [14].
2
Amazon Virtual Private Cloud (Amazon VPC) - isolated section of the Amazon Web Services (AWS) Cloud where you
can launch AWS resources in a virtual network that you define, offering complete control over your virtual networking
environment, including selection of your own IP address range, creation of subnets, and configuration of route tables
and network gateways.
3
Amazon Elastic Compute Cloud (EC2) - web service that provides resizable compute capacity in the cloud.
38
8.4
High Availability
Amazon RDS run on the same high reliable infrastructure as the other Amazon web services. It has
multiple features that enhance availability for critical production databases. Currently it offers
Automatic host replacement and Replication.
With the automatic host replacement, Amazon RDS will automatically replace the compute instance
powering the deployment in the event of a hardware failure.
The replication at this time is supported only for MySQL, although it is planned to be available for
oracle in the near future. For MySQL Amazon RDS provides two replication features, Multi-AZ
deployments and read replicas.
With Multi-AZ deployments Amazon RDS will automatically provision and manage a “standby”
replica in a different Availability Zone (independent infrastructure in a physically separate location).
Database updates are made concurrently on the primary and standby resources to prevent
replication lag. In the event of planned database maintenance, DB Instance failure, or an Availability
Zone failure, Amazon RDS will automatically failover to the up-to-date standby so that database
operations can resume quickly without administrative intervention. Prior to failover you cannot
directly access the standby, and it cannot be used to serve read traffic.
Read Replicas make it easy to elastically scale out beyond the capacity constraints of a single DB
Instance for read-heavy database workloads. It is possible to create one or more replicas of a given
source DB Instance and serve high-volume application read traffic from multiple copies of the data,
thereby increasing aggregate read throughput. Amazon RDS uses MySQL’s native replication to
propagate changes made to a source DB Instance to any associated Read Replicas. Since Read
Replicas leverage standard MySQL replication, they may fall behind their sources, and they are
therefore not intended to be used for enhancing fault tolerance in the event of source DB Instance
failure or Availability Zone failure [14].
8.5
Pricing
Same as with the other, previously mentioned DBMS services, Amazon RDS pricing is based on the
usage and the DB Instance class. It is possible to choose between hourly On-Demand pricing with
no up-front or long-term commitments with reserved pricing option.

On-Demand DB Instances lets user to pay for compute capacity by the hour with no longterm commitments. This frees you from the costs and complexities of planning,
purchasing, and maintaining hardware and transforms what are commonly large fixed
costs into much smaller variable costs.

Reserved DB Instances give users the option to make a low, one-time payment for each
DB Instance they want to reserve and in turn receive a discount on the hourly usage
charge for that DB Instance. Depending on usage, there is a possibility to choose between
three Reserved DB Instance types (Light, Medium, and Heavy Utilization) and receive
anywhere between 30% and 55% of discount over On-Demand prices. Based on the
39
application workload and the amount of time they will run, Amazon RDS Reserved
Instances may provide substantial savings over running on-demand DB instances.
The prices are different weather standard or Multi-AZ Deployment is used. For both standard and
Multi-AZ deployments, pricing is per DB Instance-hour consumed, from the time a DB Instance is
launched until it is terminated.
There is no additional charge for backup storage up to 100% of provisioned database storage for an
active DB Instance. After the DB Instance is terminated, backup storage is billed at per GB-month.
Also additional backup storage is billable.
Data transferred between Amazon RDS and Amazon EC2 Instances in the same Availability Zone
and
Data transferred between Availability Zones for replication of Multi-AZ deployments is free.
Amazon RDS DB Instances outside VPC: For data transferred between an Amazon EC2 instance and
Amazon RDS DB Instance in different Availability Zones of the same Region, there is no Data
Transfer charge for traffic in or out of the Amazon RDS DB Instance. Charges apply only for the Data
Transfer in or out of the Amazon EC2 instance, and standard Amazon EC2 Regional Data Transfer
charges apply.
Amazon RDS DB Instances inside VPC: For data transferred between an Amazon EC2 instance and
Amazon RDS DB Instance in different Availability Zones of the same Region, Amazon EC2 Regional
Data Transfer charges apply on both sides of transfer.
Data transferred between Amazon RDS and AWS services in different regions is charged as Internet
Data Transfer on both sides of the transfer.
Additionally for Oracle database there are two licensing models, “License Included” and “BringYour- Own-License (BYOL)”. In the "License Included" service model, you do not need separately
purchased Oracle licenses; the Oracle Database software has been licensed by AWS.
Bring-Your-Own-License is suited for users that already own Oracle Database licenses. The “BYOL”
model is designed for customers who prefer to use existing Oracle database licenses or purchase
new licenses directly from Oracle [14].
9.
Google Cloud SQL
Google Cloud SQL is a MySQL database in the Google's cloud. It has all the capabilities and
functionality of MySQL. Google Cloud SQL is currently available for Google App Engine applications
that are written in Java or Python. It can also be accessed from a command-line tool.
As all the others database as a service offers Google cloud SQL is fully managed, patch
management, replication and other database management chores are managed by Google.
40
High availability is offered by built in automatic replication across multiple geographic regions so da
service is available and data is preserved even when whole data center becomes unavailable. Users
can choose to create databases and choose synchronous or asynchronous replication in
datacenters in the EU or the US.
Google cloud SQL is tightly integrated with Google App Engine and other Google services which
allow users to work across multiple products and get more value of their data. The database
instances are not restricted to be used only by one application in the app engine allowing multiple
applications to use same instance and database. Data to the database can be imported usind
mysqldumps. This allows users to easily move data, applications, and services in and out of the
cloud.
As initial trial Google is offering instances with small amount of RAM and 0.5GB of database
storage. Additional RAM and storage can be purchased up to 16GB of RAM and 100GB of storage
[15].
9.1
Pricing
Google offers two billing plans for Google Cloud SQL, Packages or Per Use. The packages offer is
shown in the table below:
Tier
RAM
Included Storage
Included I/O Per Day
D1
0.5GB
1GB
850K
D2
1GB
2GB
1.7M
D4
2GB
5GB
4M
D8
4GB
10GB
8M
D16
8GB
10GB
16M
D32
16GB
10GB
32M
Table 2 Google Cloud SQL Packages
Each database instance is allocated the RAM shown above, along with an appropriate amount of
CPU. Storage is measured as the filespace used by the MySQL database. Bills are issued monthly,
based on the number of days during which the database existed. Google is not charging the storage
for backups created using the scheduled backup service. The number of I/O requests to storage
made by database instance depends on the queries, workload and data set. Cloud SQL caches data
in memory to serve queries efficiently and to minimize the number of I/O requests. Use of storage
or I/O over the included quota is charged at the Per Use rate. The maximum storage for any
instance is currently 100GB.
41
With the Per Use plan the same tiers as with the packages are offered with the difference that
database instances is charged for periods of continuous use. Storage is charged per GB in hourly
units (whether the database is active or not) measured as the largest number of bytes during that
one hour period, rounded up to the nearest GB and the I/O are charged by number rounded to the
nearest million.
Network use is charged for both packages and per use billing plans. Only outbound external traffic
is charged the network usage between Google App Engine applications and Cloud SQL is not
charged [15].
10.
Summary of RDBM DBaaS and common considerations
As we can see from the previous section Relational Database as a Service (DBaaS) is currently found
in the public marketplace in two broad capabilities - online general relational databases, and the
ability to operate virtual machine images loaded with common databases such as MySQL, Oracle or
similar commercial databases.
Database.com offers relational multitenant database specially build for the cloud using their
metadata-driven architecture.
Microsoft AzureSQL offers SQL Server like relational database management system and controls
many of the database configuration details allowing the users to focus on the schema, data and
application layer.
Amazon RDS provides implementation of MySQL or Oracle on virtual machine build and tune for
that purpose and Goolge also has their cloud SQL providing MySQL for their AppEngine PaaS.
While the all presented RDBMS DBaaS provide an opportunity to reduce cost there are many
consideration to taken before moving the data to a cloud based solution. Figure 11 presents the
main considerations comparison.
Data Sizing - All of the RDBMS DBaaS offerings presented have limits on the size of the data set
that can be stored on their systems.
Portability - Portability and adherence to standards is a critical issue for ensuring Continuity of
Operations and to mitigate business risk (e.g., a provider going out of business or raising rates). The
ability to instantiate a replicated version of the data “off-cloud” or in another cloud offering can
provide the business owners with an extra level of assurance that they will not suffer a loss of data.
This can be facilitated by standards, such as the use of a standard database query language (SQL).
Transaction Capabilities - Transaction capabilities are an essential feature for databases that need
to provide guaranteed reads and writes (ACID).
42
Salesforce
Database.com
Microsoft SQL
Azure
Amazon RDS
(MySQL or Oracle)
Google Cloud
SQL
Maximum amount
of data that can be
stored
Maximum data is
limited by number of
records per
database. Up to
22300000 records.
5gb with web
edition database
and up to 150GB
with business
edition database
1 terabyte per
database instance.
100GB per
database
instance.
Ease of software
portability with
similar locally hosted
capability
Low. Requires
database to be
specially built and
tested by Salesforce
before deployment.
High. Most SQL
Server features are
available in SQL
Azure.
High. MySQL/Oracle
instantiation in cloud
is very similar to the
local instantiated
version.
Medium.
MySQL
instance in
the cloud very
similar to the
local instance
but accessible
only by
Google App
Engine
Transaction
capabilities
Yes
Yes
Yes
Yes
Configurability and
ability to tune
databases
Low. It creates
indexes
automatically and
keeps record of most
recently accessed
records but does not
allow control over it.
Also does not allow
control over memory
allocation and similar
resources.
Medium. Can
create indexes and
stored procedures,
but no control over
memory allocation
or similar
resources.
High. MySQL/Oracle
instantiation in cloud
on virtual machine.
Low.
Automatically
tuned.
Database accessible
as “stand-alone”
offering.
Yes
Yes
Yes
No. Requires
Google App
Engine
application
layer.
Possibility to
designate where the
data is stored (ex.
Region or data
center)
No
Yes
Yes
Yes
Replication
No
Yes
Yes
Yes
Table 3 Main Considerations Comparison
43
Configurability - DBaaS offerings may provide capabilities that reduce the amount of configuration
options available to database administrators. For some applications, if more configurability options
are managed by the platform owner rather than the customer’s database administrator, this can be
a benefit and it can reduce the amount of effort expended to maintain the database. For others,
the inability to tune and control all aspects of the database, such as memory management, can be
a limiting constraint in obtaining performance.
Database Accessibility - Most DBaaSs offer a predefined set of connectivity mechanisms that will
directly impact adoption and use. There are three general approaches. First, Most RDBMS offerings
are typically accessible through industry standard database drivers such as Java Database
Connectivity (JDBC) or Open Database Connectivity (ODBC). These drivers allow for applications
external to the service to access the database through a standard connection, facilitating
interoperability. Second, services typically provide interfaces that use standards-based, ServiceOriented Architecture (SOA) protocols, such as SOAP or REST, with Hypertext Transfer Protocol
(HTTP) and a vendor-specific API definition. These services may provide software development kits
in common source-code languages to facilitate the adoption. Third, some databases may be
restricted to accessing data through software running in the vendor’s ecosystem. This approach
may increase security, but it also significantly limits portability and interoperability.
Availability and Replication - the ability to ensure that data is available and not lost will be a key
consideration. Ensuring access to data can come through enforcement of service-level agreements
(SLA) metrics such as up time, replication across a cloud provider’s regions, and replication or
movement of the data across cloud providers or to the consuming organization’s data center.

Replication across a cloud provider’s hardware within a region may ameliorate the effects
of a localized hardware or software failure.

Replication across a cloud provider’s geographic regions may ameliorate the effects of a
network outage, natural disaster, or other regional event.

Replication across multiple cloud providers or back to the consuming organization’s IT
infrastructure may provide the most continuity of operation benefit through full
geographic and IT stack independence.
Many providers such as Microsoft and Amazon offer replication of the data across hardware within
a specific region as part of a packaged service. Within a given vendor, replication across
geographies is usually more expensive and may result in significant data transfer fees.
44
11. NOSQL
While RDBMS databases are widely deployed and successful, they have shortcomings for some
applications that have been filled by the growing use of NoSQL databases. Rather than conforming
to SQL standards and providing relational data modeling, NoSQL databases typically offer fewer
transactional guarantees than RDBMSs in exchange for greater flexibility and scalability. NoSQL
databases tend to be less complex than RDBMSs and scale horizontally across lower-cost hardware.
Unlike RDBMSs, which share a common relational data model, several different types of databases,
such as column-oriented, key-value, and document-oriented, are considered as “NoSQL”
databases. NoSQL databases tend to be used in applications that do not require the same level of
data consistency guarantees that RDBMS systems provide but that require throughput levels that
would be very expensive for RDBMSs to support.
12. Amazon SimpleDB and DynamoDB
Amazon DynamoDB is a fully managed NoSQL database service. As I said in the introduction of
DBMSs in the clouds, the NoSQL databases are more suitable for situations where applications
experience explosive growth, when traditional databases require reworking to distribute their
workload across multiple servers.
DynamoDB has been created by taking Amazon’s in-house NoSQL database, Dynamo (incremental
scalability, predictable high performance), combining it with the best parts of SimpleDB (ease of
administration of a cloud service, consistency, and a table-based data model that is richer than a
pure key-value store) and putting it into a form suitable for external use as a service.
In the next section I will give a short overview of the Dynamo and SimpleDB.
12.1 Dynamo History
The original Dynamo design was based on a core set of strong distributed systems principles
resulting in an ultra-scalable and highly reliable database system. It was developed as response to
the scaling challenges that Amazon.com faced, when direct database access was one of the major
bottlenecks in scaling and operating the business. There are many services that only need primarykey access to a data store. For many services, such as those that provide best seller lists, shopping
carts, customer preferences, session management, sales rank, and product catalog, the common
pattern of using a relational database would lead to inefficiencies and limit scale and availability.
Dynamo provided a simple primary-key only interface to meet the requirements of these
applications[17][18].
Dynamo was targeted mainly at applications that need an “always writeable” data store where no
updates are rejected due to failures or concurrent writes. It was built for an infrastructure within a
single administrative domain where all nodes are assumed to be trusted. Applications that use
45
Dynamo do not require support for hierarchical namespaces (a norm in many file systems) or
complex relational schema (supported by traditional databases). Dynamo can be characterized as a
zero-hop DHT, where each node maintains enough routing information locally to route a request to
the appropriate node directly in order to avoid routing requests through multiple nodes and meet
the need of the latency sensitive applications that require at least 99.9% of read and write
operations to be performed within a few hundred milliseconds [17].
Dynamo gave to the developers a system that met their reliability, performance, and scalability
needs, it did nothing to reduce the operational complexity of running large database systems. Since
developers were responsible for running their own Dynamo installations, they had to become
experts on the various components running in multiple data centers. Also, they needed to make
complex tradeoff decisions between consistency, performance, and reliability. This operational
complexity was a barrier that kept them from adopting Dynamo [17].
12.2 Amazon DynamoDB DataModel
Amazon DynamoDB organizes data into tables containing items, and each item has one or more
attributes.
Attributes
An attribute is a name-value pair. The name must be a string, but the value can be a string, number,
string set, or number set. The following are all examples of attributes:
"ImageID" = 1 "Title" = "flower"
"Tags" = "flower", "jasmine", "white" "Ratings" = 3, 4, 2
Item
A collection of attributes forms an item, and the item is identified by its primary key. An item's
attributes are a collection of name-value pairs, in any order. The item attributes can be sparse,
unrelated to the attributes of another item in the same table, and are optional (except for the
primary key attribute). The table has no schema other than its reliance on the primary key. Items
are stored in a table. The primary key uniquely identifies an item for a DynamoDB table. In the
following diagram, Figure 9, the ImageID is the attribute designated as the primary key:
46
Figure 9 Diagram of DynamoDB Data Model [18]
Notice that the table has a name, "my table", but the item does not have a name. The primary key
defines the item; the item with primary key "ImageID"=1. [18]
Tables
Tables contain items, and organize information into discrete areas. All items in the table have the
same primary key scheme. Attribute name (or names) to be used for the primary key are
designated when a table is created, and the table requires each item in the table to have a unique
primary key value. The first step in writing data to DynamoDB is to create a table and designate a
table name with a primary key. The following is a larger table that also uses the ImageID as the
primary key to identify items.
DyanomoDB also allows specifying a composite primary key which enable specifying two attributes
in a table that collectively form a unique primary index. All items in the table must have both
attributes. One serves as a “hash partition attribute” and the other as a “range attribute.” For
example, there might be a “Status Updates” table with a composite primary key composed of
“UserID” (hash attribute, used to partition the workload across multiple servers) and a “Time”
(range attribute). Then query can be executed to fetch either: 1) a particular item uniquely
identified by the combination of UserID and Time values; 2) all of the items for a particular hash
“bucket” – in this case UserID; or 3) all of the items for a particular UserID within a particular time
range. Range queries against “Time” are only supported when the UserID hash bucket is specified.
[18]
47
Table: My Images
Primary
Key
Other Attributes
ImageID
=1
ImageLocation =
https://s3.amazonaws.com/bucket/img_1.jpg
Date =
1260653179
Title =
flower
Tags =
Flower,
Jasmine
Width = 1024
Depth =
768
ImageID
=2
ImageLocation =
https://s3.amazonaws.com/bucket/img_2.jpg
Date =
1252617979
Rated =
3, 4, 2
Tags =
Work,
Seattle,
Office
Width = 1024
Depth =
768
ImageID
=3
ImageLocation =
https://s3.amazonaws.com/bucket/img_3.jpg
Date =
1285277179
Price =
10.25
Tags =
Seattle,
Grocery,
Store
Author = you
Camera =
phone
ImageID
=4
ImageLocation =
https://s3.amazonaws.com/bucket/img_4.jpg
Date =
1282598779
Title =
Hawaii
Author =
Joe
Colors = orange, Tags =
blue, yellow
beach,
blanket, ball
Figure 10 DynamoDB Table
12.3 Amazon DynamoDB Features
As we said earlier Amazon DynamoDB is based on the principles of Dynamo, a progenitor of NOSQL,
and brings the power of the cloud to the NOSQL database world. It offers high-availability,
reliability, and incremental scalability, with no limits on dataset size or request throughput for a
given table. As all the previous explained services DynamoDB is managed, scalable system that
handles all the complexities of scaling and partitions and re-partitions of the data over more
machine resources to meet the I/O performance requirements. It can scale the resources dedicated
to a table to multiple servers spread over multiple Availability and there are no pre-defined limits to
the amount of data each table can store.
In order to achieve high performance all data items are stored on Solid State Drives (SSD).
Moreover, by not indexing all attributes, the cost of read and write operations is low as write
operations involve updating only the primary key index thereby reducing the latency of both read
and write operations.
One of the most important functionalities of DynamoDB is the Performance Predictability. There
48
are many applications that benefit from predictable performance as their workloads scale: online
gaming, social graphs applications, online advertising, and real-time analytics to name a few.
DynamoDB’s gives the ability of “Provisioned Throughput.” Users can specify the request
throughput capacity they require for a given table. DynamoDB will allocate sufficient resources to
the table to predictably achieve this throughput with low-latency performance. Throughput
reservations are elastic and can be increased or decreased on-demand using the AWS Management
Console or the DynamoDB APIs. CloudWatch metrics provides the ability to make informed
decisions about the right amount of throughput to be dedicated to a particular table.
Amazon DynamoDB also integrates with Amazon Elastic MapReduce (Amazon EMR) which allows
businesses to perform complex analytics on their large datasets using a hosted Hadoop framework
on AWS. [18]
Some of the ways in which EMR can be used with DynamoDB are as follows:



Users can analyze data stored in DynamoDB using EMR and store the results of the
analysis in S3 while leaving the original data in DynamoDB.
Users can back up the data from DynamoDB to S3 using EMR.
Customers can also use Amazon EMR to access data in multiple stores, do complex
analysis over this combined dataset, and store the results of this work.
12.4 Amazon SimpleDB
SimpleDB another NOSQL DBaaS offered by Amazon. The data model used by Amazon SimpleDB
makes it easy to store, manage and query structured data. Developers organize their data-set into
domains and can run queries across all of the data stored in a particular domain. Domains are
collections of items that are described by attribute-value pairs. This can be thought at in terms
analogous to concepts in a traditional spreadsheet table. For example, if we take details of a
customer management database shown in the table below and consider how they would be
represented in Amazon SimpleDB. The whole table would be domain named “customers.”
Individual customers would be rows in the table or items in the domain. The contact information
would be described by column headers (attributes). Values are in individual cells.
49
CustomerID
First
name
Last
name
Street
address
City
State
Zip
Telephone
123
Bob
Smith
123 Main
St
Springfield
MO
65801
222-3334444
456
James
Johnson
456 Front
St
Seattle
WA
98104
333-4445555
Figure 11 SimpleDB Table
Amazon SimpleDB differs from tables of traditional databases in important ways. It offers the
flexibility to easily go back later on and add new attributes that only apply to certain records. For
example, adding customers’ email addresses to enable real-time alerts on order status it is possible
to add the new records and any additional attributes to the existing “customers” domain. The
resulting domain might look something like this:
First name
Last name
Street
address
123
Bob
Smith
456
James
789
Deborah
City
State
Zip
Telephone
123 Main
St
Springfield
MO
65801
222-3334444
Johnson
456 Front
St
Seattle
98104
333-4445555
Thomas
789
Garfield
New York
10001
444-5556666
CustomerID
WA
NY
Email
dthomas@xyz.com
Figure 12 SimpleDB table after adding additional attributes
Domains have a finite capacity in terms of storage (10 GB) and request throughput which is
considerable scaling limitation. Although there is a possibility to work around this limitation by
partitioning workloads over many domains, this is not that simple to implement. SimpleDB also fails
to meet the requirement of incremental scalability which is possible with DynamoDB.
Another limitation of SimpleDB is Predictability of Performance. SimpleDB indexes all attributes for
each item stored in a domain. While this simplifies schema design and provides query flexibility, it
has a negative impact on the predictability of performance. For example, every database write
needs to update not just the basic record, but also all attribute indices (regardless of whether all
indices are used for querying). Similarly, since the Domain maintains a large number of indices, its
50
working set does not always fit in memory. This impacts the predictability of a Domain’s read
latency, particularly as dataset sizes grow.
SimpleDB’s original implementation had taken the "eventually consistent"4 approach to the
extreme and presented users with consistency windows that were up to a second in duration. This
meant that developers used to a more traditional database solution had trouble adapting to it. The
SimpleDB team eventually addressed this issue by enabling users to specify whether a given read
operation should be strongly or eventually consistent. consistent read can potentially incur higher
latency and lower read throughput it is best to use it only when an application scenario mandates
that a read operation absolutely needs to read all writes that received a successful response prior
to that read. For all other scenarios the default eventually consistent read yield the best
performance. [18]
12.5 Pricing
As the other services, DynamoDB and SimpleDB keep the pay only for what you use model. The
pricing is calculated based on the provisioned throughput capacity, index data storage and data
transfer.
When a DynamoDB table is created or updated the needed capacity to be reserved is specified for
reads and writes and it is charged hourly based on the capacity used. A unit of Write Capacity
enables users to perform one write per second for items of up to 1KB in size. Similarly, a unit of
Read Capacity enables users to perform one strongly consistent read per second (or two eventually
consistent reads per second) of items of up to 1KB in size.
Amazon DynamoDB is an indexed datastore, and the amount of disk space the data consumes will
exceed the raw size of the data uploaded. Amazon DynamoDB measures the size of the billable data
by adding up the raw byte size of the uploaded data, plus a per-item storage overhead of 100 bytes
to account for indexing. The first 100MB stored per month are offered free and after that the price
is calculated per GB depending on region.
As with the other AWS there is no additional charge for data transferred between Amazon
DynamoDB , SimpleDB and other Amazon Web Services within the same Region. Data transferred
across Regions (e.g. between Amazon DynamoDB in the US East (Northern Virginia) Region and
Amazon EC2 in the EU (Ireland) Region), is charged at Internet Data Transfer rates on both sides of
the transfer.
Amazon SimpleDB is biling based on machine hours utilization and data transfer depending on the
region where the SimpleDB domains are established.
Amazon SimpleDB measures the machine utilization of each request and charges based on the
amount of machine capacity used to complete the particular request (SELECT, GET, PUT, etc.),
normalized to the hourly capacity of a circa 2007 1.7 GHz Xeon processor. [18]
4
Eventually consistent- It means that given a sufficiently long period of time over which no changes are sent,
all updates can be expected to propagate eventually through the system and all the replicas will be
consistent.
51
13. Google Datastore
The Google App Engine Datastore is a schemaless object datastore providing robust, scalable
storage mainly targeted for web application. App Engine's datastore is built on top of Bigtable.
Bigtable is distributed storage system for managing structured data that is designed to scale to a
very large size, petabytes of data across thousands of commodity server. Many Google project like
Google Earth, Google Finance including the web indexing use Bigtable for storing data.
13.1 Datastore Datamodel
Datastore is basically key-value pared database. The Datastore holds data objects known
as entities. An entity has one or more properties, named values of one of several supported data
types: for instance, a property can be a string, an integer, or a reference to another entity. Each
entity is identified by its kind, which categorizes the entity for the purpose of queries, and
a key, that uniquely identifies it within its kind [19, 20]. Entities of the same kind can have different
properties, and different entities can have properties with the same name but different value
types. The key consists of the following components:

The entity's kind

An identifier, which can be either

o
a key name string
o
an integer numeric ID
An optional ancestor path locating the entity within the Datastore hierarchy.
Entities in the Datastore form a hierarchically structured space similar to the directory structure of
a file system. When an entity is created, it is possible designate another entity as its parent; the
new entity is a child of the parent entity. This creates the ancestor path. [20]
13.2 Queries and indexes
App Engine predefines a simple index on each property of an entity. An App Engine application can
define further custom indexes in an index configuration file. Because all queries on App Engine are
served by these pre-built indexes, the types of query that can be executed are more restrictive
than those allowed on a relational database with SQL [20]. In particular, the following are not
supported:

Join operations

Inequality filtering on multiple properties
52

Filtering of data based on results of a subquery
All the queries in the Datastore are eventually consistent. A typical query includes the following:

An entity kind to which the query applies

Zero or more filters based on the entities' property values, keys, and ancestors

Zero or more sort orders to sequence the results
In addition to retrieving entities from the Datastore directly by their keys, an application can
perform a query to retrieve them by the values of their properties [20].
13.3 Transactions
The Datastore can execute multiple operations in a single transaction. By definition, a transaction
cannot succeed unless every one of its operations succeeds. If any of the operations fails, the
transaction is automatically rolled back. This is especially useful for distributed web applications,
where multiple users may be accessing or manipulating the same data at the same time [20].
13.4 Scalability
The App Engine Datastore is designed to scale, allowing applications to maintain high performance
as they receive more traffic:

Datastore writes scale by automatically distributing data as necessary.

Datastore reads scale because the only queries supported are those whose performance
scales with the size of the result set (as opposed to the data set). This means that a query
whose result set contains 100 entities performs the same whether it searches over a
hundred entities or a million. This property is the key reason some types of query are not
supported [20].
13.5 High Availability
App Engine's primary data repository is the High Replication Datastore (HRD), in which data is
replicated across multiple data centers using a system based on the Paxos algorithm5. This provides
a high level of availability for reads and writes [20].
5
Paxos is a family of protocols for solving consensus in a network of unreliable processors. Consensus is the
process of agreeing on one result among a group of participants. This problem becomes difficult when the
participants or their communication medium may experience failures
53
13.6 Data Access
Development on the DataStore is done though Application Programming Interfaces (API). These
can be accessed by either Python or JAVA. The App Engine Java SDK provides a low-level Datastore
API with simple operations on entities. The SDK also includes implementations of the Java Data
Objects (JDO) and Java Persistence API (JPA) interfaces for modeling and persisting data. These
standard interfaces include mechanisms for defining classes for data objects and for performing
queries [20].
The Python Datastore interface includes a rich data modeling API and a SQL-like query language
called GQL [21, 22].
13.7 Quotas and Limits
Google has defined quotas and limits to variious aspects of application's Datastore usage:
•
Each call to the Datastore API counts toward the Datastore API Calls quota.
•
Data sent to the Datastore by the application counts toward the Data Sent to Datastore API
quota.
•
Data received by the application from the Datastore counts toward the Data Received from
Datastore API quota.
The total amount of data currently stored in the Datastore for the application cannot exceed the
Stored Data (billable) quota. This includes all entity properties and keys, as well as the indexes
needed to support querying those entities. The following table shows the limits that apply
specifically to the use of the Datastore [20]:
Limit
Amount
Maximum entity size
1MB
Maximum transaction size
10MB
Maximum number of index entries for an entity
2000
Maximum number of bytes in composite indexes for an entity
2MB
Figure 13 Google Datasore Limits [20]
54
14. MongoLab/MongoDB and Cloudent/Apache CouchDB
Both CouchDB and MongoDB are document-oriented databases with schema less JSON-style and
BSON (Binary JSON) style object data storage [26]. Because they offer similar functionalities I’ll
write about them together and give a short overview of their differences. First what is document
oriented database?
14.1 Document oriented database
A document oriented database or data store does not use tables for storing data. It stores each
record as a document with certain characteristics. Documents inside a document-oriented
database are similar, in some ways, to records or rows, in relational databases, but they are less
rigid. They are not required to adhere to a standard schema nor will they have all the same
sections, slots, parts, keys, or the like [24, 25]. For example here's a document:
FirstName:"Bob", Address:"5 Oak St.", Hobby:"sailing".
Another document could be:
FirstName:"Jonathan", Address:"15 Wanamassa Point Road", Children:[{Name:"Michael",Age:10},
{Name:"Jennifer", Age:8}, {Name:"Samantha", Age:5}, {Name:"Elena", Age:2}].
Both documents have some similar information and some different. Unlike a relational database
where each record would have the same set of fields and unused fields might be kept empty, there
are no empty 'fields' in either document (record) in this case. This system allows new information
to be added and it does not require explicitly stating if other pieces of information are left out.
The benefit would be that if you are using a document oriented database for storing a large number
of records in a huge database, any change in the number or type of row does not need an alter on
the table. All it is needed is to do is insert new documents with new structure and it is automatically
inserted to the current datastore.
Documents are addressed in the database via a unique key that represents that document. Often,
this key is a simple string. In some cases, this string is a URI or path. Regardless, this key can be
used to retrieve the document from the database. Typically, the database retains an index on the
key such that document retrieval is fast.
One of the other defining characteristics of a document-oriented database is that, beyond the
simple key-document (or key-value) lookup that you can use to retrieve a document, the database
will offer an API or query language that will allow document retrieval based on their contents. For
example, you may want a query that gets you all the documents with a certain field set to a certain
value. The set of query APIs or query language features available, as well as the expected
performance of the queries, varies significantly from one implementation to the next.
Implementations offer a variety of ways of organizing documents, including notions of

Collections
55



Tags
Non-visible Metadata
Directory hierarchies
14.2 MongoDB and CouchDB comparison
As I said earlier both MongoDB and CouchDB are document-oriented databases with schemaless
JSON-style object data storage. Table2 shows comparison between the both databases.
Data Model
Interface
Large Objects (Files)
Horizontal Partitioning scheme
Object Storage
Query Method
MongoDB
Document-Oriented (JSON)
HTTP/REST
Yes (attachments)
BigCouch, CouchDB Lounge,
Pillow
Database contains Documents
Distributed Consistency
Map/Reduce (javascript +
others) creating Views + range
queries
Master-master with custom
conflict resolution function
MVCC (Multi Version
Concurrency Control)
Eventually consistent
Written in
Erlang
Replication
Concurrency
CouchDB
Document-Oriented (BSON)
Native Drivers; REST
Yes (GRIDFS)
Auto-sharding
Database contains collections
Collection contains documents
Map/Reduce (javascript)
creating collections + objectbased query language
Master-Slave
Update in-place
Strong consistency. Eventually
consistent reads from
secondary replicas
C++
Table 4 Comparison MongoDB adn CouchDB
14.3 MVCC – Multy Version Concurency Control
One big difference is that CouchDB is MVCC based, and MongoDB is more of a traditional updatein- place store [24, 25, 27]. MVCC is very good for certain classes of problems:

Problems which need intense versioning; problems with offline databases that re-sync
later

Problems where you want a large amount of master-master replication happening. But
with MVCC there are some considerations:
56


The database must be compacted periodically, if there are many updates;
When conflicts occur on transactions, they must be handled by the programmer
manually (unless the db also does conventional locking -- although then master-master
replication is likely lost) [25].
MongoDB updates an object in-place when possible. Problems requiring high update rates of
objects are a great fit, compaction is not necessary. Mongo's replication, without the MVCC model,
is more oriented towards master/slave and auto failover configurations than to master-master
setups. MongoDB promises high write performance, especially for updates.
14.4 Scalability
One fundamental difference is that a number of Couch users use replication as a way scale. Mongo
uses auto - sharding as a way of scalability. There is couple of available options for sharding
CouchDB available as opensource or by third-party developers. The best known are CouchDB
Lounge and BigCouch used by cloudant.com [25, 26]
BigCouch can be seen as Erlang/OTP applications that allow creating a cluster of CouchDBs that is
distributed across many nodes/servers.[30] Instead of one big honking CouchDB, the result is an
elastic data store which is fully CouchDB API-compliant.
The clustering layer is most closely modeled after Amazon's Dynamo, with consistent hashing,
replication, and quorum for read/write operations. CouchDB view indexing occurs in parallel on
each partition, and can achieve impressive speedups as compared to standalone serial indexing.
[25]
14.5 Querying
CouchDB uses a view model which acts as an ongoing incremental map-reduce function, providing
a constantly updated view of the database. From the HTTP interface different views can be
accessed and data can be retrieved by key/index as well. The view model is well-suited for statically
definable queries; job-style operations. There is elegance to the approach, although these
structures must be pre-declared for each query to be executed. They can be thought of as
materialized views.[27]
Mongo uses traditional dynamic queries. As with, say, MySQL, it can do queries where an index
does not exist, or where an index is helpful but only partially so. Mongo includes a query optimizer
which makes these determinations. This is very nice for inspecting the data administratively, and
this method is also good when indexes are not used: such as insert-intensive collections. When an
index corresponds perfectly to the query, the Couch and Mongo approaches are conceptually
similar.[24]
57
14.6 Atomicity and Durability
Both MongoDB and CouchDB support concurrent modifications of single documents. Both forego
complex transactions involving large numbers of objects.
CouchDB is a "crash-only" design where the database can terminate at any time and remain
consistent.[25,27]
Previous versions of MongoDB used a storage engine that would require a repair database
operation when starting up after a hard crash. Newer versions offer durability via journaling.[24]
14.7 Map Reduce
Both CouchDB and MongoDB support map/reduce operations. For CouchDB map/reduce is
inherent to the building of all views [24]. With MongoDB, map/reduce is only for data processing
jobs but not for traditional queries.[25]
14.8 Javascript
Both CouchDB and MongoDB make use of Javascript. CouchDB uses Javascript extensively including
in the building of views.
MongoDB supports the use of JavaScript but more as an adjunct. In MongoDB, query expressions
are typically expressed as JSON-style query objects, however one may also specify a JavaScript
expression as part of the query. MongoDB also supports running arbitrary javascript functions
server-side and uses JavaScript for map/reduce operations.
14.9 REST
Couch uses REST as its interface to the database. MongoDB relies on language-specific database
drivers for access to the database over a custom binary protocol. Of course, REST interface can be
added on top of an existing MongoDB driver at any time.
14.10 MongoLab and Cloudent
The most popular platforms offering managed instances as a service of MongoDB and CouchDB are
MongoLab and Cloudent respectfully.
MongoLab is offering two tiers of plans, shared and dedicated in order to accommodate a range of
use cases and budgets. The database can be hosted on Amazon AWS or in Rackspace Cloud. With
the shared plan MongoLab is offering one MongoDB database on a shared mongod server process
on a shared VM host and replication for backups [28]. The architecture is shown in the Figure 13
58
bellow.
Master0
Slave0
Replication for
backup
DB0
DBN
Mongod server process
DB0
DBN
Mongod server process
Mastern
Slaven
Replication for
backup
DB0
DB0
DBN
Mongod server process
DBN
Mongod server process
VM Host 0
VM Host 1
Figure 14 MongoLab Shared Plan
The shared plan is offered for free up to 250MB and there are three more options available Small
Medium and Large, also additional storage is available as an option.
The dedicated plan is offered in two variants with one dedicated node and with two and more
dedicated nodes. The dedicated plan with one node is a single dedicated VM with automatic
failover to a secondary on a shared VM. It offers high availability with the replicas but it does not
allow reading from the replicas as a mean to increase read-throughput. It also offers monitoring
services through MongoDB Monitoring Service (MMS). MMS is 10gen6 web service that monitors
6
10gen is a software company that develops and provides commercial support for the open MongoDB
database
59
and graphs the performance of MongoDB clusters, servers and databases over time. It can monitor
important statistics such as resident memory usage, rate of database operations, write-lock queue
depth and CPU alongside any other MongoDB instances they might be running outside of
MongoLab [28].
Dedicated plan with two and more nodes can scale to as many dedicated nodes of equal size as it is
needed. Also in addition to providing high-availability, it scales horizontally read throughput by the
creation of a Replica Set cluster of more than one member. The architectures of both dedicated
plans are shown in Figure 14. With dedicated plans hosting is available on Amazon EC2 or in
Rackspace Cloud.
Figure 15 Dedicated Plan Architecture: 1 Dedicated Node
60
Figure 16 Dedicated Plan Architecture: 2+ Dedicated Nodes
Cloudent.com is offering multi-tenant and single-tenant (private) CouchDB database clusters that
are hosted and scaled within or across multiple top-tier data centers around the globe. In all
offered plans Cloudant automatically replicates the data across this network as needed to push it
closer to the global user base, reduce network latency overhead, ensure 24x7 availability, and
provide disaster recovery capabilities.[27]
Cloudant, provides a domain through which to access the data layer. Behind that domain, Cloudant
stores the data in horizontally scalable version of the CouchDB database. The horizontal scalability
is done with BigCouch [30] as mentioned earlier. The data layer automatically handles load
balancing, clustering, backup, growing/shrinking the clusters, and high availability. It also provides
private, single-tenant clusters that exist entirely within a data center or that span across data
centers to provide real-time data distribution to multiple locations.[29]
Regardless of whether it is a multi- or single-tenant data layer, data can be replicated and
synchronized between Cloudant data centers and:
61

other Cloudant data centers for high availability, backup or for scalability and performance

non-Cloudant data centers

disconnected devices/networks great for mobile apps

edge databases such as data marts or spreadsheets; great for independent analytic
projects
Cloudant Data Layer also includes a number of dashboards that allow view and control of the data
layer performance, usage, search indexing, billing and other metrics.[29]
Cloudnat pricing is a little bit different then MongoLab and it is based on data stored and millions of
requests per month (MReq/mo). There is a free starting plan that includes 250MB of storage and
0.5 Mreq/mo.[25,28]
Data storage is counted in a way that includes the size only of the latest revision of all documents,
plus the size of the view indexes. Older revisions and deleted documents do not count towards size
quotas. They are purged automatically after a certain time.
Requests are approximately the number of documents reads and writes from the database.
15. What benefits cloud database and cloud computing
brings for small and medium organizations?
For small and medium business owners saving money and time whenever possible is critical to their
success. Regardless weather it is just startup or more mature business, cloud software and services
in general can help to cut costs and allow you to concentrate on the core of your business. The
benefits of cloud computing for small business sound attractive but that does not mean that it does
not have certain disadvantages or is right for every business. As I shown in the previous part of this
paper, there are a lot of available options to choose from the cloud database as service offerings
and when you include there all the other available cloud services, choosing the right provider and
the right services for your business needs is not an easy task. Here I refer to cloud computing in
general as the benefits from the DBaaS solutions are part of the benefits from the Clout
computing. First I will write the main benefits.
15.1 Advantages for Small Business
I will speak about the advantages and disadvantages in more general terms of cloud computing as
the same apply to the cloud database. The main advantages include

Lower Initial Investment – only things needed to start using the cloud is computer and an
Internet connection, it is possible take advantage of most cloud offerings without investing
in any new hardware, specialized software, or adding to staff. This is one cloud computing
62
advantage that has universal appeal regardless of the industry or the type of business. This
allows organizations and especially startups to invest in new projects and ideas without risk
of big loss.

Easier to manage - There are no power requirements or space considerations to think about
and users do not have to understand the underlying technology in order to take advantage
of it. There is no need for maintaining and updating any new hardware or software.
Planning time is considerably less as well since there are fewer logistical issues.

Pay as You Go - Large upfront fees is not the norm when it comes to cloud services. Most
of the cloud services as I wrote earlier in this paper are available on a month to month
basis with no long term contracts. It also gives the benefit of keeping multiple projects
running without enormous expenses.

Scalability - Cloud computing can be scaled to match the changing needs of the small
business as it grows. Licenses, storage space, new instances and more can be added as
needed.

Deploy Faster – usually it is possible to get up and running significantly faster with cloud
services than if there is a need to plan, buy, build, and implement in house. With many
software as a service applications or other cloud offerings it is possible to start using the
service within hours or days rather than weeks or months.

Location Independent - Because services are offered over the Internet, there are no limits
to using cloud software or services just at work or only on one computer. Access from
anywhere is a big advantage for people who travel a lot, like to be able to work from home,
or whose organization is spread out across multiple locations.

Device independent - Most web-based software and cloud services are not designed
specifically for any one browser or operating system. Many can be accessed via PC, Mac, on
tablets and through mobile phones.
15.2 Disadvantages of Cloud Computing
While the advantages of cloud computing are clear and easy enough to understand, there are
potentially a few disadvantages that needs to be considered carefully.

Downtime - While we would like to think our data or the cloud based services that we use
are available on demand all day every day, the truth is they are not. System uptime is
entirely out of our hands with cloud services. There are two types of downtime:
o
Scheduled downtime might be required to upgrade software, install new hardware,
or perform other routine maintenance. Typically, scheduled downtime is
infrequent, announced well in advance, and takes place at non-peak hours where
63
usage is likely to be low so as to minimize interruption to the customer.
o
Unscheduled downtime, otherwise known as an outage, is indicative of some sort
of failure or problem. It is rare but outages do happen even for the larger, more
established cloud providers. If it does, there is not much that can be done other
than wait.

Security Issues - This is maybe one of the most discussed issues when considering moving
to the cloud. You are turning over data about your business and your customers to a third
party and entrusting them to keep it safe. Without the proper level of security, your data
could be exposed to users outside your company or accessed by a hacker.

Less control over your data loss - With cloud services, you will have to give up some degree
of control over the prevention of data loss. That is in the hands of the cloud service
provider.

Integration and Customization - Some web based software solutions and cloud services are
offered as a one size fits all solution. If you need to customize the application or service to
fit specific needs or integrate with your existing systems, doing so may be challenging,
expensive, or not an option.
15.3 Main things to be considered when moving to the cloud
Migrating to a cloud solution is usually fairly easy, the service provider usually helps with setting
everything up and transferring the information to the hosted environment. But there are some
considerations that organization should look at.

Prioritize applications
Focus on the applications that provide the maximum benefit for the minimum cost/risk. Measure
the business criticality, business risk, functionality of the services and impact to data sovereignty,
regulation and compliance. Prioritize which applications to migrate to the cloud and in which order.

Consumption models
As can be seen from the different pricing models used by the services and providers described
earlier, each provider has a different consumption model for how you procure and use the service.
These consumption models need to be considered carefully from two perspectives – frequency of
change and volume.

Data residency and legal jurisdiction
This issue is not recognized by many but most organizations realize that business information held
outside their country is subject to the commercial law of the country it is held in. Most
organizations decide to keep their data in the country of origin to ensure that the local country law
still applies to their business information.
64

Performance and availability
When moving to a distributed IT landscape with some functionality in the cloud, where there is
integration between these cloud applications and on-premise applications, then performance of
this distributed functionality needs careful consideration and potentially increased processing to
ensure service delivery. Similarly, availability will need careful assessment because an application
that is all in the cloud, or distributed across the cloud and on-premise, will have different
availability characteristics to the legacy on-premise application. Organizations also need to ensure
that their local and wide area networks are enabled for cloud and will support the associated
increase in bandwidth and network traffic.

Service integration
When moving an application to the cloud, continuity of service and service management needs to
be considered. The service management role changes to more of a service integration role. An
alternative to the in-house service management function providing this capability is the use of an
outsourcing organization, to provide this function.
 Architecting for the cloud and cloud application maturity
Cloud Computing provides real benefits for organizations but to realize these benefits the
applications being utilized sometimes need to be architected to take advantage of the scalable
nature of Cloud Computing. While new applications, should be built with this in mind, often legacy
applications are built to take advantage of legacy systems and hence may not be able to truly
leverage the benefits the Cloud can bring without significant re- architecting. There are even
differences between how much re-architecting is needed from to move to a cloud provider and
also to move from one Cloud Computing provider to the next, so the Cloud provider selection
process should include questions about the Cloud provider’s technological underpinning and if rearchitecting is needed, it does not come as a surprise. Currently application maturity is extremely
variable from one application to the next.
 Exit strategy
Before adopting a cloud service provider or application ensure you consider your exit strategy, e.g.
data extraction, and put costs for this strategy into your business case and service costs. Many
people are rightly concerned about moving to Cloud Computing and being fixed to one provider.
This is indeed a concern and one which should not be brushed off lightly. That said however, Cloud
Computing tends to be much more transparent when it comes to lock in and so organizations
should be able to accurately gauge the risks. Organizations should look at a number of different
factors:
o
Does the vendor use industry standard APIs or proprietary ones?
o
Does the vendor provide quick and easy data extraction in the event that the
customer wishes to shift?
o
Does the vendor use open standards or have they created their own ways of doing
things?
o
Can the Cloud Computing service be controlled by third party control panels?
65
 Data migration
Moving data into or out of a SaaS and DBaaS application may require considerable transformation
and load effort.
 Service and transaction state
Maintaining continuity of the state of in-flight transactions at the point of transition into the cloud
will need consideration. This will also be the case at the point of exit as well.
 Service Level Agreement (SLA)
Small business owners usually don’t have experience with these types of agreements and not
viewing them might open up Pandora’s Box without knowing it. Business impact in the SLA myst be
carefully considered and analyzed. Close attention should be paid to the availability guarantees and
penalty clauses:

Does the availability fit in with organization business model?

What do you need to do to receive the credits when the hosting provider failed to achieve
the guaranteed service levels?

Are they automatically processes, or do you need to ask for them in writing?
Usually the cloud providers have one SLA for all users and do not provide customization of the SLA.
All this considerations must be evaluated carefully before moving to a cloud based solutions in
order to mitigate the risk and be confident to choose the right cloud services that will support and
insure growth of the business.
66
16. Will cloud computing reduce the budget?
A small business which decides to own and manage its own IT equipment sometimes fails to recognize
that over time, these equipment and their components will begin to deteriorate thus causing the
system to crash or experience latency. This may pose a bigger problem if the company has remote
users and satellite offices. Without much thought, an entrepreneur will surely put in more money by
upgrading its equipment and adding extra redundancy. Additional IT support personnel may be hired.
The cycle will truly become vicious as new equipment will depreciate and break down after a couple of
years.
In general, IT eats up a huge part of the company’s budget not only because of the costly equipment
but its maintenance and upgrade costs as well. Upgrades, security threats, and unexpected system
crashes often cost a lot of money. With cloud computing, all these IT capital investments and expenses
are borne by the third-party supplier. The business owner will just have to budget for the system’s
monthly subscription fees per user. There is also no need to invest on IT in anticipation of a future
demand because cloud computing can be deployed on demand when needed. An entrepreneur can
settle for a cloud computing service for better forecasting of an IT budget
Cloud computing simplifies budgeting. The business owner need not worry about merging projects or
complex expansion because he only needs to pay for the resources his company uses. Also, when users
are reduced, the accompanying cloud computing costs are reduced also. The traditional IT process of
procurement, installation, management, protection, and support of an on-premise system can be a
vicious cycle and contradicts the company’s goal of reducing recurring expenses. Cloud computing
services and resources are used only when needed which greatly reduce recurrent
expenditures and leverage the company in adapting to frequently evolving conditions of the market.
With cloud computing, a business owner can better manage uncertainties. He exposes his company to
greater risks if he invests a lot of money on IT. Because of growing demand, a lot of businesses
overinvest in Information Technology which eventually increases expenses and uncertainties of IT
management and maintenance. Cloud computing vendors reduce the company’s reliance on onpremise IT systems thereby assuming the uncertainties and costs of IT support, security, backups, and
hardware. The business owner, therefore, has no more liability in procurement, management, and
upgrade of IT equipment. Growth opportunities can then be pursued without having to bear the
uncertainties of important capital outlays.
One usually overlooked benefit from the small entrepreneurs is the fact that cloud computing
also reduces energy costs because the company has less IT equipment to maintain. IT servers require
specific temperatures to run perfectly. When a business owner decides to use cloud computing
services, energy bills are reduced because expensive IT equipment are moved to a safe, monitored,
and disaster-proof IT center.
When on-site IT problems arise, it is but expected that employee productivity is affected. Because of
this, stress levels are elevated. When using cloud services, employees can do their work anywhere and
anytime they wish. They can work from home by accessing the software through internet connection,
67
this also improves the morale. Travel time and costs are significantly reduced. Each employee who is
given access to the software can even ask the cloud computing supplier’s team for support with
regards to the problems which may arise while he is using the system. Management can even monitor
remotely each employee’s activity through the management consoles provided by the supplier.
68
17. Conclusion
Database management system, for a long time has been an integral part of the computing. As the
whole IT world is moving to the cloud whether you are assembling, managing or developing on a cloud
computing platform, you need a cloud compatible database. In this work I gave a short overview of
cloud computing and presented couple of the currently available companies that offer database as a
service in the cloud. Although they differ from the most widely used “traditional” relational database
systems and most of them might require revision and recoding of the existing applications, it is obvious
that they bring a lot of benefits especially with the offer for fully managed and automated database
administration tuning and optimization.
Cloud database system are built to use the power of the cloud, they are extremely scalable and elastic,
giving the opportunity to start small and expand as you need mitigating the risk and uncertainties of
investing in IT equipment and professional IT support. Cloud computing in general, with the flexible
pricing models and different plans it presents the one of the best solutions for startup and small
companies that are developing new products and does not have the financial power to risk and invest
in uncertain projects.
The cloud database solution provides an ideal solution for web and mobile application. The fact that
most of the DBaaS offerings are tightly integrated with other PaaS gives the organization the
opportunity to focus on developing their products and do not waste any resources on administration
of the platform and gives an opportunity to fully focus on the development of the product.
Despite the benefits offered by cloud-based DBMS, many people still have apprehensions about them.
This is most likely due to the various security issues that have yet to be dealt with. Storing and
entrusting security of critical business data in the cloud, to a third party, where the data will be spread
on multiple hardware stacks and across multiple data centers can be a big security issue. In my
opinion, maybe the cloud is still not ready to be used to move critical enterprise applications which
store highly sensitive data but is definitely ready to be used for testing and development of new
projects.
Many companies including some of the huge multinational corporations have already moved to cloud
computing because it is less costly, efficient, and agile as compared to onsite IT systems. Therefore,
small and medium scale enterprise must follow suit. If cloud computing is proven to work for these big
enterprises, it will surely work for small and medium enterprises.
69
Appendix
Case studies from the industry – Amazon RDS
Airbnb, a vacation rental firm, kept its main database in Amazon RDS. The consistency between locally
hosted MySQL and Amazon RDS MySQL facilitated the migration to AWS.
A significant architecture consideration for Airbnb was that Amazon provided the underlying
replication infrastructure. “Amazon RDS supports asynchronous master-slave replication,” wrote Tobi
Knaup.21 Knaup added that the hot standby, which ran in a different AWS Availability Zone, was
updated synchronously with no replication lag. Therefore, if the master database failed, the standby
was promoted to the new master with no loss of data. [32]
Case studies from the industry – Microsoft SQL Azure
Xerox Corporation ported an on-premise enterprise print capability to a public cloud environment.
This capability allowed mobile users to find printers with their smartphones and route printouts. As
the on-premise version leveraged Microsoft SQL Server for the database component, Xerox selected
Microsoft SQL Azure for cloud storage. This approach allowed them to reuse their prior investments in
SQL Server-based technology and .NET, and minimize the technical challenges of porting to a cloud
based environment.38 They were also able to minimize their skills-based challenges because the
development team was trained on Microsoft products.
Xerox used SQL Azure for “user account information, job information, device information, print job
metadata, and other such data,” but the actual print files were stored in Azure Blob Storage, not SQL
Azure.39 Azure Blob Storage had different pricing and characteristics than SQL Azure. For example,
unlike SQL Azure, Blob Storage was not limited to 10 GB (Web edition) or 50 GB (Business edition).[33]
Case studies from the industry – Amazon DynamoDB
"When IMDb launches features to our over 110MM monthly unique users worldwide, we want to be
prepared for rapid growth (1000x scale), and for customers to use our software in exciting and
different ways," said H.B. Siegel, CTO, IMDb. "To ensure we could scale quickly, we migrated IMDb’s
popular 10 star rating system to DynamoDB. We evaluated several technologies and chose DynamoDB
because it is a high-performance database system that scales seamlessly and is fully managed. This
saves us a ton of development time and allows us to focus our resources on building better products
for our customers, while still feeling confident in our ability to handle growth."[34]
70
Case studies from the industry – Amazon SimpleDB
Alexa Web Search crawled the Internet every night and generated a Web-scale datastore with
terabytes of data. They wanted to allow users to run custom queries against this data and generate up
to 10 million results.
To provide this service, Alexa’s architecture team leveraged a combination of AWS services that
included EC2, S3, SQS, and SimpleDB. SimpleDB was used for status information because it was
“schema-less.” AWS’ Jinesh Varia wrote, “There is no need to provide the structure of the record
beforehand. Every controller can define its own structure and append data to a ‘job’ item.” SimpleDB
allowed components of the architecture to independently and asynchronously read and write state
information (e.g., status of jobs in-process). While a good fit for state information, SimpleDB, which
had a 10 GB limit per domain, was not used for the nightly multiterabyte Internet crawl.[35]
71
References
[1]. Cloud Computing Bible - Barrie Sosinsky, Janury 2012. ISBN: 978-0-470-90356-8
[2]. http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf
[3]. Introduction to cloud computing - Ivanka Menken, Emereo Publishing 2011
[4]. Understanding PaaS - Michael P. McGrath, O'Reilly Media January 2012
[5]. Data Management Challenges in Cloud Computing Infrastructures - Divyakant Agrawal, A. E.,
University of California, Santa Barbara.
[6]. Database Scalability, Elasticity, and Autonomy in the Cloud - Divyakant Agrawal, A. E.,
Department of Computer Science, University of California at Santa Barbara.
[7]. Cloud Computing: Principles, Systems and Applications - Gillam, N. A., Springer 2010
[8]. http://relationalcloud.com/index.php?title=Database_as_a_Service
[9]. The multitenant, metadata-driven architecture of Database.com - Database.com Getting
Started Series White Paper
[10]. Megastore: Providing Scalable, Highly Available Storage for Interactive Services - Jason Baker,
C. B.-M. http://pdos.csail.mit.edu/6.824-2012/papers/jbaker-megastore.pdf
[11]. Inside SQL Azure. Microsoft TechNet.
http://social.technet.microsoft.com/wiki/contents/articles/1695.inside-windows-azure-sqldatabase.aspx
[12]. https://www.windowsazure.com/en-us/home/features/data-management/
[13]. https://www.windowsazure.com/en-us/pricing/details/#storage
[14]. http://aws.amazon.com/rds/
[15]. https://developers.google.com/appengine/docs
[16]. http://en.wikipedia.org/wiki/Paxos_algorithm
[17]. Werner Vogels' weblog on building scalable and robust distributed systems
http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
[18]. http://aws.amazon.com/dynamodb/
[19]. http://www.databasejournal.com/features/mssql/article.php/3823471/Cloud-Computingwith-Google-DataStore.htm
72
[20]. Google AppEngine Documents
https://developers.google.com/appengine/docs/java/overview - Product page
[21]. Google AppEngine Documents
https://developers.google.com/appengine/docs/phyton/overview - Product page
[22]. Google AppEngine Documents
https://developers.google.com/appengine/docs/python/datastore/gqlreference
[23]. MongoDB - http://www.mongodb.org/ - Product Page
[24]. MongoDB blog: http://blog.mongodb.org – Product Blog
[25]. Cloudant Blog http://blog.cloudant.com/cloudant-bigcouch-is-open-source - Product Blog
[26]. http://bsonspec.org/
[27]. http://wiki.apache.org/couchdb/ Product wiki
[28]. http://www.mongolab.com – Product page
[29]. Technical Overview: Anatomy of the Cloudant Data Layer Service - 2012 Cloudant, Inc.
[30]. http://bigcouch.cloudant.com/
[31]. Building Scalable Database Solution with SQL Azure - Introducing Federation in SQL Azure.
http://blogs.msdn.com
[32]. http://aws.amazon.com/solutions/case-studies/airbnb/
[33]. https://www.windowsazure.com/en-us/home/case-studies/
[34]. http://aws.amazon.com/dynamodb/testimonials/#imdb
[35]. http://aws.amazon.com/solutions/case-studies/alexa/
[36]. White Paper - Top Ten Data Management Trends - Scalability Experts - Raj Gill, Y. B.
[37]. http://nosql.mypopescu.com/post/1669537044/sql-and-nosql-in-the-cloud
[38]. White Paper - NOSQL for the Enterprise - Neo Technology (2011)
[39]. White Paper - Database as a Cloud Service - Scalability Experts - Wolter, R. (2011)
73
Download PDF